PreprintPDF Available

More Than React: Investigating The Role of EmojiReaction in GitHub Pull Requests

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Context: Open source software development has become more social and collaborative, especially with the rise of social coding platforms like GitHub. Since 2016, GitHub started to support more informal methods such as emoji reactions, with the goal to reduce commenting noise when reviewing any code changes to a repository. Interestingly, preliminary results indicate that emojis do not always reduce commenting noise (i.e., eight out of 20 emoji reactions), providing evidence that developers use emojis with ulterior intentions. From a reviewing context, the extent to which emoji reactions facilitate for a more efficient review process is unknown. Objective: In this registered report, we introduce the study protocols to investigate ulterior intentions and usages of emoji reactions, apart from reducing commenting noise during the discussions in GitHub pull requests (PRs). As part of the report, we first perform a preliminary analysis to whether emoji reactions can reduce commenting noise in PRs and then introduce the execution plan for the study. Method: We will use a mixed-methods approach in this study, i.e., quantitative and qualitative, with three hypotheses to test.
More Than React: Investigating The Role of Emoji
Reaction in GitHub Pull Requests
Teyon Son, Tao Xiao, Dong Wang, Raula Gaikovina Kula, Takashi Ishio, Kenichi Matsumoto
Nara Institute of Science and Technology, Japan
Email: {son.teyon.sr7, tao.xiao.ts2, wang.dong.vt8, raula-k, ishio, matumoto}@is.naist.jp
Abstract—Context: Open source software development has
become more social and collaborative, especially with the rise of
social coding platforms like GitHub. Since 2016, GitHub started
to support more informal methods such as emoji reactions, with
the goal to reduce commenting noise when reviewing any code
changes to a repository. Interestingly, preliminary results indicate
that emojis do not always reduce commenting noise (i.e., eight
out of 20 emoji reactions), providing evidence that developers
use emojis with ulterior intentions. From a reviewing context,
the extent to which emoji reactions facilitate for a more efficient
review process is unknown.
Objective: In this registered report, we introduce the study
protocols to investigate ulterior intentions and usages of emoji
reactions, apart from reducing commenting noise during the
discussions in GitHub pull requests (PRs). As part of the
report, we first perform a preliminary analysis to whether emoji
reactions can reduce commenting noise in PRs and then introduce
the execution plan for the study.
Method: We will use a mixed-methods approach in this study,
i.e., quantitative and qualitative, with three hypotheses to test.
I. INTRODUCTION
In the past few years, open source software development
has become more social and collaborative. Known as social
coding,open source development promotes formal and infor-
mal collaboration by empowering the exchange of knowledge
between developers [9]. GitHub, one of the most popular social
coding platforms, attracts more than 72 million developers col-
laborating across 233 million repositories.1Every day, thou-
sands of people engage in conversations about code, design,
bugs, and new ideas on GitHub. To promote collaboration,
GitHub implements a vast number of social features (i.e.,
follow, fork, and stars).
Since 2016, GitHub introduced a new social function called
reaction” for developers to quickly express their feelings
in issue reports and PRs. Especially for discussing a Pull
Requests(PRs), we find that2:
“In many cases, especially on popular projects, the
result is a long thread full of emoji and not much
content, which makes it difficult to have a discussion.
With reactions, you can now reduce the noise in
these threads” - GitHub
In the context of code review, we assume that a thread full of
emoji may also contribute to the existing forms of confusion
for reviewers during the code review process. For instance,
1https://github.com/search
2https://tinyurl.com/3rpdr6dp
Ebert et al. [10] pointed out that confusion delays the merge
decision decreases review quality, and results in additional
discussions. Hirao et al. [16] found that patches can receive
both positive and negative scores due to the disagreement
between reviewers, which leads to conflicts in the review
process.
The Figure 1 depicts two typical cases where the emoji
reactions occur. Figure 1(a) shows the case where the re-
action does reduce unnecessary commenting in the thread,
hence may lead to less confusion and conflicts. The exam-
ple illustrates how Author B reduces the commenting by
simply reacting with a quick expression of approval through
THUMBS UP . In contrast, as shown in Figure 1(b), there
exists a case where the emoji usage has an ulterior intention
and does not reduce comments in the discussion thread. In
detail, Contributor D uses three positive emoji reactions
(THUMBS UP ,HOORAY , and HEART ) to represent the
appreciation to this PR. Then later goes on to provide detailed
comments on the PR. We posit that the intention of the emoji
was to express appreciation for the PR, and did not reduce
the amount of commenting in the thread discussions. As part
of our preliminary study, we also found other cases where the
emoji did not always reduce the commenting in the discussion.
Under a closer manual inspection of 20 emoji reactions, we
find that there are eight cases where the emoji reactions did
not reduce commenting noise.
Therefore, in this registered report, we present our study
protocol to investigate ulterior intentions and usages of emoji
reactions, apart from reducing commenting noise during the
discussions. Specifically, we would like to (i) investigate the
effect of emoji reaction related factors on the pull request
process (i.e., review time), (ii) investigate whether the first
time pull request is more likely to receive reactions, (iii)
analyze the relationship between the reaction and the intention
of comment, and (iv) explore consistency between sentiments
of an emoji reaction and the sentiment of the comment. To
enable other researchers to extend our study, we plan to make
the study data publicly available.
II. PRELIMINARY STUDY
The goal of our preliminary study is to explore the extent
to which emoji reaction usage reduces commenting noise of a
PR. We selected the Eclipse projects as our case study since
it is a mature thriving open source project with a number of
contributors that actively submit and merge PRs.
arXiv:2108.08094v1 [cs.SE] 18 Aug 2021
Contributor A
Author B reacted with thumbs up emoji
Author B
(a) Example of emoji reaction reduce commenting noise.
Contributor D
Contributor D reacted with
thumbs up emoji
Contributor D reacted with
hooray emoji
Contributor D reacted with
heart emoji
Author C
@Author C
...
...
... 7 comments
Author C
(b) Example of emoji reaction does not reduce commenting noise.
Fig. 1. Examples of emoji reactions used in GitHub.
TABLE I
PRELIMINARY DATASE T SUMMARY STATISTICS
# Repositories # PR # PR Comments
With reactions 203 (48%) 6,867 (8%) 9,256 (4%)
Without reactions 217 (52%) 76,202 (92%) 249,354 (96%)
Total 420 (100%) 83,069(100%) 258,610 (100%)
A. Data Collection
We collected a list of 683 Eclipse repositories by using the
official API of GitHub [2]. Since we focus on the reactions
that are used in PRs,we excluded the repositories that do not
have any PRs. We obtained 420 repositories that have 83,069
PRs. Then, we extracted PRs with reactions using GitHub
GraphQL API [1], along with the information of reaction
types, reaction time, the developer who posts the reaction. We
excluded those reactions or PRs that are posted by bots since
we investigate the reactions used by developers. Based on our
manual check, we performed the following two exclusions:
(i) exclude developer name by using a regular expression
matching(‘github.app’) (ii) exclude the dependabot, a popular
bot used to automatically notify developers of dependency
upgrades.3In the end, we obtained 9,256 comments having
emoji reactions across 203 Eclipse repositories, as shown in
Table I.
B. Data Analysis
With our preliminary dataset, we conducted three ex-
ploratory analyses related to emoji reaction usage. First, we
investigate the prevalence of reaction usage yearly since this
feature was initially introduced in 2016. To do so, we mea-
sured the proportion of PRs that have at least one emoji reac-
tion. Second, we investigate what are the common reactions
for developers to express during the PR process. To do so, we
grouped the eight existing emoji reactions into four categories:
Positive - is the single usage or the combination usage
of THUMBS UP ,LAUGH ,HOORAY ,HEART , and
ROCKET reactions.
3https://dependabot.com/
Fig. 2. Proportion of PRs with reactions by year in Eclipse repositories.
Negative - is the single usage or the combination usage
of THUMBS DOWN and CONFUSED .
Neutral - is the usage of EYES reaction.
Mixed - is the combination usage of the four categories
mentioned above.
For example, if a PR comment is reacted by the THUMBS
UP and the THUMBS DOWN , we then classify this case
as Mixed. Third, we further investigate whether or not the
reaction is used to reduce commenting noise during the review
process. To do so, we randomly select 20 PR comment samples
from the preliminary dataset and did a manual classification
(i.e., reduce commenting noise or not) among the first four
authors. After the emoji reactions were posted, if there are
no additional comments related to the existing topic by the
developers who react, we classify such case as Reduce Noise.
Otherwise, we classify the case as Not Reduce Noise. For
example, in Figure 1(a), the Author B reacted with THUMBS
UP to the suggestion provided by the Contributor A and
added three commits without any additional comments. This
case is labeled as Reduce Noise.
Positive emoji reactions are widely used in PRs. Two
preliminary results are summarized. First, we find that around
8% to 10% of PRs have at least one reaction in Eclipse
repositories between 2016 and 2020. Figure 2 shows the
proportion of PRs with reactions yearly. Second, we observe
that most of the reactions in PR comments are Positive (i.e., the
single or combination usage of , , , , and ), accounting
for 98.1%. Upon a further inspection, among these Positive
reactions, the single usage of almost reached 86.5%. On
the other hand, the usage of Negative,Neutral, and Mixed
only accounts for 1.84%. Table II shows the distribution of the
sentiment of emoji reaction usage, indicating that the positive
reaction is the most prevalent.
Emoji reactions do not always reduce the commenting
noise. Table III shows the frequency of samples where whether
reactions reduce commenting noise from our manual classifi-
TABLE II
DISTRIBUTION OF SENTIMENTS OF EMOJI REAC TIO NS .
Emoji reaction sentiments # PR Comments
Positive 9,084 (98.10%)
Negative 67 (0.72%)
Neutral 74 (0.79%)
Mixed 31 (0.33%)
Total 9,256 (100%)
TABLE III
THE FREQUENCY COUNT OF MANUAL SAMPLES WHERE WHETHER THE
EMOJI REACTIONS REDUCED THE COMMENTING NOISE.
Category Count
Reducing commenting noise 12
Not Reducing commenting noise 8
cation. The table shows that eight samples (40%) are classified
as Not Reduce Noise. In these samples, after the emoji
reactions were posted, developers post additional comments
to express and discuss issues on the PR.
Summary: Preliminary results show that around 8%
to 10% of PRs have reactions in Eclipse repositories.
We find cases where the emoji did not always reduce
commenting noise in the discussion. Under a closer
manual inspection of 20 emoji reactions, we find that
there are eight cases where the emoji reactions did not
reduce commenting noise.
III. STU DY PROTO CO LS
In this section, we present the design of our study. This sec-
tion consists of our research questions with their motivations.
A. Research Questions
Inspired by the motivating examples and the preliminary
study, we formulate four research questions to guide our study:
RQ1: Does the emoji reaction used in the review
discussion correlate with review time?
Prior studies [3, 19] have widely analyzed the impact
of technical and non-technical factors on the review
process (e.g., review outcome, review time). However,
little is known about whether or not the emoji reaction
can be correlated with review time. It is possible that
emoji reaction may shorten the review time, as it could
reduce the noise during the review discussions. Thus, our
motivation for the first research question is to explore the
correlation between the emoji reaction used in the review
discussion and review time.
RQ2: Does a PR submitted by a first-time contributor
receive more emoji reactions?
As shown in Figure 1(b), we find that the emoji reaction
might be used to express appreciation for submitting
a PR. Our motivation for this research question is to
understand if contributors that have never submitted to the
project before receive more emoji reactions. Furthermore,
answering this research question will provide insights into
a potent ulterior motive for a emoji reaction.
Our assumption is that:
H1: PRs submitted by first-time contributor receive
more emoji reactions. Existing contributors express
positive feelings to attract newcomers to the project.
RQ3: What is the relationship between the intention of
comments and their emoji reaction?
Our preliminary study findings show that emoji reactions
do not always reduce the commenting noise. Hence, our
motivation for the third research question is to explore
the relationship between the intention of comments and
their reactions.
Our assumption is that:
H2: Most emojis are uniformly distributed across
the different intentions. Specific intentions may ex-
plain the ulterior purpose of reacting with an emoji
reaction.
RQ4: Is emoji reaction consistent with comment senti-
ment?
We found that specific sentiments of the emoji (i.e.,
THUMBS UP ) are widely used in PRs from our prelim-
inary study. Our motivation for this research question is
to investigate whether there is any inconsistency between
sentiments of the comments and sentiments of the emoji
reactions. Furthermore, we plan to manually check the
reasons why inconsistency happened. We believe answer-
ing RQ4 would help newcomers better understand the
emoji usage in the PR discussion.
Our assumption is that:
H3: The sentiment of emoji reactions are uniformly
distributed across the same comment sentiments.
Specific sentiments may explain the ulterior purpose
of reacting with an emoji. This may be useful in
understanding what information is needed in code
review.
IV. DATA CO LL EC TI ON
To generalize the results of the study, we plan to expand
on our dataset from active software development repositories
shared by Hata et al. [13]. Each repository in this dataset
has more than 500 commits and at least 100 commits during
the most active two-year period. In total, this dataset contains
25,925 repositories from seven languages (i.e., C, C++, Java,
JavaScript, Python, PHP, and Ruby). We will use the GraphQL
API [1] to obtain PRs created before March 13rd 2016 where
GraphQL was introduced. The whole dataset will be used for
all four research questions.
V. EXECUTION PL AN
In this section, we present the execution plan of our ex-
periment. We will use a mixed method consisting of both
quantitative and qualitative analysis to answer our research
questions.
A. Research Method for RQ1:
For the first research question, we plan to use a quantitative
method. To investigate the effect of emoji reaction related
factors on the pull request process (i.e., review time), we plan
to perform a statistical analysis using a non-linear regression
model. This model allows us to capture the relationship
between the independent variable and the dependent variable.
The goal of our statistical analysis is not to predict the review
time but to understand the associations between the emoji
reaction and the review time.
For the independent variables, similar to the prior stud-
ies [19, 23], we will select the following confounding factors
as our independent variables:
PR size: The total numbers of added and deleted lines of
code changed by a PR.
Change file size: The number of files what were changed
by a PR.
Purpose: The purpose of a PR, i.e., bug, document,
feature.
# Comments: The total number of comments in a PR
discussion thread.
# Author Comments: The total number of comments by
the author in a PR discussion thread.
# Reviewer Comments: The total number of comments by
the reviewers in a PR discussion thread.
Patch author experience: The number of prior PRs that
were submitted by the PR author.
Reviewers: The number of developers who posted a
comment to a review discussion.
Commit size: The number of commits in a PR.
Since we investigate the effect of the emoji reaction, we plan
to compute additional independent variables that are related to
emoji reaction:
With emoji reaction: Whether or not a PR includes any
emoji reaction (binary).
The number of emoji reactions: The count of emoji
reaction in a PR.
For the dependent variable (i.e., review time), we measure the
time interval in hours from the time when the first comment
was posted until the time when the last comment was posted.
For the model construction, we will adopt the steps that are
similar to the prior studies, including (i) Estimating budget for
degrees of freedom, (ii) Normality adjustment, (iii) Correlation
and redundancy analysis, (iv) Allocating degrees of freedom,
and (v) Fitting statistical models.
(a) Analysis Plan: We will analyze the constructed regres-
sion models in the following three steps: (i) Assessing model
stability. To evaluate the performance of our models, we will
report the adjusted R2[12]. We will also use the bootstrap
validation approach to estimate the optimism of the adjusted
R2. (ii) Estimating the power of explanatory variables. Similar
to prior work [23], we plan to test the significant correlation of
independent variables with p-value and employ Wald statistics
to measure the impact of each independent variable. (iii)
Examining relationship. Finally, we will examine and plot
the direction of the relationship between each independent
variable (i.e., especially emoji reaction related variables) and
the dependent variable.
B. Research Method for RQ2:
For RQ2, we plan to use a quantitative method. To do so, we
will construct two groups of pull requests to compare against:
first-time contributors and non-first time contributors (control
group). For the first-time contributor group, we will identify
all pull requests that are submitted by first-time contributors
from our dataset. For the non-first time contributor group, to
construct a balanced control group, we will randomly select
the equal number of pull requests that are submitted by non-
first time contributors. We will then divide the pull requests
into ones having emoji reactions and the other ones without
emoji reactions, respectively.
(a) Analysis Plan: We will present a pivot chart to show the
frequency of pull requests having emoji reactions or without
emoji reactions by first-time contributors and non-first time
contributors. The plot x-axis will represent two groups of first-
time contributors and non-first time contributors. Furthermore,
each group will be divided into two parts: pull requests with
emoji reaction and pull reactions without emoji reactions. The
plot y-axis will represent the frequency count of pull requests.
(b) Significant Testing: To select a suitable statistical test,
we will adopt the Shapiro-Wilk test with alpha = 0.05. In the
case when the p-value is greater than 0.05, we will perform a
two-tailed independent t-test with alpha 0.05. Otherwise, we
will adopt a two-tailed Mann Whitney U test [21] with alpha
= 0.05 to validate.
In addition, we will investigate the effect size. In case when
the data is normally distributed, we will use Hedges g effect
size [14]. Effect size is analyzed as follows: (1) |d|< 0.2 as
Negligible, (2) 0.2 ≤ |d|<0.5 as Small, (3) 0.5 ≤ |d|<0.8 as
Medium, or (4) 0.8 ≤ |d|as Large. If the data are not normally
distributed, we will apply Cliff’s delta (Romano et al, 2006)
to measure effect size. Effect size is analyzed as follows: (1)
|δ|< 0.147 as Negligible, (2) 0.147 ≤ |δ|<0.33 as Small, (3)
0.33 ≤ |δ|<0.474 as Medium, or (4) 0.474 ≤ |δ|as Large.
C. Research Method for RQ3:
For RQ3, we plan to use a quantitative method to classify
the intentions of the comments. To categorize the intentions of
the comments, we will use a taxonomy of intention proposed
by Huang [17]. They manually categorized 5,408 sentences
from issue reports of four projects in GitHub to generalize the
linguistic pattern for category identification.
The taxonomy of intention category is described below:
Information Giving (IG): Share knowledge and expe-
rience with other people, or inform other people about
new plans/updates (e.g., “The typeahead from Bootstrap
v2 was removed.”).
Information Seeking (IS): Attempt to obtain information
or help from other people (e.g., “Are there any developers
working on it?”).
Feature Request (FR): Require to improve existing
features or implement new features (e.g., “Please add a
titled panel component to Twitter Bootstrap.”).
Solution Proposal (SP): Share possible solutions for
discovered problems (e.g., “I fixed this for UI Kit using
the following CSS.”).
Problem Discovery (PD): Report bugs, or describe un-
expected behaviors (e.g., “the firstletter issue was causing
a crash.”).
Aspect Evaluation (AE): Express opinions or evalua-
tions on a specific aspect (e.g., “I think BS3’s new theme
looks good, it’s a little flat style.”).
Meaningless (ML): Sentences with little meaning or
importance (e.g., “Thanks for the feedback!”).
To facilitate the automation, they proposed a convolution
neural network based classifier with high accuracy. For RQ3,
we will use this classifier to automatically label the intention
of the comments. To evaluate the robustness of this classifier
in our dataset, we will first use the proposed classifier to
automatically classify the intentions of the randomly sampled
30 comments. Then, we will manually check whether the
labeled intentions of these 30 comments are correct or not. The
result of this sanity check will be presented as a percentage
of the false positive, under 10% being considerable.
(a) Analysis Plan: To analyze the relationship between the
intention of comments and their emoji reaction, we will use
the association rule mining technique. To show the diversity
of different intentions from the classification, we will draw a
histogram plot. To show the results of the relationship, a table
will be drawn with descriptive statistics,including the criteria
support and confidence.
(b) Significant Testing: To inspect whether or not the
classified intentions of comments are normally distributed, we
will adopt the Shapiro-Wilk test with alpha = 0.05, which
is widely used for the normality test. In addition, to inspect
whether the intentions of comments are significantly different,
we will use Kruskal-Wallis non-parametric statistical test [4].
D. Research Method for RQ4:
For RQ4, we plan to use a qualitative and quantitative
method. First of all, we will use a quantitative method to
investigate whether there is any inconsistency between senti-
ments of emoji reaction and sentiments of the comments. We
determine the sentiments of the emoji based on the definition
we discussed in section 2. Hence, Emoji sentiment can be
categorized into the following types: Positive,Negative,
Neutral, and Mixed. To extract the sentiment of the first
responses, we plan to use SentiStrength-SE [18], the state-of-
the-art sentiment analysis tool for software engineering text.
Similar to the tool we plan to use for RQ3, the input is reacted
comments, and the output is the sentiment score of the given
comment. The sentiment score varies from -5 (very Negative)
to 5 (very Positive). Based on the above definition, we consider
it as inconsistent if the sentiments of the comment are different
from the sentiments of the emoji.
Then, we will conduct qualitative analysis to explore the
possible reasons for inconsistency between sentiments of
emoji and sentiments of the comments. To do so, we will
apply the open coding approach [5] to classify the reasons
for inconsistency. To discover as a complete list of reasons as
possible, we strive for theoretical saturation [11]. Similar to
prior work [15], we set our saturation criterion to 50, i.e., we
continue to code randomly selected comments until no new
reasons have been discovered for 50 consecutive comments.
Furthermore, we perform the kappa agreement score [22] to
evaluate the classification quality. Similar to Hata et al. [13],
the agreement of the coding guide will be performed using
a kappa agreement. Kappa result is interpreted as follows:
values 0 as indicating no agreement and 0.01–0.20 as none
to slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80
as substantial, and 0.81–1.00 as almost perfect agreement.
The agreement scores larger than 0.81 (i.e., almost perfect)
are considered for the manual analysis. Based on the prior
experience, we estimate the size of the samples to range from
200-300 samples.
(a) Analysis Plan: Similar to RQ3, we will depict a
histogram plot to show the distribution of emoji types by
sentiments of the comment. To show the results of incon-
sistency reasons, we will draw a histogram plot to show the
frequency. As part of our result presentation, we will paste
the real examples during the analysis to describe the reason
taxonomy.
(b) Significant Testing: Similar to RQ3, we will use
Shapiro-Wilk test to inspect whether or not the sentiments
of emoji usage are normally distributed and use Kruskal-
Wallis non-parametric statistical test to validate the significant
difference.
VI. IMPLICATIONS
We summarize our implications with the following take-
away messages for the key stakeholders:
Researchers: Answering RQ1 will help researchers un-
derstand the impact of emoji reactions, hence may con-
tribute to existing knowledge on the code review process.
As informal communication such as the usage of emojis
becomes prevalent, it is a need to understand its role in
keeping an efficient code review process. We believe that
emojis may also help remove toxic and other forms of
anti-patterns [8] in the code review process. In terms
of the intention of the emoji reactions, our study will
complement all related works on emoji usage [6, 7].
Contributors: In terms of practitioners, we envision our
study to assist projects to attract and maintain existing
and potential contributors to the project. Especially for
RQ2, the results of the study may provide some insights
into how to attract newcomers and also how to provide a
more friendly and welcoming environment. Furthermore,
answering RQ3 and RQ4 will provide some insights for
contributors into undersanding the consistency of emoji
reactions. As emoji usage becomes popular [20], the
results of the study should provide guidelines on how to
represent the common intentions when an emoji reaction
is warranted.
VII. THR EATS T O VALIDITY
We identified three key threats to our study. First, the
Eclipse projects that were used in our preliminary study may
not be representative of all types of GitHub projects. To
increase generalizability, we will extend our study to include
a sample of random GitHub projects [13]. Our second threat
is concerning the qualitative aspect of the study, as this is
bias to human error in the classification. This is because the
interpretation of emoji usage may not be trivial. To mitigate
this, we employ the Kappa method to have multiple co-authors
for agreement of each code. For instance, if positive emoji is
used in an ironic context, it does not mean positive. This usage
of emojis may influence our results. Third, our quantitative
analysis and data collection may include some false positives,
such as bot reactions and comments. Currently, we manually
exclude these bots for the preliminary study. To mitigate this,
we plan to carefully identify and systematically remove bots
based on official documentation.
REFERENCES
[1] Github graphql api v4. URL https://docs.github.com/en/
graphql.
[2] Github rest api. URL https://docs.github.com/en/rest.
[3] Olga Baysal, Oleksii Kononenko, Reid Holmes, and
Michael W. Godfrey. Investigating Technical and Non-
technical Factors Influencing Modern Code Review.
Empirical Software Engineering, page 932–959, 2016.
[4] NORMAN BRESLOW. A generalized Kruskal-Wallis
test for comparing K samples subject to unequal patterns
of censorship. Biometrika, pages 579–594, 1970.
[5] Kathy Charmaz. Constructing Grounded Theory. SAGE,
2014.
[6] Zhenpeng Chen, Yanbin Cao, Xuan Lu, Qiaozhu Mei,
and Xuanzhe Liu. Sentimoji: an emoji-powered learning
approach for sentiment analysis in software engineering.
In Proceedings of the 2019 27th ACM Joint Meeting
on European Software Engineering Conference and
Symposium on the Foundations of Software Engineering,
pages 841–852, 2019.
[7] Zhenpeng Chen, Yanbin Cao, Huihan Yao, Xuan Lu, Xin
Peng, Hong Mei, and Xuanzhe Liu. Emoji-powered sen-
timent and emotion detection from software developers’
communication data. ACM Transactions on Software
Engineering and Methodology (TOSEM), 30(2):1–48,
2021.
[8] Moataz Chouchen, Ali Ouni, Raula Gaikovina
Kula, Dong Wang, Patanamon Thongtanunam,
Mohamed Wiem Mkaouer, and Kenichi Matsumoto.
Anti-patterns in modern code review: Symptoms and
prevalence. In 2021 IEEE International Conference
on Software Analysis, Evolution and Reengineering
(SANER), pages 531–535, 2021.
[9] Laura Dabbish, Colleen Stuart, Jason Tsay, and Jim
Herbsleb. Social coding in github: transparency and col-
laboration in an open software repository. In Proceedings
of the ACM 2012 conference on computer supported
cooperative work, pages 1277–1286, 2012.
[10] F. Ebert, F. Castor, N. Novielli, and A. Serebrenik.
Confusion in code reviews: Reasons, impacts, and coping
strategies. In 2019 IEEE 26th International Conference
on Software Analysis, Evolution and Reengineering
(SANER), pages 49–60, 2019.
[11] Kathleen M Eisenhardt. Building theories from case
study research. Academy of management review, 1989.
[12] T. Hastie, R. Tibshirani, and J.H. Friedman. The
Elements of Statistical Learning: Data Mining, Inference,
and Prediction. Springer, 2009.
[13] Hideaki Hata, Christoph Treude, Raula Gaikovina Kula,
and Takashi Ishio. 9.6 million links in source
code comments: Purpose, evolution, and decay. In
Proceedings of the 41st International Conference on
Software Engineering, ICSE ’19, page 1211–1221. IEEE
Press, 2019.
[14] Larry V. Hedges and I. Olkin. Statistical Methods for
Meta-Analysis. Academic Press, 1985.
[15] Toshiki Hirao, Shane McIntosh, Akinori Ihara, and
Kenichi Matsumoto. The review linkage graph for
code review analytics: a recovery approach and em-
pirical study. In Proceedings of the 2019 27th
ACM Joint Meeting on European Software Engineering
Conference and Symposium on the Foundations of
Software Engineering, pages 578–589, 2019.
[16] Toshiki Hirao, Shane McIntosh, Akinori Ihara, and
[21] H. B. Mann and D. R. Whitney. On a Test of Whether
one of Two Random Variables is Stochastically Larger
than the Other. The Annals of Mathematical Statistics,
Kenichi Matsumoto. Code reviews with divergent review
scores: An empirical study of the openstack and qt com-
munities. IEEE Transactions on Software Engineering,
2020.
[17] Qiao Huang, Xin Xia, David Lo, and Gail C Murphy.
Automating intention mining. IEEE Transactions on
Software Engineering, 46(10):1098–1119, 2018.
[18] M. R. Islam and M. Zibran. Sentistrength-se: Exploiting
domain specificity for improved sentiment analysis in
software engineering text. J. Syst. Softw., 145:125–146,
2018.
[19] O. Kononenko, T. Rose, O. Baysal, M. Godfrey,
D. Theisen, and B. de Water. Studying pull request
merges: A case study of shopify’s active merchant.
In 2018 IEEE/ACM 40th International Conference on
Software Engineering: Software Engineering in Practice
Track (ICSE-SEIP), pages 124–133, 2018.
[20] Renee Li, Pavitthra Pandurangan, Hana Frluckaj, and
Laura Dabbish. Code of conduct conversations in open
source software projects on github. Proceedings of the
ACM on Human-Computer Interaction, 5(CSCW1):1–
31, apr 2021.
18(1):50 – 60, 1947.
[22] Anthony Viera and Joanne Garrett. Understanding
interobserver agreement: The kappa statistic. Family
medicine, pages 360–3, 06 2005.
[23] Dong Wang, Tao Xiao, Patanamon Thongtanunam,
Raula Gaikovina Kula, and Kenichi Matsumoto. Un-
derstanding shared links and their intentions to meet
information needs in modern code review. In The
Journal of Empirical Software Engineering (EMSE), vol-
ume 26, page to appear, 2021. doi: https://doi.org/10.
1007/s10664-021-09997-x.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Code reviews serve as a quality assurance activity for software teams. Especially for Modern Code Review, sharing a link during a review discussion serves as an effective awareness mechanism where "Code reviews are good FYIs [for your information].". Although prior work has explored link sharing and the information needs of a code review, the extent to which links are used to properly conduct a review is unknown. In this study, we performed a mixed-method approach to investigate the practice of link sharing and their intentions. First, through a quantitative study of the OpenStack and Qt projects, we identify 19,268 reviews that have 39,686 links to explore the extent to which the links are shared, and analyze a correlation between link sharing and review time. Then in a qualitative study, we manually analyze 1,378 links to understand the role and usefulness of link sharing. Results indicate that internal links are more widely referred to (93% and 80% for the two projects). Importantly, although the majority of the internal links are referencing to reviews, bug reports and source code are also shared in review discussions. The statistical models show that the number of internal links as an explanatory factor does have an increasing relationship with the review time. Finally, we present seven intentions of link sharing, with providing context being the most common intention for sharing links. Based on the findings and a developer survey, we encourage the patch author to provide clear context and explore both internal and external resources, while the review team should continue link sharing activities. Future research directions include the investigation of causality between sharing links and the review process, as well as the potential for tool support.
Conference Paper
Full-text available
Modern code review (MCR) is now broadly adopted as an established and effective software quality assurance practice , with an increasing number of open-source as well as commercial software projects identifying code review as a crucial practice. During the MCR process, developers review, provide constructive feedback, and/or critique each others' patches before a code change is merged into the codebase. Nevertheless, code review is basically a human task that involves technical, personal and social aspects. Existing literature hint the existence of poor reviewing practices i.e., anti-patterns, that may contribute to a tense reviewing culture, degradation of software quality, slow down integration, and may affect the overall sustainability of the project. To better understand these practices, we present in this paper the concept of Modern Code Review Anti-patterns (MCRA) and take a first step to define a catalog that enumerates common poor code review practices. In detail we explore and characterize MCRA symptoms, causes, and impacts. We also conduct a series of preliminary experiments to investigate the prevalence and co-occurrences of such anti-patterns on a random sample of 100 code reviews from various OpenStack projects. Index Terms-Modern code review, review anti-pattern
Article
The rapid growth of open source software necessitates a deeper understanding of moderation and governance methods currently used within these projects. The code of conduct, a set of rules articulating standard behavior and responsibilities for participation within a community, is becoming an increasingly common policy document in open source software projects for setting project norms of behavior and discouraging negative or harassing comments and conversation. This study describes the conversations around adopting and crafting a code of conduct as well as those utilizing code of conduct for community governance. We conduct a qualitative analysis of a random sample of GitHub issues that involve the code of conduct. We find that codes of conduct are used both proactively and reactively to govern community behavior in project issues. Oftentimes, the initial addition of a code of conduct does not involve much community participation and input. However, a controversial moderation act is capable of inciting mass community feedback and backlash. Project maintainers balance the tension between disciplining potentially offensive forms of speech and encouraging broad and inclusive participation. These results have implications for the design of inclusive and effective governance practices for open source software communities.
Article
Sentiment and emotion detection from textual communication records of developers have various application scenarios in software engineering (SE). However, commonly used off-the-shelf sentiment/emotion detection tools cannot obtain reliable results in SE tasks and misunderstanding of technical knowledge is demonstrated to be the main reason. Then researchers start to create labeled SE-related datasets manually and customize SE-specific methods. However, the scarce labeled data can cover only very limited lexicon and expressions. In this article, we employ emojis as an instrument to address this problem. Different from manual labels that are provided by annotators, emojis are self-reported labels provided by the authors themselves to intentionally convey affective states and thus are suitable indications of sentiment and emotion in texts. Since emojis have been widely adopted in online communication, a large amount of emoji-labeled texts can be easily accessed to help tackle the scarcity of the manually labeled data. Specifically, we leverage Tweets and GitHub posts containing emojis to learn representations of SE-related texts through emoji prediction. By predicting emojis containing in each text, texts that tend to surround the same emoji are represented with similar vectors, which transfers the sentiment knowledge contained in emoji usage to the representations of texts. Then we leverage the sentiment-aware representations as well as manually labeled data to learn the final sentiment/emotion classifier via transfer learning. Compared to existing approaches, our approach can achieve significant improvement on representative benchmark datasets, with an average increase of 0.036 and 0.049 in macro-F1 in sentiment and emotion detection, respectively. Further investigations reveal that the large-scale Tweets make a key contribution to the power of our approach. This finding informs future research not to unilaterally pursue the domain-specific resource but try to transform knowledge from the open domain through ubiquitous signals such as emojis. Finally, we present the open challenges of sentiment and emotion detection in SE through a qualitative analysis of texts misclassified by our approach.
Conference Paper
Sentiment analysis has various application scenarios in software engineering (SE), such as detecting developers' emotions in commit messages and identifying their opinions on Q&A forums. However, commonly used out-of-the-box sentiment analysis tools cannot obtain reliable results on SE tasks and the misunderstanding of technical jargon is demonstrated to be the main reason. Then, researchers have to utilize labeled SE-related texts to customize sentiment analysis for SE tasks via a variety of algorithms. However, the scarce labeled data can cover only very limited expressions and thus cannot guarantee the analysis quality. To address such a problem, we turn to the easily available emoji usage data for help. More specifically, we employ emotional emojis as noisy labels of sentiments and propose a representation learning approach that uses both Tweets and GitHub posts containing emojis to learn sentiment-aware representations for SE-related texts. These emoji-labeled posts can not only supply the technical jargon, but also incorporate more general sentiment patterns shared across domains. They as well as labeled data are used to learn the final sentiment classifier. Compared to the existing sentiment analysis methods used in SE, the proposed approach can achieve significant improvement on representative benchmark datasets. By further contrast experiments, we find that the Tweets make a key contribution to the power of our approach. This finding informs future research not to unilaterally pursue the domain-specific resource, but try to transform knowledge from the open domain through ubiquitous signals such as emojis.
Conference Paper
Modern Code Review (MCR) is a pillar of contemporary quality assurance approaches, where developers discuss and improve code changes prior to integration. Since review interactions (e.g., comments, revisions) are archived, analytics approaches like reviewer recommendation and review outcome prediction have been proposed to support the MCR process. These approaches assume that reviews evolve and are adjudicated independently; yet in practice, reviews can be interdependent. In this paper, we set out to better understand the impact of review linkage on code review analytics. To do so, we extract review linkage graphs where nodes represent reviews, while edges represent recovered links between reviews. Through a quantitative analysis of six software communities, we observe that (a) linked reviews occur regularly, with linked review rates of 25% in OpenStack, 17% in Chromium, and 3%–8% in Android, Qt, Eclipse, and Libreoffice; and (b) linkage has become more prevalent over time. Through qualitative analysis, we discover that links span 16 types that belong to five categories. To automate link category recovery, we train classifiers to label links according to the surrounding document content. Those classifiers achieve F1-scores of 0.71–0.79, at least doubling the F1-scores of a ZeroR baseline. Finally, we show that the F1-scores of reviewer recommenders can be improved by 37%–88% (5–14 percentage points) by incorporating information from linked reviews that is available at prediction time. Indeed, review linkage should be exploited by future code review analytics.
Article
Developers frequently discuss aspects of the systems they are developing online. The comments they post to discussions form a rich information source about the system. Intention mining, a process introduced by Di Sorbo et al., classifies sentences in developer discussions to enable further analysis. As one example of use, intention mining has been used to help build various recommenders for software developers. The technique introduced by Di Sorbo et al. to categorize sentences is based on linguistic patterns derived from two projects. The limited number of data sources used in this earlier work introduces questions about the comprehensiveness of intention categories and whether the linguistic patterns used to identify the categories are generalizable to developer discussion recorded in other kinds of software artifacts (e.g., issue reports). To assess the comprehensiveness of the previously identified intention categories and the generalizability of the linguistic patterns for category identification, we manually created a new dataset, categorizing 5,408 sentences from issue reports of four projects in GitHub. Based on this manual effort, we refined the previous categories. We assess Di Sorbo et al.'s patterns on this dataset, finding that the accuracy rate achieved is low (0.31). To address the deficiencies of Di Sorbo et al.'s patterns, we propose and investigate a convolution neural network (CNN)-based approach to automatically classify sentences into different categories of intentions. Our approach optimizes CNN by integrating batch normalization to accelerate the training speed, and an automatic hyperparameter tuning approach to tune appropriate hyperparameters of CNN. Our approach achieves an accuracy of 0.84 on the new dataset, improving Di Sorbo et al.'s approach by 171%. We also apply our approach to improve an automated software engineering task, in which we use our proposed approach to rectify misclassified issue reports, thus reducing the bias introduced by such data to other studies. A case study on four open source projects with 2,076 issue reports shows that our approach achieves an average AUC score of 0.687, which improves other baselines by at least 16%.
Article
Automated sentiment analysis in software engineering textual artifacts has long been suffering from inaccuracies in those few tools available for the purpose. We conduct an in-depth qualitative study to identify the difficulties responsible for such low accuracy. Majority of the exposed difficulties are then carefully addressed through building a domain dictionary and appropriate heuristics. These domain-specific techniques are then realized in SentiStrength-SE, a tool we have developed for improved sentiment analysis in text especially designed for application in the software engineering domain. Using a benchmark dataset consisting of 5,600 manually annotated JIRA issue comments, we carry out both qualitative and quantitative evaluations of our tool. We also separately evaluate the contributions of individual major components (i.e., domain dictionary and heuristics) of SentiStrength-SE. The empirical evaluations confirm that the domain specificity exploited in our SentiStrength-SE enables it to substantially outperform the existing domain-independent tools/toolkits (SentiStrength, NLTK, and Stanford NLP) in detecting sentiments in software engineering text.