Conference PaperPDF Available

Where Should I Look at? Recommending Lines that Reviewers Should Pay Attention To

Authors:

Abstract and Figures

Code review is an effective quality assurance practice , yet can be time-consuming since reviewers have to carefully review all new added lines in a patch. Our analysis shows that at the median, patch authors often waited 15-64 hours to receive initial feedback from reviewers, which accounts for 16%-26% of the whole review time of a patch. Importantly, we also found that large patches tend to receive initial feedback from reviewers slower than smaller patches. Hence, it would be beneficial to reviewers to reduce their effort with an approach to pinpoint the lines that they should pay attention to. In this paper, we proposed REVSPOT-a machine learning-based approach to predict problematic lines (i.e., lines that will receive a comment and lines that will be revised). Through a case study of three open-source projects (i.e., Openstack Nova, Openstack Ironic, and Qt Base), REVSPOT can accurately predict lines that will receive comments and will be revised (with a Top-10 Accuracy of 81% and 93%, which is 56% and 15% better than the baseline approach), and these correctly predicted problematic lines are related to logic defects, which could impact the functionality of the system. Based on these findings, our REVSPOT could help reviewers to reduce their reviewing effort by reviewing a smaller set of lines and increasing code review speed and reviewers' productivity.
Content may be subject to copyright.
Where Should I Look at? Recommending Lines that
Reviewers Should Pay Attention To
Yang Hong, Chakkrit (Kla) Tantithamthavorn, Patanamon (Pick) Thongtanunam
Monash University, Australia. The University of Melbourne, Australia.
Abstract—Code review is an effective quality assurance prac-
tice, yet can be time-consuming since reviewers have to carefully
review all new added lines in a patch. Our analysis shows that
at the median, patch authors often waited 15-64 hours to receive
initial feedback from reviewers, which accounts for 16%-26%
of the whole review time of a patch. Importantly, we also found
that large patches tend to receive initial feedback from reviewers
slower than smaller patches. Hence, it would be beneficial to
reviewers to reduce their effort with an approach to pinpoint the
lines that they should pay attention to.
In this paper, we proposed REV SPOT—a machine learning-
based approach to predict problematic lines (i.e., lines that will
receive a comment and lines that will be revised). Through a
case study of three open-source projects (i.e., Openstack Nova,
Openstack Ironic, and Qt Base), RE VSP OT can accurately predict
lines that will receive comments and will be revised (with a
Top-10 Accuracy of 81% and 93%, which is 56% and 15%
better than the baseline approach), and these correctly predicted
problematic lines are related to logic defects, which could impact
the functionality of the system. Based on these findings, our
REV SPOT could help reviewers to reduce their reviewing effort
by reviewing a smaller set of lines and increasing code review
speed and reviewers’ productivity.
Index Terms—Software Quality Assurance, Modern Code Re-
view
I. INT RODUCT IO N
Code review is a well-established practice that is widely
adopted in modern software development in both commercial
and open-source projects [3, 6]. Code review is part of quality
assurance practices where a newly proposed patch (i.e., a set
of code changes) will be manually examined and critiqued
by reviewers (i.e., developers other than the patch author),
and revised until reviewers agree that the patch has sufficient
quality. Several studies found that code review provides a wide
spectrum of benefits including reducing software defects [4,
32, 52], improving maintainability [6, 34, 44] as well as
increasing knowledge transfer and team awareness [3, 28, 44].
Since code review involves manual works from develop-
ers, qualitative studies reported that managing time is the
top challenge faced by developers when performing code
reviews [23, 28]. More specifically, MacLeod et al. [28]
found that receiving timely feedback and reviewing large
patches are the top challenges faced by patch authors and
reviewers, respectively. Motivated by the prior work [28], we
first conducted an exploratory study to better understand the
association between the waiting time to receive initial feedback
and the patch size. Based on the code review data of three
open-source projects (i.e., Openstack Nova, Openstack Ironic,
and Qt Base), we address the following research question.
RQ1) How long do the patch authors wait to receive initial
feedback from reviewers?
We found that at the median, patch authors waited 15-
64 hours to receive initial feedback from reviewers,
which accounts for 16%-26% of the whole review time
of a patch. Moreover, we found that larger patches tend
to receive initial feedback from reviewers slower than
smaller patches.
Our RQ1 results and findings of the prior works [23, 28]
suggest that it might be challenging for reviewers to review
the entire code change given limited time and effort; and patch
authors may also waste unnecessary time to wait for reviewer
feedback. These highlight the need of an approach to pinpoint
the lines that reviewers should pay attention, which could be
beneficial to developers to reduce the reviewing effort as well
as waiting time to receive feedback.
Therefore, we developed REVSPOT—a line-level recom-
mendation approach to predict lines that will receive comments
and lines that will be revised. Our intuition is that lines
that will receive comments or be revised could potentially
be problematic (called problematic lines, henceforth) that
reviewers should pay attention to. REVSPOT leverages a
machine learning technique (i.e., Random Forest) and a model-
agnostic technique (i.e., LIME [41]) to predict lines that
reviewers should pay attention to. To evaluate the effectiveness
of RE VSP OT, we conducted quantitative and qualitatively anal-
ysis. For the quantitative analysis, we compared the prediction
accuracy of RE VSP OT against the baseline (N-gram). For the
qualitative analysis, we manually examine the code review
defects in the problematic lines that REVSPOT can correctly
predict. Through a case study of three open-source projects,
we address the following research questions.
RQ2) How accurate is REVSPOT in predicting problem-
atic lines that reviewers should pay attention to?
For predicting lines that will receive comments,
REV SPOT achieves a Top-10 accuracy of 81% and d2h
of 0.55, which is 56% and 15% better than the n-
gram approach. For predicting lines that will be revised,
REV SPOT achieves a Top-10 accuracy of 93% and d2h
of 0.55, which is 15% and 15% better than the baseline.
RQ3) What kinds of defects in the problematic lines that
are correctly predicted and incorrectly predicted by
our RE VSP OT?
The majority of problematic lines that REVSPOT can
correctly predict are related to logic defects. This find-
ing highlights that our RE VSPOT can correctly predict
problematic lines that could impact the functionality of
the system.
Contribution. To the best of our knowledge, the contributions
of this paper are as follows:
An empirical investigation of the association between the
waiting time to receive initial feedback and the patch size
(RQ1).
REV SPOT —A machine learning-based based approach
for predicting problematic lines that the reviewers need
to pay attention to (i.e., lines that will receive comments
and lines that will be revised).
A quantitative (RQ2) and qualitative (RQ3) evaluation of
the predictions from our RE VSPOT approach.
A publicly-available code review dataset and source code
for predicting problematic lines.1
Paper Organization. The paper is organized as follows.
Section II describes background and motivation analysis. Sec-
tion III presents our REVSPOT approach. Section IV presents
the experimental design of RE VSP OT. Section V presents the
experimental results. Section VI discusses the related works.
Section VII discusses the threats to the validity. Section VIII
draws the conclusion.
II. BAC KG ROU ND & MOT IVATIO N
In this section, we describe the code review process and
discuss the motivation based on literature.
A. Code Review Process
The code review process that is commonly used in software
projects nowadays is lightweight and based on a code review
tool (e.g., Gerrit) [42, 44]. Figure 1 illustrates an overview
of the code review process, which comprises of four main
steps. Broadly speaking, once a patch author creates a new
patch (i.e., code changes), s/he first uploads it to the code
review tool in Step 1. Then, in Step 2the patch author
invites reviewers to examine the proposed patch. In Step 3,
the reviewers examine the changed code in the proposed patch.
If the reviewers find problems or have concerns, they can
provide feedback to the specific lines of code (called inline
comments, henceforth) or post a message in the discussion
thread of the patch (called general comments, henceforth).
The reviewers then make a decision whether this patch can be
integrated into the main code repository by giving a review
score in Step 4. If the reviewers suggest that the patch is not
ready for an integration by giving a negative score, the patch
author may revise the patch and upload a new version of the
patch to address the inline comments or general comments
of reviewers in Step 5. Finally, when one of the reviewers
approves that the revised version of the patch has sufficient
quality, the patch is then merged into the main code repository.
1https://github.com/awsm-research/RevSpot-replication-package
Patch
Author
Reviewers
Patch
(1) Upload
(3) Review the patch
(2) Invite reviewers
(5) Revise and upload
a new version of the patch
Approved
(4) Make
a decision
Revised Main
Repository
This line should receive
an inline comment.
This line should be
revised.
RevSpot—A line-level recommendation
“Where should I
look at?”
Fig. 1: A usage scenario of our REVSPOT in a code review
process.
B. Motivation & Usage Scenario
Intuitively, to assure the quality of the proposed patch,
reviewers should carefully examine the patch and identify
potential problems. However, such scrutinized code reviews
can be time-consuming. Prior studies found that reviewing
large patches, receiving timely feedback, and balancing time
between reviewing patches and other daily activities (e.g.,
coding, meeting) are the top challenges faced by develop-
ers [23, 28]. Moreover, Czerwonka et al. [13] also argued
that large patches tend to receive less useful feedback, while a
long review time can cause process stalls and affects the patch
authors to ineffectively revise the code. Several studies also
showed that patch size is associated with the software qual-
ity [24, 32, 52] and the code review effectiveness [5, 43, 54].
Given the limited time of reviewers, it could be challenging
for reviewers to carefully examine the entire code changes in a
patch. Hence, to reduce reviewing effort and expedite the code
review process, an approach that allows reviewers to focus on
particular part in the patch could help them effectively identify
problems. Hence, in this work, we developed REV SPOT to
pinpoint the lines that are likely problematic (i.e., will receive
a comment or be revised). Figure 1 depicts a usage scenario
of how RE VSP OT can help reviewers during the code review
process. In particular, when examining a newly-proposed patch
(in Step 3), our REVSPOT could potentially help reviewers
to pay attention to these problematic lines and their context
rather than examine the entire code change in the patch.
III. REV SPOT ANAPPROACH T O REC OM ME ND LI NES
TH AT REVI EW ERS SHOU LD PAY ATTENTIO N TO
In this section, we present the design challenges and an
overview of our REV SPO T.
Design Challenge. While practitioners and researchers [36,
38, 60, 61] argued that fine-grained recommendations are
desirable, a large body of work still mainly focuses on develop-
ing recommendation approaches at the patch level. Designing
recommendation approaches to directly predict at the line level
is very challenging due to the complex hierarchical structure of
patch and code review information. Prior studies also argued
that there exist no meaningful features at the line level that
This line should receive
an inline comment.
This line should be
revised.
Dataset
Preparation
1
Feature
Extraction
2
4
Model
Selection
5
6
Training
Datasets
Inline comment
dataset
Revised lines
dataset
BoW
Features
BoW
Features
Model Training Phase
Model Inference Phase
Multiple Files
in a Patch
Testing
Datasets
Class
Rebalancing
3
Fig. 2: An overview of REV SPO T for predicting the problematic lines.
can demonstrate accurate predictions [38, 60]. In addition,
most of the code review metrics [54] (e.g., churn, number of
modified files and number of modified directories) are at the
patch level, which are not directly applicable to our line-level
recommendation. Thus, there is a lack of meaningful features
that can be used to predict at the line level.
Overview. We design our REV SPOT as a two-step machine
learning approach (i.e., file-level and line-level) to address the
fine-grained recommendation challenge. We use the Bag-of-
Words to address the line-level feature challenge. We focus
on two prediction tasks, i.e., predicting lines that will receive
an inline comment and predicting lines that will be revised.
Our intuition is that lines that received an inline comment or
were revised are the lines that have a problem that reviewers
should pay attention to.
Figure 2 presents an overview of our approach to predict
problematic lines that reviewers should pay attention to. We
first prepare two types of datasets (one is used to build a model
for each prediction task). Then, we extract features using the
code tokens (i.e., Bag of Words). The underlying intuition of
code token features is that code tokens that frequently received
inline comments or were revised in the past are likely to
receive inline comments or be revised in the future. Since
our training datasets are highly imbalanced, we employ a
SMOTE technique [10] to mitigate the class imbalance issues
in the training dataset. After that, we build a file-level machine
learning-based approach that aims to predict files that will
receive comments or files that will be revised. Then, we
employ a Local Interpretable Model-Agnostic Explanations
(LIME) [41] to locate the most important tokens that con-
tribute to the predictions. In particular, we rank lines in each
file based on the number of top-kimportant tokens that appear
in that line. Finally, we evaluate the accuracy of our approach
and compare with a baseline approach.
(Step 1) Dataset Preparation. As discussed earlier that we
focus on two prediction tasks (i.e., predicting lines that will
receive a comment and predicting lines that will be revised),
we will need two types of datasets. Since our approach
is designed as a two-step machine learning approach (i.e.,
building a file-level ML prediction model, then using LIME
to predict the problematic line), we will need to prepare each
dataset at two granularity levels (i.e., one at the file level,
and another at the line level). Intuitively, deleted lines cannot
be revised and will not be included in the main repository.
Since we found that the deleted lines rarely received comments
from reviewers (1.8% of patches for Nova, 0.8% of patches
for Ironic, 2% of patches for Base), we focus only the added
lines of each patch.
To identify lines that will receive a comment, we start
from the patchset (i.e., the revision id) that first received
reviewer feedback. Then, we label the lines that receive an
inline comment as true, otherwise false. To identify lines that
will be revised, we start from the patchset that first received
reviewer feedback. Then, we label the lines are revised in the
later patchset as true, otherwise false. Similarly, the file-level
dataset for each task relies on the line-level dataset, i.e., any
lines in a file that receive a comment or be revised are labelled
as true, otherwise false. In total, we produce 12 datasets (i.e.,
3 projects ×2 prediction tasks ×2 granularity levels).
(Step 2) Feature Extraction. Following the underlying
intuition of our approach, we represent each file using the
Bag-of-Words features. Unlike prior study [60] that focuses
on every lines in a file, we focus only the changed (i.e.,
added) lines of the changed files. To do so, we perform a
code tokenization step to break each changed line in a file
into separate tokens. Then, we parse the added lines into a
sequence of tokens. As suggested by Rahman et al. [39],
removing these non-alphanumeric characters will ensure that
the analyzed code tokens will not be artificially repetitive.
Thus, we apply a set of regular expressions to remove non-
alphanumeric characters such as semi-colon (;) and equal
sign (=). We also replace the numeric literal with a special
token (i.e., <NUMBER>) to reduce the vocabulary size. Then,
we extract the frequency of code tokens for each file using
the Countvectorize function of the Scikit-Learn Python
library [37]. We neither perform lowercase, stemming, nor
lemmatization (i.e., a technique to reduce inflectional forms)
on our extracted tokens, since the programming language of
our studied systems is case-sensitive. Otherwise, the meaning
of code tokens may be discarded if stemming and lemmati-
zation are applied. The vocabulary size is a common concern
when using a Natural Language Processing (NLP) approach
for predictions. To reduce the dimensions of our Bag-of-Words
features, we remove the commonly-appeared features (i.e.,
tokens that appear more than 50% of the files) and rarely-
appeared features (i.e., tokens that appear only once in the
training dataset).
(Step 3) Class Rebalancing. Prior studies raised concerns
that ML models trained on datasets with class imbalance
are often not accurate [2, 47]. Thus, Tantithamthavorn et
al. [46, 47] and Agrawal et al. [2] suggested to apply SMOTE
to mitigate class imbalance issues. Since our datasets are
highly imbalanced (i.e., 18%-30% of changed files in a patch
will receive inline comments and 23%-32% of changed files
in a patch to be revised), we employ a Synthetic Minority
Oversampling technique (SMOTE) function provided by the
imbalanced-Learn Python library [25] (with the knearest
neighbors of 10).
(Step 4) File-Level Model Training. We build a file-
level prediction model using Bag-of-Words features from
Step 2. Many studies [1, 2, 16, 47, 48] demonstrated that
the performance of prediction models varies among different
classification techniques. Thus, we conduct an experiment on
different classification techniques. We consider the following
six well-known classification techniques [1, 2, 47, 48], i.e.,
Random Forest (RF), Decision Tree (DT), Logistic Regression
(LR), Naive Bayes (NB), k-Nearest Neighbours (KNN), and
Random Guessing. We use implementations of these tech-
niques provided by SC IK IT-LE AR N library.
(Step 5) File-Level Model Selection. Since our code review
datasets are time-wise, we do not perform cross-validation to
avoid the use of testing data in the training data [21, 38, 49].
To avoid any temporal evaluation bias, the dataset of each
project is sorted in a chronological order and is splitted into
two sets: training (80%) and testing (20%). This will ensure
that new patches in the testing set will not be used for training,
and the past patches in the training set will not be used for
testing. For each project, we trained our file-level prediction
models using the training set, while the testing set was used to
evaluate our file-level prediction models. In total, we build 36
file-level prediction models (i.e., 6 classification techniques ×
3 projects ×2 prediction tasks).
Then, we evaluate different machine learning models us-
ing an AUC evaluation measure, since AUC is a threshold-
independent measure and is not sensitive to class imbal-
ance [46, 47]. AUC is an Area Under the ROC Curve (i.e., the
true positive rate and the false positive rate). AUC values range
from 0 to 1, with a value of 1 indicates perfect discrimination,
while a value of 0.5 indicates random guessing.
Figure 3 shows that Random Forest achieves an AUC of
0.78 (Openstack Nova), 0.78 (Openstack Ironic), and 0.70
(Qt Base), which is 10%-59% more accurate than the other
classification techniques for predicting files to recieve inline
comments. Similarly, our Random Forest achieves an AUC
of 0.77 (Openstack Nova), 0.68 (Openstack Ironic), and 0.69
(Qt Base), which is 5%-57% more accurate than the other
0.78
0.61 0.54
0.67
0.5 0.49
0.78
0.58 0.71
0.62 0.59 0.5
0.7
0.54 0.67
0.56 0.6 0.49
Received Inline Comments
OpenstackNova
OpenstackIronic
QtBase
RF
DT
LR
NB
KNN
Random
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.77
0.62 0.61
0.58 0.67
0.49
0.68 0.59 0.63
0.54 0.63
0.49
0.69
0.54 0.66
0.56 0.62 0.5
Revised
OpenstackNova
OpenstackIronic
QtBase
RF
DT
LR
NB
KNN
Random
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
Fig. 3: The AUC of the file-level predictions of differ-
ent classifiers including Random Forest (RF), Decision Tree
(DT), Logistic Regression (LG), Naive Bayes (NB), k-Nearest
Neighbors (KNN), Random Guessing (Random).
classification techniques for predicting files to be revised.
Thus, we select Random Forest as a file-level model.
(Step 6) Line-Level Predictions. For each file that is
correctly predicted, we apply a Local Interpretable Model-
agnostic Explanations (LIME) technique to compute the im-
portance score of the token features (i.e., which tokens con-
tribute most to a given prediction). LIME is a model-agnostic
technique that aims to explain the individual prediction of the
file-level prediction model. Given a test instance m(i.e., a file)
in the testing dataset and a file-level prediction model, LIME
performs the following steps:
LIME performs a random perturbation to generate n
synthetic instances surrounding the test instance m.
LIME generates predictions of the synthetic instances
from the file-level prediction model.
LIME builds a local interpretable model (i.e., K-Lasso)
based on the synthetic instances and predictions from
the file-level model. The coefficients of the local model
indicate the importance scores of the token features for
the individual prediction of the test instance m.
The LIME score is in the range of -1 to 1. A positive LIME
score contributes to positive probability of the prediction
(i.e., supporting scores). A negative LIME score contributes
the negative probability of the prediction (i.e., contradicting
scores). To reduce the false positives, we select only the top-
10 important tokens for each file based on the descending order
of the token important scores. Then, we compute the number
of important tokens that appear in each line. Thus, any line
that has at least one of the top-10 important tokens will be
predicted as problematic, otherwise non-problematic. The line
that includes more top-10 important tokens have a higher rank
than the line that includes few important tokens.
TABLE I: An overview statistic of the three studied projects.
Overview Received Inline Comments Revised
Project Period #Total
Patches
% of Patches
Receive
Comments
#Studied
Patches
% of
Patches
% of
Changed
Files
% of
Changed
Lines
% of
Patches
% of
Changed
Files
% of
Changed
Lines
OpenstackNova Feb 2018-Mar 2021 6,117 67% 3,678 48% 30% 9% 42% 32% 34%
OpenstackIronic Jan 2018-Mar 2021 2,148 52% 1,050 41% 20% 8% 38% 27% 30%
QtBase Mar 2019-Mar 2021 11,655 50% 5,537 29% 18% 12% 31% 23% 30%
Eclipse CDT Feb 2018-Feb 2021 1,493 26% - - - - - - -
Eclipse Platform.UI Jan 2018-Mar 2021 3,214 18% - - - - - - -
IV. CAS E STUDY DESI GN
In this section, we describe our studied projects and data
collection.
A. Studied Projects
In order to address our research questions, we aim to
conduct an empirical study on large and popular open-source
projects that are continuously and actively maintained by
globally distributed development teams or communities. In
selecting the project of interest, we identified two important
criteria that need to be satisfied:
Criterion 1: Actively use code reviews. Since the focus
of our paper is in the context of code review, the studied
projects should have invested considerable efforts on
code review. Therefore, we opt to study the projects that
actively use code reviews, i.e., a large number of patches
were submitted to the code review tool.
Criterion 2: Actively provide feedback. Since code
review feedback plays a vital role in the effectiveness of
code review, we focus on the projects that actively discuss
and provide feedback via a code review tool. Therefore,
we opt to study the projects that have a relatively high
ratio (i.e., above 50%) of patches that receive feedback
(either inline or general comments) from reviewers.
To select projects for our study, we started with five open-
source projects, i.e., Openstack Nova, Openstack Ironic, Qt
Base, Eclipse CDT and Eclipse PlatformUI. These projects
satisfy Criterion 1 since they actively used code review via
the Gerrit code review tool. In addition, these projects are
commonly used in prior code review studies [31–33, 50, 52–
54]. Note that we focused on the particular projects instead
of the whole ecosystem because the code review practices,
the source code, and programming language may vary across
different projects in an ecosystem. It is also to ensure that
the training dataset is highly curated and is meaningful for
building an accurate recommendation.
For these five projects, we downloaded the code review
history of the patches that were submitted in the past three
years. Then, we analyzed the percentage of patches that receive
feedback from reviewers. Based on Criterion 2, two projects,
i.e., Eclipse CDT and Eclipse PlatformUI were discarded since
they have a relatively low ratio of patches that received feed-
back (i.e., 26% for CDT, and 18% for PlatformUI). Therefore,
in this work, we conducted a study based on three open source
projects, i.e., Openstack Nova, Openstack Ironic, and Qt Base.
Table I describes an overview statistic of our studied projects.
B. Data Collection
To conduct an exploratory study and train REV SPOT, we
collected the historical data of code reviews which include the
metadata of a patch (e.g., creation date, patch author’s name,
review status), messages posted in the discussion thread, inline
comments, and revision history. For each version of a patch,
we also collected the source code files that were impacted
by the patch. To collect the historical data and related source
code files, we used the APIs provided by Gerrit (i.e., the code
review tool used by the studied projects). We performed the
following steps for data selection.
Excluding patches with uncompleted reviews. We
excluded the patches with uncompleted reviews (i.e.,
marked as open) since the reviews may still be ongoing
and it is likely that more comments and revisions may
be provided after the time when we collected the data.
Hence, we only focus the patches that had been marked
as approved or abandoned.
Selecting the patch version that first received reviewer
feedback. It is possible that a patch author may upload
multiple versions of a patch before reviewers reviewed
and provided feedback (i.e., messages in the discussion
thread or inline comments). These early versions may
not be ready for a review yet (e.g., work-in-progress).
Hence, to learn the source code of a patch that reviewers
reviewed and provided feedback, we selected the patch
version that first received reviewer feedback (either mes-
sages in the discussion thread or inline comments that
were posted by a developer other than the patch author
nor CI bots) instead of the first patch version that was
submitted to the code review tool.
Excluding non-source code files. Since a software
project involves various files (e.g., source code files,
configuration files). Therefore, we selected source code
files that are related to the main programming language
of the project (i.e., Python for Openstack, and C++ for
Qt).
After applying the aforementioned inclusion/exclusion cri-
teria to the patches of each project, our dataset consists of
3,678 patches for Nova, 1,050 patches for Ironic, and 5,537
patches for Qt Base. To the best of our knowledge, this is
the first and most recent dataset prepared for a new line-level
recommendation problem in code review.
V. EXP ER IM EN TAL RE SU LTS
In this section, we present the motivation, approach, and
results with respect to our three research questions.
(RQ1) How long do the patch authors wait to receive initial
feedback from reviewers?
Motivation. Prior study found that receiving timely feedback
is the top challenge of patch authors [28]. Moreover, it
also found that reviewing large patches is challenging. We
hypothesize that the waiting time to receive feedback may
be correlated with the patch size. Hence, to better gain the
intuition, we set out this RQ to analyze the waiting time to
receive the initial feedback of each patch and examine the
association between the waiting time and the patch size.
Approach. To answer RQ1, we first analyzed the descriptive
statistic of the waiting time to receive the initial feedback
(called waiting time, henceforth). To measure the waiting
time, we computed the time interval between the time when
the patch was uploaded (T1) and the time when a reviewer
provided the first feedback, i.e., either inline or general
comments (T2). In addition, we analyzed how much the
waiting time accounted for the total reviewing time. Hence, we
measured the proportion of waiting time using a calculation of
T2T1
T3T1×100%, where T3is the time when the review of the
patch was completed (i.e., its status was changed to merged
or abandoned).
To analyze the association between the waiting time and
patch size, we determined whether the waiting time of large
patches is significantly longer than the waiting time of small
patches. In this work, the patches that have the number of
added lines greater than the median are considered as large,
otherwise considered as small. Then, we used the Wilcoxon
Rank Sum test (for unequal sample size) to confirm the
statistical difference of waiting time between large and small
patches.
Result. Figure 4 shows that at the median, the waiting time
of a patch is 15-64 hours, which accounts for 15%-26%
of the total reviewing time. We observed that 40%-60% of
the patches have a waiting time longer than one day (i.e.,
24 hours), which exceeds the maximum code review time
suggested by Google.2Figure 4 shows the distributions of
the waiting time with respect to large and small patches. We
also observed that the Qt Base project has a relatively shorter
waiting time than the OpenStack projects. This could be due
to the relatively smaller patch size (i.e., a median of 31 added
lines for OpenStack Nova, a median of 43 added lines for
OpenStack Ironic, and a median of 11 added lines for Qt Base).
Moreover, at the median, the waiting time of large patches
is 21-141 hours, while the waiting time of small patches is
9-34 hours. The Wilcoxon Rank Sum test also confirmed that
the waiting time of large patches is statistically greater than
the waiting time of small patches for all three studied projects
(p-value <0.05), indicating that large patches tend to receive
initial feedback slower than small patches.
2https://google.github.io/eng-practices/review/reviewer/looking-for.html
0
100
200
300
400
500
OpenstackNova
OpenstackIronic
QtBase
Waiting Hours to Receive the First Comment
0%
20%
40%
60%
80%
100%
OpenstackNova
OpenstackIronic
QtBase
The Proportion of the Waiting Hours
OpenstackNova
Large Patches Small Patches
0
200
400
600
Waiting Hours
OpenstackIronic
Large Patches Small Patches
0
200
400
600
Waiting Hours
QtBase
Large Patches Small Patches
0
50
100
150
Waiting Hours
Fig. 4: The waiting time to receive the first feedback, its
proportion for the total reviewing time, and the waiting time
between large patches and small patches.
Our RQ1 results may imply that reviewers may need more
time to review large patches as prior study showed that
reviewing a large patch is challenging [28]. Consequently,
patch authors will have a long wait for reviewer feedback,
which could be challenging for them to revise the patch
without introducing new defects [13, 23]. Hence, an approach
to pinpoint lines that reviewers should pay attention to could
help reviewers reduce their effort and help patch authors
receive timely feedback.
(RQ2) How accurate is REV SPOT in predicting problematic
lines that reviewers should pay attention to?
Motivation. Our RQ1 results showed that a large proportion
of patches have a waiting time longer than one day. We also
found that large patches tend to receive initial feedback slower
than small patches. Hence, it would be beneficial for reviewers
to reduce their reviewing effort by scoping down the whole
patch to the smaller area of code in the patch that they should
pay attention to. Hence, we developed REVSPOT to predict
the lines that will receive comments and the lines that will be
revised. Thus, we formulate this RQ to evaluate the accuracy
of RE VSP OT against the baseline.
Approach. To answer RQ2, we evaluate how accurate
REV SPOT can predict the lines that will receive comments and
the lines that will be revised. Hence, we used the following
four measures to evaluate REV SPOT:
Top-10 Accuracy measures the percentage of files where
at least one of the top-10 lines ranked by RE VSP OT is one
of the actual lines that receive comments (or the actual
81%
52%
Top10Accuracy
RevSpot
N−gram
0%
20%
40%
60%
80%
100%
d2h
RevSpot
N−gram
0.00
0.25
0.50
0.75
1.00
FAR
RevSpot
N−gram
0.00
0.25
0.50
0.75
1.00
Recall
RevSpot
N−gram
0.00
0.25
0.50
0.75
1.00
(a) Predicting lines that will receive comments
93%
81%
Top10Accuracy
RevSpot
N−gram
0%
20%
40%
60%
80%
100%
d2h
RevSpot
N−gram
0.00
0.25
0.50
0.75
1.00
FAR
RevSpot
N−gram
0.00
0.25
0.50
0.75
1.00
Recall
RevSpot
N−gram
0.00
0.25
0.50
0.75
1.00
(b) Predicting lines that will be revised
Fig. 5: (RQ2) The accuracy of the line-level predictions of REV SPO T comparing to the baseline approach (N-gram).
lines that are revised). A high value of Top-10 Accuracy
indicate that REVSPOT can correctly recommend many
files in the patches that reviewers should pay attention to.
d2h (called distance-to-heaven) measures the root mean
square of the recall and false alarm rate (FAR) values
using a calculation of q(1Recall)2+(0FAR)2
2. Since
recall and false alarm rates are trade-off measures, d2h
evaluate the performance based on both aspects [1] . A
d2h value of 0 indicates that RE VSP OT achieves a perfect
prediction, i.e., achieving a recall of 1 and a false alarm
rate of 0. A high d2h value indicates that the performance
of an approach is far from perfect.
Recall measures the proportion between the number of
predicted lines that will receive comments or will be
revised and the number of actual lines that will receive
comments or will be revised.
False alarm rate (FAR) measures a proportion between the
number of clean lines that are not problematic (will not
receive comments or will not be revised) but incorrectly
predict as problematic lines and the number of clean lines,
using the calculation of FP
(FP+TN) .
In this work, we use n-gram as a baseline approach. N-gram
is a widely-known statistical NLP approach that measures an
entropy (i.e., the perplexity of words) as a proxy of naturalness
of a word while considering the probabilities of the proceeding
words. Prior works have used n-gram models to identify buggy
lines in files [40] and commits [61]. Prior study also shows
that the entropy estimated by n-gram is associated with the
review decision [19]. In this work, we use an implementation
of Hellendoorn and Devanbu [18] to build n-gram models (n=
6). Similar to prior works [38, 40, 60, 61], we use lines that
did not receive a comment (or were not revised) in the training
dataset to build the n-gram model. We do not use the static
analysis tools as a baseline, since prior work [60] has already
shown that n-gram approaches outperform static analysis tools.
To statistically compare the accuracy (i.e., d2h, recall, and
FAR) of REV SPOT and the baseline, we used a one-sided
Wilcoxon signed-rank test which performs a paired compari-
son of the performance measure for each file in each patch. In
addition, we measure the effect size (r) i.e., the magnitude of
the difference between the performance of RE VSP OT and the
baseline using a calculation of r=Z
nwhere Zis a statistic
Z-score from the Wilcoxon signed-rank test and nis the total
number of samples [56]. The effect size r > 0.5is considered
as large, 0.3< r 0.5is medium, and 0.1< r 0.3is small,
otherwise negligible [15]. We did not use the commonly-used
Cohen’s D[12] and Cliff’s |δ|[27] to measure the effect size
because both methods are based on distributions, not pair-wise.
Results. Figure 5 presents the accuracy of R EVSPOT ap-
proach and the baseline approach (i.e., n-gram model).
For predicting the lines that will receive comments,
REV SPOT achieves a Top-10 accuracy of 81% and a d2h
of 0.55, which is 56% and 15% better than the n-gram
approach. Figure 5a shows that REVSPOT achieves a Top-10
accuracy of 81%. In addition, Figure 5a shows the n-gram
model achieves a Top-10 accuracy of 52%, which is 56%
lower than RE VSP OT. These result indicates that REVSPOT
can recommend ten lines of which at least one will actually
receive comment for more files than the n-gram model can do.
Figure 5a also shows that at the median, RE VSP OT achieves
a d2h of 0.55, while the n-gram model achieves a d2h of
0.65, suggesting that the accuracy of RE VSP OT considering
both recall and false alarm rate is better than the n-gram
model. Importantly, REV SPOT achieves a median FAR of 0.42,
while the n-gram approach achieves a median FAR of 0.89,
highlighting that our approach achieves a 53% lower false
alarm rate than the n-gram approach, which is desirable by
software practitioners [1]. Since this work aims to pinpoint the
area of code that reviewers should pay attention to, a lower
false alarm rate is preferable. Finally, the one-sided Wilcoxon
signed-ranked tests confirm that the d2h value and the false
alarm rate of RE VSPOT are significantly lower than those of
the n-gram model (p-value <0.001) with a medium effect size
(r= 0.32) for d2h and a large effect size (r= 0.85) for false
alarm rate.
For predicting the lines that will be revised, REV SPOT
achieves a Top-10 accuracy of 93% and d2h of 0.55, which
TABLE II: The types of defects that are related to the problematic lines that are correctly (Hit) and incorrectly (Miss) predicted
by our RE VSPOT approach.
Lines that
Receive Comments
Lines that
are Revised
Defect Types Description Hit Miss Hit Miss
Functionality defects Defects that impact the functionality of the system.
- Logic Defects Defects on the correctness or existence of system logic 39% 26% 35% 22%
- Interface Defects Mistakes in interacting with other parts of the system 4% 4% 5% 9%
Evolvability defects Defects that affect future development efforts.
- Structure Defects Defects related to code organizations 16% 25% 33% 39%
- Documentation Defects Defects related to textual information in source code 22% 22% 17% 14%
- Visual Defects Defects that hinder program readability 19% 23% 10% 16%
(a) The line that received a comment in QtBase-335298.
(b) The line that was revised in OpenStackNova-773976.
Fig. 6: Examples of the hit problematic lines that are related
to logic defects.
is 15% and 15% better than the n-gram approach. Similar
to predicting lines that will receive comments, Figure 5b shows
that RE VSP OT achieves a Top-10 accuracy of 93%, which is
15% better than the n-gram model. Figure 5b also shows that
our RE VSP OT achieves a median d2h of 0.55 and a median
false alarm rate of 0.41, which are significantly lower than
the n-gram model (p-value <0.001) with a small effect size
(r= 0.24) for d2h and a large effect size (r= 0.78) for false
alarm rate. These results indicate that REVSPOT outperform
the n-gram model in predicting the lines that will be revised.
The lower accuracy of the n-gram models may suggest
that the entropy value might not have an association with
the risk of being problematic lines. In other words, unnatural
lines of code may not indicate that such unnatural lines will
receive comments or will be revised. On the other hand,
our RE VSP OT is based on the assumptions that tokens that
frequently appeared in the problematic lines in the past may
also appear in the problematic lines in the future. The results
also confirm that our REVSPOT which leverages a machine
learning technique to learn the problematic lines in the past
can predict the problematic lines in the future more accurate
than the n-gram approach.
(RQ3) What kinds of defects in the problematic lines that
are correctly predicted and incorrectly predicted by our
REV SPOT ?
Motivation. The results of RQ2 confirm that our RE VSP OT
can accurately predict the lines that will receive comments
and the lines that will be revised. Since various defects can
be associated with the problematic lines [6, 30, 52], we set
out this RQ to qualitatively assess REVSP OT in terms of the
defect types that it can correctly and incorrectly predict.
(a) The line that received a comment in QtBase-321363.
(b) The line that was revised in QtBase-323089.
Fig. 7: Examples of the miss problematic lines that are related
to logic defects.
Approach. To answer RQ3, we manually examine the prob-
lematic lines that are correctly predicted (i.e., Hit) and in-
correctly predicted (i.e., Miss) by RE VSPOT in the testing
data. From the patches of the three studied projects, there
were 693 lines that actually received comments and 7,975
lines that were actually revised. Since the total number of
actual problematic lines is considerably too large to manually
examine in its entirety, we randomly selected a statistically
representative sample with a confidence level of 95% and a
confidence interval of 5%.3Therefore, we sampled 165 hit
lines and 197 missed lines that actually received comments,
and 337 hit lines and 358 missed lines that were actually
revised. Then, we manually identify the related type of defect
based on the taxonomy of code review defects proposed by
M¨
antyl¨
aet al. [30]. To better understand the context, we
also observed the code around the problematic line when we
identify the defect type. The classification was performed by
the first author, and validated by other authors of this paper.
Results. Table II presents the types of defects that are related
to the problematic lines that are correctly (Hit) and incorrectly
(Miss) predicted by our RE VSPOT approach.
The majority of problematic lines that RE VSP OT can
correctly predict are related to logic defects.For the lines
that are correctly predicted to receive comments, we found
that 39% are related to logic defects, 22% of them are related
to documentation defects, 19% of them are related to visual
defects, 16% of them are related to structure defects, and
4% of them are related to interface defects. Table II also
shows a similar proportion of defects for the lines that are
correctly predicted to be revised. Figure 6 shows an example
of problematic lines that are correctly predicted by REVSPOT
3https://www.surveysystem.com/sscalc.htm
and that are related to logic defects in QtBase-3352984and
OpenStackNova-773976.5Note that in Figure 6a, the yellow
area indicates an inline comment from a reviewer and in
Figure 6b, the red area indicates the code in the original
version of the patch, while the green area shows the code in
the revised version. This finding highlights that our REVSPOT
can correctly predict problematic lines that could impact
the functionality of the system and affect future software
development efforts.
Moreover, we also observed that the proportion of function-
ality defects for the hit problematic lines is also higher than the
miss problematic lines. In particular, 43% (39% + 4%) of the
hit problematic lines are related to functionality defects, while
30% (26% + 4%) of the miss problematic lines are related
to functionality defects. On the other hand, the proportion
of evolvability defects for the hit problematic lines is lower
than the miss problematic lines. This result suggests that our
REV SPOT is more likely to hit the functionality defects than
the evolvability defects.
On the other hand, Table II shows that 26% and 22% of
the miss problematic lines are also related to logic defects.
Our RE VSP OT missed these problematic lines maybe in part
due to rare tokens in source code. Figure 7 shows examples
of problematic lines that REVSPOT miss in QtBase-3213636
and QtBase-323089.7We observed that the miss problematic
lines mainly contain compound words (e.g., qHashMulti).
It is possible that these compound words rarely appear.
Consequently, the historical data of these compound words
is not sufficient to train RE VSP OT. Note that although the
token dis not a compound word, it was discarded as it is
commonly-appeared (i.e., appearing more than 50% of the
files). In addition, we observed that tokens that a reviewer
had paid attention to sometimes can be non-characters. For
example, Figure 7a shows that a reviewer highlighted the
quote characters ("") and provided a comment specifically
to this token. However, during our feature extraction, we
removed non-characters since they are commonly-appeared
across many files. These observation suggests that REVSP OT
may miss problematic lines with special cases, e.g., mainly
containing compound words or special characters are tokens
that reviewers had paid attention to.
VI. RE LATE D WORK
In this section, we discuss related works with respect to the
importance of code review comments on software quality and
the recommendation approaches to support code review.
4https://codereview.qt-project.org/c/qt/qtbase/+/335298/1/src/corelib/io/
qprocess unix.cpp
5https://review.opendev.org/c/openstack/nova/+/773976/2..12/nova/
scheduler/utils.py#b1343
6https://codereview.qt-project.org/c/qt/qtbase/+/321363/5/src/corelib/text/
qregularexpression.cpp#1768
7https://codereview.qt-project.org/c/qt/qtbase/+/323089/4..16/src/corelib/
text/qregularexpression.cpp#b2162
A. The Importance of Review Comments on Software Quality
Prior studies demonstrated that code review comments play
a significant role in the code review process (e.g., the quality
of patches is improved based on the comments provided by
reviewers) [3, 23]. For example, Kononenko et al. [23] found
that the quality of code reviews heavily relies on code review
comments. Bacchelli et al. [3] found that the review comment
is associate with code improvement. However, providing high
quality reviews often requires substantial amount of effort
from reviewers. Thus, Kononenko et al. [23] pointed out
that the receiving code review feedback could be slow (i.e.,
not responsive) due to the required amount of effort from
reviewers. This is also consistent with the finding of our
RQ1 that large patches tend to receive slow initial feedback
from reviewers than small patches. Importantly, Liang et
al. [26] discovered that larger patch tends to receive less
number of comments than smaller patches. Nevertheless, while
code review comments required a large amount of efforts
from reviewers, many studies found that they are still very
useful in finding software defects [11, 30] or design impactful
changes [59] in the source code.
B. Recommendation Approaches to Support Code Review
There exist many recommendations approaches to support
code review activities (e.g., review task prioritization, re-
viewer recommendation, automated code transformations). For
example, prior studies leveraged various machine learning
techniques to recommend which code changes should be
reviewed first based on the defect-proneness [22, 31] and
the characteristics of code changes [14, 29, 54]. Studies also
proposed various approaches to recommend who should be an
appropriate reviewer for a patch [17, 51, 55, 62, 63]. Tufano et
al. [57, 58] proposed an approach to automatically recommend
a revised version of a given patch. While these approaches
can alleviate developers’ effort during code review, reviewers
still need to manually identify which part of a given patch is
potentially problematic that they should pay attention to. Yet,
there exists only one study by Hellendoorn et al. [20] that is
most relevant to our work.
Hellendoorn et al. [20] proposed an approach to predict the
location of code review comments. While the concept of the
Hellendoorn et al.’s approach is similar to our REV SPOT, there
exists numerous angles that are different. First, Hellendoorn et
al. [20] only focused on which part of code will receive a
comment, while our REVSPOT approach focuses on a broader
scope (i.e., lines that will receive comments and lines that
will be revised). Second, the granularity of predictions of
Hellendoorn et al. [20] focuses on the hunk level, which is
still coarse-grained (i.e., reviewers still waste effort to identify
which lines in a diff hunk that they should pay attention to).
On the other hand, our REVSPOT approach tackles this chal-
lenge by automatically pinpointing to the lines that reviewers
should pay attention to, which is more fine-grained as desired
by many practitioners and researchers [36, 38, 40, 60, 61].
Importantly, our RQ2 shows that our approach can accu-
rately predict problematic lines, which outperforms the n-
gram approach. In addition, our RQ3 also found that most
of the problematic lines that are correctly predicted by our
approach are related to functionality defects which could
impact the functionality of the system, highlighting the key
novelty and significance of our RE VSP OT approach. Based
on these findings, our RE VSPOT approach could potentially
be beneficial to reviewers to reduce their effort in identifying
problematic lines and could help code authors receive initial
feedback faster from reviewers.
VII. THR EATS TO VALI DI TY
In this section, we discuss the potential threats to the validity
of our study.
Construct Validity. Prior works raised concerns that
the time-consuming data collection process may impact the
ground-truth data of our code review dataset [9, 38, 45]. For
example, some patches may receive initial feedback after our
data collection process. Thus, these patches that were not
receiving comments should not be labeled as such. To mitigate
this threat, we selected only the patches that are finished (i.e.,
marked as approved or abandoned). In addition, some lines
may receive feedback after our data collection process. Thus,
these lines that were not labeled as problematic should not be
labeled as such, producing some false negatives. Nevertheless,
additional approaches that improve the quality of the datasets
(i.e., recovering the missing problematic lines) may further
improve the performance of our approach.
Internal Validity. Prior works found that there exist differ-
ent characteristics of code review comments (e.g., suggestion
comments or enquiry comments) [8, 35]. To mitigate this
threat, we conducted a manual analysis to investigate the char-
acteristics of the inline comments of our studied datasets. After
the data collection process, we initially randomly selected a
statistically representative sample with a confidence level of
95% and a confidence interval of 5% to better understand the
characteristics of the inline comments of our studied datasets.
Of the 668 selected samples, we found that 91% of samples
are related to suggestion comments (i.e., comments related
to suggestions for code improvement), while as little as 9%
are related to enquiry comments (e.g., ’What is this for?’).
Since these inline comments were written by humans, we do
not apply any filtering or selection to the inline comments to
ensure that the results are not biased to one type or another.
In addition, reviewers may provide comments via other
channels (e.g., messaging tools or face-to-face). Thus, our
line-level ground-truths may not be completed (e.g., missing
comments for a particular line). Nevertheless, according to a
survey study, Bosu et al. [7] argued that developers usually
provide comments via a code review tool. Thus, missing
inline comments may have a minimal impact on our prediction
accuracy.
In our RQ1, other factors (e.g., review workload, the number
of modified files) may also be associated with the waiting
time to receive the first feedback. However, Thongtanunam et
al. [54] found that there is a weak relationship between such
other factors and waiting time.
External Validity. We evaluated our REVSPOT approach
using the three Gerrit-based open-source projects. Thus, our
results may not be generalized to other projects or code review
platforms. Nevertheless, we provide the implementation source
code and datasets to promote the replication of our study and
foster the adoption of our approach.
VIII. CO NC LUSIO N
In this paper, we first investigated the association between
the waiting time to receive initial feedback and its patch size.
We found that at the median, patch authors waited 15-64 hours
to receive initial feedback from reviewers, which accounts
for 16%-26% of the whole review time of a patch. We also
found that larger patches tend to receive initial feedback from
reviewers slower than smaller patches, highlighting the need
of an approach to pinpoint the lines that reviewers should pay
attention, which could be beneficial to developers to reduce the
reviewing effort as well as waiting time to receive feedback.
Thus, we developed REVSPOT—a line-level recommenda-
tion approach to predict lines that will receive a comment and
lines that will be revised. The results show that our REV SPOT
can accurately predict lines that will receive comments and
will be revised (with a Top-10 Accuracy of 81% and 93%,
which is 56% and 15% better than the baseline approach),
and these correctly predicted problematic lines are related
to logic defects, which could impact the functionality of the
system. Based on these findings, our REVSPOT approach could
potentially be beneficial to reviewers to reduce their effort
in identifying problematic lines to speed up the code review
process and improve reviewers’ productivity.
ACKO WL ED GE ME NT
Chakkrit Tantithamthavorn was supported by the Aus-
tralian Research Council’s Discovery Early Career Researcher
Award (DECRA) funding scheme (DE200100941). Patanamon
Thongtanunam was supported by the Australian Research
Council’s Discovery Early Career Researcher Award (DE-
CRA) funding scheme (DE210101091).
REF ER EN CE S
[1] A. Agrawal, W. Fu, D. Chen, X. Shen, and T. Menzies, “How to
”dodge” complex software analytics,” IEEE Transactions on Software
Engineering, 2019.
[2] A. Agrawal and T. Menzies, “Is” better data” better than” better data
miners”?” in Proceedings of the IEEE/ACM International Conference
on Software Engineering (ICSE), 2018, pp. 1050–1061.
[3] A. Bacchelli and C. Bird, “Expectations, outcomes, and challenges of
modern code review,” in Proceedings of the International Conference
on Software Engineering (ICSE), 2013, pp. 712–721.
[4] G. Bavota and B. Russo, “Four eyes are better than two: On the impact of
code reviews on software quality,” in Proceedings of IEEE International
Conference on Software Maintenance and Evolution (ICSME), 2015, pp.
81–90.
[5] O. Baysal, O. Kononenko, R. Holmes, and M. W. Godfrey, “Investigating
Technical and Non-Technical Factors Influencing Modern Code Review,”
Journal of Empirical Software Engineering (EMSE), vol. 21, no. 3, pp.
932–959, 2015.
[6] M. Beller, A. Bacchelli, A. Zaidman, and E. Juergens, “Modern code
reviews in open-source projects: Which problems do they fix?” in
Proceedings of the working conference on mining software repositories
(MSR), 2014, pp. 202–211.
[7] A. Bosu, J. C. Carver, C. Bird, J. Orbeck, and C. Chockley, “Process
Aspects and Social Dynamics of Contemporary Code Review: Insights
from Open Source Development and Industrial Practice at Microsoft,
Transactions on Software Engineering (TSE), vol. 43, no. 1, pp. 56–75,
2017.
[8] A. Bosu, M. Greiler, and C. Bird, “Characteristics of useful code
reviews: An empirical study at microsoft,” in Proceedings of the
IEEE/ACM Working Conference on Mining Software Repositories
(MSR), 2015, pp. 146–156.
[9] G. G. Cabral, L. L. Minku, E. Shihab, and S. Mujahid, “Class imbal-
ance evolution and verification latency in just-in-time software defect
prediction,” in Proceedings of the International Conference on Software
Engineering (ICSE), 2019, pp. 666–676.
[10] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,
“SMOTE: Synthetic Minority Over-sampling Technique,” Journal of
Artificial Intelligence Research, pp. 321–357, 2002.
[11] C. Y. Chong, P. Thongtanunam, and C. Tantithamthavorn, “Assessing the
students understanding and their mistakes in code review checklists–an
experience report of 1,791 code review checklists from 394 students,” in
Proceedings of the International Conference on Software Engineering:
Joint Software Engineering Education and Training track (ICSE-JSEET),
2021.
[12] J. Cohen, Statistical power analysis for the behavioral sciences. Rout-
ledge, 2013.
[13] J. Czerwonka, M. Greiler, and J. Tilford, “Code reviews do not find
bugs. how the current code review best practice slows us down,”
in Proceedings of the IEEE/ACM IEEE International Conference on
Software Engineering (ICSE), vol. 2, 2015, pp. 27–28.
[14] Y. Fan, X. Xia, D. Lo, and S. Li, “Early prediction of merged code
changes to prioritize reviewing tasks,Empirical Software Engineering,
vol. 23, no. 6, pp. 3346–3393, 2018.
[15] A. Field, Discovering statistics using IBM SPSS statistics. sage, 2013.
[16] B. Ghotra, S. McIntosh, and A. E. Hassan, “Revisiting the impact of
classification techniques on the performance of defect prediction mod-
els,” in Proceedings of the IEEE/ACM IEEE International Conference
on Software Engineering (ICSE), vol. 1, 2015, pp. 789–800.
[17] C. Hannebauer, M. Patalas, S. St ¨
unkel, and V. Gruhn, “Automatically
recommending code reviewers based on their expertise: An empirical
comparison,” in Proceedings of the IEEE/ACM International Conference
on Automated Software Engineering (ASE), 2016, pp. 99–110.
[18] V. J. Hellendoorn and P. Devanbu, “Are deep neural networks the best
choice for modeling source code?” in Proceedings of the Joint Meeting
on Foundations of Software Engineering (ESEC/FSE), 2017, pp. 763–
773.
[19] V. J. Hellendoorn, P. T. Devanbu, and A. Bacchelli, “Will they like this?
evaluating code contributions with language models,” in Proceedings of
the Working Conference on Mining Software Repositories (MSR), 2015,
pp. 157–167.
[20] V. J. Hellendoorn, J. Tsay, M. Mukherjee, and M. Hirzel, “Towards
automating code review at scale,” in Proceedings of the ACM Joint
Meeting on European Software Engineering Conference and Symposium
on the Foundations of Software Engineering (ESEC/FSE), 2021, pp.
1479–1482.
[21] M. Jimenez, R. Rwemalika, M. Papadakis, F. Sarro, Y. Le Traon, and
M. Harman, “The importance of accounting for real-world labelling
when predicting software vulnerabilities,” in Proceedings of the Joint
Meeting on European Software Engineering Conference and Symposium
on the Foundations of Software Engineering (ESEC/FSE), 2019, pp.
695–705.
[22] Y. Kamei, E. Shihab, B. Adams, A. E. Hassan, A. Mockus, A. Sinha,
and N. Ubayashi, “A large-scale empirical study of just-in-time quality
assurance,” IEEE Transactions on Software Engineering, vol. 39, no. 6,
pp. 757–773, 2012.
[23] O. Kononenko, O. Baysal, and M. W. Godfrey, “Code review quality:
How developers see it,” in Proceedings of the international conference
on software engineering, 2016, pp. 1028–1038.
[24] O. Kononenko, O. Baysal, L. Guerrouj, Y. Cao, and M. W. Godfrey,
“Investigating code review quality: Do people and participation mat-
ter?” in Proceedings of the IEEE international conference on software
maintenance and evolution (ICSME), 2015, pp. 111–120.
[25] G. Lemaˆ
ıtre, F. Nogueira, and C. K. Aridas, “Imbalanced-learn: A
python toolbox to tackle the curse of imbalanced datasets in machine
learning,” Journal of Machine Learning Research, vol. 18, no. 17, pp.
1–5, 2017. [Online]. Available: http://jmlr.org/papers/v18/16-365.html
[26] J. Liang and O. Mizuno, “Analyzing involvements of reviewers through
mining a code review repository,” in Proceedings of the Joint Conference
of the 21st International Workshop on Software Measurement and
the 6th International Conference on Software Process and Product
Measurement, 2011, pp. 126–132.
[27] G. Macbeth, E. Razumiejczyk, and R. D. Ledesma, “Cliff’s Delta
Calculator: A Non-parametric Effect Size Program for Two Groups of
Observations,Universitas Psychologica, vol. 10, pp. 545–555, 2011.
[28] L. MacLeod, M. Greiler, M.-A. Storey, C. Bird, and J. Czerwonka,
“Code Reviewing in the Trenches,IEEE Software, vol. 35, pp. 34–42,
2018.
[29] C. Maddila, C. Bansal, and N. Nagappan, “Predicting pull request com-
pletion time: a case study on large scale cloud services,” in Proceedings
of the 2019 27th acm joint meeting on european software engineering
conference and symposium on the foundations of software engineering,
2019, pp. 874–882.
[30] M. V. M¨
antyl¨
a and C. Lassenius, “What types of defects are really dis-
covered in code reviews?” IEEE Transactions on Software Engineering,
vol. 35, no. 3, pp. 430–448, 2008.
[31] S. McIntosh and Y. Kamei, “Are fix-inducing changes a moving tar-
get? a longitudinal case study of just-in-time defect prediction,” IEEE
Transactions on Software Engineering (TSE), pp. 412–428, 2017.
[32] S. McIntosh, Y. Kamei, B. Adams, and A. E. Hassan, “The impact of
code review coverage and code review participation on software quality:
A case study of the qt, vtk, and itk projects,” in Proceedings of the
Working Conference on Mining Software Repositories (MSR), 2014, pp.
192–201.
[33] ——, “An Empirical Study of the Impact of Modern Code Review Prac-
tices on Software Quality,Empirical Software Engineering (EMSE),
vol. 21, no. 5, pp. 2146–2189, 2016.
[34] R. Morales, S. McIntosh, and F. Khomh, “Do code review practices
impact design quality? a case study of the qt, vtk, and itk projects,”
in Proceedings of the international conference on software analysis,
evolution, and reengineering (SANER), 2015, pp. 171–180.
[35] T. Pangsakulyanont, P. Thongtanunam, D. Port, and H. Iida, “Assessing
mcr discussion usefulness using semantic similarity,” in Proceedings
of the International Workshop on Empirical Software Engineering in
Practice, 2014, pp. 49–54.
[36] L. Pascarella, F. Palomba, and A. Bacchelli, “Fine-Grained Just-In-Time
Defect Prediction,” Journal of Systems and Software (JSS), 2018.
[37] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay, “Scikit-learn: Machine learning in Python,Journal of Machine
Learning Research, vol. 12, pp. 2825–2830, 2011.
[38] C. Pornprasit and C. Tantithamthavorn, “Jitline: A simpler, better,
faster, finer-grained just-in-time defect prediction,arXiv preprint
arXiv:2103.07068, 2021.
[39] M. Rahman, D. Palani, and P. C. Rigby, “Natural software revisited,”
in Proceedings of the IEEE/ACM International Conference on Software
Engineering (ICSE), 2019, pp. 37–48.
[40] B. Ray, V. Hellendoorn, S. Godhane, Z. Tu, A. Bacchelli, and P. De-
vanbu, “On the” naturalness” of buggy code,” in Proceedings of the
IEEE/ACM International Conference on Software Engineering (ICSE),
2016, pp. 428–439.
[41] M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?”
explaining the predictions of any classifier,” in Proceedings of the ACM
SIGKDD international conference on knowledge discovery and data
mining (KDD), 2016, pp. 1135–1144.
[42] P. C. Rigby and C. Bird, “Convergent Contemporary Software Peer
Review Practices,” in Proceedings of the European Software Engineering
Conference and the International Symposium on the Foundations of
Software Engineering (ESEC/FSE), 2013, pp. 202–212.
[43] S. Ruangwan, P. Thongtanunam, A. Ihara, and K. Matsumoto, “The
Impact of Human Factors on the Participation Decision of Reviewers in
Modern Code Review,Empirical Software Engineering (EMSE), p. In
press, 2018.
[44] C. Sadowski, E. S¨
oderberg, L. Church, M. Sipko, and A. Bacchelli,
“Modern Code Review: A Case Study at Google,” in Proceedings of
ICSE (Companion), 2018, pp. 181–190.
[45] M. Tan, L. Tan, S. Dara, and C. Mayeux, “Online defect prediction for
imbalanced data,” in Proceedings of the International Conference on
Software Engineering (ICSE), vol. 2, 2015, pp. 99–108.
[46] C. Tantithamthavorn and A. E. Hassan, “An Experience Report on
Defect Modelling in Practice: Pitfalls and Challenges,” in In Proceedings
of the International Conference on Software Engineering: Software
Engineering in Practice Track (ICSE-SEIP), 2018, pp. 286–295.
[47] C. Tantithamthavorn, A. E. Hassan, and K. Matsumoto, “The Impact of
Class Rebalancing Techniques on The Performance and Interpretation of
Defect Prediction Models,” IEEE Transactions on Software Engineering
(TSE), p. To Appear, 2019.
[48] C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Matsumoto,
“Automated Parameter Optimization of Classification Techniques for
Defect Prediction Models,” in Proceedings of the 38th International
Conference on Software Engineering (ICSE), 2016, pp. 321–332.
[49] ——, “An empirical comparison of model validation techniques for
defect prediction models,” IEEE Transactions on Software Engineering,
vol. 43, no. 1, pp. 1–18, 2016.
[50] P. Thongtanunam, S. McIntosh, A. E. Hassan, and H. Iida, “Revisiting
code ownership and its relationship with software quality in the scope
of modern code review,” in Proceedings of the 2016 IEEE/ACM 38th
International Conference on Software Engineering (ICSE), 2016, pp.
1039–1050.
[51] P. Thongtanunam, R. G. Kula, A. E. C. Cruz, N. Yoshida, and H. Iida,
“Improving code review effectiveness through reviewer recommenda-
tions,” in Proceedings of the International Workshop on Cooperative
and Human Aspects of Software Engineering (CHASE), 2014, pp. 119–
122.
[52] P. Thongtanunam, S. McIntosh, A. E. Hassan, and H. Iida, “Investigating
code review practices in defective files: An empirical study of the
qt system,” in Proceedings of the IEEE/ACM Working Conference on
Mining Software Repositories (MSR), 2015, p. 168–179.
[53] ——, “Revisiting Code Ownership and its Relationship with Software
[60] S. Wattanakriengkrai, P. Thongtanunam, C. Tantithamthavorn, H. Hata,
and K. Matsumoto, “Predicting defective lines using a model-agnostic
Quality in the Scope of Modern Code Review,” in Proceedings of the
International Conference on Software Engineering (ICSE), 2016, pp.
1039–1050.
[54] ——, “Review participation in modern code review: An empirical
study of the android, qt, and openstack projects (journal-first abstract),”
in Proceedings of the International Conference on Software Analysis,
Evolution and Reengineering (SANER), 2018, pp. 475–475.
[55] P. Thongtanunam, C. Tantithamthavorn, R. G. Kula, N. Yoshida, H. Iida,
and K.-i. Matsumoto, “Who should review my code? a file location-
based code-reviewer recommendation approach for modern code re-
view,” in Proceedings of the IEEE International Conference on Software
Analysis, Evolution, and Reengineering (SANER), 2015, pp. 141–150.
[56] M. Tomczak and E. Tomczak, “The need to report effect size estimates
revisited. an overview of some recommended measures of effect size,”
Trends in sport sciences, vol. 1, no. 21, pp. 19–25, 2014.
[57] M. Tufano, J. Pantiuchina, C. Watson, G. Bavota, and D. Poshyvanyk,
“On learning meaningful code changes via neural machine translation,”
in 2019 IEEE/ACM 41st International Conference on Software Engi-
neering (ICSE), 2019, pp. 25–36.
[58] R. Tufano, L. Pascarella, M. Tufano, D. Poshyvanykz, and G. Bavota,
“Towards automating code review activities,” in Proceedings of the
International Conference on Software Engineering (ICSE), 2021, pp.
163–174.
[59] A. Uchˆ
oa, C. Barbosa, D. Coutinho, W. Oizumi, W. K. Assunc¸ao, S. R.
Vergilio, J. A. Pereira, A. Oliveira, and A. Garcia, “Predicting design
impactful changes in modern code review: A large-scale empirical
study,” in Proceedings of the IEEE/ACM International Conference on
Mining Software Repositories (MSR), 2021, pp. 471–482.
technique,” arXiv preprint arXiv:2009.03612, 2020.
[61] M. Yan, X. Xia, Y. Fan, A. E. Hassan, D. Lo, and S. Li, “Just-in-time
defect identification and localization: A two-phase framework,IEEE
Transactions on Software Engineering (TSE), 2020.
[62] Y. Yu, H. Wang, G. Yin, and C. X. Ling, “Reviewer recommender of
pull-requests in github,” in Proceedings of IEEE International Confer-
ence on Software Maintenance and Evolution (ICSME), 2014, pp. 609–
612.
[63] M. B. Zanjani, H. Kagdi, and C. Bird, “Automatically recommending
peer reviewers in modern code review,” IEEE Transactions on Software
Engineering, vol. 42, no. 6, pp. 530–543, 2015.
... As projects are becoming more complex and large, the number of reviews is increasing over time (e.g., 3K reviews per month in Microsoft Bing [46] and 500 reviews per month in Linux [47]). Given this large number of reviews, prior research also found that the code authors may have to wait 15-64 hours unnecessarily at a time until they receive the code review comments from reviewers [22]. Hence, an approach that can automatically recommend code review comments for a given code change would be beneficial to save manual efforts in the code review process. ...
... Unfortunately, to identify issues and write code review comments, reviewers are required to take a considerable amount of time to understand the purpose and the context of the code [2]. Hong et al. [22] find that the code authors can wait for 15-64 hours to receive code review comments from reviewers. Such a long wait time may also stall the development process [10]. ...
... Nevertheless, our CommentFinder is designed for recommending a code review comment for a given changed method. The recommendation of code review comment for the finer granularity of source code (e.g., line level) may help developers better focus on the smaller scope of source code [22,40,64]. Hence, to increase the likelihood of adopting the approach in the practice, future work should further investigate a code review recommendation approach for the finer grain of source code. ...
Conference Paper
Full-text available
Code review is an effective quality assurance practice, but can be labor-intensive since developers have to manually review the code and provide written feedback. Recently, a Deep Learning (DL)-based approach was introduced to automatically recommend code review comments based on changed methods. While the approach showed promising results, it requires expensive computational resource and time which limits its use in practice. To address this limitation , we propose CommentFinder ś a retrieval-based approach to recommend code review comments. Through an empirical evaluation of 151,019 changed methods, we evaluate the effectiveness and efficiency of CommentFinder against the state-of-the-art approach. We find that when recommending the best-1 review comment candidate, our CommentFinder is 32% better than prior work in recommending the correct code review comment. In addition, CommentFinder is 49 times faster than the prior work. These findings highlight that our CommentFinder could help reviewers to reduce the manual efforts by recommending code review comments, while requiring less computational time.
... Recently, Explainable AI has been actively investigated in the domain of defect prediction [48,49]. For example, recent works have shown some successful case studies to make defect prediction models more practical [15,33,34,55], explainable [17,20], and actionable [31,35,40]. However, these studies only focus on explaining the traditional ML approaches. ...
Article
Full-text available
Story point estimation is a task to estimate the overall effort required to fully implement a product backlog item. Various estimation approaches (e.g., Planning Poker, Analogy, and expert judgment) are widely-used, yet they are still inaccurate and may be subjective, leading to ineffective sprint planning. Recent work proposed Deep-SE, a deep learning-based Agile story point estimation approach, yet it is still inaccurate, not transferable to other projects, and not interpretable. In this paper, we propose GPT2SP, a Transformer-based Agile Story Point Estimation approach. Our GPT2SP employs a GPT-2 pre-trained language model with a GPT-2 Transformer-based architecture, allowing our GPT2SP models to better capture the relationship among words while considering the context surrounding a given word and its position in the sequence and be transferable to other projects, while being interpretable. Through an extensive evaluation on 23,313 issues that span across 16 open-source software projects with 10 existing baseline approaches for within-and cross-project scenarios, our results show that our GPT2SP approach achieves a median MAE of 1.16, which is (1) 34%-57% more accurate than existing baseline approaches for within-project estimations; (2) 39%-49% more accurate than existing baseline approaches for cross-project estimations. The ablation study also shows that the GPT-2 architecture used in our approach substantially improves Deep-SE by 6%-47%, highlighting the significant advancement of the AI for Agile story point estimation. Finally, we develop a proof-of-concept tool to help practitioners better understand the most important words that contributed to the story point estimation of the given issue with the best supporting examples from past estimates. Our survey study with 16 Agile practitioners shows that the story point estimation task is perceived as an extremely challenging task. In addition, our AI-based story point estimation with explanations is perceived as more useful and trustworthy than without explanations, highlighting the practical need of our Explainable AI-based story point estimation approach.
... Thus, recent work leveraged machine learning techniques to support various activities throughout the code review process, for example, reviewer recommendation [13,43,46,49,56,59], review task prioritization based on code change characteristics [23,38,58] and defect-proneness [30,39,44,45,65]. Several studies also proposed approaches to support reviewers when reading and examining code [14,27,55,63,64]. Although these approaches can reduce the manual effort of reviewers, code authors still need to manually modify the source code until it is approved by reviewers. ...
Conference Paper
Full-text available
Code review is effective, but human-intensive (e.g., developers need to manually modify source code until it is approved). Recently, prior work proposed a Neural Machine Translation (NMT) approach to automatically transform source code to the version that is reviewed and approved (i.e., the after version). Yet, its performance is still suboptimal when the after version has new identifiers or liter-als (e.g., renamed variables) or has many code tokens. To address these limitations, we propose AutoTransform which leverages a Byte-Pair Encoding (BPE) approach to handle new tokens and a Transformer-based NMT architecture to handle long sequences. We evaluate our approach based on 14,750 changed methods with and without new tokens for both small and medium sizes. The results show that when generating one candidate for the after version (i.e., beam width = 1), our AutoTransform can correctly transform 1,413 changed methods, which is 567% higher than the prior work, highlighting the substantial improvement of our approach for code transformation in the context of code review. This work contributes towards automated code transformation for code reviews, which could help developers reduce their effort in modifying source code during the code review process.
Conference Paper
Full-text available
Companies have adopted modern code review as a key technique for continuously monitoring and improving the quality of software changes. One of the main motivations for this is the early detection of design impactful changes, to prevent that design-degrading ones prevail after each code review. Even though design degradation symptoms often lead to changes' rejections, practices of modern code review alone are actually not sufficient to avoid or mitigate design decay. Software design degrades whenever one or more symptoms of poor structural decisions, usually represented by smells, end up being introduced by a change. Design degradation may be related to both technical and social aspects in collaborative code reviews. Unfortunately, there is no study that investigates if code review stakeholders, e.g, reviewers, could benefit from approaches to distinguish and predict design impactful changes with technical and/or social aspects. By analyzing 57,498 reviewed code changes from seven open-source systems, we report an investigation on prediction of design impactful changes in modern code review. We evaluated the use of six ML algorithms to predict design impactful changes. We also extracted and assessed 41 different features based on both social and technical aspects. Our results show that Random Forest and Gradient Boosting are the best algorithms. We also observed that the use of technical features results in more precise predictions. However, the use of social features alone, which are available even before the code review starts (e.g., for team managers or change assigners), also leads to highly-accurate prediction. Therefore social and/or technical prediction models can be used to support further design inspection of suspicious changes early in a code review process. Finally, we provide an enriched dataset that allows researchers to investigate the context behind design impactful changes during the code review process.
Article
Full-text available
Machine learning techniques applied to software engineering tasks can be improved by hyperparameter optimization, i.e., automatic tools that find good settings for a learner's control parameters. We show that such hyperparameter optimization can be unnecessarily slow, particularly when the optimizers waste time exploring "redundant tunings", i.e., pairs of tunings which lead to indistinguishable results. By ignoring redundant tunings, DODGE, a tuning tool, runs orders of magnitude faster, while also generating learners with more accurate predictions than seen in prior state-of-the-art approaches.