ArticlePDF Available

Abstract and Figures

Code review is a crucial activity for ensuring the quality of software products. Unlike the traditional code review process of the past where reviewers independently examine software artifacts, contemporary code review processes allow teams to collaboratively examine and discuss proposed patches. While the visibility of reviewing activities including review discussions in a contemporary code review tends to increase developer collaboration and openness, little is known whether such visible information influences the evaluation decision of a reviewer or not (i.e., knowing others' feedback about the patch before providing ones own feedback). Therefore, in this work, we set out to investigate the review dynamics, i.e., a practice of providing a vote to accept a proposed patch, in a code review process. To do so, we first characterize the review dynamics by examining the relationship between the evaluation decision of a reviewer and the visible information about a patch under review (e.g., comments and votes that are provided by prior co-reviewers). We then investigate the association between the characterized review dynamics and the defect-proneness of a patch. Through a case study of 83,750 patches of the OpenStack and Qt projects, we observe that the amount of feedback (either votes and comments of prior reviewers) and the co-working frequency of a reviewer with the patch author are highly associated with the likelihood that the reviewer will provide a positive vote to accept a proposed patch. Furthermore, we find that the proportion of reviewers who provided a vote consistent with prior reviewers is significantly associated with the defect-proneness of a patch. However, the associations of these review dynamics are not as strong as the confounding factors (i.e., patch characteristics and overall reviewing activities). Our observations shed light on the implicit influence of the visible information about a patch under review on the evaluation decision of a reviewer. Our findings suggest that the code reviewing policies that are mindful of these practices may help teams improve code review effectiveness. Nonetheless, such review dynamics should not be too concerning in terms of software quality.
Content may be subject to copyright.
A SUBMISSION TO IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 1
Review Dynamics and Their Impact on Software Quality
Patanamon Thongtanunam, Member, IEEE, and Ahmed E. Hassan, Member, IEEE
Abstract—Code review is a crucial activity for ensuring the quality of software products. Unlike the traditional code review process of the
past where reviewers independently examine software artifacts, contemporary code review processes allow teams to collaboratively
examine and discuss proposed patches. While the visibility of reviewing activities including review discussions in a contemporary
code review tends to increase developer collaboration and openness, little is known whether such visible information influences the
evaluation decision of a reviewer or not (i.e., knowing others’ feedback about the patch before providing ones own feedback). Therefore,
in this work, we set out to investigate the review dynamics, i.e., a practice of providing a vote to accept a proposed patch, in a code
review process. To do so, we first characterize the review dynamics by examining the relationship between the evaluation decision of
a reviewer and the visible information about a patch under review (e.g., comments and votes that are provided by prior co-reviewers).
We then investigate the association between the characterized review dynamics and the defect-proneness of a patch. Through a case
study of 83,750 patches of the OpenStack and Qt projects, we observe that the amount of feedback (either votes and comments of
prior reviewers) and the co-working frequency of a reviewer with the patch author are highly associated with the likelihood that the
reviewer will provide a positive vote to accept a proposed patch. Furthermore, we find that the proportion of reviewers who provided
a vote consistent with prior reviewers is significantly associated with the defect-proneness of a patch. However, the associations of
these review dynamics are not as strong as the confounding factors (i.e., patch characteristics and overall reviewing activities). Our
observations shed light on the implicit influence of the visible information about a patch under review on the evaluation decision of a
reviewer. Our findings suggest that the code reviewing policies that are mindful of these practices may help teams improve code review
effectiveness. Nonetheless, such review dynamics should not be too concerning in terms of software quality.
Index Terms—Code review, Collaboration, Human Aspects, Software Quality, Peer Review, Biases
F
1 INTRODUCTION
Code review is one of the important quality practices
in a software development process. Broadly speaking,
a proposed patch (i.e., a set of code changes) must be
examined and critiqued by team members other than
the patch author before their integration into the main
software repository. The main goal of code reviews is
to improve the overall quality of a patch [4]. Recent
work shows that active and rigorous code reviews can
improve the quality of system design [48] and decrease
the number of post-release defects [5, 45, 63, 72].
Unlike the traditional code review process of the past
where reviewers independently examined software ar-
tifacts [2, 22], contemporary code review processes of-
ten provide a transparent environment with convenient
access to reviewer feedback (e.g., comments) in order
to enhance active and timely collaboration [4, 17, 56].
Developers in GitHub projects also perceive that such a
transparent code review process allows the evaluation
decision (i.e., whether or not a new patch should be
accepted) to become more democratic [42].
On the other hand, code reviewing is performed by
humans who may have subconscious biases that influ-
ence their objective evaluation [25]. Prior studies point
out that visible information other than the technical
content like reviewing discussion may also influence the
evaluation decision of code reviews [13, 79, 80]. Tsay et
P. Thongtanunam is with the University of Melbourne, Australia.
E-mail: patanamon.t@unimelb.edu.au.
A. E. Hassan is with Queen’s University, Canada.
E-mail: ahmed@cs.queensu.ca.
al. observe that core members sometimes use feed-
back and discussions of other team members to decide
whether or not a patch should be accepted [80]. A survey
study by Bosu et al. reports that the relationship between
a patch author and a reviewer often affects the decision
of accepting a review request of the reviewer [13]. A
recent study demonstrates that OpenStack developers
are starting to perceive unfairness in code reviews [24].
Moreover, Mozilla has recently started an experiment to
reduce the gender bias in code reviews [41]. Such biases
may potentially affect the long term success of a project.
Several intrinsic biases such as cognitive particular-
ism [78], favouritism for the famliar [53], and peer
bias [23] were also identified and discussed in the context
of peer reviews for grant proposals. Reviewing rules
and approaches were developed in order to mitigate
biases [37]. Other reviewing processes like the review of
academic publications use anonymization (e.g., double-
blind reviews) to achieve effective reviews [25, 62, 66,
77]. Yet, little is known whether or not such practices
should also be applied to the code review process.
In this study, we perform an empirical study to ex-
amine review dynamics (i.e., the practices of making an
evaluation decision for a code patch) and their associa-
tion with the defect-proneness of a patch. In particular,
we perform a two-fold analysis to characterize review
dynamics (Analysis 1) and investigate the impact of
the characterized review dynamics on software quality
(Analysis 2). Through a case study of 83,750 patches
that spread across the OpenStack and Qt open source
projects, we make the following observations.
Analysis 1: Characterizing Review Dynamics. To
A SUBMISSION TO IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2
operationalize review dynamics, we investigate the sig-
nals of visible information about a patch under review
that are associated with the evaluation decision of a
reviewer (e.g., whether to vote positively or negatively
for a patch). To capture visible information about a patch
under review, we use eight metrics that are grouped into
three dimensions: (1) feedback during the review (e.g.,
the prior votes of co-reviewers), (2) status (e.g., whether
a co-reviewer is a project core member), and (3) rela-
tionship (e.g., the co-working frequency of reviewers).
We then use a mixed-effects logistic regression model
to estimate the association between these metrics and a
positive vote to accept a patch of a reviewer.
Findings: While controlling for several patch character-
istics (e.g., patch size) that are known to influence the
evaluation decision [26, 32], we find that our newly
proposed metrics are associated with the evaluation
decision of a reviewer. In particular, our results show
that the proportion of positive votes and comments that
are provided by prior reviewers are significantly asso-
ciated with the likelihood that a reviewer will provide
a positive vote to accept the patch. We also find that
the co-working frequencies with the patch authors are
associated with the evaluation decision of a reviewer.
Analysis 2: Investigating the Impact of Review Dy-
namics on Software Quality. Based on the findings
of Analysis 1, we formulate six hypotheses that are
related to software quality. For example, we hypothesize
that an accepted patch where each of its reviewers
provided a positive vote consistent with prior reviewers
is more likely to be defective in the future. To test our
hypotheses, we define six metrics that are related to the
findings of Analysis 1. We then build and examine defect
models, i.e., the logistic regression models that estimate
the likelihood of a patch inducing fixes.
Findings: While our defect models control for several
confounding factors that are known to have an im-
pact on software quality [43], we find that the defect-
proneness of a patch is associated with the proportion of
reviewers who provided a positive vote consistent with
prior reviewers and the proportion of reviewers who
have a strong relationship with either the patch author
or co-reviewers of that patch in OpenStack. Yet, the
associations of review dynamics with defect-proneness
are not as strong as the patch characteristics and overall
reviewing activities.
Our findings shed light on the implicit influence that
the visible information in contemporary code review
processes may have on the evaluation decision of a
reviewer. Such review practices have a relatively small
impact on software quality. These findings suggest that
the code reviewing policies that are mindful of these
practices may help teams improve code review effective-
ness. However, teams should not be too concerned about
the influence of the visible information on software
quality. To facilitate future replication of our study, a
replication package of our work is available online.1
1.1 Novelty Statement
This paper is the first to present:
(1) An analysis of the visible information about a patch
under review and the evaluation decision of a patch.
(2) An analysis of the association between such visible
information and the defect-proneness of a patch.
1.2 Paper organization
Section 2 provides a background and a motivation for
our work. Section 3 describes the design of our case
study. Section 4 presents a preliminary analysis of review
dynamics. Sections 5 and 6 present the results of our
analyses. Section 7 discusses a broader implication of
our results. Section 8 discusses possible threats. Finally,
Section 9 draws conclusions.
2 BACKG ROUN D & MOTIVATIO N
In this section, we overview contemporary code review
processes and motivate our work based on manual
observations and related work.
2.1 Contemporary Code Review Processes
Code reviews have been widely used in open source
and industrial software projects [6, 15, 32, 58–60]. One
of the main motivations for using code reviews is to
improve the quality of new patches [4, 13, 59]. Several
prior studies show that code reviews impact software
quality [5, 45, 63, 72].
In recent years, code review practices have converged
to contemporary code reviews which are supported by
code review tools [56]. Typically, a code review tool (e.g.,
Gerrit2, ReviewBoard3, and Phabricator4) is a web-based
application that tightly integrates with Version Control
Systems (VCSs, e.g., Git). Broadly speaking, the code
review process consists of four main steps:
1. The patch author uploads a new patch to the code
review tool.
2. Reviewers evaluate the proposed patch and provide
feedback. The feedback can be either a review com-
ment or a vote. A positive vote indicates that the patch
can be integrated into the upstream VCS, while a
negative vote indicates that the patch is not ready for
integration.
3. The patch author revises the patch to address the
review feedback and uploads a new revision to the
code review tool.
4. The review process is repeated until reviewers pro-
vide sufficient positive votes indicating that the patch
is of sufficient quality to be integrated (i.e., a vote
score of +2).5
1. https://github.com/SAILResearch/replication review-dynamics
2. https://www.gerritcodereview.com/
3. https://www.reviewboard.org/
4. http://phabricator.org/
5. https://wiki.qt.io/Qt Contribution Guidelines and https://docs.
openstack.org/infra/manual/developers.html#code-review
A SUBMISSION TO IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 3
2.2 A Transparent Environment of Code Reviews
To enable a collaborative code review process, a code
review tool often provides transparency in the form of
visibility to others’ activities on shared artifacts [17, 52,
60, 79]. For example, one can read others’ patches and
code reviews to learn about other parts of the system
without writing code [4]. One also can perceive others’
expertise through their visible history of code com-
mits [17]. Developers at Google note that team awareness
is one of the motivations of using such transparent code
reviews [60].
Unlike traditional code inspections where reviewers
have face-to-face meetings [22], contemporary code re-
views are performed through a virtual workspace (i.e.,
the online code review tool). Such a virtual commu-
nication and decision making may have an impact on
the effectiveness of code reviews [19, 46, 54]. Moreover,
based on conventional wisdom, a new patch should
be evaluated solely based on technical merits [61]. For
example, recent studies observe that in a transparent
environment of code reviews, visible information about
the author of a patch can affect the evaluation deci-
sion [11, 79]. Hence, it is possible that the visible in-
formation about co-reviewers may affect the evaluation
decision. Below, we discuss possible information that
may affect the evaluation decision of a patch based on
our manual observations and published literature.
2.2.1 Manual Observations
We randomly select 20 patches from the OpenStack
project and manually examine their review discussions.
Through our manual examination, we observe that for
some patches, reviewers provided a vote consistent with
prior reviewers and changed their votes based on the
review discussion. We provide examples for such two
practices below.
Providing Consistent Votes. In the review ID 1411,
we observe that although a reviewer has a concern,
that reviewer still provides a positive vote based on
the vote of prior reviewers: “...It would be nice to fix ...,
but that is not urgent for the release. Given that (the prior
reviewer) already +1’d this patch as well, I will +2 it”.6We
also observe that reviewers provide a negative vote to
support the concerns of prior reviewers. For example, in
the review ID 1257, the first reviewer raises a concern.
Then, the second reviewer provides a negative vote
to support the concern of the first reviewers: “Backing
up (the first reviewer)’s comment on.....”.7These examples
provide evidence that in addition to the technical merits
of a patch, the feedback of prior reviewers influence the
evaluation decision of subsequent reviewers.
Changing Votes. In the review ID 344949, we ob-
serve that one of the reviewers turns his positive vote
to a negative vote after the other co-reviewer raises a
concern: “...Code-Review-1. (the co-reviewer)’s right”.8The
6. https://review.openstack.org/#/c/1411
7. https://review.openstack.org/#/c/1257
8. https://review.openstack.org/#/c/344949
negative vote of a reviewer can also be turned into a
positive vote based on the feedback of other reviewers.
For example, in the review ID 16888, a reviewer (R)
provides a negative vote and proposes an alternative
approach for the patch.9Then, a co-reviewer provides
a positive vote to the patch and posts a comment that
the patch is acceptable “I think there is some value in
R’s proposal, but I don’t think it might be a blocker for
approval.” Finally, the reviewer Rturns his negative vote
to a positive vote “After reading the reviews..., I agree
with the approach....”. These examples provide evidence
that the comments of other reviewers may influence the
evaluation decision of a reviewer.
2.2.2 Related Work
We now discuss other visible information that may also
affect an evaluation decision based on literature.
Feedback. While reviewers can discuss concerns and
potential defects with other team members in a review
discussion thread, such visible feedback (e.g., review
votes and comments) may affect the evaluation deci-
sion. This influence can be considered as social influence
where a popular option tends to continue receiving at-
tention [20, 38, 51]. Prior studies in political science show
that the perception around a community’s opinion can
influence the voting decision of an individual [40, 69]. In
the code review context, Tsay et al. point out that team
members sometimes apply pressure during a review
discussion to influence the patch acceptance of a core
member [80].
Status. Intuitively, the longer the provided review
feedback, the more effective the code review is [36].
However, several studies report that not every feedback
will receive attention during the code review [9, 59, 80].
For example, Rigby et al. observe that the feedback
of outsiders (e.g., non-core members) tends to receive
less attention than that of core members during code
reviews [59]. This can be considered as a cognitive bias
in terms of social hierarchy, where an implicit or explicit
rank order of individuals or group influence others’
behaviors [14, 38]. Studies of peer reviews in academic
publications point out that the reputation of authors can
influence reviewers [27, 77]. In the code review context,
a recent study observes that the social status of a patch
author (e.g., the number of followers in GitHub projects)
is associated with the likelihood of patch acceptance [79].
Bosu et al. report that in open source projects, develop-
ers’ reputation (based on social network analysis) can
influence the evaluation decision of their patches [11].
Relationship. Since team members often interact with
each other during the code review process, they may
form these interactions into a relationship. Then, it is
possible that a reviewer may trust the contributions
(either a patch or review feedback) of team members
with whom they have a strong relationship. Such a
behavior can be considered as a type of ingroup-outgroup
9. https://review.openstack.org/#/c/16888
A SUBMISSION TO IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 4
behavior, where individuals create a strong motivation
to cooperate with ingroup members [38]. For example, a
study of the peer review of academic publications found
that the publications of newcomers are underrepresented
when revealing the author names [62]. Similarly, due
to the social dynamics in code reviews, the collective
evaluation of a patch acceptance can be affected by the
interpersonal compatibility of reviewers [1, 49]. In the
context of code reviews, a recent study also reports
that the relationship with the patch author is one of
the important factors for open source developers to
decide whether they will review a patch [13]. On the
other hands, studies in open source projects report that
developers with a weak relationship with other team
members like newcomers face negative impressions (e.g.,
receiving slow feedback) from reviewers [39, 68].
3 CA SE ST UDY DESIGN
In this section, we provide an overview of our study,
the studied projects, our data preparation, and our em-
ployed statistical analysis approaches.
3.1 Study Overview
To better understand the review dynamics of modern
transparent code review processes and the impact of
such dynamics on software quality, we perform an em-
pirical study using a two-fold analysis. Figure 1 provides
an overview of our case study design, which we briefly
describe blow.
Analysis 1: Characterizing Review Dynamics
To operationalize review dynamics, we analyze historical
data to identify signals that can relate to the evaluation
decision of each reviewer in each patch using a mixed-
effects logistic regression model. In particular, for each
patch, we extract the visible information available during
its code review using eight metrics grouped along three
dimensions, i.e., feedback during reviews, status, and
relationship. Then, we examine the association between
these metrics and the likelihood that a reviewer will
provide a positive vote to a patch.
Analysis 2: The Impact of Review Dynamics on Software
Quality
Once we identify important review dynamics, i.e., sig-
nificant signals of visible information metrics to an
evaluation decision, we examine whether such review
dynamics have an impact on software quality. To do
so, we formulate hypotheses based on the findings of
Analysis 1 and develop corresponding metrics for each
of our hypotheses. Then, we analyze defect models to
examine the association between our review dynamics
metrics and the likelihood that a patch will be defective.
3.2 Studied Projects
To understand the review dynamics of the modern code
review process, we perform an empirical study on large
open source projects that actively use code reviews.
To select the studied projects, we start with the open
source projects that are listed in the work of Bosu and
Carver [11]. Then, we check for the accessibility of the
data of the code review tool and the VCS of these
projects. For example, the ITK/VTK project was not
included in our study because its REST API access is
no longer provided.
We obtain OpenStack, Qt, LibreOffice, and Chromium.
These four projects have a large number of reviews
recorded in the Gerrit code review tool (see Table 1).
However, we observe that a large proportion of Libre-
Office reviews have only one reviewer. Since we want to
investigate the review dynamics and interactions among
reviewers during a code review, we exclude LibreOffice
out from our study. Similarly, we observe that Chromium
reviewers rarely use a negative vote for both merged
and abandoned patches (see Table 1 in the #Patches with
+vote/-vote column). Since our Analysis 1 investigates
the signals that can relate to the evaluation decision
(i.e., providing a positive or negative vote), we exclude
Chromium from our study. Therefore, in this paper, we
perform an empirical study on the OpenStack and Qt
projects.
The OpenStack project is managed by the OpenStack
Foundation with code contributions from well-known
companies (e.g., IBM, EMC, Cisco).10 Qt was mainly
developed by the Qt company. Currently, the Qt project
is led by the Nokia and Digia corporation, nevertheless,
code contributions from the community are also wel-
comed.11
3.3 Data Preparation
Below, we describe the data extraction and cleaning
approaches.
Data Extraction. We collect the review data from
the Gerrit tool of the OpenStack12 and Qt13 projects
using the REST API. Our review datasets include review
ID, patch information (e.g., patch description, modified
files), revisions, review discussion threads, and the in-
volved personnel that are recorded in the tool. Then, we
measure the overall review activities which are described
in Table 2.
We also extract the score of each vote and its times-
tamps for our Analysis 1. To do so, we use a set of
regular expressions to extract the votes from the review
messages in the discussion threads. The timestamps
when the corresponding message was posted is consid-
ered as the time when the vote was provided. We can
rely on this approach since Gerrit automatically records
10. https://www.openstack.org/
11. https://www.qt.io/developers/
12. https://review.opendev.org/
13. https://codereview.qt-project.org/
A SUBMISSION TO IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 5
Analysis 2: Impact of Review Dynamics on Software Quality
(Studying the patch data)
Analysis 1: Characterizing Review Dynamics
(Studying the reviewer data)
Review
Data
Data
Preparation
Data
Extraction
Review Dynamics
Measurement
Defect
Data
The association
between visible
information and an
evaluation decision
The association
between review
dynamics and
software quality
Fig. 1: An overview of the design of our case study.
TABLE 1: An overview of the review datasets.
Period #Patches #Reviewers #Reviewers/Patch #Patches w/ #Patches w/ #Commits
>1 Reviewers +Vote/-Vote (%FixInducing)
OpenStack 11/2011-07/2019 57,782 3,662 Avg=6 (min=1, max=43) 57,288 (99%) 50,374/32,074 35,885(49%)
Qt 09/2013-07/2019 25,968 694 Avg=2 (min=1, max=18) 15,765 (60%) 24,361/3,626 18,887(25%)
LibreOffice 07/2012 - 06/2019 40,171 243 Avg=1 (min=1, max=8) 3,639 (9%) 16,248/2,490 -
Chromium 01/2018-04/2019 76,927 1,834 Avg=2 (min=1, max=19) 33,221 (43%) 76,756/289 -
the history (e.g., a vote score and patch revision) in the
discussion threads.14 The first line of a review message
will contain the vote score following specific patterns
(e.g., ‘Patchset 1: Code-Review +1’) when a reviewer
provides a vote. We also manually validate the extraction
results.
For the code datasets, we use PyDriller, a Python
framework [65], to collect code commits from the Git
VCSs of the studied projects and extract code characteris-
tics which are described in Table 2. Then, we use the SZZ
algorithm [64] to identify fix-inducing commits. To do so,
we identify fixing commits using keyword search (e.g.,
fix, bug, and defect). Then, we use the implementation
of Pydriller to identify a set of the commits that recently
modified the same lines in the files that are included in
the commit of interest.15 Finally, those recent commits
that made changes to the same lines prior to the commit
of interest are identified as fix-inducing commits.
Once we collect the review and code datasets, we link
the data using the review ID. Review ID is a unique iden-
tifier that follows the ProjectBranchChange-ID
format, where Project is the name of the VCS reposi-
tory, Branch is the destination branch into which the
patch will be merged, and Change-ID is a 41-digits
hash value. According to the contribution guideline of
the studied projects,16 the Change-ID is automatically
generated and inserted into the commit message when
the proposed patch is merged into the main VCS. Hence,
we use a regular expression to extract the Change-ID
in the commit message. Then, we use the extracted
Change-ID of a commit and its corresponding VCS
repository and branch to link between the code commits
and the reviews in our datasets. Note that in this work,
we study only commits that are merged into the main
branch.
14. https://gerrit-review.googlesource.com/Documentation/
user-review-ui.html#history
15. https://pydriller.readthedocs.io/en/latest/reference.html#
pydriller.git repository.GitRepository.get commits last modified
lines
16. https://docs.openstack.org/infra/manual/developers.html
Data Cleaning. To accurately understand review dy-
namics, we clean the studied review datasets by merging
duplicate accounts due to email aliases [9]. To do so,
we search for a pair of accounts that share the same
substring of the name or the email name (excluding the
email domain). We then manually verify each pair of po-
tential duplicates. In addition, we remove the messages
that are posted by automated tools in a review discus-
sion thread [76]. To do so, we study the documentation
of the studied projects17 to identify the automated tools
that are integrated with the code review tools.
Table 1 provides a summary of the review and code
datasets that will be used in this study. The number of
patches column indicates the number of patches that are
marked as merged or abandoned. We only consider the
proposed patches that were submitted after November
2011 for OpenStack and September 2013 for Qt since
prior work pointed out that the earlier period may be
the initial adoption period of code reviews in those
projects [43].
3.4 Statistical Analysis
To statistically analyze the data, we use a regression
model to fit our data while controlling for several con-
founding factors. Similar to prior studies [10, 36, 79, 81],
our main goal of using regression models is not for
prediction, but to understand the associations between
the metrics of interest and the likelihood that is estimated
by the regression models [70].
Confounding factors. Table 2 briefly describes the 22
metrics that we used as confounding factors. For both
Analysis 1 and Analysis 2, we use 15 code metrics as
confounding factors in terms of patch characteristics.
These code metrics are commonly known to have an as-
sociation with defect-proneness [10, 18, 30, 34, 43, 55, 73].
We also consider the description length of a patch as a
confounding factor since our prior work finds that the
17. https://docs.openstack.org/infra/manual/developers.html and
https://wiki.qt.io/CI Overview
A SUBMISSION TO IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 6
TABLE 2: An overview of confounding factors.
Patch characteristics
Metric Description
#Added lines The number of lines that were added by the studied patch.
#Deleted lines The number of lines that were deleted by the studied patch.
LOC The number of lines of code in the files before the change by the studied patch.
Average Complexity The average Cyclomatic Complexity of the files that were impacted by the studied patch.
#Files The number of files that were impacted by the studied patch.
#Directories The number of directories that were impacted by the studied patch.
Is bug fixing Whether or not the studied patch is for fixing bugs.
Entropy The dispersion of modified lines across files [30].
#Developers The number of developers who made prior patches that impact the same modified files as the studied
patch.
#Prior patches The number of prior patches that impact the same modified files as the studied patch.
#Prior fixes The number of prior patches that are for bug fixing and that impact the same modified files as the
studied patch.
Age The average of the time intervals between the last patch that was made to each modified file and
the studied patch.
Is the patch author major author Whether the authoring-specific ownership of the patch author is greater than 0.05, where the
authoring-specific ownership is the proportion of prior patches that are made by the patch author,
and that impacted the same modified files of the studied patch [10].
Is the patch author major reviewer Whether or not the review-specific ownership of the patch author is greater than 0.05, where the
review-specific ownership is the proportion of prior patches that are reviewed by the patch author,
and that impacted the same files as the modified files of the studied patch [73].
Patch description length Word count of the commit message that the patch author used to describe the studied patch [74].
Reviewer Characteristics
Metric Description
Review involvement The proportion of comments that the reviewer under study provides to the studied patch before the
reviewer under study provides a vote [73, 74].
Reviewer authoring experience The proportion of prior patches that are made by the reviewer, and that impacted at least one of the
modified files of the studied patch [10].
Reviewer reviewing experience The proportion of prior patches that are reviewed by the reviewer under study, and that impacted
at least one of the modified files of the studied patch [73].
Reviewing activities
Metric Description
#Reviewers The total number of reviewers who provided a vote for the studied patch [72].
#Reviewing comments The total number of comments of reviewers [44, 72, 74].
%Positive votes The proportion of reviewers who provided a positive vote for the studied patch [72, 74].
Reviewing time The length of time between the patch submission to the code review tool and the final evaluation
outcome [72, 74].
description length is associated with the participation of
reviewers [74].
In addition, in Analysis 1, we also consider the past
involvement and experience of reviewers. This is because
prior work shows that the review quality can be asso-
ciated with the reviewer experience [36, 57, 73]. We use
three metrics to measure the characteristics of a reviewer
(see Table 2).
Since overall reviewing activities also can have an
impact on software quality [5, 36, 45, 72], we also use
four code review metrics as confounding factors for
Analysis 2 (see Table 2). These code review metrics
measure the reviewing activities that occur during the
review until the evaluation decision was made.
Our statistical analysis approach consists of four main
steps, which we describe in detail below.
3.4.1 Correlation Analysis
A high correlation among metrics may lead a regression
model to produce spurious associations between the
metrics and the estimated likelihood [29, 70]. Hence,
we use Spearman rank correlation (ρ) to assess the
correlation between each pair of metrics. For each pair
of metrics with a correlation |ρ|larger than 0.7, we
remove one metric from our study. To systematically
and consistently select metrics for our study, we use the
AutoSpearman function in the Rnalytica R package
to find the metric that has the most unique signal, i.e.,
having the least correlation with other metrics in the
dataset, without considering the response variable [33].
3.4.2 Fitting Statistical Models
We now describe our approach of fitting the regression
models for Analysis 1 and 2 (see Figure 1).
Review Dynamics Models. In our Analysis 1, we
analyze the association between the visible information
about a patch under review and the evaluation deci-
sion of a reviewer (i.e., whether a reviewer provides a
positive vote to accept a patch). Since the evaluation
decision of each reviewer is dependent on the actual
patch, we use a mixed-effects logistic regression model
to account for variance in the multilevel data (i.e., the
reviewer evaluation decision and the patch data) [67].
We use the glmer function in the lme4 R package with
the option of family=‘binomial’ to fit mixed-effects
logistic regression models to our data.
Defect Models. In our Analysis 2, we analyze how
review dynamics in a patch are associated with defect-
A SUBMISSION TO IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 7
Log-likelihood
Ratio Test
Full model
Null model!
(y ~ 1)
Log-likelihood
Ratio Test
ΔLR
χ
2
for the metrics of
interest
Model w/o
the metrics of
interest
Overall
LR
χ
2
Fig. 2: An overview of log-likelihood ratio (LR) tests.
proneness. To do, we use a simple logistic regres-
sion model to fit our data. We use the glm func-
tion in the stats R package with the option of
family=‘binomial’.
3.4.3 Assessing the Fit
To assess the goodness of fit, we use a log-likelihood
ratio test [31] and the Area Under the receiver operating
characteristic Curve (AUC) [28].
Log-likelihood ratio tests. We use a log-likelihood
ratio (LR) test to evaluate how well our regression
models fit to the data using the studied metrics. Figure
2 provides an overview of our LR tests.
1) Overall LRχ2indicates the overall goodness of fit for
the model that uses all metrics (called a full model).
To estimate the overall LRχ2, we compare our full
model against the null model (i.e., a model that fits
the data without metrics except an intercept).
2) LRχ2indicates the goodness of fit for the metrics
of interest in the model. To estimate the LRχ2, we
compare the full model against the model that fits
the data without using the metrics of interest.
The larger the overall LRχ2(or LRχ2) value is, the
better the fit of our model based on all metrics (or a
particular set of metrics).
AUC. We use AUC to evaluate how well a model can
discriminate between two potential responses. We use
the auc function in the pROC R package to measure the
AUC for our models. An AUC value of 1 indicates the
best discriminant ability of the model, while an AUC
value of 0.5 indicates that the discriminant ability of the
model is no better than random guessing.
Validating the models. We use the bootstrap valida-
tion technique [21] to estimate the optimism of AUC. As
suggested by prior work [71], the bootstrap validation
technique tends to yield less bias than the traditional
k-fold cross validation techniques for logistic regression
models. To estimate the AUC optimism, we first generate
a bootstrap sample, i.e., a sample with replacement of
the studied dataset. Then, we build a model using the
bootstrap sample (i.e., a bootstrap model). Finally, the
AUC optimism is the difference in AUC between the
bootstrap model when applied to the original dataset
and the bootstrap sample. We repeat this process 100
times and compute the average AUC optimism. The
Comments
A
Vote +1 &
3 comments
B
Vote -1 &
Comments
C
Vote +1
Review
discussion thread
Observed feedback
D
Fig. 3: An example of prior visible feedback that are used
to compute metrics.
small AUC optimism indicates that the performance
estimates of our models using the original data are
reliable.
3.4.4 Analyzing Associations
We use Wald statistics to estimate the explanatory power
of each metric that contributes to the fit of the regression
model. To do so, we use the Anova function in the car R
package with the option of test.statistic=‘Chisq’
to estimate the explanatory power (Wald χ2) and its
statistical significance (p-value). The larger the Wald
χ2value is, the larger the explanatory power that a
particular metric contributes to the model. In addition,
we examine the direction of the association by observing
the regression coefficients of the regression model.
4 PRELIMINARY ANA LYSIS
In Section 2.2.1, we observe that for some patches, re-
viewers provided a vote consistent with prior reviewers
and changed their votes based on the review discussion.
This observation provides evidence that feedback of
other prior reviewers influences the evaluation decision
of a reviewer. Therefore, we perform a preliminary anal-
ysis to examine whether such practices are associated
with defect-proneness.
Approach. We first measure the proportion of review-
ers who provide a positive vote consistent with prior
reviewers and the proportion of reviewers who change
their vote on the same patch revision. For our calcu-
lation, we assume that reviewers may observe all the
visible information when making an evaluation decision.
For example, in Figure 3, reviewer D may observe the
visible information, i.e., the prior feedback of reviewers
A, B, and C; and provide a vote consistent with a prior
reviewer (i.e., reviewer B). Therefore, the proportion of
reviewers who provide a consistent positive vote for
this review is 1
3. In this analysis, we did not measure
the proportion of reviewers who provide a consistent
negative vote, since in Section 2.2.1, we observe that
reviewers provide a positive vote consistent with the
prior reviewers.
Unlike the proportion of positive votes in the overall
review activities (see Table 2) which measures the con-
sensus at the time when the patch was accepted, the
proportion of reviewers who provided a vote consistent
A SUBMISSION TO IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 8
TABLE 3: Statistics of defect models with the observed
review dynamics. The larger the Wald χ2value of a met-
ric is, the larger the explanatory power of this particular
metric in the model.
OpenStack Qt
#TRUE/FALSE Instances 17,924/17,962 4,857/14,030
Overall LRχ26,677∗∗∗ 1,919∗∗∗
AUC 0.74 0.70
Review Dynamics Wald χ2Wald χ2
%Consistent with prior positive votes 14∗∗(+) 0(+)
%Vote changes 0(-) 5(-)
Reviewing Activities Wald χ2Wald χ2
%Positive votes 180∗∗∗ (-) 167∗∗∗ (-)
#Reviewing comments 736∗∗∗ (+) 147∗∗∗ (+)
Reviewing time 188∗∗∗ (-) 8∗∗ (+)
Patch Characteristics Wald χ2Wald χ2
#Added lines 14∗∗∗ (+) 1(+)
#Deleted lines 1(-) 34∗∗∗ (+)
LOC 176∗∗∗ (-)
Avg. complexity 143∗∗∗ (-) 42∗∗∗(-)
#Directories 168∗∗∗ (+) 4(-)
Is Bug Fixing 328∗∗∗ (+) 8∗∗ (+)
Entropy 309∗∗∗ (+) 174∗∗∗ (+)
#Prior fixes 561∗∗∗ (+) 249∗∗∗ (+)
Age 130∗∗∗ (-) 32∗∗∗ (-)
Is the patch author major author 100∗∗∗ (+) 68∗∗∗ (+)
Is the patch author major reviewer 0(-) 3(+)
Stat. significance: ∗∗∗p < 0.001,∗∗ p < 0.01,p < 0.05,p0.05
Discarded due to the correlation analysis. #Reviewers,
#Prior patches, #Developers, #Files are also discarded due to the correlation
analysis.
with prior positive votes does not measure the propor-
tion of the reviewers at the time when the patch was ac-
cepted. Indeed, our new metric measures the proportion
of reviewers at the time when each reviewer in a review
provided a vote. Hence, the proportion of reviewers who
provided a vote consistent with prior positive votes can
have a different value from the proportion of positive
votes. Finally, we examine the association between these
metrics and the likelihood of a patch inducing fixes
using defect models. We construct our defect models as
described in Section 3.4.
Preliminary Findings. Table 3 shows that there is a
significant positive association between the proportion
of reviewers who provide a consistent positive vote with
prior reviewers on the same patch (i.e., the proportion of
consistent positive votes) and the likelihood of inducing
fixes for OpenStack dataset. This result indicates that the
proportion of consistent votes is associated with defect-
proneness even when reviewing and code confounding
factors are controlled. More specifically, although we
control for overall positive votes (i.e., the proportion of
positive votes) in our defect models, the explanatory
power (Wald χ2) of the proportion of consistent votes
still accounts for 7%( 14
180 ) for consistent positive votes of
the explanatory power that the proportion of positive
votes contributes to the OpenStack model.
Although the proportion of providing consistent votes
did not contribute a significant amount of explanatory
power to the Qt model, other visible information about a
patch under review may play a role. For example, Table
3 shows that the number of reviewing comments plays
a significant role in our defect models. It is possible that
the comments of prior reviewers may also influence the
evaluation decision of a reviewer. Therefore, we further
investigate the review dynamics of code reviews and
their association with defect-proneness in the following
two sections.
Summary: Despite conventional wisdom [61], our pre-
liminary analysis shows that the visible information
in a review discussion might influence the evaluation
decision of a review. Such a practice is also associated
with defect-proneness for the OpenStack dataset. Yet,
review dynamics still remain largely unexplored.
5 ANALYSIS 1: CHARACTERIZING REVIEW
DYNAMICS
In this section, we present the approach and the results
of our Analysis 1.
5.1 Approach
To examine the review dynamics (i.e., practices of voting
for a patch of a reviewer), we compute metrics that
capture visible information about a patch under review.
Based on literature and our manual observations (see
Section 2.2), we use eight metrics which are grouped into
(1) feedback, (2) status, and (3) relationship dimensions.
Table 4 describes our seven metrics. Note that in the
calculation of these metrics, we use only the information
(i.e., votes and comments) that occurs before the re-
viewer under study provides a vote. For example, given
that reviewer D in Figure 3 is the reviewer under study,
we use only votes and comments that are provided
by the prior reviewers (i.e., reviewers A, B, and C) to
compute metrics for reviewer D.
We now build models to analyze the associations
between the visible information about a patch under re-
view and the evaluation decision of each reviewer of that
patch. In Analysis 1, we perform a study on both merged
and abandoned reviews. Since the evaluation decision
of each reviewer is dependent on the actual patch, we
use a mixed-effects logistic regression model to account
for variance in the multilevel data (i.e., the reviewer
evaluation decision and the patch data) [67]. We use our
metrics that capture the visible information about a patch
under review and confounding factors as fixed effects
while the unique reviewer ID is represented as a random
effect. Since we analyze the activities during a code
review, we use only the patch characteristics described
in Table 2 as confounding factors. We set the response
variable in our models as TRUE if the reviewer under
study provides a positive vote to a patch, and FALSE
otherwise. The formula used in our mixed-effect models
is I sP ositiveV ote x1+x2+.....xn+ (1|ReviewerI d),
where x1, ..., xnare the fixed effects (i.e., our studied
18. The core members are identified based on the project doc-
umentation, i.e., https://wiki.openstack.org/wiki/Project Resources
and https://wiki.qt.io/Maintainers.
A SUBMISSION TO IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 9
TABLE 4: An overview of our proposed metrics to capture visible information about a patch under review. Note
that we only use the information (e.g., votes and comments) that is available before a reviewer under study provides
a vote.
Metric Description
Feedback dimension captures feedback of prior reviewers before the reviewer under study provides a vote.
%Prior positive votes The proportion of positive votes provided by prior reviewers.
%Prior negative votes The proportion of negative votes provided by prior reviewers.
%Prior comments The proportion of prior reviewer comments that were posted by prior reviewers
for the recent patch revision relative to the comments for all the patch revisions.
Status dimension captures reviewing activities of core members before the reviewer under study provides a vote.
%Prior positive vote from core reviewers The proportion of positive votes provided by documented core members.18
Is the patch author a core member TRUE if the patch author is a documented core member.11
Connection dimension captures the historical cooperation in the past of the reviewer under study with the patch author, and with prior
co-reviewers of the same patch.
%Reviewed patches for the patch author The proportion between the number of prior patches that the reviewer under
study has reviewed for the patch author and the number of prior patches
reviewed by the reviewer under study.
Avg. co-working frequency with prior positive co-
reviewers
An average of the number of prior patches that the reviewer under study had
co-reviewed with each prior positive co-reviewer normalized by the number of
prior patches reviewed by the reviewer under study.
Avg. agreement level with prior positive co-reviewers An average of the number of prior patches in which the reviewer under study
and each prior positive co-reviewer had provided a consistent vote normalized
by the number of prior patches reviewed by the reviewer under study.
metrics and confounding factors). Since reviewers can
provide a vote to a patch several times, we focus only
on their latest vote. Finally, we build and analyze our
mixed-effects models (see Section 3.4).
5.2 Results
Table 5 shows that our mixed-effects models achieve
an AUC of 0.80 and 0.76 and a statistically significant
overall LRχ2improvement over a null model for both
the OpenStack and Qt datasets. Our models also achieve
a very small AUC optimism value, indicating that our
models achieve a good and stable fit to the data.
Visible information about a patch under review has a
stronger association with the evaluation decision than
patch characteristics. Table 5 shows that 9 out of the
11 metrics that capture the visible information about a
patch under review contribute a significant amount of
explanatory power to our models. Moreover, Table 5
shows that the LRχ2values for our studied metrics
account for 63% and 50% of the overall LRχ2in the
OpenStack and Qt models, respectively. On the other
hand, the LRχ2values of the patch characteristics only
account for 15% and 26% of the overall LRχ2. This result
indicates that our proposed metrics that capture the vis-
ible information about a patch under review are highly
associated with the evaluation decision of a reviewer in
our models. This result suggests that despite the patch
characteristics for which a reviewer should be concerned,
the evaluation decision of a reviewer tends to be more
related to the visible information in a code review tool.
The relationship with the patch author is highly
associated with the evaluation decision of a reviewer.
Table 5 shows that there is a significant positive as-
sociation between the proportion of reviewed patches
for the patch author and the likelihood of providing a
positive vote by a reviewer. In particular, the proportion
of reviewed patches for the patch author contributes
a relatively large amount of explanatory power which
accounts for 19% ( 2022
10713 ) and 16% (107
690 ) of the overall
LRχ2in the OpenStack and Qt models, respectively.
This result indicates that the more prior patches that the
reviewer under study had reviewed for the patch author
in the past, the more likely that the reviewer will provide
a positive vote to the patch under review.
The proportion of prior positive votes is significantly
associated with the evaluation decision of a reviewer.
Table 5 shows that the explanatory power of the pro-
portion of prior positive votes accounts for 15% ( 1658
10713 )
and 2% ( 12
690 ) of the overall LRχ2in the OpenStack and
Qt models, respectively. Furthermore, the proportion of
prior positive votes has a positive association with the
the likelihood of providing a positive vote by a reviewer,
indicating that the more co-reviewers who provided
positive votes before the reviewer under study provides
a vote, the more likely that the reviewer under study
will provide a positive vote to the patch under review.
Similarly, the proportion of prior negative votes con-
tributes a significant amount of explanatory power to
the Qt model. The negative association indicates that
the reviewer under study is more likely to provide
a negative vote when more co-reviewers provided a
negative vote.
The proportion of prior reviewer comments is sig-
nificantly associated with the evaluation decision of
a reviewer. Table 5 shows that the proportion of prior
reviewer comments contributes a relatively large amount
of explanatory power to the OpenStack and Qt models.
Table 5 also shows that the proportion of prior reviewer
comments has a negative association with the likelihood
of providing a positive vote by a reviewer. This result
indicates that the more comments that were provided
by co-reviewers to the recent patch revision before the
A SUBMISSION TO IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 10
TABLE 5: Statistics of mixed-effects models. The goodness of fit for the metrics of interest (LRχ2) is shown in a
proportion to the Overall LRχ2.
OpenStack Qt
Response Variable (#TRUE/FALSE) 198,314/61,275 36,867/4,394
Overall LRχ210,713∗∗∗ 690∗∗∗
AUC (|Optimsm|) 0.8(0.005) 0.76(0.011)
Variance of Random effect 1.07 0.75
Intercept χ22765∗∗∗ 918∗∗∗
(Dimension) Visible information about a patch under review LRχ26,800∗∗∗ (63%) 345∗∗∗ (50%)
(Feedback) %Positive votes χ21,658∗∗∗ (+) 12∗∗∗ (+)
(Feedback) %Negative votes χ259∗∗∗ (-)
(Feedback) %Prior comments χ2473∗∗∗ (-) 32∗∗∗ (-)
(Status) %Prior positive votes of core developers χ2121∗∗∗ (+) 0(+)
(Status) Is the patch author a core member χ2609∗∗∗ (+) 0(+)
(Relationship) %Reviewed patches for the patch author χ22,022∗∗∗ (+) 107∗∗∗ (+)
(Relationship) Avg. co-working frequency with prior positive co-reviewers χ2† †
(Relationship) Avg. agreement level with prior positive co-reviewers χ2† †
Patch Characteristics LRχ2654∗∗∗ (15%) 180 ∗∗∗ (26%)
#Added Lines χ253∗∗∗ (-) 4(-)
#Modified Directories χ22(-)
Entropy χ2750∗∗∗ (-) 90∗∗∗ (-)
Is the patch author major author χ2260∗∗∗ (+) 53∗∗∗ (+)
Is the patch author major reviewer χ2146∗∗∗ (+) 8 ∗∗ (+)
Patch description length χ2448∗∗∗ (-) 19∗∗∗ (-)
Statistical significance: ∗∗∗p < 0.001,∗∗ p < 0.01,p < 0.05,p0.05.
Variables were removed due to the correlation analysis.
reviewer under study provides a vote, the less likely that
the reviewer will provide a positive vote to the patch
under review.
In the OpenStack dataset, the core member status also
contributes explanatory power to the evaluation decision
of a reviewer. Table 5 shows that the proportion of prior
positive vote from core members and the core member
status of the patch author contribute a significant amount
of explanatory power to the OpenStack models. This re-
sult suggests that in OpenStack, the evaluation decision
of a reviewer may also relate to the core member status
of prior reviewers and the patch author.
5.3 Review Dynamics and Reviewer Characteristics
Although we find that our metrics that capture the
visible information about a patch under review have a
significant association with the evaluation decision of
a reviewer, such an association may not apply to all
reviewers. For example, expert reviewers may be more
able to identify a hidden problem in a patch than novice
reviewers [36, 57, 73]. Hence, we further investigate
whether reviewer characteristics influence the associa-
tions between our metrics that capture visible informa-
tion about a patch under review and the evaluation
decision of a reviewer.
Similar to prior work [79], we add an interaction
term between the metrics of interest (i.e., the proportion
of prior positive votes, prior reviewer comments, and
reviewed patches for the patch author) and reviewer
characteristics into our mixed-effects models. We use
three metrics that we use to measure reviewer character-
istics (see Table 2). We then refit and analyze our mixed-
effects logistic regression models.
We find that the overall LRχ2values of our new
mixed-effects models that include reviewer charac-
teristics and their interaction terms decrease by 6%
((1002110713)
10713 for OpenStack) and increase by 13%
((784690)
690 for Qt) from the original models in Table
5. However, the AUC values remain the same as the
original mixed-effects models. These results indicate re-
viewer characteristics have a relatively small impact on
the association between the visible information about
a patch under review and the evaluation decision of a
reviewer.
Summary: Visible information about a patch under
review has a significant association with the evaluation
decision. The proportion of prior positive votes, prior
reviewer comments, and the proportion of reviewed
patches for the patch author share a significant asso-
ciation with the evaluation decision of a reviewer.
6 ANALYSIS 2: IM PACT O F REVIEW DYNAM -
ICS ON SOFTWARE QUALI TY
The results of Analysis 1 show that our metrics that
capture the visible information about a patch under
review (i.e., feedback and relationship with the patch
author) are highly associated with the evaluation deci-
sion of a reviewer. For example, we find that the more
co-reviewers who provided positive votes before the
reviewer under study provides a vote, the more likely
that the reviewer under study will provide a positive
vote to the patch under review. Yet, little is known about
the risk that such review dynamics (i.e., a practice of
providing an evaluation decision by a reviewer in a code
review) can have on software quality. Hence, we set
out to investigate the relationship between the review
dynamics and defect-proneness.
A SUBMISSION TO IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 11
TABLE 6: An overview of metrics for measuring review dynamics in a patch.
Findings from Analysis 1 Metrics Description
Higher %prior positive votes More
likely to provide a positive vote
%Consistent with prior pos-
itive votes
The proportion of reviewers who provided a positive vote when
there is at least one prior positive vote.
Higher %prior negative votes Less
likely to provide a positive vote
%Inconsistent with prior
negative votes
The proportion of reviewers who provided a positive vote when
there is at least one prior negative vote.
Lower %prior reviewer comments
More likely to provide a positive vote
%Consistent with prior re-
viewer comments
The proportion of reviewers who provided a positive vote when
there is no prior reviewer comments.
Higher %prior positive votes of core
developers More likely to provide a
positive vote
%consistent with core posi-
tive voters
The proportion of reviewers who provided a positive vote when
there is at least one prior positive vote by a core member.
The patch author is a core member
More likely to provide a positive vote
%positive votes for the core
member
The proportion of reviewers who provided a positive vote when
the patch author is a documented core member.
Higher %reviewed patches for the patch
author More likely to provide a pos-
itive vote
%Strong relationship with
the patch author
The proportion of reviewers who provided a positive vote and
have a strong relationship with the patch author (i.e., %reviewed
patches for the patch author is above the 80th percentile).
TABLE 7: Statistics of defect models. The goodness of fit for the metrics of interest (LRχ2) is shown in a proportion
to the Overall LRχ2.
OpenStack Qt
#TRUE/FALSE Instances 17,924/17,962 4,857/14,030
AUC (|Optimsm|) 0.75 (0.001) 0.7 (0.0002)
Overall LRχ26,887∗∗∗ 1,853∗∗∗
Review dynamics metrics LRχ2225∗∗∗ (3%) 17(1%)
%Consistent with prior positive votes χ222∗∗∗ (+) 1(-)
%Inconsistent with prior negative votes χ22(+) 3(-)
%Following other comments χ20(-)
%Consistent with core positive voters χ240∗∗∗ (-) 4(+)
%Positive votes for the core member χ2158∗∗∗ (-) 2 (+)
%Strong relationship with patch author χ214∗∗∗ (+) 0(-)
Reviewing Activities LRχ21,383∗∗∗ (15%) 576∗∗∗ (30%)
Patch Characteristics LRχ23,161∗∗∗ (46%) 1,017∗∗∗ (53%)
Statistical significance: ∗∗∗p < 0.001,∗∗ p < 0.01,p < 0.05,p0.05
6.1 Approach
Based on the results of Analysis 1, we define six metrics
for measuring review dynamics. Table 6 describes our
six metrics. We hypothesize that the higher the metric
value, the more likely that a patch will be defective.
To test our hypotheses, we analyze defect models that
control for confounding factors. We build a simple lo-
gistic regression models to estimate the likelihood of a
patch inducing fixes. Note that we only studied merged
patches in Analysis 2 since we link the patches to the
commits in VCS in order to identify fix-inducing patches.
The review and patch characteristics described in Table
2 are used as confounding factors in our models. The
response variable is assigned as TRUE if a patch induces
future fixes, and FALSE otherwise. Finally, we build and
analyze our defect models as described in Section 3.4.
6.2 Results
Table 7 shows that our defect models achieve an AUC
of 0.75 and 0.70 and a significant overall LRχ2improve-
ment over a null model. These results indicate that our
defect models perform better than random guessing.
Table 7 shows that our models yield very small AUC
optimism, indicating that our models are stable enough
for interpretation.
Although our review dynamics metrics are signif-
icantly associated with defect-proneness even after
controlling for several confounding factors, the ex-
planatory power of our review dynamics metrics are
relatively small. Table 7 shows that our review dynamic
metrics have a significant association with the likelihood
of a patch inducing fixes (i.e., the p-value of LRχ2
<0.05). However, we observe that the review dynamics
metrics did not contribute a strong explanatory power
as the review and patch characteristics do. The LRχ2
values of the review dynamics metrics account for 3%
and 1% of the overall LRχ2of the OpenStack and Qt
models, respectively.
We also check the proportion of patches that induced
fixes for a high proportion of consistent with prior posi-
tive votes (i.e., greater 80th percentile) for the OpenStack
dataset. Then, we compare this proportion with the
proportion of patches that induced fixes for a low pro-
portion of consistent with prior positive votes (i.e., below
20th percentile). We find that there is little difference of
the proportion of patches that induced fixes, i.e., 50%
for the high metric value and 49% for the low metric
value. These results suggest that despite the association
between visible information about a patch under review
and the evaluation decision, such associations have little
impact on software quality.
A SUBMISSION TO IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 12
Summary: Despite the associations between visible
information about a patch under review and the eval-
uation decision, the associations between the review
dynamics metrics and the likelihood of inducing fixes
are not as strong as those of confounding factors.
7 DISCUSSION
We now discuss broader implications of our results
with regard to our three dimensions about the visible
information about a patch under review in a code review.
7.1 Feedback
Our results provide empirical evidence that reviewers
tend to adhere to community opinions when evaluating
a new proposed patch. In particular, the results of our
Analysis 1 show that a reviewer is more likely to provide
a positive vote if the patch has received many prior
positive votes. We also find that the proportion of prior
reviewer comments shares a significant relationship with
the likelihood of providing a positive vote of a reviewer.
These results suggest that feedback (i.e., votes and com-
ments) of prior reviewers tends to affect an evaluation
decision of a reviewer. The code review guidelines of
the studied projects suggest the reviewers to focus on
the technical content.19 Hence, hiding the review dis-
cussion and feedback for reviewers before the reviewers
provide the first feedback to the patch may allow them
to focus more on the technical content of the patch. This
recommended practice is also consistent with the code
review practice in the traditional code inspection where
reviewers are required to independently study the code
changes before starting the review meeting [22].
Nevertheless, prior work shows that collaborative
code reviews bring many benefits to software projects [4,
13, 42]. Studies in social science also argue that the
social processes underlying the peer review process can
increase the effectiveness of reviews [8, 47]. Our results
of Analysis 2 also show the review dynamics have a rel-
atively small impact on the software quality. Therefore,
we believe that configuring the code review tool to be
open and transparent should maximize the benefits of
performing code reviews.
7.2 Status
While several studies report that the status of team
members has an impact on code review participation
and outcome [9, 59, 80], our Analysis 1 shows that the
core member status of the patch author and co-reviewers
share a significant association with an evaluation deci-
sion in the OpenStack dataset but not for the Qt dataset
(see Table 5). One possible explanation is that reviewers
may have their own strategies in reviewing. A survey by
Bosu et al. reports that some reviewers prefer to evaluate
patches of developers who are known to propose good
19. https://docs.openstack.org/doc-contrib-guide/
docs-review-guidelines.html
patches, while some reviewers may focus on patches of
newcomers [13].
7.3 Relationship
Our Analysis 1 shows that the proportion of reviewed
patches for the patch author shares a significant associa-
tion with its evaluation decision (see Table 5). Similarly,
prior work finds that the historical cooperation of a
reviewer with the patch author often affects the decision
of whether to accept a review request [13]. These results
suggest that the relationship of a reviewer with the patch
author can effect the evaluation decision of the reviewer.
In addition, our Analysis 2 shows that the relationship
metrics are associated with the likelihood of a patch
inducing fixes in OpenStack (see Table 7). Based on
our findings, it is recommended that a reviewer should
lessen the importance of patches from the patch authors
who have a strong relationship with the reviewer. Hiding
the patch author name before the reviewer provides the
first feedback might allow the reviewers to focus on
the technical content. Nevertheless, the hiding of the
reviewer name may hinder many of the collaborative
aspects of the review process [13]. Hence additional
studies are needed to further understand the impact of
hiding the reviewer name.
8 TH RE ATS TO VALIDITY
External Validity. We perform a case study on two open
source projects. Although the OpenStack and Qt projects
are commonly used as a case study of prior research [11–
13, 39, 43, 50, 74, 75, 83], the results may not generalize to
all software projects and all settings of the code review
process. However, the goal of this paper is not to build
a theory that applies to all projects, but rather to shed
light that in some projects, the visible information about
a patch under review can affect the evaluation decision
of a reviewer. Nonetheless, additional replication studies
would help to generalize our findings. To foster future
replication of our study, a replication package of our
work is available online. 1
Construct Validity. We extract a vote score from
review messages using regular expressions in order to
calculate metrics that capture visible information about
a patch under review. There might be a case where the
data extraction is inaccurate. However, the patterns for
which we search are the activity log that is automatically
generated by the Gerrit code review tool.20 Moreover, we
manually validate the extraction results to ensure that
our regular expressions accurately identify a vote score.
We use the SZZ algorithm [64] and keyword search
to identify fix-inducing patches. There might be a case
where some fix-inducing patches are unidentified or
the number of fixing-inducing patches is inflated by
the algorithm [16, 84]. An approach that improves the
20. https://gerrit-review.googlesource.com/Documentation/
intro-gerrit-walkthrough.html
A SUBMISSION TO IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 13
accuracy of the SZZ algorithm [16, 82] may further
improve the accuracy of our results.
When we examine the relationship between review
dynamics and software quality, we focus on the review
dynamics in merged reviews. The discovered relation-
ships may be impacted when considering review dy-
namics in abandoned reviews. However, to determine
whether a review induced fixes in the historical data,
we need to link the review to the associated commit in
VCS. Since the abandoned reviews in the code review
tool are not integrated into the VCS, we cannot deter-
mine whether the abandoned reviews induced fixes or
not. Hence, we can only use merged reviews when we
investigate whether the review dynamics have an impact
on software quality.
Internal Validity. Our results are derived from the
statistical models that we fitted to our data. However, the
relationship in our statistical models does not represent
the causal effects of the review dynamics on the evalua-
tion decision and on software quality. Hence, qualitative
or experimental studies are needed in order to better
understand the reasons and effects of review dynamics.
There might be other factors that influence the evalu-
ation decision of a reviewer. For example, prior survey
studies report that the quality of a proposed patch has an
impact on the evaluation decision of the patch [13, 35].
Yet, the quality of a patch can be defined in many
aspects, e.g., code correctness, readability, and maintain-
ability [35]. In our study, we address this concern by
using patch characteristics that are commonly known
to share a link to the evaluation decision of the patch
and software quality as confounding factors in our mod-
els [7, 26, 32, 43, 73].
Nonetheless, it might still be possible that patches
are good at the first submission. Then, many review-
ers provided positive votes which can be considered
as providing a vote consistent with the prior positive
voters. To address this concern, we use one-sided Mann-
Whitney U tests to examine whether the proportion of
reviewers who provided a consistent vote with prior
positive voters (i.e., %positive votes) of the patches
with one revision is statistically greater than that of the
patches with multiple revisions. We find that %positive
votes of the patches with one revision is not significantly
greater than that of the patches with multiple revisions.
An overfit model may exaggerate spurious rela-
tionships between explanatory and response variables.
Babyak pointed out that automated variable selection
(e.g., forward stepwise selection), pretesting of candidate
predictors by checking the univariate relation between
each variable and the response, and dichotomization
of continuous variables can pose a considerable risk
for spurious findings in models [3]. To mitigate this
concern, we follow the modeling approach of Harrell
Jr. to carefully avoid the likelihood of overfitting in our
models [29].
9 CONCLUSIONS
Code reviews processes nowadays are often performed
in a transparent environment, where various informa-
tion (e.g., review comments) are visible, in order to
enhance active and timely collaboration. However, such
visible information may have an impact on code review
practices and their effectiveness. Hence, in this paper,
we first investigate the review dynamics, i.e., a practice
of providing an evaluation decision by a reviewer. To
do so, we define eight metrics that capture the visible
information about a patch under review and analyze
their associations with the likelihood that a reviewer
will provide a positive vote to accept the patch. Then,
we further investigate whether the uncovered review
dynamics are associated with defect-proneness. Through
a case study of the OpenStack and Qt projects, we find
that:
– The proportion of prior positive votes and prior
reviewer comments are highly associated with the
evaluation decision of a reviewer.
The co-working frequency of the reviewer with the
patch author also have a positive association with
the likelihood of providing a vote to accept a patch.
While we have controlled for several confounding
factors, the review dynamics metrics, e.g., the pro-
portion of reviewers who provided a positive vote
and the proportion of reviewers who have a strong
relationship with the patch author, have a relatively
small association with defect-proneness.
The key contribution of this work is to highlight that
the visible information in the code review tools implicitly
influences the evaluation decision of a reviewer. Our
findings are derived from two large open source projects
(OpenStack and Qt) which actively use tool-based code
review processes. However, the goal of this paper is
not to define a wide ranging theory that holds for
every setting. We expect reviewing practices to vary
from project to project based on the social and technical
norms of the team. Nevertheless, we believe that our
findings would be still of value to teams in managing
and improving their code review processes.
REFERENCES
[1] G. S. Aikenhead, “Collective decision making in the social context
of science,” Science Education, vol. 69, no. 4, pp. 453–475, 1985.
[2] A. Aurum, H. Petersson, and C. Wohlin, “State-of-the-Art: Soft-
ware Inspections After 25 Years,” Software Testing, Verification and
Reliability, vol. 12, no. 3, pp. 133–154, 2002.
[3] M. A. Babyak, “What You See May Not Be What You Get: A
Brief, Nontechnical Introduction to Overfitting in Regression-Type
Models,” Psychosomatic Medicine, vol. 66, no. 3, pp. 411–421, 2004.
[4] A. Bacchelli and C. Bird, “Expectations, Outcomes, and Chal-
lenges Of Modern Code Review,” in ICSE, 2013, pp. 712–721.
[5] G. Bavota and B. Russo, “Four Eyes Are Better Than Two: On the
Impact of Code Reviews on Software Quality,” in ICSME, 2015,
pp. 81–90.
[6] O. Baysal, O. Kononenko, R. Holmes, and M. W. Godfrey, “The
Secret Life of Patches: A Firefox Case Study,” in WCRE, 2012, pp.
447–455.
A SUBMISSION TO IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 14
[7] ——, “Investigating Technical and Non-Technical Factors Influ-
encing Modern Code Review,” EMSE, vol. 21, no. 3, pp. 932–959,
2015.
[8] A. G. Bedeian, “Peer review and the social construction of knowl-
edge in the management discipline.” Academy of Management
Learning & Education, vol. 3, no. 2, pp. 198 – 216, 2004.
[9] C. Bird, A. Gourley, P. Devanbu, M. Gertz, and A. Swaminathan,
“Mining Email Social Networks,” in MSR, 2006, pp. 137–143.
[10] C. Bird, N. Nagappan, B. Murphy, H. Gall, and P. Devanbu,
“Don’t Touch My Code! Examining the Effects of Ownership on
Software Quality,” in ESEC/FSE, 2011, pp. 4–14.
[11] A. Bosu and J. Carver, “Impact of Developer Reputation on Code
Review Outcomes in OSS Projects: An Empirical Investigation,”
in ESEM, 2014, pp. 33–42.
[12] A. Bosu, J. Carver, R. Guadagno, B. Bassett, D. McCallum, and
L. Hochstein, “Peer Impressions in Open Source Organizations:
A survey,” JSS, vol. 94, pp. 4–15, 2014.
[13] A. Bosu, J. C. Carver, C. Bird, J. Orbeck, and C. Chockley, “Process
Aspects and Social Dynamics of Contemporary Code Review:
Insights from Open Source Development and Industrial Practice
at Microsoft,” TSE, vol. 43, no. 1, pp. 56–75, 2017.
[14] J. S. Bunderson and R. E. Reagans, “Power, status, and learning in
organizations,” Organization Science, vol. 22, no. 5, pp. 1182–1194,
2011.
[15] J. C. Carver, B. Caglayan, M. Habayeb, B. Penzenstadler, and
A. Yamashita, “Collaborations and Code Reviews,” IEEE Software,
vol. 32, no. 5, pp. 27–29, 2015.
[16] D. A. Da Costa, S. McIntosh, W. Shang, U. Kulesza, R. Coelho,
and A. E. Hassan, “A Framework for Evaluating the Results of the
SZZ Approach for Identifying Bug-Introducing Changes,” TSE,
vol. 43, no. 7, pp. 641–657, 2017.
[17] L. Dabbish, C. Stuart, J. Tsay, and J. Herbsleb, “Social Coding
in GitHub: Transparency and Collaboration in an Open Software
Repository,” in CSCW, 2012, pp. 1277–1286.
[18] M. D’Ambros, M. Lanza, and R. Robbes, “Evaluating defect pre-
diction approaches: A benchmark and an extensive comparison,”
EMSE, vol. 17, no. 4-5, pp. 531–577, 2012.
[19] A. O. De Guinea, J. Webster, and D. S. Staples, “A meta-analysis of
the consequences of virtualness on team functioning,” Information
& Management, vol. 49, no. 6, pp. 301–308, 2012.
[20] M. Deutsch and H. B. Gerard, “A study of normative and infor-
mational social influences upon individual judgment.” The journal
of abnormal and social psychology, vol. 51, no. 3, p. 629, 1955.
[21] B. Efron, “How Biased is the Apparent Error Rate of a Prediction
Rule?” Journal of the American Statistical Association, vol. 81, no.
394, pp. 461–470, 1986.
[22] M. E. Fagan, “Design and Code Inspections to Reduce Errors in
Program Development,” IBM System Journal, vol. 15, no. 3, pp.
182–211, sep 1976.
[23] S. Fuller, Knowledge management foundations. Routledge, 2012.
[24] D. M. German, G. Robles, G. Poo-Caama˜
no, X. Yang, H. Iida, and
K. Inoue, “”Was my contribution fairly reviewed ?” A Framework
to Study the Perception of Fairness in Modern Code Reviews,” in
ICSE, no. 2, 2018, pp. 523–534.
[25] C. L. Goues, Y. Brun, S. Apel, E. Berger, S. Khurshid, and
Y. Smaragdakis, “Effectiveness of Anonymization in Double-Blind
Review,” Communications of the ACM, vol. 61, no. 6, pp. 30–33,
2018.
[26] G. Gousios, M. Pinzger, and A. van Deursen, “An Exploratory
Study of the Pull-based Software Development Model,” in ICSE,
2014, pp. 345–355.
[27] I. Hames, Peer Review and Manuscript Management in Scientific
Journals: Guidelines for Good Practice. John Wiley & Sons, 2007.
[28] J. A. Hanley and B. J. McNeil, “The Meaning and Use of the
Area under a Receiver Operating Characteristic (ROC) Curve,”
Radiology, vol. 143, no. 1, pp. 29–36, 1982.
[29] F. E. Harrell Jr., Regression Modeling Strategies: With Application to
Linear Models, Logistic Regression, and Survival Analysis, 2nd ed.
Springer, 2015.
[30] A. E. Hassan, “Predicting Faults Using the Complexity of Code
Changes,” in ICSE, 2009, pp. 78–88.
[31] J. P. Huelsenbeck and K. A. Crandall, “Phylogeny Estimation and
Hypothesis Testing Using Maximum Likelihood,” Annu. Rev. Ecol.
Evol. Syst., vol. 28, pp. 437–466, 1997.
[32] Y. Jiang, B. Adams, and D. M. German, “Will My Patch Make It?
And How Fast? Case Study on the Linux Kernel,” in MSR, 2013,
pp. 101–110.
[33] J. Jiarpakdee, C. Tantithamthavorn, and C. Treude, “Autospear-
man: Automatically mitigating correlated metrics for interpreting
defect models,” in ICSME, 2018, pp. 92–103.
[34] Y. Kamei, E. Shihab, B. Adams, A. E. Hassan, A. Mockus, A. Sinha,
and N. Ubayashi, “A Large-Scale Empirical Study of Just-in-Time
Quality Assurance,” TSE, vol. 39, no. 6, pp. 757–773, 2013.
[35] O. Kononenko, O. Baysal, and M. W. Godfrey, “Code Review
Quality: How Developers See It,” in ICSE, 2016, pp. 1028–1038.
[36] O. Kononenko, O. Baysal, L. Guerrouj, Y. Cao, and M. W. Godfrey,
“Investigating Code Review Quality: Do People and Participation
Matter?” in ICSME, 2015, pp. 111–120.
[37] M. Lamont and J. Guetzkow, How Quality Is Recognized by Peer
Review Panels: The Case of the Humanities. Cham: Springer
International Publishing, 2016, pp. 31–41.
[38] R. P. Larrick, “The social context of decisions,” Annual review of
organizational psychology and organizational behavior, vol. 3, pp. 441–
467, 2016.
[39] A. Lee, J. C. Carver, and A. Bosu, “Understanding the Impres-
sions, Motivations, and Barriers of One Time Code Contributors
to FLOSS Projects: A Survey,” in ICSE, 2017, pp. 187–197.
[40] S. D. McClurg, “The Electoral Relevance of Political Talk: Exam-
ining Disagreement and Expertise Effects in Social Networks on
Political Participation,” AJPS, vol. 50, no. 3, pp. 737–754, 2006.
[41] J. McConnell, “Mozilla experiment aims to reduce bias in code
reviews,” 2018. [Online]. Available: https://blog.mozilla.org/
blog/2018/03/08/gender-bias-code-reviews/
[42] N. McDonald and S. Goggins, “Performance and Participation in
Open Source Software on GitHub,” in CHI – Extended Abstract,
2013, pp. 139–144.
[43] S. McIntosh and Y. Kamei, “Are Fix-Inducing Changes a Mov-
ing Target ? A Longitudinal Case Study of Just-In-Time Defect
Prediction,” TSE, vol. 44, no. 5, pp. 412 – 428, 2017.
[44] S. McIntosh, Y. Kamei, B. Adams, and A. E. Hassan, “The Impact
of Code Review Coverage and Code Review Participation on
Software Quality,” in Proceedings of MSR, 2014, pp. 192–201.
[45] ——, “An Empirical Study of the Impact of Modern Code Review
Practices on Software Quality,” EMSE, vol. 21, no. 5, pp. 2146–
2189, 2016.
[46] J. R. Mesmer-Magnus, L. A. DeChurch, M. Jimenez-Rodriguez,
J. Wildman, and M. Shuffler, “A meta-analytic investigation of
virtuality and information sharing in teams,” Organizational Be-
havior and Human Decision Processes, vol. 115, no. 2, pp. 214–225,
2011.
[47] J. A. Minson, J. S. Mueller, and R. P. Larrick, “The contingent
wisdom of dyads: When discussion enhances vs. undermines the
accuracy of collaborative judgments,” Management Science, vol. 64,
no. 9, pp. 4177–4192, 2017.
[48] R. Morales, S. McIntosh, and F. Khomh, “Do Code Review Prac-
tices Impact Design Quality? A Case Study of the Qt, VTK, and
ITK Projects,” in SANER, 2015, pp. 171–180.
[49] ´
A. M ¨
unnich, G. Maksa, and R. J. Mokken, “Collective judgement:
combining individual value judgements,” Mathematical Social Sci-
ences, vol. 37, no. 3, pp. 211–233, 1999.
[50] T. Pangsakulyanont, P. Thongtanunam, D. Port, and H. Iida, “As-
sessing MCR Discussion Usefulness using Semantic Similarity,”
in IWESEP, 2014, pp. 49–54.
[51] C. Pavitt, “An interactive input–process–output model of so-
cial influence in decision-making groups,” Small Group Research,
vol. 45, no. 6, pp. 704–730, 2014.
[52] J. Pissarra and J. C. Jesuino, “Idea generation through computer-
mediated communication: The effects of anonymity,” Journal of
Managerial Psychology, vol. 20, no. 3/4, pp. 275–291, 2005.
[53] A. L. Porter and F. A. Rossini, “Peer review of interdisciplinary
A SUBMISSION TO IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 15
research proposals,” Science, Technology, and Human Values, vol. 10,
no. 3, pp. 33–38, 1985.
[54] P. Radostina K., “Face-to-face versus virtual teams: What have we
really learned?.” The Psychologist-Manager Journal, no. 1, p. 2, 2014.
[55] F. Rahman and P. Devanbu, “How, and why, process metrics are
better,” in ICSE, 2013, pp. 432–441.
[56] P. C. Rigby and C. Bird, “Convergent Contemporary Software
Peer Review Practices,” in ESEC/FSE, 2013, pp. 202–212.
[57] P. C. Rigby, D. M. German, L. Cowen, and M.-a. Storey, “Peer
Review on Open-Source Software Projects: Parameters, Statistical
Models, and Theory,” TOSEM, vol. 23, no. 4, pp. 35:1–35:33, 2014.
[58] P. C. Rigby, D. M. German, and M.-A. Storey, “Open Source
Software Peer Review Practices: A Case Study of the Apache
Server,” in ICSE, 2008, pp. 541–550.
[59] P. C. Rigby and M.-A. Storey, “Understanding Broadcast Based
Peer Review on Open Source Software Projects,” in ICSE, 2011,
pp. 541–550.
[60] C. Sadowski, E. S¨
oderberg, L. Church, M. Sipko, and A. Bacchelli,
“Modern Code Review: A Case Study at Google,” in ICSE-
Companion, 2018, pp. 181–190.
[61] W. Scacchi, “Free/Open Source Software Development: Recent
Research Results and Methods,” Adv. Electr. Comput. Eng., vol. 69,
no. 6, pp. 243–295, 2007.
[62] M. Seeber and A. Bacchelli, “Does single blind peer review hinder
newcomers?.” SCIENTOMETRICS, vol. 113, no. 1, pp. 567 – 585,
2017.
[63] J. Shimagaki, Y. Kamei, S. McIntosh, A. E. Hassan, and
N. Ubayashi, “A Study of the Quality-Impacting Practices of
Modern Code Review at Sony Mobile,” in ICSE – Companion, 2016,
pp. 212–221.
[64] J. ´
Sliwerski, T. Zimmermann, and A. Zeller, “When do changes
induce fixes?” in MSR, 2005, pp. 1–5.
[65] D. Spadini, M. Aniche, and A. Bacchelli, “PyDriller: Python
framework for mining software repositories,” in ESEC/FSE, 2018,
pp. 908–911.
[66] G. Stasser and W. Titus, “Hidden profiles: A brief history,”
Psychological Inquiry, vol. 14, no. 3-4, pp. 304–313, 2003.
[67] M. R. Steenbergen and S. J. Bradford, “Modeling Multilevel Data
Structures,” AJPS, vol. 46, no. 1, pp. 218–237, 2002.
[68] I. Steinmacher, T. U. Conte, M. A. Gerosa, and D. F. Redmiles,
“Social Barriers Faced by Newcomers Placing Their First Contri-
bution in Open Source Software Projects,” in CSCW, 2015, pp.
1379–1392.
[69] E. Suhay, “Explaining Group Influence: The Role of Identity and
Emotion in Political Conformity and Polarization,” Political Behav.,
vol. 37, no. 1, pp. 221–251, 2015.
[70] C. Tantithamthavorn and A. E. Hassan, “An experience report on
defect modelling in practice,” in ICSE-Companion, 2018, pp. 286–
295.
[71] C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Mat-
sumoto, “An empirical comparison of model validation tech-
niques for defect prediction models,” TSE, vol. 43, no. 1, pp. 1–18,
2017.
[72] P. Thongtanunam, S. McIntosh, A. E. Hassan, and H. Iida, “Inves-
tigating Code Review Practices in Defective Files: An Empirical
Study of the Qt System,” in MSR, 2015, pp. 168–179.
[73] ——, “Revisiting Code Ownership and its Relationship with
Software Quality in the Scope of Modern Code Review,” in ICSE,
2016, pp. 1039–1050.
[74] ——, “Review Participation in Modern Code Review,” EMSE,
[77] A. Tomkins, M. Zhang, and W. D. Heavlin, “Reviewer bias in
vol. 22, no. 2, pp. 768–817, 2017.
[75] P. Thongtanunam, C. Tantithamthavorn, R. G. Kula, N. Yoshida,
H. Iida, and K.-i. Matsumoto, “Who Should Review My Code? A
File Location-Based Code-Reviewer Recommendation Approach
for Modern Code Review,” in SANER, 2015, pp. 141–150.
[76] P. Thongtanunam, X. Yang, N. Yoshida, R. G. Kula, C. C. Ana
Erika, K. Fujiwara, and H. Iida, “ReDA: A Web-based Visualiza-
tion Tool for Analyzing Modern Code Review Dataset,” in ICSME,
2014, pp. 606–609.
single- versus double-blind peer review,” NAS, vol. 114, no. 48,
pp. 12 708–12 713, 2017.
[78] G. D. L. Travis and H. M. Collins, “New light on old boys: Cog-
nitive and institutional particularism in the peer review system,”
Science, Technology, and Human Values, vol. 16, no. 3, pp. 322–341,
1991.
[79] J. Tsay, L. Dabbish, and J. Herbsleb, “Influence of Social and
Technical Factors for Evaluating Contribution in GitHub,” in
ICSE, 2014, pp. 356–366.
[80] ——, “Let’s Talk About It: Evaluating Contributions through
Discussion in GitHub,” in FSE, 2014, pp. 144–154.
[81] B. Vasilescu, Y. Yu, H. Wang, P. Devanbu, and V. Filkov, “Quality
and Productivity Outcomes Relating to Continuous Integration in
GitHub,” in ESEC/FSE, 2015, pp. 805–816.
[82] R. Wu, H. Zhang, S. Kim, and S. Cheung, “ReLink: Recovering
Links between Bugs and Changes,” in FSE/ECSE, 2011, pp. 15–25.
[83] X. Xia, D. Lo, X. Wang, and X. Yang, “Who Should Review This
Change? Putting Text and File Location Analyses Together for
More Accurate Recommendations,” in ICSME, 2015, pp. 261–270.
[84] S. Yatish, J. Jiarpakdee, P. Thongtanunam, and C. Tantithamtha-
vorn, “Mining Software Defects: Should We Consider Affected
Releases?” in ICSE, 2019, pp. 654–665.
Patanamon Thongtanunam is a lecturer at the
School of Computing and Information Systems,
the University of Melbourne, Australia. Prior to
that, she was a research fellow of Japan Soci-
ety for the Promotion of Science (JSPS). She
received PhD in Information Science from Nara
Institute of Science and Technology, Japan. Her
research interests include empirical software en-
gineering, mining software repositories, software
quality, and human aspect. Her research has
been published at top-tier software engineering
venues like International Conference on Software Engineering (ICSE)
and Journal of Empirical Software Engineering (EMSE). More about
Patanamon and her work is available online at http://patanamon.com.
Ahmed E. Hassan is an IEEE Fellow, an ACM
SIGSOFT Influential Educator, an NSERC Stea-
cie Fellow, the Canada Research Chair (CRC) in
Software Analytics, and the NSERC/BlackBerry
Software Engineering Chair at the School
of Computing at Queen’s University, Canada.
His research interests include mining soft-
ware repositories, empirical software engineer-
ing, load testing, and log mining. He received
a PhD in Computer Science from the University
of Waterloo. He spearheaded the creation of the
Mining Software Repositories (MSR) conference and its research com-
munity. He also serves/d on the editorial boards of IEEE Transactions on
Software Engineering, Springer Journal of Empirical Software Engineer-
ing, and PeerJ Computer Science. Contact: ahmed@cs.queensu.ca.
More information at: http://sail.cs.queensu.ca/.
... In this study, we select the projects that use the well-known Gerrit platform, a review tool that is largely adopted by many open source projects, where the review data is accessible through REST API. From the range of open source projects that are listed in the work of Thongtanunam and Hassan (2020), we start with four projects: OpenStack, Qt, LibreOffice, and Chromium, as these four projects actively perform code reviews through Gerrit. However, we observe that a large proportion of LibreOffice reviews have only one reviewer. ...
... Qt project is developed for creating graphical user interfaces as well as crossplatform applications that run on various software and hardware platforms, such as Linux, and Windows. (DP1) Clean dataset For two studied projects, we use the review datasets from the work of Thongtanunam and Hassan (2020). In order to study the correlation between review process duration and the link, we only include the reviews whose status are abandoned or merged. ...
Article
Full-text available
Code reviews serve as a quality assurance activity for software teams. Especially for Modern Code Review, sharing a link during a review discussion serves as an effective awareness mechanism where "Code reviews are good FYIs [for your information].". Although prior work has explored link sharing and the information needs of a code review, the extent to which links are used to properly conduct a review is unknown. In this study, we performed a mixed-method approach to investigate the practice of link sharing and their intentions. First, through a quantitative study of the OpenStack and Qt projects, we identify 19,268 reviews that have 39,686 links to explore the extent to which the links are shared, and analyze a correlation between link sharing and review time. Then in a qualitative study, we manually analyze 1,378 links to understand the role and usefulness of link sharing. Results indicate that internal links are more widely referred to (93% and 80% for the two projects). Importantly, although the majority of the internal links are referencing to reviews, bug reports and source code are also shared in review discussions. The statistical models show that the number of internal links as an explanatory factor does have an increasing relationship with the review time. Finally, we present seven intentions of link sharing, with providing context being the most common intention for sharing links. Based on the findings and a developer survey, we encourage the patch author to provide clear context and explore both internal and external resources, while the review team should continue link sharing activities. Future research directions include the investigation of causality between sharing links and the review process, as well as the potential for tool support.
... After the authors answer the effective questions, the authors and reviewers discuss the concerns identified by the effective questions and answers using the tool. In asynchronous patch reviews, a fix proposal (a code patch) may be sent with a potential defect [37,38]. ...
Article
Full-text available
Previous studies reported that reviewers ask questions and engage in discussions during software reviews and that the concerns identified by the questions and discussions help detect defects. Although such concerns about potential defects lead to finding defects, review metrics such as the number of defects detected do not always reflect the questions and discussions because concerns which are not applicable to the review material are excluded from the number of defects. This paper proposes a metric, the number of questions and discussions, which identifies concerns in reviews. First, we defined an effective question, which identifies concerns. Then, we defined detailed review processes (identifying, sharing, and recording processes), which capture how concerns identified by effective questions are shared and defects are documented. We conducted a case study with 25 projects in industry to investigate the impact of the number of effective questions, which identified concerns on the number of detected defects in subsequent testing. The results of a multiple regression analysis show that the number of effective questions predicts the number of defects in subsequent testing at the significance level of 0.05.
... Another group of studies addresses developers' cognitive biases that might affect code review outcome [20,35,50,54]. Huang et al. investigated how cognitive biases relate to code review process in a controlled experiment using medical imaging and eye-tracking. ...
Preprint
Full-text available
The most popular code review tools (e.g., Gerrit and GitHub) present the files to review sorted in alphabetical order. Could this choice or, more generally, the relative position in which a file is presented bias the outcome of code reviews? We investigate this hypothesis by triangulating complementary evidence in a two-step study. First, we observe developers' code review activity. We analyze the review comments pertaining to 219,476 Pull Requests (PRs) from 138 popular Java projects on GitHub. We found files shown earlier in a PR to receive more comments than files shown later, also when controlling for possible confounding factors: e.g., the presence of discussion threads or the lines added in a file. Second, we measure the impact of file position on defect finding in code review. Recruiting 106 participants, we conduct an online controlled experiment in which we measure participants' performance in detecting two unrelated defects seeded into two different files. Participants are assigned to one of two treatments in which the position of the defective files is switched. For one type of defect, participants are not affected by its file's position; for the other, they have 64% lower odds to identify it when its file is last as opposed to first. Overall, our findings provide evidence that the relative position in which files are presented has an impact on code reviews' outcome; we discuss these results and implications for tool design and code review. Data and materials: https://doi.org/10.5281/zenodo.6901285
... We also aim to answer which of the studied features influence the transaction processing times the most. To do this, we design and follow a generalizable and extensible model construction pipeline to generate interpretable models, as outlined in prior studied [20,26,47]. Rather than focusing on generating models that are meant to be used to predict transaction processing times with high to perfect levels of accuracy, we instead leverage machine learning as a tool to understand the impact of these features. ...
Preprint
Full-text available
The Ethereum platform allows developers to implement and deploy applications called Dapps onto the blockchain for public use through the use of smart contracts. To execute code within a smart contract, a paid transaction must be issued towards one of the functions that are exposed in the interface of a contract. However, such a transaction is only processed once one of the miners in the peer-to-peer network selects it, adds it to a block, and appends that block to the blockchain This creates a delay between transaction submission and code execution. It is crucial for Dapp developers to be able to precisely estimate when transactions will be processed, since this allows them to define and provide a certain Quality of Service (QoS) level (e.g., 95% of the transactions processed within 1 minute). However, the impact that different factors have on these times have not yet been studied. Processing time estimation services are used by Dapp developers to achieve predefined QoS. Yet, these services offer minimal insights into what factors impact processing times. Considering the vast amount of data that surrounds the Ethereum blockchain, changes in processing times are hard for Dapp developers to predict, making it difficult to maintain said QoS. In our study, we build random forest models to understand the factors that are associated with transaction processing times. We engineer several features that capture blockchain internal factors, as well as gas pricing behaviors of transaction issuers. By interpreting our models, we conclude that features surrounding gas pricing behaviors are very strongly associated with transaction processing times. Based on our empirical results, we provide Dapp developers with concrete insights that can help them provide and maintain high levels of QoS.
... To satisfy criterion 1, we started by considering five systems (i.e., OpenStack, 2 Qt , 3 LibreOffice, 4 VTK, 5 ITK 6 ) that use Gerrit code review tool and have been widely studied in previous research in MCR, e.g., [21,32,54,82]. We then discarded VTK and ITK since Thongtanunam et al. [84] reported that the linkage rate of code changes to the reviews for VTK is too low and ITK does not satisfy criterion 2. As for criterion 3, after mining the code review data, we found that OpenStack has a higher number of refactoring-related code review instances than Qt and LibreOffice. ...
Preprint
Full-text available
Modern code review is a widely used technique employed in both industrial and open-source projects to improve software quality, share knowledge, and ensure adherence to coding standards and guidelines. During code review, developers may discuss refactoring activities before merging code changes in the code base. To date, code review has been extensively studied to explore its general challenges, best practices and outcomes, and socio-technical aspects. However, little is known about how refactoring is being reviewed and what developers care about when they review refactored code. Hence, in this work, we present a quantitative and qualitative study to understand what are the main criteria developers rely on to develop a decision about accepting or rejecting a submitted refactored code, and what makes this process challenging. Through a case study of 11,010 refactoring and non-refactoring reviews spread across OpenStack open-source projects, we find that refactoring-related code reviews take significantly longer to be resolved in terms of code review efforts. Moreover, upon performing a thematic analysis on a significant sample of the refactoring code review discussions, we built a comprehensive taxonomy consisting of 28 refactoring review criteria. We envision our findings reaffirming the necessity of developing accurate and efficient tools and techniques that can assist developers in the review process in the presence of refactorings.
... Here, in line with previous investigations on recommender systems for software engineering [15], [64], [102], [103], we use Likert scale 25 To answer RQ6.1, we conduct an online survey with the selected developers, and the survey results are shown in Fig. 9. In this survey, the Cronbach's alpha > 0. 85, which shows that the Likert scale is credible (usually Cronbach's alpha> 0.7 is considered credible 27 ), and we have the following observations. ...
Article
Full-text available
Collaboration efficiency is of paramount importance for software development. Finding suitable developers is critical and challenging due to the difficulty of capturing developers' expertises, relevance as well as the sparsity of explicit developer-task interactions. To tackle this problem, existing developer recommendation approaches focus on modelling the developer's expertise or interactions with tasks based on their historical information. However, such approaches often suffer from low performance because they ignore some useful information that might help improve recommendation performance, such as 1) developers' collaboration relationship; 2) the interaction relationship between developer and task; and 3) the tasks' similarity relationship. To leverage the above-mentioned relationships, this paper proposes DevRec, a novel multi-relationship embedded approach for software developer recommendation. We first formally define the multi-relationships and then learn the vector representations (aka. embeddings) of developers and tasks. Second, we explicitly encoded the multi-relationships into the embedding process. Third, to refine the embeddings of developers and tasks, we recursively propagate the embeddings from their high-order connectivity based on graph convolution network. Moreover, to reveal the importance of different relationships, we generate their attentive weights based on attention mechanism. Finally, to evaluate the performance of DevRec, we conduct extensive experiments on two real-world datasets, and to investigate the usefulness of DevRec in practice, we conduct a user study at a large software company. The results show that DevRec outperforms other five state-of-the-art approaches significantly.
... We plan to validate the results of our study using three OSS projects, namely, LibreOffice 5 , Openstack 6 and QT 7 projects data since they have been widely studied in previous research in MCR [8], [12], [17], [21], [39]. We present in Table III a summary of our collected data that we plan to use in our experiments. ...
Preprint
Full-text available
Context. Modern Code Review (MCR) is being adopted in both open source and commercial projects as a common practice. MCR is a widely acknowledged quality assurance practice that allows early detection of defects as well as poor coding practices. It also brings several other benefits such as knowledge sharing, team awareness, and collaboration. Problem. In practice, code reviews can experience significant delays to be completed due to various socio-technical factors which can affect the project quality and cost. For a successful review process, peer reviewers should perform their review tasks in a timely manner while providing relevant feedback about the code change being reviewed. However, there is a lack of tool support to help developers estimating the time required to complete a code review prior to accepting or declining a review request. Objective. Our objective is to build and validate an effective approach to predict the code review completion time in the context of MCR and help developers better manage and prioritize their code review tasks. Method. We formulate the prediction of the code review completion time as a learning problem. In particular, we propose a framework based on regression models to (i) effectively estimate the code review completion time, and (ii) understand the main factors influencing code review completion time.
... Ebert et al. (2019a) found that missing rationale, discussion of non-functional requirements of the solution, and lack of familiarity with existing code are some reasons for decrease in review quality. Other factors that impact code review quality include co-working frequency of a reviewer with the patch author (Thongtanunam and Hassan 2020), description length of a patch (Thongtanunam et al. 2017), and the level of agreement among the reviewers (Hirao et al. 2016). cataloged common poor code review practices and explore and characterize their symptoms, causes, and impacts. ...
Article
Full-text available
Peer code review is a widely adopted software engineering practice to ensure code quality and ensure software reliability in both the commercial and open-source software projects. Due to the large effort overhead associated with practicing code reviews, project managers often wonder, if their code reviews are effective and if there are improvement opportunities in that respect. Since project managers at Samsung Research Bangladesh (SRBD) were also intrigued by these questions, this research developed, deployed, and evaluated a production-ready solution using the Balanced SCorecard (BSC) strategy that SRBD managers can use in their day-to-day management to monitor individual developer’s, a particular project’s or the entire organization’s code review effectiveness. Following the four-step framework of the BSC strategy, we– 1) defined the operation goals of this research, 2) defined a set of metrics to measure the effectiveness of code reviews, 3) developed an automated mechanism to measure those metrics, and 4) developed and evaluated a monitoring application to inform the key stakeholders. Our automated model to identify useful code reviews achieves 7.88% and 14.39% improvement in terms of accuracy and minority class F1 score respectively over the models proposed in prior studies. It also outperforms human evaluators from SRBD, that the model replaces, by a margin of 25.32% and 23.84% respectively in terms of accuracy and minority class F1 score. In our post-deployment survey, SRBD developers and managers indicated that they found our solution as useful and it provided them with important insights to help their decision makings.
Conference Paper
Full-text available
With the rise of the Mining Software Repositories (MSR) field, defect datasets extracted from software repositories play a foundational role in many empirical studies related to software quality. At the core of defect data preparation is the identification of post-release defects. Prior studies leverage many heuristics (e.g., keywords and issue IDs) to identify post-release defects. However, such the heuristic approach is based on several assumptions, which pose common threats to the validity of many studies. In this paper, we set out to investigate the nature of the difference of defect datasets generated by the heuristic approach and the realistic approach that leverages the earliest affected release that is realistically estimated by a software development team for a given defect. In addition, we investigate the impact of defect identification approaches on the predictive accuracy and the ranking of defective modules that are produced by defect models. Through a case study of defect datasets of 32 releases, we find that that the heuristic approach has a large impact on both defect count datasets and binary defect datasets. Surprisingly, we find that the heuristic approach has a minimal impact on defect count models, suggesting that future work should not be too concerned about defect count models that are constructed using heuristic defect datasets. On the other hand, using defect datasets generated by the realistic approach lead to an improvement in the predictive accuracy of defect classification models.
Conference Paper
Full-text available
Modern code reviews improve the quality of software products. Although modern code reviews rely heavily on human interactions, little is known regarding whether they are performed fairly. Fairness plays a role in any process where decisions that affect others are made. When a system is perceived to be unfair, it affects negatively the productivity and motivation of its participants. In this paper, using fairness theory we create a framework that describes how fairness affects modern code reviews. To demonstrate its applicability, and the importance of fairness in code reviews, we conducted an empirical study that asked developers of a large industrial open source ecosystem (OpenStack) what their perceptions are regarding fairness in their code reviewing process. Our study shows that, in general, the code review process in OpenStack is perceived as fair; however, a significant portion of respondents perceive it as unfair. We also show that the variability in the way they prioritize code reviews signals a lack of consistency and the existence of bias (potentially increasing the perception of unfairness). The contributions of this paper are: (1) we propose a framework-based on fairness theory-for studying and managing social behaviour in modern code reviews, (2) we provide support for the framework through the results of a case study on a large industrial-backed open source project, (3) we present evidence that fairness is an issue in the code review process of a large open source ecosystem, and, (4) we present a set of guidelines for practitioners to address unfairness in modern code reviews.
Conference Paper
Full-text available
Employing lightweight, tool-based code review of code changes (aka modern code review) has become the norm for a wide variety of open-source and industrial systems. In this paper, we make an exploratory investigation of modern code review at Google. Google introduced code review early on and evolved it over the years; our study sheds light on why Google introduced this practice and analyzes its current status, after the process has been refined through decades of code changes and millions of code reviews. By means of 12 interviews, a survey with 44 respondents, and the analysis of review logs for 9 million reviewed changes, we investigate motivations behind code review at Google, current practices, and developers' satisfaction and challenges.
Preprint
Full-text available
Over the past decade with the rise of the Mining Software Repositories (MSR) field, the modelling of defects for large and long-lived systems has become one of the most common applications of MSR. The findings and approaches of such studies have attracted the attention of many of our industrial collaborators (and other practitioners worldwide). The core of many of these studies is the development and use of analytical models for defects. In this paper, we discuss common pitfalls and challenges that we observed as practitioners attempt to develop such models or reason about the findings of such studies. The key goal of this paper is to document such pitfalls and challenges so practitioners can avoid them in future efforts. We also hope that other academics will be mindful of such pitfalls and challenges in their own work and industrial engagements.
Article
Full-text available
Peer review may be "single-blind," in which reviewers are aware of the names and affiliations of paper authors, or "double-blind," in which this information is hidden. Noting that computer science research often appears first or exclusively in peer-reviewed conferences rather than journals, we study these two reviewing models in the context of the 10th Association for Computing Machinery International Conference on Web Search and Data Mining, a highly selective venue (15.6% acceptance rate) in which expert committee members review full-length submissions for acceptance. We present a controlled experiment in which four committee members review each paper. Two of these four reviewers are drawn from a pool of committee members with access to author information; the other two are drawn from a disjoint pool without such access. This information asymmetry persists through the process of bidding for papers, reviewing papers, and entering scores. Reviewers in the single-blind condition typically bid for 22% fewer papers and preferentially bid for papers from top universities and companies. Once papers are allocated to reviewers, single-blind reviewers are significantly more likely than their double-blind counterparts to recommend for acceptance papers from famous authors, top universities, and top companies. The estimated odds multipliers are tangible, at 1.63, 1.58, and 2.10, respectively.
Conference Paper
Software repositories contain historical and valuable information about the overall development of software systems. Mining software repositories (MSR) is nowadays considered one of the most interesting growing fields within software engineering. MSR focuses on extracting and analyzing data available in software repositories to uncover interesting, useful, and actionable information about the system. Even though MSR plays an important role in software engineering research, few tools have been created and made public to support developers in extracting information from Git repository. In this paper, we present PyDriller, a Python Framework that eases the process of mining Git. We compare our tool against the state-of-the-art Python Framework GitPython, demonstrating that PyDriller can achieve the same results with, on average, 50% less LOC and significantly lower complexity. URL: https://github.com/ishepard/pydriller Materials: https://doi.org/10.5281/zenodo.1327363 Pre-print: https://doi.org/10.5281/zenodo.1327411
Article
We evaluate the effect of discussion on the accuracy of collaborative judgments. In contrast to prior research, we show that discussion can either aid or impede accuracy relative to the averaging of collaborators' independent judgments, as a systematic function of task type and interaction process. For estimation tasks with a wide range of potential estimates, discussion aided accuracy by helping participants prevent and eliminate egregious errors. For estimation tasks with a naturally bounded range, discussion following independent estimates performed on par with averaging. Importantly, if participants did not first make independent estimates, discussion greatly harmed accuracy by limiting the range of considered estimates, independent of task type. Our research shows that discussion can be a powerful tool for error reduction, but only when appropriately structured: Decision makers should formindependent judgments to consider a wide range of possible answers, and then use discussion to eliminate extremely large errors.
Article
Double-blind review relies on the authors' ability and willingness to effectively anonymize their submissions. We explore anonymization effectiveness at ASE 2016, OOPSLA 2016, and PLDI 2016 by asking reviewers if they can guess author identities. We find that 74%-90% of reviews contain no correct guess and that reviewers who self-identify as experts on a paper's topic are more likely to attempt to guess, but no more likely to guess correctly. We present our findings, summarize the PC chairs' comments about administering double-blind review, discuss the advantages and disadvantages of revealing author identities part of the way through the process, and conclude by advocating for the continued use of double-blind review.