Conference PaperPDF Available

A study of the quality-impacting practices of modern code review at Sony mobile

Authors:

Abstract and Figures

Nowadays, a flexible, lightweight variant of the code review process (i.e., the practice of having other team members critique software changes) is adopted by open source and proprietary software projects. While this flexibility is a blessing (e.g., enabling code reviews to span the globe), it does not mandate minimum review quality criteria like the formal code inspections of the past. Recent work shows that lax reviewing can impact the quality of open source systems. In this paper, we investigate the impact that code reviewing practices have on the quality of a proprietary system that is developed by Sony Mobile. We begin by replicating open source analyses of the relationship between software quality (as approximated by post-release defect-proneness) and: (1) code review coverage, i.e., the proportion of code changes that have been reviewed and (2) code review participation, i.e., the degree of reviewer involvement in the code review process. We also perform a qualitative analysis, with a survey of 93 stakeholders, semi-structured interviews with 15 stakeholders, and a follow-up survey of 25 senior engineers. Our results indicate that while past measures of review coverage and participation do not share a relationship with defect-proneness at Sony Mobile, reviewing measures that are aware of the Sony Mobile development context are associated with defect-proneness. Our results have lead to improvements of the Sony Mobile code review process.
Content may be subject to copyright.
A Study of the Quality-Impacting Practices of
Modern Code Review at Sony Mobile
Junji Shimagaki
Sony Mobile, Japan
Junji.Shimagaki@
sonymobile.com
Yasutaka Kamei
Kyushu University, Japan
kamei@ait.kyushu-
u.ac.jp
Shane McIntosh
McGill University, Canada
shane.mcintosh@mcgill.ca
Ahmed E. Hassan
Queen’s University, Canada
ahmed@cs.queensu.ca
Naoyasu Ubayashi
Kyushu University, Japan
ubayashi@ait.kyushu-u.ac.jp
ABSTRACT
Nowadays, a flexible, lightweight variant of the code review
process (i.e., the practice of having other team members cri-
tique software changes) is adopted by open source and pro-
prietary software projects. While this flexibility is a blessing
(e.g., enabling code reviews to span the globe), it does not
mandate minimum review quality criteria like the formal
code inspections of the past. Recent work shows that lax
reviewing can impact the quality of open source systems.
In this paper, we investigate the impact that code review-
ing practices have on the quality of a proprietary system
that is developed by Sony Mobile. We begin by replicating
open source analyses of the relationship between software
quality (as approximated by post-release defect-proneness)
and: (1) code review coverage, i.e., the proportion of code
changes that have been reviewed and (2) code review partic-
ipation, i.e., the degree of reviewer involvement in the code
review process. We also perform a qualitative analysis, with
a survey of 93 stakeholders, semi-structured interviews with
15 stakeholders, and a follow-up survey of 25 senior engi-
neers. Our results indicate that while past measures of re-
view coverage and participation do not share a relationship
with defect-proneness at Sony Mobile, reviewing measures
that are aware of the Sony Mobile development context are
associated with defect-proneness. Our results have lead to
improvements of the Sony Mobile code review process.
CCS Concepts
Software and its engineering Software testing
and debugging; Software defect analysis; General and
reference Empirical studies; Metrics;
Keywords
Code review, software quality
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
ICSE ’16 Companion, May 14-22, 2016, Austin, TX, USA
c
2016 ACM. ISBN 978-1-4503-4205-6/16/05. . . $15.00
DOI: http://dx.doi.org/10.1145/2889160.2889243
1. INTRODUCTION
Code review is recognized as an effective strategy to dis-
cover and fix software defects before a set of proposed code
changes are integrated into the codebase. In 1976, Fagan [9]
formalized the code inspection process, which mandates that
reviewers follow checklists and participate in group meetings
with the author and other stakeholders. Although code in-
spections have been shown to be effective at detecting er-
rors during requirements analysis, design, and implementa-
tion [32], its rigid nature makes it difficult to adopt in today’s
globally-distributed, rapidly releasing software projects [31].
Unlike the formal code inspections of the past, modern
code review is lightweight and flexible. Broadcast-based peer
review proceeds asynchronously and is broadly adopted by
Open Source Software (OSS) projects, e.g., Apache [26], that
welcome contributions from developers that span the globe.
Recent advances in tool support for code review (e.g., Ger-
rit1) have enabled tighter integration of code review with
version control and issue tracking systems [25].
This flexibility of modern code review is both a blessing
and a curse. On the one hand, modern code reviewing pro-
cesses can easily scale out to support globally distributed
software teams. On the other hand, modern code reviews do
not mandate review checklists or in-person meetings, which
guaranteed a base level of reviewer participation in the code
inspection process of the past. Indeed, recent work shows
that lax code review practices can impact software quality in
large OSS projects. Reviewer involvement is known to share
a relationship with software code quality [17, 18] and soft-
ware design quality [21] in four large OSS projects. More-
over, Thongtanunam et al. [30] find that reviewers tend to
be less careful in the files that will eventually have defects.
Yet, little is known about the impact that lax review-
ing practices may have on systems that are developed in a
proprietary setting. There are several differences in soft-
ware project characteristics between OSS and proprietary
projects that may affect the prior findings. For example, un-
like the global and asynchronous development of OSS teams,
proprietary software teams are often colocated. Herbleb and
Grinter [13] found that these colocated development teams
of proprietary software projects often use face-to-face (of-
fline) communication to make project decisions. Aranda and
Venolia [2] have shown that the software repositories of large
proprietary software teams often omit (or contain erroneous)
1https://www.gerritcodereview.com/
Table 1: An overview of the Sony Mobile project in
comparison to the previously studied projects [18].
Our Project Qt v5.1 VTK 5.10
# Commits 20,000 7,106 1,431
# Authors 1,000 422 55
# Components 500 1,500 1,337 170
# Reviewers 1,000 348 45
Review coverage 81% 96% 39%
collaboration information because of the face-to-face nature
of colocated collaboration.
In this paper, we re-examine McIntosh et al.’s prior study [18]
in a proprietary setting at Sony Mobile. In addition to
replicating the quantitative analysis of the prior work, we
perform an extensive qualitative analysis, which includes:
(a) a survey of 93 stakeholders at Sony Mobile, (b) semi-
structured interviews with 15 of these stakeholders, and (c)
a follow-up survey of 25 senior software engineers. The cen-
tral question of our qualitative analysis is: “Why are certain
reviewing practices associated with better software quality?”
Triangulation of statistical analysis with stakeholder intu-
ition shows that the degree of reviewer involvement does
indeed have an impact on software quality.
This paper makes the following contributions:
Identifying quality-impacting review practices in a pro-
prietary development setting.
An empirically grounded improvement plan for the
code reviewing practices at Sony Mobile.
Paper organization. In Section 2, we introduce the code
integration process at Sony Mobile. In Sections 3 and 4,
we revisit the relationship between the degree of code re-
viewer involvement and software quality. In Section 5, we
present our survey of 93 stakeholders. Section 6 triangulates
our findings with those of related work to generate recom-
mendations about quality-impacting code review practices.
Section 7 discloses the threats to the validity of our study.
Finally, Section 8 draws conclusions.
2. CODE INTEGRATION AT SONY MOBILE
In this section, we explain the characteristics of the stud-
ied project, the Gerrit code review tools, and how code
changes are integrated into the software product.
2.1 Studied Project
Table 1 provides an overview of the studied project. The
project is an embedded system under development that is
derived from the Android codebase with built-in original
apps. The system runs on a smartphone device that is com-
mercially released. The system is developed by a network of
colocated teams in Asia, Europe, and North America.
The system consists of two types of components: those
that are developed in-house (e.g., Sony Mobile apps or An-
droid extensions) and those that originate from external
codebases (e.g., Android2and Qualcomm3). Roughly 30%
of the components of the studied project are developed ex-
ternally. Hence, a portion of the development effort at Sony
Mobile is dedicated to the integration of changes in those ex-
ternal components while active development of Sony Mobile
components is underway.
2https://source.android.com/
3http://codeaurora.org/
Figure 1: The Sony Mobile code integration process.
Furthermore, unlike typical OSS projects where contribu-
tors are globally distributed, the studied project is developed
by a network of colocated teams in Asia, Europe, and North
America. The colocated nature of much of the development
allows team members to interact with one another in-person
to a much larger extent than many OSS teams.
2.2 Code Integration Processes
At Sony Mobile, internally developed code must be cri-
tiqued by other team members using the Gerrit code review
tool. Gerrit is a web-based code review tool that is broadly
adopted by OSS and proprietary projects.4
Figure 1 provides an overview of the Gerrit-enabled code
integration process at Sony Mobile. At Sony Mobile, there
are integration processes for: (a) internal branches, (b) ex-
ternal branches, and (c) official releases. We describe each
of the integration processes below.
a. Internal integration process. Developers fetch the
latest code from an internal branch (a-1), modify the code,
and make necessary local commits. Local commits must be
uploaded for review using Gerrit (a-2), where other devel-
opers review the changes. To obtain submission privileges
(a-3), reviewers must provide positive review and verification
scores. If a negative score is given, the author must address
the feedback of the reviewers by creating a new commit and
requesting a re-review. After a commit has been granted
submission privileges, the Gerrit system allows the author
to submit the commit to the main project repository (a-4).
b. External integration process. Unlike internal inte-
gration, external integration is handled by system integra-
tors. The system integrators have the permission to execute
the direct push operation,5which is designed to bypass the
code review process. At Sony Mobile, this direct push oper-
ation is only available to senior team members.
To perform a direct push, a system integrator first down-
loads the latest code from an external repository to their
personal workspace (b-1). Then, the integrator pushes the
latest code to the external branch in the internal code repos-
itory (b-2). Next, system integrators upload merge commits
to the Gerrit code review process (b-3). These merge com-
mits integrate the externally developed code into the inter-
nal Sony Mobile branch. Note that while merge commits
are carefully reviewed to check for conflicts between internal
and external branches, cleanly merging external code churn
4http://blogs.collab.net/git/why-gerrit-is- important-for- enterprise-git
5https://gerrit-review.googlesource.com/Documentation/
access-control.html#category push direct
is often overlooked. Finally, merge commits can be submit-
ted after submission privileges are granted (a-4).
c. Release integration process. As development pro-
ceeds, a so-called release branch is created (see Figure 1).
In addition to the code review process that is applied to
internal code, these release branches are more strictly mon-
itored than development and only urgent fixes are granted
submission privileges on these branches.
3. REPLICATION STUDY SETUP
We quantitatively analyze the historical code changes and
code review data of a large scale commercial project. We do
so by re-examining two research questions originally posed
by McIntosh et al. [18] regarding code review practices:
RQ1 Is there a relationship between code review cov-
erage and post-release defects?
RQ2 Is there a relationship between code review par-
ticipation and post-release defects?
Similar to the prior study [18], we address these RQs
by analyzing components of the studied system. To iden-
tify components in the Sony Mobile environment, we adopt
the modular programming concept suggested by Parnas [24]:
“...it should be possible to make drastic changes to one mod-
ule without a need to change others. ...”. We use the commit
activity of personnel who belong to a team, and the directory
structure of the Sony Mobile system to identify components.
We briefly describe each classification type below.
Role-based: Components are classified by connecting the
team personnel data that is recorded in human re-
sources databases with the commit activity data that
is recorded in the main project repository. This con-
nected data maps every Git commit to a team. Those
commits that are recorded by a team are considered
to be impacting one component.
Directory-based: Those commits that cannot be connected
to the activity of a team are split into components us-
ing the directory structure of the system. These com-
ponents are defined using the top-level directory name.
A similar method was applied in the prior work [18].
Next, we define the software metrics that we use to quan-
tify code review practices. Then, we describe our statistical
approach to construct and analyze regression models that
explain the incidence of post-release defects.
3.1 Software Metrics
In this paper, we study the relationship between several
potential quality-impacting metrics and the incidence of post-
release defects. Table 2 provides an overview of the studied
metrics. We define each metric below.
Defects. Similar to several studies [6, 18, 23], we discover
incidences of defects by scanning corresponding fixes for de-
fects. We focus on defect fixing activity that occurs (or is
merged into) the release branch. Prior defects are defects
that were fixed before the product was released, while post-
release defects are defects that were discovered in the field
and fixed as part of a software update (see Figure 1).
Baseline metrics. In the literature, several metrics have
been shown to share a relationship with software quality.
To control for those confounding factors, we include them
as baseline metrics. Size and Complexity are measured
by reading the source code at the time of software release.
Churn,Relative Churn, and Entropy are measures of
the code change process. Total,Major,Minor, and Own-
ership capture the degree of module responsibility that the
authors of a change have.
Review coverage metrics. We measure the proportion
of code change in a component that underwent code review.
Reviewed Commit is calculated by treating each commit
as a discrete unit of equal value. Reviewed Churn is cal-
culated by treating each changed line as a discrete unit of
equal value. As mentioned in Section 2.2, there is a sub-
stantial amount of external code in flux (See Figure 1). To
account for this external code, we introduce In-House, i.e.,
the proportion of internally developed commits.
Review participation metrics. We study review partici-
pation along three dimensions: (1) the existence of a review
that was written by other team members, i.e., the number
of commits that were approved (Self Approval) or veri-
fied (Self Verify) by the authors themselves; (2) the time
spent reviewing code, i.e., the Review Window and the
proportion of Hastily reviewed changes; and (3) the effort
that was invested in improving the change, i.e., the number
of comments in the review discussion (Discuss Length),
the proportion of reviews without discussions (No Discuss),
and the churn of a change during its review (Patch Sd).
3.2 Model Construction
Similar to the prior study [18], we adopt the statistical
approach of Harrell Jr. [10]. While previous work uses mul-
tiple linear regression models [18], we use logistic regression
models because the proportion of components with multi-
ple post-release defects is too low for counting models or
linear fits. For our logistic fits, we label the components
that include at least one post-release defect as defective.
Conversely, those components that are free of post-release
defects are labelled as clean.
Normality adjustment. For highly skewed metrics, we
apply a logarithmic transformation to lessen the impact of
outliers. We apply this logarithmic transformation to Size,
Churn, and Relative Churn, which have high variance
values (105).
Variable reduction. We perform a two-step correlation
analysis to identify variables that are too highly correlated
to include in the same model. The first step is to calculate
the Spearman rank correlation between each pair of explana-
tory variables. We use hierarchical clustering to visualize the
Spearman correlation values. Similar to prior work [18, 30],
we consider a cluster of variables that has a Spearman cor-
relation value of at least 0.7 to be too highly correlated to
include together in the same model. We select one variable
from each such cluster to include in the model.
In the second step, we examine how well each variable can
be explained using a combination of the other variables. A
variable that can be well-explained using other variables is
redundant. We use the redundancy check implemented in
the redun function of the rms R package [11], which builds
models to predict the value of each explanatory variable us-
ing the others. If the fit of a model for an explanatory vari-
able has an R2value of at least 0.9 (the default threshold
of the redun function), the variable is considered redundant
and is excluded from our defect models.
Model simplification. While the surviving metrics are
not correlated or redundant, they may not contribute to the
Table 2: An overview of the studied software metrics.
Metric Log. Description
Base
Prior Defects Number of prior defects [32].
Size XLines of code.
Churn XTotal amount of changed lines of code [22].
Relative Churn XNormalized Churn by Size [22].
Total Number of unique committers of a component [6].
Major Number of unique committers whose commits represent more than 5% of all commits [6].
Minor Number of unique committers whose commits represent less than 5% of all commits [6].
Ownership Highest commit occupation ratio of the Major committers [6].
Entropy A measure of churn diversity among files in a component [12]. Calculated by PLc
i
pilog2pi
log2Lc
where Lcis the churn of a component, piis a churn fraction of a file i.
Complexity Summation of McCabe’s cyclomatic complexity number over files.
RQ1
Reviewed Commit Ratio of reviewed commits. Calculated by dividing the number of commits in the Gerrit code
review system by that in git log [17, 18].
Reviewed Churn Ratio of reviewed churn. Calculated as same as Reviewed Commit [17, 18].
In-House Ratio of internal contribution in the entire history of the current branch. Calculated in the same
manner as Reviewed Commit.
RQ2
Self Approval Number of commits that are self-approved (i.e., when the engineer who authored a code change is
only the engineer who provided a positive review score that grants submission privileges) [17, 18].
Self Verify Number of commits that are self-verified (i.e., when the engineer who authored a code change
is only the engineer who actually tested it and provided a positive verification score).
Review Window Time interval between the upload of a commit until it is submitted [17, 18]. The median value
across the changes to a component is used.
Hastily Number of commits which are hastily reviewed [17, 18].
Discuss Length Average summed length of review comments on a commit until its submission [17, 18].
No Discuss Number of commits that are submitted without any review comments [17, 18].
Patch Sd Summation of normalized standard deviations of a patch churn until it is submitted. For exam-
ple, if a change is revised 5 times with churn of (45,45,43,49,49) respectively, then its standard
deviation is 2.9 and the mean is 46.6. In this case, Patch Sd is 2.9/46.6=.06.
Abbreviations (Definition): Log. (Logarithmic transformation), (Newly added metrics in this study)
explanatory power of our defect models. To evaluate the
contribution of our metrics, we examine the reduction in ex-
planatory power between a preliminary model that contains
all explanatory variables and another preliminary model that
has all but one explanatory variable under test. Explana-
tory power of each preliminary model is estimated using the
AIC (Akaike Information Criterion) [1]. If the AIC wors-
ens after removing a variable, the variable is said to have
an impact on the fit of the preliminary model and is re-
tained for our final model fit. Otherwise, if the AIC does
not worsen after removing a variable, it is excluded from
our final model fit. This process is repeated until the model
formula reaches a saturated form. We use MASS::stepAIC
(with dir="backward" option) to evaluate the AIC.
3.3 Model Analysis
Similar to the prior work [18], we analyze our models from
two perspectives: (1) model performance and (2) the impact
of each explanatory variable on the model performance.
Model performance. We analyze the performance of con-
structed models using the discrimination index Dxy = 2(c
0.5) [7, 10], where cis the Area Under the receiver operat-
ing characteristic Curve (AUC). The AUC is a threshold-
independent performance metric that measures a classifier’s
ability to discriminate between defective and clean compo-
nents (i.e., do the defective components tend to have higher
predicted probabilities than clean ones?). AUC is computed
by measuring the area under the curve that plots the true
positive rate against the false positive rate, while varying the
threshold that is used to determine whether a component is
classified as defective or clean. Values of AUC range between
0 (worst performance), 0.5 (random guessing performance),
and 1 (best performance). Therefore, Dxy values range be-
tween 1 (worst performance), 0 (random guessing perfor-
mance), and 1 (best performance). Furthermore, Hosmer et
al. [7] state that Dxy 0.4 can be considered as acceptable
discrimination and Dxy 0.6 as excellent discrimination.
The Dxy is inherently overestimated because the model is
fitted and tested using the same data. To take model sta-
bility into account, we subtract the bootstrap-derived op-
timism [8] from the Dxy . The bootstrap technique draws
nsamples from the original dataset of size nwith replace-
ment. This procedure is repeated Btimes to create Bnew
datasets. In each of the Bdatasets, the same model formula
is applied but coefficients as well as confidence intervals are
re-calculated. Using the bootstrap models, Dxy is calculated
using both the bootstrap dataset and the original dataset.
Then, the difference ∆Dxy between the bootstrap Dxy and
the original Dxy is computed. Finally, the optimism is com-
puted by taking the average of the ∆Dxy values across the
Biterations. In our study, we use B= 1,000 iterations.
The explanatory power of our metrics. While the Dxy
evaluates the model fits, we would like to estimate the im-
pact that each explanatory variable has on our model per-
formance. To that end, we show the coefficients, standard
error, and χ2values of the explanatory variables in our fits.
In the logistic regression model, the antilog of a coefficient
is equivalent to the variable’s odds ratio. Thus, a large coef-
ficient indicates that an exploratory variable makes a large
contribution to the performance of a model.
4. REPLICATION STUDY RESULTS
In this section, we present the results of our baseline model
fits and our two research questions.
4.1 Baseline Model
Before we discuss the impact of review metrics on software
quality, we first analyze the performance of a baseline model,
Table 3: The performance of our defect models.
Dxy original Optimism Dxy corrected
(Diff. with Base) (Diff. with Base)
Base .637 .023 .614
RQ1 .695 (+ .058) .026 .669 (+ .055)
RQ2 .718 (+ .081) .045 .674 (+ .060)
Table 4: The relationship between explanatory vari-
ables and defect-proneness.
Metric Coef. S.E. χ2Pr(> χ2)
Base
Size +0.32 .09 4.01 .0453
Churn 0.32 .09 4.16 .0413
Total +0.07 .02 8.61 .0033
Prior Defects +0.41 .12 14.02 .0002
RQ1 In-House 1.44 .36 15.61 .0001
RQ2
Patch Sd 0.12 .04 8.77 .0031
Self Verify +0.02 .01 6.26 .0124
which is trained using metrics that are known to share a
relationship with defect-proneness (e.g., Size and Churn).
Baseline model construction. We first perform our cor-
relation analysis, where we select Total instead of Owner-
ship and Major because Total is easier to interpret. We
then perform our redundancy analysis, where we find that
Minor and Relative Churn are redundant. Finally, we
perform our model simplification, where we find that En-
tropy and Complexity provide insignificant amounts of
explanatory power. Four metrics survive our model con-
struction steps: Size,Churn,Total and Prior Defects.
Baseline model analysis. Table 3 shows that our base-
line model achieves a Dxy of 0.637, i.e., excellent discrimi-
nation [7]. The optimism value that we derive from 1,000
samples is small (0.023), which indicates that our model fit
is robust. The optimism-corrected Dxy is 0.614, which still
falls within the range of excellent discrimination [7].
Table 4 shows the contribution of each explanatory met-
ric in our model fit. We observe that Prior Defects,
Size, and Total have a positive impact on defect-proneness,
while surprisingly, Churn has a negative impact. We elab-
orate on the counterintuitive nature of the relationship be-
tween defect-proneness and Churn in Section 5.4.
Our baseline model achieves excellent discrimination,
with a Dxy of 0.637. Prior Defects,Size, and To-
tal share strong, increasing relationship with defect-
proneness, while Churn shares strong, decreasing rela-
tionship with defect-proneness.
4.2 Review Coverage (RQ1)
To address RQ1, we add our review coverage metrics to
the baseline model and check whether the fit is improved.
Model construction (RQ1). In correlation analysis, we
select Reviewed Churn instead of Reviewed Commit be-
cause it is a more exact measure of the amount of code that
was reviewed. While redundancy analysis does not identify
any problematic metrics, model simplification shows that
Reviewed Churn does not provide a significant amount of
explanatory power. In summary, only the In-House review
coverage metric survives our model construction steps.
Model analysis (RQ1). Table 3 shows that our new model
achieves a Dxy of 0.695, outperforming the baseline model
by 0.058. The optimism value is also small (0.026), which in-
dicates that our model fit is robust. The optimism-corrected
Dxy is 0.669, which still provides excellent discrimination [7].
Table 4 shows the contribution of the In-House metric.
We observe that it has a large negative impact on defect-
proneness. In Section 5.2, we study why only In-House
contributes to our model fits.
Although our review coverage model outperforms our
baseline model, of the three studied review coverage met-
rics, only the proportion of In-House contributions con-
tributes significantly to our model fits.
Comparison with previous work. Similar to the
prior work [18], we find that Reviewed Commit and
Reviewed Churn provide little explanatory power, sug-
gesting that other reviewing factors are at play.
4.3 Review Participation (RQ2)
To address RQ2, we add our review participation metrics
to the RQ1 model and check if the model fit is improved. We
use the RQ1 model as the baseline to control for coverage.
Model construction (RQ2). Correlation analysis does
not identify any problematic pairs of explanatory variables.
Redundancy analysis reveals that No Discuss should be ex-
cluded. Model simplification reveals that Discuss Length,
Hastily,Self Approval, and Review Window provide
insignificant amounts of explanatory power.
In short, we find that two review participation metrics sur-
vive our preliminary analyses: Patch Sd and Self Verify.
Interestingly, we find that metrics that measure code review-
ing time do not contribute distinct or significant amounts of
explanatory power. We investigate this more in Section 5.3.
Model analysis (RQ2). Table 3 shows that our new model
achieves a Dxy of 0.718, outperforming the baseline model
by 0.081. On the other hand, the optimism value is slightly
higher at 0.045, which suggests that our new model fit is less
robust. Still, the optimism-corrected Dxy is 0.674, which is
regarded as excellent discrimination [7].
Table 4 shows the impact of the Patch Sd and Self Verify
metrics. We observe that Patch Sd has a negative impact
on defect-proneness, while Self Verify has a positive im-
pact. In Section 5.3, we qualitatively study what kinds of
code reviewing practices are driving these results.
Our review participation model also outperforms our
baseline model. Of the studied review participation met-
rics, only the measure of accumulated effort to im-
prove code changes (Patch Sd) and the rate of author
self-verification (Self Verify) contribute significantly
to our model fits.
Comparison with previous work. Unlike the prior
work [18], code reviewing time and discussion length did
not provide exploratory power to the Sony Mobile model.
5. QUALITATIVE ANALYSIS
The purpose of qualitative analysis is to elicit expert opin-
ion about our empirical results in Section 4. This analysis
consists of three parts (see Figure 2); (1) a preliminary ques-
tionnaire (93 stakeholders), (2) follow-up interviews (6 ses-
sions, 15 participants), and (3) a second questionnaire to
validate our findings (25 senior developers).
Figure 2: Approach for our qualitative analysis.
5.1 Methodology
We use the data collection and analysis approaches that
are proposed by Seaman [27] to derive implications. We
finally validate these implications with stakeholders. Figure
2 provides an overview of our qualitative analysis approach,
which is composed of three parts:
1. Pre-data collection. The main ob jective of the pre-
data collection is to identify stakeholders who have a breadth
of experience in software development and/or management.
Our aim is to triangulate our statistical results with stake-
holder experience. We do so by explaining our statistical
analysis and its results, and collecting and analyzing the
stakeholder feedback.
First, we held a result sharing meeting to explain our sta-
tistical analysis and its results (1a in Figure 2). We invited
300 members of the software development team at Sony Mo-
bile to our meeting, of which 93 invitees attended (31%).
Table 5 shows the summary of attendee profiles. The pre-
sentation took 20 minutes to elaborate on the research back-
ground, analysis methods, results, and findings.
Next, we issued a questionnaire to all of the attendees
(1b). The questionnaire covered the stakeholder’s technical
background, impressions about the results, and their will-
ingness to participate in an individual interview.
When selecting interviewees, we focused on stakeholders
who made at least one pertinent comment on review met-
rics. The first author’s background in software development
at Sony Mobile also helped to identify key stakeholders. We
selected 5 software engineers and 1 project manager as inter-
viewees (O-1 ). The interviewees were asked to bring one or
more colleagues from their team to provide a more objective
perspective. In total, we interviewed 15 stakeholders.
2. Interviews. We conducted semi-structured interviews
to uncover the code review practices at Sony Mobile. Semi-
structured interviews begin with a set of prepared questions,
but the structure of the interview is flexible, allowing the
interviewer to dig into unexpected answers from the inter-
viewees by developing new questions during the session [27].
Many of the questions are directly connected with our re-
search interests, such as: “Why is self-verification a common
practice in your team?”, or “Why do you think that in-house
components developed by your team tend to have fewer de-
fects than external components?”. All of the conversations
were recorded and coded by the interviewer and were later
checked by the interviewees for correctness.
Table 5: Profiles of respondent.
Software
Engineer
Testing / Qual-
ity Assurance
Project
Manager
Total
Staff 69 7 5 81
Manager 6 4 2 12
Total 75 11 7 93
When analyzing the six interview transcripts, we grouped
team responses according to the prepared questions. We se-
lected the recurrent responses that were provided by at least
three teams (i.e., 50% of the interviewed teams) for further
validation in our secondary questionnaire. Using these re-
curring responses, we formulated six implications (O-2 ).
3. Validation. We validate the implications that we de-
rive from our semi-structured interviews by performing a
follow-up questionnaire, which was filled out by 25 senior
software engineers. In analyzing the follow-up questionnaire,
we check how many of the respondents agree with the de-
rived implications (O-3 ).
The validation questionnaire presents the respondents with
a list of our derived implications (presented in the rectan-
gular boxes of Section 5.2 to 5.4) asking whether the impli-
cation agrees with their expertise or not. We calculate the
ratio of respondents who agree with our implications out of
all respondents. We exclude blank answers when calculat-
ing the ratio because these blank answers indicate that the
respondents do not have the necessary expertise to answer
the question.
5.2 Implications on Review Coverage
We qualitatively analyze review coverage from software
quality and developer perspectives.
A) Why is In-House associated with software quality,
while Reviewed Commit and Reviewed Churn are not?
Motivation. In-House represents the same concept as Re-
viewed Commit and Reviewed Churn about review cov-
erage. The only difference is the measurement period: In-
House sees the entire history of the branch. Hence, we are
interested in understanding why In-House shares a stronger
relationship with software quality than the other review cov-
erage metrics.
Discussion. One project manager warns that we need spe-
cial attention for external commits:
“Things go wrong when different systems are combined and
integrated together. No matter how external commits ever
got positive review scores in their code review system, we
have to be aware that each system has its own test environ-
ment, approach, review criteria, which are different from
ours.” - Project manager, 5 years
As mentioned in Section 2.2, an internal commit in the Ger-
rit system needs positive review scores before it is granted
submission privileges. Although the integration of external
code follows the same review process, the individual external
commits are not reviewed by Sony Mobile engineers.
In-House measures the proportion of unreviewed com-
mits during entire history of the component, whereas Re-
viewed Churn and Reviewed Commit include the exter-
nal code changes that were merged during the current project
period. We suspect this difference allows In-House to cap-
ture software quality more accurately than the other two
metrics in the Sony Mobile context.
In-House captures the amount of unreviewed code in a
more appropriate way for the Sony Mobile context than
Reviewed Commit/Churn metrics do, likely because of
the difference in their measurement periods.
Validation. 15 senior developers (75%) agreed and 5
senior developers (25%) disagreed with this implication.
B) For developers, is it more difficult to improve soft-
ware quality of components with a low In-House rate?
Motivation. In the previous discussion, we studied the im-
pact of In-House on overall software quality. In this ques-
tion, we focus on developers to study whether or not they
are also affected by In-House.
Discussion. A software engineer working on components
of an external codebase explains his experiences:
“An internal codebase is much easier to work with, since I
can discuss with the people who wrote the code. An external
codebase takes more time from me to understand the code
and to develop patches.” - Software engineer, 6 years
Code reviewing provides a mechanism for knowledge shar-
ing [3]. The more that a codebase is developed and reviewed
by internal developers, the more knowledge that they ac-
cumulate. An external codebase naturally has a smaller
amount of information for developers. We suspect that de-
velopers generally have greater difficulty when they work on
components that have originated from an external codebase.
Since understanding plays a major role in defect repair [29],
we suspect that, if left unchecked, a high dependence on
external codebases may threaten software quality.
Developers require more time and effort to understand,
extend, or repair components with low In-House rates.
Validation. 23 senior developers (92%) agreed and 1
senior developer (4%) disagreed with this implication.
5.3 Implications on Review Participation
In addition to In-House, two of our review participa-
tion metrics have a significant impact on post-release defect-
proneness. In the interviews, we focused on review prac-
tices of different teams to investigate why certain review
practices became common and are associated with software
quality. We formulate a question corresponding to each of
the three studied dimensions of review participation (i.e.,
involvement, time, and discussion).
C) Why are higher rates of self-verification associated
with lower software quality?
Motivation. Our models show that Self Verify is associ-
ated with software quality (see Section 4). We are interested
in checking whether this self-verification trend agrees with
expert opinion.
Discussion. While on the whole, self-verification is a bad
practice, it has some advantages in the Sony Mobile con-
text. For example, an author can verify that the intended
functionality works precisely. On the other hand, an author
is not a good candidate to verify unintended workflows of
the program. One interviewee argues for the value of self-
verification:
“Our components span across a low-level hardware layer to
user-level application. I understand the architecture, and I
am the one who can test my commit properly. Automated
testing is not an option.” - Software engineer, 4 years
Yet, 4 out of 5 teams admitted that the self-verification prac-
tice is coloured by the author’s subjective perspective, which
may bias the test result. However, even if developers want
to dedicate the testing work to other engineers, there are
still many barriers to overcome, such as a detailed descrip-
tion of the testing environment, and identifying personnel
with the appropriate expertise to carry out the test. Thus,
self-verification is an alluring shortcut for developers to take.
Self-verified commits may yield biased testing crite-
ria. However, practicalities of the complex development
environment and time pressure of releases make self-
verification an alluring shortcut for developers to take.
Validation. 20 senior developers (87%) agreed and 3
senior developers (13%) disagreed with this implication.
D) Why do the metrics related to reviewing time have
little impact on our software quality models?
Motivation. The prior study [18] showed that reviewing
time shared a significant relationship with software quality
in 2 of 4 studied releases of OSS systems. In Section 4,
we find that the reviewing time metrics do not contribute
to our model fits. Hence, we are interested in investigating
why this might be the case for the Sony Mobile project.
Discussion. One potential reason for the discrepancy is
provided by a senior engineer:
“I have many reviews in my to-do’ list, but the order in
which I do them is dictated by team priorities, e.g., usually
hot-fixes are reviewed first. Any other factors, e.g., size
or complexity of commits do not matter. Does that give
rise to longer reviewing time for a small and low-priority
commits?” - Senior application software architect, 10 years
The architect points out that issue priority may influence
reviewing time. Hastily may have had a significant impact
on the OSS systems of the prior work [18] because issue pri-
ority is often not set properly or is ignored in many OSS
projects [20]. Commercial projects tend to put more em-
phasis on issue priority [16].
Issue priority may cause reviewing time measures to lose
meaning in the Sony Mobile context.
Validation. 25 senior developers (100%) agreed with
this implication.
E) Why is increased Patch Sd associated with lower
software quality?
Motivation. Our models show that Patch Sd is signifi-
cantly associated with post-release defect-proneness (see Sec-
tion 4). Hence, we are interested in investigating why.
Discussion. An interview with a senior software architect
highlights the importance of offline code improvement:
“We have a conventional code review meeting regularly.
When code change spans across different files, it is much
easier to work with direct communication rather than with
the Gerrit tools. We can talk in our language (Japanese)
too unlike in Gerrit.” - Senior platform software architect,
10 years
At Sony Mobile, many developers rely on in-person com-
munication (see Section 2.1) and claim that this is where
code improvement activities take place. We suspect that the
volume of discussion in Gerrit, i.e., Discuss Length, only
captures a limited amount of the code improvement activ-
ity. Hence, our discussion length metrics do not contribute
as much value to our models as Patch Sd does.
Self-improvement is another type of review activity that
Patch Sd captures, yet discussion metrics do not. There is
a large proportion of commits that are revised several times
without reviewer’s feedback. Two engineers explained the
motivation for updating commits before review feedback:
“I review my code on the browser. The GUI difference with
my local text editor or IDE can make me more attentive to
catch easy mistakes.” - 2 platform software engineers with
3 and 4 years of experience
Self-motivated code improvement is a common phenomena
to improve software quality in other software projects, too [5,
30]. Self-reviewing practices also increase the chance of
detecting underlying software defects. Indeed, Patch Sd
seems to capture the efforts of a developer to improve soft-
ware quality in a more appropriate way for the Sony Mobile
context than other review participation metrics do.
Patch Sd captures developer effort in a way that is not
diminished by offline review discussion at Sony Mobile.
Validation. 17 senior developers (81%) agreed and 4
senior developers (19%) disagreed with this implication.
5.4 Miscellaneous Findings
Our final finding is associated with baseline metrics which
are indirectly associated with our research interest in code
review practices.
F) Why is more churn associated with higher software
quality?
Motivation. Nagappan and Ball [22] found that the more
lines of code that churned, the higher the likelihood of de-
fects. In the studied Sony Mobile system, we observe the
opposite, i.e., increases in code churn are associated with
improvement in software quality. We are interested in in-
vestigating this counterintuitive indication of our models.
Discussion. Nagappan and Ball [22] studied a Windows
release (W2k3-SP1), which mainly contains security vulner-
ability fixes and enhancements to pre-installed programs.
Our studied Sony Mobile project has several different char-
acteristics. For example, the Sony Mobile code changes also
implement new features and apps beyond defect fixes (see
Section 2.1). New features are often contributed to in in-
house components. In Section 5.2, we suggested that in-
house components tend to have more of a positive impact
on software quality and developers’ knowledge than external
components. A senior application engineer also comments:
“New feature implementations of in-house components have
much larger lines of code than defect-fixes in external com-
ponents because we can write the code quicker.” - Senior
application engineer, 10 years
We suspect that these characteristic differences are at the
heart of the counterintuitive results that we arrive at with
respect to Churn. To investigate our suspicion, we compute
a subset of the total churn that measures only the churn
of defect fixes (Prior Defects Churn). We then used
Prior Defects -Churn instead of Churn and refit our
models. Our new model fits show that Prior Defects Churn
has a significant positive impact on the post-release defect-
proneness, i.e., increases in Prior Defects Churn are as-
sociated with increases in post-release defect-proneness.
Characteristics of the types of changes and the target sys-
tem type may be leading to this counterintuitive relation-
ship between code churn and software quality.
Validation. 13 senior developers (68%) agreed and 6
senior developers (32%) disagreed with this implication.
6. ACTION PLANS AND RELATED WORK
Table 6 provides an overview of the findings and implica-
tions. We find that all of the implications related to code re-
view practices are supported by more than 75% of the senior
engineers who responded to our follow-up questionnaire.
In this section, we discuss how the findings of our study:
(a) have formed an action plan to improve code reviewing
practices at Sony Mobile and (b) fit with the code review
literature. We structure the discussion along the review cov-
erage and participation dimensions of our study.
6.1 Review Coverage
Prior work suggests that coverage of the review process
is important. Fagan [9] and Kemerer and Paulk [14] found
that the introduction of an inspection process that covered
all of the design and code changes lead to improvements in
software quality. Tanaka et al. [28] suggest that a software
team should meticulously review each change to the source
code. Bavota and Russo [4] find that unreviewed code is two
times more likely to introduce defects than reviewed code.
On the other hand, recent empirical studies suggest that
review coverage is not the most important characteristic of a
review process. For example, McIntosh et al. [17, 18] found
that review coverage only shares a significant link with the
incidence of post-release defects in two of four studied OSS
releases. Morales et al. [21] found that review coverage only
shares a significant link with software design quality in two of
four studied OSS releases. Furthermore, Meneely et al. [19]
find that the Chromium project enforces a 100% review cov-
erage policy, rendering review coverage moot.
Implications for Sony Mobile. We find that the rate of
churn in external code shares a significant link with software
quality. Moving forward, the quality assurance teams of
Sony Mobile have been made aware of this trend, and steps
are underway to improve test coverage of external code.
6.2 Review Participation
A recent line of work has highlighted the importance of
the investment that review participants make in the code
review process. Bacchelli and Bird [3] find that the mod-
ern code review process has become a mechanism to sup-
port collaborative problem solving. McIntosh et al. [17, 18]
and Morales et al. [21] find that review participation shares
a consistent link with the incidence of post-release defects
and design anti-patterns, respectively. Kononenko et al. [15]
also find that review participation metrics are inversely as-
sociated with the commits that introduce defects.
Indeed, additional perspectives are invaluable in the code
review process. Thongtanunam et al. [30] find that software
modules that involve multiple reviewers in the code review
process tend to be less susceptible to defects than software
modules that involve few reviewers.
Table 6: Summary of findings and implications.
Dimension Section Findings and Implications Ratio
Review
Coverage
4.2 Although our review coverage model outperforms our baseline model, of the three studied
review coverage metrics, only the proportion of In-House contributions contributes signifi-
cantly to our model fits.
-
5.2.A In-House captures the amount of unreviewed code in a more appropriate way for the Sony
Mobile context than Reviewed Commit/Churn metrics do, likely because of the difference
in their measurement periods.
75%
5.2.B Developers require more time and effort to understand, extend, or repair components with
low In-House rates.
92%
Review
Participation
4.3 Our review participation model also outperforms our baseline model. Of the studied re-
view participation metrics, only the measure of accumulated effort to improve code changes
(Patch Sd) and the rate of author self-verification (Self Verify) contribute significantly to
our model fits.
-
5.3.C Self-verified commits may yield biased testing criteria. However, practicalities of the complex
development environment and time pressure of releases make self-verification an alluring
shortcut for developers to take.
87%
5.3.D Issue priority may cause reviewing time measures to lose meaning in the Sony Mobile context. 100%
5.3.E Patch Sd captures developer effort in a way that is not diminished by offline review discussion
at Sony Mobile.
Baseline 4.1 Our baseline model achieves excellent discrimination, with a Dxy of 0.637. Prior Defects,
Size, and Total share strong, increasing relationship with defect-proneness, while Churn
shares strong, decreasing relationship with defect-proneness.
-
5.4.F Characteristics of the types of changes and the target system type may be leading to this
counterintuitive relationship between code churn and software quality.
68%
Implications for Sony Mobile. We find that active par-
ticipation in the code review process indeed has a positive
impact on software quality, whereas improvement activities
can take various forms (e.g., offline code review meetings
and self-improvement). This result has lead the Sony Mo-
bile development team to more confidently invest effort in
the code review process. Furthermore, the Sony Mobile team
has set out to improve its code review process by encourag-
ing (a) passive developers to more actively participate in the
code review process and (b) all developers to verify the code
changes of other colleagues instead of their own.
7. THREATS TO VALIDITY
In this section, we discuss the extent to which our results
are threatened by our experimental design choices.
Construct validity. We assume prior/post-release de-
fects are defects that were fixed before/after the product was
released. At the time right before the release of commercial
systems, the defects that affect a wide range of components
are not likely to be fixed to mitigate the risk of regression.
Indeed, the number of prior/post-release defects that we
compute may not exactly match the number of prior/post-
release defects that the software system actually contains.
Internal validity. In our qualitative analysis, we selected
5 software engineers and 1 project manager as interviewees.
This selection may bias our results. To mitigate this bias,
we asked interviewees to bring colleagues from their teams
and also performed an external validation with 25 senior
engineers. Nonetheless, the implications we derived may
be biased by the interviewer’s background and the small
number of people who commented during the interviews.
External validity. Our study focuses on one proprietary
software system developed by Sony Mobile. This system
has several unique characteristics, such as the integration
of in-house and external codebases and being developed by
a network of colocated teams. This project might not be
representative of all proprietary software systems. However,
generalizability is not the main goal of the study. Our find-
ings have been useful to help Sony Mobile to put into action
a plan to improve their code review process. Our findings
may be useful for other teams with similar characteristics.
8. CONCLUSIONS
We quantitatively investigated the impact of code review-
ing practices, and complemented our findings with a qual-
itative analysis involving 93 stakeholders at Sony Mobile.
While prior metrics of review coverage do not share a rela-
tionship with defect-proneness in the Sony Mobile system,
the rate at which externally developed code (which is not
code reviewed by Sony Mobile engineers) is integrated with
Sony Mobile components shares a strong association with
defect-proneness. Furthermore, code review participation
also shares a significant link with software quality. For ex-
ample, author self-verification (i.e., when the engineer who
authored a code change is also the engineer responsible for
testing it) and the amount that code changes are improved
during the code review process are associated with software
quality. Our findings, which are confirmed by stakehold-
ers, have informed our process improvement plan for code
reviews at Sony Mobile.
9. ACKNOWLEDGMENTS
This research was partially supported by JSPS KAKENHI
Grant Numbers 15H05306. The authors would like to thank
D. Pursehouse, T. Nakagawa, S. Fujita, M. Tashiro, M. Mat-
suo and anonymous stakeholders at Sony Mobile to con-
tribute to the study. The findings and opinions expressed
in this paper are those of the authors and do not neces-
sarily represent or reflect those of Sony Mobile and/or its
subsidiaries and affiliates. Moreover, our results do not in
any way reflect the quality of Sony Mobile’s products.
10. REFERENCES
[1] H. Akaike. Information theory and an extension of
the maximum likelihood principle. In Proc. of the Int’l
Symp. on Information Theory, pages 267–281, 1973.
[2] J. Aranda and G. Venolia. The secret life of bugs: Going
past the errors and omissions in software repositories.
In Proc. of the Int’l Conf. on Software Eng., pages 298–
308, 2009.
[3] A. Bacchelli and C. Bird. Expectations, outcomes, and
challenges of modern code review. In Proc. of the Int’l
Conf. on Software Eng., pages 712–721, 2013.
[4] G. Bavota and R. Barbara. Four eyes are better than
two: On the impact of code reviews on software qual-
ity. In Proc. of the Int’l Conf. on Software Maint. and
Evolution, pages 81–90, 2015.
[5] M. Beller, A. Bacchelli, A. Zaidman, and E. Juergens.
Modern code reviews in open-source projects: Which
problems do they fix? In Proc. of Working Conf. on
Mining Software Repositories, pages 202–211, 2014.
[6] C. Bird, N. Nagappan, B. Murphy, H. Gall, and P. De-
vanbu. Don’t touch my code!: Examining the effects
of ownership on software quality. In Proc. of the ACM
SIGSOFT Symp. and the European Conf. on Founda-
tions of Software Eng., pages 4–14, 2011.
[7] J. David W. Hosmer, S. Lemeshow, and R. X. Sturdi-
vant. Applied Logistic Regression. Wiley, 2013.
[8] B. Efron and R. J. Tibshirani. An Introduction to the
Bootstrap. Chapman & Hall, 1993.
[9] M. E. Fagan. Design and code inspections to reduce
errors in program development. IBM Syst. J., 38(2-
3):258–287, 1999.
[10] F. E. Harrell Jr. Regression modeling strategies.
Springer, 2001.
[11] F. E. Harrell Jr. rms: Regression Modeling Strategies,
2015. R package version 4.3-0.
[12] A. Hassan. Predicting faults using the complexity of
code changes. In Proc. of the Int’l Conf. on Software
Eng., pages 78–88, 2009.
[13] J. D. Herbsleb and R. E. Grinter. Architectures, coordi-
nation, and distance: Conway’s law and beyond. IEEE
Software, 16(5):63–70, 1999.
[14] C. Kemerer and M. Paulk. The impact of design
and code reviews on software quality: An empirical
study based on PSP data. IEEE Trans. Software Eng.,
35(4):534–550, 2009.
[15] O. Kononenko, O. Baysal, L. Guerrouj, Y. Cao, and
M. Godfrey, W. Investigating code review quality: Do
people and participation matter? In Proc. of the Int’l
Conf. on Software Maint. and Evolution, pages 111–
120, 2015.
[16] M. Lavall´ee and P. N. Robillard. Why good developers
write bad code: An observational case study of the im-
pacts of organizational factors on software quality. In
Proc. of the Int’l Conf. on Software Eng., pages 677–
687, 2015.
[17] S. McIntosh, Y. Kamei, B. Adams, and A. E. Hassan.
The impact of code review coverage and code review
participation on software quality: A case study of the
Qt, VTK, and ITK projects. In Proc. of the Working
Conf. on Mining Software Repositories, pages 192–201,
2013.
[18] S. McIntosh, Y. Kamei, B. Adams, and A. E. Hassan.
An empirical study of the impact of modern code review
practices on software quality. Empirical Software Eng.,
page To appear, 2015.
[19] A. Meneely, A. C. R. Tejeda, B. Spates, S. Trudeau,
D. Neuberger, K. Whitlock, C. Ketant, and K. Davis.
An empirical investigation of socio-technical code re-
view metrics and security vulnerabilities. In Proc. of
the Int’l Workshop on Social Software Eng., pages 37–
44, 2014.
[20] A. Mockus, R. T. Fielding, and J. D. Herbsleb. Two
case studies of open source software development:
Apache and mozilla. ACM Trans. on Software Eng. and
Methodology, 11(3):309–346, 2002.
[21] R. Morales, S. McIntosh, and F. Khomh. Do code re-
view practices impact design quality? a case study of
the Qt, VTK, and ITK projects. In Proc. of the Int’l
Conf. on Software Analysis, Evolution and Reengineer-
ing, pages 171–180, 2015.
[22] N. Nagappan and T. Ball. Use of relative code churn
measures to predict system defect density. In Proc. of
the Int’l Conf. on Software Eng., pages 284–292, 2005.
[23] N. Nagappan, B. Murphy, and V. Basili. The influence
of organizational structure on software quality: An em-
pirical case study. In Proc. of the Int’l Conf. on Software
Eng., pages 521–530, 2008.
[24] D. L. Parnas. On the criteria to be used in decomposing
systems into modules. Commun. ACM, 15(12):1053–
1058, 1972.
[25] P. C. Rigby and C. Bird. Convergent contemporary
software peer review practices. In Proc. of the ACM
SIGSOFT Symp. and the European Conf. on Founda-
tions of Software Eng., pages 202–212, 2013.
[26] P. C. Rigby and M. Storey. Understanding broadcast
based peer review on open source software projects. In
Proc. of the Int’l Conf. on Software Eng., pages 541–
550, 2011.
[27] C. B. Seaman. Qualitative methods. Guide to Advanced
Empirical Software Engineering, Springer, pages 35–62,
2008.
[28] T. Tanaka, K. Sakamoto, S. Kusumoto, K. Matsumoto,
and T. Kikuno. Improvement of software process by
process description and benefit estimation. In Proc. of
the Int’l Conf. on Software Eng., pages 123–132, 1995.
[29] Y. Tao, Y. Dang, T. Xie, D. Zhang, and S. Kim. How
do software engineers understand code changes?: An
exploratory study in industry. In Proc. of the ACM
SIGSOFT Symp. on the Foundations of Software Eng.,
pages 51:1–51:11, 2012.
[30] P. Thongtanunam, S. McIntosh, A. E. Hassan, and
H. Iida. Investigating code review practices in defec-
tive files: An empirical study of the qt system. In Proc.
of the Working Conf. on Mining Software Repositories,
pages 168–179, 2015.
[31] L. G. Votta, Jr. Does every inspection need a meeting?
In Proc. of the ACM SIGSOFT Symp. on Foundations
of Software Eng., pages 107–114, 1993.
[32] T. Yu, V. Y. Shen, and H. E. Dunsmore. An analysis of
several software defect models. IEEE Trans. Software
Eng., 14(9):1261–1270, 1988.
... Among these three methodologies, survey is relatively frequent with eighteen papers being retrieved [5,7,11,13,21,22,26,31,34,37,38,55,58,60,80,86,96,117]. Thirteen papers are classified as solution [4,10,25,41,59,63,68,78,88,105,108,110,120]. The last experience methodology is the most rare case, only four papers being found [25,81,88,90]. ...
... Thirteen papers are classified as solution [4,10,25,41,59,63,68,78,88,105,108,110,120]. The last experience methodology is the most rare case, only four papers being found [25,81,88,90]. Answering RQ1: Our results show that 65% of CR researches published in premium SE venues use sound evaluation methodology (i.e., 73 papers), targeting particularly socio-technical and understanding of CR processes. ...
Article
Context Code Review (CR) is the cornerstone for software quality assurance and a crucial practice for software development. As CR research matures, it can be difficult to keep track of the best practices and state-of-the-art in methodology, dataset, and metric. Objective This paper investigates the potential of benchmarking by collecting methodology, dataset, and metric of CR studies. Methods A systematic mapping study was conducted. A total of 112 studies from 19,847 papers published in high-impact venues between the years 2011 and 2019 were selected and analyzed. Results First, we find that empirical evaluation is the most common methodology (65% of papers), with solution and experience being the least common methodology. Second, we highlight 50% of papers that use the quantitative method or mixed-method have the potential for replicability. Third, we identify 457 metrics that are grouped into sixteen core metric sets, applied to nine Software Engineering topics, showing different research topics tend to use specific metric sets. Conclusion We conclude that at this stage, we cannot benchmark CR studies. Nevertheless, a common benchmark will facilitate new researchers, including experts from other fields, to innovate new techniques and build on top of already established methodologies. A full replication is available at https://naist-se.github.io/code-review/.
... Among these three methodologies, survey is relatively frequent with eighteen papers being retrieved [5,7,11,13,21,22,26,31,34,37,38,55,58,60,80,86,96,117]. Thirteen papers are classified as solution [4,10,25,41,59,63,68,78,88,105,108,110,120]. The last experience methodology is the most rare case, only four papers being found [25,81,88,90]. ...
... Thirteen papers are classified as solution [4,10,25,41,59,63,68,78,88,105,108,110,120]. The last experience methodology is the most rare case, only four papers being found [25,81,88,90]. Answering RQ1: Our results show that 65% of CR researches published in premium SE venues use sound evaluation methodology (i.e., 73 papers), targeting particularly socio-technical and understanding of CR processes. ...
Preprint
Full-text available
Context: Code Review (CR) is the cornerstone for software quality assurance and a crucial practice for software development. As CR research matures, it can be difficult to keep track of the best practices and state-of-the-art in methodology, dataset, and metric. Objective: This paper investigates the potential of benchmarking by collecting methodology, dataset, and metric of CR studies. Method: A systematic mapping study was conducted. A total of 112 studies from 19,847 papers published in high-impact venues between the years 2011 and 2019 were selected and analyzed. Results: First, we find that empirical evaluation is the most common methodology (65% of papers), with solution and experience being the least common methodology. Second, we highlight 50% of papers that use the quantitative method or mixed-method have the potential for replica-bility. Third, we identify 457 metrics that are grouped into sixteen core metric sets, applied to nine Software Engineering topics, showing different research topics tend to use specific metric sets. Conclusion: We conclude that at this stage, we cannot benchmark CR studies. Nevertheless, a common benchmark will facilitate new researchers, including experts from other fields, to innovate new techniques and build on top of already established methodologies. A full replication is available at https://naist-se.github.io/code-review/.
... Over the past few years, a number of societies and organizations, e.g., IEEE Computer Society [29], Google [26], Microsoft [7], Sony [28], Samsung [22], and Xerox [2] have reported on their standards, guidelines, and practices with code review. However, despite many benefits and various experiences reported on code review, it remains a challenging practice of QA [15,19,21]. ...
Preprint
To reinforce the quality of code delivery, especially to improve future coding quality, one global Information and Communication Technology (ICT) enterprise has institutionalized a retrospective style inspection (namely retro-inspection), which is similar to Fagan inspection but differs in terms of stage, participants, etc. This paper reports an industrial case study that aims to investigate the experiences and lessons from this software practice. To this end, we collected and analyzed various empirical evidence for data triangulation. The results reflect that retro-inspection distinguishes itself from peer code review by identifying more complicated and underlying defects, providing more indicative and suggestive comments. Many experienced inspectors indicate defects together with their rationale behind and offer suggestions for correction and prevention. As a result, retro-inspection can benefit not only quality assurance (like Fagan inspection), but also internal audit, inter-division communication, and competence promotion. On the other side, we identify several lessons of retro-inspection at this stage, e.g., developers' acceptance and organizers' predicament, for next-step improvement of this practice. To be specific, some recommendations are discussed for retro-inspection, e.g., more adequate preparation and more careful publicity. This study concludes that most of the expected benefits of retro-inspection can be empirically confirmed in this enterprise and its value on the progress to continuous maturity can be recognized organization-wide. The experiences on executing this altered practice in a large enterprise provide reference value on code quality assurance to other software organizations.
... As a popular software practice, code review is believed to be paramount to software quality for both commercial projects and Open Source Software (OSS) projects [9,41,43,48]. Through manually scrutinizing source code, reviewers aim to identify possible issues or improvement opportunities and thereby prevent issue-prone code snippets from being incorporated into project repositories [7]. ...
Preprint
Full-text available
Modern code review is a critical and indispensable practice in a pull-request development paradigm that prevails in Open Source Software (OSS) development. Finding a suitable reviewer in projects with massive participants thus becomes an increasingly challenging task. Many reviewer recommendation approaches (recommenders) have been developed to support this task which apply a similar strategy, i.e. modeling the review history first then followed by predicting/recommending a reviewer based on the model. Apparently, the better the model reflects the reality in review history, the higher recommender's performance we may expect. However, one typical scenario in a pull-request development paradigm, i.e. one Pull-Request (PR) (such as a revision or addition submitted by a contributor) may have multiple reviewers and they may impact each other through publicly posted comments, has not been modeled well in existing recommenders. We adopted the hypergraph technique to model this high-order relationship (i.e. one PR with multiple reviewers herein) and developed a new recommender, namely HGRec, which is evaluated by 12 OSS projects with more than 87K PRs, 680K comments in terms of accuracy and recommendation distribution. The results indicate that HGRec outperforms the state-of-the-art recommenders on recommendation accuracy. Besides, among the top three accurate recommenders, HGRec is more likely to recommend a diversity of reviewers, which can help to relieve the core reviewers' workload congestion issue. Moreover, since HGRec is based on hypergraph, which is a natural and interpretable representation to model review history, it is easy to accommodate more types of entities and realistic relationships in modern code review scenarios. As the first attempt, this study reveals the potentials of hypergraph on advancing the pragmatic solutions for code reviewer recommendation.
Article
Technical debt is a sub-optimal state of development in projects. In particular, the type of technical debt incurred by developers themselves (e.g., comments that mean the implementation is imperfect and should be replaced with another implementation) is called self-admitted technical debt (SATD). In theory, technical debt should not be left for a long period because it accumulates more cost over time, making it more difficult to process. Accordingly, developers have traditionally conducted code reviews to find technical debt. In fact, we observe that many SATD comments are often introduced during modern code reviews (MCR) that are light-weight reviews with web applications. However, it is uncertain about the nature of SATD comments that are introduced in the review process: impact, frequency, characteristics, and triggers. Herein, this study empirically examines the relationship between SATD and MCR. Our case study of 156,372 review records from the Qt and OpenStack systems shows that (i) review records involving SATD are about 6%–7% less likely to be accepted by reviews than those without SATD; (ii) review records involving SATD tend to require two to three more revisions compared with those without SATD; (iii) 28–48% of SATD comments are introduced during code reviews; (iv) SATD during reviews works for communicating between authors and reviewers; and (v) 20% of the SATD comments are introduced due to reviewers’ requests.
Conference Paper
In software engineering, code review controls code quality and prevents bugs. Although many commits to a codebase add features, some commits are code refactoring, including renaming of identifiers. Reviewing code refactoring requires a bit of different efforts than that of reviewing functional changes. For instance, renaming an identifier has to make sure that the new name not only is more descriptive and follows the naming convention of the institution, but also does not collide with any other identifiers. We propose in this paper a machine learning model to automatically identify commits consisting of pure identifier renaming, from only the diff files. This technique helps code review enforce naming and coding conventions of the institution, and let quality assurance testers focus more on functional changes. In contrast to the traditional way of detecting such changes by parsing the full source code before and after the commit, which is less efficient and requires rigorous syntactical completeness and correctness, our novel approach based on neural networks is able to read only the diff and gives a confidence value of whether it is a renaming or not. Since there had been no existing labeled dataset on repository commits, we labeled a dataset with more than 1,000 repos from GitHub by Java syntax analysis. Then we trained a neural network to classify these commits as whether they are renaming, obtaining the test accuracy of 85.65% and the false positive rate of 2.03%. The methods in our experiment also have significance for general static analysis with neural network approaches.
Article
Full-text available
Context Code review is a crucial step of the software development life cycle in order to detect possible problems in source code before merging the changeset to the codebase. Although there is no consensus on a formally defined life cycle of the code review process, many companies and open source software (OSS) communities converge on common rules and best practices. In spite of minor differences in different platforms, the primary purpose of all these rules and practices leads to a faster and more effective code review process. Non-conformance of developers to this process does not only reduce the advantages of the code review but can also introduce waste in later stages of the software development. Objectives The aim of this study is to provide an empirical understanding of the bad practices followed in the code review process, that are code review (CR) smells. Methods We first conduct a multivocal literature review in order to gather code review bad practices discussed in white and gray literature. Then, we conduct a targeted survey with 32 experienced software practitioners and perform follow-up interviews in order to get their expert opinion. Based on this process, a taxonomy of code review smells is introduced. To quantitatively demonstrate the existence of these smells, we analyze 226,292 code reviews collected from eight OSS projects. Results We observe that a considerable number of code review smells exist in all projects with varying degrees of ratios. The empirical results illustrate that 72.2% of the code reviews among eight projects are affected by at least one code review smell. Conclusion The empirical analysis shows that the OSS projects are substantially affected by the code review smells. The provided taxonomy could provide a foundation for best practices and tool support to detect and avoid code review smells in practice.
Conference Paper
Full-text available
Software code review is a well-established software quality practice. Recently, Modern Code Review (MCR) has been widely adopted in both open source and proprietary projects. To evaluate the impact that characteristics of MCR practices have on software quality, this paper comparatively studies MCR practices in defective and clean source code files. We investigate defective files along two perspectives: 1) files that will eventually have defects (i.e., future-defective files) and 2) files that have historically been defective (i.e., risky files). Through an empirical study of 11,736 reviews of changes to 24,486 files from the Qt open source project, we find that both future-defective files and risky files tend to be reviewed less rigorously than their clean counterparts. We also find that the concerns addressed during the code reviews of both defective and clean files tend to enhance evolvability, i.e., ease future maintenance (like documentation), rather than focus on functional issues (like incorrect program logic). Our findings suggest that although functionality concerns are rarely addressed during code review, the rigor of the reviewing process that is applied to a source code file throughout a development cycle shares a link with its defect proneness.
Article
Full-text available
Code review is the process of having other team members examine changes to a software system in order to evaluate its technical content and quality. A lightweight variant of this practice, often referred to as Modern Code Review (MCR), is widely adopted by software organizations today. Previous studies have established a relation between the practice of code review and the occurrence of post-release bugs. While the prior work studies the impact of code review practices on software release quality, it is still unclear what impact code review practices have on software design quality. Therefore, using the occurrence of 7 different types of anti-patterns (i.e., poor solutions to design and implementation problems) as a proxy for software design quality, we set out to investigate the relationship between code review practices and software design quality. Through a case study of the Qt, VTK and ITK open source projects, we find that software components with low review coverage or low review participation are often more prone to the occurrence of anti-patterns than those components with more active code review practices.
Article
Full-text available
Software code review, i.e., the practice of having other team members critique changes to a software system, is a well-established best practice in both open source and proprietary software domains. Prior work has shown that formal code inspections tend to improve the quality of delivered software. However, the formal code inspection process mandates strict review criteria (e.g., in-person meetings and reviewer checklists) to ensure a base level of review quality, while the modern, lightweight code reviewing process does not. Although recent work explores the modern code review process, little is known about the relationship between modern code review practices and long-term software quality. Hence, in this paper, we study the relationship between post-release defects (a popular proxy for long-term software quality) and: (1) code review coverage, i.e., the proportion of changes that have been code reviewed, (2) code review participation, i.e., the degree of reviewer involvement in the code review process, and (3) code reviewer expertise, i.e., the level of domain-specific expertise of the code reviewers. Through a case study of the Qt, VTK, and ITK projects, we find that code review coverage, participation, and expertise share a significant link with software quality. Hence, our results empirically confirm the intuition that poorly-reviewed code has a negative impact on software quality in large systems using modern reviewing tools.
Article
One of the guiding principles of open source software development is to use crowds of developers to keep a watchful eye on source code. Eric Raymond declared Linus' Law as "many eyes make all bugs shallow", with the socio-technical argument that high quality open source software emerges when developers combine together their collective experience and expertise to review code collaboratively. Vulnerabilities are a particularly nasty set of bugs that can be rare, difficult to reproduce, and require specialized skills to recognize. Does Linus' Law apply to vulnerabilities empirically? In this study, we analyzed 159,254 code reviews, 185,948 Git commits, and 667 post-release vulnerabilities in the Chromium browser project. We formulated, collected, and analyzed various metrics related to Linus' Law to explore the connection between collaborative reviews and vulnerabilities that were missed by the review process. Our statistical association results showed that source code files reviewed by more developers are, counter-intuitively, more likely to be vulnerable (even after accounting for file size). However, files are less likely to be vulnerable if they were reviewed by developers who had experience participating on prior vulnerability-fixing reviews. The results indicate that lack of security experience and lack of collaborator familiarity are key risk factors in considering Linus' Law with vulnerabilities.