Conference PaperPDF Available

Workload-Aware Reviewer Recommendation using a Multi-objective Search-Based Approach

Authors:

Abstract and Figures

Background: Reviewer recommendation approaches have been proposed to provide automated support in finding suitable reviewers to review a given patch. However, they mainly focused on reviewer experience , and did not take into account the review workload, which is another important factor for a reviewer to decide if they will accept a review invitation. Aim: We set out to empirically investigate the feasibility of automatically recommending reviewers while considering the review workload amongst other factors. Method: We develop a novel approach that leverages a multi-objective meta-heuristic algorithm to search for reviewers guided by two objectives , i.e., (1) maximizing the chance of participating in a review, and (2) minimizing the skewness of the review workload distribution among reviewers. Results: Through an empirical study of 230,090 patches with 7,431 reviewers spread across four open source projects, we find that our approach can recommend reviewers who are potentially suitable for a newly-submitted patch with 19%-260% higher F-measure than the five benchmarks. Conclusion: Our empirical results demonstrate that the review workload and other important information should be taken into consideration in finding reviewers who are potentially suitable for a newly-submitted patch. In addition, the results show the effectiveness of realizing this approach using a multi-objective search-based approach. CCS CONCEPTS • Software and its engineering → Search-based software engineering. ACM Reference Format:
Content may be subject to copyright.
Workload-Aware Reviewer Recommendation using a
Multi-objective Search-Based Approach
Wisam Haitham Abbood
Al-Zubaidi
University of Wollongong
Australia
whaa807@uowmail.edu.au
Patanamon Thongtanunam
The University of Melbourne
Australia
patanamon.t@unimelb.edu.au
Hoa Khanh Dam
University of Wollongong
Australia
hoa@uow.edu.au
Chakkrit Tantithamthavorn
Monash University
Australia
chakkrit@monash.edu
Aditya Ghose
University of Wollongong
Australia
aditya@uow.edu.au
ABSTRACT
Reviewer recommendation approaches have been proposed to pro-
vide automated support in nding suitable reviewers to review a
given patch. However, they mainly focused on reviewer experience,
and did not take into account the review workload, which is another
important factor for a reviewer to decide if they will accept a review
invitation. We set out to empirically investigate the feasibility of au-
tomatically recommending reviewers while considering the review
workload amongst other factors. We develop a novel approach that
leverages a multi-objective meta-heuristic algorithm to search for
reviewers guided by two objectives, i.e., (1) maximizing the chance
of participating in a review, and (2) minimizing the skewness of the
review workload distribution among reviewers. Through an empiri-
cal study of 230,090 patches with 7,431 reviewers spread across four
open source projects, we nd that our approach can recommend
reviewers who are potentially suitable for a newly-submitted patch
with 19% - 260% higher F-measure than the ve benchmarks. Our
empirical results demonstrate that the review workload and other
important information should be taken into consideration in nd-
ing reviewers who are potentially suitable for a newly-submitted
patch. In addition, the results show the eectiveness of realizing
this approach using a multi-objective search-based approach.
CCS CONCEPTS
Software and its engineering Search-based software en-
gineering;Collaboration in software development.
KEYWORDS
Code Review, Reviewer Recommendation, Search-Based Software
Engineering
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
PROMISE ’20, November 8–9, 2020, Virtual, USA
©2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-8127-7/20/11. . . $15.00
https://doi.org/10.1145/3416508.3417115
ACM Reference Format:
Wisam Haitham Abbood Al-Zubaidi, Patanamon Thongtanunam, Hoa Khanh
Dam, Chakkrit Tantithamthavorn, and Aditya Ghose. 2020. Workload-Aware
Reviewer Recommendation using a Multi-objective Search-Based Approach.
In Proceedings of the 16th ACM International Conference on Predictive Models
and Data Analytics in Software Engineering (PROMISE ’20), November 8–9,
2020, Virtual, USA. ACM, New York, NY, USA, 10 pages. https://doi.org/10.
1145/3416508.3417115
1 INTRODUCTION
Code review is one of the important quality assurance practices
in a software development process. The main goal of code review
is to improve the overall quality of a patch (i.e., a set of software
changes) through a manual examination done by developers other
than the patch author. Recently, many software organizations have
adopted a lightweight variant of code review called Modern Code
Review (MCR) [
1
,
15
,
22
,
25
], which focuses on collaboration among
team members to achieve high quality of a software product. Sev-
eral studies have shown that active and rigorous code review can
decrease the number of post-release defects [
17
,
29
]. In addition,
the collaborative practice of MCR provides additional benets to
team members such as knowledge transfer [
1
] and sharing code
ownership [30].
Eective code reviews require active participation of review-
ers with related knowledge or experience [
2
,
23
]. Several studies
have shown that a patch tends to be less defective when it was
reviewed and discussed extensively by many reviewers [
3
,
14
,
29
].
Furthermore, recent work has shown that active review partici-
pation can only be achieved by inviting active and experienced
reviewers [
31
]. However, nding suitable reviewers is not a trivial
task, and this has to be done quickly to avoid negative impact on the
code review timeliness [
32
]. Hence, several studies have proposed
automated approaches to support the recommendation of reviewers
for a newly-submitted patch [2,19,32,34,36,37].
Intuitively, reviewing experience should be the key factor when
selecting reviewers for a newly-submitted patch [
2
,
32
,
34
,
37
].
However, recent studies pointed out that requesting only experts or
active reviewers for a review could potentially burden them with
many reviewing tasks [
7
,
16
]. Prior work has also shown that the
length of review queues, i.e., the number of pending review requests,
can lead to review delays [
4
]. In fact, recent work [
15
,
25
] reported
21
PROMISE ’20, November 8–9, 2020, Virtual, USA W. Al-Zubaidi, P. Thongtanunam, H. Dam, C. Tantithamthavorn, A. Ghose
that development teams in practice take the review workload into
account when selecting reviewers for a newly-submitted patch. Fur-
thermore, Ruangwan et al. [
24
] showed that invited reviewers often
considered their workload (i.e., the number of remaining reviews)
when deciding whether they should accept new invitations. These
empirical ndings highlight that the reviewer workload should be
considered when selecting reviewers for a newly-submitted patch.
Hence, this work conducts an empirical study to investigate the
feasibility of considering the review workload amongst other factors
in automatically recommending reviewers. To do so, we develop a
W
ork
l
oad-aware
R
eviewer
Rec
ommendation approach called
WL-
RRec
. Unlike the previous reviewer recommendation approaches
which mainly focus on the reviewing experience [
2
,
19
,
32
,
36
,
37
],
our WLRRec considers a wider range of information, including
the review workload. More specically, our WLRRec recommends
reviewers based on 4+1 key reviewer metrics, i.e., four metrics in-
cluding code ownership, reviewing experience, familiarity with the
patch author, and review participation rate, and one metric repre-
senting the review workload. We use these metrics to dene two
objectives: (1) maximizing the chance of participating a review, and
(2) minimizing the skewness of the review workload distribution
among reviewers. To nd reviewers for a newly-submitted patch,
our WLRRec leverages a multi-objective meta-heuristic algorithm,
namely non-dominated sorting genetic algorithm (NSGA-II) [
9
], to
search for solutions that meet these two objectives.
Through an evaluation of 230,090 patches with 7,431 reviewers
spread across four large open source software projects (Android,
LibreOce, Qt, and OpenStack), our results suggest that:
(1)
when considering 4+1 key reviewer metrics, our WLRRec
can recommend reviewers who are potentially suitable for a
newly-submitted patch with 19% - 260% higher F-measure
than the ve benchmarks;
(2)
including an objective to minimize the skewness of the re-
view workload distribution would be benecial to nd other
potential reviewers that might be overlooked by the other
approaches which focus only on reviewing experience; and
(3)
the multi-objective meta-heuristic algorithm, NSGA-II, is
eective in searching for reviewers who are potentially suit-
able for a newly-submitted patch.
Our empirical results demonstrate the potential of using a wider
range of information and leveraging the multi-objective meta-heuristic
algorithm to nd reviewers who are potentially suitable for a newly-
submitted patch.
2 BACKGROUND AND RELATED WORK
2.1 Modern Code Review
Modern code review is a code review process that is a light-weight
variant of software inspection. The process is supported by an
online code review tool (e.g., Gerrit). Recently, a modern code review
process is widely used in both open source and industrial software
projects [
22
]. Modern code review is a collaborative code review
process, where developers other than the author examine a patch
(i.e., a set of code changes) submitted by the author. The modern
code review process typically consists of ve main steps:
(1) An author uploads a patch to a code review tool.
Figure 1: An example of a review in the LibreOce Project.
(2) An author makes a review request to reviewers.
(3)
The reviewers who accept the review request examine the
patch, provide comments and a vote, where a positive vote
indicates that the patch is of the quality; and a negative vote
indicates that the patch needs a revision.
(4)
The author revises the patch to address the reviewer feedback
and uploads a new revision.
(5)
If the revised patch addresses the reviewer concerns, the
patch is marked as merged for integration into the main
code repository. If the patch requires a large rework, the
patch is marked as abandoned.
Figure 1provides an example of a code review in the LibreOce
project. In this example,
Justin
is an author of the patch. Two addi-
tional developers (i.e.,
Miklos
and
Szymon
) were invited to review
the patch, while
Jenkins
is the automated Continuous Integration
(CI) system. Although two reviewers were invited, only
Miklos
participated in this review by providing a vote of +2. Furthermore,
the review of this patch took three days from its creation date (on
September 5, 2017) to get a vote from the reviewer (on September
8, 2017). Hence, nding suitable reviewers would be benecial to
the code review process in terms of reducing the delay. However,
nding reviewers who will participate in a code review is challeng-
ing as many factors can play a role, which will be discussed in the
next subsection.
2.2 Related Work
2.2.1 Reviewer Recommendation Approaches. Finding reviewers
can be a challenging task for a patch author, especially in globally-
distributed software development teams like open source software
projects. Thongtanunam et al. [
32
] showed that the patch author
of 4% to 30% of the reviews in open source projects could not nd
22
Workload-Aware Reviewer Recommendation using a Multi-objective Search-Based Approach PROMISE ’20, November 8–9, 2020, Virtual, USA
An author A newly-
submitted
patch
Past
reviews
Compute
Reviewer
metrics
Generate
optimal
solutions
Select a
solution
Recommended
reviewers
Multi-objectives Search-Based Reviewer
Recommendation Approach
Figure 2: An overview of our approach
reviewers. This problem delays the reviewing process and conse-
quently aect the software development overall. To address this
problem, a number of reviewer recommendation approaches have
been proposed to help a patch author nd reviewers for a newly-
submitted patch [
2
,
19
,
32
,
36
,
37
]. The key idea of those existing
approaches is to nd reviewers with high reviewing experience.
They assume that reviewers for a newly-submitted patch should
be those who have reviewed many related patches in the past. For
example, Balachandran [
2
] proposed ReviewBot which leverages a
line change history of the changed les in a newly-submitted patch
to nd reviewers. Thongtanunam et al. [
32
] proposed RevFinder
which identies the related past patches based on the le path sim-
ilarity of the changed les in a newly-submitted patch. Zanjani et
al. [
37
] has showed that the reviewing experience can change over
time. Thus, they proposed cHRev which considers the frequency
and recency of reviewing activities when computing reviewing
experience.
Recent reviewer recommendation approaches also take into ac-
count the historical interaction between reviewers and the patch
author. For example, Yu et al. [
36
] built a review collaboration net-
work to measure the co-reviewing frequency among developers.
Hence, the reviewers for a newly-submitted patch are those who
have high reviewing experience and high co-reviewing frequency
with the author of that patch. Instead of using a heuristic approach
like Yu et al. [
36
], Ouni et al. [
19
] developed RevRec which uses
the genetic algorithm (GA) to search for reviewers based on the
reviewing experience and the review collaboration network. Their
work however follows a single-objective optimization approach.
2.2.2 Reviewer Participation & Selection. While the reviewer rec-
ommendation approaches in the literature mainly focus on the
reviewing experience and the historical interaction between the
patch author and reviewers, several empirical studies have shown
that other factors also play a role in reviewer participation. Ko-
valenko et al. [
15
] showed that when selecting a reviewer, the patch
author considers a wide range of information such as knowledge,
ownership, qualities of reviewers. Rigby and Storey [
23
] reported
that in the mailing list-based code reviews, reviewers select a patch
for a review based on their interests and the priorities of a patch.
A survey study of Bosu et al. [
7
] also reported that reviewers tend
to decline a review request if the patch is not in the area of their
ownership or expertise. Recent empirical studies have shown that
the activeness of a reviewer is a strong indicator of the reviewer
participation [24,31].
Several recent studies pointed out that reviewing workload
should be considered when selecting reviewers. A survey study of
Kovalenko et al. [
15
] reported that the reviewer workload is one of
the factors that a patch author considers when selecting a reviewer.
Sadowski et al. [
25
] reported that development teams at Google use
a system that assigns reviews in a round-robin manner to take their
review workload into account. Ruangwan et al. [
24
] also demon-
strated empirical evidence that the number of remaining reviews
has a negative impact on the likelihood of accepting a review re-
quest of a reviewer. Moreover, Kononenko et al. [
14
] showed that
the length of review queue has a statistically signicant impact
on the review quality (i.e., whether developers catch or miss bugs
during a code review).
2.2.3 Novelty. Motivated by the empirical ndings of prior stud-
ies, we develop a novel workload-aware reviewer recommendation
(WLRRec) approach. Our WLRRec uses 4+1 key reviewer metrics,
i.e., four metrics indicating whether a review request will be ac-
cepted, and one metric representing the review workload. Unlike
the previous reviewer recommendation approaches, our WLRRec
is the rst to take code ownership, a review participation rate, and
the workload of reviewers into account. Since there are trade-os
in considering these 4+1 reviewer metrics, we employ a multi-
objective search-based approach called the non-dominated sorting
genetic algorithm (NSGA-II) [
9
]. The closest related work is RevRec
of Ouli et al. [
19
] but they used a single-objective search-based
approach (rather than a multi-objective approach as we propose
here). In addition, RevRec does not consider the reviewer workload
into consideration when recommending reviewers.
3 A WORKLOAD-AWARE REVIEWER
RECOMMENDATION APPROACH
3.1 Approach Overview
Figure 2provides an overview of our Workload-aware Reviewer
Recommendation approach (WLRRec) which uses a search-based
software engineering (SBSE) approach. Given a newly-submitted
patch, our WLRRec will rst
compute reviewer metrics
for re-
viewer candidates (i.e., reviewers who have reviewed at least one
patch in the past). We use ve metrics which measure the experi-
ence, historical participation, and reviewing workload of reviewer
candidates. Then, we employ an evolutionary search technique
to
generate optimal solutions
where each solution is a set of re-
viewer candidates. The search of the solution candidates is guided
by two objectives: (1) to maximize the chance of participating a
review, and (2) to minimize the skewness of the review workload
distribution among reviewers. The rst objective considers the ex-
perience and historical participation of reviewers, while the second
objective considers the review workload of reviewers. These two
objectives are in conict with each other. If we recommend only
experts or active reviewers for newly-submitted patches, the re-
viewing workload will be highly skewed to those experts or active
reviewers. On the other hand, if we only focus on establishing a
highly balanced reviewing workload, the reviewers that we rec-
ommend may not be experienced and familiar with a code patch
(which may aect the review quality) or active (which may delay
both the review and development process).
The intuition behind the rst objective is that as found in prior
studies, the areas of expertise and the areas of code with which
the reviewers are familiar are ones of the major reasons to accept
23
PROMISE ’20, November 8–9, 2020, Virtual, USA W. Al-Zubaidi, P. Thongtanunam, H. Dam, C. Tantithamthavorn, A. Ghose
a review request [
4
,
7
,
16
]. In addition, recent work has shown
that historical participation (i.e., the review participation rate and
historical interaction with a patch author) also plays an important
role in the decision of accepting a review request [
7
,
24
]. For the
second objective, several studies have pointed out that making
review requests to only experts or active reviewers could potentially
burden them with many reviewing tasks [
7
,
16
]. A recent work [
24
]
also shows that the number of remaining reviews (RR) is negatively
associated with the likelihood of accepting a review request. Hence,
we want to ensure that our reviewer recommendation approach
does not add more reviews to a particular group of reviewers (e.g.,
experts). MacLeod et al. [
16
] also suggested that requesting less
experienced (but available) reviewers could potentially speed up
the code review process and balance the team’s workload. These
motivate us to develop an approach that considers both of those
two objectives at the same time.
In this work, we leverage the non-dominated sorting genetic
algorithm (NSGA-II) [
9
] to nd the optimal solutions with respect
to the two objectives. The algorithm is based on the principle that
a population of solution candidates is evolved towards better so-
lutions. At the end of the search process, the algorithm returns
a set of optimal solutions, i.e., sets of reviewers that satisfy our
objectives. Finally, we use a Pareto front to
identify the optimal
solution
, i.e., a set of reviewers that will be recommended for a
newly-submitted patch. Below, we describe our reviewer metrics,
the approaches of generating and identifying the optimal solutions
in details.
3.2 Compute Reviewer Metrics
To recommend reviewers, we measure the experience, historical
participation, and reviewing workload of reviewer candidates using
ve metrics. These metrics will be used in our tness functions (i.e.,
objectives) in the multi-objective evolutionary approach. Below, we
describe the intuition based on the literature and the calculation
for each of our metrics.
Code Ownership (CO)
. CO measures the proportion of past re-
views that had been authored by a reviewer candidate. Bird et al. [
6
]
showed that the developer who authored many code changes should
be accounted as an owner of those related areas of code. Hence, a
reviewer for a newly-submitted patch should be an owner of the
code that is impacted by the patch. Several studies also show that re-
viewers are likely to participate in the patch that they have related
experience [
7
,
24
]. Hence, we measure the CO based on the ap-
proach of Bird et al. [
6
]. More specically, given a newly-submitted
patch
𝑝
for a review, we measure CO of a reviewer candidate
𝑟
using the following calculation:
𝐶𝑂 (𝑟 , 𝑝)=
1
|𝑀(𝑝)| Õ
𝑚𝑀(𝑝)
author(𝑟 ,𝑚)
𝑐(𝑚)(1)
where
𝑀(𝑝)
is a set of modules (i.e., directories) in the patch
𝑝
,
author(𝑟 ,𝑚)
is the number of past reviews that were authored by
the reviewer
𝑟
, and
𝑐(𝑚)
is the total number of past reviews that
have a module 𝑚.
Reviewing Experience (RE)
. RE measures the proportion of
past reviews that had been reviewed by a reviewer candidate. Sim-
ilar to CO, a recent study showed that developers can gain ex-
pertise on the related areas of code by actively participating code
reviews [
30
]. Several reviewer recommendation approaches are also
based on the similar intuition, i.e., the appropriate reviewers are
those who reviewed many similar reviews in the past [
19
,
21
,
32
,
37
].
Hence, given a newly-submitted patch
𝑝
for a review, we measure
RE of a reviewer candidate
𝑟
using the calculation of Thongta-
nunam et al. [30] which is described as follows:
𝑅𝐸 (𝑟 , 𝑝)=
1
|𝑀(𝑝)| Õ
𝑚𝑀(𝑝)
review(𝑟,𝑚)
𝑐(𝑚)(2)
where
review(𝑟,𝑚)
is a proportion of review contributions that
the reviewer candidate
𝑟
made to the past reviews
𝐾
using a calcu-
lation of Í𝑘𝐾(𝑟,𝑚)1
𝑅(𝑘)[30].
Familiarity with the Patch Author (FPA)
. Recent studies re-
ported that in addition to the expertise, the relationship between the
patch author and the reviewer often aects the decision of whether
to accept a review request [
7
,
24
]. To capture this relationship, FPA
counts the number of past reviews that a reviewer candidate had
done for the patch author of a newly-submitted patch. Hence, the
higher the FPA value is, the more the historical interaction between
the reviewer candidate and the patch author. Then, it is more likely
that the reviewer candidate will participate in the code review of
the newly-submitted patch.
Review Participation Rate (RPR)
. RPR measures the extent
to which a candidate participated in code review in the past. More
specically, RPR measures the proportion of past reviews that a
reviewer candidate had participated compared to the number of
past reviews to which the reviewer candidate was requested. A
recent work of Ruangwan et al. [
24
] showed that RPR is one of
the most inuential factors aecting the participation decision of
reviewers. The higher the RPR value is, the more the active reviewer
candidate is; and the more likely that the reviewer will participate
in a code review of a newly-submited patch.
Remaining Reviews (RR)
. A survey study reported that “too
many review requests” is one of the reason that reviewers did not
respond to review requests [
24
]. Hence, we use RR to represent the
review workload of a reviewer candidate. The quantitative analysis
of Ruangwan et al.showed that RR is one of the most inuential
factors aecting the participation decision of reviewers [
24
]. In
addition, Baysal et al. [
4
] showed that the length of review queue,
i.e., the number of pending review requests can have an impact on
the code review timeliness. Hence, to measure RR, we count the
number of review requests that a reviewer candidate received, yet
the reviewer candidate did not participate at the time when the
newly-submitted patch was created.
3.3 Generate Optimal Solutions
In this section, we describe the solution representation, the t-
ness functions for our two objectives, and the evolutionary search
approach to generate optimal solutions (i.e., sets of reviewers).
3.3.1 Solution Representation. We use a bit string to represent a
solution candidate (i.e. a set of reviewer candidates). The length
of a bit string is the number of all reviewer candidates. Each bit
24
Workload-Aware Reviewer Recommendation using a Multi-objective Search-Based Approach PROMISE ’20, November 8–9, 2020, Virtual, USA
in the string has the value of 0 or 1. A bit value of 1 indicates that
the corresponding reviewer is selected, while 0 indicates that the
reviewer is excluded from a recommendation. For example, in a
software project, there are ve reviewer candidates (i.e., R
1
, R
2
, R
3
,
R
4
, R
5
). Then, in a solution candidate
𝑆
, reviewer candidates R
1
, R
2
,
and R
5
are selected for a recommendation. The solution candidate
𝑆can be represented in a bit string of 11001.
The reviewer candidates in our approach are those who have
participated in code reviews in the past. However, due to a large
number of developers participated in code reviews, the length of a
solution candidate can be long resulting in excess computation time.
Hence, we shorten our candidate solutions by removing reviewer
candidates who have at least three metrics with zero values. This
is because the more metrics with zero values indicates the lower
signal that the reviewer candidate will accept a review request.
Note that we have experimented all possible metric thresholds for
removing reviewer candidates (i.e.,
𝑡={
1
,
2
,
3
,
4
,
5
}
). We found
that removing reviewer candidates who have at least three metrics
with zero values (
𝑡=
3) provides a reasonable length of candidate
solutions (i.e., a median length of 23 - 381 reviewer candidates for a
candidate solution), while it has a minimal impact on the accuracy
of recommendation.
3.3.2 Fitness Functions. We now describe the calculation of the
tness functions for our two objectives.
Maximizing the Chance of Participating a Review (CPR)
.
For our rst objective, we aim to nd reviewers with maximum code
ownership (CO), reviewing experience (RE), review participation
rate (RPR) and those who are highly familiar with the patch author
(FPA). In other words, the recommended reviewers of our approach
are those who have related expertise, actively participated in code
reviews in the past and reviewed many past patches for the patch
author of a newly-submitted patch. To consider these four factors
when recommending reviewers, we formulate the following tness
function:
𝐶𝑃 𝑅(𝑆𝑖, 𝑝)=Õ
𝑟𝑅
𝑆𝑖(𝑟)𝛼1𝐶𝑂 (𝑟 , 𝑝) + 𝛼2𝑅𝐸 (𝑟 , 𝑝)
+𝛼3𝐹𝑃𝐴(𝑟, 𝑝 ) + 𝛼4𝑅𝑃 𝑅(𝑟, 𝑝 )
(3)
where
𝑝
is a newly-submitted patch for a review,
𝑆𝑖
is a candidate
solution (i.e., a bit string of 0 or 1 for selecting reviewers),
𝑅
is a
set of all reviewer candidates. Each factor is weighted by the alpha
(
𝛼
) value where
𝛼1+𝛼2+𝛼3+𝛼4=
1. Then, the higher the value
of
𝐶𝑃 𝑅(𝑆𝑖, 𝑝)
is, the better the solution
𝑆𝑖
(i.e., the set of selected
reviewers) for a newly-submitted patch 𝑝is.
Minimizing the Skewness of the Reviewing Workload Dis-
tribution (SRW)
. To ensure that our recommendations will not
burden a particular group of reviewers, we aim to balance reviewing
workload among reviewers. In other words, the number of remain-
ing reviews should not be skewed to a particular group of reviewers.
Hence, we set an objective to minimize the skewness of the review
workload distribution among reviewers. To do so, we adapt the
calculation of Shannon’s entropy [
27
] to measure the skewness
of the remaining reviews (RR) distribution among reviewers. This
is similar to the work of Hassan [
11
] who used Shanon’s entropy
to measure the distribution of modied code across modied les.
More specically, given a newly-submitted patch
𝑝
and a solution
Algorithm 1 NSGA-II pseudo-code [9]
1: 𝑃0
Initial population is randomly generated with population
size 𝑁
2: Evaluate Against Objective Functions
3: Selection Process Crossover and Mutation
4: 𝑄0Create an ospring population
5: 𝑡=0
6: while the number of generations not reached do
7: 𝑅𝑡𝑀𝑒𝑟 𝑔𝑒(𝑃𝑡+𝑄𝑡)
8: 𝐹fast-non-dominated-sort (𝑅𝑡)
9: 𝑃𝑡+1=𝜙and 𝑗=1
10: while |𝑃𝑡+1|+|𝐹𝑗| 𝑁do
11: Calculate crowding-distance-assignment(𝐹𝑗)
12: 𝑃𝑡+1𝑃𝑡+1𝐹𝑗
13: 𝑗=𝑗+1
14: end while
15: Sort (𝐹𝑗,𝑛)
16: 𝑃𝑡+1𝑃𝑡+1𝐹𝑗[1 : (𝑁 |𝑃𝑡+1) ]
17: 𝑄𝑡+1generated new population 𝑃𝑡+1
18: end while
candidate 𝑆𝑖, we formulate the following tness function:
𝑆𝑅𝑊 (𝑆𝑖, 𝑝 )=
1
𝑙𝑜𝑔2|𝑅|Õ
𝑟𝑅
(𝐻(𝑟) × 𝑙𝑜𝑔2𝐻(𝑟))
𝐻(𝑟)=
𝑅𝑅(𝑟)
Í𝑘𝑅𝑅𝑅(𝑘)
(4)
where
𝑅
is a set of all reviewer candidates,
𝐻(𝑟)
is a proportion of
remaining reviews of a reviewer candidate
𝑟
, and
𝑅𝑅
is the number
of remaining reviews (RR) including the newly-submitted patch if
the reviewer 𝑟is selected in the solution 𝑆𝑖.
For example, reviewer candidates are R
1
, R
2
, R
3
and their remain-
ing reviews are 3, 5, and 2, respectively. Given that a solution
𝑆𝑖
selects R
1
and R
3
as recommended reviewers for a newly-submitted
patch
𝑝
, the
𝑅𝑅
values will be 4 (=3+1), 5, and 3 (=2+1) for R
1
, R
2
,
R
3
respectively. Then, the SRW for the solution
𝑆𝑖
for the patch
𝑝
is -0.98 (
=1
𝑙𝑜𝑔23(4
12𝑙𝑜𝑔24
12 +5
12𝑙𝑜𝑔25
12 +3
12𝑙𝑜𝑔23
12 )
). The lower the
SRW value is, the better the spread of workload among reviewers.
3.3.3 Evolutionary Search. We employ a multi-objective meta heuris-
tic algorithm, namely non-dominated sorting genetic algorithm
(NSGA-II) [
9
] to search for solutions that meet the above two objec-
tives. Algorithm 1provides a pseudo-code of the NSGA-II algorithm.
NSGA-II starts with randomly generating an initial population
𝑃0
(i.e., a set of solution candidates). Then, the tness of each solution
candidate in the population
𝑃0
is measured with respect to the two
tness functions described in Section 3.3.2. The initial population
𝑃0
is then evolved to a new generation of solution candidates (i.e.,
an ospring population
𝑄0
) through the selection and genetic oper-
ators, i.e., crossover and mutation. The selection operator ensures
that the selection of solution candidates in the current population
are proportional to their tness values. The crossover operator
takes two selected solution candidates as parents and swapped
their bit strings to generate an ospring solution. The mutation
operator randomly chooses certain bits in the string, and inverts
the bits values.
25
PROMISE ’20, November 8–9, 2020, Virtual, USA W. Al-Zubaidi, P. Thongtanunam, H. Dam, C. Tantithamthavorn, A. Ghose
At each generation
𝑡
, the current population
𝑃𝑡
and its ospring
population
𝑄𝑡
is merged into a new population
𝑅𝑡
. Then, NSGA-II
sorts the solution candidates in the population
𝑅𝑡
using the fast
non-dominated sorting technique. This technique compares each
solution with other solutions in the population
𝑅𝑡
to nd which
solution dominates and which solution does not dominate other so-
lutions. A solution
𝑆1
is said to dominate another solution
𝑆2
if
𝑆1
is
no worse than
𝑆2
in all objectives and
𝑆1
is strictly better than
𝑆2
in
at least one objective. After sorting, the fast non-dominated sorting
technique provides Pareto fronts (i.e., sets of Pareto optimal solu-
tions that are not dominated by any other solutions). For each of the
Pereto front, NSGA-II calculates the crowding distance which is the
sum of the distance in terms of tness values between each solution
and its nearest neighbours in the same front. Then, the Pareto fronts
are sorted in the ascending order based on the crowding distance
values. The population for a next generation
𝑃𝑡+1
contains the rst
𝑁
solutions in the sorted Pareto fronts. The ospring population
𝑄𝑡+1
is then generated based on the population
𝑃𝑡+1
through the
selection and genetic operators. This evolution process is then re-
peated until a xed number of generations has been reached. In the
nal generation, NSGA-II returns a set of Pareto optimal solutions.
3.4 Select a Solution
From the previous step, NSGA-II returns a set of Pareto optimal
solutions, i.e., the sets of reviewers that meet the optimal trade-o
of our two objectives. This set of solutions can be presented to the
users for them to select. In cases where no explicit user preferences
are provided, the so-called knee point approach is applied to select
the most preferred solution among non-dominated solutions. This
knee point approach has been widely used in the evolutionary
search literature.
The knee point approach measures the Euclidean distance of
each solution on the Pareto front from the reference point. Given
the reference point is the maximum chance of participating a review
(
𝐶𝑃 𝑅𝑚𝑎𝑥
) and the minimum skewness of the reviewing workload
distribution (
𝑆𝑅𝑊𝑚𝑖𝑛
), the Euclidean distance of a solution
𝑆𝑖
is
calculated as follows:
Dist(𝑆𝑖)=p(𝐶𝑃𝑅𝑚𝑎𝑥 𝐶𝑃 𝑅(𝑆𝑖))2+ (𝑆 𝑅𝑊𝑚𝑖𝑛 𝑆 𝑅𝑊 (𝑆𝑖))2(5)
The selected solution is the one closest to the reference point. Fig-
ure 3provides an illustrative example for the knee point approach.
Given that the Pareto optimal solutions returned by NSGA-II are
S
1
, S
2
, S
3
, and S
4
and their tness values are shown in the plot,
solution S
3
is closest to the reference point, i.e., the highest chance
that the selected reviewers will accept a review request, while the
reviewing workload is well distributed (low skewness). Hence, the
solution S
3
will be selected, and the reviewers associated with this
solution will be recommended.
4 EXPERIMENTAL DESIGN
4.1 Research Questions
To evaluate our WLRRec, we formulate the following research
questions.
(RQ 1) Can our WRLRec recommend reviewers who are po-
tentially suitable for a newly-submitted patch?
S1
S2
S3
Objective 1: Maximize chance of participating a review!
(based on reviewers’ experience and historical participation)
S4
Objective 2: Minimize skewness of
the workload distribution
Reference
point
Dist(S4)
Dist(S3)
Dist(S2)
Dist(S2)
Figure 3: An illustrative example of identifying a knee point
from the Pareto optimal solutions
Table 1: An overview of the evaluation datasets
Project Period # Patches #Reviewers
Android 10/2008 - 12/2014 36,771 2,049
Qt 5/2011 - 12/2014 65,815 1,238
OpenStack 7/2011 - 12/2014 108,788 3,734
LibreOce 3/2012 - 11/2016 18,716 410
We set out this RQ for a sanity check to determine whether
our approach can recommend actual reviewers for a newly-
submitted patch. In addition, we compare our approach
with the Random Search optimization technique [
13
] which
has been a common baseline for most of the search based
metaheuristic algorithms.
(RQ 2) Does our WRLRec benet from the multi-objective
search-based approach?
Since our WLRRec considers two objectives, we set out this
RQ to empirically evaluate how well the multi-objective
approach compared to the single-objective approach. To
answer this RQ, we implemented the traditional single-
objective genetic algorithm (GA) using either CPR or SRW
as an objective. We then compare the performance of our
WLRRec against these two single-objective approaches,
i.e., GA-CPR and GA-SRW. Indeed, the GA-CPR approach
is closely similar to RevRec of Ouli et al.[
19
], who used
the genetic algorithm (GA) and considered the reviewing
experience and historical interaction.
(RQ 3) Does the choice of search algorithms impact the per-
formance of our WRLRec?
Our WLRRec leverages the multi-objective optimization
algorithm to recommend reviewers. Aside from NSGA-II,
there are other multi-objective optimization algorithms.
Hence, we set out this RQ to evaluate our WLRRec when
using two recently developed multi-objective evolution-
ary algorithms: Multiobjective Cellular Genetic Algorithm
(MOCell) [
18
] and the Strength-based Evolutionary Algo-
rithm (SPEA2) [38].
26
Workload-Aware Reviewer Recommendation using a Multi-objective Search-Based Approach PROMISE ’20, November 8–9, 2020, Virtual, USA
4.2 Datasets
In this work, we use a code review dataset of four large open source
software projects that actively use modern code reviews, i.e., An-
droid, Qt, OpenStack, and LibreOce. These projects have a large
number of the patches recorded in the code review tool. For An-
droid, Qt, and OpenStack, we use the review datasets of Hamasaki et
al. [
10
] which was often used in prior studies [
19
,
24
,
28
,
32
]. For Li-
breOce, we use the review dataset of Yang et al. [
35
]. The datasets
include patch information, review discussion, and developer infor-
mation. Below, we describe our data preparation approach, which
consist of three main steps.
(Step 1) Cleaning Datasets
. In this paper, we use the patches
that were marked as either merged or abandoned. In addition, we
exclude the patches that are self-reviewed (i.e., only the author of
the patch was a reviewer) and the patches related to the version
control system activities (e.g., branch-merging patches). Note that
we search for a keyword “merge branch” or “merge” in the descrip-
tion of the patch to identify the branch-merging patches. These
patches are excluded from our evaluation because there might be no
reviewers who actually review those patches. Since the code review
tools of the studied projects are tightly integrated with automated
checking systems (e.g., Continuous Integration test system), we re-
move the accounts of automated checking systems (e.g., Jenkins CI
or sanity checks) in the data sets. Similar to prior work [
24
], we use
a semi-automated approach to identify the accounts of automated
checking systems. Finally, we use the approach of Bird et al. [
5
] to
identify and merge alias emails of developers. In total, we obtain
230,090 patches with 7,431 reviewers that spread across four open
source projects. Table 1provides an overview of our datasets.
(Step 2) Spliting Datasets
. To evaluate our approach, we split
each dataset into two sub-datasets: (1) the most recent 10% of the
patches and (2) the remaining 90% of the patches. The 10% sub-
dataset is used to evaluate our approach, i.e., the patches in this
sub-dataset are considered as newly-submitted patches. The 90%
sub-dataset is used for building a pool of reviewer candidates and
computing reviewer metrics. As described in Section 3.3.1, reviewer
candidates are those (1) who have provided either a vote score or a
comment to at least one patch in the 90% sub-dataset and (2) whose
at least three reviewer metrics have a value greater than zero.
(Step 3) Generating Ground-Truth Data
. For the ground-
truth data of the patches in the 10% sub-dataset, we identify two
groups of reviewers: (1) actual reviewers and (2) potential review-
ers. Given a patch
𝑝
in the 10% sub-dataset, the
actual reviewers
of the patch
𝑝
are those who actually reviewed the patch
𝑝
by
either providing a vote score or a comment. These actual review-
ers are typically used to evaluate the reviewer recommendation
approaches [
19
,
32
,
37
]. Although these reviewers have actually re-
viewed in historical data, there is no guarantee that these reviewers
are the only group of suitable reviewers. Moreover, a recent work
of Kovalenko et al. [
15
] has shown that although a reviewer recom-
mendation approach achieves a good accuracy when the evaluation
is based on historical data, developers may not always select the
recommended reviewers when the approach is deployed due to
several factors, e.g., reviewers’ availability.
Hence, to evaluate our approach, we expand our ground-truth
data to include potential reviewers.
Potential reviewers
are the
reviewers who should be able to review the changed les if they
have done a review for some of these les in other patches. Specif-
ically, given a patch
𝑝
, the potential reviewers of the patch
𝑝
are
those who were the actual reviewers of the other patches in the
10% sub-dataset that made changes to at least one le, and this le
is one of the changed les of the patch
𝑝
. For example, a patch
𝑝1
made changes to les
𝐴
and
𝐵
, while a patch
𝑝2
made changes to
les
𝐵
and
𝐶
. Reviewers R
1
and R
2
are the actual reviewers of the
patch
𝑝1
, while reviewers R
3
and R
4
are the actual reviewers of the
patch
𝑝2
. Hence, the potential reviewers for the patch
𝑝1
are R
3
and
R
4
as they reviewed le
𝐵
(i.e., the common changed le in patches
𝑝1and 𝑝2).
4.3 Evaluation Analysis
To evaluate our approach, we use four performance measures. Then,
we perform a statistical analysis to determine the statistical dier-
ence between our approach and the other approaches.
4.3.1 Performance Measures. We evaluate our WLRRec based on
two perspectives. First, we evaluate our approach from the perspec-
tive of recommendation systems. Hence, for each newly-submitted
patch (i.e., a patch in the 10% sub-dataset), we use Precision, Recall,
and F-measure to measure the accuracy of our approach. Second,
we evaluate our approach from the perspective of search-based soft-
ware engineering. This measure has been used in previous work
(e.g. [
20
]) as a performance indicator for multi-objective optimiza-
tion. Hence, for each newly-submitted patch, we use Hypervolume
to evaluate the performance of search algorithms and to guide the
search.
We briey describe the calculation of each performance as follow:
Precision
measures the proportion of the recommended reviewers
that are in the ground-truth data. We measure precision for a newly-
submitted patch
𝑝𝑖
using a calculation of
𝑃(𝑝𝑖)=|rec(𝑝𝑖) g(𝑝𝑖) |
|rec(𝑝𝑖) |
,
where
rec(𝑝𝑖)
is a set of recommended reviewers and g
(𝑝𝑖)
is a set of
reviewers in the ground-truth data.
Recall
measures the proportion
of reviewers in the ground-truth data that are recommended by the
approach. We measure recall for a newly-submitted patch
𝑝𝑖
using
a calculation of
𝑅(𝑝𝑖)=|rec(𝑝𝑖) g(𝑝𝑖) |
|g(𝑝𝑖) |
, where
rec(𝑝𝑖)
is a set of rec-
ommended reviewers and g
(𝑝𝑖)
is a set of reviewers in the ground-
truth data.
F-measure
is the harmonic mean of the precision and
recall using a calculation of
𝐹(𝑝𝑖)=2(𝑃(𝑝𝑖𝑅(𝑝𝑖) )
𝑃(𝑝𝑖)+𝑅(𝑝𝑖)
.
Hypervolume
is a quality indicator for the volume of the space covered by the
non-dominated solutions from the search algorithm [
39
]. It indi-
cates the convergence and diversity of the solutions on a Pareto
front (e.g. the higher hypervolume, the better performance), which
using the following calculation [
26
]:
𝐻𝑉 =𝑣𝑜 𝑙𝑢𝑚𝑒 Ð𝑆
𝑖=1𝑣𝑖
where
𝑆
is a set of solutions from the Pareto front and
𝑣𝑖
is the hypercube
space established between each solution i and distance (reference)
point by all solutions.
4.3.2 Statistical Analysis. To compare the performance between
our WLRRec and other benchmarks, we compute the performance
gain of WLRRec (where
𝑝𝑚
indicates to precision, recall, F-score, or
MRR) over another compared the benchmark
𝑌
using the following
calculation: 𝐺𝑎𝑖𝑛 (𝑝𝑚, 𝑌 )=
𝑊 𝐿𝑅𝑅𝑒𝑐 𝑝𝑚𝑌𝑝𝑚
𝑌𝑝𝑚 ×100%.
27
PROMISE ’20, November 8–9, 2020, Virtual, USA W. Al-Zubaidi, P. Thongtanunam, H. Dam, C. Tantithamthavorn, A. Ghose
In addition, we use the Wilcoxon Signed Rank test [
8
] (
𝛼=
0
.
05)
to determine whether the performance of our WLRRec is statisti-
cally better than the other approaches. The Wilcoxon Signed Rank
test is a paired-wise non-parametric test which does not assume
a normal distribution. In addition, we measure the eect size (i.e.,
the magnitude of dierence) using the Vargha and Delaney’s
ˆ
𝐴𝑋𝑌
non-parametric eect size measure [
33
]. The
ˆ
𝐴𝑋𝑌
measures the
probability that the performance achieved from the approach
𝑋
is
better than the performance achieved by the approach
𝑌
. Consider-
ing a performance measure
𝑝𝑚
(i.e., precision, recall, f-measure, and
Hypervolume), the eect size is measured using the following calcu-
lation:
ˆ
𝐴𝑋𝑌 (𝑝𝑚)=
[#(𝑋𝑝𝑚 >𝑌𝑝𝑚)+(0.5#(𝑋𝑝𝑚 =𝑌𝑝𝑚) ) ]
𝑃
, where
𝑋𝑝𝑚
is
the number of patches where the performance
𝑝𝑚
of the approach
𝑋
(i.e., our approach) and
𝑌𝑝𝑚
is the performance
𝑝𝑚
of the approach
𝑌
(e.g., random search), and
𝑃
is the number of newly-submitted
patches (i.e., the size of the 10% sub-dataset). The dierence is con-
sidered as trivial for
ˆ
𝐴𝑋𝑌
0
.
147, small for 0
.
147
<ˆ
𝐴𝑋𝑌
0
.
33,
medium for 0.33 <ˆ
𝐴𝑋𝑌 0.474, and large for ˆ
𝐴𝑋𝑌 >0.474 [12].
4.4 Experimental Settings
Our approach was implemented in the MOEA Framework. We
employed tournament selection method and set the size of the
initial population to 100. The number of generations was set to
100
,
000. Crossover probability was set to 0.9, mutation probability
was 0.1, and reproduction probability was 0.2. We set the parameters
𝛼1, 𝛼2, 𝛼 3, 𝛼4to 0.25 as default parameters.
5 RESULTS
(RQ1) Can our WRLRec recommend reviewers
who are potentially suitable for a
newly-submitted patch?
Results
. For a sanity check, the rst row of Table 2presents the
performance results of our WLRRec when using only actual re-
viewers as a ground-truth. We nd that our WLRRec approach
achieves a precision of 16%-20%, a recall of 30%-37%, an F-measure
of 17%-28%, and a hypervolume of 72%-83%. These results suggest
that our WLRRec can identify a reviewer who actually reviewed the
patches in the past. However, as discussed in Section 4.2 (Step-3),
there is no guarantee that these actual reviewers are the only group
of suitable reviewers. Hence, the goal of our work is not limited
to nd the exact group of actual reviewers, but to nd potential
reviewers who might be able to review the patch in the future. For
the remaining results, we evaluate our WLRRec when using the
combination of actual and potential reviewers as ground-truths.
Our WLRRec can recommend reviewers who are likely
to accept a review request with an F-measure of 0.32 - 0.43,
which is 137%-260% better than the random search approach.
Tables 2and 3show that our WLRRec can recommend reviewers
who are likely to accept a review request with a precision value
of 31% for Android, 34% for LibreOce, 36% for QT, and 32% for
OpenStack. Furthermore, Table 2shows that our WLRRec achieves
a recall value of 50% for Android, LibreOce and QT, and 53%
for OpenStack. Table 3also shows that our WLRRec achieves a
recall value 201%-277% better than the random search approach.
The hypervolume values of 72%-84% achieved by our WLRRec also
indicate that our multi-objective search algorithm is 85%-214% bet-
ter than a random search approach. The Wilcoxon Signed Rank
tests (
𝑝<
0
.
001) and the large
ˆ
𝐴𝑋𝑌
eect size values (
ˆ
𝐴𝑋𝑌 >
0
.
474)
conrm that the performance of our WLRRec is statistically better
than that of the random search approach in terms of precision,
recall, F-measure, and hypervolume.
Discussion
. The results of RQ1 suggest that our WLRRec can
recommend reviewers who are potentially suitable for a newly-
submitted patch. More specically, we nd that when considering
4+1 key reviewer metrics, our WLRRec can recommend reviewers
who actually reviewed the patches in the past. In addition, when
expanding the ground-truth data to include potential reviewers
(e.g., those who should be able to review the newly-submitted
patch), about half of the potential reviewers can be identied by our
WLRRec (c.f. the recall values of our approach). These empirical
results suggest the eectiveness of considering dierent factors
when recommending reviewers for a newly-submitted patch.
(RQ2) Does our WRLRec benet from the
multi-objective search-based approach?
Results
. Tables 2shows that the single-objective approach that
maximizes the chance of participating a review (GA-CPR) achieves
a precision of 15%-17%, a recall of 18%-25%, and an F-measure
of 17%-21%. On the other hand, the single-objective approach that
minimizes the skewness of the reviewing workload distribution (GA-
SRW) achieves a precision of 17%-20%, a recall of 21%-27%, and an
F-measure of 18%-22%. Note that we did not measure hypervolume
for the GA-CPR and GA-SRW approaches since this performance
measure is not applicable for the single-objective approaches.
Our WLRRec outperforms the single-objective approaches
with the performance gain of 55%-142% for precision and
78%-178% for recall.
Table 3shows that our WLRRec achieves
88%-142% higher precision, 111%-178 higher recall, 52%-124% higher
F-measure than the GA-CPR approach. Similarly, we also nd that
our WLRRec achieves 55%-101% higher precision, 96%-138% higher
recall, 45%-111% higher F-measure than the GA-SRW approach.
The Wilcoxon Singed Rank tests and the magnitude of the dier-
ences measured by the
ˆ
𝐴𝑋𝑌
eect size also conrm that the dier-
ence is statistically signicant with a large magnitude of dierence
(ˆ
𝐴𝑋𝑌 >0.474).
Discussion
. The results of our RQ2 indicate that our WLR-
Rec which uses a multi-objective approach (i.e., maximizing the
chance of participating a review, while minimizing the skewness
of the reviewing workload distribution) is statistically better than
the single-objective approaches when recommending reviewers
who are potentially suitable for a newly-submitted patch. These
results suggest that considering multiple objectives at the same
time would allow us to nd other potential reviewers that might
be overlooked in the previous approaches [
19
,
32
]. Furthermore,
Table 2shows that the GA-SRW approach (which considers only
the workload distribution) achieves a recall relatively better than
the GA-CPR approach which only focuses on the reviewing experi-
ence, historical interaction, and activeness of reviewers (similar to
RevRec [
19
]). These empirical results highlight the benets of using
the multi-objective search-based approaches and considering the
28
Workload-Aware Reviewer Recommendation using a Multi-objective Search-Based Approach PROMISE ’20, November 8–9, 2020, Virtual, USA
Table 2: The precision, recall, f-measure, and hypervolumn values ([0,1]) of our WLRRec. The rst row presents the results
when using the actual reviewers as a ground-truth, while the other rows present the results when using the combinations of
actual and potential reviewers as a ground-truth.
Android LibreOce QT OpenStack
GT Techniques P R F1 HV P R F1 HV P R F1 HV P R F1 HV
Act. WLRRec 0.20 0.30 0.28 0.72 0.16 0.31 0.17 0.74 0.19 0.37 0.27 0.83 0.20 0.34 0.22 0.82
WLRRec 0.31 0.50 0.35 0.72 0.34 0.50 0.43 0.74 0.36 0.50 0.38 0.84 0.32 0.53 0.32 0.82
(RQ1) Random 0.08 0.16 0.10 0.35 0.12 0.17 0.16 0.40 0.12 0.15 0.13 0.27 0.10 0.14 0.14 0.28
(RQ2) GA-CPR 0.15 0.18 0.17 - 0.16 0.23 0.19 - 0.15 0.19 0.17 - 0.17 0.25 0.21 -
(RQ2) GA-SRW 0.20 0.21 0.18 - 0.17 0.24 0.21 - 0.20 0.21 0.18 - 0.18 0.27 0.22 -
(RQ3) MOCell 0.24 0.26 0.24 0.56 0.23 0.30 0.22 0.61 0.36 0.34 0.29 0.67 0.23 0.36 0.22 0.63
Act. + Pot. Rev.
(RQ3) SPEA2 0.27 0.40 0.29 0.52 0.26 0.32 0.22 0.58 0.36 0.34 0.29 0.57 0.21 0.34 0.27 0.57
Table 3: The performance gain of our proposed WLRRec over the other benchmarks.
Techniques Android LibreOce QT OpenStack
P R F1 HV P R F1 HV P R F1 HV P R F1 HV
WLRRec–Random 288% 207% 260% 108% 175% 201% 161% 85% 203% 233% 189% 214% 217% 277% 137% 199%
WLRRec–GA-CPR 107% 178% 104% - 113% 117% 124% - 142% 163% 123% - 88% 111% 52% -
WLRRec–GA-SRW 55% 138% 92% - 101% 108% 103% - 82% 138% 111% - 78% 96% 45% -
WLRRec–MOCell 31% 95% 42% 28% 48% 68% 95% 21% 0% 48% 31% 25% 42% 45% 45% 31%
WLRRec–SPEA2 17% 26% 19% 37% 30% 59% 95% 29% 0% 48% 31% 47% 52% 55% 19% 43%
workload distribution of reviewers when recommending reviewers
for a newly-submitted patch.
(RQ3) Does the choice of search algorithms
impact the performance of our WRLRec?
Results
. Tables 2shows that when using MOCell instead of NSGA-
II to search for the optimal solutions, this approach achieves a
precision of 23%-36%, a recall of 26%-36%, an F-measure of 22%-24%,
and a hypervolume of 56%-67%. Similarly, using SPEA2 to search
for the optimal solutions achieves a precision of 21%-26%, a recall of
32%-40%, an F-measure of 22%-29%, and a hypervolume of 52%-58%.
Our WLRRec with NSGA-II achieves 31%-95% and 19%-95%
higher F-measure than the MOCell and SPEA2 approaches,
respectively.
Table 3shows that our WLRRec which uses NSGA-II
achieves 31%-48% higher precision, 45%-95% higher recall, 31%-95%
higher F-measure, and 21%-31% higher hypervolume than the MO-
Cell approach. Similarly, we also nd that our WLRRec achieves
0%-52% higher precision, 26%-59% higher recall, 19%-95% higher
F-measure, and 29%-47% higher hypervolume than the SPEA2 ap-
proach. The Wilcoxon Singed Rank tests and the magnitude of
the dierences measured by the
ˆ
𝐴𝑋𝑌
eect size also show that
the dierence is statistically signicant with a large magnitude of
dierence ( ˆ
𝐴𝑋𝑌 >0.474).
Discussion
. The results of our RQ3 indicate that the choice
of the multi-objective search-based algorithms has an impact on
the performance of our approach. More specically, when using
the other multi-objective algorithms (i.e., MOCell and SPEA2), the
performance of our approach decreases in terms of recall, f-measure,
and hypervolume. In addition, the higher hypervolume value of our
WLRRec indicates that our approach nds the solutions that satisfy
the two objectives better that other two multi-objective approaches.
These empirical results suggest that the NSGA-II algorithm that
we leveraged is an appropriate multi-objective approach to nd
solutions in this problem domain.
6 THREATS TO VALIDITY
Construct Threat to Validity
is related to the ground-truth set of
reviewers for evaluating our approach. While the actual reviewers
are recorded in historical data, there is no guarantee that these
actual reviewers are the only group of suitable reviewers. Ouni et
al. [
19
] point out that potential reviewers may be assigned to a code
change that such reviewers do not contribute to it. This is due to
several reasons including the current workload, the availability, and
the social relationship with the patch author. To mitigate this threat,
as suggested by Ouni et al. [
19
], we considered the ground truth
as the set of potential reviewers who should be able to review the
changed les if they have done a review for some of these les in
other patches, which may be more realistic for evaluation than the
set of actual reviewers. Moreover, the goal of our work is not limited
to nd the exact group of actual reviewers, but to nd potential
reviewers who might be able to review the patch in the future.
External Threats to Validity
is related to the generalizability
of our results. Although we empirically evaluated our approach
on four large size open-source systems from dierent application
domains, Android, Qt, OpenStack, and LibreOce, we do not claim
that the same results would be achieved with other projects or other
periods of time.
7 CONCLUSION AND FUTURE WORK
In this paper, we develop a multi-objective search-based approach
called
W
ork
l
oad-aware
R
eviewer
Rec
ommendation (
WLRRec
) to
nd reviewers for a newly-submitted patch. Our results suggest
that: (1) when considering ve reviewer metrics, our WLRRec can
29
PROMISE ’20, November 8–9, 2020, Virtual, USA W. Al-Zubaidi, P. Thongtanunam, H. Dam, C. Tantithamthavorn, A. Ghose
recommend reviewers who are potentially suitable for a newly-
submitted patch with 19% - 260% higher F-measure than the ve
benchmarks; (2) including an objective to minimize the skewness
of the review workload distribution would be benecial to nd
other potential reviewers that might be overlooked by the other
approaches that focus on reviewing experience; and (3) the multi-
objective meta-heuristic algorithm, NSGA-II, can be used to search
for reviewers who are potentially suitable for a newly-submitted
patch. Our empirical results shed the light on the potential of using
a wider range of information and leveraging the multi-objective
meta-heuristic algorithm to nd reviewers who are potentially
suitable for a newly-submitted patch.
Our future work will incorporate other factors which may aect
the quality and productivity of the code review process, leading to
the formulation of new objectives and constraints which should be
considered in the search for generating optimal solutions. Currently,
our work considers only one newly-submitted code patch at a time.
Thus, our future work will extend the consideration of all code
patches that are currently sitting in the queue at the same time.
This may require a new solution representation.
REFERENCES
[1]
Alberto Bacchelli and Christian Bird. 2013. Expectations, outcomes, and chal-
lenges of modern code review. In Proceedings of ICSE. 712–721.
[2]
Vipin Balachandran. 2013. Reducing human eort and improving quality in peer
code reviews using automatic static analysis and reviewer recommendation. In
Proceedings of ICSE. 931–940.
[3]
Gabriele Bavota and Barbara Russo. 2015. Four eyes are better than two: On the
impact of code reviews on software quality. In Proceedings of ICSME. 81–90.
[4]
Olga Baysal, Oleksii Kononenko, Reid Holmes, and Michael W. Godfrey. 2015.
Investigating Technical and Non-Technical Factors Inuencing Modern Code
Review. Journal of EMSE 21, 3 (2015), 932–959.
[5]
Christian Bird, Alex Gourley, Prem Devanbu, Michael Gertz, and Anand Swami-
nathan. 2006. Mining email social networks. In Proceedings of MSR. 137–143.
[6]
Christian Bird, Nachiappan Nagappan, Brendan Murphy, Harald Gall, and
Premkumar Devanbu. 2011. Don’t touch my code!: examining the eects of
ownership on software quality. In Proceedings of ESEC/FSE. ACM, 4–14.
[7]
Amiangshu Bosu, Jerey C. Carver, Christian Bird, Jonathan Orbeck, and Christo-
pher Chockley. 2017. Process Aspects and Social Dynamics of Contemporary
Code Review: Insights from Open Source Development and Industrial Practice at
Microsoft. TSE 43, 1 (2017), 56–75.
[8]
J Cohen. 1988. Statistical power analysis for the behavioral sciences. 1988, Hills-
dale, NJ: L. Lawrence Earlbaum Associates 2 (1988).
[9]
Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. 2002.
A fast and elitist multiobjective genetic algorithm: NSGA-II. TEVC 6, 2 (2002),
182–197.
[10]
Kazuki Hamasaki, Raula Gaikovina Kula, Norihiro Yoshida, AE Cruz, Kenji Fuji-
wara, and Hajimu Iida. 2013. Who does what during a code review? datasets of
oss peer review repositories. In Proceedings of MSR. 49–52.
[11]
Ahmed E Hassan. 2009. Predicting faults using the complexity of code changes.
In Proceedings of ICSE. 78–88.
[12]
Melinda R Hess and Jerey D Kromrey.2004. Robust condence inter vals for eect
sizes: A comparative study of Cohen’sd and Cli’s delta under non-normality
and heterogeneous variances. In annual meeting of the American Educational
Research Association. 1–30.
[13]
Dean C Karnopp. 1963. Random search techniques for optimization problems.
Automatica 1, 2-3 (1963), 111–121.
[14]
Oleksii Kononenko, Olga Baysal, Latifa Guerrouj, Yaxin Cao, and Michael W
Godfrey. 2015. Investigating code review quality: Do people and participation
matter?. In Proceedings of ICSME. 111–120.
[15]
Vladimir Kovalenko, Nava Tintarev, Evgeny Pasynkov, Christian Bird, and Al-
berto Bacchelli. 2018. Does reviewer recommendation help developers? TSE to
appear (2018), 1–23.
[16]
Laura MacLeod, Michaela Greiler, Margaret-Anne Storey, Christian Bird, and
Jacek Czerwonka. 2018. Code Reviewing in the Trenches. IEEE Software 35 (2018),
34–42.
[17]
Shane McIntosh, Yasutaka Kamei, Bram Adams, and Ahmed E Hassan. 2016.
An empirical study of the impact of modern code review practices on software
quality. Journal of EMSE 21, 5 (2016), 2146–2189.
[18]
Antonio J Nebro, Juan J Durillo, Francisco Luna, Bernabé Dorronsoro, and Enrique
Alba. 2009. MOCell: A cellular genetic algorithm for multiobjective optimization.
International Journal of Intelligent Systems 24, 7 (2009), 726–746.
[19]
Ali Ouni, Raula Gaikovina Kula, and Katsuro Inoue. 2016. Search-based peer
reviewers recommendation in modern code review. In Proceedings of ICSME).
IEEE, 367–377.
[20]
Ali Ouni, Raula Gaikovina Kula, Marouane Kessentini, Takashi Ishio, Daniel M
German, and Katsuro Inoue. 2017. Search-based software library recommendation
using multi-objective optimization. IST 83 (2017), 55–75.
[21]
Mohammad Masudur Rahman, Chanchal K Roy, and Jason A Collins. 2016. Cor-
rect: code reviewer recommendation in github based on cross-project and tech-
nology experience. In Proceedings of ICSE (Companion). 222–231.
[22]
Peter C Rigby and Christian Bird. 2013. Convergent contemporary software peer
review practices. In Proceedings of FSE. ACM, 202–212.
[23]
Peter C Rigby and Margaret-Anne Storey. 2011. Understanding broadcast based
peer review on open source software projects. In Proceedings of ICSE. 541–550.
[24]
Shade Ruangwan, Patanamon Thongtanunam, Akinori Ihara, and Kenichi Mat-
sumoto. 2019. The Impact of Human Factors on the Participation Decision of
Reviewers in Modern Code Review. Journal of EMSE (2019).
[25]
Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto
Bacchelli. 2018. Modern Code Review: A Case Study at Google. In Proceedings of
ICSE (Companion). 181–190.
[26]
Raphael Saraiva, Allysson Allex Araujo, Altino Dantas, Italo Yeltsin, and Jereson
Souza. 2017. Incorporating decision maker’s preferences in a multi-objective
approach for the software release planning. Journal of the Brazilian Computer
Society 23, 1 (2017), 11.
[27]
Claude Elwood Shannon. 1948. A mathematical theory of communication. Bell
system technical journal 27, 3 (1948), 379–423.
[28]
Patanamon Thongtanunam and Ahmed E. Hassan. 2020. Review Dynamics and
Their Impact on Software Quality. (2020), to appear. https://doi.org/10.1109/TSE.
2020.2964660
[29]
Patanamon Thongtanunam, Shane McIntosh, Ahmed E Hassan, and Hajimu Iida.
2015. Investigating Code Review Practices in Defective Files: An Empirical Study
of the Qt System. In Proceedings of MSR. 168–179.
[30]
Patanamon Thongtanunam, Shane McIntosh, Ahmed E. Hassan, and Hajimu Iida.
2016. Revisiting Code Ownership and its Relationship with Software Quality in
the Scope of Modern Code Review. In Proceedings of ICSE. 1039–1050.
[31]
Patanamon Thongtanunam, Shane McIntosh, Ahmed E Hassan, and Hajimu Iida.
2017. Review participation in modern code review. Journal of EMSE 22, 2 (2017),
768–817.
[32]
Patanamon Thongtanunam, Chakkrit Tantithamthavorn, Raula Gaikovina Kula,
Norihiro Yoshida, Hajimu Iida, and Ken-ichi Matsumoto. 2015. Who should
review my code? A le location-based code-reviewer recommendation approach
for modern code review. In Proceedings of SANER. 141–150.
[33]
András Vargha and Harold D Delaney. 2000. A critique and improvement of
the CL common language eect size statistics of McGraw and Wong. Journal of
Educational and Behavioral Statistics 25, 2 (2000), 101–132.
[34]
Xin Xia, David Lo, Xinyu Wang, and Xiaohu Yang. 2015. Who should review
this change?: Putting text and le location analyses together for more accurate
recommendations. In Proceedings of ICSME. 261–270.
[35]
Xin Yang, Raula Gaikovina Kula, Norihiro Yoshida, and Hajimu Iida. 2016. Mining
the modern code review repositories: A dataset of people, process and product.
In Proceedings of MSR. ACM, 460–463.
[36]
Yue Yu, Huaimin Wang, Gang Yin, and Charles X Ling. 2014. Reviewer recom-
mender of pull-requests in GitHub. In Proceedings of ICSME. IEEE, 609–612.
[37]
Motahareh Bahrami Zanjani, Huzefa Kagdi, and Christian Bird. 2016. Automat-
ically recommending peer reviewers in modern code review. TSE 42, 6 (2016),
530–543.
[38]
Eckart Zitzler, Marco Laumanns, and Lothar Thiele. 2001. SPEA2: Improving the
strength Pareto evolutionary algorithm. TIK-report 103 (2001).
[39]
Eckart Zitzler and Lothar Thiele. 1999. Multiobjective evolutionary algorithms:
a comparative case study and the strength Pareto approach. TEVC 3, 4 (1999),
257–271.
30
... In addition, there are other tasks to automate other code review activities to support software developers, such as reviewer recommendation [58], [59], [60], [61], review prioritization [62], [63], [64], [65] and defect-proneness prediction in submitted code [66], [67], [68], [69], [70], [71]. For instance, Thongtanunam et al. [58] found 4% to 30% of reviews in open-source projects could not find suitable reviewers (i.e., reviewer recommendation). ...
... Rahman et al. [59] proposed CORRECT which utilizes external library similarity and technology expertise similarity of reviewers to recommend reviewers. Al-Zubaidi et al. [61] did not only focus on reviewer experience but also take into account the review workload when recommending reviewers. ...
Preprint
Code review is an effective software quality assurance activity; however, it is labor-intensive and time-consuming. Thus, a number of generation-based automatic code review (ACR) approaches have been proposed recently, which leverage deep learning techniques to automate various activities in the code review process (e.g., code revision generation and review comment generation). We find the previous works carry three main limitations. First, the ACR approaches have been shown to be beneficial in each work, but those methods are not comprehensively compared with each other to show their superiority over their peer ACR approaches. Second, general-purpose pre-trained models such as CodeT5 are proven to be effective in a wide range of Software Engineering (SE) tasks. However, no prior work has investigated the effectiveness of these models in ACR tasks yet. Third, prior works heavily rely on the Exact Match (EM) metric which only focuses on the perfect predictions and ignores the positive progress made by incomplete answers. To fill such a research gap, we conduct a comprehensive study by comparing the effectiveness of recent ACR tools as well as the general-purpose pre-trained models. The results show that a general-purpose pre-trained model CodeT5 can outperform other models in most cases. Specifically, CodeT5 outperforms the prior state-of-the-art by 13.4\%--38.9\% in two code revision generation tasks. In addition, we introduce a new metric namely Edit Progress (EP) to quantify the partial progress made by ACR tools. The results show that the rankings of models for each task could be changed according to whether EM or EP is being utilized. Lastly, we derive several insightful lessons from the experimental results and reveal future research directions for generation-based code review automation.
... Another interesting direction is to focus recommend reviewers that will ensure code base knowledge distribution [86,176,207]. Finally, some studies have included balancing review workload as an objective [43,49,86,230] In relation to how the predictors are used to recommend code reviewers, many employ traditional approaches (e.g., cosine similarity), while some use machine learning techniques, such as Random Forest [92], Naive Bayes [92,235], Support Vector Machines [144,276], Collaborative Filtering [87,230], Deep Neural Networks [222,274], or model reviewer recommendation as an optimization problem [43,86,187,207,211]. ...
... Another interesting direction is to focus recommend reviewers that will ensure code base knowledge distribution [86,176,207]. Finally, some studies have included balancing review workload as an objective [43,49,86,230] In relation to how the predictors are used to recommend code reviewers, many employ traditional approaches (e.g., cosine similarity), while some use machine learning techniques, such as Random Forest [92], Naive Bayes [92,235], Support Vector Machines [144,276], Collaborative Filtering [87,230], Deep Neural Networks [222,274], or model reviewer recommendation as an optimization problem [43,86,187,207,211]. ...
Article
Background: Modern Code Review (MCR) is a lightweight alternative to traditional code inspections. While secondary studies on MCR exist; it is unknown whether the research community has targeted themes that practitioners consider important. Objectives: The objectives are to provide an overview of MCR research, analyze the practitioners’ opinions on the importance of MCR research, investigate the alignment between research and practice, and propose future MCR research avenues. Method: We conducted a systematic mapping study to survey state-of-the-art until and including 2021, employed the Q-Methodology to analyze the practitioners’ perception of the relevance of MCR research, and analyzed the primary studies’ research impact. Results: We analyzed 244 primary studies, resulting in five themes. As a result of the 1300 survey data points, we found that the respondents are positive about research investigating the impact of MCR on product quality and MCR process properties. In contrast, they are negative about human factor- and support systems-related research. Conclusion: These results indicate a misalignment between the state-of-the-art and the themes deemed important by most survey respondents. Researchers should focus on solutions that can improve the state of MCR practice. We provide an MCR research agenda, which can potentially increase the impact of MCR research.
... Another multi-objective method [18] involves considering the reviewers' workload, such as their current review load or availability, and formulating the recommendation problem as a multi-objective optimization problem that considers both expertise and workload. Evolutionary algorithms, such as NSGA-II and MOEA/D, are used to search for reviewers who can handle the review load effectively while providing highquality reviews based on their expertise. ...
Article
Full-text available
The practice of code review is crucial in software development to improve code quality and promote knowledge exchange among team members. It requires identifying qualified reviewers with the necessary expertise and experience to thoroughly examine modifications suggested in a pull request and improve the efficiency of the code review process. However, it can be costly and time-consuming for maintainers to manually assign suitable reviewers to each request for large-scale projects. To address this challenge, various techniques, including machine learning, heuristic-based algorithms, and social network analysis, have been employed to suggest reviewers for pull requests automatically
... Although different automated approaches were proposed to support code review activities (e.g., review prioritization [25,26], just-in-time defect prediction [17,27,28] and localization [29][30][31][32], reviewer recommendation [33][34][35][36][37], AI-assisted code reviewer [38,39]), such activities still require manual effort which can be time-consuming for developers [4,5]. Indeed, prior work reported that reviewers could spend a large amount of time on reviewing code (e.g., developers in open-source software projects have to spend more than six hours per week on average reviewing code [40]). ...
Conference Paper
Full-text available
Code review is a software quality assurance practice, yet remains time-consuming (e.g., due to slow feedback from reviewers). Recent Neural Machine Translation (NMT)-based code transformation approaches were proposed to automatically generate an approved version of changed methods for a given submitted patch. The existing approaches could change code tokens in any area in a changed method. However, not all code tokens need to be changed. Intuitively, the changed code tokens in the method should be paid more attention to than the others as they are more prone to be defective. In this paper, we present an NMT-based Diff-Aware Code Transformation approach (D-ACT) by leveraging token-level change information to enable the NMT models to better focus on the changed tokens in a changed method. We evaluate our D-ACT and the baseline approaches based on a time-wise evaluation (that is ignored by the existing work) with 5,758 changed methods. Under the time-wise evaluation scenario, our results show that (1) D-ACT can correctly transform 107-245 changed methods, which is at least 62% higher than the existing approaches; (2) the performance of the existing approaches drops by 57% to 94% when the time-wise evaluation is ignored; and (3) D-ACT is improved by 17%-82% with an average of 29% when considering the token-level change information. Our results suggest that (1) NMT-based code transformation approaches for code review should be evaluated under the time-wise evaluation; and (2) the token-level change information can substantially improve the performance of NMT-based code transformation approaches for code review.
... Hannebauer et al. [22] recommended code reviewers based on their expertise. Most recently, Al-Zubaidi et al. [23] developed a novel approach that leverages a multi-objective meta-heuristic algorithm to search for reviewers guided by two objectives. ...
Preprint
Full-text available
Contemporary development projects benefit from code review as it improves the quality of a project. Large ecosystems of interdependent projects like OpenStack generate a large number of reviews, which poses new challenges for collaboration (improving patches, fixing defects). Review tools allow developers to link between patches, to indicate patch dependency, competing solutions, or provide broader context. We hypothesize that such patch linkage may also simulate cross-collaboration. With a case study of OpenStack, we take a first step to explore collaborations that occur after a patch linkage was posted between two patches (i.e., cross-patch collaboration). Our empirical results show that although patch linkage that requests collaboration is relatively less prevalent, the probability of collaboration is relatively higher. Interestingly, the results also show that collaborative contributions via patch linkage are non-trivial, i.e, contributions can affect the review outcome (such as voting) or even improve the patch (i.e., revising). This work opens up future directions to understand barriers and opportunities related to this new kind of collaboration, that assists with code review and development tasks in large ecosystems.
Article
Code review is an essential task for modern software engineers, where the author of a code change assigns other engineers the task of providing feedback on the author's code. In this paper, we investigate the task of code review through the lens of equity, the proposition that engineers should share reviewing responsibilities fairly. Through this lens, we quantitatively examine gender inequities in code review load at Google. We found that, on average, women perform about 25% fewer reviews than men, an inequity with multiple systemic antecedents, including authors' tendency to choose men as reviewers, a recommender system's amplification of human biases, and gender differences in how reviewer credentials are assigned and earned. Although substantial work remains to close the review load gap, we show how one small change has begun to do so.
Article
Full-text available
Contemporary development projects benefit from code review as it improves the quality of a project. Large ecosystems of inter-dependent projects like OpenStack generate a large number of reviews, which poses new challenges for collaboration (improving patches, fixing defects). Review tools allow developers to link between patches, to indicate patch dependency, competing solutions, or provide broader context. We hypothesize that such patch linkage may also simulate cross-collaboration. With a case study of OpenStack, we take a first step to explore collaborations that occur after a patch linkage was posted between two patches (i.e., cross-patch collaboration). Our empirical results show that although patch linkage that requests collaboration is relatively less prevalent, the probability of collaboration is relatively higher. Interestingly, the results also show that collaborative contributions via patch linkage are non-trivial, i.e, contributions can affect the review outcome (such as voting) or even improve the patch (i.e., revising). This work opens up future directions to understand barriers and opportunities related to this new kind of collaboration, that assists with code review and development tasks in large ecosystems.
Article
Full-text available
Code review is a crucial activity for ensuring the quality of software products. Unlike the traditional code review process of the past where reviewers independently examine software artifacts, contemporary code review processes allow teams to collaboratively examine and discuss proposed patches. While the visibility of reviewing activities including review discussions in a contemporary code review tends to increase developer collaboration and openness, little is known whether such visible information influences the evaluation decision of a reviewer or not (i.e., knowing others' feedback about the patch before providing ones own feedback). Therefore, in this work, we set out to investigate the review dynamics, i.e., a practice of providing a vote to accept a proposed patch, in a code review process. To do so, we first characterize the review dynamics by examining the relationship between the evaluation decision of a reviewer and the visible information about a patch under review (e.g., comments and votes that are provided by prior co-reviewers). We then investigate the association between the characterized review dynamics and the defect-proneness of a patch. Through a case study of 83,750 patches of the OpenStack and Qt projects, we observe that the amount of feedback (either votes and comments of prior reviewers) and the co-working frequency of a reviewer with the patch author are highly associated with the likelihood that the reviewer will provide a positive vote to accept a proposed patch. Furthermore, we find that the proportion of reviewers who provided a vote consistent with prior reviewers is significantly associated with the defect-proneness of a patch. However, the associations of these review dynamics are not as strong as the confounding factors (i.e., patch characteristics and overall reviewing activities). Our observations shed light on the implicit influence of the visible information about a patch under review on the evaluation decision of a reviewer. Our findings suggest that the code reviewing policies that are mindful of these practices may help teams improve code review effectiveness. Nonetheless, such review dynamics should not be too concerning in terms of software quality.
Article
Full-text available
Modern Code Review (MCR) plays a key role in software quality practices. In MCR process, a new patch (i.e., a set of code changes) is encouraged to be examined by reviewers in order to identify weaknesses in source code prior to an integration into main software repositories. To mitigate the risk of having future defects, prior work suggests that MCR should be performed with sufficient review participation. Indeed, recent work shows that a low number of participated reviewers is associated with poor software quality. However, there is a likely case that a new patch still suffers from poor review participation even though reviewers were invited. Hence, in this paper, we set out to investigate the factors that are associated with the participation decision of an invited reviewer. Through a case study of 230,090 patches spread across the Android, LibreOffice, OpenStack and Qt systems, we find that (1) 16%-66% of patches have at least one invited reviewer who did not respond to the review invitation; (2) human factors play an important role in predicting whether or not an invited reviewer will participate in a review; (3) a review participation rate of an invited reviewers and code authoring experience of an invited reviewer are highly associated with the participation decision of an invited reviewer. These results can help practitioners better understand about how human factors associate with the participation decision of reviewers and serve as guidelines for inviting reviewers, leading to a better inviting decision and a better reviewer participation.
Conference Paper
Full-text available
Employing lightweight, tool-based code review of code changes (aka modern code review) has become the norm for a wide variety of open-source and industrial systems. In this paper, we make an exploratory investigation of modern code review at Google. Google introduced code review early on and evolved it over the years; our study sheds light on why Google introduced this practice and analyzes its current status, after the process has been refined through decades of code changes and millions of code reviews. By means of 12 interviews, a survey with 44 respondents, and the analysis of review logs for 9 million reviewed changes, we investigate motivations behind code review at Google, current practices, and developers' satisfaction and challenges.
Article
Full-text available
Background Release planning (RP) is one of the most complex and relevant activities in the iterative and incremental software development, because it addresses all decisions associated with the selection and assignment of requirements to releases. There are many approaches in which RP is formalized as an optimization problem. In this context, search-based software engineering (SBSE) deals with the application of search techniques to solve complex problems of software engineering. Since RP is a wicked problem with a large focus on human intuition, the decision maker’s (DM) opinion is a relevant issue to be considered when solving release planning problem. Thus, we emphasize the importance in gathering the DM’s preferences to guide the optimization process through search space area of his/her interests. Methods Typically, RP is modelled as a multi-objective problem by considering to maximize overall clients satisfaction and minimize project risk. In this paper, we extend this notion and consider DM’s preferences as an additional objective. The DM defines a set of preferences about the requirements allocation which is stored in a preference base responsible for influencing the search process. The approach was validated through an empirical study, which consists of two different experiments, respectively identified as (a) automatic experiment and (b) participant-based experiment. Basically, the former aims to analyze the approach using different search-based algorithms (NSGA-II, MOCell, IBEA, and SPEA-II), over artificial and real-world instances, whereas the latter aims at evaluating the use of the proposal in a real scenario composed of human evaluations. ResultsThe automatic experiment points out that NSGA-II obtained overall superiority in two of the three datasets investigated, positioning itself as a superior search technique for scenarios with few number of requirements and preferences, while IBEA showed to be better for larger ones (with more requirements and preferences). Regarding the participant-based experiment, it was found that two thirds of the participants evaluated the preference-based solution better than the non-preference-based one. Conclusions The results suggest that it is feasible to investigate the approach in a real-world scenario. In addition, we made available a prototype tool in order to incorporate the human’s preferences about the requirements allocation into the solution of release planning.
Article
Full-text available
Software code review is a well-established software quality practice. Recently, Modern Code Review (MCR) has been widely adopted in both open source and proprietary projects. Our prior work shows that review participation plays an important role in MCR practices, since the amount of review participation shares a relationship with software quality. However, little is known about which factors influence review participation in the MCR process. Hence, in this study, we set out to investigate the characteristics of patches that: (1) do not attract reviewers, (2) are not discussed, and (3) receive slow initial feedback. Through a case study of 196,712 reviews spread across the Android, Qt, and OpenStack open source projects, we find that the amount of review participation in the past is a significant indicator of patches that will suffer from poor review participation. Moreover, we find that the description length of a patch shares a relationship with the likelihood of receiving poor reviewer participation or discussion, while the purpose of introducing new features can increase the likelihood of receiving slow initial feedback. Our findings suggest that the patches with these characteristics should be given more attention in order to increase review participation, which will likely lead to a more responsive review process.
Article
Selecting reviewers for code changes is a critical step for an efficient code review process. Recent studies propose automated reviewer recommendation algorithms to support developers in this task. However, the evaluation of recommendation algorithms, when done apart from their target systems and users (i.e. code review tools and change authors), leaves out important aspects: perception of recommendations, influence of recommendations on human choices, and their effect on user experience. This study is the first to evaluate a reviewer recommender in vivo. We compare historical reviewers and recommendations for over 21,000 code reviews performed with a deployed recommender in a company environment and set to measure the influence of recommendations on users' choices, along with other performance metrics. Having found no evidence of influence, we turn to the users of the recommender. Through interviews and a survey we find that, though perceived as relevant, reviewer recommendations rarely provide additional value for the respondents. We confirm this finding with a larger study at another company. The confirmation of this finding brings up a case for more user-centric approaches to designing and evaluating the recommenders. Finally, we investigate information needs of developers during reviewer selection and discuss promising directions for the next generation of reviewer recommendation tools.
Article
Context: Software library reuse has significantly increased the productivity of software developers, reduced time-to-market and improved software quality and reusability. However, with the growing number of reusable software libraries in code repositories, finding and adopting a relevant software library becomes a fastidious and complex task for developers. Objective: In this paper, we propose a novel approach called LibFinder to prevent missed reuse opportunities during software maintenance and evolution. The goal is to provide a decision support for developers to easily find “useful” third-party libraries to the implementation of their software systems. Method: To this end, we used the non-dominated sorting genetic algorithm (NSGA-II), a multi-objective search-based algorithm, to find a trade-off between three objectives : 1) maximizing co-usage between a candidate library and the actual libraries used by a given system, 2) maximizing the semantic similarity between a candidate library and the source code of the system, and 3) minimizing the number of recommended libraries. Results: We evaluated our approach on 6,083 different libraries from Maven Central super repository that were used by 32,760 client systems obtained from Github super repository. Our results show that our approach outperforms three other existing search techniques and a state-of-the art approach, not based on heuristic search, and succeeds in recommending useful libraries at an accuracy score of 92%, precision of 51% and recall of 68%, while finding the best trade-off between the three considered objectives. Furthermore, we evaluate the usefulness of our approach in practice through an empirical study on two industrial Java systems with developers. Results show that the top 10 recommended libraries was rated by the original developers with an average of 3.25 out of 5. Conclusion: This study suggests that (1) library usage history collected from different client systems and (2) library semantics/content embodied in library identifiers should be balanced together for an efficient library recommendation technique.
Conference Paper
Code review is of primary importance in modern software development. It is widely recognized that peer review is an efficient and effective practice for improving software quality and reducing defect proneness. For successful review process, peer reviewers should have a deep experience and knowledge with the code being reviewed, and familiar to work and collaborate together. However, one of the main challenging tasks in modern code review is to find the most appropriate reviewers for submitted code changes. So far, reviewers assignment is still a manual, costly and time-consuming task. In this paper, we introduce a search-based approach, namely RevRec, to provide decision-making support for code change submitters and/or reviewers assigners to identify most appropriate peer reviewers for their code changes. RevRec aims at finding reviewers to be assigned for a code change based on their expertise and collaboration in past reviews using genetic algorithm (GA). We evaluated our approach on a benchmark of three open-source software systems, Android, OpenStack, and Qt. Results indicate that RevRec accurately recommends code reviewers with up to 59% of precision and 74% of recall. Our experiments provide evidence that leveraging reviewers expertise from their prior reviews and the socio-technical aspects of the team work and collaboration is relevant in improving the performance of peer reviewers recommendation in modern code review.