Content uploaded by Patanamon Thongtanunam

Author content

All content in this area was uploaded by Patanamon Thongtanunam on Oct 27, 2020

Content may be subject to copyright.

Content uploaded by Chakkrit Tantithamthavorn

Author content

All content in this area was uploaded by Chakkrit Tantithamthavorn on Aug 10, 2020

Content may be subject to copyright.

Workload-Aware Reviewer Recommendation using a

Multi-objective Search-Based Approach

Wisam Haitham Abbood

Al-Zubaidi

University of Wollongong

Australia

whaa807@uowmail.edu.au

Patanamon Thongtanunam

The University of Melbourne

Australia

patanamon.t@unimelb.edu.au

Hoa Khanh Dam

University of Wollongong

Australia

hoa@uow.edu.au

Chakkrit Tantithamthavorn

Monash University

Australia

chakkrit@monash.edu

Aditya Ghose

University of Wollongong

Australia

aditya@uow.edu.au

ABSTRACT

Reviewer recommendation approaches have been proposed to pro-

vide automated support in nding suitable reviewers to review a

given patch. However, they mainly focused on reviewer experience,

and did not take into account the review workload, which is another

important factor for a reviewer to decide if they will accept a review

invitation. We set out to empirically investigate the feasibility of au-

tomatically recommending reviewers while considering the review

workload amongst other factors. We develop a novel approach that

leverages a multi-objective meta-heuristic algorithm to search for

reviewers guided by two objectives, i.e., (1) maximizing the chance

of participating in a review, and (2) minimizing the skewness of the

review workload distribution among reviewers. Through an empiri-

cal study of 230,090 patches with 7,431 reviewers spread across four

open source projects, we nd that our approach can recommend

reviewers who are potentially suitable for a newly-submitted patch

with 19% - 260% higher F-measure than the ve benchmarks. Our

empirical results demonstrate that the review workload and other

important information should be taken into consideration in nd-

ing reviewers who are potentially suitable for a newly-submitted

patch. In addition, the results show the eectiveness of realizing

this approach using a multi-objective search-based approach.

CCS CONCEPTS

•Software and its engineering →Search-based software en-

gineering;Collaboration in software development.

KEYWORDS

Code Review, Reviewer Recommendation, Search-Based Software

Engineering

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

PROMISE ’20, November 8–9, 2020, Virtual, USA

©2020 Association for Computing Machinery.

ACM ISBN 978-1-4503-8127-7/20/11. . . $15.00

https://doi.org/10.1145/3416508.3417115

ACM Reference Format:

Wisam Haitham Abbood Al-Zubaidi, Patanamon Thongtanunam, Hoa Khanh

Dam, Chakkrit Tantithamthavorn, and Aditya Ghose. 2020. Workload-Aware

Reviewer Recommendation using a Multi-objective Search-Based Approach.

In Proceedings of the 16th ACM International Conference on Predictive Models

and Data Analytics in Software Engineering (PROMISE ’20), November 8–9,

2020, Virtual, USA. ACM, New York, NY, USA, 10 pages. https://doi.org/10.

1145/3416508.3417115

1 INTRODUCTION

Code review is one of the important quality assurance practices

in a software development process. The main goal of code review

is to improve the overall quality of a patch (i.e., a set of software

changes) through a manual examination done by developers other

than the patch author. Recently, many software organizations have

adopted a lightweight variant of code review called Modern Code

Review (MCR) [

1

,

15

,

22

,

25

], which focuses on collaboration among

team members to achieve high quality of a software product. Sev-

eral studies have shown that active and rigorous code review can

decrease the number of post-release defects [

17

,

29

]. In addition,

the collaborative practice of MCR provides additional benets to

team members such as knowledge transfer [

1

] and sharing code

ownership [30].

Eective code reviews require active participation of review-

ers with related knowledge or experience [

2

,

23

]. Several studies

have shown that a patch tends to be less defective when it was

reviewed and discussed extensively by many reviewers [

3

,

14

,

29

].

Furthermore, recent work has shown that active review partici-

pation can only be achieved by inviting active and experienced

reviewers [

31

]. However, nding suitable reviewers is not a trivial

task, and this has to be done quickly to avoid negative impact on the

code review timeliness [

32

]. Hence, several studies have proposed

automated approaches to support the recommendation of reviewers

for a newly-submitted patch [2,19,32,34,36,37].

Intuitively, reviewing experience should be the key factor when

selecting reviewers for a newly-submitted patch [

2

,

32

,

34

,

37

].

However, recent studies pointed out that requesting only experts or

active reviewers for a review could potentially burden them with

many reviewing tasks [

7

,

16

]. Prior work has also shown that the

length of review queues, i.e., the number of pending review requests,

can lead to review delays [

4

]. In fact, recent work [

15

,

25

] reported

21

PROMISE ’20, November 8–9, 2020, Virtual, USA W. Al-Zubaidi, P. Thongtanunam, H. Dam, C. Tantithamthavorn, A. Ghose

that development teams in practice take the review workload into

account when selecting reviewers for a newly-submitted patch. Fur-

thermore, Ruangwan et al. [

24

] showed that invited reviewers often

considered their workload (i.e., the number of remaining reviews)

when deciding whether they should accept new invitations. These

empirical ndings highlight that the reviewer workload should be

considered when selecting reviewers for a newly-submitted patch.

Hence, this work conducts an empirical study to investigate the

feasibility of considering the review workload amongst other factors

in automatically recommending reviewers. To do so, we develop a

W

ork

l

oad-aware

R

eviewer

Rec

ommendation approach called

WL-

RRec

. Unlike the previous reviewer recommendation approaches

which mainly focus on the reviewing experience [

2

,

19

,

32

,

36

,

37

],

our WLRRec considers a wider range of information, including

the review workload. More specically, our WLRRec recommends

reviewers based on 4+1 key reviewer metrics, i.e., four metrics in-

cluding code ownership, reviewing experience, familiarity with the

patch author, and review participation rate, and one metric repre-

senting the review workload. We use these metrics to dene two

objectives: (1) maximizing the chance of participating a review, and

(2) minimizing the skewness of the review workload distribution

among reviewers. To nd reviewers for a newly-submitted patch,

our WLRRec leverages a multi-objective meta-heuristic algorithm,

namely non-dominated sorting genetic algorithm (NSGA-II) [

9

], to

search for solutions that meet these two objectives.

Through an evaluation of 230,090 patches with 7,431 reviewers

spread across four large open source software projects (Android,

LibreOce, Qt, and OpenStack), our results suggest that:

(1)

when considering 4+1 key reviewer metrics, our WLRRec

can recommend reviewers who are potentially suitable for a

newly-submitted patch with 19% - 260% higher F-measure

than the ve benchmarks;

(2)

including an objective to minimize the skewness of the re-

view workload distribution would be benecial to nd other

potential reviewers that might be overlooked by the other

approaches which focus only on reviewing experience; and

(3)

the multi-objective meta-heuristic algorithm, NSGA-II, is

eective in searching for reviewers who are potentially suit-

able for a newly-submitted patch.

Our empirical results demonstrate the potential of using a wider

range of information and leveraging the multi-objective meta-heuristic

algorithm to nd reviewers who are potentially suitable for a newly-

submitted patch.

2 BACKGROUND AND RELATED WORK

2.1 Modern Code Review

Modern code review is a code review process that is a light-weight

variant of software inspection. The process is supported by an

online code review tool (e.g., Gerrit). Recently, a modern code review

process is widely used in both open source and industrial software

projects [

22

]. Modern code review is a collaborative code review

process, where developers other than the author examine a patch

(i.e., a set of code changes) submitted by the author. The modern

code review process typically consists of ve main steps:

(1) An author uploads a patch to a code review tool.

Figure 1: An example of a review in the LibreOce Project.

(2) An author makes a review request to reviewers.

(3)

The reviewers who accept the review request examine the

patch, provide comments and a vote, where a positive vote

indicates that the patch is of the quality; and a negative vote

indicates that the patch needs a revision.

(4)

The author revises the patch to address the reviewer feedback

and uploads a new revision.

(5)

If the revised patch addresses the reviewer concerns, the

patch is marked as merged for integration into the main

code repository. If the patch requires a large rework, the

patch is marked as abandoned.

Figure 1provides an example of a code review in the LibreOce

project. In this example,

Justin

is an author of the patch. Two addi-

tional developers (i.e.,

Miklos

and

Szymon

) were invited to review

the patch, while

Jenkins

is the automated Continuous Integration

(CI) system. Although two reviewers were invited, only

Miklos

participated in this review by providing a vote of +2. Furthermore,

the review of this patch took three days from its creation date (on

September 5, 2017) to get a vote from the reviewer (on September

8, 2017). Hence, nding suitable reviewers would be benecial to

the code review process in terms of reducing the delay. However,

nding reviewers who will participate in a code review is challeng-

ing as many factors can play a role, which will be discussed in the

next subsection.

2.2 Related Work

2.2.1 Reviewer Recommendation Approaches. Finding reviewers

can be a challenging task for a patch author, especially in globally-

distributed software development teams like open source software

projects. Thongtanunam et al. [

32

] showed that the patch author

of 4% to 30% of the reviews in open source projects could not nd

22

Workload-Aware Reviewer Recommendation using a Multi-objective Search-Based Approach PROMISE ’20, November 8–9, 2020, Virtual, USA

An author A newly-

submitted

patch

Past

reviews

Compute

Reviewer

metrics

Generate

optimal

solutions

Select a

solution

Recommended

reviewers

Multi-objectives Search-Based Reviewer

Recommendation Approach

Figure 2: An overview of our approach

reviewers. This problem delays the reviewing process and conse-

quently aect the software development overall. To address this

problem, a number of reviewer recommendation approaches have

been proposed to help a patch author nd reviewers for a newly-

submitted patch [

2

,

19

,

32

,

36

,

37

]. The key idea of those existing

approaches is to nd reviewers with high reviewing experience.

They assume that reviewers for a newly-submitted patch should

be those who have reviewed many related patches in the past. For

example, Balachandran [

2

] proposed ReviewBot which leverages a

line change history of the changed les in a newly-submitted patch

to nd reviewers. Thongtanunam et al. [

32

] proposed RevFinder

which identies the related past patches based on the le path sim-

ilarity of the changed les in a newly-submitted patch. Zanjani et

al. [

37

] has showed that the reviewing experience can change over

time. Thus, they proposed cHRev which considers the frequency

and recency of reviewing activities when computing reviewing

experience.

Recent reviewer recommendation approaches also take into ac-

count the historical interaction between reviewers and the patch

author. For example, Yu et al. [

36

] built a review collaboration net-

work to measure the co-reviewing frequency among developers.

Hence, the reviewers for a newly-submitted patch are those who

have high reviewing experience and high co-reviewing frequency

with the author of that patch. Instead of using a heuristic approach

like Yu et al. [

36

], Ouni et al. [

19

] developed RevRec which uses

the genetic algorithm (GA) to search for reviewers based on the

reviewing experience and the review collaboration network. Their

work however follows a single-objective optimization approach.

2.2.2 Reviewer Participation & Selection. While the reviewer rec-

ommendation approaches in the literature mainly focus on the

reviewing experience and the historical interaction between the

patch author and reviewers, several empirical studies have shown

that other factors also play a role in reviewer participation. Ko-

valenko et al. [

15

] showed that when selecting a reviewer, the patch

author considers a wide range of information such as knowledge,

ownership, qualities of reviewers. Rigby and Storey [

23

] reported

that in the mailing list-based code reviews, reviewers select a patch

for a review based on their interests and the priorities of a patch.

A survey study of Bosu et al. [

7

] also reported that reviewers tend

to decline a review request if the patch is not in the area of their

ownership or expertise. Recent empirical studies have shown that

the activeness of a reviewer is a strong indicator of the reviewer

participation [24,31].

Several recent studies pointed out that reviewing workload

should be considered when selecting reviewers. A survey study of

Kovalenko et al. [

15

] reported that the reviewer workload is one of

the factors that a patch author considers when selecting a reviewer.

Sadowski et al. [

25

] reported that development teams at Google use

a system that assigns reviews in a round-robin manner to take their

review workload into account. Ruangwan et al. [

24

] also demon-

strated empirical evidence that the number of remaining reviews

has a negative impact on the likelihood of accepting a review re-

quest of a reviewer. Moreover, Kononenko et al. [

14

] showed that

the length of review queue has a statistically signicant impact

on the review quality (i.e., whether developers catch or miss bugs

during a code review).

2.2.3 Novelty. Motivated by the empirical ndings of prior stud-

ies, we develop a novel workload-aware reviewer recommendation

(WLRRec) approach. Our WLRRec uses 4+1 key reviewer metrics,

i.e., four metrics indicating whether a review request will be ac-

cepted, and one metric representing the review workload. Unlike

the previous reviewer recommendation approaches, our WLRRec

is the rst to take code ownership, a review participation rate, and

the workload of reviewers into account. Since there are trade-os

in considering these 4+1 reviewer metrics, we employ a multi-

objective search-based approach called the non-dominated sorting

genetic algorithm (NSGA-II) [

9

]. The closest related work is RevRec

of Ouli et al. [

19

] but they used a single-objective search-based

approach (rather than a multi-objective approach as we propose

here). In addition, RevRec does not consider the reviewer workload

into consideration when recommending reviewers.

3 A WORKLOAD-AWARE REVIEWER

RECOMMENDATION APPROACH

3.1 Approach Overview

Figure 2provides an overview of our Workload-aware Reviewer

Recommendation approach (WLRRec) which uses a search-based

software engineering (SBSE) approach. Given a newly-submitted

patch, our WLRRec will rst

compute reviewer metrics

for re-

viewer candidates (i.e., reviewers who have reviewed at least one

patch in the past). We use ve metrics which measure the experi-

ence, historical participation, and reviewing workload of reviewer

candidates. Then, we employ an evolutionary search technique

to

generate optimal solutions

where each solution is a set of re-

viewer candidates. The search of the solution candidates is guided

by two objectives: (1) to maximize the chance of participating a

review, and (2) to minimize the skewness of the review workload

distribution among reviewers. The rst objective considers the ex-

perience and historical participation of reviewers, while the second

objective considers the review workload of reviewers. These two

objectives are in conict with each other. If we recommend only

experts or active reviewers for newly-submitted patches, the re-

viewing workload will be highly skewed to those experts or active

reviewers. On the other hand, if we only focus on establishing a

highly balanced reviewing workload, the reviewers that we rec-

ommend may not be experienced and familiar with a code patch

(which may aect the review quality) or active (which may delay

both the review and development process).

The intuition behind the rst objective is that as found in prior

studies, the areas of expertise and the areas of code with which

the reviewers are familiar are ones of the major reasons to accept

23

PROMISE ’20, November 8–9, 2020, Virtual, USA W. Al-Zubaidi, P. Thongtanunam, H. Dam, C. Tantithamthavorn, A. Ghose

a review request [

4

,

7

,

16

]. In addition, recent work has shown

that historical participation (i.e., the review participation rate and

historical interaction with a patch author) also plays an important

role in the decision of accepting a review request [

7

,

24

]. For the

second objective, several studies have pointed out that making

review requests to only experts or active reviewers could potentially

burden them with many reviewing tasks [

7

,

16

]. A recent work [

24

]

also shows that the number of remaining reviews (RR) is negatively

associated with the likelihood of accepting a review request. Hence,

we want to ensure that our reviewer recommendation approach

does not add more reviews to a particular group of reviewers (e.g.,

experts). MacLeod et al. [

16

] also suggested that requesting less

experienced (but available) reviewers could potentially speed up

the code review process and balance the team’s workload. These

motivate us to develop an approach that considers both of those

two objectives at the same time.

In this work, we leverage the non-dominated sorting genetic

algorithm (NSGA-II) [

9

] to nd the optimal solutions with respect

to the two objectives. The algorithm is based on the principle that

a population of solution candidates is evolved towards better so-

lutions. At the end of the search process, the algorithm returns

a set of optimal solutions, i.e., sets of reviewers that satisfy our

objectives. Finally, we use a Pareto front to

identify the optimal

solution

, i.e., a set of reviewers that will be recommended for a

newly-submitted patch. Below, we describe our reviewer metrics,

the approaches of generating and identifying the optimal solutions

in details.

3.2 Compute Reviewer Metrics

To recommend reviewers, we measure the experience, historical

participation, and reviewing workload of reviewer candidates using

ve metrics. These metrics will be used in our tness functions (i.e.,

objectives) in the multi-objective evolutionary approach. Below, we

describe the intuition based on the literature and the calculation

for each of our metrics.

Code Ownership (CO)

. CO measures the proportion of past re-

views that had been authored by a reviewer candidate. Bird et al. [

6

]

showed that the developer who authored many code changes should

be accounted as an owner of those related areas of code. Hence, a

reviewer for a newly-submitted patch should be an owner of the

code that is impacted by the patch. Several studies also show that re-

viewers are likely to participate in the patch that they have related

experience [

7

,

24

]. Hence, we measure the CO based on the ap-

proach of Bird et al. [

6

]. More specically, given a newly-submitted

patch

𝑝

for a review, we measure CO of a reviewer candidate

𝑟

using the following calculation:

𝐶𝑂 (𝑟 , 𝑝)=

1

|𝑀(𝑝)| Õ

𝑚∈𝑀(𝑝)

author(𝑟 ,𝑚)

𝑐(𝑚)(1)

where

𝑀(𝑝)

is a set of modules (i.e., directories) in the patch

𝑝

,

author(𝑟 ,𝑚)

is the number of past reviews that were authored by

the reviewer

𝑟

, and

𝑐(𝑚)

is the total number of past reviews that

have a module 𝑚.

Reviewing Experience (RE)

. RE measures the proportion of

past reviews that had been reviewed by a reviewer candidate. Sim-

ilar to CO, a recent study showed that developers can gain ex-

pertise on the related areas of code by actively participating code

reviews [

30

]. Several reviewer recommendation approaches are also

based on the similar intuition, i.e., the appropriate reviewers are

those who reviewed many similar reviews in the past [

19

,

21

,

32

,

37

].

Hence, given a newly-submitted patch

𝑝

for a review, we measure

RE of a reviewer candidate

𝑟

using the calculation of Thongta-

nunam et al. [30] which is described as follows:

𝑅𝐸 (𝑟 , 𝑝)=

1

|𝑀(𝑝)| Õ

𝑚∈𝑀(𝑝)

review(𝑟,𝑚)

𝑐(𝑚)(2)

where

review(𝑟,𝑚)

is a proportion of review contributions that

the reviewer candidate

𝑟

made to the past reviews

𝐾

using a calcu-

lation of Í𝑘∈𝐾(𝑟,𝑚)1

𝑅(𝑘)[30].

Familiarity with the Patch Author (FPA)

. Recent studies re-

ported that in addition to the expertise, the relationship between the

patch author and the reviewer often aects the decision of whether

to accept a review request [

7

,

24

]. To capture this relationship, FPA

counts the number of past reviews that a reviewer candidate had

done for the patch author of a newly-submitted patch. Hence, the

higher the FPA value is, the more the historical interaction between

the reviewer candidate and the patch author. Then, it is more likely

that the reviewer candidate will participate in the code review of

the newly-submitted patch.

Review Participation Rate (RPR)

. RPR measures the extent

to which a candidate participated in code review in the past. More

specically, RPR measures the proportion of past reviews that a

reviewer candidate had participated compared to the number of

past reviews to which the reviewer candidate was requested. A

recent work of Ruangwan et al. [

24

] showed that RPR is one of

the most inuential factors aecting the participation decision of

reviewers. The higher the RPR value is, the more the active reviewer

candidate is; and the more likely that the reviewer will participate

in a code review of a newly-submited patch.

Remaining Reviews (RR)

. A survey study reported that “too

many review requests” is one of the reason that reviewers did not

respond to review requests [

24

]. Hence, we use RR to represent the

review workload of a reviewer candidate. The quantitative analysis

of Ruangwan et al.showed that RR is one of the most inuential

factors aecting the participation decision of reviewers [

24

]. In

addition, Baysal et al. [

4

] showed that the length of review queue,

i.e., the number of pending review requests can have an impact on

the code review timeliness. Hence, to measure RR, we count the

number of review requests that a reviewer candidate received, yet

the reviewer candidate did not participate at the time when the

newly-submitted patch was created.

3.3 Generate Optimal Solutions

In this section, we describe the solution representation, the t-

ness functions for our two objectives, and the evolutionary search

approach to generate optimal solutions (i.e., sets of reviewers).

3.3.1 Solution Representation. We use a bit string to represent a

solution candidate (i.e. a set of reviewer candidates). The length

of a bit string is the number of all reviewer candidates. Each bit

24

Workload-Aware Reviewer Recommendation using a Multi-objective Search-Based Approach PROMISE ’20, November 8–9, 2020, Virtual, USA

in the string has the value of 0 or 1. A bit value of 1 indicates that

the corresponding reviewer is selected, while 0 indicates that the

reviewer is excluded from a recommendation. For example, in a

software project, there are ve reviewer candidates (i.e., R

1

, R

2

, R

3

,

R

4

, R

5

). Then, in a solution candidate

𝑆

, reviewer candidates R

1

, R

2

,

and R

5

are selected for a recommendation. The solution candidate

𝑆can be represented in a bit string of 11001.

The reviewer candidates in our approach are those who have

participated in code reviews in the past. However, due to a large

number of developers participated in code reviews, the length of a

solution candidate can be long resulting in excess computation time.

Hence, we shorten our candidate solutions by removing reviewer

candidates who have at least three metrics with zero values. This

is because the more metrics with zero values indicates the lower

signal that the reviewer candidate will accept a review request.

Note that we have experimented all possible metric thresholds for

removing reviewer candidates (i.e.,

𝑡={

1

,

2

,

3

,

4

,

5

}

). We found

that removing reviewer candidates who have at least three metrics

with zero values (

𝑡=

3) provides a reasonable length of candidate

solutions (i.e., a median length of 23 - 381 reviewer candidates for a

candidate solution), while it has a minimal impact on the accuracy

of recommendation.

3.3.2 Fitness Functions. We now describe the calculation of the

tness functions for our two objectives.

Maximizing the Chance of Participating a Review (CPR)

.

For our rst objective, we aim to nd reviewers with maximum code

ownership (CO), reviewing experience (RE), review participation

rate (RPR) and those who are highly familiar with the patch author

(FPA). In other words, the recommended reviewers of our approach

are those who have related expertise, actively participated in code

reviews in the past and reviewed many past patches for the patch

author of a newly-submitted patch. To consider these four factors

when recommending reviewers, we formulate the following tness

function:

𝐶𝑃 𝑅(𝑆𝑖, 𝑝)=Õ

𝑟∈𝑅

𝑆𝑖(𝑟)𝛼1𝐶𝑂 (𝑟 , 𝑝) + 𝛼2𝑅𝐸 (𝑟 , 𝑝)

+𝛼3𝐹𝑃𝐴(𝑟, 𝑝 ) + 𝛼4𝑅𝑃 𝑅(𝑟, 𝑝 )

(3)

where

𝑝

is a newly-submitted patch for a review,

𝑆𝑖

is a candidate

solution (i.e., a bit string of 0 or 1 for selecting reviewers),

𝑅

is a

set of all reviewer candidates. Each factor is weighted by the alpha

(

𝛼

) value where

𝛼1+𝛼2+𝛼3+𝛼4=

1. Then, the higher the value

of

𝐶𝑃 𝑅(𝑆𝑖, 𝑝)

is, the better the solution

𝑆𝑖

(i.e., the set of selected

reviewers) for a newly-submitted patch 𝑝is.

Minimizing the Skewness of the Reviewing Workload Dis-

tribution (SRW)

. To ensure that our recommendations will not

burden a particular group of reviewers, we aim to balance reviewing

workload among reviewers. In other words, the number of remain-

ing reviews should not be skewed to a particular group of reviewers.

Hence, we set an objective to minimize the skewness of the review

workload distribution among reviewers. To do so, we adapt the

calculation of Shannon’s entropy [

27

] to measure the skewness

of the remaining reviews (RR) distribution among reviewers. This

is similar to the work of Hassan [

11

] who used Shanon’s entropy

to measure the distribution of modied code across modied les.

More specically, given a newly-submitted patch

𝑝

and a solution

Algorithm 1 NSGA-II pseudo-code [9]

1: 𝑃0←

Initial population is randomly generated with population

size 𝑁

2: Evaluate Against Objective Functions

3: Selection Process ←Crossover and Mutation

4: 𝑄0←Create an ospring population

5: 𝑡=0

6: while the number of generations not reached do

7: 𝑅𝑡←𝑀𝑒𝑟 𝑔𝑒(𝑃𝑡+𝑄𝑡)

8: 𝐹←fast-non-dominated-sort (𝑅𝑡)

9: 𝑃𝑡+1=𝜙and 𝑗=1

10: while |𝑃𝑡+1|+|𝐹𝑗| ≤ 𝑁do

11: Calculate crowding-distance-assignment(𝐹𝑗)

12: 𝑃𝑡+1←𝑃𝑡+1∪𝐹𝑗

13: 𝑗=𝑗+1

14: end while

15: Sort (𝐹𝑗,≺𝑛)

16: 𝑃𝑡+1←𝑃𝑡+1∪𝐹𝑗[1 : (𝑁− |𝑃𝑡+1) ]

17: 𝑄𝑡+1←generated new population 𝑃𝑡+1

18: end while

candidate 𝑆𝑖, we formulate the following tness function:

𝑆𝑅𝑊 (𝑆𝑖, 𝑝 )=

1

𝑙𝑜𝑔2|𝑅|Õ

𝑟∈𝑅

(𝐻(𝑟) × 𝑙𝑜𝑔2𝐻(𝑟))

𝐻(𝑟)=

𝑅𝑅′(𝑟)

Í𝑘∈𝑅𝑅𝑅′(𝑘)

(4)

where

𝑅

is a set of all reviewer candidates,

𝐻(𝑟)

is a proportion of

remaining reviews of a reviewer candidate

𝑟

, and

𝑅𝑅′

is the number

of remaining reviews (RR) including the newly-submitted patch if

the reviewer 𝑟is selected in the solution 𝑆𝑖.

For example, reviewer candidates are R

1

, R

2

, R

3

and their remain-

ing reviews are 3, 5, and 2, respectively. Given that a solution

𝑆𝑖

selects R

1

and R

3

as recommended reviewers for a newly-submitted

patch

𝑝

, the

𝑅𝑅′

values will be 4 (=3+1), 5, and 3 (=2+1) for R

1

, R

2

,

R

3

respectively. Then, the SRW for the solution

𝑆𝑖

for the patch

𝑝

is -0.98 (

=1

𝑙𝑜𝑔23(4

12𝑙𝑜𝑔24

12 +5

12𝑙𝑜𝑔25

12 +3

12𝑙𝑜𝑔23

12 )

). The lower the

SRW value is, the better the spread of workload among reviewers.

3.3.3 Evolutionary Search. We employ a multi-objective meta heuris-

tic algorithm, namely non-dominated sorting genetic algorithm

(NSGA-II) [

9

] to search for solutions that meet the above two objec-

tives. Algorithm 1provides a pseudo-code of the NSGA-II algorithm.

NSGA-II starts with randomly generating an initial population

𝑃0

(i.e., a set of solution candidates). Then, the tness of each solution

candidate in the population

𝑃0

is measured with respect to the two

tness functions described in Section 3.3.2. The initial population

𝑃0

is then evolved to a new generation of solution candidates (i.e.,

an ospring population

𝑄0

) through the selection and genetic oper-

ators, i.e., crossover and mutation. The selection operator ensures

that the selection of solution candidates in the current population

are proportional to their tness values. The crossover operator

takes two selected solution candidates as parents and swapped

their bit strings to generate an ospring solution. The mutation

operator randomly chooses certain bits in the string, and inverts

the bits values.

25

PROMISE ’20, November 8–9, 2020, Virtual, USA W. Al-Zubaidi, P. Thongtanunam, H. Dam, C. Tantithamthavorn, A. Ghose

At each generation

𝑡

, the current population

𝑃𝑡

and its ospring

population

𝑄𝑡

is merged into a new population

𝑅𝑡

. Then, NSGA-II

sorts the solution candidates in the population

𝑅𝑡

using the fast

non-dominated sorting technique. This technique compares each

solution with other solutions in the population

𝑅𝑡

to nd which

solution dominates and which solution does not dominate other so-

lutions. A solution

𝑆1

is said to dominate another solution

𝑆2

if

𝑆1

is

no worse than

𝑆2

in all objectives and

𝑆1

is strictly better than

𝑆2

in

at least one objective. After sorting, the fast non-dominated sorting

technique provides Pareto fronts (i.e., sets of Pareto optimal solu-

tions that are not dominated by any other solutions). For each of the

Pereto front, NSGA-II calculates the crowding distance which is the

sum of the distance in terms of tness values between each solution

and its nearest neighbours in the same front. Then, the Pareto fronts

are sorted in the ascending order based on the crowding distance

values. The population for a next generation

𝑃𝑡+1

contains the rst

𝑁

solutions in the sorted Pareto fronts. The ospring population

𝑄𝑡+1

is then generated based on the population

𝑃𝑡+1

through the

selection and genetic operators. This evolution process is then re-

peated until a xed number of generations has been reached. In the

nal generation, NSGA-II returns a set of Pareto optimal solutions.

3.4 Select a Solution

From the previous step, NSGA-II returns a set of Pareto optimal

solutions, i.e., the sets of reviewers that meet the optimal trade-o

of our two objectives. This set of solutions can be presented to the

users for them to select. In cases where no explicit user preferences

are provided, the so-called knee point approach is applied to select

the most preferred solution among non-dominated solutions. This

knee point approach has been widely used in the evolutionary

search literature.

The knee point approach measures the Euclidean distance of

each solution on the Pareto front from the reference point. Given

the reference point is the maximum chance of participating a review

(

𝐶𝑃 𝑅𝑚𝑎𝑥

) and the minimum skewness of the reviewing workload

distribution (

𝑆𝑅𝑊𝑚𝑖𝑛

), the Euclidean distance of a solution

𝑆𝑖

is

calculated as follows:

Dist(𝑆𝑖)=p(𝐶𝑃𝑅𝑚𝑎𝑥 −𝐶𝑃 𝑅(𝑆𝑖))2+ (𝑆 𝑅𝑊𝑚𝑖𝑛 −𝑆 𝑅𝑊 (𝑆𝑖))2(5)

The selected solution is the one closest to the reference point. Fig-

ure 3provides an illustrative example for the knee point approach.

Given that the Pareto optimal solutions returned by NSGA-II are

S

1

, S

2

, S

3

, and S

4

and their tness values are shown in the plot,

solution S

3

is closest to the reference point, i.e., the highest chance

that the selected reviewers will accept a review request, while the

reviewing workload is well distributed (low skewness). Hence, the

solution S

3

will be selected, and the reviewers associated with this

solution will be recommended.

4 EXPERIMENTAL DESIGN

4.1 Research Questions

To evaluate our WLRRec, we formulate the following research

questions.

(RQ 1) Can our WRLRec recommend reviewers who are po-

tentially suitable for a newly-submitted patch?

S1

S2

S3

Objective 1: Maximize chance of participating a review!

(based on reviewers’ experience and historical participation)

S4

Objective 2: Minimize skewness of

the workload distribution

Reference

point

Dist(S4)

Dist(S3)

Dist(S2)

Dist(S2)

Figure 3: An illustrative example of identifying a knee point

from the Pareto optimal solutions

Table 1: An overview of the evaluation datasets

Project Period # Patches #Reviewers

Android 10/2008 - 12/2014 36,771 2,049

Qt 5/2011 - 12/2014 65,815 1,238

OpenStack 7/2011 - 12/2014 108,788 3,734

LibreOce 3/2012 - 11/2016 18,716 410

We set out this RQ for a sanity check to determine whether

our approach can recommend actual reviewers for a newly-

submitted patch. In addition, we compare our approach

with the Random Search optimization technique [

13

] which

has been a common baseline for most of the search based

metaheuristic algorithms.

(RQ 2) Does our WRLRec benet from the multi-objective

search-based approach?

Since our WLRRec considers two objectives, we set out this

RQ to empirically evaluate how well the multi-objective

approach compared to the single-objective approach. To

answer this RQ, we implemented the traditional single-

objective genetic algorithm (GA) using either CPR or SRW

as an objective. We then compare the performance of our

WLRRec against these two single-objective approaches,

i.e., GA-CPR and GA-SRW. Indeed, the GA-CPR approach

is closely similar to RevRec of Ouli et al.[

19

], who used

the genetic algorithm (GA) and considered the reviewing

experience and historical interaction.

(RQ 3) Does the choice of search algorithms impact the per-

formance of our WRLRec?

Our WLRRec leverages the multi-objective optimization

algorithm to recommend reviewers. Aside from NSGA-II,

there are other multi-objective optimization algorithms.

Hence, we set out this RQ to evaluate our WLRRec when

using two recently developed multi-objective evolution-

ary algorithms: Multiobjective Cellular Genetic Algorithm

(MOCell) [

18

] and the Strength-based Evolutionary Algo-

rithm (SPEA2) [38].

26

Workload-Aware Reviewer Recommendation using a Multi-objective Search-Based Approach PROMISE ’20, November 8–9, 2020, Virtual, USA

4.2 Datasets

In this work, we use a code review dataset of four large open source

software projects that actively use modern code reviews, i.e., An-

droid, Qt, OpenStack, and LibreOce. These projects have a large

number of the patches recorded in the code review tool. For An-

droid, Qt, and OpenStack, we use the review datasets of Hamasaki et

al. [

10

] which was often used in prior studies [

19

,

24

,

28

,

32

]. For Li-

breOce, we use the review dataset of Yang et al. [

35

]. The datasets

include patch information, review discussion, and developer infor-

mation. Below, we describe our data preparation approach, which

consist of three main steps.

(Step 1) Cleaning Datasets

. In this paper, we use the patches

that were marked as either merged or abandoned. In addition, we

exclude the patches that are self-reviewed (i.e., only the author of

the patch was a reviewer) and the patches related to the version

control system activities (e.g., branch-merging patches). Note that

we search for a keyword “merge branch” or “merge” in the descrip-

tion of the patch to identify the branch-merging patches. These

patches are excluded from our evaluation because there might be no

reviewers who actually review those patches. Since the code review

tools of the studied projects are tightly integrated with automated

checking systems (e.g., Continuous Integration test system), we re-

move the accounts of automated checking systems (e.g., Jenkins CI

or sanity checks) in the data sets. Similar to prior work [

24

], we use

a semi-automated approach to identify the accounts of automated

checking systems. Finally, we use the approach of Bird et al. [

5

] to

identify and merge alias emails of developers. In total, we obtain

230,090 patches with 7,431 reviewers that spread across four open

source projects. Table 1provides an overview of our datasets.

(Step 2) Spliting Datasets

. To evaluate our approach, we split

each dataset into two sub-datasets: (1) the most recent 10% of the

patches and (2) the remaining 90% of the patches. The 10% sub-

dataset is used to evaluate our approach, i.e., the patches in this

sub-dataset are considered as newly-submitted patches. The 90%

sub-dataset is used for building a pool of reviewer candidates and

computing reviewer metrics. As described in Section 3.3.1, reviewer

candidates are those (1) who have provided either a vote score or a

comment to at least one patch in the 90% sub-dataset and (2) whose

at least three reviewer metrics have a value greater than zero.

(Step 3) Generating Ground-Truth Data

. For the ground-

truth data of the patches in the 10% sub-dataset, we identify two

groups of reviewers: (1) actual reviewers and (2) potential review-

ers. Given a patch

𝑝

in the 10% sub-dataset, the

actual reviewers

of the patch

𝑝

are those who actually reviewed the patch

𝑝

by

either providing a vote score or a comment. These actual review-

ers are typically used to evaluate the reviewer recommendation

approaches [

19

,

32

,

37

]. Although these reviewers have actually re-

viewed in historical data, there is no guarantee that these reviewers

are the only group of suitable reviewers. Moreover, a recent work

of Kovalenko et al. [

15

] has shown that although a reviewer recom-

mendation approach achieves a good accuracy when the evaluation

is based on historical data, developers may not always select the

recommended reviewers when the approach is deployed due to

several factors, e.g., reviewers’ availability.

Hence, to evaluate our approach, we expand our ground-truth

data to include potential reviewers.

Potential reviewers

are the

reviewers who should be able to review the changed les if they

have done a review for some of these les in other patches. Specif-

ically, given a patch

𝑝

, the potential reviewers of the patch

𝑝

are

those who were the actual reviewers of the other patches in the

10% sub-dataset that made changes to at least one le, and this le

is one of the changed les of the patch

𝑝

. For example, a patch

𝑝1

made changes to les

𝐴

and

𝐵

, while a patch

𝑝2

made changes to

les

𝐵

and

𝐶

. Reviewers R

1

and R

2

are the actual reviewers of the

patch

𝑝1

, while reviewers R

3

and R

4

are the actual reviewers of the

patch

𝑝2

. Hence, the potential reviewers for the patch

𝑝1

are R

3

and

R

4

as they reviewed le

𝐵

(i.e., the common changed le in patches

𝑝1and 𝑝2).

4.3 Evaluation Analysis

To evaluate our approach, we use four performance measures. Then,

we perform a statistical analysis to determine the statistical dier-

ence between our approach and the other approaches.

4.3.1 Performance Measures. We evaluate our WLRRec based on

two perspectives. First, we evaluate our approach from the perspec-

tive of recommendation systems. Hence, for each newly-submitted

patch (i.e., a patch in the 10% sub-dataset), we use Precision, Recall,

and F-measure to measure the accuracy of our approach. Second,

we evaluate our approach from the perspective of search-based soft-

ware engineering. This measure has been used in previous work

(e.g. [

20

]) as a performance indicator for multi-objective optimiza-

tion. Hence, for each newly-submitted patch, we use Hypervolume

to evaluate the performance of search algorithms and to guide the

search.

We briey describe the calculation of each performance as follow:

Precision

measures the proportion of the recommended reviewers

that are in the ground-truth data. We measure precision for a newly-

submitted patch

𝑝𝑖

using a calculation of

𝑃(𝑝𝑖)=|rec(𝑝𝑖) ∩g(𝑝𝑖) |

|rec(𝑝𝑖) |

,

where

rec(𝑝𝑖)

is a set of recommended reviewers and g

(𝑝𝑖)

is a set of

reviewers in the ground-truth data.

Recall

measures the proportion

of reviewers in the ground-truth data that are recommended by the

approach. We measure recall for a newly-submitted patch

𝑝𝑖

using

a calculation of

𝑅(𝑝𝑖)=|rec(𝑝𝑖) ∩g(𝑝𝑖) |

|g(𝑝𝑖) |

, where

rec(𝑝𝑖)

is a set of rec-

ommended reviewers and g

(𝑝𝑖)

is a set of reviewers in the ground-

truth data.

F-measure

is the harmonic mean of the precision and

recall using a calculation of

𝐹(𝑝𝑖)=2(𝑃(𝑝𝑖)×𝑅(𝑝𝑖) )

𝑃(𝑝𝑖)+𝑅(𝑝𝑖)

.

Hypervolume

is a quality indicator for the volume of the space covered by the

non-dominated solutions from the search algorithm [

39

]. It indi-

cates the convergence and diversity of the solutions on a Pareto

front (e.g. the higher hypervolume, the better performance), which

using the following calculation [

26

]:

𝐻𝑉 =𝑣𝑜 𝑙𝑢𝑚𝑒 Ð𝑆

𝑖=1𝑣𝑖

where

𝑆

is a set of solutions from the Pareto front and

𝑣𝑖

is the hypercube

space established between each solution i and distance (reference)

point by all solutions.

4.3.2 Statistical Analysis. To compare the performance between

our WLRRec and other benchmarks, we compute the performance

gain of WLRRec (where

𝑝𝑚

indicates to precision, recall, F-score, or

MRR) over another compared the benchmark

𝑌

using the following

calculation: 𝐺𝑎𝑖𝑛 (𝑝𝑚, 𝑌 )=

𝑊 𝐿𝑅𝑅𝑒𝑐 𝑝𝑚−𝑌𝑝𝑚

𝑌𝑝𝑚 ×100%.

27

In addition, we use the Wilcoxon Signed Rank test [

8

] (

𝛼=

0

.

05)

to determine whether the performance of our WLRRec is statisti-

cally better than the other approaches. The Wilcoxon Signed Rank

test is a paired-wise non-parametric test which does not assume

a normal distribution. In addition, we measure the eect size (i.e.,

the magnitude of dierence) using the Vargha and Delaney’s

ˆ

𝐴𝑋𝑌

non-parametric eect size measure [

33

]. The

ˆ

𝐴𝑋𝑌

measures the

probability that the performance achieved from the approach

𝑋

is

better than the performance achieved by the approach

𝑌

. Consider-

ing a performance measure

𝑝𝑚

(i.e., precision, recall, f-measure, and

Hypervolume), the eect size is measured using the following calcu-

lation:

ˆ

𝐴𝑋𝑌 (𝑝𝑚)=

[#(𝑋𝑝𝑚 >𝑌𝑝𝑚)+(0.5∗#(𝑋𝑝𝑚 =𝑌𝑝𝑚) ) ]

𝑃

, where

𝑋𝑝𝑚

is

the number of patches where the performance

𝑝𝑚

of the approach

𝑋

(i.e., our approach) and

𝑌𝑝𝑚

is the performance

𝑝𝑚

of the approach

𝑌

(e.g., random search), and

𝑃

is the number of newly-submitted

patches (i.e., the size of the 10% sub-dataset). The dierence is con-

sidered as trivial for

ˆ

𝐴𝑋𝑌 ≤

0

.

147, small for 0

.

147

<ˆ

𝐴𝑋𝑌 ≤

0

.

33,

medium for 0.33 <ˆ

𝐴𝑋𝑌 ≤0.474, and large for ˆ

𝐴𝑋𝑌 >0.474 [12].

4.4 Experimental Settings

Our approach was implemented in the MOEA Framework. We

employed tournament selection method and set the size of the

initial population to 100. The number of generations was set to

100

,

000. Crossover probability was set to 0.9, mutation probability

was 0.1, and reproduction probability was 0.2. We set the parameters

𝛼1, 𝛼2, 𝛼 3, 𝛼4to 0.25 as default parameters.

5 RESULTS

(RQ1) Can our WRLRec recommend reviewers

who are potentially suitable for a

newly-submitted patch?

Results

. For a sanity check, the rst row of Table 2presents the

performance results of our WLRRec when using only actual re-

viewers as a ground-truth. We nd that our WLRRec approach

achieves a precision of 16%-20%, a recall of 30%-37%, an F-measure

of 17%-28%, and a hypervolume of 72%-83%. These results suggest

that our WLRRec can identify a reviewer who actually reviewed the

patches in the past. However, as discussed in Section 4.2 (Step-3),

there is no guarantee that these actual reviewers are the only group

of suitable reviewers. Hence, the goal of our work is not limited

to nd the exact group of actual reviewers, but to nd potential

reviewers who might be able to review the patch in the future. For

the remaining results, we evaluate our WLRRec when using the

combination of actual and potential reviewers as ground-truths.

Our WLRRec can recommend reviewers who are likely

to accept a review request with an F-measure of 0.32 - 0.43,

which is 137%-260% better than the random search approach.

Tables 2and 3show that our WLRRec can recommend reviewers

who are likely to accept a review request with a precision value

of 31% for Android, 34% for LibreOce, 36% for QT, and 32% for

OpenStack. Furthermore, Table 2shows that our WLRRec achieves

a recall value of 50% for Android, LibreOce and QT, and 53%

for OpenStack. Table 3also shows that our WLRRec achieves a

recall value 201%-277% better than the random search approach.

The hypervolume values of 72%-84% achieved by our WLRRec also

indicate that our multi-objective search algorithm is 85%-214% bet-

ter than a random search approach. The Wilcoxon Signed Rank

tests (

𝑝<

0

.

001) and the large

ˆ

𝐴𝑋𝑌

eect size values (

ˆ

𝐴𝑋𝑌 >

0

.

474)

conrm that the performance of our WLRRec is statistically better

than that of the random search approach in terms of precision,

recall, F-measure, and hypervolume.

Discussion

. The results of RQ1 suggest that our WLRRec can

recommend reviewers who are potentially suitable for a newly-

submitted patch. More specically, we nd that when considering

4+1 key reviewer metrics, our WLRRec can recommend reviewers

who actually reviewed the patches in the past. In addition, when

expanding the ground-truth data to include potential reviewers

(e.g., those who should be able to review the newly-submitted

patch), about half of the potential reviewers can be identied by our

WLRRec (c.f. the recall values of our approach). These empirical

results suggest the eectiveness of considering dierent factors

when recommending reviewers for a newly-submitted patch.

(RQ2) Does our WRLRec benet from the

multi-objective search-based approach?

Results

. Tables 2shows that the single-objective approach that

maximizes the chance of participating a review (GA-CPR) achieves

a precision of 15%-17%, a recall of 18%-25%, and an F-measure

of 17%-21%. On the other hand, the single-objective approach that

minimizes the skewness of the reviewing workload distribution (GA-

SRW) achieves a precision of 17%-20%, a recall of 21%-27%, and an

F-measure of 18%-22%. Note that we did not measure hypervolume

for the GA-CPR and GA-SRW approaches since this performance

measure is not applicable for the single-objective approaches.

Our WLRRec outperforms the single-objective approaches

with the performance gain of 55%-142% for precision and

78%-178% for recall.

Table 3shows that our WLRRec achieves

88%-142% higher precision, 111%-178 higher recall, 52%-124% higher

F-measure than the GA-CPR approach. Similarly, we also nd that

our WLRRec achieves 55%-101% higher precision, 96%-138% higher

recall, 45%-111% higher F-measure than the GA-SRW approach.

The Wilcoxon Singed Rank tests and the magnitude of the dier-

ences measured by the

ˆ

𝐴𝑋𝑌

eect size also conrm that the dier-

ence is statistically signicant with a large magnitude of dierence

(ˆ

𝐴𝑋𝑌 >0.474).

Discussion

. The results of our RQ2 indicate that our WLR-

Rec which uses a multi-objective approach (i.e., maximizing the

chance of participating a review, while minimizing the skewness

of the reviewing workload distribution) is statistically better than

the single-objective approaches when recommending reviewers

who are potentially suitable for a newly-submitted patch. These

results suggest that considering multiple objectives at the same

time would allow us to nd other potential reviewers that might

be overlooked in the previous approaches [

19

,

32

]. Furthermore,

Table 2shows that the GA-SRW approach (which considers only

the workload distribution) achieves a recall relatively better than

the GA-CPR approach which only focuses on the reviewing experi-

ence, historical interaction, and activeness of reviewers (similar to

RevRec [

19

]). These empirical results highlight the benets of using

the multi-objective search-based approaches and considering the

28

Table 2: The precision, recall, f-measure, and hypervolumn values ([0,1]) of our WLRRec. The rst row presents the results

when using the actual reviewers as a ground-truth, while the other rows present the results when using the combinations of

actual and potential reviewers as a ground-truth.

Android LibreOce QT OpenStack

GT Techniques P R F1 HV P R F1 HV P R F1 HV P R F1 HV

Act. WLRRec 0.20 0.30 0.28 0.72 0.16 0.31 0.17 0.74 0.19 0.37 0.27 0.83 0.20 0.34 0.22 0.82

WLRRec 0.31 0.50 0.35 0.72 0.34 0.50 0.43 0.74 0.36 0.50 0.38 0.84 0.32 0.53 0.32 0.82

(RQ1) Random 0.08 0.16 0.10 0.35 0.12 0.17 0.16 0.40 0.12 0.15 0.13 0.27 0.10 0.14 0.14 0.28

(RQ2) GA-CPR 0.15 0.18 0.17 - 0.16 0.23 0.19 - 0.15 0.19 0.17 - 0.17 0.25 0.21 -

(RQ2) GA-SRW 0.20 0.21 0.18 - 0.17 0.24 0.21 - 0.20 0.21 0.18 - 0.18 0.27 0.22 -

(RQ3) MOCell 0.24 0.26 0.24 0.56 0.23 0.30 0.22 0.61 0.36 0.34 0.29 0.67 0.23 0.36 0.22 0.63

Act. + Pot. Rev.

(RQ3) SPEA2 0.27 0.40 0.29 0.52 0.26 0.32 0.22 0.58 0.36 0.34 0.29 0.57 0.21 0.34 0.27 0.57

Table 3: The performance gain of our proposed WLRRec over the other benchmarks.

Techniques Android LibreOce QT OpenStack

P R F1 HV P R F1 HV P R F1 HV P R F1 HV

WLRRec–Random 288% 207% 260% 108% 175% 201% 161% 85% 203% 233% 189% 214% 217% 277% 137% 199%

WLRRec–GA-CPR 107% 178% 104% - 113% 117% 124% - 142% 163% 123% - 88% 111% 52% -

WLRRec–GA-SRW 55% 138% 92% - 101% 108% 103% - 82% 138% 111% - 78% 96% 45% -

WLRRec–MOCell 31% 95% 42% 28% 48% 68% 95% 21% 0% 48% 31% 25% 42% 45% 45% 31%

WLRRec–SPEA2 17% 26% 19% 37% 30% 59% 95% 29% 0% 48% 31% 47% 52% 55% 19% 43%

workload distribution of reviewers when recommending reviewers

for a newly-submitted patch.

(RQ3) Does the choice of search algorithms

impact the performance of our WRLRec?

Results

. Tables 2shows that when using MOCell instead of NSGA-

II to search for the optimal solutions, this approach achieves a

precision of 23%-36%, a recall of 26%-36%, an F-measure of 22%-24%,

and a hypervolume of 56%-67%. Similarly, using SPEA2 to search

for the optimal solutions achieves a precision of 21%-26%, a recall of

32%-40%, an F-measure of 22%-29%, and a hypervolume of 52%-58%.

Our WLRRec with NSGA-II achieves 31%-95% and 19%-95%

higher F-measure than the MOCell and SPEA2 approaches,

respectively.

Table 3shows that our WLRRec which uses NSGA-II

achieves 31%-48% higher precision, 45%-95% higher recall, 31%-95%

higher F-measure, and 21%-31% higher hypervolume than the MO-

Cell approach. Similarly, we also nd that our WLRRec achieves

0%-52% higher precision, 26%-59% higher recall, 19%-95% higher

F-measure, and 29%-47% higher hypervolume than the SPEA2 ap-

proach. The Wilcoxon Singed Rank tests and the magnitude of

the dierences measured by the

ˆ

𝐴𝑋𝑌

eect size also show that

the dierence is statistically signicant with a large magnitude of

dierence ( ˆ

𝐴𝑋𝑌 >0.474).

Discussion

. The results of our RQ3 indicate that the choice

of the multi-objective search-based algorithms has an impact on

the performance of our approach. More specically, when using

the other multi-objective algorithms (i.e., MOCell and SPEA2), the

performance of our approach decreases in terms of recall, f-measure,

and hypervolume. In addition, the higher hypervolume value of our

WLRRec indicates that our approach nds the solutions that satisfy

the two objectives better that other two multi-objective approaches.

These empirical results suggest that the NSGA-II algorithm that

we leveraged is an appropriate multi-objective approach to nd

solutions in this problem domain.

6 THREATS TO VALIDITY

Construct Threat to Validity

is related to the ground-truth set of

reviewers for evaluating our approach. While the actual reviewers

are recorded in historical data, there is no guarantee that these

actual reviewers are the only group of suitable reviewers. Ouni et

al. [

19

] point out that potential reviewers may be assigned to a code

change that such reviewers do not contribute to it. This is due to

several reasons including the current workload, the availability, and

the social relationship with the patch author. To mitigate this threat,

as suggested by Ouni et al. [

19

], we considered the ground truth

as the set of potential reviewers who should be able to review the

changed les if they have done a review for some of these les in

other patches, which may be more realistic for evaluation than the

set of actual reviewers. Moreover, the goal of our work is not limited

to nd the exact group of actual reviewers, but to nd potential

reviewers who might be able to review the patch in the future.

External Threats to Validity

is related to the generalizability

of our results. Although we empirically evaluated our approach

on four large size open-source systems from dierent application

domains, Android, Qt, OpenStack, and LibreOce, we do not claim

that the same results would be achieved with other projects or other

periods of time.

7 CONCLUSION AND FUTURE WORK

In this paper, we develop a multi-objective search-based approach

called

W

ork

l

oad-aware

R

eviewer

Rec

ommendation (

WLRRec

) to

nd reviewers for a newly-submitted patch. Our results suggest

that: (1) when considering ve reviewer metrics, our WLRRec can

29

recommend reviewers who are potentially suitable for a newly-

submitted patch with 19% - 260% higher F-measure than the ve

benchmarks; (2) including an objective to minimize the skewness

of the review workload distribution would be benecial to nd

other potential reviewers that might be overlooked by the other

approaches that focus on reviewing experience; and (3) the multi-

objective meta-heuristic algorithm, NSGA-II, can be used to search

for reviewers who are potentially suitable for a newly-submitted

patch. Our empirical results shed the light on the potential of using

a wider range of information and leveraging the multi-objective

meta-heuristic algorithm to nd reviewers who are potentially

suitable for a newly-submitted patch.

Our future work will incorporate other factors which may aect

the quality and productivity of the code review process, leading to

the formulation of new objectives and constraints which should be

considered in the search for generating optimal solutions. Currently,

our work considers only one newly-submitted code patch at a time.

Thus, our future work will extend the consideration of all code

patches that are currently sitting in the queue at the same time.

This may require a new solution representation.

REFERENCES

[1]

Alberto Bacchelli and Christian Bird. 2013. Expectations, outcomes, and chal-

lenges of modern code review. In Proceedings of ICSE. 712–721.

[2]

Vipin Balachandran. 2013. Reducing human eort and improving quality in peer

code reviews using automatic static analysis and reviewer recommendation. In

Proceedings of ICSE. 931–940.

[3]

Gabriele Bavota and Barbara Russo. 2015. Four eyes are better than two: On the

impact of code reviews on software quality. In Proceedings of ICSME. 81–90.

[4]

Olga Baysal, Oleksii Kononenko, Reid Holmes, and Michael W. Godfrey. 2015.

Investigating Technical and Non-Technical Factors Inuencing Modern Code

Review. Journal of EMSE 21, 3 (2015), 932–959.

[5]

Christian Bird, Alex Gourley, Prem Devanbu, Michael Gertz, and Anand Swami-

nathan. 2006. Mining email social networks. In Proceedings of MSR. 137–143.

[6]

Christian Bird, Nachiappan Nagappan, Brendan Murphy, Harald Gall, and

Premkumar Devanbu. 2011. Don’t touch my code!: examining the eects of

ownership on software quality. In Proceedings of ESEC/FSE. ACM, 4–14.

[7]

Amiangshu Bosu, Jerey C. Carver, Christian Bird, Jonathan Orbeck, and Christo-

pher Chockley. 2017. Process Aspects and Social Dynamics of Contemporary

Code Review: Insights from Open Source Development and Industrial Practice at

Microsoft. TSE 43, 1 (2017), 56–75.

[8]

J Cohen. 1988. Statistical power analysis for the behavioral sciences. 1988, Hills-

dale, NJ: L. Lawrence Earlbaum Associates 2 (1988).

[9]

Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. 2002.

A fast and elitist multiobjective genetic algorithm: NSGA-II. TEVC 6, 2 (2002),

182–197.

[10]

Kazuki Hamasaki, Raula Gaikovina Kula, Norihiro Yoshida, AE Cruz, Kenji Fuji-

wara, and Hajimu Iida. 2013. Who does what during a code review? datasets of

oss peer review repositories. In Proceedings of MSR. 49–52.

[11]

Ahmed E Hassan. 2009. Predicting faults using the complexity of code changes.

In Proceedings of ICSE. 78–88.

[12]

Melinda R Hess and Jerey D Kromrey.2004. Robust condence inter vals for eect

sizes: A comparative study of Cohen’sd and Cli’s delta under non-normality

and heterogeneous variances. In annual meeting of the American Educational

Research Association. 1–30.

[13]

Dean C Karnopp. 1963. Random search techniques for optimization problems.

Automatica 1, 2-3 (1963), 111–121.

[14]

Oleksii Kononenko, Olga Baysal, Latifa Guerrouj, Yaxin Cao, and Michael W

Godfrey. 2015. Investigating code review quality: Do people and participation

matter?. In Proceedings of ICSME. 111–120.

[15]

Vladimir Kovalenko, Nava Tintarev, Evgeny Pasynkov, Christian Bird, and Al-

berto Bacchelli. 2018. Does reviewer recommendation help developers? TSE to

appear (2018), 1–23.

[16]

Laura MacLeod, Michaela Greiler, Margaret-Anne Storey, Christian Bird, and

Jacek Czerwonka. 2018. Code Reviewing in the Trenches. IEEE Software 35 (2018),

34–42.

[17]

Shane McIntosh, Yasutaka Kamei, Bram Adams, and Ahmed E Hassan. 2016.

An empirical study of the impact of modern code review practices on software

quality. Journal of EMSE 21, 5 (2016), 2146–2189.

[18]

Antonio J Nebro, Juan J Durillo, Francisco Luna, Bernabé Dorronsoro, and Enrique

Alba. 2009. MOCell: A cellular genetic algorithm for multiobjective optimization.

International Journal of Intelligent Systems 24, 7 (2009), 726–746.

[19]

Ali Ouni, Raula Gaikovina Kula, and Katsuro Inoue. 2016. Search-based peer

reviewers recommendation in modern code review. In Proceedings of ICSME).

IEEE, 367–377.

[20]

Ali Ouni, Raula Gaikovina Kula, Marouane Kessentini, Takashi Ishio, Daniel M

German, and Katsuro Inoue. 2017. Search-based software library recommendation

using multi-objective optimization. IST 83 (2017), 55–75.

[21]

Mohammad Masudur Rahman, Chanchal K Roy, and Jason A Collins. 2016. Cor-

rect: code reviewer recommendation in github based on cross-project and tech-

nology experience. In Proceedings of ICSE (Companion). 222–231.

[22]

Peter C Rigby and Christian Bird. 2013. Convergent contemporary software peer

review practices. In Proceedings of FSE. ACM, 202–212.

[23]

Peter C Rigby and Margaret-Anne Storey. 2011. Understanding broadcast based

peer review on open source software projects. In Proceedings of ICSE. 541–550.

[24]

Shade Ruangwan, Patanamon Thongtanunam, Akinori Ihara, and Kenichi Mat-

sumoto. 2019. The Impact of Human Factors on the Participation Decision of

Reviewers in Modern Code Review. Journal of EMSE (2019).

[25]

Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto

Bacchelli. 2018. Modern Code Review: A Case Study at Google. In Proceedings of

ICSE (Companion). 181–190.

[26]

Raphael Saraiva, Allysson Allex Araujo, Altino Dantas, Italo Yeltsin, and Jereson

Souza. 2017. Incorporating decision maker’s preferences in a multi-objective

approach for the software release planning. Journal of the Brazilian Computer

Society 23, 1 (2017), 11.

[27]

Claude Elwood Shannon. 1948. A mathematical theory of communication. Bell

system technical journal 27, 3 (1948), 379–423.

[28]

Patanamon Thongtanunam and Ahmed E. Hassan. 2020. Review Dynamics and

Their Impact on Software Quality. (2020), to appear. https://doi.org/10.1109/TSE.

2020.2964660

[29]

Patanamon Thongtanunam, Shane McIntosh, Ahmed E Hassan, and Hajimu Iida.

2015. Investigating Code Review Practices in Defective Files: An Empirical Study

of the Qt System. In Proceedings of MSR. 168–179.

[30]

Patanamon Thongtanunam, Shane McIntosh, Ahmed E. Hassan, and Hajimu Iida.

2016. Revisiting Code Ownership and its Relationship with Software Quality in

the Scope of Modern Code Review. In Proceedings of ICSE. 1039–1050.

[31]

Patanamon Thongtanunam, Shane McIntosh, Ahmed E Hassan, and Hajimu Iida.

2017. Review participation in modern code review. Journal of EMSE 22, 2 (2017),

768–817.

[32]

Patanamon Thongtanunam, Chakkrit Tantithamthavorn, Raula Gaikovina Kula,

Norihiro Yoshida, Hajimu Iida, and Ken-ichi Matsumoto. 2015. Who should

review my code? A le location-based code-reviewer recommendation approach

for modern code review. In Proceedings of SANER. 141–150.

[33]

András Vargha and Harold D Delaney. 2000. A critique and improvement of

the CL common language eect size statistics of McGraw and Wong. Journal of

Educational and Behavioral Statistics 25, 2 (2000), 101–132.

[34]

Xin Xia, David Lo, Xinyu Wang, and Xiaohu Yang. 2015. Who should review

this change?: Putting text and le location analyses together for more accurate

recommendations. In Proceedings of ICSME. 261–270.

[35]

Xin Yang, Raula Gaikovina Kula, Norihiro Yoshida, and Hajimu Iida. 2016. Mining

the modern code review repositories: A dataset of people, process and product.

In Proceedings of MSR. ACM, 460–463.

[36]

Yue Yu, Huaimin Wang, Gang Yin, and Charles X Ling. 2014. Reviewer recom-

mender of pull-requests in GitHub. In Proceedings of ICSME. IEEE, 609–612.

[37]

Motahareh Bahrami Zanjani, Huzefa Kagdi, and Christian Bird. 2016. Automat-

ically recommending peer reviewers in modern code review. TSE 42, 6 (2016),

530–543.

[38]

Eckart Zitzler, Marco Laumanns, and Lothar Thiele. 2001. SPEA2: Improving the

strength Pareto evolutionary algorithm. TIK-report 103 (2001).

[39]

Eckart Zitzler and Lothar Thiele. 1999. Multiobjective evolutionary algorithms:

a comparative case study and the strength Pareto approach. TEVC 3, 4 (1999),

257–271.

30