Content uploaded by Kazım Ayberk Tecimer
Author content
All content in this area was uploaded by Kazım Ayberk Tecimer on Jun 03, 2021
Content may be subject to copyright.
Detection and Elimination of Systematic Labeling Bias in Code
Reviewer Recommendation Systems
K. Ayberk Tecimer
Technical University of Munich
Munich, Germany
ayberk.tecimer@tum.de
Eray Tüzün
Bilkent University
Ankara, Turkey
eraytuzun@cs.bilkent.edu.tr
Hamdi Dibeklioğlu
Bilkent University
Ankara, Turkey
dibeklioglu@cs.bilkent.edu.tr
Hakan Erdogmus
Carnegie Mellon University
Pittsburgh, USA
hakane@andrew.cmu.edu
ABSTRACT
Reviewer selection in modern code review is crucial for eective
code reviews. Several techniques exist for recommending reviewers
appropriate for a given pull request (PR). Most code reviewer rec-
ommendation techniques in the literature build and evaluate their
models based on datasets collected from real projects using open-
source or industrial practices. The techniques invariably presume
that these datasets reliably represent the “ground truth.”
In the context of a classication problem, ground truth refers to
the objectively correct labels of a class used to build models from
a dataset or evaluate a model’s performance. In a project dataset
used to build a code reviewer recommendation system, the recom-
mended code reviewer picked for a PR is usually assumed to be the
best code reviewer for that PR. However, in practice, the recom-
mended code reviewer may not be the best possible code reviewer,
or even a qualied one. Recent code reviewer recommendation
studies suggest that the datasets used tend to suer from system-
atic labeling bias, making the ground truth unreliable. Therefore,
models and recommendation systems built on such datasets may
perform poorly in real practice.
In this study, we introduce a novel approach to automatically
detect and eliminate systematic labeling bias in code reviewer rec-
ommendation systems. The bias that we remove results from select-
ing reviewers that do not ensure a permanently successful x for a
bug-related PR. To demonstrate the eectiveness of our approach,
we evaluated it on two open-source project datasets —HIVE and QT
Creator— and with ve code reviewer recommendation techniques
—Prole-Based, RSTrace, Naive Bayes, k-NN, and Decision Tree.
Our debiasing approach appears promising since it improved the
Mean Reciprocal Rank (MRR) of the evaluated techniques up to 26%
in the datasets used.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
EASE 2021, June 21–23, 2021, Trondheim, Norway
©2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-9053-8/21/06. . . $15.00
https://doi.org/10.1145/3463274.3463336
CCS CONCEPTS
•Software and its engineering →Software development pro-
cess management;Collaboration in software development.
KEYWORDS
modern code review, ground truth, labeling bias elimination, sys-
tematic labeling bias, data cleaning, code review recommendation
ACM Reference Format:
K. Ayberk Tecimer, Eray Tüzün, Hamdi Dibeklioğlu, and Hakan Erdogmus.
2021. Detection and Elimination of Systematic Labeling Bias in Code Re-
viewer Recommendation Systems . In Evaluation and Assessment in Software
Engineering (EASE 2021), June 21–23, 2021, Trondheim, Norway. ACM, New
York, NY, USA, 10 pages. https://doi.org/10.1145/3463274.3463336
1 INTRODUCTION
The code review process is an important step in the software devel-
opment lifecycle. Eective code reviews increase internal quality
and reduce defect rates [
25
]. To increase the eectiveness of code
reviews, reviewers should be selected carefully. Several Code Re-
viewer Recommendation (CRR) techniques exist in the literature
[
4
,
15
,
17
,
21
,
29
,
31
,
33
–
35
]. These CRR techniques use dierent
strategies, but they invariably either build or evaluate their models
based on datasets gathered from industrial or open-source projects.
Hence they rely on the datasets accurately capturing the “ground
truth” regarding past reviewer selections. The models assume that a
code reviewer assigned to a review task, often captured by a pull re-
quest (PR), in a dataset is the best possible reviewer (the assignment
is done by carefully evaluating candidate reviewers and selecting
the best one is truly the best qualied for that PR.) However, in
practice, the selected code reviewer may not be the most qualied,
or even suciently qualied to review the submitted PR [
9
]. In
several scenarios, reviewer assignments tend to be based on non-
technical factors, which may invalidate the central assumption of
the models built [9].
For instance, according to a study on code reviewer practices
at Microsoft [
16
], reviewers are assigned to PRs according to their
availability and social relationship with the person who makes
the reviewer assignments. Similarly, according to Dogan et al. [
9
],
availability is an important factor for reviewer assignments and
is frequently substituted for technical or competency factors. In
other words, recommendation labels in datasets may frequently
EASE 2021, June 21–23, 2021, Trondheim, Norway Trovato and Tobin, et al.
be wrong. Therefore datasets originating from real practice can
negatively aect the accuracy and reliability of the CRR techniques
that rely on them.
In machine learning, the kind of labeling error that exists in CRR
datasets is generally referred to as systematic labeling bias. Super-
vised learning techniques require labels in the training samples.
These labels indicate real/actual classes of interest in past data so
that models can be built to predict the classes of new data. For
instance, in order to distinguish between apples and oranges, an ac-
tual label (i.e., apple or orange) for each training sample is required.
Ground truth refers to these labels indicating the actual class of
the training samples. In more complex tasks of pattern recognition,
such as the classication of code review tasks according to who
should review them, 100% correct labels may not be obtained in
the training samples since there are several factors, including sub-
jective ones, to be considered. Although the labels are not perfect,
they are still considered as the ground truth. While the amount of
problematic labels in the ground truth may be relatively small or the
inconsistency in the class labels may be negligible, in some cases
the ground truth may include systematic problems that can cause
the models not be able to converge, learn generalizable patterns,
or, as in the CRR case, not to be as eective as they could be in
practice. Such issues are generally due to basic/naïve assumptions
in the labeling process [
27
] or intrinsic properties of the observed
data [7]: this is what systematic labeling bias in general refers to.
With the goal to prevent these kinds of labeling problems in the
ground truth of CRR datasets, we formulate two research questions:
RQ1: How can we eliminate systematic labeling bias in CRR datasets?
RQ2: How does systematic labeling bias elimination aect the
performance of CRR techniques?
For RQ1, we explore possible solutions and introduce a new
approach to detect and eliminate potentially “incorrect” reviewer
labels in CRR datasets. For RQ2, we measure the eects of our
proposed approach by comparing before and after accuracy rates of
ve CRR techniques: Naïve Bayes, k-NN, Decision Tree, RSTrace
and Prole based.
Section 2 provides the background information on the modern
code review practice, CRR approaches, ground truth problems in
software engineering, and cognitive bias in software engineering.
Section 3 denes a success criterion for code reviews for identied
incorrect assignments. Furthermore, Section 3 focuses on code re-
views associated with PRs and bugs and introduces our debiasing
(data cleaning) approach. Section 4 describes our experiments in-
cluding the datasets, prepossessing steps, and experimental setup,
and presents the results. Section 5 answers the research questions
and discuses the limitations of our approach. Finally, Section 6
summarizes the contribution and discusses future work.
2 BACKGROUND AND RELATED WORK
Code review is a central quality practice and an important part of
the software development lifecycle. Traditional, Fagan-style code
review [
10
] is a formal, manual, synchronous, and well-documented
process performed by a carefully assigned group on selected parts
of the codebase. In contrast, modern code review is informal, tool-
based, asynchronous, and focuses on reviewing only latest changes [
2
,
25
]. In the last two decades, modern code review has become domi-
nant in both commercial projects [
6
,
25
] and open-source ones [
3
].
In the following, we provide a summary of code review recom-
mendation techniques, ground truth problems and cognitive bias
examples in software engineering.
2.1 Code Review Recommendation Techniques
The CRR techniques discussed in the literature fall under mainly
two categories: optimization-based approaches and learning-based
approaches. We mention representative works below to illustrate
the diversity of the approaches. For a more detailed overview of
CRR techniques, please refer to Cetin et al. [36].
Optimization-based approaches. Balachandran [
4
] proposed a
heuristic that analyzes the change history to nd suitable review-
ers. They reach 60% - 92% recommendation accuracy, which is
better than a comparable code reviewer recommendation approach
based on le change history. Lee et al. [
17
] proposed a graph-based
technique to nd reviewers in open-source projects. Their method
achieves an average recall of 0.84 for Top-5 predictions and a recall
of 0.94 for Top-10 predictions. A technique based on analyzing
le-path similarity was developed by Thongtanunam et al. [
31
]:
later Xia et al. [
33
] extended this technique with text mining to
leverage additional information in recommendations. Ouni et al.
[
21
] proposed a search-based genetic algorithm to identify most
appropriate peer reviewers for their code changes. The authors
evaluate their approach on three dierent open source projects
(e.g., QT, OpenStack, and Android). Their experiments show that
their genetic algorithm accurately recommends code reviewers
with up to 59% of precision and 74% of recall. Zanjani et al. [
35
]
presented a technique focusing on previous review quality of candi-
date reviewers. They argue that providing specic information (e.g.,
quantication of review comments and their recency) signicantly
improves the prior code reviewer recommendation approaches. Su-
lun et al. [
30
] proposed a graph-based technique using traceability
relations between PRs, source code les, and bugs to recommend
code reviewers for a given pull request.
Learning-based approaches. Jiang et al. [
15
] proposed a technique
that builds a model with Support Vector Machines (SVM) to make
reviewer recommendations in OSS projects. The authors evaluate
their technique on 18,651 pull requests of ve popular projects in
GitHub. They indicate that their technique achieves accuracy from
72.9% to 93.5% for Top-3 recommendation. Xia et al. [
34
] presented
a hybrid approach in which they combined latent-factor models
and neighborhood methods. Their results demonstrate that the
proposed approach performs better than comparing methods for
all Top-k recommendations.
All the CRR techniques described above evaluate their models
based on datasets collected from real projects, where a reviewer
assigned to a code review task is assumed to be the right, or best,
reviewer for the job. However in practice, this assumption is often
violated, making the ground truth suspect. Some authors [
18
,
21
,
34
,
35
] acknowledge this problem explicitly as a limitation. For instance,
Ouni et al. [
21
] discuss that reviewers who are assigned to a PR
may not do the job well for various reasons (such as workload or
availability), or the review may end up being poor quality because
the assignment was mainly determined by social factors rather
Detection and Elimination of Systematic Labeling Bias in Code Reviewer Recommendation Systems EASE 2021, June 21–23, 2021, Trondheim, Norway
than competence. Lipcak and Rossi [
18
] state that evaluating CRR
techniques with top-k-style criteria may not be accurate as it is not
guaranteed that the actual reviewers in the test data were the best
candidates (or even suciently qualied) for the tasks for which
they were selected.
2.2 Ground Truth Problems in Software
Engineering
Outside CRR [
9
], ground truth problems exist in data-analytics solu-
tions that support other common predication and recommendation
tasks in software engineering.
Bird et al. [
5
] studied bug-x datasets and found strong evidence
of systematic bias due to mislabeling of bug xes in version histories.
The performance of a defect prediction model that they tested was
adversely aected when the model was built from biased data.
Nguyen et al. [
20
] examined tagging and linkage biases in IBM
Jazz software. Tagging bias results from treating all logged issue as
bugs (some of which can represent other coding tasks, decisions, or
enhancements). A linkage bias occurs when there is no traceability
connection between a bug-x PR and the corresponding bug report.
The authors found both linkage and tagging biases even in datasets
thought to be near accurate.
Herzig et al. [
14
] analyzed tangled changes in defect prediction. A
tangled change is caused by bundling multiple unrelated changes in
a single commit. Tangled changes introduce noise to the data, where
as much as 17% of all source les could be incorrectly associated
with bug reports due to such tangling. The authors state this can
negatively impact defect prediction models.
Ahluwalia et al. [
1
] investigated biases in datasets that were
used to build defect prediction models. The authors stated that, the
bugs were usually discovered after several releases, and therefore
may have still been dormant in the snapshot taken to build the
models. According to the study, dormant bugs exist in up to 20% of
existing releases in a dataset, distorting the ground truth by causing
defective code to be mislabeled as defect-free. The authors analyzed
282 releases from six open source projects and demonstrated the
existence of ground truth problem in bug datasets, however they
did not propose a solution for debiasing the data. Chen et al. [
8
]
analyzed dormant bugs in the Apache codebase. They observed a
higher dormant bug rate of 33% than that reported by Ahluwalia
et al. Both studies demonstrate that the existence of dormant bugs
could potentially aect the reliability of ground truth.
2.3 Cognitive Bias Examples in Software
Engineering
A cognitive bias is a type of systematic error in decision making
that causes suboptimal outcomes based on established beliefs and
misguided intuition. Our work on improving the performance of
CRR techniques addresses systematic labeling errors that in great
part stem from such biases.
Cognitive biases are pervasive in software engineering. Mo-
hanani et al. [
19
] provide many examples of cognitive biases in
software engineering in their literature review and advocate their
mitigation as central to improving the quality of decision making.
Ralph [
22
] attributes the persistence of high software project fail-
ure rates despite the many advances in software technologies and
processes to cognitive biases. Stacy and Macmillian [
28
] state that
cognitive biases have an adverse eect on software developers’
thought processes. They advise preferring empirical investigation
to intuition and seeking disconrmatory information to reduce
cognitive biases. Smith and Bahill [
26
] similarly argue that cogni-
tive biases disturb rational decision making in systems engineering
through a mechanism called attribute substitution: the substitution
of a convenience factor for an objectively important factor in a de-
cision. They suggest that raising the awareness of this mechanism
among engineers could help alleviate its adverse eects in engi-
neering decisions. Attribute substitution is particularly relevant to
our work since it is often a primary source of failed labels in CRR
datasets caused by past reviewer assignments based on convenience
and social factors rather than factors that are directly related to the
suitability of a reviewer for the specic review task.
3 IMPROVING CRR ACCURACY IN PAST
DATA
In the remainder of the paper, we assume that a code review task is
encapsulated in a pull request (PR). The purpose of code review is to
merge the PR successfully into the codebase after having addressed
all of the reviewers comments, and close the PR, hopefully indef-
initely. The task of the code reviewer is to ensure this successful
resolution. According to Google best practices for code reviews [
13
]
, “the best reviewer is the person who will be able to give you the
most thorough and correct review for the piece of code you are writ-
ing.” This denition of an appropriate, or successful, code reviewer,
aligns with that of a successful PR as one that avoids downstream
rework related to the changes contained in the PR after the PR
has been closed. However the success criterion is subjective as is,
and not easily measurable. This poses a problem since our goal is
to eliminate systematic labeling bias in CRR datasets: we need a
way to identify whether a label, in this case the assigned reviewer,
is right or wrong. The label will be wrong if the PR is unsuccess-
ful, in which case we need to be able to remove the bias either by
correcting the label (assigned reviewer), removing the data point
corresponding to that PR, or generating new data with checks that
increase label accuracy.
3.1 Manual Correction
If the success measure is qualitative, we can only use manual meth-
ods to remove it. For example, additional expert checks may be put
in place to tag low-quality reviews when a PR is closed. Or multiple
independent reviewers can be assigned to a single PR, and an expert
again can examine and tag low-quality reviews. The tagged reviews
can then be excluded from the generated reviewer assignment data.
This can also be done post hoc: the reviewer assignment data can
be cleaned by an expert after the fact using similar quality checks.
However, these approaches would not scale up well because they
can be too costly or too impractical, or both. Besides, manual expert
checks can still be error-prone, and we must still assume the expert
performing the checks and verifying the labels is unbiased: if the
expert is biased, we create a circular situation in which we attempt
to address one kind of bias by a method that introduces another
kind of bias.
EASE 2021, June 21–23, 2021, Trondheim, Norway Trovato and Tobin, et al.
3.2 Characterization of a Successful Code
Review
When PRs are associated with coding tasks or defects in a bug
tracking system, an objective success measure can be dened: if
the associated bug is not reopened again, the PR is deemed success-
ful. The PR process often involves multiple rounds of reviewers
commenting on the scope of the PR, possibly resulting in several
commits to address any outstanding concerns, and concludes with
a merge into the master branch to close the PR. The code review
task itself associated with the PR is successful if the PR is successful.
Similarly, the reviewers performing the code review for the PR are
themselves successful if the PR they helped close is successful.
While this success measure is objective, it still has a few caveats.
First it cannot be immediately determined whether a PR is suc-
cessful: we can only decide the success status in retrospect, after
sucient time has lapsed following the closing of the PR. The
second one is: it relies on the assumption that reopening the un-
derlying bug after the PR has been closed can only be due to the
original PR not having been successful, and not due to, for example,
something outside the scope of the PR having changed causing a
ripple eect (a future commit refactoring a dependency.) The third
caveat is that it assumes perfect bug identication: a new bug is
created only when the reason for the bug cannot reliably traced
to an existing bug that can be reopened (no duplicate entries in
the bug tracking system). We will accept these caveats, and see
whether, even in their presence, the approach that we propose can
achieve reasonable improvements in the performance of existing
CRR techniques with this success measure.
3.3 The Debiasing Approach
The objective success measure dened in the previous section
provides us with an easy way to identify incorrect labels in CRR
datasets and remove them. The success measure requires linkages
between PRs and bugs logged in a bug tracking system. Almost all
datasets that have the required linkage focus on bug-related PRs.
The CRR techniques on which we have tested on our approach do
the same. So we limit ourselves to bugs and PRs associated with
bugs.
Fig. 1 shows the typical ow of a bug as it is tracked in a bug
tracking system (e.g. JIRA). The circles (Reported, Need More Info,
Open, In Progress, Closed) represent the dierent states. The PR pro-
cess interjects this ow at the transition from the In Progress to the
Closed state. After a developer xes a bug, before the bug is closed,
the developer creates a PR and one or more team members are in-
vited to review the code. A PR conversation takes place discussing
the xes, which may result in a number of additional commits if
the xes were deemed unsuccessful. When the reviewers nally
approve the x, the developer merges the PR to the master and
the bug’s status is changed to Closed. The merged PR is eventually
deployed. However further testing, eld testing by end users, or on-
going development work may at one point reveal that the bug was
not xed as intended, in which case the bug may be reopened by
changing its status back from Closed to Open. When this happens,
barring other rare reasons that may cause a previously identied
bug to reappear exactly in the same context, we will discover in
retrospect that the original PR was in fact not successful, and the
reviewer assignment associated with this PR becomes a candidate
for removal.
Therefore our debiasing approach is simple and based on the
removal of unsuccessful PRs. Given a CRR dataset consisting of a
set of PRs, reviewer assignments for each PR, a bug associated with
each PR, the status history of each bug in the bug tracking system,
we check for a PR whether the associated bug was reopened after
it was closed following the merge. If it was, we consider this PR
to be unsuccessful post hoc and remove the associated data point
from the dataset along with the reviewer assignment.
Figure 1: The lifecycle of a bug and where it interacts with
the PR process
4 EXPERIMENTS AND RESULTS
4.1 Dataset Description and Preprocessing
Having dened the debiasing method, we evaluate it on two dier-
ent datasets. These datasets belong to projects from two sources:
Qt
1
, a company that develops cross-platform software, and from
Apache
2
, a widely used open-source cross-platform software foun-
dation. The projects are QT Creator and HIVE, respectively. They
are chosen because they are both open-source, have full PR and
code review history and the PR information is linked to the bug
tracking information, as required.
For QT Creator, we extracted the PR history
3
and bug history
4
until December 2019. For HIVE , we used the version provided by
SEOSS 33 [
23
], a dataset repository that includes data retrieved
from several open-source software projects. In the data gathering
stage, we used the Perceval tool from GrimoireLab
5
, which allows
fetching datasets from both GitHub and Jira. Most of the PRs in the
two datasets are associated with Jira bug ID (in HIVE 96.34%, and
in QT-Creator 73.18%). This allowed us tracing PRs to Jira bugs.
Before we apply the debiasing method on these datasets, we
performed three preprocessing steps. As a rst step, we removed
1https://doc.qt.io/qt-5/index.html
2https://www.apache.org
3https://code.qt.io/cgit/playground/qt-creator/
4https://bugreports.qt.io/projects/QTCREATORBUG/issues/
5https://chaoss.github.io/grimoirelab/
Detection and Elimination of Systematic Labeling Bias in Code Reviewer Recommendation Systems EASE 2021, June 21–23, 2021, Trondheim, Norway
Table 1: PR and Reviewer Statistics of the Datasets
Dataset # Total # Unsuccessful U-to-S #
PRs PRs Ratio Reviewers
QT Creator 5927 406 7% 152
HIVE 3621 196 5% 108
the PRs that do not have any association with a Jira bug. In both
datasets, some seemingly distinct reviewer labels correspond to the
same reviewer. For instance, two dierent reviewer labels “jkobus”
and “Jarek Kobus” may refer to the same reviewer. In the second
step of the preprocessing, duplicate reviewer labels were found
and merged automatically if dierent names correspond to the
same email address. In the third step, we checked whether any PR’s
associated bug in the Jira database was reopened after the PR was
merged. If so, we tagged these PRs as unsuccessful.
Table 1 shows the total number PRs, unsuccessful PRs, and the
ratio of unsuccessful PRs to successful PRs. The total number of
PRs in QT Creator after preprocessing was 5,927. 406 of these pull
requests were unsuccessful, corresponding to a failure/success ratio
of 7%. The HIVE dataset had 3,621 PRs after prepossessing with
196 unsuccessful ones, yielding a failure/success ratio of 5%. There
were 152 distinct code reviewers in the QT Creator dataset and 108
distinct code reviewers in the HIVE dataset.
4.2 Evaluation Setup
To evaluate the reliability and usefulness of the debiasing approach,
we selected ve dierent CRR techniques from the literature, namely,
Prole-based [
11
], RSTrace [
29
], Naïve Bayes, k-NN (5-NN), and
Decision Tree. Initially, we wanted to apply the approach to all
CRR techniques discussed under the section on background and
related work. However, only a few of the CRR techniques [
11
,
29
]
provide source code or pseudocode. Therefore, we selected those
that we could actually run or implement. We had to implement
the prole-based technique ourselves since the source code was
not available. For RSTrace, we used the available implementation
shared in the original paper. For Naïve-Bayes, 5-NN, and Decision
Tree, we used the implementations provided by the Scikit-learn
library.6
For the three machine learning techniques (Naïve Bayes, k-NN,
and Decision Tree), we used le-paths in PRs as features. To convert
these le-paths to numeric values (for using in classication), we ap-
plied the vectorizers (CountVectorizer and Tdf Vectorizer) from the
Scikit-learn library. Hyperparameters of the learning-based models
were optimized within a set of considered values. The considered
values are given in Table 2.
4.3 Performance Measures
The accuracy of the CRR techniques selected were assessed by
widely two used measures: Top-k accuracy (namely, Top-1, Top-
3 and Top-5) and Mean Reciprocal Rank (MRR). Top-k accuracy
computes the ratio of test cases that have the correct label within
the top k predictions to all cases. MRR is the inverse of the rank of
the rst correct answer.
6https://scikit-learn.org/
Table 2: List of the Considered Hyperparameters for the
Learning-based Models
Model Hyperparameter Considered Values
Naïve-Bayes Distribution type {multinomial, Gaussian, Bernoulli}
Distance type {Manhattan, Euclidean}
5-NN NN search algo. {ball tree, KD tree, brute-force search}
Weight function {uniform, inverse distance}
Split strategy {Gini impurity, entropy}
Decision Tree
Measure of split qual-
ity
{best, random}
Maximum depth {1,2,. . . , 10}
We computed the relative improvements in the above measures
after debiasing to demonstrate the eectiveness of our approach as
follows: Safter −Sbefore
Sbefore
, (1)
where,
Sbefore
is the performance of a CRR technique before we
applied debiasing to its training dataset and
Safter
is the performance
after we applied debiasing to the same dataset.
4.4 Balancing
The selected CRR techniques were trained and tested with the HIVE
and QT Creator datasets. Both datasets had quite low unsuccessful
PR rate (5% to 7%). However, according to the literature [
32
], un-
successful PRs are signicantly more pervasive than they appear to
be in our datasets. The actual reopened bug ratios in some popular
open-source projects are much higher: for example, in Eclipse it
was found to be 16.1% and in OpenOce as high as 26.31%. A lower
than actual ratio commonly stems from the practice of opening a
new bug for convenience because searching for the original bug,
identifying it, and reopening it may require eort. Developers also
may not remember or know about the source bug due to turnover
or time lapse, and inadvertently mistake a recurring one for a brand
new bug.
The problem with a low unsuccessful PR rate in the dataset due
to missed reopened bugs is that debiasing through the removal of
the corresponding data points will inevitably yield only marginal
improvements in performance. We are interested in assessing how
much improvement can be achieved by debiasing if the training
dataset’s ratio were closer to the actual ratios observed in real
practice. Therefore, we under-sampled the data [
12
] by randomly
removing successful PRs until the unsuccessful ratio is in the same
ballpark range as the rates reported in the literature: starting from
the rst successful PR, we randomly removed three out of every
four successful PRs. This eectively quadruples the unsuccessful
PR ratios, bringing them closer to the more commonly observed
values, and making the dataset more balanced for training.
4.5 The Evaluation Process
Fig. 2 illustrates the evaluation process. The box Original (Og)
represents the preprocessed dataset containing successful (S) and
unsuccessful (U) PRs. Debiasing removes unsuccessful PRs, result-
ing in a Debiased (Db) dataset. At the rst step, a CRR technique
is trained with both Og and Db datasets, resulting in two models.
The performance of the models is compared to assess the eect of
debiasing. We expect the performance of the model trained with
EASE 2021, June 21–23, 2021, Trondheim, Norway Trovato and Tobin, et al.
the debiased dataset to be better since debiasing attempts to remove
samples with bad labels.
Our datasets contain too few datapoints that exhibit labeling bias,
so we expect the improvement to be marginal. At the second step,
we want to see how much the eects are amplied when systematic
labeling bias is as pervasive as it is reported in the literature. We
balance the PRs to increase the ratio of the unsuccessful PRs to
realistic levels. This is the Balanced (Ba) dataset at the top left.
The Ba dataset is then debiased by the same procedure as before,
resulting in the Balanced-Debiased (DbBa) dataset. The results are
again compared for the evaluated CRR technique. The improvement
in performance should now be more increased.
Finally, we perform an extra validation step to check any ob-
served relative improvement in performance is not due to a random
reduction in the sample size, but due to the targeted removal of only
badly labeled (unsuccessful) PRs: for this to be true, random removal
of datapoints instead of targeted removal should not improve the
performance. We form several datasets by randomly removing data
from the successful subset only (favoring this class for the reduc-
tion) and from both successful and unsuccessful PRs (not favoring
any class). These reductions give rise to the Reduced-Biased (RBi)
and Reduced-Unbiased (RUb) datasets shown at the bottom corners
of Fig. 2. We compare the performance of the given CRR technique
with these datasets to the performance with the dataset DbBa to
show that any improvement in performance with debiasing is not
merely accidental, but must be because of having deleted the badly
labeled samples.
4.6 Testing Strategies for the CRR Techniques
While testing both categories of techniques, we preserved the
chronological order of the data since we should not attempt to
predict past instances using future instances. Therefore, we made
sure the training samples always preceded the tested samples.
While evaluating the debiasing method on the optimization-
based CRR techniques (Prole-based and RSTrace), we incremen-
tally predicted each sample using all samples that preceded it, ex-
panding the models used for prediction one sample at a time.
However this ne-grained, one-by-one strategy cannot be ap-
plied in learning-based approaches: the standard testing method-
ology is based on using multiple folds. Therefore for the learning-
based techniques (Naïve Bayes, k-NN [5-NN], and Decision Tree),
we performed sliding window testing. We divided our dataset into
10 folds, choose a test fold, and trained our models using all the folds
chronologically located before the test fold. In each iteration, we
slided the window to the next fold and repeated the same procedure.
4.7 Results: Eect of Debiasing
In order to investigate the eect of removing unsuccessful PRs from
the datasets, we evaluated the accuracy of the ve CRR techniques
on QT Creator and HIVE datasets before and after the debiasing.
Notice that after debiasing, the datasets contain only successful
PRs. The CRR techniques were trained with the same parameters
to make the comparisons valid.
Table 3 summarizes the Top-3 and Top-5 correct classication
rates before and after the debiasing using both datasets. The three
learning-based CRR techniques perform better with debiasing than
Table 3: Performance of CRR Techniques Before and After
Debiasing on the Balanced Versions of HIVE and QT Creator
Top-3 Accuracy
HIVE QT Creator
Technique Before
Debiasing
After
Debiasing
Before
Debiasing
After
Debiasing
Naïve Bayes 23.52% 27.72% 26.23% 32.60%
5-NN 34.20% 38.99% 42.54% 48.28%
Decision Tree 37.51% 40.74% 40.94% 44.75%
Prole based 39.18% 39.44% 39.74% 40.03%
RSTrace 40.78% 41.12% 42.36% 42.67%
Top-5 Accuracy
HIVE QT Creator
Technique Before
Debiasing
After
Debiasing
Before
Debiasing
After
Debiasing
Naïve Bayes 27.75% 29.31% 35.59% 39.53%
5-NN 47.12% 50.20% 52.71% 55.77%
Decision Tree 46.69% 50.71% 53.77% 55.37%
Prole based 50.30% 51.13% 48.80% 49.44%
RSTrace 48.20% 48.34% 50.40% 50.57%
the two optimization-based ones according to the Top-3 measure.
For the Naïve Bayes technique, we observe the highest relative
improvement, 17% and 23%, on the HIVE and QT Creator datasets,
respectively.
The Top-1 and MRR results are compared in Fig. 5 and Fig. 6,
respectively. Higher accuracy rates were obtained for higher “k”
values in Top-k overall. For any k value, we did not observe a
considerable improvement in the accuracy of optimization-based
approaches through debiasing, whereas learning-based approaches
showed a clear improvement especially in Top-3 accuracy.
Unlike the Top-k measures that focus on best predictions, the
MRR measure considers the ranking of each prediction. It is there-
fore more representative of overall performance. Fig. 6 shows the
MRR scores before and after the debiasing. In terms of MRR, debi-
asing yields the best relative improvement on the learning-based
techniques. The improvement for the 5-NN technique on HIVE is
25% and for Naïve Bayes technique on QT Creator dataset is 26%.
These improvements are higher than the improvements observed
with the Top-k measures.
4.8 Does Debiasing Actually Work or Is It Just
Coincidence?
To illustrate that improvements observed with debiasing is no acci-
dental, we performed an additional validation step:
•
Remove PRs randomly from the datasets and evaluate the
accuracy of the CRR techniques before and after debiasing.
This process compares the dataset version RUb (Reduced-
Unbiased) in Fig. 2 to the version DbBa (Balanced-Debiased).
The number of samples removed equals the number of unsuc-
cessful PRs for a fair comparison. This is repeated 100 times
to create dierent randomly reduced versions of a dataset
to compare with the DbBa version. The performance results
from the dierent random reductions are averaged.
Detection and Elimination of Systematic Labeling Bias in Code Reviewer Recommendation Systems EASE 2021, June 21–23, 2021, Trondheim, Norway
Figure 2: The evaluation process using dierent versions of the preprocessed datasets
Figure 3: Percentage of unsuccessful PRs detected at each day following a successful merge due to reopening of the underlying
bug (HIVE dataset)
•
Remove only successful PRs from the datasets (thus introduc-
ing a bias against successful PRs) and evaluate the accuracy
of the CRR techniques before and after debiasing. This pro-
cess compares the dataset version RBi (Reduced-Biased) in
Fig. 2 to the version DbBa. Again the random reduction is
repeated 100 times and performance results averaged.
Table 4 summarizes the results of this step for the MRR measure.
The results are expected. All CRR techniques perform best with-
out unsuccessful PRs (DbBa) and worst with randomly removed
successful PRs (RBi). In all cases, the CRR techniques invariably
perform worse with randomly removed PRs (RUb) than without
EASE 2021, June 21–23, 2021, Trondheim, Norway Trovato and Tobin, et al.
Figure 4: Percentage of untouched PR les after closure over
time (HIVE dataset)
Figure 5: Top-1 Accuracy before and after debiasing on the
balanced versions of the datasets
unsuccessful PRs (DbBa). We conclude that the improvements ob-
served with debiasing is not accidental since targeted removal of
samples focusing on unsuccessful PRs always give better results.
4.9 How Many Unsuccessful PRs Could be
Superuous?
In Section 3.2, we discussed a number of caveats for the PR success
measure adopted. We now focus on the second caveat to assess to
Figure 6: MRR Scores before and after debiasing on the bal-
anced versions of the datasets
Table 4: MRR for Random vs. Targeted Removal of PR Sam-
ples
HIVE
Naive
Bayes 5-NN Decision
Tree
Prole
Based RSTrace
With Unsuccessful PRs (Balanced) 18.26% 27.32% 30.96% 33.22% 33.17%
Without Unsuccessful PRs (Balanced-Debiased) 20.97% 34.82% 35.89% 34.28% 33.26%
Without Unsuccessful-Successful PRs (Reduced-Unbiased) 18.13% 26.62% 30.75% 33.14% 33.09%
Without Successful PRs (Reduced-Biased) 17.27% 23.24% 26.83% 32.56% 32.87%
QT Creator
Naive
Bayes 5-NN Decision
Tree
Prole
Based RSTrace
With Unsuccessful PRs (Balanced) 23.37% 33.76% 33.73% 32.27% 35.58%
Without Unsuccessful PRs (Balanced-Debiased) 29.58% 37.32% 38.35% 33.48% 35.67%
Without Unsuccessful-Successful PRs (Reduced-Unbiased) 22.12% 33.49% 32.23% 32.15% 35.41%
Without Successful PRs (Reduced-Biased) 18.30% 30.72% 29.41% 31.43% 35.13%
what extent reopened bugs could be attributed to reasons other
than the review/reviewer quality of the original associated PR. This
assessment considers the possibility that the original PR was indeed
successful and the reopened bug was a false positive due to changes
to the codebase unrelated to the original bug (e.g., changes in the
underlying dependencies that makes it look like the original bug
suddenly resurfaced instead of being reported as a new bug). To
do this, we analyze the elapsed time between Closed and Reopened
transitions of each reopened bug in the datasets. If the elapsed time
is small, e.g., less than one day, resurfacing of the bug is unlikely
to be related to external circumstances impossible to have been
detected by the PR reviewers. We also look at the percentage of
the les involved in an unsuccessful PR that were not changed
(untouched) in a commit following the closure of the associated
bug. If this percentage is high after a certain time has lapsed, their
Detection and Elimination of Systematic Labeling Bias in Code Reviewer Recommendation Systems EASE 2021, June 21–23, 2021, Trondheim, Norway
likelihood of causing the bug to resurface is low, hence the PR was
likely genuinely unsuccessful.
Fig. 3 shows that in the Hive dataset, 45% of the bugs are reopened
on the same day of the PR and 80% of the bugs are reopened within
24 days. Fig. 4 shows that 80% of the les involved in an unsuccessful
PR remain untouched 19 days after the closure of the underlying
bug. The results are similar on the second dataset. Because most
bugs are reopened in the rst few days of their closure and most
les involved in an unsuccessful PR remain untouched during the
initial days after closure, we believe mislabeling unsuccessful PRs,
although possible in rare circumstance, is unlikely to be pervasive
enough to compromise the PR success measure.
To validate whether reopened PRs indeed had poor reviews, we
performed a quality analysis of in QT Creator dataset for a random
10% sample (40 data points). You can nd this analysis in a supple-
ment posted to Figshare
7
). Two authors independently inspected
the quality of each review for this sample. We found that 32 out of
the 40 (80%) had in fact poor reviews according to the criteria we
used. We categorized poor reviews as Supercial (LGTM/missing
comments) (21), Overruled (author indicated reviewer had mis-
judged the changes) (3), and Poor-Eort (self-admittance of a sub-
standard/rushed review) (6).
5 DISCUSSION
5.1 Research Questions
RQ1
:How can we eliminate the systematic labeling bias in CRR
datasets?
Since manual methods are not cost-ecient, can still be error-
prone, and do not scale up well, we looked for an automatable
method based on an objective success measure that leverages link-
ages between PR data and bug data. A PR, the underlying code
review, and the assigned reviewer’s work were deemed successful
only if the bug the PR targeted was never reopened following a suc-
cessful merge and the associated closure of the bug. The debiasing
method we propose simply removes unsuccessful samples from the
PR data to eliminate possible biases in past reviewer selections. The
CRR techniques could then use this debiased data as their ground
truth to build their models and improve their performance.
RQ2
:How does systematic labeling bias elimination aect the
performance of CRR techniques?
We applied the automatic debiasing methods to a diverse set of
ve CRR techniques using two open-source datasets. We observed
that, provided the data had suciently high-rates of badly labeled
samples reported in the literature, the performance of the CRR
techniques in general improved after debiasing. The highest im-
provement was observed with learning-based CRR techniques. The
improvements in the optimization-based CRR techniques tested
were marginal.
The reason behind the dierence of improvement between opti-
mization-based vs. learning-based techniques is that the reviewer
recommendations process of the optimization-based techniques
tested do not as heavily depends on learning from the past data as
those of learning-based techniques. Additionally, not every sample
may be as equally valuable since the optimization criterion may
7https://gshare.com/s/1b9ea55377d9f2c31a7a
inadvertently already discount badly labeled samples. Therefore,
a debiasing approach focused on removing badly labeled samples
may not make much dierence in such techniques.
We conclude that the proposed debiasing approach improves the
quality of the ground truth and is worthwhile for learning-based
CRR techniques.
5.2 Threats to Validity
Our method of identifying a badly labeled PR is subject to a con-
struct threat [
24
]. It may not be possible to catch all reopened bugs:
some may be completely missed, others may yet to be reopened
and not captured in the dataset. Leaving these false negatives in the
dataset would reduce the ecacy of debiasing. Conversely, there
may be false positives: PRs identied as unsuccessful due to a re-
opened bug may have actually been successful. We discussed some
possible reasons for such cases in Section 4.9, and by examining
two indicators in the datasets, concluded that these cases should
be rare.
Our original datasets had low unsuccessful PR rates, due to the
fact that some unsuccessful PRs are likely to be mislabeled as suc-
cessful because the recurrence of the associated bugs was missed
in the bug tracking system (we discussed possible reasons for this.)
At these low rates, the introduced bias is not signicant, and re-
moving it would not yield much benet. We precisely observed this
eect in the original data: improvements in performance measures
were less than 1%, and thus not material. We thus looked at the
reopened bug rates in the literature and balanced the datasets by
randomly removing successful PRs to move the unsuccessful PR
ratio to within the reported ranges, and evaluated the debiasing
method with these reduced datasets. Because of this adjustment,
we must posit that any observed improvements are conditional on
a dataset having sucient systematic labeling bias.
Threats to external validity are concerned with generalizability
[
24
]. We evaluated our approach on two open-source datasets and
ve dierent CRR techniques. We believe the CRR techniques used
are a reasonable representation of common approaches. However
we acknowledge the limitations of using two datasets. We could
not identify further datasets that both contain sucient samples
and integrate PR information with bug tracking.
Open-source projects typically have high turnover rates. Unlike
in closed-source projects, many contributors and reviewers become
inactive over time, and others joint the project. Therefore, our
observations may not apply to closed-source projects.
To mitigate internal threats related to the implementation of the
CRR techniques and the data extraction methods, we provide the
source code of all techniques used in the evaluation as well as the
both datasets in Figshare8.
6 CONCLUSION AND FUTURE WORK
Good code reviewer selection is central to eective code reviews.
CRR techniques attempt to automate the code reviewer selection
problem, but they mostly build their models and evaluate them us-
ing historical data whose ground truth may be unreliable. Ground
truth problems often result from the susceptibility of human deci-
sion makers to cognitive biases, such as substituting a convenience
8https://gshare.com/s/1b9ea55377d9f2c31a7a
EASE 2021, June 21–23, 2021, Trondheim, Norway Trovato and Tobin, et al.
attribute for a competence attribute, in reviewer assignments. When
the code reviews are performed in the context of PRs, and the dataset
links PR data with bug tracking data, it is possible to identify PR
requests that fail to achieve their goal: a PR, and the developer
assigned to reviewing it, can be deemed unsuccessful when the bug
associated the PR is reopened later. Our experiments showed that
when such failed cases are pervasive enough–in the 20-28% range
consistent with reported rates—removing such data points from the
dataset in general improves the performance of CRR techniques.
Although the improvement was very marginal for optimization-
based CRR techniques tested, it was large for learning-based CRR
techniques (up to 26% for Naïve Bayes).
Our work has implications for both practitioners and researchers.
Researching can apply our proposed debiasing approach while
cleaning their training data to improve the accuracy of their CRR
models. Recommendation tools built on these models would then
inherit these improvements. The debiasing approach could also be
useful for agging potentially ineective reviews to improve the
code review practices in an organization.
Although initial results look promising, we still need to test our
approach on other datasets and datasets collected from commercial
systems. We are in the process of looking for suitable candidates. It
would make sense for the future work to focus on learning-based
CRR techniques since the real returns are to found in that space.
If expanded evaluations prove the debiasing method to be widely
eective, we plan to provide tool support for automated debiasing.
REFERENCES
[1]
Aalok Ahluwalia, Davide Falessi, and Massimiliano Di Penta. 2019. Snoring : a
Noise in Defect Prediction Datasets. 2019 IEEE/ACM 16th International Conference
on Mining Software Repositories (MSR) (2019), 63–67. https://doi.org/10.1109/
MSR.2019.00019
[2]
Alberto Bacchelli and Christian Bird. 2013. Expectations , Outcomes , and Chal-
lenges of Modern Code Review. Proceedings of the 2013 International Conference
on Software Engineering (2013), 712–721.
[3]
Alberto Bacchelli and Christian Bird. 2018. Code Reviewing in the Trenches.
IEEE Software 35 (2018), 34–42. https://doi.org/10.1109/MS.2017.265100500
[4]
Vipin Balachandran. 2013. Reducing Human Eort and Improving Quality in Peer
Code Reviews using Automatic Static Analysis and Reviewer Recommendation.
2013 35th International Conference on Software Engineering (ICSE) (2013), 931–940.
https://doi.org/10.1109/ICSE.2013.6606642
[5]
Christian Bird, Adrian Bachmann, Eirik Aune, John Duy, Abraham Bernstein,
Vladimir Filkov, and Premkumar Devanbu. 2009. Fair and Balanced? Bias in
Bug-Fix Datasets. 121–130. https://doi.org/10.1145/1595696.1595716
[6]
Amiangshu Bosu, Michaela Greiler, and Christian Bird. 2015. Characteristics of
Useful Code Reviews : An Empirical Study at Microsoft. Proceedings of the 12th
Working Conference on Mining Software Repositories (2015), 146–156.
[7]
Guillermo F. Cabrera, Christopher J. Miller, and Je Schneider. 2014. Systematic
labeling bias: De-biasing where everyone is wrong. In Proceedings - International
Conference on Pattern Recognition. https://doi.org/10.1109/ICPR.2014.756
[8]
Tse-Hsun Chen, Meiyappan Nagappan, Emad Shihab, and Ahmed E. Hassan.
2014. An Empirical Study of Dormant Bugs. In Proceedings of the 11th Work-
ing Conference on Mining Software Repositories (Hyderabad, India) (MSR 2014).
Association for Computing Machinery, New York, NY, USA, 82–91. https:
//doi.org/10.1145/2597073.2597108
[9]
Emre Doğan, Eray Tüzün, K. Ayberk Tecimer, and H. Altay Güvenir. 2019. Inves-
tigating the Validity of Ground Truth in Code ReviewerRe commendation Studies.
In 2019 ACM/IEEE International Symposium on Empirical Software Engineering
and Measurement (ESEM). 1–6. https://doi.org/10.1109/ESEM.2019.8870190
[10]
M E Fagan. 1976. Design and code inspections to reduce errors in program
development. IBM Systems Journal 15 (1976), 182–211.
[11]
Mikołaj Fejzer, Piotr Przymus, and Krzysztof Stencel. 2018. Prole based recom-
mendation of code reviewers. Journal of Intelligent Information Systems (2018).
https://doi.org/10.1007/s10844-017- 0484-1
[12]
Alberto Fernández, Salvador García, Mikel Galar, Ronaldo Prati, Bartosz
Krawczyk, and Francisco Herrera. 2018. Learning from Imbalanced Data Sets.
https://doi.org/10.1007/978-3- 319-98074- 4
[13]
Google. 2020. Code Review Developer Guide. https://github.com/google/eng-
practices/blob/master/review/index.md
[14]
Kim Herzig and Andreas Zeller. 2013. The Impact of Tangled Code Changes. In
Proceedings of the 10th Working Conference on Mining Software Repositories (MSR
’13). IEEE Press, Piscataway, NJ, USA, 121–130. http://dl.acm.org/citation.cfm?
id=2487085.2487113
[15]
Jing Jiang, Jia-Huan He, and Xue-Yuan Chen. 2015. CoreDevRec: Automatic Core
Member Recommendation for Contribution Evaluation. Journal of Computer
Science and Technology 30, 5 (2015), 998–1016. https://doi.org/10.1007/s11390-
015-1577- 3
[16]
Vladimir Kovalenko, Nava Tintarev, Evgeny Pasynkov, Christian Bird, and Al-
berto Bacchelli. 2018. Does reviewer recommendation help developers? IEEE
Transactions on Software Engineering (2018), 1. https://doi.org/10.1109/TSE.2018.
2868367
[17]
John Boaz Lee, A. Ihara, A. Monden, and K. Matsumoto. 2013. Patch Reviewer
Recommendation in OSS Projects. 2013 20th Asia-Pacic Software Engineering
Conference (APSEC) 2 (2013), 1–6. https://doi.org/10.1109/APSEC.2013.103
[18]
Jakub Lipcak and Bruno Rossi. 2018. A Large-Scale Study on Source Code
Reviewer Recommendation. 44th Euromicro Conference on Software Engineering
and Advanced Applications (SEAA 2018) (2018).
[19]
Rahul Mohanani, Iaah Salman, Burak Turhan, Pilar Rodríguez, and Paul Ralph.
2020. Cognitive Biases in Software Engineering: A Systematic Mapping Study.
IEEE Transactions on Software Engineering 46, 12 (2020), 1318–1339. https:
//doi.org/10.1109/TSE.2018.2877759
[20]
Thanh H D Nguyen, Bram Adams, and Ahmed E Hassan. 2010. A Case Study of
Bias in Bug-Fix Datasets. April 2014 (2010). https://doi.org/10.1109/WCRE.2010.
37
[21]
Ali Ouni, Raula Gaikovina Kula, and Katsuro Inoue. 2016. Search-Based Peer
Reviewers Recommendation in Modern Code Review. 2016 IEEE International
Conference on Software Maintenance and Evolution (ICSME) (2016), 367–377. https:
//doi.org/10.1109/ICSME.2016.65
[22]
Paul Ralph. 2010. Toward a Theory of Debiasing Software Development. In
Lecture Notes in Business Information Processing, Vol. 93. 92–105. https://doi.org/
10.1007/978-3- 642-25676- 9{_}8
[23]
Michael Rath and Patrick Mäder. 2019. The SEOSS 33 dataset — Requirements,
bug reports, code history, and trace links for entire projects. Data in Brief (2019).
https://doi.org/10.1016/j.dib.2019.104005
[24]
Per Runeson and Martin Höst. 2008. Guidelines for conducting and reporting
case study research in software engineering. Empirical Software Engineering 14,
2 (2008), 131. https://doi.org/10.1007/s10664- 008-9102-8
[25]
Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto
Bacchelli. 2018. Modern Code Review : A Case Study at Google. Proceedings of
the 40th International Conference on Software Engineering Software Engineering in
Practice - ICSE-SEIP 18 (2018). https://doi.org/10.1145/3183519.3183525
[26]
Eric Smith and A Terry Bahill. 2009. Attribute Substitution in Systems Engineer-
ing. Systems Engineering 13 (2009), 130–148. https://doi.org/10.1002/sys.20138
[27]
Anders Søgaard, Barbara Plank, and Dirk Hovy. 2014. Selection Bias, Label Bias,
and Bias in Ground Truth. In Proceedings of COLING 2014, the 25th International
Conference on Computational Linguistics: Tutorial Abstracts.
[28]
Webb Stacy and Jean Macmillan. 1995. Cognitive Bias in Software Engineering.
Commun. ACM 38, 6 (1995), 57–63.
[29]
Emre Sülün, Eray Tüzün, and U
ˇ
gur Do
ˇ
grusöz. 2019. Reviewer recommendation
using software artifact traceability graphs. In PROMISE’19: Proceedings of the
Fifteenth International Conference on Predictive Models and Data Analytics in
Software Engineering. 66–75. https://doi.org/10.1145/3345629.3345637
[30]
Emre Sülün, Eray Tüzün, and Uğur Doğrusöz. 2021. RSTrace+: Reviewer sug-
gestion using software artifact traceability graphs. Information and Software
Technology 130 (2021), 106455. https://doi.org/10.1016/j.infsof.2020.106455
[31]
Patanamon Thongtanunam, Chakkrit Tantithamthavorn, Raula Gaikovina Kula,
Norihiro Yoshida, Hajimu Iida, and Ken-ichi Matsumoto. 2015. Who Should
Review My Code ? 2015 IEEE 22nd International Conference on Software Analysis,
Evolution, and Reengineering (SANER) (2015), 141–150. https://doi.org/10.1109/
SANER.2015.7081824
[32]
Xin Xia, David Lo, Emad Shihab, Xinyu Wang, and Bo Zhou. 2014. Automatic,
high accuracy prediction of reopened bugs. Automated Software Engineering
(2014). https://doi.org/10.1007/s10515-014- 0162-2
[33]
Xin Xia, David Lo, Xinyu Wang, and Xiaohu Yang. 2015. Who Should Review
This Change ? 2015 IEEE International Conference on Software Maintenance and
Evolution (ICSME) (2015), 261–270. https://doi.org/10.1109/ICSM.2015.7332472
[34]
Zhenglin Xia, Hailong Sun, Jing Jiang, Xu Wang, and Xudong Liu. 2017. A Hybrid
Approach to Code Reviewer Recommendation with Collaborative Filtering. 2017
6th International Workshop on Software Mining (SoftwareMining) (2017), 24–31.
[35]
Motahareh Bahrami Zanjani and Student Member. 2016. Automatically Recom-
mending Peer Reviewers in Modern Code Review. IEEE Transactions on Software
Engineering 42, 6 (2016), 530–543. https://doi.org/10.1109/TSE.2015.2500238
[36]
H. Alperen Çetin, Emre Doğan, and Eray Tüzün. 2021. A review of code reviewer
recommendation studies: Challenges and future directions. Science of Computer
Programming 208 (2021), 102652. https://doi.org/10.1016/j.scico.2021.102652