Conference PaperPDF Available

Detection and Elimination of Systematic Labeling Bias in Code Reviewer Recommendation Systems

Authors:

Abstract and Figures

Reviewer selection in modern code review is crucial for effective code reviews. Several techniques exist for recommending reviewers appropriate for a given pull request (PR). Most code reviewer recommendation techniques in the literature build and evaluate their models based on datasets collected from real projects using open-source or industrial practices. The techniques invariably presume that these datasets reliably represent the "ground truth. " In the context of a classification problem, ground truth refers to the objectively correct labels of a class used to build models from a dataset or evaluate a model's performance. In a project dataset used to build a code reviewer recommendation system, the recommended code reviewer picked for a PR is usually assumed to be the best code reviewer for that PR. However, in practice, the recommended code reviewer may not be the best possible code reviewer, or even a qualified one. Recent code reviewer recommendation studies suggest that the datasets used tend to suffer from systematic labeling bias, making the ground truth unreliable. Therefore, models and recommendation systems built on such datasets may perform poorly in real practice. In this study, we introduce a novel approach to automatically detect and eliminate systematic labeling bias in code reviewer recommendation systems. The bias that we remove results from selecting reviewers that do not ensure a permanently successful fix for a bug-related PR. To demonstrate the effectiveness of our approach, we evaluated it on two open-source project datasets-HIVE and QT Creator-and with five code reviewer recommendation techniques-Profile-Based, RSTrace, Naive Bayes, k-NN, and Decision Tree. Our debiasing approach appears promising since it improved the Mean Reciprocal Rank (MRR) of the evaluated techniques up to 26% in the datasets used.
Content may be subject to copyright.
Detection and Elimination of Systematic Labeling Bias in Code
Reviewer Recommendation Systems
K. Ayberk Tecimer
Technical University of Munich
Munich, Germany
ayberk.tecimer@tum.de
Eray Tüzün
Bilkent University
Ankara, Turkey
eraytuzun@cs.bilkent.edu.tr
Hamdi Dibeklioğlu
Bilkent University
Ankara, Turkey
dibeklioglu@cs.bilkent.edu.tr
Hakan Erdogmus
Carnegie Mellon University
Pittsburgh, USA
hakane@andrew.cmu.edu
ABSTRACT
Reviewer selection in modern code review is crucial for eective
code reviews. Several techniques exist for recommending reviewers
appropriate for a given pull request (PR). Most code reviewer rec-
ommendation techniques in the literature build and evaluate their
models based on datasets collected from real projects using open-
source or industrial practices. The techniques invariably presume
that these datasets reliably represent the “ground truth.
In the context of a classication problem, ground truth refers to
the objectively correct labels of a class used to build models from
a dataset or evaluate a model’s performance. In a project dataset
used to build a code reviewer recommendation system, the recom-
mended code reviewer picked for a PR is usually assumed to be the
best code reviewer for that PR. However, in practice, the recom-
mended code reviewer may not be the best possible code reviewer,
or even a qualied one. Recent code reviewer recommendation
studies suggest that the datasets used tend to suer from system-
atic labeling bias, making the ground truth unreliable. Therefore,
models and recommendation systems built on such datasets may
perform poorly in real practice.
In this study, we introduce a novel approach to automatically
detect and eliminate systematic labeling bias in code reviewer rec-
ommendation systems. The bias that we remove results from select-
ing reviewers that do not ensure a permanently successful x for a
bug-related PR. To demonstrate the eectiveness of our approach,
we evaluated it on two open-source project datasets —HIVE and QT
Creator— and with ve code reviewer recommendation techniques
—Prole-Based, RSTrace, Naive Bayes, k-NN, and Decision Tree.
Our debiasing approach appears promising since it improved the
Mean Reciprocal Rank (MRR) of the evaluated techniques up to 26%
in the datasets used.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
EASE 2021, June 21–23, 2021, Trondheim, Norway
©2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-9053-8/21/06. . . $15.00
https://doi.org/10.1145/3463274.3463336
CCS CONCEPTS
Software and its engineering Software development pro-
cess management;Collaboration in software development.
KEYWORDS
modern code review, ground truth, labeling bias elimination, sys-
tematic labeling bias, data cleaning, code review recommendation
ACM Reference Format:
K. Ayberk Tecimer, Eray Tüzün, Hamdi Dibeklioğlu, and Hakan Erdogmus.
2021. Detection and Elimination of Systematic Labeling Bias in Code Re-
viewer Recommendation Systems . In Evaluation and Assessment in Software
Engineering (EASE 2021), June 21–23, 2021, Trondheim, Norway. ACM, New
York, NY, USA, 10 pages. https://doi.org/10.1145/3463274.3463336
1 INTRODUCTION
The code review process is an important step in the software devel-
opment lifecycle. Eective code reviews increase internal quality
and reduce defect rates [
25
]. To increase the eectiveness of code
reviews, reviewers should be selected carefully. Several Code Re-
viewer Recommendation (CRR) techniques exist in the literature
[
4
,
15
,
17
,
21
,
29
,
31
,
33
35
]. These CRR techniques use dierent
strategies, but they invariably either build or evaluate their models
based on datasets gathered from industrial or open-source projects.
Hence they rely on the datasets accurately capturing the “ground
truth” regarding past reviewer selections. The models assume that a
code reviewer assigned to a review task, often captured by a pull re-
quest (PR), in a dataset is the best possible reviewer (the assignment
is done by carefully evaluating candidate reviewers and selecting
the best one is truly the best qualied for that PR.) However, in
practice, the selected code reviewer may not be the most qualied,
or even suciently qualied to review the submitted PR [
9
]. In
several scenarios, reviewer assignments tend to be based on non-
technical factors, which may invalidate the central assumption of
the models built [9].
For instance, according to a study on code reviewer practices
at Microsoft [
16
], reviewers are assigned to PRs according to their
availability and social relationship with the person who makes
the reviewer assignments. Similarly, according to Dogan et al. [
9
],
availability is an important factor for reviewer assignments and
is frequently substituted for technical or competency factors. In
other words, recommendation labels in datasets may frequently
EASE 2021, June 21–23, 2021, Trondheim, Norway Trovato and Tobin, et al.
be wrong. Therefore datasets originating from real practice can
negatively aect the accuracy and reliability of the CRR techniques
that rely on them.
In machine learning, the kind of labeling error that exists in CRR
datasets is generally referred to as systematic labeling bias. Super-
vised learning techniques require labels in the training samples.
These labels indicate real/actual classes of interest in past data so
that models can be built to predict the classes of new data. For
instance, in order to distinguish between apples and oranges, an ac-
tual label (i.e., apple or orange) for each training sample is required.
Ground truth refers to these labels indicating the actual class of
the training samples. In more complex tasks of pattern recognition,
such as the classication of code review tasks according to who
should review them, 100% correct labels may not be obtained in
the training samples since there are several factors, including sub-
jective ones, to be considered. Although the labels are not perfect,
they are still considered as the ground truth. While the amount of
problematic labels in the ground truth may be relatively small or the
inconsistency in the class labels may be negligible, in some cases
the ground truth may include systematic problems that can cause
the models not be able to converge, learn generalizable patterns,
or, as in the CRR case, not to be as eective as they could be in
practice. Such issues are generally due to basic/naïve assumptions
in the labeling process [
27
] or intrinsic properties of the observed
data [7]: this is what systematic labeling bias in general refers to.
With the goal to prevent these kinds of labeling problems in the
ground truth of CRR datasets, we formulate two research questions:
RQ1: How can we eliminate systematic labeling bias in CRR datasets?
RQ2: How does systematic labeling bias elimination aect the
performance of CRR techniques?
For RQ1, we explore possible solutions and introduce a new
approach to detect and eliminate potentially “incorrect” reviewer
labels in CRR datasets. For RQ2, we measure the eects of our
proposed approach by comparing before and after accuracy rates of
ve CRR techniques: Naïve Bayes, k-NN, Decision Tree, RSTrace
and Prole based.
Section 2 provides the background information on the modern
code review practice, CRR approaches, ground truth problems in
software engineering, and cognitive bias in software engineering.
Section 3 denes a success criterion for code reviews for identied
incorrect assignments. Furthermore, Section 3 focuses on code re-
views associated with PRs and bugs and introduces our debiasing
(data cleaning) approach. Section 4 describes our experiments in-
cluding the datasets, prepossessing steps, and experimental setup,
and presents the results. Section 5 answers the research questions
and discuses the limitations of our approach. Finally, Section 6
summarizes the contribution and discusses future work.
2 BACKGROUND AND RELATED WORK
Code review is a central quality practice and an important part of
the software development lifecycle. Traditional, Fagan-style code
review [
10
] is a formal, manual, synchronous, and well-documented
process performed by a carefully assigned group on selected parts
of the codebase. In contrast, modern code review is informal, tool-
based, asynchronous, and focuses on reviewing only latest changes [
2
,
25
]. In the last two decades, modern code review has become domi-
nant in both commercial projects [
6
,
25
] and open-source ones [
3
].
In the following, we provide a summary of code review recom-
mendation techniques, ground truth problems and cognitive bias
examples in software engineering.
2.1 Code Review Recommendation Techniques
The CRR techniques discussed in the literature fall under mainly
two categories: optimization-based approaches and learning-based
approaches. We mention representative works below to illustrate
the diversity of the approaches. For a more detailed overview of
CRR techniques, please refer to Cetin et al. [36].
Optimization-based approaches. Balachandran [
4
] proposed a
heuristic that analyzes the change history to nd suitable review-
ers. They reach 60% - 92% recommendation accuracy, which is
better than a comparable code reviewer recommendation approach
based on le change history. Lee et al. [
17
] proposed a graph-based
technique to nd reviewers in open-source projects. Their method
achieves an average recall of 0.84 for Top-5 predictions and a recall
of 0.94 for Top-10 predictions. A technique based on analyzing
le-path similarity was developed by Thongtanunam et al. [
31
]:
later Xia et al. [
33
] extended this technique with text mining to
leverage additional information in recommendations. Ouni et al.
[
21
] proposed a search-based genetic algorithm to identify most
appropriate peer reviewers for their code changes. The authors
evaluate their approach on three dierent open source projects
(e.g., QT, OpenStack, and Android). Their experiments show that
their genetic algorithm accurately recommends code reviewers
with up to 59% of precision and 74% of recall. Zanjani et al. [
35
]
presented a technique focusing on previous review quality of candi-
date reviewers. They argue that providing specic information (e.g.,
quantication of review comments and their recency) signicantly
improves the prior code reviewer recommendation approaches. Su-
lun et al. [
30
] proposed a graph-based technique using traceability
relations between PRs, source code les, and bugs to recommend
code reviewers for a given pull request.
Learning-based approaches. Jiang et al. [
15
] proposed a technique
that builds a model with Support Vector Machines (SVM) to make
reviewer recommendations in OSS projects. The authors evaluate
their technique on 18,651 pull requests of ve popular projects in
GitHub. They indicate that their technique achieves accuracy from
72.9% to 93.5% for Top-3 recommendation. Xia et al. [
34
] presented
a hybrid approach in which they combined latent-factor models
and neighborhood methods. Their results demonstrate that the
proposed approach performs better than comparing methods for
all Top-k recommendations.
All the CRR techniques described above evaluate their models
based on datasets collected from real projects, where a reviewer
assigned to a code review task is assumed to be the right, or best,
reviewer for the job. However in practice, this assumption is often
violated, making the ground truth suspect. Some authors [
18
,
21
,
34
,
35
] acknowledge this problem explicitly as a limitation. For instance,
Ouni et al. [
21
] discuss that reviewers who are assigned to a PR
may not do the job well for various reasons (such as workload or
availability), or the review may end up being poor quality because
the assignment was mainly determined by social factors rather
Detection and Elimination of Systematic Labeling Bias in Code Reviewer Recommendation Systems EASE 2021, June 21–23, 2021, Trondheim, Norway
than competence. Lipcak and Rossi [
18
] state that evaluating CRR
techniques with top-k-style criteria may not be accurate as it is not
guaranteed that the actual reviewers in the test data were the best
candidates (or even suciently qualied) for the tasks for which
they were selected.
2.2 Ground Truth Problems in Software
Engineering
Outside CRR [
9
], ground truth problems exist in data-analytics solu-
tions that support other common predication and recommendation
tasks in software engineering.
Bird et al. [
5
] studied bug-x datasets and found strong evidence
of systematic bias due to mislabeling of bug xes in version histories.
The performance of a defect prediction model that they tested was
adversely aected when the model was built from biased data.
Nguyen et al. [
20
] examined tagging and linkage biases in IBM
Jazz software. Tagging bias results from treating all logged issue as
bugs (some of which can represent other coding tasks, decisions, or
enhancements). A linkage bias occurs when there is no traceability
connection between a bug-x PR and the corresponding bug report.
The authors found both linkage and tagging biases even in datasets
thought to be near accurate.
Herzig et al. [
14
] analyzed tangled changes in defect prediction. A
tangled change is caused by bundling multiple unrelated changes in
a single commit. Tangled changes introduce noise to the data, where
as much as 17% of all source les could be incorrectly associated
with bug reports due to such tangling. The authors state this can
negatively impact defect prediction models.
Ahluwalia et al. [
1
] investigated biases in datasets that were
used to build defect prediction models. The authors stated that, the
bugs were usually discovered after several releases, and therefore
may have still been dormant in the snapshot taken to build the
models. According to the study, dormant bugs exist in up to 20% of
existing releases in a dataset, distorting the ground truth by causing
defective code to be mislabeled as defect-free. The authors analyzed
282 releases from six open source projects and demonstrated the
existence of ground truth problem in bug datasets, however they
did not propose a solution for debiasing the data. Chen et al. [
8
]
analyzed dormant bugs in the Apache codebase. They observed a
higher dormant bug rate of 33% than that reported by Ahluwalia
et al. Both studies demonstrate that the existence of dormant bugs
could potentially aect the reliability of ground truth.
2.3 Cognitive Bias Examples in Software
Engineering
A cognitive bias is a type of systematic error in decision making
that causes suboptimal outcomes based on established beliefs and
misguided intuition. Our work on improving the performance of
CRR techniques addresses systematic labeling errors that in great
part stem from such biases.
Cognitive biases are pervasive in software engineering. Mo-
hanani et al. [
19
] provide many examples of cognitive biases in
software engineering in their literature review and advocate their
mitigation as central to improving the quality of decision making.
Ralph [
22
] attributes the persistence of high software project fail-
ure rates despite the many advances in software technologies and
processes to cognitive biases. Stacy and Macmillian [
28
] state that
cognitive biases have an adverse eect on software developers’
thought processes. They advise preferring empirical investigation
to intuition and seeking disconrmatory information to reduce
cognitive biases. Smith and Bahill [
26
] similarly argue that cogni-
tive biases disturb rational decision making in systems engineering
through a mechanism called attribute substitution: the substitution
of a convenience factor for an objectively important factor in a de-
cision. They suggest that raising the awareness of this mechanism
among engineers could help alleviate its adverse eects in engi-
neering decisions. Attribute substitution is particularly relevant to
our work since it is often a primary source of failed labels in CRR
datasets caused by past reviewer assignments based on convenience
and social factors rather than factors that are directly related to the
suitability of a reviewer for the specic review task.
3 IMPROVING CRR ACCURACY IN PAST
DATA
In the remainder of the paper, we assume that a code review task is
encapsulated in a pull request (PR). The purpose of code review is to
merge the PR successfully into the codebase after having addressed
all of the reviewers comments, and close the PR, hopefully indef-
initely. The task of the code reviewer is to ensure this successful
resolution. According to Google best practices for code reviews [
13
]
, “the best reviewer is the person who will be able to give you the
most thorough and correct review for the piece of code you are writ-
ing.” This denition of an appropriate, or successful, code reviewer,
aligns with that of a successful PR as one that avoids downstream
rework related to the changes contained in the PR after the PR
has been closed. However the success criterion is subjective as is,
and not easily measurable. This poses a problem since our goal is
to eliminate systematic labeling bias in CRR datasets: we need a
way to identify whether a label, in this case the assigned reviewer,
is right or wrong. The label will be wrong if the PR is unsuccess-
ful, in which case we need to be able to remove the bias either by
correcting the label (assigned reviewer), removing the data point
corresponding to that PR, or generating new data with checks that
increase label accuracy.
3.1 Manual Correction
If the success measure is qualitative, we can only use manual meth-
ods to remove it. For example, additional expert checks may be put
in place to tag low-quality reviews when a PR is closed. Or multiple
independent reviewers can be assigned to a single PR, and an expert
again can examine and tag low-quality reviews. The tagged reviews
can then be excluded from the generated reviewer assignment data.
This can also be done post hoc: the reviewer assignment data can
be cleaned by an expert after the fact using similar quality checks.
However, these approaches would not scale up well because they
can be too costly or too impractical, or both. Besides, manual expert
checks can still be error-prone, and we must still assume the expert
performing the checks and verifying the labels is unbiased: if the
expert is biased, we create a circular situation in which we attempt
to address one kind of bias by a method that introduces another
kind of bias.
EASE 2021, June 21–23, 2021, Trondheim, Norway Trovato and Tobin, et al.
3.2 Characterization of a Successful Code
Review
When PRs are associated with coding tasks or defects in a bug
tracking system, an objective success measure can be dened: if
the associated bug is not reopened again, the PR is deemed success-
ful. The PR process often involves multiple rounds of reviewers
commenting on the scope of the PR, possibly resulting in several
commits to address any outstanding concerns, and concludes with
a merge into the master branch to close the PR. The code review
task itself associated with the PR is successful if the PR is successful.
Similarly, the reviewers performing the code review for the PR are
themselves successful if the PR they helped close is successful.
While this success measure is objective, it still has a few caveats.
First it cannot be immediately determined whether a PR is suc-
cessful: we can only decide the success status in retrospect, after
sucient time has lapsed following the closing of the PR. The
second one is: it relies on the assumption that reopening the un-
derlying bug after the PR has been closed can only be due to the
original PR not having been successful, and not due to, for example,
something outside the scope of the PR having changed causing a
ripple eect (a future commit refactoring a dependency.) The third
caveat is that it assumes perfect bug identication: a new bug is
created only when the reason for the bug cannot reliably traced
to an existing bug that can be reopened (no duplicate entries in
the bug tracking system). We will accept these caveats, and see
whether, even in their presence, the approach that we propose can
achieve reasonable improvements in the performance of existing
CRR techniques with this success measure.
3.3 The Debiasing Approach
The objective success measure dened in the previous section
provides us with an easy way to identify incorrect labels in CRR
datasets and remove them. The success measure requires linkages
between PRs and bugs logged in a bug tracking system. Almost all
datasets that have the required linkage focus on bug-related PRs.
The CRR techniques on which we have tested on our approach do
the same. So we limit ourselves to bugs and PRs associated with
bugs.
Fig. 1 shows the typical ow of a bug as it is tracked in a bug
tracking system (e.g. JIRA). The circles (Reported, Need More Info,
Open, In Progress, Closed) represent the dierent states. The PR pro-
cess interjects this ow at the transition from the In Progress to the
Closed state. After a developer xes a bug, before the bug is closed,
the developer creates a PR and one or more team members are in-
vited to review the code. A PR conversation takes place discussing
the xes, which may result in a number of additional commits if
the xes were deemed unsuccessful. When the reviewers nally
approve the x, the developer merges the PR to the master and
the bug’s status is changed to Closed. The merged PR is eventually
deployed. However further testing, eld testing by end users, or on-
going development work may at one point reveal that the bug was
not xed as intended, in which case the bug may be reopened by
changing its status back from Closed to Open. When this happens,
barring other rare reasons that may cause a previously identied
bug to reappear exactly in the same context, we will discover in
retrospect that the original PR was in fact not successful, and the
reviewer assignment associated with this PR becomes a candidate
for removal.
Therefore our debiasing approach is simple and based on the
removal of unsuccessful PRs. Given a CRR dataset consisting of a
set of PRs, reviewer assignments for each PR, a bug associated with
each PR, the status history of each bug in the bug tracking system,
we check for a PR whether the associated bug was reopened after
it was closed following the merge. If it was, we consider this PR
to be unsuccessful post hoc and remove the associated data point
from the dataset along with the reviewer assignment.
Figure 1: The lifecycle of a bug and where it interacts with
the PR process
4 EXPERIMENTS AND RESULTS
4.1 Dataset Description and Preprocessing
Having dened the debiasing method, we evaluate it on two dier-
ent datasets. These datasets belong to projects from two sources:
Qt
1
, a company that develops cross-platform software, and from
Apache
2
, a widely used open-source cross-platform software foun-
dation. The projects are QT Creator and HIVE, respectively. They
are chosen because they are both open-source, have full PR and
code review history and the PR information is linked to the bug
tracking information, as required.
For QT Creator, we extracted the PR history
3
and bug history
4
until December 2019. For HIVE , we used the version provided by
SEOSS 33 [
23
], a dataset repository that includes data retrieved
from several open-source software projects. In the data gathering
stage, we used the Perceval tool from GrimoireLab
5
, which allows
fetching datasets from both GitHub and Jira. Most of the PRs in the
two datasets are associated with Jira bug ID (in HIVE 96.34%, and
in QT-Creator 73.18%). This allowed us tracing PRs to Jira bugs.
Before we apply the debiasing method on these datasets, we
performed three preprocessing steps. As a rst step, we removed
1https://doc.qt.io/qt-5/index.html
2https://www.apache.org
3https://code.qt.io/cgit/playground/qt-creator/
4https://bugreports.qt.io/projects/QTCREATORBUG/issues/
5https://chaoss.github.io/grimoirelab/
Detection and Elimination of Systematic Labeling Bias in Code Reviewer Recommendation Systems EASE 2021, June 21–23, 2021, Trondheim, Norway
Table 1: PR and Reviewer Statistics of the Datasets
Dataset # Total # Unsuccessful U-to-S #
PRs PRs Ratio Reviewers
QT Creator 5927 406 7% 152
HIVE 3621 196 5% 108
the PRs that do not have any association with a Jira bug. In both
datasets, some seemingly distinct reviewer labels correspond to the
same reviewer. For instance, two dierent reviewer labels “jkobus”
and “Jarek Kobus” may refer to the same reviewer. In the second
step of the preprocessing, duplicate reviewer labels were found
and merged automatically if dierent names correspond to the
same email address. In the third step, we checked whether any PR’s
associated bug in the Jira database was reopened after the PR was
merged. If so, we tagged these PRs as unsuccessful.
Table 1 shows the total number PRs, unsuccessful PRs, and the
ratio of unsuccessful PRs to successful PRs. The total number of
PRs in QT Creator after preprocessing was 5,927. 406 of these pull
requests were unsuccessful, corresponding to a failure/success ratio
of 7%. The HIVE dataset had 3,621 PRs after prepossessing with
196 unsuccessful ones, yielding a failure/success ratio of 5%. There
were 152 distinct code reviewers in the QT Creator dataset and 108
distinct code reviewers in the HIVE dataset.
4.2 Evaluation Setup
To evaluate the reliability and usefulness of the debiasing approach,
we selected ve dierent CRR techniques from the literature, namely,
Prole-based [
11
], RSTrace [
29
], Naïve Bayes, k-NN (5-NN), and
Decision Tree. Initially, we wanted to apply the approach to all
CRR techniques discussed under the section on background and
related work. However, only a few of the CRR techniques [
11
,
29
]
provide source code or pseudocode. Therefore, we selected those
that we could actually run or implement. We had to implement
the prole-based technique ourselves since the source code was
not available. For RSTrace, we used the available implementation
shared in the original paper. For Naïve-Bayes, 5-NN, and Decision
Tree, we used the implementations provided by the Scikit-learn
library.6
For the three machine learning techniques (Naïve Bayes, k-NN,
and Decision Tree), we used le-paths in PRs as features. To convert
these le-paths to numeric values (for using in classication), we ap-
plied the vectorizers (CountVectorizer and Tdf Vectorizer) from the
Scikit-learn library. Hyperparameters of the learning-based models
were optimized within a set of considered values. The considered
values are given in Table 2.
4.3 Performance Measures
The accuracy of the CRR techniques selected were assessed by
widely two used measures: Top-k accuracy (namely, Top-1, Top-
3 and Top-5) and Mean Reciprocal Rank (MRR). Top-k accuracy
computes the ratio of test cases that have the correct label within
the top k predictions to all cases. MRR is the inverse of the rank of
the rst correct answer.
6https://scikit-learn.org/
Table 2: List of the Considered Hyperparameters for the
Learning-based Models
Model Hyperparameter Considered Values
Naïve-Bayes Distribution type {multinomial, Gaussian, Bernoulli}
Distance type {Manhattan, Euclidean}
5-NN NN search algo. {ball tree, KD tree, brute-force search}
Weight function {uniform, inverse distance}
Split strategy {Gini impurity, entropy}
Decision Tree
Measure of split qual-
ity
{best, random}
Maximum depth {1,2,. . . , 10}
We computed the relative improvements in the above measures
after debiasing to demonstrate the eectiveness of our approach as
follows: Safter Sbefore
Sbefore
, (1)
where,
Sbefore
is the performance of a CRR technique before we
applied debiasing to its training dataset and
Safter
is the performance
after we applied debiasing to the same dataset.
4.4 Balancing
The selected CRR techniques were trained and tested with the HIVE
and QT Creator datasets. Both datasets had quite low unsuccessful
PR rate (5% to 7%). However, according to the literature [
32
], un-
successful PRs are signicantly more pervasive than they appear to
be in our datasets. The actual reopened bug ratios in some popular
open-source projects are much higher: for example, in Eclipse it
was found to be 16.1% and in OpenOce as high as 26.31%. A lower
than actual ratio commonly stems from the practice of opening a
new bug for convenience because searching for the original bug,
identifying it, and reopening it may require eort. Developers also
may not remember or know about the source bug due to turnover
or time lapse, and inadvertently mistake a recurring one for a brand
new bug.
The problem with a low unsuccessful PR rate in the dataset due
to missed reopened bugs is that debiasing through the removal of
the corresponding data points will inevitably yield only marginal
improvements in performance. We are interested in assessing how
much improvement can be achieved by debiasing if the training
dataset’s ratio were closer to the actual ratios observed in real
practice. Therefore, we under-sampled the data [
12
] by randomly
removing successful PRs until the unsuccessful ratio is in the same
ballpark range as the rates reported in the literature: starting from
the rst successful PR, we randomly removed three out of every
four successful PRs. This eectively quadruples the unsuccessful
PR ratios, bringing them closer to the more commonly observed
values, and making the dataset more balanced for training.
4.5 The Evaluation Process
Fig. 2 illustrates the evaluation process. The box Original (Og)
represents the preprocessed dataset containing successful (S) and
unsuccessful (U) PRs. Debiasing removes unsuccessful PRs, result-
ing in a Debiased (Db) dataset. At the rst step, a CRR technique
is trained with both Og and Db datasets, resulting in two models.
The performance of the models is compared to assess the eect of
debiasing. We expect the performance of the model trained with
EASE 2021, June 21–23, 2021, Trondheim, Norway Trovato and Tobin, et al.
the debiased dataset to be better since debiasing attempts to remove
samples with bad labels.
Our datasets contain too few datapoints that exhibit labeling bias,
so we expect the improvement to be marginal. At the second step,
we want to see how much the eects are amplied when systematic
labeling bias is as pervasive as it is reported in the literature. We
balance the PRs to increase the ratio of the unsuccessful PRs to
realistic levels. This is the Balanced (Ba) dataset at the top left.
The Ba dataset is then debiased by the same procedure as before,
resulting in the Balanced-Debiased (DbBa) dataset. The results are
again compared for the evaluated CRR technique. The improvement
in performance should now be more increased.
Finally, we perform an extra validation step to check any ob-
served relative improvement in performance is not due to a random
reduction in the sample size, but due to the targeted removal of only
badly labeled (unsuccessful) PRs: for this to be true, random removal
of datapoints instead of targeted removal should not improve the
performance. We form several datasets by randomly removing data
from the successful subset only (favoring this class for the reduc-
tion) and from both successful and unsuccessful PRs (not favoring
any class). These reductions give rise to the Reduced-Biased (RBi)
and Reduced-Unbiased (RUb) datasets shown at the bottom corners
of Fig. 2. We compare the performance of the given CRR technique
with these datasets to the performance with the dataset DbBa to
show that any improvement in performance with debiasing is not
merely accidental, but must be because of having deleted the badly
labeled samples.
4.6 Testing Strategies for the CRR Techniques
While testing both categories of techniques, we preserved the
chronological order of the data since we should not attempt to
predict past instances using future instances. Therefore, we made
sure the training samples always preceded the tested samples.
While evaluating the debiasing method on the optimization-
based CRR techniques (Prole-based and RSTrace), we incremen-
tally predicted each sample using all samples that preceded it, ex-
panding the models used for prediction one sample at a time.
However this ne-grained, one-by-one strategy cannot be ap-
plied in learning-based approaches: the standard testing method-
ology is based on using multiple folds. Therefore for the learning-
based techniques (Naïve Bayes, k-NN [5-NN], and Decision Tree),
we performed sliding window testing. We divided our dataset into
10 folds, choose a test fold, and trained our models using all the folds
chronologically located before the test fold. In each iteration, we
slided the window to the next fold and repeated the same procedure.
4.7 Results: Eect of Debiasing
In order to investigate the eect of removing unsuccessful PRs from
the datasets, we evaluated the accuracy of the ve CRR techniques
on QT Creator and HIVE datasets before and after the debiasing.
Notice that after debiasing, the datasets contain only successful
PRs. The CRR techniques were trained with the same parameters
to make the comparisons valid.
Table 3 summarizes the Top-3 and Top-5 correct classication
rates before and after the debiasing using both datasets. The three
learning-based CRR techniques perform better with debiasing than
Table 3: Performance of CRR Techniques Before and After
Debiasing on the Balanced Versions of HIVE and QT Creator
Top-3 Accuracy
HIVE QT Creator
Technique Before
Debiasing
After
Debiasing
Before
Debiasing
After
Debiasing
Naïve Bayes 23.52% 27.72% 26.23% 32.60%
5-NN 34.20% 38.99% 42.54% 48.28%
Decision Tree 37.51% 40.74% 40.94% 44.75%
Prole based 39.18% 39.44% 39.74% 40.03%
RSTrace 40.78% 41.12% 42.36% 42.67%
Top-5 Accuracy
HIVE QT Creator
Technique Before
Debiasing
After
Debiasing
Before
Debiasing
After
Debiasing
Naïve Bayes 27.75% 29.31% 35.59% 39.53%
5-NN 47.12% 50.20% 52.71% 55.77%
Decision Tree 46.69% 50.71% 53.77% 55.37%
Prole based 50.30% 51.13% 48.80% 49.44%
RSTrace 48.20% 48.34% 50.40% 50.57%
the two optimization-based ones according to the Top-3 measure.
For the Naïve Bayes technique, we observe the highest relative
improvement, 17% and 23%, on the HIVE and QT Creator datasets,
respectively.
The Top-1 and MRR results are compared in Fig. 5 and Fig. 6,
respectively. Higher accuracy rates were obtained for higher “k”
values in Top-k overall. For any k value, we did not observe a
considerable improvement in the accuracy of optimization-based
approaches through debiasing, whereas learning-based approaches
showed a clear improvement especially in Top-3 accuracy.
Unlike the Top-k measures that focus on best predictions, the
MRR measure considers the ranking of each prediction. It is there-
fore more representative of overall performance. Fig. 6 shows the
MRR scores before and after the debiasing. In terms of MRR, debi-
asing yields the best relative improvement on the learning-based
techniques. The improvement for the 5-NN technique on HIVE is
25% and for Naïve Bayes technique on QT Creator dataset is 26%.
These improvements are higher than the improvements observed
with the Top-k measures.
4.8 Does Debiasing Actually Work or Is It Just
Coincidence?
To illustrate that improvements observed with debiasing is no acci-
dental, we performed an additional validation step:
Remove PRs randomly from the datasets and evaluate the
accuracy of the CRR techniques before and after debiasing.
This process compares the dataset version RUb (Reduced-
Unbiased) in Fig. 2 to the version DbBa (Balanced-Debiased).
The number of samples removed equals the number of unsuc-
cessful PRs for a fair comparison. This is repeated 100 times
to create dierent randomly reduced versions of a dataset
to compare with the DbBa version. The performance results
from the dierent random reductions are averaged.
Detection and Elimination of Systematic Labeling Bias in Code Reviewer Recommendation Systems EASE 2021, June 21–23, 2021, Trondheim, Norway
Figure 2: The evaluation process using dierent versions of the preprocessed datasets
Figure 3: Percentage of unsuccessful PRs detected at each day following a successful merge due to reopening of the underlying
bug (HIVE dataset)
Remove only successful PRs from the datasets (thus introduc-
ing a bias against successful PRs) and evaluate the accuracy
of the CRR techniques before and after debiasing. This pro-
cess compares the dataset version RBi (Reduced-Biased) in
Fig. 2 to the version DbBa. Again the random reduction is
repeated 100 times and performance results averaged.
Table 4 summarizes the results of this step for the MRR measure.
The results are expected. All CRR techniques perform best with-
out unsuccessful PRs (DbBa) and worst with randomly removed
successful PRs (RBi). In all cases, the CRR techniques invariably
perform worse with randomly removed PRs (RUb) than without
EASE 2021, June 21–23, 2021, Trondheim, Norway Trovato and Tobin, et al.
Figure 4: Percentage of untouched PR les after closure over
time (HIVE dataset)
Figure 5: Top-1 Accuracy before and after debiasing on the
balanced versions of the datasets
unsuccessful PRs (DbBa). We conclude that the improvements ob-
served with debiasing is not accidental since targeted removal of
samples focusing on unsuccessful PRs always give better results.
4.9 How Many Unsuccessful PRs Could be
Superuous?
In Section 3.2, we discussed a number of caveats for the PR success
measure adopted. We now focus on the second caveat to assess to
Figure 6: MRR Scores before and after debiasing on the bal-
anced versions of the datasets
Table 4: MRR for Random vs. Targeted Removal of PR Sam-
ples
HIVE
Naive
Bayes 5-NN Decision
Tree
Prole
Based RSTrace
With Unsuccessful PRs (Balanced) 18.26% 27.32% 30.96% 33.22% 33.17%
Without Unsuccessful PRs (Balanced-Debiased) 20.97% 34.82% 35.89% 34.28% 33.26%
Without Unsuccessful-Successful PRs (Reduced-Unbiased) 18.13% 26.62% 30.75% 33.14% 33.09%
Without Successful PRs (Reduced-Biased) 17.27% 23.24% 26.83% 32.56% 32.87%
QT Creator
Naive
Bayes 5-NN Decision
Tree
Prole
Based RSTrace
With Unsuccessful PRs (Balanced) 23.37% 33.76% 33.73% 32.27% 35.58%
Without Unsuccessful PRs (Balanced-Debiased) 29.58% 37.32% 38.35% 33.48% 35.67%
Without Unsuccessful-Successful PRs (Reduced-Unbiased) 22.12% 33.49% 32.23% 32.15% 35.41%
Without Successful PRs (Reduced-Biased) 18.30% 30.72% 29.41% 31.43% 35.13%
what extent reopened bugs could be attributed to reasons other
than the review/reviewer quality of the original associated PR. This
assessment considers the possibility that the original PR was indeed
successful and the reopened bug was a false positive due to changes
to the codebase unrelated to the original bug (e.g., changes in the
underlying dependencies that makes it look like the original bug
suddenly resurfaced instead of being reported as a new bug). To
do this, we analyze the elapsed time between Closed and Reopened
transitions of each reopened bug in the datasets. If the elapsed time
is small, e.g., less than one day, resurfacing of the bug is unlikely
to be related to external circumstances impossible to have been
detected by the PR reviewers. We also look at the percentage of
the les involved in an unsuccessful PR that were not changed
(untouched) in a commit following the closure of the associated
bug. If this percentage is high after a certain time has lapsed, their
Detection and Elimination of Systematic Labeling Bias in Code Reviewer Recommendation Systems EASE 2021, June 21–23, 2021, Trondheim, Norway
likelihood of causing the bug to resurface is low, hence the PR was
likely genuinely unsuccessful.
Fig. 3 shows that in the Hive dataset, 45% of the bugs are reopened
on the same day of the PR and 80% of the bugs are reopened within
24 days. Fig. 4 shows that 80% of the les involved in an unsuccessful
PR remain untouched 19 days after the closure of the underlying
bug. The results are similar on the second dataset. Because most
bugs are reopened in the rst few days of their closure and most
les involved in an unsuccessful PR remain untouched during the
initial days after closure, we believe mislabeling unsuccessful PRs,
although possible in rare circumstance, is unlikely to be pervasive
enough to compromise the PR success measure.
To validate whether reopened PRs indeed had poor reviews, we
performed a quality analysis of in QT Creator dataset for a random
10% sample (40 data points). You can nd this analysis in a supple-
ment posted to Figshare
7
). Two authors independently inspected
the quality of each review for this sample. We found that 32 out of
the 40 (80%) had in fact poor reviews according to the criteria we
used. We categorized poor reviews as Supercial (LGTM/missing
comments) (21), Overruled (author indicated reviewer had mis-
judged the changes) (3), and Poor-Eort (self-admittance of a sub-
standard/rushed review) (6).
5 DISCUSSION
5.1 Research Questions
RQ1
:How can we eliminate the systematic labeling bias in CRR
datasets?
Since manual methods are not cost-ecient, can still be error-
prone, and do not scale up well, we looked for an automatable
method based on an objective success measure that leverages link-
ages between PR data and bug data. A PR, the underlying code
review, and the assigned reviewer’s work were deemed successful
only if the bug the PR targeted was never reopened following a suc-
cessful merge and the associated closure of the bug. The debiasing
method we propose simply removes unsuccessful samples from the
PR data to eliminate possible biases in past reviewer selections. The
CRR techniques could then use this debiased data as their ground
truth to build their models and improve their performance.
RQ2
:How does systematic labeling bias elimination aect the
performance of CRR techniques?
We applied the automatic debiasing methods to a diverse set of
ve CRR techniques using two open-source datasets. We observed
that, provided the data had suciently high-rates of badly labeled
samples reported in the literature, the performance of the CRR
techniques in general improved after debiasing. The highest im-
provement was observed with learning-based CRR techniques. The
improvements in the optimization-based CRR techniques tested
were marginal.
The reason behind the dierence of improvement between opti-
mization-based vs. learning-based techniques is that the reviewer
recommendations process of the optimization-based techniques
tested do not as heavily depends on learning from the past data as
those of learning-based techniques. Additionally, not every sample
may be as equally valuable since the optimization criterion may
7https://gshare.com/s/1b9ea55377d9f2c31a7a
inadvertently already discount badly labeled samples. Therefore,
a debiasing approach focused on removing badly labeled samples
may not make much dierence in such techniques.
We conclude that the proposed debiasing approach improves the
quality of the ground truth and is worthwhile for learning-based
CRR techniques.
5.2 Threats to Validity
Our method of identifying a badly labeled PR is subject to a con-
struct threat [
24
]. It may not be possible to catch all reopened bugs:
some may be completely missed, others may yet to be reopened
and not captured in the dataset. Leaving these false negatives in the
dataset would reduce the ecacy of debiasing. Conversely, there
may be false positives: PRs identied as unsuccessful due to a re-
opened bug may have actually been successful. We discussed some
possible reasons for such cases in Section 4.9, and by examining
two indicators in the datasets, concluded that these cases should
be rare.
Our original datasets had low unsuccessful PR rates, due to the
fact that some unsuccessful PRs are likely to be mislabeled as suc-
cessful because the recurrence of the associated bugs was missed
in the bug tracking system (we discussed possible reasons for this.)
At these low rates, the introduced bias is not signicant, and re-
moving it would not yield much benet. We precisely observed this
eect in the original data: improvements in performance measures
were less than 1%, and thus not material. We thus looked at the
reopened bug rates in the literature and balanced the datasets by
randomly removing successful PRs to move the unsuccessful PR
ratio to within the reported ranges, and evaluated the debiasing
method with these reduced datasets. Because of this adjustment,
we must posit that any observed improvements are conditional on
a dataset having sucient systematic labeling bias.
Threats to external validity are concerned with generalizability
[
24
]. We evaluated our approach on two open-source datasets and
ve dierent CRR techniques. We believe the CRR techniques used
are a reasonable representation of common approaches. However
we acknowledge the limitations of using two datasets. We could
not identify further datasets that both contain sucient samples
and integrate PR information with bug tracking.
Open-source projects typically have high turnover rates. Unlike
in closed-source projects, many contributors and reviewers become
inactive over time, and others joint the project. Therefore, our
observations may not apply to closed-source projects.
To mitigate internal threats related to the implementation of the
CRR techniques and the data extraction methods, we provide the
source code of all techniques used in the evaluation as well as the
both datasets in Figshare8.
6 CONCLUSION AND FUTURE WORK
Good code reviewer selection is central to eective code reviews.
CRR techniques attempt to automate the code reviewer selection
problem, but they mostly build their models and evaluate them us-
ing historical data whose ground truth may be unreliable. Ground
truth problems often result from the susceptibility of human deci-
sion makers to cognitive biases, such as substituting a convenience
8https://gshare.com/s/1b9ea55377d9f2c31a7a
EASE 2021, June 21–23, 2021, Trondheim, Norway Trovato and Tobin, et al.
attribute for a competence attribute, in reviewer assignments. When
the code reviews are performed in the context of PRs, and the dataset
links PR data with bug tracking data, it is possible to identify PR
requests that fail to achieve their goal: a PR, and the developer
assigned to reviewing it, can be deemed unsuccessful when the bug
associated the PR is reopened later. Our experiments showed that
when such failed cases are pervasive enough–in the 20-28% range
consistent with reported rates—removing such data points from the
dataset in general improves the performance of CRR techniques.
Although the improvement was very marginal for optimization-
based CRR techniques tested, it was large for learning-based CRR
techniques (up to 26% for Naïve Bayes).
Our work has implications for both practitioners and researchers.
Researching can apply our proposed debiasing approach while
cleaning their training data to improve the accuracy of their CRR
models. Recommendation tools built on these models would then
inherit these improvements. The debiasing approach could also be
useful for agging potentially ineective reviews to improve the
code review practices in an organization.
Although initial results look promising, we still need to test our
approach on other datasets and datasets collected from commercial
systems. We are in the process of looking for suitable candidates. It
would make sense for the future work to focus on learning-based
CRR techniques since the real returns are to found in that space.
If expanded evaluations prove the debiasing method to be widely
eective, we plan to provide tool support for automated debiasing.
REFERENCES
[1]
Aalok Ahluwalia, Davide Falessi, and Massimiliano Di Penta. 2019. Snoring : a
Noise in Defect Prediction Datasets. 2019 IEEE/ACM 16th International Conference
on Mining Software Repositories (MSR) (2019), 63–67. https://doi.org/10.1109/
MSR.2019.00019
[2]
Alberto Bacchelli and Christian Bird. 2013. Expectations , Outcomes , and Chal-
lenges of Modern Code Review. Proceedings of the 2013 International Conference
on Software Engineering (2013), 712–721.
[3]
Alberto Bacchelli and Christian Bird. 2018. Code Reviewing in the Trenches.
IEEE Software 35 (2018), 34–42. https://doi.org/10.1109/MS.2017.265100500
[4]
Vipin Balachandran. 2013. Reducing Human Eort and Improving Quality in Peer
Code Reviews using Automatic Static Analysis and Reviewer Recommendation.
2013 35th International Conference on Software Engineering (ICSE) (2013), 931–940.
https://doi.org/10.1109/ICSE.2013.6606642
[5]
Christian Bird, Adrian Bachmann, Eirik Aune, John Duy, Abraham Bernstein,
Vladimir Filkov, and Premkumar Devanbu. 2009. Fair and Balanced? Bias in
Bug-Fix Datasets. 121–130. https://doi.org/10.1145/1595696.1595716
[6]
Amiangshu Bosu, Michaela Greiler, and Christian Bird. 2015. Characteristics of
Useful Code Reviews : An Empirical Study at Microsoft. Proceedings of the 12th
Working Conference on Mining Software Repositories (2015), 146–156.
[7]
Guillermo F. Cabrera, Christopher J. Miller, and Je Schneider. 2014. Systematic
labeling bias: De-biasing where everyone is wrong. In Proceedings - International
Conference on Pattern Recognition. https://doi.org/10.1109/ICPR.2014.756
[8]
Tse-Hsun Chen, Meiyappan Nagappan, Emad Shihab, and Ahmed E. Hassan.
2014. An Empirical Study of Dormant Bugs. In Proceedings of the 11th Work-
ing Conference on Mining Software Repositories (Hyderabad, India) (MSR 2014).
Association for Computing Machinery, New York, NY, USA, 82–91. https:
//doi.org/10.1145/2597073.2597108
[9]
Emre Doğan, Eray Tüzün, K. Ayberk Tecimer, and H. Altay Güvenir. 2019. Inves-
tigating the Validity of Ground Truth in Code ReviewerRe commendation Studies.
In 2019 ACM/IEEE International Symposium on Empirical Software Engineering
and Measurement (ESEM). 1–6. https://doi.org/10.1109/ESEM.2019.8870190
[10]
M E Fagan. 1976. Design and code inspections to reduce errors in program
development. IBM Systems Journal 15 (1976), 182–211.
[11]
Mikołaj Fejzer, Piotr Przymus, and Krzysztof Stencel. 2018. Prole based recom-
mendation of code reviewers. Journal of Intelligent Information Systems (2018).
https://doi.org/10.1007/s10844-017- 0484-1
[12]
Alberto Fernández, Salvador García, Mikel Galar, Ronaldo Prati, Bartosz
Krawczyk, and Francisco Herrera. 2018. Learning from Imbalanced Data Sets.
https://doi.org/10.1007/978-3- 319-98074- 4
[13]
Google. 2020. Code Review Developer Guide. https://github.com/google/eng-
practices/blob/master/review/index.md
[14]
Kim Herzig and Andreas Zeller. 2013. The Impact of Tangled Code Changes. In
Proceedings of the 10th Working Conference on Mining Software Repositories (MSR
’13). IEEE Press, Piscataway, NJ, USA, 121–130. http://dl.acm.org/citation.cfm?
id=2487085.2487113
[15]
Jing Jiang, Jia-Huan He, and Xue-Yuan Chen. 2015. CoreDevRec: Automatic Core
Member Recommendation for Contribution Evaluation. Journal of Computer
Science and Technology 30, 5 (2015), 998–1016. https://doi.org/10.1007/s11390-
015-1577- 3
[16]
Vladimir Kovalenko, Nava Tintarev, Evgeny Pasynkov, Christian Bird, and Al-
berto Bacchelli. 2018. Does reviewer recommendation help developers? IEEE
Transactions on Software Engineering (2018), 1. https://doi.org/10.1109/TSE.2018.
2868367
[17]
John Boaz Lee, A. Ihara, A. Monden, and K. Matsumoto. 2013. Patch Reviewer
Recommendation in OSS Projects. 2013 20th Asia-Pacic Software Engineering
Conference (APSEC) 2 (2013), 1–6. https://doi.org/10.1109/APSEC.2013.103
[18]
Jakub Lipcak and Bruno Rossi. 2018. A Large-Scale Study on Source Code
Reviewer Recommendation. 44th Euromicro Conference on Software Engineering
and Advanced Applications (SEAA 2018) (2018).
[19]
Rahul Mohanani, Iaah Salman, Burak Turhan, Pilar Rodríguez, and Paul Ralph.
2020. Cognitive Biases in Software Engineering: A Systematic Mapping Study.
IEEE Transactions on Software Engineering 46, 12 (2020), 1318–1339. https:
//doi.org/10.1109/TSE.2018.2877759
[20]
Thanh H D Nguyen, Bram Adams, and Ahmed E Hassan. 2010. A Case Study of
Bias in Bug-Fix Datasets. April 2014 (2010). https://doi.org/10.1109/WCRE.2010.
37
[21]
Ali Ouni, Raula Gaikovina Kula, and Katsuro Inoue. 2016. Search-Based Peer
Reviewers Recommendation in Modern Code Review. 2016 IEEE International
Conference on Software Maintenance and Evolution (ICSME) (2016), 367–377. https:
//doi.org/10.1109/ICSME.2016.65
[22]
Paul Ralph. 2010. Toward a Theory of Debiasing Software Development. In
Lecture Notes in Business Information Processing, Vol. 93. 92–105. https://doi.org/
10.1007/978-3- 642-25676- 9{_}8
[23]
Michael Rath and Patrick Mäder. 2019. The SEOSS 33 dataset — Requirements,
bug reports, code history, and trace links for entire projects. Data in Brief (2019).
https://doi.org/10.1016/j.dib.2019.104005
[24]
Per Runeson and Martin Höst. 2008. Guidelines for conducting and reporting
case study research in software engineering. Empirical Software Engineering 14,
2 (2008), 131. https://doi.org/10.1007/s10664- 008-9102-8
[25]
Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto
Bacchelli. 2018. Modern Code Review : A Case Study at Google. Proceedings of
the 40th International Conference on Software Engineering Software Engineering in
Practice - ICSE-SEIP 18 (2018). https://doi.org/10.1145/3183519.3183525
[26]
Eric Smith and A Terry Bahill. 2009. Attribute Substitution in Systems Engineer-
ing. Systems Engineering 13 (2009), 130–148. https://doi.org/10.1002/sys.20138
[27]
Anders Søgaard, Barbara Plank, and Dirk Hovy. 2014. Selection Bias, Label Bias,
and Bias in Ground Truth. In Proceedings of COLING 2014, the 25th International
Conference on Computational Linguistics: Tutorial Abstracts.
[28]
Webb Stacy and Jean Macmillan. 1995. Cognitive Bias in Software Engineering.
Commun. ACM 38, 6 (1995), 57–63.
[29]
Emre Sülün, Eray Tüzün, and U
ˇ
gur Do
ˇ
grusöz. 2019. Reviewer recommendation
using software artifact traceability graphs. In PROMISE’19: Proceedings of the
Fifteenth International Conference on Predictive Models and Data Analytics in
Software Engineering. 66–75. https://doi.org/10.1145/3345629.3345637
[30]
Emre Sülün, Eray Tüzün, and Uğur Doğrusöz. 2021. RSTrace+: Reviewer sug-
gestion using software artifact traceability graphs. Information and Software
Technology 130 (2021), 106455. https://doi.org/10.1016/j.infsof.2020.106455
[31]
Patanamon Thongtanunam, Chakkrit Tantithamthavorn, Raula Gaikovina Kula,
Norihiro Yoshida, Hajimu Iida, and Ken-ichi Matsumoto. 2015. Who Should
Review My Code ? 2015 IEEE 22nd International Conference on Software Analysis,
Evolution, and Reengineering (SANER) (2015), 141–150. https://doi.org/10.1109/
SANER.2015.7081824
[32]
Xin Xia, David Lo, Emad Shihab, Xinyu Wang, and Bo Zhou. 2014. Automatic,
high accuracy prediction of reopened bugs. Automated Software Engineering
(2014). https://doi.org/10.1007/s10515-014- 0162-2
[33]
Xin Xia, David Lo, Xinyu Wang, and Xiaohu Yang. 2015. Who Should Review
This Change ? 2015 IEEE International Conference on Software Maintenance and
Evolution (ICSME) (2015), 261–270. https://doi.org/10.1109/ICSM.2015.7332472
[34]
Zhenglin Xia, Hailong Sun, Jing Jiang, Xu Wang, and Xudong Liu. 2017. A Hybrid
Approach to Code Reviewer Recommendation with Collaborative Filtering. 2017
6th International Workshop on Software Mining (SoftwareMining) (2017), 24–31.
[35]
Motahareh Bahrami Zanjani and Student Member. 2016. Automatically Recom-
mending Peer Reviewers in Modern Code Review. IEEE Transactions on Software
Engineering 42, 6 (2016), 530–543. https://doi.org/10.1109/TSE.2015.2500238
[36]
H. Alperen Çetin, Emre Doğan, and Eray Tüzün. 2021. A review of code reviewer
recommendation studies: Challenges and future directions. Science of Computer
Programming 208 (2021), 102652. https://doi.org/10.1016/j.scico.2021.102652
... In one study, the authors used participants' preferences in review assignment [263], while in the other study, the authors combined the metadata of pull requests with the metadata associated with potential reviewers [92]. Another study focuses on detecting and removing systematic labeling bias to improve prediction [235]. Another interesting direction is to focus recommend reviewers that will ensure code base knowledge distribution [86,176,207]. ...
... Another interesting direction is to focus recommend reviewers that will ensure code base knowledge distribution [86,176,207]. Finally, some studies have included balancing review workload as an objective [43,49,86,230] In relation to how the predictors are used to recommend code reviewers, many employ traditional approaches (e.g., cosine similarity), while some use machine learning techniques, such as Random Forest [92], Naive Bayes [92,235], Support Vector Machines [144,276], Collaborative Filtering [87,230], Deep Neural Networks [222,274], or model reviewer recommendation as an optimization problem [43,86,187,207,211]. ...
... The performance of the identified approaches varies a lot and is often measured using Accuracy [92,122,144,204,230,238,270], Precision and Recall [106,145,166,187,204,222,275,276,279,285], or Mean Reciprocal Rank [106,144,230,230,231,235,268,279]. Out of the identified studies, only a few [49,158,198,230] have evaluated code reviewer recommendation tools in live environments. ...
Article
Background: Modern Code Review (MCR) is a lightweight alternative to traditional code inspections. While secondary studies on MCR exist; it is unknown whether the research community has targeted themes that practitioners consider important. Objectives: The objectives are to provide an overview of MCR research, analyze the practitioners’ opinions on the importance of MCR research, investigate the alignment between research and practice, and propose future MCR research avenues. Method: We conducted a systematic mapping study to survey state-of-the-art until and including 2021, employed the Q-Methodology to analyze the practitioners’ perception of the relevance of MCR research, and analyzed the primary studies’ research impact. Results: We analyzed 244 primary studies, resulting in five themes. As a result of the 1300 survey data points, we found that the respondents are positive about research investigating the impact of MCR on product quality and MCR process properties. In contrast, they are negative about human factor- and support systems-related research. Conclusion: These results indicate a misalignment between the state-of-the-art and the themes deemed important by most survey respondents. Researchers should focus on solutions that can improve the state of MCR practice. We provide an MCR research agenda, which can potentially increase the impact of MCR research.
... They further developed a tool, i.e., FairSquare, to validate the fairness qualities of decision-making programs derived from real-world datasets. Some other research studies [46], [47] work on the issue of bias in code reviewer recommendation in the process of modern code review. German et al. [46] investigated whether modern code reviews are conducted fairly, using fairness theory to develop a framework for understanding how fairness affects code reviews. ...
... German et al. [46] investigated whether modern code reviews are conducted fairly, using fairness theory to develop a framework for understanding how fairness affects code reviews. Tecimer et al. [47] worked on fixing bias caused by labeling errors in order to improve the performance of code reviewer recommendation. Unlike these studies, we do not focus on the process of modern code review or the biases that are not related to fairness issues. ...
Preprint
Full-text available
The fairness of machine learning (ML) approaches is critical to the reliability of modern artificial intelligence systems. Despite extensive study on this topic, the fairness of ML models in the software engineering (SE) domain has not been well explored yet. As a result, many ML-powered software systems, particularly those utilized in the software engineering community, continue to be prone to fairness issues. Taking one of the typical SE tasks, i.e., code reviewer recommendation, as a subject, this paper conducts the first study toward investigating the issue of fairness of ML applications in the SE domain. Our empirical study demonstrates that current state-of-the-art ML-based code reviewer recommendation techniques exhibit unfairness and discriminating behaviors. Specifically, male reviewers get on average 7.25% more recommendations than female code reviewers compared to their distribution in the reviewer set. This paper also discusses the reasons why the studied ML-based code reviewer recommendation systems are unfair and provides solutions to mitigate the unfairness. Our study further indicates that the existing mitigation methods can enhance fairness by 100% in projects with a similar distribution of protected and privileged groups, but their effectiveness in improving fairness on imbalanced or skewed data is limited. Eventually, we suggest a solution to overcome the drawbacks of existing mitigation techniques and tackle bias in datasets that are imbalanced or skewed.
... The error occurs due to established beliefs and misguided intuition. Our past work [45] on improving the performance of CRR techniques addresses systematic labeling errors that in great part stem from such biases. ...
... In this paper, we apply the ground truth improvement process and some of the improvement strategies described in Tuzun et al. [49] to both CRR and BA tasks. Specifically, we extend our previous work on the CRR application [45] to generalize our approach by adding BA as an additional application. In the CRR application, the labeling was done in realtime by parties directly involved with reviewer assignments. ...
Article
Full-text available
Context In the context of collaborative software development, there are many application areas of task assignment such as assigning a developer to fix a bug, or assigning a code reviewer to a pull request. Most task assignment techniques in the literature build and evaluate their models based on datasets collected from real projects. The techniques invariably presume that these datasets reliably represent the “ground truth”. In a project dataset used to build an automated task assignment system, the recommended assignee for the task is usually assumed to be the best assignee for that task. However, in practice, the task assignee may not be the best possible task assignee, or even a sufficiently qualified one. Objective We aim to clean up the ground truth by removing the samples that are potentially problematic or suspect with the assumption that removing such samples would reduce any systematic labeling bias in the dataset and lead to performance improvements. Method We devised a debiasing method to detect potentially problematic samples in task assignment datasets. We then evaluated the method’s impact on the performance of seven task assignment techniques by comparing the Mean Reciprocal Rank (MRR) scores before and after debiasing. We used two different task assignment applications for this purpose: Code Reviewer Recommendation (CRR) and Bug Assignment (BA). Results In the CRR application, we achieved an average MRR improvement of 18.17% for the three learning-based techniques tested on two datasets. No significant improvements were observed for the two optimization-based techniques tested on the same datasets. In the BA application, we achieved a similar average MRR improvement of 18.40% for the two learning-based techniques tested on four different datasets. Conclusion Debiasing the ground truth data by removing suspect samples can help improve the performance of learning-based techniques in software task assignment applications.
... Because All of Us does not directly code for variables like health equity, variables flagging multi-institutional teaming, and variables indicating whether an institution is an R1 or R2 university, such variables must be inferred for hundreds of projects using the unstructured text descriptions and (where applicable) external data sources. We constructed such a pipeline ( Figure 2) that makes judicious use of a large language model (LLM) like GPT-3.5 [18] for text analysis and coding of variables that traditionally required painstaking manual effort [19][20] and that may implicate potential cognitive biases [21][22][23]. ...
Preprint
Full-text available
Large language models (LLMs) have made significant advancements in natural language processing, offering broad applications in multiple domains. This study explores the use of the GPT-3.5 LLM to conduct efficient and robust computational analysis of registered research projects on the All of Us platform. Specifically, we explore the association between projects pursuing health equity research and: the project’s use of demographic categories (which All of Us enables), the multi-institutional composition of the team leading the project, and the involvement of R2 institutions (compared to only R1 institutions). We demonstrate the utility of GPT-3.5 in automating tasks ranging from generating Python scripts for extracting attributes from free text (such as project description and goals) to identifying and classifying institutions as R1 and R2, and summarizing project details into Unified Medical Language System (UMLS)-coded medical keywords. These contributions significantly reduced manual workload, allowing researchers to focus on more in-depth analysis. Our results reveal health equity insights not readily available in the original All of Us research hub. Specifically, we find a strong positive association between the use of demographic data and projects focused on health equity, while other associations such as health equity projects conducted by institutions were positive but weaker and more dependent on specific project topics.
... For the dataset labels, we used those provided by the dataset, and we have thoroughly verified that there are no labeling errors. We acknowledge that there are research reports indicating bias in the malware labels assigned by antivirus scanners [31]- [35], but in this study, we deliberately did not make any changes to the dataset labels in order to compare the experimental results and classification accuracy with previous studies. Unfortunately, even with our proposed method, we could not achieve 100% classification accuracy. ...
Article
Full-text available
The war between cyber attackers and security analysts is gradually intensifying. Owing to the ease of obtaining and creating support tools, recent malware continues to diversify into variants and new species. This increases the burden on security analysts and hinders quick analysis. Identifying malware families is crucial for efficiently analyzing diversified malware; thus, numerous low-cost, general-purpose, deep-learning-based classification techniques have been proposed in recent years. Among these methods, malware images that represent binary features as images are often used. However, no models or architectures specific to malware classification have been proposed in previous studies. Herein, we conduct a detailed analysis of the behavior and structure of malware and focus on PE sections that capture the unique characteristics of malware. First, we validate the features of each PE section that can distinguish malware families. Then, we identify PE sections that contain adequate features to classify families. Further, we propose an ensemble learning-based classification method that combines features of highly discriminative PE sections to improve classification accuracy. The validation of two datasets confirms that the proposed method improves accuracy over the baseline, thereby emphasizing its importance.
Article
Large language models like BERT and GPT possess significant capabilities and potential impacts across various applications. Software engineers often use these models for code-related tasks, including generating, debugging, and summarizing code. Nevertheless, large language models still have several flaws, including model hallucination. (e.g., generating erroneous code and producing outdated and inaccurate programs) and the substantial computational resources and energy required for training and fine-tuning. To tackle these challenges, we propose CodeMentor, a framework for few-shot learning to train large language models with the data available within the organization. We employ the framework to train a language model for code review activities, such as code refinement and review generation. The framework utilizes heuristic rules and weak supervision techniques to leverage available data, such as previous review comments, issue reports, and related code updates. Then, the framework employs the constructed dataset to fine-tune LLMs for code review tasks. Additionally, the framework integrates domain expertise by employing reinforcement learning with human feedback. This allows domain experts to assess the generated code and enhance the model performance. Also, to assess the performance of the proposed model, we evaluate it with four stateof-the-art techniques in various code review tasks. The experimental results attest that CodeMentor enhances the performance in all tasks compared to the state-of-the-art approaches, with an improvement of up to 22.3%, 43.4%, and 24.3% in code quality estimation, review generation, and bug report summarization tasks, respectively.
Article
Code review is an important practice in software development. One of its main objectives is for the assurance of code quality. For this purpose, the efficacy of code review is subject to the credibility of reviewers, i.e, reviewers who have demonstrated strong evidence of previously making quality-enhancing comments are more credible than those who have not. Code reviewer recommendation (CRR) is designed to assist in recommending suitable reviewers for a specific objective and, in this context, assurance of code quality. Its performance is susceptible to the relevance of its training dataset to this objective, composed of all reviewers’ historical review comments, which, however, often contains a plethora of comments that are irrelevant to the enhancement of code quality. Furthermore, recommendation accuracy has been adopted as the sole metric to evaluate a recommender’s performance, which is inadequate as it does not take reviewers’ relevant credibility into consideration. These two issues form the ground truth problem in CRR as they both originate from the relevance of dataset used to train and evaluate CRR algorithms. To tackle this problem, we first propose the concept of Quality-Enhancing Review Comments ( QERC ), which includes three types of comments - change-triggering inline comments, informative general comments, and approve-to-merge comments. We then devise a set of algorithms and procedures to obtain a distilled dataset by applying QERC to the original dataset. We finally introduce a new metric – reviewer’s credibility for quality enhancement (RCQE) – as a complementary metric to recommendation accuracy for evaluating the performance of recommenders. To validate the proposed QERC-based approach to CRR, we conduct empirical studies using real data from seven projects containing over 82K pull requests and 346K review comments. Results show that: (a) QERC can effectively address the ground truth problem by distilling quality-enhancing comments from the dataset containing original code reviews, (b) QERC can assist recommenders in finding highly credible reviewers at a slight cost of recommendation accuracy, and (c) even “wrong” recommendations using the distilled dataset are likely to be more credible than those using the original dataset.
Article
Full-text available
Determining the right code reviewer for a given code change requires understanding the characteristics of the changed code, identifying the skills of each potential reviewer (expertise profile), and finding a good match between the two. To facilitate this task, we design a code reviewer recommender that operates on the knowledge units (KUs) of a programming language. We define a KU as a cohesive set of key capabilities that are offered by one or more building blocks of a given programming language. We operationalize our KUs using certification exams for the Java programming language. We detect KUs from 10 actively maintained Java projects from GitHub, spanning 290K commits and 65K pull requests (PRs). We generate developer expertise profiles based on the detected KUs. We use these KU-based expertise profiles to build a code reviewer recommender (KUREC). We compare KUREC’s performance to that of seven baseline recommenders. KUREC ranked first along with the top-performing baseline recommender (RF) in a Scott-Knott ESD analysis of recommendation accuracy (the top-5 accuracy of KUREC is 0.84 (median) and the MAP@5 is 0.51 (median)). From a practical standpoint, we highlight that KUREC’s performance is more stable (lower interquartile range) than that of RF, thus making it more consistent and potentially more trustworthy. We also design three new recommenders by combining KUREC with our baseline recommenders. These new combined recommenders outperform both KUREC and the individual baselines. Finally, we evaluate how reasonable the recommendations from KUREC and the combined recommenders are when those deviate from the ground truth. We observe that KUREC is the recommender with the highest percentage of reasonable recommendations (63.4%). Overall we conclude that KUREC and one of the combined recommenders (e.g., AD_HYBRID) are overall superior to the baseline recommenders that we studied. Future work in the area should thus (i) consider KU-based recommenders as baselines and (ii) experiment with combined recommenders.
Article
Full-text available
Code review is the process of inspecting code changes by a developer who is not involved in the development of the changeset. One of the initial and important steps of code review process is selecting code reviewer(s) for a given code change. To maximize the benefits of the code review process, the appropriate selection of the reviewer is essential. Code reviewer recommendation has been an active research area over the last few years, and many recommendation models have been proposed in the literature. In this study, we conduct a systematic literature review by inspecting 29 primary studies published from 2009 to 2020. Based on the outcomes of our review: (1) most preferred approaches are heuristic approaches closely followed by machine learning approaches, (2) the majority of the studies use open source projects to evaluate their models, (3) the majority of the studies prefer incremental training set validation techniques, (4) most studies suffer from reproducibility problems, (5) model generalizability and dataset integrity are the most common validity threats for the models and (6) refining models and conducting additional experiments are the most common future work discussions in the studies.
Article
Full-text available
This paper provides a systematically retrieved dataset consisting of 33 open-source software projects containing a large number of typed artifacts and trace links between them. The artifacts stem from the projects' issue tracking system and source version control system to enable their joint analysis. Enriched with additional metadata, such as time stamps, release versions, component information, and developer comments, the dataset is highly suitable for empirical research, e.g., in requirements and software traceability analysis, software evolution, bug and feature localization, and stakeholder collaboration. It can stimulate new research directions, facilitate the replication of existing studies, and act as benchmark for the comparison of competing approaches. The data is hosted on Harvard Dataverse using DOI 10.7910/DVN/PDDZ4Q accessible via https://bit.ly/2wukCHc.
Conference Paper
Full-text available
Context: Software code reviews are an important part of the development process, leading to better software quality and reduced overall costs. However, finding appropriate code reviewers is a complex and time-consuming task. Goals: In this paper, we propose a large-scale study to compare performance of two main source code reviewer recommendation algorithms (RevFinder and a Naive Bayes-based approach) in identifying the best code reviewers for opened pull requests. Method: We mined data from Github and Gerrit repositories, building a large dataset of 51 projects, with more than 293K pull requests analyzed, 180K owners and 157K reviewers. Results: Based on the large analysis, we can state that i) no model can be generalized as best for all projects, ii) the usage of a different repository (Gerrit, GitHub) can have impact on the the recommendation results, iii) exploiting sub-projects information available in Gerrit can improve the recommendation results.
Preprint
Full-text available
One source of software project challenges and failures is the systematic errors introduced by human cognitive biases. Although extensively explored in cognitive psychology, investigations concerning cognitive biases have only recently gained popularity in software engineering research. This paper therefore systematically maps, aggregates and synthesizes the literature on cognitive biases in software engineering to generate a comprehensive body of knowledge, understand state of the art research and provide guidelines for future research and practice. Focusing on bias antecedents, effects and mitigation techniques, we identified 65 articles (published between 1990 and 2016), which investigate 37 cognitive biases. Despite strong and increasing interest, the results reveal a scarcity of research on mitigation techniques and poor theoretical foundations in understanding and interpreting cognitive biases. Although bias-related research has generated many new insights in the software engineering community, specific bias mitigation techniques are still needed for software professionals to overcome the deleterious effects of cognitive biases on their work.
Article
Context Various types of artifacts (requirements, source code, test cases, documents, etc.) are produced throughout the lifecycle of a software. These artifacts are connected with each other via traceability links that are stored in modern application lifecycle management repositories. Throughout the lifecycle of a software, various types of changes can arise in any one of these artifacts. It is important to review such changes to minimize their potential negative impacts. To make sure the review is conducted properly, the reviewer(s) should be chosen appropriately. Objective We previously introduced a novel approach, named RSTrace, to automatically recommend reviewers that are best suited based on their familiarity with a given artifact. In this study, we introduce an advanced version of RSTrace, named RSTrace+ that accounts for recency information of traceability links including practical tool support for GitHub. Methods In this study, we conducted a series of experiments on finding the appropriate code reviewer(s) using RSTrace+ and provided a comparison with the other code reviewer recommendation approaches. Results We had initially tested RSTrace+ on an open source project (Qt 3D Studio) and achieved a top-3 accuracy of 0.89 with an MRR (mean reciprocal ranking) of 0.81. In a further empirical evaluation of 40 open source projects, we compared RSTrace+ with Naive-Bayes, RevFinder and Profile based approach, and observed higher accuracies on the average. Conclusion We confirmed that the proposed reviewer recommendation approach yields promising top-k and MRR scores on the average compared to the existing reviewer recommendation approaches. Unlike other code reviewer recommendation approaches, RSTrace+ is not limited to recommending reviewers for source code artifacts and can potentially be used for recommending reviewers for other types of artifacts. Our approach can also visualize the affected artifacts and help the developer to make assessments of the potential impacts of change to the reviewed artifact.
Conference Paper
Various types of artifacts (requirements, source code, test cases, documents, etc.) are produced throughout the lifecycle of a software. These artifacts are often related with each other via traceability links that are stored in modern application lifecycle management repositories. Throughout the lifecycle of a software, various types of changes can arise in any one of these artifacts. It is important to review such changes to minimize their potential negative impacts. To maximize benefits of the review process, the reviewer(s) should be chosen appropriately. In this study, we reformulate the reviewer suggestion problem using software artifact traceability graphs. We introduce a novel approach, named RSTrace, to automatically recommend reviewers that are best suited based on their familiarity with a given artifact. The proposed approach, in theory, could be applied to all types of artifacts. For the purpose of this study, we focused on the source code artifact and conducted an experiment on finding the appropriate code reviewer(s). We initially tested RSTrace on an open source project and achieved top-3 recall of 0.85 with an MRR (mean reciprocal ranking) of 0.73. In a further empirical evaluation of 37 open source projects, we confirmed that the proposed reviewer recommendation approach yields promising top-k and MRR scores on the average compared to the existing reviewer recommendation approaches.
Book
This book provides a general and comprehensible overview of imbalanced learning. It contains a formal description of a problem, and focuses on its main features, and the most relevant proposed solutions. Additionally, it considers the different scenarios in Data Science for which the imbalanced classification can create a real challenge. This book stresses the gap with standard classification tasks by reviewing the case studies and ad-hoc performance metrics that are applied in this area. It also covers the different approaches that have been traditionally applied to address the binary skewed class distribution. Specifically, it reviews cost-sensitive learning, data-level preprocessing methods and algorithm-level solutions, taking also into account those ensemble-learning solutions that embed any of the former alternatives. Furthermore, it focuses on the extension of the problem for multi-class problems, where the former classical methods are no longer to be applied in a straightforward way. This book also focuses on the data intrinsic characteristics that are the main causes which, added to the uneven class distribution, truly hinders the performance of classification algorithms in this scenario. Then, some notes on data reduction are provided in order to understand the advantages related to the use of this type of approaches. Finally this book introduces some novel areas of study that are gathering a deeper attention on the imbalanced data issue. Specifically, it considers the classification of data streams, non-classical classification problems, and the scalability related to Big Data. Examples of software libraries and modules to address imbalanced classification are provided. This book is highly suitable for technical professionals, senior undergraduate and graduate students in the areas of data science, computer science and engineering. It will also be useful for scientists and researchers to gain insight on the current developments in this area of study, as well as future research directions.
Article
Selecting reviewers for code changes is a critical step for an efficient code review process. Recent studies propose automated reviewer recommendation algorithms to support developers in this task. However, the evaluation of recommendation algorithms, when done apart from their target systems and users (i.e. code review tools and change authors), leaves out important aspects: perception of recommendations, influence of recommendations on human choices, and their effect on user experience. This study is the first to evaluate a reviewer recommender in vivo. We compare historical reviewers and recommendations for over 21,000 code reviews performed with a deployed recommender in a company environment and set to measure the influence of recommendations on users' choices, along with other performance metrics. Having found no evidence of influence, we turn to the users of the recommender. Through interviews and a survey we find that, though perceived as relevant, reviewer recommendations rarely provide additional value for the respondents. We confirm this finding with a larger study at another company. The confirmation of this finding brings up a case for more user-centric approaches to designing and evaluating the recommenders. Finally, we investigate information needs of developers during reviewer selection and discuss promising directions for the next generation of reviewer recommendation tools.