Content uploaded by Dayi Lin
Author content
All content in this area was uploaded by Dayi Lin on Apr 14, 2021
Content may be subject to copyright.
1
The Impact of Data Merging on the Interpretation
of CrossProject JustInTime Defect Models
Dayi Lin, Member, IEEE, Chakkrit (Kla) Tantithamthavorn, Member, IEEE, and Ahmed E.
Hassan, Fellow, IEEE
Abstract—JustInTime (JIT) defect models are classiﬁcation models that identify the code commits that are likely to introduce defects.
Crossproject JIT models have been introduced to address the suboptimal performance of JIT models when historical data is limited.
However, many studies built crossproject JIT models using a pool of mixed data from multiple projects (i.e., data merging)—assuming
that the properties of defectintroducing commits of a project are similar to that of the other projects, which is likely not true. In this
paper, we set out to investigate the interpretation of JIT defect models that are built from individual project data and a pool of mixed
project data with and without consideration of projectlevel variances. Through a case study of 20 datasets of open source projects, we
found that (1) the interpretation of JIT models that are built from individual projects varies among projects; and (2) the projectlevel
variances cannot be captured by a JIT model that is trained from a pool of mixed data from multiple projects without considering
projectlevel variances (i.e., a global JIT model). On the other hand, a mixedeffect JIT model that considers projectlevel variances
represents the different interpretations better, without sacriﬁcing performance, especially when the contexts of projects are considered.
The results hold for different mixedeffect learning algorithms. When the goal is to derive sound interpretation of crossproject JIT
models, we suggest that practitioners and researchers should opt to use a mixedeffect modelling approach that considers individual
projects and contexts.
Index Terms—JustInTime Defect Prediction, Data Merging, MixedEffect Model, CrossProject Defect Prediction
F
1 INTRODUCTION
A JustInTime (JIT) defect model is a classiﬁcation model
that identiﬁes the code commits that are likely to introduce
defects [23, 25, 35, 40, 45, 47]. Such JIT models are criti
cally important for continuous quality assurance practices
to early prioritize code commits with the highest defect
proneness for code review and testing due to limited quality
assurance resources. In addition, knowledge that is derived
from such JIT models are often used to continuously chart
quality improvement plans to avoid past pitfalls (i.e., what
commitlevel metrics are associated with the likelihood of
introducing defects?) [15, 24, 31, 42, 50].
Recent work raises concerns that the performance of JIT
models is often suboptimal for software projects with lim
ited historical training data [5, 22]. Moreover, such data are
also unavailable in the initial software development phases
of many projects. To address this challenge, Fukushima et
al. [5] show that crossproject JIT models (i.e., models trained
using historical data from other projects) are as accurate as
JIT models that are trained on a single project (i.e., within
project JIT models). Recently, Kamei et al. [22] show that the
performance of JIT models that are built using a pool of
mixed project data (i.e., merged data) from several projects is
comparable to withinproject performance.
•D. Lin is with Centre for Software Excellence, Huawei, Canada. Email:
dayi.lin@huawei.com
•A. E. Hassan is with the School of Computing, Queen’s University,
Canada. Email: ahmed@cs.queensu.ca.
•C. Tantithamthavorn is with the Faculty of Information Technology,
Monash University, Australia. Email: chakkrit@monash.edu.
Despite the advantages of the data merging practice
for building crossproject JIT models, prior work raises
concerns that the distribution of metric values often varies
across projects [61, 62]. The ﬁndings of these studies raise a
critical concern that the interpretation of a crossproject JIT
model that is built from a pool of mixed project data may
not hold true for a JIT model that is built from an individual
project. Yet, the impact of the data merging practice on the
interpretation of JIT models remains largely unexplored.
In this paper, we set out to investigate the interpretation
of three types of crossproject JIT models when compared to
the interpretation of local JIT models—i.e., a JIT model that is
built from an individual project:
1) global JIT models: a crossproject JIT model that is built
from a pool of mixed project data assuming that the
data is collected from the same project;
2) projectaware JIT models: a crossproject JIT model
that is built from a pool of mixed project data with a
consideration of the different projects that the data is
from; and
3) contextaware JIT models: a crossproject JIT model
that is built from a pool of mixed project data with a
consideration of the different contextual factors of the
different projects that the data is from (e.g., program
ming language).
To do so, we build projectaware and contextaware JIT
models using a mixedeffect modelling approach (i.e., taking
both commitlevel and projectlevel factors into considera
tion when modelling) with two commonlyused classiﬁers,
logistic regression and random forest. We also extensively
evaluate the performance of our JIT models using 9 mea
2
sures, including 4 thresholddependent measures (i.e., Pre
cision, Recall, F1score, Gscore), one thresholdindependent
measure (i.e., AUC), and 4 practical measures (i.e., Initial
False Alarms, PCI@20%, Popt, and Popt @20%). Through a
case study of 20 datasets of open source projects, we address
the following six research questions:
(RQ1) Does the interpretation of the local JIT models
vary?
Our results show that the most important metric (i.e.,
the metric that has the highest impact on the likeli
hood of a commit introducing defects) of the local JIT
models and the baseline likelihood of introducing
defects (i.e., the likelihood of introducing defects
when we remove the impact from all commitlevel
metrics) vary among projects, suggesting that the
interpretation of global JIT models that are trained
from merged data may be misleading when project
level variances and the characteristics of the studied
projects are not considered.
(RQ2) How consistent is the interpretation of the global
JIT model when compared to the local JIT models?
Our results show the most important metric of the
global JIT model is consistent with 55% of the stud
ied projects’ local JIT models, suggesting that the in
terpretation of global JIT models cannot capture the
variation of the interpretation for all local JIT models.
Moreover, the baseline likelihood of introducing de
fects of the global JIT model is not consistent with
the local JIT models, suggesting that conclusions
that are derived from the global JIT model are not
generalizable to many of the studied projects.
(RQ3) How accurate are the predictions of the project
aware JIT model and the contextaware JIT model
when compared to the global JIT model?
Our results show that projectaware and context
aware JIT models that consider project context
achieve comparable performance across nine mea
sured performance metrics with global JIT models
that do not consider project context. The comparable
performance allows us to further compare the inter
pretation among global, projectaware and context
aware JIT models in the following RQs.
(RQ4) How consistent is the interpretation of the project
aware JIT model when compared to the local JIT
models and the global JIT model?
Our results show that the projectaware JIT model
can provide a more representative interpretation
than the global JIT model, while also providing a bet
ter ﬁt to the merged dataset from different projects,
with a 26% increase of R2compared to the average of
local JIT models and a 86% increase of R2compared
to a global JIT model.
(RQ5) How consistent is the interpretation of the context
aware JIT model when compared to the local JIT
models and the global JIT model?
Our results that the contextaware JIT model can
provide both a more representative interpretation
and a better ﬁt to the dataset than the global JIT
model, with a 35% increase of R2compared to the
average of local JIT models and 100% increase of R2
compared to a global JIT model. The inclusion of
the contextual factors in the JIT model when using
mixedeffect modelling approaches can yield more
indepth interpretation, while maintaining a good ﬁt
to the dataset.
(RQ6) Do our prior conclusions hold for other mixed
effect classiﬁers?
Similar to the projectaware and contextaware JIT
models that are based on mixedeffect regression
modelling, our results show that the projectaware
and contextaware JIT models that use mixedeffect
random forest modelling also achieved a similar per
formance compared to random forest based global
JIT models. In addition, both projectaware and
contextaware mixedeffect random forest models
achieve a better goodnessofﬁt compared to random
forest based global JIT model.
Our ﬁndings suggest that irrespective of classiﬁers, using
mixedeffect modelling with either logistic regression or
random forest classiﬁer to build JIT models can provide
better interpretation by taking projectlevel variances into
consideration, without sacriﬁcing the performance of JIT
models. When the goal is to derive sound interpretation
of crossproject JIT models, we suggest that practitioners
and researchers should opt to use a mixedeffect modelling
approach that considers individual projects and contexts.
Novelty Statements. To the best of our knowledge, this
paper is the ﬁrst to investigate the performance and inter
pretation of the contextaware and projectaware JIT models
using a mixedeffect modelling approach. In particular, this
paper makes the following contributions:
1) An investigation of the variation of the interpretation of
local JIT models.
2) A comparison of the interpretation of a global JIT model
to the interpretation of local JIT models.
3) An investigation of the performance of projectaware
and contextaware JIT models with respect to global JIT
models.
4) A comparison of the interpretation of a projectaware
JIT model to the interpretation of local JIT models.
5) A comparison of the interpretation of a contextaware
JIT model to the interpretation of local JIT models.
6) An evaluation of mixedeffect regression models and
mixedeffect random forest models for JIT models.
7) An improved implementation of generalized mixed
effect random forest models [29], which improves pre
diction of unseen projects during training, and im
proves training speed with a fast implementation of
random forests provided by the ranger package in R.
In addition to our conceptual contributions, this paper
is also the ﬁrst to develop and provide detailed technical
description of a mixedeffect modelling for crossproject JIT
defect models. We also provide a replication package [29]
including our datasets and scripts for the community to
evaluate the results.
Paper organization. Section 2 introduces the importance of
the interpretation of JustInTime defect models. Section 3
motivates the impact of the data merging on the interpre
tation of JustInTime defect models. Section 4 discusses
and motivates our six research questions with respect to
3
prior work. Section 5 presents the design of our case study.
Section 6 presents the results of each research question.
Section 7 provides practical guidelines for future studies.
Section 8 discusses the threats to the validity of our study.
Finally, Section 9 concludes the paper.
2 BACKGROU ND
2.1 Software Quality Assurance (SQA) Planning
Software Quality Assurance (SQA) planning is the process
of developing proactive SQA plans to deﬁne the quality
requirements of software products and processes [42]. Such
proactive SQA plans will be used as guidance to prevent
software defects that may slip through to future releases.
However, the development of such SQA plans are still ad
hoc and based on practitioners’ beliefs, e.g., Devanbu et
al. [4] found that practitioners form beliefs based on their
personal experience and these can vary across teams and
projects. They also may not necessarily align with actual
evidence derived from the projects. To cope with the contin
uous software development practices, researchers suggested
to derive insights and lessons learned from JustInTime
(JIT) Defect Models in order to better understand what
factors are the most important to describe the characteristics
of defectintroducing commits [23, 25, 35, 40, 45, 47].
2.2 JustInTime (JIT) Defect Prediction
A JustInTime (JIT) Defect Model is a classiﬁcation model
that is trained on the characteristics of commits in order to
predict if a commit will introduce defects in the future [23,
25, 35, 40, 45, 47] and explain the characteristics of defect
introducing commits [15, 23, 31]. To date, JIT models have
been widely adopted in many software organizations like
Avaya [35], Cisco [47], and Blackberry[45]. In summary, Just
InTime (JIT) defect models serve two main purposes.
To Predict. First, JIT defect models are used to early
predict the risk of defectintroducing commits so developers
can prioritize SQA resources on the most risky commits in
a costeffective manner. JIT defect models are trained using
numerous factors [23, 40], e.g., the number of added lines,
deleted lines, code churn, entropy. Prior studies found that
various machine learning techniques (e.g., random forest
and regression models) demonstrated a promising accuracy
when predicting defectintroducing commits [5].
To Explain. Second, JIT defect models are used to pro
vide immediate feedback on the most important character
istics of defectintroducing commits to not only support QA
leads and managers develop Software Quality Assurance
(SQA) plans, but also aid developers in precommit testings
and code reviews [15]. Such feedback or insights are derived
from the interpretation of the JIT models through model
interpretation techniques (e.g., ANOVA for regression anal
ysis, variable importance analysis for random forest). The
goal of model interpretation is to better understand what
are the most important characteristics of defectintroducing
commits. Such datainformed insights can help QA leads
and Managers to develop proactive software quality im
provement plans to prevent pitfalls that lead to defects in
the past [31], and aid developers in precommit testings
and code reviews. For example, if churn (i.e., the amount of
changed lines per commit) shares the strongest relationship
Timeline

+

+

+

+
Commit History

+

+

+
JIT Models
New Commit
Training
Data
Risk
Score
Important
Factors
Provide Immediate
Feedback for SQA Planning
Prioritize SQA Resources
on the Most Risky Commits
Testing Data
Defectintroducing commits
Defectfixing commits
Fig. 1. The predictions of JIT defect models are used to early predict the
risk of defectintroducing commits, while the interpretation of JIT defect
models is used to provide immediate feedback on the most important
characteristics of defectintroducing commits to support SQA planning.
with the likelihood of defectintroducing commits, QA leads
and managers should establish a software quality improve
ment plan to strictly control churn (e.g., each commit should
not contain churn that exceeds 1,000 lines of code) to control
the quality of software process and mitigate the risk of
introducing software defects to the code commits.
3 A MOT IVATING EXAMPLE
The interpretation of JIT defect models heavily relies on the
dataset that was used in training. Traditionally, JIT models
are often trained from an individual project (a.k.a. a local
JIT model) so the interpretation is speciﬁc to the project
that is used in training. However, when a company starts a
new software project, the company may have limited access
to the historical data that can be used to train JIT defect
models for the new project. Thus, crossproject JIT defect
models have been proposed to address this challenge of
limited historical data [13, 26, 57, 64].
Prior studies proposed to merge data from different
projects (i.e., a pool of mixed project data) to develop
universal or crossproject JIT defect models (a.k.a. a global
JIT defect model) [5, 22, 58]. The intuition is that a larger
diverse pool of defect data from several other projects
may provide a more robust model ﬁt that will be applied
better in a crossproject context. However, when datasets
are combined from other projects, each project often has dif
ferent data characteristics (e.g., different data distributions,
different project characteristics). Thus, such global JIT defect
models that are trained from a mixedproject dataset may
produce misleading interpretations of JIT defect models
when comparing to a local JIT defect model.
To illustrate the impact of data merging on the inter
pretation of JIT defect models, we conduct the following
motivating analysis. We start by building three local JIT
defect models, where each of them is trained on a single
project (i.e., Accumulo, Postgres, Django). We also build a
global JIT model that is trained from a mixedproject dataset
of 20 projects (see Table 2). These JIT models are built using
logistic regression and the percentage importance scores are
computed from an ANOVA TypeII analysis [17]. Finally,
we compute the goodnessofﬁt of the models using R2and
compute a median AUC using an outofsample boostrap
4
TABLE 1
(A Motivating Example) The percentage importance scores of three
local JIT models (i.e., Postgres, Accumulo, and Django) when
comparing to a global JIT model that is trained from all projects.
Postgres Accumulo Django Global
Entropy 16% 87% 34% 52%
NS 17% 3% 51% 16%
NF 36% 3% 1% 0%
Churn 1% 0% 0% 0%
LT 3% 5% 7% 0%
FIX 27% 1% 7% 32%
R20.31 0.20 0.23 0.14
AUC 0.66 0.78 0.75 0.71
model validation technique [52]. Based on Table 1, we draw
the following observation.
While the three local JIT models and the global JIT
model achieve a comparable AUC and R2, the interpre
tation of each JIT model is different. Table 1 shows that
the most important metric is the number of changed ﬁles
for the Postgres project, Entropy for the Accumulo project,
and the number of subsystems for the Django project. This
indicates that software quality improvement plans should
be dependent on the project.
On the other hand, the global JIT model that is trained
from a pool of mixed project data cannot capture the
variation. Table 1 shows that the most important metric
is Entropy for the global JIT model, indicating that such
insights may not be applicable to all projects.
This motivating example highlights the need of a mod
ern regression alternative that can capture the project vari
ation when aiming to draw the generalization of the most
important metrics. Below, we discuss how does misleading
interpretation of JIT defect models impact practitioners and
researchers.
Importance for Practitioners. The interpretation of JIT de
fect models plays a critical role to not only support QA
leads and managers develop quality improvement plans,
but also aid developers in code reviews and precommit
testings [15, 16, 42]. Such interpretation could provide an
immediate feedback of pitfalls that lead to software defects
in the past so practitioners can avoid in the future. Un
fortunately, when using the interpretation of the global JIT
defect model which is trained from a pool of mixed project
dataset, the interpretation derived from the global JIT defect
model cannot capture the variation from project to project,
producing misleading insights. Such misleading insights
could lead to suboptimal software development policies,
wasting time and resources when adopting in practice.
Importance for Researchers. Recently, researchers aim to
draw generalized conclusions by deriving a conclusion from
a largescale study via the combination of datasets from
multiple projects [5, 22]. However, such generalizations may
not hold true for each project, posing a critical threat to
the external validity of prior studies. Therefore, a modern
regression alternative (i.e., a mixedeffect modelling ap
proach) is needed to capture the project variation and con
text variation. However, there exists no study investigating
if a mixedeffect modelling approach can capture the project
variation and context variation for JIT defect models.
4 RESEARCH QUESTIONS
Prior work shows that data merging from multiple projects
without considering project context tends to perform well
for crossproject JIT models [5, 22]. However, recent work
raised concerns that the distribution of software metrics
often varies among project contexts (e.g., domain, size, and
programming language) [61–63]. Such variation of distribu
tions likely leads JIT models to produce different metrics
that inﬂuence defectintroducing commits among projects.
Yet, little is known whether metrics that inﬂuence defect
introducing commits vary among projects. Thus, we formu
late the following research question:
RQ1: Does the interpretation of the local JIT models vary?
Recently, Kamei et al. [22] suggest that data merging
from multiple projects without considering project contexts
tends to perform well for crossproject JIT models, suggest
ing that a simple data merging technique (i.e., a pool of
mixed project data) would likely sufﬁce for crossproject JIT
models. Yet, little is known whether the interpretation of a
global JIT model is consistent with local JIT models. Thus,
we formulate the following research question:
RQ2: How consistent is the interpretation of the global JIT
model when compared to the local JIT models?
The practice of data merging without considering
projectlevel variances has been widely used in many stud
ies on crossproject JIT models [5, 22] and crossproject
defect models [11–13, 57]. Such practice assumes that his
torical commits are collected from similar projects, which
is likely not true. While prior studies reinforce the con
sideration of projectlevel variances for crossproject mod
elling [2, 33, 61, 62], Kamei et al. [22] argued that project
aware rank transformation does not work well for cross
project JIT models. In addition, Herbold et al. [12] also
argued that projectaware data partitioning only yields a
minor improvement for crossproject defect models.
Even though prior work has reinforced that project
level variances must be considered, little research has paid
attention to a modern regression alternative, i.e., the mixed
effect modelling approach [1], especially in the context of
JIT defect models. In addition, little is known whether the
performance of projectaware JIT models (i.e., a JIT model
that is trained on a mixed project data while considering
projectlevel variances) and contextaware JIT models (i.e.,
a JIT model that is trained on a mixed project data while
considering project characteristics) that use mixedeffect
modelling approaches are comparable to global JIT models.
Thus, we formulate the following research question:
RQ3: How accurate are the predictions of the projectaware
JIT model and the contextaware JIT model when compared
to the global JIT model?
Recently, Hassan et al. [10] pointed out that the mixed
effect modelling approach is able to capture the variation of
the interpretation of models among different datasets in the
context of mobile apps reviews. Yet, little is known whether
projectaware JIT models might produce more representa
tive metrics that inﬂuence defectintroducing commits when
5
compared to the global JIT model and the local JIT models.
Thus, we formulate the following research question:
RQ4: How consistent is the interpretation of the project
aware JIT model when compared to the local JIT models
and the global JIT model?
One limitation of the projectaware JIT model is that
we cannot interpret the impact of project contexts (e.g.,
domain, size, and programming language) on the defect
proneness of the project. Prior work has shown that the
distribution of software metrics often varies among project
contexts [63]. Yet, it is unknown whether a contextaware
JIT model might produce more representative interpretation
when comparing to the global JIT model and the local JIT
models. Thus, we formulate the following research question:
RQ5: How consistent is the interpretation of the context
aware JIT model when compared to the local JIT models and
the global JIT model?
Many prior studies have explored different classiﬁers
for defect modelling, such as logistic regression [23] and
random forest [22]. In RQ1 to RQ5, we focus on the logistic
regression classiﬁer. To better understand if the conclusions
from RQ1 to RQ5 hold for other classiﬁers like random
forest, we formulate the following research question:
RQ6: Do our prior conclusions hold for other mixedeffect
classiﬁers?
5 CA SE STU DY DESIGN
In this section, we describe our selection criteria for the
studied software projects, and the design of our case study
to address the six research questions. Figure 2 presents an
overview of our case study design.
5.1 Collecting Data
5.1.1 Studied Software Projects
In order to address our research questions, we deﬁned two
important criteria for selecting studied software projects:
1) Criterion 1  Publiclyavailable datasets: To foster
replications of our study, we select studied software
projects that are hosted in a publiclyavailable data
repository (i.e., GitHub).
2) Criterion 2  Large and longterm development: To
ensure the quality of our studied projects and avoid
including any small projects in GitHub, we selected
studied software projects that are large and have been
developed for a long period of time.
We randomly selected 20 open source projects that meet
the criteria from GitHub for our study. Table 2 gives an
overview of the studied projects. We collected the commits
of each project from GitHub on February 14th, 2018. We
used CommitGuru [44] to extract commitlevel metrics (i.e.,
metrics that inﬂuence the likelihood of a commit intro
ducing defects) and identiﬁed defectintroducing commits
using the SZZ algorithm [46] for each project.
TABLE 2
Summary of studied software projects. Parenthesized values show the
percentage of defectintroducing commits
Project name Date of
ﬁrst commit Lines of code # of changes
accumulo Oct 4, 2011 600,191 9,175 (21%)
angular Jan 5, 2010 249,520 8,720 (25%)
brackets Dec 7, 2011 379,446 17,624 (24%)
bugzilla Aug 26, 1998 78,448 9,795 (37%)
camel Mar 19, 2007 1,310,869 31,369 (21%)
cinder May 3, 2012 434,324 14,855 (23%)
django Jul 13, 2005 468,100 25,453 (42%)
fastjson Jul 31, 2011 169,410 2,684 (26%)
gephi Mar 2, 2009 129,259 4,599 (37%)
hibernateorm Jun 29, 2007 711,086 8,429 (32%)
hibernatesearch Aug 15, 2007 174,475 6,022 (35%)
imglib2 Nov 2, 2009 45,935 4,891 (29%)
jetty Mar 16, 2009 519,265 15,197 (29%)
kylin May 13, 2014 214,983 7,112 (25%)
log4j Nov 16, 2000 37,419 3,275 (46%)
nova May 27, 2010 430,404 49,913 (26%)
osquery Jul 30, 2014 91,133 4,190 (23%)
postgres Jul 9, 1996 1,277,645 44,276 (33%)
tomcat Mar 27, 2006 400,869 19,213 (28%)
wordpress Apr 1, 2003 390,034 37,937 (47%)
TABLE 3
Summary of commitlevel metrics
Category Name Description
Diffusion
NS Number of modiﬁed subsystems
ND Number of modiﬁed directories
NF Number of modiﬁed ﬁles
Entropy Distribution of modiﬁed code across each ﬁle
Size
LA Lines of code added
LD Lines of code deleted
LT Lines of code in a ﬁle before the commit
Purpose FIX Whether or not the commit is a defect ﬁx
5.1.2 Collecting Commitlevel Metrics
Prior studies proposed many commitlevel metrics that are
associated with the likelihood of introducing defects [23, 25,
35, 45, 47]. Similar to Kamei et al. [22], we used Commit
Guru [44] to collect eight metrics that span 3 categories.
Table 3 provides a brief description of the commitlevel
metrics.
Diffusion category measures how distributed a commit
is. A highly distributed commit is more complex and more
prone to defects, as shown in prior work [9, 35]. We collected
the number of modiﬁed subsystems (NS), the number of
modiﬁed directories (ND), the number of modiﬁed ﬁles
(NF), and the distribution of modiﬁed code across each ﬁle
(Entropy), to measure the diffusion of a commit. Similar to
Hassan [9], we normalized the entropy by the maximum
entropy log2nto take the differences in the number of ﬁles
nacross changes into account.
Size category measures the size of a commit using the
lines added (LA), lines deleted (LD), and lines total (LT). The
intuition is that the size of a commit is a strong indicator of
the commit’s defectproneness [36, 37].
Purpose category measures whether a commit ﬁxes a
defect. The intuition is that a commit that ﬁxes a defect is
more likely to introduce another defect [6, 41].
6
Collecting Data Preprocessing
Data
Constructing JIT models Analyzing JIT
models
GitHub
CommitGuru
Select studied
projects Studied project Mitigating cor‐
related metrics
Data scaling
Projectlevel
metrics
Collecting
projectlevel
metrics
Commitlevel
metrics
Collecting
commitlevel
metrics
Merged dataset
Individual
datasets
Local JIT
models
Global JIT
models
Project
aware JIT
model
Context
aware JIT
model
CM
CM
CM
P
CM
PM
Evaluating the
performance
Identifying the
most important
metric
RQ1
RQ2
RQ5
RQ6
P
PM
CM
Evaluating the
goodnessofﬁt
RQ3
RQ4
Fig. 2. An overview of our case study design.
5.1.3 Collecting ProjectLevel Metrics
To investigate the impact of projectlevel variances on the
interpretation of JIT models, we collected 9 projectlevel
metrics, which were used in prior work [61, 63, 64]. We
brieﬂy outline the projectlevel metrics below. Among the
9 projectlevel metrics, 6 of them can be extracted from
the version control systems (i.e., Language, NLanguage,
LOC, NFILE, NCOMMIT, NDEV), and 3 of them require
manual tagging (Audience, UI, Database). For each numeric
projectlevel metrics (NLanguage, LOC, NFILE, NCOMMIT,
NDEV), we separated the values into four groups based on
the ﬁrst, second, and third quartiles (i.e., least, less, more,
most), as suggested by prior work [22].
Language (Java / JavaScript / Perl / Python / PHP /
C / C++): Programming language that is used the most in
the project. We identiﬁed the programming language using
cloc. We chose the programming language that is used in
the largest number of ﬁles as the programming language of
a project.
NLanguage (Least / Less / More / Most): Total number
of programming languages that are used in the project. We
consider a programming language being used in a project if
more than 10% of the ﬁles are written in that programming
language.
LOC (Least / Less / More / Most): Total lines of code of
the source code in the project.
NFILE (Least / Less / More / Most): Total number of
ﬁles in the project.
NCOMMIT (Least / Less / More / Most): Total number
of commits in the VCS of the project.
NDEV (Least / Less / More / Most): Total number of
unique developers in the system.
Audience (Developer / User): Whether the intended
audience of the project is end users (e.g., wordpress), or
development professionals (e.g., log4j).
UI (Toolkit / GUI / Noninteractive): The type of user
interaction of the project. E.g., imglib2 is a toolkit, gephi has
GUI, and cinder is Noninteractive.
Database (True / False): Whether the project store the
data in a database or not.
5.2 Preprocessing Data
5.2.1 Data Scaling
We observed that most commitlevel metrics are highly
skewed, and in different scales. To address this issue,
we centred and scaled the commitlevel metrics using the
scale function in R, except for the “FIX” metric which is
boolean.
5.2.2 Mitigating Correlated Metrics
Highly correlated metrics will produce incorrect interpreta
tion of JIT models [17, 18, 48]. We used the Spearman cor
relation to mitigate collinearity (i.e., the correlation between
2 metrics), and the redundancy analysis to mitigate multi
collinearity (i.e., the correlation across more than 2 metrics,
in other words, the ability of a metric being described by 2
or more other metrics).
We calculated the Spearman correlation among all the
studied commitlevel metrics, and manually removed the
highly correlated metrics with a Spearman correlation co
efﬁcient greater than 0.7. We chose 0.7 as the threshold,
because this threshold has been widelyused in defect pre
diction research [15, 17, 19, 20, 31, 32, 48, 55]. In addition,
Jiarpakdee et al. [15, 19] also found that the variation of the
Spearman correlation coefﬁcient between 0.50.7 has little
impact on the conclusions. We avoid using other coefﬁcients
so the conclusions of our paper is strictly controlled and
not biased toward the use of other coefﬁcients. We chose
Spearman correlation because it is resilient to data that are
not normally distributed. Figure 3 shows the hierarchically
clustered Spearman ρvalues of the commitlevel metrics,
from merged data across all the studied projects.
Figure 3 shows that ND and NF are highly correlated.
We removed ND and kept NF for our study, as sug
gested by prior work [23]. In addition, LA and LD are
highly correlated. We replace LA and LD with a relative
churn metric (i.e., (LA +LD)/LT ), as suggested by prior
work [22, 23, 37].
Redundant metrics (i.e., metrics that do not have a
unique signal from other metrics) will interfere with each
other and produce misleading interpretation of models. We
used the redun function in R to detect redundant metrics.
We found that after the correlation analysis, there exists no
redundant metrics.
7
la
ld
ns
entropy
nd
nf
fix
lt
0.8 0.4 0.0
Spearman ρ
Fig. 3. Hierarchical overview of the correlation among the commitlevel
metrics. The dotted line shows the threshold (ρ= 0.7).
5.2.3 Class Imbalance
We observed that the class labels of our JIT datasets are
imbalanced, i.e., only a small proportion of all commits
introduces defects. Prior work in justintime crossproject
defect prediction used class rebalancing techniques to im
prove model performance [22]. However, Tantithamtha
vorn et al. [48, 49] have shown that balancing data tends
to shift the ranking of the most important metrics. To avoid
introducing any possible bias in the interpretation of our JIT
models, we did not apply data rebalancing techniques in
our study.
5.3 Constructing JIT models
To address our six research questions, we constructed four
types of JIT models, i.e., local JIT models, global JIT models,
projectaware JIT models, and contextaware JIT models
using the logistic regression and random forest classiﬁers.
5.3.1 Local JIT Models
A local JIT model is a model that is built from an indi
vidual project. We construct local JIT models using logistic
regression for each of the 20 studied projects where we used
commitlevel metrics as independent variables and whether
a commit introduces defects as the dependent variable.
In a classic logistic regression model, the relationship be
tween commitlevel metrics (independent variables xi) and
the likelihood of the commit introducing defects (dependent
variable y) can be described as:
Ln(y
1−y) = β0+βixi+(1)
where is the standard error. The coefﬁcient βiindicates
the relationship between the ith commitlevel metric and the
likelihood of the commit introducing defects, and the intercept
β0indicates the baseline likelihood of introducing defects of a
project.
We used the implementation of logistic regression from
the glm function that is provided by the stats R package.
5.3.2 Global JIT Models
A global JIT model is trained using a pool of mixed project
data assuming that data is collected from the same project.
We ﬁrst combined data from all the studied projects as a
training dataset. We constructed a global JIT model using lo
gistic regression (see Section 5.3.1) with the merged dataset
using the same dependent and independent variables, and
the same R package as the local JIT models. Since the goal of
our paper focuses on the interpretation of crossproject JIT
models, we trained our model using the full dataset.
As the global JIT model is trained using a merged dataset
with data from all the studied projects, but the coefﬁcient
βiand the intercept β0in the model are ﬁxed, commits
from different projects have the same relationship with the
dependent variable y(the likelihood of a commit introduc
ing defects). Such models are called ﬁxedeffect models, and
have been used in prior studies for JIT models [5, 22].
5.3.3 ProjectAware JIT Models
A ﬁxedeffect JIT model with merged dataset has an as
sumption that the relationship between commitlevel met
rics and the likelihood of a commit introducing defects, and
the baseline likelihood of introducing defects of a project,
are the same across different projects, which is likely not
true. To build a projectaware JIT model that considers
the differences among different projects, we constructed a
mixedeffect logistic regression model [1] for our study.
Unlike classic logistic regression models, a mixedeffect
logistic regression model contains both ﬁxed effects (inde
pendent variables at the commit level) and random effects
(independent variables at the project level), and therefore is
able to represent different relationships between indepen
dent variables and dependent variables at different hierar
chical levels (i.e., different projects).
There are two types of mixedeffect models: (1) random
intercept models, and (2) random slope and intercept mod
els. Random intercept models have different intercepts for
independent variables at the project level, but ﬁxed coef
ﬁcients for independent variables at the commit level. On
the other hand, random slope and intercept models allow
different intercepts for independent variables at the project
level, and different coefﬁcients for independent variables at
the commit level. As we suspect that commitlevel metrics
from different projects have different relationships with the
likelihood of a commit introducing defects, we constructed
a random slope and intercept model for the projectaware
JIT model. A random slope and intercept model takes the
following form:
Ln(y
1−y) = β0+βixi+uj0+ujk zk+(2)
where y,xi,βi,has the same deﬁnition as Formula 1; uj0
is the jth random intercept; zkis the kth random effect; and
ujk is the jth random slope for the kth random effect.
In particular, we use a unique identiﬁer of a project (i.e.,
the project name) as the random intercept, and use Entropy
as the random slope against project in our model, i.e., uj0
is the random intercept for jth project; z1is Entropy and
uj1is the random slope for the Entropy of jth project. We
use the project name as the random intercept, so that the
projectaware JIT model can give each project a different
intercept, similar to the local JIT models (Section 5.3.1);
instead of treating data from different projects the same
and only computing one general intercept (β0), as done
when training a global JIT model (Section 5.3.2). We use
Entropy as the random slope, since Table 4 shows that
Entropy is the most important metric for 11 out of 20 studied
projects. We do not include other metrics as random slopes,
since excessive usage of random slope increases the model’s
8
number of degrees of freedom and therefore increases the
risk of overﬁtting. The intuitive interpretation is that we
let different projects to have different baseline likelihood of
introducing defects (β0+uj0), and allow Entropy to have a
different relationship (uj1) with the likelihood of a commit
introducing defects for each project. We use the rest of the
commitlevel metrics (NS, NF, LT, FIX and relative churn)
as ﬁxed effects (xi). We built the projectaware JIT models
using the implementation of the mixedeffect regression
model provided by the glmer function in the R package
lme4.
5.3.4 ContextAware JIT Models
In contrast to the projectaware JIT model, a contextaware
JIT model further differentiates various contextual factors
of different projects, instead of simply giving projects their
unique intercepts. To construct a contextaware JIT model,
we built a mixedeffect model (more speciﬁcally, a random
slope and intercept model). In the contextaware JIT model,
we introduced the 9 contextual factors (the 9 projectlevel
metrics that are described in Section 5) as random intercepts
(i.e., each level of each contextual factor has its unique
intercept). All of the contextual factors are at project level
instead of commit level, hence they impact the project’s
baseline likelihood of introducing defects (i.e., in the form
of intercept instead of slope). As a result, the intercept for
the jth project (uj0) in the contextaware JIT model can be
considered conceptually as the sum of the intercepts of the
project’s contextual factors:
uj0=
M
X
m=1
vjm0(3)
where Mis the total number of contextual factors (i.e.,
9), and vjm0is the coefﬁcient of random intercept for jth
project’s mth contextual factor, obtained through the ﬁtting
of the contextaware JIT model. We calculated uj0conceptu
ally as the sum of all coefﬁcients of random intercepts of a
speciﬁc project, so that we can compare that to the local JIT
models and the projectaware JIT model where each project
has only one intercept value.
We used the same random slope and ﬁxed effects in
the contextaware JIT model as the projectaware JIT model.
Again, we built the contextaware JIT models using the im
plementation of the mixedeffect regression model provided
by the glmer function in the R package lme4.
5.4 Analyzing JIT models
5.4.1 Evaluating the performance of JIT models
A defectintroducing commit can be classiﬁed by a JIT
model as defectintroducing (true positive, TP) or non
defectintroducing (false negative, FN); while a nondefect
introducing commit can be classiﬁed by a JIT model
as defectintroducing (false positive, FP) or nondefect
introducing (true negative, TN).
To measure the performance of the constructed JIT mod
els, we employed nine performance metrics that are com
monly used in prior work [14, 22, 23], deﬁned as follows:
Precision measures the ratio of correctly predicted
defectintroducing commits to all commits that are pre
dicted as defectintroducing (P recision =T P
T P +F P ).
Recall measures the ratio of correctly predicted defect
introducing commits to all defectintroducing commits
(Recall =T P
T P +F N ).
F1score measures the harmonic mean of recall and pre
cision. There exists a tradeoff between precision and recall.
F1score is a commonly used metric to combine precision
and recall (F1score =2∗Pr ecision∗Recall
P recision+Recall ).
Gscore is an alternative metric to F1score to avoid po
tential negative impact of imbalanced class on F1score [27].
Gscore is the harmonic mean between the probability
of true positive and probability of true negative (Gscore
=2∗T P ∗T N
T P +T N ).
AUC measures the Area Under the Curve (AUC) of the
Receiver Operating Curve (ROC). AUC is used to mitigate
the potential bias of the choice of the probability threshold
for precision and recall and mitigate the class imbalance
that are commonly presented in our datasets [49]. As sug
gested by Mandrekar [30], an AUC value of 0.5 indicates
that a model performs equally to random guessing or no
discrimination, 0.7 to 0.8 is considered acceptable, 0.8 to
0.9 is considered excellent, and more than 0.9 is considered
outstanding.
IFA measures the number of Initial First Alarms (IFA)
that developers would encounter before they ﬁnd the ﬁrst
defective commit. In contrast to the above mentioned mea
sures, IFA takes into consideration the human aspect of
quality assurance. A low IFA may increase developers’ trust
on the JIT model.
PCI@20% measures the Proportion of Commits In
spected when 20% of modiﬁed LOC by all commits are
inspected. PCI@20% focuses on the effort that developers
spend when inspecting the suggested defectintroducing
commits by the JIT model.
Popt: An optimal JIT model ranks defectintroducing
commits by the decreasing actual bug density. When plot
ting the percentage of defectintroducing commits against
the percentage of effort for both the optimal JIT model and
the JIT model that is being evaluated, we can calculate the
area between the two models’ curves ∆opt. Popt is calculated
as 1−∆opt. Hence, a larger Popt means a smaller difference
between the optimal JIT model and the JIT model being
evaluated.
Popt@20%: Similar to the Popt measure, Popt@20% mea
sures Popt before the cutoff of 20% of effort.
5.4.2 Evaluating the goodnessofﬁt of JIT models
To measure how well the constructed models ﬁt the data,
we calculated the conditional coefﬁcient of determination
for generalized logistic regression models and mixedeffect
models (R2or R2
GLMM) [21, 38].
R2
GLMM =σ2
f+Pu
l=1 σ2
l
σ2
f+Pu
l=1 σ2
l+σ2
+σ2
d
where σ2
fis the variance of the ﬁxed effects, and Pσ2
lis the
sum of all uvariance components, σ2
is the variance due
to the additive dispersion and σ2
dis the distributionspeciﬁc
variance. We used the implementation of the R2
GLMM fuc
tion that is provided by the MuMIn R package.
9
TABLE 4
The model statistics of the local JIT models for each studied project. The metric with the highest percentage of χ2for each project is highlighted in
bold and red. We also report the goodnessofﬁt of the local JIT models using R2and the predictive accuracy using AUC.
Project
name
(Intercept) Entropy NS NF Relative
churn LT FIX R2AUC
Coef. Coef. % χ2Coef. % χ2Coef. % χ2Coef. % χ2Coef. % χ2Coef. % χ2
cinder 1.95 %40*** %48*** %0◦%0◦%0** %12*** 0.43 0.86
nova 1.6 %38*** %51*** &0◦&0◦%2*** %9*** 0.38 0.84
postgres 0.85 %16*** %17*** %36*** %1*** &3*** %27*** 0.31 0.65
angular 1.53 %18*** %64*** %0◦%1* %0◦%17*** 0.30 0.80
osquery 1.61 %41*** %43*** %1◦%0◦%11*** %4*** 0.30 0.82
brackets 1.52 %13*** %85*** &0◦%0◦%0◦%2*** 0.29 0.79
camel 1.62 %73*** %0◦%0◦&0◦&26*** %0** 0.24 0.68
django 0.56 %34*** %51*** &1*** &0◦&7*** %7*** 0.23 0.76
accumulo 1.57 %87*** %3*** %3*** &0◦%5*** %1** 0.20 0.78
bugzilla 0.98 %8*** %42*** &0◦&1* %46*** %3*** 0.18 0.72
fastjson 1.24 %82*** %4** %1◦%0◦%14*** %0◦0.18 0.72
jetty 1.13 %70*** &25*** %2*** &0◦%0◦%3*** 0.18 0.73
hibernatesearch 0.75 %48*** &38*** %4*** %1* &9*** %0◦0.16 0.67
hibernateorm 0.87 %95*** &2*** %2*** %0◦&1** %0◦0.14 0.67
kylin 1.28 %86*** %14*** %0◦&0◦&0◦%0◦0.14 0.71
log4j 0.29 %41*** %17*** %0◦&0◦&33*** %8*** 0.11 0.67
tomcat 1.12 %38*** %13*** %24*** &0◦%9*** %16*** 0.11 0.64
gephi 0.54 %88*** &7*** %1* %1◦%3** &0 0.09 0.64
imglib2 0.98 %92*** %1◦&0◦&0◦%6*** %1◦0.09 0.67
wordpress 0.4 &1*** %24*** %12*** &0◦%1◦%63*** 0.08 0.63
global JIT model 1.10 %52*** %16*** %0*** &0◦%0*** %32*** 0.14 0.70
%: positive coefﬁcients. &: negative coefﬁcients.
Statistical signiﬁcance of χ2:◦p≥0.05;*p < 0.05;**p < 0.01; *** p < 0.001.
5.4.3 Identifying the most important metric
We used the χ2value of each commitlevel metric that is
obtained from the ANOVA TypeII to measure the impact
of commitlevel metrics on the likelihood of a commit
introducing defects. The χ2value measures the impact of
a particular independent variable on the dependent vari
able [32]. The larger the χ2value, the larger the impact that a
commitlevel metric has on the likelihood of a commit intro
ducing defects. We also calculated the statistical signiﬁcance
(p−value) of χ2. When pis less than a signiﬁcance level (e.g.,
5%), we can conclude that the independent variable has a
statistically signiﬁcant impact on the dependent variable.
We used the ANOVA TypeII test because it yields a more
stable ranking of metrics, as suggested by Tantithamthavorn
et al. [48]. To more intuitively show the impact of each
metric, we calculated the percentage of the χ2of each
commitlevel metric to the sum of all χ2values of a model,
to rank the metrics by their impact for each model.
Similarly, we used the χ2values of the projectlevel
metrics (i.e., random effects in projectaware JIT models and
contextaware JIT models) obtained from the likelihood ratio
test (LRT), to measure the impact of projectlevel metrics
on the likelihood of a commit introducing defects. We used
the likelihood ratio test rather than directly comparing the
variance of random effects as suggested by Bolker et al. [3],
as the variance of a random effect is not reliable when the
sampling distribution is skewed. We also divided the p
value by 2 as suggested by Bolker et al., as LRTbased null
hypothesis tests are conservative when the null value (i.e.,
the variance of random effects) is on the boundary of the
feasible space (i.e., the variance of random effects cannot be
less than 0) [39].
6 CA SE STU DY RESU LTS
In this section, we present the approach and results with
respect to each research question.
RQ1: Does the interpretation of the local JIT models
vary?
Approach. To address RQ1, we started from the 20 studied
datasets. For each dataset, we constructed a local JIT model
using logistic regression (see Section 5.3.1). Because the
focus of this RQ is on interpretation, we trained each local
JIT model with the whole dataset of the respective project.
For each local JIT model, we extracted the coefﬁcients of the
intercept and independent variables. Following Section 5.4,
we analyzed the goodnessofﬁt of local JIT models; and
identiﬁed the most important metric of defectintroducing
commits. Table 4 presents the model statistics of the local
JIT models. Below, we discuss the results with respect to (1)
the goodnessofﬁt; (2) the most important metric; and (3)
the baseline likelihood of introducing defects of the local JIT
models.
GoodnessofFit. To ensure that the interpretation that is
derived from the local JIT models is accurate, we ﬁrst
evaluate the R2goodnessofﬁt of the local JIT models and
the predictive accuracy using AUC. We ﬁnd that the R2
values of local JIT models range from 0.08 to 0.43, while
the AUC values of local JIT models range from 0.630.86.
Table 4 conﬁrms that our logistic regression can explain the
variability of the data from 8% to 43%. Despite some projects
achieving a low goodnessofﬁt (i.e., the linear regression
models cannot capture the linear relationship of the metrics
and the outcome), our local JIT models still outperform
random guessing (AUC=0.5).
Results.The most important metric of the local JIT models
varies among projects. Table 4 shows the model statistics
of the local JIT models. The most important metric (i.e.,
10
the metric with the highest percentage of χ2) is bolded
and highlighted in blue. We ﬁnd that Entropy is the most
important metric for 11 of 20 (55%) studied projects, while
the number of subsystems (NS) is the most important metric
for 6 of 20 (30%) projects. This ﬁnding suggests that the
interpretation of crossproject JIT models (e.g., global JIT
models) that are trained from a merged data may be mis
leading when projectlevel variances and the characteristics
of the studied projects are not considered.
We suspect that the different interpretations of the lo
cal JIT models have to do with the nature of the project
characteristics, instead of the goodnessofﬁt of the models.
We ﬁnd that local JIT models with similar goodnessofﬁt
also produce different most important metrics. For example,
both the postgres and osquery projects have a very similar
goodnessofﬁt with an R2value of 0.31 and 0.30, respec
tively. However, the most important metric for the postgres
project is the number of ﬁles (NF), while the most important
metric for the osquery project is the number of subsystem
(NS). The inconsistency of the most important metric for
local JIT models that share similar goodnessofﬁt can also
be observed for the camel and django projects, and for the
bugzilla and fastjson projects.
The different interpretation of local JIT models is similar
to the ﬁndings of Menzies et al. [34] who observed variations
in the most important metric for each project when defect
prediction models are trained using different samples from
the same project data.
The baseline likelihoods of introducing defects varies
among projects. Table 4 shows that the local JIT models
often have different intercepts (i.e., the baseline likelihood
of introducing defects). The variance of the distribution
of intercepts among local JIT models is 0.21, indicating
that different projects have different baseline likelihood of
introducing defects (e.g., some projects are more likely to
have defectintroducing commits than others). This ﬁnding
echoes the ﬁndings that the importance of the characteristics
of the studied projects must be considered when construct
ing crossproject JIT models.
Summary: The most important metric of the local JIT
models and the baseline likelihood of introducing defects
vary among projects, suggesting that the interpretation
of global JIT models that are trained from a merged data
may be misleading when projectlevel variances and the
characteristics of the studied projects are not considered.
RQ2: How consistent is the interpretation of the global
JIT model when compared to the local JIT models?
Approach. To address RQ2, we investigated the difference of
the ANOVA importance scores of the global JIT model when
compared to the local JIT models. We started from merging
the 20 studied datasets into a single dataset (called a merged
data). We then constructed a global JIT model using logistic
regression (see Section 5.3.2). Because the focus of this RQ
is on interpretation, we trained the global JIT model with
the whole merged dataset. Similar to RQ1, we analyzed the
goodnessofﬁt of global JIT models; and identiﬁed the most
important metric of defectintroducing commits. Table 4
presents the model statistics of the global JIT model. Below,
we discuss the results with respect to (1) the goodnessofﬁt;
(2) the most important metric; and (3) the baseline likelihood
of introducing defects of the global JIT models.
GoodnessofFit.The R2value of the global JIT model
is 30% lower than the average R2values of the local
JIT models. Table 4 shows that the R2values of the global
JIT model is 0.14, indicating that the global JIT model can
explain the variability of the data by 14%. On the other
hand, Table 4 shows that the average R2values of the local
JIT models is 0.207.
Results.The most important metric of defectintroducing
commits that is derived from a global JIT model is not
always consistent with the local JIT models. Table 4 shows
the model statistics of the global JIT model. We ﬁnd that
entropy, the purpose of the commit (FIX), and the number
of subsystems (NS) are the ﬁrst, second, and third most im
portant metric of defectintroducing commits when deriving
from the global JIT model. As shown in the results of RQ1,
Entropy is the most important metric of defectintroducing
commits for only 11 (55%) local JIT models. On the other
hand, the number of subsystem (NS) is the most important
metric of defectintroducing commits for the other 6 (30%)
local JIT models—where NS appears at the third rank of the
global JIT model.
The global JIT model cannot capture the variation of
the baseline likelihood of introducing defects for all of the
studied projects. Table 4 shows the intercept values (i.e., the
baseline likelihood of introducing defects) of the global JIT
model and the local JIT models. We ﬁnd that the intercept
value of the global JIT model is 1.1, while the intercept
values of the local JIT models range from 1.95 to 0.29. This
ﬁnding indicates that the global JIT model cannot capture
the variation of the baseline likelihood of introducing de
fects for all of the studied projects. Such inconsistency of
the baseline likelihood of introducing defects between the
global JIT model and the local JIT models may produce
misleading conclusions when interpreting the global JIT
model.
The inconsistency between the interpretation of global
and local JIT models is similar to the ﬁndings of Men
zies et al. [33] who observed that in the context of release
level defect prediction, the interpretation of models that are
trained from the merging of all projects’ data is suboptimal
compared to the interpretation of models that are trained
from the merging of a cluster of similar projects’ data.
Summary: The most important metric of the global JIT
model is consistent for 55% of the local JIT models, sug
gesting that the interpretation of a global JIT model cannot
capture the variation of the interpretation for all local JIT
models. Moreover, the global JIT model cannot capture the
variation of the baseline likelihood of introducing defects for
all of the studied projects, suggesting that conclusions that
are derived from the global JIT model are not generalizable
to many of the studied projects.
11
Rank 1
Rank 2
Local
Global
Project
Context
0.2
0.4
0.6
0.8
Precision
Rank 1
Rank 2
Local
Global
Project
Context
0.0
0.2
0.4
0.6
Recall
Rank 1
Rank 2
Rank 3
Local
Global
Project
Context
0.0
0.2
0.4
0.6
F1
Rank 1
Rank 2
Local
Global
Project
Context
0.6
0.7
0.8
AUC
Rank 1
Rank 2
Local
Global
Project
Context
0.0
0.2
0.4
0.6
G−score
●
●
●
●
●
●
●
Rank 1
Rank 2
Context
Global
Project
Local
1
2
3
4
IFA
●
●
Rank 1
Rank 2
Local
Global
Project
Context
0.00
0.05
0.10
0.15
0.20
PCI@20%
●
Rank 1
Rank 2
Local
Global
Project
Context
0.2
0.3
0.4
0.5
0.6
Popt
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
Rank 1
Local
Global
Project
Context
0.1
0.2
0.3
0.4
Popt@20%
Fig. 4. The ScottKnott ESD ranking and the distributions of the nine performance measures of the global, projectaware, contextaware JIT models.
Noted that the performance of local JIT models should be used as a baseline reference, as the local JIT models are not available in the realworld
crossproject scenario when having limited access to historical data.
RQ3: How accurate are the predictions of the project
aware JIT model and the contextaware JIT model when
compared to the global JIT model?
Approach. To address RQ3, we evaluate the predictive
accuracy of JIT models using nine performance measures
(e.g., precision, recall) to investigate whether the predictive
accuracy of projectaware and contextaware JIT models are
comparable to global JIT models. Similar to prior work [22],
we evaluate the JIT models using crossproject evaluation
scenario as follows. First, one project is set aside as the
testing project. Second, the rest of the studied projects are
merged together as one large training dataset. Then, we
train a global JIT model, a projectaware JIT model, and a
contextaware JIT model (see Section 5.3). Then, we evaluate
the models using the nine performance metrics (see Sec
tion 5.4.1). We repeat these steps for each of the 20 studied
projects. In addition, we also evaluate the local JIT models
as a baseline comparison. To do so, we apply an outof
sample bootstrap model validation technique to estimate
the model performance of withinproject dataset [53], i.e.,
for each project, a local JIT model is trained on a bootstrap
sample with replacement, and tested on the samples that
do not appear in the bootstrap sample. Then, we compute
the average performance value from the 100repeated out
ofsample bootstrap validation.
Finally, we apply a ScottKnott ESD test to cluster
the distributions into statistically distinct ranks with non
negligible effect size difference [51, 53, 54]. The ScottKnott
ESD test is designed to overcome the confounding factor
of overlapping groups that are produced by other posthoc
tests, such as Nemenyi’s test. In particular, Nemenyi’s test
produces overlapping groups of techniques, implying that
there exists no statistically signiﬁcant difference among the
techniques. In contrast, the ScottKnott ESD test produces
the ranking of the techniques while ensuring that (1) the
magnitude of the difference for all of the distributions in
each rank is negligible; and (2) the magnitude of the differ
ence of distributions between ranks is nonnegligible. The
ScottKnott ESD test is based on the ANOVA assumptions
of the original ScottKnott test (e.g., normal distributions,
homogeneous distributions, and the minimum sample size).
Figure 4 presents the ScottKnott ESD ranking and the dis
tributions of the nine performance measures of the global,
projectaware, contextaware JIT models, and the local JIT
models (as a baseline).
Results.Projectaware and contextaware JIT models
achieve comparable performance with global JIT models.
The ScottKnott ESD rankings (see Figure 4) conﬁrm that the
projectaware and contextaware JIT models achieve similar
performance (with negligible to small effect size) compared
to the global JIT models across all 9 performance metrics.
In addition, the ScottKnott ESD rankings (see Figure 4)
also align with the ﬁnding in prior studies [5, 22, 57],
that crossproject JIT models that are trained on a mixed
project dataset yield comparable performance compared to
local JIT models. The comparable performance of project
aware and contextaware JIT models with global JIT models
lays the foundation of the comparison among the derived
interpretation from the three studied types of JIT models.
Summary: Projectaware JIT models and contextaware JIT
models that consider project context achieve comparable
performance across nine measured performance metrics
with global JIT models that do not consider project context.
The comparable performance allows us to further compare
the interpretation among global JIT models, projectaware
JIT models and contextaware JIT models in the following
RQs.
RQ4: How consistent is the interpretation of the project
aware JIT model when compared to the local JIT models
and the global JIT model?
Approach. To address RQ4, we started from the merged
data in RQ2. Then, we constructed a projectaware JIT
12
TABLE 5
Model statistics of the projectaware JIT model (R2=0.26). The χ2
value measures the impact of a particular independent variable on the
dependent variable. The larger the χ2value, the larger the impact that
a commitlevel metric has on the likelihood of a commit introducing
defects.
Type Variable Variance Coef. χ2Pr(> χ2)
Random
slope Entropy 0.34  16856.3 <2.2e16***
Random
intercept Project 0.65  8208.6 <2.2e16***
Fixed
effect
FIX  0.53 3284.22 <2.2e16***
NS  0.27 3245.11 <2.2e16***
NF  0.05 76.80 <2.2e16***
LT  0.01 9.10 2.6e3**
Relative Churn  <0.01 0.15 0.69◦
Statistical signiﬁcance of χ2:
◦p≥0.05;*p < 0.05; ** p < 0.01; *** p < 0.001.
model using a mixedeffect logistic regression model (see
Section 5.3.3). Because the focus of this RQ is on interpreta
tion, we trained the projectaware JIT model with the whole
merged dataset. Similar to RQ1 and RQ2, we analyzed the
goodnessofﬁt and identiﬁed the most important metric of
the projectaware JIT model. Table 5 presents the model
statistics of the projectaware JIT model. To analyze the
consistency of the most important metric, we calculated
the errors of the coefﬁcient of the most important metric
(i.e., Entropy) between the projectaware JIT model and the
local JIT models, as well as between the projectaware JIT
model and the global JIT model. Similarly, to analyze the
consistency of the baseline likelihood of introducing defects,
we calculated the errors of the intercept between the project
aware JIT model and the local JIT models, as well as between
the projectaware JIT model and the global JIT model. Below,
we discuss the results with respect to (1) the goodnessofﬁt;
(2) the errors of the important scores; and (3) the errors of
the baseline likelihood of introducing defects.
GoodnessofFit.The R2value of the projectaware JIT
model is 86% and 26% higher than that of the global JIT
model and the average R2values of the local JIT models,
respectively. We ﬁnd that the R2value of the projectaware
JIT model is 0.26, while the R2value of the global JIT model
is 0.14 and the average R2values of the local JIT models
is 0.207. This ﬁnding indicates that the projectaware JIT
model that considers projectlevel variances can explain the
variability of the data 86% better than the global JIT model
that is trained from a merged data.
Results.The coefﬁcient estimates of the most important
metric (i.e., Entropy) that are derived from the project
aware JIT model are 53% more accurate than those of the
global JIT model. Figure 5 shows the distribution of the
absolute errors of the coefﬁcient of Entropy for the project
aware JIT model and for the global JIT model, respectively.
We observed that the global JIT model produces a median
absolute error (MAE) of 0.17, while the projectaware JIT
model produces an MAE of 0.08, which is 53% lower than
the global JIT model. The results show that the project
aware JIT model provides a more accurate interpretation
of the relationship between Entropy and the likelihood of
introducing defects than that of the global JIT model.
In addition, Figure 6 shows the variation of the rela
0.0 0.1 0.2 0.3 0.4 0.5
Absolute error of the coefficients of Entropy
global JIT
model
projectaware
JIT model
Fig. 5. Distribution of the absolute error of the coefﬁcient of Entropy for
the projectaware JIT model and for the global JIT model. The ﬁgure
shows that the coefﬁcient estimates of the most important metric (i.e.,
Entropy) that are derived from the projectaware JIT model are 53%
more accurate than those of the global JIT model.
0.0
0.2
0.4
0.6
0.8
−1 0 1 2
Entropy
defectproneness
projects
accumulo
angular
brackets
bugzilla
camel
cinder
django
fastjson
gephi
hibernate−orm
hibernate−search
imglib2
jetty
kylin
log4j
nova
osquery
postgres
tomcat
wordpress
Fig. 6. Relationship between Entropy and defectproneness for each
studied project. Entropy ranges out of [0,1] due to data scaling. The
random slope of Entropy in the mixedeffect model is able to show the
different relationships between Entropy and the likelihood of a commit
introducing defects across projects.
tionship between Entropy and the likelihood of introducing
defects for each studied project in the projectaware model.
We ﬁnd that the random slope of Entropy in the mixed
effect model is able to show different relationships between
Entropy and the likelihood of introducing defects across
projects.
On the other hand, the global JIT model can only provide
a ﬁxed estimate of the relationship between Entropy and
defectproneness for all the studied projects, which con
tradicts to the our prior ﬁnding that the most important
metric of the local JIT models varies among projects. This
ﬁnding indicates that global JIT models must not be used to
guide operational decisions.
The baseline likelihoods of introducing defects that
are derived from the projectaware JIT model are 81%
more accurate than those of the global JIT model. Figure 7
shows the distribution of the absolute errors of the intercept
for the projectaware JIT model and for the global JIT model
respectively.
We calculated that the global JIT model has a MAE of
0.43, while the projectaware JIT model has a lower MAE
of 0.08. The results show that the projectaware JIT model
provides a more accurate interpretation of the baseline
likelihood of introducing defects than the global JIT model.
13
0.0 0.2 0.4 0.6 0.8
Absolute error of the intercepts
global JIT
model
projectaware
JIT model
Fig. 7. Distribution of the absolute error of the intercept for the project
aware JIT model and for the global JIT model. The ﬁgure shows that
the baseline likelihoods of introducing defects that are derived from the
projectaware JIT model are 81% more accurate than those derived from
the global JIT model.
Summary: The coefﬁcient estimates of the most important
metric (i.e., Entropy) that are derived from the project
aware JIT model are 53% more accurate than those of the
global JIT model. In addition, the baseline likelihoods of
introducing defects that are derived from the projectaware
JIT model are 81% more accurate than those of the global
JIT model.
RQ5: How consistent is the interpretation of the context
aware JIT model when compared to the local JIT models
and the global JIT model?
Approach. To address RQ5, we repeated the approach of
RQ4, replacing the projectaware JIT model with a context
aware JIT model using a mixedeffect logistic regression
model (see Section 5.3.4) that considers many contextual
factors of different projects. Because the focus of this RQ
is on interpretation, we trained the contextaware JIT model
with the whole merged dataset. Table 6 presents the model
statistics of the contextaware JIT model. Similar to RQ4,
we analyzed (1) the goodnessofﬁt of the contextaware JIT
model; (2) the consistency of the most important metric; and
(3) the consistency of the baseline likelihood of introducing
defects. Below, we discuss the results with respect to (1) the
goodnessofﬁt; (2) the errors of the important scores; and (3)
the errors of the baseline likelihood of introducing defects.
GoodnessofFit.The R2value of the contextaware JIT
model is 100% and 35% higher than that of the global
JIT model and the average R2values of the local JIT
models, respectively. We ﬁnd that the contextaware JIT
model obtained a R2of 0.28, while the R2value of the global
JIT model is 0.14 and the average R2values of the local
JIT models is 0.207. This ﬁnding indicates that the context
aware JIT model that considers projectlevel variances and
contextual factors can explain the variability of the data
100% better than the global JIT model that is trained from
a merged data. In addition, the R2of the contextaware JIT
model is also 8% higher than that of the projectaware JIT
model.
Results.The coefﬁcient estimates of the most important
metric (i.e., Entropy) that are derived from the context
aware JIT model are 44% more accurate than those of
the global JIT model. Figure 8 shows the distribution of
the error of the coefﬁcients of Entropy for contextaware JIT
model and for global JIT model, respectively. We calculated
that the contextaware JIT model has a MAE of 0.09, which
is 44% lower than the MAE for global JIT model (0.16), that
TABLE 6
Summary of the contextaware JIT model (R2=0.28). The χ2value
measures the impact of a particular independent variable on the
dependent variable. The larger the χ2value, the larger the impact that
a commitlevel metric has on the likelihood of a commit introducing
defects.
Type Variable Variance Coef. χ2Pr(> χ2)
Random
slope Entropy 0.34  15009.00 <2.2e16***
Random
intercept
Language 0.22  6076.16 <2.2e16***
TLOC 0.12  1370.79 <2.2e16***
NFILE 0.09  1182.11 <2.2e16***
NCOMMIT 0.08  768.27 <2.2e16***
NDEV 0.24  300.36 <2.2e16***
Nlanguage 0.02  64.81 4.2e16***
UI <0.01  0.18 0.50◦
Database <0.01  0.00 0.50◦
Audience <0.01  0.00 0.50◦
Fixed
effect
NS  0.27 3242.08 <2.2e16***
FIX  0.53 3197.47 <2.2e16***
NF  0.05 75.99 <2.2e16***
LT  0.01 9.41 2.2e3**
Relative Churn  <0.01 0.16 0.69◦
Statistical signiﬁcance of χ2:
◦p≥0.05;*p < 0.05; ** p < 0.01; *** p < 0.001.
are calculated in RQ4. The results show that the context
aware JIT model provides a more accurate interpretation of
the relationship between Entropy and the likelihood of a
commit introducing defects than the global JIT model.
We also observed that projectaware and contextaware
JIT models produce similar coefﬁcient estimates of Entropy,
as well as other commitlevel metrics. We calculated that
the median absolute difference of the estimated coefﬁcients
of Entropy from the two models is 0.006. The observation
suggests that when taking the contextual factors into con
sideration, the mixedeffect model can yield stable repre
sentation of the relationship between commitlevel metrics
and the likelihood of a commit introducing defects.
The baseline likelihood of introducing defects that
are derived from the contextaware JIT model are 64%
more accurate than those of the global JIT model. Figure 9
shows the distribution of the errors of the intercept for
the projectaware JIT model and for the global JIT model
respectively. We calculated that the contextaware JIT model
has a lower MAE of 0.15, which is lower than the MAE
for global JIT model (0.42) that are calculated in RQ4. The
results show that the contextaware JIT model provides a
more accurate interpretation of the baseline likelihood of
introducing defects of projects than the global JIT model. We
also observed that the MAE for the contextaware JIT model
is higher than the projectaware JIT model, indicating that
there may be other project factors other than the 9 contextual
factors that contributes to the differences among projects.
The main programming language, total lines of code,
the number of ﬁles, the number of commits, the number
of developers, and the number of programming languages
are the statistically signiﬁcant context factors. Table 6
provides a summary of the model statistics for the context
aware JIT model. Table 6 shows that programming lan
guages, total lines of code, the number of ﬁles, the number
of commits, the number of developers, and the number of
programming languages have a pvalue <0.05, indicating
that the language, the total lines of code, the number of
14
0.0 0.1 0.2 0.3 0.4 0.5
Absolute error of the coefficients of Entropy
global JIT
model
contextaware
JIT model
Fig. 8. Distribution of the absolute error of the coefﬁcients of Entropy for
the contextaware JIT model and for the global JIT model. The ﬁgure
shows that the coefﬁcient estimates of the most important metric (i.e.,
Entropy) that are derived from the contextaware JIT model are 44%
more accurate than those derived from the global JIT model.
0.0 0.2 0.4 0.6 0.8 1.0 1.2
Absolute error of the intercepts
global JIT
model
contextaware
JIT model
Fig. 9. Distribution of the absolute error of the intercept for the context
aware JIT model and for the global JIT model. The ﬁgure shows that
the baseline likelihood of introducing defects that are derived from the
contextaware JIT model are 64% more accurate than those derived
from the global JIT model.
ﬁles, the number of commits, the number of developers, and
the number of programming languages have statistically
signiﬁcant impact on the likelihood of introducing defects.
For example, projects that are written in Python tend to have
the highest likelihood of introducing defects. This ﬁnding
suggests that the consideration of contextual factors in the
JIT model can provide a more indepth understanding of the
properties of defectintroducing commits without impacting
the goodnessofﬁt of the model.
Summary: The coefﬁcient estimates of the most important
metric (i.e., Entropy) that are derived from the context
aware JIT model are 44% more accurate than those of the
global JIT model. The baseline likelihood of introducing
defects that are derived from the contextaware JIT model
are 64% more accurate than those of the global JIT model.
In addition, the consideration of contextual factors in the
JIT model can provide a more indepth understanding of the
properties of defectintroducing commits without impacting
the goodnessofﬁt of the model.
RQ6: Do our prior conclusions hold for other mixed
effect classiﬁers?
Approach. In the previous RQs, we used logistic regression
to construct global JIT models, and mixedeffect logistic re
gression to construct projectaware JIT models and context
aware JIT models. To investigate if the ﬁndings in our prior
RQs hold for other mixedeffect classiﬁers, in this RQ, we
evaluate the studied JIT models using random forest clas
siﬁers. In particular, we construct global JIT models using
random forest classiﬁers, and construct projectaware and
contextaware JIT models using mixedeffect random forest
classiﬁers. Below we describe the modelling techniques in
detail.
Random forest is an ensemble classiﬁer that consists of
multiple decision trees. Each decision tree in a random forest
is trained with a randomly selected subset of the training
data and features. A random forest classiﬁes the dependent
variable by taking the majority vote of the decision trees.
In a classiﬁcation task (e.g., JIT defect modelling), the ratio
of the positive votes from the decision trees in a random
forest can be used as the predicted probability of a commit
being defectintroducing. In contrast to generalized linear
modelling that is used in the prior RQs, random forest
models a nonlinear relationship between dependent and
independent variables, and has been widely used along
with generalized linear models in the defect modelling
domain [22, 23].
Applying mixedeffect modelling on nonlinear models
is a relatively new research area. Hajjem et al. [7] conducted
the initial work on applying mixedeffect modelling on
a nonlinear model. In 2017, Hajjem et al. [8] extended
their prior work and proposed generalized mixedeffect
regression trees, to allow binary outcomes (i.e., classiﬁca
tion). Recently, Wang et al. [59] implemented generalized
mixedeffect random forest based on Hajjem et al.’s work.
Recall that in Section 5.3.3 and Section 5.3.4, we deﬁned a
generalized linear mixedeffect model (mixedeffect logistic
regression) as
Ln(y
1−y) = β0+βixi+
M
X
m=1
vjm0+uj kzk+
In this formula, β0+βixirepresents the ﬁxed effects
(commitlevel metrics), while PM
m=1 vjm0+uj kzkrepresents
the random effects (projectlevel metrics). Let F(xi) =
β0+βixi, and G(zk) = PM
m=1 vjm0+uj kzk, we can get
the general form of mixedeffect models:
Ln(y
1−y) = F(xi) + G(zk) + (4)
For a generalized linear mixedeffect model, F(xi)is
a linear function. When replacing F(xi)with a nonlinear
function (e.g., a random forest model), Formula 4 becomes
a generalized nonlinear mixedeffect model (e.g., a gener
alized mixedeffect random forest). Note that the random
effect G(zk)remains a linear function.
We followed the same process as stated in RQ3 to cal
culate the nine performance metrics for the random forest
based global, projectaware and contextaware JIT models.
In addition, we calculated the goodnessofﬁt of the three
type of random forest based JIT models using the method
stated in Section 5.4.2.
We improved the implementation of generalized mixed
effect random forest from Wang et al. [59] using a fast
implementation of random forest provided by the ranger
package in R. In addition, we also improved Wang et al.’s
implementation’s strategy of predicting an unseen project
during training using our strategy that is explained in RQ3.
The generalized mixedeffect random forest model was
trained using the default parameter settings of the random
forest implemented in the ranger package.
15
TABLE 7
Crossproject performance metrics of random forest based JIT models.
The table shows that random forest based projectaware and
contextaware JIT models achieve comparable performance to random
forest based global JIT model.
Global
JIT models
Projectaware
JIT models
Contextaware
JIT models
Median Mean Median Mean Median Mean
Precision 0.62 0.58 0.58 0.60 0.59 0.63
Recall 0.39 0.42 0.31 0.35 0.38 0.35
F1score 0.46 0.46 0.38 0.39 0.45 0.35
Gscore 0.53 0.54 0.47 0.46 0.53 0.41
AUC 0.72 0.73 0.74 0.75 0.74 0.75
IFA 1.00 1.05 1.00 1.40 1.00 1.45
PCI@20% 0.07 0.08 0.07 0.07 0.07 0.07
Popt 0.45 0.47 0.41 0.41 0.41 0.41
Popt@20% 0.27 0.27 0.27 0.27 0.28 0.27
Results.Random forest based projectaware and context
aware JIT models achieve comparable performance to ran
dom forest based global JIT model. Table 7 shows the cross
project performance of random forest based global, project
aware, and contextaware JIT models. Similar to what we
observed in RQ3, the projectaware and contextaware JIT
models that are constructed using generalized mixedeffect
random forest yield comparable performance to the random
forest based global JIT models.
Random forest based projectaware and contextaware
JIT models have a better goodnessofﬁt than random
forest based global JIT model. The R2of the random forest
based global JIT model on the merged dataset is 0.29, while
the R2of the random forest based projectaware JIT model is
0.41, and the R2of the random forest based contextaware
JIT model is 0.40. The random forest based projectaware
and contextaware JIT model achieve a 41% and 38% better
goodnessofﬁt compared to the random forest based global
JIT model. The observation is consistent with our ﬁndings
in RQ4 and RQ5.
Summary: Similar to the projectaware and contextaware
JIT models that are based on generalized linear mixedeffect
modelling, the projectaware and contextaware JIT model
that use generalized mixedeffect random forest modelling
also achieve a similar performance compared to random for
est based global JIT model. In addition, generalized mixed
effect random forest based projectaware and contextaware
JIT model also achieve a better goodnessofﬁt compared to
the random forest based global JIT model.
7 PRACTICAL GUIDELINES
Based on the results of our study, we derive the following
practical guideline: When the goal is to derive sound inter
pretations from JIT models that are trained from a mixedproject
dataset,
1) The projectaware JIT models using a mixedeffect modelling
approach (with only one projectlevel random intercept)
should be used to consider the project factors.
2) The contextaware JIT models using a mixedeffect modelling
approach (with multiple random intercepts for projectlevel
contextual factors) should be used to consider the variation
of the context factors and understand the impact of the
contextual factors on the risk of defectintroducing commits
without impacting the goodnessofﬁt of the models.
Below, we discuss the implications of our guideline to
practitioners and researchers.
Implications to Practitioners. When having limited access
to data, the use of mixedeffect modelling approach to train
on a mixedproject dataset allows practitioners to produce
more accurate insights that are able to capture the different
project and context characteristics (e.g., programming lan
guages). Such accurate insights could help developers prior
itize precommit testing efforts and code review focuses, as
well as QA leads and managers to develop the most effective
quality improvement plans.
Implications to Researchers. When the goal is to de
velop an empiricallygrounded theory from a mixedproject
dataset in order to draw a general conclusion, the mixed
effect modelling approach should be used to allow re
searchers to gain a deeper understanding whether speciﬁc
conclusions are sensitive to particular project or context
characteristics, or not. Recent studies have employed mixed
effect modelling approaches. For example, Hassan et al. [10]
employed a mixedeffect modelling approach to examine
the relationship between the characteristics of mobile app
reviews and the likelihood of a developer responding to
a review. Thongtanunam et al. [56] employed a mixed
effect modelling approach to examine the relationship be
tween the characteristics of code review dynamics and the
likelihood of a patch introducing defects in the context of
modern code review. The empirical evidence of our paper is
another supporting data point that mixedeffect modelling
approaches can capture both project and contextual factors.
Hence, we advocate the use of mixedeffect modelling ap
proaches in future studies.
8 TH REATS TO VALIDITY
In this section, we discuss the threats to validity of our study.
Internal Validity. Recently, Rodr´
ıguezP´
erez et al. [43]
raised concerns that there are different variants of the SZZ
algorithm. In this paper, we identiﬁed defectintroducing
commits using CommitGuru [44], which uses the complete
SZZ algorithm [46]. The SZZ algorithm is commonly used
in prior work on JIT defect prediction [22, 23, 60]. The
algorithm ﬁrst identify defectﬁxing commits by matching
commits with bug reports labeled as ﬁxed in the issue track
ing system, and then employ the “diff” and “blame” func
tionality of VCS to determine the defectintroducing lines,
and locate the defectintroducing commit that modiﬁes the
defectintroducing lines. This variant has two common lim
itations. First, it is possible that some commits may not be
linked or incorrectly linked to the issue reports. Second, it
is possible that the “diff” and “blame” functionality may
involve cosmetic changes, comment changes, or new blank
lines. Future research should consider addressing these lim
itations with an improved variation of the SZZ algorithm.
External Validity. We used 20 open source projects in
our study. Hence, our results may not be generalizable
to other projects. However, we selected projects that have
large amount of commits to combat potential bias in our
results. And our study shows that there exists at least one
16
set of projects which merging the datasets would lead to in
accurate interpretation. Nonetheless, additional replication
studies are needed to verify our results.
Construct Validity. Although we studied eight popular
commitlevel metrics from the literature [22], there are many
other commitlevel metrics that can be included in our
study. However, as the goal of our study is to compare the
interpretation of different types of crossproject JIT mod
els under the same settings (i.e., same training data and
same commitlevel metrics), other metrics can be included
in future work (e.g., code review metrics [31], test smell
metrics [28]).
In addition, we studied a limited number (9) of project
level metrics that were used in prior work to study the
context of projects [61, 63]. We would like to note that the se
lection of the projectlevel metrics depends on the problem,
the context, and the operationalization of the hypotheses.
Thus, the goal of our work is not to draw the generalization
that Language is always the most important contextual
factor of JIT defect models. Instead, one of the main goals
of our work is to highlight the beneﬁts of considering
project characteristics when constructing JIT models from a
mixedproject dataset to investigate the impact of the most
contextual factors on the likelihood of a commit being a
defectintroducing one.
9 CONCLUSION
In this paper, we investigated the impact of data merging
on the interpretation of crossproject JIT defect models. In
particular, we investigated the interpretation of three types
of crossproject JIT models (i.e., a global JIT model, a project
aware model, and a contextaware model) in comparison
with the interpretation of local JIT models. Through a case
study of 20 open source projects, we make the following key
observations:
1) The most important metric (i.e., the metric that has the
highest impact on the likelihood of a commit introduc
ing defects) of the local JIT models and the baseline
likelihood of introducing defects (i.e., the likelihood of
introducing defects when all commitlevel metrics are
0) vary among projects.
2) The most important metric of the global JIT model is
consistent with 55% of the studied projects’ local JIT
models, suggesting that the interpretation of global JIT
models cannot capture the variation of the interpreta
tion for all local JIT models.
3) The projectaware JIT model can provide both a more
representative interpretation and a better ﬁt to the
dataset than the global JIT model.
4) The contextaware JIT model can provide both a more
representative interpretation and a better ﬁt to the
dataset than the global JIT model.
These ﬁndings lead us to draw the following suggestion:
when training a defect model with a pool of mixed project data, one
should opt to use a mixedeffect modelling approach that considers
individual projects and contexts.
Finally, we would like to emphasize that the impact
of data merging on the interpretation of crossproject JIT
defect models does not necessarily apply to all studies, all
scenarios, all datasets, and all analytical models in software
engineering. Instead, the key message of our study is to
shed light that the simple data merging practice impacts
the interpretation of crossproject JIT models, as our re
search shows that there exists a set of projects for which
merging the datasets would lead to incorrect conclusions.
Thus, researchers and practitioners should consider using
mixedeffect modelling. On the other hand, irrespective of
learning algorithms, using mixedeffect modelling approach
to build JIT models can provide better interpretation by
taking projectlevel variances into consideration, without
sacriﬁcing the performance of JIT models. Thus, future stud
ies should consider the mixedeffect modelling approach
when the goal is to derive sound interpretation.
ACKNOWLEDGEMENT
C. Tantithamthavorn was partially supported by the Aus
tralian Research Council’s Discovery Early Career Re
searcher Award (DECRA) funding scheme (DE200100941).
REFERENCES
[1] A. Agresti, Categorical data analysis. John Wiley & Sons,
2013.
[2] N. Bettenburg, M. Nagappan, and A. E. Hassan, “To
wards improving statistical modeling of software en
gineering data: think locally, act globally!” Empirical
Software Engineering, vol. 20, no. 2, pp. 294–335, 2015.
[3] B. M. Bolker, M. E. Brooks, C. J. Clark, S. W. Geange,
J. R. Poulsen, M. H. H. Stevens, and J.S. S. White,
“Generalized linear mixed models: a practical guide for
ecology and evolution,” Trends in ecology & evolution,
vol. 24, no. 3, pp. 127–135, 2009.
[4] P. Devanbu, T. Zimmermann, and C. Bird, “Belief &
evidence in empirical software engineering,” in Pro
ceedings of the International Conference on Software En
gineering (ICSE). IEEE, 2016, pp. 108–119.
[5] T. Fukushima, Y. Kamei, S. McIntosh, K. Yamashita,
and N. Ubayashi, “An empirical study of justintime
defect prediction using crossproject models,” in Pro
ceedings of the 11th Working Conference on Mining Soft
ware Repositories. ACM, 2014, pp. 172–181.
[6] P. J. Guo, T. Zimmermann, N. Nagappan, and B. Mur
phy, “Characterizing and predicting which bugs get
ﬁxed: an empirical study of microsoft windows,” in
32nd International Conference on Software Engineering,
vol. 1. IEEE, 2010, pp. 495–504.
[7] A. Hajjem, F. Bellavance, and D. Larocque, “Mixed
effects random forest for clustered data,” Journal of
Statistical Computation and Simulation, vol. 84, no. 6, pp.
1313–1328, 2014.
[8] A. Hajjem, D. Larocque, and F. Bellavance, “General
ized mixed effects regression trees,” Statistics & Proba
bility Letters, vol. 126, pp. 114–118, 2017.
[9] A. E. Hassan, “Predicting faults using the complexity
of code changes,” in IEEE 31st International Conference
on Software Engineering (ICSE). IEEE, 2009, pp. 78–88.
[10] S. Hassan, C. Tantithamthavorn, C.P. Bezemer, and
A. E. Hassan, “Studying the dialogue between users
and developers of free apps in the google play store,”
Empirical Software Engineering (EMSE), 2017.
17
[11] S. Herbold, A. Trautsch, and J. Grabowski, “A compara
tive study to benchmark crossproject defect prediction
approaches,” IEEE Transactions on Software Engineering,
2017.
[12] ——, “Global vs. local models for crossproject de
fect prediction,” Empirical Software Engineering, vol. 22,
no. 4, pp. 1866–1902, 2017.
[13] S. Hosseini, B. Turhan, and D. Gunarathna, “A sys
tematic literature review and metaanalysis on cross
project defect prediction,” IEEE Transactions on Software
Engineering, 2017.
[14] Q. Huang, X. Xia, and D. Lo, “Supervised vs unsuper
vised models: A holistic look at effortaware justin
time defect prediction,” in 2017 IEEE International Con
ference on Software Maintenance and Evolution (ICSME).
IEEE, 2017, pp. 159–170.
[15] J. Jiarpakdee, C. Tantithamthavorn, H. K. Dam, and
J. Grundy, “An empirical study of modelagnostic tech
niques for defect prediction models,” IEEE Transactions
on Software Engineering (TSE), 2020.
[16] J. Jiarpakdee, C. Tantithamthavorn, and J. Grundy,
“Practitioners’ perceptions of the goals and visual ex
planations of defect prediction models,” in Proceedings
of the International Conference on Mining Software Reposi
tories (MSR), 2021, p. To Appear.
[17] J. Jiarpakdee, C. Tantithamthavorn, and A. E. Hassan,
“The impact of correlated metrics on the interpretation
of defect models,” IEEE Transactions on Software Engi
neering (TSE), 2019.
[18] J. Jiarpakdee, C. Tantithamthavorn, and C. Treude,
“Autospearman: Automatically mitigating correlated
metrics for interpreting defect models,” in Proceeding of
the International Conference on Software Maintenance and
Evolution (ICSME), 2018, pp. 92–103.
[19] ——, “AutoSpearman: Automatically Mitigating Cor
related Software Metrics for Interpreting Defect Mod
els,” in ICSME, 2018, pp. 92–103.
[20] ——, “The impact of automated feature selection tech
niques on the interpretation of defect models,” EMSE,
2020.
[21] P. C. Johnson, “Extension of nakagawa & schielzeth’s
r2glmm to random slopes models,” Methods in Ecology
and Evolution, vol. 5, no. 9, pp. 944–946, 2014.
[22] Y. Kamei, T. Fukushima, S. McIntosh, K. Yamashita,
N. Ubayashi, and A. E. Hassan, “Studying justintime
defect prediction using crossproject models,” Empirical
Software Engineering, vol. 21, no. 5, pp. 2072–2106, 2016.
[23] Y. Kamei, E. Shihab, B. Adams, A. E. Hassan,
A. Mockus, A. Sinha, and N. Ubayashi, “A largescale
empirical study of justintime quality assurance,” IEEE
Transactions on Software Engineering, vol. 39, no. 6, pp.
757–773, 2012.
[24] C. Khanan, W. Luewichana, K. Pruktharathikoon,
J. Jiarpakdee, C. Tantithamthavorn, M. Choetkiertikul,
C. Ragkhitwetsagul, and T. Sunetnanta, “Jitbot: An
explainable justintime defect prediction bot,” in 2020
35th IEEE/ACM International Conference on Automated
Software Engineering (ASE). IEEE, 2020, pp. 1336–1339.
[25] S. Kim, E. J. Whitehead Jr, and Y. Zhang, “Classifying
software changes: Clean or buggy?” IEEE Transactions
on Software Engineering, vol. 34, no. 2, pp. 181–196, 2008.
[26] B. A. Kitchenham, E. Mendes, and G. H. Travassos,
“Cross versus withincompany cost estimation studies:
A systematic review,” IEEE Transactions on Software
Engineering, vol. 33, no. 5, 2007.
[27] R. Krishna and T. Menzies, “Bellwethers: A Baseline
Method For Transfer Learning,” IEEE Transactions on
Software Engineering, p. To appear, 2018.
[28] S. Lambiase, A. Cupito, F. Pecorelli, A. De Lucia, and
F. Palomba, “Justintime test smell detection and refac
toring: The darts project,” in Proceedings of the 28th
International Conference on Program Comprehension, 2020,
pp. 441–445.
[29] D. Lin, C. Tantithamthavorn, and A. E. Hassan,
“Replication package of our paper,” https://github.
com/SAILResearch/suppmaterial19dayirisk data
merging jit, 2019, (last visited: Nov 11, 2019).
[30] J. N. Mandrekar, “Receiver operating characteristic
curve in diagnostic test assessment,” Journal of Thoracic
Oncology, vol. 5, no. 9, pp. 1315–1316, 2010.
[31] S. McIntosh and Y. Kamei, “Are FixInducing Changes
a Moving Target? A Longitudinal Case Study of JustIn
Time Defect Prediction,” IEEE Transactions on Software
Engineering, p. To appear, 2017.
[32] S. McIntosh, Y. Kamei, B. Adams, and A. E. Hassan,
“The impact of code review coverage and code review
participation on software quality: A case study of the
qt, vtk, and itk projects,” in Proceedings of the 11th Work
ing Conference on Mining Software Repositories. ACM,
2014, pp. 192–201.
[33] T. Menzies, A. Butcher, D. Cok, A. Marcus, L. Layman,
F. Shull, B. Turhan, and T. Zimmermann, “Local versus
global lessons for defect prediction and effort estima
tion,” IEEE Transactions on software engineering, vol. 39,
no. 6, pp. 822–834, 2013.
[34] T. Menzies, J. Greenwald, and A. Frank, “Data mining
static code attributes to learn defect predictors,” IEEE
transactions on software engineering, vol. 33, no. 1, pp.
2–13, 2006.
[35] A. Mockus and D. M. Weiss, “Predicting risk of soft
ware changes,” Bell Labs Technical Journal, vol. 5, no. 2,
pp. 169–180, 2000.
[36] R. Moser, W. Pedrycz, and G. Succi, “A comparative
analysis of the efﬁciency of change metrics and static
code attributes for defect prediction,” in Proceedings of
the 30th international conference on Software engineering.
ACM, 2008, pp. 181–190.
[37] N. Nagappan and T. Ball, “Use of relative code churn
measures to predict system defect density,” in Pro
ceedings of the 27th international conference on Software
engineering. ACM, 2005, pp. 284–292.
[38] S. Nakagawa and H. Schielzeth, “A general and simple
method for obtaining r2 from generalized linear mixed
effects models,” Methods in Ecology and Evolution, vol. 4,
no. 2, pp. 133–142, 2013.
[39] J. C. Pinheiro and D. M. Bates, “Mixedeffects models
in s and splus springer,” New York, 2000.
[40] C. Pornprasit and C. Tantithamthavorn, “JITLine: A
Simpler, Better, Faster, Finergrained JustInTime De
fect Prediction,” in Proceedings of the International Con
ference on Mining Software Repositories (MSR), 2021, p.
To Appear.
18
[41] R. Purushothaman and D. E. Perry, “Toward under
standing the rhetoric of small source code changes,”
IEEE Transactions on Software Engineering, vol. 31, no. 6,
pp. 511–526, 2005.
[42] D. Rajapaksha, C. Tantithamthavorn, J. Jiarpakdee,
C. Bergmeir, J. Grundy, and W. Buntine, “SQAPlanner:
Generating DataInformed Software Quality Improve
ment Plans,” arXiv preprint arXiv:2102.09687, 2021.
[43] G. Rodr´
ıguezP´
erez, G. Robles, and J. M. Gonz´
alez
Barahona, “Reproducibility and credibility in empirical
software engineering: A case study based on a system
atic literature review of the use of the szz algorithm,”
Information and Software Technology, vol. 99, pp. 164–176,
2018.
[44] C. Rosen, B. Grawi, and E. Shihab, “Commit guru:
Analytics and risk prediction of software commits,” in
Proceedings of the 2015 10th Joint Meeting on Foundations
of Software Engineering, ser. ESEC/FSE 2015. New York,
NY, USA: ACM, 2015, pp. 966–969.
[45] E. Shihab, A. E. Hassan, B. Adams, and Z. M. Jiang,
“An industrial study on the risk of software changes,”
in Proceedings of the ACM SIGSOFT 20th International
Symposium on the Foundations of Software Engineering.
ACM, 2012, p. 62.
[46] J. ´
Sliwerski, T. Zimmermann, and A. Zeller, “When do
changes induce ﬁxes?” in ACM sigsoft software engineer
ing notes, vol. 30, no. 4. ACM, 2005, pp. 1–5.
[47] M. Tan, L. Tan, S. Dara, and C. Mayeux, “Online defect
prediction for imbalanced data,” in Proceedings of the
37th International Conference on Software Engineering
Volume 2. IEEE Press, 2015, pp. 99–108.
[48] C. Tantithamthavorn and A. E. Hassan, “An experience
report on defect modelling in practice: Pitfalls and chal
lenges,” in Proceedings of the 40th International Conference
on Software Engineering: Software Engineering in Practice.
ACM, 2018, pp. 286–295.
[49] C. Tantithamthavorn, A. E. Hassan, and K. Matsumoto,
“The impact of class rebalancing techniques on the per
formance and interpretation of defect prediction mod
els,” IEEE Transactions on Software Engineering, 2018.
[50] C. Tantithamthavorn, J. Jiarpakdee, and J. Grundy, “Ex
plainable AI for Software Engineering,” arXiv preprint
arXiv:2012.01614, 2020.
[51] C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and
K. Matsumoto, “Automated Parameter Optimization
of Classiﬁcation Techniques for Defect Prediction Mod
els,” in ICSE, 2016, pp. 321–332.
[52] ——, “An empirical comparison of model validation
techniques for defect prediction models,” IEEE Trans
actions on Software Engineering, vol. 43, no. 1, pp. 1–18,
2016.
[53] ——, “An Empirical Comparison of Model Validation
Techniques for Defect Prediction Models,” TSE, vol. 43,
no. 1, pp. 1–18, 2017.
[54] ——, “The Impact of Automated Parameter Optimiza
tion on Defect Prediction Models,” TSE, 2018.
[55] P. Thongtanunam and A. E. Hassan, “Review dynamics
and their impact on software quality,” in IEEE Transac
tion on Software Engineering (TSE), 2020, p. to appear.
[56] ——, “Review dynamics and their impact on software
quality,” IEEE Transactions on Software Engineering, 2020.
[57] B. Turhan, T. Menzies, A. B. Bener, and J. Di Stefano,
“On the relative value of crosscompany and within
company data for defect prediction,” Empirical Software
Engineering, vol. 14, no. 5, pp. 540–578, 2009.
[58] B. Turhan, A. Tosun, and A. Bener, “Empirical eval
uation of mixedproject defect prediction models,” in
37th EUROMICRO Conference on Software Engineering
and Advanced Applications (SEAA). IEEE, 2011, pp. 396–
403.
[59] J. Wang, E. R. Gamazon, B. L. Pierce, B. E. Stranger,
H. K. Im, R. D. Gibbons, N. J. Cox, D. L. Nicolae, and
L. S. Chen, “Imputing gene expression in uncollected
tissues within and beyond gtex,” The American Journal
of Human Genetics, vol. 98, no. 4, pp. 697–708, 2016.
[60] S. Yathish, J. Jiarpakdee, P. Thongtanunam, and C. Tan
tithamthavorn, “Mining Software Defects: Should We
Consider Affected Releases?” in ICSE, 2019, pp. 654–
665.
[61] F. Zhang, A. Mockus, I. Keivanloo, and Y. Zou, “To
wards building a universal defect prediction model,”
in Proceedings of the 11th Working Conference on Mining
Software Repositories. ACM, 2014, pp. 182–191.
[62] ——, “Towards building a universal defect prediction
model with rank transformed predictors,” Empirical
Software Engineering, vol. 21, no. 5, pp. 2107–2145, 2016.
[63] F. Zhang, A. Mockus, Y. Zou, F. Khomh, and A. E.
Hassan, “How does context affect the distribution of
software maintainability metrics?” in Software Mainte
nance (ICSM), 2013 29th IEEE International Conference
on. IEEE, 2013, pp. 350–359.
[64] T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and
B. Murphy, “Crossproject defect prediction: a large
scale experiment on data vs. domain vs. process,” in
Proceedings of the 7th joint meeting of the European software
engineering conference and the ACM SIGSOFT symposium
on The foundations of software engineering. ACM, 2009,
pp. 91–100.
19
Dayi Lin is a Senior Researcher at the Centre
for Software Excellence, Huawei, Canada. He
obtained his Ph.D. in Computer Science from the
Software Analysis and Intelligence Lab (SAIL) at
Queen’s University, Canada. His research inter
ests include mining software repositories, em
pirical software engineering, game engineering,
and software engineering for machine learning
systems. More information about Dayi is avail
able on his website: http://lindayi.me.
Chakkrit Tantithamthavorn is a 2020 ARC DE
CRA Fellow and a Lecturer in Software Engi
neering in the Faculty of Information Technol
ogy, Monash University, Melbourne, Australia.
His current fellowship is focusing on the de
velopment of “Practical and Explainable Ana
lytics to Prevent Future Software Defects”. His
work has been published at several toptier soft
ware engineering venues, such as the IEEE
Transactions on Software Engineering (TSE),
the Springer Journal of Empirical Software En
gineering (EMSE) and the International Conference on Software Engi
neering (ICSE). More about Chakkrit and his work is available online at
http://chakkrit.com.
Ahmed E. Hassan is an IEEE Fellow, an ACM
SIGSOFT Inﬂuential Educator, an NSERC Stea
cie Fellow, the Canada Research Chair (CRC) in
Software Analytics, and the NSERC/BlackBerry
Software Engineering Chair at the School
of Computing at Queen’s University, Canada.
His research interests include mining soft
ware repositories, empirical software engineer
ing, load testing, and log mining. He received a
PhD in Computer Science from the University
of Waterloo. He spearheaded the creation of
the Mining Software Repositories (MSR) conference and its research
community. He also serves/d on the editorial boards of IEEE Trans
actions on Software Engineering, Springer Journal of Empirical Soft
ware Engineering, and PeerJ Computer Science. More information at
http://sail.cs.queensu.ca/.