ArticlePDF Available

# The Impact of Correlated Metrics on the Interpretation of Defect Models

Authors:

## Abstract and Figures

Defect models are analytical models for building empirical theories related to software quality. Prior studies often derive knowledge from such models using interpretation techniques, e.g., ANOVA Type-I. Recent work raises concerns that correlated metrics may impact the interpretation of defect models. Yet, the impact of correlated metrics in such models has not been investigated. In this paper, we investigate the impact of correlated metrics on the interpretation of defect models and the improvement of the interpretation of defect models when removing correlated metrics. Through a case study of 14 publicly-available defect datasets, we find that (1) correlated metrics have the largest impact on the consistency, the level of discrepancy, and the direction of the ranking of metrics, especially for ANOVA techniques. On the other hand, we find that removing all correlated metrics (2) improves the consistency of the produced rankings regardless of the ordering of metrics (except for ANOVA Type-I); (3) improves the consistency of ranking of metrics among the studied interpretation techniques; (4) impacts the model performance by less than 5 percentage points. Thus, when one wishes to derive sound interpretation from defect models, one must (1) mitigate correlated metrics especially for ANOVA analyses; and (2) avoid using ANOVA Type-I even if all correlated metrics are removed.
Content may be subject to copyright.
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 1
The Impact of Correlated Metrics on
the Interpretation of Defect Models
Jirayus Jiarpakdee, Student Member, IEEE, Chakkrit Tantithamthavorn, Member, IEEE,
and Ahmed E. Hassan, Fellow, IEEE
Abstract—Defect models are analytical models for building empirical theories related to software quality. Prior studies often derive
knowledge from such models using interpretation techniques, e.g., ANOVA Type-I. Recent work raises concerns that correlated metrics
may impact the interpretation of defect models. Yet, the impact of correlated metrics in such models has not been investigated. In this
paper, we investigate the impact of correlated metrics on the interpretation of defect models and the improvement of the interpretation
of defect models when removing correlated metrics. Through a case study of 14 publicly- available defect datasets, we ﬁnd that (1)
correlated metrics have the largest impact on the consistency, the level of discrepancy, and the direction of the ranking of metrics,
especially for ANOVA techniques. On the other hand, we ﬁnd that removing all correlated metrics (2) improves the consistency of the
produced rankings regardless of the ordering of metrics (except for ANOVA Type-I); (3) improves the consistency of ranking of metrics
among the studied interpretation techniques; (4) impacts the model performance by less than 5 percentage points. Thus, when one
wishes to derive sound interpretation from defect models, one must (1) mitigate correlated metrics especially for ANOVA analyses; and
(2) avoid using ANOVA Type-I even if all correlated metrics are removed.
Index Terms—Software Quality Assurance, Defect models, Hypothesis Testing, Correlated Metrics, Model Speciﬁcation.
F
1INTRODUCTION
Defect models are constructed using historical software
project data to identify defective modules and explore the
impact of various phenomena (i.e., software metrics) on
software quality. The interpretation of such models is used
to build empirical theories that are related to software
quality (i.e., what software metrics share the strongest as-
sociation with software quality?). These empirical theories
are essential for project managers to chart software quality
improvement plans to mitigate the risk of introducing de-
fects in future releases (e.g., a policy to maintain code as
simple as possible).
Plenty of prior studies investigate the impact of many
phenomena on code quality using software metrics, for ex-
ample, code size, code complexity [31, 49, 71], change com-
plexity [42, 57, 59, 71, 88], antipatterns [41], developer activ-
ity [71], developer experience [61], developer expertise [5],
developer and reviewer knowledge [81], design [3, 10, 11,
14, 16], reviewer participation [50, 82], code smells [40], and
mutation testing [7]. To perform such studies, there are ﬁve
common steps: (1) formulating of hypotheses that pertain to
the phenomena that one wishes to study; (2) designing ap-
propriate metrics to operationalize the intention behind the
phenomena under study; (3) deﬁning a model speciﬁcation
(e.g., the ordering of metrics) to be used when constructing
an analytical model; (4) constructing an analytical model
using, for example, regression models [5, 57, 81, 82, 87] or
random forest models [23, 38, 55, 64]; and (5) examining the
J. Jiarpakdee and C. Tantithamthavorn are with the Faculty of
Information Technology, Monash University, Australia. E-mail: ji-
rayus.jiar@gmail.com, chakkrit.tantithamthavorn@monash.edu.
A. E. Hassan is with the School of Computing, Queen’s University,
ranking of metrics using a model interpretation technique
(e.g., ANOVA Type-I, one of the most commonly-used inter-
pretation techniques since it is the default built-in function
for logistic regression (glm) models in R) in order to test the
hypotheses.
For example, to study whether complex code increases
project risk, one might use the number of reported bugs
(bugs) to capture risk, and the McCabe’s cyclomatic com-
plexity (CC) to capture code complexity, while controlling
for code size (size). We note that one needs to use control
metrics to ensure that ﬁndings are not due to confounding
factors (e.g., large modules are more likely to have more
bugs). Then, one must construct an analytical model with
a model speciﬁcation of bugssize+CC. One would then
use an interpretation technique (e.g. ANOVA Type-I) to
determine the ranking of metrics (i.e., which metrics have
a strong relationship with bugs).
Metrics of prior studies are often correlated [22, 32, 33,
35, 36, 74, 77, 85]. For example, Herraiz et al. [33], and
Gil et al. [22] point out that code complexity (CC) is often
correlated with code size (size). Zhang et al. [85] point
out that many metric aggregation schemes (e.g., averaging
or summing of McCabe’s cyclomatic complexity values at
the function level to derive ﬁle-level metrics) often produce
correlated metrics.
Recent studies raise concerns that correlated metrics may
impact the interpretation of defect models [77, 85]. Our pre-
liminary analysis (PA2) also shows that simply rearranging
the ordering of correlated metrics in the model speciﬁcation
(e.g., from bugssize+CC to bugsCC+size) would lead
to a different ranking of metrics—i.e., the importance scores
are sensitive to the ordering of correlated metrics in a
model speciﬁcation. Thus, if one wants to show that code
complexity is strongly associated with risk in a project,
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2
one simply needs to put code complexity (CC) as the ﬁrst
metric in their models (i.e., bugsCC+size), even though
a more careful analysis would show that CC is not associated
with bugs at all. The sensitivity of the model speciﬁcation
when correlated metrics are included in a model is a crit-
ical problem, since the contribution of many prior studies
can be altered by simply re-ordering metrics in the model
speciﬁcation if correlated metrics are not properly mitigated.
Unfortunately, a literature survey of Shihab [67] shows
that as much as 63% of defect studies that are published
during 2000-2011 do not mitigate correlated metrics prior to
constructing defect models.
In this paper, we set out to investigate (1) the impact of
correlated metrics on the interpretation of defect models.
After removing correlated metrics, we investigate (2) the
consistency of the interpretation of defect models; and (3)
its impact on the performance and stability of defect models.
In order to detect and remove correlated metrics, we apply
the variable clustering (VarClus) and the variance inﬂation
factor (VIF) techniques. We construct logistic regression
and random forest models using mitigated (i.e., no corre-
lated metrics) and non-mitigated datasets (i.e., not treated).
Finally, we apply 9 model interpretation techniques, i.e.,
ANOVA Type-I, 4 test statistics of ANOVA Type-II (i.e.,
Wald, Likelihood Ratio, F, and Chi-square), scaled and non-
scaled Gini Importance, and scaled and non-scaled Permu-
tation Importance. We then compare the performance and
interpretation of defect models that are constructed using
mitigated and non-mitigated datasets. Through a case study
of 14 publicly-available defect datasets of systems that span
both proprietary and open source domains, we address the
following four research questions:
(RQ1) How do correlated metrics impact the interpreta-
tion of defect models?
ANOVA Type-I and Type-II often produce the lowest
consistency and the highest level of discrepancy of
the top-ranked metric, and have the highest im-
pact on the direction of the ranking of metrics be-
tween mitigated and non-mitigated models when
compared to Gini and Permutation Importance. This
ﬁnding highlights the risks of not mitigating cor-
related metrics in the ANOVA analyses of prior
studies.
(RQ2) After removing all correlated metrics, how consis-
tent is the interpretation of defect models among
different model speciﬁcations?
After removing all correlated metrics, the top-
ranked metric according to ANOVA Type-II, Gini
Importance, and Permutation Importance are con-
sistent. However, the top-ranked metric according
to ANOVA Type-I is inconsistent, since the ranking
of metrics is impacted by its order in the model
speciﬁcation when analyzed using ANOVA Type-I
(which is the default analysis for the glm model in R
and is commonly-used in prior studies). This ﬁnding
suggests that ANOVA Type-I must be avoided even
if all correlated metrics are removed.
(RQ3) After removing all correlated metrics, how consis-
tent is the interpretation of defect models among
the studied interpretation techniques?
After removing all correlated metrics, we ﬁnd that
the consistency of the ranking of metrics among
the studied interpretation techniques is improved by
15%- 64% for the top-ranked metric and 21%-71% for
the top-3 ranked metrics, respectively, highlighting
the beneﬁts of removing all correlated metrics on the
interpretation of defect models, i.e., the conclusions
of studies that rely on one interpretation technique
may not pose a threat after mitigating correlated
metrics.
(RQ4) Does removing all correlated metrics impact the
performance and stability of defect models?
Removing all correlated metrics impacts the AUC,
F-measure, and MCC performance of defect models
by less than 5 percentage points, suggesting that
researchers and practitioners should remove corre-
lated metrics with care especially for safety-critical
software domains.
Based on our ﬁndings, we suggest that: When the goal is to
derive sound interpretation from defect models, our results suggest
that future studies must (1) mitigate correlated metrics prior to
constructing a defect model, especially for ANOVA analyses; and
(2) avoid using ANOVA Type-I even if all correlated metrics are
removed, but instead opt to use ANOVA Type-II and Type-III for
additive and interaction models, respectively. Due to the variety
of the built-in interpretation techniques and their settings, our
paper highlights the essential need for future studies to report
the exact speciﬁcation (i.e., model formula) of their models and
settings (e.g., the calculation methods of the importance score) of
the used interpretation techniques.
1.1 Novelty Statements
To the best of our knowledge, this paper is the ﬁrst to
present:
(1) A series of preliminary analyses of the nature of corre-
lated metrics in defect datasets and their impact on the
interpretation of defect models (Appendix 2).
(2) An investigation of the impact of correlated metrics
on the consistency, the level of discrepancy, and the
direction of the produced rankings by the interpretation
techniques (RQ1).
(3) An empirical evaluation of the consistency of such rank-
ings after removing all correlated metrics (RQ2, RQ3).
(4) An investigation of the impact of removing all corre-
lated metrics on the performance and stability of defect
models (RQ4).
1.2 Paper Organization
Section 2 discusses the analytical modelling process, cor-
related metrics and concerns in the literature, and tech-
niques for mitigating correlated metrics. Section 3 describes
the design of our case study, while Section 4 presents
our results with respect to our four research questions.
Section 5 provides practical guidelines for future studies.
Section 6 discusses the threats to the validity of our study.
Section 7 draws conclusions. For the detailed explanation
of the studied correlation analysis techniques, commonly-
used analytical learners, and interpretation techniques, see
Appendix 1 of the supplementary materials.
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 3
Hypothesis Metrics!
(Variables)
Correlation!
Analysis
Model!
Construction
Model!
Interpretation
The Highest
Ranked Metric
m1
m2
y ~ m1 + m2
Model!
Speciﬁcation
m
Figure 1: An overview of the analytical modelling process.
2BACKGROUND AND MOTIVATION
2.1 Analytical Modelling Process
Figure 1 provides an overview of the commonly-used an-
alytical modelling process. First, one must formulate a set
of hypotheses pertaining to phenomena of interest (e.g.,
whether the size of a module increases the risk associated
with that module). Second, one must determine a set of
metrics which operationalize the hypothesis of interest (e.g.,
the total lines of code for size, and the number of ﬁeld
reported bugs to capture the risk that is associated with a
module). Third, one must perform a correlation analysis to
remove correlated metrics. Forth, one must deﬁne a model
speciﬁcation (e.g., the ordering of metrics) to be used when
constructing an analytical model. Fifth, one is then ready
to construct an analytical model using a machine learning
technique (e.g., a random forest model) or a statistical learn-
ing technique (e.g., a regression model). Finally, one ana-
lyzes the ranking of the metrics using model interpretation
techniques (e.g., ANOVA or Breiman’s Variable Importance)
in order to test the hypotheses of interest. The importance
ranking of the metrics is essential for project managers to
chart appropriate software quality improvement plans to
mitigate the risk of introducing defects in future releases.
For example, if code complexity is identiﬁed as the top-
ranked metric, project managers then can suggest develop-
ers to reduce the complexity of their code to reduce the risk
of introducing defects.
2.2 Correlated Metrics and Concerns in the Literature
Correlated metrics are metrics (i.e., independent variables)
that share a strong linear correlation among themselves. In
this paper, we focus on two types of correlation among
metrics, i.e., collinearity and multicollinearity. Collinearity is
a phenomenon in which one metric can be linearly predicted
by another metric. On the other hand, multicollinearity is a
phenomenon in which one metric can be linearly predicted
by a combination of two or more metrics.
Prior work points out that software metrics are often
correlated [22, 32, 33, 35, 36, 74, 77, 85]. However, little is
known the prevalence of correlated metrics in the publicly-
available defect datasets. Thus, we set out to investigate
how many defect datasets of which metrics that share a
strong relationship with defect-proneness are correlated.
Unfortunately, the results of our preliminary analysis (PA1)
show that correlated metrics that share a strong relationship
with defect-proneness are prevalent in 83 of the 101 (82%)
publicly avaliable defect datasets.
In addition, prior work raises concerns that correlated
metrics may impact the interpretation of defect models [77,
85]. To better understand how correlated metrics impact the
interpretation of defect models, we set out to investigate
(1) the impact of the number of correlated metrics on the
importance scores of metrics, and (2) the impact of the or-
dering of correlated metrics in a model speciﬁcation on the
importance ranking metrics. The results of our preliminary
analyses (PA2, and PA3) show that the importance scores
of metrics substantially decrease when there are correlated
metrics in the models for both ANOVA analyses of logis-
tic regression and Variable Importance analyses (i.e., Gini
and Permutation) of random forest. The importance scores
of metrics are also sensitive to the ordering of correlated
metrics (except for ANOVA Type-II).
2.3 Techniques for Mitigating Correlated Metrics
There is a plethora of techniques that have been used to
mitigate irrelevant and correlated metrics in the domain of
defect prediction, e.g., dimensionality reduction [47, 57, 85],
feature selection [1, 54], and correlation analysis [15, 68, 82].
Dimensionality reduction transforms an initial set of
metrics into a set of transformed metrics that is represen-
tative to the initial set of metrics. Prior work has adopted
dimensionality reduction techniques (e.g., Principal Com-
ponent Analysis) to mitigate correlated metrics and improve
the performance of defect models [47, 57, 85]. Since the set
of transformed metrics does not hold the assumption of the
initial set of metrics, and is not sensible for model interpreta-
tion and statistical inference [74], we exclude dimensionality
reduction techniques from this paper.
Feature selection produces an optimal subset of met-
rics that are relevant and non-correlated. One of the
most commonly-used feature selection techniques is the
correlation-based feature selection technique (CFS) [25]
which searches for the best subset of metrics that share the
highest correlation with the outcome (e.g., defect-proneness)
while having the lowest correlation among each other. To
better understand whether feature selection techniques mit-
igate correlated metrics, we set out to perform a correlation
analysis on the metrics that are selected by feature selection
techniques. In this preliminary analysis (PA4), we focus on
the two commonly-used techniques in the domain of de-
fect prediction, i.e., Information Gain and correlation-based
feature selection techniques. The results of this preliminary
analysis show that the metrics that are selected by the two
studied feature selection techniques are correlated (with
a Spearman correlation coefﬁcient up to 0.98), suggesting
that the commonly-used feature selection techniques do not
mitigate correlated metrics.
Correlation analysis is used to measure the correlation
among metrics given a threshold. Prior work applies corre-
lation analysis techniques to identify and mitigate correlated
metrics [15, 68, 82, 83]. Based on a literature survey of
Hall et al. [26] and Shihab [67], we select the commonly-used
correlation analysis techniques: Variable Clustering analysis
(VarClus), and Variance Inﬂation Factor (VIF).
Since there are many analytical learners that can be
used to investigate the impact of correlated metrics on
defect models, the aforementioned surveys guide our selec-
tion of the two commonly-used analytical learners: logistic
regression [5, 6, 15, 43, 53, 57, 58, 65, 87] and random
forest [23, 24, 38, 55, 64]. These techniques are two of the
most commonly-used analytical learners for defect models
and they have built-in techniques for model interpretation
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 4
Non-mitigated
Dataset
Mitigated
Dataset
Mitigated
Models
Remove
Correlated
Metrics
Defect
Dataset
Construct
Defect
Models
Non-mitigated
Models
Figure 2: An overview diagram of the design of our case
study.
Table 1: A statistical summary of the studied datasets.
Project Dataset Modules Metrics Correlated
Metrics EPV AUCLR AUCRF
Apache Lucene 2.4 340 20 9 10 0.74 0.77
POI 2.5 385 20 11 12 0.80 0.90
POI 3.0 442 20 10 14 0.79 0.88
Xalan 2.6 885 20 8 21 0.79 0.85
Xerces 1.4 588 20 11 22 0.91 0.95
Eclipse Debug 3.4 1,065 17 9 15 0.72 0.81
JDT 997 15 10 14 0.81 0.82
Mylyn 1,862 15 10 16 0.78 0.74
PDE 1,497 15 9 14 0.72 0.72
Platform 2.0 6,729 32 24 30 0.82 0.84
Platform 3.0 10,593 32 24 49 0.79 0.81
SWT 3.4 1,485 17 7 38 0.87 0.97
Proprietary Prop 1 18,471 20 10 137 0.75 0.79
Prop 4 8,718 20 11 42 0.74 0.72
(i.e., ANOVA for logistic regression and Breiman’s Variable
Importance for random forest). Finally, we select 9 model
interpretation techniques, ANOVA Type-I, ANOVA Type-II
with 4 test statistics (i.e., Wald, Likelihood Ratio, F, and Chi-
square), scaled and non-scaled Gini Importance, and scaled
and non-scaled Permutation Importance. We provide the
detailed explanation of the studied correlation analysis tech-
niques, analytical learners, and interpretation techniques in
Table 2 and Appendix 1 of the supplementary materials.
3CASE STUDY DESIGN
In this study, we use 14 datasets of systems that span
across proprietary and open-source systems. We discuss
the selection criteria of the studied datasets in Appendix 3
of the supplementary materials. Table 1 shows a statistical
summary of the studied datasets, while Figure 2 provides an
overview of the design of our case study. Below, we discuss
the design of the case study that we perform in order to
3.1 Remove Correlated Metrics
To investigate the impact of correlated metrics on the per-
formance and interpretation of defect models and address
our four research questions, we start by removing highly-
correlated metrics in order to produce mitigated datasets,
i.e., datasets where correlated metrics are removed. To do so,
we apply variable clustering analysis (VarClus) and variable
inﬂuence factor analysis (VIF) (see Appendix 1.1). We use
the interpretation of Spearman correlation coefﬁcients (||)
as provided by Kraemer et al. [45] to identify correlated
metrics, i.e., a Spearman correlation coefﬁcient of above 0.7
is considered a strong correlation. We use a VIF threshold
of 5 to identify inter-correlated metrics, as it is suggested by
Fox [18] and is commonly used in prior work [4, 50, 68, 69].
We use the implementation of the variable clustering anal-
ysis as provided by the varclus function of the Hmisc R
package [28]. We use the implementation of the VIF analysis
as provided by the vif function of the rms R package [30].
Finally, we report the results of correlation analysis and a set
of mitigated metrics for each of the studied defect datasets
in the online appendix [34].
3.2 Construct Defect Models
To examine the impact of correlated metrics on the perfor-
mance and interpretation of defect models, we construct our
models using the non-mitigated datasets (i.e., datasets where
correlated metrics are not removed) and mitigated datasets
(i.e., datasets where correlated metrics are removed). To
construct defect models, we perform the following steps:
(CM1) Generate bootstrap samples. To ensure that our
conclusions are statistically sound and robust, we use the
out-of-sample bootstrap validation technique, which lever-
ages aspects of statistical inference [17, 19, 29, 72, 78]. We
ﬁrst generate bootstrap sample of sizes Nwith replacement
from the mitigated and non-mitigated datasets. The gener-
ated sample is also of size N. We construct models using
the bootstrap samples, while we measure the performance
of the models using the samples that do not appear in
the bootstrap samples. On average, 36.8% of the original
dataset will not appear in the bootstrap samples, since the
samples are drawn with replacement [17]. We repeat the out-
of-sample bootstrap process for 100 times and report their
average performance.
(CM2) Construct defect models. For each bootstrap
sample, we construct logistic regression and random forest
models. We use the implementation of logistic regression as
provided by the glm function of the stats R package [80]
and the lrm function of the rms R package [30] with the
default parameter setting. We use the implementation of
random forest as provided by the randomForest function
of the randomForest R package [9] with the default ntree
value of 100, since recent studies [76, 79] show that param-
eters of random forest are insensitive to the performance of
defect models. To ensure that the training and testing cor-
pora share similar characteristics and representative to the
original dataset, we do not re-balance nor do we re-sample
the training data to avoid any impact on the interpretation
of defect models [75].
3.3 Analyze the Model Interpretation
To address RQ1, RQ2, and RQ3, we analyze the importance
ranking of metrics of the models that are constructed using
non-mitigated datasets and mitigated datasets. The analysis
of model interpretation is made up of 2 steps.
(MI1) Compute the importance score of metrics. We
investigate the impact of correlated metrics on the interpre-
tation of defect models using different model interpretation
techniques. Thus, we apply the 9 studied model interpre-
tation techniques, i.e., Type-I, Type-II (Wald, LR, F, Chisq),
scaled and non-scaled Gini Importance, and scaled and non-
scaled Permutation Importance. We provide the technical
description of the studied interpretation techniques in Ap-
pendix 1.3 of the supplementary materials.
(MI2) Identify the importance ranking of metrics.
To statistically identify the importance ranking of metrics,
we apply the improved Scott-Knott Effect Size Difference
(ESD) test (v2.0) [73]. The Scott-Knott ESD test is a mean
comparison approach that leverages a hierarchical cluster-
ing to partition a set of treatment means (i.e., means of
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 5
Table 2: A summary of the studied correlation analysis techniques, the two studied analytical learners, and the 9 studied
interpretation techniques.
Correlation
Analysis Analytical Learner Interpretation
Technique Test Statistic R function
Variable
Clustering
[56, 81–83]
Logistic
Regression
(glm and lrm)
[5, 6, 57, 58, 87]
Type-I Deviance stats::anova(glm.model)
Type-II
Wald car::Anova(glm.model, type=2, test.statistic=‘Wald’)
Likelihood Ratio (LR) car::Anova(glm.model, type=2, test.statistic=‘LR’)
Variance
Inﬂation Factor
[4, 15, 50, 68, 69]
Fcar::Anova(glm.model, type=2, test.statistic=‘F’)
Chi-square rms::anova(lrm.model, test=‘Chisq’)
Random Forest
[23, 24, 38, 55, 64]
Scaled Gini MeanDecreaseGini randomForest::importance(model, type = 2, scale = TRUE)
Redundancy
Analysis
[2, 37, 56, 70, 77]
Non-scaled Gini MeanDecreaseGini randomForest::importance(model, type = 2, scale = FALSE)
Scaled Permutation MeanDecreaseAccuracy randomForest::importance(model, type = 1, scale = TRUE)
Non-scaled Permutation MeanDecreaseAccuracy randomForest::importance(model, type = 1, scale = FALSE)
importance scores) into statistically distinct groups with
statistically non-negligible difference. The Scott-Knott ESD test
ranks each metric at only a single rank, however several
metrics may appear within one rank. Finally, we identify
the importance ranking of metrics for the non-mitigated and
mitigated models. Thus, each metric has a rank for each
model interpretation technique and for each of the mitigated
and non-mitigated models. We use the implementation of
Scott-Knott ESD test as provided by the sk_esd function of
the ScottKnottESD R package [73].
3.4 Analyze the Model Performance
To address RQ4, we analyze the performance of the mod-
els that are constructed using non-mitigated datasets and
mitigated datasets.
First, we use the Area Under the receiver operator char-
acteristic Curve (AUC) to measure the discriminatory power
of our models, as suggested by recent research [21, 46, 62].
The AUC is a threshold-independent performance measure
that evaluates the ability of models in discriminating be-
tween defective and clean modules. The values of AUC
range between 0 (worst performance), 0.5 (no better than
random guessing), and 1 (best performance) [27].
Second, we use the F-measure, i.e, a threshold-
dependent measure. F-measure is a harmonic mean (i.e.,
2·precision·recall
precision+recall ) of precision ( TP
TP+FP ) and recall ( TP
TP+FN ).
Similar to prior studies [1, 86], we use the default probability
value of 0.5 as a threshold value for the confusion matrix,
i.e., if a module has a predicted probability above 0.5, it is
considered defective; otherwise, the module is considered
clean.
Third, we use the Matthews Correlation Coefﬁcient
(MCC) measure, i.e, a threshold-dependent measure, as
suggested by prior studies [48, 66]. MCC is a balanced
measure based on true and false positives and neg-
atives that is computed using the following equation:
TPTNFPFN
p(TP+FP)(TP+FN)(TN+FP)(TN+FN) .
4CASE STUDY RESULTS
In this section, we present the results of our case study with
respect to our four research questions.
(RQ1) How do correlated metrics impact the interpreta-
tion of defect models?
Motivation. Prior work raises concerns that metrics are
often correlated and should be mitigated [22, 32, 33, 35, 74,
77, 85]. For example, Herraiz et al. [33], and Gil et al. [22]
point out that code complexity is often correlated with lines
of code. Unfortunately, a literature survey of Shihab [67]
shows that as much as 63% of prior defect studies do
not mitigate correlated metrics prior to constructing defect
models. Yet, little is known about the impact of correlated
metrics on the interpretation of defect models.
Approach. To address RQ1, we analyze the impact of
correlated metrics on the interpretation of defect models
along with three dimensions, i.e., (1) the consistency of the
top-ranked metric, (2) the level of discrepancy of the top-
ranked metric, and (3) the direction of the ranking of metrics
for all non-correlated metrics between mitigated and non-
mitigated models.
To do so, we start from mitigated datasets (see Sec-
tion 3.1). We ﬁrst identify the top-ranked metric for each of
the 9 studied interpretation techniques. We use VarClus to
select only one of the metrics that is correlated with the top-
ranked metric in order to generate non-mitigated datasets.
We then append the correlated metric to the ﬁrst position
of the speciﬁcation of the mitigated models. Thus, the spec-
iﬁcation for the mitigated models is ymtop ranked +...,
while the speciﬁcation for the non-mitigated models is
ymcorrelated +mtop ranked +..., where mcorrelated is
the metric that is correlated with the top-ranked metric
(mtop ranked). For each of the mitigated and non-mitigated
datasets, we construct defect models (see Section 3.2) and
apply the 9 studied model interpretation techniques (see
Section 3.3).
To analyze the consistency and the level of discrepancy of
the top-ranked metric, we compute the difference in the
ranks of the top-ranked metric between mitigated and non-
mitigated models. For example, if a metric mtop ranked ap-
pears in the 1st rank in both of mitigated and non-mitigated
models, then the metric would have a rank difference of 0.
However, if mtop ranked appears in the 3rd rank of a non-
mitigated model and appears in the 1st rank of a mitigated
model, then the rank difference of mtop ranked would be 2.
The consistency of the top-ranked metric measures the per-
centage of the studied datasets that the top-ranked metric
appears at the 1st rank in both mitigated and non-mitigated
models. On the other hand, the level of discrepancy of the
top-ranked metric measures the highest rank difference of
the top-ranked metric between mitigated and non-mitigated
models.
To analyze the direction of the ranking of metrics for
all non-correlated metrics between mitigated and non-
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 6
TypeII (Chisq)
Scaled and nonscaled Gini
Scaled Permutation
Nonscaled Permutation
TypeI
TypeII (Wald)
TypeII (LR)
TypeII (F)
87654321 0 1 2 87654321 0 1 2 87654321 0 1 2 87654321 0 1 2
0%
25%
50%
75%
100%
0%
25%
50%
75%
100%
Rank difference of the topranked metric
Percentage of the studied datasets
Figure 3: The percentage of the studied datasets for each difference in the ranks between the top-ranked metric of the
models that are constructed using the mitigated and non-mitigated datasets. The light blue bars represent the consistent
rank of the metric between mitigated and non-mitigated models, while the red bars represent the inconsistent rank of the
metric between mitigated and non-mitigated models.
metrics that appear in both mitigated and non-mitigated
models. We then apply a Spearman rank correlation test
() to compute the correlation between ranks of all non-
correlated metrics between mitigated and non-mitigated
models. The Spearman correlation coefﬁcient () of 1 in-
dicates that the ranking of non-correlated metrics between
mitigated and non-mitigated models is in the same direc-
tion. On the other hand, the Spearman correlation coefﬁcient
() of -1 indicates that the ranking of non-correlated metrics
between mitigated and non-mitigated models is in the re-
verse direction. Since the produced ranking of each model
and defect dataset may be different, the use of a weighted
Spearman rank correlation test may lead these rankings
to be weighted differently. Thus, we select a traditional
Spearman rank correlation test for our study. We report the
distributions of the Spearman correlation coefﬁcients () of
the ranking of metrics between mitigated and non-mitigated
models for all studied interpretation techniques in Figure S7
in the supplementary materials.
Results.ANOVA Type-I produces the lowest consistency
of the top-ranked metric between mitigated and non-
mitigated models. We expect that the top-ranked metric in
the mitigated model will remain as the top-ranked metric
in the non-mitigated model. Unfortunately, Figure 3 shows
that this expectation does not hold true in any of the
studied datasets for ANOVA Type-I. Figure 3 shows that, for
ANOVA Type-I, none of the top-ranked metric that appears
in the 1st rank of mitigated models also appears in the 1st
rank of non-mitigated models. On the other hand, we ﬁnd
that the top-ranked metric of mitigated models appears at
the top-ranked metric in non-mitigated models for 84%,
67%, 55%, and 67% of the studied datasets for ANOVA
Type-II (Wald), Type-II (LR), Type-II (F), Type-II (Chisq), re-
spectively. We suspect that the impact of correlated metrics
on the interpretation of Type-I has to do with the sequential
nature of the calculation of the Sum of Squares, i.e., Type-
I attributes as much variance as it can to the ﬁrst metric
before attributing residual variance to the second metric in
the model speciﬁcation.
ANOVA Type-I and Type-II produce the highest level
of discrepancy of the top-ranked metrics between miti-
gated and non-mitigated models. Figure 3 shows that the
rank difference for ANOVA Type-I and Type-II can be up to
-6 and -8, respectively. In other words, we ﬁnd that the top-
ranked metric in mitigated models appear at the 7th rank
and the 9th rank in non-mitigated models for ANOVA Type-
I and Type-II, respectively. We suspect that the highest level
of discrepancy (i.e., the largest rank difference) for ANOVA
Type-I and Type-II has to do with the sharply drop of the
importance scores when correlated metrics are included in
defect models (see PA1).
For ANOVA Type-I and Type-II, correlated metrics
have the largest impact on the direction of the ranking
of metrics of defect models. For ANOVA Type-I, we ﬁnd
that the Spearman correlation coefﬁcients range from -0.1 to
0.84 (see Figure S7). For ANOVA Type-II, we ﬁnd that the
Spearman correlation coefﬁcients range from 0.1 to 1. A low
value of the Spearman correlation coefﬁcients in ANOVA
techniques indicates that the direction of the ranking of
metrics for all non-correlated metrics is varied in each rank,
suggesting that the ranking of non-correlated metrics is
inconsistent across mitigated and non-mitigated models.
Gini and Permutation Importance approaches produce
the higest consistency and the lowest level of discrepancy
of the top-ranked metric, and have the least impact on the
direction of the ranking of metrics between mitigated and
non-mitigated models. Figure 3 shows that the top-ranked
metric of mitigated models appears at the top-ranked metric
in non-mitigated models for 88%, 92%, and 55% of the
studied datasets for Gini Importance, and scaled and non-
scaled Permutation Importance, respectively. Figure 3 also
shows that the rank difference for Gini and Permutation
Importance is as low as -1 and -3, respectively. Furthermore,
we ﬁnd that the Spearman correlation coefﬁcients range
from 0.9 to 1 for Gini and Permutation Importance (see
Figure S7). These ﬁndings suggest that the lower impact that
correlated metrics have on Gini and Permutation Impor-
tance than ANOVA techniques have to do with the random
process for constructing multiple trees and the calculation of
importance scores for a random forest model. For example,
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 7
the random process of random forest may generate trees that
are constructed without correlated metrics. In addition, the
averaging of the importance scores from multiple trees may
decrease the negative impact of correlated metrics on trees
that are constructed with correlated metrics for random
forest models.
ANOVA Type-I and Type-II often produce the lowest consis-
tency and the highest level of discrepancy of the top-ranked
metric, and have the highest impact on the direction of the
ranking of metrics between mitigated and non-mitigated mod-
els when compared to Gini and Permutation Importance. This
ﬁnding highlights the risks of not mitigating correlated metrics
in the ANOVA analyses of prior studies.
(RQ2) After removing all correlated metrics, how con-
sistent is the interpretation of defect models among
different model speciﬁcations?
Motivation. Our motivating analysis (see Appendix 2) and
the results of RQ1 conﬁrm that the ranking of the top-
ranked metric substantially changes when the ordering of
correlated metrics in a model speciﬁcation is rearranged,
suggesting that correlated metrics must be removed. How-
ever, after removing correlated metrics, little is known if
the interpretation of defect models would become consistent
when rearranging the ordering of metrics.
Approach. To address RQ2, we analyze the ranking of the
top-ranked metric of the models that are constructed using
different ordering of metrics from mitigated datasets. To do
so, we start from mitigated datasets that are produced by
Section 3.1. For each of the datasets, we construct defect
models (see Section 3.2) and apply the 9 studied model in-
terpretation techniques (see Section 3.3) in order to identify
the top-ranked metric according to each technique. Then,
we regenerate the models where the ordering of metrics is
rearranged—the top-ranked metric is at each position from
the ﬁrst to the last for each dataset. Finally, we compute the
percentage of datasets where the ranks of the top-ranked
metric are inconsistent among the rearranged datasets.
Results.After removing correlated metrics, the top-ranked
metrics according to Type-II, Gini Importance, and Per-
mutation Importance are consistent. However, the top-
ranked metric according to Type-I is still inconsistent
regardless of the ordering of metrics. We ﬁnd that Type-
II, Gini Importance, and Permutation Importance produce a
stable ranking of the top-ranked metric for all of the studied
datasets regardless of the ordering of metrics.
On the other hand, ANOVA Type-I is the only technique
that produces an inconsistent ranking of the top-ranked
metric. We ﬁnd that for 71% of the studied datasets, ANOVA
Type-I produces an inconsistent ranking of the top-ranked
metric when the ordering of metrics is rearranged. We
expect that the consistency of the ranking of the top-ranked
metrics can be improved by increasing the strictness of
the correlation threshold of the variable clustering analy-
sis (VarClus). Thus, we repeat the analysis using stricter
thresholds of the variable clustering analysis (VarClus). We
use Spearman correlation coefﬁcient (||) threshold values
of 0.5 and 0.6. Unfortunately, even if we increase the strict-
ness of the correlation threshold value, Type-I produces the
inconsistent ranking of the top-ranked metric for 43% and
50% of the studied datasets, for the threshold of 0.5 and 0.6,
respectively.
The inconsistent ranking of the top-ranked metric ac-
cording to Type-I has to do with the sequential nature of
the calculation of the Sum of Squares (see Appendix 1.3).
In other words, Type-I attributes the importance scores as
much as it can to the ﬁrst metric before attributing the scores
to the second metric in the model speciﬁcation. Thus, Type-I
is sensitive to the ordering of metrics.
After removing all correlated metrics, the top-ranked metric
according to ANOVA Type-II, Gini Importance, and Permuta-
tion Importance are consistent. However, the top-ranked metric
according to ANOVA Type-I is inconsistent, since the ranking
of metrics is impacted by its order in the model speciﬁcation
when analyzed using ANOVA Type-I (which is the default
analysis for the glm model in R and is commonly-used in
prior studies). This ﬁnding suggests that ANOVA Type-I must
be avoided even if all correlated metrics are removed.
(RQ3) After removing all correlated metrics, how con-
sistent is the interpretation of defect models among the
studied interpretation techniques?
Motivation. The ﬁndings of prior work often rely heavily
on one model interpretation technique [23, 37, 55, 56, 70,
77]. Therefore, the ﬁndings of prior work may pose a threat
to construct validity, i.e., the ﬁndings may not hold true if
one uses another interpretation technique. Thus, we set out
to investigate the consistency of the top-ranked and top-3
ranked metrics after removing correlated metrics.
Approach. To address RQ3, we start from mitigated datasets
that are produced by Section 3.1 and non-mitigated datasets
(i.e., the original datasets). We compare the two rank-
ings that are produced from mitigated and non-mitigated
models using the 9 interpretation techniques for each of
the studied datasets. Then, we compute the percentage of
datasets where the top-ranked metric is consistent among
the studied model interpretation techniques. Moreover, we
also compute the percentage of datasets where at least
one of the top-3 ranked metrics is consistent among the
studied model interpretation techniques. Finally, we present
the results using a heatmap (as shown in Figure 4) where
each cell indicates the percentage of datasets which the top-
ranked metric is consistent among the two studied model
interpretation techniques.
Results.Before removing all correlated metrics, we ﬁnd
that the studied model interpretation techniques do not
tend to produce the same top-ranked metric. We observe
that the consistency of the ranking of metrics across learning
techniques (i.e., logistic regression and random forest) is
as low as 0%-43% for the top-ranked metric (Figure 4a)
and 21%-57% for the top-3 ranked metrics (see Figure S8a),
respectively. Furthermore, according to the lower-left side of
the matrix of the Figure 4a, we ﬁnd that, before removing
correlated metrics, the top-ranked metric of Type-II (Chisq)
is inconsistent with the top-ranked metrics of Type-I and
Gini Importance for all of the studied datasets.
After removing all correlated metrics, we ﬁnd that the
consistency of the ranking of metrics among the studied
interpretation techniques is improved by 15%-64% for
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 8
7%
7% 43%
7% 43% 43%
0% 36% 29% 29%
43% 7% 7% 7% 0%
29% 7% 7% 7% 7% 71%
36% 14% 14% 14% 7% 64% 50%
Permutation
(Scaled)
Permutation
(Nonscaled)
Gini
(Scaled/
nonscaled)
TypeII
(Chisq)
TypeII
(F)
TypeII
(LR)
TypeII
(Wald)
TypeI TypeII
(Wald)
TypeII
(LR)
TypeII
(F)
TypeII
(Chisq)
Gini
(Scaled/
nonscaled)
Permutation
(Nonscaled)
0 25 50 75100
Percentage of datasets
(a) The top-ranked metric for non-mitigated models.
64% 71% 71% 64% 71% 64% 86%
86% 86% 86% 50% 43% 64%
100% 86% 57% 50% 71%
86% 57% 50% 71%
50% 43% 64%
93% 79%
79%
Permutation
(Nonscaled)
Gini
(Scaled/
nonscaled)
TypeII
(Chisq)
TypeII
(F)
TypeII
(LR)
TypeII
(Wald)
TypeI
TypeII
(Wald)
TypeII
(LR)
TypeII
(F)
TypeII
(Chisq)
Gini
(Scaled/
nonscaled)
Permutation
(Nonscaled)
Permutation
(Scaled)
0 25 50 75100
Percentage of datasets
(b) The top-ranked metric for mitigated models.
Figure 4: The percentage of datasets where the top-ranked metric is consistent between the two studied model interpretation
techniques. While the lower-left side of the matrix (i.e., red shades) shows the percentage before removing correlated
metrics, the upper-right side of the matrix (i.e., blue shades) shows the percentage after removing correlated metrics. For
the consistency of the top-3 ranked metrics, see Figure S8 in the supplementary materials.
the top-ranked metric and 21%-71% for the top-3 ranked
metrics, respectively. In particular, we observe that the con-
sistency of the ranking of metrics across learning techniques
is improved by 28%-50% for the top-1 ranked metrics and
43%-71% for the top-3 ranked metrics, respectively. Most
importantly, we ﬁnd that scaled Permutation Importance
achieves the highest consistency of the ranking of metrics
across learning techniques. This ﬁnding highlights the ben-
eﬁts of removing correlated metrics on the interpretation
of defect models—the conclusions of studies that rely on
one interpretation technique may not pose a threat after
mitigating correlated metrics.
After removing all correlated metrics, we ﬁnd that the consis-
tency of the ranking of metrics among the studied interpreta-
tion techniques is improved by 15%- 64% for the top-ranked
metric and 21%-71% for the top-3 ranked metrics, respectively,
highlighting the beneﬁts of removing all correlated metrics
on the interpretation of defect models, i.e., the conclusions of
studies that rely on one interpretation technique may not pose
a threat after mitigating correlated metrics.
(RQ4) Does removing all correlated metrics impact the
performance and stability of defect models?
Motivation. The results of RQ1 show that correlated met-
rics have a negative impact on the interpretation of defect
prediction models, while the results of RQ2 and RQ3 show
the beneﬁts of removing correlated metrics on the inter-
pretation of defect models. Thus, removing correlated met-
rics is highly recommended. However, removing correlated
metrics may pose a risk to the performance and stability
of defect models. Yet, little is known if removing such
correlated metrics impacts the performance and stability of
defect models.
Approach. To address RQ4, we ﬁrst start from the AUC,
F-measure and MCC performance estimates and their per-
AUC
Fmeasure
MCC
Logistic
Regression
Random
Forest
Logistic
Regression
Random
Forest
Logistic
Regression
Random
Forest
0.10
0.05
0.00
0.05
0.10
Performance Difference (%pts)
Figure 5: The distributions of the performance difference
(% pts) between non-mitigated and mitigated models for
each of the studied datasets.
formance stability of the non-mitigated and mitigated mod-
els. The performance stability is measured by a standard
deviation of the performance estimates as produced by 100
iterations of the out-of-sample bootstrap for each model. We
then quantify the impact of removing all correlated metrics
by measuring the performance difference (i.e., the arithmetic
difference between the performance of the non-mitigated
and mitigated models) and the stability ratio (i.e., the ratio of
the S.D. of performance estimates of non-mitigated to mit-
igated models, S.D. non-mitigated models
S.D. mitigated models ). Furthermore, in order
to measure the effect size of the impact, we measure Cliff’s
||effect size for the performance difference and the stability
ratio across the non-mitigated and mitigated models.
Results.Removing all correlated metrics impacts the
AUC, F-measure, and MCC performance of defect models
by less than 5 percentage points. Figure 5 shows that
the distributions of the performance difference between
the models that are constructed using non-mitigated and
mitigated datasets are centered at zero. In addition, our
Cliff’s ||effect size test shows that the differences between
the models that are constructed using mitigated and non-
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 9
mitigated datasets are negligible to small for the AUC, F-
measure, and MCC measures. However, the performance
difference of 5 percentage points may be very important
for safety-critical software domains. Thus, researchers and
practitioners should remove correlated metrics with care.
Removing all correlated metrics yields a negligible
difference (Cliff’s ||) for the stability of the performance
of defect models. Figure 6 shows that the distributions
of the stability ratio of the models that are constructed
using non-mitigated and mitigated datasets are centered at
one (i.e., there is little difference in model stability after
removing all correlated metrics). Moreover, our Cliff’s ||
effect size test shows that the difference of the stability ratio
between the models that are constructed using mitigated
and non-mitigated datasets is negligible.
Removing all correlated metrics impacts the AUC, F-measure,
and MCC performance of defect models by less than 5 per-
centage points, suggesting that researchers and practitioners
should remove correlated metrics with care especially for safety-
critical software domains.
5PRACTICAL GUIDELINES
In this section, we offer practical guidelines for future stud-
ies. When the goal is to derive sound interpretation from
defect models:
(1) Ones must mitigate correlated metrics prior to con-
structing a defect model, especially for ANOVA anal-
yses, since RQ1 shows that (1) ANOVA Type-I and Type-
II often produce the lowest consistency and the highest
level of discrepancy of the top-ranked metric, and have
the highest impact on the direction of the ranking of
metrics between mitigated and non-mitigated models.
On the other hand, the results of RQ2 and RQ3 show
that removing all correlated metrics (2) improves the
consistency of the top-ranked metric regardless of the
ordering of metrics; and (3) improves the consistency of
the ranking of metrics among to the studied interpreta-
tion techniques, suggesting that correlated metrics must
be mitigated. However, the results of RQ4 show that the
removal of such correlated metrics impacts the model
performance by less than 5 percentage points, suggest-
ing that researchers and practitioners should remove
correlated metrics with care especially for safety-critical
software domains.
(2) Ones must avoid using ANOVA Type-I even if all
correlated metrics are removed, since RQ2 shows that
Type-I produces an inconsistent ranking of the top-
ranked metric when the orders of metrics are rear-
ranged, indicating that Type-I is sensitive to the ordering
of metrics even when removing all correlated metrics.
Instead, researchers should opt to use ANOVA Type-II
and Type-III for additive and interaction logistic re-
gression models, respectively. Furthermore, the scaled
Permutation Importance approach is recommended for
random forest since RQ3 shows that such approach
achieves the highest consistency across learning tech-
niques.
Finally, we would like to emphasize that mitigating cor-
related metrics is not necessary for all studies, all scenarios,
AUC
Fmeasure
MCC
Logistic
Regression
Random
Forest
Logistic
Regression
Random
Forest
Logistic
Regression
Random
Forest
0.50
0.75
1.00
1.25
1.50
Stability Ratio
Figure 6: The distributions of the stability ratio of non-
mitigated to mitigated models for each of the studied
datasets.
all datasets, and all analytical models in software engineer-
ing. Instead, the key message of our study is to shed light
that correlated metrics must be mitigated when the goal
is to derive sound interpretation from defect models that
are trained with correlated metrics (especially for ANOVA
Type-I). On the other hand, if the goal of the study is
to produce highly-accurate prediction models, one might
prioritize their resources on improving the model perfor-
mance rather than mitigating correlated metrics. Thus, fea-
ture selection and dimensionality reduction techniques can
be considered to mitigate irrelevant and correlated metrics,
and improve model performance.
6THREATS TO VALIDITY
Construct Validity. In this work, we only construct regres-
sion models in an additive fashion (ym1+... +mn),
since metric interactions (i.e., the relationship between each
of the two interacting metrics depends on the value of the
other metrics) (1) are rarely explored in software engineer-
ing (except in [66]); (2) must be statistically insigniﬁcant
(e.g., absence) for ANOVA Type-II test [13, 18]; and (3)
are not compatible with random forest [8] which is one
of the most commonly-used analytical learners in software
engineering. On the other hand, the importance score of
the metric produced by ANOVA Type-III is evaluated after
all of the other metrics and all metric interactions of the
metric under examination have been accounted for. Thus, if
metric interactions are signiﬁcantly present, one should use
ANOVA Type-III and avoid using ANOVA Type-II. Due to
the same way in which the importance scores of metrics
according to ANOVA Type-II and Type-III are calculated
in a hierarchical nature (see Appendix 1.3) for an additive
model, we would like to note that the importance scores of
metrics according to ANOVA Type-II and Type-III are the
Plenty of prior work show that the parameters of clas-
siﬁcation techniques have an impact on the performance
of defect models [20, 44, 51, 52, 76, 79]. While we use a
default ntree value of 100 for random forest models, recent
studies [76, 79] show that the parameters of random forest
are insensitive to the performance of defect models. Thus,
the parameters of random forest models do not pose a threat
to validity of our study.
Recent work point out that the selection [39, 76] and
the quality [84] of datasets dataset selection might impact
conclusions of a study. Thus, our conclusions may alter
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 10
when changing a set of the studieds datasets. Moreover,
Tantithamthavorn et al. [78] point out that randomness may
introduce bias to the conclusions of a study. To mitigate this
threat and ensure that our results are reproducible, we set a
random seed in every step in our experiment design.
Internal Validity. Recent research uses ridge regression to
construct defect models on the dataset that contains corre-
lated metrics [63]. However, our additional analyses (Ap-
pendix 4 of the supplementary materials) show that ridge
regression improves the performance of defect model when
comparing to logistic regression, yet produces a misleading
importance ranking of metrics. We observe that metrics that
are highly correlated appear at different ranks. This obser-
vation highlights the importance of mitigating correlated
metrics when interpreting defect models.
We studied a limited number of model interpretation
techniques. Thus, our results may not generalize to other
model interpretation techniques. Nonetheless, other model
interpretation techniques can be explored in future work.
We provide a detailed methodology for others who would
like to re-examine our ﬁndings using unexplored model
interpretation techniques.
External Validity. The analyzed datasets are part of several
corpora (e.g., NASA and PROMISE) of systems that span
both proprietary and open source domains. However, we
studied a limited number of defect datasets. Thus, the
results may not generalize to other datasets and domains.
Nonetheless, additional replication studies are needed.
In our study, we exclude (1) datasets that are not rep-
resentative of common practice or (2) datasets that would
not realistically beneﬁt from our analysis (e.g., datasets
that most of the software modules are defective) with the
selection criteria of the studied datasets (in Appendix 3).
Nevertheless, our proposed approaches are applicable to
any dataset. Practitioners are encouraged to explore our ap-
proaches on their own datasets with their own peculiarities.
The conclusions of our case study rely on one defect pre-
diction scenario (i.e., within-project defect models). How-
ever, there are a variety of defect prediction scenarios in
the literature (e.g., cross-project defect prediction [12, 86],
just-in-time defect prediction [38], heterogenous defect pre-
diction [60]). Therefore, the practical guidelines may differ
study in other scenarios of defect models.
7CONCLUSION
In this paper, we set out to investigate (1) the impact of
correlated metrics on the interpretation of defect models.
After removing correlated metrics, we investigate (2) the
consistency of the interpretation of defect models; and (3)
its impact on the performance and stability of defect models.
Through a case study of 14 publicly-available defect datasets
of systems that span both proprietary and open source
domains, we conclude that (1) correlated metrics have the
largest impact on the consistency, the level of discrepancy,
and the direction of the ranking of metrics, especially for
ANOVA techniques. On the other hand, we ﬁnd that remov-
ing all correlated metrics (2) improves the consistency of
the produced rankings regardless of the ordering of metrics
(except for ANOVA Type-I); (3) improves the consistency
of ranking of metrics among the studied interpretation
techniques; (4) impacts the model performance by less than
5 percentage points.
Based on our ﬁndings, we offer practical guidelines for
future studies. When the goal is to derive sound interpreta-
tion from defect models:
1) Ones must mitigate correlated metrics prior to con-
structing a defect model, especially for ANOVA anal-
yses.
2) Ones must avoid using ANOVA Type-I even if all
correlated metrics are removed.
Due to the variety of the built-in interpretation tech-
niques and their settings, our paper highlights the essential
need for future research to report the exact speciﬁcation
of their models and settings of the used interpretation
techniques.
REFERENCES
[1] E. Arisholm, L. C. Briand, and E. B. Johannessen, “A System-
atic and Comprehensive Investigation of Methods to Build and
Evaluate Fault Prediction Models,” Journal of Systems and Software,
vol. 83, no. 1, pp. 2–17, 2010.
[2] J. G. Barnett, C. K. Gathuru, L. S. Soldano, and S. McIntosh, “The
Relationship between Commit Message Detail and Defect Prone-
ness in Java Projects on GitHub,” in Proceedings of the International
Conference on Mining Software Repositories (MSR), 2016, pp. 496–499.
[3] V. R. Basili, L. C. Briand, and W. L. Melo, “A Validation of Object-
oriented Design Metrics as Quality Indicators,” Transactions on
Software Engineering (TSE), vol. 22, no. 10, pp. 751–761, 1996.
[4] N. Bettenburg and A. E. Hassan, “Studying the Impact of Social
Structures on Software Quality,” in Proceedings of the International
Conference on Program Comprehension (ICPC), 2010, pp. 124–133.
[5] C. Bird, B. Murphy, and H. Gall, “Don’t Touch My Code ! Examin-
ing the Effects of Ownership on Software Quality,” in Proceedings
of the Joint Meeting of the European Software Engineering Confer-
ence and the Symposium on the Foundations of Software Engineering
(ESEC/FSE), 2011, pp. 4–14.
[6] C. Bird, N. Nagappan, P. Devanbu, H. Gall, and B. Murphy, “Does
Distributed Development Affect Software Quality?: An Empirical
Case Study of Windows Vista,” Communications of the ACM, vol. 52,
no. 8, pp. 85–93, 2009.
[7] D. Bowes, T. Hall, M. Harman, Y. Jia, F. Sarro, and F. Wu,
“Mutation-Aware Fault Prediction,” in Proceedings of the Interna-
tional Symposium on Software Testing and Analysis (ISSTA), 2016, pp.
330–341.
[8] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp.
5–32, 2001.
[9] L. Breiman, A. Cutler, A. Liaw, and M. Wiener, “randomForest
: Breiman and Cutler’s Random Forests for Classiﬁcation and
Regression. R package version 4.6-12.” Software available at URL:
https://cran.r-project.org/web/packages/randomForest, 2006.
[10] L. C. Briand, W. L. Melo, and J. Wust, “Assessing the Applica-
bility of Fault-proneness Models across Object-oriented Software
Projects,” Transactions on Software Engineering (TSE), vol. 28, no. 7,
pp. 706–720, 2002.
[11] L. C. Briand, J. W ¨
ust, J. W. Daly, and D. V. Porter, “Exploring the
Relationships between Design Measures and Software Quality in
Object-oriented Systems,” Journal of Systems and Software, vol. 51,
no. 3, pp. 245–273, 2000.
[12] G. Canfora, A. De Lucia, M. Di Penta, R. Oliveto, A. Panichella,
and S. Panichella, “Multi-objective Cross-project Defect Predic-
tion,” in Proceedings of the International Conference on Software
Testing, Veriﬁcation and Validation (ICST), 2013, pp. 252–261.
[13] J. M. Chambers, “Statistical Models in S. Wadsworth,” Paciﬁc
Grove, California, 1992.
[14] S. R. Chidamber and C. F. Kemerer, “A Metrics Suite for Ob-
ject Oriented Design,” Transactions on Software Engineering (TSE),
vol. 20, no. 6, pp. 476–493, 1994.
[15] P. Devanbu, T. Zimmermann, and C. Bird, “Belief & Evidence in
Empirical Software Engineering,” in Proceedings of the International
Conference on Software Engineering (ICSE), 2016, pp. 108–119.
[16] M. Di Penta, L. Cerulo, Y.-G. Gu´
eh´
eneuc, and G. Antoniol, “An
Empirical Study of the Relationships between Design Pattern
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 11
Roles and Class Change Proneness,” in Proceedings of the Interna-
tional Conference on Software Maintenance (ICSM), 2008, pp. 217–226.
[17] B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap.
Boston, MA: Springer US, 1993.
[18] J. Fox, Applied regression analysis and generalized linear models. Sage
Publications, 2015.
[19] J. Friedman, T. Hastie, and R. Tibshirani, The Elements of Statistical
Learning. Springer series in statistics, 2001, vol. 1.
[20] W. Fu, T. Menzies, and X. Shen, “Tuning for Software Analytics:
Is it really necessary?” Information and Software Technology, vol. 76,
pp. 135–146, 2016.
[21] B. Ghotra, S. McIntosh, and A. E. Hassan, “Revisiting the Impact of
Classiﬁcation Techniques on the Performance of Defect Prediction
Models,” in Proceedings of the International Conference on Software
Engineering (ICSE), 2015, pp. 789–800.
[22] Y. Gil and G. Lalouche, “On the Correlation between Size and
Metric Validity,” Empirical Software Engineering (EMSE), vol. 22,
no. 5, pp. 2585–2611, 2017.
[23] G. Gousios, M. Pinzger, and A. v. Deursen, “An Exploratory Study
of the Pull-based Software Development Model,” in Proceedings of
the International Conference on Software Engineering (ICSE), 2014, pp.
345–355.
[24] L. Guo, Y. Ma, B. Cukic, and H. Singh, “Robust Prediction of Fault-
proneness by Random Forests,” in Proceedings of the International
Symposium on Software Reliability Engineering (ISSRE), 2004, pp.
417–428.
[25] M. A. Hall and L. A. Smith, “Feature Subset Selection: A Correla-
tion Based Filter Approach,” 1997.
[26] T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell, “A
Systematic Literature Review on Fault Prediction Performance in
Software Engineering,” Transactions on Software Engineering (TSE),
vol. 38, no. 6, pp. 1276–1304, 2012.
[27] J. a. Hanley and B. J. McNeil, “The meaning and use of the area
vol. 143, no. 4, pp. 29–36, 1982.
[28] F. E. Harrell Jr, “Hmisc: Harrell miscellaneous. R pack-
age version 3.12-2,” Software available at URL: http://cran.r-
project.org/web/packages/Hmisc, 2013.
[29] ——, Regression Modeling Strategies : With Applications to Lin-
ear Models, Logistic and Ordinal Regression, and Survival Analysis.
Springer, 2015.
[30] ——, “rms: Regression Modeling Strategies. R package version
5.1-1,” 2017.
[31] A. E. Hassan, “Predicting Faults using the Complexity of Code
Changes,” in Prooceedings of the International Conference on Software
Engineering (ICSE), 2009, pp. 78–88.
[32] H. Hemmati, S. Nadi, O. Baysal, O. Kononenko, W. Wang,
R. Holmes, and M. W. Godfrey, “The MSR Cookbook: Mining a
Decade of Research,” in Proceedings of the International Conference
on Mining Software Repositories (MSR), 2013, pp. 343–352.
[33] I. Herraiz, D. M. German, and A. E. Hassan, “On the Distribu-
tion of Source Code File Sizes,” in Proceedings of the International
Conference on Software Technologies (ICSOFT), 2011, pp. 5–14.
[34] J. Jiarpakdee, C. Tantithamthavorn, and A. E. Hassan, “Online
Appendix for “The Impact of Correlated Metrics on the In-
terpretation of Defect Models”,” https://github.com/ SAILResearch/
collinearity-pitfalls, 2018.
[35] J. Jiarpakdee, C. Tantithamthavorn, A. Ihara, and K. Matsumoto,
“A Study of Redundant Metrics in Defect Prediction Datasets,”
in Proceedings of the International Symposium on Software Reliability
Engineering Workshops (ISSREW), 2016, pp. 51–52.
[36] J. Jiarpakdee, C. Tantithamthavorn, and C. Treude, “Autospear-
man: Automatically mitigating correlated metrics for interpreting
defect models,” in Proceeding of the International Conference on
Software Maintenance and Evolution (ICSME), 2018, pp. 92–103.
[37] S. Kabinna, W. Shang, C.-P. Bezemer, and A. E. Hassan, “Exam-
ining the Stability of Logging Statements,” in Proceedings of the
International Conference on Software Analysis, Evolution, and Reengi-
neering (SANER), 2016, pp. 326–337.
[38] Y. Kamei, E. Shihab, B. Adams, A. E. Hassan, A. Mockus, A. Sinha,
and N. Ubayashi, “A Large-Scale Empirical Study of Just-In-Time
Quality Assurance,” Transactions on Software Engineering (TSE),
vol. 39, no. 6, pp. 757–773, 2013.
[39] J. Keung, E. Kocaguneli, and T. Menzies, “Finding Conclusion
Stability for Selecting the Best Effort Predictor in Software Effort
Estimation,” Automated Software Engineering, vol. 20, no. 4, pp. 543–
567, 2013.
[40] F. Khomh, M. Di Penta, and Y.-G. Gueheneuc, “An Exploratory
Study of the Impact of Code Smells on Software Change-
proneness,” in Proceedings of the Working Conference on Reverse
Engineering (WCRE), 2009, pp. 75–84.
[41] F. Khomh, M. Di Penta, Y.-G. Gu´
eh´
eneuc, and G. Antoniol, “An
Exploratory Study of the Impact of Antipatterns on Class Change-
and Fault-proneness,” Empirical Software Engineering (EMSE),
vol. 17, no. 3, pp. 243–275, 2012.
[42] S. Kim, T. Zimmermann, E. J. Whitehead Jr, and A. Zeller, “Predict-
ing Faults from Cached History,” in Proceedings of the International
Conference on Software Engineering (ICSE), 2007, pp. 489–498.
[43] S. Kirbas, B. Caglayan, T. Hall, S. Counsell, D. Bowes, A. Sen, and
A. Bener, “The Relationship between Evolutionary Coupling and
Defects in Large Industrial Software,” Journal of Software: Evolution
and Process, vol. 29, no. 4, 2017.
[44] A. G. Koru and H. Liu, “An Investigation of the Effect of Module
Size on Defect Prediction Using Static Measures,” Software Engi-
neering Notes (SEN), vol. 30, pp. 1–5, 2005.
[45] H. C. Kraemer, G. A. Morgan, N. L. Leech, J. A. Gliner, J. J. Vaske,
and R. J. Harmon, “Measures of Clinical Signiﬁcance,” Journal of
vol. 42, no. 12, pp. 1524–1529, 2003.
[46] S. Lessmann, B. Baesens, C. Mues, and S. Pietsch, “Benchmarking
Classiﬁcation Models for Software Defect Prediction: A Proposed
Framework and Novel Findings,” Transactions on Software Engi-
neering (TSE), vol. 34, no. 4, pp. 485–496, 2008.
[47] H. Lu, E. Kocaguneli, and B. Cukic, “Defect Prediction between
Software Versions with Active Learning and Dimensionality Re-
duction,” in Proceedings of the International Symposium on Software
Reliability Engineering (ISSRE), 2014, pp. 312–322.
[48] B. W. Matthews, “Comparison of the predicted and observed sec-
ondary structure of T4 phage lysozyme,” Biochimica et Biophysica
Acta (BBA)-Protein Structure, vol. 405, no. 2, pp. 442–451, 1975.
[49] T. J. McCabe, “A Complexity Measure,” Transactions on Software
Engineering (TSE), no. 4, pp. 308–320, 1976.
[50] S. McIntosh, Y. Kamei, B. Adams, and A. E. Hassan, “The Impact
of Code Review Coverage and Code Review Participation on
Software Quality,” in Proceedings of the International Conference on
Mining Software Repositories (MSR), 2014, pp. 192–201.
[51] T. Mende, “Replication of Defect Prediction Studies: Problems,
Pitfalls and Recommendations,” in Proceedings of the International
Conference on Predictive Models in Software Engineering (PROMISE),
2010, pp. 1–10.
[52] T. Mende and R. Koschke, “Revisiting the Evaluation of Defect
Prediction Models,” Proceedings of the International Conference on
Predictive Models in Software Engineering (PROMISE), pp. 7–16,
2009.
[53] A. Meneely, L. Williams, W. Snipes, and J. Osborne, “Predicting
Failures with Developer Networks and Social Network Analysis,”
in Proceedings of the International Symposium on Foundations of
Software Engineering (FSE), 2008, pp. 13–23.
[54] T. Menzies, J. Greenwald, and A. Frank, “Data Mining Static Code
Attributes to Learn Defect Predictors,” Transactions on Software
Engineering (TSE), vol. 33, no. 1, pp. 2–13, 2007.
[55] A. T. Misirli, E. Shihab, and Y. Kamei, “Studying High Impact Fix-
Inducing Changes,” Empirical Software Engineering (EMSE), vol. 21,
no. 2, pp. 605–641, 2016.
[56] R. Morales, S. McIntosh, and F. Khomh, “Do Code Review Prac-
tices Impact Design Quality? : A Case Study of the Qt, VTK,
and ITK Projects,” in Proceedings of the International Conference on
Software Analysis, Evolution and Reengineering (SANER), 2015, pp.
171–180.
[57] N. Nagappan and T. Ball, “Use of Relative Code Churn Measures
to Predict System Defect Density,” Proceedings of the International
Conference on Software Engineering (ICSE), pp. 284–292, 2005.
[58] N. Nagappan, T. Ball, and A. Zeller, “Mining Metrics to Predict
Component Failures,” in Proceedings of the International Conference
on Software Engineering (ICSE), 2006, pp. 452–461.
[59] N. Nagappan, A. Zeller, T. Zimmermann, K. Herzig, and B. Mur-
phy, “Change Bursts as Defect Predictors,” in Proceedings of the
International Symposium on Software Reliability Engineering (ISSRE),
2010, pp. 309–318.
[60] J. Nam, W. Fu, S. Kim, T. Menzies, and L. Tan, “Heterogeneous
Defect Prediction,” Transactions on Software Engineering (TSE), p. In
Press, 2017.
[61] F. Rahman and P. Devanbu, “Ownership, experience and defects: a
ﬁne-grained study of authorship,” in Proceedings of the International
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 12
Conference on Software Engineering (ICSE), 2011, pp. 491–500.
[62] ——, “How, and Why, Process Metrics are Better,” in Proceedings
of the International Conference on Software Engineering (ICSE), 2013,
pp. 432–441.
[63] F. Rahman, D. Posnett, I. Herraiz, and P. Devanbu, “Sample size
vs. bias in defect prediction,” in Proceedings of the Joint Meeting of
the European Software Engineering Conference and the Symposium on
the Foundations of Software Engineering (ESEC/FSE), 2013, pp. 147–
157.
[64] G. K. Rajbahadur, S. Wang, Y. Kamei, and A. E. Hassan, “The
Impact of Using Regression Models to Build Defect Classiﬁers,”
in Proceedings of the International Conference on Mining Software
Repositories (MSR), 2017, pp. 135–145.
[65] B. Ray, D. Posnett, V. Filkov, and P. Devanbu, “A Large Scale Study
of Programming Languages and Code Quality in Github,” in
Proceedings of the International Symposium on Foundations of Software
Engineering (FSE), 2014, pp. 155–165.
[66] M. Shepperd, D. Bowes, and T. Hall, “Researcher Bias: The Use of
Machine Learning in Software Defect Prediction,” Transactions on
Software Engineering (TSE), vol. 40, no. 6, pp. 603–616, 2014.
[67] E. Shihab, “An Exploration of Challenges Limiting Pragmatic Soft-
ware Defect Prediction,” Ph.D. dissertation, Queen’s University,
2012.
[68] E. Shihab, C. Bird, and T. Zimmermann, “The Effect of Branch-
ing Strategies on Software Quality,” in Proceedings of the Interna-
tional Symposium on Empirical Software Engineering and Measurement
(ESEM), 2012, pp. 301–310.
[69] E. Shihab, Z. M. Jiang, W. M. Ibrahim, B. Adams, and A. E.
Hassan, “Understanding the Impact of Code and Process Metrics
on Post-release Defects: A Case Study on the Eclipse Project,”
in Proceedings of the International Symposium on Empirical Software
Engineering and Measurement (ESEM), 2010, pp. 4–10.
[70] J. Shimagaki, Y. Kamei, S. McIntosh, A. E. Hassan, and
N. Ubayashi, “A Study of the Quality-Impacting Practices of Mod-
ern Code Review at Sony Mobile,” in Proceedings of the International
Conference on Software Engineering (ICSE), 2016, pp. 212–221.
[71] Y. Shin, A. Meneely, L. Williams, and J. A. Osborne, “Evaluat-
ing Complexity, Code Churn, and Developer Activity Metrics as
Indicators of Software Vulnerabilities,” Transactions on Software
Engineering (TSE), vol. 37, no. 6, pp. 772–787, 2011.
[72] C. Tantithamthavorn, “Towards a Better Understanding of the
Impact of Experimental Components on Defect Prediction Mod-
elling,” in Companion Proceeding of the International Conference on
Software Engineering (ICSE), 2016, p. 867870.
[73] ——, “ScottKnottESD : The Scott-Knott Effect Size Difference
(ESD) Test. R package version 2.0,” Software available at URL:
https://cran.r-project.org/web/packages/ScottKnottESD, 2017.
[74] C. Tantithamthavorn and A. E. Hassan, “An Experience Report
on Defect Modelling in Practice: Pitfalls and Challenges,” in In
Proceedings of the International Conference on Software Engineering:
Software Engineering in Practice Track (ICSE-SEIP), 2018, pp. 286–
295.
[75] C. Tantithamthavorn, A. E. Hassan, and K. Matsumoto, “The
Impact of Class Rebalancing Techniques on The Performance
and Interpretation of Defect Prediction Models,” Transactions on
Software Engineering (TSE), 2018.
[76] C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Mat-
sumoto, “Automated Parameter Optimization of Classiﬁcation
Techniques for Defect Prediction Models,” in Proceedings of the
International Conference on Software Engineering (ICSE), 2016, pp.
321–332.
[77] ——, “Comments on “Researcher Bias: The Use of Machine
Learning in Software Defect Prediction”,” Transactions on Software
Engineering (TSE), vol. 42, no. 11, pp. 1092–1094, 2016.
[78] ——, “An Empirical Comparison of Model Validation Techniques
for Defect Prediction Models,” Transactions on Software Engineering
(TSE), vol. 43, no. 1, pp. 1–18, 2017.
[79] ——, “The Impact of Automated Parameter Optimization on De-
fect Prediction Models,” Transactions on Software Engineering (TSE),
p. In Press, 2018.
[80] R. C. Team and contributors worldwide, “stats : The R Stats
Package. R Package. Version 3.4.0,” 2017.
[81] P. Thongtanunam, S. McIntosh, A. E. Hassan, and H. Iida, “Revis-
iting Code Ownership and its Relationship with Software Quality
in the Scope of Modern Code Review,” in Proceedings of the
International Conference on Software Engineering (ICSE), 2016, pp.
1039–1050.
[82] ——, “Review Participation in Modern Code Review,” Empirical
Software Engineering (EMSE), vol. 22, no. 2, pp. 768–817, 2017.
[83] Y. Tian, M. Nagappan, D. Lo, and A. E. Hassan, “What Are
the Characteristics of High-Rated Apps? A Case Study on Free
Android Applications,” in Proceedings of the International Conference
on Software Maintenance and Evolution (ICSME), 2015, pp. 301–310.
[84] S. Yathish, J. Jiarpakdee, P. Thongtanunam, and C. Tantithamtha-
vorn, “Mining Software Defects: Should We Consider Affected Re-
leases?” in In Proceedings of the International Conference on Software
Engineering (ICSE), 2019, p. To Appear.
[85] F. Zhang, A. E. Hassan, S. McIntosh, and Y. Zou, “The Use of Sum-
mation to Aggregate Software Metrics Hinders the Performance
of Defect Prediction Models,” Transactions on Software Engineering
(TSE), vol. 43, no. 5, pp. 476–491, 2017.
[86] T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy,
“Cross-project Defect Prediction,” in Proceedings of the Joint Meeting
of the European Software Engineering Conference and the Symposium
on the Foundations of Software Engineering (ESEC/FSE), 2009, pp.
91–100.
[87] T. Zimmermann, R. Premraj, and A. Zeller, “Predicting Defects
for Eclipse,” in Proceedings of the International Workshop on Predictor
Models in Software Engineering (PROMISE), 2007, pp. 9–19.
[88] T. Zimmermann, A. Zeller, P. Weissgerber, and S. Diehl, “Mining
Version Histories to Guide Software Changes,” Transactions on
Software Engineering (TSE), vol. 31, no. 6, pp. 429–445, 2005.
Jirayus Jiarpakdee received the B.E. degree
from Kasetsart University, Thailand, and the
M.E. degree from NAIST, Japan. He is cur-
rently a Ph.D. candidate at Monash University,
Australia. His research interests include empir-
ical software engineering and mining software
repositories (MSR). The goal of his Ph.D. is to
apply the knowledge of statistical modelling, ex-
perimental design, and software engineering in
order to tackle experimental issues that have an
impact on the interpretation of defect prediction
models.
Chakkrit Tantithamthavorn is a lecturer at the
Faculty of Information Technology, Monash Uni-
versity, Australia. His work has been published
at several top-tier software engineering venues
(e.g., TSE, ICSE, EMSE). His research interests
include empirical software engineering and min-
ing software repositories (MSR). He received the
B.E. degree from Kasetsart University, Thailand,
the M.E. and Ph.D. degrees from NAIST, Japan.
More about Chakkrit and his work is available
online at http://chakkrit.com.
Ahmed E. Hassan is the Canada Research
Chair (CRC) in Software Analytics, and the
NSERC/BlackBerry Software Engineering Chair
at the School of Computing at Queen’s Univer-
Science from the University of Waterloo. He
spearheaded the creation of the Mining Software
Repositories (MSR) conference and its research
community. More about Ahmed and his work is
available online at http://sail.cs.queensu.ca/.
... Prior studies have raised various potential issues that may impact the accuracy and interpretation of AI/ML models in SE. For example, the quality of datasets [20,56], data labelling techniques [70], feature selection techniques [21,31], collinearity analysis [29][30][31], class rebalancing techniques [55], model construction [20], parameter optimisation [2,3,17,55,57], model evaluation [32,47,58], and model interpretation [28,29]. ...
... Prior studies have raised various potential issues that may impact the accuracy and interpretation of AI/ML models in SE. For example, the quality of datasets [20,56], data labelling techniques [70], feature selection techniques [21,31], collinearity analysis [29][30][31], class rebalancing techniques [55], model construction [20], parameter optimisation [2,3,17,55,57], model evaluation [32,47,58], and model interpretation [28,29]. ...
Preprint
Android developers frequently update source code to improve the performance, security, or maintainability of Android apps. Such Android code updating activities are intuitively repetitive, manual, and time-consuming. In this paper, we propose AutoUpdate, a Transformer-based automated code update recommendation approach for Android Apps, which takes advantage of code abstraction (Abs) and Byte-Pair Encoding (BPE) techniques to represent source code. Since this is the first work to automatically update code in Android apps, we collect a history of 209,346 updated method pairs from 3,195 real-world Android applications available on Google Play stores that span 14 years (2008-2022). Through an extensive experiment on our curated datasets, the results show that AutoUpdate(1) achieves a perfect prediction of 25% based on the realistic time-wise evaluation scenario, which outperforms the two baseline approaches; (2) gains benefits at least 17% of improvement by using both Abs and BPE; (3) is able to recommend code updates for various purposes (e.g., fixing bugs, adding new feature, refactoring methods). On the other hand, the models (4) could produce optimistically high accuracy due to the unrealistic evaluation scenario (i.e., random splits), suggesting that researchers should consider time-wise evaluation scenarios in the future; (5) are less accurate for a larger size of methods with a larger number of changed tokens, providing a research opportunity for future work. Our findings demonstrate the significant advancement of NMT-based code update recommendation approaches for Android apps.
... Recently, the software engineering community has brought attention to explainable AI where the predictions of statistical models are interpreted through investigating the importance of each feature locally or globally [40,38,39,64,70,71]. In global interpretation, the association between independent variables and the dependent variable is examined for the whole data set. ...
Preprint
Test effectiveness refers to the capability of a test suite in exposing faults in software. It is crucial to be aware of factors that influence this capability. We aim at inferring the causal relationship between the two factors (i.e., Cover/Exec) and the capability of a test suite to expose and discover faults in software. Cover refers to the number of distinct test cases covering the statement and Exec equals the number of times a test suite executes a statement. We analyzed 459166 software faults from {12} Java programs. Bayesian statistics along with the back-door criterion was exploited for the purpose of causal inference. Furthermore, we examined the common pitfall measuring association, the mixture of causal and noncausal relationships, instead of causal association. The results show that Cover is of more causal association as against \textit{Exec}, and the causal association and noncausal one for those variables are statistically different. Software developers could exploit the results to design and write more effective test cases, which lead to discovering more bugs hidden in software.
... To avoid this, an additional preprocessing step transfers the data samples to a lower-dimensional space as shown in Fig. 1. In addition, Jiarpakdee et al. also suggest that even if the feature vector in and of itself is not already very high-dimensional, it may still be worthwhile to remove features [17]. Especially removing high correlations of individual variables from the data is usually desirable. ...
Conference Paper
Software metrics measure aspects related to the quality of software. Using software metrics as a method of quantification of software, various approaches were proposed for locating defect-prone source code units within software projects. Most of these approaches rely on supervised learning algorithms, which require labeled data for adjusting their parameters during the learning phase. Usually, such labeled training data is not available. Unsupervised algorithms do not require training data and can therefore help to overcome this limitation. In this work, we evaluate the effect of unsupervised learning - especially outlier mining algorithms - for the task of defect prediction, i.e., locating defect-prone source code units. We investigate the effect of various class balancing and feature compressing techniques as preprocessing steps and show how sliding windows can be used to capture time series of source code metrics. We evaluate the Isolation Forest and Local Outlier Factor, as representants of outlier mining techniques. Our experiments on three publicly available datasets, containing a total of 11 software projects, indicate that the consideration of time series can improve static examinations by up to 3%. The results further show that supervised algorithms can outperform unsupervised approaches on all projects. Among all unsupervised approaches, the Isolation Forest achieves the best accuracy on 10 out of 11 projects.
... This problem is called curse of dimensionality which can result in low accuracy and high misclassification rates of ML-based SFP classifier. Feature selection (FS) is one of the potentially active solutions to it (Al-Asadi and Tasdemir 2021; Catal and Diri 2009;Gao et al. 2011;Radjenović et al. 2013;He et al. 2015;Afzal and Torkar 2016;Yu et al. 2019;Jiarpakdee et al. 2019;Kondo et al. 2019). The feature selection (FS) can be a filter or wrapper or embedded technique to reduce the number of features and hence to increase the simplicity of SDP classifier operating at a higher speed with better understanding. ...
Article
Full-text available
Software fault prediction (SFP) plays a vital role into fostering high quality throughout the software development process. It allows to identify the fault-prone modules in early development phases and facilitates the focused and effective testing over the fault-prone modules. Machine learning (ML)-based classifiers are prominently being used for fault prediction in the software industry. The accuracy of the ML models depends upon the training data and its quality. The curse of high dimensionality adversely impacts the classification power of a ML model. The presence of inter-correlated, insignificant and/or redundant features (or attributes) in the training data hinders the performance of ML classifiers. Feature preprocessing (or feature selection (FS)) is the solution to this issue. Meta-heuristics is the key method to find out the most significant feature subset. In this paper, a novel feature selection method is devised using mathematical diversification for genetic evolution. It avoids the local optimums by utilizing arithmetic diversification among the candidate solutions (or populations). The survival of fittest is the working principle of evolving populations with crossover and mutation operations. The selected feature subset is fed to five classification algorithms, namely artificial neural network, support vector machine, decision tree, k-nearest neighbor and naïve Bayes. The proposed model is trained and tested over five datasets from NASA corpus, namely CM1, JM1, KC1, KC2 and PC1. In total, 100 SFP models are implemented (4 feature selection methods $$\times$$ 5 datasets $$\times$$ 5 classification algorithms). From the experiments, it is observed that the SFP models with proposed feature selection technique of evolving populations with mathematical diversification (FS-EPwMD) are better than other models. It can be concluded that the proposed SFP model built using proposed FS-EPwMD with artificial neural networks performs statistically best among all the competing 100 SFP models irrespective of the datasets used.
... 1 the impact of automated hyperparameter tuning (Binnig et al., 2018) 2 the presence and absence of class rebalancing techniques 3 the use of different performance metrics to assess pipeline performance (Jiarpakdee et al., 2019) 4 the use of different software packages to implement the same pipeline (Adekitan and Noma-Osaghae, 2019) 5 the impact of cross-validation on pipeline performance . ...
... 1 the impact of automated hyperparameter tuning (Binnig et al., 2018) 2 the presence and absence of class rebalancing techniques 3 the use of different performance metrics to assess pipeline performance (Jiarpakdee et al., 2019) 4 the use of different software packages to implement the same pipeline (Adekitan and Noma-Osaghae, 2019) 5 the impact of cross-validation on pipeline performance . ...
... However, few of extracted features can be redundant or irrelevant and can originate adverse impact on the performance of the ML classifier as SFP. Some features (or metrics) are correlated and degrades the performance of classifiers, hence should be removed (Jiarpakdee et al. 2019). Feature selection (FS) is one of the potentially active solutions to it (Ghotra et al. 2017). ...
Article
Full-text available
Software fault prediction (SFP) refers to the early prediction of fault-prone modules in software development which are susceptible to faults and can incur high development cost. Machine learning (ML)-based classifiers are extensively being used for SFP. Machine learning models utilize handcrafted metrics (or features), i.e., static code metrics for classification of software modules into one of the two categories, i.e., {buggy, clean}. It involves overhead of selecting the most significant features due to the presence of some correlated or non-significant features. With the shifting paradigm of machine learning to deep learning, it is desirable to improve the performance of SFP classifiers to keep pace up with the changing industrial needs. This study proposes a novel model (SCM-DLA-SFP) based on deep learning architecture (DLA) to predict the defects utilizing the static code metrics (SCMs). The defect dataset with SCMs is fed to the input layer of specially designed deep learning model, where the input is automatically conditioned using normalization. Then, the conditioned data pass through dense layers of deep neural network architecture to predict the faulty modules. The study utilizes five datasets from PROMISE repository namely camel, jedit, lucene, synapse and xalan. The proposed model SCM-DLA-SFP exhibits the performance of the average values of 88.01%, 79.83%, and 73.3% for AUC measure, accuracy criteria and F-measure, respectively. The comparison shows that proposed model is better on average than the state-of-the-art DL-based SFP methods by 16.28%, 19.61%, and 18.45% over AUC, accuracy and F-measure, respectively.
Article
Context : Interpretation has been considered as a key factor to apply defect prediction in practice. As interpretation from rule-based interpretable models can provide insights about past defects with high quality, many prior studies attempt to construct interpretable models for both accurate prediction and comprehensible interpretation. However, class imbalance is usually ignored, which may bring huge negative impact on interpretation. Objective : In this paper, we are going to investigate resampling techniques, a popular solution to deal with imbalanced data, on interpretation for interpretable models. We also investigate the feasibility to construct interpretable defect prediction models directly on original data. Further, we are going to propose a rule-based interpretable model which can deal with imbalanced data directly. Method : We conduct an empirical study on 47 publicly available datasets to investigate the impact of resampling techniques on rule-based interpretable models and the feasibility to construct such models directly on original data. We also improve gain function and tolerate lower confidence based on rule induction algorithms to deal with imbalanced data. Results : We find that (1) resampling techniques impact on interpretable models heavily from both feature importance and model complexity, (2) it is not feasible to construct meaningful interpretable models on original but imbalanced data due to low coverage of defects and poor performance, and (3) our proposed approach is effective to deal with imbalanced data compared with other rule-based models. Conclusion : Imbalanced data heavily impacts on the interpretable defect prediction models. Resampling techniques tend to shift the learned concept, while constructing rule-based interpretable models on original data may also be infeasible. Thus, it is necessary to construct rule-based models which can deal with imbalanced data well in further studies.
Conference Paper
Full-text available
With the rise of the Mining Software Repositories (MSR) field, defect datasets extracted from software repositories play a foundational role in many empirical studies related to software quality. At the core of defect data preparation is the identification of post-release defects. Prior studies leverage many heuristics (e.g., keywords and issue IDs) to identify post-release defects. However, such the heuristic approach is based on several assumptions, which pose common threats to the validity of many studies. In this paper, we set out to investigate the nature of the difference of defect datasets generated by the heuristic approach and the realistic approach that leverages the earliest affected release that is realistically estimated by a software development team for a given defect. In addition, we investigate the impact of defect identification approaches on the predictive accuracy and the ranking of defective modules that are produced by defect models. Through a case study of defect datasets of 32 releases, we find that that the heuristic approach has a large impact on both defect count datasets and binary defect datasets. Surprisingly, we find that the heuristic approach has a minimal impact on defect count models, suggesting that future work should not be too concerned about defect count models that are constructed using heuristic defect datasets. On the other hand, using defect datasets generated by the realistic approach lead to an improvement in the predictive accuracy of defect classification models.
Preprint
Full-text available
Over the past decade with the rise of the Mining Software Repositories (MSR) field, the modelling of defects for large and long-lived systems has become one of the most common applications of MSR. The findings and approaches of such studies have attracted the attention of many of our industrial collaborators (and other practitioners worldwide). The core of many of these studies is the development and use of analytical models for defects. In this paper, we discuss common pitfalls and challenges that we observed as practitioners attempt to develop such models or reason about the findings of such studies. The key goal of this paper is to document such pitfalls and challenges so practitioners can avoid them in future efforts. We also hope that other academics will be mindful of such pitfalls and challenges in their own work and industrial engagements.
Article
Full-text available
Many recent studies have documented the success of cross-project defect prediction (CPDP) to predict defects for new projects lacking in defect data by using prediction models built by other projects. However, most studies share the same limitations: it requires homogeneous data; i.e., different projects must describe themselves using the same metrics. This paper presents methods for heterogeneous defect prediction (HDP) that matches up different metrics in different projects. Metric matching for HDP requires a "large enough" sample of distributions in the source and target projects?which raises the question on how large is "large enough" for effective heterogeneous defect prediction. This paper shows that empirically and theoretically, "large enough" may be very small indeed. For example, using a mathematical model of defect prediction, we identify categories of data sets were as few as 50 instances are enough to build a defect prediction model. Our conclusion for this work is that, even when projects use different metric sets, it is possible to quickly transfer lessons learned about defect prediction.
Article
Full-text available
Empirical validation of code metrics has a long history of success. Many metrics have been shown to be good predictors of external features, such as correlation to bugs. Our study provides an alternative explanation to such validation, attributing it to the confounding effect of size. In contradiction to received wisdom, we argue that the validity of a metric can be explained by its correlation to the size of the code artifact. In fact, this work came about in view of our failure in the quest of finding a metric that is both valid and free of this confounding effect. Our main discovery is that, with the appropriate (non-parametric) transformations, the validity of a metric can be accurately (with R-squared values being at times as high as 0.97) predicted from its correlation with size. The reported results are with respect to a suite of 26 metrics, that includes the famous Chidamber and Kemerer metrics. Concretely, it is shown that the more a metric is correlated with size, the more able it is to predict external features values, and vice-versa. We consider two methods for controlling for size, by linear transformations. As it turns out, metrics controlled for size, tend to eliminate their predictive capabilities. We also show that the famous Chidamber and Kemerer metrics are no better than other metrics in our suite. Overall, our results suggest code size is the only “unique” valid metric.
Article
Full-text available
Software code review is a well-established software quality practice. Recently, Modern Code Review (MCR) has been widely adopted in both open source and proprietary projects. Our prior work shows that review participation plays an important role in MCR practices, since the amount of review participation shares a relationship with software quality. However, little is known about which factors influence review participation in the MCR process. Hence, in this study, we set out to investigate the characteristics of patches that: (1) do not attract reviewers, (2) are not discussed, and (3) receive slow initial feedback. Through a case study of 196,712 reviews spread across the Android, Qt, and OpenStack open source projects, we find that the amount of review participation in the past is a significant indicator of patches that will suffer from poor review participation. Moreover, we find that the description length of a patch shares a relationship with the likelihood of receiving poor reviewer participation or discussion, while the purpose of introducing new features can increase the likelihood of receiving slow initial feedback. Our findings suggest that the patches with these characteristics should be given more attention in order to increase review participation, which will likely lead to a more responsive review process.
Article
Defect prediction models that are trained on class imbalanced datasets (i.e., the proportion of defective and clean modules is not equally represented) are highly susceptible to produce inaccurate prediction models. Prior research compares the impact of class rebalancing techniques on the performance of defect prediction models. Prior research efforts arrive at contradictory conclusions due to the use of different choice of datasets, classification techniques, and performance measures. Such contradictory conclusions make it hard to derive practical guidelines for whether class rebalancing techniques should be applied in the context of defect prediction models. In this paper, we investigate the impact of 4 popularly-used class rebalancing techniques on 10 commonly-used performance measures and the interpretation of defect prediction models. We also construct statistical models to better understand in which experimental design settings that class rebalancing techniques are beneficial for defect prediction models. Through a case study of 101 datasets that span across proprietary and open-source systems, we recommend that class rebalancing techniques are necessary when quality assurance teams wish to increase the completeness of identifying software defects (i.e., Recall). However, class rebalancing techniques should be avoided when interpreting defect prediction models. We also find that class rebalancing techniques do not impact the AUC measure. Hence, AUC should be used as a standard measure when comparing defect prediction models.
Article
Evolutionary coupling (EC) is defined as the implicit relationship between 2 or more software artifacts that are frequently changed together. Changing software is widely reported to be defect-prone. In this study, we investigate the effect of EC on the defect proneness of large industrial software systems and explain why the effects vary. We analysed 2 large industrial systems: a legacy financial system and a modern telecommunications system. We collected historical data for 7 years from 5 different software repositories containing 176 thousand files. We applied correlation and regression analysis to explore the relationship between EC and software defects, and we analysed defect types, size, and process metrics to explain different effects of EC on defects through correlation. Our results indicate that there is generally a positive correlation between EC and defects, but the correlation strength varies. Evolutionary coupling is less likely to have a relationship to software defects for parts of the software with fewer files and where fewer developers contributed. Evolutionary coupling measures showed higher correlation with some types of defects (based on root causes) such as code implementation and acceptance criteria. Although EC measures may be useful to explain defects, the explanatory power of such measures depends on defect types, size, and process metrics.