Content uploaded by Ahmed E. Hassan

Author content

All content in this area was uploaded by Ahmed E. Hassan on Mar 20, 2019

Content may be subject to copyright.

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 1

The Impact of Correlated Metrics on

the Interpretation of Defect Models

Jirayus Jiarpakdee, Student Member, IEEE, Chakkrit Tantithamthavorn, Member, IEEE,

and Ahmed E. Hassan, Fellow, IEEE

Abstract—Defect models are analytical models for building empirical theories related to software quality. Prior studies often derive

knowledge from such models using interpretation techniques, e.g., ANOVA Type-I. Recent work raises concerns that correlated metrics

may impact the interpretation of defect models. Yet, the impact of correlated metrics in such models has not been investigated. In this

paper, we investigate the impact of correlated metrics on the interpretation of defect models and the improvement of the interpretation

of defect models when removing correlated metrics. Through a case study of 14 publicly- available defect datasets, we ﬁnd that (1)

correlated metrics have the largest impact on the consistency, the level of discrepancy, and the direction of the ranking of metrics,

especially for ANOVA techniques. On the other hand, we ﬁnd that removing all correlated metrics (2) improves the consistency of the

produced rankings regardless of the ordering of metrics (except for ANOVA Type-I); (3) improves the consistency of ranking of metrics

among the studied interpretation techniques; (4) impacts the model performance by less than 5 percentage points. Thus, when one

wishes to derive sound interpretation from defect models, one must (1) mitigate correlated metrics especially for ANOVA analyses; and

(2) avoid using ANOVA Type-I even if all correlated metrics are removed.

Index Terms—Software Quality Assurance, Defect models, Hypothesis Testing, Correlated Metrics, Model Speciﬁcation.

F

1INTRODUCTION

Defect models are constructed using historical software

project data to identify defective modules and explore the

impact of various phenomena (i.e., software metrics) on

software quality. The interpretation of such models is used

to build empirical theories that are related to software

quality (i.e., what software metrics share the strongest as-

sociation with software quality?). These empirical theories

are essential for project managers to chart software quality

improvement plans to mitigate the risk of introducing de-

fects in future releases (e.g., a policy to maintain code as

simple as possible).

Plenty of prior studies investigate the impact of many

phenomena on code quality using software metrics, for ex-

ample, code size, code complexity [31, 49, 71], change com-

plexity [42, 57, 59, 71, 88], antipatterns [41], developer activ-

ity [71], developer experience [61], developer expertise [5],

developer and reviewer knowledge [81], design [3, 10, 11,

14, 16], reviewer participation [50, 82], code smells [40], and

mutation testing [7]. To perform such studies, there are ﬁve

common steps: (1) formulating of hypotheses that pertain to

the phenomena that one wishes to study; (2) designing ap-

propriate metrics to operationalize the intention behind the

phenomena under study; (3) deﬁning a model speciﬁcation

(e.g., the ordering of metrics) to be used when constructing

an analytical model; (4) constructing an analytical model

using, for example, regression models [5, 57, 81, 82, 87] or

random forest models [23, 38, 55, 64]; and (5) examining the

•J. Jiarpakdee and C. Tantithamthavorn are with the Faculty of

Information Technology, Monash University, Australia. E-mail: ji-

rayus.jiar@gmail.com, chakkrit.tantithamthavorn@monash.edu.

•A. E. Hassan is with the School of Computing, Queen’s University,

Canada. E-mail: ahmed@cs.queensu.ca.

ranking of metrics using a model interpretation technique

(e.g., ANOVA Type-I, one of the most commonly-used inter-

pretation techniques since it is the default built-in function

for logistic regression (glm) models in R) in order to test the

hypotheses.

For example, to study whether complex code increases

project risk, one might use the number of reported bugs

(bugs) to capture risk, and the McCabe’s cyclomatic com-

plexity (CC) to capture code complexity, while controlling

for code size (size). We note that one needs to use control

metrics to ensure that ﬁndings are not due to confounding

factors (e.g., large modules are more likely to have more

bugs). Then, one must construct an analytical model with

a model speciﬁcation of bugs⇠size+CC. One would then

use an interpretation technique (e.g. ANOVA Type-I) to

determine the ranking of metrics (i.e., which metrics have

a strong relationship with bugs).

Metrics of prior studies are often correlated [22, 32, 33,

35, 36, 74, 77, 85]. For example, Herraiz et al. [33], and

Gil et al. [22] point out that code complexity (CC) is often

correlated with code size (size). Zhang et al. [85] point

out that many metric aggregation schemes (e.g., averaging

or summing of McCabe’s cyclomatic complexity values at

the function level to derive ﬁle-level metrics) often produce

correlated metrics.

Recent studies raise concerns that correlated metrics may

impact the interpretation of defect models [77, 85]. Our pre-

liminary analysis (PA2) also shows that simply rearranging

the ordering of correlated metrics in the model speciﬁcation

(e.g., from bugs⇠size+CC to bugs⇠CC+size) would lead

to a different ranking of metrics—i.e., the importance scores

are sensitive to the ordering of correlated metrics in a

model speciﬁcation. Thus, if one wants to show that code

complexity is strongly associated with risk in a project,

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2

one simply needs to put code complexity (CC) as the ﬁrst

metric in their models (i.e., bugs⇠CC+size), even though

a more careful analysis would show that CC is not associated

with bugs at all. The sensitivity of the model speciﬁcation

when correlated metrics are included in a model is a crit-

ical problem, since the contribution of many prior studies

can be altered by simply re-ordering metrics in the model

speciﬁcation if correlated metrics are not properly mitigated.

Unfortunately, a literature survey of Shihab [67] shows

that as much as 63% of defect studies that are published

during 2000-2011 do not mitigate correlated metrics prior to

constructing defect models.

In this paper, we set out to investigate (1) the impact of

correlated metrics on the interpretation of defect models.

After removing correlated metrics, we investigate (2) the

consistency of the interpretation of defect models; and (3)

its impact on the performance and stability of defect models.

In order to detect and remove correlated metrics, we apply

the variable clustering (VarClus) and the variance inﬂation

factor (VIF) techniques. We construct logistic regression

and random forest models using mitigated (i.e., no corre-

lated metrics) and non-mitigated datasets (i.e., not treated).

Finally, we apply 9 model interpretation techniques, i.e.,

ANOVA Type-I, 4 test statistics of ANOVA Type-II (i.e.,

Wald, Likelihood Ratio, F, and Chi-square), scaled and non-

scaled Gini Importance, and scaled and non-scaled Permu-

tation Importance. We then compare the performance and

interpretation of defect models that are constructed using

mitigated and non-mitigated datasets. Through a case study

of 14 publicly-available defect datasets of systems that span

both proprietary and open source domains, we address the

following four research questions:

(RQ1) How do correlated metrics impact the interpreta-

tion of defect models?

ANOVA Type-I and Type-II often produce the lowest

consistency and the highest level of discrepancy of

the top-ranked metric, and have the highest im-

pact on the direction of the ranking of metrics be-

tween mitigated and non-mitigated models when

compared to Gini and Permutation Importance. This

ﬁnding highlights the risks of not mitigating cor-

related metrics in the ANOVA analyses of prior

studies.

(RQ2) After removing all correlated metrics, how consis-

tent is the interpretation of defect models among

different model speciﬁcations?

After removing all correlated metrics, the top-

ranked metric according to ANOVA Type-II, Gini

Importance, and Permutation Importance are con-

sistent. However, the top-ranked metric according

to ANOVA Type-I is inconsistent, since the ranking

of metrics is impacted by its order in the model

speciﬁcation when analyzed using ANOVA Type-I

(which is the default analysis for the glm model in R

and is commonly-used in prior studies). This ﬁnding

suggests that ANOVA Type-I must be avoided even

if all correlated metrics are removed.

(RQ3) After removing all correlated metrics, how consis-

tent is the interpretation of defect models among

the studied interpretation techniques?

After removing all correlated metrics, we ﬁnd that

the consistency of the ranking of metrics among

the studied interpretation techniques is improved by

15%- 64% for the top-ranked metric and 21%-71% for

the top-3 ranked metrics, respectively, highlighting

the beneﬁts of removing all correlated metrics on the

interpretation of defect models, i.e., the conclusions

of studies that rely on one interpretation technique

may not pose a threat after mitigating correlated

metrics.

(RQ4) Does removing all correlated metrics impact the

performance and stability of defect models?

Removing all correlated metrics impacts the AUC,

F-measure, and MCC performance of defect models

by less than 5 percentage points, suggesting that

researchers and practitioners should remove corre-

lated metrics with care especially for safety-critical

software domains.

Based on our ﬁndings, we suggest that: When the goal is to

derive sound interpretation from defect models, our results suggest

that future studies must (1) mitigate correlated metrics prior to

constructing a defect model, especially for ANOVA analyses; and

(2) avoid using ANOVA Type-I even if all correlated metrics are

removed, but instead opt to use ANOVA Type-II and Type-III for

additive and interaction models, respectively. Due to the variety

of the built-in interpretation techniques and their settings, our

paper highlights the essential need for future studies to report

the exact speciﬁcation (i.e., model formula) of their models and

settings (e.g., the calculation methods of the importance score) of

the used interpretation techniques.

1.1 Novelty Statements

To the best of our knowledge, this paper is the ﬁrst to

present:

(1) A series of preliminary analyses of the nature of corre-

lated metrics in defect datasets and their impact on the

interpretation of defect models (Appendix 2).

(2) An investigation of the impact of correlated metrics

on the consistency, the level of discrepancy, and the

direction of the produced rankings by the interpretation

techniques (RQ1).

(3) An empirical evaluation of the consistency of such rank-

ings after removing all correlated metrics (RQ2, RQ3).

(4) An investigation of the impact of removing all corre-

lated metrics on the performance and stability of defect

models (RQ4).

1.2 Paper Organization

Section 2 discusses the analytical modelling process, cor-

related metrics and concerns in the literature, and tech-

niques for mitigating correlated metrics. Section 3 describes

the design of our case study, while Section 4 presents

our results with respect to our four research questions.

Section 5 provides practical guidelines for future studies.

Section 6 discusses the threats to the validity of our study.

Section 7 draws conclusions. For the detailed explanation

of the studied correlation analysis techniques, commonly-

used analytical learners, and interpretation techniques, see

Appendix 1 of the supplementary materials.

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 3

Hypothesis Metrics!

(Variables)

Correlation!

Analysis

Model!

Construction

Model!

Interpretation

The Highest

Ranked Metric

m1

m2

y ~ m1 + m2

Model!

Speciﬁcation

m

Figure 1: An overview of the analytical modelling process.

2BACKGROUND AND MOTIVATION

2.1 Analytical Modelling Process

Figure 1 provides an overview of the commonly-used an-

alytical modelling process. First, one must formulate a set

of hypotheses pertaining to phenomena of interest (e.g.,

whether the size of a module increases the risk associated

with that module). Second, one must determine a set of

metrics which operationalize the hypothesis of interest (e.g.,

the total lines of code for size, and the number of ﬁeld

reported bugs to capture the risk that is associated with a

module). Third, one must perform a correlation analysis to

remove correlated metrics. Forth, one must deﬁne a model

speciﬁcation (e.g., the ordering of metrics) to be used when

constructing an analytical model. Fifth, one is then ready

to construct an analytical model using a machine learning

technique (e.g., a random forest model) or a statistical learn-

ing technique (e.g., a regression model). Finally, one ana-

lyzes the ranking of the metrics using model interpretation

techniques (e.g., ANOVA or Breiman’s Variable Importance)

in order to test the hypotheses of interest. The importance

ranking of the metrics is essential for project managers to

chart appropriate software quality improvement plans to

mitigate the risk of introducing defects in future releases.

For example, if code complexity is identiﬁed as the top-

ranked metric, project managers then can suggest develop-

ers to reduce the complexity of their code to reduce the risk

of introducing defects.

2.2 Correlated Metrics and Concerns in the Literature

Correlated metrics are metrics (i.e., independent variables)

that share a strong linear correlation among themselves. In

this paper, we focus on two types of correlation among

metrics, i.e., collinearity and multicollinearity. Collinearity is

a phenomenon in which one metric can be linearly predicted

by another metric. On the other hand, multicollinearity is a

phenomenon in which one metric can be linearly predicted

by a combination of two or more metrics.

Prior work points out that software metrics are often

correlated [22, 32, 33, 35, 36, 74, 77, 85]. However, little is

known the prevalence of correlated metrics in the publicly-

available defect datasets. Thus, we set out to investigate

how many defect datasets of which metrics that share a

strong relationship with defect-proneness are correlated.

Unfortunately, the results of our preliminary analysis (PA1)

show that correlated metrics that share a strong relationship

with defect-proneness are prevalent in 83 of the 101 (82%)

publicly avaliable defect datasets.

In addition, prior work raises concerns that correlated

metrics may impact the interpretation of defect models [77,

85]. To better understand how correlated metrics impact the

interpretation of defect models, we set out to investigate

(1) the impact of the number of correlated metrics on the

importance scores of metrics, and (2) the impact of the or-

dering of correlated metrics in a model speciﬁcation on the

importance ranking metrics. The results of our preliminary

analyses (PA2, and PA3) show that the importance scores

of metrics substantially decrease when there are correlated

metrics in the models for both ANOVA analyses of logis-

tic regression and Variable Importance analyses (i.e., Gini

and Permutation) of random forest. The importance scores

of metrics are also sensitive to the ordering of correlated

metrics (except for ANOVA Type-II).

2.3 Techniques for Mitigating Correlated Metrics

There is a plethora of techniques that have been used to

mitigate irrelevant and correlated metrics in the domain of

defect prediction, e.g., dimensionality reduction [47, 57, 85],

feature selection [1, 54], and correlation analysis [15, 68, 82].

Dimensionality reduction transforms an initial set of

metrics into a set of transformed metrics that is represen-

tative to the initial set of metrics. Prior work has adopted

dimensionality reduction techniques (e.g., Principal Com-

ponent Analysis) to mitigate correlated metrics and improve

the performance of defect models [47, 57, 85]. Since the set

of transformed metrics does not hold the assumption of the

initial set of metrics, and is not sensible for model interpreta-

tion and statistical inference [74], we exclude dimensionality

reduction techniques from this paper.

Feature selection produces an optimal subset of met-

rics that are relevant and non-correlated. One of the

most commonly-used feature selection techniques is the

correlation-based feature selection technique (CFS) [25]

which searches for the best subset of metrics that share the

highest correlation with the outcome (e.g., defect-proneness)

while having the lowest correlation among each other. To

better understand whether feature selection techniques mit-

igate correlated metrics, we set out to perform a correlation

analysis on the metrics that are selected by feature selection

techniques. In this preliminary analysis (PA4), we focus on

the two commonly-used techniques in the domain of de-

fect prediction, i.e., Information Gain and correlation-based

feature selection techniques. The results of this preliminary

analysis show that the metrics that are selected by the two

studied feature selection techniques are correlated (with

a Spearman correlation coefﬁcient up to 0.98), suggesting

that the commonly-used feature selection techniques do not

mitigate correlated metrics.

Correlation analysis is used to measure the correlation

among metrics given a threshold. Prior work applies corre-

lation analysis techniques to identify and mitigate correlated

metrics [15, 68, 82, 83]. Based on a literature survey of

Hall et al. [26] and Shihab [67], we select the commonly-used

correlation analysis techniques: Variable Clustering analysis

(VarClus), and Variance Inﬂation Factor (VIF).

Since there are many analytical learners that can be

used to investigate the impact of correlated metrics on

defect models, the aforementioned surveys guide our selec-

tion of the two commonly-used analytical learners: logistic

regression [5, 6, 15, 43, 53, 57, 58, 65, 87] and random

forest [23, 24, 38, 55, 64]. These techniques are two of the

most commonly-used analytical learners for defect models

and they have built-in techniques for model interpretation

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 4

Non-mitigated

Dataset

Mitigated

Dataset

Mitigated

Models

Remove

Correlated

Metrics

Defect

Dataset

Construct

Defect

Models

Non-mitigated

Models

Analyze the

Model

Interpretation

Analyze the

Model

Performance

Figure 2: An overview diagram of the design of our case

study.

Table 1: A statistical summary of the studied datasets.

Project Dataset Modules Metrics Correlated

Metrics EPV AUCLR AUCRF

Apache Lucene 2.4 340 20 9 10 0.74 0.77

POI 2.5 385 20 11 12 0.80 0.90

POI 3.0 442 20 10 14 0.79 0.88

Xalan 2.6 885 20 8 21 0.79 0.85

Xerces 1.4 588 20 11 22 0.91 0.95

Eclipse Debug 3.4 1,065 17 9 15 0.72 0.81

JDT 997 15 10 14 0.81 0.82

Mylyn 1,862 15 10 16 0.78 0.74

PDE 1,497 15 9 14 0.72 0.72

Platform 2.0 6,729 32 24 30 0.82 0.84

Platform 3.0 10,593 32 24 49 0.79 0.81

SWT 3.4 1,485 17 7 38 0.87 0.97

Proprietary Prop 1 18,471 20 10 137 0.75 0.79

Prop 4 8,718 20 11 42 0.74 0.72

(i.e., ANOVA for logistic regression and Breiman’s Variable

Importance for random forest). Finally, we select 9 model

interpretation techniques, ANOVA Type-I, ANOVA Type-II

with 4 test statistics (i.e., Wald, Likelihood Ratio, F, and Chi-

square), scaled and non-scaled Gini Importance, and scaled

and non-scaled Permutation Importance. We provide the

detailed explanation of the studied correlation analysis tech-

niques, analytical learners, and interpretation techniques in

Table 2 and Appendix 1 of the supplementary materials.

3CASE STUDY DESIGN

In this study, we use 14 datasets of systems that span

across proprietary and open-source systems. We discuss

the selection criteria of the studied datasets in Appendix 3

of the supplementary materials. Table 1 shows a statistical

summary of the studied datasets, while Figure 2 provides an

overview of the design of our case study. Below, we discuss

the design of the case study that we perform in order to

address our four research questions.

3.1 Remove Correlated Metrics

To investigate the impact of correlated metrics on the per-

formance and interpretation of defect models and address

our four research questions, we start by removing highly-

correlated metrics in order to produce mitigated datasets,

i.e., datasets where correlated metrics are removed. To do so,

we apply variable clustering analysis (VarClus) and variable

inﬂuence factor analysis (VIF) (see Appendix 1.1). We use

the interpretation of Spearman correlation coefﬁcients (|⇢|)

as provided by Kraemer et al. [45] to identify correlated

metrics, i.e., a Spearman correlation coefﬁcient of above 0.7

is considered a strong correlation. We use a VIF threshold

of 5 to identify inter-correlated metrics, as it is suggested by

Fox [18] and is commonly used in prior work [4, 50, 68, 69].

We use the implementation of the variable clustering anal-

ysis as provided by the varclus function of the Hmisc R

package [28]. We use the implementation of the VIF analysis

as provided by the vif function of the rms R package [30].

Finally, we report the results of correlation analysis and a set

of mitigated metrics for each of the studied defect datasets

in the online appendix [34].

3.2 Construct Defect Models

To examine the impact of correlated metrics on the perfor-

mance and interpretation of defect models, we construct our

models using the non-mitigated datasets (i.e., datasets where

correlated metrics are not removed) and mitigated datasets

(i.e., datasets where correlated metrics are removed). To

construct defect models, we perform the following steps:

(CM1) Generate bootstrap samples. To ensure that our

conclusions are statistically sound and robust, we use the

out-of-sample bootstrap validation technique, which lever-

ages aspects of statistical inference [17, 19, 29, 72, 78]. We

ﬁrst generate bootstrap sample of sizes Nwith replacement

from the mitigated and non-mitigated datasets. The gener-

ated sample is also of size N. We construct models using

the bootstrap samples, while we measure the performance

of the models using the samples that do not appear in

the bootstrap samples. On average, 36.8% of the original

dataset will not appear in the bootstrap samples, since the

samples are drawn with replacement [17]. We repeat the out-

of-sample bootstrap process for 100 times and report their

average performance.

(CM2) Construct defect models. For each bootstrap

sample, we construct logistic regression and random forest

models. We use the implementation of logistic regression as

provided by the glm function of the stats R package [80]

and the lrm function of the rms R package [30] with the

default parameter setting. We use the implementation of

random forest as provided by the randomForest function

of the randomForest R package [9] with the default ntree

value of 100, since recent studies [76, 79] show that param-

eters of random forest are insensitive to the performance of

defect models. To ensure that the training and testing cor-

pora share similar characteristics and representative to the

original dataset, we do not re-balance nor do we re-sample

the training data to avoid any impact on the interpretation

of defect models [75].

3.3 Analyze the Model Interpretation

To address RQ1, RQ2, and RQ3, we analyze the importance

ranking of metrics of the models that are constructed using

non-mitigated datasets and mitigated datasets. The analysis

of model interpretation is made up of 2 steps.

(MI1) Compute the importance score of metrics. We

investigate the impact of correlated metrics on the interpre-

tation of defect models using different model interpretation

techniques. Thus, we apply the 9 studied model interpre-

tation techniques, i.e., Type-I, Type-II (Wald, LR, F, Chisq),

scaled and non-scaled Gini Importance, and scaled and non-

scaled Permutation Importance. We provide the technical

description of the studied interpretation techniques in Ap-

pendix 1.3 of the supplementary materials.

(MI2) Identify the importance ranking of metrics.

To statistically identify the importance ranking of metrics,

we apply the improved Scott-Knott Effect Size Difference

(ESD) test (v2.0) [73]. The Scott-Knott ESD test is a mean

comparison approach that leverages a hierarchical cluster-

ing to partition a set of treatment means (i.e., means of

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 5

Table 2: A summary of the studied correlation analysis techniques, the two studied analytical learners, and the 9 studied

interpretation techniques.

Correlation

Analysis Analytical Learner Interpretation

Technique Test Statistic R function

Variable

Clustering

[56, 81–83]

Logistic

Regression

(glm and lrm)

[5, 6, 57, 58, 87]

Type-I Deviance stats::anova(glm.model)

Type-II

Wald car::Anova(glm.model, type=2, test.statistic=‘Wald’)

Likelihood Ratio (LR) car::Anova(glm.model, type=2, test.statistic=‘LR’)

Variance

Inﬂation Factor

[4, 15, 50, 68, 69]

Fcar::Anova(glm.model, type=2, test.statistic=‘F’)

Chi-square rms::anova(lrm.model, test=‘Chisq’)

Random Forest

[23, 24, 38, 55, 64]

Scaled Gini MeanDecreaseGini randomForest::importance(model, type = 2, scale = TRUE)

Redundancy

Analysis

[2, 37, 56, 70, 77]

Non-scaled Gini MeanDecreaseGini randomForest::importance(model, type = 2, scale = FALSE)

Scaled Permutation MeanDecreaseAccuracy randomForest::importance(model, type = 1, scale = TRUE)

Non-scaled Permutation MeanDecreaseAccuracy randomForest::importance(model, type = 1, scale = FALSE)

importance scores) into statistically distinct groups with

statistically non-negligible difference. The Scott-Knott ESD test

ranks each metric at only a single rank, however several

metrics may appear within one rank. Finally, we identify

the importance ranking of metrics for the non-mitigated and

mitigated models. Thus, each metric has a rank for each

model interpretation technique and for each of the mitigated

and non-mitigated models. We use the implementation of

Scott-Knott ESD test as provided by the sk_esd function of

the ScottKnottESD R package [73].

3.4 Analyze the Model Performance

To address RQ4, we analyze the performance of the mod-

els that are constructed using non-mitigated datasets and

mitigated datasets.

First, we use the Area Under the receiver operator char-

acteristic Curve (AUC) to measure the discriminatory power

of our models, as suggested by recent research [21, 46, 62].

The AUC is a threshold-independent performance measure

that evaluates the ability of models in discriminating be-

tween defective and clean modules. The values of AUC

range between 0 (worst performance), 0.5 (no better than

random guessing), and 1 (best performance) [27].

Second, we use the F-measure, i.e, a threshold-

dependent measure. F-measure is a harmonic mean (i.e.,

2·precision·recall

precision+recall ) of precision ( TP

TP+FP ) and recall ( TP

TP+FN ).

Similar to prior studies [1, 86], we use the default probability

value of 0.5 as a threshold value for the confusion matrix,

i.e., if a module has a predicted probability above 0.5, it is

considered defective; otherwise, the module is considered

clean.

Third, we use the Matthews Correlation Coefﬁcient

(MCC) measure, i.e, a threshold-dependent measure, as

suggested by prior studies [48, 66]. MCC is a balanced

measure based on true and false positives and neg-

atives that is computed using the following equation:

TP⇥TNFP⇥FN

p(TP+FP)(TP+FN)(TN+FP)(TN+FN) .

4CASE STUDY RESULTS

In this section, we present the results of our case study with

respect to our four research questions.

(RQ1) How do correlated metrics impact the interpreta-

tion of defect models?

Motivation. Prior work raises concerns that metrics are

often correlated and should be mitigated [22, 32, 33, 35, 74,

77, 85]. For example, Herraiz et al. [33], and Gil et al. [22]

point out that code complexity is often correlated with lines

of code. Unfortunately, a literature survey of Shihab [67]

shows that as much as 63% of prior defect studies do

not mitigate correlated metrics prior to constructing defect

models. Yet, little is known about the impact of correlated

metrics on the interpretation of defect models.

Approach. To address RQ1, we analyze the impact of

correlated metrics on the interpretation of defect models

along with three dimensions, i.e., (1) the consistency of the

top-ranked metric, (2) the level of discrepancy of the top-

ranked metric, and (3) the direction of the ranking of metrics

for all non-correlated metrics between mitigated and non-

mitigated models.

To do so, we start from mitigated datasets (see Sec-

tion 3.1). We ﬁrst identify the top-ranked metric for each of

the 9 studied interpretation techniques. We use VarClus to

select only one of the metrics that is correlated with the top-

ranked metric in order to generate non-mitigated datasets.

We then append the correlated metric to the ﬁrst position

of the speciﬁcation of the mitigated models. Thus, the spec-

iﬁcation for the mitigated models is y⇠mtop ranked +...,

while the speciﬁcation for the non-mitigated models is

y⇠mcorrelated +mtop ranked +..., where mcorrelated is

the metric that is correlated with the top-ranked metric

(mtop ranked). For each of the mitigated and non-mitigated

datasets, we construct defect models (see Section 3.2) and

apply the 9 studied model interpretation techniques (see

Section 3.3).

To analyze the consistency and the level of discrepancy of

the top-ranked metric, we compute the difference in the

ranks of the top-ranked metric between mitigated and non-

mitigated models. For example, if a metric mtop ranked ap-

pears in the 1st rank in both of mitigated and non-mitigated

models, then the metric would have a rank difference of 0.

However, if mtop ranked appears in the 3rd rank of a non-

mitigated model and appears in the 1st rank of a mitigated

model, then the rank difference of mtop ranked would be 2.

The consistency of the top-ranked metric measures the per-

centage of the studied datasets that the top-ranked metric

appears at the 1st rank in both mitigated and non-mitigated

models. On the other hand, the level of discrepancy of the

top-ranked metric measures the highest rank difference of

the top-ranked metric between mitigated and non-mitigated

models.

To analyze the direction of the ranking of metrics for

all non-correlated metrics between mitigated and non-

mitigated models, we start with the ranking of important

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 6

Type−II (Chisq)

Scaled and non−scaled Gini

Scaled Permutation

Non−scaled Permutation

Type−I

Type−II (Wald)

Type−II (LR)

Type−II (F)

−8−7−6−5−4−3−2−1 0 1 2 −8−7−6−5−4−3−2−1 0 1 2 −8−7−6−5−4−3−2−1 0 1 2 −8−7−6−5−4−3−2−1 0 1 2

0%

25%

50%

75%

100%

0%

25%

50%

75%

100%

Rank difference of the top−ranked metric

Percentage of the studied datasets

Figure 3: The percentage of the studied datasets for each difference in the ranks between the top-ranked metric of the

models that are constructed using the mitigated and non-mitigated datasets. The light blue bars represent the consistent

rank of the metric between mitigated and non-mitigated models, while the red bars represent the inconsistent rank of the

metric between mitigated and non-mitigated models.

metrics that appear in both mitigated and non-mitigated

models. We then apply a Spearman rank correlation test

(⇢) to compute the correlation between ranks of all non-

correlated metrics between mitigated and non-mitigated

models. The Spearman correlation coefﬁcient (⇢) of 1 in-

dicates that the ranking of non-correlated metrics between

mitigated and non-mitigated models is in the same direc-

tion. On the other hand, the Spearman correlation coefﬁcient

(⇢) of -1 indicates that the ranking of non-correlated metrics

between mitigated and non-mitigated models is in the re-

verse direction. Since the produced ranking of each model

and defect dataset may be different, the use of a weighted

Spearman rank correlation test may lead these rankings

to be weighted differently. Thus, we select a traditional

Spearman rank correlation test for our study. We report the

distributions of the Spearman correlation coefﬁcients (⇢) of

the ranking of metrics between mitigated and non-mitigated

models for all studied interpretation techniques in Figure S7

in the supplementary materials.

Results.ANOVA Type-I produces the lowest consistency

of the top-ranked metric between mitigated and non-

mitigated models. We expect that the top-ranked metric in

the mitigated model will remain as the top-ranked metric

in the non-mitigated model. Unfortunately, Figure 3 shows

that this expectation does not hold true in any of the

studied datasets for ANOVA Type-I. Figure 3 shows that, for

ANOVA Type-I, none of the top-ranked metric that appears

in the 1st rank of mitigated models also appears in the 1st

rank of non-mitigated models. On the other hand, we ﬁnd

that the top-ranked metric of mitigated models appears at

the top-ranked metric in non-mitigated models for 84%,

67%, 55%, and 67% of the studied datasets for ANOVA

Type-II (Wald), Type-II (LR), Type-II (F), Type-II (Chisq), re-

spectively. We suspect that the impact of correlated metrics

on the interpretation of Type-I has to do with the sequential

nature of the calculation of the Sum of Squares, i.e., Type-

I attributes as much variance as it can to the ﬁrst metric

before attributing residual variance to the second metric in

the model speciﬁcation.

ANOVA Type-I and Type-II produce the highest level

of discrepancy of the top-ranked metrics between miti-

gated and non-mitigated models. Figure 3 shows that the

rank difference for ANOVA Type-I and Type-II can be up to

-6 and -8, respectively. In other words, we ﬁnd that the top-

ranked metric in mitigated models appear at the 7th rank

and the 9th rank in non-mitigated models for ANOVA Type-

I and Type-II, respectively. We suspect that the highest level

of discrepancy (i.e., the largest rank difference) for ANOVA

Type-I and Type-II has to do with the sharply drop of the

importance scores when correlated metrics are included in

defect models (see PA1).

For ANOVA Type-I and Type-II, correlated metrics

have the largest impact on the direction of the ranking

of metrics of defect models. For ANOVA Type-I, we ﬁnd

that the Spearman correlation coefﬁcients range from -0.1 to

0.84 (see Figure S7). For ANOVA Type-II, we ﬁnd that the

Spearman correlation coefﬁcients range from 0.1 to 1. A low

value of the Spearman correlation coefﬁcients in ANOVA

techniques indicates that the direction of the ranking of

metrics for all non-correlated metrics is varied in each rank,

suggesting that the ranking of non-correlated metrics is

inconsistent across mitigated and non-mitigated models.

Gini and Permutation Importance approaches produce

the higest consistency and the lowest level of discrepancy

of the top-ranked metric, and have the least impact on the

direction of the ranking of metrics between mitigated and

non-mitigated models. Figure 3 shows that the top-ranked

metric of mitigated models appears at the top-ranked metric

in non-mitigated models for 88%, 92%, and 55% of the

studied datasets for Gini Importance, and scaled and non-

scaled Permutation Importance, respectively. Figure 3 also

shows that the rank difference for Gini and Permutation

Importance is as low as -1 and -3, respectively. Furthermore,

we ﬁnd that the Spearman correlation coefﬁcients range

from 0.9 to 1 for Gini and Permutation Importance (see

Figure S7). These ﬁndings suggest that the lower impact that

correlated metrics have on Gini and Permutation Impor-

tance than ANOVA techniques have to do with the random

process for constructing multiple trees and the calculation of

importance scores for a random forest model. For example,

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 7

the random process of random forest may generate trees that

are constructed without correlated metrics. In addition, the

averaging of the importance scores from multiple trees may

decrease the negative impact of correlated metrics on trees

that are constructed with correlated metrics for random

forest models.

ANOVA Type-I and Type-II often produce the lowest consis-

tency and the highest level of discrepancy of the top-ranked

metric, and have the highest impact on the direction of the

ranking of metrics between mitigated and non-mitigated mod-

els when compared to Gini and Permutation Importance. This

ﬁnding highlights the risks of not mitigating correlated metrics

in the ANOVA analyses of prior studies.

(RQ2) After removing all correlated metrics, how con-

sistent is the interpretation of defect models among

different model speciﬁcations?

Motivation. Our motivating analysis (see Appendix 2) and

the results of RQ1 conﬁrm that the ranking of the top-

ranked metric substantially changes when the ordering of

correlated metrics in a model speciﬁcation is rearranged,

suggesting that correlated metrics must be removed. How-

ever, after removing correlated metrics, little is known if

the interpretation of defect models would become consistent

when rearranging the ordering of metrics.

Approach. To address RQ2, we analyze the ranking of the

top-ranked metric of the models that are constructed using

different ordering of metrics from mitigated datasets. To do

so, we start from mitigated datasets that are produced by

Section 3.1. For each of the datasets, we construct defect

models (see Section 3.2) and apply the 9 studied model in-

terpretation techniques (see Section 3.3) in order to identify

the top-ranked metric according to each technique. Then,

we regenerate the models where the ordering of metrics is

rearranged—the top-ranked metric is at each position from

the ﬁrst to the last for each dataset. Finally, we compute the

percentage of datasets where the ranks of the top-ranked

metric are inconsistent among the rearranged datasets.

Results.After removing correlated metrics, the top-ranked

metrics according to Type-II, Gini Importance, and Per-

mutation Importance are consistent. However, the top-

ranked metric according to Type-I is still inconsistent

regardless of the ordering of metrics. We ﬁnd that Type-

II, Gini Importance, and Permutation Importance produce a

stable ranking of the top-ranked metric for all of the studied

datasets regardless of the ordering of metrics.

On the other hand, ANOVA Type-I is the only technique

that produces an inconsistent ranking of the top-ranked

metric. We ﬁnd that for 71% of the studied datasets, ANOVA

Type-I produces an inconsistent ranking of the top-ranked

metric when the ordering of metrics is rearranged. We

expect that the consistency of the ranking of the top-ranked

metrics can be improved by increasing the strictness of

the correlation threshold of the variable clustering analy-

sis (VarClus). Thus, we repeat the analysis using stricter

thresholds of the variable clustering analysis (VarClus). We

use Spearman correlation coefﬁcient (|⇢|) threshold values

of 0.5 and 0.6. Unfortunately, even if we increase the strict-

ness of the correlation threshold value, Type-I produces the

inconsistent ranking of the top-ranked metric for 43% and

50% of the studied datasets, for the threshold of 0.5 and 0.6,

respectively.

The inconsistent ranking of the top-ranked metric ac-

cording to Type-I has to do with the sequential nature of

the calculation of the Sum of Squares (see Appendix 1.3).

In other words, Type-I attributes the importance scores as

much as it can to the ﬁrst metric before attributing the scores

to the second metric in the model speciﬁcation. Thus, Type-I

is sensitive to the ordering of metrics.

After removing all correlated metrics, the top-ranked metric

according to ANOVA Type-II, Gini Importance, and Permuta-

tion Importance are consistent. However, the top-ranked metric

according to ANOVA Type-I is inconsistent, since the ranking

of metrics is impacted by its order in the model speciﬁcation

when analyzed using ANOVA Type-I (which is the default

analysis for the glm model in R and is commonly-used in

prior studies). This ﬁnding suggests that ANOVA Type-I must

be avoided even if all correlated metrics are removed.

(RQ3) After removing all correlated metrics, how con-

sistent is the interpretation of defect models among the

studied interpretation techniques?

Motivation. The ﬁndings of prior work often rely heavily

on one model interpretation technique [23, 37, 55, 56, 70,

77]. Therefore, the ﬁndings of prior work may pose a threat

to construct validity, i.e., the ﬁndings may not hold true if

one uses another interpretation technique. Thus, we set out

to investigate the consistency of the top-ranked and top-3

ranked metrics after removing correlated metrics.

Approach. To address RQ3, we start from mitigated datasets

that are produced by Section 3.1 and non-mitigated datasets

(i.e., the original datasets). We compare the two rank-

ings that are produced from mitigated and non-mitigated

models using the 9 interpretation techniques for each of

the studied datasets. Then, we compute the percentage of

datasets where the top-ranked metric is consistent among

the studied model interpretation techniques. Moreover, we

also compute the percentage of datasets where at least

one of the top-3 ranked metrics is consistent among the

studied model interpretation techniques. Finally, we present

the results using a heatmap (as shown in Figure 4) where

each cell indicates the percentage of datasets which the top-

ranked metric is consistent among the two studied model

interpretation techniques.

Results.Before removing all correlated metrics, we ﬁnd

that the studied model interpretation techniques do not

tend to produce the same top-ranked metric. We observe

that the consistency of the ranking of metrics across learning

techniques (i.e., logistic regression and random forest) is

as low as 0%-43% for the top-ranked metric (Figure 4a)

and 21%-57% for the top-3 ranked metrics (see Figure S8a),

respectively. Furthermore, according to the lower-left side of

the matrix of the Figure 4a, we ﬁnd that, before removing

correlated metrics, the top-ranked metric of Type-II (Chisq)

is inconsistent with the top-ranked metrics of Type-I and

Gini Importance for all of the studied datasets.

After removing all correlated metrics, we ﬁnd that the

consistency of the ranking of metrics among the studied

interpretation techniques is improved by 15%-64% for

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 8

7%

7% 43%

7% 43% 43%

0% 36% 29% 29%

43% 7% 7% 7% 0%

29% 7% 7% 7% 7% 71%

36% 14% 14% 14% 7% 64% 50%

Permutation

(Scaled)

Permutation

(Non−scaled)

Gini

(Scaled/

non−scaled)

Type−II

(Chisq)

Type−II

(F)

Type−II

(LR)

Type−II

(Wald)

Type−I Type−II

(Wald)

Type−II

(LR)

Type−II

(F)

Type−II

(Chisq)

Gini

(Scaled/

non−scaled)

Permutation

(Non−scaled)

0 25 50 75100

Percentage of datasets

(a) The top-ranked metric for non-mitigated models.

64% 71% 71% 64% 71% 64% 86%

86% 86% 86% 50% 43% 64%

100% 86% 57% 50% 71%

86% 57% 50% 71%

50% 43% 64%

93% 79%

79%

Permutation

(Non−scaled)

Gini

(Scaled/

non−scaled)

Type−II

(Chisq)

Type−II

(F)

Type−II

(LR)

Type−II

(Wald)

Type−I

Type−II

(Wald)

Type−II

(LR)

Type−II

(F)

Type−II

(Chisq)

Gini

(Scaled/

non−scaled)

Permutation

(Non−scaled)

Permutation

(Scaled)

0 25 50 75100

Percentage of datasets

(b) The top-ranked metric for mitigated models.

Figure 4: The percentage of datasets where the top-ranked metric is consistent between the two studied model interpretation

techniques. While the lower-left side of the matrix (i.e., red shades) shows the percentage before removing correlated

metrics, the upper-right side of the matrix (i.e., blue shades) shows the percentage after removing correlated metrics. For

the consistency of the top-3 ranked metrics, see Figure S8 in the supplementary materials.

the top-ranked metric and 21%-71% for the top-3 ranked

metrics, respectively. In particular, we observe that the con-

sistency of the ranking of metrics across learning techniques

is improved by 28%-50% for the top-1 ranked metrics and

43%-71% for the top-3 ranked metrics, respectively. Most

importantly, we ﬁnd that scaled Permutation Importance

achieves the highest consistency of the ranking of metrics

across learning techniques. This ﬁnding highlights the ben-

eﬁts of removing correlated metrics on the interpretation

of defect models—the conclusions of studies that rely on

one interpretation technique may not pose a threat after

mitigating correlated metrics.

After removing all correlated metrics, we ﬁnd that the consis-

tency of the ranking of metrics among the studied interpreta-

tion techniques is improved by 15%- 64% for the top-ranked

metric and 21%-71% for the top-3 ranked metrics, respectively,

highlighting the beneﬁts of removing all correlated metrics

on the interpretation of defect models, i.e., the conclusions of

studies that rely on one interpretation technique may not pose

a threat after mitigating correlated metrics.

(RQ4) Does removing all correlated metrics impact the

performance and stability of defect models?

Motivation. The results of RQ1 show that correlated met-

rics have a negative impact on the interpretation of defect

prediction models, while the results of RQ2 and RQ3 show

the beneﬁts of removing correlated metrics on the inter-

pretation of defect models. Thus, removing correlated met-

rics is highly recommended. However, removing correlated

metrics may pose a risk to the performance and stability

of defect models. Yet, little is known if removing such

correlated metrics impacts the performance and stability of

defect models.

Approach. To address RQ4, we ﬁrst start from the AUC,

F-measure and MCC performance estimates and their per-

●

●

●

●

●

AUC

F−measure

MCC

Logistic

Regression

Random

Forest

Logistic

Regression

Random

Forest

Logistic

Regression

Random

Forest

−0.10

−0.05

0.00

0.05

0.10

Performance Difference (%pts)

Figure 5: The distributions of the performance difference

(% pts) between non-mitigated and mitigated models for

each of the studied datasets.

formance stability of the non-mitigated and mitigated mod-

els. The performance stability is measured by a standard

deviation of the performance estimates as produced by 100

iterations of the out-of-sample bootstrap for each model. We

then quantify the impact of removing all correlated metrics

by measuring the performance difference (i.e., the arithmetic

difference between the performance of the non-mitigated

and mitigated models) and the stability ratio (i.e., the ratio of

the S.D. of performance estimates of non-mitigated to mit-

igated models, S.D. non-mitigated models

S.D. mitigated models ). Furthermore, in order

to measure the effect size of the impact, we measure Cliff’s

||effect size for the performance difference and the stability

ratio across the non-mitigated and mitigated models.

Results.Removing all correlated metrics impacts the

AUC, F-measure, and MCC performance of defect models

by less than 5 percentage points. Figure 5 shows that

the distributions of the performance difference between

the models that are constructed using non-mitigated and

mitigated datasets are centered at zero. In addition, our

Cliff’s ||effect size test shows that the differences between

the models that are constructed using mitigated and non-

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 9

mitigated datasets are negligible to small for the AUC, F-

measure, and MCC measures. However, the performance

difference of 5 percentage points may be very important

for safety-critical software domains. Thus, researchers and

practitioners should remove correlated metrics with care.

Removing all correlated metrics yields a negligible

difference (Cliff’s ||) for the stability of the performance

of defect models. Figure 6 shows that the distributions

of the stability ratio of the models that are constructed

using non-mitigated and mitigated datasets are centered at

one (i.e., there is little difference in model stability after

removing all correlated metrics). Moreover, our Cliff’s ||

effect size test shows that the difference of the stability ratio

between the models that are constructed using mitigated

and non-mitigated datasets is negligible.

Removing all correlated metrics impacts the AUC, F-measure,

and MCC performance of defect models by less than 5 per-

centage points, suggesting that researchers and practitioners

should remove correlated metrics with care especially for safety-

critical software domains.

5PRACTICAL GUIDELINES

In this section, we offer practical guidelines for future stud-

ies. When the goal is to derive sound interpretation from

defect models:

(1) Ones must mitigate correlated metrics prior to con-

structing a defect model, especially for ANOVA anal-

yses, since RQ1 shows that (1) ANOVA Type-I and Type-

II often produce the lowest consistency and the highest

level of discrepancy of the top-ranked metric, and have

the highest impact on the direction of the ranking of

metrics between mitigated and non-mitigated models.

On the other hand, the results of RQ2 and RQ3 show

that removing all correlated metrics (2) improves the

consistency of the top-ranked metric regardless of the

ordering of metrics; and (3) improves the consistency of

the ranking of metrics among to the studied interpreta-

tion techniques, suggesting that correlated metrics must

be mitigated. However, the results of RQ4 show that the

removal of such correlated metrics impacts the model

performance by less than 5 percentage points, suggest-

ing that researchers and practitioners should remove

correlated metrics with care especially for safety-critical

software domains.

(2) Ones must avoid using ANOVA Type-I even if all

correlated metrics are removed, since RQ2 shows that

Type-I produces an inconsistent ranking of the top-

ranked metric when the orders of metrics are rear-

ranged, indicating that Type-I is sensitive to the ordering

of metrics even when removing all correlated metrics.

Instead, researchers should opt to use ANOVA Type-II

and Type-III for additive and interaction logistic re-

gression models, respectively. Furthermore, the scaled

Permutation Importance approach is recommended for

random forest since RQ3 shows that such approach

achieves the highest consistency across learning tech-

niques.

Finally, we would like to emphasize that mitigating cor-

related metrics is not necessary for all studies, all scenarios,

●

●

AUC

F−measure

MCC

Logistic

Regression

Random

Forest

Logistic

Regression

Random

Forest

Logistic

Regression

Random

Forest

0.50

0.75

1.00

1.25

1.50

Stability Ratio

Figure 6: The distributions of the stability ratio of non-

mitigated to mitigated models for each of the studied

datasets.

all datasets, and all analytical models in software engineer-

ing. Instead, the key message of our study is to shed light

that correlated metrics must be mitigated when the goal

is to derive sound interpretation from defect models that

are trained with correlated metrics (especially for ANOVA

Type-I). On the other hand, if the goal of the study is

to produce highly-accurate prediction models, one might

prioritize their resources on improving the model perfor-

mance rather than mitigating correlated metrics. Thus, fea-

ture selection and dimensionality reduction techniques can

be considered to mitigate irrelevant and correlated metrics,

and improve model performance.

6THREATS TO VALIDITY

Construct Validity. In this work, we only construct regres-

sion models in an additive fashion (y⇠m1+... +mn),

since metric interactions (i.e., the relationship between each

of the two interacting metrics depends on the value of the

other metrics) (1) are rarely explored in software engineer-

ing (except in [66]); (2) must be statistically insigniﬁcant

(e.g., absence) for ANOVA Type-II test [13, 18]; and (3)

are not compatible with random forest [8] which is one

of the most commonly-used analytical learners in software

engineering. On the other hand, the importance score of

the metric produced by ANOVA Type-III is evaluated after

all of the other metrics and all metric interactions of the

metric under examination have been accounted for. Thus, if

metric interactions are signiﬁcantly present, one should use

ANOVA Type-III and avoid using ANOVA Type-II. Due to

the same way in which the importance scores of metrics

according to ANOVA Type-II and Type-III are calculated

in a hierarchical nature (see Appendix 1.3) for an additive

model, we would like to note that the importance scores of

metrics according to ANOVA Type-II and Type-III are the

same for such additive models.

Plenty of prior work show that the parameters of clas-

siﬁcation techniques have an impact on the performance

of defect models [20, 44, 51, 52, 76, 79]. While we use a

default ntree value of 100 for random forest models, recent

studies [76, 79] show that the parameters of random forest

are insensitive to the performance of defect models. Thus,

the parameters of random forest models do not pose a threat

to validity of our study.

Recent work point out that the selection [39, 76] and

the quality [84] of datasets dataset selection might impact

conclusions of a study. Thus, our conclusions may alter

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 10

when changing a set of the studieds datasets. Moreover,

Tantithamthavorn et al. [78] point out that randomness may

introduce bias to the conclusions of a study. To mitigate this

threat and ensure that our results are reproducible, we set a

random seed in every step in our experiment design.

Internal Validity. Recent research uses ridge regression to

construct defect models on the dataset that contains corre-

lated metrics [63]. However, our additional analyses (Ap-

pendix 4 of the supplementary materials) show that ridge

regression improves the performance of defect model when

comparing to logistic regression, yet produces a misleading

importance ranking of metrics. We observe that metrics that

are highly correlated appear at different ranks. This obser-

vation highlights the importance of mitigating correlated

metrics when interpreting defect models.

We studied a limited number of model interpretation

techniques. Thus, our results may not generalize to other

model interpretation techniques. Nonetheless, other model

interpretation techniques can be explored in future work.

We provide a detailed methodology for others who would

like to re-examine our ﬁndings using unexplored model

interpretation techniques.

External Validity. The analyzed datasets are part of several

corpora (e.g., NASA and PROMISE) of systems that span

both proprietary and open source domains. However, we

studied a limited number of defect datasets. Thus, the

results may not generalize to other datasets and domains.

Nonetheless, additional replication studies are needed.

In our study, we exclude (1) datasets that are not rep-

resentative of common practice or (2) datasets that would

not realistically beneﬁt from our analysis (e.g., datasets

that most of the software modules are defective) with the

selection criteria of the studied datasets (in Appendix 3).

Nevertheless, our proposed approaches are applicable to

any dataset. Practitioners are encouraged to explore our ap-

proaches on their own datasets with their own peculiarities.

The conclusions of our case study rely on one defect pre-

diction scenario (i.e., within-project defect models). How-

ever, there are a variety of defect prediction scenarios in

the literature (e.g., cross-project defect prediction [12, 86],

just-in-time defect prediction [38], heterogenous defect pre-

diction [60]). Therefore, the practical guidelines may differ

for other scenarios. Thus, future research should revisit our

study in other scenarios of defect models.

7CONCLUSION

In this paper, we set out to investigate (1) the impact of

correlated metrics on the interpretation of defect models.

After removing correlated metrics, we investigate (2) the

consistency of the interpretation of defect models; and (3)

its impact on the performance and stability of defect models.

Through a case study of 14 publicly-available defect datasets

of systems that span both proprietary and open source

domains, we conclude that (1) correlated metrics have the

largest impact on the consistency, the level of discrepancy,

and the direction of the ranking of metrics, especially for

ANOVA techniques. On the other hand, we ﬁnd that remov-

ing all correlated metrics (2) improves the consistency of

the produced rankings regardless of the ordering of metrics

(except for ANOVA Type-I); (3) improves the consistency

of ranking of metrics among the studied interpretation

techniques; (4) impacts the model performance by less than

5 percentage points.

Based on our ﬁndings, we offer practical guidelines for

future studies. When the goal is to derive sound interpreta-

tion from defect models:

1) Ones must mitigate correlated metrics prior to con-

structing a defect model, especially for ANOVA anal-

yses.

2) Ones must avoid using ANOVA Type-I even if all

correlated metrics are removed.

Due to the variety of the built-in interpretation tech-

niques and their settings, our paper highlights the essential

need for future research to report the exact speciﬁcation

of their models and settings of the used interpretation

techniques.

REFERENCES

[1] E. Arisholm, L. C. Briand, and E. B. Johannessen, “A System-

atic and Comprehensive Investigation of Methods to Build and

Evaluate Fault Prediction Models,” Journal of Systems and Software,

vol. 83, no. 1, pp. 2–17, 2010.

[2] J. G. Barnett, C. K. Gathuru, L. S. Soldano, and S. McIntosh, “The

Relationship between Commit Message Detail and Defect Prone-

ness in Java Projects on GitHub,” in Proceedings of the International

Conference on Mining Software Repositories (MSR), 2016, pp. 496–499.

[3] V. R. Basili, L. C. Briand, and W. L. Melo, “A Validation of Object-

oriented Design Metrics as Quality Indicators,” Transactions on

Software Engineering (TSE), vol. 22, no. 10, pp. 751–761, 1996.

[4] N. Bettenburg and A. E. Hassan, “Studying the Impact of Social

Structures on Software Quality,” in Proceedings of the International

Conference on Program Comprehension (ICPC), 2010, pp. 124–133.

[5] C. Bird, B. Murphy, and H. Gall, “Don’t Touch My Code ! Examin-

ing the Effects of Ownership on Software Quality,” in Proceedings

of the Joint Meeting of the European Software Engineering Confer-

ence and the Symposium on the Foundations of Software Engineering

(ESEC/FSE), 2011, pp. 4–14.

[6] C. Bird, N. Nagappan, P. Devanbu, H. Gall, and B. Murphy, “Does

Distributed Development Affect Software Quality?: An Empirical

Case Study of Windows Vista,” Communications of the ACM, vol. 52,

no. 8, pp. 85–93, 2009.

[7] D. Bowes, T. Hall, M. Harman, Y. Jia, F. Sarro, and F. Wu,

“Mutation-Aware Fault Prediction,” in Proceedings of the Interna-

tional Symposium on Software Testing and Analysis (ISSTA), 2016, pp.

330–341.

[8] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp.

5–32, 2001.

[9] L. Breiman, A. Cutler, A. Liaw, and M. Wiener, “randomForest

: Breiman and Cutler’s Random Forests for Classiﬁcation and

Regression. R package version 4.6-12.” Software available at URL:

https://cran.r-project.org/web/packages/randomForest, 2006.

[10] L. C. Briand, W. L. Melo, and J. Wust, “Assessing the Applica-

bility of Fault-proneness Models across Object-oriented Software

Projects,” Transactions on Software Engineering (TSE), vol. 28, no. 7,

pp. 706–720, 2002.

[11] L. C. Briand, J. W ¨

ust, J. W. Daly, and D. V. Porter, “Exploring the

Relationships between Design Measures and Software Quality in

Object-oriented Systems,” Journal of Systems and Software, vol. 51,

no. 3, pp. 245–273, 2000.

[12] G. Canfora, A. De Lucia, M. Di Penta, R. Oliveto, A. Panichella,

and S. Panichella, “Multi-objective Cross-project Defect Predic-

tion,” in Proceedings of the International Conference on Software

Testing, Veriﬁcation and Validation (ICST), 2013, pp. 252–261.

[13] J. M. Chambers, “Statistical Models in S. Wadsworth,” Paciﬁc

Grove, California, 1992.

[14] S. R. Chidamber and C. F. Kemerer, “A Metrics Suite for Ob-

ject Oriented Design,” Transactions on Software Engineering (TSE),

vol. 20, no. 6, pp. 476–493, 1994.

[15] P. Devanbu, T. Zimmermann, and C. Bird, “Belief & Evidence in

Empirical Software Engineering,” in Proceedings of the International

Conference on Software Engineering (ICSE), 2016, pp. 108–119.

[16] M. Di Penta, L. Cerulo, Y.-G. Gu´

eh´

eneuc, and G. Antoniol, “An

Empirical Study of the Relationships between Design Pattern

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 11

Roles and Class Change Proneness,” in Proceedings of the Interna-

tional Conference on Software Maintenance (ICSM), 2008, pp. 217–226.

[17] B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap.

Boston, MA: Springer US, 1993.

[18] J. Fox, Applied regression analysis and generalized linear models. Sage

Publications, 2015.

[19] J. Friedman, T. Hastie, and R. Tibshirani, The Elements of Statistical

Learning. Springer series in statistics, 2001, vol. 1.

[20] W. Fu, T. Menzies, and X. Shen, “Tuning for Software Analytics:

Is it really necessary?” Information and Software Technology, vol. 76,

pp. 135–146, 2016.

[21] B. Ghotra, S. McIntosh, and A. E. Hassan, “Revisiting the Impact of

Classiﬁcation Techniques on the Performance of Defect Prediction

Models,” in Proceedings of the International Conference on Software

Engineering (ICSE), 2015, pp. 789–800.

[22] Y. Gil and G. Lalouche, “On the Correlation between Size and

Metric Validity,” Empirical Software Engineering (EMSE), vol. 22,

no. 5, pp. 2585–2611, 2017.

[23] G. Gousios, M. Pinzger, and A. v. Deursen, “An Exploratory Study

of the Pull-based Software Development Model,” in Proceedings of

the International Conference on Software Engineering (ICSE), 2014, pp.

345–355.

[24] L. Guo, Y. Ma, B. Cukic, and H. Singh, “Robust Prediction of Fault-

proneness by Random Forests,” in Proceedings of the International

Symposium on Software Reliability Engineering (ISSRE), 2004, pp.

417–428.

[25] M. A. Hall and L. A. Smith, “Feature Subset Selection: A Correla-

tion Based Filter Approach,” 1997.

[26] T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell, “A

Systematic Literature Review on Fault Prediction Performance in

Software Engineering,” Transactions on Software Engineering (TSE),

vol. 38, no. 6, pp. 1276–1304, 2012.

[27] J. a. Hanley and B. J. McNeil, “The meaning and use of the area

under a receiver operating characteristic (ROC) curve.” Radiology,

vol. 143, no. 4, pp. 29–36, 1982.

[28] F. E. Harrell Jr, “Hmisc: Harrell miscellaneous. R pack-

age version 3.12-2,” Software available at URL: http://cran.r-

project.org/web/packages/Hmisc, 2013.

[29] ——, Regression Modeling Strategies : With Applications to Lin-

ear Models, Logistic and Ordinal Regression, and Survival Analysis.

Springer, 2015.

[30] ——, “rms: Regression Modeling Strategies. R package version

5.1-1,” 2017.

[31] A. E. Hassan, “Predicting Faults using the Complexity of Code

Changes,” in Prooceedings of the International Conference on Software

Engineering (ICSE), 2009, pp. 78–88.

[32] H. Hemmati, S. Nadi, O. Baysal, O. Kononenko, W. Wang,

R. Holmes, and M. W. Godfrey, “The MSR Cookbook: Mining a

Decade of Research,” in Proceedings of the International Conference

on Mining Software Repositories (MSR), 2013, pp. 343–352.

[33] I. Herraiz, D. M. German, and A. E. Hassan, “On the Distribu-

tion of Source Code File Sizes,” in Proceedings of the International

Conference on Software Technologies (ICSOFT), 2011, pp. 5–14.

[34] J. Jiarpakdee, C. Tantithamthavorn, and A. E. Hassan, “Online

Appendix for “The Impact of Correlated Metrics on the In-

terpretation of Defect Models”,” https://github.com/ SAILResearch/

collinearity-pitfalls, 2018.

[35] J. Jiarpakdee, C. Tantithamthavorn, A. Ihara, and K. Matsumoto,

“A Study of Redundant Metrics in Defect Prediction Datasets,”

in Proceedings of the International Symposium on Software Reliability

Engineering Workshops (ISSREW), 2016, pp. 51–52.

[36] J. Jiarpakdee, C. Tantithamthavorn, and C. Treude, “Autospear-

man: Automatically mitigating correlated metrics for interpreting

defect models,” in Proceeding of the International Conference on

Software Maintenance and Evolution (ICSME), 2018, pp. 92–103.

[37] S. Kabinna, W. Shang, C.-P. Bezemer, and A. E. Hassan, “Exam-

ining the Stability of Logging Statements,” in Proceedings of the

International Conference on Software Analysis, Evolution, and Reengi-

neering (SANER), 2016, pp. 326–337.

[38] Y. Kamei, E. Shihab, B. Adams, A. E. Hassan, A. Mockus, A. Sinha,

and N. Ubayashi, “A Large-Scale Empirical Study of Just-In-Time

Quality Assurance,” Transactions on Software Engineering (TSE),

vol. 39, no. 6, pp. 757–773, 2013.

[39] J. Keung, E. Kocaguneli, and T. Menzies, “Finding Conclusion

Stability for Selecting the Best Effort Predictor in Software Effort

Estimation,” Automated Software Engineering, vol. 20, no. 4, pp. 543–

567, 2013.

[40] F. Khomh, M. Di Penta, and Y.-G. Gueheneuc, “An Exploratory

Study of the Impact of Code Smells on Software Change-

proneness,” in Proceedings of the Working Conference on Reverse

Engineering (WCRE), 2009, pp. 75–84.

[41] F. Khomh, M. Di Penta, Y.-G. Gu´

eh´

eneuc, and G. Antoniol, “An

Exploratory Study of the Impact of Antipatterns on Class Change-

and Fault-proneness,” Empirical Software Engineering (EMSE),

vol. 17, no. 3, pp. 243–275, 2012.

[42] S. Kim, T. Zimmermann, E. J. Whitehead Jr, and A. Zeller, “Predict-

ing Faults from Cached History,” in Proceedings of the International

Conference on Software Engineering (ICSE), 2007, pp. 489–498.

[43] S. Kirbas, B. Caglayan, T. Hall, S. Counsell, D. Bowes, A. Sen, and

A. Bener, “The Relationship between Evolutionary Coupling and

Defects in Large Industrial Software,” Journal of Software: Evolution

and Process, vol. 29, no. 4, 2017.

[44] A. G. Koru and H. Liu, “An Investigation of the Effect of Module

Size on Defect Prediction Using Static Measures,” Software Engi-

neering Notes (SEN), vol. 30, pp. 1–5, 2005.

[45] H. C. Kraemer, G. A. Morgan, N. L. Leech, J. A. Gliner, J. J. Vaske,

and R. J. Harmon, “Measures of Clinical Signiﬁcance,” Journal of

the American Academy of Child & Adolescent Psychiatry (JAACAP),

vol. 42, no. 12, pp. 1524–1529, 2003.

[46] S. Lessmann, B. Baesens, C. Mues, and S. Pietsch, “Benchmarking

Classiﬁcation Models for Software Defect Prediction: A Proposed

Framework and Novel Findings,” Transactions on Software Engi-

neering (TSE), vol. 34, no. 4, pp. 485–496, 2008.

[47] H. Lu, E. Kocaguneli, and B. Cukic, “Defect Prediction between

Software Versions with Active Learning and Dimensionality Re-

duction,” in Proceedings of the International Symposium on Software

Reliability Engineering (ISSRE), 2014, pp. 312–322.

[48] B. W. Matthews, “Comparison of the predicted and observed sec-

ondary structure of T4 phage lysozyme,” Biochimica et Biophysica

Acta (BBA)-Protein Structure, vol. 405, no. 2, pp. 442–451, 1975.

[49] T. J. McCabe, “A Complexity Measure,” Transactions on Software

Engineering (TSE), no. 4, pp. 308–320, 1976.

[50] S. McIntosh, Y. Kamei, B. Adams, and A. E. Hassan, “The Impact

of Code Review Coverage and Code Review Participation on

Software Quality,” in Proceedings of the International Conference on

Mining Software Repositories (MSR), 2014, pp. 192–201.

[51] T. Mende, “Replication of Defect Prediction Studies: Problems,

Pitfalls and Recommendations,” in Proceedings of the International

Conference on Predictive Models in Software Engineering (PROMISE),

2010, pp. 1–10.

[52] T. Mende and R. Koschke, “Revisiting the Evaluation of Defect

Prediction Models,” Proceedings of the International Conference on

Predictive Models in Software Engineering (PROMISE), pp. 7–16,

2009.

[53] A. Meneely, L. Williams, W. Snipes, and J. Osborne, “Predicting

Failures with Developer Networks and Social Network Analysis,”

in Proceedings of the International Symposium on Foundations of

Software Engineering (FSE), 2008, pp. 13–23.

[54] T. Menzies, J. Greenwald, and A. Frank, “Data Mining Static Code

Attributes to Learn Defect Predictors,” Transactions on Software

Engineering (TSE), vol. 33, no. 1, pp. 2–13, 2007.

[55] A. T. Misirli, E. Shihab, and Y. Kamei, “Studying High Impact Fix-

Inducing Changes,” Empirical Software Engineering (EMSE), vol. 21,

no. 2, pp. 605–641, 2016.

[56] R. Morales, S. McIntosh, and F. Khomh, “Do Code Review Prac-

tices Impact Design Quality? : A Case Study of the Qt, VTK,

and ITK Projects,” in Proceedings of the International Conference on

Software Analysis, Evolution and Reengineering (SANER), 2015, pp.

171–180.

[57] N. Nagappan and T. Ball, “Use of Relative Code Churn Measures

to Predict System Defect Density,” Proceedings of the International

Conference on Software Engineering (ICSE), pp. 284–292, 2005.

[58] N. Nagappan, T. Ball, and A. Zeller, “Mining Metrics to Predict

Component Failures,” in Proceedings of the International Conference

on Software Engineering (ICSE), 2006, pp. 452–461.

[59] N. Nagappan, A. Zeller, T. Zimmermann, K. Herzig, and B. Mur-

phy, “Change Bursts as Defect Predictors,” in Proceedings of the

International Symposium on Software Reliability Engineering (ISSRE),

2010, pp. 309–318.

[60] J. Nam, W. Fu, S. Kim, T. Menzies, and L. Tan, “Heterogeneous

Defect Prediction,” Transactions on Software Engineering (TSE), p. In

Press, 2017.

[61] F. Rahman and P. Devanbu, “Ownership, experience and defects: a

ﬁne-grained study of authorship,” in Proceedings of the International

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 12

Conference on Software Engineering (ICSE), 2011, pp. 491–500.

[62] ——, “How, and Why, Process Metrics are Better,” in Proceedings

of the International Conference on Software Engineering (ICSE), 2013,

pp. 432–441.

[63] F. Rahman, D. Posnett, I. Herraiz, and P. Devanbu, “Sample size

vs. bias in defect prediction,” in Proceedings of the Joint Meeting of

the European Software Engineering Conference and the Symposium on

the Foundations of Software Engineering (ESEC/FSE), 2013, pp. 147–

157.

[64] G. K. Rajbahadur, S. Wang, Y. Kamei, and A. E. Hassan, “The

Impact of Using Regression Models to Build Defect Classiﬁers,”

in Proceedings of the International Conference on Mining Software

Repositories (MSR), 2017, pp. 135–145.

[65] B. Ray, D. Posnett, V. Filkov, and P. Devanbu, “A Large Scale Study

of Programming Languages and Code Quality in Github,” in

Proceedings of the International Symposium on Foundations of Software

Engineering (FSE), 2014, pp. 155–165.

[66] M. Shepperd, D. Bowes, and T. Hall, “Researcher Bias: The Use of

Machine Learning in Software Defect Prediction,” Transactions on

Software Engineering (TSE), vol. 40, no. 6, pp. 603–616, 2014.

[67] E. Shihab, “An Exploration of Challenges Limiting Pragmatic Soft-

ware Defect Prediction,” Ph.D. dissertation, Queen’s University,

2012.

[68] E. Shihab, C. Bird, and T. Zimmermann, “The Effect of Branch-

ing Strategies on Software Quality,” in Proceedings of the Interna-

tional Symposium on Empirical Software Engineering and Measurement

(ESEM), 2012, pp. 301–310.

[69] E. Shihab, Z. M. Jiang, W. M. Ibrahim, B. Adams, and A. E.

Hassan, “Understanding the Impact of Code and Process Metrics

on Post-release Defects: A Case Study on the Eclipse Project,”

in Proceedings of the International Symposium on Empirical Software

Engineering and Measurement (ESEM), 2010, pp. 4–10.

[70] J. Shimagaki, Y. Kamei, S. McIntosh, A. E. Hassan, and

N. Ubayashi, “A Study of the Quality-Impacting Practices of Mod-

ern Code Review at Sony Mobile,” in Proceedings of the International

Conference on Software Engineering (ICSE), 2016, pp. 212–221.

[71] Y. Shin, A. Meneely, L. Williams, and J. A. Osborne, “Evaluat-

ing Complexity, Code Churn, and Developer Activity Metrics as

Indicators of Software Vulnerabilities,” Transactions on Software

Engineering (TSE), vol. 37, no. 6, pp. 772–787, 2011.

[72] C. Tantithamthavorn, “Towards a Better Understanding of the

Impact of Experimental Components on Defect Prediction Mod-

elling,” in Companion Proceeding of the International Conference on

Software Engineering (ICSE), 2016, p. 867870.

[73] ——, “ScottKnottESD : The Scott-Knott Effect Size Difference

(ESD) Test. R package version 2.0,” Software available at URL:

https://cran.r-project.org/web/packages/ScottKnottESD, 2017.

[74] C. Tantithamthavorn and A. E. Hassan, “An Experience Report

on Defect Modelling in Practice: Pitfalls and Challenges,” in In

Proceedings of the International Conference on Software Engineering:

Software Engineering in Practice Track (ICSE-SEIP), 2018, pp. 286–

295.

[75] C. Tantithamthavorn, A. E. Hassan, and K. Matsumoto, “The

Impact of Class Rebalancing Techniques on The Performance

and Interpretation of Defect Prediction Models,” Transactions on

Software Engineering (TSE), 2018.

[76] C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Mat-

sumoto, “Automated Parameter Optimization of Classiﬁcation

Techniques for Defect Prediction Models,” in Proceedings of the

International Conference on Software Engineering (ICSE), 2016, pp.

321–332.

[77] ——, “Comments on “Researcher Bias: The Use of Machine

Learning in Software Defect Prediction”,” Transactions on Software

Engineering (TSE), vol. 42, no. 11, pp. 1092–1094, 2016.

[78] ——, “An Empirical Comparison of Model Validation Techniques

for Defect Prediction Models,” Transactions on Software Engineering

(TSE), vol. 43, no. 1, pp. 1–18, 2017.

[79] ——, “The Impact of Automated Parameter Optimization on De-

fect Prediction Models,” Transactions on Software Engineering (TSE),

p. In Press, 2018.

[80] R. C. Team and contributors worldwide, “stats : The R Stats

Package. R Package. Version 3.4.0,” 2017.

[81] P. Thongtanunam, S. McIntosh, A. E. Hassan, and H. Iida, “Revis-

iting Code Ownership and its Relationship with Software Quality

in the Scope of Modern Code Review,” in Proceedings of the

International Conference on Software Engineering (ICSE), 2016, pp.

1039–1050.

[82] ——, “Review Participation in Modern Code Review,” Empirical

Software Engineering (EMSE), vol. 22, no. 2, pp. 768–817, 2017.

[83] Y. Tian, M. Nagappan, D. Lo, and A. E. Hassan, “What Are

the Characteristics of High-Rated Apps? A Case Study on Free

Android Applications,” in Proceedings of the International Conference

on Software Maintenance and Evolution (ICSME), 2015, pp. 301–310.

[84] S. Yathish, J. Jiarpakdee, P. Thongtanunam, and C. Tantithamtha-

vorn, “Mining Software Defects: Should We Consider Affected Re-

leases?” in In Proceedings of the International Conference on Software

Engineering (ICSE), 2019, p. To Appear.

[85] F. Zhang, A. E. Hassan, S. McIntosh, and Y. Zou, “The Use of Sum-

mation to Aggregate Software Metrics Hinders the Performance

of Defect Prediction Models,” Transactions on Software Engineering

(TSE), vol. 43, no. 5, pp. 476–491, 2017.

[86] T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy,

“Cross-project Defect Prediction,” in Proceedings of the Joint Meeting

of the European Software Engineering Conference and the Symposium

on the Foundations of Software Engineering (ESEC/FSE), 2009, pp.

91–100.

[87] T. Zimmermann, R. Premraj, and A. Zeller, “Predicting Defects

for Eclipse,” in Proceedings of the International Workshop on Predictor

Models in Software Engineering (PROMISE), 2007, pp. 9–19.

[88] T. Zimmermann, A. Zeller, P. Weissgerber, and S. Diehl, “Mining

Version Histories to Guide Software Changes,” Transactions on

Software Engineering (TSE), vol. 31, no. 6, pp. 429–445, 2005.

Jirayus Jiarpakdee received the B.E. degree

from Kasetsart University, Thailand, and the

M.E. degree from NAIST, Japan. He is cur-

rently a Ph.D. candidate at Monash University,

Australia. His research interests include empir-

ical software engineering and mining software

repositories (MSR). The goal of his Ph.D. is to

apply the knowledge of statistical modelling, ex-

perimental design, and software engineering in

order to tackle experimental issues that have an

impact on the interpretation of defect prediction

models.

Chakkrit Tantithamthavorn is a lecturer at the

Faculty of Information Technology, Monash Uni-

versity, Australia. His work has been published

at several top-tier software engineering venues

(e.g., TSE, ICSE, EMSE). His research interests

include empirical software engineering and min-

ing software repositories (MSR). He received the

B.E. degree from Kasetsart University, Thailand,

the M.E. and Ph.D. degrees from NAIST, Japan.

More about Chakkrit and his work is available

online at http://chakkrit.com.

Ahmed E. Hassan is the Canada Research

Chair (CRC) in Software Analytics, and the

NSERC/BlackBerry Software Engineering Chair

at the School of Computing at Queen’s Univer-

sity, Canada. He received a PhD in Computer

Science from the University of Waterloo. He

spearheaded the creation of the Mining Software

Repositories (MSR) conference and its research

community. More about Ahmed and his work is

available online at http://sail.cs.queensu.ca/.