ArticlePDF Available

Defect prediction from static code features: Current results, limitations, new approaches

Article

Defect prediction from static code features: Current results, limitations, new approaches

Abstract and Figures

Building quality software is expensive and software quality assurance (QA) budgets are limited. Data miners can learn defect predictors from static code features which can be used to control QA resources; e.g. to focus on the parts of the code predicted to be more defective. Recent results show that better data mining technology is not leading to better defect predictors. We hypothesize that we have reached the limits of the standard learning goal of maximizing area under the curve (AUC) of the probability of false alarms and probability of detection “AUC(pd, pf)”; i.e. the area under the curve of a probability of false alarm versus probability of detection. Accordingly, we explore changing the standard goal. Learners that maximize “AUC(effort, pd)” find the smallest set of modules that contain the most errors. WHICH is a meta-learner framework that can be quickly customized to different goals. When customized to AUC(effort, pd), WHICH out-performs all the data mining methods studied here. More importantly, measured in terms of this new goal, certain widely used learners perform much worse than simple manual methods. Hence, we advise against the indiscriminate use of learners. Learners must be chosen and customized to the goal at hand. With the right architecture (e.g. WHICH), tuning a learner to specific local business goals can be a simple task.
Content may be subject to copyright.
Autom Softw Eng
DOI 10.1007/s10515-010-0069-5
Defect prediction from static code features:
current results, limitations, new approaches
Tim Menzies ·Zach Milton ·Burak Turhan ·
Bojan Cukic ·Yue Jiang ·Ay¸se Bener
Received: 17 November 2009 / Accepted: 5 May 2010
© Springer Science+Business Media, LLC 2010
Abstract Building quality software is expensive and software quality assurance
(QA) budgets are limited. Data miners can learn defect predictors from static code
features which can be used to control QA resources; e.g. to focus on the parts of the
code predicted to be more defective.
Recent results show that better data mining technology is not leading to better
defect predictors. We hypothesize that we have reached the limits of the standard
learning goal of maximizing area under the curve (AUC) of the probability of false
alarms and probability of detection “AUC(pd, pf)”; i.e. the area under the curve of a
probability of false alarm versus probability of detection.
Accordingly, we explore changing the standard goal. Learners that maximize
AUC(effort, pd)” find the smallest set of modules that contain the most errors.
This research was supported by NSF grant CCF-0810879 and the Turkish Scientific Research
Council (Tubitak EEEAG 108E014). For an earlier draft, see http://menzies.us/pdf/08bias.pdf.
T. Menzies (!
)·Z. Milton ·B. Cukic ·Y. Jiang
West Virginia University, Morgantown, USA
e-mail: tim@menzies.us
Z. Milton
e-mail: zmilton@mix.wvu.edu
B. Cukic
e-mail: bojan.cukic@mail.csee.wvu.edu
Y. Jiang
e-mail: yjiang1@mix.wvu.edu
B. Turhan
University of Oulu, Oulu, Finland
e-mail: burak.turhan@oulu.fi
A. Bener
Bo˘
gaziçi University, Istandbul, Turkey
e-mail: bener@boun.edu.tr
Autom Softw Eng
WHICH is a meta-learner framework that can be quickly customized to different
goals. When customized to AUC(effort, pd), WHICH out-performs all the data min-
ing methods studied here. More importantly, measured in terms of this new goal,
certain widely used learners perform much worse than simple manual methods.
Hence, we advise against the indiscriminate use of learners. Learners must be
chosen and customized to the goal at hand. With the right architecture (e.g. WHICH),
tuning a learner to specific local business goals can be a simple task.
Keywords Defect prediction ·Static code features ·WHICH
1 Introduction
A repeated result is that static code features such as lines of code per module, number
of symbols in the module, etc., can be used by a data miner to predict which modules
are more likely to contain defects.1Such defect predictors can be used to allocate the
appropriate verification and validation budget assigned to different code modules.
The current high water mark in this field has been curiously static for several
years. For example, for three years we have been unable to improve on our 2006
results (Menzies et al. 2007b). Other studies report the same ceiling effect: many
methods learn defect predictors that perform statistically insignificantly different to
the best results. For example, after a careful study of 19 data miners for learning
defect predictors seeking to maximize the area under the curve of detection-vs-false
alarm curve, Lessmann et al. (2008) conclude
. . . the importance of the classification model is less than generally assumed
. . . practitioners are free to choose from a broad set of models when building
defect predictors.
This article argues for a very different conclusion. The results of Lessmann et al. are
certainly correct for the goal of maximizing detection and minimizing false alarm
rates. However, this is not the only possible goal of a defect predictor. WHICH (Mil-
ton 2008) is a meta-learning scheme where domain specific goals can be inserted
into the core of the learner. When those goals are set to one particular business goal
(e.g. “finding the fewest modules that contain the most errors”) then the ceiling effect
disappears:
WHICH significantly out-performs other learning schemes.
More importantly, certain widely used learners perform worse than simple manual
methods.
That is, contrary to the views of Lessmann et al., the selection of a learning method
appropriate to a particular goal is very critical. Learners that appear useful when
1See e.g. Weyuker et al. (2008), Halstead (1977), McCabe (1976), Chapman and Solomon (2002), Na-
gappan and Ball (2005a,2005b), Hall and Munson (2000), Nikora and Munson (2003), Khoshgoftaar
(2001), Tang and Khoshgoftaar (2004), Khoshgoftaar and Seliya (2003), Porter and Selby (1990), Tian
and Zelkowitz (1995), Khoshgoftaar and Allen (2001), Srinivasan and Fisher (1995).
Autom Softw Eng
pursuing certain goals, can be demonstrably inferior when pursuing others. We rec-
ommend WHICH as a simple method to create such customizations.
The rest of this paper is structured as follows. Section 2describes the use of static
code features for learning defect predictors. Section 3documents the ceiling effect
that has stalled progress in this field. After that, Sects. 4and 5discuss a novel method
to break through the ceiling effect.
2 Background
This section motivates the use of data mining for static code features and reviews
recent results. The rest of the paper will discuss limits with this approach, and how to
overcome them.
2.1 Blind spots
Our premise is that building high quality software is expensive. Hence, during de-
velopment, developers skew their limited quality assurance (QA) budgets towards
artifacts they believe most require extra QA. For example, it is common at NASA
to focus QA more on the on-board guidance system than the ground-based database
which stores scientific data collected from a satellite.
This skewing process can introduce an inappropriate bias to quality assurance
(QA). If the QA activities concentrate on project artifacts, say A,B,C,D, then that
leaves blind spots in E,F,G,H,I, . . . . Blind spots can compromise high assurance
software. Leveson remarks that in modern complex systems, unsafe operations often
result from an unstudied interaction between components (Leveson 1995). For exam-
ple, Lutz and Mikulski (2003) found a blind spot in NASA deep-space missions: most
of the mission critical in-flight anomalies resulted from errors in ground software that
fails to correctly collect in-flight data.
To avoid blind spots, one option is to rigorously assess all aspects of all software
modules, however, this is impractical. Software project budgets are finite and QA
effectiveness increases with QA effort. A linear increase in the confidence Cthat we
have found all faults can take exponentially more effort. For example, to detect one-
in-a-thousand module faults, moving Cfrom 90% to 94% to 98% takes 2301, 2812,
and 3910 black box tests (respectively).2Lowry et al. (1998) and Menzies and Cukic
(2000) offer numerous other examples where assessment effectiveness is exponential
on effort.
Exponential cost increase quickly exhausts finite QA resources. Hence, blind spots
can’t be avoided and must be managed. Standard practice is to apply the best available
assessment methods on the sections of the program that the best available domain
knowledge declares is the most critical. We endorse this approach. Clearly, the most
2A randomly selected input to a program will find a fault with probability x. Voas observes (Voas and
Miller 1995) that after Nrandom black-box tests, the chance of the inputs not revealing any fault is
(1x)N. Hence, the chance Cof seeing the fault is 1 (1x)Nwhich can be rearranged to N (C, x ) =
log(1C)
log(1x) . For example, N(0.90,103)=2301.
Autom Softw Eng
critical sections require the best known assessment methods, in hope of minimizing
the risk of safety or mission critical failure occurring post deployment. However,
this focus on certain sections can blind us to defects in other areas which, through
interactions, may cause similarly critical failures. Therefore, the standard practice
should be augmented with a lightweight sampling policy that (a) explores the rest of
the software and (b) raises an alert on parts of the software that appear problematic.
This sampling approach is incomplete by definition. Nevertheless, it is the only option
when resource limits block complete assessment.
2.2 Lightweight sampling
2.2.1 Data mining
One method for building a lightweight sampling policy is data mining over static
code features. For this paper, we define data mining as the process of summarizing
tables of data where rows are examples and columns are the features collected for
each example.3One special feature is called the class. The Appendix to this paper
describes various kinds of data miners including:
Näive Bayes classifiers use statistical combinations of features to predict for class
value. Such classifiers are called “naive” since they assume all the features are
statistically independent. Nevertheless, a repeated empirical result is that, on av-
erage, seemingly näive Bayes classifiers perform as well as other seemingly more
sophisticated schemes (e.g. see Table 1 in Domingos and Pazzani 1997).
Rule learners like RIPPER (Cohen 1995a) generate lists of rules. When classifying
a new code module, we take features extracted from that module and iterate over
the rule list. The output classification is the first rule in the list whose condition is
satisfied.
– Decision tree learners like C4.5 (Quinlan 1992b) build one single-parent tree
whose internal nodes test for feature values and whose leaves refer to class ranges.
The output of a decision tree is a branch of satisfied tests leading to a single leaf
classification.
There are many alternatives and extensions to these learners. Much recent work has
explored the value of building forests of decision trees using randomly selected sub-
sets of the features (Breimann 2001; Jiang et al. 2008b). Regardless of the learning
method, the output is the same: combinations of standard features that predict for
different class values.
2.2.2 Static code features
Defect predictors can be learned from tables of data containing static code features,
whose class label is defective and whose values are true or false. In those tables:
3Technically, this is supervised learning in the absence of a background theory. For notes on unsupervised
learning, see papers discussing clustering such as Bradley et al. (1998). For notes on using a background
theory, see (e.g.) papers discussing the learning or tuning of Bayes nets (Fenton and Neil 1999).
Autom Softw Eng
Fig. 1 Static code features m=Mccabe v(g) cyclomatic_complexity
iv(G) design_complexity
ev(G) essential_complexity
locs loc loc_total (one line =one count)
loc(other) loc_blank
loc_code_and_comment
loc_comments
loc_executable
number_of_lines
(opening to closing brackets)
Halstead h N1num_operators
N2num_operands
µ1num_unique_operators
µ2num_unique_operands
HNlength: N=N1+N2
Vvolume: V=Nlog2µ
Llevel: L=V/V where
V=(2+µ2)log2(2+µ2)
Ddifficulty: D=1/L
Icontent: I=ˆ
LVwhere
ˆ
L=2
µ1µ2
N2
Eeffort: E=V/ˆ
L
Berror_est
Tprog_time: T=E/18 seconds
Rows describe data from one module. Depending on the language, modules may
be called “functions”, “methods”, “procedures” or “files”.
Columns describe one of the static code features of Fig. 1. The Appendix of this
paper offers further details on these features.
These static code features are collected from prior development work. The defective
class summarizes the results of a whole host of QA methods that were applied to that
historical data. If any manual or automatic technique registered a problem with this
module, then it was marked “defective =true”. For these data sets, the data mining
goal is to learn a binary prediction for defective from past projects that can be applied
to future projects.
This paper argues that such defect predictors are useful and describes a novel
method for improving their performance. Just in case we overstate our case, it is
important to note that defect predictors learned from static code features can only
augment, but never replace, standard QA methods. Given a limited budget for QA, the
manager’s task is to decide which set of QA methods M1,M
2,...that cost C1,C
2,...
should be applied. Sometimes, domain knowledge is available that can indicate that
certain modules deserve the most costly QA methods. If so, then some subset of the
Autom Softw Eng
Fig. 2 Sample of percentage of
defects seen in different
modules. Note that only a very
small percentage of modules
have more than one defect. For
more details on these data sets,
see Fig. 3
Ndefects Percentage of modules with Ndefects
cm1 kc1 kc3 mw1 pc1
1 10.67 6.50 1.96 5.69 4.15
2 02.17 3.04 1.53 0.74 1.53
3 01.19 2.18 2.83 0.45
4 00.99 0.76 0.25 0.09
5 00.40 0.33 0.09
6 00.20 0.43 0.09
7 00.40 0.28 0.09
8 0.24
9 0.05 0.09
10 0.05
11
12 0.05
Totals 16.01 13.90 6.32 6.68 6.58
system may receive more attention by the QA team. We propose defect predictors as
a rapid and cost effective lightweight sampling policy for checking if the rest of the
system deserves additional attention. As argued above, such a sampling method is
essential for generating high-quality systems under the constraint of limited budgets.
2.3 Frequently asked questions
2.3.1 Why binary classifications?
The reader may wonder why we pursue such a simple binary classification scheme
(defective {true,false}) and not, say, number of defects or severity of defects. In
reply, we say:
We do not use severity of defects since in large scale data collections, such as those
used below, it is hard to distinguish defect “severity” from defect “priority”. All too
often, we have found that developers will declare a defect “severe” when they are
really only stating a preference on what bugs they wish to fix next. Other authors
have the same reservations:
Nikora cautions that “without a widely agreed upon definition of severity, we
can not reason about it” (Nikora 2004).
Ostrand et al. make a similar conclusion: “(severity) ratings were highly sub-
jective and also sometimes inaccurate because of political considerations not
related to the importance of the change to be made. We also learned that they
could be inaccurate in inconsistent ways” (Ostrand et al. 2004).
We do not use number of defects as our target variable since, as shown in Fig. 2,
only a vanishingly small percent of our modules have more than one issue report.
That is, our data has insufficient examples to utilize (say) one method in the kc1
data set with a dozen defects.
Autom Softw Eng
2.3.2 Why not use regression?
Other researchers (e.g. Mockus et al. 2005; Zimmermann and Nagappan 2009), use a
logistic regression model to predict software quality features. Such models have the
general form
Probability(Y ) =e(c+a1X1+a2X2+···)
1+e(c+a1X1+a2X2+···)
where aiare the logistic regression predicted constants and the Xiare the independent
variables used for building the logistic regression model. For example, in the case of
Zimmermann et al.’s work, those variables are measures of code change, complexity,
and pre-release bugs. These are used to predict number of defects.
Another regression variant is the negative binomial regression (NBM) model used
by Ostrand et al. (2004) to predict defects in AT&T software. Let yiequal the number
of faults observed in file iand xibe a vector of characteristics for that file. NBM
assumes that yigiven xihas a Poisson distribution with mean λicomputed from
λi=γieβxiwhere γiis the gamma distribution with mean 1 and variance σ20
(Ostrand et al. compute σ2and βusing a maximum likelihood procedure).
Logistic regression and NBM fit one model to the data. When data is multi-modal,
it is useful to fit multiple models. A common method for handling arbitrary distrib-
utions to approximate complex distributions is via a set of piecewise linear models.
Model tree learners, such as Quinlan’s M5’ algorithm (Quinlan 1992a), can learn
such piecewise linear models. M5’ also generates a decision tree describing when to
use which linear model.
We do not use regression for several reasons:
Regression assumes a continuous target variable and, as discussed above, our target
variable is binary and discrete.
There is no definitive result showing that regression methods are better/worse than
the data miners used in this study. In one of the more elaborate recent studies,
Lessmann et al. found no statistically significant advantage of logistic regression
over a large range of other algorithms (Lessmann et al. 2008) (the Lessmann et al.
result is discussed, at length, below).
In previous work, we have assessed various learning methods (including regression
methods and model trees) in terms of their ability be guided by various business
considerations. Specifically, we sought learners that could tune their conclusions
to user-supplied utility weights about false alarms, probability of detection, etc. Of
the fifteen defect prediction methods used in that study, regression and model trees
were remarkably worst at being able to be guided in this way (Menzies and Stefano
2003). The last section of this paper discusses a new learner, called WHICH, that
was specially designed to support simple tuning to user-specific criteria.
2.3.3 Why static code features?
Another common question is why just use static code features? Fenton (1994) divides
software metrics into process, product, and personnel and uses these to collect infor-
mation on how the software was built, what was built, and who built it. Static code
Autom Softw Eng
measures are just product metrics and, hence, do not reflect process and personnel
details. For this reason, other researchers use more that just static code measures. For
example:
Reliability engineers use knowledge of how the frequency of faults seen in a run-
ning system changes over time (Musa et al. 1987; Littlewood and Wright 1997).
Other researchers explore churn; i.e. the rate at which the code base changes (Hall
and Munson 2000).
Other researchers reason about the development team. For example, Nagappan et
al. comment on how organizational structure effects software quality (Nagappan
and Murphy 2008) while Weyuker et al. document how large team sizes change
defect rates (Weyuker et al. 2008).
When replying to this question, we say that static code features are one of the few
measures we can collect in a consistent manner across many projects. Ideally, data
mining occurs in some CMM level 5 company where processes and data collection
is precisely defined. In that ideal case, there exists extensive data sets collected over
many projects and many years. These data sets are in a consistent format and there
is no ambiguity in the terminology of the data (e.g. no confusion between “severity”
and “priority”).
We do not work in that ideal situation. Since 1998, two of the authors (Menzies and
Cukic) have been research consultants to NASA. Working with NASA’s Independent
Software Verification and Validation Facility (IV&V), we have tried various methods
to add value to the QA methods of that organization. As we have come to learn,
NASA is a very dynamic organization. The NASA enterprise has undergone major
upheavals following the 2003 loss of the Columbia shuttle, then President Bush’s new
vision for interplanetary space exploration in 2004, and now (2010) the cancellation
of that program. As research consultants, we cannot precisely define data collection
in such a dynamic environment. Hence, we do not ask “what are the right features to
collect?”. Instead, we can only ask “what features can we access, right now?” This
question is relevant to NASA as well as any organization where data collection is not
controlled by a centralized authority such as:
agile software projects;
out-sourced projects;
open-sourced projects;
and organizations that make extensive use of sub-contractors and sub-sub contrac-
tors.
In our experience, the one artifact that can be accessed in a consistent manner across
multiple different projects is the source code (this is particularly true in large projects
staffed by contractors, sub-contractors, and sub-sub contractors). Static code features
can be automatically and cheaply extracted from source code, even for very large
systems (Nagappan and Ball 2005a). By contrast, other methods such as manual code
reviews are labor-intensive. Depending on the review methods, 8 to 20 LOC/minute
can be inspected and this effort repeats for all members of the review team, which
can be as large as four or six (Menzies et al. 2002).
For all the above reasons, many industrial practitioners and researchers (includ-
ing ourselves) use static attributes to guide software quality predictions (see the list
Autom Softw Eng
shown in the introduction). Verification and validation (V&V) textbooks (Rakitin
2001) advise using static code complexity attributes to decide which modules are
worthy of manual inspections. At the NASA IV&V facility, we know of several large
government software contractors that will not review software modules unless tools
like McCabe predict that some of them might be fault prone.
2.3.4 What can be learned from static code features?
The previous section argued that, for pragmatic reasons, all we can often collect are
static code measures. This is not to say that if we use those features, then they yield
useful or interesting results. Hence, a very common question we hear about is “what
evidence is there that anything useful can be learned from static code measures?”.
There is a large body of literature arguing that static code features are an inade-
quate characterization of the internals of a function:
Fenton offers an insightful example where the same functionality is achieved via
different language constructs resulting in different static measurements for that
module (Fenton and Pfleeger 1997). Using this example, Fenton argues against
the use of static code features.
Shepperd and Ince present empirical evidence that the McCabe static attributes
offer nothing more than uninformative attributes like lines of code. They comment
“for a large class of software it (cyclomatic complexity) is no more than a proxy
for, and in many cases outperformed by, lines of code” (Shepperd and Ince 1994).
In a similar result, Fenton and Pfleeger note that the main McCabe attributes (cy-
clomatic complexity, or v(g)) are highly correlated with lines of code (Fenton and
Pfleeger 1997).
If static code features were truly useless, then the defect predictors learned from them
would satisfy two predictions:
Prediction 1: They would perform badly (not predict for defects);
Prediction 2: They would have no generality (predictors learned from one data set
would not be insightful on another).
At least in our experiences, these predictions do not hold. This evidence falls into two
groups: field studies and a controlled laboratory study. In the field studies:
Our prediction technology was commercialized in the Predictive tool and sold
across the United States to customers in the telecom, energy, technology, and gov-
ernment markets, including organizations such as Compagnie Financière Alcatel
(Alcatel); Chevron Corporation; LogLogic, Inc.; and Northrop Grumman Corpo-
ration. As an example of the use of Predictive, one company (GB Tech, Inc.) used
it to manage safety critical software for a United States manned strike fighter. This
code had to be tested extensively to ensure safety (the software controlled a lithium
ion battery, which can overcharge and possibly explode). First, a more expensive
tool for structural code coverage was applied. Later, the company ran that tool and
Predictive on the same code. Predictive produced consistent results with the more
expensive tools while being able to faster process a larger code base than the more
expensive tool (Turner 2006).
Autom Softw Eng
We took the defect prediction technology of this paper (which was developed at
NASA in the USA) and applied it to a software development company from an-
other country (a Turkish software company). The results were very encouraging:
when inspection teams focused on the modules that trigger our defect predictors,
they found up to 70% of the defects using 40% of the effort (measured in staff
hours). Based on those results, we were subsequently invited by two companies
to build tools to incorporate our defect prediction methods into their routine daily
processes (Tosun et al. 2009).
A subsequent, more detailed, study on the Turkish software compared how much
code needs to be inspected using a random selection process vs selection via our
defect predictors. Using the random testing strategy, 87% of the files would have
to be inspected in order to detect 87% of the defects. However, if the inspection
process was restricted to the 25% of the files that trigger our defect predictors,
then 88% of the defects could be found. That is, the same level of defect detection
(after inspection) can be achieved using 8725
87 =71% less effort (Tosun and Bener
2010).
The results of these field studies run counter to Prediction 1. However, they are not
reproducible results. In order to make a claim that other researchers can verify, we
designed a controlled experiment to assess Predictions 1 and 2 in a reproducible man-
ner (Turhan et al. 2009). That experiment was based on the public domain data sets
of Fig. 3. These data sets are quite diverse and are written in different languages (C,
C++, JAVA); written in different countries (United Stated and Turkey); and written
for different purposes (control and monitoring of white goods, NASA flight systems,
ground-based software).
Before we can show that experiment, we must first digress to define performance
measures for defect prediction. When such a predictor fires then {A, B , C, D}denotes
the true negatives, false negatives, false positives, and true positives (respectively).
From these measures we can compute:
pd =recall =D
B+D
pf =C
A+C
In the above, pd is the probability of detecting a faulty module while pf is the prob-
ability of false alarm. Other performance measures are accuracy =A+D
A+B+C+Dand
precision =D
C+D. Figure 4shows an example of the calculation of these measures.
Elsewhere (Menzies et al. 2007a), we show that accuracy and precision are highly
unstable performance indicators for data sets like Fig. 3where the target concept
occurs with relative infrequency: in Fig. 3, only 1
7th (median value) of the modules
are marked as defective. Therefore, for the rest of this paper, we will not refer to
accuracy or precision.
Having defined performance measures, we can now check Predictions 1 & 2; i.e.
static defect features lead to poor fault predictors and defect predictors have no gen-
erality between data sets. If Ddenotes all the data in Fig. 3, and Didenote one
particular data set DiD, then we can conduct two kinds of experiments:
Autom Softw Eng
Project Source Language Description # modules Features %defective
pc1 NASA C++ Flight software
for earth orbiting
satellites
1,109 21 6.94
kc1 NASA C++ Storage
management for
ground data
845 21 15.45
kc2 NASA C++ Storage
management for
ground data
522 21 20.49
cm1 NASA C++ Spacecraft
instrument
498 21 9.83
kc3 NASA JAVA Storage
management for
ground data
458 39 9.38
mw1 NASA C++ A zero gravity
experiment
related to
combustion
403 37 7.69
ar4 Turkish white
goods
manufacturer
C Refrigerator 107 30 18.69
ar3 Turkish white
goods
manufacturer
C Dishwasher 63 30 12.70
mc2 NASA C++ Video guidance
system
61 39 32.29
ar5 Turkish white
goods
manufacturer
C Washing machine 36 30 22.22
Total: 4,102
Fig. 3 Tables of data, sorted in order of number of examples, from http://promisedata.org/data. The rows
labeled “NASA” come from NASA aerospace projects while the other rows come from a Turkish software
company writing applications for domestic appliances. All this data conforms to the format of Sect. 2.2.2
Fig. 4 Performance measures Module found in defect logs?
no yes
Signal no A =395 B =67
detected? yes C =19 D =39
pf =Prob.falseAlarm =5%
pd =Prop.detected =37%
acc =accuracy =83%
prec =precision =67%
Autom Softw Eng
Fig. 5 Results of round-robin
and self experiments.
From Turhan et al. (2009). All
the pd and pf results are
statistically different at the 95%
level (according to a
Mann-Whitney test)
Experiment Notes Median
pd% pf%
RR round-robin 94 68
RR2 round-robin +relevancy filtering 69 27
SELF self test 75 29
SELF:Self-learning experiments where we train on 90% of Dithen test on the re-
maining 10%. Note that such self-learning experiments will let us comment
on Prediction 1.
RR:Round-robin experiments where we test on 10% (randomly selected) of data
set Diafter training on the remaining nine data sets DDi. Note that such
round-robin experiments will let us comment on Prediction 2.
It turns out that the round-robin results are unimpressive due to an irrelevancy effect,
discussed below. Hence, it is also useful to conduct:
RR2: Round-robin experiments where a relevancy filter is used to filter away irrele-
vant parts of the training data.
After repeating experiments RR, SELF, RR2 twenty times for each data set
DiD, the median results are shown in Fig. 5. At first glance, the round-robin results
of RR seem quite impressive: a 98% probability of detection. Sadly, these high detec-
tion probabilities are associated with an unacceptably high false alarm rate of 68%.
In retrospect, this high false alarm rate might have been anticipated. A median
sized data set from Fig. 3(e.g. mw1) has around 450 modules. In a round-robin ex-
periment, the median size of the training set is over 3600 modules taken from nine
other projects. In such an experiment, it is highly likely that the defect predictor will
be learned from numerous irrelevant details from other projects.
To counter the problem of irrelevant training data, the second set of round-robin
experiments constructed training sets for Difrom the union of the 10 nearest neigh-
bors within DDi. The RR2 results of Fig. 5show the beneficial effects of relevancy
filtering: false alarm rates reduced by 68
27 =252% with only a much smaller reduction
in pd of 94
69 =136%.
Returning now to Prediction 1, the SELF and RR2 pd 69% results are much
larger than those seen in industrial practice:
A panel at IEEE Metrics 2002 (Shull et al. 2002) concluded that manual software
reviews can find 60% of defects.4
– Raffo found that the defect detection capability of industrial review methods
can vary from pd =TR(35,50,65)%5for full Fagan inspections (Fagan 1976) to
pd =TR(13,21,30)% for less-structured inspections (Raffo 2005).
4That panel supported neither Fagan claim (Fagan 1986) that inspections can find 95% of defects before
testing or Shull’s claim that specialized directed inspection methods can catch 35% more defects that other
methods (Shull et al. 2000).
5TR(a, b, c) is a triangular distribution with min/mode/max of a, b, c .
Autom Softw Eng
That is, contrary to Prediction 1, defect predictors learned from static code features
perform well, relative to standard industrial methods.
Turning now to Prediction 2, note that the RR2 round-robin results (with relevancy
filtering) are close to the SELF:
The pd results are only 1 75
69 =8% different;
The pf results are only 29
27 1=7% different.
That is, contrary to Prediction 2, there is generality in the defect predictions learned
from static code features. Learning from local data is clearly best (SELF’s pd results
are better than RR2), however, nearly the same performance results as seen in SELF
can be achieved by applying defect data from one site (e.g. NASA fight systems) to
another (e.g. Turkish white good software).
2.4 Summary
For all the above reasons, we research defect predictors based on static code features.
Such predictors are:
Useful: they out-perform standard industrial methods. Also, just from our own
experience, we can report that they have been successfully applied in software
companies in the United States and Turkey.
Generalizable: as the RR2 results show, the predictions of these models generalize
across data sets taken from different organizations working in different countries.
Easy to use: they can automatically process thousands of modules in a matter of
seconds. Alternative methods such as manual inspections are much slower (8 to 20
LOC per minute).
Widely-used: We can trace their use as far back as 1990 (Porter and Selby 1990).
We are also aware of hundreds of publications that explore this method (for a
partial sample, see the list shown in the introduction).
3 Ceiling effects in defect predictors
Despite several years of exploring different learners and data pre-processing methods,
the performance of our learners has not improved. This section documents that ceiling
effect and the rest of this paper explores methods to break through the ceiling effect.
In 2006 (Menzies et al. 2007b), we defined a repeatable defect prediction exper-
iment which, we hoped, others could improve upon. That experiment used public
domain data sets and open source data miners. Surprisingly, a simple näive Bayes
classifier (with some basic pre-processor for the numerics) out-performed the other
studied methods. For details on näive Bayes classifiers, see the Appendix.
We made the experiment repeatable in the hope that other researchers could im-
prove or refute our results. So far, to the best of our knowledge, no study using just
static code features has out-performed our 2006 result. Our own experiments (Jiang
et al. 2008b) found little or no improvement from the application of numerous data
mining methods. Figure 6shows some of those results using (in order, left to right)
aode average one-dependence estimators (Yang et al. 2006); bag bagging (Brieman
1996); bst boosting (Freund and Schapire 1997); IBk instance-based learning (Cover
Autom Softw Eng
Fig. 6 Box plot for AUC(pf,
pd) seen with 9 learners when,
100 times, a random 90%
selection of the data is used for
training and the remaining data
is used for testing. The
rectangles show the
inter-quartile range (the 25% to
75% quartile range). The line
shows the minimum to
maximum range, unless that
range extends beyond 1.5 times
the inter-quartile range (in
which case dots are used to
mark these extreme outliers).
From Jiang et al. (2008b)
and Hart 1967); C4.5 C4.5 (Quinlan 1992b); jrip RIPPER (Cohen 1995b); lgi logis-
tic regression (Breiman et al. 1984); nb näive Bayes (second from the right); and rf
random forests (Breimann 2001). These histograms show area under the curve (AUC)
of a pf -vs-pd curve. To generate such a “AUC(pf, pd)” curve:
A learner is executed multiple times on different subsets of data;
The pd,pf results are collected from each execution;
The results are sorted on increasing order of pf ;
The results are plotted on a 2-D graph using pf for the x-axis and pd for the y-axis.
A statistical analysis of Fig. 6results showed that only boosting on discretized data
offers a statistically better result than näive Bayes. However, we cannot recommend
boosting: boosting is orders of magnitudes slower than näive Bayes; and the median
improvement over näive Bayes is negligible.
Other researchers have also failed to improve our results. For example, Fig. 7
shows results from a study by Lessmann et al. on statistical differences between 19
learners used for defect prediction (Lessmann et al. 2008). At first glance, our pre-
ferred näive Bayes method (shown as “NB” on the sixth line of Fig. 7) seems to
perform poorly: it is ranked in the lower third of all 19 methods. However, as with
all statistical analysis, it is important to examine not only central tendencies but also
the variance in the performance measure. The vertical dotted lines in Fig. 7show
Lessmann et al.’s statistical analysis that divided the results into regions where all the
results are significantly different: the performance of the top 16 methods are statisti-
cally insignificantly different from each other (including our preferred “NB” method).
Lessmann et al. comment:
Only four competitors are significantly inferior to the overall winner (k-NN,
K-start, BBF net, VP). The empirical data does not provide sufficient evidence
to judge whether RndFor (Random Forest), performs significantly better than
Autom Softw Eng
Fig. 7 Range of AUC(pf, pd) ranks seen in 19 learners building defect predictors when, 10 times, a ran-
dom 66% selection of the data is used for training and the remaining data is used for testing. In ranked data,
values from one method are replaced by their rank in space of all sorted values (so smaller ranks means
better performance). In this case, the performance value was area under the false positive vs true-positive
curve (and larger values are better). Vertical lines divide the results into regions where the results are sta-
tically similar. For example, all the methods whose top ranks are 4 to 12 are statistically insignificantly
different. From Lessmann et al. (2008)
QDA (Quadratic Discriminant Analysis) or any classifier with better average
rank.
In other words, Lessmann et al. are reporting a ceiling effect where a large number of
learners exhibit performance results that are indistinguishable.
4 Breaking through the ceiling
This section discusses methods for breaking through the ceiling effects documented
above.
One constant in the results of Figs. 6and 7is the performance goal used in those
studies: both those results assumed the goal of the learning was to maximize AUC(pf,
pd), i.e. the area under a pf -vs-pd curve. As shown below, if we change the goal of
the learning, then we can break the ceiling effect and find better (and worse) methods
for learning defect predictors from static code measures.
Depending on the business case that funded the data mining study, different goals
may be most appropriate. To see this, consider the typical pf -vs-pd-vs-effort curve of
Fig. 8:
The pf ,pd performance measures were defined above.
Effort is the percentage of the code base found in the modules predicted to be
faulty (so if all modules are predicted to be faulty, the 100% of the code base must
be processed by some other, slower, more expensive QA method).
Autom Softw Eng
Fig. 8 Pf -vs-pd-vs-effort
For the moment, we will just focus on the pf ,pd plane of Fig. 8. A perfect detector
has no false alarm rates and finds all fault modules; i.e. pf ,pd =0,1. As shown in
Fig. 8, the AUC(pf, pd) can bend towards this ideal point but may never reach there:
Detectors learned from past experience have to make some inductive leaps and,
in doing so, make some mistakes. That is, the only way to achieve high pds is to
accept some level of pf s.
The only way to avoid false alarms is to decrease the probability that the detector
will trigger. That is, the only way to achieve low pf s is to decrease pd.
Different businesses prefer different regions of Fig. 8curve:
Mission-critical systems are risk averse and may accept very high false alarm rates,
just as long as they catch any life-threatening possibility.
For less critical software, cost averse managers may accept lower probabilities of
detection, just as long as they do not waste budgets on false alarms.
That is, different businesses have different goals:
Goal 1: Risk averse developments prefer high pd;
Goal 2: Cost averse developments accept mid-range pd, provided they get low pf .
Arisholm and Briand (2006) propose yet another goal:
Goal 3: A budget-conscious team wants to know that if X% of the modules are pre-
dicted to be defective, then modules contain more than X% of the defects. Other-
wise, they argue, the cost of generating the defect predictor is not worth the effort.
The effort-based evaluation of Goal 3 uses a dimension not explored by the prior
work that reported ceiling effects (Lessmann et al. or our work Jiang et al. 2008a;
Menzies et al. 2007b). Hence, for the rest of this paper, we will assess the impacts of
the Arisholm and Briand goal of maximizing the “AUC(effort, pd)”.
4.1 Experimental set up
4.1.1 Operationalizing AUC(effort, pd)
To operationalize Goal 3 from Arisholm and Briand evaluation, we assume that:
After a data miner predicts a module is defective, it is inspected by a team of human
experts.
This team correctly recognize some subset %of the truly defective modules, (and
%=1 means that the inspection teams are perfect at their task).
Autom Softw Eng
Fig. 9 Effort-vs-PD
Our goal is to find learners that find the most number of defective modules in the
smallest number of modules (measured in terms of LOC).
For Arisholm and Briand to approve of a data miner, it must fall in the region
pd >effort. The minimum curve in Fig. 9shows the lower boundary of this region and
a “good” detector (according to AUC(effort, pd)) must fall above this line. Regarding
the x-axis and y-axis of this figure:
The x-axis shows all the modules, sorted on size. For example, if we had 100
modules of 10 LOC, 10 modules of 15 LOC, and 1 module of 20 LOC then the
x-axis would be 111 items long with the 10 LOC modules on the left-hand side
and the 20LOC module on the right-hand side.
Note that the y-axis of this figure assumes %=1; i.e. inspection teams correctly
recognizes all defective modules. Other values of %are discussed below.
4.1.2 Upper and lower bounds on performance
It is good practice to compare the performance of some technique against theoretical
upper and lower bounds (Cohen 1995a). Automatic data mining methods are interest-
ing if they out-perform manual methods. Therefore, for a lower-bound on expected
performance, we compare them against some manual methods proposed by Koru et
al. (2007,2008,2009):
They argue that the relationship between module size and number of defects is not
linear, but logarithmic; i.e. smaller modules are proportionally more troublesome.
The manualUp and manualDown curves of Fig. 9show the results expected by
Koru et al. from inspecting modules in increasing/decreasing order of size (respec-
tively).
With manualUp, all modules are selected and sorted in increasing order of size, so
that curve runs from 0 to 100% of the LOC.
Autom Softw Eng
In a result consistent with Koru et al., our experiments show manualUp usually de-
feating manualDown. As shown in Fig. 9,manualUp scores higher on effort-vs-PD
than manualDown. Hence, we define an upper bound on our performance as follows.
Consider an optimal oracle that restricts module inspections to just the modules that
are truly defective. If manualUp is applied to just these modules, then this would
show the upper-bound on detector performance. For example, Fig. 9shows this best
curve where 30% of the LOC are in defective modules.
In our experiments, we ask our learners to make a binary decision (defective,
nonDefective). All the modules identified as defective are then sorted in order of
increasing size (LOC). We then assess their performance by AUC(effort, pd). For ex-
ample, the bad learner in Fig. 9performs worse than the good learner since the latter
has a larger area under its curve.
In order to provide an upper-bound on our AUC, we report them as a ratio of the
area under the best curve. All the performance scores mentioned in the rest of this
paper are hence normalized AUC(effort,pd) values ranging from 0% to 100% of the
best curve.
Note that normalization simplifies our assessment criteria. If the effectiveness of
the inspection team is independent of the method used to select the modules that
they inspect, then %is the same across all data miners. By expressing the value of a
defect predictor as a ratio of the area under the best curve, this %cancels out so we
can assess the relative merits of different defect predictors independently of %.
4.1.3 Details
Three more details will complete our discussion of Fig. 9. Defect detectors usually
do not trigger on all modules. For example, the good curve of Fig. 9triggers on
B=43% of the code while only detecting 85% of the defective modules. Similarly,
the bad curve stops after finding 30% of the defective modules in 24% of the code.
To complete the effort-vs-PD curve, we must fill in the gap between the termination
point and X=100. Later in this article, we will assume that test engineers inspect the
modules referred to by the data miner. Visually, for the good curve, this assumption
would correspond to a flat line running to the right from point C=85 (i.e. the 85%
of the code triggered by the learner that generated the good curve).
Secondly, the following observation will become significant when we tune a
learner to AUC(effort, pd). Even though Fig. 9shows effort-vs-PD, it can also in-
directly show false alarms. Consider the plateau in the good curve of Fig. 9, marked
with “D”, at around effort =10, PD =45. Such plateaus mark false alarms where the
detectors are selecting modules that have no defects. That is, to maximize the area
under an effort-vs-PD, we could assign a heavy penalty against false alarms that lead
to plateaus.
Thirdly, Fig. 9assumes that inspection effort is linear on size of module. We make
this assumption since a previous literature review reported that current inspection
models all report linear effort models (Menzies et al. 2002). Nevertheless, Fig. 9
could be extended to other effort models as follows: stretch the x-axis to handle, say,
non-linear effects such as longer modules that take exponentially more time to read
and understand.
Autom Softw Eng
4.2 Initial results
Figure 9’s bad and manualUp curves show our first attempt at applying this new eval-
uation bias. These curves were generated by applying manualUp and the C4.5 tree
learner (Quinlan 1992b) to one of the data sets studied by Lessmann et al. Observe
how the automatic method performed far worse than a manual one. To explain this
poor performance, we comment that data miners grow their models using a search
bias B1, then we assess them using a different evaluation bias B2. For example:
During training, a decision-tree learner may stop branching if the diversity of the
instances in a leaf of a branch6falls below some heuristic threshold.
During testing, the learned decision-tree might be tested on a variety of criteria
such as Lessmann et al.’s AUC measure or our operationalization of AUC(effort,
pd).
It is hardly surprising that C4.5 performed so poorly in Fig. 9. C4.5 was not de-
signed to optimize AUC(effort, pdf) (since B1was so different to B2). Some learning
schemes support biasing the learning according to the overall goal of the system; for
example:
The cost-sensitive learners discussed by Elkan (2001).
– The ROC ensembles discussed by Fawcett (2001) where the conclusion is a
summation of the conclusions of the ensemble of ROC curves,7proportionally
weighted, to yield a new learner.
Our cost curve meta-learning scheme permits an understanding of the performance
of a learner across the entire space of pd-vs-pf trade-offs (Jiang et al. 2008a).
At best, such biasing only indirectly controls the search criteria. If the search crite-
ria is orthogonal to the success criteria of, say, maximizing effort-vs-pd, then cost-
sensitive learning or ensemble combinations or cost curve meta-learning will not be
able to generate a learner that supports that business application. Accordingly, we de-
cided to experiment with a new learner, called WHICH, whose internal search criteria
can be tuned to a range of goals such as AUC(effort, pd).
5 WHICH
The previous section argued for a change in the goals of data miners. WHICH (Milton
2008) is a meta-learning scheme that uses a configurable search bias to grow its mod-
els. This section describes WHICH, how to customize it, and what happened when
we applied those customizations to the data of Fig. 3.
6For numeric classes, this diversity measure might be the standard deviation of the class feature. For
discrete classes, the diversity measure might be the entropy measure used in C4.5.
7ROC =receiver-operator characteristic curves such as Lessmann et al.’s plots of PD-vs-PF or PD-vs-
precision.
Autom Softw Eng
5.1 Details
WHICH loops over the space of possible feature ranges, evaluating various combina-
tions of features:
(1) Data from continuous features is discretized into “N” equal width bins. We tried
various bin sizes and, for this study, best results were seen using N{2,4,8}
bins of width (max min)/N .
(2) WHICH maintains a stack of feature combinations, sorted by a customizable
search bias B1. For this study, WHICH used the AUC(effort, pd) criteria, dis-
cussed below.
(3) Initially, WHICH’s “combinations” are just each range of each feature. Subse-
quently, they can grow to two or more features.
(4) Two combinations are picked at random, favoring those combinations that are
ranked highly by B1.
(5) The two combinations are themselves combined, scored, then sorted into the
stacked population of prior combinations.
(6) Go to step 4.
For the reader aware of the artificial intelligence (AI) literature, we remark that
WHICH is a variant of beam search. Rather than use a fixed beam size, WHICH
uses a fuzzy beam where combinations deeper in the stack are exponentially less
likely to be selected. Also, while a standard beam search just adds child states to
the current frontier, WHICH can add entire sibling branches in the search tree (these
sibling branches are represented as other combinations on the stack).
After numerous loops, WHICH returns the highest ranked combination of fea-
tures. During testing, modules that satisfy this combination are predicted as being
“defective”. These modules are sorted on increasing order of size and the statistics of
Fig. 9are collected.
The looping termination criteria was set using our engineering judgment. In stud-
ies with UCI data sets (Blake and Merz 1998), Milton showed that the score of top-of-
stack condition usually stabilizes in less than 100 picks (Milton 2008) (those results
are shown in Fig. 10). Hence, to be cautious, we looped 200 times.
The following expression guides WHICH’s search:
B1=1!PD2α+(1PF)2β+(1effort)2γ
α+β+γ(1)
The (PD,PF,effort)values are normalized to fall between zero and one. The
(α,β,γ)terms in (1) model the relative utility of PD,PF,effort respectively. These
values range 0 (α,β,γ)1. Hence:
–0B11;
larger values of B1are better;
increasing (effort,PF,PD)leads to (decreases,decreases,increases)in B1(re-
spectively).
Initially, we gave PD and effort equal weights and ignored PF; i.e. α=1, β=0,
γ=1. This yielded disappointing results: the performance of the learned detectors
Autom Softw Eng
Fig. 10 Top-of-stack scores of the WHICH stack seen after multiple “picks” (selection and scoring of
two conditions picked at random, then combined) for seven data sets from the UCI data mining repository
(Blake and Merz 1998). Usually, top-of-stack stabilizes after just a dozen pick. However, occasionally,
modest improvements are seen after a few hundred “picks” (see the plot marked with an “A”)
varied wildly across our cross-validation experiments. An examination of our data
revealed why: there exists a small number of modules with very large LOCs. For
example, in one data set with 126 modules, most have under 100 lines of code but
a few of them are over 1000 lines of code long. The presence of small numbers of
very large modules means that γ=1 is not recommended. If the very large modules
fall into a particular subset of some cross-validation, then the performance associated
with WHICH’s rule can vary unpredictably from one run to another.
Accordingly, we had to use PF as a surrogate measure for effort. Recall from the
above discussion that we can restrain decreases in PD by assigning a heavy penalty
to the false alarms that lead to plateaus in an effort-vs-PD curve. In the following
experiments, we used a B1equation that disables effort but places a very large penalty
on PF; i.e.
α=1,β=1000,γ=0 (2)
We acknowledge that the choice (1) and (2) is somewhat arbitrary. In defense of these
decisions, we note that in the following results, these decisions lead to a learner that
significantly out-performed standard learning methods.
5.2 Results
Figure 11 shows results from experimental runs with different learners on the data
sets of Fig. 3. Each run randomized the order of the data ten times, then performed a
N=3-way cross-val study (N=3 was used since some of our data sets were quite
small). For each part of the cross-val study, pd-vs-effort curves were generated using:
Manual methods: manualUp and manualDown;
Using standard data miners: the C4.5 decision tree learner, the RIPPER rule
learner, and our previously recommended näive Bayes method. For more details
on these learners, see Appendix. Note that these standard miners included meth-
ods that we have advocated in prior publications.
Autom Softw Eng
Fig. 11 Results from all data sets of Fig. 3, combined from 10 repeats of a 3-way cross-val, sorted by
median Q(. Each row shows 25 to 75% percentile range of the normalized AUC(effort, pdf) results (and the
large black dot indicates the median score). Two rows have different ranks (in the left-hand-side column)
if their median AUC scores are different and a Mann-Whitney test (95% confidence) indicates that the two
rows have a different wins +ties results. Note that we do not recommend WHICH-4 and WHICH-8 since
these discretization policies performed much worse than WHICH-2
Three versions of WHICH: This study applied several variants of WHICH.
WHICH-2, WHICH-4, and WHICH-8 discretize numeric ranges into 2, 4, and
8 bins (respectively).
MICRO-20: MICRO-20 was another variant motivated by the central limit theo-
rem. According to the central limit theorem, the sum of a large enough sample
will be approximately normally distributed (the theorem explains the prevalence
of the normal probability distribution). The sample can be quite small, sometimes
even as low as 20. Accordingly, MICRO-20 was a variant of WHICH-2 that learns
from just 20 +20 examples of defective and non-defective modules (selected at
random).
5.2.1 Overall results
Figure 11 shows the results for all the data sets of Fig. 3, combined:
Each row shows the normalized AUC(effort, pdf) results for a particular learner
over 30 experiments (10 repeats of a three-way). These results are shown as a 25%
to 75% quartile range (and the large black dot indicates the median score).
The left-hand-side column of each row shows the results of a Mann-Whitney (95%
confidence test) of each row. Row ihas a different rank to row i+1 if their median
scores are different and the Mann-Whitney test indicates that the two rows have a
different wins +ties results. See the appendix for a discussion on why the Mann-
Whitney test was used on these results.
In Fig. 11, WHICH performs relatively and absolutely better than all of the other
methods studied in this paper:
Relative performance: WHICH-2 and the MICRO-20 learner have the highest
ranks;
Absolute performance: In our discussion of Fig. 9, the best curve was presented
as the upper bound in performance for any learner tackling AUC(effort, pd).
Autom Softw Eng
WHICH’s performance rises close to this upper bound, rising to 70.9 and 80%
(median and 75% percentile range) of the best possible performance.
Several other results from Fig. 11 are noteworthy.
– There is no apparent benefit in detailed discretization: WHICH-2 outperforms
WHICH-4 and WHICH-8.
In a result consistent with our prior publications (Menzies et al. 2007b), our näive
Bayes classifier out-performs other standard data miners (C4.5 and RIPPER).
In a result consistent with Koru et al.’s logarithmic defect hypothesis, manualUp
defeats manualDown.
In Fig. 11, standard data miners are defeated by manual method (manualUp). The
size of the defeat is very large: median values of 61.1% to 27.6% from manualUp
to C4.5.
This last result is very sobering. In Fig. 11, two widely used methods (C4.5 and
RIPPER) are defeated by manualDown; i.e. by a manual inspection method that Koru
et al. would argue is the worst possible inspection policy. These results calls into
question the numerous prior defect prediction results, including several papers written
by the authors.
5.2.2 Individual results
Figure 11 combines results from all data sets. Figures 12,13,14, and 15 look at each
data set in isolation. The results divide into three patterns:
In the eight data sets of pattern #1 (shown in Figs. 12 and 13), WHICH-2 has
both the highest median Q(performance and is found to be in the top rank by the
Mann-Whitney statistical analysis.
In the two data sets of pattern #2 (shown in Fig. 14), WHICH-2 does not score the
highest median performance, but still is found in the top-rank.
In the one data set that shows pattern #3 (shown in Fig. 15), WHICH-2 is soundly
defeated by manual methods (manualUp). However, in this case, the WHICH-2
variant MICRO-20 falls into the second rank.
In summary, when looking at each data set in isolation, WHICH performs very well
in 9
10 of the data sets.
5.3 External validity
We argue that the data sets used in this paper are far broader (and hence, more exter-
nally valid) than seen in prior defect prediction papers. All the data sets explored by
Lessmann et al. (2008) and our prior work (Menzies et al. 2007b) come from NASA
aerospace applications. Here, we use that data, plus three extra data sets from a Turk-
ish company writing software controllers for dishwashers (ar3), washing machines
(ar4) and refrigerators (ar5). The development practices from these two organizations
are very different:
The Turkish software was built in a profit- and revenue-driven commercial organi-
zation, whereas NASA is a cost-driven government entity.
Autom Softw Eng
Fig. 12 Four examples of pattern #1: WHICH-2 ranked #1 and has highest median. This figure is reported
in the same format as Fig. 11
The Turkish software was developed by very small teams (2–3 people) working
in the same physical location while the NASA software was built by much larger
team spread around the United States.
The Turkish development was carried out in an ad-hoc, informal way rather than
the formal, process oriented approach used at NASA.
Our general conclusion, that WHICH is preferred to other methods when optimzing
for AUC(effort, pd), holds for 6
7of the NASA data sets and 3
3of the Turkish sets. The
fact that the same result holds for such radically different organizations is a strong
argument for the external validity of our results.
Autom Softw Eng
Fig. 13 Three examples of pattern #1: WHICH-2 ranked #1 and has the highest median. This figure is
reported in the same format as Fig. 11
While the above results, based on ten data sets, are no promise of the efficacy of
WHICH on future data sets, these results are strong evidence that, when a learner is
assessed using AUC(effort, pd), then:
Of all the learners studied here, WHICH or MICRO-20 is preferred over other
learners.
Standard learners such as näive Bayes, the RIPPER rule learner, and the C4.5 de-
cision tree learner perform much worse than simple manual methods. Hence, we
must strongly deprecate their use when optimizing for AUC(effort, pd).
6 Discussion
This goal of this paper was to comment on Lessmann et al.’s results by offering one
example where knowledge of the evaluation biases alters which learner “wins” a
comparative evaluation study. The current version of WHICH offers that example.
While that goal was reached, there are many open issues that could be fruitfully
explored, in future work. Those issues divide into methodological issues and algo-
rithmic issues.
Autom Softw Eng
Fig. 14 Two examples of pattern #2: While WHICH-2 did not achieve the highest medians, it was still
ranked #1 compared to eight other methods. This figure is reported in the same format as Fig. 11
Fig. 15 The only example of pattern #3: WHICH-2 loses (badly) but MICRO-20 still ranks high. This
figure is reported in the same format as Fig. 11
6.1 Methodological issues
This paper has commented that the use of a new goal (AUC(effort, pd)) resulted in
improved performance for certain learners tuned to that new goal. It should be noted
that trying different goals for learners randomly is perhaps too expensive. Such an
analysis may never terminate since the space of possible goals is very large.
We do not recommend random goal selection. Quite the reverse, in fact. We would
propose that:
Before commencing data mining, there must be some domain analysis with the
goal of determining the success criteria that most interests the user population.
(For a sample of such goals, recall the discussion at the start of Sect. 4regarding
mission-critical and other systems.)
Autom Softw Eng
Once the business goals have been modeled, then the data miners should be cus-
tomized to those goals.
That is, rather than conduct studies with randomized business goals, we argue that it
is better to let business considerations guide the goal exploration. Of course, such an
analysis would be pointless unless the learning tool can be adapted to the business
goals. WHICH was specially designed to enable the rapid customization of the learner
to different goals. For example, while the current version supports AUC(effort, pd),
that can be easily changed to other goals.
6.2 Algorithmic issues
The algorithmic issues concerning the inner details of WHICH:
Are there better values for (α,β,γ)than (2)?
The above study only explored AUC(effort, pd) and this is only one possible goal
of a defect predictor. It could be insightful to explore other goals (e.g. how to skew
a learner to maximize precision; or how to choose an evaluation criteria that leads
to least variance in the performance of the learner).
It is possible to restrict the size of the stack to some maximum depth (and new
combinations that score less than bottom-of-stack are discarded). For the study
shown here, we used an unrestricted stack size.
Currently, WHICH sorts new items into the stack using a linear time search from
top-of-stack. This is simple to implement via a linked list structure but a faster
alternative would be a binary-search over skip lists (Pugh 1990).
Other rule learners employ a greedy back-select to prune conditions. To implement
such a search, check if removing any part of the combined condition improves the
score. If not, terminate the back select. Otherwise, remove that part and recurse on
the shorter condition. Such a back-select is coded in the current version of WHICH,
but the above results were obtained with back-select disabled.
Currently our default value for MaxLoops is 200. This may be an overly cautious
setting. Given the results of Fig. 10,MaxLoops might be safely initialized to 20
and only increased if no dramatic improvement seen in the first loop. For most
domains, this would yield a ten-fold speed up of our current implementation.
We encourage further experimentation with WHICH. The current release is released
under the GPL3.0 license and can be downloaded from http://unbox.org/wisp/tags/
which.
7 Conclusion
Given limited QA budgets, it is not possible to apply the most effective QA method to
all parts of a system. The manager’s job is to decide what needs to be tested most, or
tested least. Static code defect predictors are one method for auditing those decisions.
Learned from historical data, these detectors can check which parts of the system
deserve more QA effort. As discussed in Sect. 2.4, defect predictors learned from
Autom Softw Eng
static code measures are useful and easy to use. Hence, as shown by a list offered in
the introduction, they are very widely-used.
Based on our own results, and those of Lessmann et al., it seems natural to con-
clude that many learning methods have equal effectiveness at learning defect pre-
dictors from static code features. In this paper, we have shown that this ceiling effect
does not necessarily hold when studying performance criteria other than AUC(pf, pd).
When defect predictors are assessed by other criteria such as “read less, see more de-
fects” (i.e. AUC(effort, pd)), then the selection of the appropriate learner becomes
critical:
A learner tuned to “read less, see more defects” performs best.
A simple manual analysis out-performs certain standard learners such as NB, C4.5,
RIPPER. The use of these learners is therefore deprecate for “read less, see more
defects”.
Our conclusion is that knowledge of the goal of the learning can and should be used to
select a preferred learner for a particular domain. The WHICH meta-learning frame-
work is one method for quickly customizing a learner to different goals.
We hope that this paper prompts a new cycle of defect prediction research focused
on selecting the best learner(s) for particular business goals. In particular, based on
this paper, we now caution that it the following is an open and urgent question: “which
learners perform better than simple manual method?”
Appendix
Learners used in this study
WHICH, manualUp, and manualDown was described above. The other learners used
in this study come from the WEKA toolkit (Witten and Frank 2005) and are described
below.
Naive Bayes classifiers, or NB, offer a relationship between fragments of evi-
dence Ei, a prior probability for a posteriori probability an hypothesis given some
evidence P (H |E); and a class hypothesis P (H ) probability (in our case, we have
two hypotheses: H{defective,nonDefective}). The relationship comes from Bayes
Theorem: P (H |E) ="iP (Ei|H)P (H )
P (E) . For numeric features, a feature’s mean µ
and standard deviation σare used in a Gaussian probability function (Witten and
Frank 2005): f (x) =1/(2π σ )e(xµ)2
2σ2. Simple naive Bayes classifiers are called
“naive” since they assume independence of each feature. Potentially, this is a sig-
nificant problem for data sets where the static code measures are highly correlated
(e.g. the number of symbols in a module increases linearly with the module’s lines
of code). However, Domingos and Pazzini have shown theoretically that the indepen-
dence assumption is a problem in a vanishingly small percent of cases (Domingos and
Pazzani 1997). This result explains (a) the repeated empirical result that, on average,
seemingly näive Bayes classifiers perform as well as other seemingly more sophisti-
cated schemes (e.g. see Table 1 in Domingos and Pazzani (1997)); and (b) our prior
Autom Softw Eng
experiments where naive Bayes did not perform worse than other learners that con-
tinually re-sample the data for dependent instances (e.g. decision-tree learners like
C4.5 that recurse on each “split” of the data (Quinlan 1992b)).
This study used J48 (Witten and Frank 2005), a JAVA port of Quinlan’s C4.5 deci-
sion tree learner C4.5, release 8 (Quinlan 1992b). C4.5 is an iterative dichotomization
algorithm that seeks the best attribute value splitter that most simplifies the data that
falls into the different splits. Each such splitter becomes a root of a tree. Sub-trees are
generated by calling iterative dichotomization recursively on each of the splits. C4.5
is defined for discrete class classification and uses an information-theoretic measure
to describe the diversity of classes within a data set. A leaf generated by C4.5 stores
the most frequent class seen during training. During test, an example falls into one
of the branches in the decision tree and is assigned the class from the leaf of that
branch. C4.5 tends to produce big “bushy” trees so the algorithm includes a pruning
step. Sub-trees are eliminated if their removal does not greatly change the error rate
of the tree.
JRip is a JAVA port of the RIPPER (Cohen 1995b)rule-covering algorithm. One
rule is learned at each pass for one class. All the examples that satisfy the rule con-
dition are marked as covered and are removed from the data set. The algorithm then
recurses on the remaining data. JRip takes a rather unique stance to rule generation
and has operators for pruning,description length and rule-set optimization. For a full
description of these techniques, see Dietterich (1997). In summary, after building a
rule, RIPPER performs a back-select to see what parts of a condition can be pruned,
without degrading the performance of the rule. Similarly, after building a set of rules,
RIPPER tries pruning away some of the rules. The learned rules are built while min-
imizing their description length; the size of the learned rules, as well as a measure
of the rule errors. Finally, after building rules, RIPPER tries replacing straw-man
alternatives (i.e. rules grown very quickly by some naive method).
Details on static code features
This section offers some details on the Halstead and McCabe features.
The Halstead features were derived by Maurice Halstead in 1977. He argued that
modules that are hard to read are more likely to be fault prone (Halstead 1977). Hal-
stead estimates reading complexity by counting the number of operators and operands
in a module: see the hfeatures of Fig. 1. These three raw hHalstead features were
then used to compute the H: the eight derived Halstead features using the equations
shown in Fig. 1. In between the raw and derived Halstead features are certain inter-
mediaries:
µ=µ1+µ2;
minimum operator count: µ
1=2;
µ
2is the minimum operand count (number of module parameters).
An alternative to the Halstead features are the complexity features proposed by
Thomas McCabe in 1976. Unlike Halstead, McCabe argued that the complexity of
pathways between module symbols are more insightful than just a count of the sym-
bols (McCabe 1976). The McCabe measures are defined as follows.
Autom Softw Eng
A module is said to have a flow graph; i.e. a directed graph where each node cor-
responds to a program statement, and each arc indicates the flow of control from
one statement to another.
The cyclomatic complexity of a module is v(G) =en+2 where Gis a program’s
flow graph, eis the number of arcs in the flow graph, and nis the number of nodes
in the flow graph (Fenton and Pfleeger 1995).
The essential complexity,(ev(G)) of a module is the extent to which a flow graph
can be “reduced” by decomposing all the subflowgraphs of Gthat are D-structured
primes (also sometimes referred to as “proper one-entry one-exit subflowgraphs”
(Fenton and Pfleeger 1995)). ev(G) =v(G) mwhere mis the number of sub-
flowgraphs of Gthat are D-structured primes (Fenton and Pfleeger 1995).
Finally, the design complexity (iv(G)) of a module is the cyclomatic complexity of
a module’s reduced flow graph.
Choice of statistical test
For several reasons, this study uses the Mann Whitney test. Firstly, many authors,
including Demsar (2006), remark that ranked statistical tests such as Mann-Whitney
are not susceptible to errors caused by non-Gaussian performance distributions. Ac-
cordingly, we do not use t-tests since they make a Gaussian assumption.
Also, recall that Fig. 9shows the results of a two-stage process: first, select some
detectors; second, rank them and watch the effort-vs-pd curve grow as we sweep
right across Fig. 9(this two-stage process is necessary to baseline the learners against
manualUp and manualDown, as well as allowing us to express the results as the ratio
of a best curve). The second stage of this process violates the paired assumptions
of, say, the Wilcoxon tests since different test cases may appear depending on which
modules are predicted to be defective. Accordingly, we require a non-paired test like
Mann Whitney to compare distributions (rather than pairs of treatments applied to the
same test case).
Further, while much has been written of the inadequacy of other statistical tests
(Demsar 2006; Huang and Ling 2005), to the best of our knowledge, there is no
current negative critique of Mann Whitney as a statistical test for data miners.
Lastly, unlike some other tests (e.g. Wilcoxon), Mann-Whitney does not demand
that the two compared populations are of the same size. Hence, it is possible to run
one test that compares each row of (e.g.) Fig. 12 to every other row in the same
division. This simplifies the presentation of the results (e.g. avoids the need for a
display of, say, the Bonferroni-Dunn test shown in Fig. 2 of Demsar (2006)).
References
Arisholm, E., Briand, L.: Predicting fault-prone components in a java legacy system. In: 5th
ACM-IEEE International Symposium on Empirical Software Engineering (ISESE), Rio de
Janeiro, Brazil, September 21–22 (2006). Available from http://simula.no/research/engineering/
publications/Arisholm.2006.4
Blake, C., Merz, C.: UCI repository of machine learning databases (1998). URL: http://www.ics.uci.edu/
~mlearn/MLRepository.html
Autom Softw Eng
Bradley, P.S., Fayyad, U.M., Reina, C.: Scaling clustering algorithms to large databases. In: Knowl-
edge Discovery and Data Mining, pp. 9–15 (1998). Available from http://citeseer.ist.psu.edu/
bradley98scaling.html
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and regression trees. Tech. rep.,
Wadsworth International, Monterey, CA (1984)
Breimann, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Brieman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
Chapman, M., Solomon, D.: The relationship of cyclomatic complexity, essential complexity
and error rates. In: Proceedings of the NASA Software Assurance Symposium, Coolfont
Resort and Conference Center in Berkley Springs, West Virginia (2002). Available from
http://www.ivv.nasa.gov/business/research/osmasas/conclusion2002/Mike_Chapman_The_
Relationship_of_Cyclomatic_Complexity_Essential_Complexity_and_Error_Rates.ppt
Cohen, P.: Empirical Methods for Artificial Intelligence. MIT Press, Cambridge (1995a)
Cohen, W.: Fast effective rule induction. In: ICML’95, pp. 115–123 (1995b). Available on-line from
http://www.cs.cmu.edu/~wcohen/postscript/ml-95-ripper.ps
Cover, T.M., Hart, P.E.: Nearest neighbour pattern classification. IEEE Trans. Inf. Theory iT-13, 21–27
(1967)
Demsar, J.: Statistical comparisons of clasifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30
(2006). Available from http://jmlr.csail.mit.edu/papers/v7/demsar06a.html
Dietterich, T.: Machine learning research: four current directions. AI Mag. 18(4), 97–136 (1997)
Domingos, P., Pazzani, M.J.: On the optimality of the simple Bayesian classifier under zero-one loss.
Mach. Learn. 29(2–3), 103–130 (1997)
Elkan, C.: The foundations of cost-sensitive learning. In: Proceedings of the Seventeenth Inter-
national Joint Conference on Artificial Intelligence (IJCAI’01) (2001). Available from http://
www-cse.ucsd.edu/users/elkan/rescale.pdf
Fagan, M.: Design and code inspections to reduce errors in program development. IBM Syst. J. 15(3),
182–211 (1976)
Fagan, M.: Advances in software inspections. IEEE Trans. Softw. Eng. SE-12, 744–751 (1986)
Fawcett, T.: Using rule sets to maximize roc performance. In: 2001 IEEE International
Conference on Data Mining (ICDM-01) (2001). Available from http://home.comcast.net/
~tom.fawcett/public_html/papers/ICDM-final.pdf
Fenton, N.E., Neil, M.: A critique of software defect prediction models. IEEE Trans. Softw. Eng. 25(5),
675–689 (1999). Available from http://citeseer.nj.nec.com/fenton99critique.html
Fenton, N.E., Pfleeger, S.: Software Metrics: A Rigorous & Practical Approach, 2nd edn. International
Thompson Press (1995)
Fenton, N.E., Pfleeger, S.: Software Metrics: A Rigorous & Practical Approach. International Thompson
Press (1997)
Fenton, N., Pfleeger, S., Glass, R.: Science and substance: a challenge to software engineers. IEEE Softw.,
86–95 (1994)
Freund, Y., Schapire, R.: A decision-theoretic generalization of on-line learning and an application to
boosting. JCSS: J. Comput. Syst. Sci. 55 (1997)
Hall, G., Munson, J.: Software evolution: code delta and code churn. J. Syst. Softw. 111–118 (2000)
Halstead, M.: Elements of Software Science. Elsevier, Amsterdam (1977)
Huang, J., Ling, C.: Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowledge
Data Eng. 17(3), 299–310 (2005)
Jiang, Y., Cukic, B., Ma, Y.: Techniques for evaluating fault prediction models. Empir. Softw. Eng., 561–
595 (2008a)
Jiang, Y., Cukic, B., Menzies, T.: Does transformation help? In: Defects (2008b). Available from
http://menzies.us/pdf/08transform.pdf
Khoshgoftaar, T.: An application of zero-inflated Poisson regression for software fault prediction. In: Pro-
ceedings of the 12th International Symposium on Software Reliability Engineering, Hong Kong, pp.
66–73 (2001)
Khoshgoftaar, T., Allen, E.: Model software quality with classification trees. In: Pham, H. (ed.): Recent
Advances in Reliability and Quality Engineering, pp. 247–270. World Scientific, Singapore (2001)
Khoshgoftaar, T.M., Seliya, N.: Faultprediction modeling for software quality estimation: comparing com-
monly used techniques. Empir. Softw. Eng. 8(3), 255–283 (2003)
Koru, A., Zhang, D., Liu, H.: Modeling the effect of size on defect proneness for open-source
software. In: Proceedings PROMISE’07 (ICSE) (2007). Available from http://promisedata.org/
pdf/mpls2007KoruZhangLiu.pdf
Autom Softw Eng
Koru, A., Emam, K.E., Zhang, D., Liu, H., Mathew, D.: Theory of relative defect proneness: replicated
studies on the functional form of the size-defect relationship. Empir. Softw. Eng., 473–498 (2008)
Koru, A., Zhang, D., El Emam, K., Liu, H.: An investigation into the functional form of the size-defect
relationship for software modules. Softw. Eng. IEEE Trans. 35(2), 293–304 (2009)
Lessmann, S., Baesens, B., Mues, C., Pietsch, S.: Benchmarking classification models for software defect
prediction: a proposed framework and novel findings. IEEE Trans. Softw. Eng. (2008)
Leveson, N.: Safeware System Safety and Computers. Addison-Wesley, Reading (1995)
Littlewood, B., Wright, D.: Some conservative stopping rules for the operational testing of safety-critical
software. IEEE Trans. Softw. Eng. 23(11), 673–683 (1997)
Lowry, M., Boyd, M., Kulkarni, D.: Towards a theory for integration of mathematical verification and
empirical testing. In: Proceedings, ASE’98: Automated Software Engineering, pp. 322–331 (1998)
Lutz, R., Mikulski, C.: Operational anomalies as a cause of safety-critical requirements evolution. J. Syst.
Softw. (2003). Available from http://www.cs.iastate.edu/~rlutz/publications/JSS02.ps
McCabe, T.: A complexity measure. IEEE Trans. Softw. Eng. 2(4), 308–320 (1976)
Menzies, T., Cukic, B.: When to test less. IEEE Softw. 17(5), 107–112 (2000). Available from
http://menzies.us/pdf/00iesoft.pdf
Menzies, T., Stefano, J.S.D.: How good is your blind spot sampling policy? In: 2004 IEEE Conference on
High Assurance Software Engineering (2003). Available from http://menzies.us/pdf/03blind.pdf
Menzies, T., Raffo, D., Setamanit, S., Hu, Y., Tootoonian, S.: Model-based tests of truisms. In: Proceedings
of IEEE ASE 2002 (2002). Available from http://menzies.us/pdf/02truisms.pdf
Menzies, T., Dekhtyar, A., Distefano, J., Greenwald, J.: Problems with precision. IEEE Trans. Softw. Eng.
(2007a). http://menzies.us/pdf/07precision.pdf
Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE
Trans. Softw. Eng. (2007b). Available from http://menzies.us/pdf/06learnPredict.pdf
Milton, Z.: Which rules. M.S. thesis (2008)
Mockus, A., Zhang, P., Li, P.L.: Predictors of customer perceived software quality. In: ICSE ’05: Proceed-
ings of the 27th International Conference on Software Engineering, pp. 225–233. ACM, New York
(2005)
Musa, J., Iannino, A., Okumoto, K.: Software Reliability: Measurement, Prediction, Application.
McGraw-Hill, New York (1987)
Nagappan, N., Ball, T.: Static analysis tools as early indicators of pre-release defect density. In: ICSE 2005,
St. Louis (2005a)
Nagappan, N., Ball, T.: Static analysis tools as early indicators of pre-release defect density. In: ICSE, pp.
580–586 (2005b)
Nagappan, N., Murphy, B.: Basili, V.: The influence of organizational structure on software quality: An
empirical case study. In: ICSE’08 (2008)
Nikora, A.: Personnel communication on the accuracy of severity determinations in NASA databases
(2004)
Nikora, A., Munson, J.: Developing fault predictors for evolving software systems. In: Ninth International
Software Metrics Symposium (METRICS’03) (2003)
Ostrand, T.J., Weyuker, E.J., Bell, R.M.: Where the bugs are. In: ISSTA’04: Proceedings of the 2004 ACM
SIGSOFT International Symposium on Software Testing and Analysis, pp. 86–96. ACM, New York
(2004)
Porter, A., Selby, R.: Empirically guided software development using metric-based classification trees.
IEEE Softw. 46–54 (1990)
Pugh, W.: Skip lists: a probabilistic alternative to balanced trees. Commun. ACM 33(6), 668–676 (1990).
Available from ftp://ftp.cs.umd.edu/pub/skipLists/skiplists.pdf
Quinlan, J.R.: Learning with continuous classes. In: 5th Australian Joint Conference on Artificial Intelli-
gence, pp. 343–348 (1992a). Available from http://citeseer.nj.nec.com/quinlan92learning.html
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufman, San Mateo (1992b). ISBN:
1558602380
Raffo, D.: Personnel communication (2005)
Rakitin, S.: Software Verification and Validation for Practitioners and Managers, 2nd edn. Artech House,
Norwood (2001)
Shepperd, M., Ince, D.: A critique of three metrics. J. Syst. Softw. 26(3), 197–210 (1994)
Shull, F., Rus, I., Basili, V.: How perspective-based reading can improve requirements inspec-
tions. IEEE Comput. 33(7), 73–79 (2000). Available from http://www.cs.umd.edu/projects/
SoftEng/ESEG/papers/82.77.pdf
Autom Softw Eng
Shull, F., Boehm, B., B., V., Brown, A., Costa, P., Lindvall, M., Port, D., Rus, I., Tesoriero, R.,
Zelkowitz, M.: What we have learned about fighting defects. In: Proceedings of 8th International
Software Metrics Symposium, Ottawa, Canada, pp. 249–258 (2002). Available from http://fc-md.
umd.edu/fcmd/Papers/shull_defects.ps
Srinivasan, K., Fisher, D.: Machine learning approaches to estimating software development effort. IEEE
Trans. Soft. Eng. 126–137 (1995)
Tang, W., Khoshgoftaar, T.M.: Noise identification with the k-means algorithm. In: ICTAI, pp. 373–378
(2004)
Tian, J., Zelkowitz, M.: Complexity measure evaluation and selection. IEEE Trans. Softw. Eng. 21(8),
641–649 (1995)
Tosun, A., Bener, A.: Ai-based software defect predictors: Applications and benefits in a case study. In:
IAAI’10 (2010)
Tosun, A., Bener, A., Turhan, B.: Practical considerations of deploying ai in defect prediction: a case study
within the Turkish telecommunication industry. In: PROMISE’09 (2009)
Turhan, B., Menzies, T., Bener, A., Distefano, J.: On the relative value of cross-company and within-
company data for defect prediction. Empir. Softw. Eng. 68(2), 278–290 (2009). Available from
http://menzies.us/pdf/08ccwc.pdf
Turner, J.: A predictive approach to eliminating errors in software code (2006). Available from
http://www.sti.nasa.gov/tto/Spinoff2006/ct_1.html
Voas, J., Miller, K.: Software testability: the new verification. IEEE Softw. 17–28 (1995). Available from
http://www.cigital.com/papers/download/ieeesoftware95.ps
Weyuker, E., Ostrand, T., Bell, R.: Do too many cooks spoil the broth? Using the number of developers to
enhance defect prediction models. Empir. Softw. Eng. (2008)
Witten, I.H., Frank, E.: Data Mining, 2nd edn. Morgan Kaufmann, Los Altos (2005)
Yang, Y., Webb, G.I., Cerquides, J., Korb, K.B., Boughton, J.R., Ting, K.M.: To select or to weigh: a com-
parative study of model selection and model weighing for spode ensembles. In: ECML, pp. 533–544
(2006)
Zimmermann, T., Nagappan, N., E.G., H.G., Murphy, B., Cross-project defect prediction. In:
ESEC/FSE’09 (2009)
... In general, the extracted software metrics can be static code metrics, change metrics, or social metrics. Static code metrics are collected from the software source code or binary code units (Koru and Liu, 2005;Menzies et al., 2007;Lessmann et al., 2008;Menzies et al., 2010;He et al., 2013;Ghotra et al., 2015;Bowes et al., 2018;Kabir et al., 2021). Change metrics, sometimes called process metrics, are collected from the projects' development history (i.e., commit logs) and bug tracking systems (Nagappan et al., 2010;Giger et al., 2011;Krishnan et al., 2011Krishnan et al., , 2013Goseva-Popstojanova et al., 2019). ...
... Note that NASA datasets have no releases; each represents an independent project for which static code metrics were extracted as snapshots of source code at a given time. Examples of studies which used the NASA MDP datasets within-project include but are not limited to (Koru and Liu, 2005;Menzies et al., 2007;Gondra, 2008;Lessmann et al., 2008;Jiang et al., 2008a,b,c;Elish and Elish, 2008;Turhan et al., 2009;Mende and Koschke, 2009;Menzies et al., 2010;Wang and Yao, 2013;Ghotra et al., 2015;Bowes et al., 2018;Zhou et al., 2019;Goyal, 2022). ...
Preprint
Full-text available
Software fault-proneness prediction is an active research area, with many factors affecting prediction performance extensively studied. However, the impact of the learning approach (i.e., the specifics of the data used for training and the target variable being predicted) on the prediction performance has not been studied, except for one initial work. This paper explores the effects of two learning approaches, useAllPredictAll and usePrePredictPost, on the performance of software fault-proneness prediction, both within-release and across-releases. The empirical results are based on data extracted from 64 releases of twelve open-source projects. Results show that the learning approach has a substantial, and typically unacknowledged, impact on the classification performance. Specifically, using useAllPredictAll leads to significantly better performance than using usePrePredictPost learning approach, both within-release and across-releases. Furthermore, this paper uncovers that, for within-release predictions, this difference in classification performance is due to different levels of class imbalance in the two learning approaches. When class imbalance is addressed, the performance difference between the learning approaches is eliminated. Our findings imply that the learning approach should always be explicitly identified and its impact on software fault-proneness prediction considered. The paper concludes with a discussion of potential consequences of our results for both research and practice.
... The manner in which defects are introduced into code, and the sheer volume of defects in software, are typically beyond the capability and resources of most development teams (Ghotra et al. 2017;Kamei et al. 2012;Kondo et al. 2019;Tantithamthavorn et al. 2019). Defect prediction models aim to identify software artifacts that are likely to be defective (Menzies et al. 2010;Ohlsson and Alberg 1996;Ostrand and Weyuker 2004;Ostrand et al. 2005;Turhan et al. 2009;Weyuker et al. 2010). The main purpose of defect prediction is to reduce the cost of testing, analysis, or code review by prioritizing developers' efforts on specific artifacts such as commits, methods, or classes. ...
... Finally, we plan to investigate the opinions of developers on EAMs and specifically on the validated new NPofB metric. Tosun, A., Bener, A. B., Turhan, B., & Menzies, T. (2010). Practical considerations in deploying statistical methods for defect prediction: A case study within the turkish telecommunications industry. ...
Article
Full-text available
Context Advances in defect prediction models, aka classifiers, have been validated via accuracy metrics. Effort-aware metrics (EAMs) relate to benefits provided by a classifier in accurately ranking defective entities such as classes or methods. PofB is an EAM that relates to a user that follows a ranking of the probability that an entity is defective, provided by the classifier. Despite the importance of EAMs, there is no study investigating EAMs trends and validity. Aim The aim of this paper is twofold: 1) we reveal issues in EAMs usage, and 2) we propose and evaluate a normalization of PofBs (aka NPofBs), which is based on ranking defective entities by predicted defect density. Method We perform a systematic mapping study featuring 152 primary studies in major journals and an empirical study featuring 10 EAMs, 10 classifiers, two industrial, and 12 open-source projects. Results Our systematic mapping study reveals that most studies using EAMs use only a single EAM (e.g., PofB20) and that some studies mismatched EAMs names. The main result of our empirical study is that NPofBs are statistically and by orders of magnitude higher than PofBs. Conclusions In conclusion, the proposed normalization of PofBs: (i) increases the realism of results as it relates to a better use of classifiers, and (ii) promotes the practical adoption of prediction models in industry as it shows higher benefits. Finally, we provide a tool to compute EAMs to support researchers in avoiding past issues in using EAMs.
... One of them is software defect prediction. It identifies defects in software through software metrics, such as lines of code, McCabe's cyclomatic complexity, etc., which has been widely adopted in industry and academia (Ostrand et al. 2005;Menzies et al. 2010;Lewis et al. 2013). However, it is usually exhausting to locate the defects and find the developers who introduced them in large-scale software projects. ...
Article
Full-text available
Context Just-in-time defect prediction (JITDP) leverages modern machine learning models to predict the defect-proneness of commits. Such models require adequate training data, which is unavailable in projects with short histories. To address this problem, cross-project methods reuse the data or models in other projects to make predictions, grounded on the assumption that they share similar defect-related features. However, these features are overlooked, which leads to unsatisfying model performance. Objective This study aims to investigate the relationship between cross-project JITDP performances and project features, thereby improving the performance of cross-project models. Method We propose a F eature-based ENSE mble modeling approach (FENSE) to cross-project JITDP. For a target project, FENSE pairs it to each source project and obtains 20 features. Leveraging them, it can predict the transferability of each off-the-shelf JITDP model. Then FENSE identifies the most transferable ones and combines them to make cross-project predictions. To achieve this, we conduct a large-scale empirical study of 113,906 project pairs in GitHub and investigate the impact of project features. Results The results show that: (1) cross-project transferability is highly related to features including programming language and the defect ratio of the source project; (2) our feature-based model selection scheme can improve the cross-project JITDP performance by 10%; (3) FENSE outperforms other models on five evaluation measures without extra time and space costs. Conclusions Our study suggests that project features can help identify powerful cross-project JITDP models and improve the performance of ensemble approaches.
... The structure of software has become complex, and the defects in the software are hidden deeper and harder to be found. Moreover, the cost of relying on the workforce for defect detection has increased; on the contrary, the timeliness is poor gradually [2]. According to the survey [3], 50% to 70% of software development costs are spent in the process of finding and fixing software defects. ...
Article
Full-text available
Since defects in software may cause product fault and financial loss, it is essential to conduct software defect prediction (SDP) to identify the potentially defective modules, especially in the early stage of the software development lifecycle. Recently, cross-version defect prediction (CVDP) began to draw increasing research interests, employing the labeled defect data of the prior version within the same project to predict defects in the current version. As software development is a dynamic process, the data distribution (such as defects) during version change may get changed. Recent studies utilize machine learning (ML) techniques to detect software defects. However, due to the close dependencies between the updated and unchanged code, prior ML-based methods fail to model the long and deep dependencies, causing a high false positive. Furthermore, traditional defect detection is performed on the entire project, and the detection efficiency is relatively low, especially on large-scale software projects. To this end, we propose BugPre , a CVDP approach to address these two issues. BugPre is a novel framework that only conducts efficient defect prediction on changed modules in the current version. BugPre utilizes variable propagation tree-based associated analysis method to obtain the changed modules in the current version. Besides, BugPre constructs graph leveraging code context dependences and uses a graph convolutional neural network to learn representative characteristics of code, thereby improving defect prediction capability when version changes occur. Through extensive experiments on open-source Apache projects, the experimental results indicate that our BugPre outperforms three state-of-the-art defect detection approaches, and the F1-score has increased by higher than 16%.
... Static code metrics which mainly capture the complexity and structural aspects of the source code have been proposed, such as McCabe (1976), Halstead metrics (1977), CK features (design metrics from UML) (Chidamber and Kemerer 1994), and objectoriented features (coupling, cohesion, etc.) (Harrison et al. 1998;Bansiya and Davis 2002;e Abreu and Carapuça 1994). Many studies have used static code metrics for defect prediction (Menzies et al. 2010(Menzies et al. , 2007Zimmermann et al. 2007). Other metrics, such as historical and process-related metrics (e.g., number of past bugs Kläs et al. 2010;Ostrand et al. 2005 or the number of changes Pinzger et al. 2008;Meneely et al. 2008;Moser et al. 2008) and organizational metrics (e.g., number of developers Weyuker et al. 2008;Graves et al. 2000), have also been proposed. ...
Article
Full-text available
Regression testing activities greatly reduce the risk of faulty software release. However, the size of the test suites grows throughout the development process, resulting in time-consuming execution of the test suite and delayed feedback to the software development team. This has urged the need for approaches such as test case prioritization (TCP) and test-suite reduction to reach better results in case of limited resources. In this regard, proposing approaches that use auxiliary sources of data such as bug history can be interesting. We aim to propose an approach for TCP that takes into account test case coverage data, bug history, and test case diversification. To evaluate this approach we study its performance on real-world open-source projects. The bug history is used to estimate the fault-proneness of source code areas. The diversification of test cases is preserved by incorporating fault-proneness on a clustering-based approach scheme. The proposed methods are evaluated on datasets collected from the development history of five real-world projects including 357 versions in total. The experiments show that the proposed methods are superior to coverage-based TCP methods. The proposed approach shows that improvement of coverage-based and fault-proneness-based methods is possible by using a combination of diversification and fault-proneness incorporation.
... Software defect prediction approaches are significantly cheaper than software measurement and reviews. Empirical studies have indicated that the probability of detecting software defects using prediction models could be higher than the possibility of discovery in current software reviews (Menzies et al. 2010). Many of these techniques, such as statistics-based models, parametric models, and machine learning-based models, are used (Alsaeedi and Khan 2019;Rahman and Devanbu 2013). ...
Article
Full-text available
Understanding software evolution is essential for software development tasks, including debugging, maintenance, and testing. As a software system evolves, it grows in size and becomes more complex, hindering its comprehension. Researchers proposed several approaches for software quality analysis based on software metrics. One of the primary practices is predicting defects across software components in the codebase to improve agile product quality. While several software metrics exist, graph-based metrics have rarely been utilized in software quality. In this paper, we explore recent network comparison advancements to characterize software evolution and focus on aiding software metrics analysis and defect prediction. We support our approach with an automated tool named GraphEvoDef. Particularly, GraphEvoDef provides three major contributions: (1) detecting and visualizing significant events in software evolution using call graphs, (2) extracting metrics that are suitable for software comprehension, and (3) detecting and estimating the number of defects in a given code entity (e.g., class). One of our major findings is the usefulness of the Network Portrait Divergence metric, borrowed from the information theory domain, to aid the understanding of software evolution. To validate our approach, we examined 29 different open-source Java projects from GitHub and then demonstrated the proposed approach using 9 use cases with defect data from the the PROMISE dataset. We also trained and evaluated defect prediction models for both classification and regression tasks. Our proposed technique has an 18% reduction in the mean square error and a 48% increase in squared correlation coefficient over the state-of-the-art approaches in the defect prediction domain.
Article
Cross‐project defect prediction (CPDP) is used to build defect prediction models when data from the target project are not enough. There has been several approaches to improve the performance of CPDP, such as feature transformation and instance selection methods. However, existing techniques are strongly dependent on the target data to reduce the distribution discrepancy between source and target projects. That is, the performance of these methods is determined by the effectiveness of feature transformation or the similarity between two projects. Additionally, when there is a large amount of source data that needs to be matched with target data, it will take much time and reduce the efficiency of model construction. Therefore, it is vital to explore a target project‐agnostic approach to build CPDP models. This paper presents a Weighted Isolation Forest with class Label information Filter (WIFLF) to relieve the issues above. Four groups of datasets from AEEEM, Relink and PROMISE Data Repository are used to conduct CPDP models. Besides, WIFLF is compared with 12 approaches. The experimental results indicate that WIFLF significantly outperforms all the baselines. Specifically, WIFLF with random forest significantly improves the performance over the baselines on average by at least 14.64% and 4.90% with respect to Skewed F‐Measure and G‐Measure, respectively. A Weighted Isolation Forest with class Label information Filter (WIFLF) is proposed for instance selection, which is suitable for the scenario that training CPDP models in depending the target project. The algorithm with logistic regression (LR), random forest (RF), and naïve Bayes (NB) is used to 36 releases data sets of 22 projects from commonly available PROMISE Data Repository, AEEEM, and Relink. The comparative experiments are performed with 12 cross‐project defect prediction approaches, and the results indicate WIFLF is significantly outperforms all the baselines. Specifically, WIFLF with RF improves the performance over the baselines on average by at least 14.64% and 4.90% with respect to two overall measures: skewed F‐Measure and G‐Measure, respectively.
Article
Software defect prediction is used to assist developers in finding potential defects and allocating their testing efforts as the scale of software grows. Traditional software defect prediction methods primarily concentrate on creating static code metrics that are fed into machine learning classifiers to predict defects in the code. To achieve the desired classifier performance, appropriate design decisions are required for deep neural network (DNN) and convolutional neural network (CNN) models. This is especially important when predicting software module fault proneness. When correctly identified, this could help to reduce testing costs by concentrating efforts on the modules that have been identified as fault prone. This paper proposes a CONVSDP and DNNSDP (cognitive and neural network) approach for predicting software defects. Python Programming Language with Keras and TensorFlow was used as the framework. From three NASA system datasets (CM1, KC3, and PC1) selected from PROMISE repository, a comparative analysis with machine learning algorithms (such as Random Forest (RF), Decision Trees (DT), Nave Bayes (NF), and Support Vector Machine (SVM) in terms of F-Measure (known as F1-score), Recall, Precision, Accuracy, Receiver Operating Characteristics (ROC) and Area Under Curve (AUC) has been presented. We extract four dataset attributes from the original datasets and use them to estimate the development effort, development time, and number of errors. The number of operands, operators, branch count, and executable LOCs are among these attributes. Furthermore, a new parameter called cognitive weight (Wc) of Basic Control Structure (BCS) is proposed to make the proposed cognitive technique more effective, and a cognitive data set of 8 features for NASA system datasets (CM1, KC3, and PC1) selected from the PROMISE repository to predict software defects is created. The experimental results showed that the CONVSDP and DNNSDP models was comparable to existing classifiers in both original datasets and cognitive data sets, and that it outperformed them in most of the experiments.
Article
Cross-project defect prediction (CPDP) technology can effectively ensure software quality, which plays an important role in software engineering. When encountering a newly developed project with insufficient training data, CPDP can be used to build defect predictors using other projects. However, CPDP does not take into account the prior knowledge of the target items and the class imbalance in the source item data. In this paper, we design an active learning selection algorithm for cross-project defect prediction to alleviate the above problems. First, we use clustering and active learning algorithms to filter and label some representative data from the target items and use these data as prior knowledge to guide the selection of source items. Then, the active learning algorithm is used to filter representative data from the source items. Finally, the balanced cross-item dataset is constructed using the active learning algorithm, and the defect prediction model is built. In this article, we selected 10 open-source projects by using common defect prediction models, active learning algorithms, and common evaluation metrics. The results show that the proposed algorithm can effectively filter the data, solve the class imbalance problem in cross-project data, and improve the defect prediction performance. © 2022 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group.
Article
Full-text available
This article examines the metrics of the software science model, cyclomatic complexity, and an information flow metric of Henry and Kafura. These were selected on the basis of their popularity within the software engineering literature and the significance of the claims made by their progenitors. Claimed benefits are summarized. Each metric is then subjected to an in-depth critique. All are found wanting. We maintain that this is not due to mischance, but indicates deeper problems of methodology used in the field of software metrics. We conclude by summarizing these problems.
Article
Machine-learning research has been making great progress in many directions. This article summarizes four of these directions and discusses some current open problems. The four directions are (1) the improvement of classification accuracy by learning ensembles of classifiers, (2) methods for scaling up supervised learning algorithms, (3) reinforcement learning, and (4) the learning of complex stochastic models.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.