Conference PaperPDF Available

Regularities in Learning Defect Predictors


Abstract and Figures

Collecting large consistent data sets of real world software projects from a single source is problematic. In this study, we show that bug reports need not necessarily come from the local projects in order to learn defect prediction models. We demonstrate that using imported data from different sites can make it suitable for predicting defects at the local site. In addition to our previous work in commercial software, we now explore open source domain with two versions of an open source anti-virus software (Clam AV) and a subset of bugs in two versions of GNU gcc compiler, to mark the regularities in learning predictors for a different domain. Our conclusion is that there are surprisingly uniform assets of software that can be discovered with simple and repeated patterns in local or imported data using just a handful of examples. KeywordsDefect prediction-Code metrics-Software quality-Cross-company
Content may be subject to copyright.
Regularities in Learning Defect Predictors
Burak Turhan, Ayse Bener
Dept. of Computer Engineering
Bogazici University
34342 Bebek, Istanbul, Turkey
Tim Menzies
Lane Dept. of CS&EE
West Virginia University
Morgantown, WV, USA
Collecting large consistent data sets for real world soft-
ware projects is problematic. Therefore, we explore how
little data are required before the predictor performance
plateaus; i.e. further data do not improve the performance
score. In this case study, we explore three embedded con-
troller software, two versions of an open source anti-virus
software (Clam AV) and a subset of bugs in two versions
of GNU gcc compiler, to mark the regularities in learning
predictors for different domains. We show that only a small
number of bug reports, around 30, is required to learn sta-
ble defect predictors. Further, we show that these bug re-
ports need not necessarily come from the local projects. We
demonstrate that using imported data from different sites
can make it suitable for learning defects at the local site.
Our conclusion is that software construction is a surpris-
ingly uniform endeavor with simple and repeated patterns
that can be discovered in local or imported data using just
a handful of examples.
1. Introduction
It is surprisingly difficult to find relevant data within a
single organization to fully specify all the internal param-
eters inside a complete software process model. For ex-
ample, after 26 years of trying, Barry Boehm’s team re-
saerchers from the Univestity of Southern California col-
lected less than 200 sample projects for their COCOMO ef-
fort estimation database [18].
There are many reasons for this, not the least being
the business sensitivity associated with the data. Software
The research described in this paper was supported by Bo
gazic¸i Uni-
versity research fund under grant number BAP-06HA104 and at West Vir-
ginia University under grants with NASAs Software Assurance Research
Program. Reference herein to any specific commercial product, process,
or service by trademark, manufacturer, or otherwise, does not constitute or
imply its endorsement by the United States Government.
projects are notoriously difficult to control: recall the 1995
report of the Standish group that described a $250 billion
dollar American software industry where 31% of projects
were canceled and 53% of projects incurred costs exceeding
189% of the original estimate [35]. Understandably, cor-
porations are reluctant to expose their poor performance to
public scrutiny.
Despite this data shortage, remarkably effective predic-
tors for software products have been generated. In previous
work we have built:
Software effort estimators using only very high-level
knowledge of the code being developed [21] (typically,
just some software process details). Yet these estima-
tors offer predictions that are remarkably close to ac-
tual development times [17].
Software defect predictors using only static code fea-
tures. Fenton (amongst others) argues persauively that
such features are very poor charactizations of the inner
complexities of software modules [7]. Yet these see-
ingly naive defect predictors out-perform current in-
dustrial best-practices [19].
The success of such simple models seems highly unlikely.
Organizations can work in different domains, have differ-
ent process, and define/measure defects and other aspects
of their product and process in different ways. Worse, all
to often, organizations do not precisely define their pro-
cesses, products, measurements, etc. Nevertheless, it is
true that very simple models suffice for generating approx-
imately correct predictions for (say) software development
time [17], the location of software defects [19].
One candidate explanation for the strange predictability
in software development is that:
Despite all the seemingly random factors influencing
software construction . . .
. . . the net result follows very tight statistical patterns.
Other researchers have argued for a similar results [1, 14,
26–28] but here we offer new evidence. Specifically, we
document the early plateau effect seen with learning defect
predictors and show that it also holds in a completely novel
The performance of a data miner improves as the size
of the training set grows.
At some point, the performance plateaus and further
training data does not improve that performance.
Previously, we showed in one domain (NASA
aerospace software) that for defect prediction, plateaus
occur remarkably early (after just 30 examples).
In this paper, we show that this early plateau also oc-
curs in Turkish whitegoods control software and two
other open source software.
Observing an effect in one domain (NASA) might be a co-
incidence. However, after observing exactly the same ef-
fect in two more unrelated domains (Turkish whitegoods &
open-source), we can no longer dismiss the effect as quirks
in one domain. Rather, we assert that:
The regularities we observe in software are very regu-
lar indeed;
We can depend on those regularities to generate effec-
tive defect predictors using minimal information from
This second point is a strong endorsement for our approach
since, as discussed above, it can be very difficult to access
details and extensive project data from real world software
The rest of this paper is structured as follows. We give
examples of statistical patterns from other research in Sec-
tion 2. Then we introduce defect predictors and observed
regularities in defect prediction research in Section 3. We
discuss the external validity of these regularities in Section
4. Section 5 extends these regularities to open source do-
main. Discussions of our observations are given in Section
6 and we conclude our research in Section 7.
Note that Figure 6 and Figure 8 results of Section 3 have
appeared previously [19, 20, 23]. The rest of this paper is
new work.
2. Background
Previous research reports much evidence that software
products confirm tightly to simple and regular statistical
models, For example, Veldhuizen shows that library reuse
characteristics in three unix based systems (i.e. Linux,
SunOS and MacOS X), can be explained by Zipfs Law, that
is most frequently used library routines [38] are inversely
proportional to their frequency ranks. As shown in Figure
1, the distribution is highly regular.
Figure 2 shows a summary of our prior work on effort es-
timation [17] using the COCOMO features. These features
Figure 1. Distribution of reuse frequencies in
three unix based systems. From [38].
100 (pred actual)/actual
50% 65% 75%
percentile percentile percentile
mode=embedded -9 26 60
project=X -6 16 46
all -4 12 31
year=1975 -3 19 39
mode=semi-detached -3 10 22 -3 11 29
center=5 -3 20 50
mission.planning -1 25 50
project=gro -1 9 19
center=2 0 11 21
year=1980 4 29 58
avionics.monitoring 6 32 56
median -3 19 39
Figure 2. 158 effort estimation methods ap-
plied to 12 subsets of the COCO81 data.
lack detailed knowledge of the system under development.
Rather, they are just two dozen ultra-high-level descriptors
of (e.g.) develeper experience, platform violatily, software
process maturity, etc. In Figure 2, different effort estimation
methods are applied to the COC81 data
used by Boehm to
develop the original COCOMO effort estimation model [2].
Twenty times, ten test instances were selected at random
and effort models were built from the remaining models us-
ing 158 different estimation methods
. The resulting pre-
dictions were compared to the actual estimation times using
relative error
. COC81’s data can be divided
into the 12 (overlapping) subsets shown left-hand-side of
Figure 2. The right-hand-columns of Figure 2 show MRE
Available from
For a description of those methods, see [11]
Figure 3. Distribution of change prone class
percentages in two open source projects.
From [14].
at the median, 65%, and 75% percentiles. The median pre-
dictions are within within 3% of the actual. Such a close
result would be impossible if software did not conform to
very tight statistical regularities.
Figure 3 shows Koru and Liu’s analysis of two large-
scale open source projects (K-office and Mozilla). As
shown in those figures, these different systems confirm
to nearly an identical Pareto distribution of change-prone
classes: 80% of changes occur in 20% of classes [14]. The
same 80:20 changes:fault distribution has been observed by
Ostrand, Weyuker and Bell in very large scale telecommu-
nication projects from AT&T [1, 26–28]. Furthermore, the
defect trends in Eclipse also follows similar patterns, where
they are explained better by a Weibull distribution [39] .
3 Regularities in Defect Predictors
Figures 1, 2, 3 and 4 show that statistical regularities
simplifies predictions of certain kinds of properties. Previ-
ously [19], we have exploited those regularities with data
Figure 4. Distribution of faulty modules in
Eclipse. From [39].
miners that learn defect predictors from static code at-
tribtues. Those predictors were learned either from projects
previously developed in the same environment or from a
continually expanding base of current project’s artifacts.
To do so, tables of examples are formed where one col-
umn has a boolean value for “defects detected” and the other
columns describe software features such as lines of code;
number of unique symbols [8]; or max. number of possible
execution pathways [16]. Each row in the table holds data
from one “module”, the smallest unit of functionality. De-
pending on the language, these may be called “functions”,
“methods”, or “procedures”. The data mining task is to find
combinations of features that predict for the value in the de-
fects column.
The value of static code features as defect predictors has
been widely debated. Some researchers vehemently oppose
them [6, 31], while many more endorse their use [12, 19,
24, 25, 29, 34, 36, 37]. Standard verification and validation
(V&V) textbooks [30] advise using static code complexity
attributes to decide which modules are worthy of manual
inspections. The authors are aware of several large govern-
ment software contractors that won’t review software mod-
ules unless tools like the McCabe static source code ana-
lyzer predicts that they exhibit high code complexity mea-
Nevertheless, static code attributes can never be a full
characterization of a program module. Fenton offers an in-
sightful example where the same functionality is achieved
using different programming language constructs resulting
in different static measurements for that module [7]. Fen-
ton uses this example to argue the uselessness of static code
attributes for fault prediction.
Using NASA data, our fault prediction models find de-
fect predictors [19] with a probability of detection (pd)
and probability of false alarm (pf) of mean(pd, pf ) =
(71%, 25%). These values should be compared to base-
lines in data mining and industrial practice. Raffo (per-
sonnel communication) found that industrial reviews find
pd = T R(35, 50, 65)%
of a systems errors’ (for full Fa-
gan inspections [4]) to pd = T R(13, 21, 30)% for less-
structured inspections. Similar values were reported at a
IEEE Metrics 2002 panel. That panel declined to endorse
claims by Fagan [5] and Schull [33] regarding the efficacy
of their inspection or directed inspection methods. Rather,
it concluded that manual software reviews can find 60%
of defects [32];
That is, contrary to the pessimism of Fenton, our
(pd, pf) = (71, 25)% results are better than currently used
industrial methods such as the pd60% reported at the 2002
IEEE Metrics panel or the median(pd) = 21..50 reported
by Raffo. Better yet, automated defect predictors can be
generated with a fraction of the effort of alternative meth-
ods, even for very large systems [24]. Other methods such
as formal methods or manual code reviews may be more
labor-intensive. Depending on the review method, 8 to 20
lines of code (LOC) per minute can be inspected. This ef-
fort repeats for all members of the review team (typically,
four or six [22]).
In prior work [20, 23], we have used the NASA and
SOFTLAB data of Figure 5 to explore learning defect pred-
citors using data miners. To learn defect predictors we use
a Naive Bayes data miner since prior work [19] could not
find a better data miner for learning defect predictors. In all
our experiments, the data was pre-processed as follows:
Since the number of features in each data table is not
consistent, we restricted our data to only the features
shared by all data sets.
Previously [19], we have observed that all the numeric
distributions in the Figure 5 data are exponential in na-
ture. It is therefore useful to apply a “log-filter” to all
numerics N with log(N ). This spreads out exponen-
tial curves more evenly across the space from the min-
imum to maximum values (to avoid numerical errors
with ln(0), all numbers under 0.000001 are replaced
with ln(0.000001)).
Inspired by the recent systematic review of within vs
cross company effort estimation studies by Kitchenham et
al. [13], we have done extensive experiments on Promise
data tables to analyze predictor behavior using a) local data
(within the same company) b) imported data (cross com-
pany). For each NASA and SOFTLAB table of Figure 5,
we built test sets from 10% of the rows, selected at random.
Then we learned defect predictors from:
the other 90% rows of the corresponding table (i.e. lo-
cal data).
90% rows of the other tables combined (i.e. imported
T R(a, b, c) is a triangular distribution with min/mode/max of a, b, c.
(# modules) .
source data examples features %defective
NASA pc1 1,109 22 6.94
NASA kc1 845 22 15.45
NASA kc2 522 22 20.49
NASA cm1 498 22 9.83
NASA kc3 458 22 9.38
NASA mw1 403 22 7.69
NASA mc2 61 22 32.29
SOFTLAB ar3 63 29 12.70
SOFTLAB ar4 107 29 18.69
SOFTLAB ar5 36 29 22.22
OPEN SOURCE cav90 1184 26 0.40
OPEN SOURCE cav91 1169 26 0.20
OPEN SOURCE gcc 116 26 100.0
Figure 5. Datasets used in this study. All the
NASA and SOFTLAB data sets are available
We repeated this procedure 100 times, each time ran-
domizing the order of the rows in each table, in order to
control order effects (where the learned theory is unduly
affected by the order of the examples). We measured the
performance of predictor using pd, pf and balance. If
{A, B, C, D} are the true negatives, false negatives, false
positives, and true positives (respectively) found by a de-
fect predictor, then:
pd = recall = D/(B + D) (1)
pf = C/(A + C) (2)
bal = balance = 1
(0 pf)
+ (1 pd)
All these values range zero to one. Better and larger bal-
ances fall closer to the desired zone of no false alarms
(pf = 0) and 100% detection (pd = 1). We then used
the Mann-Whitney U test [15] in order to test for statistical
Our results are visualized in Figure 6 as quartile chaorts.
Quartile charts are generated from sorted sets of results, di-
vided into four subsets of (approx) same cardinality. For
example these numbers have four quartiles:
z }| {
4, 7, 15, 20, 31,
40 , 52, 64,
q 4
z }| {
70, 81, 90}
These quartiles can be drawn as follows: the upper and
lower quartiles are marked with black lines; the median is
marked with a black dot; and vertical bars are added to mark
the 50% percentile value. For example, the above numbers
can be drawn as:
In a finding consistent with our general thesis (that soft-
ware artifacts conform to very regular statistical patterns),
Nasa on local Nasa data.
treatment min Q1 median Q3 max
pd Local 0 60 75 82 100
pf Local 0 24 29 35 73
Nasa on imported Nasa data.
treatment min Q1 median Q3 max
pd Imp. 81 84 94 98 99
pf Imp. 26 60 68 90 94
Softlab on local Softlab data.
treatment min Q1 median Q3 max
pd Local 35 40 88 100 100
pf Local 3 5 29 40 42
Softlab on imported Nasa data.
treatment min Q1 median Q3 max
pd Imp. 88 88 95 100 100
pf Imp. 52 59 65 68 68
Figure 6. Quartile charts for NASA and SOFT-
LAB data. Numeric results on left; quartile
charts on right. “Q1” and “Q3” denote the
25% and 75% percentile points (respectively).
Figure 6 shows the same stable and useful regularity occur-
ing in both seven NASA data sets and three SOFTLAB data
Using imported data dramatically increased the proba-
bility of detecting defective modules (for NASA: 74%
to 94% median pd; for SOFTLAB: 88% to 95% me-
dian pd);
But imported data also dramatically increased the false
alarm rate (for NASA: 29% to 68% median pf; for
SOFTLAB: 29% to 65% pd) . Our suspicion is that
the reason for the high false alarm rates was the irrele-
vancies introduced by imported data.
We have then designed a set of experiments for NASA
data tables to see the effects of sampling strategies on the
performance and to determine the lower-limit on the num-
ber of samples for learning defect predictors, i.e. the point
where a plateau is observed in the performance measure.
In those experiments we applied over/under and micro-
sampling (see Figure 7) to the data tables and observed that:
the performance of predictors does not improve with
over sampling.
under-sampling improves the performance of certain
predictors, i.e. decision trees, but not of Naive
Bayes [20, 23].
with micro-sampling, the performance of predictors
stabilize after a mere number of defective and defect-
free examples, i.e. 50 to 100 samples.
Micro-sampling is a generalization of under-sampling such that
given N defective modules in a data set,
M {25, 50, 75, .} N
defective modules were selected at random. Another M non-
defective modules are selected, at random. The combined 2M
data set is then passed to a 10*10-way cross validation.
Formally, under-sampling is a micro-sampling where M = N.
Micro-sampling explores training sets of size up to N, standard
under-sampling just explores once data set of size 2N .
Figure 8 shows the results of an micro-sampling study where
M {25, 50, 75, ...} defective modules were selected at ran-
dom, along with an equal M number of defect-free modules.
Note the visual pattern: increasing data does not necessarily
improve balance.
Mann-Whitney tests were applied to test this visual pattern.
Detectors learned from small M instances do as well as de-
tectors learned from any other number of instances.
For six data sets, {CM1,KC2, KC3, MC2, MW1, PC1},
micro-sampling at M = 25 did just as well as anything
larger sample size.
For one data set {KC1} best results were seen at M =
{575}. However, in all by one case M = 25 did as well
as any larger value.
Figure 7. Micro-sampling.
The last observation suggests that the number of cases
that must be reviewed in order to arrive at the perfor-
mance ceiling of a defect predictor is very small: as low
as 50 randomly selected modules (25 defective and 25 non-
defective). In Figure 8 for Nasa tables, we visualize number
of training examples (increments of 25) vs. balance perfor-
mance of Naive Bayes predictor. It is clear that the per-
formance does not improve with more training examples,
indeed it may deteriorate with more examples.
We now report the results of the same experiment for
SOFTLAB data. Note that SOFTLAB tables ar3, ar4 and
ar5 have {36,63,107} modules respectively, with a total of
206 modules of which only 36 are defective. Thus, local
results in Figure 6 are achieved using a minimum of (36 +
63) 0.90 = 90 and a maximum of (107 + 63) 0.90 =
153 examples. Nevertheless we repeat the micro-sampling
experiment for SOFTLAB data tables, with increments of 5
due to relatively lower number of defects. In Figure 9, we
plot the results for SOFTLAB data tables. We observe the
same pattern as in Figure 6: performance tends to stabilize
after a small number of training examples.
Figure 8. Micro-sampling results for NASA data
0 5 10 15
0 5 10 15 20 25 30 35
0 5 10 15
Figure 9. Micro-sampling results for SOFTLAB data
4 External Validity
Results for NASA and SOFTLAB data tables suggest
that practical defect predictors can be learned using only a
handful of examples. In order to allow for generalization, it
is appropriate to question the external validity of the above
two results: The data used for those results come from very
different sources:
The SOFTLAB data were collected from a Turkish
white-goods manufacturer (see the the datasets ({ar3,
ar4, ar5}) from Figure 5) building controller software
for a washing machine, a dishwasher and a refrigerator.
On the other hand, NASA software are ground and
flight control projects for aerospace applications, each
developed by different teams at different locations .
The development practices from these two organizations are
very different:
The SOFTLAB software were built in a profit-
and revenue-driven commercial organization, whereas
NASA is a cost-driven government entity
The SOFTLAB software were are developed by very
small teams (2-3 people) working in the same physical
location while the NASA software was built by much
larger team spread around the United States.
The SOFTLAB development was carried out in an ad-
hoc, informal way rather than formal, process oriented
approach used at NASA.
The fact that the same defect detection patterns hold for
such radically different kinds of organization is a strong ar-
gument for the external validity of our results. However, an
even stronger argument would be that the patterns we first
saw at NASA/SOFTLAB are also found in software devel-
oped at other sites. The rest of this paper collects evidence
for that stronger argument.
cav90 on local cav90 data.
treatment min Q1 median Q3 max
pd Local 0 40 40 60 100
pf Local 22 32 35 37 44
cav90 on imported Nasa data.
treatment min Q1 median Q3 max
pd Imp. 49 49 51 51 53
pf Imp. 35 36 37 37 38
cav91 on local cav91 data.
treatment min Q1 median Q3 max
pd Local 0 67 67 100 100
pf Local 17 25 28 30 39
cav91 on imported Nasa data.
treatment min Q1 median Q3 max
pd Imp. 77 77 77 77 77
pf Imp. 36 38 38 39 40
gcc defects on imported cav91 data.
treatment min Q1 median Q3 max
pd Imp. 53 53 53 53 53
gcc defects on imported Nasa data.
treatment min Q1 median Q3 max
pd Imp. 60 60 60 60 60
Figure 10. Quartile charts for OPENSOURCE
5. Validity in Open Source Domain
This section checks for the above patterns in two open
source projects:
two versions of an anti-virus project: Clam AV v0.90
and v0.91;
a subset of defective modules of the GNU gcc com-
In Figure 5, these are denoted as cav90, cav91 and gcc, re-
spectively. Note that these are very different projects, build
by different developers with very different purposes. Also
note that the development processes for the open source
projects are very different to the NASA and SOFTLAB
projects studied above. Whereas NASA and SOFTLAB
were developed by centrally-controlled top-down manage-
ment teams, cav90, cav91 and gcc were build in a highly
distributed manner. Further, unlike our other software, gcc
has been under extensive usage and maintenance for over a
Local/ imported data experiments are once more applied
on cav90, cav91 and gcc data tables and the results are visu-
alized in Figure 10. We again used Nasa tables as imported
data, since they provide a large basis with 5000+ samples.
Note that partial gcc data includes only a sample of defects,
thus we are not able to make a ’local data’ analysis for gcc.
Rather, we report the detection rates of predictors built on
imported data from Nasa and cav91. These predictors can
correctly detect up to median 60% of the subset of bugs that
we were able to manually match to functional modules.
Recall that cav91 has a defect rate of 0.20%. The proba-
bility of detection rates for cav91 are median 67% and 77%
for local and imported data respectively, which is another
evidence on the usefulness of statistical predictors.
At the first glance, the patterns in Figure 10 seem a bit
different than those in Figure 6. There are still increases
in probability of detection and false alarms rates, though
not dramatically. However, this is not a counter example
of our claim. We explain this behavior with the following
For our experiments:
in commercial software analysis, local data corre-
sponds to single projects developed by a relatively
small team of people in the same company and with
certain business knowledge.
in commercial software analysis, imported data corre-
sponds to a variety of projects developed by various
people in different companies and spans a larger busi-
ness knowledge.
in open source software analysis, local data corre-
sponds to single projects developed by a larger team
of people at different geographical locations with vari-
ous backgrounds.
We argue that, the above assertions differentiate the
meaning of local data for open source projects from com-
mercial projects. The nature of open source development
allows the definition of local data to be closer to commer-
cial imported data, since both are developed by people at
different sites with different background. That’s the rea-
son why adding commercial imported data does not add as
much detection capability as it does for commercial local
data. Furthermore, the false alarms are not that high since
there are less irrelevancies in local open source data than
commercial imported data, which is the cause of high false
alarms. That’s because open source local data consist of a
single project and commercial imported data consist of sev-
eral projects, which introduce irrelevancies.
We also repeat the 10*10 way cross-validation micro-
sampling experiment for cav90 and cav91 data tables, this
time with increments of 5 due to limited defect data (Fig-
ure 8 used increments of 25). In Figure 11 we see a similar
Balance values tend to converge using only 20-30 sam-
ples of defective and non-defective modules.
Using less examples produce unstable results, since
there are not enough samples to learn a theory.
0 20 40 60 80
0 5 10 15 20 25
Figure 11. Micro-sampling results for OPEN-
For cav90, using more than 60 training examples af-
fects the stability of the results.
The importance of Figure 11 results is that our results in
two commercial domains, considering the minimum num-
ber of examples to train a predictor, are once more validated
in a third, open-source domain.
6. Discussions
We can summarize our results on commercial and open-
source data as follows:
Using imported data not only increases the detection
capability of predictors, but also their false alarm rates.
This effect is less visible in open source domain for the
reasons discussed above.
Predictors can be learned using only a handful of data
samples. In both commercial and open-source domain
the performance of predictors converge after a small
number of training examples. Providing further train-
ing instances may cause variations in predictor perfor-
These results are generalizable through different cul-
tural and organizational entities, since same pat-
terns are observed in NASA, SOFTLAB and OPEN-
SOURCE data tables.
Note that these result are observed: in a variety of
projects (i.e. 7 NASA, 3 SOFTLAB, 2 OPENSOURCE)
from different domains (i.e. commercial and open-source)
spanning a wide range time interval (i.e. from 1980’s to
2007). Therefore, we assert that:
There exists repeated patterns in software that can be
discovered and explained by simple models with minimal
information no matter what the underlying seemingly ran-
dom and complex processes or phenomena are .
This assertion should be processed carefully. We do
not mean that everything about software can be controlled
through patterns. Rather, we argue that these patterns exist
and are easy to discover. It is practical and cost effective to
use them as guidelines in order to understand the behavior
of software and to take corrective actions. In this context we
will propose two directions for ASE research in our conclu-
7. Conclusion
Automated analysis methods cannot offer 100% certifi-
cation guarantee, however they can be useful to augment
more expensive methods such that:
Defect predictors learned on static code attributes
achieve detection rates better than manual inspections.
Feeding these predictors with third party data (i.e. im-
ported data) improves their detection rates further.
However, this comes at the cost of increase false alarm
These predictors can be learned with as low as 30 sam-
Building such models is easy, fast and effective in guid-
ing manual test effort to correct locations. What is more
important is that these automated analysis methods are ap-
plicable in different domains. We have previously shown
their validity in two commercial domains and in this pa-
per we observe similar patterns in two projects of the open
source domain. We have also shown that although these
models are sensitive to the information level in the train-
ing data (i.e. local/ imported), they are not affected by the
organizational differences that generate them.
Based on our results, we argue that no matter how com-
plicated the underlying processes may seem, software has
a statistically predictable nature. Going one step further,
we claim that the patterns in software are not limited to in-
dividual projects or domains, rather they are generalizable
through different projects and domains.
Therefore, we suggest two directions for ASE research:
One direction should explore software analysis using
rigorous formalisms that offer ironclad gaureentees of
the corrections of a code (e.g. interactive theorem
proving, model checking, or the correctness preserv-
ing transoftations discussed by Doug Smith
). This
approach is required for the analysis of mission-critical
software that must always function correctly.
Keynote address, ASE’07
Another direction should explore automatic methods
with a stronger focus on maximizing the effectiveness
of the analysis while minimizing the associated cost.
There has been some exploration of the second approach
using lightweight formal methods (e.g. [3]) or formal meth-
ods that emphasis the usability and ease of use of the tool
(e.g. [9, 10]). However, our results also show that there ex-
ists an under-explored space of extrememly cost-effective
automated analysis methods.
[1] R. Bell, T. Ostrand, and E. Weyuker. Looking for bugs in
all the right places. ISSTA ’06: Proceedings of the 2006
international symposium on Software testing and analysis,
Jul 2006.
[2] B. Boehm. Software Engineering Economics. Prentice Hall,
[3] S. Easterbrook, R. R. Lutz, R. Covington, J. Kelly, Y. Ampo,
and D. Hamilton. Experiences using lightweight formal
methods for requirements modeling. IEEE Transactions on
Software Engineering, pages 4–14, 1998.
[4] M. Fagan. Design and code inspections to reduce errors in
program development. IBM Systems Journal, 15(3), 1976.
[5] M. Fagan. Advances in software inspections. IEEE Trans.
on Software Engineering, pages 744–751, July 1986.
[6] N. Fenton and N. Ohlsson. Quantitative analysis of faults
and failures in a complex software system. IEEE Trans-
actions on Software Engineering, pages 797–814, August
[7] N. E. Fenton and S. Pfleeger. Software Metrics: A Rigorous
& Practical Approach. International Thompson Press, 1997.
[8] M. Halstead. Elements of Software Science. Elsevier, 1977.
[9] M. Heimdahl and N. Leveson. Completeness and consis-
tency analysis of state-based requirements. IEEE Transac-
tions on Software Engineering, May 1996.
[10] C. Heitmeyer, R. Jeffords, and B. Labaw. Auto-
mated consistency checking of requirements specifi-
cations. ACM Transactions on Software Engineering
and Methodology, 5(3):231–261, July 1996. Avail-
able from
[11] O. Jalali. Evaluation bias in effort estimation. Master’s the-
sis, Lane Department of Computer Science and Electrical
Engineering, West Virginia University, 2007.
[12] T. M. Khoshgoftaar and N. Seliya. Fault prediction mod-
eling for software quality estimation: Comparing com-
monly used techniques. Empirical Software Engineering,
8(3):255–283, 2003.
[13] B. A. Kitchenham, E. Mendes, and G. H. Travassos. Cross-
vs. within-company cost estimation studies: A systematic
review. IEEE Transactions on Software Engineering, pages
316–329, May 2007.
[14] A. G. Koru and H. Liu. Identifying and characterizing
change-prone classes in two large-scale open-source prod-
ucts. JSS, Jan 2007.
[15] H. B. Mann and D. R. Whitney. On a test of whether one of
two random variables is stochastically larger than the other.
Ann. Math. Statist., 18(1):50–60, 1947. Available on-line at
[16] T. McCabe. A complexity measure. IEEE Transactions on
Software Engineering, 2(4):308–320, Dec. 1976.
[17] T. Menzies, Z. Chen, J. Hihn, and K. Lum. Selecting best
practices for effort estimation. IEEE Transactions on Soft-
ware Engineering, November 2006. Available from http:
[18] T. Menzies, O. Elrawas, B. Barry, R. Madachy, J. Hihn,
D. Baker, and K. Lum. Accurate estiamtes without cal-
ibration. In International Conference on Software Pro-
cess, 2008. Available from
[19] T. Menzies, J. Greenwald, and A. Frank. Data min-
ing static code attributes to learn defect predictors.
IEEE Transactions on Software Engineering, January
2007. Available from
[20] T. Menzies, Z. Milton, A. Bener, B. Cukic, G. Gay, Y. Jiang,
and B. Turhan. Overcoming ceiling effects in defect predic-
tion. Submitted to IEEE TSE, 2008.
[21] T. Menzies, D. Port, Z. Chen, J. Hihn, and S. Stukes.
Specialization and extrapolation of induced domain mod-
els: Case studies in software effort estimation. In IEEE
ASE, 2005, 2005. Available from http://menzies.
[22] T. Menzies, D. Raffo, S. on Setamanit, Y. Hu, and S. Tootoo-
nian. Model-based tests of truisms. In Proceedings of IEEE
ASE 2002, 2002. Available from
[23] T. Menzies, B. Turhan, A. Bener, G. Gay, B. Cukic, and
Y. Jiang. Implications of ceiling effects in defect pre-
dictors. In Proceedings of PROMISE 2008 Workshop
(ICSE), 2008. Available from
[24] N. Nagappan and T. Ball. Static analysis tools as early indi-
cators of pre-release defect density. In ICSE 2005, St. Louis,
[25] A. Nikora and J. Munson. Developing fault predictors for
evolving software systems. In Ninth International Software
Metrics Symposium (METRICS’03), 2003.
[26] T. Ostrand and E. Weyuker. The distribution of faults in a
large industrial software system. ISSTA ’02: Proceedings of
the 2002 ACM SIGSOFT international symposium on Soft-
ware testing and analysis, Jul 2002.
[27] T. Ostrand, E. Weyuker, and R. Bell. Where the bugs
are. ACM SIGSOFT Software Engineering Notes, 29(4), Jul
[28] T. Ostrand, E. Weyuker, and R. Bell. Automating algorithms
for the identification of fault-prone files. ISSTA ’07: Pro-
ceedings of the 2007 international symposium on Software
testing and analysis, Jul 2007.
[29] A. Porter and R. Selby. Empirically guided software devel-
opment using metric-based classification trees. IEEE Soft-
ware, pages 46–54, March 1990.
[30] S. Rakitin. Software Verification and Validation for Practi-
tioners and Managers, Second Edition. Artech House, 2001.
[31] M. Shepperd and D. Ince. A critique of three metrics. The
Journal of Systems and Software, 26(3):197–210, September
[32] F. Shull, V. B. ad B. Boehm, A. Brown, P. Costa, M. Lind-
vall, D. Port, I. Rus, R. Tesoriero, and M. Zelkowitz.
What we have learned about fighting defects. In Pro-
ceedings of 8th International Software Metrics Sympo-
sium, Ottawa, Canada, pages 249–258, 2002. Avail-
able from
[33] F. Shull, I. Rus, and V. Basili. How perspective-
based reading can improve requirements inspections.
IEEE Computer, 33(7):73–79, 2000. Available from
[34] K. Srinivasan and D. Fisher. Machine learning approaches
to estimating software development effort. IEEE Trans. Soft.
Eng., pages 126–137, February 1995.
[35] The Standish Group Report: Chaos, 1995. Avail-
able from
[36] W. Tang and T. M. Khoshgoftaar. Noise identification with
the k-means algorithm. In ICTAI, pages 373–378, 2004.
[37] J. Tian and M. Zelkowitz. Complexity measure evaluation
and selection. IEEE Transaction on Software Engineering,
21(8):641–649, Aug. 1995.
[38] T. L. Veldhuizen. Software libraries and their reuse: En-
tropy, kolmogorov complexity, and zipfs law. arXiv, cs.SE,
Aug 2005.
[39] H. Zhang. On the distribution of software faults. Software
Engineering, IEEE Transactions on, 34(2):301–302, 2008.
... In this section, we provide a discussion of cross-project prediction studies, which is an emerging area with very limited number of published studies, based on selected studies representing major research effort on the topic. We have identified nine empirical studies [9,10,11,12,13,14,15,16,17]. We focus our discussions on these studies rather than providing a general review of defect prediction literature, which is out of the scope of this paper (we refer the reader to [3] for a systematic review of defect prediction studies in general). ...
... In a following study, Turhan et al. investigate whether the patterns in their previous work [11] are also observed in open-source software and analyzed three additional projects [14]. Similar to Zimmermann et al., they find that the patterns they observed earlier are not easily detectable in predicting open-source software defects using proprietary cross project data. ...
... Data from systems that include the label "NASA" in their names come from NASA aerospace projects and these were collected as part of the metric data programme (MDP). On the other hand, systems with the label "SOFTLAB" are from a Turkish software company developing embedded controllers for home appliances and the related data were collected by the authors for an earlier work [14]. The projects in these two groups are all single version and the metrics along with the defect information are available at the functional method level. ...
Full-text available
ContextDefect prediction research mostly focus on optimizing the performance of models that are constructed for isolated projects (i.e. within project (WP)) through retrospective analyses. On the other hand, recent studies try to utilize data across projects (i.e. cross project (CP)) for building defect prediction models for new projects. There are no cases where the combination of within and cross (i.e. mixed) project data are used together.Objective Our goal is to investigate the merits of using mixed project data for binary defect prediction. Specifically, we want to check whether it is feasible, in terms of defect detection performance, to use data from other projects for the cases (i) when there is an existing within project history and (ii) when there are limited within project data.Method We use data from 73 versions of 41 projects that are publicly available. We simulate the two above-mentioned cases, and compare the performances of naive Bayes classifiers by using within project data vs. mixed project data.ResultsFor the first case, we find that the performance of mixed project predictors significantly improves over full within project predictors (p-value < 0.001), however the effect size is small (Hedges′ g = 0.25). For the second case, we found that mixed project predictors are comparable to full within project predictors, using only 10% of available within project data (p-value = 0.002, g = 0.17).Conclusion We conclude that the extra effort associated with collecting data from other projects is not feasible in terms of practical performance improvement when there is already an established within project defect predictor using full project history. However, when there is limited project history, e.g. early phases of development, mixed project predictions are justifiable as they perform as good as full within project models.
... As a result, 13 additional relevant articles [14,31,32,43,45,61,71,84,88,99,101,108,129] were found with respect to the identified 46 articles. Next, we used "cross company" + "defect prediction" as the search terms, identifying 3 additional relevant articles [104,106,110] with respect to the 59 (=46 + 13) articles. After that, we used "cross project" + "fault prediction" as the search terms, identifying 1 additional relevant article [100] with respect to the 62 (=59 + 3) articles. ...
Background. Recent years have seen an increasing interest in cross-project defect prediction (CPDP), which aims to apply defect prediction models built on source projects to a target project. Currently, a variety of (complex) CPDP models have been proposed with a promising prediction performance. Problem. Most, if not all, of the existing CPDP models are not compared against those simple module size models that are easy to implement and have shown a good performance in defect prediction in the literature. Objective. We aim to investigate how far we have really progressed in the journey by comparing the performance in defect prediction between the existing CPDP models and simple module size models. Method. We first use module size in the target project to build two simple defect prediction models, ManualDown and ManualUp, which do not require any training data from source projects. ManualDown considers a larger module as more defect-prone, while ManualUp considers a smaller module as more defect-prone. Then, we take the following measures to ensure a fair comparison on the performance in defect prediction between the existing CPDP models and the simple module size models: using the same publicly available data sets, using the same performance indicators, and using the prediction performance reported in the original cross-project defect prediction studies. Result. The simple module size models have a prediction performance comparable or even superior to most of the existing CPDP models in the literature, including many newly proposed models. Conclusion. The results caution us that, if the prediction performance is the goal, the real progress in CPDP is not being achieved as it might have been envisaged. We hence recommend that future studies should include ManualDown/ManualUp as the baseline models for comparison when developing new CPDP models to predict defects in a complete target project.
... Turhan et al. [39] investigated two open source softwares, two versions of Clam AV and a subset of the defect data of the GNU gcc compiler, to study the impact of using external data to learn the defect predictors and use it to make predictions for the localized data using naive bayes as a classifier. The results reported that the ability of such models to detect the defective modules increases. ...
... An interesting pattern in their predictions is that open-source projects are good predictors of close-source projects, however open-source projects can not be predicted by any other projects. In a following study, Turhan et al. investigated whether the patterns in their previous work [12] are also observed in open-source software and analyzed three additional projects [14]. Similar to Zimmermann et al., they found that the patterns they observed earlier are not easily detectable in predicting open-source software defects using proprietary cross project data. ...
Conference Paper
Full-text available
Defect prediction research mostly focus on optimizing the performance of models that are constructed for isolated projects. On the other hand, recent studies try to utilize data across projects for building defect prediction models. We combine both approaches and investigate the effects of using mixed (i.e. within and cross) project data on defect prediction performance, which has not been addressed in previous studies. We conduct experiments to analyze models learned from mixed project data using ten proprietary projects from two different organizations. We observe that code metric based mixed project models yield only minor improvements in the prediction performance for a limited number of cases that are difficult to characterize. Based on existing studies and our results, we conclude that using cross project data for defect prediction is still an open challenge that should only be considered in environments where there is no local data collection activity, and using data from other projects in addition to a project's own data does not pay off in terms of performance.
Predicting the changes in the next release of software, during the early phases of software development is gaining wide importance. Such a prediction helps in allocating the resources appropriately and thus, reduces costs associated with software maintenance. But predicting the changes using the historical data (data of past releases) of the software is not always possible due to unavailability of data. Thus, it would be highly advantageous if we can train the model using the data from other projects rather than the same project. In this paper, we have performed cross project predictions using 12 datasets obtained from three open source Apache projects, Abdera, POI and Rave. In the study, cross project predictions include both the inter-project (different projects) and inter-version (different versions of same projects) predictions. For cross project predictions, we investigated whether the characteristics of the datasets are valuable for selecting the training set for a known testing set. We concluded that cross project predictions give high accuracy and the distributional characteristics of the datasets are extremely useful for selecting the appropriate training set. Besides this, within cross project predictions, we also examined the accuracy of inter-version predictions.
Defect Prediction Models aim at identifying error-prone modules of a software system to guide quality assurance activities — for example tests or code reviews. Due to their potential cost savings, such models have been actively researched for more than a decade, resulting in over 100 published research papers. Additionally, defect prediction models are often used for the empirical validation of software metrics and research hypotheses. Despite the large body of existing research, the evaluation of defect prediction models has received only little attention. This is underlined by the large number of, sometimes only slightly differing evaluation approaches used by researchers. This thesis systematically investigates advantages and drawbacks of experimental design decisions and proposes guidelines for adequate evaluation procedures for defect pre- diction models. First, different evaluation approaches are identified and summarized in a literature survey. Afterwards, we investigate the most common methods in detail. By using publicly available data sets, we demonstrate that different evaluation procedures have advantages and drawbacks, and may lead to different results. Additionally, we show that very simple models, based only on the size of modules, are able to achieve sur- prisingly good performance. The reason for this is an implicit assumption underlying almost all evaluation approaches, namely that the treatment costs for additional qual- ity assurance activities are distributed uniformly across modules. We introduce the notion of effort-awareness that takes non-uniform treatment costs into account. Most models that perform well according to a uniform cost assumption are not cost-effective according to effort-aware performance measures. The performance can be increased significantly by using effort-aware predictions, both from a practical and from a statis- tical perspective. Whether effort-aware models based only on static code metrics are cost-effective in practice remains questionable, as shown in a case study. In summary, our experiments show that the most commonly used evaluation pro- cedures often lead to overly optimistic performance estimates and will not lead to cost- effective defect prediction models for many practical usage scenarios. The guidelines derived from these experiments, and in particular the notion of effort-aware prediction and evaluation, can help to build defect prediction models usable in practice.
Full-text available
Reliably predicting software defects is one of the holy grails of software engineering. Researchers have devised and implemented a plethora of defect/bug prediction approaches varying in terms of accuracy, complexity and the input data they require. However, the absence of an established benchmark makes it hard, if not impossible, to compare approaches. We present a benchmark for defect prediction, in the form of a publicly available dataset consisting of several software systems, and provide an extensive comparison of well-known bug prediction approaches, together with novel approaches we devised. We evaluate the performance of the approaches using different performance indicators: classification of entities as defect-prone or not, ranking of the entities, with and without taking into account the effort to review an entity. We performed three sets of experiments aimed at (1) comparing the approaches across different systems, (2) testing whether the differences in performance are statistically significant, and (3) investigating the stability of approaches across different learners. Our results indicate that, while some approaches perform better than others in a statistically significant manner, external validity in defect prediction is still an open problem, as generalizing results to different contexts/learners proved to be a partially unsuccessful endeavor. KeywordsDefect prediction–Source code metrics–Change metrics
Full-text available
Background: The accurate prediction of where faults are likely to occur in code can help direct test effort, reduce costs and improve the quality of software. Objective: We investigate how the context of models, the independent variables used and the modelling techniques applied, influence the performance of fault prediction models. Method:We used a systematic literature review to identify 208 fault prediction studies published from January 2000 to December 2010. We synthesise the quantitative and qualitative results of 36 studies which report sufficient contextual and methodological information according to the criteria we develop and apply. Results: The models that perform well tend to be based on simple modelling techniques such as Na??ve Bayes or Logistic Regression. Combinations of independent variables have been used by models that perform well. Feature selection has been applied to these combinations when models are performing particularly well. Conclusion: The methodology used to build models seems to be influential to predictive performance. Although there are a set of fault prediction studies in which confidence is possible, more studies are needed that use a reliable methodology and which report their context, methodology and performance comprehensively.
Conference Paper
Background: The prediction performance of a case-based reasoning (CBR) model is influenced by the combination of the following parameters: (i) similarity function, (ii) number of nearest neighbor cases, (iii) weighting technique used for attributes, and (iv) solution algorithm. Each combination of the above parameters is considered as an instantiation of the general CBR-based prediction method. The selection of an instantiation for a new data set with specific characteristics (such as size, defect density and language) is called customization of the general CBR method. Aims: For the purpose of defect prediction, we approach the question which combinations of parameters works best at which situation. Three more specific questions were studied: (RQ1) Does one size fit all? Is one instantiation always the best? (RQ2) If not, which individual and combined parameter settings occur most frequently in generating the best prediction results? (RQ3) Are there context-specific rules to support the customization? Method: In total, 120 different CBR instantiations were created and applied to 11 data sets from the PROMISE repository. Predictions were evaluated in terms of their mean magnitude of relative error (MMRE) and percentage Pred(α) of objects fulfilling a prediction quality level α. For the third research question, dependency network analysis was performed. Results: Most frequent parameter options for CBR instantiations were neural network based sensitivity analysis (as the weighting technique), un-weighted average (as the solution algorithm), and maximum number of nearest neighbors (as the number of nearest neighbors). Using dependency network analysis, a set of recommendations for customization was provided. Conclusion: An approach to support customization is provided. It was confirmed that application of context-specific rules across groups of similar data sets is risky and produces poor results.
Conference Paper
Predicting the fault-proneness of program modules when the fault labels for modules are unavailable is a practical problem frequently encountered in the software industry. Because fault data belonging to previous software version is not available, supervised learning approaches can not be applied, leading to the need for new methods, tools, or techniques. In this study, we propose a clustering and metrics thresholds based software fault prediction approach for this challenging problem and explore it on three datasets, collected from a Turkish white-goods manufacturer developing embedded controller software. Experiments reveal that unsupervised software fault prediction can be automated and reasonable results can be produced with techniques based on metrics thresholds and clustering. The results of this study demonstrate the effectiveness of metrics thresholds and show that the standalone application of metrics thresholds (one-stage) is currently easier than the clustering and metrics thresholds based (two-stage) approach because the selection of cluster number is performed heuristically in this clustering based method.
Conference Paper
Full-text available
This paper describes methods for automatically analyzing formal, state-based requirements specifications for completeness and consistency. The approach uses a low-level functional formalism, simplifying the analysis process. State space exploslon problems are eliminated by applying the analysis at a high level of abstraction; i.e, instead of generating a reachability graph for analysis, the analysis is performed directly on the model. The method scales up to large systems by decomposing the specification into smaller, analyzable parts and then using functional composition rules to ensure that verified properties hold for the entire specification. The analysis algorithms and tools have been validated on TCAS II, a complex, airborne, collision-avoidance system reqmred on all commercial aircraft with more than 30 passengers that fly in U.S. airspace.
Full-text available
This article examines the metrics of the software science model, cyclomatic complexity, and an information flow metric of Henry and Kafura. These were selected on the basis of their popularity within the software engineering literature and the significance of the claims made by their progenitors. Claimed benefits are summarized. Each metric is then subjected to an in-depth critique. All are found wanting. We maintain that this is not due to mischance, but indicates deeper problems of methodology used in the field of software metrics. We conclude by summarizing these problems.
A summary is presented of the current state of the art and recent trends in software engineering economics. It provides an overview of economic analysis techniques and their applicability to software engineering and management. It surveys the field of software cost estimation, including the major estimation techniques available, the state of the art in algorithmic cost models, and the outstanding research issues in software cost estimation.
This article describes a formal analysis technique, called consistency checking, for automatic detection of errors, such as type errors, nondeterminism, missing cases, and circular definitions, in requirements specifications. The technique is designed to analyze requirements specifications expressed in the SCR (Software Cost Reduction) tabular notation. As background, the SCR approach to specifying requirements is reviewed. To provide a formal semantics for the SCR notation and a foundation for consistency checking, a formal requirements model is introduced; the model represents a software system as a finite-state automaton, which produces externally visible outputs in response to changes in monitored environmental quantities. Results of two experiments are presented which evaluated the utility and scalability of our technique for consistency checking in a real-world avionics application. The role of consistency checking during the requirements phase of software development is discussed.
Substantial net improvements in programming quality and productivity have been obtained through the use of formal inspections of design and of code. Improvements are made possible by a systematic and efficient design and code verification process, with well-defined roles for inspection participants. The manner in which inspection data is categorized and made suitable for process analysis is an important factor in attaining the improvements. It is shown that by using inspection results, a mechanism for initial error reduction followed by ever-improving error rates can be achieved.
OBJECTIVE - The objective of this paper is to determine under what circumstances individual organisations would be able to rely on cross-company based estimation models. METHOD - We performed a systematic review of studies that compared predictions from cross- company models with predictions from within-company models based on analysis of project data. RESULTS - Ten papers compared cross-company and within-company estimation models, however, only seven of the papers presented independent results. Of those seven, three found that cross- company models were as good as within-company models, four found cross-company models were significantly worse than within-company models. Experimental procedures used by the studies differed making it impossible to undertake formal meta-analysis of the results. The main trend distinguishing study results was that studies with small single company data sets (i.e.