ArticlePDF Available

Learning Better Inspection Optimization Policies.

Abstract and Figures

Recent research has shown the value of social metrics for defect prediction. Yet many repositories lack the information required for a social analysis. So, what other means exist to infer how developers interact around their code? One option is static code metrics that have already demonstrated their usefulness in analyzing change in evolving software systems. But do they also help in defect prediction? To address this question we selected a set of static code metrics to determine what classes are most "active" (i.e., the classes where the developers spend much time interacting with each other's design and implementation decisions) in 33 open-source Java systems that lack details about individual developers. In particular, we assessed the merit of these activity-centric measures in the context of "inspection optimization" — a technique that allows for reading the fewest lines of code in order to find the most defects. For the task of inspection optimization these activity measures perform as well as (usually, within 4%) a theoretical upper bound on the performance of any set of measures. As a result, we argue that activity-centric static code metrics are an excellent predictor for defects.
Content may be subject to copyright.
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
LEARNING BETTER INSPECTION OPTIMIZATION POLICIES
MARKUS LUMPE, RAJESH VASA
Faculty of ICT, Swin b urn e Univers it y of Technology
Hawthorn, Australia
{mlumpe,rv a s a} @swin edu au
TIM MENZIES, REBECCA RUSH
CSEE, W e s t Virginia University
Morganto wn, Wes t Virginia
tim@menzies us, rrush4@mix wvu edu
BURAK TURHAN
Info Processing Science, University of Oulu
Oulu, Finland
turhanb@computer org
Received (Day Month Year)
Revised (Day Month Year)
Accepted (Day Month Year)
Recent research has shown the value of social metrics for defect prediction. Yet many
repositories lack the information required for a social analysis. So, what other means ex-
ist to infer how developers interact around their code? One option is static code metrics
that have already demonstrated their usefulness in analyzing change in evolving software
systems. But do they also help in defect prediction? To address this question we selected
asetofstaticcodemetricstodeterminewhatclassesaremost“active”(i.e.,theclasses
where the developers sp end much time interacting with each other’s design and imple-
mentation decisions) in 33 open-source Java systems that lack details about individual
developers. In particular, we assessed the merit of these activity-centric measures in the
context of “inspection optimization” a technique that allows for reading the fewest lines
of code in order to find the most defects. For the task of inspection optimization these
activity measures perform as well as (usually, within 4%) a theoretical upper bound on
the performance of any set of measures. As a result, we argue that activity-centric static
code metrics are an excellent predictor for defects.
Keywords:DataMining;DefectPrediction;StaticMeasures.
1. Introduction
A remarkable recent discovery is that social metrics, which model the sociology
of the programmers working on the code, can be an eective predictor for de fe ct
injection and removal [1, 2, 3, 4]. For example, Guo et al. [4] demonstrate that the
reputation of the developer who reported a defect naturally relates to the odds that
this defect will get fixed eventually.
1
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
2
Models that oer predictions on likely location of defects have traditionally
relied on static code metrics [16, 17, 20]. Yet, the premise of social metrics research
is that code repositories contain more than just static code measures and that
these measures provide a valuable dimension worth investigating. But, not all code
repositories contain detailed knowledge about how developers interact around a code
base. Consider, for example, the Helix project [5] which has studied 40+ multi-
year large open-source Java systems under active development. Many developers
contributed to those systems but their code repositories are very weak sources for
information regarding a developer’s social context. This occurs because t h e sy st e ms
in use to support software development do not always capture the social dimension
consistently. Additionally, aspects like “reputation” [4] are fuzzy and there is no
widely accepted standard to measure these social dimensions.
Nevertheless, social aspects do add a valuable and useful dimension that we
should aim to measure objectively. In this paper, we show th at it is possible to
use static code measures to capture how programmers interact wit h their code by
taking into consid er at i on software evolution, that is, we add the dimension of time.
Specifically, it is feasible to find what parts of the code are most “active,” that is,
are th e focus of much of the shared attention of all developers working to organize
behavior and functionality at suitable system-s pecifi c levels [6, 7, 8]. This opens
intriguing options for gu i di n g quality assurance (QA) processes. In particular, we
demonstrate that a small set of activity-centric static code metrics [7, 8] can serve
as a good predictor for defects in object-oriented software.
Now, defect prediction techniques, in general, rely heavily on the available in-
put [64, 65] and, depending on the amount of processing required, can be char-
acterized as eithe r lightweight or complex quality assurance methods. Early ap-
proaches were based on univariate logistic regression [43, 66]. Later models for defect
prediction incorporated multiple explanatory variables in the analysis in recogni-
tion of the fact that the actual prob ab il i ty of defects is a function of several fac-
tors [36, 37, 38, 39]. Recently, machine learning [ 15, 40, 41] has become a formidable
contender in the area of defect prediction that oers an promising alternative to
standard regression-based methods. However, the more complex these approaches
become th e more dicult they are to master, especially, when the reasons as to why
the underlying model characterizes some modules more de fe ct -p r one than others are
hard to grasp. This can hamper adop t ion of these techniques in i nd u st r y.
An ideal approach for defect prediction, we advocate, would b e relatively
straightforward, based on simple measures, easy to un de rs t and , and di r ec t ly as -
sociated with the developer’s mental model for eective software development [6].
This is the domain of activity-cent ri c static code metrics [7, 8]. In particular, we
present evidence in thi s paper that, based on experiments with 33 open-source
Java software systems, shows that activity-centric metrics perform very close to the
theoretical upper bound on defect prediction p er f orm anc e [67]. To compute that
upper bound, we adopt the defect density inspection bias proposed by Arisholm &
Briand [20] which aims at an optimal inspection policy in order to locate defects in
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
3
Table 1. Defects found with and without sp ecialized verication teams.
From [11].
phase with verification team no verification team
requirements 16 0
high-level design 20 2
low-level design 31 8
coding & user testing 24 34
integration & testing 6 14
totals 97 58
the code base. Such a policy seeks to identify the most faults while reading the least
amount of code and fits within the developer’s workflow as it yiel d s an inspection
strategy that orders classes based on their defect probability.
The rest of t hi s paper is structured as follows. The next section presents the
economic case for defect detection (find more bugs, ear l i er ) then introduces the
concepts of static code defect predictors and inspection optimization. We then turn
to the experiments showing the value of activity measures. We demonstrate that in
our selected systems, activity-based defect predictors work within 4% of a theoret-
ical upper bound on predictor performance (this is the basis for our claim that a
small set of static metrics can generate an excellent p er f orm anc e within the context
of inspection optimization). The validity of our conclusions is then dis cu sse d, which
will lead into a review of possible future directions for this work.
2. Background
This section r ev i ews the core moti vation of this work: the reduction of software
construction costs by an earlier detection of defects. We start with a discussion of
some of the practical considerations governing defect detection in the software life
cycle. Then, we shift our fo c us on lightweight sampling policies. In particu l ar, we
explore one special kind: static code defect predictors. Finally, we explore the use of
data miners for the task of inspection optimization.
2.1. Defect Detection Economics
Boehm & Papaccio advise t h at reworking software is far cheaper earlier in the life
cycle than later by factors of 50 to 200 [9]. This eect has been widely documented
by other researchers. A panel at IEEE Metrics 2002 concluded that finding and
fixing severe software problems after delivery is often 100 times more expensive
than finding and fixing them during the requirements and design phase [10]. Also,
Arthur et a l. [11] conduc te d a small controlled experiment where a dozen engineers
at NASA’s Langley Resear ch Center were spli t into development and speci alized
verification teams. The same application was written with and withou t specialized
verification teams. Table 1 shows th e r es ul t s: ( a) m ore i s su es were found using
specialized verification than without; (b) t h e issues were found much earlier. That
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
4
Table 2. Cost-to-x escalation factors. From [12].
Phase issue found
Phase issue f=1 f=2 f=3 f=4 f=5 f=6
iintroducedRequirements Design Code Test Int Operations
1Requirements 151050130 368
2Design 121026 74
3Code 1513 37
4Test 13 7
5Integration 13
=mean
C[f,i]
C[f,i1]
5252.7 2.8
Note: C[f, i]denotesthecost-to-xescalationfactorrelativetofixinganissue
in the phase where it was found (f )versusthephasewhereitwasintroduced
(i). The last row shows the cost-to-fix delta if the issue introduced in phase i is
fixed immediately afterwards in phase f = i +1.
is, if the verification team found the same bugs as the development team, but fou nd
them earlier, the cost-to-fix would be reduced by a significant factor. For example,
consider Table 2 that shows the cost of q ui ckly fixing an issue relative to leaving it
for a later phase (data fr om four NASA projects [12]). The last line of that table
reveals t hat delaying issue resolution even by one phase increases the cost-to-fix
to = 2 ...5. Using this data, Dabney et al. [12] calc ul at e that a dollar spent
on verification returns to NASA, on those four projects, $1.21, $1.59, $5.53, and
$10.10, respectively.
The above notes leads to on e very strong conclusion: find bugs earlier.But
how? Software assessment budgets are finite while assessment eectiveness increases
exponentially with assessment eort. However, the state space explosion problem
imposes strict limits on how much a system can be explored via automatic formal
methods [68, 69]. As to other testing methods, a linear increase in the confidence C
that we have found all defects can take exponentially more eort. For example, for
one-in-a-thousand detects, moving C from 90% to 94% to 98% takes 2301, 2812,
and 3910 black box probes, respectively.
a
Exponential costs quickly exhaust fini t e
resources. Standard practice is to appl y the best available assessment methods on
the sections of the program that the best available domain knowledge declares is
most critical. We endorse this approach. Clearly, the most critic al sections require
the best known assessment methods. However, this focus on certain sections can
blind us to defects in other areas. Therefore, standard practice should be augmented
with a lightweight sampling policy to expl or e t he rest of the system. This sampling
policy will always be incomplete. Nevertheless, it is the only option when resources
do not permit a complet e asse ssm ent of the whole system.
a
Arandomlyselectedinputtoaprogramwillfindafaultwithprobabilityp.AfterN random
black-box tests, the chances of the inputs not revealing any fault is (1 p )
N
.Hence,thechances
C of seeing the fault is 1 (1 p)
N
which can be rearranged to N (C, p)=
log(1C)
log(1p)
.Forexample,
N(0.90, 10
3
)=2301.
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
5
2.2. Static Code Defect Prediction
A typical, object-oriented, software project can contain hundreds to thousands of
classes. In order to guarantee gene r al and project-related fitness attributes for those
classes, it is commonplace to apply some quality assurance (QA) techniques to assess
the classes’s inherent quality. These techniques include inspections, unit tests, static
source code analyzers, etc. A record of the results of t hi s QA is a defect log. We can
use t he s e logs to learn defect predictor s, if the information contained in the data
provides not only a precise account of the encountered faults (i.e., the “bugs”), but
also a thor ough description of static code features such as Lines of Code (LOC),
complexity measures (e.g., McCabe’s cyclomatic complexity [31]), and other suitable
object-oriented design metrics [6, 7, 8, 14].
For this, data miners c an learn a predictor for the number of defective classes
from past projects so that it can be applied for QA assessment in future projects.
Such a pred i ct or all ows focusing the QA budgets on where it might be most cost
eective. This is an important task as, d ur i n g development, developers have to skew
their quality assurance activities towards artifacts they believe require most eort
due to limited project resources.
Now, static co de defect predi ct or s yield a lightweight sampling policy that, based
on suit abl e static code measures, can eectively guide the exploration of a system
and raise s an alert on sections that appear problematic. One reason to favor static
code measures is that they can be automatically extracted from the code base, with
very little eort even for very large software systems [16]. The industrial experience
is that def ec t prediction scales well to a commercial context. Defect predicting
technology has been commercialized in Predictive [17] a product suite to analyze
and predict defects in software projects. One company use d it to manage the safety
critical software for a fighter aircraft (the software controlled a lithium ion battery,
which can over-charge and possibly explode). After applying a more expensive tool
for structural code coverage, the company ran Predictive on the same code base.
Predictive produced results consistent with the more expensive tool. B ut , Predictive
was able to fast e r pr ocess a larger code base than the more expensive tool [17].
In addition, defect predictors developed at NASA [15] have also been used in
software development c om pani e s outside the US (in Turkey). When the inspection
teams focused on the modules that trigger t he defect predictors, they found up to
70% of the defects using just 40% of th ei r QA eort (measured in sta h our s ) [18].
Finally, a subsequent study on the Turkish software comp ar ed h ow much code
needs to be inspected using random selection vs. selection via defect predictors.
Using random testing, 87% of the files would have to be in spected in order to
detect 87% of the defects. However, if the inspection process was restricted to the
25% of the files that t r igge r the defect predictors, then 88% of the defects could be
found. That is, the same level of defect detection (after inspection) can be achieved
using
8725
87
= 71% less eort [19].
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
6
0
20
40
60
80
100
0 25 50 75 100
% defects found
% loc read
baseline
active
optimal
sort on LOC
sort on LOC,predictions
sort on LOC,actual
Fig. 1. Percentage of defects found after sorting the code using dierent inspection ordering policies.
Note that, in this case, developers were continually modifying a small number of very active classes
handling complex interfacing tasks. Hence for the blue curve, reading just this 1% of the code found
nearly a quarter of the defects.
2.3. Inspection Opti mi zati on
Inspection Opt im i z ati on is a term proposed by Arisholm & Bri and [20]. It is a
technique for assessing the value of, say, a s t at ic code defect predictor. They define
it as f ol lows:
If X% of the class es are predicted to be defective, then the actual
faults identified in those class es must account for m ore than X%
of all def ects in the system being analyzed. Otherwise, the costs of
generating the defect predictor is not worth the eort.
In essence, this is inspection optimization find some ordering to project artifacts
such that humans have to read the least code in order t o discover the most f aul t s,
which we model as outl i ne d below:
After a data miner predicts a class is defective, then a secondary human team
examines the code.
This team correctly recognizes % of the truly defective classes (and =100%
means that the inspection team is perfect at its task and finds every defect
present).
A good learner is one that finds the most defective classes (measured in terms of
probability of detecti on , pd)inthesmallest classes (measured in terms of l i ne s of
code, LOC) .
Inspection optimization can be visualized using Figure 1 that illus t rat e s three
plausible inspection ordering policies:
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
7
The blue optimal policy combines knowledge of class size and the location of the
actual defects.
The green activity policy guesses defect locations using a defect predictor learned
from the activity measures.
The red baseli ne policy ignores defect counts and just sorts the classes in ascending
order of size.
Each of these ordering policies sorts the code base along the x-ax i s. The code
is then inspected, left to right, across that order, so that, by the end of the x-axis,
we have read 100% of the code. Along the way, we encounter classes containing y%
of defect s (a.k.a. recall). A better policy finds more defects sooner, that is, it yields
a larger area under the cur ve of %LOC-vs-recall. In Figure 1, we n ot e that that
the green activity policy does better than the red baseline (and comes close, within
95%, of t he blu e optimal).
These three policies are defined by an equation modeling the distance to some
utopia point of most defects and smallest LOC:
0 (score(D
c
,L
c
,α)) =
αD
2
c
+(1 L
c
)
2
α +1
1
Here, D
c
and L
c
are the number of defects an d lines of co de in class C (nor-
malized to range between 0 and 1), whereas α is a constant c ontrolling the sorting.
At α = 0, we ignore d ef ec t s and sort only on LOC. This implements the baseline
policy. This baseline policy is the Koru ordering advocated by researchers who ar-
gue that smaller classes have a rel at ively h i ghe r density of errors [21, 22, 23]. Note
that if the activity policy cannot out-perform baseline, then our noti on of activity
is s uperfl uou s.
The ot h er policie s use α = 1. For the activity policy, we have to:
Trai n a learner using the measur es of Table 3 without LOC,
Set D
c
via the learned model,
Sort using score, D
c
, LO C, and α = 1,
Calculate Figure 1 and determine the area under the %LOC-vs-recal l cur ve.
The optimal policy does the same, but sets D
c
using the historical defect logs.
Note that optimal is dierent to activity since the forme r knows exactly where the
defects are, whereas the latter must guess the defect locations using the learned
model.
In practice , the optimal policy is impossible to apply since it implies that we
were to know the number of defects before the classes would be inspected. However,
it is the theoretical upper-b oun d on the performance of insp ec ti on optimization.
Hence, we report activity and baseline performan ces as a ratio of the area under
the curve of optim al.
This ratio calculation has another advantage. Note that the eectiveness of
the secondary human inspection team is same , regardless of the oracle that sorts
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
8
Table 3. Measures used in this study (collected separately for each class).
Measure Description Rationale for selection
Bugs annotations in the source control logs Used to check our predictions
LOC Lines of code in the class Used to estimate inspection eort
Getters get methods Read responsibility allo cation
Setters set methods Write responsibility allocation
NoM all methods Breadth of functional decomposition
InDegree other classes depending on this class Coupling within design
OutDegree other classes this class depends upon Breadth of delegation
Clustering Coecient Degree to which classes cluster together Density of design
the code. Hence, in the ratio calculat i on, cancels out and we can ignore it from
our analysis.
3. Activity
The novel feature of this paper is augmenting the usual static code measures with
the concept of activity. As discussed below, we find that activity can be a very
useful concept for inspection optimization.
When do we call a software artifact, say a class, “active”? We contend that
activity arises when code is being modifi ed , typically via en han ce ment or correction.
This is change and we can detect and measure it through the evolution of the
associated volumetric and structural properties of a class [6].
However, one surprising obs er vation from the Helix studies [6, 7] has been that
(a) only a small set of highly active classes un de rgoes change frequently and (b) pre-
dictable patterns of modification emerge very early in the lifetime of a software
system. Therefore, we ask whether the same metrics used to analyze the Helix data
set c an also guide defect discovery, since change and defects are closely relate d
concepts. In particular, we argue that change can lead to defects via:
Defect discovery: Since active classes are used more frequently by develop er s, then
developers are most likely to discover their defects earlier.
Defect injection: When developers work wit h active classes, they make occasional
mistakes, some of which lead to defects. Since developers work on active classes
more than other classes, then most develop er defects accumulate in the active
classes.
(The second point was first proposed by Nagappan & Ball who say code that
changes many t im es prerelease will likely have more post-release defects than code
that changes less over the same period of time [13].)
Table 3 summarizes our choices of measures of activity, each tagged with a ratio-
nale motivating it s selection. These measures captu r e volumetric and the structur al
properties of a class and provide us with an empirical component for detecting and
measuring change. Furthermore, these measures are s uciently broad to encompass,
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
9
Table 4. Activity-centric metrics denitions.
Metrics Definition
NoM Counts all member functions defined by class C.
Getters Counts all non-overloaded member functions in class C with arity
zero, whose name starts with ”get.”
Setters Counts all non-overladed member functions in class C with arity one,
whose name starts with ”set.”
InDegree Let n be the type node for class C.Then
|{(n
,n) L | n = n
}| is the in-degree of class C.
OutDegree Let n be the type node for class C.Then|{(n, n
) L | n = n
}| is
the out-degree of class C.
Clustering Coecient Let n be the type node for class C.Then
2|{(n
i
,n
j
) L | n
i
,n
j
N
n
}|
|N
n
|(|N
n
|1)
is the clustering coecient of class C,whereN
n
is the neighborhood
of n with
N
n
= {n
| (n
,n) L (n, n
) L}
from a design perspective, the amount of functionality as well as how the develop-
ers have structurally organized the solution, and how t h ey chose to decompose the
functionality.
The measures NoM, Getters, and Setters define simple class-based counts (cf.
Table 4). For the complexity measures InDegree, OutDegree, and Clustering Co-
ecient, however, we nee d t o con st r u ct a complete class d epende nc y graph first.
The class dependency graph captures the dependencies between these classes. That
is, when a class uses either data or functionality from another class, there is a de-
pendency between these cl as se s . In the context of Java software, a dependency is
created if a class inherits from a class, implements an interface, invokes a method
on another class (including constructor s) , declares a field or local variable, uses an
exception, or refers to cl ass types within a method declaration. Thus, a class depen-
dency graph is an or d er ed pair (N, L), where N is a finite, nonempty set of types
(i.e., cl ass es and interfaces) and L is a finite, possibly em pty, set of directed links
between types (i.e., L N × N ) expressing the dependencies between c lasses. For
the purpose of the metrics extraction, we analyze each node n N in the graph
to compute the structural complexity metrics of cl ass C type node n represents as
shown in Table 4.
An important feat ure of these measures is that they are relatively easy to collect.
For example, one measur e we rely on for defect predi ct i on is the Number of Getter
Methods (Getters) that developers have added to a class. Parsers for such simple
measures are easy to obtain from early design representation (e.g., UML models)
and can, with little eort, be adapted to new languages. Moreover, all measures
are pairwise independent [7, 8] (measured us i ng Spe ar man’ s rank correlation). In
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
10
Table 5. Java systems used in this study.
System Description
ant Build management system
ivy Dependency manager
jedit Text editor
lucene Text search engine
poi API for Oce Open XML standards
synapse E nterprise service bus
velocity Template language engine
xalan XSLT processor
xerces XML processor
particular, Getters and Setters do not occur in pairs and are not being used as a
means to expose simply the private fields of a class [8]. In general, the odds are only
1:3 that if a class defines a getter, then this class wi l l also provide a matching setter
method.
4. Activity and Inspection Optimization
To assess the value of the selected activity-centric metrics (cf. Table 3), we distilled
them for 33 open-source Java projects from the Helix p roject and use d to resulting
information to build de fe ct predictors. As shown below, the median value f or the
learning
oracle
ratio is 96%, that is, very close to the theoretical upper bound possibl e for
any defect predictor for the task of inspection optimization.
4.1. Data Selec tion
The dat a used in this studies was built as a join between two complementary data
sets:
The PROMISE repository [24] contains defect information for various open-sou r ce
object-oriented systems. The defect data for this study was collected by Ju-
reczko [25].
The Helix r eposit or y [5, 6] provides static source code me tr i cs for a compilation
of r el eas e his t ori es of non-tri v i al Java open-source software systems.
The joined data sets represent 33 releases of the projects listed in Tab l e 5. All
projects are “long term” (at least 15 releases span over a development period of
36 months or more) and comprise more than 100 classes each. In addition, every
project can be characterized as either application, framework, or library, a broad
“binning” strategy that reflects the inherent, yet rec ur r i ng, dierences in software
design and composit i on. For a detailed description of these data sets, see Vasa’s
PhD thesis [6].
For LOC (i.e. ,theLines of Codes) we use an estimator based on the size of the
compiled byte code rather than the actual source code. The byte code provides u s
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
11
Spearman's rho = 0.9774
0 500 1000 1500 2000 2500
LOC (Bytecode)
0 200 400 600 800
LOC (Source Code)
Fig. 2. Lines of Code (LOC) extracted from byte code is a very strong approximation of the LOC
extracted directly from source code.
with a noise-fre e image of the class’s defined func t ion al i ty. LOC of a class C is given
as t he sum of the following components extract ed from the binari e s:
out-degree of C line(s) for import statements
1 line for the class declaration
1 line for super class declaration if not java.lang.Object
1 line for each interface implemented by C
1 line for each field defined in class C
1 line for each method m defined in class C,plus
# parameters of m
#throwsdenedbym
MaxLocals attribute (i.e., local variables) of m
# byte code instructions in m
We selected these components as th ey provide a very consistent approximation
of the size of source code indep en de nt of the actual coding style used. The LOC esti-
mator correlates very well with the l i ne s of source code (cf. Figure 2). Furthermore,
for the purpose of inspection optimization, an added benefit of processing byte code
rather than source is that the data miner will only report those classes that actu-
ally appear in the released version. T hat is, the secondary human inspection team is
given further guidance to focus its QA eort. Previous research [ 26, 27, 28, 29] found
that, in general, not all parts of the code base are included in the final release build.
This is du e to the release build configurat ion set ti n gs used. Hence, processin g 10%
of the classes as per byte code, is equivalent to analyzing 10% of the active sour ce
code class es (i. e. , the cl asse s that must be inspected in the QA process).
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
12
Table 6. Percentile distributions, defects p er class.
System # Classes 10% 30% 50% 70% 90%
ant-1.3 125 0 0 0 0 1
ant-1.4 178 0 0 0 0 1
ant-1.5 293 0 0 0 0 1
ant-1.6 351 0 0 0 0 2
ant-1.7 493 0 0 0 0 2
ivy-1.4 241 0 0 0 0 0
ivy-2 352 0 0 0 0 1
jedit-3.2 272 0 0 0 1 4
jedit-4 306 0 0 0 0 2
jedit-4.1 312 0 0 0 0 2
jedit-4.2 367 0 0 0 0 1
jedit-4.3 492 0 0 0 0 0
lucene-2 195 0 0 0 1 4
poi-2 314 0 0 0 0 1
synapse-1 157 0 0 0 0 0
synapse-1.1 222 0 0 0 0 1
synapse-1.2 256 0 0 0 1 2
velocity-1.6 229 0 0 0 1 2
xalan-2.4 428 0 0 0 0 1
xalan-2.5 763 0 0 0 1 2
xalan-2.6 875 0 0 0 1 2
xerces-1 162 0 0 0 2 2
xerces-1.2 438 0 0 0 0 1
xerces-1.3 452 0 0 0 0 1
lucene-2.2 247 0 0 1 2 4
lucene-2.4 428 0 0 1 2 5
poi-1.5 237 0 0 1 1 4
poi-2.5 348 0 0 1 2 2
poi-3 442 0 0 1 1 2
velocity-1.4 196 0 1 1 1 2
velocity-1.5 214 0 0 1 2 4
xalan-2.7 908 1 1 1 1 2
xerces-1.4 329 0 0 1 2 4
Note:Thetableissortedbythemediandefects(seethe50%
percentile column). For example, in xalan-2.7 the median (50th
percentile) defects per class is 1, whereas in lucene-2.4, 10% of
classes have 5 defects or more.
The j oi ne d, ac ti v i ty-based, d at a sets are constr uc t ed as follows:
1. From the P ROMISE repository we fetch t h e bug i nfor mat i on for r el e as e N per
class.
2. We extract from the Hel i x repository the static code metrics, including LOC, for
release N per class .
3. Using the fully qualified class name as key, both information is merged into the
activity data set for release N per cl ass .
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
13
Table 6 shows the distribution of defects seen in ou r classes. Usually, most classes
have no defects, but in 10% of cases, each class has more than 1 to 5 recorded defects.
4.2. Exper i mental Setu p
For the purpose of finding a predictor for inspection optimization we employed a
technique, called N-w ay cross-evaluat ion [30]. The data set is divided into N = 10
buckets. For each bucket in the N -way, a predictor is learned on the nine of the
buckets, then tested on the remain in g bucket. These N studies implement Nhold
out studies where a model is tested on data not used in training.
To appreciate cross-validati on, consi d er anoth er approach called self-tes t where
the learned model is assessed on the same data that was used to create it. Self-tests
are d ep re cat ed by the research community [30]. If the goal is to understand how
well a defect predictor will work on future pr ojects, it is best to assess the predictor
via hold-out modules not used in the generation of that predictor.
In the WEKA 3.7.3 implementation of the cross-val procedure used in this study,
results are reported once for each test-instance as that instance appears in one of
the N hold-outs.
b
So a data set c ontaining C examples will generated C predictions,
regardless of the value of N used for the number of hold-outs.
4.3. Selecti on of Learners
As mentioned above, there are very many methods for converting static code mea-
sures into defect predictors [15, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41]. We adopted
Holte’s simplicity-first heuristic [42] and applied a simple linear regression (LSR)
algorithm available in WEKA [30], with no pre-processing.
Note that WEKA’s LSR tool uses a simple greedy back-select, which is applied
after t he linear model has been generated. In that back- select, WEKA steps through
all the attributes removing the one with the smallest standardized coecient until
no improvement is observed in the estimate of the model error given by the Akaike
information criterion. As a consequence, some attributes may be ab se nt from the
final learned model.
Initially, we planne d to test various learners, feature extractors, in st an ce selec-
tors, and discretization methods (as we have done in the past [15, 40, 41]). But our
results were so encouraging that there was little room for further improvement over
simple LSR.
b
Note that prior to WEKA 3.7.2, the cross-val proce dure java -cp weka.jar $learner -t
file.arff incorrectly returns self-test results.
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
14
Table 7. Percentile distributions of actual-
predicted number of defects per class.
System 10% 30% 50% 70% 90%
lucene-2.4 -2.0 -0.9 -0.3 0.4 2.6
velocity-1.6 -1.4 -0.6 -0.3 0.0 1.4
xerces-1.0 -1.1 -0.6 -0.3 0.7 1.0
ant-1.4 -0.4 -0.3 -0.2 -0.1 0.8
lucene-2.0 -2.0 -0.8 -0.2 0.1 1.7
poi-1.5 -1.6 -0.9 -0.2 0.0 2.3
synapse-1.2 -0.8 -0.4 -0.2 0.0 1.0
xalan-2.5 -0.9 -0.5 -0.2 0.5 0.8
xalan-2.6 -0.9 -0.5 -0.2 0.4 1.2
xalan-2.7 -0.5 -0.3 -0.2 0.1 0.7
xerces-1.2 -0.5 -0.2 -0.2 0.0 0.9
xerces-1.4 -2.2 -0.4 -0.2 0.4 0.8
ant-1.3 -0.5 -0.3 -0.1 0.0 0.6
ant-1.7 -0.8 -0.3 -0.1 0.0 0.8
ivy-2.0 -0.3 -0.1 -0.1 0.0 0.3
lucene-2.2 -2.5 -0.8 -0.1 0.4 2.0
poi-2.0 -0.2 -0.1 -0.1 -0.1 0.7
poi-3.0 -1.1 -0.4 -0.1 0.0 1.0
synapse-1.0 -0.3 -0.2 -0.1 0.0 0.0
synapse-1.1 -0.7 -0.4 -0.1 0.0 0.9
velocity-1.5 -1.5 -0.7 -0.1 0.3 1.5
xalan-2.4 -0.5 -0.2 -0.1 0.0 0.7
ant-1.5 -0.3 -0.1 0.0 0.0 0.3
ant-1.6 -0.8 -0.3 0.0 0.0 0.9
ivy-1.4 -0.2 -0.1 0.0 0.0 0.0
jedit-3.2 -2.3 -0.8 0.0 0.0 1.8
jedit-4.0 -1.3 -0.5 0.0 0.0 1.0
jedit-4.1 -1.1 -0.4 0.0 0.0 1.0
jedit-4.2 -0.6 -0.2 0.0 0.0 0.3
jedit-4.3 -0.1 0.0 0.0 0.0 0.0
poi-2.5 -1.3 -0.6 0.0 0.7 0.9
velocity-1.4 -1.2 -0.2 0.0 0.1 1.0
xerces-1.3 -1.4 -0.2 0.0 0.0 0.7
Note:Forexample,themedian(50thpercentile)
value of actual-predicted is -0.3 to 0.
5. Results
5.1. Sanity Check
Table 7 shows the distribution of actual-predicted defects for our classes where actual
comes from historical logs and predicted com es from the C predictions seen in our
10-way. This result is our sanity check:iftheactual-predicted values were large,
then we would doubt the value of activity-based defect prediction. Note that, in the
median case (shown in the middle 50% column), the predictions are very c los e to
actuals (-0 to -0.3). Since our estimates are close to actuals, we may continue.
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
15
50
60
70
80
90
100
synapse-1.0
ant-1.3
poi-2.0
jedit-4.3
ivy-2.0
ivy-1.4
ant-1.5
velocity-1.6
synapse-1.1
poi-3.0
ant-1.7
lucene-2.0
velocity-1.5
xerces-1.3
ant-1.4
poi-1.5
poi-2.5
xerces-1.4
synapse-1.2
xalan-2.4
xerces-1.0
xerces-1.2
xalan-2.5
xalan-2.6
lucene-2.4
xalan-2.7
ant-1.6
jedit-3.2
jedit-4.0
jedit-4.1
jedit-4.2
lucene-2.2
velocity-1.4
percent of optimal
baseline: sort on LOC
active: sort on LOC,predictions
Fig. 3. Performance results expressed as a ratio of the optimal policy. Data sets are sorted according
to the activity results. Median values for baseline, activity are 91% and 96% of optimal,respectively.
5.2. Baseline and Activity vs. Optimal
Figure 3 shows the ratio of the optimal policy achieved with the activity policy
(the green curve) and the baseline policy (the red curve). These curves are st at i st i -
cally significantly dierent (Wilcoxon, 95% confidence). For both curves, the result
are expressed in as a ratio of the optimal policy that uses hist or i cal knowledge to
determine the number of defects in each class.
We observe that the results of the basel ine policy are far more erratic than for
the activity policy. The spread of a distribution is the dierence between the 75%
and 25% t h percentile range . The spread of the values in Figure 3 are:
Activity: 98 - 91 = 7
Baseline: 95 - 82 = 13
That is, the results of the activity policy are more pred i ct ab l e (fall into a nar-
rower range), whereas the results from baseline can spread nearly twice as f ar .
Moreover, the activi ty results not only are more predictable, but also out-perform
the basel i ne policy. The median value of the red baseline policy result s (i.e.,inspect-
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
16
ing the code based on increasing class size) is 91% of optimal. Note that baseline is
rarely any little better than activity, and often, it is much worse:
When baseline out-performs activity (in only
3
33
of our comp ar is ons ) , it does so
only by a smal l margi n.
In the
30
33
data sets where baseline does worse than activity, sometimes it does
much worse (see t h e veloci ty-1.5 and velocity-1.6 results which fall to 70% of
optimal).
The median value of the activity p ol i ci es results are 96%, which is within 4% of
optimal. Further, the top ten results of activity all scor e 100% of optim al (see the
right-hand side of the green curve in Figure 3). That is, for the purpose of optimizing
inspection, there is little to no room for improvement on top of the activity-centric
measures. Hence, we strongly recommend the activity policy.
5.3. Summary
Our key observations in this study are as follows:
According to Table 7, activity-centric measures combined with linear regression
lead to de fe ct predi c tor s with low error rates in open-source object-oriented sys-
tems.
According to Figure 3, for the task of inspection optimization, activity-centric
defect prediction works significantly better than the baseline and very close to
the optimum.
6. Discussion
The results here are quite unequivocal activity is a strong predictor for software
defects, and this eect can be detected with a simple m odel such as linear regres-
sion. Hen ce, we need to explain why this eect has not been reported before. We
conjecture that the use of a small set of activity-centric static metrics is too simple
and too novel a concept to be reported previously.
6.1. Too Simple?
We can broadly classify object-oriented software quality research as (a) studies with
more focus on prediction models than the metrics, and (b) studies with more focus
on metrics validati on than the models (as in this study). It is no su r pr i se that
the former kind of studies did not explicitly investigate the concep t of activity, as
they usually operate within existing sets of common metrics in order to choose
the best model among many. The literature oers many complex methods for data
mining such as support-vector machines, random forests, and tabu search to tune
the p aram et er s of a genetic algorithm (i.e., [32, 33, 34, 35]). In this era of increasing
learner complexity, something as easy as l i ne ar regression on a small set of static
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
17
code measures aiming especially on activity may have been discounted before being
explored rigorously. Therefore, our fir s t explanation is that the use of activity as a
concept is so simple that it escaped the attention of this type of research.
Nevertheless, we cannot ignore the latter type of studies, in which the focus
has usually been validat i ng object-oriented metrics as predic t ors of defects through
correlational methods (i.e., [43, 44, 45, 46, 47]). Briand & W¨ust provide an extensive
survey of empirical studies of quality in object-oriented syst e m s, and observe that
the majority of studies falls in this category [ 48] . However, they also state that only
half of the studi es employ multivariate prediction models, and the other half just
reports univariate relations between object-oriented m et r ic s and defects. Further,
only half of the studies with a predicti on model, c onduc t a proper performance
analysis through cross-validati on . After this filtering, remaining work contains hard-
to-compare e mp ir i cal studies, where the size and the number of data sets are so small
that the combined results are conflicting and do not reveal a common trend possibly
due to varying contexts of the studies.
Another aspect of related studies is th at they consider certain subgroups of
object-oriented metrics relating to concepts such as coupli n g, cohesion, inheri t anc e
and polymorphism, and size [48]. Bri and & W¨ust report that the significance of the
relation between dierent subgroups of metrics and defects are mostly inconclusive,
and only a number of size and coupling measures are c ons is t ent. We have further
run a s mal l er -sc al e review of major studies conducted with the guidelines of the
original survey [20, 38, 43, 44, 45, 47, 49, 50, 51, 52]. Similar to Briand & ust, we
observed t hat the table of metrics vs. dier ent systems used to assess those metr i cs
were sparsely populated.
6.2. Too Novel?
The starting point for this research was the observation in the Helix data sets that
most classes stabi li z e very early in their life cy cl e while a very small number of
active classes garner the most attention by developers [6, 7] . As discussed above in
Section 3, this is not the standard picture of the life cyc le of a class. To us, this
observation was so uni q ue that it prompted the q ue st i on does the amount a class
is used by developers predict for system defect s”(i.e., this study). However, without
that initial surprising observation, we would not have condu ct e d the study reported
in this paper.
Compared to other studies (e.g., studies surveyed in [48]), the size and number of
data sets used in our study is extensive and reveals a clear benefit of using activity-
centric metrics in the context of open-source object-oriented systems. In contrast to
our concept of activity, Turhan et al. [53] investigate popularity. Their approach is
to augment standard static code metrics within a call graph-based ranking frame-
work, which is inspired by t h e PageRank algorithm [54]. Rather than constructing
learners with a standard set of metrics that valu e each module equally, Turhan et
al. first rank t he modules using the dependency graph informati on and weigh the
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
18
information learned from “popular” modules more. Their approach reduced the
false alarm rates significantly. However, this technique is an indirect way of utiliz-
ing act i v ity, and does not include explicit activity-centric met ri c s that are used in
this study. Similarly, Zimmermann et al. include eigenvector centrality, a measure
of closeness centrality of network nodes simil ar to PageRank, in their analysis of
complexity vs. network metrics for predicting defects from software dependencies
[55]. Though, eigenvector centrality is found to b e correlated with defects for the
Windows sy st em they have evaluat ed , this metric did not stand out among other
network or complexity based metrics to allow a discussion on “activity” (see the
next secti on below for a possible cause). Finally, Kpodjodo et al. monit or ed their
proposed, again PageRank inspired, Class Rank metric among several versions of a
single system and found moderate evi de nc e in favor [56]. In this paper, we handle
activity as a concept rather than relying on a single measure, and we achieve ne ar
optimal results compared to moderate improvements of similar work.
6.3. Hidden?
It is possible that activity was buried u n de r other eects. When we look at the
measures that we have used in previous studies (e.g., [15]), we can see some overlap
between those measur es and ones used here (cf . Table 3). Mille r [57], Witten &
Frank [30], and Wagner [58] oer a theoretical analysis discussing how an excess
of attribut e s containing multiple strong predictors for the target class can conf us e
learning. For example, both Wagner and Miller note that in a model comprising N
variables, any noise in variable N
i
adds to the n oi se in the outp ut variables.
We have also obser ved supporting evidence for this explanation in our small scale
quality-in-object-oriented-systems review. In all cases, where both an univar i ate and
a multivariate analysis is being utilized, it is com mon for metrics that have been
verified by the univariate model to not be included in the multivariate model for
the same data [43, 44, 47, 49, 51, 52]. El Emam et al. use this phenomenon to control
for the confounding eects of size on metri cs believed to serve as suitable predi ct or s
for defects [22]. Similarly, the multivariate model metrics may include those that
are not verified by the univariate model [20, 38, 49], for which Guyon et al. provide
simple examples showing that the prediction power can be significantly increased
when feature s are used together rather than individually [59]. Hence, even t hou gh
some measures exist in a data set, noise from the other variables may have drowned
out their eect.
7. Validity & Future Work
Internal Validity: Apart from joini n g the PROMISE d at a sets (for defect counts)
with the Helix data sets (for the activity-centric measures), we did not pre-process
the datasets in any way. This was done to enable replication of our results.
Construct Validity: We have made the case above that the measures listed in
Table 3 reflect the “activity” of dierent classes, that is, how often a d eveloper will
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
19
modify or extend the ser v i ces of a class as an expression of the attractiveness of this
class for the developer’s design choices. This case has not been tested here. Hence:
Future work 1: Analyze participant observation of developers to determine what
classes they inspect as part of their workflow.
External Validity: Our use of cross-vali d at i on means that all the results reported
above come fr om the application of our models to data not seen during train in g.
This gives us some confidence that these results will hold for future data sets.
As to our selection of dat a sets, the material used in this study represents real-
world use, collected from real-world projects. Measured in terms of number of data
sets, t h is paper is one of the largest defect prediction studies that we are aware of.
Nevertheless, there is a clear bias in our sample: open-source Java systems. Hence:
Future work 2: Test the validity of our conclusi on s to close-sourced, non-object-
oriented, and non-Java projects.
Conclusion Vali dit y: We t ake great care to only state our conclusions in terms
of areas under a %LOC-vs-recall curve. For the purpose of finding the most defects
after inspecting the fewest lines of code (i.e., the i ns pecti on optimization crite-
rion p roposed by Arisholm & Briand [20]), the activ i ty-centric metrics exhibit an
excellent performance (median results within 96% of the optimum).
While the area under a %LOC-vs-recall is an interesting measure, it is n ot the
only seen in the literature. Hence:
Future work 3: Explore the value of activity for other evalu at ion criteria. Those
other criteria may include:
Counting the number of file s inspected, rather than the total LOC, as done, for
example, by Weyuker, Ostrand, and Bell [60, 61],
Precision, as advocated, for example, by Zhang & Zhang [62] (but depreciated by
Menzies et al. [63]),
Area under the curve of the pd-vs-pf cur ves, as used by Lessmann et al. [32].
8. Conclusion
We have shown above that a r epository containing just static code measures can still
be used to infer interaction patterns amongst developer s. Specifically, we studied the
“active” classes, that is, the classes wher e the developers spend much time interact-
ing with each ot h er ’ s design and imp le mentation d eci s ion s. In 33 open-source Java
systems, we found that defect predictors based on static code measures that model
“activity” perform within 96% of a theoretical up per bound. This upper bound was
derived assuming that the goal of the dete ct or s was “inspection optimization,” th at
is, read the fewest lines of code t o find the most defe ct s .
Though, we h ave focused on inspection optimization and limited our discussions
around it, application of our techniques is not limited within the scope of this
particular QA method. For example, our techniques can be directly appli ed to
address regression test case selection (or regression test prioritization) problem,
especially in very large systems. The important challenges for such systems are
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
20
(a) to identify specific parts of the system against which regression tests should
be developed and (b) to determine which t es ts should have priority over others
within the existing (possibly huge) regre ss ion test library. In practice, it usually
takes from a few h ou rs to weeks for develope r s to get feedback from regression
test results (without considering the cost of mental context switch overheads for
developers). Our t echniques can be used to address both problems: (a) they point
to most problematic parts, so regression tests should cover those parts, (b) they
provide a prioritization of problem ati c parts, so a small porti on of all tests consisting
of high priority ones could be run more frequently to provide faster feedback to
developers. While the sc ope of our hypothetical example is the whole syst em , it is
straightforward to sc ale it down to the operational level where developers can al so
benefit from our techniques directly: developers can be guided to develop and run
local regressions tests on the critical parts in their local machines as pointed out by
our techniques. In summary, applications of our techniques in dierent QA activities
allow cost reductions through ecient management of resources and faster (early)
feedback cycles to stakeholders.
There is another aspect of activity-centric measures that r e comm en ds their use.
In this paper, we show that simple linear regression over these measures works very
well indeed. That is, the machinery required to convert these measures into defect
predictors is far less complex than alternative approaches, such as:
Lessmann’s random forests and support-vector machines [32],
The many methods explored by Khoshgoftaar [33, 34, 35],
Defect prediction via multiple explanat or y variables [38 , 39],
Our own defect predictors via feature selection [15], instance selection [40], or
novel learners built for particular tasks [41].
The comparative simplicity of activity-centric predi ct i on , suggests that previ-
ous work [31, 32, 33, 34, 35, 36, 37, 38, 39], including our own research [15, 40, 41] may
have needlessly complicated a very simple concept, that is, defects are introduced
and discovered due to all the activity around a small number of most ac ti ve classes.
9. Acknowledgments
This work was condu ct e d at Swinburne University of Technology, West Virginia
University, and University of Oulu with partial funding from (1) t he New Zeal an d
Foundation for Research, Science and Technology, (2) the United States National
Science Foundation, CISE grant 71608561, (3) a research subcontract with the Qatar
University NPRP 09-1205-2-470, and (4) TEKES under the Cloud-SW project in
Finland.
References
[1] C. Bird, N. Nagappan, H. Gall, B. Murphy, and P. Devanbu, “Putting It All To-
gether: Using Socio-technical Networks t o Predict Failures,” in Proceedings of 20th
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
21
International Symposium on Software Reliability Engineering,Nov.2009,pp.109-119.
[2] C. Bird, N. Nagappan, P. Devanbu, H . Gall, and B. Murphy, “Does distributed de-
velopment aect software quality? An empirica l case study of Wind ows Vista,” in
Proceedings of 31st International Conference on Software Engineering,May2009,
pp. 518-528.
[3] N. Nagappan, B. Murphy, and V. Basili, “The Influence of Organizationa l Structure
on Software Quali ty: An Empirical Case Study,” in Proceedings of the 30th Interna-
tional Co nf erence on Software Engineering,May2008,pp.521-530.
[4] P. J. Guo, T. Zimmermann, N. Nagappan, and M. Brendan, “Characterizing and
predicting which bugs get fixed: An empirica l study of Microsoft Windows,” in Pro-
ceedings of the 32nd International Conference on Software Engineering, May 2010,
pp. 495-504.
[5] R. Vasa, M. Lumpe, and A. Jones, “Helix - Software Evolution Data Set,”
http://www.ict.swin.edu.au/research/project s/helix, Dec. 201 0 .
[6] R. Vasa, “Growth and Chan g e Dynamics in Open Source Software Systems,” Ph.D.
dissertation, Swinburne University of Technology, Faculty of Information and Com-
munication Technologies, Oct. 2010
[7] R. Vasa, M. Lumpe, P. Branch, and O. Nierstrasz, “Comparative Analysis of Evolving
Software Systems Using the Gini Coecient,” in Proceedings of 25th IEEE Interna-
tional Conference on Software Maintenance. Edmonton, Alberta: IEEE Computer
So ciety, Sep. 2009, pp. 179-188.
[8] M. Lumpe, S. Mahmud, and R. Vasa, “On the Use of Properties in Java Applications,”
in Proceedings of the 21st Australian Sof tware Engineering Conference,Auckland,
New Zealand, Apr. 2010, pp. 235-244.
[9] B. Boehm and P. Papaccio, “Understanding and controlling software costs”, IEEE
Trans. on Software Engineering, 14(10), 1462–14 7 7, O c tober, 1988 .
[10] F. Shull and V.R. Basili ad B. Boehm and A.W. Brown an d P. Costa and M. Lind-
vall and D. Port and I. Rus and R. Tesoriero and M.V. Zelkowitz, “Wha t We Have
Learned About Fighting Defect s ”, in Proceedings of 8th International Software Met-
rics Symposium, Ottawa, Canada”, 200 2 .
[11] J.D. Arthur and M.K. Groner and K.J. Hayhurst and C.M. Holloway, “Evaluating the
Eectiveness of Independent Verification and Validation”, IEEE Computer,October
1999.
[12] J. B. Dabney and G. Barber and D. Ohi, “Predicting Software Defect Function Point
Ratios Using a Bayesian Belief Network”, Proceedings of the PROMISE workshop,
2006.
[13] N. Nagappan and T. Ball , “Use of relative code churn measures to predict system
defect density,” in ICSE’05,2005,pp.284-292.
[14] S. R. Chidamber and C. F. Kemerer, “A Metrics Suite for Object Oriented Design,”
IEEE Transactions on Software Engineering,vol.20,no.6,pp.476-493,Jun.1994.
[15] T. Menzies, J. Greenwald, and A. Frank, “Data mining static code attributes to learn
defect predictors,” IEEE Transactions on Software Engineering,vol.33,no.1,pp.
2-13, Jan. 2007.
[16] N. Nagappan and T. Ball, “Static ana l ys is tools as early indicators of pre-release
defect density,” in Proceedings of the 27th International Conference on Software En-
gineering,May2005,pp.580-586.
[17] J. Turner, “A predictive approach to eliminating errors in software code,” 2006, avail-
able from http://www.sti.nasa.gov/tto/Spino2006/ct 1.html.
[18] A. Tosun, B. Turhan , and A. Bener, “Practical considerations of deploying ai in defect
prediction: A case study within the turkish telecommunication industry,” in Proceed-
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
22
ings of the 5th International Conference on Predictor Models in Software Engineering.
ACM, May 2009, pp. 1-9.
[19] A. Tosun, A. Bener, and R. Kal e, “Ai- based software defect predictors: Applica-
tions and benefits in a case study,” in Twenty-Second IAAI Conference on Artificial
Intelligence,Jul.2010,pp.1748-1755.
[20] E. Arisholm and L. Briand, “Predicting fault-prone components in a java legacy
system,” in Proceedings of the 5th ACM/IEEE Internati onal Symposium on Empirical
Software Engineering,Sep.2006,pp.8-17.
[21] A. Koru, D. Zhang, K. El Emam, and H. Liu, “An investi ga t i on into the functional
form of the size-defect relationship for software modules,” IEEE Transactions on
Software Engineering, vol. 35, no. 2, pp. 293-304, Mar. 2009.
[22] A. Koru, K. E. Emam, D. Zha n g , H. Liu, and D. Mathew, “Theory of relative defect
proneness: Rep l ic a t ed studies on the functional form of the size-defect relationship,”
Empirical Software Engineering,pp.473-498,Oct.2008.
[23] A. Koru, D. Zhang, and H. Liu, “Modeling the eect of size on defect proneness for
open-source software,” in International Workshop on Predictor Models in Software
Engineering (PROMISE’07), May 2007, article No. 10.
[24] G. Boetticher, T. Menzies, and T. Ostrand, “The PROMISE Rep o si to ry of Empirical
Software Engineering Data,” 20 0 7 , http://promisedata.org/.
[25] M. Jureczko a n d D. Spinellis, “Using Object-O riented Design Metrics to Predict Soft-
ware Defects,” Models and Methods of System Dependability. Oficyna Wydawnicza
Politechniki Wroc”awskiej,pp.69-81,2010.
[26] J. Krinke, “Identifying Similar Code with Program Dependence Graphs,” in Proceed-
ings of the Eighth Working Conference on Reverse Engineering (WCRE’01).IEEE
Computer Society, Oct. 2001, pp. 301-309 .
[27] G. Antoniol, G. Canfora, G. Casazza, and A. De Lucia, “ I n form a t io n Retrieval Models
for Recovering Traceability Links Between Code and Documentation,” i n Proceedings
of the International Conference on Software Maintenance (ICSM’00),2000,pp.40-49.
[28] K. Kontogiannis, R. DeMori, E. Merlo, M. Galler, and M. Bernstein, “Pattern Match-
ing for Clone and Concept Detection,” Journal of Automated Software Engineering,
vol. 3, pp. 77-108, 1996.
[29] J. H. Jo h n s o n , “Substring Matching for Clone Detection and Change Tracking,” in
Proceedings of the International Conference on Software Maintenance (ICSM 94),
1994, pp. 120-126.
[30] I. H. Witten and E. Frank, Data mining. 2nd edition. Los Altos, US: Morgan Kauf-
mann, 2005.
[31] T. McCabe, “A Complexity Measure,” IEEE Transactions on Software Engineering,
vol. 2, no. 4, pp. 308-320, Dec. 1976.
[32] S. Lessmann, B. Baesens, C. Mues, and S. Pietsch, “B en ch-marking classification
models fo r software defect prediction: A prop o sed framework and novel findings,”
IEEE Transactions on Software Engineering,vol.34,no.4,pp.485-496,Aug.2008.
[33] T. M. Khoshgoftaar, N. Seliya, and K. Gao, “Assessment of a New Three-Group Soft-
ware Quality Classification Technique: An Empirical Case Study,” Empiri cal Software
Engineering, vol. 10, pp. 18 3 -2 1 8 , Ap r. 2 0 05 .
[34] T. M. Khoshgoftaar, S. Zhong, and V. Joshi, “Enhancing software qu a li ty estimation
using ensemble-classifier based noise filteri n g ,” Intell. Data Anal.,vol.9,pp.3-27,
Jan. 2005.
[35] T. M. Khoshgoftaar, X. Yuan, and E. B. Allen, “ B a la n c in g Misclassification Rates
in Cla s sifi ca t io n - Tree Models of Software Quality,” Empirical Software Engineering,
vol. 5, pp. 313-330, Dec. 2000.
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
23
[36] R. Harri so n , S. Counsell, and R. Nithi, “An investigation into the applicability and
validity of object-oriented design metrics,” Empirical Software Engineering,vol.3,
no. 3, pp. 255-273, Sep. 1998.
[37] L. C. Briand, S. Morasca, and V. R. Basili, “Defining and Validating Measures for
Object-B a sed High-Level Design,” IEEE Transactions on Software Engineering,vol.
25, no. 5, pp. 722-743, Sep. 1999.
[38] L. Briand, J. W¨ust, J. Daly, and D. Victor Porter, “Exploring the relationships be-
tween design measures and software quality in object-oriented systems,” Journal of
Systems and Software,vol.51,no.3,pp.245-273,May2000.
[39] N. Nagappan, T. Ball, and A. Zeller, “Mining metrics to predict component failures,”
in Proceedings of the 28th International Conference on Sof tware Engineering.ACM,
May2006, pp. 452-461.
[40] B. Turhan, T. Menzies, A. Bener, and J. Di Stefano, “On the relative value of cross-
company and within-company data for defect prediction,” Empirical Software Engi-
neering,vol.14,no.5,pp.540-578,Oct.2009.
[41] T. Menzies, O. Jalali, J. Hihn, D. Baker, and K. Lum, “Stable rankings for dierent
eort models,” Automated Software Engineering, vol. 17 , no. 4, p p . 409-437, Dec.
2010.
[42] R. Holte, “Very s im p le classification rules perform well on most commonly used
datasets,” Machine Learning, vol. 11, pp. 63- 9 0, Ap r. 1 9 9 3 .
[43] V. R. Basili, L. C. Briand, and W. L. Melo, “A validation of object-oriented design
metrics as quality indicators,” IEEE Transactions on Software Engineering,vol.22,
no. 10, pp. 751-761, Oct. 1996.
[44] L. C. Briand, J. W¨ust, S. V. Ikonomovski, and H. Lounis, “Investigating quality
factors in object-oriented designs: an industrial case study,” in Proceedings of 21st
International Conference of Software Engineering.LosAlamitos,CA,USA:IEEE
Computer Society Press, May 1999, pp. 345-354.
[45] M. Cartwright and M. Shepperd, “An empirical investigation fan object-oriented
software system,” IEEE Transactions on Software Engineering,vol.26,pp.786-796,
Aug. 2000.
[46] K. El Emam, S. Benlarbi, N. Goel, and S. N. Rai, “The confounding eect of class
size on the validity of object-oriented metrics,” IEEE Transactions on Software En-
gineering,vol.27,no.7,pp.630-650,Jul.2001.
[47] K. K. Aggarwal, Y. Sin gh , A. Kaur, and R. Malhotra, “Empirical analysis for in-
vestigating the eect of object-oriented metrics on fault proneness: a replicated case
study,” Software Process: Improvement and Practice,vol.14,no.1,pp.39-62,Jan.
2009.
[48] L. C. Briand and J. W¨ust, “Emp iri ca l studies of quality models in object - ori ented
systems,” Advances in Computers,vol.56,pp.98-167,2002.
[49] K. E. Emam, W. Mel o, and J. C. Machado, “The prediction of faulty classes using
object-o ri ented design metrics,” System and Sof tware,vol.56,pp.63-75,Feb.2001.
[50] L. C. Briand, W. L. Melo, and J. ust, “Assessin g the applicability of fault- p ro n en ess
models across object-oriented software projects,” IEEE Transactions on Software
Engineering,vol.28,no.7,pp.706-720,Jul.2002.
[51] K. K. Aggarwal, Y. Sing h , A. Kaur, and R. Malhotra, “Investigating the eect of
coupling metrics on fault proneness in object-oriented systems,” Software Quality
Professional vol. 8, no. 4, pp. 4-16, Sep. 2006.
[52] , “Investigating eect of Desig n Metrics on Fault Proneness in Object-Oriented
Systems,” Jour na l of Object Technology,vol.6,no.10,pp.127-141,Nov.2007.
[53] B. Tu rh a n , G. Kocak, and A. Bener, “Software Defect Prediction Using Call Grap h
November 30, 2011 10:37 Learning Better Inspection Optimization Policies
24
Based Ranking (CGBR) Framework,” in Proceedings of 34th Euromi cro Conf erence
on Software Engineering and Advanced Applications,Sep.2008,pp.191-198.
[54] S. Brin and L. Page, “The anatomy of a large -s ca l e hypertextual Web search engi n e,”
in Proceedings of the 7th International Conference on World Wi de Web, Apr. 1998,
pp. 107-117.
[55] T. Zimmermann and N. Nagappan, “Predicting defects using network analysis on
dependency graphs,” in Proceedings of the 30th international conference on Software
engineering (ICSE ’08). New York, NY, USA: ACM, 2008, pp. 531-540.
[56] S. Kpodjedo, F. Ricca, G. Antoniol, and P. Galinie r, “Evolution an d search based
metrics to improve defects pred ic t io n , ” International Symposium on Search Based
Software Engineering,vol.0,pp.23-32,2009.
[57] A. Miller, Su bs et Selection in Regression ( s econd edition).Chapman&Hall,2002.
[58] S. Wagner, “Global Sen si t iv ity Analysis of Predictor Models in Software Engi-
neering,” in International Workshop on Predictor Models in Software Engineering
(PROMISE’07), May 2007, article No. 3.
[59] I. Guyon, A. Elisseefi, and L. Kaelbling, “A n introduction to variable and feature
selection,” Journal of Machine Learning Research,vol.3,no.7-8,pp.1157-1182,
Mar. 2003.
[60] T. J. Ostrand, E. J. Weyuker, and R. M. Bell, “Where the bugs are,” in Proceedings of
the 2004 ACM SIGSOFT International Symposium on Software Testing and Analysis.
New York, NY, USA: ACM, Jul. 2004 , pp . 8 6- 9 6 .
[61] E. Weyuker, T. Ostrand, and R. Bell, “Do too many cooks spoil the broth? using
the number of developers to enhance defect prediction models,” Empirical Software
Engineering,vol.13,no.5,pp.539-559,Oct.2008.
[62] H. Zhang and X. Zhang, “Comments on ”Data Mining Stat i c Code Attributes to
Learn Defect Predictors,” IEEE Transactions on Software Engineering,vol.33,no.
9, pp. 635-637, Sep. 2007.
[63] T. Menzies, A. Dekhtyar, J. Distefano, and J. Greenwald, “Problems with Precision:
A Response to ”Comments on ” Da ta Mining Static Code Attributes to Learn Defect
Predictors,” IEEE Transactions on Software Engineering vol. 33,no. 9, pp. 637-640,
Sep. 2007.
[64] K. Sunghun, T. Zimmermann, E. J. Whitehead Jr., and A. Zeller, “Predicting Faults
from Cached Hist ory,” in Proceedings of the 29th international conference on Software
Engineering (ICSE ’07). Washington, DC, USA: ACM,2007, pp. 489-498.
[65] F. Rahman, D. Posnett, A. Hindle, E . Barr,and P. Devanbu, “BugCache for Inspec-
tions: Hit or Miss?,” in Proceedings of the 19th ACM SIGSOFT Symposium and the
13th European Conference on Foundations of Software Engineering (ESEC/FSE ’11).
Szeged, Hungary, Sep. 201 1 , pp . 3 2 2 -3 3 1 .
[66] W. Li and S. Henry, “Object-oriented metrics that predict maintainability,” Journal
of Systems and Software,vol.23,no.2,pp.111122,1993.
[67] P. Cohen, Empirical Methods for Artificial Intelligence. MIT Press, 1995.
[68] M. Lumpe, L. Grunske, and J.-G. Schneider, “State Space Reduction Techniques
for Component Interfaces,” in Proceedings of 11th International Symposium of
Component-Based Software Engineering (CBSE 2008). LNCS 5282, Springer, Hei-
delberg, Germany, pp. 130-145, Oct. 2 0 0 8.
[69] M. Lumpe and R. Vasa, “Partition Refinement o f Component Interaction Automata:
Why Structure Matters More Than Size ,” Electronic Proceedings in Theoretical Com-
puter Science,vol.37,pp.12-26,Oct.2010.
... Hence, process factors can be highly informative about what parts of a codebase are buggy. In support of the Herbsleb hypothesis, prior studies have shown that, for defect prediction, process metrics significantly outperform product metrics (Lumpe et al. 2012;Bird et al. 2009). Also, if we wish to learn general principles for software engineering that hold across multiple projects, it is better to use process metrics since: ...
Article
Full-text available
Numerous methods can build predictive models from software data. However, what methods and conclusions should we endorse as we move from analytics in-the-small (dealing with a handful of projects) to analytics in-the-large (dealing with hundreds of projects)? To answer this question, we recheck prior small-scale results (about process versus product metrics for defect prediction and the granularity of metrics) using 722,471 commits from 700 Github projects. We find that some analytics in-the-small conclusions still hold when scaling up to analytics in-the-large. For example, like prior work, we see that process metrics are better predictors for defects than product metrics (best process/product-based learners respectively achieve recalls of 98%/44% and AUCs of 95%/54%, median values). That said, we warn that it is unwise to trust metric importance results from analytics in-the-small studies since those change dramatically when moving to analytics in-the-large. Also, when reasoning in-the-large about hundreds of projects, it is better to use predictions from multiple models (since single model predictions can become confused and exhibit a high variance).
... Stallings [26] comments that the design of operating systems often assumes a locality of reference; i.e. that program execution does not dash across all the code all the time but instead is mostly focused on a small part of the memory or instruction space. Lumpe et al. [27] report studies tracking decades of work in dozens of open source projects. In that code base, most classes are written once, revised never. ...
Preprint
Full-text available
Can we simplify explanations for software analytics? Maybe. Recent results show that systems often exhibit a "keys effect"; i.e. a few key features control the rest. Just to say the obvious, for systems controlled by a few keys, explanation and control is just a matter of running a handful of "what-if" queries across the keys. By exploiting the keys effect, it should be possible to dramatically simplify even complex explanations, such as those required for ethical AI systems.
... Cross-project bug prediction models are based on the usage of data coming from external (similar) projects to train a machine learner able to discriminate buggy and non-buggy instances in the project currently being analyzed [99]. While most of the research made in this area investigated which are the most efficient features to use in cross-project models to correctly capture the bugginess of software classes [3,5,12,19,23,35,54,70], a notable effort has been also devoted to how to make external data suitable for the project under analysis [15,55,63,92]. The latter problem aims at dealing with the fact that cross-project models suffer data heterogeneity, i.e., external data might be different from the project data, leading to worsening the bug prediction performance. ...
Article
Full-text available
Bug prediction aims at locating defective source code components relying on machine learning models. Although some previous work showed that selecting the machine-learning classifier is crucial, the results are contrasting. Therefore, several ensemble techniques, i.e., approaches able to mix the output of different classifiers, have been proposed. In this paper, we present a benchmark study in which we compare the performance of seven ensemble techniques on 21 open-source software projects. Our aim is twofold. On the one hand, we aim at bridging the limitations of previous empirical studies that compared the accuracy of ensemble approaches in bug prediction. On the other hand, our goal is to verify how ensemble techniques perform in different settings such as cross- and local-project defect prediction. Our empirical experimentation results show that ensemble techniques are not a silver bullet for bug prediction. In within-project bug prediction, using ensemble techniques improves the prediction performance with respect to the best stand-alone classifier. We confirm that the models based on Validation and Voting achieve slightly better results. However, they are similar to those obtained by other ensemble techniques. Identifying buggy classes using external sources of information is still an open problem. In this setting, the use of ensemble techniques does not provide evident benefits with respect to stand-alone classifiers. The statistical analysis highlights that local and global models are mostly equivalent in terms of performance. Only one ensemble technique (i.e., ASCI) slightly exploits local learning to improve performance.
... In support of the Herbsleb hypothesis, prior studies have shown that, for the purpose of defect prediction, process metrics significantly out-perform product metrics [8,35,56]. Also, if we wish to learn general principles for software engineering that hold across multiple projects, it is better to use process metrics since: ...
Preprint
Numerous automated SE methods can build predictive models from software project data. But what methods and conclusions should we endorse as we move from analytics in-the small (dealing with a handful of projects) to analytics in-the-large (dealing with hundreds of projects)? To answer this question, we recheck prior small-scale results (about process versus product metrics for defect prediction) using 722,471 commits from 770 Github projects. We find that some analytics in-the-small conclusions still hold when scaling up to analytics in-the large. For example, like prior work, we see that process metrics are better predictors for defects than product metrics (best process/product-based learners respectively achieve recalls of 98%/44% and AUCs of 95%/54%, median values). However, we warn that it is unwise to trust metric importance results from analytics in-the-small studies since those change, dramatically when moving to analytics in-the-large. Also, when reasoning in-the-large about hundreds of projects, it is better to use predictions from multiple models (since single model predictions can become very confused and exhibit very high variance). Apart from the above specific conclusions, our more general point is that the SE community now needs to revisit many of the conclusions previously obtained via analytics in-the-small.
... In Section 4 we discuss sorting change parts by their importance for review. Similar ideas have been studied by other researchers, e.g., by Lumpe et al (2012). The idea to focus reviewing on "sections where finding defects is really worthwhile" can already be found in Gilb and Graham (1993, p. 74), albeit they relate more to the severity of defects and not to the probability of finding them. ...
Preprint
Change-based code review is used widely in industrial software development. Thus, research on tools that help the reviewer to achieve better review performance can have a high impact. We analyze one possibility to provide cognitive support for the reviewer: Determining the importance of change parts for review, specifically determining which parts of the code change can be left out from the review without harm. To determine the importance of change parts, we extract data from software repositories and build prediction models for review remarks based on this data. The approach is discussed in detail. To gather the input data, we propose a novel algorithm to trace review remarks to their triggers. We apply our approach in a medium-sized software company. In this company, we can avoid the review of 25% of the change parts and of 23% of the changed Java source code lines, while missing only about 1% of the review remarks. Still, we also observe severe limitations of the tried approach: Much of the savings are due to simple syntactic rules, noise in the data hampers the search for better prediction models, and some developers in the case company oppose the taken approach. Besides the main results on the mining and prediction of triggers for review remarks, we contribute experiences with a novel, multi-objective and interactive rule mining approach. The anonymized dataset from the company is made available, as are the implementations for the devised algorithms.
... Beyond defect prediction are other goals that combine defect prediction with other economic factors: (4) Arisholm & Briand [8], Ostrand & Weyeuker [73] and Rahman et al. [78] say that a defect predictor should maximize reward; i.e., find the fewest lines of code that contain the most bugs. (5) In other work, Lumpe et al. are concerned about amateur bug fixes [55]. Such amateur fixes are highly correlated to errors and, hence, to avoid such incorrect bug fixes; we have to optimize for finding the most number of bugs in regions that the most programmers have worked with before. ...
Conference Paper
Full-text available
This paper introduces Data-Driven Search-based Software Engineering (DSE), which combines insights from Mining Software Repositories (MSR) and Search-based Software Engineering (SBSE). While MSR formulates software engineering problems as data mining problems, SBSE reformulate Software Engineering (SE) problems as optimization problems and use meta-heuristic algorithms to solve them. Both MSR and SBSE share the common goal of providing insights to improve software engineering. The algorithms used in these two areas also have intrinsic relationships. We, therefore, argue that combining these two fields is useful for situations (a) which require learning from a large data source or (b) when optimizers need to know the lay of the land to find better solutions, faster. This paper aims to answer the following three questions: (1) What are the various topics addressed by DSE?, (2) What types of data are used by the researchers in this area?, and (3) What research approaches do researchers use? The paper briefly sets out to act as a practical guide to develop new DSE techniques and also to serve as a teaching resource. This paper also presents a resource (tiny.cc/data-se) for exploring DSE. The resource contains 89 artifacts which are related to DSE, divided into 13 groups such as requirements engineering, software product lines, software processes. All the materials in this repository have been used in recent software engineering papers; i.e., for all this material, there exist baseline results against which researchers can comparatively assess their new ideas.
... Whatever the specific objectives, a small number meta-measures are used in many research papers such as the Hypervolume or Spread. [55]. Such amateur fixes are highly correlated to errors and, hence, to avoid such incorrect bug fixes, we have to optimize for finding the most number of bugs in regions that the most programmers have worked with before. ...
Article
Full-text available
This paper introduces Data-Driven Search-based Software Engineering (DSE), which combines insights from Mining Software Repositories (MSR) and Search-based Software Engineering (SBSE). While MSR formulates software engineering problems as data mining problems, SBSE reformulates SE problems as optimization problems and use meta-heuristic algorithms to solve them. Both MSR and SBSE share the common goal of providing insights to improve software engineering. The algorithms used in these two areas also have intrinsic relationships. We, therefore, argue that combining these two fields is useful for situations (a) which require learning from a large data source or (b) when optimizers need to know the lay of the land to find better solutions, faster. This paper aims to answer the following three questions: (1) What are the various topics addressed by DSE? (2) What types of data are used by the researchers in this area? (3) What research approaches do researchers use? The paper briefly sets out to act as a practical guide to develop new DSE techniques and also to serve as a teaching resource. This paper also presents a resource (tiny.cc/data-se) for exploring DSE. The resource contains 89 artifacts which are related to DSE, divided into 13 groups such as requirements engineering, software product lines, software processes. All the materials in this repository have been used in recent software engineering papers; i.e., for all this material, there exist baseline results against which researchers can comparatively assess their new ideas.
... Social patterns Data miners can find programmer interaction patterns leading to slow fix times [21]. Project management Data miners can also identify patterns in programmer's development processes [22], [23] that lead to less or more software errors. ...
Article
Full-text available
Transfer learning has been the subject of much recent research. In practice, that research means that the models are unstable since they are continually revised whenever new data arrives. This paper offers a very simple "bellwether" transfer learner. Given N datasets, we find which one produces the best predictions on all the others. This bellwether dataset is then used for all subsequent predictions (when its predictions start failing, one may seek another bellwether). Bellwethers are interesting since they are very simple to find (wrap a for-loop around standard data miners). They simplify the task of making general policies in software engineering since as long as one bellwether remains useful, stable conclusions for $N$ datasets can be achieved by reasoning over that bellwether. This paper shows that this bellwether approach works for multiple datasets from various domains in SE. From this, we conclude that (1) bellwether method is a useful (and simple) transfer learner; (2) Unlike bellwethers, other complex transfer learners do not generalized to all domains in SE; (3) "bellwethers" are a baseline method against which future transfer learners should be compared; (4) When building increasingly complex automatic methods, researchers should pause and compare more sophisticated method against simpler alternatives.
Article
Many change classification techniques have been proposed to identify defect-prone changes. These techniques consider all developers’ historical change data to build a global prediction model. In practice, since developers have their own coding preferences and behavioral patterns, which causes different defect patterns, a separate change classification model for each developer can help to improve performance. Jiang, Tan, and Kim refer to this problem as personalized change classification, and they propose PCC+ to solve this problem. A software project has a number of developers; for a developer, building a prediction model not only based on his/her change data, but also on other relevant developers’ change data can further improve the performance of change classification. In this paper, we propose a more accurate technique named collective personalized change classification (CPCC), which leverages a multiobjective genetic algorithm. For a project, CPCC first builds a personalized prediction model for each developer based on his/her historical data. Next, for each developer, CPCC combines these models by assigning different weights to these models with the purpose of maximizing two objective functions (i.e., F1-scores and cost effectiveness). To further improve the prediction accuracy, we propose CPCC+ by combining CPCC with PCC proposed by Jiang, Tan, and Kim To evaluate the benefits of CPCC+ and CPCC, we perform experiments on six large software projects from different communities: Eclipse JDT, Jackrabbit, Linux kernel, Lucene, PostgreSQL, and Xorg. The experiment results show that CPCC+ can discover up to 245 more bugs than PCC+ (468 versus 223 for PostgreSQL) if developers inspect the top 20% lines of code that are predicted buggy. In addition, CPCC+ can achieve F1-scores of 0.60–0.75, which are statistically significantly higher than those of PCC+ on all of the six projects.
Conference Paper
Many defect prediction techniques have been proposed. While they often take the author of the code into consideration, none of these techniques build a separate prediction model for each developer. Different developers have different coding styles, commit frequencies, and experience levels, causing different defect patterns. When the defects of different developers are combined, such differences are obscured, hurting prediction performance. This paper proposes personalized defect prediction—building a separate prediction model for each developer to predict software defects. As a proof of concept, we apply our personalized defect prediction to classify defects at the file change level. We evaluate our personalized change classification technique on six large software projects written in C and Java—the Linux kernel, PostgreSQL, Xorg, Eclipse, Lucene and Jackrabbit. Our personalized approach can discover up to 155 more bugs than the traditional change classification (210 versus 55) if developers inspect the top 20% lines of code that are predicted buggy. In addition, our approach improves the F1-score by 0.01–0.06 compared to the traditional change classification.
Article
Full-text available
In this thesis we address the problem of identifying where, in successful software systems, maintenance effort tends to be devoted. By examin-ing a larger data set of open source systems we show that maintenance effort is, in general, spent on addition of new classes. Interestingly, ef-forts to base new code on stable classes will make those classes less stable as they need to be modified to meet the needs of the new clients. This thesis advances the state of the art in terms of our understanding of how evolving software systems grow and change. We propose an innovative method to better understand growth dynamics in evolving software systems. Rather than relying on the commonly used method of analysing aggregate system size growth over time, we analyze how the probability distribution of a range of software metrics change over time. Using this approach we find that the process of evolution typically drives the popular classes within a software system to gain additional clients over time and the increase in popularity makes these classes change-prone.
Article
While software metrics are a generally desirable feature in the software management functions of project planning and project evaluation, they are of especial importance with a new technology such as the object-oriented approach. This is due to the significant need to train software engineers in generally accepted object-oriented principles. This paper presents theoretical work that builds a suite of metrics for object-oriented design. In particular, these metrics are based upon measurement theory and are informed by the insights of experienced object-oriented software developers. The proposed metrics are formally evaluated against a widelyaccepted list of software metric evaluation criteria.
Article
An important component of software process predictive modeling is computation of a suitable measure of defect density based on project characteristics that can be estimated early in the project lifecycle. This paper presents a Bayesian belief network-based approach to estimating defect density in the form of defect function point ratios, calibrated empirically.