Conference PaperPDF Available

Does measuring code change improve fault prediction?

Authors:

Abstract and Figures

Background: Several studies have examined code churn as a variable for predicting faults in large software systems. High churn is usually associated with more faults appearing in code that has been changed frequently. Aims: We investigate the extent to which faults can be predicted by the degree of churn alone, whether other code characteristics occur together with churn, and which combinations of churn and other characteristics provide the best predictions. We also investigate different types of churn, including both additions to and deletions from code, as well as overall amount of change to code. Method: We have mined the version control database of a large software system to collect churn and other software measures from 18 successive releases of the system. We examine the frequency of faults plotted against various code characteristics, and evaluate a diverse set of prediction models based on many different combinations of independent variables, including both absolute and relative churn. Results: Churn measures based on counts of lines added, deleted, and modified are very effective for fault prediction. Individually, counts of adds and modifications outperform counts of deletes, while the sum of all three counts was most effective. However, these counts did not improve prediction accuracy relative to a model that included a simple count of the number of times that a file had been changed in the prior release. Conclusions: Including a measure of change in the prior release is an essential component of our fault prediction method. Various measures seem to work roughly equivalently.
Content may be subject to copyright.
Does Measuring Code Change Improve Fault Prediction?
Robert M. Bell, Thomas J. Ostrand, Elaine J. Weyuker
AT&T Labs - Research
180 Park Avenue
Florham Park, NJ 07932
(rbell,ostrand,weyuker)@research.att.com
ABSTRACT
Background: Several studies have examined code churn
as a variable for predicting faults in large software systems.
High churn is usually associated with more faults appearing
in code that has been changed frequently.
Aims: We investigate the extent to which faults can be
predicted by the degree of churn alone, whether other code
characteristics occur together with churn, and which combi-
nations of churn and other characteristics provide the best
predictions. We also investigate different types of churn, in-
cluding both additions to and deletions from code, as well
as overall amount of change to code.
Method: We have mined the version control database
of a large software system to collect churn and other soft-
ware measures from 18 successive releases of the system.
We examine the frequency of faults plotted against various
code characteristics, and evaluate a diverse set of prediction
models based on many different combinations of indepen-
dent variables, including both absolute and relative churn.
Results: Churn measures based on counts of lines added,
deleted, and modified are very effective for fault prediction.
Individually, counts of adds and modifications outperform
counts of deletes, while the sum of all three counts was most
effective. However, these counts did not improve prediction
accuracy relative to a model that included a simple count
of the number of times that a file had been changed in the
prior release.
Conclusions: Including a measure of change in the prior
release is an essential component of our fault prediction
method. Various measures seem to work roughly equiva-
lently.
Categories and Subject Descriptors: D.2.5 [Software
Engineering]: Testing and Debugging – Debugging aids
General Terms: Experimentation
Keywords: software faults, fault prediction, code churn,
fault-percentile average, empirical study
1. INTRODUCTION
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
PROMISE ’11, September 20–21, 2011, Banff, Canada
Copyright 2011 ACM 978-1-4503-0709-3/11/09 ...$10.00.
Several researchers have noted that the amount of code
change in some part of the code of a software system can
be an effective indicator of the fault-proneness of that part,
either in the modified release or in succeeding releases. This
change is often referred to as churn, and we will use that
term to refer to any sort of change in the remainder of the
paper.
For example, we have found that the number of times
that a file has been changed in the previous two releases is
a powerful indicator of fault-proneness in the negative bino-
mial prediction models that we have investigated [11, 12]. In
the present paper, we examine finer-grained measurements
of churn, including the counts of numbers of lines within a
file that have been added, deleted, or modified in previous
releases.
The studies described in this paper are performed on 18
quarterly releases spanning approximately five years of a
provisioning system. In previous work, we have referred to
this system as System 7 [10]. It includes files written in six
different programming languages including C, C++, Java,
SQL, a proprietary variant of C, a proprietary variant of
C++, as well as .h files. About 60% of the files were written
in Java with the remaining 40% of the files roughly equally
distributed among the other six file types. Table 1 provides
basic data on the size of the system’s releases, the occur-
rence of faults, and the percent of each release that was
changed. The first 16 rows of the table are updated from
the data that appeared in [10]. The present table includes
one file type (.h files) that was not included in the previ-
ous paper and two additional releases, as well as updates to
the fault and change counts due to the ongoing development
and maintenance of the system.
2. MEASURING CHANGE
Changes to a file can be measured in many ways in ad-
dition to simply determining whether or not the file was
changed during a release. In this paper we consider ways of
quantifying the degree of change, in order to see whether
that added information is helpful when predicting fault-
proneness. In our previous work [11, 12, 13], we measured
the number of changes within a release, that is, the number
of times that the file had gone through a check-out/check-in
cycle.
The number of changes in Releases N-1 and N-2 are two
attributes that are part of our Standard Model to predict
fault-proneness of files in Release N. We refer to these in
the sequel as Prior Changes and Prior-Prior Changes re-
% New % Changed % Faulty
Release Files KLOC files files files Faults Faults/file Faults/KLOC
1 2470 885.3 100.0 .0 5.6 284 .11 .32
2 2487 912.3 .7 10.2 3.7 168 .07 .18
3 2507 924.8 .8 7.5 5.1 404 .16 .44
4 2569 943.8 2.4 8.4 4.9 243 .09 .26
5 2610 966.0 1.6 15.1 4.4 192 .07 .20
6 2707 989.3 3.6 10.5 5.1 288 .11 .29
7 2872 1034.8 5.8 8.1 6.8 580 .20 .56
8 3005 1082.0 4.4 14.6 5.7 355 .12 .33
9 3159 1122.6 4.9 13.2 7.0 453 .14 .40
10 3506 1229.9 9.9 18.9 9.1 689 .20 .56
11 3552 1288.2 1.3 22.9 6.7 576 .16 .45
12 3517 1293.8 .1 11.2 2.5 161 .05 .12
13 3713 1364.0 5.3 2.9 6.1 715 .19 .52
14 3769 1408.1 1.5 17.9 4.8 454 .12 .32
15 3797 1441.9 .7 10.5 4.5 389 .10 .27
16 3829 1466.4 .8 8.5 4.0 352 .09 .24
17 3891 1489.0 1.6 6.8 3.4 275 .07 .18
18 3920 1520.1 .7 8.2 4.0 522 .13 .34
Average 6.8 11.0 5.2 7100 (total) .12 .33
Table 1: Size, Faults, and Changes for System 7 (Provisioning System)
spectively, and note here that all model variables, with the
exception of size, relate to prior releases and are named ac-
cordingly. The other attributes in the Standard Model are
the count of faults in Release N-1, the size of the files in
1000’s of lines of code (KLOC), the age of the file in terms
of the number of previous releases the file was part of the
system, and the file type.
In this paper, we use finer-grained change information to
measure the volume of the editing changes that occur each
time a user checks out the latest version of the file, edits
it, and then checks in the edited version. Specifically, we
consider three types of changes that can occur within an
editing session: lines added to the file, lines deleted, and
lines modified. The definitions we use are as follows.
Changes are always measured for contiguous sets of lines
in the file. If a user edits two groups of lines that are sep-
arated by at least one unchanged line, that is considered
two separate actions, and the following definitions must be
applied separately to both of the actions.
If an editing action involves only adding lines or only delet-
ing lines, the result is simply the count of the number of lines
added or deleted. If an action involves modifying individ-
ual lines without inserting or removing additional lines, the
result is again simply the number of lines that are modi-
fied. If an action involves replacing kcontiguous lines with
k+ilines (kand iboth positive integers), then the result
is klines modified and ilines added. If an action involves
replacing kcontiguous lines with kilines (kand iboth
positive integers, and k > i), then the result is kilines
modified and ilines deleted.
It is possible to have alternative views of the changes
that are made in going from the checked-out version to the
checked-in version. In particular, one could define editing
changes only in terms of adds and deletes. For example,
the replacement of klines with k+ilines could be defined
as deleting kand adding k+i. However, we believe it is
more meaningful to distinguish this presumably single edit-
Lines Lines Lines
Change Added Deleted Modified
[7] x+1 [7] x+2 0 0 1
[21:30] 1 line 0 9 1
[41:45] 8 lines 3 0 5
[99:100] [99] 12 lines [100] 12 0 0
Total 15 9 7
Table 2: Examples of changes to a file
ing action from the situation where a developer performs
two potentially unrelated edits, deleting klines and adding
k+iin widely-separated sections of code.
A single editing session can involve multiple adds, deletes,
and modifications. For example, suppose a user checks out
a file with 100 lines, and makes the following changes before
checking the file back in. The user changes the expression
x+1 in line 7 to x+2, replaces lines 21-30 with a single line,
replaces the original lines 41-45 with a new set of 8 lines, and
inserts 12 new lines just in front of the last line of the file,
resulting in a new file with 106 lines. Table 2 shows the way
these changes are counted. The numbers in square brackets
represent the line numbers in the original checked-out file.
A file may undergo multiple editing sessions in a single
release. The churn counts for the file in the release are the
sums of the respective adds, deletes, and modifications done
in all those sessions.
3. PRELIMINARIES
We start by looking at simple properties of file size and
fault occurrence, according to the change status of the files.
For each release, we categorize each file’s status as being one
of: new to the system, changed from the prior release, or
unchanged. Figure 1 shows the mean file size, by release, for
each change status. At Release 1, we treat all files as new.
Consequently, the majority of instances of new files occur at
Release 1.
0
500
1000
1500
2000
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Mean LOC
Release
New (Mean=303)
Unchanged (Mean=279)
Changed (Mean=1081)
Figure 1: Sizes of files, by change status, Releases
1-18
0
0.5
1
1.5
2
2.5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Fault-per-File
Release
Faults per File, by Change Status and Release
New (Mean=0.24)
Unchanged (Mean=0.02)
Changed (Mean=0.80)
Figure 2: Faults per file, by change status, Releases
1-18
Changed files are notably larger than the other types, im-
plying that large files are more likely to be changed. Note
that new files introduced after the first two releases were
much smaller on average than those created in Releases 1
and 2.
Figure 2 shows the average number of faults per file, by
release and change status. Unchanged files always have ex-
tremely low fault counts, never exceeding 0.06 faults per file.
Fault rates are much higher for the other two types of file,
and changed files have higher fault rates than new files in
every release from 2 through 18 except Release 17. This
is consistent with the data from our previous work, where
we noted that files that have been changed in the previous
release are much more likely to have faults than unchanged
files.
Because the average size of changed files is so much larger
than that of new files, the fault rate per KLOC is usually
lower for changed files, as seen in Figure 3.
Table 3 summarizes the results in the last three figures by
aggregating over release. In addition, the last two columns
of the table show the aggregate counts and percentages of
faults by change status. Despite being only 11 percent of
files overall (Table 1), changed files contain 72 percent of all
faults and 75 percent of faults after Release 1.
Going beyond merely the change status of files, Table 4
shows fault rates for changed files, broken down by the types
of changes that occurred. As the number of types of changes
grows from one to all three, both the average number of lines
changed and the subsequent fault rates increase. Performing
more than one type of change is more fault-prone than doing
0
2
4
6
8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Fault-per-KLOC
Release
Faults per KLOC, by Change Status and Release
New (Mean=0.79)
Unchanged (Mean=0.08)
Changed (Mean=0.74)
Figure 3: Faults per LOC, by change status, Re-
leases 1-18
0
0.5
1
1.5
2
2.5
3
110 100 1000
Faults per File
Lines Added + Lines Deleted + Lines Modified
Figure 4: Faults per size of change
a single one, and performing all three types during a single
release results in more than 4 times as many faults as doing
a single type of change. When only one type of change
occurs in a file, Adds are the most fault-prone, Modifications
second, and Deletes yield the fewest faults.
Figure 4 shows faults per file, for files changed in the prior
release, as a function of the total number of lines changed.
Files are sorted into ten bins based on the number of lines
changed, with each plotting symbol representing about 600
to 700 points. Not surprisingly, the number of faults oc-
curring in changed files rises with the total number of lines
changed. Obviously, the more lines that are changed, the
greater the opportunity for mistakes to occur.
4. PREDICTION MODELS AND PREDIC-
TION RESULTS
To determine the effectiveness of different variables for
fault prediction, we constructed models containing various
combinations of the original variables from the Standard
Model, together with subsets of the churn variables. Models
to predict fault counts for Release Nare constructed using
negative binomial regression, with values from Releases 1
through N1 as training data. See [11] for details on the
negative binomial regression model for fault prediction as
well as for example results.
Models were constructed, applied, and evaluated following
our standard procedure, as follows: The dependent output
variable is the number of faults for each file in a release.
The non-churn independent variables are chosen from a set
File Number Average Faults Faults Number Percent
status of files LOC per file per KLOC of faults of faults
New (Release 1) 2470 358 .11 .32 284 4.0
New (Releases 2-18) 1493 210 .44 2.11 661 9.3
Changed 6389 1081 .80 .74 5121 72.1
Unchanged 47528 279 .02 .08 1034 14.6
Table 3: Faults Rates, by File Status
File Number Percentage Average Average Faults Faults
status of files of files LOC Lines changed per file per KLOC
Adds only 597 9.4 769 21 .30 .39
Deletes only 296 4.6 513 5 .04 .07
Modifications only 683 10.7 605 4 .19 .32
Adds & Deletes 126 2.0 702 21 .50 .71
Adds & Mods 1894 29.6 940 37 .55 .59
Deletes & Mods 168 2.6 521 23 .36 .69
Adds, Deletes, & Mods 2625 41.1 1495 210 1.38 .92
Table 4: Faults Rates for Changed Files, by Type(s) of Change
that includes code attributes, which are obtainable from the
release to be predicted, and history attributes, which are ob-
tainable from previous releases. The code attributes include
LOC, file status, file age, and file type. History attributes
include prior changes and prior faults. All models in this
paper include release number, implemented as a series of
dummy variables for Releases 1 through N2.
4.1 Evaluation
After a model has been constructed, it is applied to gen-
erate predictions for Release Nusing code attributes of Re-
lease Nand history attributes of the previous two Releases
(N1 and N2). The files of Release Nare sorted in
descending order of the resulting predictions of fault counts.
To evaluate a model, we sum the actual number of faults
that occur in the files at the top X% of the sorted list, and
determine the percentage of all faults in Release Nthat is
included in the top X%. This percentage is the yield of the
model, relative to the chosen value of X. While X can be
any suitable value, we have typically presented results with
X=20, as repeated studies on a wide variety of systems have
found that 80% or more of the faults are contained within
20% of the system’s files. For the systems we have studied,
we have frequently found even higher fault concentration.
For example, in each of the 18 releases of System 7, all the
faults were contained in 9.1% or fewer of the system’s files,
as shown in Table 1.
In our previous work, the top 20% of the files identified
by negative binomial regression models for 6 large systems
contained 83% to 95% of the faults, and 76% of the faults
in one other system.
To dispel the potential concern that using any particular
value of X to evaluate a model may seem arbitrary, we de-
fined the notion of fault-percentile average, which essentially
averages the top X% figure over all values of X. A full de-
scription of the fault-percentile average is in Reference [13].
The mean fault-percentile averages for the systems studied
in that paper ranged from 88.1 to 92.8 when the predictions
were generated by negative binomial regression using the
Standard Model.
While we find the top X% metric to be the most useful
for interpreting effectiveness of the prediction models, we
prefer fault-percentile average for comparison of alternative
models. The top X% metric can be sensitive to whether a
few faulty files just make or miss the threshold. In contrast,
we have found the fault-percentile average to be more stable.
4.2 Prediction Accuracy for Simple Prediction
Models
Figure 5 shows fault-percentile average (FPA) values, by
release, for four selected predictor variables, representing the
four main components of our prediction models. LOC is the
total length of the file, including comment lines. To reduce
skewness, the value used as a predictor variable in models is
log(KLOC).Language is a categorical variable representing
the file type. Age represents the number of prior releases
containing the file. Prior Changes is a count of the number
of times the file was changed in the prior release.
The two most effective predictors are easily log(KLOC)
and Prior Changes, which produced very similar FPAs ex-
cept at Release 13.
Table 5 displays mean FPAs across Releases 3 to 16 for
models that use the four predictor variables shown in Fig-
ure 5, as well as models based on a number of other measures
of churn in the prior release. We note that whether or not
a predictor variable was transformed does not matter for
this analysis because the FPA metric depends only on the
ranking of predictions, not on the actual values. Among
the various measures of changes in the prior release, Prior
Changes and the total number of lines changed performed
best on the FPA metric. A count of Prior Developers is not
far behind. Among types of changes, lines Added and Modi-
fied are most predictive; lines Deleted performs substantially
worse.
Prior Changed is a simple binary indicator for whether
any changes were applied to the file in the prior release.
It is basically a simplification of Prior Changes, lumping
together all files for which Prior Changes >0. While clearly
some signal is lost, the model using Prior Changed performs
nearly as well as the Prior Changes model.
40
50
60
70
80
90
100
3 6 9 12 15 18
Fault Percentile Average
Release
LOC
Prior Changes
Language
Age
Figure 5: Fault percentile average, for four simple predictor variables
Predictor Mean FPA
log(KLOC) 85.9
Prior Changes 84.8
Prior Adds+Deletes+Mods 84.7
Prior Developers 84.1
Prior Lines Added 83.5
Prior Lines Modified 82.9
Prior Changed 82.3
Prior Faults 75.9
Prior Lines Deleted 75.6
Language 69.4
Age 56.5
Table 5: Mean FPA for selected simple predictor
variables
Results for Prior Faults are subtantially inferior to all
other prior change measures except for Lines Deleted.
4.3 Prediction Accuracy for Multivariate Pre-
diction Models
In this section, we investigate the performance of various
churn measures in a multivariate model that controls for
other code and history variables. We begin with a simplifi-
cation of our Standard Model that includes only log(LOC),
programming language and file age (the number of previ-
ous releases that a file was in the system). We will call this
the Base Model and it is the Standard Model without any
churn-related predictor variables.
As mentioned earlier, this system includes seven differ-
ent file types including C, C++, Java, SQL, a proprietary
variant of C, a proprietary variant of C++, and .h files. Be-
cause analysis of previous systems indicated that fault rates
declined quickly over the first few releases of a file before
flattening out, we treat age (the number of prior releases) as
a categorical variable with four values: 0, 1, 2-4, and greater
than 4.
The first line of Table 6 shows that this Base Model pro-
duces a mean fault percentile average of 90.93 when averaged
across Releases 3 to 18. When we consider log(KLOC) alone,
the FPA is 85.9 when averaged across these releases. This is
shown in Table 5, so for this system, adding Language and
Age yields about a 5.0 percentage point increment in the
mean FPA over using size alone.
Subsequent lines of Table 6 display the additional im-
provement associated with a selected list of churn measures.
In order to have measures of the density of changes in a
file, we also evaluated relative churn, ratios of the counts of
Adds, Deletes, and Modifications to the LOC for the file.
Each ratio was truncated at one, if the line count exceeded
the LOC. These fault densities are shown in column five of
Table 3 and the last column of Table 4.
Because the various counts and line count ratios can be
very skewed, we evaluated three versions of each change mea-
sure (except the binary-valued PriorChanged indicator): the
raw variable, the square root, and the fourth root. We only
show the most effective version of each churn measure in the
table.
In Table 6 we observe that the two best churn measures
were Prior Changes and Prior Adds+Deletes+Mods—with
increments of about 2.4 percentage points each. This is con-
sistent with what we observed in Table 5,
For Prior Changes, the best form was the square root, al-
though the fourth root was close. In contrast, the fourth root
worked best for Prior Adds+Deletes+Mods, which is more
skewed. In general, the untransformed measures performed
worse than their transformed counterparts.
Next to each increment in the mean FPA, we show an
estimated standard error for that increment. These standard
errors are estimated based on release to release variability in
the increments associated with each change measure. The
resulting t-statistics and P-values are the results of paired
two-sample t tests. We note that all these measures produce
statistically significant improvements in mean FPA relative
to the Base Model without any churn measures.
Other churn measures ranked similarly to those in Ta-
ble 5. Increments in mean FPA for the relative churn ratios
were uniformly smaller than for the absolute churn measures
themselves (both after transformations).
Table 7 shows corresponding results starting with an ini-
tial model that includes Prior Changes. The addition of
Prior Changes to the base model dramatically changes the
marginal effectiveness of the other churn measures, of which
only three retain statistical significance at the 0.05 level.
Prior Cumulative Developers rises from the middle of the
pack to the top of the list. The effectiveness of this predic-
tor was previously reported for this system [10] as well as
for three other systems [12].
The other statistically significant variable at this point
was Prior-Prior Changes (i.e., changes at Release N2),
the least effective measure in Table 6. What sets these two
variables apart from the others is that each one incorporates
information about churn going back beyond the immediately
Mean Increase Standard
Predictor Variables FPA vs. Base error t-value P-value
Base: log(KLOC),
Language, Age 90.93 NA NA NA NA
(Prior Changes)1/293.35 2.42 0.24 9.90 0.0000
(Prior Add+Deletes+Mods)1/493.28 2.35 0.26 9.08 0.0000
(Prior Add+Deletes+Mods/LOC)1/493.19 2.25 0.28 8.15 0.0000
(Prior Developers)1/293.17 2.24 0.23 9.68 0.0000
(Prior Lines Added)1/493.15 2.21 0.26 8.41 0.0000
(Prior Lines Added/LOC)1/493.03 2.10 0.29 7.25 0.0000
Prior Changed 92.95 2.01 0.23 8.77 0.0000
(Prior Cum Developers)1/292.93 2.00 0.17 11.65 0.0000
(Prior Lines Modified)1/492.91 1.98 0.20 9.85 0.0000
(Prior Lines Modified/LOC)1/492.81 1.87 0.20 9.48 0.0000
(Prior Faults)1/492.21 1.28 0.16 8.03 0.0000
(Prior New Developers)1/292.06 1.13 0.25 4.56 0.0004
(Prior Lines Deleted)1/492.06 1.13 0.16 6.93 0.0000
(Prior Lines Deleted/LOC)1/492.00 1.06 0.18 5.88 0.0000
(Prior-Prior Changes)1/491.96 1.03 0.14 7.21 0.0000
Table 6: FPA Improvements for selected churn measures, relative to model without any churn variables
Mean Increase Standard
Predictor Variables FPA vs. Base error t-value P-value
Base: log(KLOC), Language,
Age, (Prior Changes)1/293.35 NA NA NA NA
(Prior Cum Developers)1/293.67 0.32 0.07 4.32 0.0006
(Prior-Prior Changes)1/493.50 0.14 0.05 3.06 0.0080
(Prior Add+Deletes+Mods)1/493.38 0.03 0.03 1.08 0.2959
(Prior Lines Added)1/493.37 0.02 0.03 0.84 0.4132
(Prior Faults)1/493.36 0.01 0.02 0.47 0.6435
(Prior Lines Modified)1/493.35 -0.00 0.01 -0.21 0.8387
Prior New Developers 93.35 -0.00 0.00 -1.45 0.1678
Prior Developers 93.35 -0.00 0.00 -0.98 0.3440
(Prior Lines Deleted)1/493.34 -0.01 0.01 -0.95 0.3549
(Prior Add+Deletes+Mods/LOC)1/293.34 -0.01 0.04 -0.27 0.7910
(Prior Lines Modified/LOC)1/293.34 -0.01 0.01 -1.02 0.3237
(Prior Lines Added/LOC)1/293.34 -0.01 0.05 -0.30 0.7669
(Prior Lines Deleted/LOC)1/493.33 -0.02 0.01 -2.43 0.0279
Prior Changed 93.30 -0.06 0.05 -1.17 0.2610
Table 7: FPA Improvements for selected churn measures, relative to model with Prior Changes
prior release. For each of the other churn measures, there
is apparently sufficient correlation with the Prior Changes
measure to blunt any statistically significant improvement.
Adding Prior Cumulative Developers to the initial model
produced a mean FPA of 93.67. Beyond that point, the
largest observed increment in the Mean FPA is only 0.03
percentage points (for Prior-Prior Changes), and no incre-
ment was statistically significant (not shown). For compari-
son, we note that our Standard Model achieves a mean FPA
value of 93.44 for the subject system.
5. THREATS TO VALIDITY
Our various measures of churn tend to be highly corre-
lated. Consequently, it is difficult to determine with much
confidence that a particular measure is more effective than
all others for fault prediction. Instead, we can mainly assess
the marginal value of certain measures in the presence of
others.
This study has been carried out on 18 releases of one large
system, and similar results may or may not be found for
other systems. Many aspects of software development and
maintenance can vary from one system to another, includ-
ing design processes, development approaches, programming
languages, and testing strategies. Any of these may intro-
duce significant differences in the frequency or locations of
faults, and render the prediction models less successful. In
particular, development processes that emphasize minimal
rewriting of code, such as cleanroom development, may have
a large impact on the number and type of code modifica-
tions. In such environments, the relations presented here
may not hold. We intend to repeat the investigation on sev-
eral of the other large systems that we have access to, to see
whether we observe similar patterns.
6. RELATED WORK
Fault prediction researchers have investigated a wide vari-
ety of predictor variables. Many authors have utilized code
mass, various complexity metrics, design information and
variations on the history variables that make up our basic
model. Work of this type has been reported by Graves et
al. [1], Khoshgoftaar et al. [2], Menzies et al. [3, 4], Mockus
and Weiss [5], Nagappan and Ball [6, 7], Ohlsson and Al-
berg [9], Ostrand et al. [11], and Zimmermann and Nagap-
pan [14], among others.
Only a few authors have incorporated churn into predic-
tion models. Mockus and Weiss [5] developed a model to
predict the probability that a given change to the software
system would result in a software failure. A given change
is defined as all the modifications to the system that result
from a single Maintenance Request, and may consist of mul-
tiple individual modifications to multiple code units. Their
full prediction model included among its predictor variables
the number of files changed, total LOC in the changed files,
lines added, lines deleted, and the total number of deltas
(check-in/check-out cycles) to a file. Stepwise regression
yielded a reduced model that included number of deltas and
lines added. This methodology has been incorporated into a
“risk assessment tool”, but the authors do not state whether
it uses the full or the reduced model.
Nagappan and Ball [6] constructed models to predict ex-
pected defect density (defects/KLOC) using either absolute
or relative churn measures for predictor variables. The abso-
M1 M2 Faults/KLOC
M1 1.000 0.731 0.258
M2 1.000 0.219
Faults/KLOC 1.000
Table 8: Rank Correlations between Relative Churn
Measures and Fault Density
lute measures include added + changed lines, deleted lines,
and total number of changes made (apparently equivalent
to the deltas of Mockus and Weiss). Relative measures nor-
malize the corresponding absolute measure in terms of total
LOC or total file count. The Nagappan-Ball measures are
computed for binaries that are the result of compilation of
many source files. Since files are the largest code unit in our
software, there are no measures that are comparable to their
measures that are based on the files within a binary. How-
ever, the following two relative measures of code churn are
based only on the lines of code in a file, and are comparable
to measures that can be computed for the software in the
provisioning system:
M1: (addedlines +changedlines)/LOC
M2: (deletedlines)/LOC
Nagappan and Ball found rank correlations of 0.8 and
above between M1 or M2, and fault density. We observed
considerably lower correlations, although the relative values
of the correlations were similar. Table 8 shows rank corre-
lations between fault density for the provisioning system we
analyzed, and the measures M1, M2, as well as the inter-
measure correlations.
Nagappan and Ball’s relative measure models produced
R2values of .8 or higher on the Windows code, significantly
better than the absolute measure models, leading to their
conclusion that relative churn can be an efficient and effec-
tive predictor of defect density. In contrast to the results
for Windows 2003, we found virtually no difference between
the effectiveness of absolute and relative churn measures.
In fact, the relative measures proved uniformly slightly less
effective than the absolute.
In [7], Nagappan and Ball examine models that combine
churn and system dependency information to predict post-
release failures of system binaries that are constructed from
a set of source files. They use 3 churn-related metrics: over-
all count of lines added, deleted, or modified, number of
files in the binary that were changed, and total number of
changes made to the files in the binary. Comparison of the
post-release failures predicted by these models against ac-
tual failures showed R2values better than .6, with P<.0005,
leading them to conclude that, at least for the subject sys-
tem (Windows Server 2003), churn and system dependency
information are reliable early indicators of failures.
Nagappan et al. [8] investigate the ability of change bursts,
sequences of consecutive changes to a software system, to
predict the system’s fault-prone components. They define
churn as the total of lines added, deleted, and modified dur-
ing changes made to a component, and use the total churn
over a component’s lifetime, the churn within bursts, and
the maximum churn occurring in any burst as three of the
predictor variables in their model. When measured for Win-
dows Vista, bursts turn out to be highly effective predictors
of fault-prone components.
It is worth noting that each of these studies assessed their
models by data splitting the information from a single re-
lease of the software. The churn data is based on changes
made between the release’s baseline version up to the final
version that produced the production binaries.
7. CONCLUSIONS
Consistent with our earlier observations and results, we
found that most faults in the system analyzed here occurred
in files that had been changed in the prior release. The im-
portance of changes is so high that even a simple changed/not-
changed variable is capable of providing respectable predic-
tions of fault-prone files.
Confirming other research studies, we have seen that low-
level measurements of code changes can be very effective for
fault prediction. Counts of additions, deletions and changes
to a code base can be derived from the version control history
of a system, and be used as input to a fault prediction model.
Of the three specific types of file changes, lines added de-
livered the most accurate predictions, followed by lines mod-
ified. Lines deleted had substantially less predictive value
than the other two. The sum of all three counts, essentially
a count of the total lines changed in a file, proved the most
effective way to use the individual file churn data and was
as good as any other variable that we tried.
However, the line counts did not improve prediction accu-
racy for this system relative to our Standard Model, which
already included an alternative measure of churn in the prior
release. It appears that either our old measure, Prior Changes,
or a sum of Adds+Deletes+Mods can be equally effective.
8. REFERENCES
[1] T.L. Graves, A.F. Karr, J.S. Marron, and H. Siy.
Predicting Fault Incidence Using Software Change
History. IEEE Trans. on Software Engineering, Vol 26,
No. 7, July 2000, pp. 653-661.
[2] T.M. Khoshgoftaar, E.B. Allen, J. Deng. Using
Regression Trees to Classify Fault-Prone Software
Modules. IEEE Trans. on Reliability, Vol 51, No. 4,
Dec 2002, pp. 455-462.
[3] T. Menzies, J. Greenwald and A. Frank. Data Mining
Static Code Attributes to Learn Defect Predictors.
IEEE Trans. Software Eng., 33(1), pp. 2-13, 2007.
[4] T. Menzies, B. Turhan, A. Bener, G. Gay, B. Cukic
and Y. Jiang. Implications of Ceiling Effects in Defect
Predictors. Proc. 4th Int. Workshop on Predictor
Models in Software Engineering (PROMISE08), pp.
47-54, 2008.
[5] A. Mockus and D.M. Weiss. Predicting Risk of
Software Changes. Bell Labs Technical Journal,
April-June 2000, pp. 169-180.
[6] N. Nagappan and T. Ball. Use of Relative Code Churn
Measures to Predict System Defect Density. Proc.
27th Int. Conference on Software Engineering
(ICSE05), 2005.
[7] N. Nagappan and T. Ball. Using Software
Dependencies and Churn Metrics to Predict Field
Failures: An Empirical Case Study. Proc. Empirical
Software Engineering and Measurement Conference
(ESEM), Madrid, Spain, 2007.
[8] N. Nagappan, A. Zeller, T. Zimmermann, K. Herzig,
and B. Murphy. Change Bursts as Defect Predictors.
Proc. 21st IEEE Int. Symposium on Software
Reliability Engineering (ISSRE2010).
[9] N. Ohlsson and H. Alberg. Predicting Fault-Prone
Software Modules in Telephone Switches. IEEE Trans.
on Software Engineering, Vol 22, No 12, December
1996, pp. 886-894.
[10] T.J. Ostrand, E.J. Weyuker, and R.M. Bell.
Programmer-based Fault Prediction. Proc. Int.
Conference on Predictive Models (PROMISE10),
Timisoara, Romania, September 2010.
[11] T.J. Ostrand, E.J. Weyuker, and R.M. Bell.
Predicting the Location and Number of Faults in
Large Software Systems. IEEE Trans. on Software
Engineering, Vol 31, No 4, April 2005.
[12] E.J. Weyuker, T.J. Ostrand, and R.M. Bell. Do Too
Many Cooks Spoil the Broth? Using the Number of
Developers to Enhance Defect Prediction Models.
Empirical Software Eng., Vol 13, No. 5, October 2008.
[13] E.J. Weyuker, T.J. Ostrand, and R.M. Bell.
Comparing the Effectiveness of Several Modeling
Methods for Fault Prediction. Empirical Software Eng.
Vol 15, No. 3, June 2010.
[14] T. Zimmermann and N. Nagappan. Predicting Defects
Using Network Analysis on Dependency Graphs. Proc.
13th Int. Conference on Software Engineering
(ICSE08), p.531-540, 2008.
... Changes made to source code during software development can be quantified in the form of code change metrics based on data collected throughout the software lifecycle across successive versions. Such metrics have proven useful for identifying defect-prone modules [7], [86], especially code churn measures [30], [81], change bursts [87], [11], and code deltas [30], [88], [29]. In most studies of software defect prediction, the code change metrics representing the module of a project are calculated using differences between the source code of two successive versions of the module [6], [89], [90], but it can also be calculated taking into account changes from the several continuous versions of the module [91], [86], [87], [8]. ...
... Such metrics have proven useful for identifying defect-prone modules [7], [86], especially code churn measures [30], [81], change bursts [87], [11], and code deltas [30], [88], [29]. In most studies of software defect prediction, the code change metrics representing the module of a project are calculated using differences between the source code of two successive versions of the module [6], [89], [90], but it can also be calculated taking into account changes from the several continuous versions of the module [91], [86], [87], [8]. ...
Full-text available
Article
To ensure the delivery of high quality software, it is necessary to ensure that all of its artifacts function properly, which is usually done by performing appropriate tests with limited resources. It is therefore desirable to identify defective artifacts so that they can be corrected before the testing process. So far, researchers have proposed various predictive models for this purpose. Such models are typically trained on data representing previous project versions of a software and then used to predict which of the software artifacts in the new version are likely to be defective. However, the data representing a software project usually consists of measurable properties of the project or its modules, and leaves out information about the timeline of the software development process. To fill this gap, we propose a new set of metrics, namely aggregated change metrics, which are created by aggregating the data of all changes made to the software between two versions, taking into account the chronological order of the changes. In experiments conducted on open source projects written in Java, we show that the stability and performance of commonly used classification models are improved by extending a feature set to include both measurable properties of the analyzed software and the aggregated change metrics.
... Some studies refer to these metrics as process metrics (e.g., refactoring, revision, and authors). Those metrics used to predict software fault proneness [9,10,11,12,13,14,15]. Change metrics were used in 8% of the research papers to predict software fault proneness [16]. ...
... In our work, we used one of the change metrics as a response variable. This work's change metric is derived from the code churn metric, representing the total number of lines added and deleted to a software class as defined in [12,13]. The metric is in a binary format, which means any added or deleted line to a file will be given one value (change prone) in the model, and a false value is given to a file with zero value of line added and deleted (not changeprone). ...
... Developers often commit unrelated changes, which incorrectly labels bug-free code as buggy (Herzig and Zeller 2013). Additionally, change proneness is often highly correlated to bug proneness (Moser et al. 2008;Bell et al. 2011;Bavota et al. 2015;Rahman and Roy 2017;Pascarella et al. 2020). Therefore, if a code metric is a good predictor of change proneness, it is likely to be a good predictor of bug proneness as well. ...
Full-text available
Article
Evaluating and predicting software maintenance effort using source code metrics is one of the holy grails of software engineering. Unfortunately, previous research has provided contradictory evidence in this regard. The debate is still open: as a community we are not certain about the relationship between code metrics and maintenance impact. In this study we investigate whether source code metrics can indeed establish maintenance effort at the previously unexplored method level granularity. We consider \(\sim \)730K Java methods originating from 47 popular open source projects. After considering seven popular method level code metrics and using change proneness as a maintenance effort indicator, we demonstrate why past studies contradict one another while examining the same data. We also show that evaluation context is king. Therefore, future research should step away from trying to devise generic maintenance models and should develop models that account for the maintenance indicator being used and the size of the methods being analyzed. Ultimately, we show that future source code metrics can be applied reliably and that these metrics can provide insight into maintenance effort when they are applied in a judiciously context-sensitive manner.
... The CK suite has 6 metrics which are Weighted method per class (WMC), Coupling between object classes (CBO), Lack of cohesion in methods (LCOM), Depth of inheritance tree (DIT), Number of children (NOC) and Response for a class (RFC). -Change or Process Metrics (Moser et al. 2008;Krishnan et al. 2011;Bell et al. 2011;Nagappan et al. 2010): Changes made during the software development process are collected throughout the software life cycle across its multiple releases. Some process metrics are code churn measures, change bursts, and code deltas. ...
Full-text available
Article
Understanding software evolution is essential for software development tasks, including debugging, maintenance, and testing. As a software system evolves, it grows in size and becomes more complex, hindering its comprehension. Researchers proposed several approaches for software quality analysis based on software metrics. One of the primary practices is predicting defects across software components in the codebase to improve agile product quality. While several software metrics exist, graph-based metrics have rarely been utilized in software quality. In this paper, we explore recent network comparison advancements to characterize software evolution and focus on aiding software metrics analysis and defect prediction. We support our approach with an automated tool named GraphEvoDef. Particularly, GraphEvoDef provides three major contributions: (1) detecting and visualizing significant events in software evolution using call graphs, (2) extracting metrics that are suitable for software comprehension, and (3) detecting and estimating the number of defects in a given code entity (e.g., class). One of our major findings is the usefulness of the Network Portrait Divergence metric, borrowed from the information theory domain, to aid the understanding of software evolution. To validate our approach, we examined 29 different open-source Java projects from GitHub and then demonstrated the proposed approach using 9 use cases with defect data from the the PROMISE dataset. We also trained and evaluated defect prediction models for both classification and regression tasks. Our proposed technique has an 18% reduction in the mean square error and a 48% increase in squared correlation coefficient over the state-of-the-art approaches in the defect prediction domain.
... The CISE works by first finding the classes affected by the changes (by performing a CIA), then performing estimation for change in size with an accuracy of 93.7%. Similarly, there are existing works on developing defect prediction solutions by measuring code changes in terms of SLOCs additions, deletions, or modifications of software releases [Bell et al. 2011;Hassan 2009]. Capiluppi and Izquierdo-Cortázar [2013] study the developer activity patterns in FLOSS projects by clustering developers around different time slots and days of a week. ...
Full-text available
Article
Software development effort estimation (SDEE) generally involves leveraging the information about the effort spent in developing similar software in the past. Most organizations do not have access to sufficient and reliable forms of such data from past projects. As such, the existing SDEE methods suffer from low usage and accuracy. We propose an efficient SDEE method for open source software, which provides accurate and fast effort estimates. The significant contributions of our article are (i) novel SDEE software metrics derived from developer activity information of various software repositories, (ii) an SDEE dataset comprising the SDEE metrics’ values derived from approximately 13,000 GitHub repositories from 150 different software categories, and (iii) an effort estimation tool based on SDEE metrics and a software description similarity model . Our software description similarity model is basically a machine learning model trained using the PVA on the software product descriptions of GitHub repositories. Given the software description of a newly envisioned software, our tool yields an effort estimate for developing it. Our method achieves the highest standardized accuracy score of 87.26% (with Cliff’s δ = 0.88 at 99.999% confidence level) and 42.7% with the automatically transformed linear baseline model. Our software artifacts are available at https://doi.org/10.5281/zenodo.5095723.
... Selecting the classifier to use represents a relevant problem for the configuration of bug prediction models [1]. In the past, most of the bug prediction models devised made use of Logistic Regression [5,19,61,62,70], Decision Trees [4,60,88], Radial Basis Function Network [65,100], Support Vector Machines [48,67,93], Decision Tables [46,57], Multi-Layer Perceptron [23], or Bayesian Network [76]. ...
Full-text available
Article
Bug prediction aims at locating defective source code components relying on machine learning models. Although some previous work showed that selecting the machine-learning classifier is crucial, the results are contrasting. Therefore, several ensemble techniques, i.e., approaches able to mix the output of different classifiers, have been proposed. In this paper, we present a benchmark study in which we compare the performance of seven ensemble techniques on 21 open-source software projects. Our aim is twofold. On the one hand, we aim at bridging the limitations of previous empirical studies that compared the accuracy of ensemble approaches in bug prediction. On the other hand, our goal is to verify how ensemble techniques perform in different settings such as cross- and local-project defect prediction. Our empirical experimentation results show that ensemble techniques are not a silver bullet for bug prediction. In within-project bug prediction, using ensemble techniques improves the prediction performance with respect to the best stand-alone classifier. We confirm that the models based on Validation and Voting achieve slightly better results. However, they are similar to those obtained by other ensemble techniques. Identifying buggy classes using external sources of information is still an open problem. In this setting, the use of ensemble techniques does not provide evident benefits with respect to stand-alone classifiers. The statistical analysis highlights that local and global models are mostly equivalent in terms of performance. Only one ensemble technique (i.e., ASCI) slightly exploits local learning to improve performance.
Full-text available
Preprint
Software development effort estimation (SDEE) generally involves leveraging the information about the effort spent in developing similar software in the past. Most organizations do not have access to sufficient and reliable forms of such data from past projects. As such, the existing SDEE methods suffer from low usage and accuracy. We propose an efficient SDEE method for open source software, which provides accurate and fast effort estimates. The significant contributions of our paper are i) Novel SDEE software metrics derived from developer activity information of various software repositories, ii) SDEE dataset comprising the SDEE metrics' values derived from $\approx13,000$ GitHub repositories from 150 different software categories, iii) an effort estimation tool based on SDEE metrics and a software description similarity model. Our software description similarity model is basically a machine learning model trained using the Paragraph Vectors algorithm on the software product descriptions of GitHub repositories. Given the software description of a newly-envisioned software, our tool yields an effort estimate for developing it. Our method achieves the highest Standard Accuracy score of 87.26% (with cliff's $\delta$=0.88 at 99.999% confidence level) and 42.7% with the Automatic Transformed Linear Baseline model. Our software artifacts are available at https://doi.org/10.5281/zenodo.5095723.
Chapter
Software engineering repositories have been attracted by researchers to mine useful information about the different quality attributes of the software. These repositories have been helpful to software professionals to efficiently allocate various resources in the life cycle of software development. Software fault prediction is a quality assurance activity. In fault prediction, software faults are predicted before actual software testing. As exhaustive software testing is impossible, the use of software fault prediction models can help the proper allocation of testing resources. Various machine learning techniques have been applied to create software fault prediction models. In this study, ensemble models are used for software fault prediction. Change metrics-based data are collected for an open-source android project from GIT repository and code-based metrics data are obtained from PROMISE data repository and datasets kc1, kc2, cm1, and pc1 are used for experimental purpose. Results showed that ensemble models performed better compared to machine learning and hybrid search-based algorithms. Bagging ensemble was found to be more effective in the prediction of faults in comparison to soft and hard voting.
Full-text available
Book
The pandemic "Corona" has put us this year before a difficult time. With care we have kept to the hygiene rules not to get an infection with the virus Covid-19. With mask we have got into coaches and trains, have made our purchases or on work worked. Home office was the catchword of these days. The universities and research facilities have maintained only a small emergency company and lectures were held as online lectures. From home we have tried to do our scientific works. In 1-to-1 tele-phone calls or phone conferences we have organized with our colleagues the work and have discussed important results of the research. Under it the efficiency suffers what is easy to understand. In the beginning of the pandemic fell the deadline of our conference. Insecurity spread. The figures of the infected persons increased rapidly. The virus spreads out in more and more countries and was further carried by continent to continent. Soon stood the whole world in the spell of Corona. A conference was the last to this in this situation most thought. In this situation appeared once again which high demands for a scientist are made. It belongs to the job of a scientist that he presents his scientific results in conferences and makes thus his results of a wide public immediately available. A scientist should have well organized his research, should be able to do his scientific tasks and duties in a flexible way, and should have financed his research with suitable financial means. Only those who were meeting these rules could successfully continue in their profes-sional research work. The best of the best of us are represented with their papers in this volume. They presented themselves personal or in online presentations in the conference. The ac-ceptance rate for the submitted paper of our conference was 33% percent for long paper as well as short papers. Because of many refusals because of missing financial means or other reasons the acceptance rate decreased to few percent. This shows once more the excellent quality of these scientists. Their papers are of most excellent quali-ty and expand the state-of-the-art in an excellent way. This proceeding volume pre-sent mostly theoretical work on known topics such as clustering, classification and prediction, graph mining but also application-oriented theoretical work for different purposes based deep learning and others. One invited talk was given by Prof. Dan A. Simovici on a theoretical subject on “Information theoretic approaches in data min-ing”. His paper is also included in the proceedings. The proceedings will be freely accessible as an OPEN-ACCESS Proceedings of a wide public so that, the new acquired knowledge on the different subjects is able to spread around quickly worldwide. You can find the proceedings at http://www.ibai-publishing.org/html/proceeding2020.php. In this time, flexibility was a must Because the situation in the USA was still diffi-cult, we have moved the conference to Amsterdam in the Netherlands. Here a variety of the participants was able to do outward journeys. The ones who could not travel, were online present. Extended versions of selected papers will appear in the international journal Trans-actions on Machine Learning and Data Mining (www.ibai-publishing.org/journal/mldm). We invite you to join us in 2021 in New York again to the next International Con-ference on Machine Learning and Data Mining MLDM. The conference will run again under the umbrella of the Worldcongress (www.worldcongressdsa.com) “The Frontiers in Intelligent Data and Signal Analysis, DSA2020” that combines under his roof the following three events: International Conferences Machine Learning and Data Mining MLDM, the Industrial Conference on Data Mining ICDM , and the In-ternational Conference on Mass Data Analysis of Signals and Images in Artificial Intelligence and Pattern Recognition with Application in with Applications in Medi-cine, r/g/b Biotechnology, Food Industries and Dietetics, Biometry and Security, Ag-riculture, Drug Discover, and System Biology MDA-AI&PR. We will give then the tutorials on Data Mining, Case-Based Reasoning, and Intel-ligent Image Analysis again (http://www.data-mining-forum.de/tutorials.php) again. The workshops running in connection with ICDM will also be given (http://www.data-mining-forum.de/workshops.php). We would warmly invite you with pleasure to contribute to this conference. Please come and join us. We are awaiting you.
Article
At the beginning of the testing phase and before the deployment phase of a project's development cycle, we need to predict files with a high chance of change. Software products are always prone to change due to several reasons, including fixing errors or improvements. In this work, we used the Eclipse (releases from 2.0 to 3.5) to investigate how prediction models can perform when learning from a release and predicting in the subsequent one, which contains new files that models have not seen. We compared the performance of these models with models that are trained and tested on the same release. We found no differences between predicting the same release or subsequent release on two pre Europa releases. Predicting change in newly created files helps improve maintenance planning for software project managers and reduce cost. It will also help to enhance the quality of software by improving the practices of developers. This study used the Adaptive Boost classifier with the decision tree J48 algorithm and combined it with the re‐sampling method. We find this to be better than using a meta classifier alone or combine the re‐sampling with the standard classification. We compared our results with related works and found that our results are outperforming.
Full-text available
Article
Advance knowledge of which files in the next release of a large software system are most likely to contain the largest numbers of faults can be a very valuable asset. To accomplish this, a negative binomial regression model has been developed and used to predict the expected number of faults in each file of the next release of a system. The predictions are based on the code of the file in the current release, and fault and modification history of the file from previous releases. The model has been applied to two large industrial systems, one with a history of 17 consecutive quarterly releases over 4 years, and the other with nine releases over 2 years. The predictions were quite accurate: For each release of the two systems, the 20 percent of the files with the highest predicted number of faults contained between 71 percent and 92 percent of the faults that were actually detected, with the overall average being 83 percent. The same model was also used to predict which files of the first system were likely to have the highest fault densities (faults per KLOC). In this case, the 20 percent of the files with the highest predicted fault densities contained an average of 62 percent of the system's detected faults. However, the identified files contained a much smaller percentage of the code mass than the files selected to maximize the numbers of faults. The model was also used to make predictions from a much smaller input set that only contained fault data from integration testing and later. The prediction was again very accurate, identifying files that contained from 71 percent to 93 percent of the faults, with the average being 84 percent. Finally, a highly simplified version of the predictor selected files containing, on average, 73 percent and 74 percent of the faults for the two systems.
Full-text available
Conference Paper
Context: There are many methods that input static code features and output a predictor for faulty code modules. These data mining methods have hit a "performance ceiling"; i.e., some inherent upper bound on the amount of information offered by, say, static code features when identifying modules which contain faults. Objective: We seek an explanation for this ceiling effect. Per-haps static code features have "limited information content"; i.e. their information can be quickly and completely discovered by even simple learners. Method: An initial literature review documents the ceiling effect in other work. Next, using three sub-sampling techniques (under-, over-, and micro-sampling), we look for the lower useful bound on the number of training instances. Results: Using micro-sampling, we find that as few as 50 in-stances yield as much information as larger training sets. Conclusions: We have found much evidence for the limited in-formation hypothesis. Further progress in learning defect predic-tors may not come from better algorithms. Rather, we need to be improving the information content of the training data, perhaps with case-based reasoning methods.
Full-text available
Conference Paper
Background: Previous research has provided evidence that a combination of static code metrics and software history metrics can be used to predict with surprising success which files in the next release of a large system will have the largest numbers of defects. In contrast, very little research exists to indicate whether information about individual developers can profitably be used to improve predictions. Aims: We investigate whether files in a large system that are modified by an individual developer consistently contain either more or fewer faults than the average of all files in the system. The goal of the investigation is to determine whether information about which particular developer modified a file is able to improve defect predictions. We also continue an earlier study to evaluate the use of counts of the number of developers who modified a file as predictors of the file's future faultiness. Method: We analyzed change reports filed by 107 programmers for 16 releases of a system with 1,400,000 LOC and 3100 files. A "bug ratio" was defined for programmers, measuring the proportion of faulty files in release R out of all files modified by the programmer in release R-1. The study compares the bug ratios of individual programmers to the average bug ratio, and also assesses the consistency of the bug ratio across releases for individual programmers. Results: Bug ratios varied widely among all the programmers, as well as for many individual programmers across all the releases that they participated in. We found a statistically significant correlation between the bug ratios for programmers for the first half of changed files versus the ratios for the second half, indicating a measurable degree of persistence in the bug ratio. However, when the computation was repeated with the bug ratio controlled not only by release, but also by file size, the correlation disappeared. In addition to the bug ratios, we confirmed that counts of the cumulative number of different developers changing a file over its lifetime can help to improve predictions, while other developer counts are not helpful. Conclusions: The results from this preliminary study indicate that adding information to a model about which particular developer modified a file is not likely to improve defect predictions. The study is limited to a single large system, and its results may not hold more widely. The bug ratio is only one way of measuring the "fault-proneness" of an individual programmer's coding, and we intend to investigate other ways of evaluating bug introduction by individuals.
Full-text available
Article
Fault prediction by negative binomial regression models is shown to be effective for four large production software systems from industry. A model developed originally with data from systems with regularly scheduled releases was successfully adapted to a system without releases to identify 20% of that system’s files that contained 75% of the faults. A model with a pre-specified set of variables derived from earlier research was applied to three additional systems, and proved capable of identifying averages of 81, 94 and 76% of the faults in those systems. A primary focus of this paper is to investigate the impact on predictive accuracy of using data about the number of developers who access individual code units. For each system, including the cumulative number of developers who had previously modified a file yielded no more than a modest improvement in predictive accuracy. We conclude that while many factors can “spoil the broth” (lead to the release of software with too many defects), the number of developers is not a major influence.
Full-text available
Article
We compare the effectiveness of four modeling methods—negative binomial regression, recursive partitioning, random forests and Bayesian additive regression trees—for predicting the files likely to contain the most faults for 28 to 35 releases of three large industrial software systems. Predictor variables included lines of code, file age, faults in the previous release, changes in the previous two releases, and programming language. To compare the effectiveness of the different models, we use two metrics—the percent of faults contained in the top 20% of files identified by the model, and a new, more general metric, the fault-percentile-average. The negative binomial regression and random forests models performed significantly better than recursive partitioning and Bayesian additive regression trees, as assessed by either of the metrics. For each of the three systems, the negative binomial and random forests models identified 20% of the files in each release that contained an average of 76% to 94% of the faults.
Full-text available
Article
The value of using static code attributes to learn defect predictors has been widely debated. Prior work has explored issues like the merits of "McCabes versus Halstead versus lines of code counts” for generating defect predictors. We show here that such debates are irrelevant since how the attributes are used to build predictors is much more important than which particular attributes are used. Also, contrary to prior pessimism, we show that such defect predictors are demonstrably useful and, on the data studied here, yield predictors with a mean probability of detection of 71 percent and mean false alarms rates of 25 percent. These predictors would be useful for prioritizing a resource-bound exploration of code that has yet to be inspected.
Full-text available
Conference Paper
In software development, every change induces a risk. What happens if code changes again and again in some period of time? In an empirical study on Windows Vista, we found that the features of such change bursts have the highest predictive power for defect-prone components. With precision and recall values well above 90%, change bursts significantly improve upon earlier predictors such as complexity metrics, code churn, or organizational structure. As they only rely on version history and a controlled change process, change bursts are straight-forward to detect and deploy.
Conference Paper
In software development, resources for quality assurance are limited by time and by cost. In order to allocate resources effectively, managers need to rely on their experience backed by code complexity metrics. But often dependencies exist between various pieces of code over which managers may have little knowledge. These dependencies can be construed as a low level graph of the entire system. In this paper, we propose to use network analysis on these dependency graphs. This allows managers to identify central program units that are more likely to face defects. In our evaluation on Windows Server 2003, we found that the recall for models built from network measures is by 10% points higher than for models built from complexity metrics. In addition, network measures could identify 60% of the binaries that the Windows developers considered as critical-twice as many as identified by complexity metrics.
Article
Reducing the number of software failures is one of the most challenging problems of software production. We assume that software development proceeds as a series of changes and model the probability that a change to software will cause a failure. We use predictors based on the properties of a change itself. Such predictors include size in lines of code added, deleted, and unmodified; diffusion of the change and its component subchanges, as reflected in the number of files, modules, and subsystems touched, or changed; several measures of developer experience; and the type of change and its subchanges (fault fixes or new code). The model is built on historic information and is used to predict the risk of new changes. In this paper we apply the model to 5ESS® software updates and find that change diffusion and developer experience are essential to predicting failures. The predictive model is implemented as a Web-based tool to allow timely prediction of change quality. The ability to predict the quality of change enables us to make appropriate decisions regarding inspection, testing, and delivery. Historic information on software changes is recorded in many commercial software projects, suggesting that our results can be easily and widely applied in practice.
Conference Paper
Commercial software development is a complex task that requires a thorough understanding of the architecture of the software system. We analyze the Windows Server 2003 operating system in order to assess the relationship between its software dependencies, churn measures and post-release failures. Our analysis indicates the ability of software dependencies and churn measures to be efficient predictors of post-release failures. Further, we investigate the relationship between the software dependencies and churn measures and their ability to assess failure-proneness probabilities at statistically significant levels.