Conference PaperPDF Available

# Does measuring code change improve fault prediction?

Authors:

## Abstract and Figures

Background: Several studies have examined code churn as a variable for predicting faults in large software systems. High churn is usually associated with more faults appearing in code that has been changed frequently. Aims: We investigate the extent to which faults can be predicted by the degree of churn alone, whether other code characteristics occur together with churn, and which combinations of churn and other characteristics provide the best predictions. We also investigate different types of churn, including both additions to and deletions from code, as well as overall amount of change to code. Method: We have mined the version control database of a large software system to collect churn and other software measures from 18 successive releases of the system. We examine the frequency of faults plotted against various code characteristics, and evaluate a diverse set of prediction models based on many different combinations of independent variables, including both absolute and relative churn. Results: Churn measures based on counts of lines added, deleted, and modified are very effective for fault prediction. Individually, counts of adds and modifications outperform counts of deletes, while the sum of all three counts was most effective. However, these counts did not improve prediction accuracy relative to a model that included a simple count of the number of times that a file had been changed in the prior release. Conclusions: Including a measure of change in the prior release is an essential component of our fault prediction method. Various measures seem to work roughly equivalently.
Content may be subject to copyright.
Does Measuring Code Change Improve Fault Prediction?
Robert M. Bell, Thomas J. Ostrand, Elaine J. Weyuker
AT&T Labs - Research
180 Park Avenue
Florham Park, NJ 07932
(rbell,ostrand,weyuker)@research.att.com
ABSTRACT
Background: Several studies have examined code churn
as a variable for predicting faults in large software systems.
High churn is usually associated with more faults appearing
in code that has been changed frequently.
Aims: We investigate the extent to which faults can be
predicted by the degree of churn alone, whether other code
characteristics occur together with churn, and which combi-
nations of churn and other characteristics provide the best
predictions. We also investigate diﬀerent types of churn, in-
cluding both additions to and deletions from code, as well
as overall amount of change to code.
Method: We have mined the version control database
of a large software system to collect churn and other soft-
ware measures from 18 successive releases of the system.
We examine the frequency of faults plotted against various
code characteristics, and evaluate a diverse set of prediction
models based on many diﬀerent combinations of indepen-
dent variables, including both absolute and relative churn.
Results: Churn measures based on counts of lines added,
deleted, and modiﬁed are very eﬀective for fault prediction.
Individually, counts of adds and modiﬁcations outperform
counts of deletes, while the sum of all three counts was most
eﬀective. However, these counts did not improve prediction
accuracy relative to a model that included a simple count
of the number of times that a ﬁle had been changed in the
prior release.
Conclusions: Including a measure of change in the prior
release is an essential component of our fault prediction
method. Various measures seem to work roughly equiva-
lently.
Categories and Subject Descriptors: D.2.5 [Software
Engineering]: Testing and Debugging – Debugging aids
General Terms: Experimentation
Keywords: software faults, fault prediction, code churn,
fault-percentile average, empirical study
1. INTRODUCTION
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for proﬁt or commercial advantage and that copies
bear this notice and the full citation on the ﬁrst page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior speciﬁc
permission and/or a fee.
PROMISE ’11, September 20–21, 2011, Banff, Canada
Copyright 2011 ACM 978-1-4503-0709-3/11/09 ...$10.00. Several researchers have noted that the amount of code change in some part of the code of a software system can be an eﬀective indicator of the fault-proneness of that part, either in the modiﬁed release or in succeeding releases. This change is often referred to as churn, and we will use that term to refer to any sort of change in the remainder of the paper. For example, we have found that the number of times that a ﬁle has been changed in the previous two releases is a powerful indicator of fault-proneness in the negative bino- mial prediction models that we have investigated [11, 12]. In the present paper, we examine ﬁner-grained measurements of churn, including the counts of numbers of lines within a ﬁle that have been added, deleted, or modiﬁed in previous releases. The studies described in this paper are performed on 18 quarterly releases spanning approximately ﬁve years of a provisioning system. In previous work, we have referred to this system as System 7 [10]. It includes ﬁles written in six diﬀerent programming languages including C, C++, Java, SQL, a proprietary variant of C, a proprietary variant of C++, as well as .h ﬁles. About 60% of the ﬁles were written in Java with the remaining 40% of the ﬁles roughly equally distributed among the other six ﬁle types. Table 1 provides basic data on the size of the system’s releases, the occur- rence of faults, and the percent of each release that was changed. The ﬁrst 16 rows of the table are updated from the data that appeared in [10]. The present table includes one ﬁle type (.h files) that was not included in the previ- ous paper and two additional releases, as well as updates to the fault and change counts due to the ongoing development and maintenance of the system. 2. MEASURING CHANGE Changes to a ﬁle can be measured in many ways in ad- dition to simply determining whether or not the ﬁle was changed during a release. In this paper we consider ways of quantifying the degree of change, in order to see whether that added information is helpful when predicting fault- proneness. In our previous work [11, 12, 13], we measured the number of changes within a release, that is, the number of times that the ﬁle had gone through a check-out/check-in cycle. The number of changes in Releases N-1 and N-2 are two attributes that are part of our Standard Model to predict fault-proneness of ﬁles in Release N. We refer to these in the sequel as Prior Changes and Prior-Prior Changes re- % New % Changed % Faulty Release Files KLOC ﬁles ﬁles ﬁles Faults Faults/ﬁle Faults/KLOC 1 2470 885.3 100.0 .0 5.6 284 .11 .32 2 2487 912.3 .7 10.2 3.7 168 .07 .18 3 2507 924.8 .8 7.5 5.1 404 .16 .44 4 2569 943.8 2.4 8.4 4.9 243 .09 .26 5 2610 966.0 1.6 15.1 4.4 192 .07 .20 6 2707 989.3 3.6 10.5 5.1 288 .11 .29 7 2872 1034.8 5.8 8.1 6.8 580 .20 .56 8 3005 1082.0 4.4 14.6 5.7 355 .12 .33 9 3159 1122.6 4.9 13.2 7.0 453 .14 .40 10 3506 1229.9 9.9 18.9 9.1 689 .20 .56 11 3552 1288.2 1.3 22.9 6.7 576 .16 .45 12 3517 1293.8 .1 11.2 2.5 161 .05 .12 13 3713 1364.0 5.3 2.9 6.1 715 .19 .52 14 3769 1408.1 1.5 17.9 4.8 454 .12 .32 15 3797 1441.9 .7 10.5 4.5 389 .10 .27 16 3829 1466.4 .8 8.5 4.0 352 .09 .24 17 3891 1489.0 1.6 6.8 3.4 275 .07 .18 18 3920 1520.1 .7 8.2 4.0 522 .13 .34 Average 6.8 11.0 5.2 7100 (total) .12 .33 Table 1: Size, Faults, and Changes for System 7 (Provisioning System) spectively, and note here that all model variables, with the exception of size, relate to prior releases and are named ac- cordingly. The other attributes in the Standard Model are the count of faults in Release N-1, the size of the ﬁles in 1000’s of lines of code (KLOC), the age of the ﬁle in terms of the number of previous releases the ﬁle was part of the system, and the ﬁle type. In this paper, we use ﬁner-grained change information to measure the volume of the editing changes that occur each time a user checks out the latest version of the ﬁle, edits it, and then checks in the edited version. Speciﬁcally, we consider three types of changes that can occur within an editing session: lines added to the ﬁle, lines deleted, and lines modiﬁed. The deﬁnitions we use are as follows. Changes are always measured for contiguous sets of lines in the ﬁle. If a user edits two groups of lines that are sep- arated by at least one unchanged line, that is considered two separate actions, and the following deﬁnitions must be applied separately to both of the actions. If an editing action involves only adding lines or only delet- ing lines, the result is simply the count of the number of lines added or deleted. If an action involves modifying individ- ual lines without inserting or removing additional lines, the result is again simply the number of lines that are modi- ﬁed. If an action involves replacing kcontiguous lines with k+ilines (kand iboth positive integers), then the result is klines modiﬁed and ilines added. If an action involves replacing kcontiguous lines with kilines (kand iboth positive integers, and k > i), then the result is kilines modiﬁed and ilines deleted. It is possible to have alternative views of the changes that are made in going from the checked-out version to the checked-in version. In particular, one could deﬁne editing changes only in terms of adds and deletes. For example, the replacement of klines with k+ilines could be deﬁned as deleting kand adding k+i. However, we believe it is more meaningful to distinguish this presumably single edit- Lines Lines Lines Change Added Deleted Modiﬁed [7] x+1 [7] x+2 0 0 1 [21:30] 1 line 0 9 1 [41:45] 8 lines 3 0 5 [99:100] [99] 12 lines [100] 12 0 0 Total 15 9 7 Table 2: Examples of changes to a ﬁle ing action from the situation where a developer performs two potentially unrelated edits, deleting klines and adding k+iin widely-separated sections of code. A single editing session can involve multiple adds, deletes, and modiﬁcations. For example, suppose a user checks out a ﬁle with 100 lines, and makes the following changes before checking the ﬁle back in. The user changes the expression x+1 in line 7 to x+2, replaces lines 21-30 with a single line, replaces the original lines 41-45 with a new set of 8 lines, and inserts 12 new lines just in front of the last line of the ﬁle, resulting in a new ﬁle with 106 lines. Table 2 shows the way these changes are counted. The numbers in square brackets represent the line numbers in the original checked-out ﬁle. A ﬁle may undergo multiple editing sessions in a single release. The churn counts for the ﬁle in the release are the sums of the respective adds, deletes, and modiﬁcations done in all those sessions. 3. PRELIMINARIES We start by looking at simple properties of ﬁle size and fault occurrence, according to the change status of the ﬁles. For each release, we categorize each ﬁle’s status as being one of: new to the system, changed from the prior release, or unchanged. Figure 1 shows the mean ﬁle size, by release, for each change status. At Release 1, we treat all ﬁles as new. Consequently, the majority of instances of new ﬁles occur at Release 1. 0 500 1000 1500 2000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Mean LOC Release New (Mean=303) Unchanged (Mean=279) Changed (Mean=1081) Figure 1: Sizes of ﬁles, by change status, Releases 1-18 0 0.5 1 1.5 2 2.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Fault-per-File Release Faults per File, by Change Status and Release New (Mean=0.24) Unchanged (Mean=0.02) Changed (Mean=0.80) Figure 2: Faults per ﬁle, by change status, Releases 1-18 Changed ﬁles are notably larger than the other types, im- plying that large ﬁles are more likely to be changed. Note that new ﬁles introduced after the ﬁrst two releases were much smaller on average than those created in Releases 1 and 2. Figure 2 shows the average number of faults per ﬁle, by release and change status. Unchanged ﬁles always have ex- tremely low fault counts, never exceeding 0.06 faults per ﬁle. Fault rates are much higher for the other two types of ﬁle, and changed ﬁles have higher fault rates than new ﬁles in every release from 2 through 18 except Release 17. This is consistent with the data from our previous work, where we noted that ﬁles that have been changed in the previous release are much more likely to have faults than unchanged ﬁles. Because the average size of changed ﬁles is so much larger than that of new ﬁles, the fault rate per KLOC is usually lower for changed ﬁles, as seen in Figure 3. Table 3 summarizes the results in the last three ﬁgures by aggregating over release. In addition, the last two columns of the table show the aggregate counts and percentages of faults by change status. Despite being only 11 percent of ﬁles overall (Table 1), changed ﬁles contain 72 percent of all faults and 75 percent of faults after Release 1. Going beyond merely the change status of ﬁles, Table 4 shows fault rates for changed ﬁles, broken down by the types of changes that occurred. As the number of types of changes grows from one to all three, both the average number of lines changed and the subsequent fault rates increase. Performing more than one type of change is more fault-prone than doing 0 2 4 6 8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Fault-per-KLOC Release Faults per KLOC, by Change Status and Release New (Mean=0.79) Unchanged (Mean=0.08) Changed (Mean=0.74) Figure 3: Faults per LOC, by change status, Re- leases 1-18 0 0.5 1 1.5 2 2.5 3 110 100 1000 Faults per File Lines Added + Lines Deleted + Lines Modified Figure 4: Faults per size of change a single one, and performing all three types during a single release results in more than 4 times as many faults as doing a single type of change. When only one type of change occurs in a ﬁle, Adds are the most fault-prone, Modiﬁcations second, and Deletes yield the fewest faults. Figure 4 shows faults per ﬁle, for ﬁles changed in the prior release, as a function of the total number of lines changed. Files are sorted into ten bins based on the number of lines changed, with each plotting symbol representing about 600 to 700 points. Not surprisingly, the number of faults oc- curring in changed ﬁles rises with the total number of lines changed. Obviously, the more lines that are changed, the greater the opportunity for mistakes to occur. 4. PREDICTION MODELS AND PREDIC- TION RESULTS To determine the eﬀectiveness of diﬀerent variables for fault prediction, we constructed models containing various combinations of the original variables from the Standard Model, together with subsets of the churn variables. Models to predict fault counts for Release Nare constructed using negative binomial regression, with values from Releases 1 through N1 as training data. See [11] for details on the negative binomial regression model for fault prediction as well as for example results. Models were constructed, applied, and evaluated following our standard procedure, as follows: The dependent output variable is the number of faults for each ﬁle in a release. The non-churn independent variables are chosen from a set File Number Average Faults Faults Number Percent status of ﬁles LOC per ﬁle per KLOC of faults of faults New (Release 1) 2470 358 .11 .32 284 4.0 New (Releases 2-18) 1493 210 .44 2.11 661 9.3 Changed 6389 1081 .80 .74 5121 72.1 Unchanged 47528 279 .02 .08 1034 14.6 Table 3: Faults Rates, by File Status File Number Percentage Average Average Faults Faults status of ﬁles of ﬁles LOC Lines changed per ﬁle per KLOC Adds only 597 9.4 769 21 .30 .39 Deletes only 296 4.6 513 5 .04 .07 Modiﬁcations only 683 10.7 605 4 .19 .32 Adds & Deletes 126 2.0 702 21 .50 .71 Adds & Mods 1894 29.6 940 37 .55 .59 Deletes & Mods 168 2.6 521 23 .36 .69 Adds, Deletes, & Mods 2625 41.1 1495 210 1.38 .92 Table 4: Faults Rates for Changed Files, by Type(s) of Change that includes code attributes, which are obtainable from the release to be predicted, and history attributes, which are ob- tainable from previous releases. The code attributes include LOC, ﬁle status, ﬁle age, and ﬁle type. History attributes include prior changes and prior faults. All models in this paper include release number, implemented as a series of dummy variables for Releases 1 through N2. 4.1 Evaluation After a model has been constructed, it is applied to gen- erate predictions for Release Nusing code attributes of Re- lease Nand history attributes of the previous two Releases (N1 and N2). The ﬁles of Release Nare sorted in descending order of the resulting predictions of fault counts. To evaluate a model, we sum the actual number of faults that occur in the ﬁles at the top X% of the sorted list, and determine the percentage of all faults in Release Nthat is included in the top X%. This percentage is the yield of the model, relative to the chosen value of X. While X can be any suitable value, we have typically presented results with X=20, as repeated studies on a wide variety of systems have found that 80% or more of the faults are contained within 20% of the system’s ﬁles. For the systems we have studied, we have frequently found even higher fault concentration. For example, in each of the 18 releases of System 7, all the faults were contained in 9.1% or fewer of the system’s ﬁles, as shown in Table 1. In our previous work, the top 20% of the ﬁles identiﬁed by negative binomial regression models for 6 large systems contained 83% to 95% of the faults, and 76% of the faults in one other system. To dispel the potential concern that using any particular value of X to evaluate a model may seem arbitrary, we de- ﬁned the notion of fault-percentile average, which essentially averages the top X% ﬁgure over all values of X. A full de- scription of the fault-percentile average is in Reference [13]. The mean fault-percentile averages for the systems studied in that paper ranged from 88.1 to 92.8 when the predictions were generated by negative binomial regression using the Standard Model. While we ﬁnd the top X% metric to be the most useful for interpreting eﬀectiveness of the prediction models, we prefer fault-percentile average for comparison of alternative models. The top X% metric can be sensitive to whether a few faulty ﬁles just make or miss the threshold. In contrast, we have found the fault-percentile average to be more stable. 4.2 Prediction Accuracy for Simple Prediction Models Figure 5 shows fault-percentile average (FPA) values, by release, for four selected predictor variables, representing the four main components of our prediction models. LOC is the total length of the ﬁle, including comment lines. To reduce skewness, the value used as a predictor variable in models is log(KLOC).Language is a categorical variable representing the ﬁle type. Age represents the number of prior releases containing the ﬁle. Prior Changes is a count of the number of times the ﬁle was changed in the prior release. The two most eﬀective predictors are easily log(KLOC) and Prior Changes, which produced very similar FPAs ex- cept at Release 13. Table 5 displays mean FPAs across Releases 3 to 16 for models that use the four predictor variables shown in Fig- ure 5, as well as models based on a number of other measures of churn in the prior release. We note that whether or not a predictor variable was transformed does not matter for this analysis because the FPA metric depends only on the ranking of predictions, not on the actual values. Among the various measures of changes in the prior release, Prior Changes and the total number of lines changed performed best on the FPA metric. A count of Prior Developers is not far behind. Among types of changes, lines Added and Modi- ﬁed are most predictive; lines Deleted performs substantially worse. Prior Changed is a simple binary indicator for whether any changes were applied to the ﬁle in the prior release. It is basically a simpliﬁcation of Prior Changes, lumping together all ﬁles for which Prior Changes >0. While clearly some signal is lost, the model using Prior Changed performs nearly as well as the Prior Changes model. 40 50 60 70 80 90 100 3 6 9 12 15 18 Fault Percentile Average Release LOC Prior Changes Language Age Figure 5: Fault percentile average, for four simple predictor variables Predictor Mean FPA log(KLOC) 85.9 Prior Changes 84.8 Prior Adds+Deletes+Mods 84.7 Prior Developers 84.1 Prior Lines Added 83.5 Prior Lines Modiﬁed 82.9 Prior Changed 82.3 Prior Faults 75.9 Prior Lines Deleted 75.6 Language 69.4 Age 56.5 Table 5: Mean FPA for selected simple predictor variables Results for Prior Faults are subtantially inferior to all other prior change measures except for Lines Deleted. 4.3 Prediction Accuracy for Multivariate Pre- diction Models In this section, we investigate the performance of various churn measures in a multivariate model that controls for other code and history variables. We begin with a simpliﬁ- cation of our Standard Model that includes only log(LOC), programming language and ﬁle age (the number of previ- ous releases that a ﬁle was in the system). We will call this the Base Model and it is the Standard Model without any churn-related predictor variables. As mentioned earlier, this system includes seven diﬀer- ent ﬁle types including C, C++, Java, SQL, a proprietary variant of C, a proprietary variant of C++, and .h ﬁles. Be- cause analysis of previous systems indicated that fault rates declined quickly over the ﬁrst few releases of a ﬁle before ﬂattening out, we treat age (the number of prior releases) as a categorical variable with four values: 0, 1, 2-4, and greater than 4. The ﬁrst line of Table 6 shows that this Base Model pro- duces a mean fault percentile average of 90.93 when averaged across Releases 3 to 18. When we consider log(KLOC) alone, the FPA is 85.9 when averaged across these releases. This is shown in Table 5, so for this system, adding Language and Age yields about a 5.0 percentage point increment in the mean FPA over using size alone. Subsequent lines of Table 6 display the additional im- provement associated with a selected list of churn measures. In order to have measures of the density of changes in a ﬁle, we also evaluated relative churn, ratios of the counts of Adds, Deletes, and Modiﬁcations to the LOC for the ﬁle. Each ratio was truncated at one, if the line count exceeded the LOC. These fault densities are shown in column ﬁve of Table 3 and the last column of Table 4. Because the various counts and line count ratios can be very skewed, we evaluated three versions of each change mea- sure (except the binary-valued PriorChanged indicator): the raw variable, the square root, and the fourth root. We only show the most eﬀective version of each churn measure in the table. In Table 6 we observe that the two best churn measures were Prior Changes and Prior Adds+Deletes+Mods—with increments of about 2.4 percentage points each. This is con- sistent with what we observed in Table 5, For Prior Changes, the best form was the square root, al- though the fourth root was close. In contrast, the fourth root worked best for Prior Adds+Deletes+Mods, which is more skewed. In general, the untransformed measures performed worse than their transformed counterparts. Next to each increment in the mean FPA, we show an estimated standard error for that increment. These standard errors are estimated based on release to release variability in the increments associated with each change measure. The resulting t-statistics and P-values are the results of paired two-sample t tests. We note that all these measures produce statistically signiﬁcant improvements in mean FPA relative to the Base Model without any churn measures. Other churn measures ranked similarly to those in Ta- ble 5. Increments in mean FPA for the relative churn ratios were uniformly smaller than for the absolute churn measures themselves (both after transformations). Table 7 shows corresponding results starting with an ini- tial model that includes Prior Changes. The addition of Prior Changes to the base model dramatically changes the marginal eﬀectiveness of the other churn measures, of which only three retain statistical signiﬁcance at the 0.05 level. Prior Cumulative Developers rises from the middle of the pack to the top of the list. The eﬀectiveness of this predic- tor was previously reported for this system [10] as well as for three other systems [12]. The other statistically signiﬁcant variable at this point was Prior-Prior Changes (i.e., changes at Release N2), the least eﬀective measure in Table 6. What sets these two variables apart from the others is that each one incorporates information about churn going back beyond the immediately Mean Increase Standard Predictor Variables FPA vs. Base error t-value P-value Base: log(KLOC), Language, Age 90.93 NA NA NA NA (Prior Changes)1/293.35 2.42 0.24 9.90 0.0000 (Prior Add+Deletes+Mods)1/493.28 2.35 0.26 9.08 0.0000 (Prior Add+Deletes+Mods/LOC)1/493.19 2.25 0.28 8.15 0.0000 (Prior Developers)1/293.17 2.24 0.23 9.68 0.0000 (Prior Lines Added)1/493.15 2.21 0.26 8.41 0.0000 (Prior Lines Added/LOC)1/493.03 2.10 0.29 7.25 0.0000 Prior Changed 92.95 2.01 0.23 8.77 0.0000 (Prior Cum Developers)1/292.93 2.00 0.17 11.65 0.0000 (Prior Lines Modiﬁed)1/492.91 1.98 0.20 9.85 0.0000 (Prior Lines Modiﬁed/LOC)1/492.81 1.87 0.20 9.48 0.0000 (Prior Faults)1/492.21 1.28 0.16 8.03 0.0000 (Prior New Developers)1/292.06 1.13 0.25 4.56 0.0004 (Prior Lines Deleted)1/492.06 1.13 0.16 6.93 0.0000 (Prior Lines Deleted/LOC)1/492.00 1.06 0.18 5.88 0.0000 (Prior-Prior Changes)1/491.96 1.03 0.14 7.21 0.0000 Table 6: FPA Improvements for selected churn measures, relative to model without any churn variables Mean Increase Standard Predictor Variables FPA vs. Base error t-value P-value Base: log(KLOC), Language, Age, (Prior Changes)1/293.35 NA NA NA NA (Prior Cum Developers)1/293.67 0.32 0.07 4.32 0.0006 (Prior-Prior Changes)1/493.50 0.14 0.05 3.06 0.0080 (Prior Add+Deletes+Mods)1/493.38 0.03 0.03 1.08 0.2959 (Prior Lines Added)1/493.37 0.02 0.03 0.84 0.4132 (Prior Faults)1/493.36 0.01 0.02 0.47 0.6435 (Prior Lines Modiﬁed)1/493.35 -0.00 0.01 -0.21 0.8387 Prior New Developers 93.35 -0.00 0.00 -1.45 0.1678 Prior Developers 93.35 -0.00 0.00 -0.98 0.3440 (Prior Lines Deleted)1/493.34 -0.01 0.01 -0.95 0.3549 (Prior Add+Deletes+Mods/LOC)1/293.34 -0.01 0.04 -0.27 0.7910 (Prior Lines Modiﬁed/LOC)1/293.34 -0.01 0.01 -1.02 0.3237 (Prior Lines Added/LOC)1/293.34 -0.01 0.05 -0.30 0.7669 (Prior Lines Deleted/LOC)1/493.33 -0.02 0.01 -2.43 0.0279 Prior Changed 93.30 -0.06 0.05 -1.17 0.2610 Table 7: FPA Improvements for selected churn measures, relative to model with Prior Changes prior release. For each of the other churn measures, there is apparently suﬃcient correlation with the Prior Changes measure to blunt any statistically signiﬁcant improvement. Adding Prior Cumulative Developers to the initial model produced a mean FPA of 93.67. Beyond that point, the largest observed increment in the Mean FPA is only 0.03 percentage points (for Prior-Prior Changes), and no incre- ment was statistically signiﬁcant (not shown). For compari- son, we note that our Standard Model achieves a mean FPA value of 93.44 for the subject system. 5. THREATS TO VALIDITY Our various measures of churn tend to be highly corre- lated. Consequently, it is diﬃcult to determine with much conﬁdence that a particular measure is more eﬀective than all others for fault prediction. Instead, we can mainly assess the marginal value of certain measures in the presence of others. This study has been carried out on 18 releases of one large system, and similar results may or may not be found for other systems. Many aspects of software development and maintenance can vary from one system to another, includ- ing design processes, development approaches, programming languages, and testing strategies. Any of these may intro- duce signiﬁcant diﬀerences in the frequency or locations of faults, and render the prediction models less successful. In particular, development processes that emphasize minimal rewriting of code, such as cleanroom development, may have a large impact on the number and type of code modiﬁca- tions. In such environments, the relations presented here may not hold. We intend to repeat the investigation on sev- eral of the other large systems that we have access to, to see whether we observe similar patterns. 6. RELATED WORK Fault prediction researchers have investigated a wide vari- ety of predictor variables. Many authors have utilized code mass, various complexity metrics, design information and variations on the history variables that make up our basic model. Work of this type has been reported by Graves et al. [1], Khoshgoftaar et al. [2], Menzies et al. [3, 4], Mockus and Weiss [5], Nagappan and Ball [6, 7], Ohlsson and Al- berg [9], Ostrand et al. [11], and Zimmermann and Nagap- pan [14], among others. Only a few authors have incorporated churn into predic- tion models. Mockus and Weiss [5] developed a model to predict the probability that a given change to the software system would result in a software failure. A given change is deﬁned as all the modiﬁcations to the system that result from a single Maintenance Request, and may consist of mul- tiple individual modiﬁcations to multiple code units. Their full prediction model included among its predictor variables the number of ﬁles changed, total LOC in the changed ﬁles, lines added, lines deleted, and the total number of deltas (check-in/check-out cycles) to a ﬁle. Stepwise regression yielded a reduced model that included number of deltas and lines added. This methodology has been incorporated into a “risk assessment tool”, but the authors do not state whether it uses the full or the reduced model. Nagappan and Ball [6] constructed models to predict ex- pected defect density (defects/KLOC) using either absolute or relative churn measures for predictor variables. The abso- M1 M2 Faults/KLOC M1 1.000 0.731 0.258 M2 1.000 0.219 Faults/KLOC 1.000 Table 8: Rank Correlations between Relative Churn Measures and Fault Density lute measures include added + changed lines, deleted lines, and total number of changes made (apparently equivalent to the deltas of Mockus and Weiss). Relative measures nor- malize the corresponding absolute measure in terms of total LOC or total ﬁle count. The Nagappan-Ball measures are computed for binaries that are the result of compilation of many source ﬁles. Since ﬁles are the largest code unit in our software, there are no measures that are comparable to their measures that are based on the ﬁles within a binary. How- ever, the following two relative measures of code churn are based only on the lines of code in a ﬁle, and are comparable to measures that can be computed for the software in the provisioning system: M1: (addedlines +changedlines)/LOC M2: (deletedlines)/LOC Nagappan and Ball found rank correlations of 0.8 and above between M1 or M2, and fault density. We observed considerably lower correlations, although the relative values of the correlations were similar. Table 8 shows rank corre- lations between fault density for the provisioning system we analyzed, and the measures M1, M2, as well as the inter- measure correlations. Nagappan and Ball’s relative measure models produced R2values of .8 or higher on the Windows code, signiﬁcantly better than the absolute measure models, leading to their conclusion that relative churn can be an eﬃcient and eﬀec- tive predictor of defect density. In contrast to the results for Windows 2003, we found virtually no diﬀerence between the eﬀectiveness of absolute and relative churn measures. In fact, the relative measures proved uniformly slightly less eﬀective than the absolute. In [7], Nagappan and Ball examine models that combine churn and system dependency information to predict post- release failures of system binaries that are constructed from a set of source ﬁles. They use 3 churn-related metrics: over- all count of lines added, deleted, or modiﬁed, number of ﬁles in the binary that were changed, and total number of changes made to the ﬁles in the binary. Comparison of the post-release failures predicted by these models against ac- tual failures showed R2values better than .6, with P<.0005, leading them to conclude that, at least for the subject sys- tem (Windows Server 2003), churn and system dependency information are reliable early indicators of failures. Nagappan et al. [8] investigate the ability of change bursts, sequences of consecutive changes to a software system, to predict the system’s fault-prone components. They deﬁne churn as the total of lines added, deleted, and modiﬁed dur- ing changes made to a component, and use the total churn over a component’s lifetime, the churn within bursts, and the maximum churn occurring in any burst as three of the predictor variables in their model. When measured for Win- dows Vista, bursts turn out to be highly eﬀective predictors of fault-prone components. It is worth noting that each of these studies assessed their models by data splitting the information from a single re- lease of the software. The churn data is based on changes made between the release’s baseline version up to the ﬁnal version that produced the production binaries. 7. CONCLUSIONS Consistent with our earlier observations and results, we found that most faults in the system analyzed here occurred in ﬁles that had been changed in the prior release. The im- portance of changes is so high that even a simple changed/not- changed variable is capable of providing respectable predic- tions of fault-prone ﬁles. Conﬁrming other research studies, we have seen that low- level measurements of code changes can be very eﬀective for fault prediction. Counts of additions, deletions and changes to a code base can be derived from the version control history of a system, and be used as input to a fault prediction model. Of the three speciﬁc types of ﬁle changes, lines added de- livered the most accurate predictions, followed by lines mod- iﬁed. Lines deleted had substantially less predictive value than the other two. The sum of all three counts, essentially a count of the total lines changed in a ﬁle, proved the most eﬀective way to use the individual ﬁle churn data and was as good as any other variable that we tried. However, the line counts did not improve prediction accu- racy for this system relative to our Standard Model, which already included an alternative measure of churn in the prior release. It appears that either our old measure, Prior Changes, or a sum of Adds+Deletes+Mods can be equally eﬀective. 8. REFERENCES [1] T.L. Graves, A.F. Karr, J.S. Marron, and H. Siy. Predicting Fault Incidence Using Software Change History. IEEE Trans. on Software Engineering, Vol 26, No. 7, July 2000, pp. 653-661. [2] T.M. Khoshgoftaar, E.B. Allen, J. Deng. Using Regression Trees to Classify Fault-Prone Software Modules. IEEE Trans. on Reliability, Vol 51, No. 4, Dec 2002, pp. 455-462. [3] T. Menzies, J. Greenwald and A. Frank. Data Mining Static Code Attributes to Learn Defect Predictors. IEEE Trans. Software Eng., 33(1), pp. 2-13, 2007. [4] T. Menzies, B. Turhan, A. Bener, G. Gay, B. Cukic and Y. Jiang. Implications of Ceiling Eﬀects in Defect Predictors. Proc. 4th Int. Workshop on Predictor Models in Software Engineering (PROMISE08), pp. 47-54, 2008. [5] A. Mockus and D.M. Weiss. Predicting Risk of Software Changes. Bell Labs Technical Journal, April-June 2000, pp. 169-180. [6] N. Nagappan and T. Ball. Use of Relative Code Churn Measures to Predict System Defect Density. Proc. 27th Int. Conference on Software Engineering (ICSE05), 2005. [7] N. Nagappan and T. Ball. Using Software Dependencies and Churn Metrics to Predict Field Failures: An Empirical Case Study. Proc. Empirical Software Engineering and Measurement Conference (ESEM), Madrid, Spain, 2007. [8] N. Nagappan, A. Zeller, T. Zimmermann, K. Herzig, and B. Murphy. Change Bursts as Defect Predictors. Proc. 21st IEEE Int. Symposium on Software Reliability Engineering (ISSRE2010). [9] N. Ohlsson and H. Alberg. Predicting Fault-Prone Software Modules in Telephone Switches. IEEE Trans. on Software Engineering, Vol 22, No 12, December 1996, pp. 886-894. [10] T.J. Ostrand, E.J. Weyuker, and R.M. Bell. Programmer-based Fault Prediction. Proc. Int. Conference on Predictive Models (PROMISE10), Timisoara, Romania, September 2010. [11] T.J. Ostrand, E.J. Weyuker, and R.M. Bell. Predicting the Location and Number of Faults in Large Software Systems. IEEE Trans. on Software Engineering, Vol 31, No 4, April 2005. [12] E.J. Weyuker, T.J. Ostrand, and R.M. Bell. Do Too Many Cooks Spoil the Broth? Using the Number of Developers to Enhance Defect Prediction Models. Empirical Software Eng., Vol 13, No. 5, October 2008. [13] E.J. Weyuker, T.J. Ostrand, and R.M. Bell. Comparing the Eﬀectiveness of Several Modeling Methods for Fault Prediction. Empirical Software Eng. Vol 15, No. 3, June 2010. [14] T. Zimmermann and N. Nagappan. Predicting Defects Using Network Analysis on Dependency Graphs. Proc. 13th Int. Conference on Software Engineering (ICSE08), p.531-540, 2008. ... Changes made to source code during software development can be quantified in the form of code change metrics based on data collected throughout the software lifecycle across successive versions. Such metrics have proven useful for identifying defect-prone modules [7], [86], especially code churn measures [30], [81], change bursts [87], [11], and code deltas [30], [88], [29]. In most studies of software defect prediction, the code change metrics representing the module of a project are calculated using differences between the source code of two successive versions of the module [6], [89], [90], but it can also be calculated taking into account changes from the several continuous versions of the module [91], [86], [87], [8]. ... ... Such metrics have proven useful for identifying defect-prone modules [7], [86], especially code churn measures [30], [81], change bursts [87], [11], and code deltas [30], [88], [29]. In most studies of software defect prediction, the code change metrics representing the module of a project are calculated using differences between the source code of two successive versions of the module [6], [89], [90], but it can also be calculated taking into account changes from the several continuous versions of the module [91], [86], [87], [8]. ... Full-text available Article To ensure the delivery of high quality software, it is necessary to ensure that all of its artifacts function properly, which is usually done by performing appropriate tests with limited resources. It is therefore desirable to identify defective artifacts so that they can be corrected before the testing process. So far, researchers have proposed various predictive models for this purpose. Such models are typically trained on data representing previous project versions of a software and then used to predict which of the software artifacts in the new version are likely to be defective. However, the data representing a software project usually consists of measurable properties of the project or its modules, and leaves out information about the timeline of the software development process. To fill this gap, we propose a new set of metrics, namely aggregated change metrics, which are created by aggregating the data of all changes made to the software between two versions, taking into account the chronological order of the changes. In experiments conducted on open source projects written in Java, we show that the stability and performance of commonly used classification models are improved by extending a feature set to include both measurable properties of the analyzed software and the aggregated change metrics. ... Some studies refer to these metrics as process metrics (e.g., refactoring, revision, and authors). Those metrics used to predict software fault proneness [9,10,11,12,13,14,15]. Change metrics were used in 8% of the research papers to predict software fault proneness [16]. ... ... In our work, we used one of the change metrics as a response variable. This work's change metric is derived from the code churn metric, representing the total number of lines added and deleted to a software class as defined in [12,13]. The metric is in a binary format, which means any added or deleted line to a file will be given one value (change prone) in the model, and a false value is given to a file with zero value of line added and deleted (not changeprone). ... ... Developers often commit unrelated changes, which incorrectly labels bug-free code as buggy (Herzig and Zeller 2013). Additionally, change proneness is often highly correlated to bug proneness (Moser et al. 2008;Bell et al. 2011;Bavota et al. 2015;Rahman and Roy 2017;Pascarella et al. 2020). Therefore, if a code metric is a good predictor of change proneness, it is likely to be a good predictor of bug proneness as well. ... Full-text available Article Evaluating and predicting software maintenance effort using source code metrics is one of the holy grails of software engineering. Unfortunately, previous research has provided contradictory evidence in this regard. The debate is still open: as a community we are not certain about the relationship between code metrics and maintenance impact. In this study we investigate whether source code metrics can indeed establish maintenance effort at the previously unexplored method level granularity. We consider $$\sim$$730K Java methods originating from 47 popular open source projects. After considering seven popular method level code metrics and using change proneness as a maintenance effort indicator, we demonstrate why past studies contradict one another while examining the same data. We also show that evaluation context is king. Therefore, future research should step away from trying to devise generic maintenance models and should develop models that account for the maintenance indicator being used and the size of the methods being analyzed. Ultimately, we show that future source code metrics can be applied reliably and that these metrics can provide insight into maintenance effort when they are applied in a judiciously context-sensitive manner. ... The CK suite has 6 metrics which are Weighted method per class (WMC), Coupling between object classes (CBO), Lack of cohesion in methods (LCOM), Depth of inheritance tree (DIT), Number of children (NOC) and Response for a class (RFC). -Change or Process Metrics (Moser et al. 2008;Krishnan et al. 2011;Bell et al. 2011;Nagappan et al. 2010): Changes made during the software development process are collected throughout the software life cycle across its multiple releases. Some process metrics are code churn measures, change bursts, and code deltas. ... Full-text available Article Understanding software evolution is essential for software development tasks, including debugging, maintenance, and testing. As a software system evolves, it grows in size and becomes more complex, hindering its comprehension. Researchers proposed several approaches for software quality analysis based on software metrics. One of the primary practices is predicting defects across software components in the codebase to improve agile product quality. While several software metrics exist, graph-based metrics have rarely been utilized in software quality. In this paper, we explore recent network comparison advancements to characterize software evolution and focus on aiding software metrics analysis and defect prediction. We support our approach with an automated tool named GraphEvoDef. Particularly, GraphEvoDef provides three major contributions: (1) detecting and visualizing significant events in software evolution using call graphs, (2) extracting metrics that are suitable for software comprehension, and (3) detecting and estimating the number of defects in a given code entity (e.g., class). One of our major findings is the usefulness of the Network Portrait Divergence metric, borrowed from the information theory domain, to aid the understanding of software evolution. To validate our approach, we examined 29 different open-source Java projects from GitHub and then demonstrated the proposed approach using 9 use cases with defect data from the the PROMISE dataset. We also trained and evaluated defect prediction models for both classification and regression tasks. Our proposed technique has an 18% reduction in the mean square error and a 48% increase in squared correlation coefficient over the state-of-the-art approaches in the defect prediction domain. ... The CISE works by first finding the classes affected by the changes (by performing a CIA), then performing estimation for change in size with an accuracy of 93.7%. Similarly, there are existing works on developing defect prediction solutions by measuring code changes in terms of SLOCs additions, deletions, or modifications of software releases [Bell et al. 2011;Hassan 2009]. Capiluppi and Izquierdo-Cortázar [2013] study the developer activity patterns in FLOSS projects by clustering developers around different time slots and days of a week. ... Full-text available Article Software development effort estimation (SDEE) generally involves leveraging the information about the effort spent in developing similar software in the past. Most organizations do not have access to sufficient and reliable forms of such data from past projects. As such, the existing SDEE methods suffer from low usage and accuracy. We propose an efficient SDEE method for open source software, which provides accurate and fast effort estimates. The significant contributions of our article are (i) novel SDEE software metrics derived from developer activity information of various software repositories, (ii) an SDEE dataset comprising the SDEE metrics’ values derived from approximately 13,000 GitHub repositories from 150 different software categories, and (iii) an effort estimation tool based on SDEE metrics and a software description similarity model . Our software description similarity model is basically a machine learning model trained using the PVA on the software product descriptions of GitHub repositories. Given the software description of a newly envisioned software, our tool yields an effort estimate for developing it. Our method achieves the highest standardized accuracy score of 87.26% (with Cliff’s δ = 0.88 at 99.999% confidence level) and 42.7% with the automatically transformed linear baseline model. Our software artifacts are available at https://doi.org/10.5281/zenodo.5095723. ... Selecting the classifier to use represents a relevant problem for the configuration of bug prediction models [1]. In the past, most of the bug prediction models devised made use of Logistic Regression [5,19,61,62,70], Decision Trees [4,60,88], Radial Basis Function Network [65,100], Support Vector Machines [48,67,93], Decision Tables [46,57], Multi-Layer Perceptron [23], or Bayesian Network [76]. ... Full-text available Article Bug prediction aims at locating defective source code components relying on machine learning models. Although some previous work showed that selecting the machine-learning classifier is crucial, the results are contrasting. Therefore, several ensemble techniques, i.e., approaches able to mix the output of different classifiers, have been proposed. In this paper, we present a benchmark study in which we compare the performance of seven ensemble techniques on 21 open-source software projects. Our aim is twofold. On the one hand, we aim at bridging the limitations of previous empirical studies that compared the accuracy of ensemble approaches in bug prediction. On the other hand, our goal is to verify how ensemble techniques perform in different settings such as cross- and local-project defect prediction. Our empirical experimentation results show that ensemble techniques are not a silver bullet for bug prediction. In within-project bug prediction, using ensemble techniques improves the prediction performance with respect to the best stand-alone classifier. We confirm that the models based on Validation and Voting achieve slightly better results. However, they are similar to those obtained by other ensemble techniques. Identifying buggy classes using external sources of information is still an open problem. In this setting, the use of ensemble techniques does not provide evident benefits with respect to stand-alone classifiers. The statistical analysis highlights that local and global models are mostly equivalent in terms of performance. Only one ensemble technique (i.e., ASCI) slightly exploits local learning to improve performance. Full-text available Preprint Software development effort estimation (SDEE) generally involves leveraging the information about the effort spent in developing similar software in the past. Most organizations do not have access to sufficient and reliable forms of such data from past projects. As such, the existing SDEE methods suffer from low usage and accuracy. We propose an efficient SDEE method for open source software, which provides accurate and fast effort estimates. The significant contributions of our paper are i) Novel SDEE software metrics derived from developer activity information of various software repositories, ii) SDEE dataset comprising the SDEE metrics' values derived from$\approx13,000$GitHub repositories from 150 different software categories, iii) an effort estimation tool based on SDEE metrics and a software description similarity model. Our software description similarity model is basically a machine learning model trained using the Paragraph Vectors algorithm on the software product descriptions of GitHub repositories. Given the software description of a newly-envisioned software, our tool yields an effort estimate for developing it. Our method achieves the highest Standard Accuracy score of 87.26% (with cliff's$\delta\$=0.88 at 99.999% confidence level) and 42.7% with the Automatic Transformed Linear Baseline model. Our software artifacts are available at https://doi.org/10.5281/zenodo.5095723.
Chapter
Software engineering repositories have been attracted by researchers to mine useful information about the different quality attributes of the software. These repositories have been helpful to software professionals to efficiently allocate various resources in the life cycle of software development. Software fault prediction is a quality assurance activity. In fault prediction, software faults are predicted before actual software testing. As exhaustive software testing is impossible, the use of software fault prediction models can help the proper allocation of testing resources. Various machine learning techniques have been applied to create software fault prediction models. In this study, ensemble models are used for software fault prediction. Change metrics-based data are collected for an open-source android project from GIT repository and code-based metrics data are obtained from PROMISE data repository and datasets kc1, kc2, cm1, and pc1 are used for experimental purpose. Results showed that ensemble models performed better compared to machine learning and hybrid search-based algorithms. Bagging ensemble was found to be more effective in the prediction of faults in comparison to soft and hard voting.
Full-text available
Book
Article
At the beginning of the testing phase and before the deployment phase of a project's development cycle, we need to predict files with a high chance of change. Software products are always prone to change due to several reasons, including fixing errors or improvements. In this work, we used the Eclipse (releases from 2.0 to 3.5) to investigate how prediction models can perform when learning from a release and predicting in the subsequent one, which contains new files that models have not seen. We compared the performance of these models with models that are trained and tested on the same release. We found no differences between predicting the same release or subsequent release on two pre Europa releases. Predicting change in newly created files helps improve maintenance planning for software project managers and reduce cost. It will also help to enhance the quality of software by improving the practices of developers. This study used the Adaptive Boost classifier with the decision tree J48 algorithm and combined it with the re‐sampling method. We find this to be better than using a meta classifier alone or combine the re‐sampling with the standard classification. We compared our results with related works and found that our results are outperforming.
Full-text available
Article
Advance knowledge of which files in the next release of a large software system are most likely to contain the largest numbers of faults can be a very valuable asset. To accomplish this, a negative binomial regression model has been developed and used to predict the expected number of faults in each file of the next release of a system. The predictions are based on the code of the file in the current release, and fault and modification history of the file from previous releases. The model has been applied to two large industrial systems, one with a history of 17 consecutive quarterly releases over 4 years, and the other with nine releases over 2 years. The predictions were quite accurate: For each release of the two systems, the 20 percent of the files with the highest predicted number of faults contained between 71 percent and 92 percent of the faults that were actually detected, with the overall average being 83 percent. The same model was also used to predict which files of the first system were likely to have the highest fault densities (faults per KLOC). In this case, the 20 percent of the files with the highest predicted fault densities contained an average of 62 percent of the system's detected faults. However, the identified files contained a much smaller percentage of the code mass than the files selected to maximize the numbers of faults. The model was also used to make predictions from a much smaller input set that only contained fault data from integration testing and later. The prediction was again very accurate, identifying files that contained from 71 percent to 93 percent of the faults, with the average being 84 percent. Finally, a highly simplified version of the predictor selected files containing, on average, 73 percent and 74 percent of the faults for the two systems.
Full-text available
Conference Paper
Context: There are many methods that input static code features and output a predictor for faulty code modules. These data mining methods have hit a "performance ceiling"; i.e., some inherent upper bound on the amount of information offered by, say, static code features when identifying modules which contain faults. Objective: We seek an explanation for this ceiling effect. Per-haps static code features have "limited information content"; i.e. their information can be quickly and completely discovered by even simple learners. Method: An initial literature review documents the ceiling effect in other work. Next, using three sub-sampling techniques (under-, over-, and micro-sampling), we look for the lower useful bound on the number of training instances. Results: Using micro-sampling, we find that as few as 50 in-stances yield as much information as larger training sets. Conclusions: We have found much evidence for the limited in-formation hypothesis. Further progress in learning defect predic-tors may not come from better algorithms. Rather, we need to be improving the information content of the training data, perhaps with case-based reasoning methods.
Full-text available
Conference Paper
Background: Previous research has provided evidence that a combination of static code metrics and software history metrics can be used to predict with surprising success which files in the next release of a large system will have the largest numbers of defects. In contrast, very little research exists to indicate whether information about individual developers can profitably be used to improve predictions. Aims: We investigate whether files in a large system that are modified by an individual developer consistently contain either more or fewer faults than the average of all files in the system. The goal of the investigation is to determine whether information about which particular developer modified a file is able to improve defect predictions. We also continue an earlier study to evaluate the use of counts of the number of developers who modified a file as predictors of the file's future faultiness. Method: We analyzed change reports filed by 107 programmers for 16 releases of a system with 1,400,000 LOC and 3100 files. A "bug ratio" was defined for programmers, measuring the proportion of faulty files in release R out of all files modified by the programmer in release R-1. The study compares the bug ratios of individual programmers to the average bug ratio, and also assesses the consistency of the bug ratio across releases for individual programmers. Results: Bug ratios varied widely among all the programmers, as well as for many individual programmers across all the releases that they participated in. We found a statistically significant correlation between the bug ratios for programmers for the first half of changed files versus the ratios for the second half, indicating a measurable degree of persistence in the bug ratio. However, when the computation was repeated with the bug ratio controlled not only by release, but also by file size, the correlation disappeared. In addition to the bug ratios, we confirmed that counts of the cumulative number of different developers changing a file over its lifetime can help to improve predictions, while other developer counts are not helpful. Conclusions: The results from this preliminary study indicate that adding information to a model about which particular developer modified a file is not likely to improve defect predictions. The study is limited to a single large system, and its results may not hold more widely. The bug ratio is only one way of measuring the "fault-proneness" of an individual programmer's coding, and we intend to investigate other ways of evaluating bug introduction by individuals.
Full-text available
Article
Fault prediction by negative binomial regression models is shown to be effective for four large production software systems from industry. A model developed originally with data from systems with regularly scheduled releases was successfully adapted to a system without releases to identify 20% of that system’s files that contained 75% of the faults. A model with a pre-specified set of variables derived from earlier research was applied to three additional systems, and proved capable of identifying averages of 81, 94 and 76% of the faults in those systems. A primary focus of this paper is to investigate the impact on predictive accuracy of using data about the number of developers who access individual code units. For each system, including the cumulative number of developers who had previously modified a file yielded no more than a modest improvement in predictive accuracy. We conclude that while many factors can “spoil the broth” (lead to the release of software with too many defects), the number of developers is not a major influence.
Full-text available
Article
We compare the effectiveness of four modeling methods—negative binomial regression, recursive partitioning, random forests and Bayesian additive regression trees—for predicting the files likely to contain the most faults for 28 to 35 releases of three large industrial software systems. Predictor variables included lines of code, file age, faults in the previous release, changes in the previous two releases, and programming language. To compare the effectiveness of the different models, we use two metrics—the percent of faults contained in the top 20% of files identified by the model, and a new, more general metric, the fault-percentile-average. The negative binomial regression and random forests models performed significantly better than recursive partitioning and Bayesian additive regression trees, as assessed by either of the metrics. For each of the three systems, the negative binomial and random forests models identified 20% of the files in each release that contained an average of 76% to 94% of the faults.
Full-text available
Article
The value of using static code attributes to learn defect predictors has been widely debated. Prior work has explored issues like the merits of "McCabes versus Halstead versus lines of code counts” for generating defect predictors. We show here that such debates are irrelevant since how the attributes are used to build predictors is much more important than which particular attributes are used. Also, contrary to prior pessimism, we show that such defect predictors are demonstrably useful and, on the data studied here, yield predictors with a mean probability of detection of 71 percent and mean false alarms rates of 25 percent. These predictors would be useful for prioritizing a resource-bound exploration of code that has yet to be inspected.
Full-text available
Conference Paper
In software development, every change induces a risk. What happens if code changes again and again in some period of time? In an empirical study on Windows Vista, we found that the features of such change bursts have the highest predictive power for defect-prone components. With precision and recall values well above 90%, change bursts significantly improve upon earlier predictors such as complexity metrics, code churn, or organizational structure. As they only rely on version history and a controlled change process, change bursts are straight-forward to detect and deploy.
Conference Paper
In software development, resources for quality assurance are limited by time and by cost. In order to allocate resources effectively, managers need to rely on their experience backed by code complexity metrics. But often dependencies exist between various pieces of code over which managers may have little knowledge. These dependencies can be construed as a low level graph of the entire system. In this paper, we propose to use network analysis on these dependency graphs. This allows managers to identify central program units that are more likely to face defects. In our evaluation on Windows Server 2003, we found that the recall for models built from network measures is by 10% points higher than for models built from complexity metrics. In addition, network measures could identify 60% of the binaries that the Windows developers considered as critical-twice as many as identified by complexity metrics.
Article
Reducing the number of software failures is one of the most challenging problems of software production. We assume that software development proceeds as a series of changes and model the probability that a change to software will cause a failure. We use predictors based on the properties of a change itself. Such predictors include size in lines of code added, deleted, and unmodified; diffusion of the change and its component subchanges, as reflected in the number of files, modules, and subsystems touched, or changed; several measures of developer experience; and the type of change and its subchanges (fault fixes or new code). The model is built on historic information and is used to predict the risk of new changes. In this paper we apply the model to 5ESS® software updates and find that change diffusion and developer experience are essential to predicting failures. The predictive model is implemented as a Web-based tool to allow timely prediction of change quality. The ability to predict the quality of change enables us to make appropriate decisions regarding inspection, testing, and delivery. Historic information on software changes is recorded in many commercial software projects, suggesting that our results can be easily and widely applied in practice.
Conference Paper
Commercial software development is a complex task that requires a thorough understanding of the architecture of the software system. We analyze the Windows Server 2003 operating system in order to assess the relationship between its software dependencies, churn measures and post-release failures. Our analysis indicates the ability of software dependencies and churn measures to be efficient predictors of post-release failures. Further, we investigate the relationship between the software dependencies and churn measures and their ability to assess failure-proneness probabilities at statistically significant levels.