Conference PaperPDF Available

Empirical Evaluation of Mixed-Project Defect Prediction Models

Authors:

Abstract

Defect prediction research mostly focus on optimizing the performance of models that are constructed for isolated projects. On the other hand, recent studies try to utilize data across projects for building defect prediction models. We combine both approaches and investigate the effects of using mixed (i.e. within and cross) project data on defect prediction performance, which has not been addressed in previous studies. We conduct experiments to analyze models learned from mixed project data using ten proprietary projects from two different organizations. We observe that code metric based mixed project models yield only minor improvements in the prediction performance for a limited number of cases that are difficult to characterize. Based on existing studies and our results, we conclude that using cross project data for defect prediction is still an open challenge that should only be considered in environments where there is no local data collection activity, and using data from other projects in addition to a project's own data does not pay off in terms of performance.
Empirical Evaluation of Mixed-Project Defect Prediction Models
Burak Turhan
Department of Information Processing Science
University of Oulu
90014, Oulu, Finland
burak.turhan@oulu.fi
Ays¸e Tosun
Department of Computer Engineering
Bo˘
gazic¸i University
34342, Istanbul, Turkey
ayse.tosun@boun.edu.tr
Ays¸e Bener
Ted Rogers School of ITM
Ryerson University
M5B-2K3, Toronto, ON, Canada
ayse.bener@ryerson.ca
Abstract—Defect prediction research mostly focus on opti-
mizing the performance of models that are constructed for
isolated projects. On the other hand, recent studies try to utilize
data across projects for building defect prediction models. We
combine both approaches and investigate the effects of using
mixed (i.e. within and cross) project data on defect prediction
performance, which has not been addressed in previous studies.
We conduct experiments to analyze models learned from mixed
project data using ten proprietary projects from two different
organizations. We observe that code metric based mixed-
project models yield only minor improvements in the prediction
performance for a limited number of cases that are difficult
to characterize. Based on existing studies and our results, we
conclude that using cross project data for defect prediction
is still an open challenge that should only be considered in
environments where there is no local data collection activity,
and using data from other projects in addition to a project’s
own data does not pay off in terms of performance.
Keywords-cross project; within project; mixed project; defect
prediction; product metrics;
I. INTRODUCTION
Defect predictors are decision support systems for prior-
itizing the list of software modules to be tested, in order to
allocate limited testing resources effectively, and to detect
as many defects as possible with minimum effort. Defect
prediction studies usually formulate the problem as a su-
pervised learning problem, where the outcome of a defect
predictor model depends on historical data. Expected use of
such models in practice is to train and calibrate them with
past project data and then to apply to new projects. Though
there are many publications on the problem – some examples
include [1]–[5] –, almost all ignore the practical aspect that
the purpose of a defect predictor is to identify the defects of
new projects, which are different than those used in model
construction. The majority of publications focuses on the
algorithmic models and report simulation results of defect
predictors that are trained on a specific project and tested on
the reserved portion of the same project. While this approach
aims at validating the effectiveness of these models, it does
not address the practical purposes. Though there are studies
that apply defect predictors to the consecutive versions of
the same project, they are longitudinal case studies and do
not address predictions across different projects [6], [7].
We are curious about why defect prediction research
fails to utilize data across projects. Is it because such an
approach is useless in defect prediction context? We are
optimistic about the answer. Just consider the problem of
cost estimation, which is technically similar to defect pre-
diction, i.e. a supervised learning problem utilizing past data.
Though the effectiveness of resulting models may vary, cost
estimation research have made use of cross project data for a
long time. A systematic review comparing within company
vs. cross company cost estimation models concluded that
some companies may benefit from cross company cost
estimations, while others may not [8]. Data gathered from
different projects are extensively used in cost estimation
studies, i.e. COCOMO models and ISBSG dataset [9], [10].
Our optimism not only relies on the analogy with cost
estimation, but also on the recent research results in cross-
project defect prediction studies (i.e. see Section II). Another
motivation for pursuing the research on cross project data
for defect prediction is that successful applications will have
significant implications in practice. Companies will be able
to employ defect prediction techniques in their projects,
even if they have no or limited historical local data to
build models with. Another scenario is that companies may
already have their defect prediction models in place and
making use of external data may improve the performance
of models learned from local project data. However, there
are no studies addressing the latter case, i.e. the effects of
incorporating cross project data in existing within project
defect predictors, which we address in this paper. Therefore,
we identify the following research goal for this study:
Previous studies focused on the two ends of the
spectrum, i.e. using either within or cross project
data for defect prediction. We want to check
whether using additional data from other projects
improves the performance of an existing, project
specific defect prediction model, i.e. what happens
when within and cross project data are mixed?
We demonstrate a mixed data approach by using both
within and cross project data, and analyze the spectrum in
between. In our experiments we use code metrics of ten
proprietary projects from two different sources, whose data
2011 37th EUROMICRO Conference on Software Engineering and Advanced Applications
978-0-7695-4488-5/11 $26.00 © 2011 IEEE
DOI 10.1109/SEAA.2011.59
342
2011 37th EUROMICRO Conference on Software Engineering and Advanced Applications
978-0-7695-4488-5/11 $26.00 © 2011 IEEE
DOI 10.1109/SEAA.2011.59
342
2011 37th EUROMICRO Conference on Software Engineering and Advanced Applications
978-0-7695-4488-5/11 $26.00 © 2011 IEEE
DOI 10.1109/SEAA.2011.59
396
are publicly available. Our contributions are to investigate
the merits of mixed-project predictions and to explore if
(and when) they may be effective; an issue that has not been
addressed by any previous work.
The rest of the paper is organized as follows: The next
section presents a discussion of the previous cross project
defect prediction studies. Section III describes the details of
the data, methods and the setup we used in our experiments.
Then we present and discuss the results of our experiments
in Section IV followed by the threats to validity in Section
V. Finally, we conclude our work in Section VI.
II. RE LATED WORK
To the best of our knowledge, the earliest work on cross-
project prediction is by Briand et al. [11]. They use logistic
regression and MARS models to learn defect predictors from
an open-source project (i.e. Xpose), and apply the same
models to another open-source project (Jwriter), which is
developed by an identical team with different design strate-
gies and coding standards. They observed that cross-project
prediction is indeed better than a random and a simple,
class-size based model. Yet, cross-project performance of
the model was lower compared to its performance on the
training project. They argue that cross-project predictions
can be more effective in more homogeneous settings, adding
that such an environment may not exist in real life. They
identify the challenge as of high practical interest, and not
straightforward to solve.
Turhan et al. made a thorough analysis of cross project
prediction using 10 projects collected from two different data
sources (i.e. same projects analyzed in this paper) [12]. They
identified clear patterns that cross project predictions dramat-
ically increase the probability of detecting defective modules
(from median value of 75% to 97%), but the false alarm rates
as well (from median value of 29% to 64%). They claim that
improvements in detection rates are due to extra information
captured from cross project data and the increased false
alarms can be explained by the irrelevancies, which cross
project data also contain. They propose a nearest-neighbor
based data selection technique to filter the irrelevancies in
cross project data and achieve performances that are close to,
but still worse than within project predictions. They conclude
that within company prediction is the best path to follow and
cross project prediction with data filtering can be used as a
stop-gap technique before a local repository is constructed.
Turhan et al.s results are replicated by Nelson et al. in a
follow up study [13].
Zimmermann et al. consider different factors that may af-
fect the results of cross project predictions. They categorize
projects according to their domain, process characteristics
and code measures. In their initial case study they run
experiments to predict defects in Internet Explorer (IE)
using models trained with Mozilla Firefox and vice versa.
These products are in the same domain and have similar
features, but development teams employ different processes.
Their results show that Firefox can predict the defects in
IE successfully (i.e. 76.47% precision and 81.25% recall),
however the opposite direction does not work (i.e. 4.12%
recall). Zimmermann et al. then collect data from 10 addi-
tional projects and perform 622 pairwise predictions across
project components. This is a slightly different approach
than Turhan et al.s, who constructed predictors from a com-
mon pool of cross project data with data filtering in order to
satisfy Briand et al.’s homogeneity argument. Zimmermann
et al. classify a prediction as successful if precision, recall
and accuracy values are all above 75%, which results in only
21 successful predictions corresponding to 3.4% success
rate. They do not mention the performance of predictions
that are below the threshold. They derive a decision tree
using these prediction results to estimate the expected per-
formance from a cross project predictor in order to guide
practitioners. An interesting pattern in their predictions is
that open-source projects are good predictors of close-source
projects, however open-source projects can not be predicted
by any other projects. In a following study, Turhan et al.
investigated whether the patterns in their previous work [12]
are also observed in open-source software and analyzed three
additional projects [14]. Similar to Zimmermann et al., they
found that the patterns they observed earlier are not easily
detectable in predicting open-source software defects using
proprietary cross project data.
Cruz et al. train a defect prediction model with an open-
source project (Mylyn) and test the performance of the same
model on six other projects [15]. However, before training
and testing the model, they try to obtain similar distributions
in training and test samples through data transformations
(i.e. power transformation). They also remove outliers in
data by trimming the tails of distributions. They observe
that using transformed training and test data yields better
cross-project prediction performances [15].
Jureczko and Madeyski look for clusters of similar
projects in a pool of 92 versions from 38 proprietary,
open-source and academic projects [16]. Their idea is to
reuse same defect predictor model among the projects that
fall in the same cluster. They use a comprehensive set of
code metrics to represent the projects and compare the
performances of prediction models that are trained on the
same project vs. other projects in the same cluster. They
identify three statistically significant clusters (out of 10),
where cross project predictions are better than within project
predictions in terms of the number of classes that must be
visited to detect 80% of the defects.
Liu et al. employ a search-based strategy, using genetic
algorithms, to select data points from seven NASA MDP
projects in order to build cross-project defect prediction
models [17]. They use 17 different machine learning meth-
ods and majority voting to build defect predictors. They con-
sistently observe lower misclassification errors than trivial
343343397
cross-project models (i.e. using all available cross-project
data without data selection). They argue that single project
data may fail to represent the overall quality trends, and
recommend development organizations to combine multiple
project repositories using their approach for exploiting the
capabilities of their defect prediction models [17]. However,
they do not provide a comparison with baseline within
project defect predictors.
III. STU DY DES IGN F OR MI X ED -PRO JEC T MOD ELS
In this paper, we focus on an alternative way of utilizing
cross project data. While the overall goal of this line of
research is to construct predictors with no local data, we
take a more practice oriented step and investigate whether
existing defect predictors can be improved by incorporating
other projects’ data. In the following sections, we describe
our methods, data and experimental setup for our analyses.
A. Data and Methods
We use data from 10 proprietary projects from two differ-
ent sources, which are publicly available in PROMISE repos-
itory [18], [19]. Project related information and descriptive
statistics are given in Table I. Seven rows with corresponding
source columns labeled as “NASA” come from NASA
aerospace projects, and three rows with “SOFTLAB” label
in source columns come from a Turkish software company
developing embedded controllers for home appliances.
One caveat of cross project prediction is that all projects
need to have the same set of metrics in order to be able
to pool data from different projects. Therefore, though the
projects in Table I have more available metrics, we are
limited to use only those that are common in all analyzed
projects. The final set of 17 metrics include complexity,
control flow, size and Halstead metrics and a complete list
is provided in Figure 1.
Before using the data, we applied a log-transformation
(i.e. replaced all numeric values with their logarithms) as
recommended by previous studies that analyzed the same
datasets [2], [12]. For the same reason, we used naive Bayes
classifier to label methods as defect-prone or defect-free.
For instance, Menzies et al. demonstrated the effectiveness
of this technique in a series of data mining experiments
on these datasets [2], [20]. Further, Lessmann et al. com-
pared commonly used classification techniques on the same
datasets and found no significant differences between the
performances of top 15 classifiers, including naive Bayes.
They concluded that the choice of the classifier is not that
important for building defect predictors [4].
In our experiments we use cross project data after ap-
plying the filtering method proposed in [12], i.e. nearest-
neighbor (NN)-filtering, due to its simplicity and docu-
mented effectiveness. With this filtering, it is expected to
obtain a subset of available cross project data that shows
similar characteristics to that of test project’s data. Note
Complexity & flow v(g)cyclomatic complexity
iv(G)design complexity
branch count
Lines of Code loc total
loc code and comment
loc comments
loc executable
Halstead Base N1num operators
N2num operands
µ1num unique operators
µ2num unique operands
Derived Nlength
Vvolume
Ddifficulty
Eeffort
Berror est
Tprog time
Figure 1. Common metrics for all projects (see [19] for descriptions).
that the granularity of data is at the functional method
level, hence NN-filtering identifies the methods with similar
characteristics (i.e. does not make project-level comparisons
for similarity).
In order to implement NN-filter, we first calculate the
pairwise distances between the test set and the candidate
training set samples (i.e. all cross project data). Let Nbe
the number of test set size. For each test instance, we pick
its k= 10 nearest neighbors from candidate training set.
Then we come up with a total of 10 ×Nsimilar instances.
Note that these 10 ×Ninstances may not be unique (i.e. a
single data sample can be a nearest neighbor of many data
samples in the test set). Using only unique ones, we form
the training set and use it in our experiments. [12].
We use three performance measures to assess the per-
formance of the defect predictors: probability of detection,
probability of false alarm, balance. Since our datasets are
unbalanced (i.e. relatively less number of defective modules
than non-defectives), we did not use measures such as
accuracy and precision as recommended in [2], [21]. Using
a confusion matrix, we count the number of true positives
(tp), true negatives (tn), false positives (fp), false negatives
(fn) and derive the performance measures described below
[2].
Probability of the detection rate (pd) is a measure of ac-
curacy for correctly identifying defective modules. It should
be as high as possible (ideal case is when pd = 1):
(pd) = tp/(tp +fn)(1)
Probability of the false alarm rate (pf) is a measure for
false alarms and it is an error measure for incorrectly flag-
ging the non-defective modules. False alarms cause testing
efforts to be spent in vain. Thus, a defect predictor should
lower pf as much as possible (ideal case is when pf = 0):
(pf) = f p/(f p +tn)(2)
Balance (bal) is a single measure to indicate the tradeoff
between pd and pf rates. It is defined as the normalized
344344398
Table I
TEN P ROJ EC TS U SE D IN T HI S STU DY AR E SO RTE D IN T HE O RD ER O F NU MB ER O F EX AM PL ES .
(# methods)
source project language description examples features %defective
NASA pc1 C++ Flight software for earth orbiting satellite 1,109 21 6.94
NASA kc1 C++ Storage management for ground data 845 21 15.45
NASA kc2 C++ Storage management for ground data 522 21 20.49
NASA cm1 C++ Spacecraft instrument 498 21 9.83
NASA kc3 JAVA Storage management for ground data 458 39 9.38
NASA mw1 C++ A zero gravity experiment related to combustion 403 37 7.69
SOFTLAB ar4 C Embedded controller for white-goods 107 30 18.69
SOFTLAB ar3 C Embedded controller for white-goods 63 30 12.70
NASA mc2 C++ Video guidance system 61 39 32.29
SOFTLAB ar5 C Embedded controller for white-goods 36 30 22.22
4,102
Euclidean distance from the desired point (1,0) to observed
(pd, pf) in a ROC curve. Larger bal rates indicate that the
performance is closer to the ideal case.
(bal) = 1 p(1 pd)2+pf2
2(3)
We present our results using charts to visualize the quar-
tiles of performance measures of a defect predictor, marking
the minimum, 25% percentile, median, 75% percentile and
the maximum values. The interpretation of quartile charts is
straightforward (similar to box-plots), and we would like to
refer the reader to previous studies for details [2], [12].
Finally, we use a non-parametric test, i.e. Mann-Whitney
U Test, to check for statistical differences between the
performances of different predictors. In all tests we use
the significance level α= 0.05. Experiment scripts are
implemented in Matlab R2007a.
B. Experiment Design
Our experiments designed for comparing within project
(WP) and mixed project (WP+CP) defect predictors follow
the procedure explained in Figure 2. Between lines 4 and
6, common metrics among projects are identified and the
rest are ignored. Through lines 8 and 19, we (i) prepare WP
training sets from 90% of each project, selected at random
(line 9) ; (ii) prepare test sets with the unused 10% of the
previous step (line 10); (iii) and prepare a training set for
WP+CP model by applying NN-filtering on CP project data,
which consists of the pool of all projects other than the
one to be tested (lines 12 to 19). Between lines 22 and 26,
we form a loop that iteratively (in increments of 10) merges
random CP samples with WP data, then builds and evaluates
a model using this mixed data. In line 29, we select the best
mixed (i.e. WP + CP data) model with the smallest possible
number of CP instances, according to the balance measure.
In lines 31 and 32, we build and evaluate WP only model and
store its prediction performances. We repeat this procedure
20 times for each project (line 7).
In all, we conduct ((2models)
(20randomizedselection)(10projects) = 400
Table II
SUMMARY OF MANN WHIT NEY U -T ES T RE SU LTS F OR NASA P ROJ EC TS
WHEN MOVING FROM WITHIN PROJECT TO MIXED PROJECT
PREDICTIONS.
W P WP +CP
PD PF BAL Data sets
same same same cm1, kc1, kc2,
mw1, pc1, kc3
same decreased increased mc2
Table III
SUMMARY OF MANN WHIT NEY U -T ES T RE SU LTS F OR SOFTLAB
PRO JE CT S WH EN M OVI NG F ROM W IT HI N PRO JE CT T O MI XE D PRO JE CT
PREDICTIONS.
W P WP +CP
PD PF BAL Data sets
same same same ar3
increased increased increased ar4
increased same increased ar5
experiments to compare within and mixed project predictors.
IV. RES ULTS AND DISCUSSIONS
Table II shows project-wise overall results for NASA
projects. It is clear that, except mc2, adding cross project
samples to within project data does not significantly improve
the prediction performances of models. In mc2 project, an
improvement is achieved through a reduction of false alarms
from median 36% to 27%.
Summary of results for SOFTLAB projects is provided
in Table III. In these projects a different pattern than NASA
projects is observed: In two out of three projects, adding CP
samples to WP data increases prediction performance. In ar4,
pd rates as well as pf rates have increased (from median pd-
pf rates of 45%-6% to 75%-25%), however improvement
in pd is more significant in terms of its affect on bal
value (from median 61% to 75%). In ar5 project, pd rates
have also significantly improved with a mixed model from
median 88% to 100% yielding a 2% increase in bal, with no
significant change in pf rates. There is no observable effect
in ar3 project.
345345399
1: DATA = {CM1, KC1, KC2, KC3, PC1, MW1, MC2, AR3, AR4, AR5}
2: LEARNER = {naive Bayes}
3: C F E AT URE S = Find common features in DATA
4: for data DATA do
5: data = Select C F EAT U RE S from data
6: end for
7: for i= 1 to 20 do
8: for data DATA do
9: WPTrain = Select random 90% data
10: TEST = data - WPTrain
11: CPTrain = DATA - data
12: {NN filtering: Select 10 nearest neighbours in CPTrain for each test instance}
13: for test TEST do
14: dist = NNDistance(test,CPTrain)
15: NNCP Select 10 instances in CPTrain with min(dist)
16: end for
17:
18: {Remove duplicate CC instances}
19: NNCPTRAIN = UNIQUE(NNCP)
20:
21: {Model WPCP: Add random X instances from NNCPTRAIN}
22: for j= 10 to size(N N CP T RAI N )do
23: NNCPSAMPLE = WPTRAIN + Select random j instances in NNCPTRAIN
24: W P CP MODE L = Train LEARNER with NNCPSAMPLE
25: [model pd(j, 1), model pf (j, 1), model bal(j, 1)] =W P C P M ODE L on TEST
26: end for
27:
28: {Select the best smallest mixed data model}
29: [wpcp pd, wpcp pf , wpcp bal]Select max(model bal) on TEST
30:
31: W P M ODE L = Train LEARNER with WPTRAIN
32: [wp pd, wp pf , wp bal]=W P M ODEL on TEST
33: end for
34: end for
Figure 2. Pseudo-code for the experimental setup.
A possible explanation for these observations is the size
and defect rates of the projects. Specifically, SOFTLAB
projects have fewer methods with relatively high defect rates
compared to NASA projects. We can clearly observe the
improvements in SOFTLAB projects, which are particularly
smaller than NASA projects. We also observe the same
improvement in the smallest NASA project, mc2.
We investigated this issue further by visualizing the per-
formances on individual projects with the quartile charts
provided in Figure 3. Charts for individual projects are sorted
with respect to size of the projects in descending order. For
example, the first chart belongs to pc1, which is the largest
project (1109 methods), whereas the last four charts belong
to smaller projects (method counts: ar3= 107, ar4 = 63, mc2
= 61, ar5 = 36). Please note that the project sizes suddenly
drop between rows 3 & 4 from |mw1|= 403 to |ar4|= 107
methods. There is a pattern in Figure 3 that adding cross
project samples into within project data does not affect the
prediction performances in relatively larger projects (pc1,
kc2, kc3, kc1, cm1, mw1), which contain more than 400
methods. On the other hand, the last four projects, which are
relatively smaller projects with less than 100 methods, show
a completely different pattern: In ar4, mc2 and ar5, adding
cross project data improves the prediction performance of
defect models.
Mann-Whitney U-tests also validate that the improve-
ments in ar4 and ar5 projects are significant, and median pf
for mc2 decreased from 36% to 27% yielding a 4% increase
in bal. Apart from their small sizes, these projects have
relatively high defect rates. The exception in ar3 project can
be explained either as being an outlier or by its relatively
lower defect rate, 13%, compared to other projects in this
group (ar4, mc2, ar5), which have (19%, 32%, 22%) defect
rates.
Another explanation for this pattern may be the common,
strict processes enforced by NASA during the development
of these larger projects. For this reason, cross project data
may not yield additional benefits in projects following
similar strict processes. In this case, we should consider
mc2 project as an outlier in NASA projects. In addition,
three of the four projects, making up the smaller group,
are developed by a company working in a specific domain,
where common processes are not strictly enforced as in
NASA. Specifically [12]:
The SOFTLAB software were built in a profit-
and revenue-driven commercial organization, whereas
NASA is a cost-driven government entity
The SOFTLAB software were developed by small
teams (2-3 people) working in the same physical lo-
cation while the NASA software were built by much
346346400
Project (pc1)
model min Q1 med Q3 max
pd WP 50 50 63 75 88 u
WP+CP 50 50 63 75 88 u
pf WP 19 25 26 30 34 u
WP+CP 19 25 26 30 33 u
bal WP 58 61 68 75 83 u
WP+CP 58 61 68 75 83 u
Project (kc1)
model min Q1 med Q3 max
pd WP 67 76 80 86 94 u
WP+CP 67 76 79 86 91 u
pf WP 24 29 31 33 34 u
WP+CP 22 26 28 30 33 u
bal WP 68 71 74 76 80 u
WP+CP 69 72 75 78 81 u
Project (kc2)
model min Q1 med Q3 max
pd WP 55 73 77 91 100 u
WP+CP 55 73 77 91 100 u
pf WP 15 17 27 28 34 u
WP+CP 12 16 26 27 32 u
bal WP 61 73 77 80 88 u
WP+CP 63 73 77 81 90 u
Project (cm1)
model min Q1 med Q3 max
pd WP 40 60 80 80 100 u
WP+CP 40 60 80 80 100 u
pf WP 16 24 31 36 44 u
WP+CP 16 22 29 31 33 u
bal WP 53 64 70 75 84 u
WP+CP 53 65 74 77 86 u
Project (kc3)
model min Q1 med Q3 max
pd WP 25 63 75 75 100 u
WP+CP 25 63 75 75 100 u
pf WP 7 21 26 27 32 u
WP+CP 7 18 22 26 29 u
bal WP 44 68 74 79 86 u
WP+CP 45 68 75 79 90 u
Project (mw1)
model min Q1 med Q3 max
pd WP 33 50 67 100 100 u
WP+CP 33 50 67 100 100 u
pf WP 14 20 24 28 54 u
WP+CP 11 20 24 28 54 u
bal WP 46 56 71 76 90 u
WP+CP 47 56 71 76 92 u
Project (ar4)
model min Q1 med Q3 max
pd WP 40 45 45 50 50 u
WP+CP 70 75 75 75 75 u
pf WP 3 6 6 6 6 u
WP+CP 22 24 25 25 26 u
bal WP 58 61 61 64 64 u
WP+CP 74 74 75 75 75 u
Project (ar3)
model min Q1 med Q3 max
pd WP 88 88 88 88 88 u
WP+CP 88 88 88 88 88 u
pf WP 40 40 40 40 40 u
WP+CP 40 40 40 40 40 u
bal WP 70 70 70 70 70 u
WP+CP 70 70 70 70 70 u
Project (mc2)
model min Q1 med Q3 max
pd WP 40 60 80 80 100 u
WP+CP 40 60 80 80 100 u
pf WP 0 27 36 45 73 u
WP+CP 0 18 27 36 55 u
bal WP 43 57 67 71 81 u
WP+CP 47 59 71 76 81 u
Project (ar5)
model min Q1 med Q3 max
pd WP 88 88 88 88 100 u
WP+CP 88 100 100 100 100 u
pf WP 29 29 29 29 29 u
WP+CP 29 29 29 29 29 u
bal WP 78 78 78 78 80 u
WP+CP 78 80 80 80 80 u
Figure 3. Project-wise results for within project data vs. mixed project data experiments, ordered by decreasing project size first from left to right, then
top to bottom.
347347401
larger teams spread around the United States.
The SOFTLAB development activities were carried out
in an ad-hoc, informal way rather than formal, process
oriented approach used at NASA.
Nevertheless, it is evident that mixed-project models yield
only limited improvements in a few cases, i.e. 3/10 projects.
These rare improvements achieved by mixed-project data
predictions can be attributed either to project size and defect
rate, or to organizational processes. However, we should also
note that regardless of the dominant reason, mixed-project
predictions do not affect the results in the negative direction.
We may explain the lack of improvements with mixed-
project models with the findings of [20]. In their study,
Menzies et al. argue that static code metrics can reflect only
a limited amount of information about projects (as noted
by other researchers as well [22], [23]), and this limited
information can be captured by models using only a small
number of data points (i.e. as low as 50 methods). Our
results complement their findings. Our defect models based
on static code features are likely to capture available project
information and converge to a maximum performance, such
that adding more data from other projects does not reveal
additional information about the test project. Therefore, a
future direction could be to test this argument using other
types of project metrics (i.e. code churn, developer networks)
across different projects.
V. TH REATS T O VALIDITY
We assess possible threats to the validity of our study in
two categories: internal and external validity.
A. Internal Validity
Issues that may affect the internal validity of our results
are instrumentation and selection. To control the problems
due to instrumentation, we have used commonly employed
tools and methods in our study design. We use method level
static code metrics to represent software projects. Although
static code metrics can not capture important issues such
as development process, skills and experiences of the team,
they are widely accepted and used in software engineering
studies. We applied methods like naive Bayes and data
filtering, which have been used in earlier studies. We used
Matlab environment for implementing our test scripts.
For controlling problems due to selection, we took great
care in choosing the data that we used in our experiments.
Specifically, we removed a portion of available data sets
from NASA, for which there have been concerns about data
quality [18]. SOFTLAB data sets were collected via auto-
mated metric extraction tools and their defect labels were
finalized after discussions with the original development and
test teams. Finally, in the experimental rig, we repeated
several stratified cross validation steps to avoid sampling
bias.
B. External Validity
Problems with external validity are concerned about the
generalization of results. To address this issue at least to
some extent, we use 10 different projects and two different
data sources: NASA and SOFTLAB. Other software engi-
neering researchers argued that conclusions derived from
NASA data sets are relevant to software engineering industry
[2], [24]. These data sets come from various contractors
operating on different industries, such as governmental or
commercial organizations. Therefore, they represent various
sections in manufacturing and service industries. We also
use additional datasets from a different source, i.e. SOFT-
LAB data sets. Nevertheless, it is difficult to draw general
conclusions from empirical studies in software engineering
and our results are limited to the analyzed data and context
[25].
VI. CONCLUSIONS
Defect prediction is an important area of research with
potential applications in industry for early detection of prob-
lematic parts in software, which allows significant reductions
in project costs and schedules. Surprisingly, most defect
prediction studies focus only on optimizing performance
of models that are constructed using data from individual
projects. On the other hand, some studies that took place
recently approach the problem from a different perspective:
looking for ways to utilize cross project data for building
defect predictors. These studies identify cross project defect
prediction as a challenge with cost effective opportunities,
such as using a set of open-source data repositories for
building or tuning models for proprietary projects.
We noticed that existing studies focus on the two ends
of the spectrum, that is using either within or cross project
data, leading to the motivation behind this paper. We in-
vestigated the case where models are constructed from a
mix of within and cross project data. We checked for
any improvements to within project defect predictions after
adding data from other projects. We conducted experiments
with 10 proprietary projects to analyze the behaviour of such
hybrid models, and observed that mixed project models yield
only minor improvements over within project models in a
limited number of cases. These cases can be characterized as
small projects (<100 methods) with relatively higher defect
rates (19% ). Such characteristics are usually observed
in early development phases of projects. However, we also
noticed that these results might be due to different levels
of adherence to strict processes used in the development of
the analyzed projects - hence the results might be company
specific. Regardless, we observe no negative effects.
In summary, we hesitate to recommend cross or mixed-
project models over within-project models given the current
state of research results. Pure cross project data should be
used in data starving environments as a temporary solution
before data collection practices take place, and models
348348402
learned from mixed data do not perform better than project
specific models. However, there are promising results, and
cross-project defect prediction is still an open challenge.
In our future work we plan to focus on open-source
projects, especially to learn predictors for proprietary
projects; a difficult, but promising challenge. We are also
planning to investigate the same problem using other types
of project metrics, i.e. code churn, developer networks.
ACKNOWLEDGMENT
This research is supported in part by (i) TEKES under
Cloud-SW project in Finland and (ii) Turkish State Planning
Organization (DPT) under the project number 2007K120610
in Turkey.
REFERENCES
[1] A. G. Koru and H. Liu, “Building effective defect-prediction
models in practice.” IEEE Software, vol. 22, no. 6, pp. 23–29,
2005.
[2] T. Menzies, J. Greenwald, and A. Frank, “Data mining
static code attributes to learn defect predictors,Software
Engineering, IEEE Transactions on, vol. 33, no. 1, pp. 2 –13,
jan. 2007.
[3] T. M. Khoshgoftaar and Y. Liu, “A multi-objective software
quality classification model using genetic programming.”
IEEE Transactions on Reliability, vol. 56, no. 2, pp. 237–
245, 2007.
[4] S. Lessmann, B. Baesens, C. Mues, and S. Pietsch, “Bench-
marking classification models for software defect prediction:
A proposed framework and novel findings,IEEE Trans.
Software Eng., vol. 34, no. 4, pp. 485–496, 2008.
[5] B. Turhan and A. Bener, “Analysis of naive bayes’ assump-
tions on software fault data: An empirical study,” Data and
Knowledge Engineering, vol. 68, no. 2, pp. 278–290, 2009.
[6] E. J. Weyuker, T. J. Ostrand, and R. M. Bell, “Do too
many cooks spoil the broth? using the number of developers
to enhance defect prediction models,” Empirical Software
Engineering, vol. 13, no. 5, pp. 539–559, 2008.
[7] A. Tosun, A. Bener, B. Turhan, and T. Menzies, “Practical
considerations in deploying statistical methods for defect pre-
diction: A case study within the turkish telecommunications
industry,Information and Software Technology, vol. In Press,
Corrected Proof, pp. –, 2010.
[8] B. A. Kitchenham, E. Mendes, and G. H. Travassos, “Cross
versus within-company cost estimation studies: A systematic
review,” IEEE Trans. Software Eng., vol. 33, no. 5, pp. 316–
329, 2007.
[9] B. Boehm, E. Horowitz, R. Madachy, D. Reifer, B. K. Clark,
B. Steece, A. W. Brown, S. Chulani, and C. Abts, Software
Cost Estimation with Cocomo II. Prentice Hall, 2000.
[10] C. Lokan, T. Wright, P. R. Hill, and M. Stringer, “Organiza-
tional benchmarking using the isbsg data repository,IEEE
Software, vol. 18, pp. 26–32, 2001.
[11] L. C. Briand, W. L. Melo, and J. Wust, “Assessing the
applicability of fault-proneness models across object-oriented
software projects,” IEEE Trans. Softw. Eng., vol. 28, pp. 706–
720, July 2002.
[12] B. Turhan, T. Menzies, A. B. Bener, and J. S. D. Stefano,
“On the relative value of cross-company and within-company
data for defect prediction,” Empirical Software Engineering,
vol. 14, no. 5, pp. 540–578, 2009.
[13] A. Nelson, T. Menzies, and G. Gay, “Sharing experiments
using open source software,” Software: Practice and Experi-
ence,, 2010.
[14] B. Turhan, A. B. Bener, and T. Menzies, “Regularities in
learning defect predictors,” in PROFES, 2010, pp. 116–130.
[15] A. E. C. Cruz and K. Ochimizu, “Towards logistic re-
gression models for predicting fault-prone code across soft-
ware projects,” in Proceedings of the 2009 3rd International
Symposium on Empirical Software Engineering and Mea-
surement, ser. ESEM ’09. Washington, DC, USA: IEEE
Computer Society, 2009, pp. 460–463.
[16] M. Jureczko and L. Madeyski, “Towards identifying software
project categories with regard to defect prediction ,” in
PROMISE ’10, New York, NY, USA: ACM, 2010.
[17] Y. C. Liu, T. M. Khoshgoftaar, and N. Seliya, “Evolution-
ary optimization of software quality modeling with multiple
repositories,” IEEE Transactions on Software Engineering,
vol. 36, pp. 852–864, 2010.
[18] G. Boetticher, T. Menzies, and T. Ostrand, “PROMISE repos-
itory of empirical software engineering data,” 2007.
[19] NASA-IV&V, “Metrics data program, available from
http://mdp.ivv.nasa.gov,” Internet; accessed 2007.
[20] T. Menzies, B. Turhan, A. Bener, G. Gay, B. Cukic, and
Y. Jiang, “Implications of ceiling effects in defect predictors,”
in PROMISE ’08, New York, NY, USA: ACM, 2008, pp. 47–
54.
[21] T. Menzies, A. Dekhtyar, J. Distefano, and J. Greenwald,
“Problems with precision,” IEEE Transactions on Software
Engineering, September 2007.
[22] M. J. Shepperd and D. C. Ince, “A critique of three metrics.
Journal of Systems and Software, vol. 26, no. 3, pp. 197–210,
1994.
[23] N. Fenton and S. L. Pfleeger, Software Metrics: A Rigorous
and Practical Approach, 2nd ed. London, UK: International
Thomson Computer Press, 1997.
[24] V. Basili, F. McGarry, R. Pajerski, and M. Zelkowitz,
“Lessons learned from 25 years of process improvement: the
rise and fall of the nasa software engineering laboratory,” in
Proceedings of the 24rd International Conference on Software
Engineering, 2002, pp. 69 – 79.
[25] V. R. Basili, F. Shull, and F. Lanubile, “Building knowledge
through families of experiments,IEEE Transactions on Soft-
ware Engineering, vol. 25, pp. 456–473, 1999.
349349403
... Researchers build their prediction models based on software metrics derived from source code repository (e.g., Change metrics [7], CK metrics [20], Object-oriented metrics [9]) using machine learning classifiers (e.g., Naive Bayes [21], Support Vector Machine [22], Decision Tree [23], Random Forest [24]) to classify faulty and non-faulty modules. The main challenge of CPDP is to reduce data divergence between source and target projects data sets. ...
Preprint
Background: The early stage of defect prediction in the software development life cycle can reduce testing effort and ensure the quality of software. Due to the lack of historical data within the same project, Cross-Project Defect Prediction (CPDP) has become a popular research topic among researchers. CPDP trained classifiers based on labeled data sets of one project to predict fault in another project. Goals: Software Defect Prediction (SDP) data sets consist of manually designed static features, which are software metrics. In CPDP, source and target project data divergence is the major challenge in achieving high performance. In this paper, we propose a Generative Adversarial Network (GAN)-based data transformation to reduce data divergence between source and target projects. Method: We apply the Generative Adversarial Method where label data sets are choosing as real data, while target data sets are choosing as fake data. The Discriminator tries to measure the perfection of domain adaptation through loss function. Through the generator, target data sets try to adapt the source project domain and, finally, apply machine learning classifier (i.e., Naive Bayes) to classify faulty modules. Results: Our result shows that it is possible to predict defects based on the Generative Adversarial Method. Our model performs quite well in a cross-project environment when we choose JDT as a target data sets. However, all chosen data sets are facing a large class imbalance problem which affects the performance of our model.
... Prior studies proposed to merge data from different projects (i.e., a pool of mixed project data) to develop universal or cross-project JIT defect models (a.k.a. a global JIT defect model) [5,22,58]. The intuition is that a larger diverse pool of defect data from several other projects may provide a more robust model fit that will be applied better in a cross-project context. ...
Article
Full-text available
Just-In-Time (JIT) defect models are classification models that identify the code commits that are likely to introduce defects. Cross-project JIT models have been introduced to address the suboptimal performance of JIT models when historical data is limited. However, many studies built cross-project JIT models using a pool of mixed data from multiple projects (i.e., data merging)-assuming that the properties of defect-introducing commits of a project are similar to that of the other projects, which is likely not true. In this paper, we set out to investigate the interpretation of JIT defect models that are built from individual project data and a pool of mixed project data with and without consideration of project-level variances. Through a case study of 20 datasets of open source projects, we found that (1) the interpretation of JIT models that are built from individual projects varies among projects; and (2) the project-level variances cannot be captured by a JIT model that is trained from a pool of mixed data from multiple projects without considering project-level variances (i.e., a global JIT model). On the other hand, a mixed-effect JIT model that considers project-level variances represents the different interpretations better, without sacrificing performance, especially when the contexts of projects are considered. The results hold for different mixed-effect learning algorithms. When the goal is to derive sound interpretation of cross-project JIT models, we suggest that practitioners and researchers should opt to use a mixed-effect modelling approach that considers individual projects and contexts.
... SeaMonkey has 3,000þ classes whereas Licq 1.3.8 is smaller system with 280 class. As it was seen from the results of various studies [51][52][53][54] that within company prediction model results outperform over the cross company results. Further, [9] and [14] in their study also validate their model in cross company projects and did not find practical threshold values. ...
... Data mining tools have been succesfully applied to many applications in software engineering; e.g. Czerwonka et al. (2011), Ostrand et al. (2004), Menzies et al. (2007a), Turhan et al. (2011), Kocaguneli et al. (2012), Begel and Zimmermann (2014), Theisen et al. 2015). Despite these successes, current software analytic tools have certain drawbacks. ...
Article
Full-text available
The current generation of software analytics tools are mostly prediction algorithms (e.g. support vector machines, naive bayes, logistic regression, etc). While prediction is useful, after prediction comes planning about what actions to take in order to improve quality. This research seeks methods that generate demonstrably useful guidance on “what to do” within the context of a specific software project. Specifically, we propose XTREE (for within-project planning) and BELLTREE (for cross-project planning) to generating plans that can improve software quality. Each such plan has the property that, if followed, it reduces the expected number of future defect reports. To find this expected number, planning was first applied to data from release x. Next, we looked for change in release x + 1 that conformed to our plans. This procedure was applied using a range of planners from the literature, as well as XTREE. In 10 open-source JAVA systems, several hundreds of defects were reduced in sections of the code that conformed to XTREE’s plans. Further, when compared to other planners, XTREE’s plans were found to be easier to implement (since they were shorter) and more effective at reducing the expected number of defects.
... Data mining tools have been applied to many applications in Software Engineering (SE). For example, it has been used to estimate how long it would take to integrate new code into an existing project [15], where defects are most likely to occur [46,55], or how long will it take to develop a project [33,66], etc. Large organizations like Microsoft routinely practice data-driven policy development where organizational policies are learned from an extensive analysis of large datasets [6,65]. ...
Conference Paper
According to psychological scientists, humans understand models that most match their own internal models, which they characterize as lists of "heuristic"s (i.e. lists of very succinct rules). One such heuristic rule generator is the Fast-and-Frugal Trees (FFT) preferred by psychological scientists. Despite their successful use in many applied domains, FFTs have not been applied in software analytics. Accordingly, this paper assesses FFTs for software analytics. We find that FFTs are remarkably effective in that their models are very succinct (5 lines or less describing a binary decision tree) while also outperforming result from very recent, top-level, conference papers. Also, when we restrict training data to operational attributes (i.e., those attributes that are frequently changed by developers), the performance of FFTs are not effected (while the performance of other learners can vary wildly). Our conclusions are two-fold. Firstly, there is much that software analytics community could learn from psychological science. Secondly, proponents of complex methods should always baseline those methods against simpler alternatives. For example, FFTs could be used as a standard baseline learner against which other software analytics tools are compared.
... With almost 1.4 M answers to analyze, the models trained on Stack Overflow may have ended up learning data regularities that were not (entirely) relevant in the cross-platform prediction stage. Thus, the size and the generality of the Stack Overflow training set may have increased the 'domain distance' from the test sets (Turhan et al. 2011). Future replications of this study will seek to reduce this distance between the training and the test datasets by leveraging topic modeling and related tags to select a more relevant and 'closer' subset of questions threads from the entire Stack Overflow dataset. ...
Article
Full-text available
Technical Q&A sites have become essential for software engineers as they constantly seek help from other experts to solve their work problems. Despite their success, many questions remain unresolved, sometimes because the asker does not acknowledge any helpful answer. In these cases, an information seeker can only browse all the answers within a question thread to assess their quality as potential solutions. We approach this time-consuming problem as a binary-classification task where a best-answer prediction model is built to identify the accepted answer among those within a resolved question thread, and the candidate solutions to those questions that have received answers but are still unresolved. In this paper, we report on a study aimed at assessing 26 best-answer prediction models in two steps. First, we study how models perform when predicting best answers in Stack Overflow, the most popular Q&A site for software engineers. Then, we assess performance in a cross-platform setting where the prediction models are trained on Stack Overflow and tested on other technical Q&A sites. Our findings show that the choice of the classifier and automatied parameter tuning have a large impact on the prediction of the best answer. We also demonstrate that our approach to the best-answer prediction problem is generalizable across technical Q&A sites. Finally, we provide practical recommendations to Q&A platform designers to curate and preserve the crowdsourced knowledge shared through these sites.
Article
Change-level defect prediction is widely referred to as just-in-time (JIT) defect prediction since it identifies a defect-inducing change at the check-in time, and researchers have proposed many approaches based on the language-independent change-level features. These approaches can be divided into two types: supervised approaches and unsupervised approaches, and their effectiveness has been verified on Java or C++ projects. However, whether the language-independent change-level features can effectively identify the defects of JavaScript projects is still unknown. Additionally, many researches have confirmed that supervised approaches outperform unsupervised approaches on Java or C++ projects when considering inspection effort. However, whether supervised JIT defect prediction approaches can still perform best on JavaScript projects is still unknown. Lastly, prior proposed change-level features are programming language-independent, whether programming language-specific change-level features can further improve the performance of JIT approaches on identifying defect-prone changes is also unknown. To address the aforementioned gap in knowledge, in this paper, we collect and label top-20 most starred JavaScript projects on GitHub. JavaScript is an extremely popular and widely used programming language in the industry. We propose five JavaScript-specific change-level features and conduct a large-scale empirical study (i.e., involving a total of 176,902 changes) and find that 1) supervised JIT defect prediction approaches (i.e., CBS+) still statistically significantly outperform unsupervised approaches on JavaScript projects when considering inspection effort; 2) JavaScript-specific change-level features can further improve the performance of approach built with language-independent features on identifying defect-prone changes; 3) the change-level features in the dimension of size (i.e., LT), diffusion (i.e., NF), and JavaScript-specific (i.e., SO and TC) are the most important features for indicating the defect-proneness of a change on JavaScript projects; and 4) project-related features (i.e., Stars, Branches, Def Ratio, Changes, Files, Defective and Forks) have a high association with the probability of a change to be a defect-prone one on JavaScript projects.
Chapter
Software defect prediction has been much studied in the field of research in Software Engineering. Within project Software defect prediction works well as there is sufficient amount of data available to train any model. But rarely local training data of the projects is available for predictions. There are many public defect data repositories available from various organizations. This availability leads to the motivation for Cross projects defect prediction. This chapter cites on defect prediction using cross projects defect data. We proposed two experiments with cross projects homogeneous metric set data and within projects data on open source software projects with class level information. The machine learning models including the ensemble approaches are used for prediction. The class imbalance problem is addressed using oversampling techniques. An empirical analysis is carried out to validate the performance of the models. The results indicate that cross projects defect prediction with homogeneous metric sets are comparable to within project defect prediction with statistical significance.
Article
Full-text available
This article examines the metrics of the software science model, cyclomatic complexity, and an information flow metric of Henry and Kafura. These were selected on the basis of their popularity within the software engineering literature and the significance of the claims made by their progenitors. Claimed benefits are summarized. Each metric is then subjected to an in-depth critique. All are found wanting. We maintain that this is not due to mischance, but indicates deeper problems of methodology used in the field of software metrics. We conclude by summarizing these problems.
Conference Paper
Full-text available
Context: There are many methods that input static code features and output a predictor for faulty code modules. These data mining methods have hit a "performance ceiling"; i.e., some inherent upper bound on the amount of information offered by, say, static code features when identifying modules which contain faults. Objective: We seek an explanation for this ceiling effect. Per-haps static code features have "limited information content"; i.e. their information can be quickly and completely discovered by even simple learners. Method: An initial literature review documents the ceiling effect in other work. Next, using three sub-sampling techniques (under-, over-, and micro-sampling), we look for the lower useful bound on the number of training instances. Results: Using micro-sampling, we find that as few as 50 in-stances yield as much information as larger training sets. Conclusions: We have found much evidence for the limited in-formation hypothesis. Further progress in learning defect predic-tors may not come from better algorithms. Rather, we need to be improving the information content of the training data, perhaps with case-based reasoning methods.