Conference PaperPDF Available

Software Defect Prediction Using Call Graph Based Ranking (CGBR) Framework

Authors:

Abstract and Figures

Recent research on static code attribute (SCA) based defect prediction suggests that a performance ceiling has been achieved and this barrier can be exceeded by increasing the information content in data. In this research we propose static call graph based ranking (CGBR) framework, which can be applied to any defect prediction model based on SCA. In this framework, we model both intra module properties and inter module relations. Our results show that defect predictors using CGBR framework can detect the same number of defective modules, while yielding significantly lower false alarm rates. On industrial public data, we also show that using CGBR framework can improve testing efforts by 23%.
Content may be subject to copyright.
Software Defect Prediction Using Call Graph Based Ranking (CGBR)
Framework
Burak Turhan, Gozde Kocak, Ayse Bener
Dept. of Computer Engineering, Bogazici University, Turkey
{turhanb, gozde.kocak, bener}@boun.edu.tr
Abstract
Recent research on static code attribute (SCA)
based defect prediction suggests that a performance
ceiling has been achieved and this barrier can be
exceeded by increasing the information content in
data [18]. In this research we propose static call
graph based ranking (CGBR) framework, which can
be applied to any defect prediction model based on
SCA. In this framework, we model both intra module
properties and inter module relations. Our results
show that defect predictors using CGBR framework
can detect the same number of defective modules,
while yielding significantly lower false alarm rates.
On industrial public data, we also show that using
CGBR framework can improve testing efforts by
23%.
1. Introduction
Defect rate is a key indicator of software product
quality. High defect rates increase the cost of
development, maintenance and lead to customer
dissatisfaction. The aim is to keep the defect rate as
low as possible and catch them as early as possible.
The statistics show that in successful projects the
testing phase takes up to 50% of the total project cost
[9, 25]. Therefore an effective testing strategy would
shorten the testing period, the minimize number of
defects and improve software quality.
Recent research shows that, companies can benefit
from defect prediction tools to effectively manage
their testing phases. Using test resources effectively
would significantly lower the costs of the company.
Numerous software metrics and statistical models
have been developed in order to predict defects in
software [7, 10, 13, 14, 15, 16, 20, 28]. There are
several approaches to tackle the problem. Some
researchers use size and complexity metrics that are
collected at different phases of development, some
use process and human centric metrics to measure the
quality of the development process, and some others
prefer combining multiple approaches [7, 10, 13, 14,
15, 19].
Architectural characteristics of software also play
an important role in predicting the defects. Design
metrics are useful to realize the quality of a design.
Architects use design metrics to decide which parts to
rework, they evaluate aspects of a design and they
can better focus on the problem areas such as testing
and/ or re-engineering effort. For example, fan-in fan-
out metrics give important clues over the complexity,
code redundancy and maintainability of the software
[12].
Although there are some doubts on the value of
static code attributes (i.e. size and complexity
metrics) [24], they are shown to be useful, easy to use
and widely employed in several research [16, 17, 23,
26, 27, 29]. Basic static code attributes, like lines of
code, Halstead attributes and McCabe attributes can
be cheaply and automatically collected even for large
systems. However, Menzies et.al., in a recent
research, empirically show that static code attributes
have limited information content i.e. their
information can be quickly and completely
discovered by even simple learners [18]. They also
discuss that defect predictors based on static code
attributes have reached a performance ceiling and
further improvement would need increasing the
information content of data either by incorporating
metrics from different sources, i.e. requirements
phase, or by defining human in the loop approaches.
Considering the results of Menzies et.al., we aim
at increasing the information content of static code
attributes by investigating the architectural structure
of the code. More precisely, static code attributes
assume the independence of software modules and
measure individual module complexities without
considering their interactions1. In this research we
1 Fan-in fan-out metrics are frequency counts of module calls and
do not represent the architectural structure of the overall software.
Thus, they do not violate the module independence assumption.
34th Euromicro Conference Software Engineering and Advanced Applications
978-0-7695-3276-9/08 $25.00 © 2008 IEEE
DOI 10.1109/SEAA.2008.52
191
adjust static code attributes by modeling their
interconnection schema. Previous research mostly
focused on intra module metrics [8, 21, 30]. In this
research we combine both intra module and inter
module metrics.
We claim that module interaction and structure
play an important role in defect proneness of a given
code. Therefore we propose a model to investigate
the module interactions. We use the static call graph
structures in our research. Then we calculate call
graph based ranking (CGBR) values and assign ranks
to each module. We have used the page rank
algorithm (web link based ranking algorithm) in
constructing our call graph based ranking algorithm
[4, 6].
In our experiments we used data from a large-
scale white-goods manufacturer [5]. We constructed
call graphs for every project and then calculated
CBGR values for all modules in each project. Then
we adjusted our data attributes with CGBR values.
Afterwards we used the Naïve Bayes model for
defect prediction [16, 28]. Our results show that
probability of false alarm is decreased significantly
while recall is retained, thus the testing effort is
improved.
The rest of the paper is organized as follows: In
section 2 we review the literature, in section 3 we
explain our data sets. Call graph based ranking
algorithm is discussed section 4. Section 5 explains
our experiment and results. The last section gives
conclusion and future direction.
2. Related work
Various methods such as linear regression
analysis, discriminant analysis, decision trees, neural
networks and Naïve Bayes classification are used for
defect prediction in previous research [11, 13, 14, 16,
17, 19, 20, 26, 27, 28, 29]. Menzies et.al. state that
The current status of the research in learning defect
predictors is curiously static” [18]. They discuss that
when complex algorithms no longer improve the
prediction performance we should find ways to get a
better insight for the training data. They suggest that
improving the information content of the training
data should replace investigating more complex
learners [18].
Static code measures defined by McCabe and
Halstead are module-based metrics like lines of code,
complexity and number of unique symbols. While
some researchers opposed using static code measures
[24], recent research showed that using Naïve Bayes
classifier with log-filtering static code measures
produced significantly better results than rule based
methods [16]. Additionally, various researches
showed the usefulness of static code attributes [13,
14, 16, 17, 26, 27, 28, 29].
Since we aim at combining static code attributes
with call graph information for defect prediction, we
should carefully differentiate our research from other
call graph related studies. In addition to the
abovementioned studies on static code attributes
(SCA), there are recent studies investigating call
graph-defect relations [8, 21, 30]. While SCA related
studies examine only intra module metrics, call graph
related studies examine only inter module relations.
Our contribution in this research is to combine
both approaches. We show that combining inter and
intra module metrics not only increases the
performance of defect predictors but also decreases
the required testing effort for locating defects in the
source code.
3. Data
In order to validate our approach, we have built
our defect predictors from static code attributes
adjusted by module interactions (i.e. CGBR values).
To train these predictors, static code attributes are
formed as a table, where columns describe software
attributes such as lines of code, McCabe and Halstead
metrics. Each row in the data table corresponds to
data from software modules, where we define a
module as the functions/ procedures in source code.
We also include another column that indicates the
corresponding modules’ fault data (i.e. whether it is
defective or defect free). The goal of defect predictor
is to find the relations of attribute columns that
predict values in defects column. If such relations are
found, test team can use them to determine where to
start testing [16].
We use data from a local company that develops
embedded systems software [5]. This data source
includes three different projects. These projects are
embedded controller software for white-goods and
named as AR3, AR4, and AR5. We collected 29
static code attributes, such as LOC, Halstead, and
McCabe metrics from these projects. We also
collected defect information for each module in AR3,
AR4 and AR5, and manually matched these with the
software modules. Table1 shows general properties
of all three projects used in this study.
The projects are written in C programming
language and we have access to source codes of the
projects. Public data sets do not contain call graph
information [22]. Therefore, we used the available
source code to build the static call graphs. We created
NxN matrix for building a call graph where N is the
192
number of modules. In this matrix, rows contain the
information whether a module calls the others or not.
We represent these by using 1 and 0 respectively.
Columns contain how many times a module is called
by other modules. Table 2 shows an example call
graph matrix.
Table 1. Datasets used in this study
Project Attributes # Modules Defect
Rate
AR3 29 63 0.12
AR4 29 107 0.18
AR5 29 36 0.2
We built the call graphs for every project and
calculated call graph based ranking values for all
modules of the three projects. We adjusted static code
attributes as described in the next section. Then we
ran simulations as if each project were new. Thus,
each time we learned defect predictors from two
completed projects and tested the predictor on the
new project.
Table 2. Sample call graph matrix
Module
A
Module
B
Module
C
... Module
X
Module
A
0 0 1 ... 1
Module
B
1 0 0 ... 0
Module
C
1 1 0 ... 1
.... ... ... ... ... ...
Module
X
0 0 0 ... 0
4. Call graph based ranking algorithm
Call graphs can be used in tracing the software
code module by module. Specifically, each node in
the call graph represents a procedure and each edge
(a, b) indicates that procedure “a” calls procedure
“b”. Call graphs can be used to better understand the
program behavior, i.e. the data flow between
procedures. The advantages of call graphs are simply
locating procedures that are rarely or frequently
called.
In our research we have inspired from the search
capability on the web. Most search engines on the
web use the PageRank algorithm that is developed by
Page and Brin [6]. The PageRank algorithm
computes the most relevant results of a search by
ranking web pages. We have adopted this ranking
methodology to software modules (Figure 1). We
hypothesize that more frequently used modules and
less used modules should have different defect
characteristics. For instance, if a module is frequently
used and the developers/ testers are aware of that,
they will be more careful in implementing/ testing
that module. On the contrary, less used modules may
not reveal their defective nature as they are not used
frequently and existing defects can only be detected
with thorough testing.
Figure 1. Simple Example
The PageRank theory holds that an imaginary
surfer, who is randomly clicking on links, will
continue surfing with a certain probability. This
probability, at any step, that the person will continue
is defined as a damping factor d. The damping factor
is in the range 0 < d < 1, and usually set to 0.85 [4,
6].
In our case, we do not necessarily take the
damping factor as a constant, rather we dynamically
calculate it for each project. The web surfer analogy
corresponds to calling other software modules in our
context. Therefore, we define the damping factor, d,
of software with N modules, as the ratio of actual
module calls to the all possible module calls.
We used the same algorithm of the PageRank.
The modules resemble web pages, and calling
relationships between the modules resemble the
linking relationship between the web pages. We
assign equal initial values (i.e. 1) to all modules and
iteratively calculate module ranks. We named the
results as the Call Graph Based Ranking (CGBR)
values. CGBR framework gives ranks to all modules.
The formula of the CGBR is given in Equation 1.
+=
iTiC
TiCGBR
ddACGBR )(
)(
*)1()( (1)
,where CGBR(A) is the call graph based rank of
module A, CGBR(Ti) is the call graph based rank of
module Ti which call for module A, C(Ti) is the
number of outbound calls of module Ti and d is the
damping factor. Usually after 20 iterations the CGBR
values converge [4]. The pseudo code of the CGBR
193
computation adopted from PageRank is given in
Figure 2.
Figure 2. The pseudo code of the
CGBR computation.
Considering the simple example for call graph
based ranking algorithm (recall Figure 1), there are 5
modules that interact with each other.
Table 3. Call graph matrix of the
example
Called
Caller
A B C D E
A 0 1 1 0 1
B 0 0 1 0 0
C 1 0 0 0 0
D 0 0 1 0 0
E 0 0 0 0 0
For example, module A calls three other
modules, which are modules B, C and E. The first
row in Table 3 shows the modules that are called by
module A. These calls are represented with the
numeric symbol 1. In addition, if we look at the first
column in Table 3, we can say that module A is only
called from module C. Also module D is dead code
unless it is a main routine, since it is not called from
any other modules. For this simple example damping
factor is calculated as df = 6/(5*5), since there are a
total of 6 module calls in this software. It can be seen
in Figure 2 that, module C, which has the highest
number of incoming calls, will have the greatest call
graph based ranking value.
5. Experiments
5.1. Design
In our experiments, we left one of the projects as the
test case and built predictors on the remaining two
projects. We repeat this for all three datasets. We use
a random 90% of the training set for the models and
repeat this procedure 20 times in order to overcome
ordering effects [16, 17].
Before we trained our model, we have applied log-
filtering on the datasets and normalized our CGBR
values. We classified our CGBR values in 10 bins
and give each bin a weight value from 0.1 to 1. Then
we have adjusted the static code attributes by
multiplying each row in the data table with
corresponding weights.
Note that we use sample weighting rather than
attribute weighting. Attribute weighting corresponds
to assigning weight to the columns of the data table,
which we have investigated elsewhere [28].
Furthermore, it is a special case of feature subset
selection, where informative features are assigned
either 0 or 1 weight values. However, the purpose of
this research is to locate software modules for
assigning testing priority rather than deciding which
attributes to use.
We have used a simple data miner that is Naïve
Bayes [16, 17, 18]. Naive Bayes is implemented as
described in Menzies et al. [16]. It is derived from
Bayes’ Theorem, where the prior probability and the
evidence explain the posterior probability of an event.
=iiHEP
EP
HP
EHP )(
)(
)(
)( (2)
From Equation 2, we can denote the class label of
a module, i.e. defective or defect-free as H.
Evidences, Ei, are the attributes collected from the
194
Figure 3. (pd, pf) and balance results with original and CGBR adjusted data
projects. In order to find the posterior probability of a
new module given its attributes, we need to use the
prior probabilities of two classes and the evidence from
historical data. For numeric attributes with pairs of
mean and standard deviation (µ, σ), the probability of
each attribute given H is calculated with a Gaussian
density function [1] (Equation 3).
2
2
2
)(
2
1
)(
σ
μ
σπ
=
x
exf (3)
In order to make a comparison of our approach with
previous research, we use three different performance
measures, which are probability of detection (pd), the
probability of false alarm, and balance (bal) as defined
in Menzies et al. [16]. To compute pd and pf, the
confusion matrix in Table 4 is needed. Performance
measures are calculated using Equations 4, 5 and 6
respectively.
Table 4. Confusion Matrix
Defective Nondefective
Defective A C
Nondefective B D
pd = A / ( A + C) (4)
pf = B / (B + D) (5)
balance =
2
)1()0(
1
22 pdpf +
(6)
In the perfect scenario, predictors should catch all
the defects and never mark a defective-free module as
defective. To provide this scenario pf should be a low
value and pd should be a high value. Balance is a third
performance measure and is used to choose the optimal
(pd, pf) pairs that are closest to the perfect scenario.
5.2. Results from the experiments
Figure 3 shows (pd, pf, bal) results for the original
attributes (i.e. without CGBR adjustment) and the
dataset with CGBR adjustment respectively. These
figures show the box-plots of the (pd, pf) and balance
results. The leftmost and the rightmost lines in the
boxes correspond to the 25% and 75% quartiles and the
line in the middle of the box is the median. The
notches around the median correspond to the 95%
confidence interval of the median for paired t-test. The
dashed lines in the box-plots indicate 1.5 times of the
inter-quartile range, i.e. the distance between the 25%
and 75% quartiles. Data points outside these lines are
considered as outliers.
Figure 3 shows that using CGBR adjusted data
slightly increases the median probability of detection,
and significantly decreases the median probability of
false alarms. Moreover, CGBR’s pd results are spread
in a narrower interval, from which we can conclude
that CGBR produces more stable results than standard
approach. Note that 50% of the results yield detection
rates over 88% and false alarms below 18%. Finally
the balance performance measure significantly
improves.
We conducted another experiment to reflect the
practical usage of the model. In the new experiment we
use all available data in the training projects and test
the performance of the model on the third project.
Table 5 shows averaged values over all projects for the
Real
Estimated
195
median (pd, pf) rates of the two models and the
required testing effort in terms of LOC [24]. It also
includes the estimated LOC for random strategy
depending on the pd rates 79% and 77% respectively.
In order to detect 79% of the actual defects in the code,
the estimated LOC for inspection with random strategy
is 13,866 and the estimated LOC for original data is
12,513. Therefore, using Naïve Bayes predictor with
original data achieves an improvement of 10% in
testing effort.
Conversely for 77% defect detection rate of the
CGBR framework, the estimated LOC for random
strategy is 13,515 and the required LOC for CGBR
adjusted data is 10439. The false alarm rate has also
changed from 30% to 20%. Hence CGBR framework
decreases the probability of false alarms and testing
effort, where the improvement is 23%.
Table 5. Analysis of the second
experiment.
Naive Bayes
with Original
Data
Naive Bayes with
CGBR Adjusted
Data
pd 79% 77%
pf 30% 20%
Estimated
LOC for
inspection
12513 10439
Required
LOC for
inspection
(Random
approach)
13866 13515
Improvement
on testing
effort
10% 23%
The results for individual projects also validate our
findings. Table 6 shows the (pd, pf) rates with CGBR
adjusted data and with the original data. For all
projects we observe the same pattern: pd rates do not
change significantly and pf rates decrease for all
projects. For AR5, pf rate is decreased with the cost of
missing some defective modules. However, AR5 is a
small dataset with only 8 defective modules out of a
total of 36 projects. Therefore CGBR adjusted data
model miss only a single defective module compared
to the original data model.
We explain our observations as follows: Adjusting
dataset with CGBR values provide additional
information content compared to pure static code
attributes. The adjustment incorporates inter module
information by using architectural structure, in addition
to the intra module information available in static code
attributes.
Table 6. (pd, pf) rates for all projects
with CGBR adjusted and original data
AR3 AR4 AR5
Avg.
CGBR pd 88 55 88
77
pf 33 10 18 20
Orig. pd 86 50 100
79
pf 42 18 29 30
5.3. Discussion of the experiments
The benefits of using datasets adjusted with CGBR
values are very important and practical. Low pf rates
make the predictors practical to use. The reason is that
low pf rates keep the amount of code to be manually
inspected at a minimum rate. As pf rates increase, the
predictors become impractical, even not much better
than random testing strategies. Our proposed approach
allows decreasing the pf rates, thus it is practical to
use.
Note that CGBR adjusted model locates equivalent
number of defects with the standard model, while
giving fewer false alarms. Our results show that, if
testers were to use the predictions from the CGBR
adjusted model, they would spend 13% less testing
effort than the original Naïve Bayes model and 23%
less than random testing.
In addition to the discussion of results, we
investigated the correlation between the CGBR values
and the defect content of the modules. We used
Spearman’s correlation coefficient that indicates a
negative correlation with -0.149. This means that there
are relatively fewer defects when the modules’ CGBR
values are high. That is what we have speculated in the
first place. If a module is called frequently from other
modules, any defect in this module is recognized and
corrected in a short time by the developers. On the
other hand, the modules which are not called by other
modules will be more defective because these types of
modules are not used frequently. Thus, developers are
not as much aware as they do for frequently used
modules. As a result the defects can easily hide in less
used modules.
6. Threats to validity
Public data sets do not contain call graph matrices
[22]. This is a challenge but we overcome this with
using datasets of a company that we had source code
access. Therefore, this study should be externally
validated with other industrial and open source projects
of different sizes. Although open source projects allow
access to code, matching defects at the module level
and obtaining call graphs are difficult tasks. For this
research, we have implemented a tool for constructing
196
static call graphs from ANCI C source codes and
manually matched defects to modules.
Furthermore, the datasets we have used are
implemented in a procedural language, i.e. C, and we
are well aware that the same study should be repeated
with object oriented projects, where we suspect that
these effects would be more visible.
7. Conclusion
In this research, we propose a novel combination of
the static code attributes (SCA) and architectural
structure of software to make predictions about defect
content of software modules. We model intra module
properties by using SCA. Further, we model inter
module relations by weighting software modules,
where the weights are derived from static call graphs
using the well-known PageRank algorithm.
Our results are strongly consistent with recent
discussions by Menzies et.al. They state that SCA have
limited information content and further research
direction should focus on increasing the content rather
than using new models for defect prediction [18].
They combine requirement level metrics with SCA,
whereas we combine inter module relations with SCA
in order to increase the information content in software
data. We achieve this by modeling inter module
relations with call graph based ranking (CGBR) values.
Our results show that data adjusted with CGBR values
do not change the median probability of detection, and
significantly decrease the median probability of false
alarm.
Proposed CGBR framework has important practical
aspects. Defect predictors are helpful when they guide
testers to the software modules that should be manually
inspected. Therefore, defect predictors should detect as
much defect as possible, while firing low false alarms.
False alarms cause waste of testing effort since they
unnecessarily mark a defect free module as defective.
CGBR framework’s detection performance is not
worse than state-of-the-art defect predictors. On the
other hand, it has significantly lower false alarm rates.
We should also note that CGBR is a framework
rather than a model. In this research we have used it
with Naive Bayes data miner, however it can be safely
applied to any model that is possibly being used in
practice.
Another contribution of this research is easing the
call graph analysis. Call graphs are visually inspected
in general. For large systems, visual inspections of call
graphs become difficult. CGBR framework allows
automated analysis of identifying heavily or scarcely
used modules.
Our future work consists of replicating this study on
projects developed in object-oriented languages. We
also plan to expand the number of data miners, to show
the applicability of CGBR framework.
8. Acknowledgements
This research is supported by Boğaziçi University
research fund under grant number BAP 06HA104 and
the Turkish Scientific Research Council (TUBITAK)
under grant number EEEAG 108E014.
9. References
[1] Alpaydin, E., Introduction to Machine Learning, The
MIT Press, 2004.
[2] Arisholm, E. and Briand, C. L., “Predicting Fault-prone
Components in a Java Legacy System”, ISESE’06, 2006.
[3] Basili, V. R., Briand, L. C., and Melo, W. L., "A
Validation of Object-Oriented Design Metrics as Quality
Indicators", Software Engineering, vol. 22, 1996, pp. 751-
761.
[4] Benincasa, C., Calden, A., Hanlon, E., Kindzerske, M.,
Law, K., Lam, E., Rhoades, J., Roy, I., Satz, M., Valentine,
E., and Whitaker, N., “Page Rank Algorithm”, Department of
Mathematics and Statics, University of Massachusetts,
Amherst, Research, May, 2006.
[5] Boetticher, G., Menzies, T. and Ostrand, T., PROMISE
Repository of empirical software engineering data
http://promisedata.org/repository, West Virginia University,
Department of Computer Science, 2007.
[6] Brin, S. and Page, L., “The anatomy of a large-scale
hypertextual search engine”, Computer Networks and ISDN
Systems, 1998,pp. 107-117.
[7] Fenton, N. E., “A Critique of Software Defect Prediction
Models”, IEEE Transaction on Software Engineering ,
vol.25, no. 5, 1999.
[8] Hall, M. W. and Kennedy, K., “Efficient Call Graph
Analysis”, LOPLAS, vol.1, issue 3, 1992, p. 227-242.
[9] Harrold, M. J. , “Testing: a roadmap”, In Proceedings of
the Conference on the Future of Software Engineering, ACM
Press, New York, NY, 2000, pp. 61-72.
[10] Henry S., and Kafura D., “Software structure metrics
based on information flow”, IEEE Trans. on Software
Engineering, 7(5):510 518, September, 1981.
[11] Khoshgoftaar T. M., and Seliya N., ”Fault Prediction
Modeling for Software Quality Estimation: Comparing
Commonly Used Techniques”, Empirical Software
Engineering, vol. 8, no. 3, 2003, pp. 255-283.
197
[12] Kitchenham, B. A., Pickard, L. M., and Linkman, S. J.,
"An evaluation of some design metrics", Software
Engineering Journal archive, vol. 5, Issue 1, 1990, pp. 50-58.
[13] Koru, A. G. and Liu, H., “Building effective defect-
prediction models in practice Software”, IEEE, vol. 22, Issue
6, Nov.-Dec. 2005, pp. 23 – 29.
[14] Koru, A. G. and Liu, H., “An Investigation of the Effect
of Module Size on Defect Prediction Using Static Measures.”
Proceeding of PROMISE 2005, St. Louis, Missouri, 2005,
pp. 1-6.
[15] Malaiya, Y. K., and Denton, J., “Module Size
Distribution and Defect Density”, ISSRE 2000, 2000, pp. 62-
71.
[16] Menzies, T., Greenwald, J. and Frank, A., “Data Mining
Static Code Attributes to Learn Defect Predictors”, IEEE
Transactions on Software Engineering, vol. 33, no.1, 2007,
pp. 2-13.
[17] Menzies, T., Turhan, B., Bener, A., and Distefano, J.,
“Cross- vs within-company defect prediction studies”,
Technical report, Computer Science, West Virginia
University, 2007.
[18] Menzies, T., Turhan, B., Bener, A., Gay, G., Cukic, B.,
Jiang, Y., "Implications of Ceiling Effects in Defect
Predictors", to appear in PROMISE Workshop in ICSE 2008,
Leipzig, Germany, May 2008
[19] Munson, J. and Khoshgoftaar, T. M., ”Regression
modelling of software quality: empirical investigation”,
Journal of Electronic Materials, vol. 19, no. 6, 1990, pp.
106-114.
[20] Munson, J. and Khoshgoftaar, T. M., ”The Detection of
Fault-Prone Programs”, IEEE Trans. on Software Eng., vol.
18, no. 5, 1992, pp. 423-433.
[21] Nagappan, N. and Ball, T., "Explaining Failures Using
Software Dependencies and Churn Metrics", In Proceedings
of the 1st International Symposium on Empirical Software
Engineering and Measurement, Madrid, Spain.
[22] Nasa/Wvu IV&V Facility, Metrics Data Program,
available from http://mdp.ivv.nasa.gov; Internet.
[23] Oral, A.D. and Bener, A., 2007, Defect Prediction for
Embedded Software, ISCIS 2007, 7-9 November, Ankara,
Turkey.
[24] Shepperd, M. and Ince, D., “A Critique of Three
Metrics,” J. Systems and Software, vol.26, no. 3, pp.197-210,
Sept. 1994
[25] Tahat, B. V. , Korel, B. and Bader, A., "Requirement-
Based Automated Black-Box Test Generation", In
Proceedings of 25th Annual International Computer Software
and Applications Conference, Chicago, Illinois, 2001, pp.
489-495.
[26] Tosun, A., Turhan, B., Bener, A., “Ensemble of Defect
Predictors: An Industrial Application in Embedded Systems
Domain”, Technical Report, Computer Engineering,
Bogazici University, 2008.
[27] Turhan, B., and Bener, A., “A multivariate analysis of
static code attributes for defect prediction”, In Proceedings of
the Seventh International Conference on Quality Software,
Los Alamitos, CA, USA, 11-12 October, IEEE Society, ,
2007, pp. 231-237.
[28] Turhan, B., and Bener, A., “Software Defect Prediction:
Heuristics for Weighted Naïve Bayes”, ICSOFT 2007,
Barcelona, Spain, 2007, pp. 22-25.
[29] Turhan, B., Bener, A., “Data Sampling for Cross
Company Defect Predictors”, Technical Report, Computer
Engineering, Bogazici University, 2008.
[30] Zimmermann, T. and Nagappan, N., "Predicting
Subsystem Failures using Dependency Graph Complexities,"
The 18th IEEE International Symposium on Software
Reliability (ISSRE '07), 2007, pp. 227-236.
198
... Schröter et al. [75] leveraged the import relationships between software components for SDP and showed that the imported software components could predict post-version defects more effectively than random guesses. Turhan et al. [92] proposed a static call graph based ranking (CGBR) framework, which was based on both static code metrics and the call dependency relations between software modules. Shin et al. [79,81] investigated the effectiveness of the calling structure on SDP. ...
Preprint
Full-text available
Software dependency network metrics extracted from the dependency graph of the software modules by the application of Social Network Analysis (SNA metrics) have been shown to improve the performance of the Software Defect prediction (SDP) models. However, the relative effectiveness of these SNA metrics over code metrics in improving the performance of the SDP models has been widely debated with no clear consensus. Furthermore, some of the common SDP scenarios like predicting the number of defects in a module (Defect-count) in Cross-version and Cross-project SDP contexts remain unexplored. Such lack of clear directive on the effectiveness of SNA metrics when compared to the widely used code metrics prevents us from potentially building better performing SDP models. Therefore, through a case study of 9 open source software projects across 30 versions, we study the relative effectiveness of SNA metrics when compared to code metrics across 3 commonly used SDP contexts (Within-project, Cross-version and Cross-project) and scenarios (Defect-count, Defect-classification (classifying if a module is defective) and Effort-aware (ranking the defective modules w.r.t to the involved effort)). We find the SNA metrics by themselves or along with code metrics improve the performance of SDP models over just using code metrics on 5 out of the 9 studied SDP scenarios (three SDP scenarios across three SDP contexts). However, we note that in some cases the improvements afforded by considering SNA metrics over or alongside code metrics might only be marginal, whereas in other cases the improvements could be potentially large. Based on these findings we suggest that the future work should: consider SNA metrics alongside code metrics in their SDP models; as well as consider Ego metrics and Global metrics, the two different types of the SNA metrics separately when training SDP models as they behave differently.
... Schröter et al. [75] leveraged the import relationships between software components for SDP and showed that the imported software components could predict post-version defects more effectively than random guesses. Turhan et al. [92] proposed a static call graph based ranking (CGBR) framework, which was based on both static code metrics and the call dependency relations between software modules. Shin et al. [79,81] investigated the effectiveness of the calling structure on SDP. ...
Article
Full-text available
Software dependency network metrics extracted from the dependency graph of the software modules by the application of Social Network Analysis (SNA metrics) have been shown to improve the performance of the Software Defect prediction (SDP) models. However, the relative effectiveness of these SNA metrics over code metrics in improving the performance of the SDP models has been widely debated with no clear consensus. Furthermore, some of the common SDP scenarios like predicting the number of defects in a module (Defect-count) in Cross-version and Cross-project SDP contexts remain unexplored. Such lack of clear directive on the effectiveness of SNA metrics when compared to the widely used code metrics prevents us from potentially building better performing SDP models. Therefore, through a case study of 9 open source software projects across 30 versions, we study the relative effectiveness of SNA metrics when compared to code metrics across 3 commonly used SDP contexts (Within-project, Cross-version and Cross-project) and scenarios (Defect-count, Defect-classification (classifying if a module is defective) and Effort-aware (ranking the defective modules w.r.t to the involved effort)). We find the SNA metrics by themselves or along with code metrics improve the performance of SDP models over just using code metrics on 5 out of the 9 studied SDP scenarios (three SDP scenarios across three SDP contexts). However, we note that in some cases the improvements afforded by considering SNA metrics over or alongside code metrics might only be marginal, whereas in other cases the improvements could be potentially large. Based on these findings we suggest that the future work should: consider SNA metrics alongside code metrics in their SDP models; as well as consider Ego metrics and Global metrics, the two different types of the SNA metrics separately when training SDP models as they behave differently.
... Another area where call graphs were successfully used is fault localization [20,27]. Finding the reason of test failures without human intervention can significantly reduce the time and costs for fixing issues. ...
Chapter
Full-text available
JavaScript is the most popular programming language these days and it is used in many environments such as node.js. The node.js ecosystem allows sharing JavaScript code easily, and the shared code can be reused as building blocks to create new applications. However, this ever growing environment has its own challenges as well. One of them is security: even simple applications can have many dependencies, and these dependencies might contain malware software. Another challenge is fault localization: finding the reason of a fault could be difficult in a software with many dependencies. Dynamic program analysis can help solving these problems. In particular, dynamic call graphs were used successfully in both cases before. Since no call graph generators were available for node.js before, we created them. In this paper, we compare the call graphs constructed by our generator tools. We show that a large amount of engine-specific information is present in the call graphs and filtering can efficiently remove it. We also discuss how the asynchronous nature of JavaScript affects call graphs. Finally, we show the performance overhead of call graph generation and its side effects on module testing.
... False positive rate is also called as probability of false detection (PF) [65], [67]. ...
Article
Full-text available
Assessing the quality of the software is both important and difficult. For this purpose, software fault prediction (SFP) models have been extensively used. However, selecting the right model and declaring the best out of multiple models are dependent on the performance measures. We analyze 14 frequently used, non-graphic classifier’s performance measures used in SFP studies. This analyses would help machine learning practitioners and researchers in SFP to select the most appropriate performance measure for models’ evaluation. We analyze the performance measures for resilience against producing invalid values through our proposed plausibility criterion. After that, consistency and discriminancy analyses are performed to find the best out of the 14 performance measures. Finally, we draw the order of the selected performance measures from better to worse in both balance and imbalance datasets. Our analyses conclude that F-measure and G-mean1 are equally the best candidates to evaluate SFP models with careful analysis of the result, as there is a risk of invalid values in certain scenarios.
... Learning-to-rank is generated on the basis of regression models in SDP [53]. ...
Article
Context: There are various ways to cope with class imbalance problem which is one of the main issues of software defect prediction. Sampling algorithms are implemented on both industrial and open-source software defect prediction data sets by practitioners to wipe out imbalanced data points. Sampling algorithms, up-to-date, have been employed either static or process code metrics. Objective: In this study, sampling algorithms including Virtual, SMOTE, and HSDD (hybrid sampling for defect data sets) are explored using static code and quality metrics together. Our goal is not only to lead practitioners to decide the type of the metrics in defect prediction but also provide useful information for developers to design less defective software projects. Method: We ran sampling experiments with three sampling algorithms on ten data sets (from GitHub). Feature selection is applied on large features of the data sets. Using five classifiers, the performance of the data sets after sampling is compared with initial data sets. Regression analyzes are implemented on quality metrics to find the most influential metrics for detecting defect proneness
... It has been widely used before in software engineering domain. Turhan et al. [21] used Page rank index for Software Defect Prediction. In another study, Bhattacharya et al. studied the page rank to predict bug severity [3]. ...
Chapter
Software fault prediction aims to classify whether the module is defective or not-defective. In software systems, there are some software metrics may contain irrelevant or redundant information that leads to negative impact on the performance of the fault prediction model. Therefore, feature selection is an method that several studies have addressed to reduce computation time, improve prediction performance and provide a better understanding of data in machine learning. Additionally, the presence of imbalanced classes is one of the most challenge in software fault prediction. In this study, we examined the effectiveness of six different wrapper feature selection including Genetic Algorithm, Particle Swarm Optimization, Whale Optimization Algorithm, Cuckoo Search, Mayfly Algorithm and Binary Bat Algorithm for selecting the optimal subset of features. Then, we applied VanilaGAN to train the dataset with optimal features for handling the imbalanced problem. Subsequently, these generated training dataset and the testing dataset are fed to the machine learning techniques. Experimental validation has been done on five dataset collected from Promise repository and Precision, Recall, F1-score, and AUC are evaluation performance measurements.KeywordsSoftware fault predictionfeature selectionWrapperVanillaGANdataset
Article
Several prediction approaches are contained in the arena of software engineering such as prediction of effort, security, quality, fault, cost, and re-usability. All these prediction approaches are still in the rudimentary phase. Experiments and research are conducting to build a robust model. Software Fault Prediction (SFP) is the process to develop the model which can be utilized by software practitioners to detect faulty classes/module before the testing phase. Prediction of defective modules before the testing phase will help the software development team leader to allocate resources more optimally and it reduces the testing effort. In this article, we present a Systematic Literature Review (SLR) of various studies from 1990 to June 2019 towards applying machine learning and statistical method over software fault prediction. We have cited 208 research articles, in which we studied 154 relevant articles. We investigated the competence of machine learning in existing datasets and research projects. To the best of our knowledge, the existing SLR considered only a few parameters over SFP’s performance, and they partially examined the various threats and challenges of SFP techniques. In this article, we aggregated those parameters and analyzed them accordingly, and we also illustrate the different challenges in the SFP domain. We also compared the performance between machine learning and statistical techniques based on SFP models. Our empirical study and analysis demonstrate that the prediction ability of machine learning techniques for classifying class/module as fault/non-fault prone is better than classical statistical models. The performance of machine learning-based SFP methods over fault susceptibility is better than conventional statistical purposes. The empirical evidence of our survey reports that the machine learning techniques have the capability, which can be used to identify fault proneness, and able to form well-generalized result. We have also investigated a few challenges in fault prediction discipline, i.e., quality of data, over-fitting of models, and class imbalance problem. We have also summarized 154 articles in a tabular form for quick identification.
Article
Full-text available
Software defect prediction has been one of the key areas of exploration in the domain of software quality. In this paper, we perform a systematic mapping to analyze all the software defect prediction literature available from 1995 to 2018 using a multi-stage process. A total of 156 studies are selected in the first step, and the final mapping is conducted based on these studies. The ability of a model to learn from data that does not come from the same project or organization will help organizations that do not have sufficient training data or are going to start work on new projects. The findings of this research are useful not only to the software engineering domain, but also to the empirical studies, which mainly focus on symmetry as they provide steps-by-steps solutions for questions raised in the article.
Chapter
Quality assurance tasks such as testing, verification and validation, fault tolerance, and fault prediction play a major role in software engineering activities. Fault prediction approaches are used when a software company needs to deliver a finished product while it has limited time and budget for testing it. In such cases, identifying and testing parts of the system that are more defect prone is reasonable. In fact, prediction models are mainly used for improving software quality and exploiting available resources. Software fault prediction is studied in this chapter based on different criteria that matters in this research field. Usually, there are certain issues that need to be taken care of such as different machine-learning techniques, artificial intelligence classifiers, variety of software metrics, distinctive performance evaluation metrics, and some statistical analysis. In this chapter, the authors present a roadmap for those researchers who are interested in working in this area. They illustrate problems along with objectives related to each mentioned criterion, which could assist researchers to build the finest software fault prediction model.
Article
Full-text available
PageRank is the algorithm used by the Google search engine, originally formulated by Sergey Brin and Larry Page in their paper The Anatomy of a Large-Scale Hypertextual Web Search Engine. It is based on the premise, prevalent in the world of academia, that the importance of a research paper can be judged by the number of citations the paper has from other research papers. Brin and Page have simply transferred this premise to its web equiva-lent: the importance of a web page can be judged by the number of hyperlinks pointing to it from other web pages.
Article
Full-text available
This article examines the metrics of the software science model, cyclomatic complexity, and an information flow metric of Henry and Kafura. These were selected on the basis of their popularity within the software engineering literature and the significance of the claims made by their progenitors. Claimed benefits are summarized. Each metric is then subjected to an in-depth critique. All are found wanting. We maintain that this is not due to mischance, but indicates deeper problems of methodology used in the field of software metrics. We conclude by summarizing these problems.
Article
Full-text available
In a recent May 2007 IEEE TSE article, Kitchenham et.al. explored effort estimation and found contradictory evidence about the value of cross-vs within-company data. Those contradictory results may have been the result of effort estimation features, some of which are subjective in nature. Static code features are different than effort estimation features. They can be generated in an automatic, rapid, and uniform manner across multiple projects. Therefore, in theory, the conclusions reached from such features may be more uniform. This paper tests that theory by searching for uniform conclusions using cross-or within-company static code features. Whereas Kitchenham et.al. explored effort estimation, this paper explores defect prediction. Cross-company static code features will be found to generate higher false alarm rates than within-company data. Hence, cross-company data is best used for mission critical software where (a) the extra costs associated with high false alarm rates is compensated by (b) an associated increase in the probability of predicting fault modules. For other classes of software, false alarm rates can be decreased using a very small amount of local data (often, just 100 modules). In our experiments, the use of within-company data halved the false alarm rate while decreasing prediction rates by only ≈ 10%. Hence, for non-mission-critical software, we strongly recommend using within-company data for defect prediction.
Article
We present an efficient algorithm for computing the procedure call graph, the program representation underlying most interprocedural optimization techniques. The algorithm computes the possible bindings of procedure variables in languages where such variables only receive their values through parameter passing, such as Fortran. We extend the algorithm to accommodate a limited form of assignments to procedure variables. The resulting algorithm can also be used in analysis of functional programs that have been converted to Continuation-Passing-Style. We discuss the algorithm in relationship to other call graph analysis approaches. Many less efficient techniques produce essentially the same call graph. A few algorithms are more precise, but they may be prohibitively expensive depending on language features.
Article
Commercial software development is a complex task that requires a thorough understanding of the architecture of the software system. We analyze the Windows Server 2003 operating system in order to assess the relationship between its software dependences, churn metrics and post-release failures. Our analysis indicates the ability of software dependences and churn metrics to be efficient predictors of post-release failures. Further, we investigate the relationship between the software dependences and churn metrics and their ability to assess failure-proneness probabilities at statistically significant levels.