Conference PaperPDF Available

A Multivariate Analysis of Static Code Attributes for Defect Prediction

  • Toronto Metropolitan (Ryerson) University

Abstract and Figures

Defect prediction is important in order to reduce test times by allocating valuable test resources effectively. In this work, we propose a model using multivariate approaches in conjunction with Bayesian methods for defect predictions. The motivation behind using a multivariate approach is to overcome the independence assumption of univariate approaches about software attributes. Using Bayesian methods gives practitioners an idea about the defectiveness of software modules in a probabilistic framework rather than the hard classification methods such as decision trees. Furthermore the software attributes used in this work are chosen among the static code attributes that can easily be extracted from source code, which prevents human errors or subjectivity. These attributes are preprocessed with feature selection techniques to select the most relevant attributes for prediction. Finally we compared our proposed model with the best results reported so far on public datasets and we conclude that using multivariate approaches can perform better. Keywords: Defect prediction, Software Metrics, Naïve Bayes. Topics: Software Quality, Methods and Tools.
Content may be subject to copyright.
(Research Paper)
A Multivariate Analysis of Static Code Attributes for Defect Prediction
Burak Turhan, Ayşe Bener
Department of Computer Engineering, Bogazici University
34342, Bebek, Istanbul, Turkey
{turhanb, bener}
Defect prediction is important in order to reduce
test times by allocating valuable test resources
effectively. In this work, we propose a model using
multivariate approaches in conjunction with Bayesian
methods for defect predictions. The motivation behind
using a multivariate approach is to overcome the
independence assumption of univariate approaches
about software attributes. Using Bayesian methods
gives practitioners an idea about the defectiveness of
software modules in a probabilistic framework rather
than the hard classification methods such as decision
trees. Furthermore the software attributes used in this
work are chosen among the static code attributes that
can easily be extracted from source code, which
prevents human errors or subjectivity. These attributes
are preprocessed with feature selection techniques to
select the most relevant attributes for prediction.
Finally we compared our proposed model with the best
results reported so far on public datasets and we
conclude that using multivariate approaches can
perform better.
Keywords: Defect prediction, Software Metrics, Naïve
Topics: Software Quality, Methods and Tools.
1. Introduction
Testing is the most costly and time consuming part
of software development lifecycle, regardless of the
development process used. Therefore effective testing
leads to significant decrease in project costs and
schedules. The aim of defect prediction is to give an
idea about the testing priorities, so that exhaustive
testing is prevented. Using an automated model may
help project managers to allocate testing resources
effectively. These models can predict the degree of
defectiveness if relevant features of software are
supplied to them. These relevant features are achieved
by using software metrics.
Researchers usually prefer focusing on the selection
of a subset of available features [10]. Feature subset
selection is mainly preferred because of its
interpretability, since the selected features correspond
to actual and in some occasions controllable
measurements from software. This gives the ability to
generate rules about the desired values of metrics for
'good' software. It is easier to explain such rules to
programmers and managers [6].
This is also the answer to why most of the studies
use decision trees as predictors. Decision trees can be
interpreted as a set of rules and they can be understood
by less technically involved people [6]. But decision
trees are hard classification methods that can predict a
module as either defective or non-defective.
Alternatively, Bayesian approaches provide a
probabilistic framework and yield soft classification
methods with posterior probabilities attached to the
predictions [1]. This is why we employed Bayesian
approaches in this work.
On the other hand, feature subset selection requires
an exhaustive search for choosing the optimal subset.
Thus, feature selection algorithms use greedy
approaches like backward or forward selection [7]. In
forward selection, one starts with an empty set of
features, and a feature is selected only if it increases the
performance of the predictor, otherwise it is discarded.
Backward selection is similar in the sense that one
starts with all features, and a feature is removed if it
does not affect the performance of the predictor. These
approaches evaluate the features one at a time and they
do not consider the effects of features if taken as pairs,
triples and n-tuples. While a single feature may not
affect the estimation performance significantly, pairs,
triples or n-tuples of features may [7]. In order to
overcome this problem, this study employs feature
extraction techniques and compares the results with a
baseline study, where InfoGain algorithm is used to
rank and select a subset of features [10].
Major contribution of this research is to incorporate
multivariate approaches rather than univariate ones.
Univariate approaches assume the independence of
features whereas multivariate approaches take the
relations between features into consideration.
Obviously univariate models are simpler than
multivariate models. While it is good practice to start
modeling with simple models, the problem at hand
should also be investigated by using more complex
models. Then it should be validated by measuring
performance whether using more complex models is
worth the extra complexity introduced in the modeling.
This research performs experiments with both simple
and complex models and compares their performances.
In the following section, feature extraction methods
used in this research are briefly described. Then,
models used for defect prediction are explained. After
describing the experimental design and the results,
conclusions will be given.
2. Feature Extraction Methods
In feature extraction, new features are formed by
combining the existing ones. These new set of features
may not be interpreted easily as before [6]. On the
contrary, there are cases where they turn out to be
interpretable [5]. The new features may also lead to
better prediction performances by removing irrelevant
and non-informative features. An advantage of feature
extraction methods used in this study is that they
project data to an orthogonal feature space. One has to
decide between ease of interpretability and better
prediction performances in such cases. In this research
authors prefer better performance and therefore they
explore feature extraction methodologies.
Principal Component Analysis (PCA) has been used
in other defect prediction studies [11], [13], [8],
[14],[2]. We also use PCA in this research. PCA
reveals the optimum linear structure of data points. But
it is unable to find nonlinear relations, if there exists
such relations in data. In order to investigate non-linear
relations, we use Isomap algorithm as another feature
extraction technique.
2.1. Isomap
Isomap inherits the advantages of PCA and extends
them to learn nonlinear structures that are hidden in
high dimensional data. Computational efficiency,
global optimality, and guarantee of asymptotic
convergence are its major features [16].
In general, Euclidean distance is used to calculate
the similarity of two instances. However, the use of the
Euclidean distance to represent pair wise distances
makes the model unable to preserve the intrinsic
geometry of the data. Two nearby points, in terms of
Euclidean distance, may indeed be distant, because
their actual distance is the path between these points
along the manifold. The length of the path along the
manifold is referred to as the
geodesic distance [16]. A
2-D spiral is an example of a manifold, which is
actually a 1-D line that is folded and embedded in 2-D
(See Figure 1, adapted from [9]). Applying Isomap on
the spiral unfolds it to its true structure. Isomap simply
performs classical Multidimensional Scaling [4] on pair
wise geodesic-distance matrix.
Figure 1. Geodesic distance metric: Points X and Y are at
distinct ends of the spiral. Using Euclidean distance, the true
structure of spiral, i.e. 1-D line folded and embedded in 2-D,
can not be revealed.
Geodesic distance represents similar (or different)
data points more accurately than the Euclidean
distance, but the question is how to estimate it? Here
the local linearity principle is used and it is assumed
that neighboring points lie on a linear patch of the
manifold, so for nearby points the Euclidean distances
correctly estimate the geodesic distances. For distant
points, the geodesic distances are estimated by adding
up neighboring distances over the manifold using a
shortest-path algorithm.
Isomap finds the true dimensionality of nonlinear
structures. The interpretation of projection axes can be
meaningful in some cases [5]. Isomap uses a single
parameter to define the neighborhood for data points
i.e. for k-nearest neighbors of a data point, pair wise
geodesic distances are assumed to be equivalent to
Euclidean distances. This parameter should be fine
tuned, preferably by cross-validation, to obtain
optimum results. Data sample is transformed to have a
linear structure in the new projection space; e.g. the
spiral is unfolded to a line.
3. Predictor Models
This section explains predictor models used for defect
prediction. As a baseline, the Naive Bayes classifier is
taken since it is shown to acquire best results obtained
so far [10]. We remove the assumptions of the Naive
Bayes classifier one at a time and construct the linear
and quadratic discriminants. The assumption in Naive
Bayes is that the features of data sample are
independent, thus it employs the univariate normal
distribution. We believe this assumption is not valid for
software data and since there are correlations between
software data features. So we use a multivariate normal
distribution to model the correlations among features.
In the next section univariate and multivariate normal
distributions are briefly explained.
3.1. Univariate vs. Multivariate Normal
In univariate normal distribution, ),(~
Nx ,
is said to be normal distributed with mean μ and
standard deviation
σ and the probability distribution
function (pdf) is defined as:
The term inside the exponential term in Equation 2
is the normalized Euclidean distance, where the
distance of a data sample
x to the sample mean μ is
measured in terms of standard deviations
σ. This
ensures to scale the distances of different features in
case feature values vary significantly. This measure
does not consider the correlations among features.
In the multivariate case,
x is a d-dimensional vector
that is normal distributed,
, and the pdf
of a multivariate normal distribution is defined as:
Where Σ is the covariance matrix and μ is the mean
vector. The term inside the exponential term in
Equation 2 is another distance function and called the
ahalanobis distance [1]. In this case, the distance to
the mean vector is normalized by the covariance matrix
and the correlations of features are also considered.
This results in less contribution of highly correlated
features and features with high variance.
Our assumption is that software data features are
correlated and a multivariate model would be more
appropriate than the univariate model. Besides,
multivariate normal distribution is analytically simple,
tractable and robust to departures from normality [1].
no free lunch theorem states [17], nothing comes
for free and using a multivariate model increases the
number of parameters to estimate. In the univariate
case, only 2 parameters,
μ and Σ are estimated, while in
the multivariate case, d parameters for
μ and d.d
parameters for
Σ need to be estimated.
3.2. Multivariate Classification
In software defect prediction, one aims to
discriminate classes C
and C
where samples in C
non defective and samples in C
are defective. We
combine the multivariate normal distribution and the
Bayes rule, use different assumptions, and achieve
different discriminants with different complexity levels
(See Table 1). We prefer discriminant point of view,
since it is geometrically interpretable. A discriminant in
general is a hyper plane that separates d-dimensional
space into 2 disjoint subspaces. General structure of a
discriminant is explained next.
Table 1. Complexities of predictors in a K-class
problem with d features.
Predictor # Parameters
QD (K x (d x d)) + (K x d) + (K)
LD (d x d) + (K x d) + (K)
NB (d) + (K x d) + (K)
Bayes theorem states that the posterior distribution
of a sample is proportional to the prior distribution and
the likelihood of the given sample. More formally:
Equation 4 is read as:
"The probability of a given data instance x to
belong to class C
is equal to the multiplication of the
likelihood that x is coming from the distribution that
generates C
and the probability of observing C
's in
the whole sample, normalized by the evidence.
Evidence is given by:
CPCxPxP )()|()(
and it is a normalization constant for all classes, thus it
can be safely discarded. Then Equation 4 becomes:
In a classification problem we compute the posterior
probabilities P(C
|x) for each class and choose the one
with the highest posterior. This is equivalent to
defining a discriminant function g
(x) for class C
(x) is derived from Equation 6 by taking the
logarithms for convenience.
In order to achieve a discriminant value, one needs
to compute the prior and likelihood terms. Prior
probability P(C
) can be estimated from the sample by
counting. The critical issue is to choose a suitable
distribution for the likelihood term P(x|C
). This is
where the multivariate normal distribution takes place.
In this study likelihood term is modeled by the
multivariate normal distribution.
Computing discriminant values for each class and
assigning the instance to the class with the highest
value is equivalent to using Bayes Theorem for
choosing the class with the highest posterior
probability. For the 2-class case, it is sufficient to
construct a single discriminant by g(x) = g
(x) g
Using discriminant point of view, we will explain
different predictors in the following section. In all
cases, an instance
x is classified as C
such that
))((maxarg xgi
3.3. Quadratic Discriminant
Assumption: Each class has distinct Σ
and μ
Derivation: Combining Equation 2 and Equation 6
and by defining new variables W
, w
and w
, the
quadratic discriminant is obtained as
and S
, m
and P(C
) are maximum likelihood estimates
, μ
and P(C
) respectively.
Quadratic model considers the correlation of the
features differently for each class. In case of K-classes,
the number of parameters to estimate is K.(d.d) for
covariance estimates and (K.d) for mean estimates.
Also K prior probability estimations are needed.
3.4. Linear Discriminant
Assumption: Each class has a common Σ and distinct μ
Derivation: Assumption states that classes share a
common covariance matrix. The estimator is found by
either using the whole data sample or by the weighted
average of class covariances which is given as
Placing this term in Equation 7 we get
which is now a linear discriminant in the form of
This model considers the correlation of the features
but assumes the variances and correlation of features
are the same for both classes. The number of
parameters to estimate for covariance matrix is now
independent of K. For covariance estimates (d.d), for
mean estimates (K.d) and for priors K parameters
should be estimated.
3.5. Naïve Bayes
Assumption: Each class has a common Σ with off
diagonal entries equal to 0, and distinct
Derivation: Assumption states the independence of
features by using a diagonal covariance matrix. Then
the model reduces to a univariate model given in
Equation 17.
This model does not take the correlation of the
features into account and it measures the deviation
from the mean in terms of standard deviations. For
Naive Bayes, (d) covariance, (K.d) mean and K prior
parameters should be estimated.
4. Experiments and Results
Design of experiments and evaluation of results in
software defect prediction problems have particular
importance. Most of the experiment designs have
important flaws such as self tests and insufficient
performance measures as reported in [10]. Most
research reported only the accuracy of predictors as a
performance indicator. Examining defect prediction
datasets, it is easily seen that they are not balanced. In
other words, the number of defective instances is much
less than the number of nondefective instances. As
pointed out in [10], one can achieve 95% accuracy on a
5% defective dataset by building a dummy classifier
that always classifies instances as nondefective. A
framework of MxN experiment design, which means M
replications of N holdout (cross validation)
experiments, is also given in [10] and additional
performance measures are reported, such as probability
of detection (pd) and probability of false alarm (pf).
This research follows the same notation.
Figure 2. Experiment Design.
The experiments conducted in [10] are replicated
and extended in this study. Framework for experiment
design in [10] is followed and updated as in Figure 2.
In order to extract features, PCA and Isomap are
performed on the log filtered data attributes. An
advantage of log filtering is that it scales the features so
that extreme values are handled. Another advantage of
log filtering is that normal distribution better fits to
data. In other words, data attributes are assumed to be
lognormal distributed. 5 to 30 features are extracted for
all datasets using PCA and Isomap. Best subset of
features reported in [10] is also used in the
experiments. This subset of features differs in each
dataset. The best performing dimensionalities achieved
by PCA and Isomap are also different for each dataset.
These observations support the idea that there is no
global set of features that describe the software. So,
maximum possible metrics of software should be
collected and analyzed as long as it is feasible to
collect them.
10-fold cross-validation approach is used in the
experiments. That is, datasets are divided into 10 bins,
9 bins are used for training and 1 bin is used for testing.
Repeating these 10 folds ensures that each bin is used
for training and testing while minimizing sampling
bias. Each holdout experiment is also repeated 10 times
and in each repetition the datasets are randomized to
overcome any ordering effect and to achieve reliable
statistics. Reported results are the mean values of these
100 experiments for each dataset. Quadratic
discriminant (QD), linear discriminant (LD) and Naive
Bayes (NB) are the predictors used in this research. As
performance measures
pd, pf and balance (bal) are
pd is a measure for correctly detecting
defective modules and it is the ratio of the number of
defective predicted modules to the number of actual
defective modules. Obviously higher
pd's are desired.
As the name suggests,
pf is a measure for false alarms
and it is interpreted as the probability of predicting a
module as defective while it is not indeed. pf is desired
to have low values. Balance measure is used to choose
the optimal (
pd, pf) pairs such that area under the ROC
curve is maximized and it is defined as the normalized
Euclidean distance from the desired point (0,1) to (
pf) in a ROC curve.
Table 2. Dataset Descriptions
Name #Modules DefectRate
CM1 505 9
PC1 1107 6
PC2 5589 0.6
PC3 1563 10
PC4 1458 12
KC3 458 9
KC4 125 4
MW1 403 9
For evaluation, 8 different public datasets obtained
from NASA MDP repository [12] are used. Sample
sizes vary from 125 to 5589 modules. Each dataset has
38 features representing static code attributes. As seen
in Table 2 defect rates are too low which consolidates
the use of above mentioned performance measures. All
implementations are done in MATLAB environment
using standard toolboxes.
Results are tabulated in Table 3. Mean results of
pd, pf) pairs selected by the bal measure after 10x10
holdout experiments are given. For PCA and ISO
labeled entries, these results are selected from 5 to 30
features obtained by PCA and Isomap respectively.
For SUB labeled entries, the best subset of features
Table 3. Results
pd(%) pf(%) bal(%)
32 74
25 71
PC2 PCA+NB 72 13 78
31 72
PC4 PCA+QD 88 20 83
25 77
KC4 ISO+LD 78 27 75
34 69
77 25 76
obtained by InfoGain are used as reported in [10]. In
Table 3, results indicated in bold face are statistically
significant than ot
her methods with α = 0.05 after
applying a t-test, considering
pd performance measure.
Subset selection is better than feature extraction
methods in only 1 out of 8 datasets (CM1). In the
remaining datasets, best performances are obtained
either by applying PCA or Isomap instead of InfoGain.
In PC1, PC2, PC3 and PC4, best mean performances
are achieved applying PCA, and in KC3, KC4 and
MW1 Isomap yielded better results. It is observed that
Isomap gives the best performances on relatively small
datasets. As the module sizes increase PCA performs
Except PC3 dataset, our replicated results are
similar to reported mean results in [10]. But variances
of replicated experiments (i.e. subsetting) are larger
than PCA and Isomap approach especially for
measure. NB and LD are observed to behave similarly
whereas QD results are different than NB and LD in
terms of performance. It is observed for QD, that as the
number of features increase, performances get worse
especially for
pf measure and the variances increase.
Possible reason for this is the complexity of the model
(i.e. too many parameters to estimate).
As for the predictors, Naive Bayes (NB) is chosen 4
times, linear discriminant (LD) is chosen 3 times and
quadratic discriminant (QD) is chosen only once.
From these results, it can be concluded that claims
stating any of these predictors as the 'globally' correct
one, should be avoided. As expected, no specific
configuration of a feature selection and a predictor is
always better than the others. Even though NB is the
majority winner, it is clearly seen that performances on
some datasets are increased by using multivariate
methods: QD and LD. Applying QD gives the best
result in PC4 dataset, but it is not statistically
significant. It can be concluded that QD can be
discarded because of its complexity. In cases where
LD wins, statistical significances are observed, so the
additional complexity introduced can be justified.
There may be other predictors performing better than
these. Constructing better predictors is an open ended
problem and as better results are reported, the problem
gets more difficult due to
ceiling effect i.e. it is harder
to confirm the hypothesis that predictor A performs
better than predictor B, when A and B perform
maximum achievable performance or close to it [3].
Overall performance of the approach improves on
the best results reported so far [10]. Previous research
reported mean (
pd, pf) = (71,25) which yields bal = 72
averaged over all datasets. Replication of these
experiments yield mean (
pd, pf) = (64, 19) and bal =
71. After experimenting with all possible combinations
of InfoGain, PCA, Isomap with NB, LD and QD, an
improvement is observed by picking the best
combinations for all datasets. Improved results yield
mean (
pd, pf) = (77, 25) where bal = 76. While no
change in pf measure is observed,
pd measure is
improved by 6%.
A final comment should be made about the running
times of algorithms. As expected, QD takes more time
than LD and NB. However this difference is not too
significant. The dominant factor that affects the running
times are the sample sizes.
5. Conclusions and Future Work
In this research software defect prediction is
considered as a data mining problem. Several
experiments are conducted, including the replication of
previous research on publicly available datasets from
NASA repository. Performances of different predictors
together with different feature extraction methods are
evaluated. Results are compared with the best
performances reported so far and some improvements
are observed.
The previous research advices that one should not
seek for globally best subset of features, rather to focus
on building predictors that combines information from
multiple features. In addition, authors also believe that
research should focus on a balanced combination of
those. In other words, building successful predictors
depends on how useful information is supplied to them.
While making research on better predictors, research
on obtaining useful information from features should
also be carried out. A contribution of this research is
using linear and nonlinear feature extraction methods in
order to combine information from multiple features. In
software defect prediction there is more research on
feature subset selection than feature extraction. Results
suggest that it is worth to explore more to deepen our
knowledge on feature extraction studies.
Another contribution of this research is the
modeling of correlations among features. Improved
results are obtained by using multivariate statistical
methods. Furthermore, the probabilities of predictions
are provided by employing Bayesian approaches,
which can give project managers and practitioners a
better understanding of the defectiveness of software
Further research should investigate the validation of
the log normal distribution assumption of software data
used in this research. It is better practice to apply
goodness of fit tests, rather than assuming a normal
distribution. Other exponential family distributions
should also be investigated. Another research area is to
investigate filters to transform data into suitable
This research is supported in part by Bogazici
University research fund under grant number BAP-
Authors would like to thank Koray Balcı,
who has contributed to the earlier versions of this
[1] E. Alpaydin, Introduction to Machine Learning, The MIT
Press, October 2004.
[2] E. Ceylan, F. O. Kutlubay, and A. B. Bener, “Software
defect identification using machine learning techniques”, In
Proceedings of the 32nd EUROMICRO Conference on
Software Engineering and Advanced Applications
Computer Society, Washington, DC, USA, 2006, pp. 240–
[3] P. R. Cohen.
Empirical Methods for Artificial Intlligence,
The MIT Press, London, England, 1995.
[4] T. Cox and M. Cox,
Multidimensional Scaling. Chapman
& Hall, London, 1994.
[5] V. de Silva and J. B. Tenenbaum, Global versus local
methods in nonlinear dimensionality reduction”, In S. Becker,
S. Thrun, and K. Obermayer, editors,
Advances in Neural
Information Processing Systems,
15, MIT Press, Cambridge,
MA, 2003, pp. 705–712.
[6] N. E. Fenton and M. Neil, “A critique of software defect
prediction models”,
IEEE Transactions. on Software.
., 25(5), 1999, pp. 675–689.
[7] Guyon and Elisseff, “An introduction to variable and
feature selection”,
Journal of Machine Learning Research, 3,
2003, pp 1157–1182.
[8] T. M. Khoshgoftaar and J. C. Munson, “Predicting
software development errors using software complexity
IEEE Journal on Selected Areas in
, 8(2), Feb. 1990, pp. 253–261.
[9] J. A. Lee, A. Lendasse, N. Donckers, and M. Verleysen.,
“A robust nonlinear projection method”, In
Proceedings of
ESANN 2000, European Symposium on Artificial Neural
, Bruges (Belgium), 2000, pp. 13– 20.
[10] T. Menzies, J. Greenwald, and A. Frank, “Data mining
static code attributes to learn defect predictors”,
Transactions on Software Engineering
, 33(1), 2007, pp. 2–
[11] J. Munson and Y. M. Khoshgoftaar, “Regression
modelling of software quality: empirical investigation”,
Electron. Mater.
, 19(6), 1990, pp. 106–114
[12] NASA/WVU IV&V Facility, Metrics Data Program,
available from
[13] M. Neil, “Multivariate assessment of software products”,
Softw. Test., Verif. Reliab., 1(4), 1992, pp. 17–37.
[14] D. E. Neumann, “An enhanced neural network technique
for software risk analysis”,
IEEE Tranactions on. Software
, 28(9), 2002, pp. 904–912.
[15] G. Boetticher, T. Menzies and T. Ostrand, PROMISE
Repository of empirical software engineering data, West Virginia University,
Department of Computer Science, 2007
[16] J. B. Tenenbaum, V. de Silva, and J. C. Langford, „A
global geometric framework for nonlinear dimensionality
Science, 290, 2000, pp. 2319– 2323
[17] D. H. Wolpert and W. G. Macready, “No free lunch
theorems for optimization”
IEEE Transactions on
Evolutionary Computation
, 1(1), April 1997, pp. 67–82
... A training model is built with the objects belonging to the known classes. The modules of the software code with faults can be predicted by using classification methods [13][14][15][16]. In the literature, classification algorithms such as decision trees, Bayesian classifiers, rule-based classifiers, artificial neural networks (ANN), k-nearest neighbor classifiers, support vector machines, and collective learning methods are frequently used. ...
... Its calculation applies the following formula. The FM is calculated by the formula given in Equation (14). ...
Full-text available
Alongside the modern software development life cycle approaches, software testing has gained more importance and has become an area researched actively within the software engineering discipline. In this study, machine learning and deep learning-related software fault predictions were made through a data set named SFP XP-TDD, which was created using three different developed software projects. A data set of five different classifiers widely used in the literature and their Rotation Forest classifier ensemble versions were trained and tested using this data set. Numerous publications in the literature discussed software fault predictions through ML algorithms addressing solutions to different problems. Some of these articles indicated the usage of feature selection algorithms to improve classification performance, while others reported operating ensemble machine learning algorithms for software fault predictions. Besides, a detailed literature review revealed that there were few studies involving software fault prediction with DL algorithms due to the small sample sizes in the data sets and the low success rates in the tests performed on these datasets. As a result, the major contribution of this research was to statistically demonstrate that DL algorithms outperformed ML algorithms in data sets with large sample values via employing three separate software fault prediction datasets. The experimental outcomes of a model that includes a layer of recurrent neural networks (RNNs) were enclosed within this study. Alongside the aforementioned and generated data sets, the study also utilized the Eclipse and Apache Active MQ data sets in to test the effectiveness of the proposed deep learning method.
... Learning fault predictors has been generally given as a well organized method in the area of Software Quality Assurance. Concern them can direct to describe testing main concerns enhanced to avoid fatiguing testing, the generally expensive element of software improvement life sequence [26]. Many fault databanks has been composed from diverse schemes to investigate a variety of statistical and machine learning methods. ...
In quality of software, a fault discovery course is anticipated; intended to recover the taking up of various methods using cluster classifiers. Initially the classifiers are qualified on software record and then utilized to forecast if a forthcoming transformation originates a defect. Shortcomings of previous classifier based error prediction methods are inadequate presentation for realistic utilization and slow-moving forecast times due to a huge number of learned machine characteristics. Feature selection is a procedure in choosing a subset of pertinent characteristics so that the eminence of forecast replica can be enhanced. So that prediction recital of grouping techniques will be enhanced or sustained, whereas learning instance is considerably abridged. This effort commences by presenting a general idea of the datasets for error prediction, and then features a novel procedure for feature assortment by means of wrapper methods namely Fuzzy Neural Network (FNN) and Kernel Based Support Vector Machine (KSVM). The features chosen from FNN and KSVM are measured as significant characters. This effort examines numerous feature selection wrapper methods that are normally appropriate to grouping based error prediction. The system castoffs not as much of significant characters until optimal grouping recital are attained. The whole number of characters utilized for guidance is considerably reduced, frequently to lower than 15% of the unique. The general performance metrics is make used to estimate grouping systems such as accurateness, Recall, Precision, and F-Measure. It demonstrates that the anticipated Hybrid Hierarchical K-Centers (HHKC) grouping executes enhanced software quality compared to conventional grouping methods.
... Recently few researchers proposed deep learning-based SDP models [48,49]. Turhan and Bener [50] and Pain and Dugan [51] applied the Bayesian network over the NASA dataset to build the SDP model and found effective results. Elish and Elish [33] performed experiments over NASA datasets using SVM and concluded that SVM surpasses the performance of the basic logistic regression (LR) model. ...
Full-text available
Predicting defects during software testing reduces an enormous amount of testing effort and help to deliver a high‐quality software system. Owing to the skewed distribution of public datasets, software defect prediction (SDP) suffers from the class imbalance problem, which leads to unsatisfactory results. Overfitting is also one of the biggest challenges for SDP. In this study, the authors performed an empirical study of these two problems and investigated their probable solution. They have conducted 4840 experiments over five different classifiers using eight NASA projects and 14 PROMISE repository datasets. They suggested and investigated the varying kernel function of an extreme learning machine (ELM) along with kernel principal component analysis (K‐PCA) and found better results compared with other classical SDP models. They used the synthetic minority oversampling technique as a sampling method to address class imbalance problems and k‐fold cross‐validation to avoid the overfitting problem. They found ELM‐based SDP has a high receiver operating characteristic curve over 11 out of 22 datasets. The proposed model has higher precision and F‐score values over ten and nine, respectively, compared with other state‐of‐the‐art models. The Mathews correlation coefficient (MCC) of 17 datasets of the proposed model surpasses other classical models' MCC.
... they also concluded the SVM outperforms over LR model. Turhan and Bener (2007a) uses multivariate BN over the NASA dataset to deliver high specificity; they reached the results is outperformed over statistical technique. Pandey, Mishra, and Tripathi (2020) utilizes heterogeneous EL over 12 NASA datasets and concluded, high ROC over most of the state of the arts, and EL subjugate overfitting problem. ...
Several prediction approaches are contained in the arena of software engineering such as prediction of effort, security, quality, fault, cost, and re-usability. All these prediction approaches are still in the rudimentary phase. Experiments and research are conducting to build a robust model. Software Fault Prediction (SFP) is the process to develop the model which can be utilized by software practitioners to detect faulty classes/module before the testing phase. Prediction of defective modules before the testing phase will help the software development team leader to allocate resources more optimally and it reduces the testing effort. In this article, we present a Systematic Literature Review (SLR) of various studies from 1990 to June 2019 towards applying machine learning and statistical method over software fault prediction. We have cited 208 research articles, in which we studied 154 relevant articles. We investigated the competence of machine learning in existing datasets and research projects. To the best of our knowledge, the existing SLR considered only a few parameters over SFP’s performance, and they partially examined the various threats and challenges of SFP techniques. In this article, we aggregated those parameters and analyzed them accordingly, and we also illustrate the different challenges in the SFP domain. We also compared the performance between machine learning and statistical techniques based on SFP models. Our empirical study and analysis demonstrate that the prediction ability of machine learning techniques for classifying class/module as fault/non-fault prone is better than classical statistical models. The performance of machine learning-based SFP methods over fault susceptibility is better than conventional statistical purposes. The empirical evidence of our survey reports that the machine learning techniques have the capability, which can be used to identify fault proneness, and able to form well-generalized result. We have also investigated a few challenges in fault prediction discipline, i.e., quality of data, over-fitting of models, and class imbalance problem. We have also summarized 154 articles in a tabular form for quick identification.
... Indeed, considerable research works have been done to propose automatic vulnerability prediction (AVP) approaches based on machine learning (ML) and manually-defined static code features, such as software metrics ([2]- [9]) and text-based features [7], [10]. These works were motivated by the success of similar works [11]- [15] that have been done to predict software defects and by the fact that several code attributes, such as complexity, size and coupling (which can be quantified by corresponding software metrics), are proven in practice to be correlated to vulnerabilities. As reported in [16], the task of defining features is tedious, subjective and sometimes error-prone because of the complexity of the problem. ...
Full-text available
Deep Learning (DL) techniques were successfully applied to solve challenging problems in the field of Natural Language Processing (NLP). Since source code and natural text share several similarities, it was possible to adopt text classification techniques, such as word embedding, to propose DL-based Automatic Vulnerabilities Prediction (AVP) approaches. Although the obtained results were interesting, they were not good enough compared to those obtained in NLP. In this paper, we propose an improved DL-based AVP approach based on the technique of character n-gram embedding. We evaluate the proposed approach for 4 types of vulnerabilities using a large c/c++ open-source codebase. The results show that our approach can yield a very excellent performance which outperforms the performances obtained by previous approaches.
... Similarly to Menzies and Greenwald, Turhan and Bener [51] found the Bayesian method to be the best in static source code metric based defect prediction. They applied it in conjunction with a multivariate approach and got promising results, recall around 80% and precision around 30%. ...
Full-text available
Forecasting defect proneness of source code has long been a major research concern. Having an estimation of those parts of a software system that most likely contain bugs may help focus testing efforts, reduce costs, and improve product quality. Many prediction models and approaches have been introduced during the past decades that try to forecast bugged code elements based on static source code metrics, change and history metrics, or both. However, there is still no universal best solution to this problem, as most suitable features and models vary from dataset to dataset and depend on the context in which we use them. Therefore, novel approaches and further studies on this topic are highly necessary. In this paper, we employ a chemometric approach - Partial Least Squares with Discriminant Analysis (PLS-DA) - for predicting bug prone Classes in Java programs using static source code metrics. To our best knowledge, PLS-DA has never been used before as a statistical approach in the software maintenance domain for predicting software errors. In addition, we have used rigorous statistical treatments including bootstrap resampling and randomization (permutation) test, and evaluation for representing the software engineering results. We show that our PLS-DA based prediction model achieves superior performances compared to the state-of-the-art approaches (i.e. F-measure of 0.44-0.47 at 90% confidence level) when no data re-sampling applied and comparable to others when applying up-sampling on the largest open bug dataset, while training the model is significantly faster, thus finding optimal parameters is much easier. In terms of completeness, which measures the amount of bugs contained in the Java Classes predicted to be defective, PLS-DA outperforms every other algorithm: it found 69.3% and 79.4% of the total bugs with no re-sampling and up-sampling, respectively.
... False positive rate is also called as probability of false detection (PF) [65], [67]. ...
Full-text available
Assessing the quality of the software is both important and difficult. For this purpose, software fault prediction (SFP) models have been extensively used. However, selecting the right model and declaring the best out of multiple models are dependent on the performance measures. We analyze 14 frequently used, non-graphic classifier’s performance measures used in SFP studies. This analyses would help machine learning practitioners and researchers in SFP to select the most appropriate performance measure for models’ evaluation. We analyze the performance measures for resilience against producing invalid values through our proposed plausibility criterion. After that, consistency and discriminancy analyses are performed to find the best out of the 14 performance measures. Finally, we draw the order of the selected performance measures from better to worse in both balance and imbalance datasets. Our analyses conclude that F-measure and G-mean1 are equally the best candidates to evaluate SFP models with careful analysis of the result, as there is a risk of invalid values in certain scenarios.
... various purposes in software testing process. The studies [7]- [10] used machine learning-based defect prediction models to classify each software module as "buggy" or "buggy free". The study [11] created three defect prediction models for three different testing phases in order to monitor defects on large enterprise software. ...
... This indicator is widely used in defect prediction [43]- [45]. ...
Full-text available
Cross-Project Defect Prediction (CPDP) is an active topic for predicting defects on projects (target projects) with scarce labeled data by reusing the classification models from other projects (source projects). Traditional CPDP methods require common features between the data of two projects and utilize them to construct defect prediction models. However, when cross-project data do not satisfy the requirement, i.e., Heterogeneous CPDP (HCPDP) scenario, these methods become infeasible. In this paper, we propose a novel HCPDP method called Heterogeneous Domain Adaptation (HDA) to address the issue. HDA treats the cross-project data as being from two different domains with heterogeneous feature sets. It employs the domain adaptation method to embed the data from the two domains into a comparable feature space with a lower dimension, then measures the difference between the two mapped domains of data using the dictionaries learned from them with dictionary learning technique. We comprehensively evaluate HDA on 94 cross-project pairs of 12 projects from three open-source defect datasets with three performance indicators, i.e., F-measure, Balance, and AUC. Compared with two state-of-the-art HCPDP methods, the experimental results indicate that HDA improves 0.219 and 0.336 in terms of F-measure, 0.185 and 0.215 in terms of Balance, and 0.131 and 0.035 in terms of AUC. In addition, HDA achieves comparable results compared with Within-Project Defect Prediction (WPDP) setting and a state-of-the-art unsupervised learning method in most cases.
Full-text available
Variable and feature selection have become the focus of much research in areas of application for which datasets with tells or hundreds of thousands of variables are available. These areas include text processing of internet documents, gene expression array analysis, and combinatorial chemistry. The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. The contributions of this special issue cover a wide range of aspects of such problems: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.
Full-text available
Planning systems generate partially ordered sequences of actions (or plans) that solve a goal. They start from a specification of the valid actions (also called operators), which includes both the conditions under which an action applies (the preconditions) ...
The use of software complexity metrics in the determination of software quality has met with limited success. Many metrics measure similar aspects of program differences. Some lack a sound theoretical foundation. Attempts to use these metrics in quantitative modelling scenarios have been frustrated by a lack of understanding of the precise nature of exactly what is being measured. This is particularly true in the application of these metrics to predictive models. The paper investigates some basic issues associated with the modelling process, including problems of shared variance among metrics and the possible relationship between complexity metrics and measures of program quality. The modelling techniques are applied to a sample data set to explore the differences between modelling techniques with raw complexity metrics and complexity metrics that have been simplified through factor analysis. The ultimate objective is to provide the foundation for the use of complexity metrics in predictive models. This, in turn, will permit the effective use of these measures in the management of complex software projects.
Conference Paper
Software engineering is a tedious job that includes people, tight deadlines and limited budgets. Delivering what customer wants involves minimizing the defects in the programs. Hence, it is important to establish quality measures early on in the project life cycle. The main objective of this research is to analyze problems in software code and propose a model that will help catching those problems earlier in the project life cycle. Our proposed model uses machine learning methods. Principal component analysis is used for dimensionality reduction, and decision tree, multi layer perceptron and radial basis functions are used for defect prediction. The experiments in this research are carried out with different software metric datasets that are obtained from real-life projects of three big software companies in Turkey. We can say that, the improved method that we proposed brings out satisfactory results in terms of defect prediction
Predictive models that incorporate a functional relationship of program error measures with software complexity metrics and metrics based on factor analysis of empirical data are developed. Specific techniques for assessing regression models are presented for analyzing these models. Within the framework of regression analysis, the authors examine two separate means of exploring the connection between complexity and errors. First, the regression models are formed from the raw complexity metrics. Essentially, these models confirm a known relationship between program lines of code and program errors. The second methodology involves the regression of complexity factor measures and measures of errors. These complexity factors are orthogonal measures of complexity from an underlying complexity domain model. From this more global perspective, it is believed that there is a relationship between program errors and complexity domains of program structure and size (volume). Further, the strength of this relationship suggests that predictive models are indeed possible for the determination of program errors from these orthogonal complexity domains
Scientists working with large volumes of high-dimensional data, such as global climate patterns, stellar spectra, or human gene distributions, regularly confront the problem of dimensionality reduction: finding meaningful low-dimensional structures hidden in their high-dimensional observations. The human brain confronts the same problem in everyday perception, extracting from its high-dimensional sensory inputs—30,000 auditory nerve fibers or 106 optic nerve fibers—a manageably small number of perceptually relevant features. Here we describe an approach to solving dimensionality reduction problems that uses easily measured local metric information to learn the underlying global geometry of a data set. Unlike classical techniques such as principal component analysis (PCA) and multidimensional scaling (MDS), our approach is capable of discovering the nonlinear degrees of freedom that underlie complex natural observations, such as human handwriting or images of a face under different viewing conditions. In contrast to previous algorithms for nonlinear dimensionality reduction, ours efficiently computes a globally optimal solution, and, for an important class of data manifolds, is guaranteed to converge asymptotically to the true structure.