Content uploaded by Ayse Basar

Author content

All content in this area was uploaded by Ayse Basar

Content may be subject to copyright.

(Research Paper)

A Multivariate Analysis of Static Code Attributes for Defect Prediction

Burak Turhan, Ayşe Bener

Department of Computer Engineering, Bogazici University

34342, Bebek, Istanbul, Turkey

{turhanb, bener}@boun.edu.tr

Abstract

Defect prediction is important in order to reduce

test times by allocating valuable test resources

effectively. In this work, we propose a model using

multivariate approaches in conjunction with Bayesian

methods for defect predictions. The motivation behind

using a multivariate approach is to overcome the

independence assumption of univariate approaches

about software attributes. Using Bayesian methods

gives practitioners an idea about the defectiveness of

software modules in a probabilistic framework rather

than the hard classification methods such as decision

trees. Furthermore the software attributes used in this

work are chosen among the static code attributes that

can easily be extracted from source code, which

prevents human errors or subjectivity. These attributes

are preprocessed with feature selection techniques to

select the most relevant attributes for prediction.

Finally we compared our proposed model with the best

results reported so far on public datasets and we

conclude that using multivariate approaches can

perform better.

Keywords: Defect prediction, Software Metrics, Naïve

Bayes.

Topics: Software Quality, Methods and Tools.

1. Introduction

Testing is the most costly and time consuming part

of software development lifecycle, regardless of the

development process used. Therefore effective testing

leads to significant decrease in project costs and

schedules. The aim of defect prediction is to give an

idea about the testing priorities, so that exhaustive

testing is prevented. Using an automated model may

help project managers to allocate testing resources

effectively. These models can predict the degree of

defectiveness if relevant features of software are

supplied to them. These relevant features are achieved

by using software metrics.

Researchers usually prefer focusing on the selection

of a subset of available features [10]. Feature subset

selection is mainly preferred because of its

interpretability, since the selected features correspond

to actual and in some occasions controllable

measurements from software. This gives the ability to

generate rules about the desired values of metrics for

'good' software. It is easier to explain such rules to

programmers and managers [6].

This is also the answer to why most of the studies

use decision trees as predictors. Decision trees can be

interpreted as a set of rules and they can be understood

by less technically involved people [6]. But decision

trees are hard classification methods that can predict a

module as either defective or non-defective.

Alternatively, Bayesian approaches provide a

probabilistic framework and yield soft classification

methods with posterior probabilities attached to the

predictions [1]. This is why we employed Bayesian

approaches in this work.

On the other hand, feature subset selection requires

an exhaustive search for choosing the optimal subset.

Thus, feature selection algorithms use greedy

approaches like backward or forward selection [7]. In

forward selection, one starts with an empty set of

features, and a feature is selected only if it increases the

performance of the predictor, otherwise it is discarded.

Backward selection is similar in the sense that one

starts with all features, and a feature is removed if it

does not affect the performance of the predictor. These

approaches evaluate the features one at a time and they

do not consider the effects of features if taken as pairs,

triples and n-tuples. While a single feature may not

affect the estimation performance significantly, pairs,

triples or n-tuples of features may [7]. In order to

overcome this problem, this study employs feature

extraction techniques and compares the results with a

baseline study, where InfoGain algorithm is used to

rank and select a subset of features [10].

Major contribution of this research is to incorporate

multivariate approaches rather than univariate ones.

Univariate approaches assume the independence of

features whereas multivariate approaches take the

relations between features into consideration.

Obviously univariate models are simpler than

multivariate models. While it is good practice to start

modeling with simple models, the problem at hand

should also be investigated by using more complex

models. Then it should be validated by measuring

performance whether using more complex models is

worth the extra complexity introduced in the modeling.

This research performs experiments with both simple

and complex models and compares their performances.

In the following section, feature extraction methods

used in this research are briefly described. Then,

models used for defect prediction are explained. After

describing the experimental design and the results,

conclusions will be given.

2. Feature Extraction Methods

In feature extraction, new features are formed by

combining the existing ones. These new set of features

may not be interpreted easily as before [6]. On the

contrary, there are cases where they turn out to be

interpretable [5]. The new features may also lead to

better prediction performances by removing irrelevant

and non-informative features. An advantage of feature

extraction methods used in this study is that they

project data to an orthogonal feature space. One has to

decide between ease of interpretability and better

prediction performances in such cases. In this research

authors prefer better performance and therefore they

explore feature extraction methodologies.

Principal Component Analysis (PCA) has been used

in other defect prediction studies [11], [13], [8],

[14],[2]. We also use PCA in this research. PCA

reveals the optimum linear structure of data points. But

it is unable to find nonlinear relations, if there exists

such relations in data. In order to investigate non-linear

relations, we use Isomap algorithm as another feature

extraction technique.

2.1. Isomap

Isomap inherits the advantages of PCA and extends

them to learn nonlinear structures that are hidden in

high dimensional data. Computational efficiency,

global optimality, and guarantee of asymptotic

convergence are its major features [16].

In general, Euclidean distance is used to calculate

the similarity of two instances. However, the use of the

Euclidean distance to represent pair wise distances

makes the model unable to preserve the intrinsic

geometry of the data. Two nearby points, in terms of

Euclidean distance, may indeed be distant, because

their actual distance is the path between these points

along the manifold. The length of the path along the

manifold is referred to as the

geodesic distance [16]. A

2-D spiral is an example of a manifold, which is

actually a 1-D line that is folded and embedded in 2-D

(See Figure 1, adapted from [9]). Applying Isomap on

the spiral unfolds it to its true structure. Isomap simply

performs classical Multidimensional Scaling [4] on pair

wise geodesic-distance matrix.

Figure 1. Geodesic distance metric: Points X and Y are at

distinct ends of the spiral. Using Euclidean distance, the true

structure of spiral, i.e. 1-D line folded and embedded in 2-D,

can not be revealed.

Geodesic distance represents similar (or different)

data points more accurately than the Euclidean

distance, but the question is how to estimate it? Here

the local linearity principle is used and it is assumed

that neighboring points lie on a linear patch of the

manifold, so for nearby points the Euclidean distances

correctly estimate the geodesic distances. For distant

points, the geodesic distances are estimated by adding

up neighboring distances over the manifold using a

shortest-path algorithm.

Isomap finds the true dimensionality of nonlinear

structures. The interpretation of projection axes can be

meaningful in some cases [5]. Isomap uses a single

parameter to define the neighborhood for data points

i.e. for k-nearest neighbors of a data point, pair wise

geodesic distances are assumed to be equivalent to

Euclidean distances. This parameter should be fine

tuned, preferably by cross-validation, to obtain

optimum results. Data sample is transformed to have a

linear structure in the new projection space; e.g. the

spiral is unfolded to a line.

3. Predictor Models

This section explains predictor models used for defect

prediction. As a baseline, the Naive Bayes classifier is

taken since it is shown to acquire best results obtained

so far [10]. We remove the assumptions of the Naive

Bayes classifier one at a time and construct the linear

and quadratic discriminants. The assumption in Naive

Bayes is that the features of data sample are

independent, thus it employs the univariate normal

distribution. We believe this assumption is not valid for

software data and since there are correlations between

software data features. So we use a multivariate normal

distribution to model the correlations among features.

In the next section univariate and multivariate normal

distributions are briefly explained.

3.1. Univariate vs. Multivariate Normal

Distribution

In univariate normal distribution, ),(~

2

Nx ,

x

is said to be normal distributed with mean μ and

standard deviation

σ and the probability distribution

function (pdf) is defined as:

2

2

2

exp

)2(

1

)(

x

xp

(1)

The term inside the exponential term in Equation 2

is the normalized Euclidean distance, where the

distance of a data sample

x to the sample mean μ is

measured in terms of standard deviations

σ. This

ensures to scale the distances of different features in

case feature values vary significantly. This measure

does not consider the correlations among features.

In the multivariate case,

x is a d-dimensional vector

that is normal distributed,

),(~

Nx

, and the pdf

of a multivariate normal distribution is defined as:

xxxp

T

d

1

2

1

2

2

1

exp

)2(

1

)(

(2)

Where Σ is the covariance matrix and μ is the mean

vector. The term inside the exponential term in

Equation 2 is another distance function and called the

M

ahalanobis distance [1]. In this case, the distance to

the mean vector is normalized by the covariance matrix

and the correlations of features are also considered.

This results in less contribution of highly correlated

features and features with high variance.

Our assumption is that software data features are

correlated and a multivariate model would be more

appropriate than the univariate model. Besides,

multivariate normal distribution is analytically simple,

tractable and robust to departures from normality [1].

As

no free lunch theorem states [17], nothing comes

for free and using a multivariate model increases the

number of parameters to estimate. In the univariate

case, only 2 parameters,

μ and Σ are estimated, while in

the multivariate case, d parameters for

μ and d.d

parameters for

Σ need to be estimated.

3.2. Multivariate Classification

In software defect prediction, one aims to

discriminate classes C

0

and C

1

where samples in C

0

are

non defective and samples in C

1

are defective. We

combine the multivariate normal distribution and the

Bayes rule, use different assumptions, and achieve

different discriminants with different complexity levels

(See Table 1). We prefer discriminant point of view,

since it is geometrically interpretable. A discriminant in

general is a hyper plane that separates d-dimensional

space into 2 disjoint subspaces. General structure of a

discriminant is explained next.

Table 1. Complexities of predictors in a K-class

problem with d features.

Predictor # Parameters

QD (K x (d x d)) + (K x d) + (K)

LD (d x d) + (K x d) + (K)

NB (d) + (K x d) + (K)

Bayes theorem states that the posterior distribution

of a sample is proportional to the prior distribution and

the likelihood of the given sample. More formally:

)(

)()|(

)|(

xP

CPCxP

xCP

ii

i

(3)

Equation 4 is read as:

"The probability of a given data instance x to

belong to class C

i

is equal to the multiplication of the

likelihood that x is coming from the distribution that

generates C

i

and the probability of observing C

i

's in

the whole sample, normalized by the evidence.

Evidence is given by:

i

ii

CPCxPxP )()|()(

(4)

and it is a normalization constant for all classes, thus it

can be safely discarded. Then Equation 4 becomes:

)()|()|(

iii

CPCxPxCP

(5)

In a classification problem we compute the posterior

probabilities P(C

i

|x) for each class and choose the one

with the highest posterior. This is equivalent to

defining a discriminant function g

i

(x) for class C

i

and

g

i

(x) is derived from Equation 6 by taking the

logarithms for convenience.

))(log())|(log()(

iii

CPCxPxg

(6)

In order to achieve a discriminant value, one needs

to compute the prior and likelihood terms. Prior

probability P(C

i

) can be estimated from the sample by

counting. The critical issue is to choose a suitable

distribution for the likelihood term P(x|C

i

). This is

where the multivariate normal distribution takes place.

In this study likelihood term is modeled by the

multivariate normal distribution.

Computing discriminant values for each class and

assigning the instance to the class with the highest

value is equivalent to using Bayes Theorem for

choosing the class with the highest posterior

probability. For the 2-class case, it is sufficient to

construct a single discriminant by g(x) = g

0

(x) – g

1

(x).

Using discriminant point of view, we will explain

different predictors in the following section. In all

cases, an instance

x is classified as C

i

such that

))((maxarg xgi

kk

3.3. Quadratic Discriminant

Assumption: Each class has distinct Σ

i

and μ

i

.

Derivation: Combining Equation 2 and Equation 6

))(log(

2

1

|)log(|

2

1

)(

1

i

T

ii

CPmxSmxSxg

(7)

and by defining new variables W

i

, w

i

and w

i0

, the

quadratic discriminant is obtained as

0

)(

i

T

ii

T

i

wxwxWxxg

(8)

where

1

2

1

ii

SW

(9)

iii

mSw

1

(10)

))(log(|)log(|

2

1

2

1

1

0

iiii

T

ii

CPSmSmw

(11)

and S

i

, m

i

and P(C

i

) are maximum likelihood estimates

of

Σ

i

, μ

i

and P(C

i

) respectively.

Quadratic model considers the correlation of the

features differently for each class. In case of K-classes,

the number of parameters to estimate is K.(d.d) for

covariance estimates and (K.d) for mean estimates.

Also K prior probability estimations are needed.

3.4. Linear Discriminant

Assumption: Each class has a common Σ and distinct μ

i

Derivation: Assumption states that classes share a

common covariance matrix. The estimator is found by

either using the whole data sample or by the weighted

average of class covariances which is given as

i

ii

SCPS )(

(12)

Placing this term in Equation 7 we get

))(log()2(

2

1

)(

11

ii

T

ii

T

i

CPmSmmSxxg

(13)

which is now a linear discriminant in the form of

0

)(

i

T

ii

wxwxg

(14)

where

iii

mSw

1

(15)

))(log(

2

1

1

0

iii

T

ii

CPmSmw

(16)

This model considers the correlation of the features

but assumes the variances and correlation of features

are the same for both classes. The number of

parameters to estimate for covariance matrix is now

independent of K. For covariance estimates (d.d), for

mean estimates (K.d) and for priors K parameters

should be estimated.

3.5. Naïve Bayes

Assumption: Each class has a common Σ with off

diagonal entries equal to 0, and distinct

μ

i

Derivation: Assumption states the independence of

features by using a diagonal covariance matrix. Then

the model reduces to a univariate model given in

Equation 17.

))(log(

2

1

)(

2

1

i

d

j

j

ij

t

j

i

CP

s

mx

xg

(17)

This model does not take the correlation of the

features into account and it measures the deviation

from the mean in terms of standard deviations. For

Naive Bayes, (d) covariance, (K.d) mean and K prior

parameters should be estimated.

4. Experiments and Results

Design of experiments and evaluation of results in

software defect prediction problems have particular

importance. Most of the experiment designs have

important flaws such as self tests and insufficient

performance measures as reported in [10]. Most

research reported only the accuracy of predictors as a

performance indicator. Examining defect prediction

datasets, it is easily seen that they are not balanced. In

other words, the number of defective instances is much

less than the number of nondefective instances. As

pointed out in [10], one can achieve 95% accuracy on a

5% defective dataset by building a dummy classifier

that always classifies instances as nondefective. A

framework of MxN experiment design, which means M

replications of N holdout (cross validation)

experiments, is also given in [10] and additional

performance measures are reported, such as probability

of detection (pd) and probability of false alarm (pf).

This research follows the same notation.

Figure 2. Experiment Design.

The experiments conducted in [10] are replicated

and extended in this study. Framework for experiment

design in [10] is followed and updated as in Figure 2.

In order to extract features, PCA and Isomap are

performed on the log filtered data attributes. An

advantage of log filtering is that it scales the features so

that extreme values are handled. Another advantage of

log filtering is that normal distribution better fits to

data. In other words, data attributes are assumed to be

lognormal distributed. 5 to 30 features are extracted for

all datasets using PCA and Isomap. Best subset of

features reported in [10] is also used in the

experiments. This subset of features differs in each

dataset. The best performing dimensionalities achieved

by PCA and Isomap are also different for each dataset.

These observations support the idea that there is no

global set of features that describe the software. So,

maximum possible metrics of software should be

collected and analyzed as long as it is feasible to

collect them.

10-fold cross-validation approach is used in the

experiments. That is, datasets are divided into 10 bins,

9 bins are used for training and 1 bin is used for testing.

Repeating these 10 folds ensures that each bin is used

for training and testing while minimizing sampling

bias. Each holdout experiment is also repeated 10 times

and in each repetition the datasets are randomized to

overcome any ordering effect and to achieve reliable

statistics. Reported results are the mean values of these

100 experiments for each dataset. Quadratic

discriminant (QD), linear discriminant (LD) and Naive

Bayes (NB) are the predictors used in this research. As

performance measures

pd, pf and balance (bal) are

reported.

pd is a measure for correctly detecting

defective modules and it is the ratio of the number of

defective predicted modules to the number of actual

defective modules. Obviously higher

pd's are desired.

As the name suggests,

pf is a measure for false alarms

and it is interpreted as the probability of predicting a

module as defective while it is not indeed. pf is desired

to have low values. Balance measure is used to choose

the optimal (

pd, pf) pairs such that area under the ROC

curve is maximized and it is defined as the normalized

Euclidean distance from the desired point (0,1) to (

pd,

pf) in a ROC curve.

Table 2. Dataset Descriptions

Name #Modules DefectRate

(%)

CM1 505 9

PC1 1107 6

PC2 5589 0.6

PC3 1563 10

PC4 1458 12

KC3 458 9

KC4 125 4

MW1 403 9

For evaluation, 8 different public datasets obtained

from NASA MDP repository [12] are used. Sample

sizes vary from 125 to 5589 modules. Each dataset has

38 features representing static code attributes. As seen

in Table 2 defect rates are too low which consolidates

the use of above mentioned performance measures. All

implementations are done in MATLAB environment

using standard toolboxes.

Results are tabulated in Table 3. Mean results of

(

pd, pf) pairs selected by the bal measure after 10x10

holdout experiments are given. For PCA and ISO

labeled entries, these results are selected from 5 to 30

features obtained by PCA and Isomap respectively.

For SUB labeled entries, the best subset of features

Table 3. Results

Performances

Data

Predictor

pd(%) pf(%) bal(%)

CM1 SUB+NB

84

32 74

PC1 PCA+NB

68

25 71

PC2 PCA+NB 72 13 78

PC3 PCA+LD

76

31 72

PC4 PCA+QD 88 20 83

KC3 ISO+NB

81

25 77

KC4 ISO+LD 78 27 75

MW1 ISO+LD

73

34 69

Average:

77 25 76

obtained by InfoGain are used as reported in [10]. In

Table 3, results indicated in bold face are statistically

significant than ot

her methods with α = 0.05 after

applying a t-test, considering

pd performance measure.

Subset selection is better than feature extraction

methods in only 1 out of 8 datasets (CM1). In the

remaining datasets, best performances are obtained

either by applying PCA or Isomap instead of InfoGain.

In PC1, PC2, PC3 and PC4, best mean performances

are achieved applying PCA, and in KC3, KC4 and

MW1 Isomap yielded better results. It is observed that

Isomap gives the best performances on relatively small

datasets. As the module sizes increase PCA performs

better.

Except PC3 dataset, our replicated results are

similar to reported mean results in [10]. But variances

of replicated experiments (i.e. subsetting) are larger

than PCA and Isomap approach especially for

pf

measure. NB and LD are observed to behave similarly

whereas QD results are different than NB and LD in

terms of performance. It is observed for QD, that as the

number of features increase, performances get worse

especially for

pf measure and the variances increase.

Possible reason for this is the complexity of the model

(i.e. too many parameters to estimate).

As for the predictors, Naive Bayes (NB) is chosen 4

times, linear discriminant (LD) is chosen 3 times and

quadratic discriminant (QD) is chosen only once.

From these results, it can be concluded that claims

stating any of these predictors as the 'globally' correct

one, should be avoided. As expected, no specific

configuration of a feature selection and a predictor is

always better than the others. Even though NB is the

majority winner, it is clearly seen that performances on

some datasets are increased by using multivariate

methods: QD and LD. Applying QD gives the best

result in PC4 dataset, but it is not statistically

significant. It can be concluded that QD can be

discarded because of its complexity. In cases where

LD wins, statistical significances are observed, so the

additional complexity introduced can be justified.

There may be other predictors performing better than

these. Constructing better predictors is an open ended

problem and as better results are reported, the problem

gets more difficult due to

ceiling effect i.e. it is harder

to confirm the hypothesis that predictor A performs

better than predictor B, when A and B perform

maximum achievable performance or close to it [3].

Overall performance of the approach improves on

the best results reported so far [10]. Previous research

reported mean (

pd, pf) = (71,25) which yields bal = 72

averaged over all datasets. Replication of these

experiments yield mean (

pd, pf) = (64, 19) and bal =

71. After experimenting with all possible combinations

of InfoGain, PCA, Isomap with NB, LD and QD, an

improvement is observed by picking the best

combinations for all datasets. Improved results yield

mean (

pd, pf) = (77, 25) where bal = 76. While no

change in pf measure is observed,

pd measure is

improved by 6%.

A final comment should be made about the running

times of algorithms. As expected, QD takes more time

than LD and NB. However this difference is not too

significant. The dominant factor that affects the running

times are the sample sizes.

5. Conclusions and Future Work

In this research software defect prediction is

considered as a data mining problem. Several

experiments are conducted, including the replication of

previous research on publicly available datasets from

NASA repository. Performances of different predictors

together with different feature extraction methods are

evaluated. Results are compared with the best

performances reported so far and some improvements

are observed.

The previous research advices that one should not

seek for globally best subset of features, rather to focus

on building predictors that combines information from

multiple features. In addition, authors also believe that

research should focus on a balanced combination of

those. In other words, building successful predictors

depends on how useful information is supplied to them.

While making research on better predictors, research

on obtaining useful information from features should

also be carried out. A contribution of this research is

using linear and nonlinear feature extraction methods in

order to combine information from multiple features. In

software defect prediction there is more research on

feature subset selection than feature extraction. Results

suggest that it is worth to explore more to deepen our

knowledge on feature extraction studies.

Another contribution of this research is the

modeling of correlations among features. Improved

results are obtained by using multivariate statistical

methods. Furthermore, the probabilities of predictions

are provided by employing Bayesian approaches,

which can give project managers and practitioners a

better understanding of the defectiveness of software

modules.

Further research should investigate the validation of

the log normal distribution assumption of software data

used in this research. It is better practice to apply

goodness of fit tests, rather than assuming a normal

distribution. Other exponential family distributions

should also be investigated. Another research area is to

investigate filters to transform data into suitable

distributions.

Acknowledgements

This research is supported in part by Bogazici

University research fund under grant number BAP-

06HA104.

Authors would like to thank Koray Balcı,

who has contributed to the earlier versions of this

manuscript.

References

[1] E. Alpaydin, Introduction to Machine Learning, The MIT

Press, October 2004.

[2] E. Ceylan, F. O. Kutlubay, and A. B. Bener, “Software

defect identification using machine learning techniques”, In

Proceedings of the 32nd EUROMICRO Conference on

Software Engineering and Advanced Applications

, IEEE

Computer Society, Washington, DC, USA, 2006, pp. 240–

247.

[3] P. R. Cohen.

Empirical Methods for Artificial Intlligence,

The MIT Press, London, England, 1995.

[4] T. Cox and M. Cox,

Multidimensional Scaling. Chapman

& Hall, London, 1994.

[5] V. de Silva and J. B. Tenenbaum, “Global versus local

methods in nonlinear dimensionality reduction”, In S. Becker,

S. Thrun, and K. Obermayer, editors,

Advances in Neural

Information Processing Systems,

15, MIT Press, Cambridge,

MA, 2003, pp. 705–712.

[6] N. E. Fenton and M. Neil, “A critique of software defect

prediction models”,

IEEE Transactions. on Software.

Engineering

., 25(5), 1999, pp. 675–689.

[7] Guyon and Elisseff, “An introduction to variable and

feature selection”,

Journal of Machine Learning Research, 3,

2003, pp 1157–1182.

[8] T. M. Khoshgoftaar and J. C. Munson, “Predicting

software development errors using software complexity

metrics”,

IEEE Journal on Selected Areas in

Communications

, 8(2), Feb. 1990, pp. 253–261.

[9] J. A. Lee, A. Lendasse, N. Donckers, and M. Verleysen.,

“A robust nonlinear projection method”, In

Proceedings of

ESANN 2000, European Symposium on Artificial Neural

Networks

, Bruges (Belgium), 2000, pp. 13– 20.

[10] T. Menzies, J. Greenwald, and A. Frank, “Data mining

static code attributes to learn defect predictors”,

IEEE

Transactions on Software Engineering

, 33(1), 2007, pp. 2–

13.

[11] J. Munson and Y. M. Khoshgoftaar, “Regression

modelling of software quality: empirical investigation”,

J.

Electron. Mater.

, 19(6), 1990, pp. 106–114

[12] NASA/WVU IV&V Facility, Metrics Data Program,

available from http://mdp.ivv.nasa.gov.

[13] M. Neil, “Multivariate assessment of software products”,

Softw. Test., Verif. Reliab., 1(4), 1992, pp. 17–37.

[14] D. E. Neumann, “An enhanced neural network technique

for software risk analysis”,

IEEE Tranactions on. Software

Engineering

, 28(9), 2002, pp. 904–912.

[15] G. Boetticher, T. Menzies and T. Ostrand, PROMISE

Repository of empirical software engineering data

http://promisedata.org/repository, West Virginia University,

Department of Computer Science, 2007

[16] J. B. Tenenbaum, V. de Silva, and J. C. Langford, „A

global geometric framework for nonlinear dimensionality

reduction”,

Science, 290, 2000, pp. 2319– 2323

[17] D. H. Wolpert and W. G. Macready, “No free lunch

theorems for optimization”

IEEE Transactions on

Evolutionary Computation

, 1(1), April 1997, pp. 67–82