Content uploaded by Burak Turhan

Author content

All content in this area was uploaded by Burak Turhan on Feb 20, 2015

Content may be subject to copyright.

A comparative study for estimating software

development effort intervals

Ays¸e Bakır•Burak Turhan •Ays¸ e Bener

Published online: 9 September 2010

Springer Science+Business Media, LLC 2010

Abstract Software cost/effort estimation is still an open challenge. Many researchers

have proposed various methods that usually focus on point estimates. Until today, software

cost estimation has been treated as a regression problem. However, in order to prevent

overestimates and underestimates, it is more practical to predict the interval of estimations

instead of the exact values. In this paper, we propose an approach that converts cost

estimation into a classiﬁcation problem and that classiﬁes new software projects in one of

the effort classes, each of which corresponds to an effort interval. Our approach integrates

cluster analysis with classiﬁcation methods. Cluster analysis is used to determine effort

intervals while different classiﬁcation algorithms are used to ﬁnd corresponding effort

classes. The proposed approach is applied to seven public datasets. Our experimental

results show that the hit rate obtained for effort estimation are around 90–100%, which is

much higher than that obtained by related studies. Furthermore, in terms of point esti-

mation, our results are comparable to those in the literature although a simple mean/median

is used for estimation. Finally, the dynamic generation of effort intervals is the most

distinctive part of our study, and it results in time and effort gain for project managers

through the removal of human intervention.

Keywords Software effort estimation Interval prediction Classiﬁcation

Cluster analysis Machine learning

A. Bakır(&)

Department of Computer Engineering, Bog

˘azic¸i University, 34342 Bebek, Istanbul, Turkey

e-mail: ayse.bakir@boun.edu.tr

B. Turhan

Department of Information Processing Science, University of Oulu, 90014 Oulu, Finland

e-mail: burak.turhan@oulu.ﬁ

A. Bener

Ted Rogers School of Information Technology Management, Ryerson University,

Toronto M5B 2K3, Canada

e-mail: ayse.bener@ryerson.ca

123

Software Qual J (2011) 19:537–552

DOI 10.1007/s11219-010-9112-9

1 Introduction

As software becomes more important in many domains, the focus on its overall quality in

terms of technical product quality and process quality also increases. As a result, software

is blamed for business failures and the increased cost of business in many industries (Lum

et al. 2003). The underestimation of software effort causes cost overruns that lead to cost

cutting. Cost cutting means that some of the life cycle activities either can be skipped or

cannot be completed as originally planned. This causes a drop in software product quality.

To avoid the cost/quality death spiral, accurate cost estimates are vital (Menzies and Hihn

2006).

Software cost estimation is one of the critical steps in the software development life

cycle (Boehm 1981; Leung and Fan 2001). It is the process of predicting the effort required

to develop a software project. Such predictions assist project managers when they make

important decisions such as bidding for a new project, planning and allocating resources.

Inaccurate cost estimations may cause project managers to make wrong decisions. As

Leung and Fan state, underestimations may result in approving projects that would exceed

their budgets and schedules (Leung and Fan 2001). Overestimations, on the other hand,

may result in rejecting other useful projects and wasting resources.

Point estimates are generally used for project stafﬁng and scheduling (Sentas et al.

2005). However, managers may easily make wrong decisions if they rely only on point

estimates and the associated error margins generated by cost estimation methods. Although

most methods proposed in the literature produce point estimates, Stamelos and Angelis

state that producing interval estimates is safer (Stamelos and Angelis 2001). They

emphasize that point estimates have a high impact on project managers, causing them to

make wrong decisions, since they include a high level of uncertainty as a result of unclear

requirements and their implications in the project. Interval estimates may be used for

predicting the cost of any current project in terms of completed ones. In addition, while

bidding for a new project, an interval estimate can easily be converted to a point estimate

by evaluating the values that fall into the same interval.

Up to now, interval estimation has consisted of ﬁnding either the conﬁdence intervals

for point estimates or the posterior probabilities of predeﬁned intervals and then ﬁtting

regression-based methods to these intervals (Angelis and Stamelos 2000; Jorgensen 2002;

Sentas et al. 2003,2005; Stamelos and Angelis 2001; Stamelos et al. 2003). However, none

of these approaches addresses the problem of cost estimation as a pure classiﬁcation

problem. In this paper, we aim to convert cost estimation into a classiﬁcation problem by

using interval estimation as a tool. The proposed approach integrates classiﬁcation methods

with cluster analysis, which, to the best of our knowledge, is applied for the ﬁrst time in the

software engineering domain. In addition, by using cluster analysis, effort classes are

determined dynamically instead of using manually predeﬁned intervals. The approach uses

historical data of completed projects including their effort values.

The proposed approach includes three main phases: (1) clustering effort data so that

each cluster contains similar projects; (2) labeling each cluster with a class number and

determining the effort intervals for each cluster; and (3) classifying new projects to one of

the effort classes. We used various datasets to validate our approach, and our results

revealed much higher estimation accuracies than those in the literature. According to our

experimental study, we obtained higher hit rates for effort estimation. We also obtained

point estimates with simple approaches such as mean/median regression, and our perfor-

mance has been comparable to those in the literature.

538 Software Qual J (2011) 19:537–552

123

The rest of the paper is organized as follows: Sect. 2discusses related work from the

literature. Section 3describes the proposed approach in detail, while Sect. 4presents the

experiments conducted. Section 5comprises a presentation of the results and discussions.

Finally, conclusions and future work are presented in Sect. 6.

2 Related work

Previous work on software cost estimation mostly produced point estimates by using

regression methods (Baskeles et al. 2007; Boetticher 2001; Briand et al. 1992; Draper and

Smith 1981; Miyazaki et al. 1994; Shepperd and Schoﬁeld 1997; Srinivasan and Fisher

1995; Tadayon 2005). According to Boehm, the two most popular regression methods are

ordinary least square regression (OLS) and robust regression (Boehm et al. 2000). OLS is a

general linear model that uses least squares, whereas robust regression is the improved

version of OLS (Draper and Smith 1981; Miyazaki et al. 1994). Besides regression, various

machine learning methods are used for cost estimation. For example, back-propagation

multilayer perceptrons and support vector machines (SVM) have been used for effort

estimation in Baskeles et al. (2007) and Boetticher (2001), and Briand et al. (1992)

introduce a cost estimation method based on optimized set reduction (Baskeles et al. 2007;

Boetticher 2001; Briand et al. 1992). Other methods for point estimation include estimation

by analogy and neural networks. In Shepperd and Schoﬁeld (1997), high accuracies are

obtained by using analogy with prediction models, whereas in Tadayon (2005) and

Shepperd and Schoﬁeld (1997), a signiﬁcant improvement is made on large datasets

through the use of an adaptive neural network model (Shepperd and Schoﬁeld 1997;

Tadayon 2005).

Fewer studies focus on interval estimation. They can be grouped into two main cate-

gories: (1) those that produce conﬁdence intervals for point estimates and (2) those that

produce probabilities of predeﬁned intervals. In category 1, interval estimates are gener-

ated during the estimation process, whereas in category 2, intervals are predeﬁned before

the estimation process.

The ﬁrst study that has empirically evaluated effort prediction interval models in the

literature is Angelis and Stamelos (2000). It compares the effort prediction intervals

derived from a bootstrap-based model with the prediction intervals derived from regres-

sion-based effort estimation models. However, the said study displays a confusion of

terms, and a critique was consequently made by Jorgensen in (2002) to clarify the

ambiguity (Jorgensen and Teigen 2002). In another study, an interval estimation method

based on expert judgment is proposed (Jorgensen 2003). Statistical simulation techniques

for calculating conﬁdence intervals for project portfolios are presented in Stamelos and

Angelis (2001).

Two important studies for category 2 are Sentas et al. (2005), in which ordinal

regression is used to model the probabilities of both effort and productivity intervals, and

Sentas et al. (2003), which uses multinomial logistic regression for modeling productivity

intervals (Sentas et al. 2003,2005). Both studies also include point estimate results of the

proposed models. Also, in Sentas et al. (2003), predeﬁned intervals of productivity are used

in a Bayesian belief network to support expert opinion (Sentas et al. 2003). An empirical

comparison of the models that produce point estimates and predeﬁned interval estimates is

given in Bibi et al. (2004). Firstly, in contrast to these studies, effort intervals are not

predeﬁned manually in this paper. Instead, they are determined by clustering analysis.

Secondly, instead of using regression-based methods, we use classiﬁcation algorithms that

Software Qual J (2011) 19:537–552 539

123

originate from the machine learning domain. Thirdly, point estimates can still be derived

from these intervals as we will show in the following sections.

NASA’s Software Engineering Laboratory also speciﬁed some guidelines for the esti-

mation of effort prediction intervals (NASA 1990). However, these guidelines may affect

the external validity of the results since they do not reﬂect the same characteristics of the

projects in other organizations.

Clustering analysis is not a new concept in the software cost estimation domain. Lee

et al. integrate clustering with neural networks in order to estimate the development cost

(Lee et al. 1998). They have found similar projects with clustering and used them to train

the network. In Gallego et al. (2007), the cost data are clustered, and then different

regression models are ﬁtted to each cluster (Gallego et al. 2007). Similar to these studies,

we also use here cluster analysis for grouping similar projects. The difference of our

research in comparison to these studies is that we combine clustering through classiﬁcation

methods for effort estimation.

3 The approach

There are three main steps in our approach: (1) grouping similar projects together by

cluster analysis; (2) determining the effort intervals for each cluster and specifying the

effort classes; and (3) classifying new projects into one of the effort classes. The

assumption behind applying cluster analysis to effort data is that similar projects have

similar development effort. The class-labeled clusters then become the input data for the

classiﬁcation algorithm, which converts cost estimation into a classiﬁcation process.

3.1 Cluster analysis

Cluster analysis is a technique for grouping data and ﬁnding similar structures in data. In

software cost estimation domain, clustering corresponds to grouping projects into clusters

based on their attributes. Similar projects are assigned to the same cluster, whereas dis-

similar projects belong to different clusters.

In this study, we use an incremental clustering algorithm called leader cluster (Alpaydin

2004) for cluster analysis. In this algorithm, the number of clusters is not predeﬁned;

instead, the clusters are generated incrementally. Since one of our main objectives is to

generate the effort intervals dynamically, this algorithm is selected to group similar soft-

ware projects. Other clustering techniques that generate the clusters dynamically can also

be used, but this is out of the scope of this work. The pseudocode of the leader cluster

algorithm is given in Fig. 1(Bakar et al. 2005).

In order to determine the similarity between two projects, Euclidean distance is used. It

is a widely preferred distance metric for software engineering datasets (Lee et al. 1998).

3.2 Effort classes

After the clusters and their centers are determined, the effort intervals are calculated for

each cluster.

In order to specify the effort intervals and classes, ﬁrstly, the minimum and maximum

values of the efforts of the projects residing in the same cluster are found. Secondly, these

minimum and maximum values are selected as the upper and lower bounds of the interval

540 Software Qual J (2011) 19:537–552

123

that will represent that cluster. Finally, each cluster is given a class label, which will be

used for classifying new projects.

3.3 Classiﬁcation

The class of a new project is estimated by using the class-labeled data generated in the

previous step. The resulting class corresponds to the effort interval that contains the effort

value of the new project.

We use three different classiﬁcation algorithms for this step: one is parametric (linear

discrimination) and the others are non-parametric (k-nearest neighbor and decision tree).

These three algorithms are chosen to show how our approach performs with the algorithms

of different complexities. Linear discrimination is the simplest, whereas the decision tree is

the most complex one. k-nearest neighbor has moderate complexity depending on the size

of the training set.

3.3.1 Linear discrimination

Linear discrimination (LD) is a discriminant-based approach that tries to ﬁt a model

directly for the discriminant between the class regions, without ﬁrst estimating the like-

lihoods or posteriors (Alpaydin 2004). It assumes that the projects of a class are linearly

separable from the projects of other classes and require no knowledge of the densities

inside the class regions. The linear discriminant function is as:

gixjwi;wi0

hi

¼X

d

j¼1

wijxjþwi0ð1Þ

where g

i

is the model, w

i

and w

i0

are the model parameters and xis the software project

with dattributes. It is used to separate two or more classes.

Learning involves the optimization of the model parameters to maximize the classiﬁ-

cation accuracy on a given set of projects. Because of its simplicity and comprehensibility,

linear discrimination is frequently used before trying a more complicated model.

3.3.2 k-nearest neighbor

The k-nearest neighbor (k-NN) algorithm is a simple but also powerful learning method

that is particularly suited for classiﬁcation problems.

Assign the first data item to the first cluster.

Consider the next data item:

Find the distances between the new item and the existing cluster centers.

If (distance < threshold)

{

Assign this item to the nearest cluster

Recompute the value for that cluster center

}

Else

{

Assign it to a new cluster

}

Repeat step 2 until the total squared error is small enough.

Fig. 1 Pseudocode for leader cluster algorithm (Bakar et al. 2005)

Software Qual J (2011) 19:537–552 541

123

k-NN assumes that all projects correspond to points in the n-dimensional Euclidean

space R

n

, where nis the number of the project attributes. The algorithm’s output is the

class, which has the most examples among the kneighbors of the input project. The

neighbors are found by calculating the Euclidean distance from each project to the input

project.

The selection of kis very important. It is generally set as an odd number to minimize

ties as confusion generally appears between any two neighboring classes (Alpaydin 2004).

Although the algorithm is easy to implement, the amount of computation increases as the

training set grows in size.

3.3.3 Decision tree

Decision trees (DT) are hierarchical data structures that are based on a divide-and-conquer

strategy (Quinlan 1993). They can be used for both classiﬁcation and regression and

require no assumptions concerning the data. In the case of classiﬁcation, they are called

classiﬁcation trees.

The nodes of a classiﬁcation tree correspond to the attributes that best split data into

disjoint groups, while the leaves correspond to the average effort of that split. The quality

of the split is determined by an impurity measure. The tree is constructed by partitioning

the data recursively until no further partitioning is possible while choosing the split that

minimizes the impurity at every occasion (Alpaydin 2004).

Concerning the estimation of software effort, the effort of the new project can be

determined by traversing the tree from top to bottom along the appropriate paths.

4 Experimental study

Our purpose in this study is to convert the effort estimation problem into a classiﬁcation

problem that includes the following phases: (1) clustering the effort data; (2) labeling each

cluster with a class number and determining the effort intervals for each cluster; and (3)

classifying the new projects. In addition, the point estimation performance of the approach

is tested by taking either the mean or the median of the effort values of the projects

included in the estimated class.

In this section, details about the validation of our approach on a number of datasets will

be given. MATLAB is used as a tool for all the analyses stated in this study.

4.1 Dataset description

In our experiments, data from two different sources are used: the Promise Data Repository

and the Software Engineering Research Laboratory (SoftLab) Repository (Boetticher et al.

2007; SoftLab 2009). Seven datasets are used in this study. Four of them, which are

cocomonasa_v1, coc81, desharnais_1_1 and nasa93, are taken from the Promise Data

Repository. The others, which are sdr05, sdr06 and sdr07, are taken from the SoftLab

(2009) Repository. These datasets contain data from different local software companies in

Turkey, which are collected by using the COCOMO II Data Collection Questionnaire

(Boehm 1999).

The datasets include a number of nominal attributes and two real-valued attributes:

Lines of Code and Actual Effort. An exemplary dataset is given in Table 1. Each row in

Table 1corresponds to a different project. These projects are represented by the nominal

542 Software Qual J (2011) 19:537–552

123

attributes from the COCOMO II model along with their size in terms of LOC and the actual

effort spent for completing the projects.

We have used several datasets in the same format as provided in Table 1in order to

validate our approach on a wide range of effort estimation data and to generalize our

results as much as possible. A list of all the datasets used in this study is given in Table 2.

4.2 Design

Before applying any method, all of the datasets are normalized in order to remove the

scaling effects on different dimensions. By using min–max normalization, project attribute

values are converted into the [0…1] interval (Shalabi and Shaaban 2006). After normal-

ization, the need for a dimension reduction technique to extract the relevant features arises.

In this paper, principal component analysis (PCA) is used (Alpaydin 2004). The main

purpose of PCA is to reduce the dimensions of the dataset so that it can still be efﬁciently

represented without losing much information. Speciﬁcally, PCA seeks dimensions in

which the variances are maximized. By applying PCA to each cluster after clustering, the

model shown in Fig. 2is developed. Our aim in applying PCA separately to each cluster is

Table 1 An example dataset

Project Nominal attributes (as deﬁned in COCOMO II) LOC Effort

P1 1.00,1.08,1.30,1.00,1.00,0.87,1.00,0.86,1.00,0.70,1.21,1.00,0.91,1.00,1.08 70 278

P2 1.40,1.08,1.15,1.30,1.21,1.00,1.00,0.71,0.82,0.70,1.00,0.95,0.91,0.91,1.08 227 1,181

P3 1.00,1.08,1.15,1.30,1.06,0.87,1.07,0.86,1.00,0.86,1.10,0.95,0.91,1.00,1.08 177.9 1,248

P4 1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08 115.8 480

P5 1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08 29.5 120

P6 1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08 19.7 60

P7 1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08 66.6 300

P8 1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08 5.5 18

P9 1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08 10.4 50

P10 1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08 14 60

P11 1.00,1.00,1.15,1.11,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00 16 114

P12 1.15,1.00,1.15,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00 6.5 42

P13 1.00,1.00,1.15,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00 13 60

P14 1.00,1.00,1.15,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00 8 42

Table 2 An overview of

the datasets Data source Dataset name # of Projects

Promise cocomonasa_v1 60

coc81 63

desharnais_1_1

(updated version)

77

nasa93 93

SoftLab (2009) sdr05 25

sdr06 24

sdr07 40

Software Qual J (2011) 19:537–552 543

123

to extract separate features for each cluster so that we can obtain better results for both

classiﬁcation and point estimation. The dataset given in Table 1is used as an example in

Fig. 2to show how our cost data are processed.

In Fig. 2, the projects in the cost dataset are illustrated as P1…P14. After the dataset is

normalized, projects are shown as P10…P140. The four clusters generated are named as C1,

C2, C3 and C4, which correspond to effort interval classes. As described earlier, the lower

and upper bounds for an effort interval class are determined dynamically by the minimum

and the maximum effort values of the projects that reside in the corresponding cluster.

4.3 Model

Normalized effort estimation data are given as input to this model. Firstly, the leader cluster

algorithm is applied to the normalized data to obtain project groups. Here, we selected the

number of clusters that minimize the total-squared error while keeping the distance below

the deﬁned threshold value. The optimum value for the number of clusters is found by

testing all possibilities and calculating the total-squared error. Secondly, with PCA, each

cluster’s dimensions are reduced individually by using their own covariance matrices (the

proportion of variance is set to 0.90). The aim here is to prevent data loss within the

Data

Min-Max Normalization

Normalized Data

Leader Cluster

On Each Cluster

PCA

Find Effort Intervals

for Each Cluster

10x10 Cross-Validation

KNN LD DT

Calculate

Fig. 2 Our proposed model

544 Software Qual J (2011) 19:537–552

123

clusters. PCA is applied to the entire data except the Effort column, which is the value that

we want to estimate. Thirdly, each cluster is assigned a class label, and the effort intervals

for each of them are determined. As stated in Sect. 3.2, minimum and maximum values are

selected as the interval bounds. Then, the effort data containing the projects with corre-

sponding class labels are given to each of the classiﬁcation algorithms described in Sect. 3.

For the k-nearest neighbor algorithm, the nearest neighbor is selected. For linear discrim-

ination and decision tree algorithms, the predeﬁned implementations of Matlab have been

used. Since separate training and test sets do not exist, the classiﬁcation process is per-

formed in a 10 910 cross-validation loop. The data are shufﬂed 10 times into random order

and then divided into 10 bins in the cross-validation loop. The training set is built from nine

of the bins, and the remaining bin is used as the validation set. Classiﬁcation algorithms are

ﬁrst trained on the training set, and then, estimations and error calculations are made on the

validation set. The errors are collected during 100 cross-validation iterations, and then

MMRE, MdMRE and PRED values are calculated. Since we have three classiﬁcation

methods, we have three sets of measurements.

In addition, point estimates are calculated at the classiﬁcation stage in order to deter-

mine our point estimation performance. For this process, we decided to use the mean and

the median as our point estimators since they have been used by other studies in the

literature. For example, Sentas et al. (2003) represent each interval by a single represen-

tative value: the mean point or the median point (Sentas et al. 2003). At the classiﬁcation

step, when the correct effort class is estimated, the mean and median of the effort values of

the projects belonging to that class are calculated.

4.4 Accuracy measures

Although our aim is to convert cost estimation to a classiﬁcation problem, we want to give

the point estimate results of the proposed approach in order to make a comparison with

other studies. Thus, we have employed two types of accuracy measures in our experimental

study: (1) misclassiﬁcation rate for classiﬁcation and (2) MeanMRE (MMRE), Median-

MRE and PRED (25) for point estimates.

4.4.1 Misclassiﬁcation rate

The misclassiﬁcation rate is simply the proportion of the number of misclassiﬁed software

projects in a test set to the total number of projects to be classiﬁed in the same test set. It is

calculated for each classiﬁcation algorithm in each model. The formula for calculating the

misclassiﬁcation rate is as follows:

MR ¼1

NtX

Nt

n¼1

1ify6¼ y0

0 otherwise

(ð2Þ

where N

t

is the total number of training samples, y0is the estimated effort and yis the

actual effort.

The misclassiﬁcation rate can be thought as the complement of the hit rate that has been

mentioned in interval prediction studies; thus, our results are still comparable to those

studies:

%100 ¼Misclassification Rate þHit Rate ð3Þ

Software Qual J (2011) 19:537–552 545

123

4.4.2 MMRE, MedianMRE and PRED (25)

These are the measures that are calculated from the relative error and the difference

between the actual and the estimated value.

The magnitude of relative error (MRE) is calculated by the following formula:

MRE ¼predicted actual

jj=actual ð4Þ

The mean magnitude of relative error (MMRE) is the mean of the MRE values, and

MedianMRE (MdMRE) is the median of the MRE values. Prediction at level ror

PRED(r) is used to examine the cumulative frequency of MRE for a speciﬁc error level.

For T estimations, the formula is as follows:

PREDðNÞ¼100

TX

T

i

1 if MREiN

100

0 otherwise

ð5Þ

In this study, we take the desired error level as r=25. PRED (25) is preferred over

MMRE and MdMRE, in terms of evaluating the stability and robustness of the estimations

(Conte et al. 1986; Stensrud et al. 2003). In order to say that a model performs well, the

MdMRE and MRE values should be low and the PRED (25) values should be high.

4.5 Scope and limitations

In this paper, we address the cost estimation problem as a classiﬁcation problem and

propose an approach that integrates classiﬁcation methods with cluster analysis. This

approach uses historical cost data and different machine learning techniques in order to

make predictions. Although our main aim is to predict effort intervals, we also demonstrate

that point estimates can be achieved through our approach as well. Therefore, the scope of

our work is relevant for practitioners who employ cost estimation practices.

One of the limitations of our approach is that we test only one clustering method in order to

obtain the effort classes. Other clustering techniques that create dynamic clusters can also be

used instead of the leader cluster. As a second limitation, we obtain point estimates through

simple approaches such as the mean/median regression. Regression-based models can be

used to increase the point estimation performance. However, our aim is not to demonstrate the

superiority of one algorithm over the others; instead, we provide an implementation of our

ideas using public datasets in order to demonstrate the applicability of our approach.

We address the threats to the validity of our work under three categories: (1) internal

validity, (2) external validity and (3) construct validity.

Internal validity fundamentally questions to what extent the cause–effect relationship

between dependent and independent variables exist. For addressing the threats to the

internal validity of our results, we used seven datasets and applied 10 910 cross-vali-

dation to overcome the ordering effects.

External validity, i.e. the generalizability, of results addresses the extent to which the

ﬁndings of a particular study are applicable outside the speciﬁcations of that study. To

ensure the generalizability of our results, we paid extra attention to include as many

datasets coming from various resources as possible and used seven datasets from two

different sources in our study. Our datasets contain a wide diversity of projects in terms of

their sources, their domains and the time period during which they were developed.

Datasets composed of software development projects from different organizations around

the world are used to generalize our results.

546 Software Qual J (2011) 19:537–552

123

Construct validity (i.e. face validity) assures that we are measuring what we actually

intended to measure. We use in our research MR, MMRE, MdMRE and PRED (25) for

measuring and comparing the performance of the model. The majority of effort estimation

studies use estimation-error-based measures for measuring and comparing the performance

of different methods. We also used error-based measures in our study since they are a

practical option for the majority of researchers. Moreover, using error-based methods

enables our study to be benchmarked with previous effort estimation research.

5 Results and discussions

The proposed approach is applied to and validated on all of the seven datasets. The results

are given in terms of accuracy measures mentioned in Sect. 4.

The effort clusters created for each dataset are given in Table 3. In order to show the

clustering efﬁciency, the minimum and maximum numbers of projects assigned to a cluster

are also given.

The classiﬁcation results for effort interval estimation are given in Fig. 3. k-NN and LD

perform similarly for coc81, desharnais_1_1, nasa93 and sdr05. They both give a mis-

classiﬁcation rate of 0% for coc81 and sdr05. For cocomonasa_v1 and sdr06, k-NN out-

performs the others, whereas LD is the best one for sdr07. In total, the proposed model

gives a misclassiﬁcation rate of 0% for ﬁve cases in the best case and 17% in the worst

case.

Table 3 Effort clusters

for each dataset Dataset # of Clusters # of Projects

Min Max

coc81 4 2 44

cocomonasa_v1 5 3 36

desharnais_1_1 9 2 21

nasa93 6 3 44

sdr05 3 3 16

sdr06 3 2 12

sdr07 4 6 16

Fig. 3 Effort misclassiﬁcation rates for each dataset

Software Qual J (2011) 19:537–552 547

123

The outcomes concerning effort interval estimation yield some important results.

Considering classiﬁers, k-NN is the best performing one and LD follows it with a slight

difference, whereas DT is the worst performing one.

Since our main aim is effort interval classiﬁcation, we focus on the misclassiﬁcation rate

to measure how good our classiﬁcation performance is. The misclassiﬁcation rates are 0%

for most cases and around 17% in the worst case. There are not many studies in the

literature that investigate the effort interval classiﬁcation. The most recent study on this

topic is Sentas et al.’s study, in which ordinal regression is used to model the probabilities

of both effort and productivity intervals (Sentas et al. 2005). In the said study, hit rates of

around 70% are obtained for productivity interval estimation on the coc81 dataset. In our

study, however, the hit rates for all datasets are between 90 and 100%. The main reason for

this is that we use similar projects in order to predict the project cost. This is achieved

through clustering the projects according to their attributes. Furthermore, the intervals in

the above-mentioned study are manually predeﬁned, whereas we dynamically create them

by clustering. In Table 4, we compare our results with those of Sentas et al.

We also analyzed our results in terms of point estimation. We used a simple approach

based on using the means and medians of the intervals for point estimation. We should

once again note that our main aim is to determine the effort intervals. However, we also

show how our results can be easily converted to point estimates and can produce com-

parable results to previous ones.

In Table 5, we present point estimation results in terms of the three measures mentioned

in the previous section. Point estimates are determined by taking either the mean or the

median of the effort values of the projects. In terms of the point estimation performance,

k-NN and LD perform nearly the same and better than DT for all datasets. The performance

of all classiﬁers improves for all measures when the median is used for point estimation.

Especially for MMRE and MdMRE measures, the improvement is obvious. MMRE and

MdMRE results decrease to 13%, and PRED results increase to 86% for some datasets. Note

that a PRED value of 86% means that 86% of all estimations are within the 25% conﬁdence

interval, which shows the stability and robustness of the model we propose.

Combining clustering with classiﬁcation methods has helped us to achieve favorable

results by eliminating the effects of unrelated data. Our experimental results show that we

achieved much higher hit rates than those of previous studies. Although we simply use the

mean and the median of the effort interval values, the point estimation results are also

comparable to those in the literature. If a different model is ﬁtted to each interval sepa-

rately, it is expected that our estimation results will further improve.

6 Conclusions and future work

Although various methods have been proposed within the scope of the literature, in this

paper, we handle the cost estimation problem in a different manner. We treat cost esti-

mation as a classiﬁcation problem rather than a regression problem and propose an

Table 4 Comparison

of the results # of Clusters Hit rate (%)

Min Max

Sentas et al. 60.38 79.24

Our model 97 100

548 Software Qual J (2011) 19:537–552

123

approach that classiﬁes new software projects into one of the dynamically created effort

classes, with each corresponding to an effort interval. The prevention of overestimation

and underestimation is more practical through predicting the intervals instead of the exact

values. This approach integrates classiﬁcation methods with cluster analysis, which is, to

the best of our knowledge, performed for the ﬁrst time in the software engineering domain.

In contrast to previous studies, the intervals are not predeﬁned but dynamically created

through clustering.

The proposed approach is validated on seven datasets taken from public repositories,

and the results are presented in terms of widely used performance measures. These results

point out the three important advantages our approach offers:

1. We obtain much higher effort estimation hit rates (around 90–100%) in comparison to

other studies in the literature.

2. For point estimation results, we can see that the MdMRE, MMRE and PRED (25)

values are comparable to those in the literature for most of the datasets although we

use simple methods such as mean and median regression.

3. Effort intervals are generated dynamically according to historical data. This method

removes the need for project managers to specify effort intervals manually and hence

prevents the waste of time and effort.

Future work includes the use of different clustering techniques to ﬁnd effort classes and

to ﬁt probabilistic models to the intervals. Also, regression-based models can be used for

Table 5 Point estimation results (%)

Dataset Classiﬁer Using the mean of projects Using the median of projects

MMRE MdMRE PRED MMRE MdMRE PRED

coc81 LD 189 183 33 131 131 33.6

k-NN 189 183 33 131 131 33.6

DT 192 190 29.6 134 131 30.2

cocomonasa_v1 LD 69 45 42.2 51 32 54.8

k-NN 69 45 42 51 32 54.6

DT 76 50 26.8 58 40 39.4

desharnais_1_1 LD 13 12 84.14 13 12 86.42

k-NN 13 12 84.14 13 12 86.71

DT 16 15 79 15 15 81.85

nasa93 LD 70 52 55.5 52 40 57.7

k-NN 69 52 55.5 52 40 57.7

DT 72 52 51.2 55 41 53.4

sdr05 LD 45 28 45.5 37 26 52

k-NN 45 28 45.5 37 26 52

DT 59 44 28.5 52 38 35

sdr06 LD 31 31 50.5 25 23 67

k-NN 30 31 50.5 24 23 67

DT 34 36 44.5 27 25 61

sdr07 LD 14 14 84.66 14 14 79.6

k-NN 14 13 81.33 14 14 76.3

DT 14 13 81.33 14 14 76.3

Software Qual J (2011) 19:537–552 549

123

point estimation instead of taking the mean and the median of interval values, which would

enhance the point estimation performance.

Acknowledgments This research is supported in part by Tubitak under grant number EEEAG108E014.

References

Alpaydin, E. (2004). Introduction to machine learning. Cambridge: The MIT Press.

Angelis, L., & Stamelos, I. (2000). A simulation tool for efﬁcient analogy based cost estimation. Journal of

Empirical Software Engineering, 5(1), 35–68.

Bakar, Z. A., Deris, M. M., & Alhadi, A. C. (2005). Performance analysis of partitional and incremental

clustering, Seminar Nasional Aplikasi Teknologi Informasi (SNATI).

Baskeles, B., Turhan, B., & Bener, A. (2007). Software effort estimation using machine learning methods. In

Proceedings of the 22nd international symposium on computer and information sciences (ISCIS 2007),

Ankara, Turkey, pp. 126–131.

Bibi, S., Stamelos, I., & Angelis, L. (2004). Software cost prediction with predeﬁned interval estimates. In

First Software Measurement European Forum, Rome, Italy, January 2004.

Boehm, B. W. (1981). Software engineering economics. Advances in computer science and technology

series. Upper Saddle River, NJ: Prentice Hall PTR.

Boehm, B. W. (1999). COCOMO II and COQUALMO Data Collection Questionnaire. University of

Southern California, Version 2.2.

Boehm, B., Abts, C., & Chulani, S. (2000). Software development cost estimation approaches—A survey.

Annals of Software Engineering.

Boetticher, G. D. (2001). Using machine learning to predict project effort: empirical case studies in data-

starved domains. In First international workshop on model-based requirements engineering,

pp. 17–24.

Boetticher, G., Menzies, T., & Ostrand, T. (2007). PROMISE repository of empirical software engineering

data. West Virginia University, Department of Computer Science. http://www.promisedata.org/

repository.

Briand, L. C., Basili, V. R., & Thomas, W. M. (1992). A pattern recognition approach for software

engineering data analysis. IEEE Transactions on Software Engineering, 18(11), 931–942.

Conte, S. D., Dunsmore, H. E., & Shen, V. Y. (1986). Software engineering metrics and models. Menlo

Park, CA: Benjamin-Cummings.

Draper, N., & Smith, H. (1981). Applied regression analysis. London: Wiley.

Gallego, J. J. C., Rodriguez, D., Sicilia, M. A., Rubio, M. G., & Crespo, A. G. (2007). Software project

effort estimation based on multiple parametric models generated through data clustering. Journal of

Computer Science and Technology, 22(3), 371–378.

Jorgensen, M. (2002). Comments on ‘a simulation tool for efﬁcient analogy based cost estimation’.

Empirical Software Engineering, 7, 375–376.

Jorgensen, M. (2003). An effort prediction interval approach based on the empirical distribution of previous

estimation accuracy. Information and Software Technology, 45, 123–126.

Jorgensen, M., & Teigen, K. H. (2002). Uncertainty intervals versus interval uncertainty: An alternative

method for eliciting effort prediction intervals in software development projects. In International

conference on project management (ProMAC), Singapore, pp. 343–352.

Lee, A., Cheng, C. H., & Balakrishnan, J. (1998). Software development cost estimation: Integrating neural

network with cluster analysis. Information and Management, 34, 1–9.

Leung, H., & Fan, Z. (2001). Software cost estimation. Handbook of software engineering and knowledge

engineering. ftp://cs.pitt.edu/chang/handbook/42b.pdf.

Lum, K., Bramble, M., Hihn, J., Hackney, J., Khorrami, M., & Monson, E. (2003). Handbook for software

cost estimation. NASA Jet Propulsion Laboratory, JPL D-26303.

Menzies, T., & Hihn, J. (2006). Evidence-based cost estimation for better-quality software. IEEE Software,

23(4), 64–66.

Miyazaki, Y., Terakado, M., Ozaki, K., & Nozaki, H. (1994). Robust regression for developing software

estimation models. Journal of Systems and Software, 1, 3–16.

NASA. (1990). Manager’s handbook for software development. Goddard Space Flight Center, Greenbelt,

MD, NASA Software Engineering Laboratory.

Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufman.

550 Software Qual J (2011) 19:537–552

123

Sentas, P., Angelis, L., & Stamelos, I. (2003). Multinominal logistic regression applied on software pro-

ductivity prediction. In 9th Panhellenic conference in informatics, Thessaloniki.

Sentas, P., Angelis, L., Stamelos, I., & Bleris, G. (2005). Software productivity and effort prediction with

ordinal regression. Information and Software Technology, 47, 17–29.

Shalabi, L. A., & Shaaban, Z. (2006). Normalization as a preprocessing engine for data mining and the

approach of preference matrix. In IEEE proceedings of the international conference on dependability

of computer systems (DEPCOS-RELCOMEX’06).

Shepperd, M., & Schoﬁeld, M. (1997). Estimating software project effort using analogies. IEEE Transac-

tions on Software Engineering, 23(12), 736–743.

SoftLab. (2009). Software research laboratory, Department of Computer Engineering, Bogazici University.

http://www.softlab.boun.edu.tr.

Srinivasan, K., & Fisher, D. (1995). Machine learning approaches to estimating software development

effort. IEEE Transactions on Software Engineering, 21(2), 126–137.

Stamelos, I., & Angelis, L. (2001). Managing uncertainty in project portfolio cost estimation. Information

and Software Technology, 43(13), 759–768.

Stamelos, I., Angelis, L., Dimou, P., & Sakellaris, E. (2003). On the use of bayesian belief networks for the

prediction of software productivity. Information and Software Technology, 45, 51–60.

Stensrud, E., Foss, T., Kitchenham, B., & Myrtveit, I. (2003). A further empirical investigation of the

relationship between MRE and project size. Empirical Software Engineering.

Tadayon, N. (2005). Neural network approach for software cost estimation. International Conference on

Information Technology: Coding and Computing, 2, 815–818.

Author Biographies

Ays¸e Bakırreceived her MSc degree in computer engineering from

Bogazici University in 2008 and her BSc degree in computer engi-

neering from Gebze Institute of Technology in 2006. Her research

interests include software quality modeling and software cost

estimation.

Burak Turhan received his PhD in computer engineering from

Bogazici University. After his postdoctoral studies at the National

Research Council of Canada, he joined the Department of Information

Processing Science at the University of Oulu. His research interests

include empirical studies on software quality, cost/defect prediction

models, test-driven development and the evaluation of new approaches

for software development.

Software Qual J (2011) 19:537–552 551

123

Ays¸e B. Bener is an associate professor in the Ted Rogers School of

Information Technology Management. Prior to joining Ryerson,

Dr. Bener was a faculty member and Vice Chair in the Department of

Computer Engineering at Bog

˘azic¸i University. Her research interests

are software defect prediction, process improvement and software

economics. Bener has a PhD in information systems from the London

School of Economics. She is a member of the IEEE, the IEEE Com-

puter Society and the ACM.

552 Software Qual J (2011) 19:537–552

123