ArticlePDF Available

A comparative study for estimating software development effort intervals


Abstract and Figures

Software cost/effort estimation is still an open challenge. Many researchers have proposed various methods that usually focus on point estimates. Until today, software cost estimation has been treated as a regression problem. However, in order to prevent overestimates and underestimates, it is more practical to predict the interval of estimations instead of the exact values. In this paper, we propose an approach that converts cost estimation into a classification problem and that classifies new software projects in one of the effort classes, each of which corresponds to an effort interval. Our approach integrates cluster analysis with classification methods. Cluster analysis is used to determine effort intervals while different classification algorithms are used to find corresponding effort classes. The proposed approach is applied to seven public datasets. Our experimental results show that the hit rate obtained for effort estimation are around 90–100%, which is much higher than that obtained by related studies. Furthermore, in terms of point estimation, our results are comparable to those in the literature although a simple mean/median is used for estimation. Finally, the dynamic generation of effort intervals is the most distinctive part of our study, and it results in time and effort gain for project managers through the removal of human intervention. KeywordsSoftware effort estimation–Interval prediction–Classification–Cluster analysis–Machine learning
Content may be subject to copyright.
A comparative study for estimating software
development effort intervals
Ays¸e BakırBurak Turhan Ays¸ e Bener
Published online: 9 September 2010
Springer Science+Business Media, LLC 2010
Abstract Software cost/effort estimation is still an open challenge. Many researchers
have proposed various methods that usually focus on point estimates. Until today, software
cost estimation has been treated as a regression problem. However, in order to prevent
overestimates and underestimates, it is more practical to predict the interval of estimations
instead of the exact values. In this paper, we propose an approach that converts cost
estimation into a classification problem and that classifies new software projects in one of
the effort classes, each of which corresponds to an effort interval. Our approach integrates
cluster analysis with classification methods. Cluster analysis is used to determine effort
intervals while different classification algorithms are used to find corresponding effort
classes. The proposed approach is applied to seven public datasets. Our experimental
results show that the hit rate obtained for effort estimation are around 90–100%, which is
much higher than that obtained by related studies. Furthermore, in terms of point esti-
mation, our results are comparable to those in the literature although a simple mean/median
is used for estimation. Finally, the dynamic generation of effort intervals is the most
distinctive part of our study, and it results in time and effort gain for project managers
through the removal of human intervention.
Keywords Software effort estimation Interval prediction Classification
Cluster analysis Machine learning
A. Bakır(&)
Department of Computer Engineering, Bog
˘azic¸i University, 34342 Bebek, Istanbul, Turkey
B. Turhan
Department of Information Processing Science, University of Oulu, 90014 Oulu, Finland
A. Bener
Ted Rogers School of Information Technology Management, Ryerson University,
Toronto M5B 2K3, Canada
Software Qual J (2011) 19:537–552
DOI 10.1007/s11219-010-9112-9
1 Introduction
As software becomes more important in many domains, the focus on its overall quality in
terms of technical product quality and process quality also increases. As a result, software
is blamed for business failures and the increased cost of business in many industries (Lum
et al. 2003). The underestimation of software effort causes cost overruns that lead to cost
cutting. Cost cutting means that some of the life cycle activities either can be skipped or
cannot be completed as originally planned. This causes a drop in software product quality.
To avoid the cost/quality death spiral, accurate cost estimates are vital (Menzies and Hihn
Software cost estimation is one of the critical steps in the software development life
cycle (Boehm 1981; Leung and Fan 2001). It is the process of predicting the effort required
to develop a software project. Such predictions assist project managers when they make
important decisions such as bidding for a new project, planning and allocating resources.
Inaccurate cost estimations may cause project managers to make wrong decisions. As
Leung and Fan state, underestimations may result in approving projects that would exceed
their budgets and schedules (Leung and Fan 2001). Overestimations, on the other hand,
may result in rejecting other useful projects and wasting resources.
Point estimates are generally used for project staffing and scheduling (Sentas et al.
2005). However, managers may easily make wrong decisions if they rely only on point
estimates and the associated error margins generated by cost estimation methods. Although
most methods proposed in the literature produce point estimates, Stamelos and Angelis
state that producing interval estimates is safer (Stamelos and Angelis 2001). They
emphasize that point estimates have a high impact on project managers, causing them to
make wrong decisions, since they include a high level of uncertainty as a result of unclear
requirements and their implications in the project. Interval estimates may be used for
predicting the cost of any current project in terms of completed ones. In addition, while
bidding for a new project, an interval estimate can easily be converted to a point estimate
by evaluating the values that fall into the same interval.
Up to now, interval estimation has consisted of finding either the confidence intervals
for point estimates or the posterior probabilities of predefined intervals and then fitting
regression-based methods to these intervals (Angelis and Stamelos 2000; Jorgensen 2002;
Sentas et al. 2003,2005; Stamelos and Angelis 2001; Stamelos et al. 2003). However, none
of these approaches addresses the problem of cost estimation as a pure classification
problem. In this paper, we aim to convert cost estimation into a classification problem by
using interval estimation as a tool. The proposed approach integrates classification methods
with cluster analysis, which, to the best of our knowledge, is applied for the first time in the
software engineering domain. In addition, by using cluster analysis, effort classes are
determined dynamically instead of using manually predefined intervals. The approach uses
historical data of completed projects including their effort values.
The proposed approach includes three main phases: (1) clustering effort data so that
each cluster contains similar projects; (2) labeling each cluster with a class number and
determining the effort intervals for each cluster; and (3) classifying new projects to one of
the effort classes. We used various datasets to validate our approach, and our results
revealed much higher estimation accuracies than those in the literature. According to our
experimental study, we obtained higher hit rates for effort estimation. We also obtained
point estimates with simple approaches such as mean/median regression, and our perfor-
mance has been comparable to those in the literature.
538 Software Qual J (2011) 19:537–552
The rest of the paper is organized as follows: Sect. 2discusses related work from the
literature. Section 3describes the proposed approach in detail, while Sect. 4presents the
experiments conducted. Section 5comprises a presentation of the results and discussions.
Finally, conclusions and future work are presented in Sect. 6.
2 Related work
Previous work on software cost estimation mostly produced point estimates by using
regression methods (Baskeles et al. 2007; Boetticher 2001; Briand et al. 1992; Draper and
Smith 1981; Miyazaki et al. 1994; Shepperd and Schofield 1997; Srinivasan and Fisher
1995; Tadayon 2005). According to Boehm, the two most popular regression methods are
ordinary least square regression (OLS) and robust regression (Boehm et al. 2000). OLS is a
general linear model that uses least squares, whereas robust regression is the improved
version of OLS (Draper and Smith 1981; Miyazaki et al. 1994). Besides regression, various
machine learning methods are used for cost estimation. For example, back-propagation
multilayer perceptrons and support vector machines (SVM) have been used for effort
estimation in Baskeles et al. (2007) and Boetticher (2001), and Briand et al. (1992)
introduce a cost estimation method based on optimized set reduction (Baskeles et al. 2007;
Boetticher 2001; Briand et al. 1992). Other methods for point estimation include estimation
by analogy and neural networks. In Shepperd and Schofield (1997), high accuracies are
obtained by using analogy with prediction models, whereas in Tadayon (2005) and
Shepperd and Schofield (1997), a significant improvement is made on large datasets
through the use of an adaptive neural network model (Shepperd and Schofield 1997;
Tadayon 2005).
Fewer studies focus on interval estimation. They can be grouped into two main cate-
gories: (1) those that produce confidence intervals for point estimates and (2) those that
produce probabilities of predefined intervals. In category 1, interval estimates are gener-
ated during the estimation process, whereas in category 2, intervals are predefined before
the estimation process.
The first study that has empirically evaluated effort prediction interval models in the
literature is Angelis and Stamelos (2000). It compares the effort prediction intervals
derived from a bootstrap-based model with the prediction intervals derived from regres-
sion-based effort estimation models. However, the said study displays a confusion of
terms, and a critique was consequently made by Jorgensen in (2002) to clarify the
ambiguity (Jorgensen and Teigen 2002). In another study, an interval estimation method
based on expert judgment is proposed (Jorgensen 2003). Statistical simulation techniques
for calculating confidence intervals for project portfolios are presented in Stamelos and
Angelis (2001).
Two important studies for category 2 are Sentas et al. (2005), in which ordinal
regression is used to model the probabilities of both effort and productivity intervals, and
Sentas et al. (2003), which uses multinomial logistic regression for modeling productivity
intervals (Sentas et al. 2003,2005). Both studies also include point estimate results of the
proposed models. Also, in Sentas et al. (2003), predefined intervals of productivity are used
in a Bayesian belief network to support expert opinion (Sentas et al. 2003). An empirical
comparison of the models that produce point estimates and predefined interval estimates is
given in Bibi et al. (2004). Firstly, in contrast to these studies, effort intervals are not
predefined manually in this paper. Instead, they are determined by clustering analysis.
Secondly, instead of using regression-based methods, we use classification algorithms that
Software Qual J (2011) 19:537–552 539
originate from the machine learning domain. Thirdly, point estimates can still be derived
from these intervals as we will show in the following sections.
NASA’s Software Engineering Laboratory also specified some guidelines for the esti-
mation of effort prediction intervals (NASA 1990). However, these guidelines may affect
the external validity of the results since they do not reflect the same characteristics of the
projects in other organizations.
Clustering analysis is not a new concept in the software cost estimation domain. Lee
et al. integrate clustering with neural networks in order to estimate the development cost
(Lee et al. 1998). They have found similar projects with clustering and used them to train
the network. In Gallego et al. (2007), the cost data are clustered, and then different
regression models are fitted to each cluster (Gallego et al. 2007). Similar to these studies,
we also use here cluster analysis for grouping similar projects. The difference of our
research in comparison to these studies is that we combine clustering through classification
methods for effort estimation.
3 The approach
There are three main steps in our approach: (1) grouping similar projects together by
cluster analysis; (2) determining the effort intervals for each cluster and specifying the
effort classes; and (3) classifying new projects into one of the effort classes. The
assumption behind applying cluster analysis to effort data is that similar projects have
similar development effort. The class-labeled clusters then become the input data for the
classification algorithm, which converts cost estimation into a classification process.
3.1 Cluster analysis
Cluster analysis is a technique for grouping data and finding similar structures in data. In
software cost estimation domain, clustering corresponds to grouping projects into clusters
based on their attributes. Similar projects are assigned to the same cluster, whereas dis-
similar projects belong to different clusters.
In this study, we use an incremental clustering algorithm called leader cluster (Alpaydin
2004) for cluster analysis. In this algorithm, the number of clusters is not predefined;
instead, the clusters are generated incrementally. Since one of our main objectives is to
generate the effort intervals dynamically, this algorithm is selected to group similar soft-
ware projects. Other clustering techniques that generate the clusters dynamically can also
be used, but this is out of the scope of this work. The pseudocode of the leader cluster
algorithm is given in Fig. 1(Bakar et al. 2005).
In order to determine the similarity between two projects, Euclidean distance is used. It
is a widely preferred distance metric for software engineering datasets (Lee et al. 1998).
3.2 Effort classes
After the clusters and their centers are determined, the effort intervals are calculated for
each cluster.
In order to specify the effort intervals and classes, firstly, the minimum and maximum
values of the efforts of the projects residing in the same cluster are found. Secondly, these
minimum and maximum values are selected as the upper and lower bounds of the interval
540 Software Qual J (2011) 19:537–552
that will represent that cluster. Finally, each cluster is given a class label, which will be
used for classifying new projects.
3.3 Classification
The class of a new project is estimated by using the class-labeled data generated in the
previous step. The resulting class corresponds to the effort interval that contains the effort
value of the new project.
We use three different classification algorithms for this step: one is parametric (linear
discrimination) and the others are non-parametric (k-nearest neighbor and decision tree).
These three algorithms are chosen to show how our approach performs with the algorithms
of different complexities. Linear discrimination is the simplest, whereas the decision tree is
the most complex one. k-nearest neighbor has moderate complexity depending on the size
of the training set.
3.3.1 Linear discrimination
Linear discrimination (LD) is a discriminant-based approach that tries to fit a model
directly for the discriminant between the class regions, without first estimating the like-
lihoods or posteriors (Alpaydin 2004). It assumes that the projects of a class are linearly
separable from the projects of other classes and require no knowledge of the densities
inside the class regions. The linear discriminant function is as:
where g
is the model, w
and w
are the model parameters and xis the software project
with dattributes. It is used to separate two or more classes.
Learning involves the optimization of the model parameters to maximize the classifi-
cation accuracy on a given set of projects. Because of its simplicity and comprehensibility,
linear discrimination is frequently used before trying a more complicated model.
3.3.2 k-nearest neighbor
The k-nearest neighbor (k-NN) algorithm is a simple but also powerful learning method
that is particularly suited for classification problems.
Assign the first data item to the first cluster.
Consider the next data item:
Find the distances between the new item and the existing cluster centers.
If (distance < threshold)
Assign this item to the nearest cluster
Recompute the value for that cluster center
Assign it to a new cluster
Repeat step 2 until the total squared error is small enough.
Fig. 1 Pseudocode for leader cluster algorithm (Bakar et al. 2005)
Software Qual J (2011) 19:537–552 541
k-NN assumes that all projects correspond to points in the n-dimensional Euclidean
space R
, where nis the number of the project attributes. The algorithm’s output is the
class, which has the most examples among the kneighbors of the input project. The
neighbors are found by calculating the Euclidean distance from each project to the input
The selection of kis very important. It is generally set as an odd number to minimize
ties as confusion generally appears between any two neighboring classes (Alpaydin 2004).
Although the algorithm is easy to implement, the amount of computation increases as the
training set grows in size.
3.3.3 Decision tree
Decision trees (DT) are hierarchical data structures that are based on a divide-and-conquer
strategy (Quinlan 1993). They can be used for both classification and regression and
require no assumptions concerning the data. In the case of classification, they are called
classification trees.
The nodes of a classification tree correspond to the attributes that best split data into
disjoint groups, while the leaves correspond to the average effort of that split. The quality
of the split is determined by an impurity measure. The tree is constructed by partitioning
the data recursively until no further partitioning is possible while choosing the split that
minimizes the impurity at every occasion (Alpaydin 2004).
Concerning the estimation of software effort, the effort of the new project can be
determined by traversing the tree from top to bottom along the appropriate paths.
4 Experimental study
Our purpose in this study is to convert the effort estimation problem into a classification
problem that includes the following phases: (1) clustering the effort data; (2) labeling each
cluster with a class number and determining the effort intervals for each cluster; and (3)
classifying the new projects. In addition, the point estimation performance of the approach
is tested by taking either the mean or the median of the effort values of the projects
included in the estimated class.
In this section, details about the validation of our approach on a number of datasets will
be given. MATLAB is used as a tool for all the analyses stated in this study.
4.1 Dataset description
In our experiments, data from two different sources are used: the Promise Data Repository
and the Software Engineering Research Laboratory (SoftLab) Repository (Boetticher et al.
2007; SoftLab 2009). Seven datasets are used in this study. Four of them, which are
cocomonasa_v1, coc81, desharnais_1_1 and nasa93, are taken from the Promise Data
Repository. The others, which are sdr05, sdr06 and sdr07, are taken from the SoftLab
(2009) Repository. These datasets contain data from different local software companies in
Turkey, which are collected by using the COCOMO II Data Collection Questionnaire
(Boehm 1999).
The datasets include a number of nominal attributes and two real-valued attributes:
Lines of Code and Actual Effort. An exemplary dataset is given in Table 1. Each row in
Table 1corresponds to a different project. These projects are represented by the nominal
542 Software Qual J (2011) 19:537–552
attributes from the COCOMO II model along with their size in terms of LOC and the actual
effort spent for completing the projects.
We have used several datasets in the same format as provided in Table 1in order to
validate our approach on a wide range of effort estimation data and to generalize our
results as much as possible. A list of all the datasets used in this study is given in Table 2.
4.2 Design
Before applying any method, all of the datasets are normalized in order to remove the
scaling effects on different dimensions. By using min–max normalization, project attribute
values are converted into the [01] interval (Shalabi and Shaaban 2006). After normal-
ization, the need for a dimension reduction technique to extract the relevant features arises.
In this paper, principal component analysis (PCA) is used (Alpaydin 2004). The main
purpose of PCA is to reduce the dimensions of the dataset so that it can still be efficiently
represented without losing much information. Specifically, PCA seeks dimensions in
which the variances are maximized. By applying PCA to each cluster after clustering, the
model shown in Fig. 2is developed. Our aim in applying PCA separately to each cluster is
Table 1 An example dataset
Project Nominal attributes (as defined in COCOMO II) LOC Effort
P1 1.00,1.08,1.30,1.00,1.00,0.87,1.00,0.86,1.00,0.70,1.21,1.00,0.91,1.00,1.08 70 278
P2 1.40,1.08,1.15,1.30,1.21,1.00,1.00,0.71,0.82,0.70,1.00,0.95,0.91,0.91,1.08 227 1,181
P3 1.00,1.08,1.15,1.30,1.06,0.87,1.07,0.86,1.00,0.86,1.10,0.95,0.91,1.00,1.08 177.9 1,248
P4 1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08 115.8 480
P5 1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08 29.5 120
P6 1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08 19.7 60
P7 1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08 66.6 300
P8 1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08 5.5 18
P9 1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08 10.4 50
P10 1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08 14 60
P11 1.00,1.00,1.15,1.11,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00 16 114
P12 1.15,1.00,1.15,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00 6.5 42
P13 1.00,1.00,1.15,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00 13 60
P14 1.00,1.00,1.15,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00 8 42
Table 2 An overview of
the datasets Data source Dataset name # of Projects
Promise cocomonasa_v1 60
coc81 63
(updated version)
nasa93 93
SoftLab (2009) sdr05 25
sdr06 24
sdr07 40
Software Qual J (2011) 19:537–552 543
to extract separate features for each cluster so that we can obtain better results for both
classification and point estimation. The dataset given in Table 1is used as an example in
Fig. 2to show how our cost data are processed.
In Fig. 2, the projects in the cost dataset are illustrated as P1P14. After the dataset is
normalized, projects are shown as P10P140. The four clusters generated are named as C1,
C2, C3 and C4, which correspond to effort interval classes. As described earlier, the lower
and upper bounds for an effort interval class are determined dynamically by the minimum
and the maximum effort values of the projects that reside in the corresponding cluster.
4.3 Model
Normalized effort estimation data are given as input to this model. Firstly, the leader cluster
algorithm is applied to the normalized data to obtain project groups. Here, we selected the
number of clusters that minimize the total-squared error while keeping the distance below
the defined threshold value. The optimum value for the number of clusters is found by
testing all possibilities and calculating the total-squared error. Secondly, with PCA, each
cluster’s dimensions are reduced individually by using their own covariance matrices (the
proportion of variance is set to 0.90). The aim here is to prevent data loss within the
Min-Max Normalization
Normalized Data
Leader Cluster
On Each Cluster
Find Effort Intervals
for Each Cluster
10x10 Cross-Validation
Fig. 2 Our proposed model
544 Software Qual J (2011) 19:537–552
clusters. PCA is applied to the entire data except the Effort column, which is the value that
we want to estimate. Thirdly, each cluster is assigned a class label, and the effort intervals
for each of them are determined. As stated in Sect. 3.2, minimum and maximum values are
selected as the interval bounds. Then, the effort data containing the projects with corre-
sponding class labels are given to each of the classification algorithms described in Sect. 3.
For the k-nearest neighbor algorithm, the nearest neighbor is selected. For linear discrim-
ination and decision tree algorithms, the predefined implementations of Matlab have been
used. Since separate training and test sets do not exist, the classification process is per-
formed in a 10 910 cross-validation loop. The data are shuffled 10 times into random order
and then divided into 10 bins in the cross-validation loop. The training set is built from nine
of the bins, and the remaining bin is used as the validation set. Classification algorithms are
first trained on the training set, and then, estimations and error calculations are made on the
validation set. The errors are collected during 100 cross-validation iterations, and then
MMRE, MdMRE and PRED values are calculated. Since we have three classification
methods, we have three sets of measurements.
In addition, point estimates are calculated at the classification stage in order to deter-
mine our point estimation performance. For this process, we decided to use the mean and
the median as our point estimators since they have been used by other studies in the
literature. For example, Sentas et al. (2003) represent each interval by a single represen-
tative value: the mean point or the median point (Sentas et al. 2003). At the classification
step, when the correct effort class is estimated, the mean and median of the effort values of
the projects belonging to that class are calculated.
4.4 Accuracy measures
Although our aim is to convert cost estimation to a classification problem, we want to give
the point estimate results of the proposed approach in order to make a comparison with
other studies. Thus, we have employed two types of accuracy measures in our experimental
study: (1) misclassification rate for classification and (2) MeanMRE (MMRE), Median-
MRE and PRED (25) for point estimates.
4.4.1 Misclassification rate
The misclassification rate is simply the proportion of the number of misclassified software
projects in a test set to the total number of projects to be classified in the same test set. It is
calculated for each classification algorithm in each model. The formula for calculating the
misclassification rate is as follows:
MR ¼1
0 otherwise
where N
is the total number of training samples, y0is the estimated effort and yis the
actual effort.
The misclassification rate can be thought as the complement of the hit rate that has been
mentioned in interval prediction studies; thus, our results are still comparable to those
%100 ¼Misclassification Rate þHit Rate ð3Þ
Software Qual J (2011) 19:537–552 545
4.4.2 MMRE, MedianMRE and PRED (25)
These are the measures that are calculated from the relative error and the difference
between the actual and the estimated value.
The magnitude of relative error (MRE) is calculated by the following formula:
MRE ¼predicted actual
jj=actual ð4Þ
The mean magnitude of relative error (MMRE) is the mean of the MRE values, and
MedianMRE (MdMRE) is the median of the MRE values. Prediction at level ror
PRED(r) is used to examine the cumulative frequency of MRE for a specific error level.
For T estimations, the formula is as follows:
1 if MREiN
0 otherwise
In this study, we take the desired error level as r=25. PRED (25) is preferred over
MMRE and MdMRE, in terms of evaluating the stability and robustness of the estimations
(Conte et al. 1986; Stensrud et al. 2003). In order to say that a model performs well, the
MdMRE and MRE values should be low and the PRED (25) values should be high.
4.5 Scope and limitations
In this paper, we address the cost estimation problem as a classification problem and
propose an approach that integrates classification methods with cluster analysis. This
approach uses historical cost data and different machine learning techniques in order to
make predictions. Although our main aim is to predict effort intervals, we also demonstrate
that point estimates can be achieved through our approach as well. Therefore, the scope of
our work is relevant for practitioners who employ cost estimation practices.
One of the limitations of our approach is that we test only one clustering method in order to
obtain the effort classes. Other clustering techniques that create dynamic clusters can also be
used instead of the leader cluster. As a second limitation, we obtain point estimates through
simple approaches such as the mean/median regression. Regression-based models can be
used to increase the point estimation performance. However, our aim is not to demonstrate the
superiority of one algorithm over the others; instead, we provide an implementation of our
ideas using public datasets in order to demonstrate the applicability of our approach.
We address the threats to the validity of our work under three categories: (1) internal
validity, (2) external validity and (3) construct validity.
Internal validity fundamentally questions to what extent the cause–effect relationship
between dependent and independent variables exist. For addressing the threats to the
internal validity of our results, we used seven datasets and applied 10 910 cross-vali-
dation to overcome the ordering effects.
External validity, i.e. the generalizability, of results addresses the extent to which the
findings of a particular study are applicable outside the specifications of that study. To
ensure the generalizability of our results, we paid extra attention to include as many
datasets coming from various resources as possible and used seven datasets from two
different sources in our study. Our datasets contain a wide diversity of projects in terms of
their sources, their domains and the time period during which they were developed.
Datasets composed of software development projects from different organizations around
the world are used to generalize our results.
546 Software Qual J (2011) 19:537–552
Construct validity (i.e. face validity) assures that we are measuring what we actually
intended to measure. We use in our research MR, MMRE, MdMRE and PRED (25) for
measuring and comparing the performance of the model. The majority of effort estimation
studies use estimation-error-based measures for measuring and comparing the performance
of different methods. We also used error-based measures in our study since they are a
practical option for the majority of researchers. Moreover, using error-based methods
enables our study to be benchmarked with previous effort estimation research.
5 Results and discussions
The proposed approach is applied to and validated on all of the seven datasets. The results
are given in terms of accuracy measures mentioned in Sect. 4.
The effort clusters created for each dataset are given in Table 3. In order to show the
clustering efficiency, the minimum and maximum numbers of projects assigned to a cluster
are also given.
The classification results for effort interval estimation are given in Fig. 3. k-NN and LD
perform similarly for coc81, desharnais_1_1, nasa93 and sdr05. They both give a mis-
classification rate of 0% for coc81 and sdr05. For cocomonasa_v1 and sdr06, k-NN out-
performs the others, whereas LD is the best one for sdr07. In total, the proposed model
gives a misclassification rate of 0% for five cases in the best case and 17% in the worst
Table 3 Effort clusters
for each dataset Dataset # of Clusters # of Projects
Min Max
coc81 4 2 44
cocomonasa_v1 5 3 36
desharnais_1_1 9 2 21
nasa93 6 3 44
sdr05 3 3 16
sdr06 3 2 12
sdr07 4 6 16
Fig. 3 Effort misclassification rates for each dataset
Software Qual J (2011) 19:537–552 547
The outcomes concerning effort interval estimation yield some important results.
Considering classifiers, k-NN is the best performing one and LD follows it with a slight
difference, whereas DT is the worst performing one.
Since our main aim is effort interval classification, we focus on the misclassification rate
to measure how good our classification performance is. The misclassification rates are 0%
for most cases and around 17% in the worst case. There are not many studies in the
literature that investigate the effort interval classification. The most recent study on this
topic is Sentas et al.’s study, in which ordinal regression is used to model the probabilities
of both effort and productivity intervals (Sentas et al. 2005). In the said study, hit rates of
around 70% are obtained for productivity interval estimation on the coc81 dataset. In our
study, however, the hit rates for all datasets are between 90 and 100%. The main reason for
this is that we use similar projects in order to predict the project cost. This is achieved
through clustering the projects according to their attributes. Furthermore, the intervals in
the above-mentioned study are manually predefined, whereas we dynamically create them
by clustering. In Table 4, we compare our results with those of Sentas et al.
We also analyzed our results in terms of point estimation. We used a simple approach
based on using the means and medians of the intervals for point estimation. We should
once again note that our main aim is to determine the effort intervals. However, we also
show how our results can be easily converted to point estimates and can produce com-
parable results to previous ones.
In Table 5, we present point estimation results in terms of the three measures mentioned
in the previous section. Point estimates are determined by taking either the mean or the
median of the effort values of the projects. In terms of the point estimation performance,
k-NN and LD perform nearly the same and better than DT for all datasets. The performance
of all classifiers improves for all measures when the median is used for point estimation.
Especially for MMRE and MdMRE measures, the improvement is obvious. MMRE and
MdMRE results decrease to 13%, and PRED results increase to 86% for some datasets. Note
that a PRED value of 86% means that 86% of all estimations are within the 25% confidence
interval, which shows the stability and robustness of the model we propose.
Combining clustering with classification methods has helped us to achieve favorable
results by eliminating the effects of unrelated data. Our experimental results show that we
achieved much higher hit rates than those of previous studies. Although we simply use the
mean and the median of the effort interval values, the point estimation results are also
comparable to those in the literature. If a different model is fitted to each interval sepa-
rately, it is expected that our estimation results will further improve.
6 Conclusions and future work
Although various methods have been proposed within the scope of the literature, in this
paper, we handle the cost estimation problem in a different manner. We treat cost esti-
mation as a classification problem rather than a regression problem and propose an
Table 4 Comparison
of the results # of Clusters Hit rate (%)
Min Max
Sentas et al. 60.38 79.24
Our model 97 100
548 Software Qual J (2011) 19:537–552
approach that classifies new software projects into one of the dynamically created effort
classes, with each corresponding to an effort interval. The prevention of overestimation
and underestimation is more practical through predicting the intervals instead of the exact
values. This approach integrates classification methods with cluster analysis, which is, to
the best of our knowledge, performed for the first time in the software engineering domain.
In contrast to previous studies, the intervals are not predefined but dynamically created
through clustering.
The proposed approach is validated on seven datasets taken from public repositories,
and the results are presented in terms of widely used performance measures. These results
point out the three important advantages our approach offers:
1. We obtain much higher effort estimation hit rates (around 90–100%) in comparison to
other studies in the literature.
2. For point estimation results, we can see that the MdMRE, MMRE and PRED (25)
values are comparable to those in the literature for most of the datasets although we
use simple methods such as mean and median regression.
3. Effort intervals are generated dynamically according to historical data. This method
removes the need for project managers to specify effort intervals manually and hence
prevents the waste of time and effort.
Future work includes the use of different clustering techniques to find effort classes and
to fit probabilistic models to the intervals. Also, regression-based models can be used for
Table 5 Point estimation results (%)
Dataset Classifier Using the mean of projects Using the median of projects
coc81 LD 189 183 33 131 131 33.6
k-NN 189 183 33 131 131 33.6
DT 192 190 29.6 134 131 30.2
cocomonasa_v1 LD 69 45 42.2 51 32 54.8
k-NN 69 45 42 51 32 54.6
DT 76 50 26.8 58 40 39.4
desharnais_1_1 LD 13 12 84.14 13 12 86.42
k-NN 13 12 84.14 13 12 86.71
DT 16 15 79 15 15 81.85
nasa93 LD 70 52 55.5 52 40 57.7
k-NN 69 52 55.5 52 40 57.7
DT 72 52 51.2 55 41 53.4
sdr05 LD 45 28 45.5 37 26 52
k-NN 45 28 45.5 37 26 52
DT 59 44 28.5 52 38 35
sdr06 LD 31 31 50.5 25 23 67
k-NN 30 31 50.5 24 23 67
DT 34 36 44.5 27 25 61
sdr07 LD 14 14 84.66 14 14 79.6
k-NN 14 13 81.33 14 14 76.3
DT 14 13 81.33 14 14 76.3
Software Qual J (2011) 19:537–552 549
point estimation instead of taking the mean and the median of interval values, which would
enhance the point estimation performance.
Acknowledgments This research is supported in part by Tubitak under grant number EEEAG108E014.
Alpaydin, E. (2004). Introduction to machine learning. Cambridge: The MIT Press.
Angelis, L., & Stamelos, I. (2000). A simulation tool for efficient analogy based cost estimation. Journal of
Empirical Software Engineering, 5(1), 35–68.
Bakar, Z. A., Deris, M. M., & Alhadi, A. C. (2005). Performance analysis of partitional and incremental
clustering, Seminar Nasional Aplikasi Teknologi Informasi (SNATI).
Baskeles, B., Turhan, B., & Bener, A. (2007). Software effort estimation using machine learning methods. In
Proceedings of the 22nd international symposium on computer and information sciences (ISCIS 2007),
Ankara, Turkey, pp. 126–131.
Bibi, S., Stamelos, I., & Angelis, L. (2004). Software cost prediction with predefined interval estimates. In
First Software Measurement European Forum, Rome, Italy, January 2004.
Boehm, B. W. (1981). Software engineering economics. Advances in computer science and technology
series. Upper Saddle River, NJ: Prentice Hall PTR.
Boehm, B. W. (1999). COCOMO II and COQUALMO Data Collection Questionnaire. University of
Southern California, Version 2.2.
Boehm, B., Abts, C., & Chulani, S. (2000). Software development cost estimation approaches—A survey.
Annals of Software Engineering.
Boetticher, G. D. (2001). Using machine learning to predict project effort: empirical case studies in data-
starved domains. In First international workshop on model-based requirements engineering,
pp. 17–24.
Boetticher, G., Menzies, T., & Ostrand, T. (2007). PROMISE repository of empirical software engineering
data. West Virginia University, Department of Computer Science.
Briand, L. C., Basili, V. R., & Thomas, W. M. (1992). A pattern recognition approach for software
engineering data analysis. IEEE Transactions on Software Engineering, 18(11), 931–942.
Conte, S. D., Dunsmore, H. E., & Shen, V. Y. (1986). Software engineering metrics and models. Menlo
Park, CA: Benjamin-Cummings.
Draper, N., & Smith, H. (1981). Applied regression analysis. London: Wiley.
Gallego, J. J. C., Rodriguez, D., Sicilia, M. A., Rubio, M. G., & Crespo, A. G. (2007). Software project
effort estimation based on multiple parametric models generated through data clustering. Journal of
Computer Science and Technology, 22(3), 371–378.
Jorgensen, M. (2002). Comments on ‘a simulation tool for efficient analogy based cost estimation’.
Empirical Software Engineering, 7, 375–376.
Jorgensen, M. (2003). An effort prediction interval approach based on the empirical distribution of previous
estimation accuracy. Information and Software Technology, 45, 123–126.
Jorgensen, M., & Teigen, K. H. (2002). Uncertainty intervals versus interval uncertainty: An alternative
method for eliciting effort prediction intervals in software development projects. In International
conference on project management (ProMAC), Singapore, pp. 343–352.
Lee, A., Cheng, C. H., & Balakrishnan, J. (1998). Software development cost estimation: Integrating neural
network with cluster analysis. Information and Management, 34, 1–9.
Leung, H., & Fan, Z. (2001). Software cost estimation. Handbook of software engineering and knowledge
Lum, K., Bramble, M., Hihn, J., Hackney, J., Khorrami, M., & Monson, E. (2003). Handbook for software
cost estimation. NASA Jet Propulsion Laboratory, JPL D-26303.
Menzies, T., & Hihn, J. (2006). Evidence-based cost estimation for better-quality software. IEEE Software,
23(4), 64–66.
Miyazaki, Y., Terakado, M., Ozaki, K., & Nozaki, H. (1994). Robust regression for developing software
estimation models. Journal of Systems and Software, 1, 3–16.
NASA. (1990). Manager’s handbook for software development. Goddard Space Flight Center, Greenbelt,
MD, NASA Software Engineering Laboratory.
Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufman.
550 Software Qual J (2011) 19:537–552
Sentas, P., Angelis, L., & Stamelos, I. (2003). Multinominal logistic regression applied on software pro-
ductivity prediction. In 9th Panhellenic conference in informatics, Thessaloniki.
Sentas, P., Angelis, L., Stamelos, I., & Bleris, G. (2005). Software productivity and effort prediction with
ordinal regression. Information and Software Technology, 47, 17–29.
Shalabi, L. A., & Shaaban, Z. (2006). Normalization as a preprocessing engine for data mining and the
approach of preference matrix. In IEEE proceedings of the international conference on dependability
of computer systems (DEPCOS-RELCOMEX’06).
Shepperd, M., & Schofield, M. (1997). Estimating software project effort using analogies. IEEE Transac-
tions on Software Engineering, 23(12), 736–743.
SoftLab. (2009). Software research laboratory, Department of Computer Engineering, Bogazici University.
Srinivasan, K., & Fisher, D. (1995). Machine learning approaches to estimating software development
effort. IEEE Transactions on Software Engineering, 21(2), 126–137.
Stamelos, I., & Angelis, L. (2001). Managing uncertainty in project portfolio cost estimation. Information
and Software Technology, 43(13), 759–768.
Stamelos, I., Angelis, L., Dimou, P., & Sakellaris, E. (2003). On the use of bayesian belief networks for the
prediction of software productivity. Information and Software Technology, 45, 51–60.
Stensrud, E., Foss, T., Kitchenham, B., & Myrtveit, I. (2003). A further empirical investigation of the
relationship between MRE and project size. Empirical Software Engineering.
Tadayon, N. (2005). Neural network approach for software cost estimation. International Conference on
Information Technology: Coding and Computing, 2, 815–818.
Author Biographies
Ays¸e Bakırreceived her MSc degree in computer engineering from
Bogazici University in 2008 and her BSc degree in computer engi-
neering from Gebze Institute of Technology in 2006. Her research
interests include software quality modeling and software cost
Burak Turhan received his PhD in computer engineering from
Bogazici University. After his postdoctoral studies at the National
Research Council of Canada, he joined the Department of Information
Processing Science at the University of Oulu. His research interests
include empirical studies on software quality, cost/defect prediction
models, test-driven development and the evaluation of new approaches
for software development.
Software Qual J (2011) 19:537–552 551
Ays¸e B. Bener is an associate professor in the Ted Rogers School of
Information Technology Management. Prior to joining Ryerson,
Dr. Bener was a faculty member and Vice Chair in the Department of
Computer Engineering at Bog
˘azic¸i University. Her research interests
are software defect prediction, process improvement and software
economics. Bener has a PhD in information systems from the London
School of Economics. She is a member of the IEEE, the IEEE Com-
puter Society and the ACM.
552 Software Qual J (2011) 19:537–552
... While building models for effort estimation, we used both datasets, which we prepared by collecting data from various local software companies in Turkey, as well as publicly available datasets such as COCOMO database. Among the datasets we collected are SDR datasets [48,49,50,51]. We used the COCOMO II Data Collection Questionnaire in order to collect SDR datasets. ...
... Other relatively more complex learners are also frequently employed in software engineering research as predictive methods. Neural networks [49] and decision trees [50,80] are examples to such learners that were used by the authors. We will not go deep into the mechanics of these learners, however, we will provide the general idea and possible inherent biases. ...
... We will futher discuss ensembles of other learners in the next section as a way to improve the predictive power of learners. A capable and commonly used learner type is the decision trees [50,80]. Unlike NNs that take instances one-by-one, decision trees require all the data to be available before they can start learning. ...
In this chapter, we share our experience and views on software data analytics in practice with a retrospect to our previous work. Over ten years of joint research projects with the industry, we have encountered similar data analytics patterns in diverse organizations and in different problem cases. We discuss these patterns following a 'software analytics' framework: problem identification, data collection, descriptive statistics and decision making. We motivate the discussion by building our arguments and concepts around our experiences of the research process in six different industry research projects in four different organizations.
... 9 Table 6a reports on the quality of the solutions found using CoGEE N SGAII , GA-SAE, and GA-CI, with respect to the three Pareto Front quality indicators: I HV , I GD , and I C . 7. The SA values are obtained by using an ensemble composed of the models on the Pareto Front. ...
... Sentas et al. [91] investigated an ordinal regression technique to model the probability of correctly classifying a new project to one of the predefined cost categories, each of which corresponds to an effort interval. The work by Bakir et al. [7] differs from the previous studies in that the effort intervals are not predefined manually but determined by clustering analysis. They evaluated their approach using seven public datasets. ...
Full-text available
Replication studies increase our confidence in previous results when the findings are similar each time, and help mature our knowledge by addressing both internal and external validity aspects. However, these studies are still rare in certain software engineering fields. In this paper, we replicate and extend a previous study, which denotes the current state-of-the-art for multi-objective software effort estimation, namely CoGEE. We investigate the original research questions with an independent implementation and the inclusion of a more robust baseline (LP4EE), carried out by the first author, who was not involved in the original study. Through this replication, we strengthen both the internal and external validity of the original study. We also answer two new research questions investigating the effectiveness of CoGEE by using four additional evolutionary algorithms (i.e., IBEA, MOCell, NSGA-III, SPEA2) and a well-known Java framework for evolutionary computation, namely JMetal (rather than the previously used R software), which allows us to strengthen the external validity of the original study. The results of our replication confirm that: (1) CoGEE outperforms both baseline and state-of-the-art benchmarks statistically significantly (p < 0.001); (2) CoGEEs multi-objective nature makes it able to reach such a good performance; (3) CoGEEs estimation errors lie within claimed industrial human-expert-based thresholds. Moreover, our new results show that the effectiveness of CoGEE is generally not limited to nor dependent on the choice of the multi-objective algorithm. Using CoGEE with either NSGA-II, NSGA-III, or MOCell produces human competitive results in less than a minute. The Java version of CoGEE has decreased the running time by over 99.8% with respect to its R counterpart. We have made publicly available the Java code of CoGEE to ease its adoption, as well as, the data used in this study in order to allow for future replication and extension of our work.
... Selain Metode-metode machine learning yang telah digunakan untuk estimasi usaha pengembangan perangkat lunak diantaranya k-Nearnest Neighbor (k-NN) [11], [12], [13], [14]. Artificial Neural Networks (ANN) atau Neural Network (NN) [11], [15], [16], Support Vector Machines [11], [17], Naive Bayes (NB) [18], Bayesian Networks (BN) [19], Decision Trees (DT) [12], Linear Regression (LR) [20]. ...
... Selain Metode-metode machine learning yang telah digunakan untuk estimasi usaha pengembangan perangkat lunak diantaranya k-Nearnest Neighbor (k-NN) [11], [12], [13], [14]. Artificial Neural Networks (ANN) atau Neural Network (NN) [11], [15], [16], Support Vector Machines [11], [17], Naive Bayes (NB) [18], Bayesian Networks (BN) [19], Decision Trees (DT) [12], Linear Regression (LR) [20]. ...
... Empirical comparisons between the models producing point and predefined interval estimates were conducted in [9] but no general conclusions were found as the best performed method could behave relatively badly depending on the dataset. Later in [7], clustering analysis was applied to automatically define the effort categories. Its main contribution was the removal of human intervention in predefining the effort intervals. ...
... 6: (3) Final Probabilistic Prediction: Three methods to calculate the final probabilistic prediction for the testing project x using Equations (4)-(6). (4) PI Construction: Convert the derived Gaussian PDF prediction to CDF and derive the PI [y lb , y ub ] with CL α using Equation (7) or Equation (8). 8: Output: PI [y lb , y ub ] with CL α. ...
Software effort estimation (SEE) usually suffers from inherent uncertainty arising from predictive model limitations and data noise. Relying on point estimation only may ignore the uncertain factors and lead project managers (PMs) to wrong decision making. Prediction intervals (PIs) with confidence levels (CLs) present a more reasonable representation of reality, potentially helping PMs to make better-informed decisions and enable more flexibility in these decisions. However, existing methods for PIs either have strong limitations or are unable to provide informative PIs. To develop a “better” effort predictor, we propose a novel PI estimator called Synthetic Bootstrap ensemble of Relevance Vector Machines (SynB-RVM) that adopts Bootstrap resampling to produce multiple RVM models based on modified training bags whose replicated data projects are replaced by their synthetic counterparts. We then provide three ways to assemble those RVM models into a final probabilistic effort predictor, from which PIs with CLs can be generated. When used as a point estimator, SynB-RVM can either significantly outperform or have similar performance compared with other investigated methods. When used as an uncertain predictor, SynB-RVM can achieve significantly narrower PIs compared to its base learner RVM. Its hit rates and relative widths are no worse than the other compared methods that can provide uncertain estimation.
... Software engineering studies using intervals can be generally categorized as either using CIs in combination with capture-recapture methods [71,72,87,88], or constructing PIs from expert opinion elicitation [40], bootstrapping [1], regression models [12], a combination of cluster analysis and classification methods [3],a multi-objective evolutionary algorithm [79], or based on prior estimates [41]. We note that linear regression is not typically used in the construction of prediction intervals with respect to software engineering predictions. ...
... Thus, we cannot allow exact replication of our study. However, in order to enhance the study usability and replicability we developed and shared Meta tune 3 , an opensource Python package for constructing and analyzing prediction intervals. In order to support the use of Meta tune, we provide a README 5 and a demo 4 . ...
Recent studies have shown that tuning prediction models increases prediction accuracy and that Random Forest can be used to construct prediction intervals. However, to our best knowledge, no study has investigated the need to, and the manner in which one can, tune Random Forest for optimizing prediction intervals { this paper aims to fill this gap. We explore a tuning approach that combines an effectively exhaustive search with a validation technique on a single Random Forest parameter. This paper investigates which, out of eight validation techniques, are beneficial for tuning, i.e., which automatically choose a Random Forest configuration constructing prediction intervals that are reliable and with a smaller width than the default configuration. Additionally, we present and validate three meta-validation techniques to determine which are beneficial, i.e., those which automatically chose a beneficial validation technique. This study uses data from our industrial partner (Keymind Inc.) and the Tukutuku Research Project, related to post-release defect prediction and Web application effort estimation, respectively. Results from our study indicate that: i) the default configuration is frequently unreliable, ii) most of the validation techniques, including previously successfully adopted ones such as 50/50 holdout and bootstrap, are counterproductive in most of the cases, and iii) the 75/25 holdout meta-validation technique is always beneficial; i.e., it avoids the likely counterproductive effects of validation techniques.
... The target users are students from some secondary schools, Kaduna Polytechnic and Apprentices as contained in table 2. They were interviewed and discussed to know their view on vocational career information system. Few out of the questions asked during the interview are (1) what are expected systems requirements, (2) what are expected functional requirements, (3) what are your security expectation. ...
... Sedangkan metode machine learning yang sudah digunakan untuk estimasi usaha pengembangan perangkat lunak diantaranya k-Nearnest Neighbor (k-NN) [7], [8]. Artificial Neural Networks (ANN) atau Neural Network (NN) [7], [9], [10], Support Vector Machines [7], [11], Naive Bayes (NB) [12], Decision Trees (DT) [13], Linear Regression (LR) [14]. ...
... In recent years, data-analysis based software cost/effort and quality estimation gains more and more popularity, Notice that we only consider the publications which clearly address if they conduct any MDT before data analysis. All the publications collected are listed in [198] ABE SWR-FS PROMISE, ISBSG LD Azzeh [201] ABE SFS PROMISE LD Corazza, et al. [196] ABE, SVM Filters PROMISE, Tukutuku LD Kocaguneli, et al. [164] ABE SFS PROMISE MEI Seo and Bae [199] ABE SFS PROMISE, ISBSG, Company LD SQJ Liu, et al. [194] CART, ABE Filters ISBSG LD Bakir, et al. [208] SVM, ABE PCA PROMISE, ISBSG LD Hsu and Huang [171] ANN, CART, ABE Filters PROMISE, ISBSG LD Bakir, et al. [209] Khatibi Bardsiri, et al. [195] CART [217] re-confirmed the ability of MEI in terms of accuracy, and more importantly, the impact of missingness mechanism on imputation accuracy is not statistically significant. ...
Full-text available
There are two critical elements to software development, i.e. quality and effort. Quality is not the final goal for software development. A more important idea behind quality is the ability to fix the problem, maintain and upgrade the software rapidly. Generally, practitioners refer a bug to the failure or error in software programs. Bugs may seriously interfere with the program functionality or user experience. The effort, instead, is less related to the end-users; however, it critically decides if the software could be released in time. The notorious Brooks’ law raises the idea of adding manpower to a late software project makes it even later. The implied characteristics of software effort are difficult to control in practice. Therefore, software practitioners are still calling for more well-established methods to estimate quality and effort. There are a plenty of research that identifies and discusses potential project factors on software economic elements. Former studies have identified multiple factors of team and project that determine software economics. However, there is serious conclusive inconsistency. Team size, as a rudimentary factor during software development, is often neglected in any effort or quality estimation process. Former empirical research is still lacking holistic investigation between team size and overall economic elements. This part shall provide empirical evidence of the impact of team size and its interactions on software economics. The major difficulty is to identify and investigate the “role players” that impacts quality and effort in the development. This thesis reports the research that aims to estimate quality and efforts of software development from a holistic perspective, including missing data, team size, language, platform, reuse and other project factors. The 1st part of this thesis aims at identifying the potential effect of software reuse in the context of embedded software development. Software reuse has been advocated as a technique with a great capacity to improve product quality and reduce development effort and cost. However, the benefit of reuse is still doubted by serious conclusive and methodological inconsistency. Experts are still calling for more solid empirical studies with objective data on the effects of software reuse on new product performance. The validation of the benefits could build a strong guidance for the software industry. The 2nd part deals with missing data issues in software estimation. With known critical software project factors, appropriate preprocessing is necessary for further machine learning (ML) based empirical software engineering (SE). Historical datasets are widely II used to build models for prediction. However, the missingness inside dataset seriously affects the ability to discover knowledge from constructing effective analogy-based estimation model. Literature review reveals that listwise deletion gains the most popularity but reduce the sample size. And the issue of missing data in empirical software engineering is less addressed. The 3rd part of this thesis investigates and improves one commonly adopted data imputation technique: k nearest neighbor (kNN) based method, instead of ignoring missing observations to make data incomplete. KNN based imputation is improved to predict each missing value with special parameter settings under various missing data patterns. The optimization strategy includes multiple parameter combinations and feature relevance technique. We compare the novel imputation techniques with mean imputation (MEI) and other commonly used kNN ones. Then we conduct various estimation learners on eight real famous software quality datasets to discover the impact of the kNN based imputation methods. To solve and provide better missing data imputation methods helps use possible data for estimation. The 4th part of this thesis exploits the best data preprocessing (DP) combination for various ML methods to maximize the utility of project factors. Due to the complex nature of the software development process, traditional parametric models and statistical methods often appear to be inadequate to model the increasingly complicated relationship between project development effort and the project features (or effort drivers). ML methods, with several reported successful applications, have gained popularity for software effort estimation in recent years. DP has been claimed by many researchers as a fundamental stage of ML methods; however, very few works have focused on the effects of DP techniques. This part strongly addresses this issue from the perspective of the data mining. The thesis reports a real-life study of the impact of reuse on quality, effort and related economic consequences of embedded software development based on first-hand objective data from 30 projects in a small-sized company. The thesis validates the empirical relationships between team/project factors and software economic measurement, including productivity, quality, effort, and time-to-market. The data analysis bases on a renowned dataset, ISBSG (The International Software Benchmarking Standards Group). It also validates and improves classic imputation techniques on well-known datasets with full project factors in the context of empirical ML based SE; (4) III systematically assesses the effectiveness of DP techniques on ML methods in the context of software effort estimation. In this thesis, we first conduct a literature survey on the recent publications using DP techniques, followed by a systematic empirical study to analyse the strengths and weaknesses of individual data preprocessing techniques as well as their combinations. This thesis reveals that (1) a higher reuse rate enhances productivity and quality and reduces the cost of embedded software development. (2) Multiple factors, including team size, language type, and organization type, turn out to have a significant impact on software economics; (3) the proposed cross-validation based kNN imputation performs better in the context of software quality prediction; (4) DP techniques may significantly influence the final prediction. They sometimes might have negative impacts on prediction performance of ML methods. To improve prediction models, meticulous parameter selection and tuning are necessary according to the characteristics of ML methods, as well as the datasets used for software effort estimation. Future work includes (1) mining software reuse repository to discover more knowledge of software reuse benefits, (2) further quantify the relationship between team size and software economic elements, (3) further improvement on kNN imputation in the domain of both effort and quality estimation, (4) more empirical findings in terms of investigating DP combination in empirical SE.
Conference Paper
Model-based estimation often uses impact factors and historical data to predict the effort of new projects. Estimation accuracy of this approach is highly dependent on how well impact factors are selected. This paper comparatively assesses six methods for prune parameters of effort estimation models, including Stepwise regression, Lasso, constrained regression, GRASP, Tabu search, and PCA. Four data sets were used for evaluation, showing that estimation accuracy varies among the methods but no method consistently outperforms the rest. Stepwise regression prunes estimation model parameters the most while it does not sacrifice much estimation performance. Our study provides further evidence to support the use of Stepwise regression for selecting factors in effort estimation.
Full-text available
In recent years data mining has been experiencing growing popularity. It has been applied for various purposes and become commonly used in day-to-day oper- ations for knowledge discovery, especially in areas where uncertainty is substantial. Data mining is replacing traditional error prone and often ineffective techniques or is used in conjunction. Due to a large number of projects either struggling or even failing the researchers recognize its potential application in the project management discipline in order to increase project success rates. It can be used for different esti- mation problems like effort, duration, quality or maintenance cost. This paper pre- sents a critical review of potential applications of data mining techniques contributing to the project management eld.
Full-text available
The partitional and incremental clustering are the common models in mining data in large databases. However, some models are better than the others due to the types of data, time complexity, and space requirement. This paper describes the performance of partitional and incremental models based on the number of clusters and threshold values. Experimental studies shows that partitional clustering outperformed when the number of cluster increased, while the incremental clustering outperformed when the threshold value decreased.
Full-text available
Defining the required productivity in order to complete successfully and within time and budget constraints a software development project is actually a reasoning problem that should be modelled under uncertainty. One way of achieving this is to estimate an interval accompanied by a probability instead of a particular value. In this paper we compare traditional methods that focus on point estimates, methods that focus both on point and interval estimates and methods that produce only predefined interval estimates. In the case of predefined intervals, software cost estimation becomes a classification problem. All the above methods are applied on two different data sets, namely the COCOMO81 dataset and the Maxwell dataset. Also the ability of classification techniques to resolve one classification problem in cost estimation, namely to determine the software development mode based on project attributes, is assessed and compared to reported results.
Full-text available
In software cost estimation various methods have been proposed to yield a prediction of the productivity of a software project. Most of the methods produce point estimates. However, in practice it is more realistic and useful to have a method providing interval predictions. Although some methods accom- pany a point estimate with a prediction interval, it is also reasonable to use a method predicting the interval in which the cost will fall. In this paper, we con- sider a method called Multinomial Logistic Regression using as dependent vari- able the predefined cost intervals and as predictor variables the attributes, simi- lar to the ones characterizing completed projects of the available data set. The method builds a model, which classifies any new software project, according to estimated probabilities, in one of the predefined intervals. The proposed method was applied to a well-known data set and was validated with respect to its fit- ting and predictive accuracy.
When estimating software development effort, it may be useful to describe the uncertainty of the estimate through an effort prediction interval (PI). An effort PI consists of a minimum and a maximum effort value and a confidence level. We introduce and evaluate a software development effort PI approach that is based on the assumption that the estimation accuracy of earlier software projects predicts the effort PIs of new projects. First, we demonstrate the applicability and different variants of the approach on a data set of 145 software development tasks. Then, we experimentally compare the performance of one variant of the approach with human (software professionals') judgment and regression analysis-based effort PIs on a data set of 15 development tasks. Finally, based on the experiment and analytical considerations, we discuss when to base effort PIs on human judgment, regression analysis, or our approach.