ArticlePDF Available

On the dataset shift problem in software engineering prediction models


Abstract and Figures

A core assumption of any prediction model is that test data distribution does not differ from training data distribution. Prediction models used in software engineering are no exception. In reality, this assumption can be violated in many ways resulting in inconsistent and non-transferrable observations across different cases. The goal of this paper is to explain the phenomena of conclusion instability through the dataset shift concept from software effort and fault prediction perspective. Different types of dataset shift are explained with examples from software engineering, and techniques for addressing associated problems are discussed. While dataset shifts in the form of sample selection bias and imbalanced data are well-known in software engineering research, understanding other types is relevant for possible interpretations of the non-transferable results across different sites and studies. Software engineering community should be aware of and account for the dataset shift related issues when evaluating the validity of research outcomes.
Content may be subject to copyright.
Empir Software Eng (2012) 17:62–74
DOI 10.1007/s10664-011-9182-8
On the dataset shift problem in software engineering
prediction models
Burak Turhan
Published online: 12 October 2011
© Springer Science+Business Media, LLC 2011
Editors: Martin Shepperd and Tim Menzies
Abstract A core assumption of any prediction model is that test data distribution
does not differ from training data distribution. Prediction models used in software
engineering are no exception. In reality, this assumption can be violated in many
ways resulting in inconsistent and non-transferrable observations across different
cases. The goal of this paper is to explain the phenomena of conclusion instability
through the dataset shift concept from software effort and fault prediction perspec-
tive. Different types of dataset shift are explained with examples from software
engineering, and techniques for addressing associated problems are discussed. While
dataset shifts in the form of sample selection bias and imbalanced data are well-
known in software engineering research, understanding other types is relevant for
possible interpretations of the non-transferable results across different sites and
studies. Software engineering community should be aware of and account for the
dataset shift related issues when evaluating the validity of research outcomes.
Keywords Dataset shift ·Prediction models ·Effort estimation ·Defect prediction
1 Introduction
Software engineering community has widely adopted the use of prediction models
for estimating, e.g. development and maintenance cost/effort, fault count/density,
and reliability of software projects. With the goal of building generalized prediction
models in software development context, the community is faced with the challenge
of non-transferable results across different projects/studies. In order to determine
whether individual results are indeed transferrable across different sites, first the
B. Turhan (B)
Department of Information Processing Science, University of Oulu,
POB.3000, 90014, Oulu, Finland
Empir Software Eng (2012) 17:62–74 63
validity of the assumptions yielding these results should be examined for under-
standing their effects on the results. This paper aims to introduce the dataset shift
concept (Candela et al. 2009;Hand2006), to interpret the variability of results across
different sites from a data oriented perspective in addition to critiques of varying
performance assessment techniques (Shepperd and Kadoda 2001).
Prediction systems aim to make generalizations from past experiences for esti-
mating desired properties of future events. Such systems commonly operate under
the assumption that future events to be estimated will be near identical to past
events. More formally, a prediction system utilizes past (i.e. training) covariate-
response pairs {xtrain,ytrain }sampled from a joint distribution of the form p(X,Y)=
p(Y|X)p(X)with true conditional model p(Y|X)and prior p(X), for learning an es-
timated conditional ˆ
p(Y|X)(i.e. the prediction model), in order to make predictions
for future response variables given future (i.e. test) covariates, ˆ
A prediction model is considered good to the extent that the estimated model
p(Y|X)is a close approximation to the underlying true model p(Y|X). This may be
evaluated by various error criteria through many procedures by measuring the delta
between the true and estimated model responses, and as the CFP for this special
issue suggests, these may be causing the conclusion instability problem. Though non-
standardized evaluation measures and procedures may well be one cause, this paper
will assume an ideal state where the evaluation is not an issue, and rather focus on
the other possible causes of the conclusion instability problem, namely data related
issues. Specifically, any factor (i.e. p(X),p(Y)or other significant confounding
factors) that affects the joint density p(X,Y), and is changing between training and
test environments will also affect the performance of the prediction model ˆ
unless accounted for during modeling.
The rest of this paper elaborates on the different types of dataset shift and
their implications to software engineering field, provide pointers to computational
methods to deal with certain types of dataset shift, and concludes with offering
explanations by matching dataset shift types to certain reported (in)consistencies in
software engineering literature.
2 Types of Dataset Shift
The problem of dataset shift has recently attracted the attention of noted re-
searchers (Candela et al. 2009;Hand2006). In particular, Storkey classifies different
forms of dataset shift into six groups which are discussed in the following sec-
tions (Storkey 2009). Some groups are well-known such as sample selection bias
and imbalanced data, while others may be relevant, but are not appropriately or
explicitly addressed in, software engineering research results. The descriptions of
different dataset shift types are based on Storkey (2009) and are interpreted from
software engineering point of view. Note that the different types of dataset shift are
not mutually exclusive, on the contrary may be related in many cases.
2.1 Covariate Shift
Covariate shift occurs when the distributions of covariates in training and test data
differ, i.e. p(Xtrain)= p(Xtest). In this case, if the prediction model ˆ
p(Y|X)is able
64 Empir Software Eng (2012) 17:62–74
to correctly approximate the true model p(Y|X)globally, i.e. at least for p(Xtest)
or p(X)in general, then there are no implications. Otherwise, the prediction model
might be a good approximation to the true model in the X-space only for the locality
defined by p(Xtrain), but not necessarily for p(Xtest). For instance, the true global
model can be a high degree polynomial and can be successfully approximated with
a linear prediction model with a positive-slope for the region of X-space covered
by training samples, whereas the region covered by test samples may show linear
characteristics with a negative slope, i.e. a different local model.
Implications for software engineering domain could be for the typical size based
prediction models. Size is a commonly used covariate for effort or fault estimation,
and corresponding models might be effective for project sizes within the traditional
operational values of a company. Due to new business opportunities, or change
in technologies and development techniques, the size of new projects might differ
from the past.1Such conditions require re-examination of existing prediction models
for covariate shift. Using data from projects of different scales and types, within or
across companies (this will also be addressed in Section 2.6), also requires control for
covariate shift. For instance, COCOMO incorporates local calibration for handling
this issue (Boehm et al. 2000).
Another important implication is for the design of simulation studies, where
training and test set splits, with or without cross validation, might result in covariate
shift among resulting datasets, and should be compensated for. Covariate shift
problem is an active topic in machine learning community, and proposed solutions
range from importance weighting of samples to kernel based methods (Bickel et al.
2009; Huang et al. 2006; Sugiyama et al. 2008), which will be discussed in Section 3.
2.2 Prior Probability Shift
Prior probability shift occurs when the predictive model is obtained via the appli-
cation of Bayes rule, p(Y|X)=p(X|Y)p(Y), and the distributions of the response
variable differ in training and test data, i.e. p(Ytrain)= p(Ytest). In this case the pre-
diction model fails to properly approximate the true model. Note that the direction of
causal relation among covariates and response variables change in prior probability
shift, i.e. covariates are dependent on the response, whereas it is the opposite for
covariate shift (Storkey 2009).
In a software engineering context, prior probability shift may occur, for instance,
in fault prediction studies when the fault characteristics change in new projects
as a result of process improvement for better development, testing and quality
assurance activities that are not captured by the covariates, or simply due to different
characteristics of the test project. Furthermore, specialization in a business domain
and increased developer experience over time, and changes in business domain
may similarly affect the fault characteristics. To use fault models across projects
with different fault characteristics require compensation for prior probability shift.
It is easy to augment the model accordingly by simply accounting for the shifted
distribution p(Ytest), if the new distribution is known. While this can be simulated
1For simplicity, it’s assumed that there are no other confounding effects of such changes than on
software size.
Empir Software Eng (2012) 17:62–74 65
in empirical studies,2where this is known a priori, in practice this information
will not be available. Therefore, p(Ytest )should be parametrized and computed by
marginalizing over the parameter space and training covariates (Storkey 2009).
2.3 Sample Selection Bias
Sample selection bias is a statistical issue, and for the purposes of this paper a
dataset shift type, that is well-known and accounted for in software engineering
studies, and listed here for the sake of completeness. In software prediction studies,
it should be considered that companies collecting relevant data have mostly higher
maturity levels than general software engineering industry and selected projects
may not reflect the usual operating environment of a company, hence conclusions
from such data may not be externally valid. Increasing the scale of these kind
of studies to address this validity problem, however, eventually surrenders to the
paradox of empirical research that it is practically impossible to draw meaningful
conclusions from the resulting population which becomes too general to characterize
(also see Section 2.6). Nevertheless, in the micro level, sample selection bias might be
introduced in a controlled manner to deal with (1) covariate shift through relevancy
filtering in Kocaguneli et al. (2010) and Turhan et al. (2009), (2) imbalanced data
through sampling techniques (Menzies et al. 2008), (3) and source component shift
through logical grouping of data (Bakır et al. 2010; Premraj and Zimmermann
2.4 Imbalanced Data
Imbalanced data is concerned with the cases where certain type(s) of events of
interest are observed rarely compared to other event types. This is a common issue in
software fault prediction studies, where the number of non-faulty entities dominates
the data sample as opposed to the number of to faulty entities. Since the goal is
to learn theories about this rare event (i.e. faults), a widely accepted solution is
to introduce a sample selection bias on purpose via the application of over/under
sampling techniques. In practice, this causes a prior probability shift when the learned
model is applied in the test settings, and it should be addressed with an adjustment
of the estimates of the model as explained in Section 2.2. While Storkey argues that
failure to compensate may be seen as a way to address the relative importance of false
positives and true positives (Storkey 2009), it gives no control to the practitioner for
adjusting this relative importance. In the worst case, if the losses associated with false
positives and true positives are equivalent, or if a specific loss function is desired,
then the fault prediction models will not be optimized with respect to the preferred
criteria, unless the prior probability shift is addressed.
2However, this would introduce an unfair bias, since it would mean using, during model construction
phase, information related to an attribute that is to be predicted (i.e. defect rate). The model is
supposed to predict that attribute in the first place, and should be blind to such prior facts that exist
in the test data.
66 Empir Software Eng (2012) 17:62–74
2.5 Domain Shift
Domain shift refers to cases where a concept can be measured in different ways,
and the method of measurement changes between training and test conditions.
Immediate relevance to software engineering is again size based models. Size is a
concept and can be measured in different ways from code or earlier in terms of
function points (FP). For instance, different values (with different units) are obtained
with the application of different methods for the latter (Demirors and Gencel 2009),
and similar problems arise in counting the lines of code (LOC). However, such size
measures are usually reported only as LOC or FP without providing measurement
details. Before applying a size-based model, i.e. to a new project, it must be assured
that the training and test data are collected in a consistent manner. Otherwise,
domain shift may lead to inconsistencies across studies for similar reasons as using
different evaluation measures for performance (Shepperd and Kadoda 2001).
2.6 Source Component Shift
Source component shift take place when each, or groups of, datum in the data
sample originate from different sources with unique characteristics and in varying
proportions (Storkey 2009). This description is almost a perfect characterization of
software engineering datasets, where data from several projects or companies with
different properties are typically merged (Boehm et al. 2000;Lokanetal.2001).
In this type of shift, due to their specific characteristics, each source component
might span a different region of the problem space, leading to too generic models.
Experimental procedures, i.e. cross validation or train/test splits, would probably
cause covariate shift as well. Furthermore, as the proportion of sources may vary
among training and test conditions, prior probability shifts both in the ratio of
sources and response variables are inevitable. In software engineering, specifically in
cross-project prediction, studies source component shift has been referred to as data
heterogeneity (or homogeneity). The main idea of such studies has been to process
datasets (e.g. repartitioning through relevancy filtering, analysis of variance, or actual
data source) to achieve more homogeneous structures (i.e. identifying the source
components) to work with Briand et al. (2002), Kitchenham et al. (2007), Turhan
et al. (2009), Wieczorek and Ruhe (2002), and Zimmermann et al. (2009).
3 Managing Dataset Shift
This section aims to answer the question that, by now, should have arisen in your
mind: “What needs to be done?”. There is no silver bullet, but there exist certain
techniques to address different types of dataset shift to some extent.3In this section,
some representative techniques are discussed in two broad groups, namely instance-
based and distribution-based techniques.
3Domain shift is not included in the discussion, since that is a measurement related issue that should
be separately handled by the researcher/practitioner.
Empir Software Eng (2012) 17:62–74 67
3.1 Instance Based Techniques
3.1.1 Outlier Detection
Outliers are data points that are not of interest from statistical modeling point of
view, but have the potential to significantly affect the dataset structure and model
parameters (Chandola et al. 2009). In other words, outliers may cause (covariate)
shifts in datasets and consequently lead to false generalizations. While the choice to
remove or keep outliers in data analysis is a topic of discussion, it is paramount to be
aware of their existence, to make necessary assessments and to take required actions
In this respect, studies reporting mean and standard deviation values for descrip-
tive statistics and performance results would be misleading if outliers exist and are
not removed, since these statistics are sensitive to outliers. Median and quartile
values should be preferred in such cases as they are more robust to outliers than mean
and standard deviation. Further, the robustness of different models against outliers
may vary, therefore, software engineering studies that compare different models
to select the “best” alternative should take this into account. In an hypothetical
scenario, the observed superiority of a complex model A against a simpler model
B on a given dataset may be purely due to model A’s robustness against outliers,
and model B may perform as good as model A when outliers are removed. Hence
it should be noted that while it is tempting to use readily available tools (i.e.
WEKA (Hall et al. 2009)) for constructing predictive models, it is important to be
aware of the limitations and strengths of the underlying models rather than treating
them as black-boxes.
In some cases outliers themselves may be the objects of interest in prediction.
Therefore, it is important to differentiate outlier detection from outlier removal.
Analysis of outliers could reveal the limitations of (a class of) predictive models.
Further, the context plays an important role in defining the outliers. For instance, it
does not have any practical value to construct a defect prediction model with an
exceptional goodness of fit, if the removed outliers account for an undiscardable
number of defect logs as well. This would yield the wrong impression that the
predictive model is able to detect almost all defects. In fact the model would be
limited to detect certain types of defects and it would be missing a significant number
of them. Hence, the initial goal, which is predicting defects in this case, should neither
be forgotten nor sacrificed for statistical excellence.
Though most software engineering studies tend to discard reporting the issue,
outlier detection has been addressed explicitly in a number of software engineering
publications. For instance, Briand et al. emphasize the importance of identifying (and
removing) both univariate and multivariate outliers, and advises using Pregibon beta
for logistic regression, Cook’s distance for ordinary least squares and scatter plots in
general (Briand and Wust 2002). Further, Kitchenham et al. employ jackknifed cross-
validation to achieve robustness against outliers (Kitchenham et al. 2007), and Keung
et al. developed a methodology, named Analogy-X, to identify outliers (Keung et al.
2008). Boxplots are also commonly used in software engineering studies for visual
inspection of datasets and results (Menzies et al. 2008; Turhan et al. 2009). Another
useful technique for visualizing and detecting outliers in time-series data is VizTrees
by Lin et al. (2004). It is out of the scope of this paper to give details on the well-
founded field of outlier detection. We refer the reader to the extensive survey by
68 Empir Software Eng (2012) 17:62–74
Chandola et al., where they group and discuss specific methods under classification-
based, nearest-neighbor-based, clustering-based and statistical techniques (Chandola
et al. 2009).
3.1.2 Relevancy Filtering
There is no agreed upon definition of an outlier; its definition is rather context
dependent (Chandola et al. 2009). Relevancy filtering exploits this fact to introduce
a controlled sample selection bias that is tweaked towards the test set samples.
Specifically, relevancy filtering allows the use of only certain train set samples, which
are closer to the test set samples based on a similarity/distance metric, for model
construction. Please note that only the covariates of test data should be used in
calculating the similarity metric, and not the responses.4An important implication
of using relevancy filtering is that there is not a global static model, but rather a
new model is dynamically constructed based on different subsets of training data
each time a new test instance/batch arrives. This may be considered as a limitation
of relevancy filtering that it requires the underlying model to be simple enough to
enable on the fly construction for performance issues.
Relevancy filtering addresses covariate shift by forcing a training distribution
that matches the test distribution as well as possible. However, there is a risk of
introducing a prior probability shift since the distribution of responses in the training
set cannot be guaranteed to be preserved. Nevertheless, relevancy filtering yielded
promising results in software fault prediction (Turhan et al. 2009)andcostestima-
tion (Kocaguneli et al. 2010; Kocaguneli and Menzies 2011) studies for selection
of relevant training data to construct prediction modes. For example, Turhan et al.
was able to significantly improve the performance of naive Bayes prediction models
across different projects (i.e. learn from a set of projects then apply to another
project) using nearest-neighbor similarity (Turhan et al. 2009).
3.1.3 Instance Weighting
While relevancy filtering can be considered as hard-filtering (i.e. an instance is either
in or out), instance weighting is a soft-filtering method that assigns a weight to
each training instance based on its relative importance with respect to the test set
distribution. Similar to relevancy filtering, instance weighting specifically addresses
covariate shift problem. In this technique, determining the weights becomes the
essential issue. Once the weights are set, it is possible to use the weighted versions
of commonly used models (e.g. weighted least squares regression, weighted naive
Bayes (Zhang and Sheng 2004)).
Recent progress in machine learning research led to the development of tech-
niques for accurate weight identification. Shimodaira proposes to identify weights
based on the ratio of training data distribution to test set distribution (Shimodaira
2000). Figure 1(from Shimodaira 2000), shows the effectiveness of the approach
on a toy example with true data generated by adding white noise to a third degree
4In practice, this warning applies to simulation studies, since test responses are typically not known
in real settings.
Empir Software Eng (2012) 17:62–74 69
Fig. 1 Example of instance weighting for comparing WLS and OLS performance in the existence of
covariate shift, from Shimodaira (2000)
polynomial. The plot on the left shows the fit of final models (weighted least squares
(WLS) vs. ordinary least squares (OLS)) on the training set and the plot on the right
shows the fit of WLS on the test set. Please note the covariate shift in the test set and
how it is successfully handled by WLS. It may also be argued that the region of input
space that contributes data points to the training set, but not to the test set, contains
outliers. In both situations, instance weighting overcomes covariate shift due to either
change in the distribution or outliers.
Another method for instance weighting, KLIEP, has been recently proposed by
Sugiyama et al. which minimizes the Kullback–Leibler divergence between training
and test set densities using kernel-based methods without explicit density estima-
tion (Sugiyama et al. 2008). Sugiyama et al. argues that KLIEP outperforms other
techniques such as Bickel et al.’s formalization as an optimization problem, and
kernel mean matching (KMM) proposed by Huang et al. (2006). Please refer to
relevant publications for more details.
3.2 Distribution Based Methods
3.2.1 Stratification
In simulation based settings stratification, or stratified sampling, ensures that prior
probability shift does not occur. Such studies using cross validation or training-test
data splits should prefer stratification over random splits in order to preserve the
ratio of different response groups in the newly created data bins. In classification
problems such as fault prediction, its application is straightforward. In regression
type fault prediction and effort estimation studies, possible solutions are to devise
logical groups (i.e. five categories of effort ranging from very-low to very-high) based
on pre-determined thresholds or to identify some number of clusters of response
variable values, and then to reflect the ratio of these logical groups/clusters in the
whole dataset into the training and test sets.
70 Empir Software Eng (2012) 17:62–74
Fig. 2 Example cost curves for comparing two fault prediction models, from Jiang et al. (2008a)
Stratification is also helpful in cases of imbalanced data. In this case, random splits
may result in the exclusion of the minority instances from most of the newly created
bins, which would cause numeric errors in performance metric calculation and the
corresponding bin-models to be meaningless.
However, it should be noted that while stratification avoids prior probability
shift in simulation settings, it also assumes the real world data will not suffer from
prior probability shift after the model is deployed. Therefore, it is useful to utilize
stratification in simulation settings for post hoc analysis, but this does not solve the
problems associated with prior probability shift between simulation and real-settings.
This problem is addressed with cost curves which are explained next.
3.2.2 Cost Curves
In addition to the problem stated above, the problems associated with the application
of over/under sampling to imbalanced data was discussed in Section 2.4, arguing that
sampling strategies cause prior probability shift and affect the loss functions in an
uncontrolled manner. Sampling techniques may be preferred, and are commonly
used, over stratification as a design decision in predictive model construction, yet
both approaches are prone to prior probability shift and changing relative costs in
real settings.
For binary classification problems (i.e. fault prediction), Drummond and Holte’s
cost curves provide control over all these complications at once (Drummond and
Holte 2006). Cost curves visualize predictor model performances “over the full range
of possible class distributions and misclassification costs” for binary classification
problems (Drummond and Holte 2006). Cost curves provide decision support by
visual inspection of all possible future scenarios for class ratios and loss functions.
Therefore, it is advised to include cost curve analysis in empirical studies of predictive
models in order to see the capabilities of reported models over the space of all
possible future states.
Empir Software Eng (2012) 17:62–74 71
Cost curves are investigated by Jiang et al. in software engineering domain along
with a comparison with alternative methods for model selection including lift charts
and ROC curves (Jiang et al. 2008a). An example of cost curves is shown in Fig. 2
(from Jiang et al. 2008a) that compares the fault prediction performances of logistic
regression and naive Bayes classifiers on KC4 dataset from NASA MDP repository.
In order to determine the cost curve for a model, the points in a ROC curve are
converted to lines and the convex hull, including the x-axis, with the minimum area
is found. In the example, logistic regression turns out to be a better model, since its
curve is consistently closer to the optimal cost curve (i.e. x-axis) for all scenarios.
Please refer to Drummond and Holte (2006) for more details on cost curves and
to Jiang et al. (2008a,b) for its applications in software fault prediction.
3.2.3 Mixture Models
In order to address source component shift, where data are known to have different
origins, individual source components in the data should be accounted for. These
components can be either identified manually when there is information about the
origin of data, or estimated and handled automatically with mixture models such as
“mixture of Gaussians” and “mixture of experts” (Alpaydin 2010;Storkey2009).
In manual identification, meta-knowledge can be used for defining alternative
source components, e.g. company, domain, project team, or individual team mem-
bers. As an example of manual identification Wieczorek and Ruhe compared
company-specific vs. multi-company data for cost estimation using the Laturi data-
base consisting of 206 projects from 26 different companies, i.e. source components
are defined at the company level (Wieczorek and Ruhe 2002). While they did not find
any significant advantage of using company-specific data, the systematic review by
Kitchenham et al. revealed that some companies may achieve better cost estimations
with company-specific data (Kitchenham et al. 2007). Considering the different ways
of defining source components, Wieczorek and Ruhe recommends domain clustering
as an alternative, and Bakir et al. reports such an application in embedded systems
domain (Bakır et al. 2010).
When it is not possible to identify the number of source components for the
data, the latter approach—automated mixture models—can be used instead. Mixture
models address multi-modalities (i.e. different source components) in the data as
opposed to the uni-modality assumption of their counterparts. Therefore, the idea
of mixture models is to identify multiple density distributions (commonly from the
same family, e.g. Gaussian) and to fit different models for each density component.
In practice, a test datum is assigned to a single source component and a prediction
is achieved based on the specified model for that component. As an alternative, an
aggregation of all predictions can be taken into account based on a weighting of the
posterior probabilities that the test datum belongs to a certain source component.
For an application of mixture models to fault prediction please see Guo and Lyu
4 Summary
In order to provide a possible explanation for the conclusion instability problem, this
paper introduced the dataset shift concept, discussed its implications for predictive
72 Empir Software Eng (2012) 17:62–74
model construction for software engineering problems, and provided pointers to
representative techniques to address associated issues.
Dataset shift offers viable justifications for certain results in software engineering:
Covariate shift might be the cause for the inconsistencies reported by Shepperd
and Kadoda across different training and test set splits, and seemingly better
performance of case-based reasoning (Shepperd and Kadoda 2001), as well as
Myrtveit et al.’s leave-one-out cross validation related issues (Myrtveit et al.
Covariate shift also explains why COCOMO with local calibration is consistently
selected among the best models in Menzies et al.’s extensive effort estimation
experiments, along with source component shift as they logically grouped their
data (Menzies et al. 2010).
The benefits of relevancy filtering in Kocaguneli et al.’s effort estimation
(Kocaguneli et al. 2010) and Turhan et al.’s fault prediction (Turhan et al. 2009)
studies can be attributed to compensation for covariate shift through a controlled
sample selection bias with relevancy filtering.
Source component shift, together with covariate shift is referred to as data
heterogeneity/homogeneity and addressed to some extent in certain software
engineering studies (Briand et al. 2002; Kitchenham et al. 2007; Turhan et al.
2009; Wieczorek and Ruhe 2002; Zimmermann et al. 2009)
The contradicting results in cross project/company studies can be explained
either with covariate shift, prior probabilty shift, source component shift, or a
combination of those (Kitchenham et al. 2007; Turhan et al. 2009; Zimmermann
et al. 2009).
Dataset shift is a recently coined and active research topic in machine learning
community, and the classification of dataset shift types and techniques described in
this paper are not necessarily complete. Nevertheless, dataset shift provides a means
to study the conclusion instability problem in software engineering predictions, and
the community can benefit by validating and interpreting their results from this
Acknowledgements This research is partly funded by the Finnish Funding Agency for Technology
and Innovation (TEKES) under Cloud Software Program. The author would like to thank the
anonymous reviewers for their suggestions which greatly improved the paper.
Alpaydin E (2010) Introduction to machine learning, 2nd edn. The MIT Press, Cambridge, MA
Bakır A, Turhan B, Bener A (2010) A new perspective on data homogeneity in software cost
estimation: a study in the embedded systems domain. Softw Qual J 18(1):57–80
Bickel S, Brückner M, Scheffer T (2009) Discriminative learning under covariate shift. J Mach Learn
Res 10:2137–2155
Boehm B, Horowitz E, Madachy R, Reifer D, Clark BK, Steece B, Brown AW, Chulani S, Abts C
(2000) Software cost estimation with Cocomo II. Prentice Hall, Englewood Cliffs, NJ
Briand L, Wust J (2002) Empirical studies of quality models in object-oriented systems. Adv Comput
Empir Software Eng (2012) 17:62–74 73
Briand LC, Melo WL, Wust J (2002) Assessing the applicability of fault-proneness models across
object-oriented software projects. IEEE Trans Softw Eng 28:706–720
Candela JQ, Sugiyama M, Schwaighofer A, Lawrence ND (eds) (2009) Dataset shift in machine
learning. The MIT Press, Cambridge, MA
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv
Demirors O, Gencel C (2009) Conceptual association of functional size measurement methods. IEEE
Softw 26(3):71–78
Drummond C, Holte RC (2006) Cost curves: an improved method for visualizing classifier perfor-
mance. Mach Learn 65(1):95–130
Guo P, Lyu MR (2000) Software quality prediction using mixture models with EM algorithm.
In: Proceedings of the the first Asia-Pacific conference on quality software (APAQS’00). IEEE
Computer Society, Washington, DC, USA, pp 69–78
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining
software: an update. SIGKDD explorations, vol 11/1
Hand DJ (2006) Classifier technology and the illusion of progress. Stat Sci 21(1):1–15
Huang J, Smola AJ, Gretton A, Borgwardt KM, Schšlkopf B (2006) Correcting sample selection bias
by unlabeled data. Neural Information Processing Systems, pp 601–608
Jiang Y, Cukic B, Ma Y (2008a) Techniques for evaluating fault prediction models. Empir Soft Eng
Jiang Y, Cukic B, Menzies T (2008b) Cost curve evaluation of fault prediction models. In: Proceed-
ings of the 19th int’l symposium on software reliability engineering (ISSRE 2008), Redmond,
WA, pp 197–206
Keung JW, Kitchenham BA, Jeffery DR (2008) Analogy-X: providing statistical inference to
analogy-based software cost estimation. IEEE Trans Softw Eng 34(4):471–484
Kitchenham BA, Mendes E, Travassos GH (2007) Cross versus within-company cost estimation
studies: a systematic review. IEEE Trans Softw Eng 33(5):316–329
Kocaguneli E, Menzies T (2011) How to find relevant data for effort estimation? In: Proceedings of
the 5th ACM/IEEE international symposium on empirical software engineering and measure-
ment (ESEM’11)
Kocaguneli E, Gay G, Menzies T, Yang Y, Keung JW (2010) When to use data from other projects
for effort estimation. In: Proceedings of the IEEE/ACM international conference on automated
software engineering (ASE ’10). ACM, New York, pp 321–324
Lin J, Keogh E, Lonardi S, Lankford J, Nystrom DM (2004) Visually mining and monitoring massive
time series. In: Proceedings of 10th ACM SIGKDD international conference on knowledge and
data mining. ACM Press, pp 460–469
Lokan C, Wright T, Hill PR, Stringer M (2001) Organizational benchmarking using the isbsg data
repository. IEEE Softw 18:26–32
Menzies T, Jalali O, Hihn J, Baker D, Lum K (2010) Stable rankings for different effort models.
Autom Softw Eng 17(4):409–437
Menzies T, Turhan B, Bener A, Gay G, Cukic B, Jiang Y (2008) Implications of ceiling effects
in defect predictors. In: Proceedings of the 4th international workshop on predictor models in
software engineering (PROMISE ’08). ACM, New York, pp 47–54
Myrtveit I, Stensrud E, Shepperd M (2005) Reliability and validity in comparative studies of software
prediction models. IEEE Trans Softw Eng 31(5):380–391
Premraj R, Zimmermann T (2007) Building software cost estimation models using homogenous
data. In: Proceedings of the first international symposium on empirical software engineering and
measurement (ESEM ’07). IEEE Computer Society, Washington, DC, USA, pp 393–400
Shepperd M, Kadoda G (2001) Comparing software prediction techniques using simulation. IEEE
Trans Softw Eng 27(11):1014–1022
Shimodaira H (2000) Improving predictive inference under covariate shift by weighting the log-
likelihood function. J Stat Plan Inference 90(2):227–244
Storkey A (2009) When training and test sets are different: characterizing learning transfer.
In: Quionero-Candela J, Sugiyama M, Schwaighofer A, Lawrence ND (eds) Dataset shift in
machine learning, chapter 1. The MIT Press, Cambridge, MA, pp 3–28
Sugiyama M, Suzuki T, Nakajima S, Kashima H, von Bünau P, Kawanabe M (2008)
Direct importance estimation for covariate shift adaptation. Ann Inst Stat Math 60(4):699–
Turhan B, Menzies T, Bener AB, Di Stefano J (2009) On the relative value of cross-company and
within-company data for defect prediction. Empir Softw Eng 14(5):540–578
74 Empir Software Eng (2012) 17:62–74
Wieczorek I, Ruhe M (2002) How valuable is company-specific data compared to multi-company
data for software cost estimation? In: Proceedings of the 8th international symposium on soft-
ware metrics (METRICS ’02). IEEE Computer Society, Washington, DC, USA, p 237
Zhang H, Sheng S (2004) Learning weighted naive Bayes with accurate ranking. In: Proceedings of
the 4th IEEE international conference on data mining, pp 567–570
Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction.
In: Proceedings of the 7th joint meeting of the European software engineering conference and
the ACM SIGSOFT symposium on the foundations of software engineering. ACM
Burak Turhan holds a PhD in Computer Engineering from Bogazici University, Turkey. He is a
postdoctoral researcher in the Department of Information Processing Science at the University of
Oulu, Finland. His research interests include empirical studies of software engineering, on software
quality, defect prediction, cost estimation, as well as data mining for software engineering and agile/
lean software development with a special focus on test-driven development. He is a member of ACM,
IEEE, and IEEE Computer Society.
... • SE model flaws: What flaws has the accumulation of decades of model introduction revealed? [11,12,13,14,15] -Inconsistency in effort -Inconsistency in defects -Inconsistency in process • Solutions: Methods to handle instability of models [11,12,13,14,15] -Envy-based learning -Ensembles • Static [11,12,13,14,15] -Temporal -GAC-based ...
... • SE model flaws: What flaws has the accumulation of decades of model introduction revealed? [11,12,13,14,15] -Inconsistency in effort -Inconsistency in defects -Inconsistency in process • Solutions: Methods to handle instability of models [11,12,13,14,15] -Envy-based learning -Ensembles • Static [11,12,13,14,15] -Temporal -GAC-based ...
... • SE model flaws: What flaws has the accumulation of decades of model introduction revealed? [11,12,13,14,15] -Inconsistency in effort -Inconsistency in defects -Inconsistency in process • Solutions: Methods to handle instability of models [11,12,13,14,15] -Envy-based learning -Ensembles • Static [11,12,13,14,15] -Temporal -GAC-based ...
Conference Paper
In the age of big data, data science is an essential skill that should be equipped by software engineers, practitioners, developers, and researchers who want to understand the state of the art in using data science for software engineering (SE). It can be used to forecast useful information about new projects based on previous projects. Different fields/areas/branches of Software Engineering and Computer Science have a pretty wide and immense scope of Data Science for the purpose of improvement of available or research for new technologies, algorithm and mining, manipulating or clustering of Data. As Data Science is being used in almost every application, programs or systems, Data Science has become the basic and fundamental element for Computer Science and Software and Engineering. With the time the demand of innovation and updating of software is increasing with that requirement of Data Science for Software Engineering is increasing simultaneously. This paper will discuss the tasks involved in deploying machine-learning algorithms in organizations, followed by discretization, clustering, dichotomization, and statistical analysis.It will address issues such as how to adapt data from other organizations to local problems when local data is scarce. When privacy concerns prevent access, how can data be privatized while still being mined? When working with data of questionable quality, how do you prune spurious information? How to simplify data mining results when data or models appear to be overly complex. Methods for generating predictions when data is insufficient to support complex models. What to do when the world changes and old models need to be updated. When the effect is too complex for a single model, how do you reason across ensembles of models? [1,16,17,18,19,20]
... Several methods, including those based on distance measures [21] and dimensionality reduction followed by statistical hypothesis testing [22], have been proposed for this purpose. Techniques for mitigating the effect of dataset mismatch include importance weighting [23] and utilizing stratification, cost curves, or mixture models [24], among others. ...
Full-text available
This chapter presents a regulatory science perspective on the assessment of machine learning algorithms in diagnostic imaging applications. Most of the topics are generally applicable to many medical imaging applications, while brain disease-specific examples are provided when possible. The chapter begins with an overview of US FDA’s regulatory framework followed by assessment methodologies related to ML devices in medical imaging. Rationale, methods, and issues are discussed for the study design and data collection, the algorithm documentation, and the reference standard. Finally, study design and statistical analysis methods are overviewed for the assessment of standalone performance of ML algorithms as well as their impact on clinicians (i.e., reader studies). We believe that assessment methodologies and regulatory science play a critical role in fully realizing the great potential of ML in medical imaging, in facilitating ML device innovation, and in accelerating the translation of these technologies from bench to bedside to the benefit of patients.
... Drift characterization commonly distinguishes sudden, gradual, incremental and recurrent drift [59]. In the context of software analytics tools, previous work also suggested approaches to analyze and catalog different types of data drift [40,57]. MLOps engineers need to solve the problem of adapting deployed models to concept changes, ideally predicting those changes for future data points. ...
Full-text available
Nowadays, software analytics tools using machine learning (ML) models to, for example, predict the risk of a code change are well established. However, as the goals of a project shift over time, and developers and their habits change, the performance of said models tends to degrade (drift) over time, until a model is retrained using new data. Current retraining practices typically are an afterthought (and hence costly), requiring a new model to be retrained from scratch on a large, updated data set at random points in time; also, there is no continuity between the old and new model. In this paper, we propose to use lifelong learning (LL) to continuously build and maintain ML-based software analytics tools using an incremental learner that progressively updates the old model using new data. To avoid so-called ''catastrophic forgetting'' of important older data points, we adopt a replay buffer of older data, which still allows us to drastically reduce the size of the overall training data set, and hence model training time. We empirically evaluate our LL approach on two industrial use cases, i.e., a brown build detector and a Just-in-Time risk prediction tool, showing how LL in practice manages to at least match traditional retraining-from-scratch performance in terms of F1-score, while using 3.3-13.7x less data at each update, thus considerably speeding up the model updating process. Considering both the computational effort of updates and the time between model updates, the LL setup needs 2-40x less computational effort than retraining-from-scratch setups.
... Bhat and Farooq (2021) propose a filter approach (BurakMHD filter) for selecting relevant training data in CPDP and conclude the BurakMHD filter compared to Burak filter Turhan et al. (2009) and Peter filter Peters (2013) improves the CPDP performance. Turhan (2012) asserted the dataset shift problem between software defect datasets is the main reason for the substandard performance of CPDP. Hosseini et al. (2018) infer that search-based methods integrated with feature selection are a propitious way for training data selection in CPDP. ...
Full-text available
The software defect prediction approaches are evaluated, in within-project context only, with only a few other approaches, according to distinct scenarios and performance indicators. So, we conduct various experiments to evaluate well-known defect prediction approaches using different performance indicators. The evaluations are performed in the scenario of ranking the entities — with and without considering the effort to review the entities and classifying entities in within-project as well as cross-project contexts. The effect of imbalanced datasets on the ranking of the approaches is also evaluated. Our results indicate that in within-project as well as cross-project context, process metrics, the churn of source code, and entropy of source code perform significantly better under the context of classification and ranking — with and without effort consideration. The previous defect metrics and other single metric approaches (like lines of code) perform worst. The ranking of the approaches is not changed by imbalanced datasets. We suggest using the process metrics, the churn of source code, and entropy of source code metrics as predictors in future defect prediction studies and taking care while using the single metric approaches as predictors. Moreover, different evaluation scenarios generate different ordering of approaches in within-project and cross-project contexts. Therefore, we conclude that each problem context has distinct characteristics, and conclusions of within-project studies should not be generalized to cross-project context and vice versa.
... Data imbalance: In most cases, defects are not equally dispersed where the number of defective codes is much lower than the number of non-defective codes. This problem is called class imbalance [38]. The class imbalance substantially affects the performance and generalizability of classification models [39]. ...
Context: Despite recent attention given to Software Defect Prediction (SDP), the lack of any systematic effort to assess existing empirical evidence on the application of Deep Learning (DL) in SDP indicates that it is still relatively under-researched. Objective: To synthesize literature on SDP using DL, pertaining to measurements, models, techniques, datasets, and achievements; to obtain a full understanding of current SDP-related methodologies using DL; and to compare the DL models’ performances with those of Machine Learning (ML) models in classifying software defects. Method: We completed a thorough review of the literature in this domain. To answer the research issues, results from primary investigations were synthesized. The preliminary findings for DL vs. ML in SDP were verified by using meta-analysis (MA). Result: We discovered 63 primary studies that passed the systematic literature review quality evaluation. However, only 19 primary studies passed the MA quality evaluation. The five most popular performance mea- surements employed in SDP were f-measure, recall, accuracy, precision, and Area Under the Curve (AUC). The top five DL techniques used in building SDP models were Convolutional Neural Network (CNN), Deep Neural Network (DNN), Long Short-Term Memory (LSTM), Deep Belief Network (DBN), and Stacked Denoising Autoencoder (SDAE). PROMISE and NASA datasets were found to be used more frequently to train and test DL models in SDP. The MA results show that DL was favored over ML in terms of study and dataset across accuracy, f-measure, and AUC. Conclusion: The application of DL in SDP remains a challenge, but it has the potential to achieve better predictive performance when the performance-influencing parameters are optimized. We provide a reference point for future research which could be used to improve research quality in this domain.
... Its performance could be enhanced as the dimension of the training dataset rises [25]. Still, it largely depends on the distribution gap between training and test datasets: a highly divergent test dataset would test an ML prediction model on a feature space that it was not trained on, resulting in poor testing and results; additionally, a highly overlapping test dataset would not test the model for its generalization ability [30]. Specifically, DL employs algorithms such as DNNs and convolutional neural networks (CNNs) [15]. ...
Full-text available
Arterial hypertension (AH) is a progressive issue that grows in importance with the increased average age of the world population. The potential role of artificial intelligence (AI) in its prevention and treatment is firmly recognized. Indeed, AI application allows personalized medicine and tailored treatment for each patient. Specifically, this article reviews the benefits of AI in AH management, pointing out diagnostic and therapeutic improvements without ignoring the limitations of this innovative scientific approach. Consequently, we conducted a detailed search on AI applications in AH: the articles (quantitative and qualitative) reviewed in this paper were obtained by searching journal databases such as PubMed and subject-specific professional websites, including Google Scholar. The search terms included artificial intelligence, artificial neural network, deep learning, machine learning, big data, arterial hypertension, blood pressure, blood pressure measurement, cardiovascular disease, and personalized medicine. Specifically, AI-based systems could help continuously monitor BP using wearable technologies; in particular, BP can be estimated from a photoplethysmograph (PPG) signal obtained from a smartphone or a smartwatch using DL. Furthermore, thanks to ML algorithms, it is possible to identify new hypertension genes for the early diagnosis of AH and the prevention of complications. Moreover, integrating AI with omics-based technologies will lead to the definition of the trajectory of the hypertensive patient and the use of the most appropriate drug. However, AI is not free from technical issues and biases, such as over/underfitting, the “black-box” nature of many ML algorithms, and patient data privacy. In conclusion, AI-based systems will change clinical practice for AH by identifying patient trajectories for new, personalized care plans and predicting patients’ risks and necessary therapy adjustments due to changes in disease progression and/or therapy response.
... Generally, defects are not distributed evenly, and the number of defective modules is fewer compared to the non-defective modules. This problem is called the class imbalance problem [34]. Data imbalance dramatically influences the performance of the classification models as well as decreases their generalisability [13]. ...
Full-text available
In software development, defects influence the quality and cost in an undesirable way. Software defect prediction (SDP) is one of the techniques which improves the software quality and testing efficiency by early identification of defects(bug/fault/error). Thus, several experiments have been suggested for defect prediction (DP) techniques. Mainly DP method utilises historical project data for constructing prediction models. SDP performs well within projects until there is an adequate amount of data accessible to train the models. However, if the data are inadequate or limited for the same project, the researchers mainly use Cross-Project Defect Prediction (CPDP). CPDP is a possible alternative option that refers to anticipating defects using prediction models built on historical data from other projects. CPDP is challenging due to its data distribution and domain difference problem. The proposed framework is an effective two-stage approach for CPDP, i.e., model generation and prediction process. In model generation phase, the conglomeration of different pre-processing, including feature selection and class reweights technique, is used to improve the initial data quality. Finally, a fine-tuned efficient bagging and boosting based hybrid ensemble model is developed, which avoids model over -fitting/under-fitting and helps enhance the prediction performance. In the prediction process phase, the generated model predicts the historical data from other projects, which has defects or clean. The framework is evaluated using25 software projects obtained from public repositories. The result analysis shows that the proposed model has achieved a 0.71±0.03 f1-score, which significantly improves the state-of-the-art approaches by 23 % to 60 %.
Traditional statistical learning algorithms perform poorly in case of learning from an imbalanced dataset. Software defect prediction (SDP) is a useful way to identify defects in the primary phases of the software development life cycle. This SDP methodology will help to remove software defects and induce to build a cost-effective and good quality of software products. Several statistical and machine learning models have been employed to predict defects in software modules. But the imbalanced nature of this type of datasets is one of the key characteristics, which needs to be exploited, for the successful development of a defect prediction model. Imbalanced software datasets contain non-uniform class distributions with most of the instances belonging to a specific class compared to that of the other class. We propose a novel hybrid model based on Hellinger distance-based decision tree (HDDT) and artificial neural network (ANN), which we call as hybrid HDDT-ANN model, for analysis of software defect prediction (SDP) data. This is a newly developed model which is found to be quite effective in predicting software bugs. A comparative study of several supervised machine learning models with our proposed model using different performance measures is also produced. Hybrid HDDT-ANN also takes care of the strength of a skew-insensitive distance measure, known as Hellinger distance, in handling class imbalance problems. A detailed experiment was performed over ten NASA SDP datasets to prove the superiority of the proposed method. KeywordsSoftware defect predictionClass imbalanceHellinger distanceArtificial neural networkHybrid model
Because defects in software modules (e.g., classes) might lead to product failure and financial loss, software defect prediction enables us to better understand and control software quality. Software development is a dynamic evolutionary process that may result in data distributions (e.g., defect characteristics) varying from version to version. In this case, effective cross‐version defect prediction (CVDP) is not easy to achieve. In this paper, we aim to investigate whether the defect prediction method of the threshold‐based active learning (TAL) can tackle the problem of the different data distribution between successive versions. Our TAL method includes two stages. At the active learning stage, a committee of investigated metrics is constructed to vote on the unlabeled modules of the current version. We pick up the unlabeled module with the median of voting scores to domain experts. The domain experts test and label the selected unlabeled module. Then, we merge the selected labeled module and the remaining modules with pseudo‐labels from the current version into the labeled modules of the prior version to form enhanced training data. Based on the training data, we derive the metric thresholds used for the next iteration. At the defect prediction stage, the iterations stop when a predefined threshold is reached. Finally, we use the cutoff threshold of voting scores, that is, 50%, to predict the defect‐prone of the remaining unlabeled modules. We evaluate the TAL method on 31 versions of 10 projects with three prevalent performance indicators. The results show that TAL outperforms the baseline methods, including three variations methods, two common supervised methods, and the state‐of‐the‐art method Hybrid Active Learning and Kernel PCA (HALKP). The results indicate that TAL can effectively address the different data distribution between successive versions. Furthermore, to keep the cost of extensive testing low in practice, selecting 5% of candidate modules from the current version is sufficient for TAL to achieve a good performance of defect prediction.
Full-text available
We systematically study the capacity of two large language models for code - CodeT5 and Codex - to generalize to out-of-domain data. In this study, we consider two fundamental applications - code summarization, and code generation. We split data into domains following its natural boundaries - by an organization, by a project, and by a module within the software project. This makes recognition of in-domain vs out-of-domain data at the time of deployment trivial. We establish that samples from each new domain present both models with a significant challenge of distribution shift. We study how well different established methods can adapt models to better generalize to new domains. Our experiments show that while multitask learning alone is a reasonable baseline, combining it with few-shot finetuning on examples retrieved from training data can achieve very strong performance. In fact, according to our experiments, this solution can outperform direct finetuning for very low-data scenarios. Finally, we consider variations of this approach to create a more broadly applicable method to adapt to multiple domains at once. We find that in the case of code generation, a model adapted to multiple domains simultaneously performs on par with those adapted to each domain individually.
Full-text available
A class of predictive densities is derived by weighting the observed samples in maximizing the log-likelihood function. This approach is effective in cases such as sample surveys or design of experiments, where the observed covariate follows a different distribution than that in the whole population. Under misspecification of the parametric model, the optimal choice of the weight function is asymptotically shown to be the ratio of the density function of the covariate in the population to that in the observations. This is the pseudo-maximum likelihood estimation of sample surveys. The optimality is defined by the expected Kullback–Leibler loss, and the optimal weight is obtained by considering the importance sampling identity. Under correct specification of the model, however, the ordinary maximum likelihood estimate (i.e. the uniform weight) is shown to be optimal asymptotically. For moderate sample size, the situation is in between the two extreme cases, and the weight function is selected by minimizing a variant of the information criterion derived as an estimate of the expected loss. The method is also applied to a weighted version of the Bayesian predictive density. Numerical examples as well as Monte-Carlo simulations are shown for polynomial regression. A connection with the robust parametric estimation is discussed.
Full-text available
Moments before the launch of every space vehicle, engineering discipline specialists must make a critical go/no-go decision. The cost of a false positive, allowing a launch in spite of a fault, or a false negative, stopping a potentially successful launch, can be measured in the tens of millions of dollars, not including the cost in morale and other more intangible detriments. The Aerospace Corporation is responsible for providing engineering assessments critical to the go/no-go decision for every Department of Defense space vehicle. These assessments are made by constantly monitoring streaming telemetry data in the hours before launch. We will introduce VizTree, a novel time-series visualization tool to aid the Aerospace analysts who must make these engineering assessments. VizTree was developed at the University of California, Riverside and is unique in that the same tool is used for mining archival data and monitoring incoming live telemetry. The use of a single tool for both aspects of the task allows a natural and intuitive transfer of mined knowledge to the monitoring task. Our visualization approach works by transforming the time series into a symbolic representation, and encoding the data in a modified suffix tree in which the frequency and other properties of patterns are mapped onto colors and other visual properties. We demonstrate the utility of our system by comparing it with state-of-the-art batch algorithms on several real and synthetic datasets.
Conference Paper
Full-text available
Context: There are many methods that input static code features and output a predictor for faulty code modules. These data mining methods have hit a "performance ceiling"; i.e., some inherent upper bound on the amount of information offered by, say, static code features when identifying modules which contain faults. Objective: We seek an explanation for this ceiling effect. Per-haps static code features have "limited information content"; i.e. their information can be quickly and completely discovered by even simple learners. Method: An initial literature review documents the ceiling effect in other work. Next, using three sub-sampling techniques (under-, over-, and micro-sampling), we look for the lower useful bound on the number of training instances. Results: Using micro-sampling, we find that as few as 50 in-stances yield as much information as larger training sets. Conclusions: We have found much evidence for the limited in-formation hypothesis. Further progress in learning defect predic-tors may not come from better algorithms. Rather, we need to be improving the information content of the training data, perhaps with case-based reasoning methods.
Full-text available
There exists a large and growing number of proposed estimation methods but little conclusive evidence ranking one method over another. Prior effort estimation studies suffered from “conclusion instability”, where the rankings offered to different methods were not stable across (a)different evaluation criteria; (b)different data sources; or (c)different random selections of that data. This paper reports a study of 158 effort estimation methods on data sets based on COCOMO features. Four “best” methods were detected that were consistently better than the “rest” of the other 154 methods. These rankings of “best” and “rest” methods were stable across (a)three different evaluation criteria applied to (b)multiple data sets from two different sources that were (c)divided into hundreds of randomly selected subsets using four different random seeds. Hence, while there exists no single universal “best” effort estimation method, there appears to exist a small number (four) of most useful methods. This result both complicates and simplifies effort estimation research. The complication is that any future effort estimation analysis should be preceded by a “selection study” that finds the best local estimator. However, the simplification is that such a study need not be labor intensive, at least for COCOMO style data sets. KeywordsCOCOMO-Effort estimation-Data mining-Evaluation
Full-text available
Functional size determines how much functionality software provides by measuring the aggregate amount of its cohesive execution sequences. Alan Albrecht first introduced the concept in 1979. Since he originally described the function point analysis (FPA) method, researchers and practitioners have developed variations of functional size metrics and methods. The authors discuss the conceptual similarities and differences between functional size measurement methods and introduce a model for unification.
This chapter introduces the general learning transfer problem and formulates it in terms of a change of scenario. Standard regression and classification models can be characterized as conditional models. Assuming that the conditional model is true, covariate shift is not an issue. However, if this assumption does not hold, conditional modeling will fail. The chapter then characterizes a number of different cases of dataset shift, including simple covariate shift, prior probability shift, sample selection bias, imbalanced data, domain shift, and source component shift. Each of these situations is cast within the framework of graphical models and a number of approaches to addressing each of these problems are reviewed. The chapter also presents a framework for multiple dataset learning that prompts the possibility of using hierarchical dataset linkage.
Prediction of software defects works well within projects as long as there is a sufficient amount of data available to train any models. However, this is rarely the case for new software projects and for many companies. So far, only a few have studies focused on transferring prediction models from one project to another. In this paper, we study cross-project defect prediction models on a large scale. For 12 real-world applications, we ran 622 cross-project predictions. Our results indicate that cross-project prediction is a serious challenge, i.e., simply using models from projects in the same domain or with the same process does not lead to accurate predictions. To help software engineers choose models wisely, we identified factors that do influence the success of cross-project predictions. We also derived decision trees that can provide early estimates for precision, recall, and accuracy before a prediction is attempted.
Dataset shift is a common problem in predictive modeling that occurs when the joint distribution of inputs and outputs differs between training and test stages. Covariate shift, a particular case of dataset shift, occurs when only the input distribution changes. Dataset shift is present in most practical applications, for reasons ranging from the bias introduced by experimental design to the irreproducibility of the testing conditions at training time. (An example is -email spam filtering, which may fail to recognize spam that differs in form from the spam the automatic filter has been built on.) Despite this, and despite the attention given to the apparently similar problems of semi-supervised learning and active learning, dataset shift has received relatively little attention in the machine learning community until recently. This volume offers an overview of current efforts to deal with dataset and covariate shift. The chapters offer a mathematical and philosophical introduction to the problem, place dataset shift in relationship to transfer learning, transduction, local learning, active learning, and semi-supervised learning, provide theoretical views of dataset and covariate shift (including decision theoretic and Bayesian perspectives), and present algorithms for covariate shift. Contributors: Shai Ben-David, Steffen Bickel, Karsten Borgwardt, Michael Brckner, David Corfield, Amir Globerson, Arthur Gretton, Lars Kai Hansen, Matthias Hein, Jiayuan Huang, Takafumi Kanamori, Klaus-Robert Mller, Sam Roweis, Neil Rubens, Tobias Scheffer, Marcel Schmittfull, Bernhard Schlkopf, Hidetoshi Shimodaira, Alex Smola, Amos Storkey, Masashi Sugiyama, Choon Hui Teo Neural Information Processing series