Conference PaperPDF Available

Regularities in Learning Defect Predictors

Authors:

Abstract and Figures

Collecting large consistent data sets of real world software projects from a single source is problematic. In this study, we show that bug reports need not necessarily come from the local projects in order to learn defect prediction models. We demonstrate that using imported data from different sites can make it suitable for predicting defects at the local site. In addition to our previous work in commercial software, we now explore open source domain with two versions of an open source anti-virus software (Clam AV) and a subset of bugs in two versions of GNU gcc compiler, to mark the regularities in learning predictors for a different domain. Our conclusion is that there are surprisingly uniform assets of software that can be discovered with simple and repeated patterns in local or imported data using just a handful of examples. KeywordsDefect prediction-Code metrics-Software quality-Cross-company
Content may be subject to copyright.
A preview of the PDF is not available
... In this section, we provide a discussion of cross-project prediction studies, which is an emerging area with very limited number of published studies, based on selected studies representing major research effort on the topic. We have identified nine empirical studies [9,10,11,12,13,14,15,16,17]. We focus our discussions on these studies rather than providing a general review of defect prediction literature, which is out of the scope of this paper (we refer the reader to [3] for a systematic review of defect prediction studies in general). ...
... In a following study, Turhan et al. investigate whether the patterns in their previous work [11] are also observed in open-source software and analyzed three additional projects [14]. Similar to Zimmermann et al., they find that the patterns they observed earlier are not easily detectable in predicting open-source software defects using proprietary cross project data. ...
... Data from systems that include the label "NASA" in their names come from NASA aerospace projects and these were collected as part of the metric data programme (MDP). On the other hand, systems with the label "SOFTLAB" are from a Turkish software company developing embedded controllers for home appliances and the related data were collected by the authors for an earlier work [14]. The projects in these two groups are all single version and the metrics along with the defect information are available at the functional method level. ...
Article
Full-text available
ContextDefect prediction research mostly focus on optimizing the performance of models that are constructed for isolated projects (i.e. within project (WP)) through retrospective analyses. On the other hand, recent studies try to utilize data across projects (i.e. cross project (CP)) for building defect prediction models for new projects. There are no cases where the combination of within and cross (i.e. mixed) project data are used together.Objective Our goal is to investigate the merits of using mixed project data for binary defect prediction. Specifically, we want to check whether it is feasible, in terms of defect detection performance, to use data from other projects for the cases (i) when there is an existing within project history and (ii) when there are limited within project data.Method We use data from 73 versions of 41 projects that are publicly available. We simulate the two above-mentioned cases, and compare the performances of naive Bayes classifiers by using within project data vs. mixed project data.ResultsFor the first case, we find that the performance of mixed project predictors significantly improves over full within project predictors (p-value < 0.001), however the effect size is small (Hedges′ g = 0.25). For the second case, we found that mixed project predictors are comparable to full within project predictors, using only 10% of available within project data (p-value = 0.002, g = 0.17).Conclusion We conclude that the extra effort associated with collecting data from other projects is not feasible in terms of practical performance improvement when there is already an established within project defect predictor using full project history. However, when there is limited project history, e.g. early phases of development, mixed project predictions are justifiable as they perform as good as full within project models.
... As a result, 13 additional relevant articles [14,31,32,43,45,61,71,84,88,99,101,108,129] were found with respect to the identified 46 articles. Next, we used "cross company" + "defect prediction" as the search terms, identifying 3 additional relevant articles [104,106,110] with respect to the 59 (=46 + 13) articles. After that, we used "cross project" + "fault prediction" as the search terms, identifying 1 additional relevant article [100] with respect to the 62 (=59 + 3) articles. ...
Article
Background. Recent years have seen an increasing interest in cross-project defect prediction (CPDP), which aims to apply defect prediction models built on source projects to a target project. Currently, a variety of (complex) CPDP models have been proposed with a promising prediction performance. Problem. Most, if not all, of the existing CPDP models are not compared against those simple module size models that are easy to implement and have shown a good performance in defect prediction in the literature. Objective. We aim to investigate how far we have really progressed in the journey by comparing the performance in defect prediction between the existing CPDP models and simple module size models. Method. We first use module size in the target project to build two simple defect prediction models, ManualDown and ManualUp, which do not require any training data from source projects. ManualDown considers a larger module as more defect-prone, while ManualUp considers a smaller module as more defect-prone. Then, we take the following measures to ensure a fair comparison on the performance in defect prediction between the existing CPDP models and the simple module size models: using the same publicly available data sets, using the same performance indicators, and using the prediction performance reported in the original cross-project defect prediction studies. Result. The simple module size models have a prediction performance comparable or even superior to most of the existing CPDP models in the literature, including many newly proposed models. Conclusion. The results caution us that, if the prediction performance is the goal, the real progress in CPDP is not being achieved as it might have been envisaged. We hence recommend that future studies should include ManualDown/ManualUp as the baseline models for comparison when developing new CPDP models to predict defects in a complete target project.
... Turhan et al. [39] investigated two open source softwares, two versions of Clam AV and a subset of the defect data of the GNU gcc compiler, to study the impact of using external data to learn the defect predictors and use it to make predictions for the localized data using naive bayes as a classifier. The results reported that the ability of such models to detect the defective modules increases. ...
... An interesting pattern in their predictions is that open-source projects are good predictors of close-source projects, however open-source projects can not be predicted by any other projects. In a following study, Turhan et al. investigated whether the patterns in their previous work [12] are also observed in open-source software and analyzed three additional projects [14]. Similar to Zimmermann et al., they found that the patterns they observed earlier are not easily detectable in predicting open-source software defects using proprietary cross project data. ...
Conference Paper
Full-text available
Defect prediction research mostly focus on optimizing the performance of models that are constructed for isolated projects. On the other hand, recent studies try to utilize data across projects for building defect prediction models. We combine both approaches and investigate the effects of using mixed (i.e. within and cross) project data on defect prediction performance, which has not been addressed in previous studies. We conduct experiments to analyze models learned from mixed project data using ten proprietary projects from two different organizations. We observe that code metric based mixed project models yield only minor improvements in the prediction performance for a limited number of cases that are difficult to characterize. Based on existing studies and our results, we conclude that using cross project data for defect prediction is still an open challenge that should only be considered in environments where there is no local data collection activity, and using data from other projects in addition to a project's own data does not pay off in terms of performance.
Article
Predicting the changes in the next release of software, during the early phases of software development is gaining wide importance. Such a prediction helps in allocating the resources appropriately and thus, reduces costs associated with software maintenance. But predicting the changes using the historical data (data of past releases) of the software is not always possible due to unavailability of data. Thus, it would be highly advantageous if we can train the model using the data from other projects rather than the same project. In this paper, we have performed cross project predictions using 12 datasets obtained from three open source Apache projects, Abdera, POI and Rave. In the study, cross project predictions include both the inter-project (different projects) and inter-version (different versions of same projects) predictions. For cross project predictions, we investigated whether the characteristics of the datasets are valuable for selecting the training set for a known testing set. We concluded that cross project predictions give high accuracy and the distributional characteristics of the datasets are extremely useful for selecting the appropriate training set. Besides this, within cross project predictions, we also examined the accuracy of inter-version predictions.
Thesis
Defect Prediction Models aim at identifying error-prone modules of a software system to guide quality assurance activities — for example tests or code reviews. Due to their potential cost savings, such models have been actively researched for more than a decade, resulting in over 100 published research papers. Additionally, defect prediction models are often used for the empirical validation of software metrics and research hypotheses. Despite the large body of existing research, the evaluation of defect prediction models has received only little attention. This is underlined by the large number of, sometimes only slightly differing evaluation approaches used by researchers. This thesis systematically investigates advantages and drawbacks of experimental design decisions and proposes guidelines for adequate evaluation procedures for defect pre- diction models. First, different evaluation approaches are identified and summarized in a literature survey. Afterwards, we investigate the most common methods in detail. By using publicly available data sets, we demonstrate that different evaluation procedures have advantages and drawbacks, and may lead to different results. Additionally, we show that very simple models, based only on the size of modules, are able to achieve sur- prisingly good performance. The reason for this is an implicit assumption underlying almost all evaluation approaches, namely that the treatment costs for additional qual- ity assurance activities are distributed uniformly across modules. We introduce the notion of effort-awareness that takes non-uniform treatment costs into account. Most models that perform well according to a uniform cost assumption are not cost-effective according to effort-aware performance measures. The performance can be increased significantly by using effort-aware predictions, both from a practical and from a statis- tical perspective. Whether effort-aware models based only on static code metrics are cost-effective in practice remains questionable, as shown in a case study. In summary, our experiments show that the most commonly used evaluation pro- cedures often lead to overly optimistic performance estimates and will not lead to cost- effective defect prediction models for many practical usage scenarios. The guidelines derived from these experiments, and in particular the notion of effort-aware prediction and evaluation, can help to build defect prediction models usable in practice.
Article
Full-text available
Reliably predicting software defects is one of the holy grails of software engineering. Researchers have devised and implemented a plethora of defect/bug prediction approaches varying in terms of accuracy, complexity and the input data they require. However, the absence of an established benchmark makes it hard, if not impossible, to compare approaches. We present a benchmark for defect prediction, in the form of a publicly available dataset consisting of several software systems, and provide an extensive comparison of well-known bug prediction approaches, together with novel approaches we devised. We evaluate the performance of the approaches using different performance indicators: classification of entities as defect-prone or not, ranking of the entities, with and without taking into account the effort to review an entity. We performed three sets of experiments aimed at (1) comparing the approaches across different systems, (2) testing whether the differences in performance are statistically significant, and (3) investigating the stability of approaches across different learners. Our results indicate that, while some approaches perform better than others in a statistically significant manner, external validity in defect prediction is still an open problem, as generalizing results to different contexts/learners proved to be a partially unsuccessful endeavor. KeywordsDefect prediction–Source code metrics–Change metrics
Article
Full-text available
Background: The accurate prediction of where faults are likely to occur in code can help direct test effort, reduce costs and improve the quality of software. Objective: We investigate how the context of models, the independent variables used and the modelling techniques applied, influence the performance of fault prediction models. Method:We used a systematic literature review to identify 208 fault prediction studies published from January 2000 to December 2010. We synthesise the quantitative and qualitative results of 36 studies which report sufficient contextual and methodological information according to the criteria we develop and apply. Results: The models that perform well tend to be based on simple modelling techniques such as Na??ve Bayes or Logistic Regression. Combinations of independent variables have been used by models that perform well. Feature selection has been applied to these combinations when models are performing particularly well. Conclusion: The methodology used to build models seems to be influential to predictive performance. Although there are a set of fault prediction studies in which confidence is possible, more studies are needed that use a reliable methodology and which report their context, methodology and performance comprehensively.
Conference Paper
Background: The prediction performance of a case-based reasoning (CBR) model is influenced by the combination of the following parameters: (i) similarity function, (ii) number of nearest neighbor cases, (iii) weighting technique used for attributes, and (iv) solution algorithm. Each combination of the above parameters is considered as an instantiation of the general CBR-based prediction method. The selection of an instantiation for a new data set with specific characteristics (such as size, defect density and language) is called customization of the general CBR method. Aims: For the purpose of defect prediction, we approach the question which combinations of parameters works best at which situation. Three more specific questions were studied: (RQ1) Does one size fit all? Is one instantiation always the best? (RQ2) If not, which individual and combined parameter settings occur most frequently in generating the best prediction results? (RQ3) Are there context-specific rules to support the customization? Method: In total, 120 different CBR instantiations were created and applied to 11 data sets from the PROMISE repository. Predictions were evaluated in terms of their mean magnitude of relative error (MMRE) and percentage Pred(α) of objects fulfilling a prediction quality level α. For the third research question, dependency network analysis was performed. Results: Most frequent parameter options for CBR instantiations were neural network based sensitivity analysis (as the weighting technique), un-weighted average (as the solution algorithm), and maximum number of nearest neighbors (as the number of nearest neighbors). Using dependency network analysis, a set of recommendations for customization was provided. Conclusion: An approach to support customization is provided. It was confirmed that application of context-specific rules across groups of similar data sets is risky and produces poor results.
Conference Paper
Predicting the fault-proneness of program modules when the fault labels for modules are unavailable is a practical problem frequently encountered in the software industry. Because fault data belonging to previous software version is not available, supervised learning approaches can not be applied, leading to the need for new methods, tools, or techniques. In this study, we propose a clustering and metrics thresholds based software fault prediction approach for this challenging problem and explore it on three datasets, collected from a Turkish white-goods manufacturer developing embedded controller software. Experiments reveal that unsupervised software fault prediction can be automated and reasonable results can be produced with techniques based on metrics thresholds and clustering. The results of this study demonstrate the effectiveness of metrics thresholds and show that the standalone application of metrics thresholds (one-stage) is currently easier than the clustering and metrics thresholds based (two-stage) approach because the selection of cluster number is performed heuristically in this clustering based method.
Conference Paper
Full-text available
This paper describes methods for automatically analyzing formal, state-based requirements specifications for completeness and consistency. The approach uses a low-level functional formalism, simplifying the analysis process. State space exploslon problems are eliminated by applying the analysis at a high level of abstraction; i.e, instead of generating a reachability graph for analysis, the analysis is performed directly on the model. The method scales up to large systems by decomposing the specification into smaller, analyzable parts and then using functional composition rules to ensure that verified properties hold for the entire specification. The analysis algorithms and tools have been validated on TCAS II, a complex, airborne, collision-avoidance system reqmred on all commercial aircraft with more than 30 passengers that fly in U.S. airspace.
Article
Full-text available
This article examines the metrics of the software science model, cyclomatic complexity, and an information flow metric of Henry and Kafura. These were selected on the basis of their popularity within the software engineering literature and the significance of the claims made by their progenitors. Claimed benefits are summarized. Each metric is then subjected to an in-depth critique. All are found wanting. We maintain that this is not due to mischance, but indicates deeper problems of methodology used in the field of software metrics. We conclude by summarizing these problems.
Article
A summary is presented of the current state of the art and recent trends in software engineering economics. It provides an overview of economic analysis techniques and their applicability to software engineering and management. It surveys the field of software cost estimation, including the major estimation techniques available, the state of the art in algorithmic cost models, and the outstanding research issues in software cost estimation.
Article
This article describes a formal analysis technique, called consistency checking, for automatic detection of errors, such as type errors, nondeterminism, missing cases, and circular definitions, in requirements specifications. The technique is designed to analyze requirements specifications expressed in the SCR (Software Cost Reduction) tabular notation. As background, the SCR approach to specifying requirements is reviewed. To provide a formal semantics for the SCR notation and a foundation for consistency checking, a formal requirements model is introduced; the model represents a software system as a finite-state automaton, which produces externally visible outputs in response to changes in monitored environmental quantities. Results of two experiments are presented which evaluated the utility and scalability of our technique for consistency checking in a real-world avionics application. The role of consistency checking during the requirements phase of software development is discussed.
Article
Substantial net improvements in programming quality and productivity have been obtained through the use of formal inspections of design and of code. Improvements are made possible by a systematic and efficient design and code verification process, with well-defined roles for inspection participants. The manner in which inspection data is categorized and made suitable for process analysis is an important factor in attaining the improvements. It is shown that by using inspection results, a mechanism for initial error reduction followed by ever-improving error rates can be achieved.
Article
OBJECTIVE - The objective of this paper is to determine under what circumstances individual organisations would be able to rely on cross-company based estimation models. METHOD - We performed a systematic review of studies that compared predictions from cross- company models with predictions from within-company models based on analysis of project data. RESULTS - Ten papers compared cross-company and within-company estimation models, however, only seven of the papers presented independent results. Of those seven, three found that cross- company models were as good as within-company models, four found cross-company models were significantly worse than within-company models. Experimental procedures used by the studies differed making it impossible to undertake formal meta-analysis of the results. The main trend distinguishing study results was that studies with small single company data sets (i.e.