ArticlePDF Available

Abstract and Figures

As the application layer in embedded systems dominates over the hardware, ensuring software quality becomes a real challenge. Software testing is the most time-consuming and costly project phase, specifically in the embedded software domain. Misclassifying a safe code as defective increases the cost of projects, and hence leads to low margins. In this research, we present a defect prediction model based on an ensemble of classifiers. We have collaborated with an industrial partner from the embedded systems domain. We use our generic defect prediction models with data coming from embedded projects. The embedded systems domain is similar to mission critical software so that the goal is to catch as many defects as possible. Therefore, the expectation from a predictor is to get very high probability of detection (pd). On the other hand, most embedded systems in practice are commercial products, and companies would like to lower their costs to remain competitive in their market by keeping their false alarm (pf) rates as low as possible and improving their precision rates. In our experiments, we used data collected from our industry partners as well as publicly available data. Our results reveal that ensemble of classifiers significantly decreases pf down to 15% while increasing precision by 43% and hence, keeping balance rates at 74%. The cost-benefit analysis of the proposed model shows that it is enough to inspect 23% of the code on local datasets to detect around 70% of defects. KeywordsDefect prediction–Ensemble of classifiers–Static code attributes–Embedded software
Content may be subject to copyright.
A preview of the PDF is not available
... Recently, some ensemble models were proposed in the literature (Misirli et al., 2011;Peng et al., 2011;Laradji et al., 2015;Petrić et al., 2016;Tong et al., 2018;Malhotra and Jain, 2020) for WPDP. However, these models depend either on the singleinducer system to generate single-inducer ensemble classifiers (Misirli et al., 2011) or, these models are built using the traditional sequential or parallel ensemble classifiers (Peng et al., 2011;Laradji et al., 2015;Petrić et al., 2016;Tong et al., 2018;Malhotra and Jain, 2020), which involves in fewer decision makings and results in reduced classification performances (Opitz and Maclin, 1999;Zhou, 2012). ...
... Recently, some ensemble models were proposed in the literature (Misirli et al., 2011;Peng et al., 2011;Laradji et al., 2015;Petrić et al., 2016;Tong et al., 2018;Malhotra and Jain, 2020) for WPDP. However, these models depend either on the singleinducer system to generate single-inducer ensemble classifiers (Misirli et al., 2011) or, these models are built using the traditional sequential or parallel ensemble classifiers (Peng et al., 2011;Laradji et al., 2015;Petrić et al., 2016;Tong et al., 2018;Malhotra and Jain, 2020), which involves in fewer decision makings and results in reduced classification performances (Opitz and Maclin, 1999;Zhou, 2012). In addition, the empirical evaluations conducted in the above models (Laradji et al., 2015;Misirli et al., 2011;Peng et al., 2011;Petrić et al., 2016;Tong et al., 2018;Malhotra and Jain, 2020) are limited to few projects. ...
... However, these models depend either on the singleinducer system to generate single-inducer ensemble classifiers (Misirli et al., 2011) or, these models are built using the traditional sequential or parallel ensemble classifiers (Peng et al., 2011;Laradji et al., 2015;Petrić et al., 2016;Tong et al., 2018;Malhotra and Jain, 2020), which involves in fewer decision makings and results in reduced classification performances (Opitz and Maclin, 1999;Zhou, 2012). In addition, the empirical evaluations conducted in the above models (Laradji et al., 2015;Misirli et al., 2011;Peng et al., 2011;Petrić et al., 2016;Tong et al., 2018;Malhotra and Jain, 2020) are limited to few projects. ...
Article
Full-text available
Predicting the defect-proneness of a module can reduce the time, effort, manpower, and consequently the cost to develop a software project. Since the causes of software defects are difficult to identify, a wide range of machine learning models are still being developed to build a high performing prediction systems. For this reason, an hybrid approach called – diverse ensemble learning technique (DELT), that adopts two diversity generation schemes such as bootstrap aggregation and multi-inducer concepts, is proposed for with-in-project defect prediction (WPDP) problem in order to mitigate the low classification rates of the prediction model. To predict the final class-label for any unlabeled test module, the proposed DELT employs the principle of majority voting. An extensive set of experiments are conducted on 43 publicly available PROMISE and NASA datasets. The experimental results are promising since it improves the generalization performance in classifying the defect proneness of the software module.
... Base classifiers instead, reach an overall average F -score ranging between 73% (simple model) and 83% (Advanced model-2) for Promise data set and the ROC(AUC) between 60% (Advanced model-4) and 79% (Advanced model-2). Thus, we can say that the ensemble design enhances the strengths of multiple predictors and supplements to state of art in fault prediction problem [35,72]. ...
... Adaptive Boosting (AdaBoost) is a well-known Boosting technique. Bootstrap aggregating [29] Bagging is a bootstrap method proposed by Breiman [72] that mainly extracts a training sample from a training set by returning them to each extraction. It allocates equal weight to developed models, thereby reduces the variance related with classification, which in turn improves the classification process. ...
Article
Background: Fault prediction is a key problem in software engineering domain. In recent years, an increasing interest in exploiting machine learning techniques to make informed decisions to improve software quality based on available data has been observed. Aim: The study aims to build and examine the predictive capability of advanced fault prediction models based on product and process metrics by using machine learning classifiers and ensemble design. Method: Authors developed a methodological framework, consisting of three phases i.e., (i) metrics identification (ii) experimentation using base ML classifiers and ensemble design (iii) evaluating performance and cost sensitiveness. The study has been conducted on 32 projects from the PROMISE, BUG, and JIRA repositories. Result: The results shows that advanced fault prediction models built using ensemble methods show an overall median of $F$-score ranging between 76.50\% and 87.34\% and the ROC(AUC) between 77.09\% and 84.05\% with better predictive capability and cost sensitiveness. Also, non-parametric tests have been applied to test the statistical significance of the classifiers. Conclusion: The proposed advanced models have performed impressively well for inter project fault prediction for projects from PROMISE, BUG, and JIRA repositories.
... As noted in Section 1, our focus is to improve the accurate polarity classification probability of a given unit by analyzing the polarity labels of the units from the existing SE-specific tools. This concept is similar to the bagging or boosting combination of classifiers in machine learning [39,53,75,78]. However, for the combination to perform better than the stand-alone SE-specific classifier, we first need to confirm whether another tool can offer correct polarity classification for a given unit when a tool is wrong on that unit. ...
... Following standard principles of machine learning, for each algorithm, we hyper tuned it on the six benchmark datasets under different configurations and picked the configurations that offered the best performance. We use the python scikit-learn GridSearchCV function for our hypertuning, which takes a list of configuration parameters and a Forest is by design an ensemble algorithm and it is found to be a good predictor in other SE-specific tasks (e.g., defect prediction [39]). We name the best performing Random Forest (RF) model in our analysis 'Sentisead '. ...
Preprint
Full-text available
Sentiment analysis in software engineering (SE) has shown promise to analyze and support diverse development activities. We report the results of an empirical study that we conducted to determine the feasibility of developing an ensemble engine by combining the polarity labels of stand-alone SE-specific sentiment detectors. Our study has two phases. In the first phase, we pick five SE-specific sentiment detection tools from two recently published papers by Lin et al. [31, 32], who first reported negative results with standalone sentiment detectors and then proposed an improved SE-specific sentiment detector, POME [31]. We report the study results on 17,581 units (sentences/documents) coming from six currently available sentiment benchmarks for SE. We find that the existing tools can be complementary to each other in 85-95% of the cases, i.e., one is wrong, but another is right. However, a majority voting-based ensemble of those tools fails to improve the accuracy of sentiment detection. We develop Sentisead, a supervised tool by combining the polarity labels and bag of words as features. Sentisead improves the performance (F1-score) of the individual tools by 4% (over Senti4SD [5]) - 100% (over POME [31]). In a second phase, we compare and improve Sentisead infrastructure using Pre-trained Transformer Models (PTMs). We find that a Sentisead infrastructure with RoBERTa as the ensemble of the five stand-alone rule-based and shallow learning SE-specific tools from Lin et al. [31, 32] offers the best F1-score of 0.805 across the six datasets, while a stand-alone RoBERTa shows an F1-score of 0.801.
... Most of the existing studies [11][12][13][14][15][16] have applied ensemble methods for within project software defect prediction, not for the cross-project software defect prediction. Only few studies focused on cross-project software defect prediction and applied ensemble methods with Naïve Bayes (NB) [17], Support Vector Machine (SVM) [18], or Decision Tree (DT) and NB as base learners [19]. ...
... As base classifiers, they used MultiLayer Perceptron (MLP), Radial Basis Function (RNF) network, Bayesian Belief Network (BBN), NB, SVM, and DT. Misirli et al. [16] presented an ensemble method that combines NB, ANN, and voting feature intervals for locating software defects. ...
... Thus, more studies should be conducted which provide comparisons of ET amongst each other and with different non-ensemble techniques for varied applications. Moreover, apart from AUC, other stable performance measures such as MCC [90] or Balance [63] should be widely used by researchers to report and compare the results of SDP/SCP models developed using various ET. Different ET that address the class imbalance issue may also be compared based on "cost-effectiveness" to provide a comprehensive picture to other researchers and software practitioners. ...
Article
Background: The use of ensemble techniques have steadily gained popularity in several software quality assurance activities. These aggregated classifiers have proven to be superior than their constituent base models. Though ensemble techniques have been widely used in key areas such as Software Defect Prediction (SDP) and Software Change Prediction (SCP), the current state-of-the-art concerning the use of these techniques needs scrutinization. Aim: The study aims to assess, evaluate and uncover possible research gaps with respect to the use of ensemble techniques in SDP and SCP. Method: This study conducts an extensive literature review of 77 primary studies on the basis of the category, application, rules of formulation, performance, and possible threats of the proposed/utilized ensemble techniques. Results: Ensemble techniques were primarily categorized on the basis of similarity, aggregation, relationship, diversity, and dependency of their base models. They were also found effective in several applications such as their use as a learning algorithm for developing SDP/SCP models and for addressing the class imbalance issue. Conclusion: The results of the review ascertain the need of more studies to propose, assess, validate, and compare various categories of ensemble techniques for diverse applications in SDP/SCP such as transfer learning and online learning.
... The software defect prediction model construction can benefit from directly optimizing the model performance measure for ranking task have been discussed in paper [14].For predicting defects in software several classification technique have used like SVM, Decisions trees, Random forest, Naïve bayes, ANN. SVM classification technique used for the defect prediction in [15] [16],Decision tree algorithm is employed for the same effort in [17][18] [19] and Naïve Bayes [20] [21] [22].ANN Algorithm has used for the problem in [23] [24]. ...
Article
Full-text available
The software engineering is the technology which is used to analyze software behavior SDP includes software metrics, their attributes like line of code etc. The main goal of software defects prediction model includes ordering new software modules based on their defect-proneness and classifying them whether it is new software or not. The main purpose of SDP for the ranking is to predict which modules have the most defects to define software quality enhancement. The goal of SDP for the ranking task is to predict the relative defect number, although estimating the precise number of defects of the modules is better than estimating the ranks of modules, because the precise number of defects can give more information than the ranks. The software defect prediction technique is applied in the previous work based on the technique of ANN. In this research work the technique of KNN is applied for the software defect prediction. It is analyzed that proposed technique has high accuracy and less execution time as compared to existing ANN technique.
... EML combines individual classifiers' predictions by using their strengths and diluting their weaknesses, which improves the prediction performance of individual classifiers (Step ). EML has proven to be effective compared to individual classifiers in problems such as app review classification [21] and software defects prediction [27]. Our method uses three classifiers: Naive Bayes, Random Forest and Support Vector Machine as they are simple classifiers to work with textual data [30] [21] and provide a probability of an instance belonging to a class which is indicative of the uncertainty (e.g., for a binary classification problem, the closer the probability of a prediction is to 0.5, the higher the uncertainty of classifier's prediction) [15]. ...
Article
Sentiment analysis in software engineering (SE) has shown promise to analyze and support diverse development activities. Recently, several tools are proposed to detect sentiments in software artifacts. While the tools improve accuracy over off-the-shelf tools, recent research shows that their performance could still be unsatisfactory. A more accurate sentiment detector for SE can help reduce noise in analysis of software scenarios where sentiment analysis is required. Recently, combinations, i.e., hybrids of stand-alone classifiers are found to offer better performance than the stand-alone classifiers for fault detection. However, we are aware of no such approach for sentiment detection for software artifacts. We report the results of an empirical study that we conducted to determine the feasibility of developing an ensemble engine by combining the polarity labels of stand-alone SE-specific sentiment detectors. Our study has two phases. In the first phase, we pick five SE-specific sentiment detection tools from two recently published papers by Lin et al. [31, 32], who first reported negative results with stand alone sentiment detectors and then proposed an improved SE-specific sentiment detector, POME [31]. We report the study results on 17,581 units (sentences/documents) coming from six currently available sentiment benchmarks for software engineering. We find that the existing tools can be complementary to each other in 85-95% of the cases, i.e., one is wrong but another is right. However, a majority voting-based ensemble of those tools fails to improve the accuracy of sentiment detection. We develop Sentisead, a supervised tool by combining the polarity labels and bag of words as features. Sentisead improves the performance (F1-score) of the individual tools by 4% (over Senti4SD [5]) – 100% (over POME [31]). The initial development of Sentisead occurred before we observed the use of deep learning models for SE-specific sentiment detection. In particular, recent papers show the superiority of advanced language-based pre-trained transformer models (PTM) over rule-based and shallow learning models. Consequently, in a second phase, we compare and improve Sentisead infrastructure using the PTMs. We find that a Sentisead infrastructure with RoBERTa as the ensemble of the five stand-alone rule-based and shallow learning SE-specific tools from Lin et al. [31, 32] offers the best F1-score of 0.805 across the six datasets, while a stand-alone RoBERTa shows an F1-score of 0.801.
Chapter
Issue reports related to a software system provide an important source of information. However, an issue report has to go through various phases before it gets fixed. One such phase is issue type assignment which is currently done manually. Manual issue type assignment is not only time consuming but also error prone. Automated system can help in reducing time and error in issue type assignment. In this paper, we work on characterization and prediction of different issue types. We perform Characterization study with respect to three parameters: distribution, mean time to repair and top terms present in various issue types. We compared several classic and ensemble machine learning classifiers with respect to accuracy and prediction time. The experimental results performed on the Apache Lucene project show that the machine learning based issue type classification approach is effective as it gave maximum accuracy of 67%. The classifiers multinomial Naïve Bayes and Ensemble (Hard Voting) give the best results.
Article
Several prediction approaches are contained in the arena of software engineering such as prediction of effort, security, quality, fault, cost, and re-usability. All these prediction approaches are still in the rudimentary phase. Experiments and research are conducting to build a robust model. Software Fault Prediction (SFP) is the process to develop the model which can be utilized by software practitioners to detect faulty classes/module before the testing phase. Prediction of defective modules before the testing phase will help the software development team leader to allocate resources more optimally and it reduces the testing effort. In this article, we present a Systematic Literature Review (SLR) of various studies from 1990 to June 2019 towards applying machine learning and statistical method over software fault prediction. We have cited 208 research articles, in which we studied 154 relevant articles. We investigated the competence of machine learning in existing datasets and research projects. To the best of our knowledge, the existing SLR considered only a few parameters over SFP’s performance, and they partially examined the various threats and challenges of SFP techniques. In this article, we aggregated those parameters and analyzed them accordingly, and we also illustrate the different challenges in the SFP domain. We also compared the performance between machine learning and statistical techniques based on SFP models. Our empirical study and analysis demonstrate that the prediction ability of machine learning techniques for classifying class/module as fault/non-fault prone is better than classical statistical models. The performance of machine learning-based SFP methods over fault susceptibility is better than conventional statistical purposes. The empirical evidence of our survey reports that the machine learning techniques have the capability, which can be used to identify fault proneness, and able to form well-generalized result. We have also investigated a few challenges in fault prediction discipline, i.e., quality of data, over-fitting of models, and class imbalance problem. We have also summarized 154 articles in a tabular form for quick identification.
Article
Full-text available
Advance knowledge of which files in the next release of a large software system are most likely to contain the largest numbers of faults can be a very valuable asset. To accomplish this, a negative binomial regression model has been developed and used to predict the expected number of faults in each file of the next release of a system. The predictions are based on the code of the file in the current release, and fault and modification history of the file from previous releases. The model has been applied to two large industrial systems, one with a history of 17 consecutive quarterly releases over 4 years, and the other with nine releases over 2 years. The predictions were quite accurate: For each release of the two systems, the 20 percent of the files with the highest predicted number of faults contained between 71 percent and 92 percent of the faults that were actually detected, with the overall average being 83 percent. The same model was also used to predict which files of the first system were likely to have the highest fault densities (faults per KLOC). In this case, the 20 percent of the files with the highest predicted fault densities contained an average of 62 percent of the system's detected faults. However, the identified files contained a much smaller percentage of the code mass than the files selected to maximize the numbers of faults. The model was also used to make predictions from a much smaller input set that only contained fault data from integration testing and later. The prediction was again very accurate, identifying files that contained from 71 percent to 93 percent of the faults, with the average being 84 percent. Finally, a highly simplified version of the predictor selected files containing, on average, 73 percent and 74 percent of the faults for the two systems.
Article
Substantial net improvements in programming quality and productivity have been obtained through the use of formal inspections of design and of code. Improvements are made possible by a systematic and efficient design and code verification process, with well-defined roles for inspection participants. The manner in which inspection data is categorized and made suitable for process analysis is an important factor in attaining the improvements. It is shown that by using inspection results, a mechanism for initial error reduction followed by ever-improving error rates can be achieved.
Article
Early identification of fault-prone modules is desirable both from developer and customer perspectives because it supports planning and scheduling activities that facilitate cost avoidance and improved time to market. Large-scale software systems are rarely built from scratch, and usually involve modification and enhancement of existing systems. This suggests that development planning and software quality could greatly be enhanced, because knowledge about product complexity and quality of previous releases can be taken into account when making improvements in subsequent projects. In this article we present results from empirical studies at Ericsson Telecom AB that examine the use of metrics to predict fault-prone modules in successive product releases. The results show that such prediction appears to be possible and has potential to enhance project maintenance.