ArticlePDF Available

DeepLineDP: Towards a Deep Learning Approach for Line-Level Defect Prediction

Authors:

Abstract

Defect prediction is proposed to assist practitioners effectively prioritize limited Software Quality Assurance (SQA) resources on the most risky files that are likely to have post-release software defects. However, there exist two main limitations in prior studies: (1) the granularity levels of defect predictions are still coarse-grained and (2) the surrounding tokens and surrounding lines have not yet been fully utilized. In this paper, we perform a survey study to better understand how practitioners perform code inspection in modern code review process, and their perception on a line-level defect prediction. According to the responses from 36 practitioners, we found that 50% of them spent at least 10 minutes to more than one hour to review a single file, while 64% of them still perceived that code inspection activity is challenging to extremely challenging. In addition, 64% of the respondents perceived that a line-level defect prediction tool would potentially be helpful in identifying defective lines. Motivated by the practitioners' perspective, we present DeepLineDP, a deep learning approach to automatically learn the semantic properties of the surrounding tokens and lines in order to identify defective files and defective lines. Through a case study of 32 releases of 9 software projects, we find that the risk score of code tokens varies greatly depending on their location. Our DeepLineDP is 14%-24% more accurate than other file-level defect prediction approaches; is 50%-250% more cost-effective than other line-level defect prediction approaches; and achieves a reasonable performance when transferred to other software projects. These findings confirm that the surrounding tokens and surrounding lines should be considered to identify the fine-grained locations of defective files (i.e., defective lines).
A preview of the PDF is not available
... Deep learning has been intensively used in a variety of domains such as, natural language processing [19] and image processing. Deep learning is becoming increasingly prevalent in the field of software engineering [17], [20]. In this paper we focus on building a deep neural network model for predicting defect density. ...
... The DNN is an improved version of traditional artificial neural network with multiple dense layers. The DNN models are recently becoming very popular due to their excellent performance to learn not only the nonlinear input-output mapping but also the underlying structure of the input data vectors [20]. ...
... The Stochastic Gradient Descent is extensively employed throughout the training procedure for the sake of weights optimization. The DNN, on the other hand, necessitates tuning of various hyperparameters such as the number of neurons, hidden layers, and iterations, which might make solving a complex model computationally costly [11], [20], [21]. ...
Article
Full-text available
Delivering a reliable and high-quality software system to client is a big challenge in software development and evolution process. One of the software measures that confirm the quality of the system is the defect density. Practitioners usually need this measure during software development process or during a period of operation to indicate the reliability of software system. However, since predicting defect density before testing the modules is time consuming, managers need to build a prediction model that can help in detecting the defective modules. This process can reduce the testing cost and improve testing resources utilizations. The most intrinsic feature of software defect datasets is the data sparsity in the defect density which might bias the final prediction. Therefore, we use deep learning to build defect density prediction models and handle the inherit challenge of data sparsity in defect density. Deep learning has shown to be effective with sparse data. The constructed model has been evaluated against well-known machine learning methods over 28 public datasets. The obtained results confirmed that the deep learning model is generally more adequate than other machine models over the datasets with high and very high sparsity ratios, and competitive choice when the sparsity ratio is either medium or low.
... Similarly, line-level defect prediction has recently received high attention from the research community [38,39,60]. For example, Pornprasit and Tantithamthavorn [38] and Wattanakriengkrai et al. [60] proposed a machine learning-based approach with LIME model-agnostic technique (BoW+RF+LIME) to predict which lines are likely to be defective in the future. ...
... For example, Fu and Tantithamthavorn [18] proposed a GPT-2 based Agile story point estimation, by leveraging the Integrated Gradient attention to interpret the GPT-2 model and understand what words in a JIRA issue report contributed to the estimation of Agile story points. Similarly, Pornprasit and Tantithamthavorn [39] proposed a Hierarchical Attention Network (HAN) architecture for line-level defect prediction, by leveraging the attention mechanism of the HAN architecture to understand what code tokens in a source code contributed to the prediction of defective files. However, Jain et al. [21] argued that the learned attention weights are frequently uncorrelated with gradient-based measures of feature importance, while Wiegreffe et al. [61] argued that the accuracy/reliability of such attention weights could provide meaningful explanations, depending on the definition and the rigor of experimental design. ...
... This concept is directly aligned with findings from AI discipline by Wiegreffe et al. [61]. However, this concept is still novel for line-level vulnerability predictions, since prior works [18,39] only focused on explaining Agile story point estimations and defect predictions-not CodeBERT-based line-level vulnerability predictions. Our results in RQ2 and RQ3 confirm that the use of self-attention mechanism outperforms other model-agnostic techniques for line-level vulnerability predictions. ...
Conference Paper
Full-text available
Software vulnerabilities are prevalent in software systems, causing a variety of problems including deadlock, information loss, or system failures. Thus, early predictions of software vulnerabilities are critically important in safety-critical software systems. Various ML/DL-based approaches have been proposed to predict vulnerabilities at the file/function/method level. Recently, IVDetect (a graph-based neural network) is proposed to predict vulnerabilities at the function level. Yet, the IVDetect approach is still inaccurate and coarse-grained. In this paper, we propose LineVul, a Transformer-based line-level vulnerability prediction approach in order to address several limitations of the state-of-the-art IVDetect approach. Through an empirical evaluation of a large-scale real-world dataset with 188k+ C/C++ functions, we show that LineVul achieves (1) 160%-379% higher F1-measure for function-level predictions; (2) 12%-25% higher Top-10 Accuracy for line-level predictions; and (3) 29%-53% less Effort@20%Recall than the baseline approaches, highlighting the significant advancement of LineVul towards more accurate, more cost-effective line-level vulnerability predictions. Our additional analysis also shows that our LineVul is also very accurate (75%-100%) for predicting vulnerable functions affected by the Top-25 most dangerous CWEs, highlighting the potential impact of our LineVul in real-world usage scenarios.
... Deep learning (DL) based techniques have gained enormous popularity among the software engineering community. Recent works have used DL in attempt to solve various software engineering problem, such as automated code completion [10,32], code clone detection, software defect prediction [28], code repair [17,25,33,35] and so on. Depending on the nature of the analysis, researchers employ different representations for the code to feed as input to the model as discussed in section 2.3. ...
... Although different automated approaches were proposed to support code review activities (e.g., review prioritization [25,26], just-in-time defect prediction [17,27,28] and localization [29][30][31][32], reviewer recommendation [33][34][35][36][37], AI-assisted code reviewer [38,39]), such activities still require manual effort which can be time-consuming for developers [4,5]. Indeed, prior work reported that reviewers could spend a large amount of time on reviewing code (e.g., developers in open-source software projects have to spend more than six hours per week on average reviewing code [40]). ...
Conference Paper
Full-text available
Code review is a software quality assurance practice, yet remains time-consuming (e.g., due to slow feedback from reviewers). Recent Neural Machine Translation (NMT)-based code transformation approaches were proposed to automatically generate an approved version of changed methods for a given submitted patch. The existing approaches could change code tokens in any area in a changed method. However, not all code tokens need to be changed. Intuitively, the changed code tokens in the method should be paid more attention to than the others as they are more prone to be defective. In this paper, we present an NMT-based Diff-Aware Code Transformation approach (D-ACT) by leveraging token-level change information to enable the NMT models to better focus on the changed tokens in a changed method. We evaluate our D-ACT and the baseline approaches based on a time-wise evaluation (that is ignored by the existing work) with 5,758 changed methods. Under the time-wise evaluation scenario, our results show that (1) D-ACT can correctly transform 107-245 changed methods, which is at least 62% higher than the existing approaches; (2) the performance of the existing approaches drops by 57% to 94% when the time-wise evaluation is ignored; and (3) D-ACT is improved by 17%-82% with an average of 29% when considering the token-level change information. Our results suggest that (1) NMT-based code transformation approaches for code review should be evaluated under the time-wise evaluation; and (2) the token-level change information can substantially improve the performance of NMT-based code transformation approaches for code review.
... Baselines For this task, we also chose baselines that are similar to the Function-level bug localization task, which are CodeBERT , GraphCodeBERT PLBART (Ahmad et al., 2021). In addition, we include 2 additional baselines that have been used to detect vulnerability in software engineering, which are DeepLineDP (Pornprasit and Tantithamthavorn, 2022) and LineVul (Fu and Tantithamthavorn, 2022). They work by simply performing prediction at the function level, then using attention scores from the backbone neural architecture to retrieve the line scores to predict vulnerability at line level. ...
Preprint
Automated software debugging is a crucial task for improving the productivity of software developers. Many neural-based techniques have been proven effective for debugging-related tasks such as bug localization and program repair (or bug fixing). However, these techniques often focus only on either one of them or approach them in a stage-wise manner, ignoring the mutual benefits between them. In this work, we propose a novel unified \emph{Detect-Localize-Repair} framework based on a pretrained programming language model CodeT5 to seamlessly address these tasks, named CodeT5-DLR. Specifically, we propose three objectives to adapt the generic CodeT5 for debugging: a bug detection objective to determine whether a given code snippet is buggy or not, a bug localization objective to identify the buggy lines, and a program repair objective to translate the buggy code to its fixed version. We evaluate it on each of these tasks and their combined setting on two newly collected line-level debugging datasets in Java and Python. Extensive results show that our model significantly outperforms existing baselines from both NLP and software engineering domains.
Article
Context : Interpretation has been considered as a key factor to apply defect prediction in practice. As interpretation from rule-based interpretable models can provide insights about past defects with high quality, many prior studies attempt to construct interpretable models for both accurate prediction and comprehensible interpretation. However, class imbalance is usually ignored, which may bring huge negative impact on interpretation. Objective : In this paper, we are going to investigate resampling techniques, a popular solution to deal with imbalanced data, on interpretation for interpretable models. We also investigate the feasibility to construct interpretable defect prediction models directly on original data. Further, we are going to propose a rule-based interpretable model which can deal with imbalanced data directly. Method : We conduct an empirical study on 47 publicly available datasets to investigate the impact of resampling techniques on rule-based interpretable models and the feasibility to construct such models directly on original data. We also improve gain function and tolerate lower confidence based on rule induction algorithms to deal with imbalanced data. Results : We find that (1) resampling techniques impact on interpretable models heavily from both feature importance and model complexity, (2) it is not feasible to construct meaningful interpretable models on original but imbalanced data due to low coverage of defects and poor performance, and (3) our proposed approach is effective to deal with imbalanced data compared with other rule-based models. Conclusion : Imbalanced data heavily impacts on the interpretable defect prediction models. Resampling techniques tend to shift the learned concept, while constructing rule-based interpretable models on original data may also be infeasible. Thus, it is necessary to construct rule-based models which can deal with imbalanced data well in further studies.
Conference Paper
Full-text available
Code review is an effective quality assurance practice, but can be labor-intensive since developers have to manually review the code and provide written feedback. Recently, a Deep Learning (DL)-based approach was introduced to automatically recommend code review comments based on changed methods. While the approach showed promising results, it requires expensive computational resource and time which limits its use in practice. To address this limitation , we propose CommentFinder ś a retrieval-based approach to recommend code review comments. Through an empirical evaluation of 151,019 changed methods, we evaluate the effectiveness and efficiency of CommentFinder against the state-of-the-art approach. We find that when recommending the best-1 review comment candidate, our CommentFinder is 32% better than prior work in recommending the correct code review comment. In addition, CommentFinder is 49 times faster than the prior work. These findings highlight that our CommentFinder could help reviewers to reduce the manual efforts by recommending code review comments, while requiring less computational time.
Conference Paper
Full-text available
As software vulnerabilities grow in volume and complexity, researchers proposed various Artificial Intelligence (AI)-based approaches to help under-resourced security analysts to find, detect, and localize vulnerabilities. However, security analysts still have to spend a huge amount of effort to manually fix or repair such vulnerable functions. Recent work proposed an NMT-based Automated Vulnerability Repair, but it is still far from perfect due to various limitations. In this paper, we propose VulRepair, a T5-based automated software vulnerability repair approach that leverages the pre-training and BPE components to address various technical limitations of prior work. Through an extensive experiment with over 8,482 vulnerability fixes from 1,754 real-world software projects, we find that our VulRepair achieves a Perfect Prediction of 44%, which is 13%-21% more accurate than competitive baseline approaches. These results lead us to conclude that our VulRepair is considerably more accurate than two baseline approaches, highlighting the substantial advancement of NMT-based Automated Vulnerability Repairs. Our additional investigation also shows that our VulRe-pair can accurately repair as many as 745 out of 1,706 real-world well-known vulnerabilities (e.g., Use After Free, Improper Input Validation , OS Command Injection), demonstrating the practicality and significance of our VulRepair for generating vulnerability repairs, helping under-resourced security analysts on fixing vulnerabilities.
Article
Full-text available
Story point estimation is a task to estimate the overall effort required to fully implement a product backlog item. Various estimation approaches (e.g., Planning Poker, Analogy, and expert judgment) are widely-used, yet they are still inaccurate and may be subjective, leading to ineffective sprint planning. Recent work proposed Deep-SE, a deep learning-based Agile story point estimation approach, yet it is still inaccurate, not transferable to other projects, and not interpretable. In this paper, we propose GPT2SP, a Transformer-based Agile Story Point Estimation approach. Our GPT2SP employs a GPT-2 pre-trained language model with a GPT-2 Transformer-based architecture, allowing our GPT2SP models to better capture the relationship among words while considering the context surrounding a given word and its position in the sequence and be transferable to other projects, while being interpretable. Through an extensive evaluation on 23,313 issues that span across 16 open-source software projects with 10 existing baseline approaches for within-and cross-project scenarios, our results show that our GPT2SP approach achieves a median MAE of 1.16, which is (1) 34%-57% more accurate than existing baseline approaches for within-project estimations; (2) 39%-49% more accurate than existing baseline approaches for cross-project estimations. The ablation study also shows that the GPT-2 architecture used in our approach substantially improves Deep-SE by 6%-47%, highlighting the significant advancement of the AI for Agile story point estimation. Finally, we develop a proof-of-concept tool to help practitioners better understand the most important words that contributed to the story point estimation of the given issue with the best supporting examples from past estimates. Our survey study with 16 Agile practitioners shows that the story point estimation task is perceived as an extremely challenging task. In addition, our AI-based story point estimation with explanations is perceived as more useful and trustworthy than without explanations, highlighting the practical need of our Explainable AI-based story point estimation approach.
Conference Paper
Full-text available
Just-In-Time (JIT) defect prediction (i.e., an AI/ML model to predict defect-introducing commits) is proposed to help developers prioritize their limited Software Quality Assurance (SQA) resources on the most risky commits. However, the explainability of JIT defect models remains largely unexplored (i.e., practitioners still do not know why a commit is predicted as defect-introducing). Recently, LIME has been used to generate explanations for any AI/ML models. However, the random perturbation approach used by LIME to generate synthetic neighbors is still suboptimal, i.e., generating synthetic neighbors that may not be similar to an instance to be explained, producing low accuracy of the local models, leading to inaccurate explanations for just-in-time defect models. In this paper, we propose PyExplainer-i.e., a local rule-based model-agnostic technique for generating explanations (i.e., why a commit is predicted as defective) of JIT defect models. Through a case study of two open-source software projects, we find that our PyExplainer produces (1) synthetic neighbors that are 41%-45% more similar to an instance to be explained; (2) 18%-38% more accurate local models; and (3) explanations that are 69%-98% more unique and 17%-54% more consistent with the actual characteristics of defect-introducing commits in the future than LIME (a state-of-the-art model-agnostic technique). This could help practitioners focus on the most important aspects of the commits to mitigate the risk of being defect-introducing. Thus, the contributions of this paper build an important step towards Explainable AI for Software Engineering, making software analytics more explainable and actionable. Finally, we publish our PyExplainer as a Python package to support practitioners and researchers (https://github.com/awsm-research/PyExplainer).
Article
Full-text available
The success of software projects depends on complex decision making (e.g., which tasks should a developer do first, who should perform this task, is the software of high quality, is a software system reliable and resilient enough to deploy, etc.). Bad decisions cost money (and reputation) so we need better tools for making better decisions. This article discusses the "why" and "how" of explainable and actionable software analytics. For the task of reducing the risk of software defects, we show initial results from a successful case study that offers more actionable software analytics. Also, we present an interactive tutorial on the subject of Explainable AI tools for SE in our Software Analytics Cookbook (https://xai4se.github.io/book/), and we discuss some open questions that need to be addressed.
Article
Full-text available
Software Quality Assurance (SQA) planning aims to define proactive plans, such as defining maximum file size, to prevent the occurrence of software defects in future releases. To aid this, defect prediction models have been proposed to generate insights as the most important factors that are associated with software quality. Such insights that are derived from traditional defect models are far from actionable---i.e., practitioners still do not know what they should do or avoid to decrease the risk of having defects, and what is the risk threshold for each metric. A lack of actionable guidance and risk threshold can lead to inefficient and ineffective SQA planning processes. In this paper, we investigate the practitioners' perceptions of current SQA planning activities, current challenges of such SQA planning activities, and propose four types of guidance to support SQA planning. We then propose and evaluate our AI-Driven SQAPlanner approach, a novel approach for generating four types of guidance and their associated risk thresholds in the form of rule-based explanations for the predictions of defect prediction models. Finally, we develop and evaluate a visualization for our SQAPlanner approach. Through the use of qualitative survey and empirical evaluation, our results lead us to conclude that SQAPlanner is needed, effective, stable, and practically applicable. We also find that 80% of our survey respondents perceived that our visualization is more actionable. Thus, our SQAPlanner paves a way for novel research in actionable software analytics---i.e., generating actionable guidance on what should practitioners do and not do to decrease the risk of having defects to support SQA planning.
Article
Full-text available
Defect prediction models are proposed to help a team prioritize source code areas files that need Software Quality Assurance (SQA) based on the likelihood of having defects. However, developers may waste their unnecessary effort on the whole file while only a small fraction of its source code lines are defective. Indeed, we find that as little as 1%-3% of lines of a file are defective. Hence, in this work, we propose a novel framework (called LINE-DP) to identify defective lines using a model-agnostic technique, i.e., an Explainable AI technique that provides information why the model makes such a prediction. Broadly speaking, our LINE-DP first builds a file-level defect model using code token features. Then, our LINE-DP uses a state-of-the-art model-agnostic technique (i.e., LIME) to identify risky tokens, i.e., code tokens that lead the file-level defect model to predict that the file will be defective. Then, the lines that contain risky tokens are predicted as defective lines. Through a case study of 32 releases of nine Java open source systems, our evaluation results show that our LINE-DP achieves an average recall of 0.61, a false alarm rate of 0.47, a top 20%LOC recall of 0.27, and an initial false alarm of 16, which are statistically better than six baseline approaches. Our evaluation shows that our LINE-DP requires an average computation time of 10 seconds including model construction and defective identification time. In addition, we find that 63% of defective lines that can be identified by our LINE-DP are related to common defects (e.g., argument change, condition change). These results suggest that our LINE-DP can effectively identify defective lines that contain common defects while requiring a smaller amount of inspection effort and a manageable computation cost. The contribution of this paper builds an important step towards line-level defect prediction by leveraging a model-agnostic technique.
Article
An increasingly popular set of techniques adopted by software engineering (SE) researchers to automate development tasks are those rooted in the concept of Deep Learning (DL). The popularity of such techniques largely stems from their automated feature engineering capabilities, which aid in modeling software artifacts. However, due to the rapid pace at which DL techniques have been adopted, it is difficult to distill the current successes, failures, and opportunities of the current research landscape. In an effort to bring clarity to this cross-cutting area of work, from its modern inception to the present, this article presents a systematic literature review of research at the intersection of SE & DL. The review canvasses work appearing in the most prominent SE and DL conferences and journals and spans 128 papers across 23 unique SE tasks. We center our analysis around the components of learning , a set of principles that governs the application of machine learning techniques (ML) to a given problem domain, discussing several aspects of the surveyed work at a granular level. The end result of our analysis is a research roadmap that both delineates the foundations of DL techniques applied to SE research and highlights likely areas of fertile exploration for the future.