Preprint

PatUntrack: Automated Generating Patch Examples for Issue Reports without Tracked Insecure Code

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

Security patches are essential for enhancing the stability and robustness of projects in the software community. While vulnerabilities are officially expected to be patched before being disclosed, patching vulnerabilities is complicated and remains a struggle for many organizations. To patch vulnerabilities, security practitioners typically track vulnerable issue reports (IRs), and analyze their relevant insecure code to generate potential patches. However, the relevant insecure code may not be explicitly specified and practitioners cannot track the insecure code in the repositories, thus limiting their ability to generate patches. In such cases, providing examples of insecure code and the corresponding patches would benefit the security developers to better locate and fix the insecure code. In this paper, we propose PatUntrack to automatically generating patch examples from IRs without tracked insecure code. It auto-prompts Large Language Models (LLMs) to make them applicable to analyze the vulnerabilities. It first generates the completed description of the Vulnerability-Triggering Path (VTP) from vulnerable IRs. Then, it corrects hallucinations in the VTP description with external golden knowledge. Finally, it generates Top-K pairs of Insecure Code and Patch Example based on the corrected VTP description. To evaluate the performance, we conducted experiments on 5,465 vulnerable IRs. The experimental results show that PatUntrack can obtain the highest performance and improve the traditional LLM baselines by +14.6% (Fix@10) on average in patch example generation. Furthermore, PatUntrack was applied to generate patch examples for 76 newly disclosed vulnerable IRs. 27 out of 37 replies from the authors of these IRs confirmed the usefulness of the patch examples generated by PatUntrack, indicating that they can benefit from these examples for patching the vulnerabilities.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
To date, being benefited from the ability of automated feature extraction and the performance of software vulnerability identification, deep learning techniques have attracted extensive attention in data-driven software vulnerability detection. Many methods based on deep learning have been proposed to speed up and intelligentize the process of vulnerability identification. Although these methods have shown significant advantages over traditional machine learning ones, there is an apparent gap between the deep learning-based detection systems and human experts in understanding potentially vulnerable code semantics. In some real-world vulnerability prediction scenarios, the performance of deep learning-based methods drops by more than 50% compared to these methods’ performance in experimental scenarios. We define this phenomenon as the perception gap by examining and reviewing the early software vulnerability detection approaches. Then, from the perspective of the perception gap, this paper profoundly explores the current software vulnerability detection methods and how existing solutions endeavor to narrow the perception gap and push forward the development of the field of interest. Finally, we summarize the challenges of this new field and discuss the possible future.
Article
Full-text available
Security remains under-addressed in many organisations, illustrated by the number of large-scale software security breaches. Preventing breaches can begin during software development if attention is paid to security during the software’s design and implementation. One approach to security assurance during software development is to examine communications between developers as a means of studying the security concerns of the project. Prior research has investigated models for classifying project communication messages (e.g., issues or commits) as security related or not. A known problem is that these models are project-specific, limiting their use by other projects or organisations. We investigate whether we can build a generic classification model that can generalise across projects. We define a set of security keywords by extracting them from relevant security sources, dividing them into four categories: asset, attack/threat, control/mitigation, and implicit. Using different combinations of these categories and including them in the training dataset, we built a classification model and evaluated it on industrial, open-source, and research-based datasets containing over 45 different products. Our model based on harvested security keywords as a feature set shows average recall from 55 to 86%, minimum recall from 43 to 71% and maximum recall from 60 to 100%. An average f-score between 3.4 and 88%, an average g-measure of at least 66% across all the dataset, and an average AUC of ROC from 69 to 89%. In addition, models that use externally sourced features outperformed models that use project-specific features on average by a margin of 26–44% in recall, 22–50% in g-measure, 0.4–28% in f-score, and 15–19% in AUC of ROC. Further, our results outperform a state-of-the-art prediction model for security bug reports in all cases. We find using sound statistical and effect size tests that (1) using harvested security keywords as features to train a text classification model improve classification models and generalise to other projects significantly. (2) Including features in the training dataset before model construction improve classification models significantly. (3) Different security categories represent predictors for different projects. Finally, we introduce new and promising approaches to construct models that can generalise across different independent projects.
Article
Full-text available
Background In order that the general public is not vulnerable to hackers, security bug reports need to be handled by small groups of engineers before being widely discussed. But learning how to distinguish the security bug reports from other bug reports is challenging since they may occur rarely. Data mining methods that can find such scarce targets require extensive optimization effort. Goal The goal of this research is to aid practitioners as they struggle to optimize methods that try to distinguish between rare security bug reports and other bug reports. Method Our proposed method, called SWIFT, is a dual optimizer that optimizes both learner and pre-processor options. Since this is a large space of options, SWIFT uses a technique called 𝜖-dominance that learns how to avoid operations that do not significantly improve performance. Result When compared to recent state-of-the-art results (from FARSEC which is published in TSE’18), we find that the SWIFT’s dual optimization of both pre-processor and learner is more useful than optimizing each of them individually. For example, in a study of security bug reports from the Chromium dataset, the median recalls of FARSEC and SWIFT were 15.7% and 77.4%, respectively. For another example, in experiments with data from the Ambari project, the median recalls improved from 21.5% to 85.7% (FARSEC to SWIFT). Conclusion Overall, our approach can quickly optimize models that achieve better recalls than the prior state-of-the-art. These increases in recall are associated with moderate increases in false positive rates (from 8% to 24%, median). For future work, these results suggest that dual optimization is both practical and useful.
Conference Paper
Full-text available
When developers gain thorough understanding and knowledge of software security, they can produce more secure software. This study aims at empirically identifying and understanding the security issues posted on a random sample of GitHub repositories. We tried to understand the presence of security issues and their key themes and topics. We applied a mixed-methods approach, combining topic modeling techniques and qualitative analysis. Our findings have revealed that a) the rate of security-related issues was rather small (approx. 3% of all issues), b) the majority of the security issues were related to identity management and cryptography topics. We present 7 high-level themes of problems that developers face in implementing security features.
Article
Full-text available
Many standards exist to guide the process of risk assessment, particularly in the field of information security. This leads to many, subtly different, definitions of risk analysis, evaluation and assessment. Consequently, researchers often confuse these terms and disciplines, which leads to further confusion within the community. In this sense, it is important to come to a common understanding of the processes and terminology to clarify research in this area. A common approach to achieve this goal is to carry out a literature review. This paper takes a formal approach to the literature review based on the ideas of the Cochrane group. The result is a systematic review of risk assessment in the field of information security. We present a systematic review of over 80 research papers published between 2004 and 2014. The main contribution of our paper is to construct a classification of these published papers into seven types. This classification aims to help researchers obtain a clear and unbiased picture of the terminology, developments and trends of information security risk assessment in the academic sector.
Article
Full-text available
Security has always been a popular and critical topic. With the rapid development of information technology, it is always attracting people’s attention. However, since security has a long history, it covers a wide range of topics which change a lot, from classic cryptography to recently popular mobile security. There is a need to investigate security-related topics and trends, which can be a guide for security researchers, security educators and security practitioners. To address the above-mentioned need, in this paper, we conduct a large-scale study on security-related questions on Stack Overflow. Stack Overflow is a popular on-line question and answer site for software developers to communicate, collaborate, and share information with one another. There are many different topics among the numerous questions posted on Stack Overflow and security-related questions occupy a large proportion and have an important and significant position. We first use two heuristics to extract from the dataset the questions that are related to security based on the tags of the posts. And then we use an advanced topic model, Latent Dirichlet Allocation (LDA) tuned using Genetic Algorithm (GA), to cluster different security-related questions based on their texts. After obtaining the different topics of security-related questions, we use their metadata to make various analyses. We summarize all the topics into five main categories, and investigate the popularity and difficulty of different topics as well. Based on the results of our study, we conclude several implications for researchers, educators and practitioners.
Conference Paper
Full-text available
As critical and sensitive systems increasingly rely on complex software systems, identifying software vulnerabilities is becoming increasingly important. It has been suggested in previous work that some bugs are only identified as vulnerabilities long after the bug has been made public. These bugs are known as Hidden Impact Bugs (HIBs). This paper presents a hidden impact bug identification methodology by means of text mining bug databases. The presented methodology utilizes the textual description of the bug report for extracting textual information. The text mining process extracts syntactical information of the bug reports and compresses the information for easier manipulation. The compressed information is then utilized to generate a feature vector that is presented to a classifier. The proposed methodology was tested on Linux vulnerabilities that were discovered in the time period from 2006 to 2011. Three different classifiers were tested and 28% to 88% of the hidden impact bugs were identified correctly by using the textual information from the bug descriptions alone. Further analysis of the Bayesian detection rate showed the applicability of the presented method according to the requirements of a development team.
Conference Paper
Full-text available
Little is known about the duration and prevalence of zero-day attacks, which exploit vulnerabilities that have not been disclosed publicly. Knowledge of new vulnerabilities gives cyber criminals a free pass to attack any target of their choosing, while remaining undetected. Unfortunately, these serious threats are difficult to analyze, because, in general, data is not available until after an attack is discovered. Moreover, zero-day attacks are rare events that are unlikely to be observed in honeypots or in lab experiments. In this paper, we describe a method for automatically identifying zero-day attacks from field-gathered data that records when benign and malicious binaries are downloaded on 11 million real hosts around the world. Searching this data set for malicious files that exploit known vulnerabilities indicates which files appeared on the Internet before the corresponding vulnerabilities were disclosed. We identify 18 vulnerabilities exploited before disclosure, of which 11 were not previously known to have been employed in zero-day attacks. We also find that a typical zero-day attack lasts 312 days on average and that, after vulnerabilities are disclosed publicly, the volume of attacks exploiting them increases by up to 5 orders of magnitude.
Conference Paper
Full-text available
Identifying software vulnerabilities is becoming more important as critical and sensitive systems increasingly rely on complex software systems. It has been suggested in previous work that some bugs are only identified as vulnerabilities long after the bug has been made public. These vulnerabilities are known as hidden impact vulnerabilities. This paper discusses existing bug data mining classifiers and present an analysis of vulnerability databases showing the necessity to mine common publicly available bug databases for hidden impact vulnerabilities. We present a vulnerability analysis from January 2006 to April 2011 for two well known software packages: Linux kernel and MySQL. We show that 32% (Linux) and 62% (MySQL) of vulnerabilities discovered in this time period were hidden impact vulnerabilities. We also show that the percentage of hidden impact vulnerabilities has increased from 25% to 36% in Linux and from 59% to 65% in MySQL in the last two years. We then propose a hidden impact vulnerability identification methodology based on text mining classifier for bug databases. Finally, we discuss potential challenges faced by a development team when using such a classifier.
Article
Machine learning and its promising branch deep learning have proven to be effective in a wide range of application domains. Recently, several efforts have shown success in applying deep learning techniques for automatic vulnerability discovery, as alternatives to traditional static bug detection. In principle, these learning-based approaches are built on top of classification models using supervised learning. Depending on the different granularities to detect vulnerabilities, these approaches rely on learning models which are typically trained with well-labeled source code to predict whether a program method, a program slice, or a particular code line contains a vulnerability or not. The effectiveness of these models is normally evaluated against conventional metrics including precision, recall and F1 score. In this paper, we show that despite yielding promising numbers, the above evaluation strategy can be insufficient and even misleading when evaluating the effectiveness of current learning-based approaches. This is because the underlying learning models only produce the classification results or report individual/isolated program statements, but are unable to pinpoint bug-triggering paths, which is an effective way for bug fixing and the main aim of static bug detection. Our key insight is that a program method or statement can only be stated as vulnerable in the context of a bug-triggering path. In this work, we systematically study the gap between recent learning-based approaches and conventional static bug detectors in terms of fine-grained metrics called BTP metrics using bug-triggering paths. We then characterize and compare the quality of the prediction results of existing learning-based detectors under different granularities. Finally, our comprehensive empirical study reveals several key issues and challenges in developing classification models to pinpoint bug-triggering paths and calls for more advanced learning-based bug detection techniques.
Article
The detection of software vulnerabilities (or vulnerabilities for short) is an important problem that has yet to be tackled, as manifested by the many vulnerabilities reported on a daily basis. This calls for machine learning methods for vulnerability detection. Deep learning is attractive for this purpose because it alleviates the requirement to manually define features. Despite the tremendous success of deep learning in other application domains, its applicability to vulnerability detection is not systematically understood. In order to fill this void, we propose the first systematic framework for using deep learning to detect vulnerabilities in C/C++ programs with source code. The framework, dubbed Sy ntax-based, Se mantics-based, and V ector R epresentations (SySeVR), focuses on obtaining program representations that can accommodate syntax and semantic information pertinent to vulnerabilities. Our experiments with four software products demonstrate the usefulness of the framework: we detect 15 vulnerabilities that are not reported in the National Vulnerability Database. Among these 15 vulnerabilities, seven are unknown and have been reported to the vendors, and the other eight have been “silently” patched by the vendors when releasing newer versions of the pertinent software products.
Article
Software developers spend 35-50 percent of their time validating and debugging software. The cost of debugging, testing, and verification is estimated to account for 50-75 percent of the total budget of software development projects, amounting to more than $100 billion annually. While tools, languages, and environments have reduced the time spent on individual debugging tasks, they have not significantly reduced the total time spent debugging, nor the cost of doing so. Therefore, a hyperfocus on elimination of bugs during development is counterproductive; programmers should instead embrace debugging as an exercise in problem solving.
Article
Context Systematic literature reviews (SLRs) rely on a rigorous and auditable methodology for minimizing biases and ensuring reliability. A common kind of bias arises when selecting studies using a set of inclusion/exclusion criteria. This bias can be decreased through dual revision, which makes the selection process more time-consuming and remains prone to generating bias depending on how each researcher interprets the inclusion/exclusion criteria. Objective To reduce the bias and time spent in the study selection process, this paper presents a process for selecting studies based on the use of Cohen’s Kappa statistic. We have defined an iterative process based on the use of this statistic during which the criteria are refined until obtain almost perfect agreement (k>0.8). At this point, the two researchers interpret the selection criteria in the same way, and thus, the bias is reduced. Starting from this agreement, dual review can be eliminated; consequently, the time spent is drastically shortened. Method The feasibility of this iterative process for selecting studies is demonstrated through a tertiary study in the area of software engineering on works that were published from 2005 to 2018. Results The time saved in the study selection process was 28% (for 152 studies) and if the number of studies is sufficiently large, the time saved tend asymptotically to 50%. Conclusions Researchers and students may take advantage of this iterative process for selecting studies when conducting SLRs to reduce bias in the interpretation of inclusion and exclusion criteria. It is especially useful for research with few resources.
Article
The links between issues in an issue-tracking system and commits resolving the issues in a version control system are important for a variety of software engineering tasks (e.g., bug prediction, bug localization and feature location). However, only a small portion of such links are established by manually including issue identifiers in commit logs, leaving a large portion of them lost in the evolution history. To recover issue-commit links, heuristic-based and learning-based techniques leverage the metadata and text/code similarity in issues and commits; however, they fail to capture the embedded semantics in issues and commits and the hidden semantic correlations between issues and commits. As a result, this semantic gap inhibits the accuracy of link recovery. To bridge this gap, we propose a semantically-enhanced link recovery approach, named DeepLink, which is built on top of deep learning techniques. Specifically, we develop a neural network architecture, using word embedding and recurrent neural network, to learn the semantic representation of natural language descriptions and code in issues and commits as well as the semantic correlation between issues and commits. In experiments, to quantify the prevalence of missing issue-commit links, we analyzed 1078 highly-starred GitHub Java projects (i.e., 583,795 closed issues) and found that only 42.2% of issues were linked to corresponding commits. To evaluate the effectiveness of DeepLink, we compared DeepLink with a state-of-the-art link recovery approach FRLink using ten GitHub Java projects and demonstrated that DeepLink can outperform FRLink in terms of F-measure.
Conference Paper
We revisit the performance of template-based APR to build comprehensive knowledge about the effectiveness of fix patterns, and to highlight the importance of complementary steps such as fault localization or donor code retrieval. To that end, we first investigate the literature to collect, summarize and label recurrently-used fix patterns. Based on the investigation, we build TBar, a straightforward APR tool that systematically attempts to apply these fix patterns to program bugs. We thoroughly evaluate TBar on the Defects4J benchmark. In particular, we assess the actual qualitative and quantitative diversity of fix patterns, as well as their effectiveness in yielding plausible or correct patches. Eventually, we find that, assuming a perfect fault localization, TBar correctly/plausibly fixes 74/101 bugs. Replicating a standard and practical pipeline of APR assessment, we demonstrate that TBar correctly fixes 43 bugs from Defects4J, an unprecedented performance in the literature (including all approaches, i.e., template-based, stochastic mutation-based or synthesis-based APR).
Conference Paper
Automated Program Repair (APR) is one of the most recent advances in automated debugging, and can directly fix buggy programs with minimal human intervention. Although various advanced APR techniques (including search-based or semantic-based ones) have been proposed, they mainly work at the source-code level and it is not clear how bytecode-level APR performs in practice. Also, empirical studies of the existing techniques on bugs beyond what has been reported in the original papers are rather limited. In this paper, we implement the first practical bytecode-level APR technique, PraPR, and present the first extensive study on fixing real-world bugs (e.g., Defects4J bugs) using JVM bytecode mutation. The experimental results show that surprisingly even PraPR with only the basic traditional mutators can produce genuine fixes for 17 bugs; with simple additional commonly used APR mutators, PraPR is able to produce genuine fixes for 43 bugs, significantly outperforming state-of-the-art APR, while being over 10X faster. Furthermore, we performed an extensive study of PraPR and other recent APR tools on a large number of additional real-world bugs, and demonstrated the overfitting problem of recent advanced APR tools for the first time. Lastly, PraPR has also successfully fixed bugs for other JVM languages (e.g., for the popular Kotlin language), indicating PraPR can greatly complement existing source-code-level APR.
Conference Paper
In cybersecurity, vulnerability discovery in source code is a fundamental problem. To automate vulnerability discovery, Machine learning (ML) based techniques has attracted tremendous attention. However, existing ML-based techniques focus on the component or file level detection, and thus considerable human effort is still required to pinpoint the vulnerable code fragments. Using source code files also limit the generalisability of the ML models across projects. To address such challenges, this paper targets at the function-level vulnerability discovery in the cross-project scenario. A function representation learning method is proposed to obtain the high-level and generalizable function representations from the abstract syntax tree (AST). First, the serialized ASTs are used to learn project independence features. Then, a customized bi-directional LSTM neural network is devised to learn the sequential AST representations from the large number of raw features. The new function-level representation demonstrated promising performance gain, using a unique dataset where we manually labeled 6000+ functions from three open-source projects. The results confirm that the huge potential of the new AST-based function representation learning.
Article
Software security vulnerabilities are one of the critical issues in the realm of computer security. Due to their potential high severity impacts, many different approaches have been proposed in the past decades to mitigate the damages of software vulnerabilities. Machine-learning and data-mining techniques are also among the many approaches to address this issue. In this article, we provide an extensive review of the many different works in the field of software vulnerability analysis and discovery that utilize machine-learning and data-mining techniques. We review different categories of works in this domain, discuss both advantages and shortcomings, and point out challenges and some uncharted territories in the field.
Article
Background Software fault prediction is the process of developing models that can be used by the software practitioners in the early phases of software development life cycle for detecting faulty constructs such as modules or classes. There are various machine learning techniques used in the past for predicting faults. Method In this study we perform a systematic review of studies from January 1991 to October 2013 in the literature that use the machine learning techniques for software fault prediction. We assess the performance capability of the machine learning techniques in existing research for software fault prediction. We also compare the performance of the machine learning techniques with the statistical techniques and other machine learning techniques. Further the strengths and weaknesses of machine learning techniques are summarized. Results In this paper we have identified 64 primary studies and seven categories of the machine learning techniques. The results prove the prediction capability of the machine learning techniques for classifying module/class as fault prone or not fault prone. The models using the machine learning techniques for estimating software fault proneness outperform the traditional statistical models. Conclusion Based on the results obtained from the systematic review, we conclude that the machine learning techniques have the ability for predicting software fault proneness and can be used by software practitioners and researchers. However, the application of the machine learning techniques in software fault prediction is still limited and more number of studies should be carried out in order to obtain well formed and generalizable results. We provide future guidelines to practitioners and researchers based on the results obtained in this work.
Article
This paper presents an approach based on machine learning to predict which components of a software application contain security vulnerabilities. The approach is based on text mining the source code of the components. Namely, each component is characterized as a series of terms contained in its source code, with the associated frequencies. These features are used to forecast whether each component is likely to contain vulnerabilities. In an exploratory validation with 20 Android applications, we discovered that a dependable prediction model can be built. Such model could be useful to prioritize the validation activities, e.g., to identify the components needing special scrutiny.
Conference Paper
Application security is becoming increasingly prevalent during software and especially web application development. Consequently, countermeasures are continuously being discussed and built into applications, with the goal of reducing the risk that unauthorized code will be able to access, steal, modify, or delete sensitive data. In this paper we gauged the presence and atmosphere surrounding security-related discussions on GitHub, as mined from discussions around commits and pull requests. First, we found that security related discussions account for approximately 10% of all discussions on GitHub. Second, we found that more negative emotions are expressed in security-related discussions than in other discussions. These findings confirm the importance of properly training developers to address security concerns in their applications as well as the need to test applications thoroughly for security vulnerabilities in order to reduce frustration and improve overall project atmosphere.