ArticlePublisher preview available
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Detecting source-code level vulnerabilities at the development phase is a cost-effective solution to prevent potential attacks from happening at the software deployment stage. Many machine learning, including deep learning-based solutions, have been proposed to aid the process of vulnerability discovery. However, these approaches were mainly evaluated on self-constructed/-collected datasets. It is difficult to evaluate the effectiveness of proposed approaches due to lacking a unified baseline dataset. To bridge this gap, we construct a function-level vulnerability dataset from scratch, providing in source-code-label pairs. To evaluate the constructed dataset, a function-level vulnerability detection framework is built to incorporate six mainstream neural network models as vulnerability detectors. We perform experiments to investigate the performance behaviors of the neural model-based detectors using source code as raw input with continuous Bag-of-Words neural embeddings. Empirical results reveal that the variants of recurrent neural networks and convolutional neural network perform well on our dataset, as the former is capable of handling contextual information and the latter learns features from small context windows. In terms of generalization ability, the fully connected network outperforms the other network architectures. The performance evaluation can serve as a reference benchmark for neural model-based vulnerability detection at function-level granularity. Our dataset can serve as ground truth for ML-based function-level vulnerability detection and a baseline for evaluating relevant approaches.
This content is subject to copyright. Terms and conditions apply.
ORIGINAL ARTICLE
Deep neural-based vulnerability discovery demystified: data, model
and performance
Guanjun Lin
1
Wei Xiao
2
Leo Yu Zhang
3
Shang Gao
3
Yonghang Tai
4
Jun Zhang
5
Received: 20 July 2020 / Accepted: 25 March 2021 / Published online: 17 May 2021
The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2021
Abstract
Detecting source-code level vulnerabilities at the development phase is a cost-effective solution to prevent potential attacks
from happening at the software deployment stage. Many machine learning, including deep learning-based solutions, have
been proposed to aid the process of vulnerability discovery. However, these approaches were mainly evaluated on self-
constructed/-collected datasets. It is difficult to evaluate the effectiveness of proposed approaches due to lacking a unified
baseline dataset. To bridge this gap, we construct a function-level vulnerability dataset from scratch, providing in source-
code-label pairs. To evaluate the constructed dataset, a function-level vulnerability detection framework is built to
incorporate six mainstream neural network models as vulnerability detectors. We perform experiments to investigate the
performance behaviors of the neural model-based detectors using source code as raw input with continuous Bag-of-Words
neural embeddings. Empirical results reveal that the variants of recurrent neural networks and convolutional neural
network perform well on our dataset, as the former is capable of handling contextual information and the latter learns
features from small context windows. In terms of generalization ability, the fully connected network outperforms the other
network architectures. The performance evaluation can serve as a reference benchmark for neural model-based vulnera-
bility detection at function-level granularity. Our dataset can serve as ground truth for ML-based function-level vulner-
ability detection and a baseline for evaluating relevant approaches.
Keywords Vulnerability discovery Deep learning Function-level Baseline dataset Performance evaluation
1 Introduction
Computer software is ubiquitous and affects all aspects of
our lives daily. Vulnerabilities in the software might be
exploited by attackers, thus leading to severe consequences
Guanjun Lin and Wei Xiao have contributed equally to this
work, and Yonghang Tai is the corresponding author.
&Yonghang Tai
taiyonghang@ynnu.edu.cn
Guanjun Lin
daniellin1986d@gmail.com
Wei Xiao
xiaowei@ccut.edu.cn
Leo Yu Zhang
leo.zhang@deakin.edu.au
Shang Gao
shang.gao@deakin.edu.au
Jun Zhang
junzhang@swin.edu.au
1
School of Information Engineering, Sanming University,
Sanming, Fujian Province, China
2
School of Computer Science and Engineering, Changchun
University of Technology, Changchun, Jilin Province, China
3
School of Information Technology, Deakin University,
Geelong, VIC 3216, Australia
4
Yunnan Key Laboratory of Opto-electronic Information
Technology, Yunnan Normal University, Kunming, Yunnan,
China
5
School of Software and Electrical Engineering, Swinburne
University of Technology, Melbourne, VIC 3122, Australia
123
Neural Computing and Applications (2021) 33:13287–13300
https://doi.org/10.1007/s00521-021-05954-3(0123456789().,-volV)(0123456789().,-volV)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... In recent years, more and more deep learning methods have been applied to software vulnerability detection [5]. Some deep learning-based vulnerability detection systems utilize the syntax, semantics, and structure of software code to improve detection performance. ...
Article
Full-text available
Automated vulnerability detection is crucial to protect software systems. However, state-of-the-art approaches mainly focus on a single view of the source code, which often leads to incomplete code representation and low detection accuracy. To solve these problems, this paper proposes a novel automatic vulnerability detection model, DMVL4AVD, based on deep multi-view learning that represents source codes from three distinct views: code sequences, code property graphs, and code metrics. Different deep models are employed to extract features from each view. Firstly, the [CLS] vectors derived from encoder layers 1 to 12 of GraphCodeBERT are used as code sequence features which contain rich semantic information. Next, the gated graph neural network (GGNN) is exploited to learn the features of nodes in the code property graph, encompassing both syntactic and dependency information of the source code. During the extraction of graph features, node representation is augmented by incorporating the degree centrality of each node, along with its corresponding code and type attributes, resulting in a more comprehensive depiction of the graph's structure. Statistical metrics generated by the code analysis tool SourceMonitor are then processed through a 1-dimensional (1-D) CNN to produce metric features. Fused features from these three views are learned by a multilayer perceptron (MLP) to yield final classification results. Experimental results demonstrate the superiority of DMVL4AVD over existing approaches. The model performs significantly better than the studied baselines, achieving an average increase in accuracy of 6.79% and an average boost of 6.94% in precision compared to the approaches in the literature.
... For example, Krishnan et al. conducted a detailed comparison of the effects of deep learning methods in Web application security in a study and found that although these methods performed well in detecting complex vulnerabilities, they still had obvious limitations in dealing with imbalanced data sets and adversarial attacks [2]. In addition, Wartschinski et al. compared the application of deep learning models applied in Python code bases, revealing the advantages and disadvantages of different models in dealing with natural code bases [3]. This study further verified and supplemented the performance of these deep learning models in actual vulnerability detection tasks, and proposed improvement suggestions for model design and data processing. ...
Article
Full-text available
With the rapid development of information technology, software security issues have become increasingly prominent, especially the importance of software vulnerability detection technology, which has continued to increase. This paper reviews vulnerability detection technology based on neural networks, with a special focus on two widely used models: Devign and CodeBERT. By analyzing the working principles and processes of these two models, their performance and advantages in dealing with vulnerability detection tasks in complex codes are explored. With the advancement of technology, an analysis is conducted on the update and development of these two different technologies from various perspectives. At the same time, this paper also analyzes the challenges that existing models may encounter and proposes future development directions, including data processing, model design improvements, and innovations in feature extraction technology, in order to improve the accuracy and efficiency of vulnerability detection technology.
... Tang et al. [50] surveyed two models to investigate the best methods among neural network architectures, vector representation methods, and symbolization methods. Lin et al. [30] construct dataset including nine software projects to evaluate six neural network models' vulnerability detection ability and their generalization. Meanwhile, Ban et al. [9] evaluated six learning based models in a cross-project setting considering three software projects. ...
Preprint
Though many deep learning-based models have made great progress in vulnerability detection, we have no good understanding of these models, which limits the further advancement of model capability, understanding of the mechanism of model detection, and efficiency and safety of practical application of models. In this paper, we extensively and comprehensively investigate two types of state-of-the-art learning-based approaches (sequence-based and graph-based) by conducting experiments on a recently built large-scale dataset. We investigate seven research questions from five dimensions, namely model capabilities, model interpretation, model stability, ease of use of model, and model economy. We experimentally demonstrate the priority of sequence-based models and the limited abilities of both LLM (ChatGPT) and graph-based models. We explore the types of vulnerability that learning-based models skilled in and reveal the instability of the models though the input is subtlely semantical-equivalently changed. We empirically explain what the models have learned. We summarize the pre-processing as well as requirements for easily using the models. Finally, we initially induce the vital information for economically and safely practical usage of these models.
Article
Aiming at the problems of non-uniform experimental platforms and heterogeneous datasets faced in the field of vulnerability detection, we studied the application of word vector models in C/C + +function vulnerability detection. Five word vector models were used for the knowledge representation of the abstract syntax tree structure generated by the source code, and six neural network models were used for vulnerability detection. The experimental results show that functionlevel code has shallow semantic relationships and tight connections within code blocks.
Article
In today's world, cybersecurity risks are getting trickier. It's super important to think ahead about how vulnerable systems might be taken advantage of. This is all about making smart defense tactics. The goal here is to build a system that predicts how weak certain vulnerabilities can be when it comes to attacks. We’re using the Common Vulnerability scoring System (CVSS) metrics for this task. By digging into a detailed dataset from the National Vulnerability Database (NVD), this project turns the data from JSON format into a CSV table. After that, it finds key characteristics and uses machine learning to guess how likely vulnerabilities are to be exploited. The process involves breaking down CVSS info to identify crucial parts like Attack Vector (AV), Attack Complexity (AC), Privileges Needed (PR), User Involvement (UI), Scope (S), Confidentiality (C), Integrity (I), and Availability (A). All these elements become inputs for our model, which we then tweak and check using various methods to ensure it's accurate & reliable. The results reveal just how important the selected features & the predictive model are for calculating vulnerability susceptibility. This gives valuable insights for everyone in cybersecurity. Our initiative stresses the importance of preprocessing data, picking relevant features, and using predictive models to make cybersecurity strategies stronger. Going forward, we’ll work on improving the model with more data & explore advanced algorithms to boost prediction accuracy. In short, our project shows how data-driven approaches can really help improve cybersecurity defenses and lessen the risks linked with exploitable weaknesses.
Article
Full-text available
As more and more 5G networks are deployed, the limitations of 5G networks are not only being discovered but are driving exploratory research into 6G networks as a next-generation solution. Part of these investigations includes the fundamental security and privacy problems associated with 6G technologies. Therefore, to consolidate and solidify this foundational research as a basis for future investigations, we have prepared this survey on the current state-of-play in 6G-related security and privacy. The survey begins with a historical review of previous network technologies and how they have informed the current trends in 6G networking. We then discuss four key aspects of 6G networks – real-time intelligent edge computing, distributed artificial intelligence, intelligent radio, and 3D intercoms – and some promising emerging technologies in each area, along with the relevant security and privacy issues. The survey concludes with a report on potential 6G applications. Some of the references used in this paper along with further details of several points raised can be found at: security-privacyin5g-6g.github.io.
Research
Full-text available
Thousands of security vulnerabilities are discovered in production software each year, either reported publicly to the Common Vulnerabilities and Exposures database or discovered internally in proprietary code. Vulnerabil-ities often manifest themselves in subtle ways that are not obvious to code reviewers or the developers themselves. With the wealth of open source code available for analysis, there is an opportunity to learn the patterns of bugs that can lead to security vulnerabilities directly from data. In this paper, we present a data-driven approach to vulnerability detection using machine learning , specifically applied to C and C++ programs. We first compile a large dataset of hundreds of thousands of open-source functions labeled with the outputs of a static analyzer. We then compare methods applied directly to source code with methods applied to artifacts extracted from the build process, finding that source-based models perform better. We also compare the application of deep neural network models with more traditional models such as random forests and find the best performance comes from combining features learned by deep models with tree-based models. Ultimately, our highest performing model achieves an area under the precision-recall curve of 0.49 and an area under the ROC curve of 0.87.
Article
Full-text available
The Software Assurance Reference Dataset (SARD) is a growing collection of over 170 000 programs with precisely located bugs. The programs are in C, C++, Java, PHP, and C# and cover more than 150 classes of weaknesses, such as SQL injection, cross-site scripting (XSS), buffer overflow, and use of a broken cryptographic algorithm. Most are automatically generated synthetic programs, each a few pages of code long, but there are also over 7000 full-sized applications. In addition, SARD has production code and has hundreds of cases written by hand. The code is typical quality. It is neither pristine nor obfuscated. Many cases have corresponding "good" cases, in which weaknesses are fixed, to test for false positives. The SARD web interface allows users to browse test cases and test suites or search for test cases by programming language, weakness type, file name, size, words in the description, and several other criteria. The user can select and download any or all of the resulting cases.
Article
Deep Learning (DL) is a disruptive technology that has changed the landscape of cyber security research. Deep learning models have many advantages over traditional Machine Learning (ML) models, particularly when there is a large amount of data available. Android malware detection or classification qualifies as a big data problem because of the fast booming number of Android malware, the obfuscation of Android malware, and the potential protection of huge values of data assets stored on the Android devices. It seems a natural choice to apply DL on Android malware detection. However, there exist challenges for researchers and practitioners, such as choice of DL architecture, feature extraction and processing, performance evaluation, and even gathering adequate data of high quality. In this survey, we aim to address the challenges by systematically reviewing the latest progress in DL-based Android malware detection and classification. We organize the literature according to the DL architecture, including FCN, CNN, RNN, DBN, AE, and hybrid models. The goal is to reveal the research frontier, with the focus on representing code semantics for Android malware detection. We also discuss the challenges in this emerging field and provide our view of future research opportunities and directions.
Article
The constantly increasing number of disclosed security vulnerabilities have become an important concern in the software industry and in the field of cybersecurity, suggesting that the current approaches for vulnerability detection demand further improvement. The booming of the open-source software community has made vast amounts of software code available, which allows machine learning and data mining techniques to exploit abundant patterns within software code. Particularly, the recent breakthrough application of deep learning to speech recognition and machine translation has demonstrated the great potential of neural models' capability of understanding natural languages. This has motivated researchers in the software engineering and cybersecurity communities to apply deep learning for learning and understanding vulnerable code patterns and semantics indicative of the characteristics of vulnerable code. In this survey, we review the current literature adopting deep-learning-/neural-network-based approaches for detecting software vulnerabilities, aiming at investigating how the state-of-the-art research leverages neural techniques for learning and understanding code semantics to facilitate vulnerability discovery. We also identify the challenges in this new field and share our views of potential research directions.
Article
Machine learning (ML) has great potential in automated code vulnerability discovery. However, automated discovery application driven by off-the-shelf machine learning tools often performs poorly due to the shortage of high-quality training data. The scarceness of vulnerability data is almost always a problem for any developing software project during its early stages, which is referred to as the cold-start problem. This article proposes a framework that utilizes transferable knowledge from pre-existing data sources. In order to improve the detection performance, multiple vulnerability-relevant data sources were selected to form a broader base for learning transferable knowledge. The selected vulnerability-relevant data sources are cross-domain, including historical vulnerability data from different software projects and data from the Software Assurance Reference Database (SARD) consisting of synthetic vulnerability examples and proof-of-concept test cases. To extract the information applicable in vulnerability detection from the cross-domain data sets, we designed a deep-learning-based framework with Long-short Term Memory (LSTM) cells. Our framework combines the heterogeneous data sources to learn unified representations of the patterns of the vulnerable source codes. Empirical studies showed that the unified representations generated by the proposed deep learning networks are feasible and effective, and are transferable for real-world vulnerability detection. Our experiments demonstrated that by leveraging two heterogeneous data sources, the performance of our vulnerability detection outperformed the static vulnerability discovery tool Flawfinder . The findings of this article may stimulate further research in ML-based vulnerability detection using heterogeneous data sources.
Article
Social and Internet traffic analysis is fundamental in detecting and defending cyber attacks. Traditional approaches resorting to manually defined rules are gradually replaced by automated approaches empowered by machine learning. This revolution is accelerated by huge datasets which support machine-learning models with outstanding performance. In the context of a data-driven paradigm, this article reviews recent analytic research on cyber traffic over social networks and the Internet by using a set of common concepts of similarity, correlation, and collective indication, and by sharing security goals for classifying network host or applications and users or Tweets. The ability to do so is not determined in isolation, but rather drawn for a wide use of many different network or social flows. Furthermore, the flows exhibit many characteristics, such as fixed sized and multiple messages between source and destination. This article demonstrates a new research methodology of data-driven cyber security (DDCS) and its application in social and Internet traffic analysis. The framework of the DDCS methodology consists of three components, that is, cyber security data processing, cyber security feature engineering, and cyber security modeling. Challenges and future directions in this field are also discussed.
Article
Machine learning-based solutions have been successfully employed for the automatic detection of malware on Android. However, machine learning models lack robustness to adversarial examples, which are crafted by adding carefully chosen perturbations to the normal inputs. So far, the adversarial examples can only deceive detectors that rely on syntactic features ( e.g. , requested permissions, API calls, etc. ), and the perturbations can only be implemented by simply modifying application’s manifest. While recent Android malware detectors rely more on semantic features from Dalvik bytecode rather than manifest, existing attacking/defending methods are no longer effective. In this paper, we introduce a new attacking method that generates adversarial examples of Android malware and evades being detected by the current models. To this end, we propose a method of applying optimal perturbations onto Android APK that can successfully deceive the machine learning detectors. We develop an automated tool to generate the adversarial examples without human intervention. In contrast to existing works, the adversarial examples crafted by our method can also deceive recent machine learning-based detectors that rely on semantic features such as control-flow-graph. The perturbations can also be implemented directly onto APK’s Dalvik bytecode rather than Android manifest to evade from recent detectors. We demonstrate our attack on two state-of-the-art Android malware detection schemes, MaMaDroid and Drebin. Our results show that the malware detection rates decreased from 96% to 0% in MaMaDroid, and from 97% to 0% in Drebin, with just a small number of codes to be inserted into the APK.