Figure - available from: Neural Computing and Applications
This content is subject to copyright. Terms and conditions apply.
The bubble chart depicts the distribution of severity levels of vulnerabilities in our dataset. Four colors in the bubble chart correspond with four severity levels of “Low,” “Medium,” “High” and “Critical.” Each level has its own severity score range. The size of a circle represents the proportion of vulnerabilities with a certain severity score

The bubble chart depicts the distribution of severity levels of vulnerabilities in our dataset. Four colors in the bubble chart correspond with four severity levels of “Low,” “Medium,” “High” and “Critical.” Each level has its own severity score range. The size of a circle represents the proportion of vulnerabilities with a certain severity score

Source publication
Article
Full-text available
Detecting source-code level vulnerabilities at the development phase is a cost-effective solution to prevent potential attacks from happening at the software deployment stage. Many machine learning, including deep learning-based solutions, have been proposed to aid the process of vulnerability discovery. However, these approaches were mainly evalua...

Citations

... In recent years, more and more deep learning methods have been applied to software vulnerability detection [5]. Some deep learning-based vulnerability detection systems utilize the syntax, semantics, and structure of software code to improve detection performance. ...
Article
Full-text available
Automated vulnerability detection is crucial to protect software systems. However, state-of-the-art approaches mainly focus on a single view of the source code, which often leads to incomplete code representation and low detection accuracy. To solve these problems, this paper proposes a novel automatic vulnerability detection model, DMVL4AVD, based on deep multi-view learning that represents source codes from three distinct views: code sequences, code property graphs, and code metrics. Different deep models are employed to extract features from each view. Firstly, the [CLS] vectors derived from encoder layers 1 to 12 of GraphCodeBERT are used as code sequence features which contain rich semantic information. Next, the gated graph neural network (GGNN) is exploited to learn the features of nodes in the code property graph, encompassing both syntactic and dependency information of the source code. During the extraction of graph features, node representation is augmented by incorporating the degree centrality of each node, along with its corresponding code and type attributes, resulting in a more comprehensive depiction of the graph's structure. Statistical metrics generated by the code analysis tool SourceMonitor are then processed through a 1-dimensional (1-D) CNN to produce metric features. Fused features from these three views are learned by a multilayer perceptron (MLP) to yield final classification results. Experimental results demonstrate the superiority of DMVL4AVD over existing approaches. The model performs significantly better than the studied baselines, achieving an average increase in accuracy of 6.79% and an average boost of 6.94% in precision compared to the approaches in the literature.
... For example, Krishnan et al. conducted a detailed comparison of the effects of deep learning methods in Web application security in a study and found that although these methods performed well in detecting complex vulnerabilities, they still had obvious limitations in dealing with imbalanced data sets and adversarial attacks [2]. In addition, Wartschinski et al. compared the application of deep learning models applied in Python code bases, revealing the advantages and disadvantages of different models in dealing with natural code bases [3]. This study further verified and supplemented the performance of these deep learning models in actual vulnerability detection tasks, and proposed improvement suggestions for model design and data processing. ...
Article
Full-text available
With the rapid development of information technology, software security issues have become increasingly prominent, especially the importance of software vulnerability detection technology, which has continued to increase. This paper reviews vulnerability detection technology based on neural networks, with a special focus on two widely used models: Devign and CodeBERT. By analyzing the working principles and processes of these two models, their performance and advantages in dealing with vulnerability detection tasks in complex codes are explored. With the advancement of technology, an analysis is conducted on the update and development of these two different technologies from various perspectives. At the same time, this paper also analyzes the challenges that existing models may encounter and proposes future development directions, including data processing, model design improvements, and innovations in feature extraction technology, in order to improve the accuracy and efficiency of vulnerability detection technology.
... Tang et al. [50] surveyed two models to investigate the best methods among neural network architectures, vector representation methods, and symbolization methods. Lin et al. [30] construct dataset including nine software projects to evaluate six neural network models' vulnerability detection ability and their generalization. Meanwhile, Ban et al. [9] evaluated six learning based models in a cross-project setting considering three software projects. ...
Preprint
Though many deep learning-based models have made great progress in vulnerability detection, we have no good understanding of these models, which limits the further advancement of model capability, understanding of the mechanism of model detection, and efficiency and safety of practical application of models. In this paper, we extensively and comprehensively investigate two types of state-of-the-art learning-based approaches (sequence-based and graph-based) by conducting experiments on a recently built large-scale dataset. We investigate seven research questions from five dimensions, namely model capabilities, model interpretation, model stability, ease of use of model, and model economy. We experimentally demonstrate the priority of sequence-based models and the limited abilities of both LLM (ChatGPT) and graph-based models. We explore the types of vulnerability that learning-based models skilled in and reveal the instability of the models though the input is subtlely semantical-equivalently changed. We empirically explain what the models have learned. We summarize the pre-processing as well as requirements for easily using the models. Finally, we initially induce the vital information for economically and safely practical usage of these models.
... This is a relatively traditional approach. In addition, studies [45][46][47][48] used basic deep learning methods combined with the bag-ofwords technique to detect source code vulnerabilities based on traditional datasets such as Asterisk, Ffmpeg, and Httpd. However, we believe that these datasets are relatively balanced so using them for detection will bring good results. ...
Article
Full-text available
Improving and enhancing the effectiveness of software vulnerability detection methods is urgently needed today. In this study, we propose a new source code vulnerability detection method based on intelligent and advanced computational algorithms. It's a combination of four main processing techniques including (i) Source Embedding, (ii) Feature Learning, (iii) Resampling Data, and (iv) Classification. The Source Embedding method will perform the task of analyzing and standardizing the source code based on the Joern tool and the data mining algorithm. The Feature Learning model has the function of aggregating and extracting source code attribute based on node using machine learning and deep learning methods. The Resampling Data technique will perform equalization of the experimental dataset. Finally, the Classification model has the function of detecting source code vulnerabilities. The novelty and uniqueness of the new intelligent cognitive computing method is the combination and synchronous use of many different data extracting techniques to compute, represent, and extract the properties of the source code. With this new calculation method, many significant unusual properties and features of the vulnerability have been synthesized and extracted. To prove the superiority of the proposed method, we experiment to detect source code vulnerabilities based on the Verum dataset, details of this part are presented in the experimental section. The experimental results show that the method proposed in the paper has brought good results on all measures. These results have shown to be the best research results for the source code vulnerability detection task using the Verum dataset according to our survey to date. With such results, the proposal in this study is not only meaningful in terms of science but also in practical terms when the method of using intelligent cognitive computing techniques to analyze and evaluate source code has helped to improve the efficiency of the source code analysis and vulnerability detection process.
... After we completed our study, we found two related empirical studies. Lin et al. [26] evaluated 6 DL models' generalization for 9 software projects. Ban et al. [3] evaluated 6 machine learning models (1 of which is a neural network) in a cross-project setting with 3 software projects, and also studied training on 2 bug types vs. a single bug type. ...
Preprint
Deep learning (DL) models of code have recently reported great progress for vulnerability detection. In some cases, DL-based models have outperformed static analysis tools. Although many great models have been proposed, we do not yet have a good understanding of these models. This limits the further advancement of model robustness, debugging, and deployment for the vulnerability detection. In this paper, we surveyed and reproduced 9 state-of-the-art (SOTA) deep learning models on 2 widely used vulnerability detection datasets: Devign and MSR. We investigated 6 research questions in three areas, namely model capabilities, training data, and model interpretation. We experimentally demonstrated the variability between different runs of a model and the low agreement among different models' outputs. We investigated models trained for specific types of vulnerabilities compared to a model that is trained on all the vulnerabilities at once. We explored the types of programs DL may consider "hard" to handle. We investigated the relations of training data sizes and training data composition with model performance. Finally, we studied model interpretations and analyzed important features that the models used to make predictions. We believe that our findings can help better understand model results, provide guidance on preparing training data, and improve the robustness of the models. All of our datasets, code, and results are available at https://figshare.com/s/284abfba67dba448fdc2.
... Mainstream code embeddings, such as Word2Vec, utilize distributed representations to capture the meanings of words distributed in the components of fixed-length vectors [26], allowing neural networks to learn rich meanings from code. It had been applied by many existing studies [7][8][9][10][11][12][13]27,28] for learning vulnerable representations. However, the static or non-contextual embedding models are not able to generate different representations for the same code token appeared in different code contexts. ...
Article
Full-text available
Detecting vulnerabilities in programs is an important yet challenging problem in cybersecurity. The recent advancement in techniques of natural language understanding enables the data-driven research on automated code analysis to embrace Pre-trained Contextualized Models (PCMs). These models are pre-trained on the large corpus and can be fine-tuned for various downstream tasks, but their feasibility and effectiveness for software vulnerability detection have not been systematically studied. In this paper, we explore six prevalent PCMs and compare them with three mainstream Non-Contextualized Models (NCMs) in terms of generating effective function-level representations for vulnerability detection. We found that, although the detection performance of PCMs outperformed that of the NCMs, training and fine-tuning PCMs were computationally expensive. The budgets for deployment and inference are also considerable in practice, which may prevent the wide adoption of PCMs in the field of interest. However, we discover that, when the PCMs were compressed using the technique of knowledge distillation, they achieved similar detection performance but with significantly improved efficiency compared with their uncompressed counterparts when using 40,000 synthetic C functions for fine-tuning and approximately 79,200 real-world C functions for training. Among the distilled PCMs, the distilled CodeBERT achieved the most cost-effective performance. Therefore, we proposed a framework encapsulating the Distilled CodeBERT for an end-to-end Vulnerable function Detection (named DistilVD). To examine the performance of the proposed framework in real-world scenarios, DistilVD was tested on four open-source real-world projects with a small amount of training data. Results showed that DistilVD outperformed the five baseline approaches. Further evaluations on multi-class vulnerability detection also confirmed the effectiveness of DistilVD for detecting various vulnerability types.
... This propagation of information between semantically relevant statements theoretically allows the model to better make use of relevant contextual lines. While GNNs have been shown to perform well in function-level vulnerability classification (graph classification) [9,10,12,37,58], the effects of vectorised methods and GNNs models on statement-level vulnerability classification (node classification) have yet to be explored. ...
Preprint
Full-text available
Current machine-learning based software vulnerability detection methods are primarily conducted at the function-level. However, a key limitation of these methods is that they do not indicate the specific lines of code contributing to vulnerabilities. This limits the ability of developers to efficiently inspect and interpret the predictions from a learnt model, which is crucial for integrating machine-learning based tools into the software development workflow. Graph-based models have shown promising performance in function-level vulnerability detection, but their capability for statement-level vulnerability detection has not been extensively explored. While interpreting function-level predictions through explainable AI is one promising direction, we herein consider the statement-level software vulnerability detection task from a fully supervised learning perspective. We propose a novel deep learning framework, LineVD, which formulates statement-level vulnerability detection as a node classification task. LineVD leverages control and data dependencies between statements using graph neural networks, and a transformer-based model to encode the raw source code tokens. In particular, by addressing the conflicting outputs between function-level and statement-level information, LineVD significantly improve the prediction performance without vulnerability status for function code. We have conducted extensive experiments against a large-scale collection of real-world C/C++ vulnerabilities obtained from multiple real-world projects, and demonstrate an increase of 105\% in F1-score over the current state-of-the-art.
... The experimental study demonstrated that the suggested framework was a useful feature set for vulnerability detection. In [47], a framework for vulnerability detection having the six advanced mainstream network models built-in ensured oneclick execution for model training and testing on the suggested dataset. Empirical findings demonstrate that the variants of recurrent neural networks and convolutional neural networks exhibit good performance on the proposed dataset. ...
Preprint
Full-text available
In computer security, semantic learning is helpful in understanding vulnerability requirements, realizing source code semantics, and constructing vulnerability knowledge. Nevertheless, learning how to extract and select the most valuable features for software vulnerability detection remains difficult. We establish a hypothesis that the source projects' vulnerable functions include the project-independent vulnerable patterns shared among vulnerabilities across different software projects. The trained Gated Graph Sequence Neural Networks (GGNNs) can be utilized to discover the said patterns in order to detect vulnerabilities on a target project. It is necessary to identify vulnerable programming patterns by considering the context. Therefore, we use functional connectivity (FC) based on Gated Graph Neural Networks as a learning representation of features to capture long-term dependency to understand a high-level representation of potential vulnerabilities to learn context-dependence. The experimental findings indicate that the suggested Model can select relevant discriminative features and achieve superior performance than benchmark methods.
... However, it grows to approximately 153,955 in 2021. Software vulnerability [4][5][6], as a threat, is increasing in frequency, scale, and severity, which are similar to natural disasters; it may lead to unintended and severe consequences. Once vulnerability in a key system is exploited by attackers, millions of computer systems may be affected [7]. ...
Article
Full-text available
Due to multitudinous vulnerabilities in sophisticated software programs, the detection performance of existing approaches requires further improvement. Multiple vulnerability detection approaches have been proposed to aid code inspection. Among them, there is a line of approaches that apply deep learning (DL) techniques and achieve promising results. This paper attempts to utilize CodeBERT which is a deep contextualized model as an embedding solution to facilitate the detection of vulnerabilities in C open-source projects. The application of CodeBERT for code analysis allows the rich and latent patterns within software code to be revealed, having the potential to facilitate various downstream tasks such as the detection of software vulnerability. CodeBERT inherits the architecture of BERT, providing a stacked encoder of transformer in a bidirectional structure. This facilitates the learning of vulnerable code patterns which requires long-range dependency analysis. Additionally, the multihead attention mechanism of transformer enables multiple key variables of a data flow to be focused, which is crucial for analyzing and tracing potentially vulnerable data flaws, eventually, resulting in optimized detection performance. To evaluate the effectiveness of the proposed CodeBERT-based embedding solution, four mainstream-embedding methods are compared for generating software code embeddings, including Word2Vec, GloVe, and FastText. Experimental results show that CodeBERT-based embedding outperforms other embedding models on the downstream vulnerability detection tasks. To further boost performance, we proposed to include synthetic vulnerable functions and perform synthetic and real-world data fine tuning to facilitate the model learning of C-related vulnerable code patterns. Meanwhile, we explored the suitable configuration of CodeBERT. The evaluation results show that the model with new parameters outperform some state-of-the-art detection methods in our dataset.
... These issues were tackled by distributed word embedding techniques, such as Word2Vec [22], GloVe (global vectors for word representation) [23], and FastText [24]. For example, techniques such as Word2Vec can learn word probability based on contextual information, are capable of capturing word similarities, and form the foundation for many existing studies that require the learning of code semantics for code analysis tasks, such as vulnerability detection [5,14,[25][26][27][28]. ...
Article
Full-text available
Exploitable vulnerabilities in software systems are major security concerns. To date, machine learning (ML) based solutions have been proposed to automate and accelerate the detection of vulnerabilities. Most ML techniques aim to isolate a unit of source code, be it a line or a function, as being vulnerable. We argue that a code segment is vulnerable if it exists in certain semantic contexts, such as the control flow and data flow; therefore, it is important for the detection to be context aware. In this paper, we evaluate the performance of mainstream word embedding techniques in the scenario of software vulnerability detection. Based on the evaluation, we propose a supervised framework leveraging pre-trained context-aware embeddings from language models (ELMo) to capture deep contextual representations, further summarized by a bidirectional long short-term memory (Bi-LSTM) layer for learning long-range code dependency. The framework takes directly a source code function as an input and produces corresponding function embeddings, which can be treated as feature sets for conventional ML classifiers. Experimental results showed that the proposed framework yielded the best performance in its downstream detection tasks. Using the feature representations generated by our framework, random forest and support vector machine outperformed four baseline systems on our data sets, demonstrating that the framework incorporated with ELMo can effectively capture the vulnerable data flow patterns and facilitate the vulnerability detection task.