ArticlePublisher preview available

A novel approach for software vulnerability detection based on intelligent cognitive computing

Authors:
  • University of Economics - Technology for Industries
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Improving and enhancing the effectiveness of software vulnerability detection methods is urgently needed today. In this study, we propose a new source code vulnerability detection method based on intelligent and advanced computational algorithms. It's a combination of four main processing techniques including (i) Source Embedding, (ii) Feature Learning, (iii) Resampling Data, and (iv) Classification. The Source Embedding method will perform the task of analyzing and standardizing the source code based on the Joern tool and the data mining algorithm. The Feature Learning model has the function of aggregating and extracting source code attribute based on node using machine learning and deep learning methods. The Resampling Data technique will perform equalization of the experimental dataset. Finally, the Classification model has the function of detecting source code vulnerabilities. The novelty and uniqueness of the new intelligent cognitive computing method is the combination and synchronous use of many different data extracting techniques to compute, represent, and extract the properties of the source code. With this new calculation method, many significant unusual properties and features of the vulnerability have been synthesized and extracted. To prove the superiority of the proposed method, we experiment to detect source code vulnerabilities based on the Verum dataset, details of this part are presented in the experimental section. The experimental results show that the method proposed in the paper has brought good results on all measures. These results have shown to be the best research results for the source code vulnerability detection task using the Verum dataset according to our survey to date. With such results, the proposal in this study is not only meaningful in terms of science but also in practical terms when the method of using intelligent cognitive computing techniques to analyze and evaluate source code has helped to improve the efficiency of the source code analysis and vulnerability detection process.
This content is subject to copyright. Terms and conditions apply.
Vol:.(1234567890)
The Journal of Supercomputing (2023) 79:17042–17078
https://doi.org/10.1007/s11227-023-05282-4
1 3
A novel approach forsoftware vulnerability detection
based onintelligent cognitive computing
ChoDoXuan1· DaoHoangMai2· MaCongThanh1· BuiVanCong3
Accepted: 10 April 2023 / Published online: 5 May 2023
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature
2023, corrected publication 2023\
Abstract
Improving and enhancing the effectiveness of software vulnerability detection meth-
ods is urgently needed today. In this study, we propose a new source code vulner-
ability detection method based on intelligent and advanced computational algo-
rithms. It’s a combination of four main processing techniques including (i) Source
Embedding, (ii) Feature Learning, (iii) Resampling Data, and (iv) Classification.
The Source Embedding method will perform the task of analyzing and standardizing
the source code based on the Joern tool and the data mining algorithm. The Feature
Learning model has the function of aggregating and extracting source code attribute
based on node using machine learning and deep learning methods. The Resampling
Data technique will perform equalization of the experimental dataset. Finally, the
Classification model has the function of detecting source code vulnerabilities. The
novelty and uniqueness of the new intelligent cognitive computing method is the
combination and synchronous use of many different data extracting techniques to
compute, represent, and extract the properties of the source code. With this new cal-
culation method, many significant unusual properties and features of the vulnerabil-
ity have been synthesized and extracted. To prove the superiority of the proposed
method, we experiment to detect source code vulnerabilities based on the Verum
dataset, details of this part are presented in the experimental section. The experimen-
tal results show that the method proposed in the paper has brought good results on
all measures. These results have shown to be the best research results for the source
code vulnerability detection task using the Verum dataset according to our survey
to date. With such results, the proposal in this study is not only meaningful in terms
of science but also in practical terms when the method of using intelligent cognitive
computing techniques to analyze and evaluate source code has helped to improve the
efficiency of the source code analysis and vulnerability detection process.
Keywords Source code vulnerability· Source code vulnerability detection·
Code property graph· Source embedding· Data rebalancing· Feature learning·
Classification
Extended author information available on the last page of the article
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... Recently, there have been many studies and approaches to propose methods to detect source code vulnerabilities. The approaches are mainly based on two main methods [3,4]. The first is based on known and pre-identified vulnerabilities. ...
... • For analyzing source code characteristics, there are two main methods that can be applied at this stage: [5][6][7][8][9] using natural language processing methods and graph construction methods. For graph-based approaches, there are 4 main graph types commonly used to represent source code including: Abstract syntax tree (AST) [10,11], Control Flow Graph (CFG) [12,13], Program Dependence Graph (PDG) [14,15], CPG [4,16,17]. Each of the above graph types has certain advantages and disadvantages. ...
... Besides, the study [17] proposed to use the gated graph neural networks (GGNN) for feature extraction. A most recent research by Cho et al. [4] proposed a model combining MLP and GCN for feature extraction. However, we realize that there is a huge weakness that previous studies have not approached to overcome: enriching information for CPG before analyzing and extracting features of the graph. ...
Article
Full-text available
Because the damage caused by source code vulnerabilities to agencies and organizations is increasing, early detection and warning of these vulnerabilities is very necessary today. In recent times, approaches based on analyzing source code into Code Property Graph (CPG) and then using deep learning graph techniques, machine learning models or deep learning have brought certain effectiveness. However, some issues need to be improved in traditional approaches including: (i) source code feature extraction technique from CPG; (ii) source code classification techniques. To overcome the above two problems, this article will propose a new model called CSGD. The educational policy model will be a combination of three main techniques: Code sage; Graph Convolution Network (GCN); and Dropout. These three techniques will flexibly combine with each other to perform two main functions: Feature Intelligent Extraction and Rebalancing Data. Feature Intelligent Extraction will be a model combining GCN and Code sage to synthesize and extract source code features in the form of CPG. Code sage’s mission will be to synthesize and enrich information for CPG vertexes. GCN will convert the graph into a single feature vector. Finally, the Rebalancing Data technique in the CSGD model will generate additional data of missing labels based on the Dropout function. To evaluate the effectiveness of the CSGD model, this study will evaluate two experimental datasets that are being researched and widely applied today for the task of detecting source code vulnerabilities: Verum and FFmpeg + Qume. The experimental results in the article show that the CSGD model brings good results on both of these datasets. Besides, this model outperforms other approaches by 1% to 6% on the Verum dataset and by 1% to 5% on the FFmpeg + Qume dataset. This is the best result of the source code vulnerability detection task based on the FFmpeg + Qume and Verum datasets.
... When it comes to identifying vulnerabilities in source code, methodologies are typically categorized into two primary classes [1,4]. The initial detection method relies on the utilization of the Common Weakness Enumeration (CWE) and Common Vulnerabilities and Exposures (CVE). ...
... In addition, the study [16] suggested utilizing gated graph neural networks (GGNN) for the purpose of feature extraction. In a recent study conducted by Cho et al. [4], a model was proposed that combines Multilayer Perceptron (MLP) and Graph Convolutional Network (GCN) for the purpose of feature extraction. MLP was responsible for obtaining edge features of the CPG, while GCN evaluated and extracted nodes of the CPG. ...
... It has been observed that the problem of rebalancing data is commonly addressed using the Synthetic Minority Oversampling Technique (SMOTE) method [4,16,20]. Despite the numerous benefits of this method, it is nevertheless burdened with several drawbacks that must be addressed. ...
Article
Full-text available
Detecting software vulnerabilities is a very urgent problem today. One of the common approaches for detecting software vulnerabilities is source code analysis. In this paper, to improve the effectiveness of the software vulnerability detection model based on source code analysis, we propose a novel model called GRD. The GRD model performs source code analysis to find and conclude about source code vulnerabilities based on a combination of two main methods: Feature Intelligent Extraction and Rebalancing Data. In particular, Feature Intelligent Extraction, which includes two models: deep graph networks and natural language processing (NLP) techniques, is responsible for synthesizing and extracting features of source code in the code property graph (CPG) form. Rebalancing Data has the function of balancing data to improve the efficiency of the source code classification task. The main characteristics of our proposal in this paper include two main phases as follows. The first phase extracts and synthesizes source code features into the CPG form. At this phase, the article proposes using Graph Convolution Network (GCN) to extract CPG features, and RoBERTa to extract source code snippets on the node of CPG. In the second phase, based on the feature vectors of the source code obtained in phase 1, the article proposes using the Dropout technique to generate data to balance among labels. Finally, the feature vectors generated after the Dropout technique are used to predict source code vulnerabilities. The study evaluates the proposed model on two common datasets: Verum and FFMQ. The experimental results in the article have shown the superiority of the proposed model compared to other approaches on all measures.
... The use of Artificial Intelligence (AI) has been a constant in this field. Many works apply traditional AI algorithms such as support vector machines [8,9], K-nearest neighbor [8,9] or graph neural networks [10,11], once having extracted code features like the program dependency graph [10,11], the control flow graph [12] or the number of lines of code [8], among others. ...
... A similar distribution is identified in all cases. Considering a coverage of 85% or more (recall Section 4), 5 classes are defined for CC [1,2,3,4,5] and 6 for HD [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][25][26][27][28][29][30], where HD are divided in groups of 5. Note that 0 or floats with 0 as the integer part are discarded for not being considered representative enough. ...
Preprint
Full-text available
Large Language Models (LLMs) are being extensively used for cybersecurity purposes. One of them is the detection of vulnerable codes. For the sake of efficiency and effectiveness, compression and fine-tuning techniques are being developed, respectively. However, they involve spending substantial computational efforts. In this vein, we analyse how Linear Probes (LPs) can be used to provide an estimation on the performance of a compressed LLM at an early phase -- before fine-tuning. We also show their suitability to set the cut-off point when applying layer pruning compression. Our approach, dubbed LPASS, is applied in BERT and Gemma for the detection of 12 of MITRE's Top 25 most dangerous vulnerabilities on 480k C/C++ samples. LPs can be computed in 142.97 s. and provide key findings: (1) 33.3 \% and 72.2\% of layers can be removed, respectively, with no precision loss; (2) they provide an early estimate of the post-fine-tuning and post-compression model effectiveness, with 3\% and 8.68\% as the lowest and average precision errors, respectively. LPASS-based LLMs outperform the state of the art, reaching 86.9\% of accuracy in multi-class vulnerability detection. Interestingly, LPASS-based compressed versions of Gemma outperform the original ones by 1.6\% of F1-score at a maximum while saving 29.4 \% and 23.8\% of training and inference time and 42.98\% of model size.
... In the study [7], Beatrice Casey and colleagues demonstrated that source code representation is an important aspect that can influence how models analyze source code, bringing high accuracy in classifying vulnerabilities. There are two main approaches to detecting source code vulnerabilities [8][9][10][11][12]: ...
... In previous studies [7,10,18,28,29], the importance of analyzing and representing source code, especially for complex programs written in low-level languages such as C or C++, was demonstrated. Listing node and edge features through CPG is particularly useful for representing the original source code, as it provides a comprehensive and concise representation, including control flow and data flow, beyond the Abstract Syntax Tree (AST) and Program Dependence Graph (PDG). ...
Article
The software production sector gains advantages from automated code generating techniques, yet encounters issues related to vulnerabilities in the resulting code. This research presents a hybrid paradigm, termed GBD, for detecting vulnerabilities in software written in C and C++. It integrates Graph Convolution Network (GCN), Bidirectional Encoder Representations from Transformers (BERT), and Dropout. During Phase 2 of the GBD model, the subsequent tasks are executed concurrently: (i) obtaining node and edge features utilizing the GCN graph convolution network; (ii) deriving segment features employing the BERT model; (iii) constructing a source code profile via the Code Property Graph (CPG). Phase 3 of the model implements the Dropout strategy to mitigate overfitting. Phase 4 is the classifier that ascertains the presence of vulnerabilities in the source code. Experimental findings demonstrate the superiority of the proposed model relative to alternative methods, attaining a prediction accuracy of 61.21% for vulnerable code and 88.94% for normal files. Additionally, the classification outcomes demonstrate that with a token length of 512, the GBD model yields the most uniform results across all metrics: Accuracy (86.65%), Precision (38.59%), Recall (66.21%), and F1-score (48.76%). This corresponds with our analysis of the Verum experimental dataset, indicating that over 70% of the source code files have lengths exceeding 256 but less than 512. Furthermore, the GBD model exhibits strong performance across both individual and multiple datasets. For example, in the Verum dataset, the GBD model surpasses five alternative methodologies—REVEAL [1], Russell [2], VulDeePecker [3], SySeVR [4], and Devign [5] - by 4% in Accuracy and between 15% and 57% in Precision, Recall, and F1-score. In comparison to SySeVR [4], the GBD model exceeds it by 3% to 25% across all metrics. In comparison to Devign [5], GBD achieves improvements of 5% to 39% in Precision, Recall, and F1-score. Upon assessment of the FFmpeg+Qume dataset, the GBD model attains an Accuracy improvement ranging from 0.2% to 10% above all other studies. In terms of precision, GBD surpasses alternative methods by 0.3% to 9%. In terms of Recall, GBD is marginally worse than REVEAL by 1.5%, although surpasses all other methodologies by 10% to over 31%. In terms of F1-score, GBD is 0.3% inferior to REVEAL but surpasses other studies by 7% to 30%. The results indicate that the GBD model is effective on both individual and multiple datasets
... Do Xuan et al. (2023) proposed a new source code vulnerability detection method based on intelligent cognitive computing. This method combined source code embedding, feature learning, data resampling, and classification models, utilizing various data extraction techniques to achieve the detection and classification of source code vulnerabilities [9]. ...
Article
Full-text available
To enhance the intelligent classification of computer vulnerabilities and improve the efficiency and accuracy of network security management, this study delves into the application of a comprehensive classification system that integrates the Memristor Neural Network (MNN) and an improved Temporal Convolutional Neural Network (TCNN) in network security management. This system not only focuses on the precise classification of vulnerability data but also emphasizes its core role in strengthening the network security management framework. Firstly, the study designs and implements a neural network model based on memristors. The MNN, by simulating the memory effect of biological neurons, effectively captures the complex nonlinear relationships within vulnerability data, thereby enhancing the data insight capabilities of the network security management system. Subsequently, structural optimization and parameter adjustments are made to the TCNN model, incorporating residual connections and attention mechanisms to improve its classification performance, making it more adaptable to the dynamically changing network security environment. Through data preprocessing, feature extraction, and model training, this study conducts experimental validation on a public vulnerability dataset. The experimental results indicate that: The MNN model demonstrates excellent performance across evaluation metrics such as Accuracy (ACC), Precision (P), Recall (R), and F1 Score, achieving an ACC of 89.5%, P of 90.2%, R of 88.7%, and F1 of 89.4%. The improved TCNN model shows even more outstanding performance on the aforementioned evaluation metrics. After structural optimization and parameter adjustments, the TCNN model’s ACC increases to 93.8%, significantly higher than the MNN model. The P value also improves, reaching 91.5%, indicating enhanced capability in reducing false positives and improving vulnerability identification accuracy. The integrated classification system, leveraging the strengths of both the MNN and improved TCNN models, achieves an ACC of 95.2%. This improvement not only demonstrates the system’s superior capability in accurately classifying vulnerability data but also proves the synergistic effect of MNN and TCNN models in addressing complex network security environments. The comprehensive classification system proposed in this study significantly enhances the classification performance of computer vulnerabilities, providing robust technical support for network security management. The system exhibits higher accuracy and stability in handling complex vulnerability datasets, making it highly valuable for practical applications and research.
... In contrast, for NLP methods, code embedding techniques are considered a primary choice. Some NLP models and LLMs currently of interest in this area include Word2vec [56]. Roberta [12,55], CodeT5 [7][8][9], CodeBert [10,11],... ...
Article
Full-text available
Detecting vulnerabilities in C/C + + source code has become a critical challenge in information security, especially as the growing number and severity of new vulnerabilities increasingly impact organizations. In this context, Large Language Models (LLMs) have emerged as a promising approach; however, building a model capable of effectively predicting and classifying various types of vulnerabilities from diverse datasets remains a complex problem, demanding innovative and comprehensive solutions. Our research proposes a breakthrough approach by developing the FG-CVD ensemble learning model, an advanced architecture that combines code embedding techniques and Knowledge Argument to enhance feature representation and the ability to learn complex relationships within source code. These improvements are specifically designed on the foundation of code embedding and the Transformer architecture of LLMs to boost the detection and classification of sophisticated vulnerability patterns. To evaluate the model’s effectiveness, we conducted extensive experiments on four representative datasets: Reveal, BigVul, RealVul, and FFMQ + QEmu. The experimental results demonstrated FG-CVD’s superior performance with an average accuracy of 85%, a prediction precision of 43%, a recall of 65%, and an F1-score of 47%. Notably, the model exhibited flexible adaptability to datasets with different structures and efficiently addressed data imbalance between labels. Moreover, through rigorous cross-dataset testing, the model showcased strong generalization capabilities and high stability, underscoring not only the academic value of the approach but also its practical potential, outperforming traditional approaches across a range of metrics and experimental scenarios.
... Beyond these identified concerns, we advocate for optimising the detection of source code vulnerabilities by incorporating advanced models after our proposed model (Bui & Do, 2024), such as representation learning (M. Keke Huang et al., 2024;Li Ming et al., 2024; and contrastive learning (Do Xuan et al., 2023;Zaharia et al., 2022) -models currently undergoing extensive research. ...
Article
Full-text available
To enhance the effectiveness of vulnerability detection in software developed using C and C++ programming languages, our study introduces a novel correlation calculation method for analyzing and evaluating Code Property Graphs (CPG). The intelligent computation method proposed in this study comprises three key stages. In the first stage, we present a method for extracting features from the CPG source code. To accomplish this, we integrate three distinct data exploration methods: employing Graph Convolutional Neural (GCN) to extract node features from CPG, utilizing Convolutional Neural Network (CNN) to extract edge features from CPG, and finally employing the Doc2vec natural language processing algorithm to extract source code from CPG nodes. The second stage involves proposing a method for synthesizing CPG source code features. Building on the features acquired in the first stage, our paper introduces a synthesis and construction method to generate feature vectors for the source code. The final stage, stage three, executes the detection of source code vulnerabilities. The experimental results demonstrate that our proposed model in this study achieves higher efficiency compared to other studies, with an improvement ranging from 3% to 4%.
... In some cases metrics like the code's entropy, number of characters, number of conditional sentences, among others, are applied [20], [28]. Other works also play with cyclomatic complexity [9], [20], use dependency and control flow graphs [10], [22], [24], [25], [27], [30]- [34], even work at token level [23], use vectors to represent assorted information [8], [19], [21], [26] or directly apply the code [29]. Once features are extracted, machine learning algorithms are used in all cases, being neural networks the most common one, specially MLP, though some works also use DNN, like [29]. ...
Preprint
Full-text available
The complexity of implementations and the interconnection of assorted systems and devices facilitates the emergence of vulnerabilities. Detection systems are developed to fight against this security issue, being the use of Artificial Intelligence (AI) a common practice. However, the use of AI is not without its problems, specially those affecting the training phase. This paper tackles this issue following a two-fold approach. First, an AI-based vulnerability detection system based on code and token metrics, dubbed VulCoT, is developed. It reaches state-of-the-art performance while being suitable for C#, C/C++ and PHP. Second, the impact of poisoning attacks on VulCoT is analysed. Results show that VulCoT is specially affected beyond 20% of false data. Remarkably, detecting some of the most frequent Common Weakness Enumeration is altered even with lower poison rates. Overall, KNN and SVM are more appropriate for system protection in C# and C/C++, while MLP in PHP. Indeed, PHP is the language which is less affected by attacks, while C# and C/C++ present comparable results.
Article
Full-text available
Design flows, code errors, or inadequate countermeasures may occur in software development. Some of them lead to vulnerabilities in the code, opening the door to attacks. Assorted techniques are developed to detect vulnerable code samples, making artificial intelligence techniques, such as Machine Learning (ML), a common practice. Nonetheless, the security of ML is a major concern. This includes the the case of ML-based detection whose training process is affected by data poisoning. More generally, vulnerability detection can be evaded unless poisoning attacks are properly handled. This paper tackles this problem. A novel vulnerability detection system based on ML-based image processing, using Convolutional Neural Network (CNN), is proposed. The system, hereinafter called IVul, is evaluated under the presence of backdoor attacks, a precise type of poisoning in which a pattern is introduced in the training data to alter the expected behavior of the learned models. IVul is evaluated with more than three thousand code samples associated with two representative programming languages (C# and PHP). IVul outperforms other comparable state-of-the-art vulnerability detectors in the literature, reaching 82%82\% to 99%99\% detection accuracy. Besides, results show that the type of attack may affect a particular language more than another, though, in general, PHP is more resilient to proposed attacks than C#.
Article
Full-text available
To enhance the effectiveness of the Advanced Persistent Threat (APT) detection process, this research proposes a new approach to build and analyze the behavior profiles of APT attacks in network traffic. To achieve this goal, this study carries out two main objectives, including (i) building the behavior profile of APT IP in network traffic using a new intelligent computation method; (ii) analyzing and evaluating the behavior profile of APT IP based on a deep graph network. Specifically, to build the behavior profile of APT IP, this article describes using a combination of two different data mining methods: Bidirectional Long Short-Term Memory (Bi) and Attention (A). Based on the obtained behavior profile, the Dynamic Graph Convolutional Neural Network (DGCNN) is proposed to extract the characteristics of APT IP and classify them. With the flexible combination of different components in the model, the important information and behavior of APT attacks are demonstrated, not only enhancing the accuracy of detecting attack campaigns but also reducing false predictions. The experimental results in the paper show that the method proposed in this study has brought better results than other approaches on all measurements. In particular, the accuracy of APT attack prediction results (Precision) reached from 84 to 91%, higher than other studies of over 7%. These experimental results have proven that the proposed BiADG model for detecting APT attacks in this study is proper and reasonable. In addition, those experimental results have not only proven the effectiveness and superiority of the proposed method in detecting APT attacks but have also opened up a new approach for other cyber-attack detections such as distributed denial of service, botnets, malware, phishing, etc.
Article
Full-text available
Following advances in machine learning and deep learning processing, cyber security experts are committed to creating deep intelligent approaches for automatically detecting software vulnerabilities. Nowadays, many practices are for C and C++ programs, and methods rarely target PHP application. Moreover, many of these methods use LSTM (Long Short-Term Memory) but not GNN (Graph Neural Networks) to learn the token dependencies within the source code through different transformations. That may lose a lot of semantic information in terms of code representation. This article presents a novel Graph Neural Network vulnerability detection approach, VulEye, for PHP applications. VulEye can assist security researchers in finding vulnerabilities in PHP projects quickly. VulEye first constructs the PDG (Program Dependence Graph) of the PHP source code, slices PDG with sensitive functions contained in the source code into sub-graphs called SDG (Sub-Dependence Graph), and then makes SDG the model input to train with a Graph Neural Network model which contains three stack units with a GCN layer, Top-k pooling layer, and attention layer, and finally uses MLP (Multi-Layer Perceptron) and softmax as a classifier to predict if the SDG is vulnerable. We evaluated VulEye on the PHP vulnerability test suite in Software Assurance Reference Dataset. The experiment reports show that the best macro-average F1 score of the VulEye reached 99% in the binary classification task and 95% in the multi-classes classification task. VulEye achieved the best result compared with the existing open-source vulnerability detection implements and other state-of-art deep learning models. Moreover, VulEye can also locate the precise area of the flaw, since our SDG contains code slices closely related to vulnerabilities with a key triggering sensitive/sink function.
Article
Full-text available
Software developers represent the bastion of application security against the overwhelming cyber-attacks which target all organizations and affect their resilience. As security weaknesses which may be introduced during the process of code writing are complex and matching different and variate skills, most applications are launched intrinsically vulnerable. We have advanced our research for a security scanner able to use automated learning techniques based on machine learning algorithms to recognize patterns of security weaknesses in source code. To make the scanner independent on the programming language, the source code is converted to a vectorial representation using natural language processing methods, which are able to retain semantical traits of the original code and at the same time to reduce the dependency on the lexical structure of the program. The security flaws detection performance is in the ranges accepted by software security professionals (recall > 0.94) even when vulnerable samples are very low represented in the dataset (e.g., less than 4% vulnerable code for a specific CWE in the dataset). No significant change or adaptation is needed to change the source code language under scrutiny. We apply this approach on detecting Common Weaknesses Enumeration (CWE) vulnerabilities in datasets provided by NIST (Test suites–NIST Software Assurance Reference Dataset).
Article
Full-text available
Automatically detecting buffer overflow vulnerabilities is an important research topic in software security. Recent studies have shown that vulnerability detection performance utilizing deep learning-based techniques can be significantly enhanced. However, due to information loss during code representation, existing approaches cannot learn the features associated with vulnerabilities, leading to a high false negative rate (FNR) and low precision. To resolve the existing problems, we propose a method for buffer overflow vulnerability detection based on graph feature extraction (BovdGFE) in C/C++ programs. BovdGFE constructs the buffer overflow function samples. Then, we present a new representation structure, code representation sequence (CoRS), which incorporates the control flow, data dependencies, and syntax structure of the vulnerable code for reducing information loss during code representation. After the function samples are transformed into CoRS, a deep learning model is used to learn vulnerable features and perform vulnerability classification. The results of the experiments show that BovdGFE improves the precision and FNR by 6.3% and 3.9% respectively compared with state-of-the-art methods, which can significantly improve the capability of vulnerability detection.
Article
Full-text available
Nowadays, software vulnerabilities pose a serious problem, because cyber-attackers often find ways to attack a system by exploiting software vulnerabilities. Detecting software vulnerabilities can be done using two main methods: i) signature-based detection, i.e. methods based on a list of known security vulnerabilities as a basis for contrasting and comparing; ii) behavior analysis-based detection using classification algorithms, i.e., methods based on analyzing the software code. In order to improve the ability to accurately detect software security vulnerabilities, this study proposes a new approach based on a technique of analyzing and standardizing software code and the random forest (RF) classification algorithm. The novelty and advantages of our proposed method are that to determine abnormal behavior of functions in the software, instead of trying to define behaviors of functions, this study uses the Word2vec natural language processing model to normalize and extract features of functions. Finally, to detect security vulnerabilities in the functions, this study proposes to use a popular and effective supervised machine learning algorithm.
Article
Full-text available
Vulnerability detection on source code can prevent the risk of cyber-attacks as early as possible. However, lacking fine-grained analysis of the code has rendered the existing solutions still suffering from low performance; besides, the explosive growth of open-source projects has dramatically increased the complexity and diversity of the source code. This paper presents HGVul, a code vulnerability detection method based on heterogeneous intermediate representation of source code. The key of the proposed method is the fine-grained handling on heterogeneous source-level intermediate representation (SIR) without expert knowledge. It first extracts graph SIR of code with multiple syntactic-semantic information. Then, HGVul splits the SIR into different subgraphs according to various semantic relations, which are used to obtain semantic information conveyed by different types of edges. Next, a graph neural network with attention operations is deployed on each subgraph to learn representation, which captures the subtle effects from node neighbors on their representation. Finally, the learned code feature representations are utilized to perform vulnerability detection. Experiments are conducted on multiple datasets. The F1 of HGVul reaches 96.1% on the sample-balanced Big-Vul-VP dataset and 88.3% on the unbalanced Big-Vul dataset. Further experiments on actual open-source project datasets prove the better performance of HGVul.
Article
Full-text available
In this paper, we address the problem of automatic repair of software vulnerabilities with deep learning. The major problem with data-driven vulnerability repair is that the few existing datasets of known confirmed vulnerabilities consist of only a few thousand examples. However, training a deep learning model often requires hundreds of thousands of examples. In this work, we leverage the intuition that the bug fixing task and the vulnerability fixing task are related and that the knowledge learned from bug fixes can be transferred to fixing vulnerabilities. In the machine learning community, this technique is called transfer learning. In this paper, we propose an approach for repairing security vulnerabilities named VRepair which is based on transfer learning. VRepair is first trained on a large bug fix corpus and is then tuned on a vulnerability fix dataset, which is an order of magnitude smaller. In our experiments, we show that a model trained only on a bug fix corpus can already fix some vulnerabilities. Then, we demonstrate that transfer learning improves the ability to repair vulnerable C functions. We also show that the transfer learning model performs better than a model trained with a denoising task and fine-tuned on the vulnerability fixing task. To sum up, this paper shows that transfer learning works well for repairing security vulnerabilities in C compared to learning on a small dataset.
Article
Full-text available
The use of cloud services, web-based software systems, the Internet of Things (IoT), Machine Learning (ML), Artificial Intelligence (AI), and other wireless sensor devices in the health sector has resulted in significant advancements and benefits. Early disease detection, increased accessibility, and high diagnostic reach have all been made possible by digital healthcare. Despite this remarkable achievement, healthcare data protection has become a serious issue for all parties involved. According to data breach statistics, the healthcare data industry is one of the major threats to cyber criminals. In reality, healthcare data breaches have increased at an alarming rate in recent years. Practitioners are developing a variety of tools, strategies, and approaches to solve healthcare data security concerns. The author has highlighted the crucial measurements and parameters in relation to enormous organizational circumstances for securing a vast amount of data in this paper. Security measures are those that prevent developers and organizations from achieving their objectives. The goal of this work is to identify and prioritize the security approaches that are used to locate and solve problems using different versions of two approaches that have been used to analyze big data security in the past. The Fuzzy Analytic Hierarchy Process (Fuzzy AHP) approach is being used by authors to examine the priorities and overall data security. In addition, the most important features in terms of weight have been quantitatively analyzed. Experts will discover the findings and conclusions useful in improving big data security.