PreprintPDF Available

LineVD: Statement-level Vulnerability Detection using Graph Neural Networks

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Current machine-learning based software vulnerability detection methods are primarily conducted at the function-level. However, a key limitation of these methods is that they do not indicate the specific lines of code contributing to vulnerabilities. This limits the ability of developers to efficiently inspect and interpret the predictions from a learnt model, which is crucial for integrating machine-learning based tools into the software development workflow. Graph-based models have shown promising performance in function-level vulnerability detection, but their capability for statement-level vulnerability detection has not been extensively explored. While interpreting function-level predictions through explainable AI is one promising direction, we herein consider the statement-level software vulnerability detection task from a fully supervised learning perspective. We propose a novel deep learning framework, LineVD, which formulates statement-level vulnerability detection as a node classification task. LineVD leverages control and data dependencies between statements using graph neural networks, and a transformer-based model to encode the raw source code tokens. In particular, by addressing the conflicting outputs between function-level and statement-level information, LineVD significantly improve the prediction performance without vulnerability status for function code. We have conducted extensive experiments against a large-scale collection of real-world C/C++ vulnerabilities obtained from multiple real-world projects, and demonstrate an increase of 105\% in F1-score over the current state-of-the-art.
Content may be subject to copyright.
LineVD: Statement-level Vulnerability Detection using Graph
Neural Networks
David Hin
CREST - The Centre for Research on Engineering
Software Technologies, University of Adelaide
Cyber Security Cooperative Research Centre
Adelaide, Australia, 5005
david.hin@adelaide.edu.au
Andrey Kan
AWS AI Labs*
Adelaide, SA, Australia, 5005
avkan@amazon.com
Huaming Chen
CREST - The Centre for Research on Engineering
Software Technologies, University of Adelaide
Cyber Security Cooperative Research Centre
Adelaide, Australia, 5005
huaming.chen@adelaide.edu.au
M. Ali Babar
CREST - The Centre for Research on Engineering
Software Technologies, University of Adelaide
Cyber Security Cooperative Research Centre
Adelaide, Australia, 5005
ali.babar@adelaide.edu.au
ABSTRACT
Current machine-learning based software vulnerability detection
methods are primarily conducted at the function-level. However,
a key limitation of these methods is that they do not indicate the
specic lines of code contributing to vulnerabilities. This limits
the ability of developers to eciently inspect and interpret the
predictions from a learnt model, which is crucial for integrating
machine-learning based tools into the software development work-
ow. Graph-based models have shown promising performance
in function-level vulnerability detection, but their capability for
statement-level vulnerability detection has not been extensively
explored. While interpreting function-level predictions through
explainable AI is one promising direction, we herein consider the
statement-level software vulnerability detection task from a fully
supervised learning perspective. We propose a novel deep learning
framework, LineVD, which formulates statement-level vulnerabil-
ity detection as a node classication task. LineVD leverages control
and data dependencies between statements using graph neural net-
works, and a transformer-based model to encode the raw source
code tokens. In particular, by addressing the conicting outputs
between function-level and statement-level information, LineVD
signicantly improve the prediction performance without vulnera-
bility status for function code. We have conducted extensive experi-
ments against a large-scale collection of real-world C/C++ vulnera-
bilities obtained from multiple real-world projects, and demonstrate
an increase of 105% in F1-score over the current state-of-the-art.
CCS CONCEPTS
Computer systems organization Embedded systems
;Re-
dundancy; Robotics; Networks Network reliability.
*This work was done prior to joining Amazon.
MSR’ 2022, May 23–24, 2022, Pittsburgh, PA, USA
2022. ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn
KEYWORDS
Software Vulnerability Detection, Program Representation, Deep
Learning
ACM Reference Format:
David Hin, Andrey Kan, Huaming Chen, and M. Ali Babar. 2022. LineVD:
Statement-level Vulnerability Detection using Graph Neural Networks. In
MSR ’22: Proceedings of the 19th International Conference on Mining Software
Repositories, May 23-24, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA,
12 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
Identifying potential software vulnerabilities is a crucial step in
defending against cyber attacks [
25
]. However, it can be dicult
and time-consuming for developers to determine which parts of
the software will contribute to vulnerabilities in large software
systems. Consequently, interest in more accurate and ecient au-
tomated software vulnerability detection (SVD) solutions has been
increasing [
21
,
38
]. Automated SVD can be broadly classied into
two categories: (1) traditional methods, which can include both
static and dynamic analysis, and (2) data-driven solutions, which
leverage data mining and machine learning to predict the presence
of software vulnerabilities [
22
]. Traditional static tools are often
rule-based, leveraging knowledge from security domain experts.
This can result in inconsistent performance due to a high number
of false positive alerts, or completely missing more complex vul-
nerabilities [
5
]. As it is challenging to dene vulnerable patterns
in an accurate and comprehensive way, data-driven solutions have
become a promising alternative.
One primary reason for the increasing popularity of data-driven
solutions can be attributed to the ever growing quantity of open-
source vulnerability data. The accumulating security vulnerabili-
ties in open-source software are regularly updated and reported to
sources like the National Vulnerability Database [
1
]. There have
been many eorts that utilize the publicly available information to
reduce the need for manually dening patterns and to learn from
the data [
27
,
49
]. As a result, data-driven solutions have generally
shown outstanding performance in comparison to traditional static
application security testing tools [
13
]. One potential reason could be
arXiv:2203.05181v1 [cs.CR] 10 Mar 2022
MSR’ 2022, May 23–24, 2022, Pisburgh, PA, USA David Hin, Andrey Kan, Huaming Chen, and M. Ali Babar
due to the learning ability of the models to incorporate latent infor-
mation from patterns where vulnerabilities would appear in source
code, in addition to the underlying causes of the vulnerability.
Despite the success of current data-driven approaches in the
identication of software vulnerabilities, they are often limited
to a coarse level of granularity. The model outputs often present
developers with limited information for prediction outcome vali-
dation and interpretation, leading to extra eorts when evaluating
and mitigating the software vulnerabilities. Consequently, many
proposed SVD solutions have transitioned to either function-level
[
9
,
10
,
32
,
58
] or slice-level [
12
,
33
35
] predictions, which are a
major improvement from le-level predictions [
16
,
24
,
50
]. Some
other works further leverage supplementary information, such as
commit-level code changes with accompanying log messages, to
build the prediction model [
23
,
48
]. While the goal is to help practi-
tioners to prioritize the defective codes, vulnerabilities can often be
localized to a few key lines [
17
]. Hence, reviewing large functions
could still be a considerable burden. From a preliminary analysis
on a large C/C++ vulnerability detection dataset [
19
], we nd that
vulnerable functions in the dataset are on average 95 lines of code.
Fig. 1 shows an example of how statement-level SVD can be
benecial. To save space, we choose a vulnerability from a smaller
function, which contains an integer overow vulnerability from the
Linux kernel (CVE-2018-12896) that can ultimately be exploited to
cause denial-of-service. With explicit statement-level predictions,
it can be easier to interpret why the function has been predicted
as vulnerable (or alternatively, verify that the prediction was er-
roneous). Focusing the developer’s attention on the highlighted
lines can help a developer narrow down the lines needing further
inspection based on model condence. In this case, a statement-
level SVD model ags the addition assignment operation on line 22,
which contains the vulnerable integer casting operation, as most
suspicious, allowing a developer to more eciently validate and
mitigate the vulnerability.
Rening SVD granularity towards the statement-level is still in
its infancy; latest work by Li et. al [
32
] has indicated the possibility
of leveraging the interpretable ML model, namely GNNExplainer,
to derive the vulnerable statements as the interpretation of learnt
model. However, in our work, we nd that the performance is not
sucient and eective when classifying and ranking the latent vul-
nerable statements. Alternatively, we aim to explore the feasibility
and eectiveness of directly training and predicting on vulnerabili-
ties at the statement level for SVD granularity renement, which
would allow data-driven solutions to directly utilize any available
statement-level information in a fully supervised manner.
In this paper, we propose a novel framework for statement-level
SVD, namely LineVD. We focus on revisiting core components of
data-driven SVD from the perspective of statement-level classica-
tion, which can serve as a way of allowing developers to better in-
terpret vulnerability predictions. We have explored statement-level
SVD in a thorough way to nd the best data-driven architecture to
achieve optimal performance. With the coverage of various feature
extraction methods and model architectures to tackle the latent
challenges in SVD, we show that LineVD could provide sucient
capacity to incorporate contextual information for each statement
in an ecient manner. The extensive evaluation of the model has
been delivered in realistic scenarios; namely, heavily imbalanced
1void common_timer_get(struct k_itimer *timr, struct itimerspec64
*cur_setting)
2{
3 const struct k_clock *kc = timr->kclock;
4 ktime_t now, remaining, iv;
5 struct timespec64 ts64;
6 bool sig_none;
7
8 sig_none = timr->it_sigev_notify == SIGEV_NONE;
9 iv = timr->it_interval;
10
11 if (iv) {
12 cur_setting->it_interval = ktime_to_timespec64(iv);
13 } else if (!timr->it_active) {
14 if (!sig_none)
15 return;
16 }
17
18 kc->clock_get(timr->it_clock, &ts64);
19 now = timespec64_to_ktime(ts64);
20
21 if (iv && (timr->it_requeue_pending & REQUEUE_PENDING || sig_none))
22 timr->it_overrun += (int)kc->timer_forward(timr, now);
23 // Added: timr->it_overrun += kc->timer_forward(timr, now);
24
25 remaining = kc->timer_remaining(timr, now);
26
27 if (remaining <= 0) {
28 if (!sig_none)
29 cur_setting->it_value.tv_nsec = 1;
30 } else {
31 cur_setting->it_value = ktime_to_timespec64(remaining);
32 }
33 }
Figure 1: Vulnerable lines predicted by LineVD for a code
snippet from CVE-2018-12896, which are indicated by the
presence of a red background. A darker red background
color indicates higher prediction condence by LineVD.
Ground-truth vulnerable lines have red line numbers, while
dark red line numbers indicate a data or control dependency
on an added line (indicated by line 23).
labels and cross-project testing. In summary, this paper makes the
following contributions:
We propose a novel and ecient statement-level SVD ap-
proach, LineVD. With the investigation of current state-of-
the-art interpretation-based SVD model showing declined
performance, LineVD achieves a signicant improvement
for an increase of 105% in F1-score.
We investigate the performance eects of each stage for
building a GNN-based statement-level SVD model, includ-
ing the node embedding methods and GNN model selec-
tion. Upon the ndings, LineVD is developed to largely im-
prove the performance via learning from the function- and
statement-level information simultaneously.
LineVD is the rst approach to jointly learn from function-
level and statement-level information via graph neural net-
works to enhance the SVD performance, which in the empiri-
cal evaluation has signicantly outperformed the traditional
models only using one type information.
We publish our dataset, source code, and models with sup-
porting scripts [
6
], which provides a ready-to-use implemen-
tation solution for future work with regards to benchmarking
and comparison.
2 BACKGROUND
In this section, we will introduce relevant key concepts relating
to the source code embedding and graph neural networks (GNNs),
which have played key roles for providing eective and explainable
capabilities to SVD prediction models in recent literature works.
LineVD: Statement-level Vulnerability Detection using Graph Neural Networks MSR’ 2022, May 23–24, 2022, Pisburgh, PA, USA
2.1 Source code embedding
To tackle a language-related task with machine learning models,
it is necessary to transform the related textual corpora into vector
representations. The corpora are usually specic to a given do-
main; e.g., one recently popular topic is source code-related tasks.
The process is commonly referred to as building a language model
and extracting language embeddings, which can be at the word,
sentence, or document-level [
42
]. Source code modelling is the
application of language modelling to source code, which can be
considered a special type of structured natural language [
30
], and
has demonstrated encouraging results for a wide range of down-
stream tasks. These include code completion [
40
] and code clone
detection [55], among others [54].
For software vulnerability detection, prior approaches have uti-
lized document embedding methods like Doc2Vec [
29
], or word em-
bedding methods such as GloVe [
46
] and Word2Vec [
42
] to generate
pre-trained vectors for singular tokens, which are then aggregated
in some way. For example, Cao et al. [
9
] utilized averaged Word2Vec
embeddings to transform raw code statements into vector repre-
sentations.
Recently, transformer-based models have been applied to source
code modelling, allowing for large source code understanding mod-
els to be pre-trained. They are expected to obtain higher-quality
embeddings for source code, which can be leveraged for down-
stream tasks requiring less labeled data and training resources. One
major advancement is CodeBERT [
20
], based on the RoBERTa [
41
]
architecture, which was trained on six dierent programming lan-
guages: Python, Java, JavaScript, PHP, Ruby, and Go. Specically,
CodeBERT was trained on 2.1 million natural language and pro-
gramming language bimodal samples, and 6.4 million programming
language unimodal samples. It should be noted that CodeBERT and
Doc2Vec can produce contextual embeddings, in contrast to GloVe
and Word2Vec embeddings, which are static and hence have the
same embedding for each word regardless of its context.
Tokenization approaches for source code can vary signicantly.
For word embedding-based approaches, code tokens can be split by
whitespace alongside punctuation, such as parenthesis and semi-
colons [
34
]. Alternatively, punctuation can be completely removed
[32]. In addition, code identiers are sometimes further tokenized
according to common naming conventions, such as underscores or
camel case. This is to help reduce the number of out-of-vocabulary
tokens, which can otherwise be signicantly higher than regular
natural language due to the nature of how developers name iden-
tiers. An alternative approach to manually dened tokenization
rules is unsupervised subword tokenization. In CodeBERT, byte-
pair encoding (BPE) [
52
] is used to tokenize the source code tokens.
In particular, long variable names are split into subwords based
on the BPE algorithm. For example, ’add_one’ may be tokenized
to ’add’, ’_, and ’one’. This is arguably a more consistent way of
reducing out-of-vocabulary issues in source code identiers, as this
approach is not reliant on pre-dened rules.
2.2 Graph Neural Network
Recently, graph neural networks have demonstrated superior per-
formance at mining graph data structures for social networks [
7
],
spatial-temporal related trac networks [
11
], and so on. For the
1
3
4
5
6
8 9
11
12
13
14
15 181921
22 25
27
28
29
31
Figure 2: Program Dependence Graph for Fig. 1. Black lines
represent control dependency edges; dashed red lines repre-
sent data dependency edges.
downstream source code modelling tasks, transformer-based mod-
els have demonstrated promising results [
3
]. However, the complex
syntactic and semantic characteristics, which are inherently pre-
sented in the programming languages, are not explicitly leveraged.
Rather than solely representing source code as a sequence of
tokens, the intrinsic structure information for source code can also
be eectively modelled by representing a source code snippet (or
program) as a graph,
G=(V,E )
, allowing a model to more easily
learn latent relationships within the source code where
V
is the
set of nodes representing the program graph and
E
is the edge
matrix. With dierent types of program-related structures, the edge
matrix
E
denotes the corresponding syntactic and semantic infor-
mation of the source code. Using this graph-based representation in
combination with graph neural networks has resulted in improved
performance for function-level vulnerability detection, in which a
single graph represents a single source code function (see Section
7 for further details). The program dependency graph (PDG) of a
program (see Fig. 2) is an overall focus in this work, as software
vulnerabilities often involve data and control ows [32, 47, 57].
3 THE LINEVD FRAMEWORK
In this section, we present the LineVD framework by rstly sum-
marising the problem denition. The overall architecture is sub-
sequently discussed with details for three main components. Par-
ticularly, we demonstrate how we have incorporated the large-
scale pretrained model and developed the novel graph construction
method.
3.1 Problem Denition
We formalize the identication of vulnerable statements in a func-
tion as a binary node classication problem, i.e., learning to predict
which source code statements in a function are vulnerable. Let us de-
ne a sample of data as
((𝑉𝑖, 𝑌𝑖)|𝑉𝑖∈ V, 𝑌𝑖∈ {0,1},𝑖 ={1,2, ..., 𝑛 })
,
MSR’ 2022, May 23–24, 2022, Pisburgh, PA, USA David Hin, Andrey Kan, Huaming Chen, and M. Ali Babar
Statements
int add_one(x) {
return x + 1;
}
Function
... ×
Codebert
Vulnerable
Statements
Classifier Learning
Tokenizer
Feature Extraction
Input
Graph Construction
Control & Data Dependency EdgesGraph ExtractionFunction
Figure 3: LineVD Overall Architecture
where
V
is the set of all nodes representing a statement of code in
the dataset and
𝑌𝑖
is the statement-level label where 1 is vulnerable
and 0 is non-vulnerable. We collectively represent
Y
for the set of
labels 𝑌𝑖.𝑛is the number of nodes in the dataset.
For each
𝑉𝑖
, we utilize the
𝑛
-hop neighborhood graph
G𝑖=
(𝑁V
𝑖, 𝑁 E𝑖,X𝑖)
to encode
𝑉𝑖
with the contextual information from
neighboring nodes.
𝑁V
𝑖
indicates the neighborhood nodes for
𝑉𝑖
,
𝑁E𝑖
represents the corresponding edges as an adjacency matrix,
X𝑖R𝑚×𝑑
is the node feature matrix for
𝑉𝑖
, and
𝑚
is the number of
nodes in
𝑁V
𝑖
. The goal of LineVD is to learn a mapping
𝑓:V → Y
to determine the label of a given node; i.e., whether a statement
is vulnerable or not. The prediction function
𝑓
can be learned by
minimizing the following loss function:
𝑚𝑖𝑛
𝑛
𝑖=1
L(𝑓(G𝑖, 𝑌𝑖|𝑉𝑖)) (1)
where Lis the cross entropy loss function.
3.2 Approach Overview
In this section, we present a GNN-based approach to identify the
vulnerable statements. One fundamental element is that the iden-
tied data and control dependencies between statements could
suciently serve as the contextual information for the statement-
level SVD task. Furthermore, we propose a novel architecture to
better leverage the semantics conveyed within the statement and
between the statements, which overall framework is illustrated
in Figure. 3. LineVD can be divided into three main components,
described in the following sections.
3.2.1 Feature Extraction. Given a snippet of source code, achieving
an informative and comprehensive code representation is critical for
subsequent model construction. LineVD is rstly designed to extract
the code features against a transformer-based method, which is
considered as eective for source code related tasks with the self-
supervised learning objectives.
LineVD takes a single function of source code as the raw input.
By processing and splitting the function into individual statements
𝑉𝑖
, each sample is rstly tokenized via CodeBERT’s pretrained BPE
tokenizer. Following the collection of
𝑉={𝑉1,𝑉2, ..., 𝑉𝑛}
, the entire
function and the individual statements comprising the function are
passed into CodeBERT. Thus, the function-level and statement-level
code representation can be acquired.
Specically, LineVD has separately embedded the function-level
and statement-level codes, rather than aggregating the statement-
level embeddings for the function-level embeddings. CodeBERT
is a bimodal model, meaning it was trained on both the natural
language description of a function in addition to the function code
itself. As input, it uses a special separator token to distinguish the
natural language description from the function code. While the
natural language descriptions for the functions is not accessible, a
general operation, as specied in the literature, is applied in this
work to prepend each input with an additional separator token,
leaving the description blank. For the output of CodeBERT, we
utilize the embedding of the classication token, which is suited
for code summarisation tasks. This allows us to better leverage the
powerful pretrained source code summarisation capability of the
CodeBERT model.
Overall, the feature extraction component of LineVD using Code-
BERT produces
𝑛+1
feature embeddings: one embedding for the
overall function, and
𝑛
embeddings for each statement, for which
we denote as 𝑋𝑣={𝑥𝑣
1, 𝑥 𝑣
2, .. ., 𝑥𝑣
𝑛}separately.
3.2.2 Graph Construction. In LineVD, we focus on the data and
control dependency information, for which we have introduced
the graph attention network (GAT) model [
53
]. As discussed in
Sec. 2.2, graph neural networks (GNNs) learn the graph structured
data based on an information diusion mechanism rather than
squashing the information into a at vector, which update the node
states according to the graph connectivity to preserve the important
information, i.e., the topological dependency information [51].
As shown in Figure. 3, a graph attention network is used to
construct the model for learning topological dependency informa-
tion from the graph. A GAT layer will rstly take the
𝑛
output
statement embeddings from CodeBERT along with the edges be-
tween each node. The graph structure, including the nodes and
edges information, for the function is extracted and provided to
GAT. Self-loops are added to include a given node within the set
of its neighborhood graph. We initialize the GAT layer state vec-
tor
{(𝑙)
1, ℎ (𝑙)
2, ℎ (𝑙)
3, . .., ℎ (𝑙)
𝑛}
with
𝑋𝑣
.
𝑙
indicates the current state.
LineVD will propagate the information by embedding the data and
control dependent statement (i.e., the program dependence graph)
LineVD: Statement-level Vulnerability Detection using Graph Neural Networks MSR’ 2022, May 23–24, 2022, Pisburgh, PA, USA
between neighboring statements in an incremental manner. There-
fore, two graph attention networks are implemented in the LineVD
architecture. Overall, GAT is dened by its use of attention over
features of neighbors in its aggregation function in Eq. (2) - (5):
𝑧(𝑙)
𝑖=𝑊(𝑙)(𝑙)
𝑖(2)
𝑒(𝑙)
𝑖, 𝑗 =LeakyReLU ®𝑎(𝑙)𝑇𝑧(𝑙)
𝑖𝑧(𝑙)
𝑗(3)
𝛼(𝑙)
𝑖, 𝑗 =
exp(𝑒(𝑙)
𝑖, 𝑗 )
Í𝑘∈N (𝑖)exp(𝑒(𝑙)
𝑖,𝑘 )
(4)
(𝑙+1)
𝑖=𝜎©
«
𝑗∈N (𝑖)
𝛼(𝑙)
𝑖, 𝑗 𝑧(𝑙)
𝑗ª
®
¬(5)
where
𝑙
is the current state,
(𝑙)
𝑖
is the node embedding vectors
at current layer,
𝑊(𝑙)
is the learnable weight matrix, and
®𝑎
is a
learnable weight vector.
3.2.3 Classifier Learning. As multilayer perceptron (MLP) models
are dominantly evaluated as one of the top classiers [
31
,
43
], we
leverage the superior learning capability of a deep neural network
to better train the classier. In this work, the goal is to train a model
that could jointly learn from the function-level and statement-level
code simultaneously.
To achieve this, we consider both function-level and statement-
level code contribute equally to the prediction outcomes. Thus, we
build a shared set of linear and dropout layers taking the input
of function-level CodeBERT embedding and the statement embed-
dings obtained from the GAT layer. The ReLU [
44
] function serves
as the activation function. In addition, LineVD retains the consis-
tency of the prediction outcomes from its learning algorithm. It
incorporates a element-wise multiplication between the output
class of each statement and the output class of the function-level
embedding, which could be either one or zero. This operation har-
moniously balance the conicting outputs between function-level
and statement-level embeddings, and will justify the decision for
some scenarios, i.e., if the output class of the function-level em-
bedding is zero, then all statement-level outputs are also zero. The
intuition for this is that a non-vulnerable function cannot have vul-
nerable lines. LineVD outputs the predictions corresponding to the
statements in the input function. The cross-entropy loss function
is used to train the LineVD.
4 EXPERIMENTAL DESIGN AND SETUP
In this section, we will report the experiment design details, in-
cluding the research questions and evaluation process. Particularly,
we present the dataset details for the empirical evaluation. We fur-
ther discuss the applied evaluation metrics, which is considered to
thoroughly quantify the model performance.
4.1 Research Questions
To explore statement-level vulnerability detection task and inves-
tigate the performance of LineVD, we answer and motivate the
following research questions:
RQ1: How much performance increasement can LineVD
achieve in comparision with the state-of-the-art interpretati-
on-based SVD model?
To evaluate the relative improvement of
LineVD in the statement-level SVD task, we choose to compare
against the state-of-the-art interpretation-based model. It is con-
ducted from two diverse measures, which are binary classication
and ranked metrics for vulnerable statements.
RQ2: How do dierent code embedding methods aect
statement-level vulnerability detection?
Code embedding meth-
ods have not yet been explored for statement-level SVD compared
to SVD at other levels of granularity.
RQ3: How do graph neural networks and function-level
information contribute to LineVD performance?
The eect of
information propagation using graph neural networks on statement-
level SVD has yet to be explored.
RQ4: How does LineVD perform in a cross-project classi-
cation scenario?
While training on a dataset containing multiple
projects already reduces misrepresentation of model generalisabil-
ity, it is still possible for samples from the same project to appear
in both the training and test set. Using a cross-project scenario can
better represent how the model performs on a completely unseen
project, rather than only unseen samples.
RQ5: Which statement types are best distinguished by Lin-
eVD for real-world data?
Investigating the model prediction out-
comes from the perspective of statement types, particularly for
real-world data, can help to understand where the model performs
best and where it fails, which can guide future work and improve-
ments in statement-level SVD.
4.2 Datasets
Recent research suggests that SVD models should be evaluated on
data that could represent the distinct characteristics of real-world
vulnerabilities [
10
]. This means evaluating on source code extracted
from real-world projects (i.e. non-synthetic) while maintaining an
imbalanced ratio, which is inherent for vulnerabilities in software
projects. The usage of datasets that do not satisfy these conditions
would result in the inconsistency of model performance when ap-
plied in real world scenarios. Another dataset requirement is a
suciently large number of samples, ideally spanning multiple
projects in order to acquire a model that can generalize well to
unseen code. The nal requirement is access to ground truth labels
at the statement level or traceability to the before-x code, i.e., the
original git commit.
By extracting the code vulnerabilities from over 300 dierent
open-source C/C++ GitHub projects, Big-Vul contains the trustwor-
thy source code vulnerabilities spanning 91 dierent vulnerability
types, which are linked to the public Common Vulnerabilities and
Exposures (CVE) database [
19
]. A substantial amount of manual re-
sources has also been dedicated to ensure the quality of the dataset.
Meanwhile, it has provided the enriched information including
CVE IDs, CVE severity scores and particularly the code changes,
along with other metadata. In the end, Big-Vul provides the best t
for meeting the requirements to model code-centric vulnerability
detection to the best of our knowledge, containing approximately
10,000 vulnerable samples and 177,000 non-vulnerable samples.
This large diversity in projects, rather than focusing on a limited
MSR’ 2022, May 23–24, 2022, Pisburgh, PA, USA David Hin, Andrey Kan, Huaming Chen, and M. Ali Babar
number of projects in particular, allows for better representation
of all open-source C/C++ projects (see Table 5 for the top 10 most
common projects in the dataset).
4.2.1 Ground-truth Labels. To obtain the ground-truth labels for
the vulnerable and non-vulnerable lines, we follow the assertions
from literature [
19
,
32
] rather than proposing our own heuristics:
(1) removed lines in a vulnerability-xing commit serve as an in-
dicator for a vulnerable line, and (2) all lines that are control or
data dependent on the added lines are also treated as vulnerable.
The reasoning for the second point is that any added lines in a
vulnerability-xing commit were added to help patch the vulnera-
bility. Hence, lines that were not modied in vulnerability-xing
commit, but are related to these added lines, can be considered as
related to the vulnerability.
To obtain labels corresponding to lines that are dependent on
the added lines, we rst obtain the code changes from the before
and after version of the sample, where a sample in Big-Vul refers
to a function-level code snippet. For the before version, we remove
all the added lines, and for the after version, we remove all the
deleted lines. In both cases, we keep blank placeholder lines to
ensure line number consistency. The code graph extracted from
the after version can be used to nd all lines that are control or
data dependent on the added lines, whose line numbers correspond
to the before version. This set of lines can be combined with the
set of deleted lines to obtain the nal set of vulnerable lines for a
single sample. Commented lines are excluded in the code graph,
and hence are not used for training or prediction. This can be seen
in Fig. 1 and 2; in this case, there is only a modied line, which is
treated as both a deleted line (22) and added line (23). The control
and data dependency edges in this case happen to be the same for
both the before and after version, and hence we can use Fig. 2 to
identify the lines that are control/data dependent on line 23, which
are lines 3, 19, and 21.
4.2.2 Dataset Cleaning. We performed multiple ltering steps on
the Big-Vul dataset. First, we remove all comments from the code.
Second, we ignore code changes that are purely cosmetic, such as
changes to whitespace, and consequently remove any functions
with no non-cosmetic code changes. Third, we removed improperly
truncated functions. A few samples in the original dataset were
truncated incorrectly, resulting in an unparsable and invalid code
sample. For example, a function that was originally 50 lines may
be incorrectly truncated to 40 lines for no apparent reason. The
reason may be due to an error in how the dataset was originally
constructed; however, there were only 30 such samples in the whole
dataset. We use a random training/validation/test split ratio of
80:10:10. For the training set, we undersample the number of non-
vulnerable samples to produce an approximately balanced dataset
at the function-level, while the test and validation set is left in the
original imbalanced ratio. We choose to balance the samples at the
function-level as it is non-trivial to balance at the statement-level
while maintaining the contextual dependencies between statements
within a function.
4.3 Evaluation Metrics
We report F1 score, precision, recall, area under the receiver operat-
ing characteristic curve (ROCAUC), and area under the precision-
recall curve (PRAUC). While ROCAUC is widely used to directly
measure the predictive power of the model without choosing a
specic threshold, it cannot fully reect the eectiveness when
dealing with imbalanced datasets. Hence, in addition to reporting
ROCAUC for comparison with past literature, we have also used
PR-AUC, which is better suited for imbalanced problems.
While binary classication is the primary focus, as described in
Section 3.1, we also report ranked metrics to evaluate the perfor-
mance of the most condent predictions of the model. The ranked
metrics include mean average precision (MAP), normalized dis-
counted cumulative gain (nDCG), and mean rst ranking (MFR).
Here, we dene rst ranking as the rank of the rst correctly pre-
dicted vulnerable statement. We care about how well the model
performs at a certain number of
𝑘
most condent lines, as we can
thus further limit the amount of code that needs to be reviewed by
the developer. In this work, we choose to use
𝑘=5
as an eective
performance representation at the top following the recommen-
dation of Li et al. [
32
]. However, in practical usage, this threshold
could be adjusted by the user. Finally, we report accuracy of inter-
pretation given N nodes, where N is equal to 5 (denoted as N5 in
Table 1). This can be interpreted as function-level accuracy given
only the prediction results of the top ve lines.
Due to the imbalanced nature of the dataset, we rst nd the best
threshold for the F1-score using the validation set before calculating
the F1-score on the test set. When determining the signicance of
improvements, we use the Wilcoxon signed-rank test [
56
] on the
F1-scores of ten runs with random seeds. The best models according
to the automated hyperparameter tuning are used for the test.
4.4 Hyperparameters
Hyperparameters of LineVD were tuned using Ray Tune [
36
] with
randomized grid search. The hyperparameter details can be found
in the publicly released source code. The scores are reported cor-
responding to the mean test results across ten runs using the best
hyperparameters based on the loss of the validation set.
4.5 Experimental Design of Research
Questions
4.5.1 RQ1. In RQ1, we focus on the comparison between our pro-
posed model and the existing literature methods. For SVD task, a
practical experiment setting is that the vulnerability remains un-
known for a given piece of code regardless at function level or le
level. Thus, leveraging the interpretable ML models for the vali-
dation and interpretation of prediction results of SVD models is
dominant. In this work, we compare LineVD with the state-of-the-
art interpretation-based SVD model, namely IVDetect from [
32
].
IVDetect is designed as an interpretable vulnerability detector which
embeds both articial intelligence to detect vulnerabilities and in-
telligence assistant to provide vulnerabilities interpretations from
statements. It has presented the state-of-the-art performance in
terms of the detection and localization capabilities with the graph-
based software vulnerability detection and interpretation models.
While it has included an empirical evaluation against the existing
LineVD: Statement-level Vulnerability Detection using Graph Neural Networks MSR’ 2022, May 23–24, 2022, Pisburgh, PA, USA
function-level deep learning-based approaches, a full replication of
IVDetect for comparison in RQ1 is provided in terms of the perfor-
mance of vulnerable statements identication.
4.5.2 RQ2. In RQ2, we investigate the impacts of utilising pre-
trained embedding methods for statement-level vulnerability pre-
diction. While vectorising the code statement information (also
node information in this work) is a critical step, we have included
dierent feature embedding methods, including CodeBERT, Doc2Vec,
and averaged GloVe embeddings to understand their impacts on
prediction performance. These two baselines were chosen to repre-
sent the embedding methods usually seen in vulnerability detection
models [
9
,
12
,
32
,
34
,
35
]. GloVe and Word2Vec embeddings usually
perform similarly in the vulnerability detection context [
45
], and
thus we only use GloVe, which was also the chosen word embedding
technique used in IVDetect [
32
]. For classication of the embed-
dings, we use multiple hidden layers, and freeze all parameters of
the CodeBERT model during training.
4.5.3 RQ3. In RQ3, we explore the eect of introducing GNN lay-
ers into the feature extraction component of the model. In this
experiment, we compare the use of two popular GNN variants:
Graph Convolution Networks (GCN) [
28
] and Graph Attention
Networks (GAT) [
53
]. These two GNN types were chosen to ex-
plore the eect of dierent GNN architectures on statement-level
vulnerability classication. The GNN layers are inserted after the
feature extraction component and before the nal hidden layers.
We also test two dierent types of graphs extracted from the source
code: Program Dependence Graphs (PDG) and Control Dependence
Graphs (CDG). PDGs consist of both data dependency edges and
control dependency edges. Data dependency edges describe how
data ows between statements in the program (which statements
are inuenced by which variables), while control dependency edges
describe the order in which statements execute, as well as whether
they execute or not. We use both variants to explore whether GCNs
and GATs can utilize data dependency information from related
statements to better distinguish vulnerable statements. The control
and data dependencies are extracted using the Joern program [
57
].
We remove samples that either cannot be parsed by Joern, or are
correctly parsed but do not contain control or data dependency
edges. Since the nodes produced by Joern are not necessarily at the
line-level, we group together nodes with the same line number, and
remove nodes with no line numbers (e.g. metadata nodes).
In addition to the GNN component, we simultaneously explore
how to incorporate the function-level information, for which it may
benet LineVD to reduce the false positive rate of statement-level
classication. This is done through training on the function-level
label of the function, which is embedded using CodeBERT.
4.5.4 RQ4. In RQ4, we explore how LineVD performs in a cross-
project scenario; i.e. samples from the target project do not appear
in the source projects which are used in model training. We simulate
this by producing multiple splits where the test set is made of a
single project, and the rest of the projects are used in the training
and validation set. We choose the top 10 projects with the most
vulnerable samples in the dataset to use as cross-project splits.
4.5.5 RQ5. In RQ5, we examine which statement types (e.g. if-
statement, goto-statement) LineVD can correctly distinguish. We
use the node types given by Joern. For the "Control Structure" node
type, we replace it with the given control structure type (e.g. if,
while, for). For "Function Call" node types, we split them into two
dierent categories: "built-in" function calls, which are functions
that appear in the C standard library, and "external" function calls,
which are any other functions not found in the standard C library.
This is an important distinction, as each sample in the dataset
consists of a code snippet at the function-level. This means the only
information present in external function calls is in the identier
name itself, along with its arguments, rather than the contents of
the external function. For "Operator" node types, we group them
into the following categories: assignment, arithmetic, comparisons,
access, and logical. Any operator node types that do not fall into
these categories are grouped into "other".
5 RESULTS
We run our experiments on a computing cluster utilizing multiple
NVIDIA Tesla V100 GPUs and Xeon E5-2698v3 CPUs operating at
2.30 GHz.
5.1 RQ1: How much performance advantage
can LineVD achieve in comparison with the
state-of-the-art SVD model?
Table 1 and Table 2 summarise the performance comparison of
LineVD with respect to IVDetect, a recently proposed ne-grained
vulnerability detection approach, using various evaluation mea-
sures. We show that LineVD signicantly outperforms IVDetect
method in all metrics. In Table 1, the ranked metrics are included
to show the accuracy of dierent models. A higher accuracy means
that the model can generate more precise vulnerability detection at
the statement level. As seen in Table 1, LineVD largely outperforms
IVDetect in all ranked metrics (
𝑝<0.01
). For ranked metrics,
LineVD improves
𝑁5
by
0.205
in accuracy value from
0.695
to
0.900
, and increases MAP@5 by
0.336
over IVDetect from
0.424
to
0.760
. For MFR, LineVD improves the performance by
4.373
ranks,
which signicantly increase the eciency of correctly locating the
rst vulnerable statement by 114.5%.
When examining the distribution of rankings for LineVD, we nd
that it is extremely skewed towards the lower end, with some highly
incorrect predictions raising the MFR. The distribution of rst-rank
scores is plotted in Figure. 4, which shows that the majority (89%)
of rst ranking scores are between 1 to 5. This suggests that LineVD
may struggle with certain types of longer functions, but correctly
ranks them in most cases.
Table 2 provides the binary classication metrics, which is the
most straightforward evaluation (whether or not a line is correctly
predicted as vulnerable statement), and the one directly aligned
with our problem denition. Overall, LineVD outperforms IVDetect
by 104% (
𝑝<0.01
) in F1-score. Specically, for practical usage, we
notice that recall score has also been substantially improved from
0.140
to
0.533
, indicating that LineVD has boosted the ability to
correctly determine and locate the vulnerable statements.
We also note that another advantage of our architecture regards
its eciency in generating statement-level predictions compared to
IVDetect. The use of an explanation model signicantly increases
the inference time for a single sample, as an entire model must
MSR’ 2022, May 23–24, 2022, Pisburgh, PA, USA David Hin, Andrey Kan, Huaming Chen, and M. Ali Babar
Figure 4: Histogram of rst rankings of LineVD on default
test set. First ranking is dened as the rst true-positive
statement in a sorted list of softmax scores assigned to each
statement for a given function. E.g. in 575 samples, a vulner-
able statement is rst in the ranked list of vulnerable state-
ment predictions; in 67 samples, a vulnerable statement is
second in the list, etc.
be trained to nd a subgraph that explains the model’s prediction,
compared to the single forward pass required by LineVD. Using an
NVIDIA Tesla V100, a single inference using GNNExplainer can
take over a minute, depending on its conguration, while the single
forward pass in LineVD takes less than a second.
5.2 RQ2: How do dierent code embedding
methods aect statement-level
vulnerability detection?
We evaluate the eectiveness of CodeBERT regarding the code
embedding methods. In Table 3, CodeBERT has outperformed the
baseline feature embedding methods, which are Doc2Vec and GloVe,
in the context of statement-level vulnerability prediction, by 134%
(
𝑝<0.01
) and 16% (
𝑝<0.05
) in F1-score respectively. This is to
be expected, as the pre-trained CodeBERT model is designed with
over 125 million parameters, which provide an enhanced capability
to encode richer information in a larger and deeper model from
code snippets comparing to the other models with less layers and
parameters.
Table 1: RQ1: Statement-level Performance (Ranked)
Methods N5 MAP@5 NDCG@5 MFR
IVDetect 0.695 0.424 0.517 8.192
LineVD 0.900 0.760 0.804 3.819
Table 2: RQ1: Statement-level Performance (classication)
Methods F1 Rec Prec ROCAUC PRAUC
IVDetect 0.176 0.140 0.238 0.463 0.520
LineVD 0.360 0.533 0.271 0.913 0.642
We also note that, while CodeBERT has the advantage over GloVe
and Doc2Vec being pre-trained on a large corpus, it has the disad-
vantage of being trained on an external dataset consisting of non
C/C++ code in ve other programming languages. In comparison,
the GloVe and Doc2Vec models are directly trained on our C/C++
dataset. Since Doc2Vec should be able to provide a better encoding
capability for the contextual information about statements, GloVe
presents a second best performance by the averaged aggregation
method. Nonetheless, CodeBERT provides a best code embedding
method given the comparative experiments with the other most
popular code embedding methods for SVD to date.
Table 3: RQ2: Feature Embedding Methods for Statement-
level Vulnerability Classication
Embedding F1 Rec Prec
ROCAUC
PRAUC
Doc2Vec 0.064 0.167 0.040 0.580 0.508
GloVe 0.129 0.166 0.106 0.666 0.529
CodeBERT 0.150 0.254 0.121 0.703 0.534
5.3 RQ3: Can graph neural networks and
function-level information benet
statement-level classication?
Table. 4 shows the eect of dierent experiment settings involving
GNN type, program graph type, and whether or not the function-
level classication component is included in the model. As it could
be seen from Table. 4, using graph attention network for the feature
learning from PDG information is the best t, which we have hence
included as the graph component in LineVD. When comparing
this GNN feature extraction combination with the model variation
without GNN, we achieve an increase of 24% in F1 score (
𝑝<0.01
)
from 0.296 to 0.360, indicating that the presence of the GNN benets
the model’s learning capability. However, the performance of other
graph-based combinations is generally comparable to using the
model without a GNN. This suggests that the program graph type
and GNN type non-trivially aects the performance of the model. In
particular, only using control dependency edges generally results in
lower performance than using both control and data dependencies,
suggesting that data dependency edges are important in predicting
statement-level vulnerabilities.
In addition, we nd that GCN achieves worse performance in
comparison to GAT for all model types, which is to be expected, as
the feature aggregation in GCN treats all neighboring nodes equally,
unlike GAT, which attends to certain neighbors. However, when
comparing only using statement-level information for classication
(i.e. no function-level information involved), the use of GCN with
either of the graph types or a GAT with CDG results in worse
performance compared to using no GNN. Finally, we nd that the
enrichment of function-level information in the model signicantly
increases the statement-level performance, regardless of whether
a GNN is used. Using the GAT+PDG combination, we achieve a
performance increase of 140% (𝑝<0.01).
LineVD: Statement-level Vulnerability Detection using Graph Neural Networks MSR’ 2022, May 23–24, 2022, Pisburgh, PA, USA
Table 4: RQ3: Graph-based Feature Variants for Statement-
level Vulnerability Classication
Model Type F1 Rec Prec ROCAUC PRAUC
GAT+CDG 0.115 0.120 0.112 0.657 0.528
GAT+CDG+Func 0.304 0.491 0.221 0.907 0.624
GAT+PDG 0.150 0.254 0.121 0.703 0.534
GAT+PDG+Func 0.360 0.533 0.271 0.913 0.642
GCN+CDG 0.084 0.143 0.060 0.632 0.514
GCN+CDG+Func 0.283 0.558 0.190 0.911 0.597
GCN+PDG 0.085 0.125 0.067 0.599 0.513
GCN+PDG+Func 0.310 0.460 0.235 0.905 0.616
No GNN 0.129 0.166 0.106 0.666 0.529
No GNN+Func 0.296 0.537 0.205 0.921 0.619
Table 5: RQ4: Cross-project Statement-level Prediction
Project F1 Rec Prec ROCAUC PRAUC Vuln
Chromium 0.298 0.470 0.219 0.923 0.625 3103
Linux 0.301 0.502 0.216 0.925 0.630 1847
Android 0.290 0.494 0.208 0.922 0.625 962
ImageMagick 0.333 0.504 0.249 0.925 0.644 331
PHP 0.290 0.487 0.208 0.928 0.622 200
TCPDump 0.284 0.452 0.207 0.925 0.622 197
OpenSSL 0.298 0.508 0.211 0.926 0.628 157
Krb5 0.259 0.535 0.186 0.903 0.605 139
QEMU 0.250 0.478 0.177 0.910 0.601 120
FFmpeg 0.269 0.483 0.193 0.917 0.612 115
5.4 RQ4: How does LineVD perform in a
cross-project classication scenario?
From Table 5, we can see that performance is generally consistent
throughout the dierent projects. Due to the nature of the vari-
ous projects, there are varying numbers of vulnerable samples for
each project. The top 10 software projects are reported in Table
5. While the performance varies, in general, they are all slightly
lower than the results from the random splits in Table 2. This is par
for the course, as even partial inclusion of within-project informa-
tion such as variable and function names may assist the model in
distinguishing vulnerable samples.
In Table 5, the investigated projects are ranked according to the
number of vulnerabilities, which is indicated in the last column
Vuln’. We note that the Chromium and Linux splits contain a dispro-
portionately large number of vulnerable samples. The Chromium
split itself accounts for 30% of all vulnerable samples in the dataset.
Despite this, the performance is competitive with the other splits,
indicating that the model attains a high generalisability for dierent
settings, which has shown comparable performance with a smaller
training set in Table 5.
Table 6: RQ5: Analysis of Statement-level Predictions
Statement Type TP FP TN FN F1
Function Declaration 530 364
17572
126 0.68
While Statement 48 72 1374 23 0.50
Builtin Function Call 142 246 4537 72 0.47
Logical Operation 90 159 4601 60 0.45
Switch Statement 18 31 1424 31 0.37
For Statement 75 195 3632 88 0.35
If Statement 749 1930
50969
855 0.35
Assignment Operation 1206 3792
81490
1051 0.33
Other Operation 62 200 5401 54 0.33
Jump Target 18 53
11119
19 0.33
Arithmetic Operation 27 106 1379 10 0.32
Return Statement 166 526
27332
186 0.32
External Function Call 644 2212
63378
582 0.32
Comparison Operation 19 64 1489 17 0.32
Access Operation 44 168 3488 48 0.29
Cast Operation 25 115 3146 28 0.26
Continue 2 13 1039 1 0.22
Break 5 42 7966 43 0.11
Goto Statement 3 15 5115 48 0.09
5.5 RQ5: Which statement types are best
distinguished by LineVD?
Table 6 shows the raw prediction results for dierent statement
types. We report the confusion matrix values for each statement
type and sort them by F1 score. We see that that assignment op-
erations and external function calls are the most commonly oc-
curring vulnerable lines, and that the model most often correctly
classies Function Declaration statements. This could be due to
the signicantly higher level of information in function declara-
tions compared to other statements, which can consist of the return
type, function name, parameter types, and the parameter names.
Operation-related statements generally perform similarly, except
for logical operations, which LineVD can generally better distin-
guish, and access and cast operations, which are achieve relatively
poorer performance.
LineVD has higher performance with built-in function call state-
ments compared to external function calls. This is somewhat ex-
pected, as the same built-in functions are likely to appear across
more samples, unlike user-dened function names. In addition, due
to the function-level nature of the dataset, the only information
in external function call statements is from the function name it-
self and the arguments passed. Without knowledge of what occurs
within an external function call, it can be dicult to nd distin-
guishing patterns relating to their vulnerability.
LineVD struggles most with continue, break, and goto state-
ments. These are all control structure nodes that aect the control
dependency graph, and do not contain any direct statement-level
information that could distinguish them from each other. Hence, to
distinguish the vulnerability of these statements, the model must
rely on the patterns relating to the context in which they appear.
Despite using a GNN to capture these relationships, these state-
ments still cannot be classied correctly. Hence, while the use of
MSR’ 2022, May 23–24, 2022, Pisburgh, PA, USA David Hin, Andrey Kan, Huaming Chen, and M. Ali Babar
GNN does improve overall performance, these results indicate that
neither GCN or GAT are sucient for eectively distinguishing
these particular control statements. In contrast, statement types
like function declarations, function calls, and for/while/if control
structures often contain signicantly more information within the
statement, and hence can be distinguished more easily. This focus
on more complex control structure statement types and built-in
function calls is suitable for practical usage as they are more likely
to be directly involved in the vulnerability.
6 DISCUSSION
In this section, we discuss the threats to validity and limitations of
our work, along with ideas for future work.
6.1 Threats to Validity
The rst threat relates to the potential sub-optimal hyperparame-
ter tuning; we rely on a random grid search for determining the
combinations of hyperparameters; however, it would be close to
impossible to exhaustively test all combinations of hyperparame-
ters. To mitigate this threat, we adopt best practices where possible
when choosing the default range of values for the hyperparameter
search, and tune each model variation with the same number of
random parameter congurations.
There has also been recent work on the eect of time-based
validation on software defect prediction [
18
], which could also
be relevant to vulnerability detection. We mitigate this threat by
also exploring cross-project vulnerability prediction, since time-
based validation primarily applies to within-project prediction. Fur-
thermore, any graph-based model should intuitively have similar
capabilities in regards to learning patterns associated with vulnera-
bilities, and hence the performance loss (or gain) should be linear
across the dierent baseline models. However, this is something
that could be tested more thoroughly in future works.
6.2 Limitations and Future Work
While LineVD outperforms the current state-of-the-art, the per-
formance is still quite low, meaning there is still much room for
improvement in the statement-level SVD task. A particular area of
interest is nding the best way to produce better code embeddings.
While CodeBERT can perform well, it is not trained on C/C++.
The use of a large language model pretrained on the specic tar-
get language (e.g. C-BERT [
8
]) could improve the performance of
downstream tasks. However, at the time of writing, there were no
openly released C/C++-based large pretrained models.
Another limitation regards the learning capability of the GNN
layer, which inherently only propagates information from imme-
diately neighboring nodes (i.e. 1-hop). To propagate further range
information, multiple graph neural network layers can be stacked.
For example, two GNN layers would propagate information across
a 2-hop neighborhood for each node. However, most graph neural
network architectures using more than just a few layers has been
shown to result in over-squashing, where the exponentially grow-
ing information between each layer cannot be captured within a
xed-length vector [
4
], resulting in bottlenecked performance. This
is consistent with our ndings; we tuned the number GNN layers
during experimentation but did not nd any signicant increase in
performance beyond two layers. Future work could explore novel
GNN architectures that can better capture vulnerability patterns in
program graphs.
7 RELATED WORK
To recap the state-of-the-art of software vulnerability detection
researches, we introduce the related work from two perspectives:
1) the application of GNN on SVD, 2) the explanability of machine
learning models for SVD.
7.1 Software vulnerability detection with GNN
Detecting software vulnerability is an ongoing topic to secure soft-
ware systems from cyber attacks. Recent advances have advocated
the application of GNN and its variants for function level vulnerabil-
ity detections, which is considered as the best way of representing
source code in the SVD context [
2
,
58
], and have demonstrated
enhanced performance over other approaches [9, 17, 32].
The introduction of GNNs for modelling vulnerabilities was orig-
inally inspired by the vulnerability discovery approach proposed
by Yamaguchi et al. [
57
] using Code Property Graphs, a type of
program graph incorporating program dependency edges [
26
,
39
],
control ow edges, and the abstract syntax tree of the program,
which provide an additional source of information to learn from
[
14
]. Hence, the performance improvements using GNNs can pri-
marily be attributed to leveraging the domain knowledge that lines
of source code within a program have specic relationships to other
lines; i.e., training using both semantic and syntactical information,
rather than only syntactical information.
To achieve this goal, the lines of source code, which is also the
nodes in the graph, will be rstly vectorised according to the possi-
ble code tokens. The vectorised information is then concatenated
as initial node information to allow GNNs to capture the seman-
tic and syntactical information. This propagation of information
between semantically relevant statements theoretically allows the
model to better make use of relevant contextual lines. While GNNs
have been shown to perform well in function-level vulnerability
classication (graph classication) [
9
,
10
,
12
,
37
,
58
], the eects of
vectorised methods and GNNs models on statement-level vulnera-
bility classication (node classication) have yet to be explored.
7.2 Interpretation machine-learning based
models for SVD
One way to improve the performance of machine-learning based
method for SVD in practice is the development of explainable de-
tection results, which could provide a ne-grained vulnerability
prediction outcome. This raises the importance of research for in-
terpretatable machine-learning based models. Besides the novel
machine-learning based model for function level vulnerability de-
tection output, the existing works are limited to providing partial
information for the explanation generation, i.e., tokens from [
59
]
and intermediate code by [
33
]. Other way for statement-level SVD
task may be via localizing the specic vulnerable statements with
the assumption of receiving vulnerable source codes in function
level [
15
]. However, it demonstrates a certain limitation as SV de-
tection and localization are mostly demanded at the same time.
LineVD: Statement-level Vulnerability Detection using Graph Neural Networks MSR’ 2022, May 23–24, 2022, Pisburgh, PA, USA
A recently explored benet of GNNs for SVD is direct access
to explainable GNN approaches [
32
]. These can be attached to
any GNN-based function-level SVD model to obtain ne-grained
predictions; i.e., statement-level predictions if the nodes themselves
represent statements. However, whether or not it is the best way to
build a statement-level vulnerability classier has yet to be explored.
For example, one disadvantage of using a function-level detector as
the base model for interpretation is the inability to directly leverage
any statement-level information during the training process, in
addition to the signicantly longer inference times.
8 CONCLUSION
We introduce LineVD, a novel deep learning approach for statement-
level vulnerability detection, which can allow developers to more ef-
ciently evaluate potentially vulnerable functions. LineVD achieves
a new state-of-the-art on statement-level vulnerability detection
on real-world open source projects by leveraging graph neural
networks and statement-level information during training. The
signicant improvement in comparison to the latest ne-grained
machine-learning based model indicates the eectiveness of directly
utilizing statement-level information for statement-level SVD. Fi-
nally, LineVD achieves reasonable cross-project performance, in-
dicating its eectiveness and generalization capabilities even for
completely unseen software projects. Future directions will include
exploring alternate pretrained feature embedding methods and
novel GNN architectures that can better accommodate the underly-
ing nature of software source code and the vulnerabilities.
REFERENCES
[1] [n. d.]. NIST: National Vulnerability Database. https://nvd.nist.gov/.
[2]
Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang.
2021. Unied Pre-training for Program Understanding and Generation. arXiv
preprint arXiv:2103.06333 (2021).
[3]
Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning
to Represent Programs with Graphs. In International Conference on Learning
Representations.
[4]
Uri Alon and Eran Yahav. 2020. On the bottleneck of graph neural networks and
its practical implications. arXiv preprint arXiv:2006.05205 (2020).
[5]
Bushra Aloraini, Meiyappan Nagappan, Daniel M German, Shinpei Hayashi,
and Yoshiki Higo. 2019. An empirical study of security warnings from static
application security testing tools. Journal of Systems and Software 158 (2019),
110427.
[6]
Authors. [n. d.]. Reproduction package for MSR double-blind review. Retrieved
Jan, 2022 from. https://github.com/davidhin/linevd
[7]
Ranran Bian, Yun Sing Koh, Gillian Dobbie, and Anna Divoli. 2019. Network
embedding and change modeling in dynamic heterogeneous networks. In Proceed-
ings of the 42nd International ACM SIGIR Conference on Research and Development
in Information Retrieval. 861–864.
[8]
Luca Buratti, Saurabh Pujar, Mihaela Bornea, Scott McCarley, Yunhui Zheng,
Gaetano Rossiello, Alessandro Morari, Jim Laredo, Veronika Thost, Yufan Zhuang,
et al
.
2020. Exploring software naturalness through neural language models.
arXiv preprint arXiv:2006.12641 (2020).
[9]
Sicong Cao, Xiaobing Sun, Lili Bo, Ying Wei, and Bin Li. 2021. BGNN4VD:
Constructing Bidirectional Graph Neural-Network for Vulnerability Detection.
Information and Software Technology 136 (2021), 106576.
[10]
Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, and Baishakhi Ray. 2021.
Deep learning based vulnerability detection: Are we there yet. IEEE Transactions
on Software Engineering (2021).
[11]
Cen Chen, Kenli Li, Sin G Teo, Xiaofeng Zou, Kang Wang, Jie Wang, and Zeng
Zeng. 2019. Gated residual recurrent graph neural networks for trac prediction.
In Proceedings of the AAAI conference on articial intelligence, Vol. 33. 485–492.
[12]
Xiao Cheng, Haoyu Wang, Jiayi Hua, Guoai Xu, and YuleiSui. 2021. DeepWukong:
Statically detecting software vulnerabilities using deep graph neural network.
ACM Transactions on Software Engineering and Methodology (TOSEM) 30, 3 (2021),
1–33.
[13]
Roland Croft, Dominic Newlands, Ziyu Chen, and M Ali Babar.2021. An Empirical
Study of Rule-Based and Learning-Based Approaches for Static Application
Security Testing. arXiv preprint arXiv:2107.01921 (2021).
[14]
Lei Cui, Zhiyu Hao, Yang Jiao,Haiqiang Fei, and Xiao chun Yun. 2020. VulDetector:
Detecting Vulnerabilities Using Weighted Feature Graph Comparison. IEEE
Transactions on Information Forensics and Security 16 (2020), 2004–2017.
[15]
Yangruibo Ding, Sahil Suneja, Yunhui Zheng, Jim Laredo,Alessandro Morari, Gail
Kaiser, and Baishakhi Ray. 2022. VELVET: a noVel Ensemble Learning approachto
automatically locate VulnErable sTatements. In 2022 IEEE International Conference
on Software Analysis, Evolution and Reengineering (SANER). IEEE, 1–12.
[16]
Xiaoning Du, Bihuan Chen, Yuekang Li, Jianmin Guo, Yaqin Zhou, Yang Liu, and
Yu Jiang. 2019. Leopard: Identifying vulnerable code for vulnerability assessment
through program metrics. In 2019 IEEE/ACM 41st International Conference on
Software Engineering (ICSE). IEEE, 60–71.
[17]
Xu Duan, Jingzheng Wu, Shouling Ji, Zhiqing Rui, Tianyue Luo, Mutian Yang,
and Yanjun Wu. 2019. VulSniper: Focus Your Attention to Shoot Fine-Grained
Vulnerabilities.. In IJCAI. 4665–4671.
[18]
Davide Falessi, Jacky Huang, Likhita Narayana, Jennifer Fong Thai, and Burak
Turhan. 2020. On the need of preserving order of data when validating within-
project defect classiers. Empirical Software Engineering 25, 6 (2020), 4805–4830.
[19]
Jiahao Fan, Yi Li, Shaohua Wang, and Tien N Nguyen. 2020. AC/C++ Code
Vulnerability Dataset with Code Changes and CVE Summaries. In Proceedings of
the 17th International Conference on Mining Software Repositories. 508–512.
[20]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong,
Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT:
A Pre-Trained Model for Programming and Natural Languages. http://arxiv.org/
abs/2002.08155 cite arxiv:2002.08155Comment: Accepted to Findings of EMNLP
2020. 12 pages.
[21]
Seyed Mohammad Ghaarian and Hamid Reza Shahriari. 2017. Software vulnera-
bility analysis and discovery using machine-learning and data-mining techniques:
A survey. ACM Computing Surveys (CSUR) 50, 4 (2017), 1–36.
[22]
Hazim Hanif, Mohd Hairul Nizam Md Nasir, Mohd Faizal Ab Razak, Ahmad Fir-
daus, and Nor Badrul Anuar. 2021. The rise of software vulnerability: Taxonomy
of software vulnerabilities detection and machine learning approaches. Journal
of Network and Computer Applications (2021), 103009.
[23]
Thong Hoang, Hong Jin Kang, David Lo, and Julia Lawall. 2020. Cc2vec: Dis-
tributed representations of code changes. In Proceedings of the ACM/IEEE 42nd
International Conference on Software Engineering. 518–529.
[24]
Aram Hovsepyan, Riccardo Scandariato, and Wouter Joosen. 2016. Is Newer
Always Better? The Case of Vulnerability Prediction Models. In Proceedings of
the 10th ACM/IEEE International Symposium on Empirical Software Engineering
and Measurement. 1–6.
[25]
Julian Jang-Jaccard and Surya Nepal. 2014. A survey of emerging threats in
cybersecurity. J. Comput. System Sci. 80, 5 (2014), 973–993.
[26]
Andrew Johnson, Lucas Waye, Scott Moore, and Stephen Chong. 2015. Explor-
ing and enforcing security guarantees via program dependence graphs. ACM
SIGPLAN Notices 50, 6 (2015), 291–302.
[27]
Seulbae Kim, Seunghoon Woo, Heejo Lee, and Hakjoo Oh. 2017. Vuddy: A
scalable approach for vulnerable code clone discovery. In 2017 IEEE Symposium
on Security and Privacy (SP). IEEE, 595–614.
[28]
Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classication with
Graph Convolutional Networks. In Proceedings of the 5th International Conference
on Learning Representations (Palais des Congrès Neptune, Toulon, France) (ICLR
’17). https://openreview.net/forum?id=SJU4ayYgl
[29]
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and
documents. In International conference on machine learning. PMLR, 1188–1196.
[30]
Triet HM Le, Hao Chen, and Muhammad Ali Babar. 2020. Deep learning for
source code modeling and generation: Models, applications, and challenges. ACM
Computing Surveys (CSUR) 53, 3 (2020), 1–38.
[31]
Stefan Lessmann, Bart Baesens, Christophe Mues, and Swantje Pietsch. 2008.
Benchmarking classication models for software defect prediction: A proposed
framework and novel ndings. IEEE Transactions on Software Engineering 34, 4
(2008), 485–496.
[32]
Yi Li, Shaohua Wang, and Tien N Nguyen.2021. Vulnerability Detection with Fine-
grained Interpretations. In The 29th ACM Joint European Software Engineering
Conference and Symposium on the Foundations of Software Engineering. ACM.
[33]
Zhen Li, Deqing Zou, Shouhuai Xu, Zhaoxuan Chen, Yawei Zhu, and Hai Jin.
2021. Vuldeelocator: a deep learning-based ne-grained vulnerability detector.
IEEE Transactions on Dependable and Secure Computing (2021), 1–17.
[34]
Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, and Zhaoxuan Chen. 2021.
SySeVR: A framework for using deep learning to detect software vulnerabilities.
IEEE Transactions on Dependable and Secure Computing (2021).
[35]
Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun
Deng, and Yuyi Zhong. 2018. VulDeePecker: A Deep Learning-Based System
for Vulnerability Detection.. In Network and Distributed Systems Security (NDSS)
Symposium 2018. 1–12.
[36]
Richard Liaw, Eric Liang, Robert Nishihara, Philipp Moritz, Joseph E Gonzalez,
and Ion Stoica. 2018. Tune: A research platform for distributed model selection
and training. arXiv preprint arXiv:1807.05118 (2018).
MSR’ 2022, May 23–24, 2022, Pisburgh, PA, USA David Hin, Andrey Kan, Huaming Chen, and M. Ali Babar
[37]
Guanjun Lin, Wei Xiao, Leo Yu Zhang, Shang Gao, Yonghang Tai, and Jun Zhang.
2021. Deep neural-based vulnerability discovery demystied: data, model and
performance. Neural Computing and Applications (2021), 1–14.
[38]
Bingchang Liu, Liang Shi, Zhuhua Cai, and Min Li. 2012. Software vulnerabil-
ity discovery techniques: A survey. In 2012 fourth international conference on
multimedia information networking and security. IEEE, 152–156.
[39]
Chao Liu, Chen Chen, Jiawei Han, and Philip S Yu. 2006. GPLAG: detection of
software plagiarism by program dependence graph analysis. In Proceedings of
the 12th ACM SIGKDD international conference on Knowledge discovery and data
mining. 872–881.
[40]
Fang Liu, Ge Li, Yunfei Zhao, and Zhi Jin. 2020. Multi-task learning based pre-
trained language model for code completion. In Proceedings of the 35th IEEE/ACM
International Conference on Automated Software Engineering. 473–485.
[41]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A
robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
(2019).
[42]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Je Dean. 2013.
Distributed representations of words and phrases and their compositionality. In
Advances in neural information processing systems. 3111–3119.
[43]
Hajra Naeem and Manar H Alal. 2020. Identifying Vulnerable IoT Applications
using Deep Learning. In 2020 IEEE 27th International Conference on Software
Analysis, Evolution and Reengineering (SANER). IEEE, 582–586.
[44]
Vinod Nair and Georey E Hinton. 2010. Rectied linear units improve restricted
boltzmann machines. In Icml.
[45]
Hai Ngoc Nguyen, Songpon Teerakanok, Atsuo Inomata, and Tetsutaro Uehara.
2021. The Comparison of Word Embedding Techniques in RNNs for Vulnerability
Detection.. In ICISSP. 109–120.
[46]
Jerey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove:
Global vectors for word representation. In Proceedings of the 2014 conference on
empirical methods in natural language processing (EMNLP). 1532–1543.
[47]
Nam H Pham, Tung Thanh Nguyen, Hoan Anh Nguyen, and Tien N Nguyen. 2010.
Detection of recurring software vulnerabilities. In Proceedings of the IEEE/ACM
international conference on Automated software engineering. 447–456.
[48]
Chanathip Pornprasit and Chakkrit Tantithamthavorn. 2021. JITLine: A Simpler,
Better, Faster, Finer-grained Just-In-Time Defect Prediction. In 2021 International
Conference on Mining Software Repositories (MSR’21). 1–11.
[49]
Hitesh Sajnani, Vaibhav Saini, Jerey Svajlenko, Chanchal K Roy, and Cristina V
Lopes. 2016. Sourcerercc: Scaling code clone detection to big-code. In Proceedings
of the 38th International Conference on Software Engineering. 1157–1168.
[50]
Riccardo Scandariato, James Walden, Aram Hovsepyan, and Wouter Joosen. 2014.
Predicting vulnerable software components via text mining. IEEE Transactions
on Software Engineering 40, 10 (2014), 993–1006.
[51]
Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele
Monfardini. 2008. The graph neural network model. IEEE transactions on neural
networks 20, 1 (2008), 61–80.
[52]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine
translation of rare words with subword units. arXiv preprint arXiv:1508.07909
(2015).
[53]
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
Liò, and Yoshua Bengio. 2017. Graph Attention Networks. 6th International
Conference on Learning Representations (2017).
[54]
Deze Wang, Yue Yu, Shanshan Li, Wei Dong, Ji Wang, and Liao Qing. 2021. Mul-
Code: A Multi-task Learning Approach for Source Code Understanding. In 2021
IEEE International Conference on Software Analysis, Evolution and Reengineering
(SANER). IEEE, 48–59.
[55]
Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. 2020. Detecting code clones
with graph neural network and ow-augmented abstract syntax tree. In 2020 IEEE
27th International Conference on Software Analysis, Evolution and Reengineering
(SANER). IEEE, 261–271.
[56]
Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In Break-
throughs in statistics. Springer, 196–202.
[57]
Fabian Yamaguchi, Nico Golde, Daniel Arp,and Konrad Rieck. 2014. Modeling and
discovering vulnerabilities with code property graphs. In 2014 IEEE Symposium
on Security and Privacy. IEEE, 590–604.
[58]
Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019.
Devign: Eective vulnerability identication by learning comprehensive program
semantics via graph neural networks. Advances in neural information processing
systems 32 (2019), 8026–8037.
[59]
Deqing Zou, Yawei Zhu, Shouhuai Xu, Zhen Li, Hai Jin, and Hengkai Ye. 2021.
Interpreting deep learning-based vulnerability detector predictions based on
heuristic searching. ACM Transactions on Software Engineering and Methodology
(TOSEM) 30, 2 (2021), 1–31.
... Specifically, the authors obtained the symbolic representation of the source code related to the vulnerability through three kinds of symbolization and 7 https://www.checkmarx.com. [86], for detecting statement-level vulnerabilities as a node classification problem. LineVD used a transformer-based approach, CodeBERT [83], to encapsulate the raw source code tokens and Graph Neural Network (GNN) to utilize control and data dependencies between statements. ...
... Contextual embedding technology, such as CodeBERT [83], considers the context information and can recognize polysemy and similar terms according to the context. The effects of different word embedding methods such as Doc2vec [82], GloVe, and CodeBERT on vulnerability detection results in the same scenario are compared in research [86]. The results show that context embedding technology such as CodeBERT can effectively improve the results of vulnerability detection. ...
Article
Full-text available
To date, being benefited from the ability of automated feature extraction and the performance of software vulnerability identification, deep learning techniques have attracted extensive attention in data-driven software vulnerability detection. Many methods based on deep learning have been proposed to speed up and intelligentize the process of vulnerability identification. Although these methods have shown significant advantages over traditional machine learning ones, there is an apparent gap between the deep learning-based detection systems and human experts in understanding potentially vulnerable code semantics. In some real-world vulnerability prediction scenarios, the performance of deep learning-based methods drops by more than 50% compared to these methods’ performance in experimental scenarios. We define this phenomenon as the perception gap by examining and reviewing the early software vulnerability detection approaches. Then, from the perspective of the perception gap, this paper profoundly explores the current software vulnerability detection methods and how existing solutions endeavor to narrow the perception gap and push forward the development of the field of interest. Finally, we summarize the challenges of this new field and discuss the possible future.
... Authors used feature engineering to fetch meaning from the given document to produce a meaningful summary of the original document. Hin et al. [59] presented a DL architecture, termed LineVD, which focused on formulating vulnerability detection at the statement level as the node classification task. LineVD uses Graph Neural Network (GNN). ...
Article
Full-text available
With the rapid growth of social media platforms, digitization of official records, and digital publication of articles, books, magazines, and newspapers, lots of data are generated every day. This data is a foundation of information and contains a vast amount of text that may be complex, ambiguous, redundant, irrelevant, and unstructured. Therefore, we require tools and methods that can help us understand and automatically summarize the vast amount of generated text. There are mainly two types of approaches to perform text summarization: abstractive and extractive. In Abstractive Text Summarization, a concise summary is generated by including the salient features of the input documents and paraphrasing documents using new sentences and phrases. While in Extractive Text Summarization, a summary is produced by selecting and combining the most significant sentences and phrases from the source documents. The researchers have given numerous techniques for both kinds of text summarization. In this work, we classify Extractive Text Summarization approaches and review them based on their characteristics, techniques, and performance. We have discussed the existing Extractive Text Summarization approaches along with their limitations. We also classify and discuss evaluation measures and provide the research challenges faced in Extractive Text Summarization.
Article
Full-text available
Detecting source-code level vulnerabilities at the development phase is a cost-effective solution to prevent potential attacks from happening at the software deployment stage. Many machine learning, including deep learning-based solutions, have been proposed to aid the process of vulnerability discovery. However, these approaches were mainly evaluated on self-constructed/-collected datasets. It is difficult to evaluate the effectiveness of proposed approaches due to lacking a unified baseline dataset. To bridge this gap, we construct a function-level vulnerability dataset from scratch, providing in source-code-label pairs. To evaluate the constructed dataset, a function-level vulnerability detection framework is built to incorporate six mainstream neural network models as vulnerability detectors. We perform experiments to investigate the performance behaviors of the neural model-based detectors using source code as raw input with continuous Bag-of-Words neural embeddings. Empirical results reveal that the variants of recurrent neural networks and convolutional neural network perform well on our dataset, as the former is capable of handling contextual information and the latter learns features from small context windows. In terms of generalization ability, the fully connected network outperforms the other network architectures. The performance evaluation can serve as a reference benchmark for neural model-based vulnerability detection at function-level granularity. Our dataset can serve as ground truth for ML-based function-level vulnerability detection and a baseline for evaluating relevant approaches.
Article
Automated detection of software vulnerabilities is a fundamental problem in software security. Existing program analysis techniques either suffer from high false positives or false negatives. Recent progress in Deep Learning (DL) has resulted in a surge of interest in applying DL for automated vulnerability detection. Several recent studies have demonstrated promising results achieving an accuracy of up to 95 percent at detecting vulnerabilities. In this paper, we ask, “how well do the state-of-the-art DL-based techniques perform in a real-world vulnerability prediction scenario?” To our surprise, we find that their performance drops by more than 50 percent. A systematic investigation of what causes such precipitous performance drop reveals that existing DL-based vulnerability prediction approaches suffer from challenges with the training data (e.g., data duplication, unrealistic distribution of vulnerable classes, etc.) and with the model choices (e.g., simple token-based models). As a result, these approaches often do not learn features related to the actual cause of the vulnerabilities. Instead, they learn unrelated artifacts from the dataset (e.g., specific variable/function names, etc.). Leveraging these empirical findings, we demonstrate how a more principled approach to data collection and model design, based on realistic settings of vulnerability prediction, can lead to better solutions. The resulting tools perform significantly better than the studied baseline—up to 33.57 percent boost in precision and 128.38 percent boost in recall compared to the best performing model in the literature. Overall, this paper elucidates existing DL-based vulnerability prediction systems’ potential issues and draws a roadmap for future DL-based vulnerability prediction research.
Article
Static bug detection has shown its effectiveness in detecting well-defined memory errors, e.g., memory leaks, buffer overflows, and null dereference. However, modern software systems have a wide variety of vulnerabilities. These vulnerabilities are extremely complicated with sophisticated programming logic, and these bugs are often caused by different bad programming practices, challenging existing bug detection solutions. It is hard and labor-intensive to develop precise and efficient static analysis solutions for different types of vulnerabilities, particularly for those that may not have a clear specification as the traditional well-defined vulnerabilities. This article presents D eep W ukong , a new deep-learning-based embedding approach to static detection of software vulnerabilities for C/C++ programs. Our approach makes a new attempt by leveraging advanced recent graph neural networks to embed code fragments in a compact and low-dimensional representation, producing a new code representation that preserves high-level programming logic (in the form of control- and data-flows) together with the natural language information of a program. Our evaluation studies the top 10 most common C/C++ vulnerabilities during the past 3 years. We have conducted our experiments using 105,428 real-world programs by comparing our approach with four well-known traditional static vulnerability detectors and three state-of-the-art deep-learning-based approaches. The experimental results demonstrate the effectiveness of our research and have shed light on the promising direction of combining program analysis with deep learning techniques to address the general static code analysis challenges.
Article
Automatically detecting software vulnerabilities is an important problem that has attracted much attention from the academic research community. However, existing vulnerability detectors still cannot achieve the vulnerability detection capability and the locating precision that would warrant their adoption for real-world use. In this article, we present a vulnerability detector that can simultaneously achieve a high detection capability and a high locating precision, dubbed Vul nerability Dee p learning-based Locator (VulDeeLocator). In the course of designing VulDeeLocator, we encounter difficulties including how to accommodate semantic relations between the definitions of types as well as macros and their uses across files, how to accommodate accurate control flows and variable define-use relations, and how to achieve high locating precision. We solve these difficulties by using two innovative ideas: (i) leveraging intermediate code to accommodate extra semantic information, and (ii) using the notion of granularity refinement to pin down locations of vulnerabilities. When applied to 200 files randomly selected from three real-world software products, VulDeeLocator detects 18 confirmed vulnerabilities (i.e., true-positives). Among them, 16 vulnerabilities correspond to known vulnerabilities; the other two are not reported in the National Vulnerability Database (NVD) but have been “silently” patched by the vendor of Libav when releasing newer versions.
Article
Context Previous studies have shown that existing deep learning-based approaches can significantly improve the performance of vulnerability detection. They represent code in various forms and mine vulnerability features with deep learning models. However, the differences of code representation forms and deep learning models make various approaches still have some limitations. In practice, their false-positive rate (FPR) and false-negative rate (FNR) are still high. Objective To address the limitations of existing vulnerability detection approaches, we propose BGNN4VD (Bidirectional Graph Neural Network for Vulnerability Detection), a vulnerability detection approach by constructing a Bidirectional Graph Neural-Network (BGNN). Method In Phase 1, we extract the syntax and semantic information of source code through abstract syntax tree (AST), control flow graph (CFG), and data flow graph (DFG). Then in Phase 2, we use vectorized source code as input to Bidirectional Graph Neural-Network (BGNN). In Phase 3, we learn the different features between vulnerable code and non-vulnerable code by introducing backward edges on the basis of traditional Graph Neural-Network (GNN). Finally in Phase 4, a Convolutional Neural-Network (CNN) is used to further extract features and detect vulnerabilities through a classifier. Results We evaluate BGNN4VD on 4 popular C/C++ projects from NVD and GitHub, and compare it with four state-of-the-art (Flawfinder, RATS, SySeVR, and VUDDY). Experiment results show that, when compared these these baselines, BGNN4VD achieves 4.9%, 11.0%, and 8.4% improvement in F1-measure, accuracy and precision, respectively. Conclusion The proposed BGNN4VD achieves a higher precision and accuracy than the state-of-the-art methods. In addition, when applied on the latest vulnerabilities reported by CVE, BGNN4VD can still achieve a precision at 45.1%, which demonstrates the feasibility of BGNN4VD in practical application.