Available via license: CC BY 4.0
Content may be subject to copyright.
LineVD: Statement-level Vulnerability Detection using Graph
Neural Networks
David Hin
CREST - The Centre for Research on Engineering
Software Technologies, University of Adelaide
Cyber Security Cooperative Research Centre
Adelaide, Australia, 5005
david.hin@adelaide.edu.au
Andrey Kan
AWS AI Labs*
Adelaide, SA, Australia, 5005
avkan@amazon.com
Huaming Chen
CREST - The Centre for Research on Engineering
Software Technologies, University of Adelaide
Cyber Security Cooperative Research Centre
Adelaide, Australia, 5005
huaming.chen@adelaide.edu.au
M. Ali Babar
CREST - The Centre for Research on Engineering
Software Technologies, University of Adelaide
Cyber Security Cooperative Research Centre
Adelaide, Australia, 5005
ali.babar@adelaide.edu.au
ABSTRACT
Current machine-learning based software vulnerability detection
methods are primarily conducted at the function-level. However,
a key limitation of these methods is that they do not indicate the
specic lines of code contributing to vulnerabilities. This limits
the ability of developers to eciently inspect and interpret the
predictions from a learnt model, which is crucial for integrating
machine-learning based tools into the software development work-
ow. Graph-based models have shown promising performance
in function-level vulnerability detection, but their capability for
statement-level vulnerability detection has not been extensively
explored. While interpreting function-level predictions through
explainable AI is one promising direction, we herein consider the
statement-level software vulnerability detection task from a fully
supervised learning perspective. We propose a novel deep learning
framework, LineVD, which formulates statement-level vulnerabil-
ity detection as a node classication task. LineVD leverages control
and data dependencies between statements using graph neural net-
works, and a transformer-based model to encode the raw source
code tokens. In particular, by addressing the conicting outputs
between function-level and statement-level information, LineVD
signicantly improve the prediction performance without vulnera-
bility status for function code. We have conducted extensive experi-
ments against a large-scale collection of real-world C/C++ vulnera-
bilities obtained from multiple real-world projects, and demonstrate
an increase of 105% in F1-score over the current state-of-the-art.
CCS CONCEPTS
•Computer systems organization →Embedded systems
;Re-
dundancy; Robotics; •Networks →Network reliability.
*This work was done prior to joining Amazon.
MSR’ 2022, May 23–24, 2022, Pittsburgh, PA, USA
2022. ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn
KEYWORDS
Software Vulnerability Detection, Program Representation, Deep
Learning
ACM Reference Format:
David Hin, Andrey Kan, Huaming Chen, and M. Ali Babar. 2022. LineVD:
Statement-level Vulnerability Detection using Graph Neural Networks. In
MSR ’22: Proceedings of the 19th International Conference on Mining Software
Repositories, May 23-24, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA,
12 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
Identifying potential software vulnerabilities is a crucial step in
defending against cyber attacks [
25
]. However, it can be dicult
and time-consuming for developers to determine which parts of
the software will contribute to vulnerabilities in large software
systems. Consequently, interest in more accurate and ecient au-
tomated software vulnerability detection (SVD) solutions has been
increasing [
21
,
38
]. Automated SVD can be broadly classied into
two categories: (1) traditional methods, which can include both
static and dynamic analysis, and (2) data-driven solutions, which
leverage data mining and machine learning to predict the presence
of software vulnerabilities [
22
]. Traditional static tools are often
rule-based, leveraging knowledge from security domain experts.
This can result in inconsistent performance due to a high number
of false positive alerts, or completely missing more complex vul-
nerabilities [
5
]. As it is challenging to dene vulnerable patterns
in an accurate and comprehensive way, data-driven solutions have
become a promising alternative.
One primary reason for the increasing popularity of data-driven
solutions can be attributed to the ever growing quantity of open-
source vulnerability data. The accumulating security vulnerabili-
ties in open-source software are regularly updated and reported to
sources like the National Vulnerability Database [
1
]. There have
been many eorts that utilize the publicly available information to
reduce the need for manually dening patterns and to learn from
the data [
27
,
49
]. As a result, data-driven solutions have generally
shown outstanding performance in comparison to traditional static
application security testing tools [
13
]. One potential reason could be
arXiv:2203.05181v1 [cs.CR] 10 Mar 2022
MSR’ 2022, May 23–24, 2022, Pisburgh, PA, USA David Hin, Andrey Kan, Huaming Chen, and M. Ali Babar
due to the learning ability of the models to incorporate latent infor-
mation from patterns where vulnerabilities would appear in source
code, in addition to the underlying causes of the vulnerability.
Despite the success of current data-driven approaches in the
identication of software vulnerabilities, they are often limited
to a coarse level of granularity. The model outputs often present
developers with limited information for prediction outcome vali-
dation and interpretation, leading to extra eorts when evaluating
and mitigating the software vulnerabilities. Consequently, many
proposed SVD solutions have transitioned to either function-level
[
9
,
10
,
32
,
58
] or slice-level [
12
,
33
–
35
] predictions, which are a
major improvement from le-level predictions [
16
,
24
,
50
]. Some
other works further leverage supplementary information, such as
commit-level code changes with accompanying log messages, to
build the prediction model [
23
,
48
]. While the goal is to help practi-
tioners to prioritize the defective codes, vulnerabilities can often be
localized to a few key lines [
17
]. Hence, reviewing large functions
could still be a considerable burden. From a preliminary analysis
on a large C/C++ vulnerability detection dataset [
19
], we nd that
vulnerable functions in the dataset are on average 95 lines of code.
Fig. 1 shows an example of how statement-level SVD can be
benecial. To save space, we choose a vulnerability from a smaller
function, which contains an integer overow vulnerability from the
Linux kernel (CVE-2018-12896) that can ultimately be exploited to
cause denial-of-service. With explicit statement-level predictions,
it can be easier to interpret why the function has been predicted
as vulnerable (or alternatively, verify that the prediction was er-
roneous). Focusing the developer’s attention on the highlighted
lines can help a developer narrow down the lines needing further
inspection based on model condence. In this case, a statement-
level SVD model ags the addition assignment operation on line 22,
which contains the vulnerable integer casting operation, as most
suspicious, allowing a developer to more eciently validate and
mitigate the vulnerability.
Rening SVD granularity towards the statement-level is still in
its infancy; latest work by Li et. al [
32
] has indicated the possibility
of leveraging the interpretable ML model, namely GNNExplainer,
to derive the vulnerable statements as the interpretation of learnt
model. However, in our work, we nd that the performance is not
sucient and eective when classifying and ranking the latent vul-
nerable statements. Alternatively, we aim to explore the feasibility
and eectiveness of directly training and predicting on vulnerabili-
ties at the statement level for SVD granularity renement, which
would allow data-driven solutions to directly utilize any available
statement-level information in a fully supervised manner.
In this paper, we propose a novel framework for statement-level
SVD, namely LineVD. We focus on revisiting core components of
data-driven SVD from the perspective of statement-level classica-
tion, which can serve as a way of allowing developers to better in-
terpret vulnerability predictions. We have explored statement-level
SVD in a thorough way to nd the best data-driven architecture to
achieve optimal performance. With the coverage of various feature
extraction methods and model architectures to tackle the latent
challenges in SVD, we show that LineVD could provide sucient
capacity to incorporate contextual information for each statement
in an ecient manner. The extensive evaluation of the model has
been delivered in realistic scenarios; namely, heavily imbalanced
1void common_timer_get(struct k_itimer *timr, struct itimerspec64
*cur_setting)
2{
3 const struct k_clock *kc = timr->kclock;
4 ktime_t now, remaining, iv;
5 struct timespec64 ts64;
6 bool sig_none;
7
8 sig_none = timr->it_sigev_notify == SIGEV_NONE;
9 iv = timr->it_interval;
10
11 if (iv) {
12 cur_setting->it_interval = ktime_to_timespec64(iv);
13 } else if (!timr->it_active) {
14 if (!sig_none)
15 return;
16 }
17
18 kc->clock_get(timr->it_clock, &ts64);
19 now = timespec64_to_ktime(ts64);
20
21 if (iv && (timr->it_requeue_pending & REQUEUE_PENDING || sig_none))
22 timr->it_overrun += (int)kc->timer_forward(timr, now);
23 // Added: timr->it_overrun += kc->timer_forward(timr, now);
24
25 remaining = kc->timer_remaining(timr, now);
26
27 if (remaining <= 0) {
28 if (!sig_none)
29 cur_setting->it_value.tv_nsec = 1;
30 } else {
31 cur_setting->it_value = ktime_to_timespec64(remaining);
32 }
33 }
Figure 1: Vulnerable lines predicted by LineVD for a code
snippet from CVE-2018-12896, which are indicated by the
presence of a red background. A darker red background
color indicates higher prediction condence by LineVD.
Ground-truth vulnerable lines have red line numbers, while
dark red line numbers indicate a data or control dependency
on an added line (indicated by line 23).
labels and cross-project testing. In summary, this paper makes the
following contributions:
•
We propose a novel and ecient statement-level SVD ap-
proach, LineVD. With the investigation of current state-of-
the-art interpretation-based SVD model showing declined
performance, LineVD achieves a signicant improvement
for an increase of 105% in F1-score.
•
We investigate the performance eects of each stage for
building a GNN-based statement-level SVD model, includ-
ing the node embedding methods and GNN model selec-
tion. Upon the ndings, LineVD is developed to largely im-
prove the performance via learning from the function- and
statement-level information simultaneously.
•
LineVD is the rst approach to jointly learn from function-
level and statement-level information via graph neural net-
works to enhance the SVD performance, which in the empiri-
cal evaluation has signicantly outperformed the traditional
models only using one type information.
•
We publish our dataset, source code, and models with sup-
porting scripts [
6
], which provides a ready-to-use implemen-
tation solution for future work with regards to benchmarking
and comparison.
2 BACKGROUND
In this section, we will introduce relevant key concepts relating
to the source code embedding and graph neural networks (GNNs),
which have played key roles for providing eective and explainable
capabilities to SVD prediction models in recent literature works.
LineVD: Statement-level Vulnerability Detection using Graph Neural Networks MSR’ 2022, May 23–24, 2022, Pisburgh, PA, USA
2.1 Source code embedding
To tackle a language-related task with machine learning models,
it is necessary to transform the related textual corpora into vector
representations. The corpora are usually specic to a given do-
main; e.g., one recently popular topic is source code-related tasks.
The process is commonly referred to as building a language model
and extracting language embeddings, which can be at the word,
sentence, or document-level [
42
]. Source code modelling is the
application of language modelling to source code, which can be
considered a special type of structured natural language [
30
], and
has demonstrated encouraging results for a wide range of down-
stream tasks. These include code completion [
40
] and code clone
detection [55], among others [54].
For software vulnerability detection, prior approaches have uti-
lized document embedding methods like Doc2Vec [
29
], or word em-
bedding methods such as GloVe [
46
] and Word2Vec [
42
] to generate
pre-trained vectors for singular tokens, which are then aggregated
in some way. For example, Cao et al. [
9
] utilized averaged Word2Vec
embeddings to transform raw code statements into vector repre-
sentations.
Recently, transformer-based models have been applied to source
code modelling, allowing for large source code understanding mod-
els to be pre-trained. They are expected to obtain higher-quality
embeddings for source code, which can be leveraged for down-
stream tasks requiring less labeled data and training resources. One
major advancement is CodeBERT [
20
], based on the RoBERTa [
41
]
architecture, which was trained on six dierent programming lan-
guages: Python, Java, JavaScript, PHP, Ruby, and Go. Specically,
CodeBERT was trained on 2.1 million natural language and pro-
gramming language bimodal samples, and 6.4 million programming
language unimodal samples. It should be noted that CodeBERT and
Doc2Vec can produce contextual embeddings, in contrast to GloVe
and Word2Vec embeddings, which are static and hence have the
same embedding for each word regardless of its context.
Tokenization approaches for source code can vary signicantly.
For word embedding-based approaches, code tokens can be split by
whitespace alongside punctuation, such as parenthesis and semi-
colons [
34
]. Alternatively, punctuation can be completely removed
[32]. In addition, code identiers are sometimes further tokenized
according to common naming conventions, such as underscores or
camel case. This is to help reduce the number of out-of-vocabulary
tokens, which can otherwise be signicantly higher than regular
natural language due to the nature of how developers name iden-
tiers. An alternative approach to manually dened tokenization
rules is unsupervised subword tokenization. In CodeBERT, byte-
pair encoding (BPE) [
52
] is used to tokenize the source code tokens.
In particular, long variable names are split into subwords based
on the BPE algorithm. For example, ’add_one’ may be tokenized
to ’add’, ’_’, and ’one’. This is arguably a more consistent way of
reducing out-of-vocabulary issues in source code identiers, as this
approach is not reliant on pre-dened rules.
2.2 Graph Neural Network
Recently, graph neural networks have demonstrated superior per-
formance at mining graph data structures for social networks [
7
],
spatial-temporal related trac networks [
11
], and so on. For the
1
3
4
5
6
8 9
11
12
13
14
15 181921
22 25
27
28
29
31
Figure 2: Program Dependence Graph for Fig. 1. Black lines
represent control dependency edges; dashed red lines repre-
sent data dependency edges.
downstream source code modelling tasks, transformer-based mod-
els have demonstrated promising results [
3
]. However, the complex
syntactic and semantic characteristics, which are inherently pre-
sented in the programming languages, are not explicitly leveraged.
Rather than solely representing source code as a sequence of
tokens, the intrinsic structure information for source code can also
be eectively modelled by representing a source code snippet (or
program) as a graph,
G=(V,E )
, allowing a model to more easily
learn latent relationships within the source code where
V
is the
set of nodes representing the program graph and
E
is the edge
matrix. With dierent types of program-related structures, the edge
matrix
E
denotes the corresponding syntactic and semantic infor-
mation of the source code. Using this graph-based representation in
combination with graph neural networks has resulted in improved
performance for function-level vulnerability detection, in which a
single graph represents a single source code function (see Section
7 for further details). The program dependency graph (PDG) of a
program (see Fig. 2) is an overall focus in this work, as software
vulnerabilities often involve data and control ows [32, 47, 57].
3 THE LINEVD FRAMEWORK
In this section, we present the LineVD framework by rstly sum-
marising the problem denition. The overall architecture is sub-
sequently discussed with details for three main components. Par-
ticularly, we demonstrate how we have incorporated the large-
scale pretrained model and developed the novel graph construction
method.
3.1 Problem Denition
We formalize the identication of vulnerable statements in a func-
tion as a binary node classication problem, i.e., learning to predict
which source code statements in a function are vulnerable. Let us de-
ne a sample of data as
((𝑉𝑖, 𝑌𝑖)|𝑉𝑖∈ V, 𝑌𝑖∈ {0,1},𝑖 ={1,2, ..., 𝑛 })
,
MSR’ 2022, May 23–24, 2022, Pisburgh, PA, USA David Hin, Andrey Kan, Huaming Chen, and M. Ali Babar
Statements
int add_one(x) {
return x + 1;
}
Function
... ×
Codebert
Vulnerable
Statements
Classifier Learning
Tokenizer
Feature Extraction
Input
Graph Construction
Control & Data Dependency EdgesGraph ExtractionFunction
Figure 3: LineVD Overall Architecture
where
V
is the set of all nodes representing a statement of code in
the dataset and
𝑌𝑖
is the statement-level label where 1 is vulnerable
and 0 is non-vulnerable. We collectively represent
Y
for the set of
labels 𝑌𝑖.𝑛is the number of nodes in the dataset.
For each
𝑉𝑖
, we utilize the
𝑛
-hop neighborhood graph
G𝑖=
(𝑁V
𝑖, 𝑁 E𝑖,X𝑖)
to encode
𝑉𝑖
with the contextual information from
neighboring nodes.
𝑁V
𝑖
indicates the neighborhood nodes for
𝑉𝑖
,
𝑁E𝑖
represents the corresponding edges as an adjacency matrix,
X𝑖∈R𝑚×𝑑
is the node feature matrix for
𝑉𝑖
, and
𝑚
is the number of
nodes in
𝑁V
𝑖
. The goal of LineVD is to learn a mapping
𝑓:V → Y
to determine the label of a given node; i.e., whether a statement
is vulnerable or not. The prediction function
𝑓
can be learned by
minimizing the following loss function:
𝑚𝑖𝑛
𝑛
𝑖=1
L(𝑓(G𝑖, 𝑌𝑖|𝑉𝑖)) (1)
where Lis the cross entropy loss function.
3.2 Approach Overview
In this section, we present a GNN-based approach to identify the
vulnerable statements. One fundamental element is that the iden-
tied data and control dependencies between statements could
suciently serve as the contextual information for the statement-
level SVD task. Furthermore, we propose a novel architecture to
better leverage the semantics conveyed within the statement and
between the statements, which overall framework is illustrated
in Figure. 3. LineVD can be divided into three main components,
described in the following sections.
3.2.1 Feature Extraction. Given a snippet of source code, achieving
an informative and comprehensive code representation is critical for
subsequent model construction. LineVD is rstly designed to extract
the code features against a transformer-based method, which is
considered as eective for source code related tasks with the self-
supervised learning objectives.
LineVD takes a single function of source code as the raw input.
By processing and splitting the function into individual statements
𝑉𝑖
, each sample is rstly tokenized via CodeBERT’s pretrained BPE
tokenizer. Following the collection of
𝑉={𝑉1,𝑉2, ..., 𝑉𝑛}
, the entire
function and the individual statements comprising the function are
passed into CodeBERT. Thus, the function-level and statement-level
code representation can be acquired.
Specically, LineVD has separately embedded the function-level
and statement-level codes, rather than aggregating the statement-
level embeddings for the function-level embeddings. CodeBERT
is a bimodal model, meaning it was trained on both the natural
language description of a function in addition to the function code
itself. As input, it uses a special separator token to distinguish the
natural language description from the function code. While the
natural language descriptions for the functions is not accessible, a
general operation, as specied in the literature, is applied in this
work to prepend each input with an additional separator token,
leaving the description blank. For the output of CodeBERT, we
utilize the embedding of the classication token, which is suited
for code summarisation tasks. This allows us to better leverage the
powerful pretrained source code summarisation capability of the
CodeBERT model.
Overall, the feature extraction component of LineVD using Code-
BERT produces
𝑛+1
feature embeddings: one embedding for the
overall function, and
𝑛
embeddings for each statement, for which
we denote as 𝑋𝑣={𝑥𝑣
1, 𝑥 𝑣
2, .. ., 𝑥𝑣
𝑛}separately.
3.2.2 Graph Construction. In LineVD, we focus on the data and
control dependency information, for which we have introduced
the graph attention network (GAT) model [
53
]. As discussed in
Sec. 2.2, graph neural networks (GNNs) learn the graph structured
data based on an information diusion mechanism rather than
squashing the information into a at vector, which update the node
states according to the graph connectivity to preserve the important
information, i.e., the topological dependency information [51].
As shown in Figure. 3, a graph attention network is used to
construct the model for learning topological dependency informa-
tion from the graph. A GAT layer will rstly take the
𝑛
output
statement embeddings from CodeBERT along with the edges be-
tween each node. The graph structure, including the nodes and
edges information, for the function is extracted and provided to
GAT. Self-loops are added to include a given node within the set
of its neighborhood graph. We initialize the GAT layer state vec-
tor
{ℎ(𝑙)
1, ℎ (𝑙)
2, ℎ (𝑙)
3, . .., ℎ (𝑙)
𝑛}
with
𝑋𝑣
.
𝑙
indicates the current state.
LineVD will propagate the information by embedding the data and
control dependent statement (i.e., the program dependence graph)
LineVD: Statement-level Vulnerability Detection using Graph Neural Networks MSR’ 2022, May 23–24, 2022, Pisburgh, PA, USA
between neighboring statements in an incremental manner. There-
fore, two graph attention networks are implemented in the LineVD
architecture. Overall, GAT is dened by its use of attention over
features of neighbors in its aggregation function in Eq. (2) - (5):
𝑧(𝑙)
𝑖=𝑊(𝑙)ℎ(𝑙)
𝑖(2)
𝑒(𝑙)
𝑖, 𝑗 =LeakyReLU ®𝑎(𝑙)𝑇𝑧(𝑙)
𝑖∥𝑧(𝑙)
𝑗 (3)
𝛼(𝑙)
𝑖, 𝑗 =
exp(𝑒(𝑙)
𝑖, 𝑗 )
Í𝑘∈N (𝑖)exp(𝑒(𝑙)
𝑖,𝑘 )
(4)
ℎ(𝑙+1)
𝑖=𝜎©
«
𝑗∈N (𝑖)
𝛼(𝑙)
𝑖, 𝑗 𝑧(𝑙)
𝑗ª
®
¬(5)
where
𝑙
is the current state,
ℎ(𝑙)
𝑖
is the node embedding vectors
at current layer,
𝑊(𝑙)
is the learnable weight matrix, and
®𝑎
is a
learnable weight vector.
3.2.3 Classifier Learning. As multilayer perceptron (MLP) models
are dominantly evaluated as one of the top classiers [
31
,
43
], we
leverage the superior learning capability of a deep neural network
to better train the classier. In this work, the goal is to train a model
that could jointly learn from the function-level and statement-level
code simultaneously.
To achieve this, we consider both function-level and statement-
level code contribute equally to the prediction outcomes. Thus, we
build a shared set of linear and dropout layers taking the input
of function-level CodeBERT embedding and the statement embed-
dings obtained from the GAT layer. The ReLU [
44
] function serves
as the activation function. In addition, LineVD retains the consis-
tency of the prediction outcomes from its learning algorithm. It
incorporates a element-wise multiplication between the output
class of each statement and the output class of the function-level
embedding, which could be either one or zero. This operation har-
moniously balance the conicting outputs between function-level
and statement-level embeddings, and will justify the decision for
some scenarios, i.e., if the output class of the function-level em-
bedding is zero, then all statement-level outputs are also zero. The
intuition for this is that a non-vulnerable function cannot have vul-
nerable lines. LineVD outputs the predictions corresponding to the
statements in the input function. The cross-entropy loss function
is used to train the LineVD.
4 EXPERIMENTAL DESIGN AND SETUP
In this section, we will report the experiment design details, in-
cluding the research questions and evaluation process. Particularly,
we present the dataset details for the empirical evaluation. We fur-
ther discuss the applied evaluation metrics, which is considered to
thoroughly quantify the model performance.
4.1 Research Questions
To explore statement-level vulnerability detection task and inves-
tigate the performance of LineVD, we answer and motivate the
following research questions:
RQ1: How much performance increasement can LineVD
achieve in comparision with the state-of-the-art interpretati-
on-based SVD model?
To evaluate the relative improvement of
LineVD in the statement-level SVD task, we choose to compare
against the state-of-the-art interpretation-based model. It is con-
ducted from two diverse measures, which are binary classication
and ranked metrics for vulnerable statements.
RQ2: How do dierent code embedding methods aect
statement-level vulnerability detection?
Code embedding meth-
ods have not yet been explored for statement-level SVD compared
to SVD at other levels of granularity.
RQ3: How do graph neural networks and function-level
information contribute to LineVD performance?
The eect of
information propagation using graph neural networks on statement-
level SVD has yet to be explored.
RQ4: How does LineVD perform in a cross-project classi-
cation scenario?
While training on a dataset containing multiple
projects already reduces misrepresentation of model generalisabil-
ity, it is still possible for samples from the same project to appear
in both the training and test set. Using a cross-project scenario can
better represent how the model performs on a completely unseen
project, rather than only unseen samples.
RQ5: Which statement types are best distinguished by Lin-
eVD for real-world data?
Investigating the model prediction out-
comes from the perspective of statement types, particularly for
real-world data, can help to understand where the model performs
best and where it fails, which can guide future work and improve-
ments in statement-level SVD.
4.2 Datasets
Recent research suggests that SVD models should be evaluated on
data that could represent the distinct characteristics of real-world
vulnerabilities [
10
]. This means evaluating on source code extracted
from real-world projects (i.e. non-synthetic) while maintaining an
imbalanced ratio, which is inherent for vulnerabilities in software
projects. The usage of datasets that do not satisfy these conditions
would result in the inconsistency of model performance when ap-
plied in real world scenarios. Another dataset requirement is a
suciently large number of samples, ideally spanning multiple
projects in order to acquire a model that can generalize well to
unseen code. The nal requirement is access to ground truth labels
at the statement level or traceability to the before-x code, i.e., the
original git commit.
By extracting the code vulnerabilities from over 300 dierent
open-source C/C++ GitHub projects, Big-Vul contains the trustwor-
thy source code vulnerabilities spanning 91 dierent vulnerability
types, which are linked to the public Common Vulnerabilities and
Exposures (CVE) database [
19
]. A substantial amount of manual re-
sources has also been dedicated to ensure the quality of the dataset.
Meanwhile, it has provided the enriched information including
CVE IDs, CVE severity scores and particularly the code changes,
along with other metadata. In the end, Big-Vul provides the best t
for meeting the requirements to model code-centric vulnerability
detection to the best of our knowledge, containing approximately
10,000 vulnerable samples and 177,000 non-vulnerable samples.
This large diversity in projects, rather than focusing on a limited
MSR’ 2022, May 23–24, 2022, Pisburgh, PA, USA David Hin, Andrey Kan, Huaming Chen, and M. Ali Babar
number of projects in particular, allows for better representation
of all open-source C/C++ projects (see Table 5 for the top 10 most
common projects in the dataset).
4.2.1 Ground-truth Labels. To obtain the ground-truth labels for
the vulnerable and non-vulnerable lines, we follow the assertions
from literature [
19
,
32
] rather than proposing our own heuristics:
(1) removed lines in a vulnerability-xing commit serve as an in-
dicator for a vulnerable line, and (2) all lines that are control or
data dependent on the added lines are also treated as vulnerable.
The reasoning for the second point is that any added lines in a
vulnerability-xing commit were added to help patch the vulnera-
bility. Hence, lines that were not modied in vulnerability-xing
commit, but are related to these added lines, can be considered as
related to the vulnerability.
To obtain labels corresponding to lines that are dependent on
the added lines, we rst obtain the code changes from the before
and after version of the sample, where a sample in Big-Vul refers
to a function-level code snippet. For the before version, we remove
all the added lines, and for the after version, we remove all the
deleted lines. In both cases, we keep blank placeholder lines to
ensure line number consistency. The code graph extracted from
the after version can be used to nd all lines that are control or
data dependent on the added lines, whose line numbers correspond
to the before version. This set of lines can be combined with the
set of deleted lines to obtain the nal set of vulnerable lines for a
single sample. Commented lines are excluded in the code graph,
and hence are not used for training or prediction. This can be seen
in Fig. 1 and 2; in this case, there is only a modied line, which is
treated as both a deleted line (22) and added line (23). The control
and data dependency edges in this case happen to be the same for
both the before and after version, and hence we can use Fig. 2 to
identify the lines that are control/data dependent on line 23, which
are lines 3, 19, and 21.
4.2.2 Dataset Cleaning. We performed multiple ltering steps on
the Big-Vul dataset. First, we remove all comments from the code.
Second, we ignore code changes that are purely cosmetic, such as
changes to whitespace, and consequently remove any functions
with no non-cosmetic code changes. Third, we removed improperly
truncated functions. A few samples in the original dataset were
truncated incorrectly, resulting in an unparsable and invalid code
sample. For example, a function that was originally 50 lines may
be incorrectly truncated to 40 lines for no apparent reason. The
reason may be due to an error in how the dataset was originally
constructed; however, there were only 30 such samples in the whole
dataset. We use a random training/validation/test split ratio of
80:10:10. For the training set, we undersample the number of non-
vulnerable samples to produce an approximately balanced dataset
at the function-level, while the test and validation set is left in the
original imbalanced ratio. We choose to balance the samples at the
function-level as it is non-trivial to balance at the statement-level
while maintaining the contextual dependencies between statements
within a function.
4.3 Evaluation Metrics
We report F1 score, precision, recall, area under the receiver operat-
ing characteristic curve (ROCAUC), and area under the precision-
recall curve (PRAUC). While ROCAUC is widely used to directly
measure the predictive power of the model without choosing a
specic threshold, it cannot fully reect the eectiveness when
dealing with imbalanced datasets. Hence, in addition to reporting
ROCAUC for comparison with past literature, we have also used
PR-AUC, which is better suited for imbalanced problems.
While binary classication is the primary focus, as described in
Section 3.1, we also report ranked metrics to evaluate the perfor-
mance of the most condent predictions of the model. The ranked
metrics include mean average precision (MAP), normalized dis-
counted cumulative gain (nDCG), and mean rst ranking (MFR).
Here, we dene rst ranking as the rank of the rst correctly pre-
dicted vulnerable statement. We care about how well the model
performs at a certain number of
𝑘
most condent lines, as we can
thus further limit the amount of code that needs to be reviewed by
the developer. In this work, we choose to use
𝑘=5
as an eective
performance representation at the top following the recommen-
dation of Li et al. [
32
]. However, in practical usage, this threshold
could be adjusted by the user. Finally, we report accuracy of inter-
pretation given N nodes, where N is equal to 5 (denoted as N5 in
Table 1). This can be interpreted as function-level accuracy given
only the prediction results of the top ve lines.
Due to the imbalanced nature of the dataset, we rst nd the best
threshold for the F1-score using the validation set before calculating
the F1-score on the test set. When determining the signicance of
improvements, we use the Wilcoxon signed-rank test [
56
] on the
F1-scores of ten runs with random seeds. The best models according
to the automated hyperparameter tuning are used for the test.
4.4 Hyperparameters
Hyperparameters of LineVD were tuned using Ray Tune [
36
] with
randomized grid search. The hyperparameter details can be found
in the publicly released source code. The scores are reported cor-
responding to the mean test results across ten runs using the best
hyperparameters based on the loss of the validation set.
4.5 Experimental Design of Research
Questions
4.5.1 RQ1. In RQ1, we focus on the comparison between our pro-
posed model and the existing literature methods. For SVD task, a
practical experiment setting is that the vulnerability remains un-
known for a given piece of code regardless at function level or le
level. Thus, leveraging the interpretable ML models for the vali-
dation and interpretation of prediction results of SVD models is
dominant. In this work, we compare LineVD with the state-of-the-
art interpretation-based SVD model, namely IVDetect from [
32
].
IVDetect is designed as an interpretable vulnerability detector which
embeds both articial intelligence to detect vulnerabilities and in-
telligence assistant to provide vulnerabilities interpretations from
statements. It has presented the state-of-the-art performance in
terms of the detection and localization capabilities with the graph-
based software vulnerability detection and interpretation models.
While it has included an empirical evaluation against the existing
LineVD: Statement-level Vulnerability Detection using Graph Neural Networks MSR’ 2022, May 23–24, 2022, Pisburgh, PA, USA
function-level deep learning-based approaches, a full replication of
IVDetect for comparison in RQ1 is provided in terms of the perfor-
mance of vulnerable statements identication.
4.5.2 RQ2. In RQ2, we investigate the impacts of utilising pre-
trained embedding methods for statement-level vulnerability pre-
diction. While vectorising the code statement information (also
node information in this work) is a critical step, we have included
dierent feature embedding methods, including CodeBERT, Doc2Vec,
and averaged GloVe embeddings to understand their impacts on
prediction performance. These two baselines were chosen to repre-
sent the embedding methods usually seen in vulnerability detection
models [
9
,
12
,
32
,
34
,
35
]. GloVe and Word2Vec embeddings usually
perform similarly in the vulnerability detection context [
45
], and
thus we only use GloVe, which was also the chosen word embedding
technique used in IVDetect [
32
]. For classication of the embed-
dings, we use multiple hidden layers, and freeze all parameters of
the CodeBERT model during training.
4.5.3 RQ3. In RQ3, we explore the eect of introducing GNN lay-
ers into the feature extraction component of the model. In this
experiment, we compare the use of two popular GNN variants:
Graph Convolution Networks (GCN) [
28
] and Graph Attention
Networks (GAT) [
53
]. These two GNN types were chosen to ex-
plore the eect of dierent GNN architectures on statement-level
vulnerability classication. The GNN layers are inserted after the
feature extraction component and before the nal hidden layers.
We also test two dierent types of graphs extracted from the source
code: Program Dependence Graphs (PDG) and Control Dependence
Graphs (CDG). PDGs consist of both data dependency edges and
control dependency edges. Data dependency edges describe how
data ows between statements in the program (which statements
are inuenced by which variables), while control dependency edges
describe the order in which statements execute, as well as whether
they execute or not. We use both variants to explore whether GCNs
and GATs can utilize data dependency information from related
statements to better distinguish vulnerable statements. The control
and data dependencies are extracted using the Joern program [
57
].
We remove samples that either cannot be parsed by Joern, or are
correctly parsed but do not contain control or data dependency
edges. Since the nodes produced by Joern are not necessarily at the
line-level, we group together nodes with the same line number, and
remove nodes with no line numbers (e.g. metadata nodes).
In addition to the GNN component, we simultaneously explore
how to incorporate the function-level information, for which it may
benet LineVD to reduce the false positive rate of statement-level
classication. This is done through training on the function-level
label of the function, which is embedded using CodeBERT.
4.5.4 RQ4. In RQ4, we explore how LineVD performs in a cross-
project scenario; i.e. samples from the target project do not appear
in the source projects which are used in model training. We simulate
this by producing multiple splits where the test set is made of a
single project, and the rest of the projects are used in the training
and validation set. We choose the top 10 projects with the most
vulnerable samples in the dataset to use as cross-project splits.
4.5.5 RQ5. In RQ5, we examine which statement types (e.g. if-
statement, goto-statement) LineVD can correctly distinguish. We
use the node types given by Joern. For the "Control Structure" node
type, we replace it with the given control structure type (e.g. if,
while, for). For "Function Call" node types, we split them into two
dierent categories: "built-in" function calls, which are functions
that appear in the C standard library, and "external" function calls,
which are any other functions not found in the standard C library.
This is an important distinction, as each sample in the dataset
consists of a code snippet at the function-level. This means the only
information present in external function calls is in the identier
name itself, along with its arguments, rather than the contents of
the external function. For "Operator" node types, we group them
into the following categories: assignment, arithmetic, comparisons,
access, and logical. Any operator node types that do not fall into
these categories are grouped into "other".
5 RESULTS
We run our experiments on a computing cluster utilizing multiple
NVIDIA Tesla V100 GPUs and Xeon E5-2698v3 CPUs operating at
2.30 GHz.
5.1 RQ1: How much performance advantage
can LineVD achieve in comparison with the
state-of-the-art SVD model?
Table 1 and Table 2 summarise the performance comparison of
LineVD with respect to IVDetect, a recently proposed ne-grained
vulnerability detection approach, using various evaluation mea-
sures. We show that LineVD signicantly outperforms IVDetect
method in all metrics. In Table 1, the ranked metrics are included
to show the accuracy of dierent models. A higher accuracy means
that the model can generate more precise vulnerability detection at
the statement level. As seen in Table 1, LineVD largely outperforms
IVDetect in all ranked metrics (
𝑝<0.01
). For ranked metrics,
LineVD improves
𝑁5
by
0.205
in accuracy value from
0.695
to
0.900
, and increases MAP@5 by
0.336
over IVDetect from
0.424
to
0.760
. For MFR, LineVD improves the performance by
4.373
ranks,
which signicantly increase the eciency of correctly locating the
rst vulnerable statement by 114.5%.
When examining the distribution of rankings for LineVD, we nd
that it is extremely skewed towards the lower end, with some highly
incorrect predictions raising the MFR. The distribution of rst-rank
scores is plotted in Figure. 4, which shows that the majority (89%)
of rst ranking scores are between 1 to 5. This suggests that LineVD
may struggle with certain types of longer functions, but correctly
ranks them in most cases.
Table 2 provides the binary classication metrics, which is the
most straightforward evaluation (whether or not a line is correctly
predicted as vulnerable statement), and the one directly aligned
with our problem denition. Overall, LineVD outperforms IVDetect
by 104% (
𝑝<0.01
) in F1-score. Specically, for practical usage, we
notice that recall score has also been substantially improved from
0.140
to
0.533
, indicating that LineVD has boosted the ability to
correctly determine and locate the vulnerable statements.
We also note that another advantage of our architecture regards
its eciency in generating statement-level predictions compared to
IVDetect. The use of an explanation model signicantly increases
the inference time for a single sample, as an entire model must
MSR’ 2022, May 23–24, 2022, Pisburgh, PA, USA David Hin, Andrey Kan, Huaming Chen, and M. Ali Babar
Figure 4: Histogram of rst rankings of LineVD on default
test set. First ranking is dened as the rst true-positive
statement in a sorted list of softmax scores assigned to each
statement for a given function. E.g. in 575 samples, a vulner-
able statement is rst in the ranked list of vulnerable state-
ment predictions; in 67 samples, a vulnerable statement is
second in the list, etc.
be trained to nd a subgraph that explains the model’s prediction,
compared to the single forward pass required by LineVD. Using an
NVIDIA Tesla V100, a single inference using GNNExplainer can
take over a minute, depending on its conguration, while the single
forward pass in LineVD takes less than a second.
5.2 RQ2: How do dierent code embedding
methods aect statement-level
vulnerability detection?
We evaluate the eectiveness of CodeBERT regarding the code
embedding methods. In Table 3, CodeBERT has outperformed the
baseline feature embedding methods, which are Doc2Vec and GloVe,
in the context of statement-level vulnerability prediction, by 134%
(
𝑝<0.01
) and 16% (
𝑝<0.05
) in F1-score respectively. This is to
be expected, as the pre-trained CodeBERT model is designed with
over 125 million parameters, which provide an enhanced capability
to encode richer information in a larger and deeper model from
code snippets comparing to the other models with less layers and
parameters.
Table 1: RQ1: Statement-level Performance (Ranked)
Methods N5 MAP@5 NDCG@5 MFR
IVDetect 0.695 0.424 0.517 8.192
LineVD 0.900 0.760 0.804 3.819
Table 2: RQ1: Statement-level Performance (classication)
Methods F1 Rec Prec ROCAUC PRAUC
IVDetect 0.176 0.140 0.238 0.463 0.520
LineVD 0.360 0.533 0.271 0.913 0.642
We also note that, while CodeBERT has the advantage over GloVe
and Doc2Vec being pre-trained on a large corpus, it has the disad-
vantage of being trained on an external dataset consisting of non
C/C++ code in ve other programming languages. In comparison,
the GloVe and Doc2Vec models are directly trained on our C/C++
dataset. Since Doc2Vec should be able to provide a better encoding
capability for the contextual information about statements, GloVe
presents a second best performance by the averaged aggregation
method. Nonetheless, CodeBERT provides a best code embedding
method given the comparative experiments with the other most
popular code embedding methods for SVD to date.
Table 3: RQ2: Feature Embedding Methods for Statement-
level Vulnerability Classication
Embedding F1 Rec Prec
ROCAUC
PRAUC
Doc2Vec 0.064 0.167 0.040 0.580 0.508
GloVe 0.129 0.166 0.106 0.666 0.529
CodeBERT 0.150 0.254 0.121 0.703 0.534
5.3 RQ3: Can graph neural networks and
function-level information benet
statement-level classication?
Table. 4 shows the eect of dierent experiment settings involving
GNN type, program graph type, and whether or not the function-
level classication component is included in the model. As it could
be seen from Table. 4, using graph attention network for the feature
learning from PDG information is the best t, which we have hence
included as the graph component in LineVD. When comparing
this GNN feature extraction combination with the model variation
without GNN, we achieve an increase of 24% in F1 score (
𝑝<0.01
)
from 0.296 to 0.360, indicating that the presence of the GNN benets
the model’s learning capability. However, the performance of other
graph-based combinations is generally comparable to using the
model without a GNN. This suggests that the program graph type
and GNN type non-trivially aects the performance of the model. In
particular, only using control dependency edges generally results in
lower performance than using both control and data dependencies,
suggesting that data dependency edges are important in predicting
statement-level vulnerabilities.
In addition, we nd that GCN achieves worse performance in
comparison to GAT for all model types, which is to be expected, as
the feature aggregation in GCN treats all neighboring nodes equally,
unlike GAT, which attends to certain neighbors. However, when
comparing only using statement-level information for classication
(i.e. no function-level information involved), the use of GCN with
either of the graph types or a GAT with CDG results in worse
performance compared to using no GNN. Finally, we nd that the
enrichment of function-level information in the model signicantly
increases the statement-level performance, regardless of whether
a GNN is used. Using the GAT+PDG combination, we achieve a
performance increase of 140% (𝑝<0.01).
LineVD: Statement-level Vulnerability Detection using Graph Neural Networks MSR’ 2022, May 23–24, 2022, Pisburgh, PA, USA
Table 4: RQ3: Graph-based Feature Variants for Statement-
level Vulnerability Classication
Model Type F1 Rec Prec ROCAUC PRAUC
GAT+CDG 0.115 0.120 0.112 0.657 0.528
GAT+CDG+Func 0.304 0.491 0.221 0.907 0.624
GAT+PDG 0.150 0.254 0.121 0.703 0.534
GAT+PDG+Func 0.360 0.533 0.271 0.913 0.642
GCN+CDG 0.084 0.143 0.060 0.632 0.514
GCN+CDG+Func 0.283 0.558 0.190 0.911 0.597
GCN+PDG 0.085 0.125 0.067 0.599 0.513
GCN+PDG+Func 0.310 0.460 0.235 0.905 0.616
No GNN 0.129 0.166 0.106 0.666 0.529
No GNN+Func 0.296 0.537 0.205 0.921 0.619
Table 5: RQ4: Cross-project Statement-level Prediction
Project F1 Rec Prec ROCAUC PRAUC Vuln
Chromium 0.298 0.470 0.219 0.923 0.625 3103
Linux 0.301 0.502 0.216 0.925 0.630 1847
Android 0.290 0.494 0.208 0.922 0.625 962
ImageMagick 0.333 0.504 0.249 0.925 0.644 331
PHP 0.290 0.487 0.208 0.928 0.622 200
TCPDump 0.284 0.452 0.207 0.925 0.622 197
OpenSSL 0.298 0.508 0.211 0.926 0.628 157
Krb5 0.259 0.535 0.186 0.903 0.605 139
QEMU 0.250 0.478 0.177 0.910 0.601 120
FFmpeg 0.269 0.483 0.193 0.917 0.612 115
5.4 RQ4: How does LineVD perform in a
cross-project classication scenario?
From Table 5, we can see that performance is generally consistent
throughout the dierent projects. Due to the nature of the vari-
ous projects, there are varying numbers of vulnerable samples for
each project. The top 10 software projects are reported in Table
5. While the performance varies, in general, they are all slightly
lower than the results from the random splits in Table 2. This is par
for the course, as even partial inclusion of within-project informa-
tion such as variable and function names may assist the model in
distinguishing vulnerable samples.
In Table 5, the investigated projects are ranked according to the
number of vulnerabilities, which is indicated in the last column
‘Vuln’. We note that the Chromium and Linux splits contain a dispro-
portionately large number of vulnerable samples. The Chromium
split itself accounts for 30% of all vulnerable samples in the dataset.
Despite this, the performance is competitive with the other splits,
indicating that the model attains a high generalisability for dierent
settings, which has shown comparable performance with a smaller
training set in Table 5.
Table 6: RQ5: Analysis of Statement-level Predictions
Statement Type TP FP TN FN F1
Function Declaration 530 364
17572
126 0.68
While Statement 48 72 1374 23 0.50
Builtin Function Call 142 246 4537 72 0.47
Logical Operation 90 159 4601 60 0.45
Switch Statement 18 31 1424 31 0.37
For Statement 75 195 3632 88 0.35
If Statement 749 1930
50969
855 0.35
Assignment Operation 1206 3792
81490
1051 0.33
Other Operation 62 200 5401 54 0.33
Jump Target 18 53
11119
19 0.33
Arithmetic Operation 27 106 1379 10 0.32
Return Statement 166 526
27332
186 0.32
External Function Call 644 2212
63378
582 0.32
Comparison Operation 19 64 1489 17 0.32
Access Operation 44 168 3488 48 0.29
Cast Operation 25 115 3146 28 0.26
Continue 2 13 1039 1 0.22
Break 5 42 7966 43 0.11
Goto Statement 3 15 5115 48 0.09
5.5 RQ5: Which statement types are best
distinguished by LineVD?
Table 6 shows the raw prediction results for dierent statement
types. We report the confusion matrix values for each statement
type and sort them by F1 score. We see that that assignment op-
erations and external function calls are the most commonly oc-
curring vulnerable lines, and that the model most often correctly
classies Function Declaration statements. This could be due to
the signicantly higher level of information in function declara-
tions compared to other statements, which can consist of the return
type, function name, parameter types, and the parameter names.
Operation-related statements generally perform similarly, except
for logical operations, which LineVD can generally better distin-
guish, and access and cast operations, which are achieve relatively
poorer performance.
LineVD has higher performance with built-in function call state-
ments compared to external function calls. This is somewhat ex-
pected, as the same built-in functions are likely to appear across
more samples, unlike user-dened function names. In addition, due
to the function-level nature of the dataset, the only information
in external function call statements is from the function name it-
self and the arguments passed. Without knowledge of what occurs
within an external function call, it can be dicult to nd distin-
guishing patterns relating to their vulnerability.
LineVD struggles most with continue, break, and goto state-
ments. These are all control structure nodes that aect the control
dependency graph, and do not contain any direct statement-level
information that could distinguish them from each other. Hence, to
distinguish the vulnerability of these statements, the model must
rely on the patterns relating to the context in which they appear.
Despite using a GNN to capture these relationships, these state-
ments still cannot be classied correctly. Hence, while the use of
MSR’ 2022, May 23–24, 2022, Pisburgh, PA, USA David Hin, Andrey Kan, Huaming Chen, and M. Ali Babar
GNN does improve overall performance, these results indicate that
neither GCN or GAT are sucient for eectively distinguishing
these particular control statements. In contrast, statement types
like function declarations, function calls, and for/while/if control
structures often contain signicantly more information within the
statement, and hence can be distinguished more easily. This focus
on more complex control structure statement types and built-in
function calls is suitable for practical usage as they are more likely
to be directly involved in the vulnerability.
6 DISCUSSION
In this section, we discuss the threats to validity and limitations of
our work, along with ideas for future work.
6.1 Threats to Validity
The rst threat relates to the potential sub-optimal hyperparame-
ter tuning; we rely on a random grid search for determining the
combinations of hyperparameters; however, it would be close to
impossible to exhaustively test all combinations of hyperparame-
ters. To mitigate this threat, we adopt best practices where possible
when choosing the default range of values for the hyperparameter
search, and tune each model variation with the same number of
random parameter congurations.
There has also been recent work on the eect of time-based
validation on software defect prediction [
18
], which could also
be relevant to vulnerability detection. We mitigate this threat by
also exploring cross-project vulnerability prediction, since time-
based validation primarily applies to within-project prediction. Fur-
thermore, any graph-based model should intuitively have similar
capabilities in regards to learning patterns associated with vulnera-
bilities, and hence the performance loss (or gain) should be linear
across the dierent baseline models. However, this is something
that could be tested more thoroughly in future works.
6.2 Limitations and Future Work
While LineVD outperforms the current state-of-the-art, the per-
formance is still quite low, meaning there is still much room for
improvement in the statement-level SVD task. A particular area of
interest is nding the best way to produce better code embeddings.
While CodeBERT can perform well, it is not trained on C/C++.
The use of a large language model pretrained on the specic tar-
get language (e.g. C-BERT [
8
]) could improve the performance of
downstream tasks. However, at the time of writing, there were no
openly released C/C++-based large pretrained models.
Another limitation regards the learning capability of the GNN
layer, which inherently only propagates information from imme-
diately neighboring nodes (i.e. 1-hop). To propagate further range
information, multiple graph neural network layers can be stacked.
For example, two GNN layers would propagate information across
a 2-hop neighborhood for each node. However, most graph neural
network architectures using more than just a few layers has been
shown to result in over-squashing, where the exponentially grow-
ing information between each layer cannot be captured within a
xed-length vector [
4
], resulting in bottlenecked performance. This
is consistent with our ndings; we tuned the number GNN layers
during experimentation but did not nd any signicant increase in
performance beyond two layers. Future work could explore novel
GNN architectures that can better capture vulnerability patterns in
program graphs.
7 RELATED WORK
To recap the state-of-the-art of software vulnerability detection
researches, we introduce the related work from two perspectives:
1) the application of GNN on SVD, 2) the explanability of machine
learning models for SVD.
7.1 Software vulnerability detection with GNN
Detecting software vulnerability is an ongoing topic to secure soft-
ware systems from cyber attacks. Recent advances have advocated
the application of GNN and its variants for function level vulnerabil-
ity detections, which is considered as the best way of representing
source code in the SVD context [
2
,
58
], and have demonstrated
enhanced performance over other approaches [9, 17, 32].
The introduction of GNNs for modelling vulnerabilities was orig-
inally inspired by the vulnerability discovery approach proposed
by Yamaguchi et al. [
57
] using Code Property Graphs, a type of
program graph incorporating program dependency edges [
26
,
39
],
control ow edges, and the abstract syntax tree of the program,
which provide an additional source of information to learn from
[
14
]. Hence, the performance improvements using GNNs can pri-
marily be attributed to leveraging the domain knowledge that lines
of source code within a program have specic relationships to other
lines; i.e., training using both semantic and syntactical information,
rather than only syntactical information.
To achieve this goal, the lines of source code, which is also the
nodes in the graph, will be rstly vectorised according to the possi-
ble code tokens. The vectorised information is then concatenated
as initial node information to allow GNNs to capture the seman-
tic and syntactical information. This propagation of information
between semantically relevant statements theoretically allows the
model to better make use of relevant contextual lines. While GNNs
have been shown to perform well in function-level vulnerability
classication (graph classication) [
9
,
10
,
12
,
37
,
58
], the eects of
vectorised methods and GNNs models on statement-level vulnera-
bility classication (node classication) have yet to be explored.
7.2 Interpretation machine-learning based
models for SVD
One way to improve the performance of machine-learning based
method for SVD in practice is the development of explainable de-
tection results, which could provide a ne-grained vulnerability
prediction outcome. This raises the importance of research for in-
terpretatable machine-learning based models. Besides the novel
machine-learning based model for function level vulnerability de-
tection output, the existing works are limited to providing partial
information for the explanation generation, i.e., tokens from [
59
]
and intermediate code by [
33
]. Other way for statement-level SVD
task may be via localizing the specic vulnerable statements with
the assumption of receiving vulnerable source codes in function
level [
15
]. However, it demonstrates a certain limitation as SV de-
tection and localization are mostly demanded at the same time.
LineVD: Statement-level Vulnerability Detection using Graph Neural Networks MSR’ 2022, May 23–24, 2022, Pisburgh, PA, USA
A recently explored benet of GNNs for SVD is direct access
to explainable GNN approaches [
32
]. These can be attached to
any GNN-based function-level SVD model to obtain ne-grained
predictions; i.e., statement-level predictions if the nodes themselves
represent statements. However, whether or not it is the best way to
build a statement-level vulnerability classier has yet to be explored.
For example, one disadvantage of using a function-level detector as
the base model for interpretation is the inability to directly leverage
any statement-level information during the training process, in
addition to the signicantly longer inference times.
8 CONCLUSION
We introduce LineVD, a novel deep learning approach for statement-
level vulnerability detection, which can allow developers to more ef-
ciently evaluate potentially vulnerable functions. LineVD achieves
a new state-of-the-art on statement-level vulnerability detection
on real-world open source projects by leveraging graph neural
networks and statement-level information during training. The
signicant improvement in comparison to the latest ne-grained
machine-learning based model indicates the eectiveness of directly
utilizing statement-level information for statement-level SVD. Fi-
nally, LineVD achieves reasonable cross-project performance, in-
dicating its eectiveness and generalization capabilities even for
completely unseen software projects. Future directions will include
exploring alternate pretrained feature embedding methods and
novel GNN architectures that can better accommodate the underly-
ing nature of software source code and the vulnerabilities.
REFERENCES
[1] [n. d.]. NIST: National Vulnerability Database. https://nvd.nist.gov/.
[2]
Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang.
2021. Unied Pre-training for Program Understanding and Generation. arXiv
preprint arXiv:2103.06333 (2021).
[3]
Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning
to Represent Programs with Graphs. In International Conference on Learning
Representations.
[4]
Uri Alon and Eran Yahav. 2020. On the bottleneck of graph neural networks and
its practical implications. arXiv preprint arXiv:2006.05205 (2020).
[5]
Bushra Aloraini, Meiyappan Nagappan, Daniel M German, Shinpei Hayashi,
and Yoshiki Higo. 2019. An empirical study of security warnings from static
application security testing tools. Journal of Systems and Software 158 (2019),
110427.
[6]
Authors. [n. d.]. Reproduction package for MSR double-blind review. Retrieved
Jan, 2022 from. https://github.com/davidhin/linevd
[7]
Ranran Bian, Yun Sing Koh, Gillian Dobbie, and Anna Divoli. 2019. Network
embedding and change modeling in dynamic heterogeneous networks. In Proceed-
ings of the 42nd International ACM SIGIR Conference on Research and Development
in Information Retrieval. 861–864.
[8]
Luca Buratti, Saurabh Pujar, Mihaela Bornea, Scott McCarley, Yunhui Zheng,
Gaetano Rossiello, Alessandro Morari, Jim Laredo, Veronika Thost, Yufan Zhuang,
et al
.
2020. Exploring software naturalness through neural language models.
arXiv preprint arXiv:2006.12641 (2020).
[9]
Sicong Cao, Xiaobing Sun, Lili Bo, Ying Wei, and Bin Li. 2021. BGNN4VD:
Constructing Bidirectional Graph Neural-Network for Vulnerability Detection.
Information and Software Technology 136 (2021), 106576.
[10]
Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, and Baishakhi Ray. 2021.
Deep learning based vulnerability detection: Are we there yet. IEEE Transactions
on Software Engineering (2021).
[11]
Cen Chen, Kenli Li, Sin G Teo, Xiaofeng Zou, Kang Wang, Jie Wang, and Zeng
Zeng. 2019. Gated residual recurrent graph neural networks for trac prediction.
In Proceedings of the AAAI conference on articial intelligence, Vol. 33. 485–492.
[12]
Xiao Cheng, Haoyu Wang, Jiayi Hua, Guoai Xu, and YuleiSui. 2021. DeepWukong:
Statically detecting software vulnerabilities using deep graph neural network.
ACM Transactions on Software Engineering and Methodology (TOSEM) 30, 3 (2021),
1–33.
[13]
Roland Croft, Dominic Newlands, Ziyu Chen, and M Ali Babar.2021. An Empirical
Study of Rule-Based and Learning-Based Approaches for Static Application
Security Testing. arXiv preprint arXiv:2107.01921 (2021).
[14]
Lei Cui, Zhiyu Hao, Yang Jiao,Haiqiang Fei, and Xiao chun Yun. 2020. VulDetector:
Detecting Vulnerabilities Using Weighted Feature Graph Comparison. IEEE
Transactions on Information Forensics and Security 16 (2020), 2004–2017.
[15]
Yangruibo Ding, Sahil Suneja, Yunhui Zheng, Jim Laredo,Alessandro Morari, Gail
Kaiser, and Baishakhi Ray. 2022. VELVET: a noVel Ensemble Learning approachto
automatically locate VulnErable sTatements. In 2022 IEEE International Conference
on Software Analysis, Evolution and Reengineering (SANER). IEEE, 1–12.
[16]
Xiaoning Du, Bihuan Chen, Yuekang Li, Jianmin Guo, Yaqin Zhou, Yang Liu, and
Yu Jiang. 2019. Leopard: Identifying vulnerable code for vulnerability assessment
through program metrics. In 2019 IEEE/ACM 41st International Conference on
Software Engineering (ICSE). IEEE, 60–71.
[17]
Xu Duan, Jingzheng Wu, Shouling Ji, Zhiqing Rui, Tianyue Luo, Mutian Yang,
and Yanjun Wu. 2019. VulSniper: Focus Your Attention to Shoot Fine-Grained
Vulnerabilities.. In IJCAI. 4665–4671.
[18]
Davide Falessi, Jacky Huang, Likhita Narayana, Jennifer Fong Thai, and Burak
Turhan. 2020. On the need of preserving order of data when validating within-
project defect classiers. Empirical Software Engineering 25, 6 (2020), 4805–4830.
[19]
Jiahao Fan, Yi Li, Shaohua Wang, and Tien N Nguyen. 2020. AC/C++ Code
Vulnerability Dataset with Code Changes and CVE Summaries. In Proceedings of
the 17th International Conference on Mining Software Repositories. 508–512.
[20]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong,
Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT:
A Pre-Trained Model for Programming and Natural Languages. http://arxiv.org/
abs/2002.08155 cite arxiv:2002.08155Comment: Accepted to Findings of EMNLP
2020. 12 pages.
[21]
Seyed Mohammad Ghaarian and Hamid Reza Shahriari. 2017. Software vulnera-
bility analysis and discovery using machine-learning and data-mining techniques:
A survey. ACM Computing Surveys (CSUR) 50, 4 (2017), 1–36.
[22]
Hazim Hanif, Mohd Hairul Nizam Md Nasir, Mohd Faizal Ab Razak, Ahmad Fir-
daus, and Nor Badrul Anuar. 2021. The rise of software vulnerability: Taxonomy
of software vulnerabilities detection and machine learning approaches. Journal
of Network and Computer Applications (2021), 103009.
[23]
Thong Hoang, Hong Jin Kang, David Lo, and Julia Lawall. 2020. Cc2vec: Dis-
tributed representations of code changes. In Proceedings of the ACM/IEEE 42nd
International Conference on Software Engineering. 518–529.
[24]
Aram Hovsepyan, Riccardo Scandariato, and Wouter Joosen. 2016. Is Newer
Always Better? The Case of Vulnerability Prediction Models. In Proceedings of
the 10th ACM/IEEE International Symposium on Empirical Software Engineering
and Measurement. 1–6.
[25]
Julian Jang-Jaccard and Surya Nepal. 2014. A survey of emerging threats in
cybersecurity. J. Comput. System Sci. 80, 5 (2014), 973–993.
[26]
Andrew Johnson, Lucas Waye, Scott Moore, and Stephen Chong. 2015. Explor-
ing and enforcing security guarantees via program dependence graphs. ACM
SIGPLAN Notices 50, 6 (2015), 291–302.
[27]
Seulbae Kim, Seunghoon Woo, Heejo Lee, and Hakjoo Oh. 2017. Vuddy: A
scalable approach for vulnerable code clone discovery. In 2017 IEEE Symposium
on Security and Privacy (SP). IEEE, 595–614.
[28]
Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classication with
Graph Convolutional Networks. In Proceedings of the 5th International Conference
on Learning Representations (Palais des Congrès Neptune, Toulon, France) (ICLR
’17). https://openreview.net/forum?id=SJU4ayYgl
[29]
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and
documents. In International conference on machine learning. PMLR, 1188–1196.
[30]
Triet HM Le, Hao Chen, and Muhammad Ali Babar. 2020. Deep learning for
source code modeling and generation: Models, applications, and challenges. ACM
Computing Surveys (CSUR) 53, 3 (2020), 1–38.
[31]
Stefan Lessmann, Bart Baesens, Christophe Mues, and Swantje Pietsch. 2008.
Benchmarking classication models for software defect prediction: A proposed
framework and novel ndings. IEEE Transactions on Software Engineering 34, 4
(2008), 485–496.
[32]
Yi Li, Shaohua Wang, and Tien N Nguyen.2021. Vulnerability Detection with Fine-
grained Interpretations. In The 29th ACM Joint European Software Engineering
Conference and Symposium on the Foundations of Software Engineering. ACM.
[33]
Zhen Li, Deqing Zou, Shouhuai Xu, Zhaoxuan Chen, Yawei Zhu, and Hai Jin.
2021. Vuldeelocator: a deep learning-based ne-grained vulnerability detector.
IEEE Transactions on Dependable and Secure Computing (2021), 1–17.
[34]
Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, and Zhaoxuan Chen. 2021.
SySeVR: A framework for using deep learning to detect software vulnerabilities.
IEEE Transactions on Dependable and Secure Computing (2021).
[35]
Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun
Deng, and Yuyi Zhong. 2018. VulDeePecker: A Deep Learning-Based System
for Vulnerability Detection.. In Network and Distributed Systems Security (NDSS)
Symposium 2018. 1–12.
[36]
Richard Liaw, Eric Liang, Robert Nishihara, Philipp Moritz, Joseph E Gonzalez,
and Ion Stoica. 2018. Tune: A research platform for distributed model selection
and training. arXiv preprint arXiv:1807.05118 (2018).
MSR’ 2022, May 23–24, 2022, Pisburgh, PA, USA David Hin, Andrey Kan, Huaming Chen, and M. Ali Babar
[37]
Guanjun Lin, Wei Xiao, Leo Yu Zhang, Shang Gao, Yonghang Tai, and Jun Zhang.
2021. Deep neural-based vulnerability discovery demystied: data, model and
performance. Neural Computing and Applications (2021), 1–14.
[38]
Bingchang Liu, Liang Shi, Zhuhua Cai, and Min Li. 2012. Software vulnerabil-
ity discovery techniques: A survey. In 2012 fourth international conference on
multimedia information networking and security. IEEE, 152–156.
[39]
Chao Liu, Chen Chen, Jiawei Han, and Philip S Yu. 2006. GPLAG: detection of
software plagiarism by program dependence graph analysis. In Proceedings of
the 12th ACM SIGKDD international conference on Knowledge discovery and data
mining. 872–881.
[40]
Fang Liu, Ge Li, Yunfei Zhao, and Zhi Jin. 2020. Multi-task learning based pre-
trained language model for code completion. In Proceedings of the 35th IEEE/ACM
International Conference on Automated Software Engineering. 473–485.
[41]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A
robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
(2019).
[42]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Je Dean. 2013.
Distributed representations of words and phrases and their compositionality. In
Advances in neural information processing systems. 3111–3119.
[43]
Hajra Naeem and Manar H Alal. 2020. Identifying Vulnerable IoT Applications
using Deep Learning. In 2020 IEEE 27th International Conference on Software
Analysis, Evolution and Reengineering (SANER). IEEE, 582–586.
[44]
Vinod Nair and Georey E Hinton. 2010. Rectied linear units improve restricted
boltzmann machines. In Icml.
[45]
Hai Ngoc Nguyen, Songpon Teerakanok, Atsuo Inomata, and Tetsutaro Uehara.
2021. The Comparison of Word Embedding Techniques in RNNs for Vulnerability
Detection.. In ICISSP. 109–120.
[46]
Jerey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove:
Global vectors for word representation. In Proceedings of the 2014 conference on
empirical methods in natural language processing (EMNLP). 1532–1543.
[47]
Nam H Pham, Tung Thanh Nguyen, Hoan Anh Nguyen, and Tien N Nguyen. 2010.
Detection of recurring software vulnerabilities. In Proceedings of the IEEE/ACM
international conference on Automated software engineering. 447–456.
[48]
Chanathip Pornprasit and Chakkrit Tantithamthavorn. 2021. JITLine: A Simpler,
Better, Faster, Finer-grained Just-In-Time Defect Prediction. In 2021 International
Conference on Mining Software Repositories (MSR’21). 1–11.
[49]
Hitesh Sajnani, Vaibhav Saini, Jerey Svajlenko, Chanchal K Roy, and Cristina V
Lopes. 2016. Sourcerercc: Scaling code clone detection to big-code. In Proceedings
of the 38th International Conference on Software Engineering. 1157–1168.
[50]
Riccardo Scandariato, James Walden, Aram Hovsepyan, and Wouter Joosen. 2014.
Predicting vulnerable software components via text mining. IEEE Transactions
on Software Engineering 40, 10 (2014), 993–1006.
[51]
Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele
Monfardini. 2008. The graph neural network model. IEEE transactions on neural
networks 20, 1 (2008), 61–80.
[52]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine
translation of rare words with subword units. arXiv preprint arXiv:1508.07909
(2015).
[53]
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
Liò, and Yoshua Bengio. 2017. Graph Attention Networks. 6th International
Conference on Learning Representations (2017).
[54]
Deze Wang, Yue Yu, Shanshan Li, Wei Dong, Ji Wang, and Liao Qing. 2021. Mul-
Code: A Multi-task Learning Approach for Source Code Understanding. In 2021
IEEE International Conference on Software Analysis, Evolution and Reengineering
(SANER). IEEE, 48–59.
[55]
Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. 2020. Detecting code clones
with graph neural network and ow-augmented abstract syntax tree. In 2020 IEEE
27th International Conference on Software Analysis, Evolution and Reengineering
(SANER). IEEE, 261–271.
[56]
Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In Break-
throughs in statistics. Springer, 196–202.
[57]
Fabian Yamaguchi, Nico Golde, Daniel Arp,and Konrad Rieck. 2014. Modeling and
discovering vulnerabilities with code property graphs. In 2014 IEEE Symposium
on Security and Privacy. IEEE, 590–604.
[58]
Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019.
Devign: Eective vulnerability identication by learning comprehensive program
semantics via graph neural networks. Advances in neural information processing
systems 32 (2019), 8026–8037.
[59]
Deqing Zou, Yawei Zhu, Shouhuai Xu, Zhen Li, Hai Jin, and Hengkai Ye. 2021.
Interpreting deep learning-based vulnerability detector predictions based on
heuristic searching. ACM Transactions on Software Engineering and Methodology
(TOSEM) 30, 2 (2021), 1–31.