ArticlePDF Available

Abstract

Defect prediction is proposed to assist practitioners effectively prioritize limited Software Quality Assurance (SQA) resources on the most risky files that are likely to have post-release software defects. However, there exist two main limitations in prior studies: (1) the granularity levels of defect predictions are still coarse-grained and (2) the surrounding tokens and surrounding lines have not yet been fully utilized. In this paper, we perform a survey study to better understand how practitioners perform code inspection in modern code review process, and their perception on a line-level defect prediction. According to the responses from 36 practitioners, we found that 50% of them spent at least 10 minutes to more than one hour to review a single file, while 64% of them still perceived that code inspection activity is challenging to extremely challenging. In addition, 64% of the respondents perceived that a line-level defect prediction tool would potentially be helpful in identifying defective lines. Motivated by the practitioners' perspective, we present DeepLineDP, a deep learning approach to automatically learn the semantic properties of the surrounding tokens and lines in order to identify defective files and defective lines. Through a case study of 32 releases of 9 software projects, we find that the risk score of code tokens varies greatly depending on their location. Our DeepLineDP is 14%-24% more accurate than other file-level defect prediction approaches; is 50%-250% more cost-effective than other line-level defect prediction approaches; and achieves a reasonable performance when transferred to other software projects. These findings confirm that the surrounding tokens and surrounding lines should be considered to identify the fine-grained locations of defective files (i.e., defective lines).
1
DeepLineDP: Towards a Deep Learning
Approach for Line-Level Defect Prediction
Chanathip Pornprasit, Student Member, IEEE, Chakkrit (Kla) Tantithamthavorn, Member, IEEE
Abstract—Defect prediction is proposed to assist practitioners effectively prioritize limited Software Quality Assurance (SQA) resources
on the most risky files that are likely to have post-release software defects. However, there exist two main limitations in prior studies:
(1) the granularity levels of defect predictions are still coarse-grained and (2) the surrounding tokens and surrounding lines have not
yet been fully utilized. In this paper, we perform a survey study to better understand how practitioners perform code inspection in
modern code review process, and their perception on a line-level defect prediction. According to the responses from 36 practitioners,
we found that 50% of them spent at least 10 minutes to more than one hour to review a single file, while 64% of them still perceived
that code inspection activity is challenging to extremely challenging. In addition, 64% of the respondents perceived that a line-level
defect prediction tool would potentially be helpful in identifying defective lines. Motivated by the practitioners’ perspective, we present
DeepLineDP, a deep learning approach to automatically learn the semantic properties of the surrounding tokens and lines in order
to identify defective files and defective lines. Through a case study of 32 releases of 9 software projects, we find that the risk score
of code tokens varies greatly depending on their location. Our DeepLineDP is 17%-37% more accurate than other file-level defect
prediction approaches; is 47%-250% more cost-effective than other line-level defect prediction approaches; and achieves a reasonable
performance when transferred to other software projects. These findings confirm that the surrounding tokens and surrounding lines
should be considered to identify the fine-grained locations of defective files (i.e., defective lines).
Index Terms—Software Quality Assurance, Line-level Defect Prediction, Deep Learning, Explainable AI
F
1 INTRODUCTION
Software defects are prevalent and costly. Thus, Soft-
ware Quality Assurance (SQA) practices play a critical
role to ensure the absence of software defects. Despite
several SQA tools having been heavily invested during
the development phase (e.g., CI/CD, code review, static
analysis), software defects may still slip through to the
official releases of a software product (i.e., post-release
defects) [62, 63]. Vassallo et al. [66] found that practition-
ers still control for quality only at the end of a sprint
and at the release preparation stage. However, real-
world software projects are extremely large and complex.
Hence, it is intuitively infeasible to exhaustively perform
SQA activities for all of the files of the codebase.
To address this challenge, defect prediction is pro-
posed to help developers prioritize their limited SQA
resources on the most risky files that are likely to have
post-release software defects. Recent studies found that
deep learning approaches, which automatically learn
syntactic and semantic features, outperform traditional
machine learning approaches that used process and
product as features metrics [9, 34, 71]. However, there
exist two main limitations in prior studies.
(1) The granularity levels of deep learning-based
defect predictions are still coarse-grained. Defect pre-
diction models have been proposed at various levels of
granularity (e.g., packages [27], components [63], mod-
ules [28], files [9, 27, 39, 47, 71], methods [15], and
C. Pornprasit and C. Tantithamthavorn are with the Faculty of Information
Technology, Monash University, Melbourne, Australia.
E-mail: chanathip.sit@gmail.com, chakkrit@monash.edu
commits [18, 19, 29]). However, not all lines in a defective
file are actually defective. Therefore, developers still
waste a large amount of SQA effort to inspect clean lines
that may not lead to post-release defects (i.e., Table 1
shows that only 0.03%-2.9% lines of the whole release
are defective). As such, line-level defect prediction is
needed to help developers prioritize their SQA effort
on the high-risk areas of source code so SQA effort
can be allocated in a cost-effective manner. However,
line-level defect prediction is a challenging problem and
still remains largely unexplored (e.g., no existing deep
learning-based approach considers sequence of tokens).
(2) The surrounding tokens and the surrounding
lines have not yet been fully utilized. Source code
has a hierarchical structure (i.e., tokens forming lines
and lines forming files) and is contextually dependent.
Thus, the same code token that appears in different
lines may have different lexical meaning depending on
its location (e.g., variable declaration or assigning a
value to a variable). Therefore, the riskiness of code
tokens should be different depending on their location.
However, current deep learning approaches for file-level
defect prediction [9, 34, 70, 71] can capture only the
long-term sequences of code tokens without considering
surrounding lines, assuming that the same code token at
different lines is equally important to predict defective
files—which is likely not true.
In this paper, we began with conducting a survey
study to investigate the state-of-practice of code inspec-
tion in modern code review, and practitioners’ percep-
tion on a line-level defect prediction tool. We recruited
participants through social media platforms targeting
2
professionals and software developers (i.e., LinkedIn,
Facebook Groups). Through an analysis of the 36 re-
sponses obtained from the participants, we addressed
the following two motivating questions (MQs):
(MQ1) How do practitioners perform code inspection
in modern code review process?
Results. Given a changed file in a pull re-
quest, 80.6% of the respondents currently inspect
source code in a top-down order. Once defective
lines are identified, 72.2% of the respondents in-
spect the defective lines and their related method
calls, while 52.8% of the respondents inspect
the defective lines and their surrounding lines.
Also, 50% of the respondents spent at least 10
minutes to more than one hour to review a
single file, indicating that current code review
activities are still time-consuming. Importantly,
64% of the respondents perceived that code in-
spection activity is very challenging to extremely
challenging.
(MQ2) Would a line-level defect prediction tool be
helpful for practitioners?
Results. 44% of the respondents perceived that
a line-level defect prediction tool would poten-
tially be helpful in identifying defective lines. In
addition, 64% of the respondents would consider
using the tool if it is publicly available for free.
Motivated by the findings from our survey with prac-
titioners, we present DeepLineDP, a deep learning ap-
proach to learn the surrounding code tokens and the
surrounding lines of source code in order to predict
defective files and locate defective lines, which performs
as follows. First, we generate a vector representation
for each code token. Then, we employ a bidirectional
GRU unit [4] to capture the surrounding tokens of
that source code line. Next, we employ a Hierarchical
Attention Network (HAN) architecture [76] to learn the
hierarchical structure of source code to predict defective
files. After that, we compute the risk score of code tokens
that contribute to the prediction of a given defective file
using the attention mechanism [67]. We then generate a
ranking of risky lines using the summation of the risk
score of the code tokens that appear in that line. Through
a case study of 32 releases of 9 software projects, we
address the following four research questions (RQs):
(RQ1) Can our DeepLineDP be used to differentiate
the riskiness of code tokens in defective and
clean lines?
Results. The risk score of the same code token
that appears in different lines varies greatly de-
pending on their locations, since we observe a
maximum difference of 1. For example, the risk
score of a code token may be extremely risky
in one line (a risk score of 1), while not being
risky at all in another line (a risk score of 0).
In addition, we find that the risk score of code
tokens in defective lines are significantly higher
than the risk score of code tokens in clean lines,
suggesting that our DeepLineDP approach can be
used to differentiate the riskiness of code tokens
in defective and clean lines.
(RQ2) What is the accuracy of our DeepLineDP for
predicting defective files?
Results. Our DeepLineDP achieves a median
AUC of 0.81 and a median Balanced Accuracy
of 0.63, which is 17%-37% and 3%-26% more
accurate than the state-of-the-art in terms of the
median AUC and the median Balanced Accu-
racy, respectively. The ScottKnott ESD test also
confirms that our DeepLineDP approach always
appears at the top rank for both measures.
(RQ3) What is the cost-effectiveness of our
DeepLineDP for locating defective lines?
Results. Our DeepLineDP is 47%-250%
more cost-effective than the state-of-the-art
line-level defect prediction approaches for
Recall@Top20%LOC, while achieving 38% and
49% less Effort@Top20%Recall than using the
ErrorProne static analysis tool and the N-gram
model, respectively.
(RQ4) What is the accuracy of our DeepLineDP for
line-level cross-project defect predictions?
Results. Our DeepLineDP models are trans-
ferable to other software projects, since our
DeepLineDP models still achieve an AUC of
0.63-0.79 and a Recall@Top20%LOC of 0.31-0.46.
However, they can be slightly less accurate and
less cost-effective than the models that are trained
from the previous release of its own project. Nev-
ertheless, our DeepLineDP models still achieve a
reasonable performance.
Novelty and Contributions. To the best of our knowl-
edge, the novelty and main contributions of our work
are as follows:
1) A survey with 36 practitioners to understand their
current code inspection practices in modern code
review and their perceptions to adopt a line-level
defect prediction.
2) DeepLineDP—a deep learning-based approach for
line-level defect prediction that aims to automati-
cally learn the surrounding tokens and surrounding
lines to predict defective files and defective lines.
3) An extensive experiment using both within-project
and cross-project evaluation settings at the file and
line levels with 3 traditional measures and 3 effort-
aware measures.
4) Our empirical finding shows that our DeepLineDP
models (1) can differentiate the riskiness of the same
code token that appears in different lines; (2) can
accurately predict the defective files; (3) can effec-
tively generate the ranking of defective lines; and (4)
trained on a software project can be transferable to
other software projects while achieving a reasonable
performance.
3
To foster the replication of our study, we publish the
implementation of our DeepLineDP and the baselines at
https://github.com/awsm-research/DeepLineDP.
Paper Organization. The rest of this paper is organized
as follows. Section 2 presents a survey study of the
current code inspection practices in modern code review.
Section 3 presents the architecture of our DeepLineDP
approach. Section 4 presents the experimental design
and results. Section 5 discloses the threats to the validity.
Section 6 discusses the related work. Finally, we draw
conclusions in Section 7.
2 A MOTI VATING SURVE Y
In this section, we aim to investigate (1) how practition-
ers perform code inspection in modern code review and
(2) would a line-level defect prediction tool be helpful
for practitioners. Below, we described the approach to
conduct a survey and presented the results from our
survey study.
2.1 Approach
Similar to Kitchenham and Pfleeger [32], we conducted
our study according to the following steps: (1) design
and develop a survey, (2) recruit and select participants,
and (3) verify data and analyze data. We explained the
detail of each step below.
(Step 1) Design and develop a survey. The purpose
of our survey is to investigate the current practices in
modern code review process. We designed our survey as
a cross-sectional study where participants provided their
responses at one fixed point in time. The survey consists
of 7 closed-ended questions and 6 open-ended questions.
For closed-ended questions, we used multiple-choices
question and Likert scale from 1 to 5. Our survey con-
sists of three parts: preliminary question; understand-
ing current code inspection practices; and practitioners’
perceptions on a line-level defect prediction tool. Our
preliminary question starts with (“Do you perform code
review”) to ensure that our survey results obtained from
the right target participants, followed by roles, levels of
experience, and primary programming language skills.
The next part is focused on a set of questions to under-
stand the current code inspection practices. Then, the
final part is focused on practitioners’ perceptions on a
line-level defect prediction tool. To do so, we presented
a usage scenario and an example visualization. Figure 1
is an example defective file that is correctly predicted by
our DeepLineDP. The shade colour of each code token
varies based on the risk score from dark red (very risky)
to light red (less risky). The example was obtained from
the file ../store/Directory.java from the release
3.1 of the Lucene project, which has 426 lines of code.
With our DeepLineDP, the model correctly predicts that
the line number of 227-228 are the most risky lines with a
risk score of 0.99 and 0.97, respectively due to the tokens
createOutput() and openInput(). We used Google
Form to create our online survey. When accessing the
Fig. 1: An example visualization to highlight the most
risky tokens obtained from our DeepLineDP approach.
survey, each participant was provided with an explana-
tory statement that describes the purpose of the study,
why the participant is chosen for this study, possible
benefits and risks, and confidentiality. The survey takes
approximately 10 minutes to complete and is anony-
mous. Our survey has been rigorously reviewed and
approved by the Monash University Human Research
Ethics Committee (MUHREC ID: 30739).
(Step 2) Recruit and select participants. We recruited
the practitioners via LinkedIn since we can specify the
target group that we would like to access. We then
sent a survey invitation to the target groups via the
LinkedIn direct message. To ensure that our survey is
not biased, we selected participants from various large
companies. Due to the limitation of the number of
invitations in LinkedIn, we were able to sent at most
100 invitations per week. In total, we sent a survey
invitation to 100 practitioners via LinkedIn. We received
a response rate of 15% ( 15
100 ). Then, we also received 28
additional responses through the survey advertisement
at several software developer’s communities (e.g., Face-
book groups). Finally, we obtained a total of 43 responses
over one week recruitment.
(Step 3) Verify and analyze data. To verify the com-
pleteness of the response in our survey (i.e., whether all
questions were appropriately answered), we manually
read all of the open-ended questions. to ensure that our
survey results are derived from the right target partic-
ipants, we excluded 7 responses that do not perform
code review. Finally, we obtained a set of 36 responses.
We presented the results of closed-ended responses in
a Likert scale with stacked bar plots. We manually
analyzed the responses of the open-ended questions to
extract in-depth insights.
2.2 Respondent Demographics
Among all of the 36 respondents in the survey, they
have the following roles in a software development
4
No
Yes
0%
12%
24%
36%
48%
60%
Top-down order
Random order
Main change then top-down
0%
22.5%
45%
67.5%
90%
Focus on the defective lines and their related method calls
Focus on the defective lines and their surrounding lines
Focus on the defective lines and their whole function
Focus on only the defective lines
0%
20%
40%
60%
80%
5-10 mins
< 5 mins
10-20 mins
20-30 mins
30 mins - 1 hr.
> 1 hr.
0%
7%
14%
21%
28%
(Q1.1) Please rate the degree of challenge of code inspection activity.
(Q1.2) Please justify your answer in the Q1.1.
(Q1.3) Do you currently use any automated code review tools to identify
the lines that are likely to be defective?
(Q1.4) If the answer in Q1.3 is *yes*, what are the automated code
review tools that you mainly use?
(Q1.5) If the answer in Q1.3 is *yes*, why such automated code review
tools are used?
(Q1.6) If the answer in Q1.3 is *no*, why such automated code review
tools are not used?
(Q1.7) Given a changed file in a pull request, in the code inspection
activity, what is the order of lines when you inspect source code?
(Q1.8) Once the defective lines of code are identified, what is the
scope of source code when inspecting source code? (Checkbox)
(Q1.9) How long do you usually spend on reviewing one file?
(Q2.1) Do you think this tool would be helpful in identifying defective
lines?
(Q2.2) Please justify your answer for Q2.1 in a few sentences.
(Q2.3) If this tool is publicly available (for free/with no cost), would you
consider using our tool?
(Q2.4) Please justify your answer for Q2.3 in a few sentences.
36% 64%
The degree of code inspection challenge
100 50 0 50 100
Percentage
Response not challenging neutral challenging extremely challenging
36%
64%
Part1: Understanding the current code inspection practices
Part2: Practitioners’s Perceptions on Line-Level Defect Prediction
56% 44%
The perception of the helpfulness of the tool in
identifying defective lines
100 50 0 50 100
Percentage
Response disagree neutral agree strongly agree
56% 44%
The perception of the helpfulness of the tool in
identifying defective lines
100 50 0 50 100
Percentage
Response disagree neutral agree strongly agree
14% 64%22%
The willingness to use the tool if it is
avaliable for free
100 50 0 50 100
Percentage
Response strongly disagree disagree neutral agree strongly agree
64%14%
19% 44%36%
The perception of the helpfulness of the tool in
identifying defective lines
100 50 0 50 100
Percentage
Response strongly disagree disagree neutral agree strongly agree
44%
19%
36%
22%
8% 62%30%
The degree of code inspection challenge
100 50 0 50 100
Percentage
Response extremely not challenging not challenging neutral challenging extremely challenging
Fig. 2: (MQ1/MQ2) A summary of the survey questions and the results obtained from 36 participants.
team: (full-stack) software engineer or software devel-
oper (88.8%), quality assurance engineer (5.6%) , team
leader (2.8%), others (2.8%). The respondents have dif-
ferent Years of Professional Experience: less than 5 years
(52.8%), 6-10 years (33.3%), 11-15 years (8.3%), 16-20
years (2.8%), more than 20 years (2.8%) The respondents
described their experience in programming languages
as Java (24.8%), Javascript (16.7%), C# (16.7%), Python
(13.9%), C/C++ (5.6%), Go (8.3%), Kotlin (5.6%), PHP
(2.8%), Scalar (2.8%), Swift (2.8%). These demographics
indicate that the responses are collected from practition-
ers who have a variety of roles, years of experiences,
and various programming language expertise. Thus, our
findings are likely not bound to specific groups of prac-
titioners.
2.3 Survey Results
Figure 2 shows a summary of the survey questions and
the results of MQ1 and MQ2.
(MQ1) How do practitioners perform code inspection
in modern code review process?
Findings.Developers inspect source code in a top-
down order. Once defective lines are identified, they
will inspect the method calls and their surrounding
lines. Each file often takes at least 10 minutes to more
than one hour. To speed up code review process, auto-
mated code review tools are currently used. However,
64% of the respondents still perceived that their current
code inspection activity remains very challenging, high-
lighting the need of a line-level defect prediction tool to
help developers identify the areas of code that are likely
to be defective in the future. Below, we summarize the
key findings from our survey.
Given a changed file in a pull request, 80.6% of the
respondents currently inspect source code in a top-
down order (i.e., from the top to the bottom). Other
respondents inspect source code in a random order,
or inspect the main code change followed by the
top-down order.
Once the defective lines are identified, 72.2% of
the respondents focused on inspecting the defective
lines and their related method calls, while 52.8% of
the respondents inspect the defective lines and their
surrounding lines, 41.7% of the respondents inspect
the defective lines and their whole function, and
17.6% of the respondents inspect only the defective
lines.
50% of the respondents spent at least 10 minutes
to more than one hour to review a single file,
indicating that current code review activities are still
time-consuming.
To aid code review process, 44.4% of the respon-
dents currently use automated code review tools
(e.g., SonarQube and ESLint), since such tools can
facilitate their code review tasks (R10: It provides
massive rules for code reviews, support many languages
and the review report is intelligible, R25: Help us high-
light and simplify the work that they need to do.)
64% of the respondents perceived that code inspec-
tion activity is very challenging to extremely chal-
lenging. They stated the reasons that it’s challenging
5
to review code that they do not own (R20: It’s chal-
lenging when I have to review the code in the repo that
I do not own), understanding the logic behind code
is time consuming (R24: People have different way of
thinking. My time is usually spent on understanding
their reasoning behind the code before I leave a comment
to challenge their design or logic. ), and there are a
lot of code to review (R26: Many lines of code to be
reviewed).
(MQ2) Would a line-level defect prediction tool be
helpful for practitioners?
Findings.44% of the respondents perceived that a
line-level defect prediction tool would potentially be
helpful in identifying defective lines, due to various
reasons:
Reduce time to do code review (R18: It could be useful
and take less time to review the code).
Easy to understand (R8: the highlight and score are
clear and easy to understand, R26: Red alert attracts your
eyes so it is easier to spot what is wrong).
Provide some useful information (R14: Show high-
light with probability is informative).
In addition, we find that the respondents who are
(full-stack) software engineers or software developers,
have less than 5 years of experience, and focus on
the surrounding lines and the whole function of the
identified defective lines, tend to perceive that the tool
is helpful in identifying defective lines.
64% of the respondents would consider using a line-
level defect prediction tool if it is publicly available
for free, due to various reasons:
Expedite software development process (R12: If it
make process go faster, why not?).
Interesting (R14: It’s interesting to use the tools).
3 LEARNING THE SURROUNDING TOKENS
AND SURROUNDING LINES FOR LINE-LEVE L
DEFECT PREDICTION
In this section, we present DeepLineDP, an approach
to address the challenges of line-level defect prediction.
Our DeepLineDP approach is designed to capture defec-
tive code lines through the use of bidirectional GRU unit
to learn the surrounding tokens and surrounding lines
using a hierarchical attention network. Figure 3 provides
an overview of our approach.
Overview. Our approach begins with data collection
and data preparation to generate datasets that contains
source code files, file-level and line-level ground-truth
labels (i.e., files that are affected by a post-release defect
and lines that changed or deleted to address the defect).
For each source code file, we perform several prepro-
ceessing steps, including code abstraction and vocabu-
lary size management. Next, we perform source code
representation to generate a vector representation. After
that, we use a hierarchical attention network to learn the
hierarchical structure of source code for file-level defect
prediction, while enabling line-level defect prediction.
Then, the prediction layer produces the probability score
of the prediction of defective files. Finally, we use the
token-level attention layer to identify defective lines
based on the most important tokens that contribute to
the prediction of defective files.
3.1 Source Code Preprocessing
The goal of the source code preprocessing steps is to
extract syntactic, semantic, and contextual code features
of each code token. For each Java file, we use JavaParser 1
to build an Abstract Syntax Tree (AST) to generate a
stream of tokens and determine their type by extracting
syntactic information (e.g., whether a token represents
an identifier, a method declaration). Then, each source
code file is parsed into a set of lines and each line is
parsed into a sequence of code tokens.
Vocabulary Size Management. A large vocabulary size
often requires high computing time and large mem-
ory usage [9, 42, 64, 74]. To alleviate this issue, we
replace integers, real numbers, exponential notation, and
hexadecimal numbers with a generic hnumitoken, and
replace constant strings with a generic hstritoken. We
remove special characters (i.e., {. , ” ’ ( ) : (space) }
!) since Rahman and Rigby [52] found that these special
tokens introduce noise to prediction models. We replace
tokens which exist in test sets but do not exist in the
training set with a special token hunki, known as the out-
of-vocabulary problem. We ignore blank lines, as they do
not contribute to the actual behavior of the code.
3.2 Token Embedding Layer
Unlike prior studies that represent source code as a long-
term sequence [9], our goal is to maintain the hierarchical
structure of source code (i.e., tokens forming lines and
lines forming files). Thus, we represent the source code
file by maintaining the order of lines and tokens in
the original code. Each file is now a sequence of lines
hl1, l2, ..., lniand each line is a sequence of code tokens
hw1, w2, ..., wni. For each token, we generate a vector
representation using the Word2Vec function provided
by the gensim Python library. In particular, we build
a project-specific language model using our training
dataset for each project to ensure that the vector repre-
sentation is derived from the domain-specific vocabular-
ies and to obtain the optimal distributed representations.
That means the vector representations generated from
domain-specific language models tend to produce more
meaningful vectors (i.e., capturing better distributed re-
lationships between tokens and their surrounding tokens
better than a generic pre-trained language model). When
training the Word2Vec model, we use the Continuous
1. http://javaparser.org/
6
Token Attention Layer
Most
Risky
The Risk Score
of Code Tokens
Issue
Reports
Version
Control
Data Collection
File-level
Ground-Truth
Line-level
Ground-Truth
Code Tokenization
Hierarchical Attention Network
BiGRU
Token 1
BiGRU
Token 2
BiGRU
Token |T|
Token Embedding LayerCode Preprocessing File-level and Line-Level Defect Prediction
Least
Risky
Prob
Attention
Scores
Analysis
The Ranking of
Defective Lines
A Source
Code File
int i=5;
if(j==0){
BiGRU
Line 1
BiGRU
Line 2
BiGRU
Line |L|
Token Attention Layer (line i)
Line Attention Layer
Token
Encoder
Line
Encoder
Data Preparation
Token Vectors
Token
Representation
L2
L1 int i=<num>
if j== <num>
int
i
=
<num>
if
j
==
Line1
Line2
Defect Datasets
of Yatish et al.
Fully-Connected Layer
Fig. 3: An overview diagram of our DeepLineDP approach.
Bag of Words (CBOW) architecture to learn the dis-
tributed representations of tokens, since the CBOW ar-
chitecture considers the surrounding tokens to generate
a vector representation of the target token.
3.3 Learning the Hierarchical Structure of Source
Code
We use Hierarchical Attention Network [76] to learn
the hierarchical structure of source code (see Figure 4).
Specifically, we use a two-layer attention network (a
token layer and a line layer). This network consists of
four parts: a token encoder, a token-level attention layer,
a line encoder, and a line-level attention layer. Assuming
that a source code file f∈ F has a sequence of lines
L= [l1, l2, ..., l|L|], where licontains a sequence of tokens
T= [wi1, wi2, ..., wi|T |], where wit represents the code
token in the i-th line, t[1,|T |].
Token Encoder. Given a line liwith a sequence of
tokens Tand a word embedding matrix WR|VCd,
where VCis the vocabulary containing all tokens ex-
tracted from training source code files and dis the
word embedding size of the representation of tokens,
we first obtain a vector representation of each token
for each line vit =W(wit)from Word2Vec, where vit
indicates the vector representation of token wit in the
word embedding matrix W.
We employ a bidirectional GRU [4] to summarize in-
formation from the context of a token in both directions.
GRU is proposed to solve the vanishing gradient prob-
lem, which is commonly found in a standard RNN ar-
chitecture when training with a long sequence of tokens.
To capture the contextual information, the bidirectional
GRU includes a forward GRU
hit =
GRU(vit), t
[1,|T |]that reads the line lifrom left (wi1) to right (wi|T|)
and a backward GRU
hit =
GRU(vit), t [|T |,1] that
reads the line lifrom right (wi|T|) to left (wi1). We obtain
an annotation for a given token wit by concatenating the
forward hidden state
hit and backward hidden state
hit, i.e., hit = [
hit
hit].
Token Attention. Based on the intuition that not all
tokens contribute equally to the semantic representation
of the source code line, we use the attention mecha-
nism [67] to highlight the tokens that are more important
to the semantics of the source code lines and aggregate
h21
h21
w21
h22
h22
w22
h2|T|
h2|T|
w2|T|
h1
h1
l1
h2
h2
l2
h|L|
h|L|
l|L|
α21 α22 α2|T|
uw
α1α2αL
ul
ef
Fully-Connected Layer (MLP)
Token Encoder Token Attention Line Encoder Line Attention
if j
Token Embedding Layer
Token Embedding
v21 v22 v2|T|
Prediction Layer
h
Sigmoid p
Probability Score
Fig. 4: An overview of a two-layer hierarchical attention
network to learn the hierarchical structure of source
code using a bidirectional GRU unit to capture the code
context (i.e., surrounding tokens).
the representation of those informative tokens to form a
line vector.
We first feed the token annotation hit through a one-
layer Multi-Layer Perceptron (MLP) to get a hidden
representation (uit) of hit, i.e., uit = tanh(Wwhit +bw).
Similar to prior studies [19, 76], we define a token-level
context vector (uw) that can be seen as a high level
representation of the answer to the fixed query “what is
the most informative token” over the tokens. The token
context vector uwis randomly initialized and learned
during the training process. Then, we measure the im-
7
portance of the token as the similarity of uit with a token-
level context vector uw. Then, we compute a normalized
contribution (attention) αit of token wit of the line li
through a softmax function [7]: αit =exp(uT
ituw)
Ptexp(uT
ituw). For
each line li, its vector is computed as a weighted sum
of the embedding vectors of the tokens based on their
importance as follows: sit =Ptαithit .
Line Encoder. Given source code lines (i.e., li), we use
a bidirectional GRU to encode the lines as follows:
hi=
GRU(li), i [1,|L|]and
hi=
GRU(li), t [|L|,1].
Similar to the Token Encoder, we obtain an annotation
of the source code line liby concatenating the forward
hidden state
hiand the backward hidden state
hiof
this line. The annotation of the line liis denoted as
hi= [
hi
hi], which summarizes the line liconsidering
its neighboring lines.
Line Attention. We again use an attention mechanism
to highlight the lines that are more important to the
semantics of the source code files and aggregate the
representation of those informative lines to form a file
vector. We first define a line-level context vector ul=
tanh(Wlhi+bs). Then, we compute a normalized con-
tribution (attention) αiof line ulof the line lithrough
a softmax function [7]: αi=exp(uT
lus)
Σiexp(uT
lus). Here usis a
sentence-level context vector which is compared with ul
to measure the importance of a code line. For each file,
efis the embedding vector of the file that is computed
as a weighted sum of the embedding vectors of the lines
based on their importance as follows: ef= Σiαihi.
3.4 File-level and Line-Level Defect Prediction Layer
The embedding vector efis a high level representation
of the source code file which can be used as features
for file-level defect prediction. This vector is fed to a
one-layer Multi-Layer Perceptron (MLP) (a.k.a. a fully-
connected layer) to produce a prediction score h=wh·
ef+bh, where whis the weight matrix used to connect
the embedding vector efwith the hidden layer, and bh
is the bias value. Finally, the prediction score is passed
to an output layer to compute a probability score for
a given source code file. We use Sigmoid function to
compute the predicted probability of a given source code
file as follows: p(yi= 1|fi) = 1
1+exp(h·w0), where yiY
is the probability score of the i-th file and fiis the file
that we want to predict.
To identify defective lines, we first extract the attention
score, ranges from -1 to 1, of each code token in that
defective file. The attention score, calculated by the
attention mechanism [67], is used as a proxy to indicate
the riskiness of code tokens. The concept of the attention
score is similar to the importance score of the random
forest’s variable importance that is widely-used in soft-
ware engineering, but having different calculations. To
do so, we rank the attention score obtained from the
token attention layer in a descending order. Then, we
select the top-ktokens that have the highest attention
scores. Then, we compute the line-level risk score as the
summation of the risk score of any tokens that appear
in the top-k. Finally, we produce a ranking of the most
risky lines based on the risk scores of each line. In this
paper, we choose the kof 1,500.
4 ST UDY DESIGN AND RES ULTS
The goal of this paper is to empirically assess our
hypothesis whether the hierarchical structure and code
context (i.e. surrounding tokens and surrounding lines)
should be considered to identify the fine-grained loca-
tions of defective code (i.e., defective files and defective
lines). Below, we present the study design followed by
the results.
Line-level Defect Datasets. In this paper, we use the
benchmark line-level defect dataset prepared by Wat-
tanakriengkrai et al. [73]. The dataset consists of 32
software releases that span across 9 open-source software
systems. Table 1 shows a statistical summary of our
studied dataset. Each release contains 731 to 8,846 files,
74,349-567,804 lines of code, and 58,659-621,238 code
tokens. Below, we discuss the detailed steps that were
used to collect file-level and line-level ground-truths.
(Step-1) Collect issue reports. For each studied system,
all issue reports were retrieved from the JIRA Issue
Tracking System (ITS). Then, the unique identifier of
the issue reports (IssueID), the issue report type (e.g.,
a bug or a new feature), and the affected releases (i.e.,
the releases that are affected by a given issue report)
were extracted. Since the goal of release-based defect
prediction is to predict if a file will be affected by an
issue report in the future, only the issue reports that were
classified as bug, were reported after the studied release,
and affected the studied releases were focused.
(Step-2) Collect a snapshot of source code. For each release,
a snapshot of source code at the release date (i.e., the
list of files of a given release) was collected from the Git
Version Control System (VCS).
(Step-3) Identify defect-fixing commits. Defective files
are defined as files that are affected by post-release
defects [44, 77]. To identify defective files, commits that
addressed a post-release defect (i.e., defect-fixing com-
mits) were needed to be found. To do so, defect-fixing
commits that are associated with the defect reports that
affected the studied release were found, using regular
expressions to search for the issue IDs in the commit
messages. Then, files that are fixed for the defect reports
were labelled as defective, otherwise clean.