Conference PaperPDF Available

# D-ACT: Towards Diff-Aware Code Transformation for Code Review Under a Time-Wise Evaluation

Authors:

## Abstract and Figures

Code review is a software quality assurance practice, yet remains time-consuming (e.g., due to slow feedback from reviewers). Recent Neural Machine Translation (NMT)-based code transformation approaches were proposed to automatically generate an approved version of changed methods for a given submitted patch. The existing approaches could change code tokens in any area in a changed method. However, not all code tokens need to be changed. Intuitively, the changed code tokens in the method should be paid more attention to than the others as they are more prone to be defective. In this paper, we present an NMT-based Diff-Aware Code Transformation approach (D-ACT) by leveraging token-level change information to enable the NMT models to better focus on the changed tokens in a changed method. We evaluate our D-ACT and the baseline approaches based on a time-wise evaluation (that is ignored by the existing work) with 5,758 changed methods. Under the time-wise evaluation scenario, our results show that (1) D-ACT can correctly transform 107-245 changed methods, which is at least 62% higher than the existing approaches; (2) the performance of the existing approaches drops by 57% to 94% when the time-wise evaluation is ignored; and (3) D-ACT is improved by 17%-82% with an average of 29% when considering the token-level change information. Our results suggest that (1) NMT-based code transformation approaches for code review should be evaluated under the time-wise evaluation; and (2) the token-level change information can substantially improve the performance of NMT-based code transformation approaches for code review.
Content may be subject to copyright.
D-ACT: Towards Diff-Aware Code Transformation
for Code Review Under a Time-Wise Evaluation
Chanathip Pornprasit
Monash University
Australia
chanathip.pornprasit@monash.edu
Chakkrit Tantithamthavorn§
Monash University
Australia
chakkrit@monash.edu
Patanamon Thongtanunam
The University of Melbourne
Australia
patanamon.t@unimelb.edu.au
Chunyang Chen
Monash University
Australia
chunyang.chen@monash.edu
Abstract—Code review is a software quality assurance practice,
yet remains time-consuming (e.g., due to slow feedback from
reviewers). Recent Neural Machine Translation (NMT)-based
code transformation approaches were proposed to automatically
generate an approved version of changed methods for a given
submitted patch. The existing approaches could change code
tokens in any area in a changed method. However, not all code
tokens need to be changed. Intuitively, the changed code tokens
in the method should be paid more attention to than the others
as they are more prone to be defective. In this paper, we present
an NMT-based Diff-Aware Code Transformation approach (D-
ACT) by leveraging token-level change information to enable
the NMT models to better focus on the changed tokens in
a changed method. We evaluate our D-ACT and the baseline
approaches based on a time-wise evaluation (that is ignored by
the existing work) with 5,758 changed methods. Under the time-
wise evaluation scenario, our results show that (1) D-ACT can
correctly transform 107 - 245 changed methods, which is at least
62% higher than the existing approaches; (2) the performance
of the existing approaches drops by 57% to 94% when the time-
wise evaluation is ignored; and (3) D-ACT is improved by 17%
- 82% with an average of 29% when considering the token-level
change information. Our results suggest that (1) NMT-based code
transformation approaches for code review should be evaluated
under the time-wise evaluation; and (2) the token-level change
information can substantially improve the performance of NMT-
based code transformation approaches for code review.
Index Terms—Modern Code Review, Deep Learning, Neural
Machine Translation
I. INTRODUCTION
Code review is the practice where reviewers (i.e., developers
other than the author) review a patch (i.e., a set of code
changes) to ensure that it meets quality standards before
being approved to be integrated into a codebase. Recent
studies showed that developers perform code review to identify
security issues [1]; and detect code smells [2], issues of
refactored code [3], and software defects [4, 5]. Although
code review brings a lot of beneﬁts to software development,
developers still face challenges while revising their submitted
patches. In particular, one of the top challenges of developers
is receiving timely feedback [5]. Rigby et al. [6] also reported
that developers may have to wait for 15 to 20 hours to receive
feedback for their initial version of the submitted patches.
Neural Machine Translation (NMT)-based code transforma-
tion approaches have been proposed to facilitate developers re-
§The corresponding author.
Codebase
Version
(1) Submit
Developers
(2) Provide useful feedback,
yet often delayed
Reviewers
(3) Approve
Reviewers
Analyze!
Code Di
D-ACT: A Di-Aware Code Transformation
Integrate !
Code Di
Build NMT !
Model
Generate!
Code
Initial
Version Approved
Version
Fig. 1: A usage scenario of our Diff-Aware Code Transforma-
tion (D-ACT) approach.
vising their submitted patches. For example, Thongtanunam et
al. [7] proposed AutoTransform, which leverages a Trans-
former model [8] and Byte-Pair Encoding (BPE) subword
tokenization [9] for code transformation. Similarly, Tufano et
al. [10] proposed a T5-based pre-trained code model for
automated code review activities (called, TufanoT5 onwards).
Intuitively, the changed code tokens in the method should be
paid more attention to than the others as they are more prone to
be defective [11]. In addition, prior work found that changed
code in the past is more likely to be changed again in the
future (e.g., ﬁxing software defects) [12]. Unfortunately, the
existing code transformation approaches [7, 10, 13, 14] only
learn the sequence of code tokens without knowing which
code tokens are previously changed from the codebase.
In this paper, we are the ﬁrst to present a Diff-Aware
Code Transformation (D-ACT) approach for code review that
leverages the token-level code difference information and
CodeT5 [15] (the state-of-the-art pre-trained code model).
Different from the existing approaches [7, 10], we design our
D-ACT approach to transform the initial version (i.e., the ﬁrst
version of a submitted patch) to the approved version (i.e., the
version after being reviewed and approved by reviewers) while
considering the token-level code difference information which
is extracted from the code difference between the codebase
version and the initial version. Our intuition is that changed
code tokens are more likely to introduce defects than the
others [11]; thus, explicitly providing such information may
help the NMT models better learn the code transformation.
While temporal information is well-regarded in the lit-
erature [16–19], such temporal information remains largely
ignored by the existing code transformation approaches for
code review [7, 10]. Thus, we are the ﬁrst to evaluate the
NMT-based code transformation approaches for code reviews
under the time-wise scenario. Speciﬁcally, we consider the
chronological order of patches in our evaluation, meaning
that no future patches are used to train the models. Finally,
we conduct experiments based on our datasets collected
from three large-scale software projects, i.e., Google, Ovirt,
and Android. Through an experimental evaluation of 57,615
changed methods that span across the three software projects,
we address the following research questions:
(RQ1) How do our approach and state-of-the-art code
transformation approaches for code review perform under
the time-wise evaluation scenario?
Result. Our D-ACT can correctly transform 107-245 changed
methods (i.e., beam size = 10), which is at least 62% higher
than TufanoT5 [10] and 274% higher than AutoTransform [7].
Besides, when the state-of-the-art code transformation ap-
proaches (i.e., TufanoT5 and AutoTransform) are evaluated
under the time-wise scenario, the number of perfect match
decreases by at least 57% and 92%, respectively.
(RQ2) What are the contributions of the components
(i.e., the token-level code difference information and the
pre-trained model) of our approach?
Result. Using the token-level code difference information
can increase the number of correct transformations by 17%-
82% when compared to without using such information. In
addition, using CodeT5 can increase the number of correct
transformations by at least 29% when compared to using
the other pre-trained code models (i.e., CodeBERT [20],
GraphCodeBERT [21], PLBART [22]).
In summary, these results suggest that (1) NMT-based
code transformation approaches for code review should be
evaluated under the time-wise evaluation; and (2) the token-
level code difference information can substantially improve the
performance of NMT-based code transformation approaches
for code review.
Open Science. To support future studies, our replication
package is available in GitHub [23].
Paper Organization. Section II describes the background
and related work in order to situate this paper with respect to
the literature. Section III presents our proposed D-ACT. Sec-
tion IV describes the experimental design. Section V presents
the experimental results of our study. Section VI discusses our
experimental results. Section VII discloses possible threats to
the validity. Section VIII draws the conclusions.
II. BACKGROU ND & REL ATED WO RK
In this section, we describe the background and related
work to situate the paper with respect to the literature, and
describe the limitations of state-of-the-art code transformation
approaches for code review.
A. Code Review
Code review is a software quality assurance practice where
developers other than patch authors are required to provide
feedback to artifacts to ensure that the artifacts meet quality
standards. Nowadays, code review practices are commonly
performed in asynchronous tool-based settings (e.g., Gerrit,
GitHub, Phabricator, Review Board) [6, 24]. Figure 1 presents
an overview of a code review process, which comprises
three main steps. In Step 1., developers (i.e., patch authors)
create the ﬁrst version of a patch (i.e., the initial version)
by modifying source code in the codebase (i.e., the codebase
version) or creating the new one. Then, developers submit
the created patch to a code review platform (e.g., Gerrit).
Next, in Step 2, reviewers review and provide feedback on
the submitted patch. Then, the developers revise the patch and
submit it to the code review platform again. Finally, in Step
patch can be integrated into the codebase, they will approve
version) to be integrated into the codebase.
B. Automated Code Transformation for Code Review
Although different automated approaches were proposed to
support code review activities (e.g., review prioritization [25,
26], just-in-time defect prediction [17, 27, 28] and localiza-
tion [29–32], reviewer recommendation [33–37], AI-assisted
code reviewer [38, 39]), such activities still require manual
effort which can be time-consuming for developers [4, 5].
Indeed, prior work reported that reviewers could spend a
large amount of time on reviewing code (e.g., developers in
open-source software projects have to spend more than six
hours per week on average reviewing code [40]). Furthermore,
it is also challenging for patch authors to receive timely
feedback. For example, developers at Microsoft may wait for
15 to 20 hours to receive feedback for their ﬁrst version
of the submitted patches [6]. Therefore, an automated code
transformation approach could potentially save developers’
effort by automatically revising submitted patches.
Neural Machine Translation (NMT)-based code transforma-
tion approaches have been proposed to facilitate developers by
generating an approved version of a submitted patch. Broadly
speaking, the NMT-based code transformation approaches
will transform the method of the initial version minitial in
a submitted patch to the method of the approved version
mapproved. To do so, such approaches learn the mapping
between minitial and mapproved by computing the conditional
probability p(mapproved |minitial).
Recent studies have proposed various NMT-based code
transformation approaches for code review. Tufano et al. [14]
were the ﬁrst to leverage the RNN architecture and code
abstraction to learn meaningful code changes in code review.
Later, Thongtanunam et al. [7] presented AutoTransform to
address the limitation of Tufano et al.s approach [14], by
leveraging a Byte-Pair Encoding (BPE) subword tokeniza-
tion [9] and a Vanilla Transformer architecture [8] to better
handle the changed methods where the approved versions
The input sequence of (TufanoT5/AutoTransform)
minitial
The input sequence of (Our D–ACT)
minitial+di
public void setItems(ArrayList<SystemTreeItemModel> value) { … itemsChanged();
getItemsChangedEvent().raise(this, eventArgs. EMPTY); onProperty Changed(new
propertyChanged EventArg("Items")); } }
public void setItems(ArrayList<SystemTreeItemModel> value) { … itemsChanged();
getItemsChangedEvent().raise(this, EventArgs. EMPTY); onProperty Changed(new
PropertyChangedEventArgs("Items")); } }
public void setItems(ArrayList<SystemTreeItemModel> value) { … itemsChanged();
getItemsChanged Event().raise(this, EventArgs. empty); onProperty Changed(new
PropertyChanged EventArgs("Items")); } }
public void setItems(ArrayList<SystemTreeItemModel> value) { … itemsChanged();
getItemsChanged Event().raise(this,EventArgs. <START_MOD>empty<END_MOD>);
onProperty Changed(new PropertyChangedEventArgs (“Items")); } }
The generated from TufanoT5/AutoTransform
mapproved
The generated from Our D–ACT
mapproved
The method of the codebase version ( )
mcodebase
The method of the initial version ( )
minitial
The method of the approved version ( )
mapproved
public void setItems(ArrayList<SystemTreeItemModel>
value) {!
!
itemsChanged();!
getItemsChangedEvent().raise(this,
EventArgs.Empty); !
onPropertyChanged(new
PropertyChangedEventArgs("Items")); //$NON-NLS-1$!
}!
}
public void setItems(ArrayList<SystemTreeItemModel>
value) {!
!
itemsChanged();!
getItemsChangedEvent().raise(this,
EventArgs.empty); !
onPropertyChanged(new
PropertyChangedEventArgs("Items")); //$NON-NLS-1$!
}!
}
public void setItems(ArrayList<SystemTreeItemModel>
value) {!
!
itemsChanged();!
getItemsChangedEvent().raise(this,
EventArgs.EMPTY); !
onPropertyChanged(new
PropertyChangedEventArgs("Items")); //$NON-NLS-1$!
}!
}
Fig. 2: A real-world example of a code review from the Ovirt project. As shown in this example, reviewers tend to provide
comments related to the changed code tokens (i.e., empty) rather than the others where this token is eventually changed to
EMPTY in the approved version. However, the existing code transformation approaches for code review may make changes to
the code tokens that should remain unchanged (highlighted in yellow), since these approaches do not know which code tokens
should be paid more attention to.
contain newly-introduced code tokens. However, such ap-
proaches may incorrectly transform minitial to mapproved since
the models are trained from limited knowledge of source code
(i.e., a limited amount of training data).
Recently, Tufano et al. [10] proposed to build a pre-trained
language model using the Text-to-Text Transfer Transformer
(T5) architecture [41] in order to learn general knowledge of
source code (called TufanoT5, henceforth). By using a transfer
learning approach [42], the model is then ﬁne-tuned on the
code review dataset to perform the code transformation task.
With the use of a transfer learning approach (i.e., pre-trained
and then ﬁne-tuned), the model is able to generate better vector
representation, producing more accurate generations of source
code for the code transformation task. Although prior studies
have shown promising results of the previous NMT-based
code transformation approaches for code reviews [7, 10], the
performance of NMT-based code transformation approaches
heavily relies on the knowledge that is used to train a model
and the methodology that is used to evaluate the model.
Lack of Code-Diff Information. As pointed out by
Beller et al. [11], code changes are prone to be more defective
than others, thus the area of code changes should require more
attention than the others. In addition, Xie et al. [12] also found
that changed code in the past is more likely to be changed
again in the future (e.g., ﬁxing software defects). However, the
existing code transformation approaches [7, 10, 13, 14] only
learn the sequence of code tokens without knowing which code
tokens are previously changed from the codebase. Therefore,
it is possible that the existing code transformation approaches
may transform code tokens that should remain unchanged and
also may not transform code tokens that should be changed,
as pointed out by Thongtanunam et al. [7].
To illustrate this scenario, Figure 2 presents a real–world
example of a code review from the Ovirt project. Generally,
the existing code transformation approaches require a sequence
of code tokens in the initial version without knowing that
empty is the only token that was changed from the codebase
version. With the existing code transformation approaches,
they may transform the code tokens EventArgs and Prop-
ertyChanged to eventArgs and propertyChanged
(highlighted in yellow) in the generated approved version,
while these two code tokens should remain unchanged in
the actual approved version. To address this challenge, we
set out to investigate the impact of the token-level code
difference information on the performance of automated code
transformation approaches.
Lack of Temporal Information. Code review is a practice
that is conducted in temporal order. Thus, temporal infor-
mation must be considered in the evaluation setup. Prior
studies raised concerns that the evaluation setup may have a
negative impact on the performance of automated approaches
for software engineering tasks (called experimental bias). For
example, prior work [16, 17] found that temporal information
must be considered when evaluating the just-in-time defect
prediction models. Similarly, Jimenez et al. [18] also found
that unrealistic labelling due to a lack of temporal information
has a negative impact on the performance of vulnerability
prediction approaches. Liu et al. [19, 43] also found that
by changing the experimental setup to consider temporal
information, the performance of ML-based malware detec-
tion approaches is substantially decreased. While temporal
information is well-regarded in the literature, such temporal
information remains largely ignored by the existing code
transformation approaches for code review [7, 10, 13, 14].
Thus, it is possible that these approaches may learn some
of the future patches to generate an approved version of the
methods in old patches, which is not well aligned with the
realistic evaluation scenario. To address this challenge, we set
out to investigate the impact of the time-wise scenario on the
performance of automated code transformation approaches.
III. D-ACT: DIFF -AWARE CODE TRANSFORMATION FOR
COD E REVIEW
In this section, we present our D-ACT, an NMT-based code
transformation approach that leverages the token-level code
difference information and CodeT5 the state-of-the-art pre-
trained code model [15].
Our underlying intuition of using the token-level code
difference information is that changed code tokens are more
likely to introduce defects than the others [11]; thus, explicitly
providing such information may help NMT models better
learn the code transformation. Figure 2 illustrates an example
of the input of our approach. Generally speaking, D-ACT
uses special tokens (i.e., <START_MOD> and <END_MOD>)
to explicitly specify which code tokens in the initial version
were changed from the codebase version. We hypothesize that
with this information, the NMT model will better capture the
relationship between all code tokens and the given special
tokens, reducing the probability that code tokens outside of the
special tokens will be changed. We also opt to use the CodeT5
pre-trained model because the model was speciﬁcally trained
to learn the syntactic and semantic information of code tokens.
This is different from the pre-trained T5 model of Tufano et
al. [10], which learns source code without focusing syntactic
information of code tokens.
A. Overview
Figure 3 presents an overview of our D-ACT approach,
which consists of three main steps: (Step 1) data preparation,
(Step 2) training phase, and (Step 3) inference phase. In
Step 1, we extract three versions of a changed method, i.e.,
codebase (mcodebase), initial (minitial ), and approved versions
(mapproved). Then, we integrate the token-level code difference
information into minitial (i.e., minitial+diﬀ ). Once we prepare
the dataset, we use pairs of minitial+diﬀ and mapproved to
ﬁne-tune the CodeT5 pre-trained model in Step 2. Finally,
in Step 3, for each patch in the testing dataset, we use the
ﬁne-tuned CodeT5 model to generate mapproved based on a
given minitial+diﬀ. We describe the details of each step of
our D-ACT below.
B. (Step 1) Data Preparation
The key goal of our data preparation step is to integrate
the token-level code difference information into the minitial.
To do so, for each changed ﬁle in each patch, we ﬁrst
identify changed methods based on its codebase, initial, and
approved versions. Speciﬁcally, we employ a source code
analysis tool namely Iterative Java Matcher (IJM) [44]1to
1https://github.com/VeitFrick/IJM
extract pairs of changed methods between the codebase and
initial versions of a changed ﬁle, i.e., mcodebase, minitial ;
and pairs of changed methods between the initial and ap-
proved versions minitial, mapproved. Then, we merge each
of mcodebase, minitial and minitial, mapprovedinto a triplet
of the changed method mcodebase, minitial , mapprovedbased
on the whole content of minitial. Note that in our ap-
proach, we also consider the newly-added minitial, i.e.,
Ø, minitial, mapproved, but we do not consider deleted meth-
ods mcodebase, minitial ,Øor mcodebase ,Ø,Øsince our ap-
proach aims to transform the existing minitial in a patch to the
version that is reviewed and approved.
After we obtain a triplet of a changed method, we analyze
the token-level difference information and integrate it into
minitial. We integrate the token-level code difference informa-
tion into minitial by inserting special tokens <START_MOD>
and <END_MOD> to specify the code tokens that were changed
from mcodebase. Figure 2 shows an example of minitial with
the token-level code difference information (minitial+diﬀ ). To
do so, we ﬁrst identify the code tokens in minitial that were
changed from mcodebase. Speciﬁcally, we tokenize minitial
and mcodebase in sequences of tokens using the Java parser
namely javalang2. Then, we use the approach of Liu et
al. [45] to align code tokens between minitial and mcodebase
and identify the changed code tokens (i.e., either replaced or
inserted). Finally, in minitial, we insert <START_MOD> and
<END_MOD> at the beginning and the end of each sequence
of the changed tokens to produce minitial+diﬀ .
C. (Step 2) Training Phase
During the training phase, we leverage the CodeT5
pre-trained code model to learn code transformation in
code review through the relationship between the pairs of
minitial+diﬀ , mapproved in the training dataset. Prior to training
a model, tokenization plays an important role to break source
code into a meaningful sequence of code tokens. As suggested
by prior work [7, 46, 47], we leverage the Byte-Pair Encoding
(BPE) approach [9] to perform subword tokenization, i.e.,
split tokens into a list of sub-tokens. With the use of BPE
subword tokenization, it will greatly reduce the vocabulary
size, while enabling the model to create new tokens that never
appear in the training dataset. We use the BPE tokenizer that
is trained on the CodeSearchNet (CSN) corpus [48], provided
by Wang et al. [15]. Since our approach uses special tokens
(i.e., <START_MOD> and <END_MOD>) to explicitly specify
which code tokens in the minitial are changed from mcodebase,
we include the special tokens in the BPE tokenizer to ensure
that these tokens will not be split into sub-tokens. Then,
we apply the BPE tokenizer to minitial+diﬀ and mapproved
to produce subtoken-level methods (i.e., sminitial+diﬀ and
smapproved, respectively), where these subtoken-level methods
contain a sequence of subtokens sm = [st1, ..., stn]. Finally,
we feed these sequences into the CodeT5 pre-trained code
models in order to learn the relationship between the input
2https://github.com/c2nes/javalang
A Changed File!
in a Patch
Fine-tuned
Model
Codebase!
Version
Initial "
Version
Approved "
Version
mcodebase
mapproved
3 Versions of "
The Changed Method
Pre-trained "
CodeT5
A Patch
Identify Changed Methods
Analyze + Integrate
Token-level
Code Difference
minitial+di
Our Dataset
mapproved
minitial+di
,
(Step 1) Data Preparation
minitial+di
,
Testing Dataset
Training Dataset
Generated
mapproved
(Step 2) Training Phase
(Step 3) Inference Phase
Generate
Code
BPE Tokenization
Fig. 3: An overview of our approach.
sequence of the initial version (sminitial+diﬀ ) and the output
sequence of the approved version (smapproved) for each patch
in the training dataset to build an NMT model. Below, we
describe the technical details of the BPE tokenization and the
architecture of CodeT5.
1) BPE Tokenization: The BPE tokenization process con-
sists of two main steps: generating merge operations which
determine how code tokens should be split, and applying
merge operations to split code tokens into a list of sub-tokens.
To generate merge operations, BPE ﬁrst splits all code tokens
into a list of characters. Next, BPE generates a merge operation
by identifying the symbol pair (i.e., the pair of two consecutive
characters) having the highest frequency in a corpus. Then, the
co-occurrence characters appearing in the identiﬁed symbol
pair are replaced by the symbol pair, without removing a
single character that appears in the symbol pair. After that,
the symbol pair is added to the vocabulary list. For example,
suppose that (‘a’,‘c’) has the highest frequency. The merge
operation is (‘a’,‘c’) ’ac’, which will replace all of the co-
occurrence of (‘a’, ‘c’) with ‘ac’, without removing ‘a’ or ‘c’
that appears alone. The above steps are then repeated until a
given number of merge operations is reached.
After the merge operations are generated, we apply the
merge operations to split code tokens into a list of sub-tokens.
To do so, BPE ﬁrst splits all code tokens into sequences of
characters. Then, the generated merge operations are applied
to minitial+diﬀ and mapproved in the training data, resulting
sminitial+diﬀ and smapproved, respectively.
2) The Architecture of CodeT5: The architecture of the
CodeT5 is based on the T5 architecture [41], which begins
with a word embedding layer, followed by an encoder block
and a decoder block. The architecture then ends with a linear
layer and a softmax activation function.
Word Embedding Layer. Given sminitial+diﬀ (or
smapproved) obtained from BPE tokenization, we generate
an embedding vector for each sub-token and combine into
an embedding matrix. Then, to capture the information re-
lated to the position of each token, unlike BERT [49] that
combines embedding vectors with the vectors obtained from
absolute position encoding, CodeT5 leverages relative position
encoding to encode the position of tokens during self-attention
calculation. The self-attention mechanism is described below.
The Encoder Block. The generated word embedding
matrix of sminitial+diﬀ is fed to the encoder block, which
consists of six identical encoder layers. Similar to a Vanilla
Transformer [8], each encoder layer is composed of two
sub-components: a multi-head self-attention mechanism and a
fully-connected feed-forward network (FFN). Before the input
vector is fed to each sub-component, layer normalization [50]
and residual skip connection [51] are respectively applied to
the input vector. Different from Vanilla Transformer [8], the
layer normalization here is a simpliﬁed version where additive
bias is not applied.
Multi-head self-attention mechanism [8] is employed to
allow CodeT5 to capture the contextual relationship between
all tokens in the whole sequence. In particular, the multi-head
self-attention mechanism calculates the attention score of each
token by using the dot product operation. There are three main
components for calculating the attention score: query (Q), key
(K), and value (V) matrices. The attention weight is ﬁrst
computed by taking the scaled dot-product between Qand K.
The softmax function is then applied to the result of the dot-
product. Finally, the dot-product between the result obtained
from the softmax function and Vis calculated to update the
value in V. However, different from Vanilla Transformer [8]
that encodes the absolute position of each token before com-
puting self-attention score, CodeT5 uses relative positional
encoding to efﬁciently represent the position of tokens. Thus,
an additional matrix P, which represents the relative position
of tokens, is computed and supplied during self-attention
calculation. Finally, the attention vectors obtained from the
last encoder layer will be used to help decoder layers focus
on appropriate tokens in the sminitial+diﬀ .
The Decoder Block. The generated embedding matrix of
smapproved is fed to the decoder block, which consists of six
identical decoder layers. Each decoder layer is composed of
the sub-components similar to the ones in the encoder layer.
However, different from the encoder layer, the decoder layer
uses special kinds of self-attention mechanism (i.e., masked
attention) to generate attention vectors. The concept of the
phase, the attention of tokens that appear after the current
tokens are masked out to ensure that the model does not
attend to future (unseen) tokens while generating the next
token. Then, the decoder layer uses the multi-head encoder-
decoder self-attention mechanism to compute attention vec-
tors, using the attention vectors generated by the encoder
self-attention mechanism. After the decoder block computes
attention vectors at the last layer, the linear layer will convert
the attention vectors to the logits vector, having length equal
to the vocabulary size. Then, the softmax function is applied
to the logits vector to generate the vector, which determines
the next token to be generated. Finally, the loss function will
compare the generated smapproved with the actual smapproved
to estimate an error.
D. (Step 3) Inference Phase
After we ﬁne-tune the CodeT5 pre-trained model, we aim
to generate mapproved from the given minitial+diﬀ in the
testing dataset. To do so, we ﬁrst tokenize minitial+diﬀ to
sminitial+diﬀ by using BPE like in the training phase. Then,
we leverage beam search to generate top-kbest candidates of
mapproved . The beam size kis a hyperparameter specifying the
number of the best candidates to be generated. In particular,
after a token is generated at each time step, the beam search
uses a best-ﬁrst search strategy to select the top-kcandidates
that have the highest conditional probability. The beam search
will keep generating the top-kbest candidates until the EOS
token (i.e., </s>”) is generated.
IV. STU DY DESIGN
In this section, we provide the motivations of the research
questions and describe our experimental setup.
A. Research Questions
We formulate the following two research questions in our
study.
(RQ1) How do our approach and state-of-the-art code
transformation approaches for code review perform under
the time-wise evaluation scenario?
Motivation. Recently, Thongtanunam et al. [7] and Tufano et
al. [10] proposed code transformation approaches for code
review to facilitate developers during the code review process.
However, as explained in Section II, such approaches still have
limitations that need to be addressed. Therefore, we formulate
this RQ to investigate the accuracy of our proposed D-ACT
approach and the existing code transformation approaches
TABLE I: An overview statistic of the studied datasets.
Project # Total
patches
# Total
studied patches
# Total studied triplets
# train # validation # test
Android 22,746 4,497 (19.77%) 14,690 1,836 1,835
Google 14,099 3,486 (24.73%) 9,899 1,237 1,235
Ovirt 22,800 5,802 (25.45%) 21,509 2,686 2,688
(i.e., AutoTransform [7] and TufanoT5 [10]) in the time-wise
evaluation scenario.
(RQ2) What are the contributions of the components (i.e.,
the token-level code difference information and the pre-
trained model) of our approach?
Motivation. Our D-ACT leverages the token-level code
difference information and the pre-trained code model (i.e.,
CodeT5 [15]). However, little is known about the contribution
of the token-level code difference and the pre-trained code
model to our approach. Therefore, the effectiveness of our
approach is possibly impacted when the token-level code
difference is removed; or CodeT5 is changed to the other pre-
trained programming language models or other code transfor-
mation approaches. Thus, we formulate this RQ to investigate
the possible impact of the components of the token-level code
difference and the pre-trained programming language models
on the accuracy of our approach.
B. Experimental Setup
We now describe our data collection and data ﬁltering
approaches, the model implementation, the hyper-parameter
settings in our experiments, and the evaluation measure.
Data Collection: In this work, we collect new datasets
from the Gerrit code review repositories of Google, Ovirt,
and Android which are software projects that code reviews
are actively performed. We do not use the datasets of the
prior studies [7, 14] since their datasets only have pairs of
mcodebase, mapproved, while our approach requires triplets
of mcodebase, minitial , mapproved. To collect the datasets,
we use Gerrit REST APIs to retrieve three versions of the
submitted patches and their changed java ﬁles. Speciﬁcally, we
obtain java ﬁles of the codebase and the initial version from
the ﬁrst version of a patch that was submitted for review. We
then identify java ﬁles in the approved version that are changed
from the initial version.
Data Filtering: We ﬁrst follow the steps below to ﬁlter out
method pairs (mcodebase, minitial and minitial, mapproved)
before forming triplets mcodebase, minitial , mapproved. Ta-
ble I provides summary statistics of the dataset after data
ﬁltering steps are applied.
1) Exclude patches that contain a single revision. It
is possible that the ﬁrst version of the patches is ap-
proved without further revision. Thus, to ensure that
our D-ACT can help developers to transform minitial to
mapproved that is different from minitial, we exclude
patches that contain a single revision.
2) Exclude the method pairs having minitial that appear
more than once in the same ﬁle of a patch. It is
possible that developers revise different mcodebase to
identical minitial, or revise identical minitial to different
mapproved. Thus, there can be different method pairs
containing the same minitial but different mcodebase or
mapproved. Consequently, when forming triplets by using
such method pairs, there may be extra invalid triplets.
Thus, to ensure the correctness of the triplets in the
dataset, we exclude such method pairs.
After the method pairs are ﬁltered out, they are used to
form the triplets mcodebase, minitial , mapproved. The triplets
are then ﬁltered out according to the steps below.
1) Exclude the triplets having minitial that are the
same as mcodebase.During the code review process,
developers tend to revise the modiﬁed or newly added
methods rather than the whole ﬁles. However, there are
some cases that mcodebase are the same as minitial. Since
the minitial remain unchanged, we exclude such triplets.
2) Exclude the triplets having minitial that appear in
more than one triplet. Developers possibly revise iden-
tical minitial to different mapproved, leading to triplets
having the same minitial but different mapproved. How-
ever, if an NMT model learns from these triplets, the
NMT model may not learn how to correctly translate
identical minitial to different mapproved. Thus, similar to
prior work [10], the triplets with the same minitial but
different mapproved are marked as duplicated and ex-
cluded from the dataset.
3) Exclude the triplets having minitial that are the
same as mapproved.Developers possibly do not revise
minitial until the submitted patch is approved, leading to
triplets having minitial the same as mapproved. However,
if such triplets are included in the dataset, an NMT
model may not learn how to generate mapproved that is
different from minitial. To ensure that the trained NMT
model can generate the mapproved that is different from
minitial, we exclude such triplets.
4) Exclude the triplets having minitial or mapproved that
contain more than 512 tokens The size of methods
can vary; however, an NMT model has a ﬁxed input
size (e.g., CodeBERT [20] can accept input that has at
most 512 tokens). Similar to prior work [10], we exclude
triplets having minitial or mapproved that have more than
512 tokens.
Model Implementation: In the experiment, we implement
our D-ACT, CodeBERT, GraphCodeBERT, PLBART, and
CodeT5 by using the HuggingFace [52] library. In addition,
we use the implementation of TufanoT53that can be obtained
from the t5 library, which is implemented in Tensorﬂow.
Similarly, we use the implementation of AutoTransform4that
can be obtained from the Tensor2Tensor library, which is
implemented in Tensorﬂow.
Hyper-parameter Setting: In our experiment, the hyper-
parameter settings of the pre-trained models (i.e., CodeBERT,
GraphCodeBERT, CodeT5, PLBART) are obtained from the
3https://github.com/RosaliaTufano/code review automation
4https://github.com/awsm-research/AutoTransform-Replication
base model (please refer to Huggingface website5for more
detail). Similarly, for the baselines, we use the optimal hyper-
parameter as reported in [7] (for AutoTransform) and [10]
(for TufanoT5). During the model training phase, we use a
batch size of 6, the maximum input length of 512, a learning
rate of 5e-5, and the maximum training steps of 300,000.
The validation loss is calculated every 2,000 training steps to
indicate when to stop training. We use AdamW optimizer [53]
to minimize training loss.
Evaluation Measure: Similar to the prior work [7,
10, 14], we measure #perfect match (P M ) when evalu-
ating D-ACT and the baselines. The P M is the num-
ber of generated mapproved that exactly matches the actual
mapproved (the mapproved in ground-truth). To compare the
generated mapproved and the actual mapproved, the generated
mapproved and the actual mapproved are ﬁrst tokenized by
the Java parser obtained from the javalang library. Then,
the sequence of tokens of the generated mapproved and the
sequence of tokens of the actual mapproved are compared.
In our experiment, we use the beam size (k) of 1, 5 and
10 when generating mapproved. Thus, an NMT model will
achieve a perfect match if one of the generated mapproved from
the ksequence candidates matches the actual mapproved. We
do not measure BLEU score [54] similar to the previous
study [7], since Ding et al. [55] argue that two similar
code sequences (i.e., two sequences having few different
code tokens) may have different semantics. We also do not
measure CodeBLEU [56] since CodeBLEU does not consider
identiﬁer names. Thus, the generated mapproved may have high
CodeBLEU even though their identiﬁer names are changed
when compared to ground truth, indicating that mapproved are
still incorrectly generated.
Experimental Environment: The experiment is conducted
in the server equipped with the following hardware: an AMD
Ryzen 9 5950X @3.4 GHz 16-Core CPU, an SSD of 1
Terabyte, a RAM of 64 Gigabyte, and an NVIDIA GeForce
RTX 3090 GPU with 24 GB memory.
V. RE SU LTS
In this section, we present the results with respect to our
two research questions.
(RQ1) How do our approach and state-of-the-art code
transformation approaches for code review perform under
the time-wise evaluation scenario?
Approach. To answer this RQ, we build and evaluate our
approach and the baselines (i.e., TufanoT5 [10] and Auto-
Transform [7]) in the time-wise scenario. To do so, we ﬁrst
sort the triplets in chronological order by the commit date
of their corresponding patches. Then, we split the triplets to
train/validation/test set by the proportion of 80%/10%/10%,
respectively. In this RQ, we also evaluate the baselines in
the time-ignore scenario, where the triplets in the dataset are
randomly shufﬂed.
5https://huggingface.co
TABLE II: (RQ1) #perfect match of our approach and the baselines (i.e., TufanoT5 and AutoTransform) for the time-
wise evaluation scenario. The number in the parenthesis indicates the percentage improvement compared to TufanoT5 and
AutoTransform, respectively.
Approach k= 1 k= 5 k= 10 k= 1 k= 5 k= 10 k= 1 k= 5 k= 10
D-ACT 74
(+362%, +2,367%)
165
(+68%, +1,079%)
202
(+62%, +1,022%)
48
(+860%, +1,100%)
187
(+307%, +523%)
238
(+310%, +367%)
12
(+1,100%, +1,200%)
70
(+367%, +438%)
101
(+261%, +274%)
TufanoT5 16 98 125 5 46 58 1 15 28
AutoTransform 3 14 18 4 30 51 0 13 27
TABLE III: (RQ1) #perfect match of the existing code transformation approaches for the time-ignore and the time-wise
evaluation scenarios.
Approach Evaluation scenario k= 1 k= 5 k= 10 k= 1 k= 5 k= 10 k= 1 k= 5 k= 10
TufanoT5
time-ignore 152 263 293 442 638 718 316 462 501
time-wise 16 98 125 5 46 58 1 15 28
% decrease 89.47 62.74 57.34 98.87 92.79 91.92 99.68 96.75 94.41
AutoTransform
time-ignore 114 208 233 578 794 854 344 489 516
time-wise 3 14 18 4 30 51 0 13 27
% decrease 97.37 93.27 92.27 99.31 96.22 94.03 100 97.34 94.77
To select a model for evaluation, we choose the model that
achieves the lowest loss on the validation set. Finally, we
measure P M (as explained in Section IV) on the test set.
In particular, we compute the percentage increase of P M as
P Mours P Mbaseline
P Mbaseline
×100%
where P Mours is the P M of our approach, and P Mbaseline
is the P M of the baselines.
In addition, we measure P M of the time-ignore scenario
(P Mtime-ignore ) and the time-wise scenario (P Mtime-wise),
achieved by the baselines. We compute the percentage de-
crease of P M as follows:
P Mtime-ignore P Mtime-wise
P Mtime-ignore
×100%
Result. Our D-ACT can achieve P M at least 62%
higher than TufanoT5 and at least 274% higher than
AutoTransform. Table II shows the results of P M of our
D-ACT and the baselines across three datasets (i.e., Google,
Ovirt and Android) based on the beam sizes of 1, 5 and
10. The table shows that for the beam size of 1, our D-
ACT can achieve P M of 12 (Android) - 74 (Google), which
is at least 362% higher than TufanoT5 and 1,100% higher
than AutoTransform. On the other hand, for the beam size
of 5, our D-ACT can achieve P M of 70 (Android) - 187
(Ovirt), which is at least 68.37% higher than TufanoT5 and
438.46% higher than AutoTransform. Likewise, we ﬁnd that
for the beam size of 10, our approach can achieve P M of
101 (Android) - 238 (Ovirt), which is at least 62% higher
than TufanoT5 and 274% higher than AutoTransform. The
results indicate that the number of minitial that our D-ACT can
correctly transform to mapproved is substantially higher than
the TufanoT5 and AutoTransform approaches.
To investigate why our approach can outperform the Tu-
fanoT5 approach, we manually analyze the mapproved gener-
ated by our approach and the TufanoT5 approach (k= 1). To
do so, we analyze the mapproved that only our approach can
correctly generate while the TufanoT5 approach cannot.
We ﬁnd that there are 65, 39, and 12 mapproved of Google,
Ovirt, and Android, respectively that our approach can cor-
rectly generate while the TufanoT5 approach cannot. The
results suggest that the higher P M of our approach has to
do with the token-level code difference information and the
CodeT5 [15] model of our approach.
When the existing code transformation approaches are
evaluated in the time-wise scenario, P M of these ap-
proaches decreases by at least 57% (for TufanoT5) and
92% (for AutoTransform). Table III shows the P M of
the existing code transformation approaches across the three
datasets (i.e., Google, Ovirt and Android) in the time-ignore
and the time-wise scenarios, based on the beam sizes of 1, 5
and 10. The table shows that when the beam size is 1, P M
of TufanoT5 drops by at least 89.97% (15216
and P M of AutoTransform drops by at least 97.37% (1143
114 ;
for Google). Even the beam size increases to 5 and 10, P M
of TufanoT5 drops by at least from 57.34% (293125
293 ; for
Google at beam size of 10) to 62.74% (26398
beam size of 5). Likewise, P M of AutoTransform drops by
at least from 91.92% (71858
718 ; for Ovirt at beam size of 10)
to 93.27% ( 20814
208 ; for Google at beam size of 5). The results
imply that P M of the existing code transformation approaches
dramatically decreases when being evaluated in the time-wise
evaluation scenario. Thus, future work should consider the
time-wise scenario when evaluating proposed approaches.
(RQ2) What are the contributions of the components (i.e.,
the token-level code difference information and the pre-
trained model) of our approach?
Approach. To answer this RQ, we investigate the changes
of P M when the components of our D-ACT is varied.
In particular, we extend our experiment by removing the
token-level code difference information from our approach
to examine the impact of the token-level code difference
information on our approach. We also conduct experiment
TABLE IV: (RQ2) #perfect match of our approach and the CodeT5 model (here, the CodeT5 model is trained without the
token-level code difference information) for the time-wise evaluation scenario; the number in the parenthesis indicates the
percentage improvement compared to CodeT5.
Approach k= 1 k= 5 k= 10 k= 1 k= 5 k= 10 k= 1 k= 5 k= 10
D-ACT 74 (+1380%) 165 (+132%) 202 (+82%) 48 (+243%) 187 (+46%) 238 (+26%) 12 (+1200%) 70 (+35%) 101 (+17%)
CodeT5 5 71 111 14 128 189 0 52 86
TABLE V: (RQ2) #perfect match of our approach and the variants of our approach, where CodeT5 is changed to other pre-
trained PL models or other code transformation approaches (trained with the token-level code difference information) for the
time-wise evaluation scenario.
Approach k= 1 k= 5 k= 10 k= 1 k= 5 k= 10 k= 1 k= 5 k= 10
D-ACT 74 165 202 48 187 238 12 70 101
CodeBERT 48 76 78 15 49 64 1 10 18
GraphCodeBERT 45 83 92 17 46 62 3 19 28
PLBART 3 49 84 19 89 129 12 51 78
TufanoT5 16 88 122 16 85 113 2 33 55
AutoTransform 13 38 50 26 64 92 3 33 46
by changing CodeT5 to AutoTransform [7], TufanoT5 [10],
and the following pre-trained programming language models:
CodeBERT [20], GraphCodeBERT [21] and PLBART [22].
CodeBERT and GraphCodeBERT are the pre-trained encoder
models based on BERT [49] architecture, while PLBART is
the pre-trained encoder-decoder model based on BART [57]
architecture.
Similar to RQ1, we evaluate our approach and the variation
of our approach in the time-wise scenario. We choose the
model that achieves the lowest loss on the validation set, and
measure P M on the test set. In addition, we calculate the
percentage change of P M as
P Mours P Mvariants
P Mours
×100%
where P Mvariants is P M of the variation of our approach
explained above, and P Mours is P M of our approach.
Result. Using the token-level code difference information
can increase P M by 17%-82% when compared to without
using the token-level code difference information. Table IV
shows P M of our approach and the CodeT5 [15] model across
the three datasets (i.e., Google, Ovirt and Android) based on
the beam sizes of 1, 5 and 10. The table shows that when
the token-level code difference information is used together
with CodeT5, at beam size of 1, P M increases by 243%
(4814
14 ; for Ovirt)-1,380% ( 745
the beam size of 5 and 10, P M increases by 35% ( 7052
52 ;
for Android) - 132% ( 16571
71 ; for Google) and 17% ( 10186
86 ;
for Android) - 82% ( 202111
The results highlight the performance improvement (in term
of P M ) when incorporating the token-level code difference
information to transform minitial to mapproved.
Using CodeT5 can increase P M by at least 29% when
compared to the other pre-trained code models. Table V
shows that when CodeT5 is used instead of CodeBERT [20],
we ﬁnd that P M increases by at least 54% ( 7448
48 ; for
76 ; for Google), and 159% (20278
78 ; for
Google) for the beam size of 1, 5, and 10, respectively. Simi-
larly, when CodeT5 is used instead of GraphCodeBERT [21],
we ﬁnd that P M increases by at least 64% ( 7445
45 ; for
83 ; for Google), and ( 20292
for the beam size of 1, 5, and 10, respectively. Lastly, when
the CodeT5 is used instead of PLBART [22], we ﬁnd that P M
increases by at least 153% (4819
19 ; for Ovirt), 37% (7051
51 ; for
Android), and 29% (10178
78 ; for Android) for the beam size
of 1, 5, and 10, respectively. The results indicate that CodeT5
contributes to the performance improvement (in term of P M)
over the other pre-trained code models when transforming
minitial+diﬀ to mapproved.
Using CodeT5 can increase P M by at least 85% and
66% when compared to the existing code transformation
approaches for code review (i.e., AutoTransform [7] and
TufanoT5 [10], respectively). Table V shows that when
CodeT5 is used instead of TufanoT5, P M increases by at
least 200% ( 4816
16 ; for Ovirt), 88% ( 16588
and 66% ( 202122
122 ; for Google) for beam size of 1, 5, and
10, respectively. The similar observation can be seen when
CodeT5 is used instead of AutoTransform. In particular, we
ﬁnd that P M increases by at least 85% ( 4826
26 ; for Ovirt),
112% ( 7033
33 ; for Android) and 120% ( 10146
46 ; for Android)
for beam size of 1, 5, and 10, respectively. The results indicate
that CodeT5 contributes to the performance improvement (in
term of P M ) over the existing code transformation approaches
when transforming minitial+diﬀ to mapproved.
VI. DISCUSSION
A. Implications
Researchers should consider temporal information when
evaluating their approaches for code review. The reason for
this suggestion is that the experiment results of RQ1 show
that P M of the existing code transformation approaches for
code review decreases at least 57% (for TufanoT5 [10]) and
92% (for AutoTransform [7]). In reality, it is unlikely that
the patches that are submitted later would be available for
training a model, since patches are chronologically submitted
to a code review platform. Thus, when conducting an experi-
ment, patches should be sorted in chronological order before
evaluating proposed approaches for code review.
B. The Performance of Our D-ACT on the Methods that Never
Exist in a Codebase
In our studied dataset, there is a total of 2,139 minitial in
the test set that never exist in a codebase (764 for Android,
253 for Google, 1,122 for Ovirt). When we manually analyze
the minitial that are correctly transformed to mapproved, we
ﬁnd that our approach can correctly transform 2 out of 253
(for Google) and 11 out of 1,122 (for Ovirt) minitial that never
exist in a codebase to mapproved. The results imply that our
approach is still far from perfect when transforming the newly
added minitial to mapproved. Thus, future work should develop
an approach that can achieve higher accuracy for newly added
methods.
C. The Model Complexity between D-ACT and TufanoT5
The model complexity plays a key role for practitioners
when deciding to deploy a model in practice. Generally, highly
complex models are likely to consume large memory, which
requires premium GPU resources for deployment. There exist
many studies [49, 58] showing that larger language models
are likely to be more accurate whereas larger language models
are likely to be more complex than others. Thus, we perform
additional analysis to investigate the model complexity of our
approach when compared to TufanoT5. To do so, we compute
a total number of parameters of models obtained from the
parameters function provided by the PyTorch library. We
ﬁnd that our D-ACT has approximately 222.8M parameters,
while TufanoT5 has 60.5M parameters. While our model is
more complex than TufanoT5, our model is still more accurate
than TufanoT5. Nevertheless, our model can still ﬁt within the
commodity GPU (i.e., we used RTX 3090 in our experiment),
which does not require any premium GPU resources like
Nvidia DGX servers. Thus, our approach is still practical.
VII. THR EATS TO VALIDITY
In this section, we disclose the threats to the validity of our
study.
Threats to Construct Validity. The threats to construct
validity relate to the data ﬁltering process. As explained in
Section IV, we perform data ﬁltering to reduce the noise
in our dataset (e.g., removing duplicated triplets). However,
unknown noise may exist in our ﬁltered dataset. Thus, future
work should investigate our dataset to discover unknown noise,
and explore its impact on code transformation approaches.
Threats to Internal Validity. The threats to internal
validity relate to hyper-parameter settings that we use in our
experiment. We do not explore all possible hyper-parameter
settings when conducting the experiment. In addition, similar
to Feng et al. [20], we use a constant learning rate instead
of using warmup to adjust a learning rate during the training
phase. Thus, the obtained results may differ from the reported
results if hyper-parameter settings are changed or warmup is
used. However, ﬁnding the best hyper-parameter setting for
our approach is very expensive, and is not the primary goal
of this paper. In contrast, the primary goal of our work is to
fairly evaluate our approach and the baselines. To mitigate this
threat, we provide detail of the hyper-parameter settings in the
replication package to aid future replication studies.
Threats to External Validity. The threats to external
validity relate to the generalizability of our approach. In
our study, we evaluate our approach on three large-scale
software projects on Gerrit (i.e., Google, Ovirt and Android).
In addition, our study focuses on transforming methods in
the patches written in Java. However, the results presented in
our study may not be generalized to other software projects,
other programming languages, or changes that occur outside
methods. Thus, other software projects that are implemented
in Java, other programming languages, or other changes that
occur outside methods can be explored in future work.
VIII. CONCLUSION
In this work, we highlight the importance of integrating
the token-level code difference information when designing a
code transformation approach. Then, we proposed our D-ACT,
which leverages the token-level code difference information,
and CodeT5 [15]. In addition, the evaluation of the prior
studies did not consider the chronological order of the data,
which is not realistic. Therefore, we conducted the experiment
by considering the chronological order of the data.
Through an empirical evaluation on three large-scale soft-
ware projects on Gerrit (i.e., Google, Ovirt and Android), we
found that (1) our D-ACT can achieve P M at least 62%
higher than TufanoT5 [10] and at least 274% higher than
AutoTransform [7]; (2) P M of the existing code transforma-
tion approaches decreases at least 57% (for TufanoT5) and
92% (for AutoTransform) when they are evaluated in the
time-wise scenario; and (3) the token-level code difference
information of our approach can increase P M by 17% -
82% when compared to without using the token-level code
difference information. These results suggest that our approach
can improve the existing code transformation approaches, and
the token-level code difference information of our approach
can help improve the performance of code transformation.
ACK NOW LE DG EM EN T
Chakkrit Tantithamthavorn was supported by the Aus-
tralian Research Council’s Discovery Early Career Researcher
Award (DECRA) funding scheme (DE200100941). Patanamon
Thongtanunam was supported by the Australian Research
Council’s Discovery Early Career Researcher Award (DE-
CRA) funding scheme (DE210101091).
REFERENCES
[1] J. Lipcak and B. Rossi, A large-scale study on source
code reviewer recommendation, in Proceedings of
SEAA, 2018, pp. 378–387.
[2] X. Han, A. Tahir, P. Liang, S. Counsell, K. Blincoe, B. Li,
and Y. Luo, “Code smells detection via modern code
review: A study of the openstack and qt communities,
EMSE, pp. 1–42, 2022.
[3] E. A. AlOmar, M. Chouchen, M. W. Mkaouer, and
A. Ouni, “Code review practices for refactoring changes:
An empirical study on openstack,” arXiv preprint
arXiv:2203.14404, 2022.
[4] A. Bacchelli and C. Bird, “Expectations, Outcomes, and
Challenges of Modern Code Review, in Proceedings of
ICSE, 2013, pp. 712–721.
[5] L. MacLeod, M. Greiler, M.-A. Storey, C. Bird, and
J. Czerwonka, “Code reviewing in the trenches: Chal-
lenges and best practices,” IEEE Software, pp. 34–42,
2017.
[6] P. C. Rigby and C. Bird, “Convergent Contemporary
Software Peer Review Practices, in Proceedings of ES-
EC/FSE, 2013, pp. 202–212.
[7] P. Thongtanunam, C. Pornprasit, and C. Tantithamtha-
vorn, Autotransform: Automated code transformation to
support modern code review process, in Proceedings of
ICSE, 2022, pp. 237–248.
[8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin,
Attention is All You Need,” in Proceedings of NIPS,
2017, pp. 5999–6009.
[9] R. Sennrich, B. Haddow, and A. Birch, “Neural Machine
Translation of Rare Words with Subword Units, in
Proceedings of ACL, 2016, pp. 1715–1725.
[10] R. Tufano, S. Masiero, A. Mastropaolo, L. Pascarella,
D. Poshyvanyk, and G. Bavota, “Using pre-trained mod-
els to boost code review automation, in Proceedings of
ICSE, 2022, p. 2291–2302.
[11] M. Beller, A. Bacchelli, A. Zaidman, and E. Juergens,
“Modern code reviews in open-source projects: Which
problems do they ﬁx?” in Proceedings of MSR, 2014,
pp. 202–211.
[12] G. Xie, J. Chen, and I. Neamtiu, “Towards a better
understanding of software evolution: An empirical study
on open source software, in Proceedings of ICSM, 2009,
pp. 51–60.
[13] R. Tufan, L. Pascarella, M. Tufanoy, D. Poshyvanykz,
and G. Bavota, “Towards automating code review activ-
ities,” in Proceedings of ICSE, 2021, p. 1479–1482.
[14] M. Tufano, J. Pantiuchina, C. Watson, G. Bavota, and
D. Poshyvanyk, “On learning meaningful code changes
via neural machine translation,” in Proceedings of ICSE,
2019, pp. 25–36.
[15] Y. Wang, W. Wang, S. Joty, and S. C. Hoi,
“Codet5: Identiﬁer-aware uniﬁed pre-trained encoder-
decoder models for code understanding and generation,”
in Proceedings of EMNLP, 2021, pp. 8696–8708.
[16] M. Tan, L. Tan, S. Dara, and C. Mayeux, “Online defect
prediction for imbalanced data,” in Proceedings of ICSE,
2015, pp. 99–108.
[17] C. Pornprasit and C. K. Tantithamthavorn, “Jitline: A
simpler, better, faster, ﬁner-grained just-in-time defect
prediction,” in Proceedings of MSR, 2021, pp. 369–379.
[18] M. Jimenez, R. Rwemalika, M. Papadakis, F. Sarro,
Y. Le Traon, and M. Harman, “The importance of
accounting for real-world labelling when predicting
software vulnerabilities, in Proceedings of ESEC/FSE,
2019, pp. 695–705.
[19] Y. Liu, C. Tantithamthavorn, L. Li, and Y. Liu, “Ex-
plainable ai for android malware detection: Towards
understanding why the models perform so well?” in
Proceedings of ISSRE, 2022, pp. 169–180.
[20] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong,
L. Shou, B. Qin, T. Liu, D. Jiang et al., “Codebert: A pre-
trained model for programming and natural languages,”
arXiv preprint arXiv:2002.08155, 2020.
[21] D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou,
N. Duan, A. Svyatkovskiy, S. Fu et al., “Graphcodebert:
Pre-training code representations with data ﬂow,” arXiv
preprint arXiv:2009.08366, 2020.
[22] W. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang,
“Uniﬁed pre-training for program understanding and
generation,” in Proceedings of NAACL, 2021, pp. 2655–
2668.
[23] “A Replication Package, https://github.com/awsm-
research/D-ACT-Replication-Package.
oderberg, L. Church, M. Sipko, and
A. Bacchelli, “Modern Code Review: A Case Study at
Google,” in Proceedings of ICSE (Companion), 2018, pp.
181–190.
[25] Y. Fan, X. Xia, D. Lo, and S. Li, “Early Prediction of
Merged Code Changes to Prioritize Reviewing Tasks,”
EMSE, vol. 23, no. 6, pp. 3346–3393, 2018.
[26] T. Baum, K. Schneider, and A. Bacchelli, “Associating
Working Memory Capacity and Code Change Ordering
with Code Review Performance, EMSE, vol. 24, no. 4,
pp. 1762–1798, 2019.
[27] D. Lin, C. Tantithamthavorn, and A. E. Hassan, “The
impact of data merging on the interpretation of cross-
project just-in-time defect models,” IEEE Transactions
on Software Engineering, 2021.
[28] C. Khanan, W. Luewichana, K. Pruktharathikoon,
J. Jiarpakdee, C. Tantithamthavorn, M. Choetkiertikul,
C. Ragkhitwetsagul, and T. Sunetnanta, “Jitbot: an ex-
plainable just-in-time defect prediction bot,” in Proceed-
ings of the 35th IEEE/ACM international conference on
automated software engineering, 2020, pp. 1336–1339.
[29] S. Wattanakriengkrai, P. Thongtanunam, C. Tan-
tithamthavorn, H. Hata, and K. Matsumoto, “Predicting
defective lines using a model-agnostic technique, IEEE
Transactions on Software Engineering, 2020.
[30] C. Pornprasit, C. Tantithamthavorn, J. Jiarpakdee, M. Fu,
and P. Thongtanunam, “Pyexplainer: Explaining the pre-
dictions of just-in-time defect models,” in Proceedings
of the International Conference on Automated Software
Engineering (ASE). IEEE, 2021, pp. 407–418.
[31] M. Fu and C. Tantithamthavorn, “Linevul: A transformer-
based line-level vulnerability prediction, in 2022
IEEE/ACM 19th International Conference on Mining
Software Repositories (MSR), 2022, pp. 608–620.
[32] C. Pornprasit and C. Tantithamthavorn, “Deeplinedp:
Towards a deep learning approach for line-level defect
prediction,” IEEE Transactions on Software Engineering,
2022.
[33] V. Balachandran, “Reducing human effort and improv-
ing quality in peer code reviews using automatic static
analysis and reviewer recommendation, in Proceedings
of ICSE. IEEE, 2013, pp. 931–940.
[34] C. Hannebauer, M. Patalas, S. St¨
unkel, and V. Gruhn,
Automatically recommending code reviewers based on
their expertise: An empirical comparison, in Proceed-
ings of ASE, 2016, pp. 99–110.
[35] P. Thongtanunam, C. Tantithamthavorn, R. G. Kula,
N. Yoshida, H. Iida, and K.-i. Matsumoto, “Who should
review my code? a ﬁle location-based code-reviewer
recommendation approach for modern code review, in
Proceedings of SANER, 2015, pp. 141–150.
[36] M. B. Zanjani, H. Kagdi, and C. Bird, “Automatically
recommending peer reviewers in modern code review,
TSE, pp. 530–543, 2015.
[37] W. H. A. Al-Zubaidi, P. Thongtanunam, H. K. Dam,
C. Tantithamthavorn, and A. Ghose, “Workload-aware
reviewer recommendation using a multi-objective search-
based approach,” in Proceedings of the 16th ACM In-
ternational Conference on Predictive Models and Data
Analytics in Software Engineering, 2020, pp. 21–30.
[38] Y. Hong, C. Tantithamthavorn, P. Thongtanunam, and
A. Aleti, “Commentﬁnder: a simpler, faster, more ac-
curate code review comments recommendation, in Pro-
ceedings of the 30th ACM Joint European Software Engi-
neering Conference and Symposium on the Foundations
of Software Engineering, 2022, pp. 507–519.
[39] Y. Hong, C. Tantithamthavorn, and P. Thongtanunam,
“Where should i look at? recommending lines that re-
viewers should pay attention to, in Proceedings of the
International Conference on Software Analysis, Evolu-
tion and Reengineering (SANER). IEEE, 2022, pp.
1034–1045.
[40] A. Bosu and J. C. Carver, “Impact of peer code review
on peer impression formation: A survey,” in Proceedings
of ESEM, 2013, pp. 133–142.
[41] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring
the limits of transfer learning with a uniﬁed text-to-text
transformer, JMLR, pp. 1–67, 2020.
[42] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu,
H. Xiong, and Q. He, A comprehensive survey on
transfer learning,” Proceedings of the IEEE, pp. 43–76,
2020.
[43] Y. Liu, C. Tantithamthavorn, Y. Liu, P. Thongta-
nunam, and L. Li, Autoupdate: Automatically recom-
mend code updates for android apps,” arXiv preprint
arXiv:2209.07048, 2022.
[44] V. Frick, T. Grassauer, F. Beck, and M. Pinzger, “Gen-
erating accurate and compact edit scripts using tree
differencing, in Proceedings of ICSME, 2018, pp. 264–
274.
[45] Z. Liu, X. Xia, D. Lo, M. Yan, and S. Li, “Just-in-time
obsolete comment detection and update,” TSE, pp. 1–23,
2021.
[46] R.-M. Karampatsis, H. Babii, R. Robbes, C. Sutton, and
A. Janes, “Big code!= big vocabulary: Open-vocabulary
models for source code,” in Proceedings of ICSE, 2020,
pp. 1073–1085.
[47] M. Fu, C. Tantithamthavorn, T. Le, V. Nguyen, and
D. Phung, “Vulrepair: a t5-based automated software
vulnerability repair, in Proceedings of the 30th ACM
Joint European Software Engineering Conference and
Symposium on the Foundations of Software Engineering
(ESEC/FSE), 2022, pp. 935–947.
[48] H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and
M. Brockschmidt, “Codesearchnet challenge: Evaluat-
ing the state of semantic code search,” arXiv preprint
arXiv:1909.09436, 2019.
[49] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova,
“Bert: Pre-training of deep bidirectional transformers
for language understanding,” arXiv preprint
arXiv:1810.04805, 2018.
[50] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normal-
ization,” arXiv preprint arXiv:1607.06450, 2016.
[51] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual
learning for image recognition,” in Proceedings of CVPR,
2016, pp. 770–778.
[52] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue,
A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al.,
“Huggingface’s transformers: State-of-the-art natural lan-
guage processing,” arXiv preprint arXiv:1910.03771,
2019.
[53] I. Loshchilov and F. Hutter, “Decoupled weight decay
regularization, arXiv preprint arXiv:1711.05101, 2017.
[54] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: A
method for automatic evaluation of machine translation,
in Proceedings of ACL, 2002, pp. 311–318.
[55] Y. Ding, B. Ray, P. Devanbu, and V. J. Hellendoorn,
“Patching as Translation: The Data and the Metaphor,
in Proceedings of ASE, 2020, pp. 275–286.
[56] S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D. Tang, N. Sun-
daresan, M. Zhou, A. Blanco, and S. Ma, “Codebleu:
a method for automatic evaluation of code synthesis,
arXiv preprint arXiv:2009.10297, 2020.
[57] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo-
hamed, O. Levy, V. Stoyanov, and L. Zettlemoyer,
“BART: Denoising sequence-to-sequence pre-training for
natural language generation, translation, and comprehen-
sion,” in Proceedings of ACL, 2020, pp. 7871–7880.
[58] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,
I. Sutskever et al., “Language models are unsupervised
multitask learners,” OpenAI blog, p. 9, 2019.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Code review is an effective quality assurance practice, but can be labor-intensive since developers have to manually review the code and provide written feedback. Recently, a Deep Learning (DL)-based approach was introduced to automatically recommend code review comments based on changed methods. While the approach showed promising results, it requires expensive computational resource and time which limits its use in practice. To address this limitation , we propose CommentFinder ś a retrieval-based approach to recommend code review comments. Through an empirical evaluation of 151,019 changed methods, we evaluate the effectiveness and efficiency of CommentFinder against the state-of-the-art approach. We find that when recommending the best-1 review comment candidate, our CommentFinder is 32% better than prior work in recommending the correct code review comment. In addition, CommentFinder is 49 times faster than the prior work. These findings highlight that our CommentFinder could help reviewers to reduce the manual efforts by recommending code review comments, while requiring less computational time.
Conference Paper
Full-text available
As software vulnerabilities grow in volume and complexity, researchers proposed various Artificial Intelligence (AI)-based approaches to help under-resourced security analysts to find, detect, and localize vulnerabilities. However, security analysts still have to spend a huge amount of effort to manually fix or repair such vulnerable functions. Recent work proposed an NMT-based Automated Vulnerability Repair, but it is still far from perfect due to various limitations. In this paper, we propose VulRepair, a T5-based automated software vulnerability repair approach that leverages the pre-training and BPE components to address various technical limitations of prior work. Through an extensive experiment with over 8,482 vulnerability fixes from 1,754 real-world software projects, we find that our VulRepair achieves a Perfect Prediction of 44%, which is 13%-21% more accurate than competitive baseline approaches. These results lead us to conclude that our VulRepair is considerably more accurate than two baseline approaches, highlighting the substantial advancement of NMT-based Automated Vulnerability Repairs. Our additional investigation also shows that our VulRe-pair can accurately repair as many as 745 out of 1,706 real-world well-known vulnerabilities (e.g., Use After Free, Improper Input Validation , OS Command Injection), demonstrating the practicality and significance of our VulRepair for generating vulnerability repairs, helping under-resourced security analysts on fixing vulnerabilities.
Article
Full-text available
Code review plays an important role in software quality control. A typical review process involves a careful check of a piece of code in an attempt to detect and locate defects and other quality issues/violations. One type of issue that may impact the quality of software is code smells-i.e., bad coding practices that may lead to defects or maintenance issues. Yet, little is known about the extent to which code smells are identified during modern code review. To investigate the concept behind code smells identified in modern code review and what actions reviewers suggest and developers take in response to the identified smells, we conducted an empirical study of code smells in code reviews by analyzing reviews from four large open source projects from the OpenStack (Nova and Neutron) and Qt (Qt Base and Qt Creator) communities. We manually checked a total of 25,415 code review comments obtained by keywords search and random selection; this resulted in the identification of 1,539 smell-related reviews which then allowed the study of the causes of code smells, actions taken against identified smells, time taken to fix identified smells, and reasons why developers ignored fixing identified smells. Our analysis found that 1) code smells were not commonly identified in code reviews , 2) smells were usually caused by violation of coding conventions, 3) reviewers usually provided constructive feedback, including fixing (refactoring) recommen-2 Xiaofeng Han et al. dations to help developers remove smells, 4) developers generally followed those recommendations and actioned the changes, 5) once identified by reviewers, it usually takes developers less than one week to fix the smells, and 6) the main reason why developers chose to ignore the identified smells is that it is not worth fixing the smell. Our results suggest the following: 1) developers should closely follow coding conventions in their projects to avoid introducing code smells, 2) review-based detection of code smells is perceived to be a trustworthy approach by developers, mainly because reviews are context-sensitive (as reviewers are more aware of the context of the code given that they are part of the project's development team), and 3) program context needs to be fully considered in order to make a decision of whether to fix the identified code smell immediately.
Conference Paper
Full-text available
Software vulnerabilities are prevalent in software systems, causing a variety of problems including deadlock, information loss, or system failures. Thus, early predictions of software vulnerabilities are critically important in safety-critical software systems. Various ML/DL-based approaches have been proposed to predict vulnerabilities at the file/function/method level. Recently, IVDetect (a graph-based neural network) is proposed to predict vulnerabilities at the function level. Yet, the IVDetect approach is still inaccurate and coarse-grained. In this paper, we propose LineVul, a Transformer-based line-level vulnerability prediction approach in order to address several limitations of the state-of-the-art IVDetect approach. Through an empirical evaluation of a large-scale real-world dataset with 188k+ C/C++ functions, we show that LineVul achieves (1) 160%-379% higher F1-measure for function-level predictions; (2) 12%-25% higher Top-10 Accuracy for line-level predictions; and (3) 29%-53% less Effort@20%Recall than the baseline approaches, highlighting the significant advancement of LineVul towards more accurate, more cost-effective line-level vulnerability predictions. Our additional analysis also shows that our LineVul is also very accurate (75%-100%) for predicting vulnerable functions affected by the Top-25 most dangerous CWEs, highlighting the potential impact of our LineVul in real-world usage scenarios.
Conference Paper
Full-text available
Code review is effective, but human-intensive (e.g., developers need to manually modify source code until it is approved). Recently, prior work proposed a Neural Machine Translation (NMT) approach to automatically transform source code to the version that is reviewed and approved (i.e., the after version). Yet, its performance is still suboptimal when the after version has new identifiers or liter-als (e.g., renamed variables) or has many code tokens. To address these limitations, we propose AutoTransform which leverages a Byte-Pair Encoding (BPE) approach to handle new tokens and a Transformer-based NMT architecture to handle long sequences. We evaluate our approach based on 14,750 changed methods with and without new tokens for both small and medium sizes. The results show that when generating one candidate for the after version (i.e., beam width = 1), our AutoTransform can correctly transform 1,413 changed methods, which is 567% higher than the prior work, highlighting the substantial improvement of our approach for code transformation in the context of code review. This work contributes towards automated code transformation for code reviews, which could help developers reduce their effort in modifying source code during the code review process.
Article
Full-text available
Defect prediction is proposed to assist practitioners effectively prioritize limited Software Quality Assurance (SQA) resources on the most risky files that are likely to have post-release software defects. However, there exist two main limitations in prior studies: (1) the granularity levels of defect predictions are still coarse-grained and (2) the surrounding tokens and surrounding lines have not yet been fully utilized. In this paper, we perform a survey study to better understand how practitioners perform code inspection in modern code review process, and their perception on a line-level defect prediction. According to the responses from 36 practitioners, we found that 50% of them spent at least 10 minutes to more than one hour to review a single file, while 64% of them still perceived that code inspection activity is challenging to extremely challenging. In addition, 64% of the respondents perceived that a line-level defect prediction tool would potentially be helpful in identifying defective lines. Motivated by the practitioners' perspective, we present DeepLineDP, a deep learning approach to automatically learn the semantic properties of the surrounding tokens and lines in order to identify defective files and defective lines. Through a case study of 32 releases of 9 software projects, we find that the risk score of code tokens varies greatly depending on their location. Our DeepLineDP is 14%-24% more accurate than other file-level defect prediction approaches; is 50%-250% more cost-effective than other line-level defect prediction approaches; and achieves a reasonable performance when transferred to other software projects. These findings confirm that the surrounding tokens and surrounding lines should be considered to identify the fine-grained locations of defective files (i.e., defective lines).
Conference Paper
Full-text available
Code review is an effective quality assurance practice , yet can be time-consuming since reviewers have to carefully review all new added lines in a patch. Our analysis shows that at the median, patch authors often waited 15-64 hours to receive initial feedback from reviewers, which accounts for 16%-26% of the whole review time of a patch. Importantly, we also found that large patches tend to receive initial feedback from reviewers slower than smaller patches. Hence, it would be beneficial to reviewers to reduce their effort with an approach to pinpoint the lines that they should pay attention to. In this paper, we proposed REVSPOT-a machine learning-based approach to predict problematic lines (i.e., lines that will receive a comment and lines that will be revised). Through a case study of three open-source projects (i.e., Openstack Nova, Openstack Ironic, and Qt Base), REVSPOT can accurately predict lines that will receive comments and will be revised (with a Top-10 Accuracy of 81% and 93%, which is 56% and 15% better than the baseline approach), and these correctly predicted problematic lines are related to logic defects, which could impact the functionality of the system. Based on these findings, our REVSPOT could help reviewers to reduce their reviewing effort by reviewing a smaller set of lines and increasing code review speed and reviewers' productivity.
Article
Comments are valuable resources for the development, comprehension and maintenance of software. However, while changing code, developers sometimes neglect the evolution of the corresponding comments, resulting in obsolete comments. Such obsolete comments can mislead developers and introduce bugs in the future, and are therefore detrimental. We notice that by detecting and updating obsolete comments in time with code changes, obsolete comments can be effectively reduced and even avoided. We refer to this task as Just-In-Time (JIT) Obsolete Comment Detection and Update. In this work, we propose a two-stage framework named CUP $^\mathrm{2}$ 2 ( Two -stage C omment UP dater) to automate this task. CUP $^\mathrm{2}$ consists two components, i.e., an O bsolete C omment D etector named OCD and a C omment UP dater named CUP , each of which relies on a distinct neural network model to perform detection (updates). Specifically, given a code change and a corresponding comment, CUP $^\mathrm{2}$ first leverages OCD to predict whether this comment should be updated. If the answer is yes, CUP will be used to generate the new version of the comment automatically. To evaluate CUP $^\mathrm{2}$ , we build a large-scale dataset with over 4 million code-comment change samples. Our dataset focuses on method-level code changes and updates on method header comments considering the importance and widespread use of such comments. Evaluation results show that 1) both OCD and CUP outperform their baselines by significant margins, and 2) CUP $^\mathrm{2}$ performs better than a rule-based baseline. Specifically, the comments generated by CUP $^\mathrm{2}$ are identical to the ground truth for 41.8% of the samples that are predicted to be positive by OCD. We believe CUP $^\mathrm{2}$ can help developers detect obsolete comments, better understand where and how to update obsolete comments and reduce their edits on obsolete comment updates.