Conference PaperPDF Available

AutoTransform: Automated Code Transformation to Support Modern Code Review Process

Authors:

Abstract and Figures

Code review is effective, but human-intensive (e.g., developers need to manually modify source code until it is approved). Recently, prior work proposed a Neural Machine Translation (NMT) approach to automatically transform source code to the version that is reviewed and approved (i.e., the after version). Yet, its performance is still suboptimal when the after version has new identifiers or liter-als (e.g., renamed variables) or has many code tokens. To address these limitations, we propose AutoTransform which leverages a Byte-Pair Encoding (BPE) approach to handle new tokens and a Transformer-based NMT architecture to handle long sequences. We evaluate our approach based on 14,750 changed methods with and without new tokens for both small and medium sizes. The results show that when generating one candidate for the after version (i.e., beam width = 1), our AutoTransform can correctly transform 1,413 changed methods, which is 567% higher than the prior work, highlighting the substantial improvement of our approach for code transformation in the context of code review. This work contributes towards automated code transformation for code reviews, which could help developers reduce their effort in modifying source code during the code review process.
Content may be subject to copyright.
AutoTransform: Automated Code Transformation to Support
Modern Code Review Process
Patanamon Thongtanunam
patanamon.t@unimelb.edu.au
The University of Melbourne
Australia
Chanathip Pornprasit
chanathip.pornprasit@monash.edu
Monash University
Australia
Chakkrit Tantithamthavorn
chakkrit@monash.edu
Monash University
Australia
ABSTRACT
Code review is eective, but human-intensive (e.g., developers need
to manually modify source code until it is approved). Recently, prior
work proposed a Neural Machine Translation (NMT) approach to
automatically transform source code to the version that is reviewed
and approved (i.e., the after version). Yet, its performance is still
suboptimal when the after version has new identiers or liter-
als (e.g., renamed variables) or has many code tokens. To address
these limitations, we propose AutoTransform which leverages
a Byte-Pair Encoding (BPE) approach to handle new tokens and a
Transformer-based NMT architecture to handle long sequences. We
evaluate our approach based on 14,750 changed methods with and
without new tokens for both small and medium sizes. The results
show that when generating one candidate for the after version (i.e.,
beam width = 1), our AutoTransform can correctly transform
1,413 changed methods, which is 567% higher than the prior work,
highlighting the substantial improvement of our approach for code
transformation in the context of code review. This work contributes
towards automated code transformation for code reviews, which
could help developers reduce their eort in modifying source code
during the code review process.
ACM Reference Format:
Patanamon Thongtanunam, Chanathip Pornprasit, and Chakkrit Tantithamtha-
vorn. 2022. AutoTransform: Automated Code Transformation to Support
Modern Code Review Process. In 44th International Conference on Software
Engineering (ICSE ’22), May 21–29, 2022, Pittsburgh, PA, USA. ACM, New
York, NY, USA, 12 pages. https://doi.org/10.1145/3510003.3510067
1 INTRODUCTION
Code review is one of the important quality assurance practices
in a software development process. One of the main goals of code
review is to ensure that the quality of newly developed code meets
a standard before integrating into the main software repository [
11
,
37
]. Hence, in the code review process, new source code written by
a code author has to be examined and revised until reviewers (i.e.,
developers other than a code author) agree that this source code is of
A corresponding author.
The rst and second authors contributed equally to this research.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
ICSE ’22, May 21–29, 2022, Pittsburgh, PA, USA
© 2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9221-1/22/05. . . $15.00
https://doi.org/10.1145/3510003.3510067
sucient quality to be integrated (i.e., the code is approved). Several
studies also showed that during the code review process, source
code is revised not only to remove functional defects [
15
,
17
,
40
,
47
],
but also to improve design quality [
41
], maintainability [
50
], and
readability [16, 50, 57].
Despite the benets of code review, the code review process
is still human-intensive where developers have to manually re-
view and revise code. To support reviewing activities, prior studies
proposed approaches to save the reviewers’ eort (e.g., review pri-
oritization [
14
,
23
,
55
]). However, the revising activities still require
manual eort from code authors. Indeed, given a large number of
reviews (e.g., 3K reviews per month at Microsoft Bing [
47
]), prior
studies found that it is challenging for code authors to revise their
code without introducing new defects, while switching contexts
and keeping track of other reviews [
21
,
32
]. Thus, automated code
transformation would be benecial to augment code authors and to
save their eort by automatically applying the common revisions
during code reviews in the past to the newly-written code, while
allowing code authors to focus on revising more complex code.
To the best of our knowledge, the work of Tufano et al. [
60
] is
the most recent work that proposed an approach to automatically
transform code to the version that is reviewed and approved. In the
prior work, they leveraged code abstraction and a Recurrent Neural
Network (RNN) architecture. They have shown that their NMT
approach can correctly transform code by applying a wide range
of meaningful code changes including refactoring and bug-xing.
While the results of the prior work [
60
] highlighted the potential of
using NMT to help code authors automatically modify code during
the code review process, the applicability of their approach is still
limited. Specically, their code abstraction hinders the Tufano et
al. approach in transforming source code when new identiers or
literals appear in the version that has been approved. In addition,
due to the nature of the RNN architecture, the Tufano et al. approach
may be suboptimal when the size of source code is longer (i.e.,
source code has many tokens).
In this paper, we propose a new framework called AutoTrans-
form, which leverages Byte-Pair Encoding (BPE) subword tokeniza-
tion [
52
] and Transformer [
62
] to address the limitations of the prior
approach [
60
]. We evaluated our AutoTransform based on two
types of changed methods: (1) the changed methods
without
newly-
introduced identiers/literals (i.e., changed methods w/o new to-
kens); and (2) the changed methods
with
newly-introduced identi-
ers/literals (i.e., the changed methods w/ new tokens). Through
a case study of 147,553 changed methods extracted from three
Gerrit code review repositories of Android, Google, and Ovirt, we
addressed the following research questions:
ICSE ’22, May 21ś29, 2022, Pisburgh, PA, USA Patanamon Thongtanunam, Chanathip Pornprasit, and Chakkrit Tantithamthavorn
(RQ1) Can AutoTransform transform code better than
Tufano et al. approach?
Results.
When generating one candidate of the after version for
14,750 changed methods in the testing datasets, our AutoTrans-
form can correctly transform source code for 1,413 changed meth-
ods which is 567% higher than the Tufano et al. approach. Specif-
ically, for the changed methods w/ new tokens, our AutoTrans-
form can correctly transform 1,060 methods, while the Tufano et
al. approach could not correctly transform any of the changed meth-
ods w/ new tokens. For the changed methods w/o new tokens, our
AutoTransform can correctly transform 353 changed methods,
while the Tufano et al. approach can correctly transform only 212
changed methods.
(RQ2) What are the contributions of AutoTransform’s
components?
Results.
Using subword tokenization by BPE enables our Auto-
Transform to achieve perfect predictions 284% higher than using
code abstraction like the Tufano et al. approach. Furthermore, using
the Transformer architecture can increase perfect predictions at
least by 17%, compared to using the RNN architecture. In particular,
we found that the percentage improvement in perfect predictions
is much higher for the medium methods (i.e., 183% - 507%).
Signicance & Contributions.
The results of our work high-
light the substantial improvement of our approach for code trans-
formation in the context of code review. More specically, our
AutoTransform can transform a wider range of changed methods
(i.e., methods with new tokens and methods with long sequences)
than the state-of-the-art approach [
60
]. The proposed approach,
results, and insights presented in this paper contribute towards
automated code transformation for code reviews, which could help
developers reduce their eort in modifying source code and expe-
dite the code review process.
Novelty. This paper is the rst to present:
AutoTransform, i.e., a Transformer-based NMT approach
to transform source code from the version before the imple-
mentation of code changes to the version that is reviewed
and eventually merged.
The use of Byte-Pair Encoding (BPE) subword tokenization
to address the limitation of transforming changed methods
that have newly-introduced identiers/literals.
An empirical evaluation of our AutoTransform and the
Tufano et al. approach [
60
] based on a large-scale dataset
with two types of changed methods.
An ablation study to quantify the contributions of the two
components (i.e., BPE and Transformer) in our proposed
approach.
Open Science.
To facilitate future work, the datasets of 147,553
extracted changed methods, scripts of our AutoTransform, and
experimental results (e.g., raw predictions) are available online [
4
].
Paper Organization.
Section 2 presents and discusses the limi-
tations of the state-of-the-art approach [
60
]. Section 3 presents our
AutoTransform. Section 4 describes our case study design. Section
5 presents the results, while Section 6 discusses the results. Section
7 discusses related work. Section 8 discusses possible threats to the
validity. Section 9 draws the conclusions.
2 THE STATE-OF-THE-ART CODE
TRANSFORMATION FOR CODE REVIEWS
Recently, Tufano et al. [
60
] proposed a Neural Machine Translation
(NMT) approach to automatically transform the source code of the
before version of a method (i.e., the version before the implemen-
tation of code changes) to its after version (i.e., the version after
the code changes are reviewed and merged). Broadly speaking, Tu-
fano et al. (1) performed
code abstraction
to reduce vocabulary
size by replacing actual identier/literals with reusable IDs and
(2) built an NMT model based on
a Recurrent Neural Network
(RNN) architecture
to transform the token sequence of the before
version to the token sequence of the after version. Their approach
was evaluated based on the changed methods that were extracted
from three Gerrit code review repositories, namely Android [
3
],
Google [
6
], and Ovirt [
7
]. We briey describe their approach below.
(Step 1) Code Abstraction: Since NMT models are likely to be
inaccurate and slow when dealing with a large vocabulary size [
26
],
Tufano et al. proposed to abstract code tokens into reusable IDs
to reduce the vocabulary size. More specically, for both before
and after versions
(𝑚𝑏,𝑚𝑎)
of each changed method, Tufano et
al. replaced identiers (i.e., type, method, and variable names) and
literals (i.e., int, double, char, string values) with a reusable ID. A
reusable ID means that an ID is allowed to be reused across dierent
changed methods (e.g., the rst variable appearing in a changed
method will be always replaced with
VAR_1
). At the end of this step,
from the original source code of
(𝑚𝑏,𝑚𝑎)
, the
abstracted
code
sequences
(𝑎𝑚𝑏, 𝑎𝑚𝑎)
were obtained. In addition, for each changed
method, a map
𝑀(𝑎𝑚𝑏)
was also generated, which will be used to
map the IDs back to the actual identiers/literals.
(Step 2) Build an NMT model: To build an NMT model, Tufano et
al. used a Recurrent Neural Network (RNN) Encoder-Decoder ar-
chitecture with the attention mechanism [
12
,
18
,
36
,
53
]. Given
the abstracted code sequences
(𝑎𝑚𝑏, 𝑎𝑚𝑎)
from Step 1, an RNN
Encoder will learn these sequences by estimating the conditional
probability
𝑝(𝑦1, ..., 𝑦𝑡|𝑥1, ..., 𝑥𝑡)
, where
𝑥1, .. ., 𝑥𝑡
is the sequence
of am
𝑏
and
𝑦1, ..., 𝑦𝑡
is the sequence of am
𝑎
while the sequence
lengths
𝑡
and
𝑡
may dier. A bi-directional RNN Encoder [
18
] was
used to learn the abstracted code sequence of the before version
𝑎𝑚𝑏
from left-to-right and right-to-left when creating sequence
representations. An RNN Decoder was then used to estimate the
probability for each token
𝑦𝑖
in the abstracted code sequence of
the after version
𝑎𝑚𝑎
based on the recurrent state
𝑠𝑖
, the previous
tokens
𝑦1..𝑖
, and a context vector
𝑐𝑖
. The vector
𝑐𝑖
is the attention
vector which is computed as a weighted average of the hidden
states, allowing the RNN model to pay attention to particular parts
of 𝑎𝑚𝑏, when predicting token 𝑦𝑖for 𝑎𝑚𝑎.
(Step 3) Generate predictions: Given the abstracted code sequence
of the before version of an unseen method (i.e., a testing instance)
𝑎𝑚𝑡
𝑏
, the RNN model in the Tufano et al. approach generated the
abstracted code sequence of the after version
𝑎𝑚𝑡
𝑎
. A beam search
strategy was used to obtain
𝑘
sequence candidates for
𝑎𝑚𝑡
𝑎
. Since
the generated output sequences are the abstracted code sequences,
Tufano et al. replaced the IDs found in
𝑎𝑚𝑡
𝑎
back to the actual
identier/literals based on the map 𝑀(𝑎𝑚𝑡
𝑏).
Limitations.
Although the study of Tufano et al. sheds light that
their NMT approach can transform source code with meaningful
AutoTransform: Automated Code Transformation to Support Modern Code Review Process ICSE ’22, May 21ś29, 2022, Pisburgh, PA, USA
Before version After version
Reusable IDs
Byte-Pair-Encoding (BPE)
Generated Sequence from RNN
Generated Sequence from Transformer
d
Tufano et al. Approach (Code Abstraction + RNN)
Our AutoTransform (Subword Tokenization + Transformer)
236
237
public void onSuccess ( final TYPE_1 result ) { METHOD_1 ( VAR_1 , VAR_2 ,
line, VAR_3 ) ; }
241
242
243
244
Original code
public void onSuccess (final com.google.gwtjsonrpc.client.VoidResult result )
{
createCommentEditor (suggestRow ,column ,line ,sidePanel ) ;
}
Abstracted code
233
234
235
Original code
public void onSuccess (final com.google.gwtjsonrpc.client.VoidResult result )
{
createCommentEditor (suggestRow ,column ,line,side ) ;
}
245
246
Abstracted code
public void onSuccess ( final TYPE_1 result ) { METHOD_1 ( VAR_1 , VAR_2 , line
, VAR_4 ) ; }
239
240
public void on@@ Success ( final com.google.@@ gwtjsonrpc.@@ client.@@
VoidResult result ) { create@@ Comment@@ Editor ( suggest@@ Row , col@@ umn ,
line , side ) ; }
247
248
249
public void on@@ Success ( final com.google.@@ gwtjsonrpc.@@ client.@@
VoidResult result ) { create@@ Comment@@ Editor ( suggest@@ Row , col@@ umn ,
line , side@@ Panel ) ; }
Figure 1: A motivating example for an unknown identier/literal for the newly-introduced abstracted token (Limitation 1).
Table 1: The number of changed methods with and without
new tokens in the after version.
Dataset Method size w/o New Tokens w/ New Tokens
Android Small 4,437 (18%) 20,640 (82%)
Medium 4,593 (16%) 24,547 (84%)
Google Small 2,289 (20%) 9,070 (80%)
Medium 2,833 (20%) 11,624 (80%)
Ovirt Small 4,734 (17%) 23,282 (83%)
Medium 6,229 (16%) 33,275 (84%)
The detail of data preparation is provided in Section 4.2.
code changes (e.g., bug-x, refactoring) [
60
], their approach still
has the following limitations.
Limitation 1: Unknown identiers/literals for the new to-
kens appearing in the after version.
We hypothesize that the
Tufano et al. approach may not be able to correctly transform code
when new tokens appear in the after version but did not appear in
the before version. For such a case, the code abstraction approach
with ReusableIDs of Tufano et al. cannot map the ID of a new token
back to the actual identiers or actual literals.
Indeed, it is possible that developers introduced new tokens that
did not appear in the before version. Figure 1 illustrates a motivating
example for Limitation 1. As shown in the example, the
side
variable
is changed to
sidePanel
. Thus, the
sidePanel
variable is a new
token appearing in the after version. By the code abstraction with
ReusableID of Tufano et al., a new ID will be assigned to
sidePanel
,
i.e.,
VAR_4 = sidePanel
. Note that this limitation is dierent from
the Out-Of-Vocabulary (OOV) problem [
31
] since the Tufano et
al. approach may still be able to predict the correct ID (i.e.,
VAR_4
)
for the after version
𝑎𝑚𝑎
instead of assigning a special tokens (e.g.,
<UNK>
). However, this
VAR_4
cannot be realistically mapped back to
the actual identier (i.e.,
sidePanel
) since it does not appear in the
before version 𝑚𝑏nor its mapping 𝑀(𝑎𝑚𝑏).
Moreover, when we analyze the changed methods in the datasets
of Tufano et al. [
60
], we nd that as much as 80%-84% of the changed
method have new tokens appearing in the after version (see Table 1).
However, Tufano et al. exclude these changed methods with new
tokens, while using only the remaining 16% - 20% of the changed
methods without new tokens for their experiment [
60
]. Thus, the
Tufano et al. approach may be applicable to only a small set of
changed methods, limiting its applicability in real-world contexts.
Limitation 2: Sub optimal performance when the sequences
become longer.
Prior studies have raised a concern that the perfor-
mance of the RNN architecture will be suboptimal when sequences
become longer [
12
,
36
]. Although Tufano et al. leveraged the at-
tention mechanism to handle changed methods that have long
sequences, a recent study by Ding et al. [
22
] noted that such atten-
tion mechanism for the RNN architecture still has diculties in re-
membering long-term dependencies between tokens in a sequence.
In particular, the attention mechanism only computes attention
weights based on the nal hidden states of the RNN architecture,
instead of using any given intermediate states from the encoder,
causing the RNN model to forget tokens seen long ago. Thus, we
hypothesize that the performance of the Tufano et al. approach
which is based on the RNN architecture may be suboptimal for the
changed methods with long sequences.
3 AUTOTRANSFORM
In this section, we present our AutoTransform, a Neural Machine
Translation (NMT) approach that can transform code (1) when new
tokens appear in the after version; and (2) when code sequences
become longer. AutoTransform leverages a Byte-Pair-Encoding
(BPE) approach [
52
] to handle new tokens that may appear in
the after version; and leverages a novel NMT architecture called
Transformer [
62
] to address the suboptimal performance of the
RNN architecture used in the Tufano et al. approach.
Overview.
Figure 2 provides an overview of our AutoTrans-
form approach, which consists of two stages: training and inference.
During the training stage, our AutoTransform performs two main
steps. In Step
1
, we perform subword tokenization on the original
source code of the before and after versions of a changed method
(
𝑚𝑏
,
𝑚𝑎
) using BPE, which produces subword sequences (
𝑠𝑚𝑏
,
𝑠𝑚𝑎
).
In Step
2
, we use these subword sequences (
𝑠𝑚𝑏
,
𝑠𝑚𝑎
) to train a
Transformer model. The inference stage is for transforming the
before version (
𝑚𝑡
𝑏
) of given methods (i.e., testing data) to the after
version (
𝑚𝑡
𝑎
). To do so, we perform subword-tokenization on the
before version in Step
1b
to produce a subword sequences
𝑠𝑚𝑡
𝑏
.
Then, in Step
3
, we use the Transformer NMT model to generate
a prediction for the source code of the after version
𝑚𝑡
𝑎
. Below, we
provide details for each step in our approach.
ICSE ’22, May 21ś29, 2022, Pisburgh, PA, USA Patanamon Thongtanunam, Chanathip Pornprasit, and Chakkrit Tantithamthavorn
smb
Generate Merge
Operations
mbma
Original source code"
of the before and after
versions in training data
A list of
merge operations
smbsma
Subword sequences!
of the before and after
versions
mb
Original source code of the
before version in testing data
smb
A subword sequence of
the before version
Apply Merge!
Operations
Subword Tokenization by BPE
Training
Inference
sma
Before version
(Input)
After version
(Target)
Embedding &
Positional Encoding Multi-Head"
Self-Attention Feed-Forward
Embedding
vectors
Apply Merge
Operations
Encoder Layer x Ne
Encoder Block
Embedding &
Positional Encoding Embedding
vectors
Masked Multi-Head"
Self-Attention Feed-Forward
Decoder Layer x Nd
Decoder Block
Multi-Head"
Self-Attention
Learning Code Transformation by Transformer
smb
Before version
(Input)
Vocab
Position 1 2 ……… n
v1
v2
v3
v4
Word probabilities
Generate Target
Sequence
Generated sequence
Convert !
Subwords to Code ma
Prediction
A Transformer model
Generating Predictions
1
2
3
1a
1b
1b
2a
2b
Attention
vectors
2c
2d
3a 3b 3c
sma
Figure 2: An overview of our AutoTransform.
3.1 Subword Tokenization
To reduce the vocabulary size and handle new tokens that may
appear in the after version, we perform subword tokenization using
Byte-Pair-Encoding (BPE) [
52
]. BPE is a tokenization approach
that splits a word (i.e., a code token) into a sequence of frequently
occurring subwords. For example, a code token
column
in Figure
1 will be split into a sequence of
col@@
and
umn
, where
@@
is a
subword separator. When using BPE, the vocabulary will mainly
contain subwords which are more frequently occurring than the
whole code tokens. Prior studies also have shown that BPE can
eectively address the large vocabulary size than other tokenization
approaches (e.g., camel-case splitting) [
31
]. Moreover, BPE will
enable our approach to address the new token by allowing the
NMT model to generate a new code token based on a combination
of subwords existing across all methods in the training data. For
example, the code token
sidePanel
which is introduced in the after
version (see Figure 1) can be constructed based on a combination of
side@@ and Panel if these two subwords exist in the training data.
The subword tokenization consists of two main steps (see Figure
2). Step
1a
is for generating merge operations which are used to
determine how a code token should be split. To generate merge
operations, BPE will rst split all code tokens into characters se-
quences. Then, BPE generates a merge operation by identifying
the most frequent symbol pair (e.g., the pair of two consecutive
characters) that should be merged into a new symbol. For example,
given that (‘
c
’, ‘
o
’) is the most frequent symbol pair in the corpus,
the merge operation will be (‘
c
’, ‘
o
’)
co
’. Then, BPE replaces all
of the co-occurrence of (‘
c
’, ‘
o
’) with ‘
co
’ without removing ‘
c
’ or
o
’ (which may still appear alone). After the previous merge opera-
tion is applied, BPE generates a new merge operation based on the
frequency of current symbol pairs, e.g., (‘
co
’, ‘
l
’)
col
’. This step
is repeated until it reaches a given number of merge operations.
To ensure the consistency of subword tokenization between the
before and after versions of a changed method, we use joint BPE
to generate merge operations based on the union of code tokens
from both before and after versions. Note that we generated merge
operations based on the training data.
In Step
1b
, we apply merge operations to split code tokens
into subwords. To do so, BPE will rst split all code tokens into
sequences of characters. Then, the generated merge operations are
applied to the before and after versions in the training data. Note
that the lower the number of merge operations applied, the smaller
the size of vocabulary is. We also apply the same list of merge
operations to the before version in the testing data. Note that we
did not split Java keywords since they are commonly used across
changed methods.
3.2 Learning Code Transformation
To learn code transformation, we train a Transformer-based NMT
model using the subword sequences (
𝑠𝑚𝑏
,
𝑠𝑚𝑎
) of the before and
after versions. Unlike the RNN architecture, the Transformer archi-
tecture entirely relies on an attention mechanism without using
the sequence-aligned RNNs or convolution networks [
62
], which
allows the Transformer model to better pay attention to any set of
tokens across arbitrarily long distances. Transformer uses a self-
attention function to compute attention weights based on all tokens
in a sequence, where attention weights indicate how each token
is relevant to all other tokens in the sequence. This self-attention
function enables the Transformer model to capture the contextual
relationship between all tokens in the whole sequence instead of re-
lying on the limited number of the nal hidden states like the RNN
architecture. In addition, instead of using a single self-attention,
the Transformer architecture employs a multi-head self-attention,
which calculates attention weights
times (where
is the number
of heads) based on the dierent parts of input data.
The Transformer architecture consists of two major components:
an Encoder block which encodes a subword sequence into a vector
representation, and a Decoder block which decodes the represen-
tation into another subword sequence. The Transformer performs
the following four main steps (see Figure 2). First, in Step
2a
, given
a subword sequence of the before version
𝑠𝑚𝑏=(𝑠1, . .., 𝑠𝑡)
, the
Transformer embeds the tokens into vectors and uses positional
encoding to add information about the token position of the se-
quence into the embedding vectors. Second, the embedding vectors
are then fed into the encoder block
2b
, which is composed of a
stack of multiple Encoder layers (where
𝑁𝑒
is the number of En-
coder layers). Each layer consists of two sub-layers: a multi-head
self-attention and a fully-connected feed forward network (FFN)
which computes attention weights and generates attention vectors.
The attention vectors will be used to inform the Decoder block
about which tokens that should be paid attention.
AutoTransform: Automated Code Transformation to Support Modern Code Review Process ICSE ’22, May 21ś29, 2022, Pisburgh, PA, USA
The subword sequence of the after version
𝑠𝑚𝑎=(𝑢1, ..., 𝑢𝑡)
is used as an input of the Decoder block. Similarly, in Step
2c
,
the subword sequence of
𝑠𝑚𝑎
is embeded into vectors with the
position information. Then, the embedding vectors are fed into the
Decoder block
2d
to generate an encoded representation of the
target sequence. The Decoder block is also composed of a stack of
multiple Decoder layers (where
𝑁𝑑
is the number of Decoder layers).
Each Decoder layer also consists of multi-head self-attention and
FFN sub-layers. However, before the embedding vectors are fed
into the multi-head self-attention sub-layer, the masked multi-head
self-attention layer is used to ensure that only previous tokens
(i.e.,
𝑢1, ..., 𝑢𝑖1
) are used in decoding for a current token
𝑢𝑖
. After
that, the multi-head self-attention with subsequent FFN generates
an encoded representation of the target sequence. The encoded
representation are converted into word probabilities and the nal
output sequence using the linear and softmax functions. Finally,
the loss function will compare this output sequence with the target
sequence to estimate an error.
3.3 Generating Predictions
The inference stage aims to generate the after version for a given
method based on the before version
𝑚𝑡
𝑏
in the testing data. Starting
from the before version
𝑚𝑡
𝑏
, we perform subword tokenization by
applying the list of the merge operations generated in the training
stage to produce a subword sequence (
𝑠𝑚𝑡
𝑏
) for the before version
in Step
1b
. Then, in Step
3
, we use our Transformer model to gen-
erate the target sequence, i.e., the after version
𝑚𝑡
𝑎
. In particular, in
Step
3a
, given the subword sequence
𝑠𝑚𝑡
𝑏
, the Transformer model
will estimate a probability of each subword in the vocabulary at
each position in the output sequence. Then, based on the gener-
ated subword probabilities, in Step
3b
, we use the beam search
approach [53] to generate the target sequence.
Broadly speaking, beam search generates the target sequence by
selecting the best subword for each position in the output sequence
based on the generated subword probabilities. In particular, beam
search selects a subword for a current position based on the selected
subwords in the previous positions. The beam width
𝑘
is used to
specify the number of previous positions that should be considered
and the number of sequence candidates that will be generated. In
other words, the selection of a subword for a position
𝑖
is based on
the conditional probabilities of selected subwords in the positions
𝑖𝑘
to
𝑖
1. Finally, the beam search generates the best
𝑘
sequence
candidates for the target sequence.
The sequence candidates generated in Step
3b
are the subword
sequences. Hence, in Step
3c
, we convert these subword sequences
back to the code sequences. This step is simply performed by con-
catenating the subwords ending with
‘@@’
with the subsequent
subword. For example, the subwords
[‘col@@’, ‘umn’]
are con-
verted back to column.
4 CASE STUDY DESIGN
In this section, we present the motivation of our research questions,
data preparation, and experimental setup for our case study.
4.1 Research Questions
To evaluate our AutoTransform, we formulate the following two
research questions.
(RQ1) Can AutoTransform transform code better than
Tufano et al. approach?
Motivation.
In this work, we proposed our AutoTransform to
address the limitations of the state-of-the-art approach [
60
]. In
particular, the goal of our AutoTransform is to allow an NMT
model to better transform code (1) when new tokens appear in
the after version; and (2) when the code sequence become longer.
Hence, we set out this RQ to evaluate our AutoTransform based
on the two aforementioned aspects.
(RQ2) What are the contributions of AutoTransform’s
components?
Motivation.
To address the limitations of the state-of-the-art ap-
proach [
60
], we used two dierent techniques in our AutoTrans-
form, i.e., (1) BPE [
52
] to reduce the vocabulary size and handle
new tokens appearing in the after version and (2) Transformer [
62
]
to learn code transformation. In this RQ, we set out to empirically
evaluate the contribution of each technique to the performance of
our AutoTransform, compared against the techniques used in the
Tufano et al. approach.
4.2 Datasets
In this work, we obtain the datasets from the work of Tufano et
al. [
5
]. The datasets consist of 630,858 changed methods which
were extracted from 58,728 reviews across three Gerrit code review
repositories, namely Android [
3
], Google [
6
], and Ovirt [
7
]. For
each changed method, the datasets consist of the source code of
the before and after versions, i.e.,
(𝑚𝑏,𝑚𝑎)
. For our experiment, we
perform data preparation on these obtained datasets to classify and
select the changed methods. Note that we did not use the ltered
datasets that were used in the experiment of Tufano et al. [
60
], since
their ltered datasets do not include the changed methods of which
the after version contains a new token (which is considered in this
work). Table 1 provides an overview of the studied datasets after
the data preparation, which we describe in details below.
Data Preparation.
We rst classify the changed methods into
two types of changed methods: (1) the changed methods of which
the after version does not contain new identiers/literals addition-
ally from the before version (i.e., changed methods
without
new
tokens) and (2) the changed methods of which the after version con-
tains new identiers/literals additionally from the before version
(i.e., changed methods
with
new tokens). The changed methods
without new tokens were used to fairly evaluate our AutoTrans-
form against the Tufano et al. approach under the same condition
as in the prior experiment [
60
], while the changed methods with
new tokens were used to evaluate whether our AutoTransform
can transform source code when a new token appears in the after
version (i.e., evaluating the hypothesis discussed in Limitation 1).
For the changed methods without new tokens, we used the same
selection approach as in the prior work [
60
]. More specically, the
changed methods without new tokens are those methods of which
the after version
𝑚𝑎
contains only (1) Java keywords; (2) Top-300 fre-
quent identiers/literals [
2
]; and (3) identiers and literals that are
already available in the before version
𝑚𝑏
. The remaining changed
ICSE ’22, May 21ś29, 2022, Pisburgh, PA, USA Patanamon Thongtanunam, Chanathip Pornprasit, and Chakkrit Tantithamthavorn
Table 2: Vocabulary size for the original, subword tokenized,
and abstracted methods in the training datasets.
Dataset Subword Tokenized
(Method size) Change Type Original BPE2K BPE5K Abs
Android w/o new tokens 12,052 2,702 4,230 356
(Small) w/ new tokens 43,795 7,448 9,247 408
Google w/o new tokens 5,012 1,719 2,751 333
(Small) w/ new tokens 13,737 3,417 4,884 383
Ovirt w/o new tokens 9,772 1,992 3,575 306
(Small) w/o new tokens 30,562 4,243 6,042 355
Android w/o new tokens 22,296 5,165 6,860 447
(Medium) w/ new tokens 76,264 15,585 17,874 496
Google w/o new tokens 9,340 2,831 4,046 371
(Medium) w/ new tokens 22,140 6,334 8,052 422
Ovirt w/o new tokens 17,680 3,231 4,958 353
(Medium) w/o new tokens 44,317 7,528 9,674 422
*The w/ and w/o new tokens change types are mutually exclusive sets.
methods of which the after version contains tokens not listed in the
aforementioned token categories are classied as the changed meth-
ods with new tokens. Note that we exclude the changed methods
that were completely added or deleted (i.e.,
𝑚𝑏=
or
𝑚𝑎=
) and
the changed methods of which before and after versions appear the
same (i.e.,
𝑚𝑏=𝑚𝑎
) since NMT models would not be able to learn
any code transformation patterns from these methods. In addition,
we remove the duplicate changed methods (i.e., the methods whose
(𝑚𝑏,𝑚𝑏)
are exactly same) to ensure that none of the duplicate
methods in testing will appear in training.
After selecting and classifying the changed methods, we sepa-
rate the datasets into two method sizes, i.e., small and medium to
evaluate the hypothesis discussed in Limitation 2 (see Section 2).
The
small
methods are those methods of which before and after
versions have a sequence length no longer than 50 tokens. The
medium
methods are those methods of which before and after
versions have a sequence length between 50-100 tokens. Similar
to prior work [
60
], we disregard the changed methods that have
a sequence longer than 100 tokens because large methods have
a long tail distribution of sequence lengths with a high variance,
which might be problematic when training an NMT model.
In total, we conduct an experiment based on 12 datasets, i.e.,
3 repositories (Android, Google, Ovirt)
×
2 method sizes (small,
medium)
×
2 change types (w/o and w/ new tokens). Each of the
datasets is then partitioned into training (80%), validation (10%),
and testing (10%).
4.3 Experimental Setup
This section provides setup details for our experiment.
Source Code Pre-processing.
Before we perform subword
tokenization (for our AutoTransform) and code abstraction (for
Tufano et al. approach), we perform word-level tokenization to
convert the formatted code into a sequence of code tokens. To do
so, for each version of each changed method, we simply separate
code lines, identiers, literals, Java keywords, and Java reserved
characters (e.g.,
;(){}
) by a space. We do not convert code tokens
to lower cases because the programming language (i.e., Java) of the
studied datasets is case-sensitive. We also do not split code tokens
Table 3: Hyper-parameter settings for the Transformer and
RNN models.
Hyper-Parameter Transformer Model RNN Model
#Encoder Layers (𝑁𝑒) {1, 2} {1, 2}
#Decoder Layers (𝑁𝑑){2, 4} {2, 4}
Cell Types n/a {GRU, LSTM}
#Cells n/a {256, 512}
Embedding Size n/a {256, 512}
Attention Size n/a {256, 512}
#Attention Heads (){8, 16} n/a
Hidden Size (ℎ𝑠){256, 512} n/a
Total #Settings 8 10
with compound words (e.g., camel-case) since our AutoTransform
already performs subword tokenization and such compound-word
splitting is not performed in Tufano et al. approach.
For our AutoTransform, we use the implementation of Sen-
nrich et al. [
52
] to perform subword tokenization using Byte-Pair
Encoding (BPE) [
1
]. In this work, we experiment with two encoding
sizes, i.e., the number of merge operations: 2,000 (BPE2K) and 5,000
(BPE5K), which substantially reduce the vocabulary size (at least
approximately by 50%; see Table 2). For Tufano et al. approach, we
use src2abs [
9
] to abstract tokens with reusable IDs (Abs) and gen-
erate a map
𝑀
for each changed method. Similar to prior work [
60
],
we do not abstract Java keywords and the top-300 frequent iden-
tier/literals [
2
]. To ensure that the identier/literal appearing in
both versions has the same ID, we use a pair mode of src2abs for
the training and validation data, while a single mode is used for the
before version in the testing data.
NMT Models & Hyper-Parameter Settings.
To build a Trans-
former model in our AutoTransform, we use the implementation
of the Tensor2Tensor library [
10
]. To build an RNN model in
Tufano et al. approach, we use the implementation of the seq2seq
library [
8
] which is also used in the prior work [
60
]. To ensure a
fair comparison, we use similar combinations of hyper-parameters
(where applicable) for both Transformer and RNN models (see Table
3). Therefore, we experiment with eight hyper-parameter settings
for our Transformer model in AutoTransform. For the RNN mod-
els, we use ten hyper-parameter settings which are originally used
in the experiment of the prior work [60].
When training the models for both approaches, we set the maxi-
mum number of epochs similar as in the prior work [
60
].
1
To avoid
the overtting of our models to the training data, we select the
model checkpoint (i.e., the model that was trained until a particular
number of epochs) that achieves the lowest loss value computed
based on the validation data (not the testing data).
Evaluation Measure.
To evaluate the performance of our Au-
toTransform and Tufano et al. approach, we measure the number
of methods for which an approach achieves a perfect prediction, i.e.,
the generated after version exactly matches the ground-truth (i.e.,
the actual after version). Note that we convert the generated after
version (i.e., the subword sequence of our AutoTransform, and
1
Note that #epochs are calculated based on the size of training data, batch size
and #train steps which are calculated dierently for the Tensor2Tensor and seq2seq
libraries. The details are provided in the Supplementary Materials [4].
AutoTransform: Automated Code Transformation to Support Modern Code Review Process ICSE ’22, May 21ś29, 2022, Pisburgh, PA, USA
Table 4: Perfect predictions (#PP) of our AutoTransform and Tufano et al. approach approach for the small and medium
changed method with and without new tokens in the after version. The percentage value in the parenthesis indicates the
percentage improvement of our AutoTransform.
Beam width = 1 Beam width = 5 Beam width = 10
Dataset AutoTransform Tufano et al. AutoTransform Tufano et al. AutoTransform Tufano et al.
(Method Size) Change Type #Test #PP #PP #PP #PP #PP #PP
Android w/o new tokens 443 84 53 125 83 130 107
(Small) w/ new tokens 2,064 108 0 206 0 233 0
Google w/o new tokens 228 11 14 22 36 29 42
(Small) w/ new tokens 907 40 0 81 0 97 0
Ovirt w/o new tokens 473 73 86 132 173 145 200
(Small) w/ new tokens 2,328 352 0 618 0 715 0
Android w/o new tokens 459 58 32 85 67 89 78
(Medium) w/ new tokens 2,454 124 0 247 0 289 0
Google w/o new tokens 283 16 9 28 18 33 22
(Medium) w/ new tokens 1,162 18 0 46 0 63 0
Ovirt w/o new tokens 622 111 18 179 49 199 62
(Medium) w/ new tokens 3,327 415 0 833 0 992 0
Total w/o new tokens 2,508 353 212 571 426 625 511
w/ new tokens 12,242 1,060 0 2,031 0 2,389 0
Both 14,750 1,413 (+567%) 212 2,602 (+511%) 426 3,014 (+490%) 511
the abstracted code sequence of the Tufano et al. approach) back to
the code sequence before matching it with the ground-truth.
In our experiment, we use three dierent beam widths (i.e.,
𝑘=
{
1
,
5
,
10
}
) when generating the after version. Thus, if one of the
𝑘
sequence candidates exactly matches the ground-truth, we consider
that the NMT approach achieves a perfect prediction, i.e., the code
is correctly transformed. We do not use other metrics, e.g., BLEU
which measures the overlap (or similarity) between the generated
and ground-truth sequences, since similarity cannot imply that the
generated sequences are viable for code implementation. Ding et
al. also argue that BLEU should not be used to evaluate the code
transformation since the sequences that are similar (i.e., few code
tokens are dierent between the two sequences) may have largely-
dierent intentions or semantics [22].
5 RESULTS
RQ1: Can AutoTransform transform code
better than Tufano et al. approach?
Approach.
To address our RQ1, we evaluate how well our Auto-
Transform can transform the source code of the before version
𝑚𝑏
to the after version
𝑚𝑎
of given methods (in the testing data),
compare against the approach of Tufano et al. [
60
]. Therefore, we
build an NMT model for each of the 12 datasets, i.e., small and
medium methods with and without new tokens across three repos-
itories (see Table 1). For this RQ, we use the maximum number of
merge operations of 2,000 (i.e., BPE2K) in our AutoTransform. In
total, we train 96 Transformer models for our AutoTransform (i.e.,
12 datasets
×
8 hyper-parameter settings); and 120 RNN models
for Tufano et al. approach (i.e., 12 datasets
×
10 hyper-parameter
settings). Then, for each approach and for each dataset, we select
the model with the best hyper-parameter setting that achieves the
lowest loss value computed based on the validation data. Finally,
we measure perfect predictions based on the testing data. To quan-
tify the magnitude of improvement for our AutoTransform, we
compute a percentage improvement of perfect predictions using a
calculation of (#𝑃𝑃our #𝑃 𝑃Tufano) ×100%
#𝑃𝑃Tufano .
Results.
Table 4 shows the results of perfect predictions of our
AutoTransform and Tufano et al. approach of 14,750 changed
methods across the 12 datasets. The results are based on three beam
widths,
𝑘={
1
,
5
,
10
}
, where
𝑘
is the number of sequence candidates
for a given method.
When considering both change types (i.e., w/o and w/ new
tokens), our AutoTransform achieves a perfect prediction
for 1,413 methods which is 567% higher than the perfect pre-
dictions achieved by the Tufano et al. approach.
Table 4 shows
that when a beam width is 1 and both change types are considered,
our AutoTransform achieves a perfect prediction for 34 - 526
methods which accounted for 2% (
34
1,445
; for Google Medium) - 13%
(
526
3,949
; for Ovirt Medium) of the methods in testing data. On the
other hand, Tufano et al. approach achieves a perfect prediction for
9 - 86 methods which accounted for only 0.62% (
9
1,445
; for Google
Medium) - 3% (
86
2,801
; for Ovirt Small). In total across the 12 datasets,
our AutoTransform can correctly transform 1,413 methods, which
is 567% higher than the perfect predictions achieved by Tufano et
al. approach.
Even when we increase the beam width to 5 and 10, our Auto-
Transform can achieve higher perfect predictions, i.e., 5% (
74
1,445
for
Google Medium) - 28% (
1,102
3,949
for Ovirt Medium) at beam width = 5
and 7% (
96
1,445
for Google Medium) - 30% (
1,191
3,949
for Ovirt Medium) at
beam width = 10. The perfect predictions are improved by 511% for
the beam width of 5 (and 490% for the beam width of 10) when com-
pared to the perfect predictions achieved by Tufano et al. approach.
These results indicate that the number of methods for which our
AutoTransform can achieve a perfect prediction is substantially
higher than Tufano et al. approach.
For the changed methods with new tokens appearing in
the after version, our AutoTransform achieve a perfect pre-
diction for 18 - 415 methods.
At beam width is 1, Table 4 shows
ICSE ’22, May 21ś29, 2022, Pisburgh, PA, USA Patanamon Thongtanunam, Chanathip Pornprasit, and Chakkrit Tantithamthavorn
that our AutoTransform achieves a perfect prediction for 18 -
415 methods which accounted for 2% (
18
1,162
) - 12% (
415
3,327
) of the
changed methods with new tokens in the testing data. In total, our
AutoTransform achieves a perfect prediction for 1,060 methods.
Similarly, our AutoTransform achieves higher perfect predictions
when the beam width is increased to 5 and 10, i.e., 4% (
46
1,162
) - 25%
(
833
3,327
) at the beam width of 5; and 5% (
63
1,162
) - 30% (
992
3,327
) at the
beam width of 10. On the other hand, Table 4 shows that Tufano et
al. approach could not achieve a perfect prediction for any of the
changed methods with new tokens appearing in the after version.
This is because the IDs of the new tokens cannot be mapped back
the actual identiers/literals as they did not exist in the before ver-
sion (see Limitation 1 in Section 2). These results highlight that
our AutoTransform can transform source code even when new
tokens appear in the after version.
For the changed methods without new tokens in the after
version, our AutoTransform achieves a perfect prediction
for 11 - 111 methods, while Tufano et al. approach achieves
a perfect prediction for 9 - 86 methods.
At the beam width
of 1, Table 4 shows that our AutoTransform achieves a perfect
prediction for 11 - 111 methods which accounted for 5% (
11
228
) - 18%
(
111
622
) of the changed methods without new tokens in the testing
data. On the other hand, Tufano et al. approach achieves a perfect
prediction for 9 - 86 methods which accounted for 3% (
9
283
) - 18%
(
86
473
). For the small methods, our AutoTransform achieves more
perfect predictions in the Android dataset, but fewer in the Google
and Ovirt datasets. We will further discuss this result in Section 6.
Nevertheless, it is worth noting that for the medium methods, our
AutoTransform achieves perfect predictions 78% (
169
9
) - 517%
(
11118
18
) higher than the perfect predictions achieved by Tufano et
al. approach. The results are similar when the beam width is 5 and
10. These results suggest that when a sequence becomes longer,
our AutoTransform transforms code better than the Tufano et
al. approach, highlighting that our AutoTransform can address
Limitation 2 discussed in Section 2.
RQ2: What are the contributions of
AutoTransform’s components?
Approach.
To address our RQ2, we examine perfect predictions
when a component in our AutoTransform is varied. Specically,
we examine the percentage dierence of perfect predictions when
subword tokenization is changed to code abstraction (BPE
Abs);
and when the NMT architecture is changed from Transformer to
RNN (Transformer
RNN). We also investigate the case when
the maximum number of merge operations is changed from 2,000
to 5,000 (BPE2K
BPE5K), i.e., the vocabulary size increases.
Thus, we evaluate the perfect predictions of four additional com-
binations: BPE5K+Transformer, Abs+Transformer, BPE2K+RNN,
BPE5K+RNN. We build an NMT model using each combination
for each of the 12 datasets. In total, for RQ2, we further build 192
Transformer models (i.e., 2 Transformer-based combinations
×
12
datasets
×
8 hyper-parameter settings); and 240 RNN models (i.e.,
2 RNN-based combinations
×
12 datasets
×
10 hyper-parameter
settings). Similar to RQ1, we select the model with the best hyper-
parameter setting based on the validation data; and measure perfect
predictions based on the testing data.
Results.
Figure 3 shows the perfect predictions of our Auto-
Transform (BPE2K+Transformer), Tufano et al. approach (Abs+RNN),
and the four additional combinations.
Using subword tokenization by BPE can increase perfect
predictions at least by 284%, compared to the code abstrac-
tion with reusable IDs.
Figure 3 shows that at beam width of 1, the
perfect predictions of our AutoTransform (BPE2K+Transformer)
is 284% (
19250
50
for Android Small) - 2,290% (
52622
22
for Ovirt Medium)
higher than Abs+Transformer. Considering the cases when the RNN
architecture is used with BPE (i.e., BPE2K+RNN and BPE5K+RNN),
the perfect predictions are also higher than Tufano et al. approach
(Abs+RNN). For example, for the small methods, Figure 3 shows
that BPE2K+RNN achieves perfect predictions 29% (
1814
14
) - 323%
(
36486
86
) higher than Tufano et al. approach. Figure 3 also shows
similar results when the beam width was increased to 5 and 10.
These results indicate that regardless of the NMT architecture, sub-
word tokenization by BPE largely contributes to the performance
in transforming code of our AutoTransform.
When using a dierent number of merge operations (BPE2K
BPE5K), we nd that perfect predictions were slightly dierent. For
example, for the small methods (at beam width of 1), the percentage
dierence of perfect prediction between our AutoTransform and
BPE5K+Transformer is 7% (
5551
51
) - 19% (
192156
192
). Figure 3 also
shows that the results are similar for the RNN-based approaches
(i.e., BPE2K+RNN and BPE5K+RNN). These results suggest that the
number of merge operations has an impact (but relatively small)
on the performance of our AutoTransform.
Using the Transformer architecture can increase perfect
predictions at least by 17%, compared to the RNN architec-
ture.
Figure 3 shows that at the beam width of 1 and for small meth-
ods, our AutoTransform achieves perfect predictions 17% (
425364
364
for Ovirt) - 183% (
5118
18
for Google) higher than BPE2K+RNN. It is
also worth noting that the percentage dierence is much higher for
the medium methods, i.e., 183% (
3412
12
for Google) - 507% (
18230
30
for Android). Figure 3 also shows a large dierence of perfect pre-
dictions between our AutoTransform and BPE2K+RNN when
the beam width is increased to 5 and 10. The results are also
similar when comparing the perfect predictions between BPE5K+
Transformer and BPE5K+RNN, i.e., the Transfomer models tend
to achieve higher perfect predictions than the RNN models. These
results suggest that the Transformer architecture also contributes
to the performance of our AutoTransform in transforming code,
especially for the methods with a relatively long sequence like
medium methods.
6 DISCUSSION
In this section, we discuss our AutoTransform in several aspects
including its advantage, performance, and practicality.
Advantage:
Why does our AutoTransform work for the changed
methods with new tokens? Table 4 shows that in total, our Auto-
Transform can correctly transform the methods that have new
tokens appearing in the after version. We further analyze these
methods to better understand the characteristics of the new code
AutoTransform: Automated Code Transformation to Support Modern Code Review Process ICSE ’22, May 21ś29, 2022, Pisburgh, PA, USA
192
331
363
156
295
353
159
240
275
167
263
288
50
100
121
53
83
107
182
332
378
214
385
423
30
45
48
62
79
87
20
36
44
32
67
78
51
103
126
55
105
132
18
39
48
18
42
48
11
37
43
14
36
42
34
74
96
19
58
77
12
39
41
28
61
68
7
12
15
9
18
22
425
750
860
364
664
753
364
629
703
339
617
696
73
167
186
86
173
200
526
1012
1191
531
1007
1170
154
334
428
254
532
614
22
50
59
18
49
62
Android (M)
Google (M)
Ovirt (M)
Android (S)
Google (S)
Ovirt (S)
Beam = 1 Beam = 5 Beam = 10 Beam = 1 Beam = 5 Beam = 10 Beam = 1 Beam = 5 Beam = 10
Beam = 1 Beam = 5 Beam = 10 Beam = 1 Beam = 5 Beam = 10 Beam = 1 Beam = 5 Beam = 10
0
250
500
750
1000
0
500
1000
1500
0
50
100
150
0
30
60
90
120
0
100
200
300
400
0
100
200
300
400
500
#Pefect Predictions
AutoTransform (BPE2K+Transformer) BPE5K+Transformer BPE2K+RNN BPE5K+RNN Abs+Transformer Tufano et al (Abs+RNN)
Figure 3: The perfect prediction of our AutoTransform when a component is varied. The y-axis shows the total number of
perfect predictions of changed methods with and without new tokens.
tokens appearing in the after version. We nd that 43% (
960
1,689
) of the
new code tokens are the identiers/literals that already exist in the
training data (i.e., known code tokens), suggesting that our Auto-
Transform can reuse the code tokens existing across all methods
that AutoTransform have learnt. On the other hand, the Tufano et
al. approach cannot generate these new code tokens because their
approach is restricted by the code abstraction with reusable IDs to
use only the identiers/literals that exist in the before version; or
that are the top-300 frequent identiers/literals.
Furthermore, the other 57% of the new code tokens appear-
ing in the after version are new identiers/literals that do not
exist in the training data, suggesting that our AutoTransform
can generate these new code tokens based on a combination of
known subwords in the training data. We observe that these new
code tokens are related to changing a Java package/library (e.g.,
org.junit.Assert org.hamcrest.MatcherAssert
), changing iden-
tier (e.g.,
getlog_type_name getLogTypeName
,
ddt.mID
), or even
adding new statements (e.g., instantiating a new object).
Limitation:
Why does our AutoTransform achieve fewer perfect
predictions than Tufano et al. approach for small methods without
new tokens in the Google and Ovirt datasets? Table 4 shows that, for
the small methods without new tokens in the after version, our Au-
toTransform achieves fewer perfect predictions than the Tufano et
al. approach in the Google and Ovirt datasets. To better understand
the methods for which our AutoTransform cannot achieve a per-
fect prediction, we manually examine 9 of the 14 methods (Google)
and 54 of the 86 methods (Ovirt) for which our AutoTransform
cannot achieve a perfect prediction but the Tufano et al. approach
could. We nd that there are only 2 (out of 9) and 5 (out of 54)
methods that are incorrectly predicted by our AutoTransform.
On the other hand, for the remaining 7 (out of 9) and 49 (out of
54) methods, we nd that our AutoTransform almost achieves a
perfect prediction, i.e., the generated sequence is very similar to
the ground-truth with minor errors. Broadly speaking, we observe
that for most of these methods, our approach transforms some code
tokens that should remain unchanged in the after version. One
possible reason is that these code tokens are a rare token and BPE
splits them into many subwords, i.e., one rare token becomes a long
subword sequence. Then, due to the large search space of subwords,
our approach may unnecessarily generate a new combination of
subwords. For example, we observe that
FixturesTool.DATA_CENTER
is split into 9 subwords (i.e.,
FixturesTool.@@, DA@@, ..., TE@@,
R
). Then, our approach inserts a subword
A@@
, resulting in a new
token
FixturesTool.ADATA_CENTER
. Based on this observation, an ap-
proach to reduce the length of subword sequences (e.g., ne-tuning
the number of merge operations) may improve the performance.
Hyper-Parameter Sensitivity:
How sensitive the hyper-parameter
setting is in our AutoTransform?Deep learning models are known
for being sensitive to hyper-parameter settings. While we select the
best hyper-parameter setting based on the validation data for our
experiment, we are interested in examining the impact of hyper-
parameter settings on the performance of our AutoTransform.
Hence, we analyze the perfect predictions of our AutoTransform
when each of the eight hyper-parameter settings in Table 3 is used.
We nd that a setting of (
𝑁𝑒
= 2,
𝑁𝑑
= 4,
= 8,
ℎ𝑠
= 512,) allows
our AutoTransform achieves the highest perfect predictions for
7 out of 12 datasets, while a setting of (
𝑁𝑒
= 1,
𝑁𝑑
= 2,
= 8,
ℎ𝑠
=
512) allows our AutoTransform to achieve the highest perfect
predictions for 4 out of 12 datasets. Nevertheless, we observe that
the perfect predictions is decreased by only 1 - 3 percentage points
when using the other hyper-parameter settings instead of the best
setting.
Performance:
How long does our AutoTransform take to train
and infer? Model training and inference time can be one of the
important factors when considering the adoption of our approach
in practice. Hence, we measure the training and inference time for
ICSE ’22, May 21ś29, 2022, Pisburgh, PA, USA Patanamon Thongtanunam, Chanathip Pornprasit, and Chakkrit Tantithamthavorn
the Transformer models in our AutoTransform. We nd that the
training time for each Transformer model in our AutoTransform
used in RQ1 is ranging from 30 minutes to 2 hours depending on
the size of the datasets and the number of epochs. The average
inference time of AutoTransform per method is ranging from 15
to 60 milliseconds for generating one sequence candidate (i.e., Beam
width = 1). Similarly, when generating 5 and 10 sequence candidates
per method (i.e., Beam width =
{
5
,
10
}
), the average inference time
per input sequence is ranging from 42 to 200 milliseconds. Note
that the training and inference time is based on a computer with
an Nvidia GeForce RTX 3090 graphic card.
Practicality:
To what extent AutoTransform can support the
modern code review process? Our RQ1 has shown that AutoTrans-
form can correctly transform 1,413 methods, which is substantially
better than the prior work. This result highlights a great potential of
AutoTransform to augment code authors by recommending the
common revisions that occurred during the code reviews in the past
to apply to the newly-written code without waiting for reviewers’
feedback. Nevertheless, at this stage of the work, AutoTransform
may be suitable for code changes of software components that
tolerate false positives as AutoTransform will provide a recom-
mendation for every changed method which is subject to produce
many false positives compared to humans. Indeed, prior work [
51
]
reported that developers may decide to not use a supporting tool if
its false positive rate is above 25%. Furthermore, similar to the prior
work [
60
], the applicability of AutoTransform is still limited to
small and medium method sizes. Thus, to broaden the practicality of
AutoTransform, future work should aim to develop an approach
that is more selective to achieve higher accuracy (e.g., below 25% of
false positive rate) for any method sizes (including large methods).
7 RELATED WORK
Automated Code Review.
Code review is eective, but still human-
intensive and time-consuming [
11
,
37
,
50
]. Thus, recent work lever-
aged machine learning techniques to support various activities
throughout the code review process, for example, reviewer recom-
mendation [
13
,
43
,
46
,
49
,
56
,
59
], review task prioritization based
on code change characteristics [
23
,
38
,
58
] and defect-proneness [
30
,
39
,
44
,
45
,
65
]. Several studies also proposed approaches to support
reviewers when reading and examining code [
14
,
27
,
55
,
63
,
64
]. Al-
though these approaches can reduce the manual eort of reviewers,
code authors still need to manually modify the source code until
it is approved by reviewers. Yet, few studies focus on developing
an approach to automatically transform source code to help code
authors reduce their eort in the code review context.
To the best of our knowledge, only two recent approaches [
60
,
61
]
are proposed to automatically transform the source code to the
version that is approved by reviewers (i.e., the after version). How-
ever, both recent approaches use code abstraction (ABS+ RNN [
60
],
ABS+Transformer [
61
]) to reduce the vocabulary size, where our
results show that code abstraction hinders the NMT approaches in
correctly transforming code if the after version has a new token
(e.g., renamed variables). Thus, these recent approaches still have a
limited usage to automatically transform code in the code review
process. Dierent from prior work, we leverage BPE to address this
limitation of prior work. Importantly, our RQ2 shows that using
BPE (BPE+Transformer, BPE+RNN) achieves perfect predictions at
least by 284% higher than using the code abstraction with reusable
IDs (ABS+Transformer, ABS+RNN), highlighting the important
contribution of this paper to the automated code transformation
for code review.
Neural Machine Translation (NMT) in Software Engineer-
ing.
NMT approaches have been developed to support various
software engineering tasks, which can be categorized into four
types of transformation: (1) Text
Text (e.g., language transla-
tion of code documentation [
35
], query reformulation [
19
]); (2)
Text
Code (e.g., code search [
24
,
42
]); (3) Code
Text (e.g., code
summarization [
25
], commit message generation [
29
,
34
]); and (4)
Code
Code (e.g., automated program repair [
20
,
28
,
33
], program-
ming language translation [
48
], code completion [
54
]). Although
automated program repair (APR) approaches and our AutoTrans-
form share a similar concept of using NMT for a Code
Code task,
APR approaches [
20
,
28
,
33
] only aim to automatically transform
buggy code to clean code for bug-xing purposes, which may not
be related to other types of code changes in code review. Dierent
from APR, our AutoTransform aims to automatically transform
source code that is changed during the code review process (e.g.,
refactoring) to improve readability, maintainability, and design
quality [16, 41, 50, 57].
8 THREATS TO VALIDITY
Construct Validity.
The source code granularity used in this work
is at the method level. The results may be varied if the changed
source code is extracted at a dierent granularity (e.g., changed
lines). However, prior work pointed out that an NMT model requires
code context and using only changed lines may lead the NMT model
to suer from the Out-Of-Vocabulary problem, even though BPE
is used [
22
]. Hence, we believe that training an NMT model at a
method level would provide a reasonable range of code context
than changed lines.
We dene the method size based on the number of tokens (i.e.,
1 - 50 tokens for small and 51 - 100 tokens for medium). This size
denition may not reect the actual method sizes in practices, e.g.,
a method with 100 tokens may actually have few lines of code. The
performance of AutoTransform may dier if other denitions of
method size are used. Nevertheless, for the sake of fair comparison
in our experiment, we opt to use the same denition of method size
as used in the prior work [
60
]. Moreover, from the aspect of the
NMT algorithm, transforming long sequences are not trivial as it
requires higher memory and computation power [66].
Internal Validity.
We experiment with only two settings of
merge operations (i.e., BPE2K and BPE5K); and 8 hyper-parameter
settings which are based on a combinations of four hyper-parameters.
The results may be varied if the number of merge operations and
the hyper-parameter settings of both approaches are optimized.
However, nding an optimal setting can be very computationally
expensive given a large search space of the number of merge op-
erations and all available hyper-parameters. In addition, the goal
of our work is not to nd the best setting, but to fairly compare
the performance of our approach with the prior approach based on
similar settings as used in the prior work [
60
]. Nevertheless, our
analyses in Sections 5 and 6 have shown that the number of merge
AutoTransform: Automated Code Transformation to Support Modern Code Review Process ICSE ’22, May 21ś29, 2022, Pisburgh, PA, USA
operations and hyper-parameter settings have a small impact on
the performance of our approach.
External Validity.
We evaluate our AutoTransform based on
the changed methods that were extracted from three Gerrit code
repositories. In addition, we only experimented with Java programs.
Our results may not be generalized to other code review reposi-
tories or other programming languages. However, our approach
is based on the techniques (i.e., BPE and Transformer) that are
language-independent. In addition, we provided a replication pack-
age to facilitate future work to replicate our approach on dierent
repositories and programming languages.
9 CONCLUSION
Prior work [
60
] proposed an NMT approach to automatically trans-
form a given method from the before version (i.e., the version before
the implementation of code changes) to the after version (i.e., the
version that is reviewed and merged). Yet, its performance is still
suboptiomal when the after version has new identiers or literals,
or when the sequence becomes longer. Hence, in this paper, we pro-
pose AutoTransform which leverages BPE to handle new tokens
and a Transformer to handle long sequences.
Through an empirical evaluation based on 14,750 changed meth-
ods that are extracted from three Gerrit code review repositories,
we nd that (1) our AutoTransform can correctly transform 1,060
methods of which the after version has new tokens while the prior
work can not correctly transform any of these methods; and (2) for
the changed methods of which the after version do not have new
tokens, our AutoTransform can correctly transform 353 meth-
ods, which is 67% higher than the prior work. Furthermore, our
ablation study also shows that BPE and Transformer substantially
contribute to the performance improvement of our AutoTrans-
form, when compared to the components used in the prior work.
These results highlight that our AutoTransform eectively ad-
dress the limitations of the prior work, allowing the NMT approach
to be applied to a wider range of changed methods (i.e., methods
with new tokens and methods with long sequences). The proposed
approach and the results of this paper contribute toward automated
code transformation for code reviews, which could help developers
to reduce their eort on modifying source code during the code
review process.
ACKOWLEDGEMENT
Patanamon Thongtanunam was supported by the Australian Re-
search Council’s Discovery Early Career Researcher Award (DE-
CRA) funding scheme (DE210101091). Chakkrit Tantithamthavorn
was supported by the Australian Research Council’s Discovery
Early Career Researcher Award (DECRA) funding scheme (DE200100941).
REFERENCES
[1]
[n.d.]. A library for subwod tokenization using Byte-Pair Encoding. https:
//github.com/rsennrich/subword-nmt.
[2]
[n.d.]. A list of Top-300 frequent identier/literals for each of the studied datasets.
https://sites.google.com/view/learning-codechanges/data#h.p_r- R_Z4sKJC2L.
[3]
[n.d.]. Android’s Gerrit Code Review Repositories. https://android-review.
googlesource.com/.
[4]
[n.d.]. AutoTransform’s Replication Package. https://github.com/awsm-research/
AutoTransform-Replication.
[5]
[n.d.]. Datasets of the paper titled “On Learning Meaningful Code Changes
Via Neural Machine Translation”. https://sites.google.com/view/learning-
codechanges/data#h.p__6KdV38lN05N.
[6]
[n.d.]. Google’s Gerrit Code Review Repositories. https://gerrit-review.
googlesource.com/.
[7] [n.d.]. Ovirt’s Gerrit Code Review Repositories. https://gerrit.ovirt.org/.
[8]
[n.d.]. Seq2Seq: A library for RNN-based NMT models. https://google.github.io/
seq2seq/.
[9]
[n.d.]. Src2Abs: A library for abstracting code with reusable IDs. https://github.
com/micheletufano/src2abs.
[10]
[n.d.]. Tensor2Tensor: A library for Transfomer-based NMT models. https:
//github.com/tensorow/tensor2tensor.
[11]
Alberto Bacchelli and Christian Bird. 2013. Expectations, Outcomes, and Chal-
lenges of Modern Code Review. In Proceedings of ICSE. 712–721.
[12]
Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. 2015. Neural Machine
Translation by Jointly Learning to Align and Translate. In Proceedings of ICLR.
1–15.
[13]
Vipin Balachandran. 2013. Reducing Human Eort and Improving Quality in Peer
Code Reviews using Automatic Static Analysis and Reviewer Recommendation.
In Proceedings of ICSE. 931–940.
[14]
Tobias Baum, Kurt Schneider, and Alberto Bacchelli. 2019. Associating Working
Memory Capacity and Code Change Ordering with Code Review Performance.
EMSE 24, 4 (2019), 1762–1798.
[15]
Gabriele Bavota and Barbara Russo. 2015. Four Eyes Are Better Than Two: On
The Impact of Code Reviews on Software Quality. In Proceedings of ICSME. 81–90.
[16]
Moritz Beller, Alberto Bacchelli, Andy Zaidman, and Elmar Juergens. 2014. Mod-
ern Code Reviews in Open-Source Projects: Which Problems do They Fix?. In
Proceedings of MSR. 202–211.
[17]
Amiangshu Bosu, Michaela Greiler, and Christian Bird. 2015. Characteristics of
Useful Code Reviews: An Empirical Study at Microsoft. In Proceedings of MSR.
146–156.
[18]
Denny Britz, Anna Goldie, Minh Thang Luong, and Quoc V. Le. 2017. Massive Ex-
ploration of Neural Machine Translation Architectures. In Proceedings of EMNLP.
1442–1451.
[19]
Kaibo Cao, Chunyang Chen, Sebastian Baltes, Christoph Treude, and Xiang Chen.
2021. Automated Query Reformulation for Ecient Search based on Query Logs
From Stack Overow. In Proceedings of ICSE. 1273–1285.
[20]
Zimin Chen, Steve James Kommrusch, Michele Tufano, Louis-Noël Pouchet,
Denys Poshyvanyk, and Martin Monperrus. 2019. Sequencer: Sequence-to-
Sequence Learning for End-to-End Program Repair. TSE (2019).
[21]
Czerwonka, Jacek and Greiler, Michaela and Tilford, Jack. 2015. Code Reviews
Do Not Find Bugs How the Current Code Review Best Practice Slows Us Down.
In Proceedings of ICSE. 27–28.
[22]
Yangruibo Ding, Baishakhi Ray, Premkumar Devanbu, and Vincent J. Hellendoorn.
2020. Patching as Translation: The Data and the Metaphor. In Proceedings of ASE.
275–286.
[23]
Yuanrui Fan, Xin Xia, David Lo, and Shanping Li. 2018. Early Prediction of Merged
Code Changes to Prioritize Reviewing Tasks. EMSE 23, 6 (2018), 3346–3393.
[24]
Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep Code Search. In
Proceedings of ICSE. 933–944.
[25]
Sakib Haque, Alexander LeClair, Lingfei Wu, and Collin McMillan. 2020. Im-
proved Automatic Summarization of Subroutines via Attention to File Context.
In Proceedings of MSR. 300–310.
[26]
Vincent J Hellendoorn and Premkumar Devanbu. 2017. Are Deep Neural Net-
works the Best Choice for Modeling Source Code?. In Proceedings of FSE. 763–773.
[27]
Yang Hong, Chakkrit Tantithamthavorn, and Patanamon Thongtanunam. 2022.
Where Should I Look at? Recommending Lines that Reviewers Should Pay At-
tention To. In Proceeding of SANER.
[28]
Nan Jiang, Thibaud Lutellier, and Lin Tan. 2021. CURE: Code-Aware Neural
Machine Translation for Automatic Program Repair. In Proceedings of ICSE. 1161–
1173.
[29]
Siyuan Jiang, Ameer Armaly, and Collin McMillan. 2017. Automatically Generat-
ing Commit Messages from Dis using Neural Machine Translation. In Proceed-
ings of ASE). 135–146.
[30]
Yasutaka Kamei, Emad Shihab, Bram Adams, Ahmed E. Hassan, Audris Mockus,
Anand Sinha, and Naoyasu Ubayashi. 2013. A Large-Scale Empirical Study of
Just-In-Time Quality Assurance. TSE 39, 6 (2013), 757–773.
[31]
Rafael Michael Karampatsis, Hlib Babii, Romain Robbes, Charles Sutton, and
Andrea Janes. 2020. Big Code != Big Vocabulary: Open-Vocabulary Models for
Source Code. In Proceedings of ICSE). 1073–1085.
[32]
Oleksii Kononenko, Olga Baysal, and Michael W. Godfrey. 2016. Code Review
Quality: How Developers See It. In Proceedings of ICSE. 1028–1038.
[33]
Yi Li, Shaohua Wang, and Tien N Nguyen. 2020. Dlx: Context-based Code
Transformation Learning for Automated Program Repair. In Proceedings of ICSE.
602–614.
[34]
Zhongxin Liu, Xin Xia, Ahmed E Hassan, David Lo, Zhenchang Xing, and Xinyu
Wang. 2018. Neural-Machine-Translation-based Commit Message Generation:
How Far are We?. In Proceedings of ASE. 373–384.
ICSE ’22, May 21ś29, 2022, Pisburgh, PA, USA Patanamon Thongtanunam, Chanathip Pornprasit, and Chakkrit Tantithamthavorn
[35]
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambro-
sio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al
.
2021.
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding
and Generation. arXiv preprint arXiv:2102.04664 (2021).
[36]
Minh Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Eective
Approaches to Attention-based Neural Machine Translation. In Proceedings of
EMNLP. 1412–1421.
[37]
Laura MacLeod, Michaela Greiler, Margaret-Anne Storey, Christian Bird, and
Jacek Czerwonka. 2017. Code Reviewing in the Trenches: Challenges and Best
Practices. IEEE Software 35, 4 (2017), 34–42.
[38]
Chandra Maddila, Chetan Bansal, and Nachiappan Nagappan. 2019. Predicting
Pull Request Completion Time: A Case Study on Large Scale Cloud Services. In
Proceedings of ESEC/FSE. 874–882.
[39]
Shane McIntosh and Yasutaka Kamei. 2017. Are Fix-Inducing Changes a Moving
Target? A Longitudinal Case Study of Just-In-Time Defect Prediction. TSE (2017),
412–428.
[40]
Shane McIntosh, Yasutaka Kamei, Bram Adams, and Ahmed E Hassan. 2016. An
Empirical Study of the Impact of Modern Code Review Practices on Software
Quality. EMSE 21, 5 (2016), 2146–2189.
[41]
Rodrigo Morales, Shane McIntosh, and Foutse Khomh. 2015. Do Code Review
Practices Impact Design Quality?: A Case Study of the Qt, VTK, and ITK Projects.
In Proceedings of SANER. 171–180.
[42]
Thanh Nguyen, Peter C Rigby, Anh Tuan Nguyen, Mark Karanl, and Tien N
Nguyen. 2016. T2API: Synthesizing API Code Usage Templates from English
Texts with Statistical Translation. In Proceedings of FSE. 1013–1017.
[43]
Ali Ouni, Raula Gaikovina Kula, and Katsuro Inoue. 2016. Search-based Peer
Reviewers Recommendation in Modern Code Review. In Proceedings of ICSME.
367–377.
[44]
Chanathip Pornprasit and Chakkrit Tantithamthavorn. 2021. JITLine: A Simpler,
Better, Faster, Finer-grained Just-In-Time Defect Prediction. In Proceedings of
MSR. To Appear.
[45]
Chanathip Pornprasit and Chakkrit Tantithamthavorn. 2022. DeepLineDP: To-
wards a Deep Learning Approach for Line-Level Defect Prediction. IEEE Trans-
actions on Software Engineering (2022).
[46]
Mohammad Masudur Rahman, Chanchal K Roy, and Jason A Collins. 2016. COR-
RECT: Code Reviewer Recommendation in GitHub based on Cross-Project and
Technology Experience. In Proceedings of ICSE (Companion). 222–231.
[47]
Peter C Rigby and Christian Bird. 2013. Convergent Contemporary Software
Peer Review Practices. In Proceedings of FSE. 202–212.
[48]
Baptiste Roziere, Marie-Anne Lachaux, Lowik Chanussot, and Guillaume Lample.
2020. Unsupervised Translation of Programming Languages.. In NeurIPS.
[49]
Shade Ruangwan, Patanamon Thongtanunam, Akinori Ihara, and Kenichi Mat-
sumoto. 2018. The Impact of Human Factors on the Participation Decision of
Reviewers in Modern Code Review. EMSE (2018), In press.
[50]
Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto
Bacchelli. 2018. Modern Code Review: A Case Study at Google. In Proceedings of
ICSE (Companion). 181–190.
[51]
Caitlin Sadowski, Jerey Van Gogh, Ciera Jaspan, Emma Söderberg, and Collin
Winter. 2015. Tricorder: Building a program analysis ecosystem. In Proceeding of
the International Conference on Software Engineering (ICSE). 598–608.
[52]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine
Translation of Rare Words with Subword Units. In Proceedings of ACL, Vol. 3.
1715–1725.
[53]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence
Learning with Neural Networks. In NIPS. 3104–3112.
[54]
Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020.
IntelliCode Compose: Code Generation Using Transformer. In Proceedings of
ESEC/FSE.
[55]
Yida Tao and Sunghun Kim. 2015. Partitioning Composite Code Changes to
Facilitate Code Review. In Proceedings of MSR. 180–190.
[56]
Patanamon Thongtanunam, Raula Gaikovina Kula, Ana Erika Camargo Cruz,
Norihiro Yoshida, and Hajimu Iida. 2014. Improving code review eectiveness
through reviewer recommendations. In Proceedings of CHASE. 119–122.
[57]
Patanamon Thongtanunam, Shane McIntosh, Ahmed E. Hassan, and Hajimu Iida.
2015. Investigating Code Review Practices in Defective Files: An Empirical Study
of the Qt System. In MSR. 168–179.
[58]
Patanamon Thongtanunam, Shane McIntosh, Ahmed E Hassan, and Hajimu Iida.
2017. Review Participation in Modern Code Review. EMSE 22, 2 (2017), 768–817.
[59]
Patanamon Thongtanunam, Chakkrit Tantithamthavorn, Raula Gaikovina Kula,
Norihiro Yoshida, Hajimu Iida, and Ken-ichi Matsumoto. 2015. Who Should Re-
view My Code? A File Location-based Code-reviewer Recommendation Approach
for Modern Code Review. In Proceedings of SANER. 141–150.
[60]
Michele Tufano, Jevgenija Pantiuchina, Cody Watson, Gabriele Bavota, and Denys
Poshyvanyk. 2019. On Learning Meaningful Code Changes Via Neural Machine
Translation. In Proceedings of ICSE. 25–36.
[61]
Rosalia Tufano, Luca Pascarella, Michele Tufano, Denys Poshyvanykz, and
Gabriele Bavota. 2021. Towards Automating Code Review Activities. In Pro-
ceedings of ICSE. To appear.
[62]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
You Need. In NIPS. 5999–6009.
[63]
Dong Wang, Tao Xiao, Patanamon Thongtanunam, Raula Gaikovina Kula, and
Kenichi Matsumoto. 2021. Understanding Shared Links and Their Intentions to
Meet Information Needs in Modern Code Review. In EMSE. to appear.
[64]
Min Wang, Zeqi Lin, Yanzhen Zou, and Bing Xie. 2019. CORA: Decomposing
and Describing Tangled Code Changes for Reviewer. In Proceedings of ASE. 1050–
1061.
[65]
Supatsara Wattanakriengkrai, Patanamon Thongtanunam, Chakkrit Tan-
tithamthavorn, Hideaki Hata, and Kenichi Matsumoto. 2020. Predicting defective
lines using a model-agnostic technique. IEEE Transactions on Software Engineering
(2020).
[66]
Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris
Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang,
and Amr Ahmed. 2020. Big Bird: Transformers for Longer Sequences. In Advances
in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell,
M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 17283–17297.
... In addition, studies [16], [17], [18], [19] show that program comprehension (one of the major work for code reviewers) takes up as much as half of a developer's time, especially when the code quality is low [19]. To reduce the developers' burden, researchers have created techniques [20], [21], [22] to automate code review activities by leveraging deep learning algorithms. Particularly, they focus on revising the submitted code to address the possible flaws in the code (i.e., Code Revision Before Review) [20], [21], [22], writing review comments based on the submitted code (i.e., Review Comment Generation) [22], and revising the submitted code based on the comments written by reviewers (i.e., Code Revision After Review) [20], [22]. ...
... To reduce the developers' burden, researchers have created techniques [20], [21], [22] to automate code review activities by leveraging deep learning algorithms. Particularly, they focus on revising the submitted code to address the possible flaws in the code (i.e., Code Revision Before Review) [20], [21], [22], writing review comments based on the submitted code (i.e., Review Comment Generation) [22], and revising the submitted code based on the comments written by reviewers (i.e., Code Revision After Review) [20], [22]. ...
... Recent approaches [20], [21], [22] targeting the above code review tasks push code review automation to new heights. Tufano et al. [20] proposed an approach namely Trans-Review based on Transformer [23] to complete the code revision generation tasks. ...
Preprint
Code review is an effective software quality assurance activity; however, it is labor-intensive and time-consuming. Thus, a number of generation-based automatic code review (ACR) approaches have been proposed recently, which leverage deep learning techniques to automate various activities in the code review process (e.g., code revision generation and review comment generation). We find the previous works carry three main limitations. First, the ACR approaches have been shown to be beneficial in each work, but those methods are not comprehensively compared with each other to show their superiority over their peer ACR approaches. Second, general-purpose pre-trained models such as CodeT5 are proven to be effective in a wide range of Software Engineering (SE) tasks. However, no prior work has investigated the effectiveness of these models in ACR tasks yet. Third, prior works heavily rely on the Exact Match (EM) metric which only focuses on the perfect predictions and ignores the positive progress made by incomplete answers. To fill such a research gap, we conduct a comprehensive study by comparing the effectiveness of recent ACR tools as well as the general-purpose pre-trained models. The results show that a general-purpose pre-trained model CodeT5 can outperform other models in most cases. Specifically, CodeT5 outperforms the prior state-of-the-art by 13.4\%--38.9\% in two code revision generation tasks. In addition, we introduce a new metric namely Edit Progress (EP) to quantify the partial progress made by ACR tools. The results show that the rankings of models for each task could be changed according to whether EM or EP is being utilized. Lastly, we derive several insightful lessons from the experimental results and reveal future research directions for generation-based code review automation.
... Numerous NMT techniques (especially Transformer-based models) have been increasingly adopted for automated code-based tasks. In addition, NMT models like T5 [37] have demonstrated the ability to effectively learn code representation from unlabeled data to conduct a wide range of downstream tasks given supervised discriminative fine-tuning on specific tasks (e.g., code completion [25,46], code search [17,40], code summarization [46], code review [24,41,42], and API recommendation [9]). ...
... We use accuracy (i.e., perfect prediction rate or exact match rate) for performance comparison, which is widely used in NMT-based automated code generation [24,[41][42][43]. Specifically, We count the number of generated code that exactly matches the ground truth and then compute the perfect prediction rate. ...
... Tofuno et al. [42] develops a deep learning approach using RNNs to automatically transform source code in the code review context. Thongtanunam et al. [41] further introduces advanced Transformer architecture and a Byte-Pair Encoding (BPE) approach to handle the Out-Of-Vocabulary and long sequence problems. To better learn code properties, pre-training techniques are increasingly adopted in the code review scenario [11,18,24,44,54]. ...
Preprint
Full-text available
Automatic code generation, the task of generating new code snippets from existing code or comments, has long been of interest. Numerous code generation models have been proposed and proven on different benchmark datasets. However, little is known about whether this objective has been achieved and why code generation models effectively transform code sequences automatically. In other words, can we totally trust these automated code generation models? Consequently, there is a pressing need to understand the inner logic of code generation models and to investigate their replicability, reliability, and explainability. To bridge these research gaps, we conduct a thorough empirical study of five code generation models on four representative code generation datasets to assess the limits and capabilities of automatic code generation approaches. We further employ advanced explainable AI approaches to highlight the tokens that significantly contribute to the code generation. Experiments demonstrate that we successfully replicate state-of-the-art code generation approaches. We discover that state-of-the-art approaches suffer from severe data duplication and input insensitivity, which are subtle issues with significant implications. Our explainability analysis reveals that, in various experimental scenarios, code generation models can recognize code grammar and structural information, but can not capture key tokens that need to be updated. Our results draw several lessons and guidelines for future work in this area.
... Their approach also leveraged code abstraction to address the challenge of large vocabulary size. However, Tufano et al. [10] and Thongtanunam et al. [11] have pointed out that the performance of RNNs can be suboptimal with longer code sequences. To address this, they leveraged the transformer architecture to transform longer code sequences [10], [11]. ...
... However, Tufano et al. [10] and Thongtanunam et al. [11] have pointed out that the performance of RNNs can be suboptimal with longer code sequences. To address this, they leveraged the transformer architecture to transform longer code sequences [10], [11]. In addition, Thongtanunam et al. [11] investigated the use of Byte-Pair Encoding to address the limitation of the code abstraction which was not able to handle the cases when new tokens are introduced in the post-review implementation. ...
... To address this, they leveraged the transformer architecture to transform longer code sequences [10], [11]. In addition, Thongtanunam et al. [11] investigated the use of Byte-Pair Encoding to address the limitation of the code abstraction which was not able to handle the cases when new tokens are introduced in the post-review implementation. Although the approaches of Tufano et al. [10] and Thongtanunam et al. [11] outperformed the RNN-based approach [9], these approaches trained the NMT models from scratch using relatively small datasets, which may generate suboptimal vector representations. ...
Conference Paper
Full-text available
Code review is a crucial ingredient to quality software development, but requires a large amount of time and effort for developers. To optimise this manual process, recent research on automated code review seeks to leverage Neural Machine Translation (NMT) models to perform tasks such as automated code improvement. A recent work had pretrained the NMT model for automated code review in order to equip the model with general coding knowledge. However, their pretraining approach is generic to natural languages, which does not leverage the unique properties of coding languages. Therefore, we set out to explore two state-of-the-art pretrained NMT models (i.e., CodeT5 and GraphCodeBERT) that were designed to learn code structure. We studied the models' abilities to generate correct code improvement through an empirical evaluation based on five different datasets. Our results showed that in terms of generating correct code sequences, CodeT5, GraphCodeBERT, and the prior work achieved an average accuracy of 22%, 18%, and 10%, respectively. In terms of generating correct dataflow structures, they achieved an average accuracy of 33%, 30%, and 22%, respectively. The results suggested that the code structure focused approaches could outperform the generic pretraining approach. This work contributes towards enhancing automated code review techniques by understanding the effectiveness of code structure focused NMT models.
... vising their submitted patches. For example, Thongtanunam et al. [7] proposed AutoTransform, which leverages a Transformer model [8] and Byte-Pair Encoding (BPE) subword tokenization [9] for code transformation. Similarly, Tufano et al. [10] proposed a T5-based pre-trained code model for automated code review activities (called, TufanoT5 onwards). ...
... In addition, prior work found that changed code in the past is more likely to be changed again in the future (e.g., fixing software defects) [12]. Unfortunately, the existing code transformation approaches [7,10,13,14] only learn the sequence of code tokens without knowing which code tokens are previously changed from the codebase. ...
... In this paper, we are the first to present a Diff-Aware Code Transformation (D-ACT) approach for code review that leverages the token-level code difference information and CodeT5 [15] (the state-of-the-art pre-trained code model). Different from the existing approaches [7,10], we design our D-ACT approach to transform the initial version (i.e., the first version of a submitted patch) to the approved version (i.e., the version after being reviewed and approved by reviewers) while considering the token-level code difference information which is extracted from the code difference between the codebase version and the initial version. Our intuition is that changed code tokens are more likely to introduce defects than the others [11]; thus, explicitly providing such information may help the NMT models better learn the code transformation. ...
Conference Paper
Full-text available
Code review is a software quality assurance practice, yet remains time-consuming (e.g., due to slow feedback from reviewers). Recent Neural Machine Translation (NMT)-based code transformation approaches were proposed to automatically generate an approved version of changed methods for a given submitted patch. The existing approaches could change code tokens in any area in a changed method. However, not all code tokens need to be changed. Intuitively, the changed code tokens in the method should be paid more attention to than the others as they are more prone to be defective. In this paper, we present an NMT-based Diff-Aware Code Transformation approach (D-ACT) by leveraging token-level change information to enable the NMT models to better focus on the changed tokens in a changed method. We evaluate our D-ACT and the baseline approaches based on a time-wise evaluation (that is ignored by the existing work) with 5,758 changed methods. Under the time-wise evaluation scenario, our results show that (1) D-ACT can correctly transform 107-245 changed methods, which is at least 62% higher than the existing approaches; (2) the performance of the existing approaches drops by 57% to 94% when the time-wise evaluation is ignored; and (3) D-ACT is improved by 17%-82% with an average of 29% when considering the token-level change information. Our results suggest that (1) NMT-based code transformation approaches for code review should be evaluated under the time-wise evaluation; and (2) the token-level change information can substantially improve the performance of NMT-based code transformation approaches for code review.
... Results. Our AutoUpdate achieves a perfect prediction of 25% for recommending code updates for Android apps, which is 59%-107% better than the two baselines (i.e., Tufano et al. [62] and AutoTransform [59]) under a realistic time-wise evaluation scenario. In addition, our AutoUpdate achieves a higher efficiency for recommending code updates compared with the Tufano et al. [62]'s approach and AutoTransform [59]. ...
... Our AutoUpdate achieves a perfect prediction of 25% for recommending code updates for Android apps, which is 59%-107% better than the two baselines (i.e., Tufano et al. [62] and AutoTransform [59]) under a realistic time-wise evaluation scenario. In addition, our AutoUpdate achieves a higher efficiency for recommending code updates compared with the Tufano et al. [62]'s approach and AutoTransform [59]. (RQ2) What is the benefit of our tokenization approach (Abs+BPE) for our AutoUpdate? ...
... Results. We found that the time-ignore evaluation scenario used in prior studies [59,62] (i.e., random splits which is non-realistic) produces optimistically higher accuracy than the timewise evaluation scenario (realistic), suggesting that researchers should consider time-wise evaluation scenarios in the future. (RQ5) How do the method size and update size impact the accuracy of our AutoUpdate? ...
Preprint
Android developers frequently update source code to improve the performance, security, or maintainability of Android apps. Such Android code updating activities are intuitively repetitive, manual, and time-consuming. In this paper, we propose AutoUpdate, a Transformer-based automated code update recommendation approach for Android Apps, which takes advantage of code abstraction (Abs) and Byte-Pair Encoding (BPE) techniques to represent source code. Since this is the first work to automatically update code in Android apps, we collect a history of 209,346 updated method pairs from 3,195 real-world Android applications available on Google Play stores that span 14 years (2008-2022). Through an extensive experiment on our curated datasets, the results show that AutoUpdate(1) achieves a perfect prediction of 25% based on the realistic time-wise evaluation scenario, which outperforms the two baseline approaches; (2) gains benefits at least 17% of improvement by using both Abs and BPE; (3) is able to recommend code updates for various purposes (e.g., fixing bugs, adding new feature, refactoring methods). On the other hand, the models (4) could produce optimistically high accuracy due to the unrealistic evaluation scenario (i.e., random splits), suggesting that researchers should consider time-wise evaluation scenarios in the future; (5) are less accurate for a larger size of methods with a larger number of changed tokens, providing a research opportunity for future work. Our findings demonstrate the significant advancement of NMT-based code update recommendation approaches for Android apps.
... • AutoTransform [26] is a variant of Transformer. It introduces a Byte-Pair Encoding (BPE) algorithm [23] to handle unseen tokens and utilizes the self-attention mechanism to model long sequences. ...
... Chakraborty et al. [3] further proposed a tree-based model that learned to edit the syntax tree of source code and ensured the grammatical correctness of the edited code. Thongtanunam et al. [26] introduced the BPE algorithm and the Transformer architecture to handle new tokens (e.g., rare identifiers) and too-long code snippets. Recently, some powerful pre-trained models are applied to code editing and achieved SOTA results, such as SPT-Code [18], T5-review [28], and CodeT5 [31]. ...
Preprint
Developers often perform repetitive code editing activities for various reasons (e.g., code refactor) during software development. Many deep learning models are applied to automate code editing by learning from the code editing history. Recently, pre-trained code editing models have achieved the state-of-the-art (SOTA) results. Pre-trained models are first pre-trained with pre-training tasks and fine-tuned with the code editing task. Existing pre-training tasks mainly are code infilling tasks (e.g., masked language modeling), which are derived from the natural language processing field and are not designed for code editing. In this paper, we propose a pre-training task specialized in code editing and present an effective pre-trained code editing model named CodeEditor. Our pre-training task further improves the performance and generalization ability of code editing models. Specifically, we collect real-world code snippets as the ground truth and use a generator to rewrite them into natural but inferior versions. Then, we pre-train our CodeEditor to edit inferior versions into the ground truth, to learn edit patterns. We conduct experiments on four datasets and evaluate models in three settings. (1) In the fine-tuning setting, we fine-tune the pre-trained CodeEditor with four datasets. CodeEditor outperforms SOTA baselines by 15%, 25.5%, and 9.4% and 26.6% on four datasets. (2) In the few-shot setting, we fine-tune the pre-trained CodeEditor with limited data. CodeEditor substantially performs better than all baselines, even outperforming baselines that are fine-tuned with all data. (3) In the zero-shot setting, we evaluate the pre-trained CodeEditor without fine-tuning. CodeEditor correctly edits 1,113 programs while SOTA baselines can not work. The results prove that the superiority of our pre-training task and the pre-trained CodeEditor is more effective in automatic code editing.
... Changeset size reduction is another automation direction, which aims to decide which change fragments are error-prone and need to be checked in detail to expedite a CR process [6]. Recent studies have focused on automating the reviews through ML-based models [36,69,71,72]. ...
Preprint
Full-text available
Context: Due to the association of significant efforts, even a minor improvement in the effectiveness of Code Reviews(CR) can incur significant savings for a software development organization. Aim: This study aims to develop a finer grain understanding of what makes a code review comment useful to OSS developers, to what extent a code review comment is considered useful to them, and how various contextual and participant-related factors influence its usefulness level. Method: On this goal, we have conducted a three-stage mixed-method study. We randomly selected 2,500 CR comments from the OpenDev Nova project and manually categorized the comments. We designed a survey of OpenDev developers to better understand their perspectives on useful CRs. Combining our survey-obtained scores with our manually labeled dataset, we trained two regression models - one to identify factors that influence the usefulness of CR comments and the other to identify factors that improve the odds of `Functional' defect identification over the others. Key findings: The results of our study suggest that a CR comment's usefulness is dictated not only by its technical contributions such as defect findings or quality improvement tips but also by its linguistic characteristics such as comprehensibility and politeness. While a reviewer's coding experience positively associates with CR usefulness, the number of mutual reviews, comment volume in a file, the total number of lines added /modified, and CR interval has the opposite associations. While authorship and reviewership experiences for the files under review have been the most popular attributes for reviewer recommendation systems, we do not find any significant association of those attributes with CR usefulness.
... While the character-level representation can diminish the OOV problem with the limited vocabulary size (e.g., English characters), models may not be able to handle an excessively long sequence of source code (i.e., each character has its own vector). Instead, we use sub-word tokenization with the Byte-Pair Encoding (BPE) algorithm [20], as prior studies found that BPE can substantially reduce the vocabulary size [21], [22], while being able to generate new identifiers that never appear in the dataset [23]. First, BPE splits source code into characters. ...
Preprint
Full-text available
Code completion aims to help improve developers' productivity by suggesting the next code tokens from a given context. Various approaches have been proposed to incorporate abstract syntax tree (AST) information for model training, ensuring that code completion is aware of the syntax of the programming languages. However, existing syntax-aware code completion approaches are not on-the-fly, as we found that for every two-thirds of characters that developers type, AST fails to be extracted because it requires the syntactically correct source code, limiting its practicality in real-world scenarios. On the other hand, existing on-the-fly code completion does not consider syntactic information yet. In this paper, we propose PyCoder to leverage token types, a kind of lightweight syntactic information, which is readily available and aligns with the natural order of source code. Our PyCoder is trained in a multi-task training manner so that by learning the supporting task of predicting token types during the training phase, the models achieve better performance on predicting tokens and lines of code without the need for token types in the inference phase. Comprehensive experiments show that PyCoder achieves the first rank on the CodeXGLUE leaderboard with an accuracy of 77.12% for the token-level predictions, which is 0.43%-24.25% more accurate than baselines. In addition, PyCoder achieves an exact match of 43.37% for the line-level predictions, which is 3.63%-84.73% more accurate than baselines. These results lead us to conclude that token type information (an alternative to syntactic information) that is rarely used in the past can greatly improve the performance of code completion approaches, without requiring the syntactically correct source code like AST-based approaches do. Our PyCoder is publicly available on HuggingFace.
... Importantly, prior studies proposed various recommendation approaches for the code review process. For example, code review prioritization approaches based on the characteristics of code changes [11,32,57] and the defect-proneness [14, 25ś27, 35, 40ś 42, 45, 51ś54, 64], reviewer recommendation approaches [4,19,55,59,66,68], automated code transformation [15,58,60,62]. ...
Conference Paper
Full-text available
Code review is an effective quality assurance practice, but can be labor-intensive since developers have to manually review the code and provide written feedback. Recently, a Deep Learning (DL)-based approach was introduced to automatically recommend code review comments based on changed methods. While the approach showed promising results, it requires expensive computational resource and time which limits its use in practice. To address this limitation , we propose CommentFinder ś a retrieval-based approach to recommend code review comments. Through an empirical evaluation of 151,019 changed methods, we evaluate the effectiveness and efficiency of CommentFinder against the state-of-the-art approach. We find that when recommending the best-1 review comment candidate, our CommentFinder is 32% better than prior work in recommending the correct code review comment. In addition, CommentFinder is 49 times faster than the prior work. These findings highlight that our CommentFinder could help reviewers to reduce the manual efforts by recommending code review comments, while requiring less computational time.
Preprint
Transformers have gained popularity in the software engineering (SE) literature. These deep learning models are usually pre-trained through a self-supervised objective, meant to provide the model with basic knowledge about a language of interest (e.g., Java). A classic pre-training objective is the masked language model (MLM), in which a percentage of tokens from the input (e.g., a Java method) is masked, with the model in charge of predicting them. Once pre-trained, the model is then fine-tuned to support the specific downstream task of interest (e.g., code summarization). While there is evidence suggesting the boost in performance provided by pre-training, little is known about the impact of the specific pre-training objective(s) used. Indeed, MLM is just one of the possible pre-training objectives and recent work from the natural language processing field suggest that pre-training objectives tailored for the specific downstream task of interest may substantially boost the model's performance. In this study, we focus on the impact of pre-training objectives on the performance of transformers when automating code-related tasks. We start with a systematic literature review aimed at identifying the pre-training objectives used in SE. Then, we pre-train 32 transformers using both (i) generic pre-training objectives usually adopted in SE; and (ii) pre-training objectives tailored to specific code-related tasks subject of our experimentation, namely bug-fixing, code summarization, and code completion. We also compare the pre-trained models with non pre-trained ones. Our results show that: (i) pre-training helps in boosting performance only if the amount of fine-tuning data available is small; (ii) the MLM objective is usually sufficient to maximize the prediction performance of the model, even when comparing it with pre-training objectives specialized for the downstream task at hand.
Article
Full-text available
Defect prediction is proposed to assist practitioners effectively prioritize limited Software Quality Assurance (SQA) resources on the most risky files that are likely to have post-release software defects. However, there exist two main limitations in prior studies: (1) the granularity levels of defect predictions are still coarse-grained and (2) the surrounding tokens and surrounding lines have not yet been fully utilized. In this paper, we perform a survey study to better understand how practitioners perform code inspection in modern code review process, and their perception on a line-level defect prediction. According to the responses from 36 practitioners, we found that 50% of them spent at least 10 minutes to more than one hour to review a single file, while 64% of them still perceived that code inspection activity is challenging to extremely challenging. In addition, 64% of the respondents perceived that a line-level defect prediction tool would potentially be helpful in identifying defective lines. Motivated by the practitioners' perspective, we present DeepLineDP, a deep learning approach to automatically learn the semantic properties of the surrounding tokens and lines in order to identify defective files and defective lines. Through a case study of 32 releases of 9 software projects, we find that the risk score of code tokens varies greatly depending on their location. Our DeepLineDP is 14%-24% more accurate than other file-level defect prediction approaches; is 50%-250% more cost-effective than other line-level defect prediction approaches; and achieves a reasonable performance when transferred to other software projects. These findings confirm that the surrounding tokens and surrounding lines should be considered to identify the fine-grained locations of defective files (i.e., defective lines).
Conference Paper
Full-text available
Code review is an effective quality assurance practice , yet can be time-consuming since reviewers have to carefully review all new added lines in a patch. Our analysis shows that at the median, patch authors often waited 15-64 hours to receive initial feedback from reviewers, which accounts for 16%-26% of the whole review time of a patch. Importantly, we also found that large patches tend to receive initial feedback from reviewers slower than smaller patches. Hence, it would be beneficial to reviewers to reduce their effort with an approach to pinpoint the lines that they should pay attention to. In this paper, we proposed REVSPOT-a machine learning-based approach to predict problematic lines (i.e., lines that will receive a comment and lines that will be revised). Through a case study of three open-source projects (i.e., Openstack Nova, Openstack Ironic, and Qt Base), REVSPOT can accurately predict lines that will receive comments and will be revised (with a Top-10 Accuracy of 81% and 93%, which is 56% and 15% better than the baseline approach), and these correctly predicted problematic lines are related to logic defects, which could impact the functionality of the system. Based on these findings, our REVSPOT could help reviewers to reduce their reviewing effort by reviewing a smaller set of lines and increasing code review speed and reviewers' productivity.
Article
Full-text available
Code reviews serve as a quality assurance activity for software teams. Especially for Modern Code Review, sharing a link during a review discussion serves as an effective awareness mechanism where "Code reviews are good FYIs [for your information].". Although prior work has explored link sharing and the information needs of a code review, the extent to which links are used to properly conduct a review is unknown. In this study, we performed a mixed-method approach to investigate the practice of link sharing and their intentions. First, through a quantitative study of the OpenStack and Qt projects, we identify 19,268 reviews that have 39,686 links to explore the extent to which the links are shared, and analyze a correlation between link sharing and review time. Then in a qualitative study, we manually analyze 1,378 links to understand the role and usefulness of link sharing. Results indicate that internal links are more widely referred to (93% and 80% for the two projects). Importantly, although the majority of the internal links are referencing to reviews, bug reports and source code are also shared in review discussions. The statistical models show that the number of internal links as an explanatory factor does have an increasing relationship with the review time. Finally, we present seven intentions of link sharing, with providing context being the most common intention for sharing links. Based on the findings and a developer survey, we encourage the patch author to provide clear context and explore both internal and external resources, while the review team should continue link sharing activities. Future research directions include the investigation of causality between sharing links and the review process, as well as the potential for tool support.
Thesis
A transcompiler, also known as source-to-source translator, is a system that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python2) to a modern one. They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree. Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions, and require manual modifications in order to work properly. The overall translation process is time-consuming and requires expertise in both the source and target languages, making code-translation projects expensive. Although neural models significantly outperform their rule-based counterparts in the context of natural language translation, their applications to transcompilation have been limited due to the scarcity of parallel data in this domain. In this thesis, we propose methods to train effective and fully unsupervised neural transcompilers.Natural language translators are evaluated with metrics based on token co-occurences between the translation and the reference. We identify that they do not capture the semantics of programming languages. Hence, we build and release a test set composed of 852 parallel functions, along with unit tests to check the semantic correctness of translations. We first leverage objectives designed for natural languages to learn multilingual representations of source code, and train a model to translate, using source code from open source GitHub projects. This model outperforms rule-based methods for translating functions between C++, Java, and Python. Then, we develop an improved pre-training method, which leads the model to learn deeper semantic representations of source code. It results in enhanced performances on several tasks including unsupervised code translation. Finally, we use automated unit tests to automatically create examples of program translations. Training on these examples leads to significant improvements in the performance of our neural transcompilers. Our methods rely exclusively on monolingual source code, require no expertise in the source or target languages, and can easily be generalized to other programming languages.