PreprintPDF Available

A Neural Model for Generating Natural Language Summaries of Program Subroutines

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Source code summarization -- creating natural language descriptions of source code behavior -- is a rapidly-growing research topic with applications to automatic documentation generation, program comprehension, and software maintenance. Traditional techniques relied on heuristics and templates built manually by human experts. Recently, data-driven approaches based on neural machine translation have largely overtaken template-based systems. But nearly all of these techniques rely almost entirely on programs having good internal documentation; without clear identifier names, the models fail to create good summaries. In this paper, we present a neural model that combines words from code with code structure from an AST. Unlike previous approaches, our model processes each data source as a separate input, which allows the model to learn code structure independent of the text in code. This process helps our approach provide coherent summaries in many cases even when zero internal documentation is provided. We evaluate our technique with a dataset we created from 2.1m Java methods. We find improvement over two baseline techniques from SE literature and one from NLP literature.
Content may be subject to copyright.
A Neural Model for Generating Natural
Language Summaries of Program Subroutines
Alexander LeClair, Siyuan Jiang, Collin McMillan
Dept. of Computer Science and Engineering
University of Notre Dame
Notre Dame, IN, USA
Email: {aleclair, cmc}
Dept. of Computer Science
Eastern Michigan University
Ypsilanti, MI, USA
Abstract—Source code summarization – creating natural lan-
guage descriptions of source code behavior – is a rapidly-growing
research topic with applications to automatic documentation
generation, program comprehension, and software maintenance.
Traditional techniques relied on heuristics and templates built
manually by human experts. Recently, data-driven approaches
based on neural machine translation have largely overtaken
template-based systems. But nearly all of these techniques rely
almost entirely on programs having good internal documentation;
without clear identifier names, the models fail to create good
summaries. In this paper, we present a neural model that
combines words from code with code structure from an AST.
Unlike previous approaches, our model processes each data
source as a separate input, which allows the model to learn code
structure independent of the text in code. This process helps
our approach provide coherent summaries in many cases even
when zero internal documentation is provided. We evaluate our
technique with a dataset we created from 2.1m Java methods. We
find improvement over two baseline techniques from SE literature
and one from NLP literature.
Index Terms—automatic documentation generation, source
code summarization, code comment generation
A “summary” of source code is a brief natural language
description of that section of source code [1]. One of the most
common targets for summarization are the subroutines in a
program; for example, the one-sentence descriptions of Java
methods widely used in automatically-formatted documenta-
tion e.g. JavaDocs [2]. These descriptions are useful because
they help programmers understand the role that the subroutine
plays in a program – empirical studies have repeatedly shown
that understanding the role of the subroutines in a program
is a crucial step to understanding the program’s behavior
overall [3]–[6]. Even a short summary of a subroutine e.g.
“returns the player’s hitpoint count” can tell a programmer a
lot about that subroutine and the program as a whole.
A holy grail of software engineering research has long
been to generate these summaries automatically. Forward et
al. pointed out in 2002 that “software professionals value
This work is supported in part by NSF CCF-1452959, CCF-1717607, and
CNS-1510329 grants.
technologies that improve automation of the documentation
process,” and “that documentation tools should seek to bet-
ter extract knowledge from core resources”, such as source
code [7]. However, the state-of-the-practice has barely changed
since that time for tool support for automated documentation
generation. Tools such as JavaDoc [2] and Doxygen [8]
automate the format and presentation of documentation, but
still leave programmers with the most labor-intensive effort of
writing the text and examples.
Research into generation of natural language descriptions
of code has come to be known as “source code summa-
rization” [9], with significant effort focused on generation
of summaries of subroutines: For several years, significant
progress was made based on content selection and sentence
templates [1], [10]–[14] or even somewhat-idiosyncratic solu-
tions such as mimicking human eye movements [15]. However,
as in many research areas and as chronicled in a recent survey
by Allamanis et al. [16], these techniques have largely given
way to AI based on big data input.
The inspiration for a vast majority of efforts into AI-based
code summarization originates in neural machine translation
(NMT) from the natural language processing research com-
munity. An NMT system converts one natural language into
another. It is typically thought of in terms of sequence to
sequence (seq2seq) learning, in which an e.g. English sentence
is one sequence and is converted into a equivalent target
sequence representing a e.g. French sentence. In software
engineering research, machine translation can be considered
as a metaphor for source code summarization: the words and
tokens in the body of a subroutine are one sequence, while
the desired natural language summary is the target sequence.
This application of NMT to code summarization has shown
strong benefits in a variety of applications [17]–[22].
However, an Achilles’ heel to nearly all source code sum-
marization techniques is a reliance on programmers having
written high quality internal documentation in the form of
identifier names or comments. In order to generate a mean-
ingful summary, meaningful words must be observed in the
body of the subroutine. In traditional NMT, this is accepted
arXiv:1902.01954v1 [cs.SE] 5 Feb 2019
because a natural language input sentence will definitely have
words related to the output target sentence. But in software,
the words in code are not actually related to the behavior of
that code. A subroutine’s behavior is dictated by the structure
of programming language keywords and tokens that define
control flow, data flow, etc. These differences between code
and natural language are a barrier to improving performance
in several AI applications to software engineering, as Hellen-
doorn et al. [23] pointed out, to some controversy, at FSE’17.
In this paper, we present a neural model for summarizing
subroutines. Our model combines two types of information
about source code: 1) a word representation treating code
as text, and 2) an abstract syntax tree (AST) representation.
A distinguishing factor of our model compared to earlier
approaches is that we treat both representations separately.
Previous techniques have shown promise by annotating a word
representation with AST information [22], but ultimately the
annotated representation is sent as a single sequence through
a standard seq2seq model. In contrast, our model accepts two
inputs, one for the word representation and one for the AST.
The advantage is that we are able to treat each input with
differently, which increases the flexibility of our approach, as
we will show in this paper.
In essense, the neural model we propose involves two uni-
directional gated recurrent unit (GRU) layers: one to process
the words from source code, and one to process the AST. We
modify the SBT AST flattening procedure by Hu et al. [22]
to represent the AST. We then use an attention mechanism
to attend words in the output summary sentence to words
in the code word representation, and a separate attention
mechanism to attend the summary words to parts of the AST.
We concatenate the vectors from each attention mechanism
to create a context vector. Finally, we predict the summary
one word at a time from the context vector, following what is
typical in seq2seq models.
We evaluate our technique in two stages. First, we collect
over 51m Java methods from the Sourcerer repository [24], and
preprocess them to form a dataset of around 2.1m methods
with suitable JavaDoc summary comments. We divide the
dataset into training/validation/testing sets and perform a set
of tests comparing results from our model to three competitive
baselines. We call this the standard experiment, because it
conforms to common practice in both SE and NLP venues.
Second, to evaluate the limits of our model in a scenario
without words from source code, we repeat the standard
experiment using only the AST for each Java method – in this
study, we assume no code words are available, as in obfuscated
code, poorly-written code, or situations in which there is only
bytecode (from which an AST can be extracted but code words
are likely to have been removed during compilation). This “no
code words” experiment simulates a situation unique to the
SE domain and, as we will show, is far more difficult than the
standard application of NMT in which a programmer provides
useful keywords. We call this the challenge experiment.
Our results, in a nutshell, are: 1) In the standard experiment,
our model and the competitive NLP baseline provide compara-
ble performance but with orthogonal predictions, implying that
they are good candidates for ensemble decoding. An ensemble
provides state-of-the-art performance of 20.9 BLEU (an 8%
improvement over the nearest baseline). 2) In the challenge
experiment, our model achieves 9.5 BLEU, versus 0 for any
baseline. This is a significant step forward in source code
summarization, since it requires zero meaningful code words.
We release all data, code, and implementation via our online
appendix (see Section X).
We target the problem of source code summarization of
subroutines – automatic generation of natural language de-
scriptions of subroutines. Specifically, we target summariza-
tion of Java methods, with the objective of creating method
summaries like those used in JavaDocs. While we limit the
scope of the experiments in this paper to a large Java dataset, in
principle the techniques described in this paper are applicable
to any programming language that has subroutines, from which
an AST can be computed, and from which text e.g. identifier
names can be extracted. Our scoping of our target problem is
consistent with the problem definition in many papers on code
summarization [1], [12], [13], [22], [25].
A solution to this problem would have many practical
applications. The primary practical application would be in
automatic documentation generation, to help programmers
write documentation more quickly, as well as understand code
that has not been documented. Of the 51m Java methods
we found in the Sourcerer dataset, only about 10% have
any sort of method summary, and only about 4% contain
summaries that met basic quality filters we define in Section V.
In our view, it seems likely that more than 4% “should” be
documented well, and an automatic summary generator would
help improve the amount of code that could be documented.
But more generally, our goal for this paper is to also
contribute to an ongoing academic debate about how to
represent source code to solve software engineering problems
using AI. As mentioned, there is reasonable doubt [23] that
neural-based techniques are even appropriate for software
engineering data; a recent workshop at AAAI’18 [26] focused
heavily on this debate. Given the long history of AI use to
solve SE problems [27], our sincere hope for this paper is
to provide insight into ways to build neural models of SE
data, even for researchers outside of the specific task of code
summarization. We have made significant efforts to keep our
data and techniques public and reproducible (see Section X)
to help these other researchers as much as possible.
An overview of this paper is below. In the next section
we cover background and related technologies. Then, we
introduce our proposed neural model. We then describe how
we obtained and processed the Java datasets we use. We
conduct the standard and challenge experiments on the same
set of Java methods. Finally, we spend significant space on
examples and discussion. We feel an in-depth look at examples
where the model worked and did not will provide key insights
for improving or adapting the model in the future.
This section covers the supporting technologies behind our
work, plus related work in source code summarization.
A. Source Code Summarization
Related work in source code summarization can be broadly
classified as either AI/data-driven or heuristic/template-driven.
1) Data-driven: Among data-driven techniques, recent
work by Hu et al. [22] is the most-closely related to this
paper. That work proposes to use the AST to annotate the
words in the source code, and then use the annotated rep-
resentation as input to a seq2seq neural model. The model
itself is an off-the-shelf encoder-decoder; the main advance-
ment is the AST-annotated representation called Structure-
based Traversal (SBT). SBT is essentially a technique for
flattening the AST and ensuring that words in the code are
associated with their AST node type. For example, the code
request.remove(id) becomes:
( MethodInvocation
( SimpleName_request ) SimpleName_request
( SimpleName_remove ) SimpleName_remove
( SimpleName_id ) SimpleName_id
) MethodInvocation
The intent is that the words “request”, “remove”, and “id” be
associated with the context in which they appear. In this case,
a MethodInvocation node. The SBT representation forms an
important baseline for comparison in our experiments in later
sections. A casual reader will note that SBT was shown in that
paper to obtain remarkable performance of 38 BLEU, but we
caution that this is not directly comparable to the results in our
experiments. The reason is that in [22], the dataset was split
by function, so the training, validation, and test sets contain
random selections of functions in the entire dataset. In contrast,
we split by project. In [22], functions from the same project
can be in both the training and test sets. In our experiments, all
methods from a project are either training, validation, or test.
In addition, we performed other preprocessing such as auto-
generated code removal (see Section V), to avoid situations
where identical methods appear in both training and test sets.
Taken together, we expect that the nominal performance scores
for all approaches will be far lower in our experiments.
Other related AI/data approaches in generating summaries
of subroutines includes 1) work by Hu et al. [28] focusing
on creating summaries from sequences of API calls, and 2)
CODE-NN by Iyer et al. [19] which, similar to [22], creates
a custom word representation of code which it then feeds to
an off-the-shelf seq2seq model.
There is also related work outside the task of subroutine
summaries. Jiang et al. [20] and Loyola et al. [29] generate
descriptions of code changes (i.e. commit messages). Alla-
manis et al. [18] predict a name for a subroutine from the
body of a subroutine. Oda et al. [17] create pseudocode from
source code by adapting statistical machine translation. Yin et
al. [21], Movshovitz et al. [30], and Allamanis et al. [31]
target comments of short snippets of code, a task facilitated
by public datasets [32]. Gu et al. [33] have demonstrated using
a neural model for source code search, another task growing
in popularity and facilitated by public datasets [34]. Of note is
that the attentional encoder-decoder seq2seq model originally
described by Bahdanau et al. [35] is at the core of many of
these papers, as it provides strong baseline performance even
for many software engineering tasks.
2) Heuristic/Template-based: Haiduc et al. [11], [36] is
often cited as the first attempt to create text summaries of
code, and indeed is the first to introduce the term “source code
summarization.” These early approaches create extractive sum-
maries by calculating the top-nkeywords with metrics such
as TF/IDF. Shortly thereafter, work by Sridhara et al. [12],
[37] adapted SWUM [10] (a technique for finding parts of
speech of words in code) to create short summary phrases for
source code using templates. Another template-based solution
by McBurney et al. [1] also used SWUM, but summarized
a subroutine’s context (defined as the functions that call or
are called by a method) in addition to the method context.
Rodeghero et al. [15] made further improvements to content
extraction for heuristic and template solutions by modifying
the heuristics to mimic how human programmers read code
with their eyes. As in other research areas related to natural
language generation [38], data-driven techniques have largely
supplanted template-based techniques due to a much higher
degree of flexibility and reduced human effort in template
creation. We direct readers to a comprehensive survey by
Nazar et al. [39].
B. Neural Machine Translation
The workhorse of most Neural Machine Translation (NMT)
systems is the attentional encoder-decoder architecture [40].
This architecture originated in work by Bahdanau et al. [35]
and is explained in great detail by a plethora of very highly-
regarded sources [41]–[45]. In this section, we cover only the
concepts necessary to understand our approach at a high level.
In an encoder-decoder architecture, there are a minimum of
two recurrent neural networks (RNNs). The first, called the
encoder, converts an arbitrary-length sequence into a single
vector representation of a specified length. The second, called
the decoder, converts the vector representation given by the
encoder into another arbitrary-length sequence. The sequence
inputted to the encoder is one language e.g. English, and the
sequence from the decoder is another language e.g. French.
Encoder-decoder architectures learn to predict sentences one
word at a time – the decoder generally does not try to predict
a whole sentence at once. The way this usually works is that
during training, instead of sending the network:
[ cat on the table ] => [ chat sur la table ]
The network receives 1) the whole input sequence, 2) a
sequence of output words so far, plus 3) the correct next word:
[ cat on the table ]
[ cat on the table ]
[ cat on the table ]
=> [ chat sur la 0 ] + [ table ]
During inference, the trained model is given an input
sequence, which is used to predict the first word in the
output sentence. Then the input sentence is sent to the model
again, along with the first prediction. The decoder outputs a
prediction for the second word in the sentence, and so on, until
the decoder predicts an end-of-sentence token.
The problem with this strategy is that the encoder is
burdened with creating a vector representation suitable for
prediction at every output step. In reality, some words in
the input sentence will be more important than others for a
particular output. E.g., ‘on’ for ‘sur’. This is the motivation for
“attentional” encoder-decoder networks [35]. Essentially what
happens is that instead of a single vector representation of
the input sentence, an attention mechanism is placed between
the encoder and decoder. That attention mechanism receives
the encoder’s state at every time step – in the example above,
four vectors for each of the four positions in the sentence. The
attention mechanism, in essence, selects which vector from the
encoder to use, so that different decoder predictions receive
input from different positions in the input sequence. Our work
builds on the attentional encoder-decoder strategy in key ways
that we describe in the next section.
This section describes our proposed neural model. The
model assumes a typical NMT architecture in which the model
is asked to predict one word at a time, as described in the
previous section.
A. Model Overview
Our model is essentially an attentional encoder-decoder
system, except with two encoders: one for code/text data and
one for AST data. In the spirit of maintaining simplicity where
possible, we used embedding and recurrent layers of equal
size for the encoders. We concatenate the output of attention
mechanisms for each encoder as depicted here:
Precedent for combining different data sources comes heav-
ily from image captioning [46]–[49] (e.g. merging convolution
image output with a list of tags). One aim in this paper is
to demonstrate how a similar concept is beneficial for code
summarization, in contrast to the usual seq2seq application to
SE data in which all information is put into one sequence.
We also hope to sow fertile ground for several areas of future
work in creating unique processing techniques for each data
type – treating software’s text and structure differently has a
long tradition [50].
B. Model Details
To encourage reproducibility and for clarity, we
explain our model as a walkthrough of our actual
Keras implementation. The following starts at line 29
in models/; all code is available
for download from our online appendix (Section X).
txt_input = Input(shape=(self.txtlen,))
com_input = Input(shape=(self.comlen,))
ast_input = Input(shape=(self.astlen,))
First, above, are three input layers corresponding to the
code/text sequence, the comment sequence, and the flattened
AST sequence. We chose the sequence lengths as a balance
between model size and coverage of the dataset. The sequence
sizes of 100 for code/text and AST, and 13 words for comment,
each cover at least 80% of the training set. Shorter sequences
are padded with zeros, and longer sequences are truncated.
ee = Embedding(output_dim=self.embdims,
se = Embedding(output_dim=self.embdims,
We start with a fairly common encoding structure, includ-
ing embedding layers for each of our encoded input types
(code/text and AST). The embedding will output a shape
of (batch_size, txtvocabsize, embdims). What this
means is that for every batch, each word in the sequence
has one vector of length embdims. For example, (200, 100,
100) means that for each of 200 examples in a batch, there
are 100 words and each word is represented by a 100 length
embedding vector. We found two separate embeddings to have
better performance than a unified embedding space.
ast_enc = CuDNNGRU(self.rnndims,
return_state=True, return_sequences=False)
astout, sa = ast_enc(se)
Next is a GRU layer with rnndims units (we found 256
to provide good results without oversizing the model) to
serve as the AST encoding. We used a CuDNNGRU to
increase training speed, not for prediction performance. The
return_state flag is necessary so that we get the final
hidden state of the AST encoder. The return_sequences
flag is necessary because we want the state at every cell instead
just the final state. We need the state at every cell for the
attention mechanism later.
txt_enc = CuDNNGRU(self.rnndims,
return_state=True, return_sequences=True)
txtout, st = enc(ee, initial_state=sa)
The code/text encoder operates in nearly the same way as
the AST encoder, except that we start the code/text GRU with
the final state of the AST GRU. The effect is similar to if we
had simply concatenated the inputs, except that 1) we keep
separate embedding spaces, 2) we allow for attention to focus
on each input differently rather than across input types, 3) we
ensure that one input is not truncated by an excessively long
sequence of the other input type, and 4) we “keep the door
open” for further processing e.g. via convolution layers that
would benefit one input type but not the other. As we show
in our evaluation, this is an important point for future work.
Tensor txtout would normally have shape (batch_size,
rnndims), an rnndims-length vector representation
of every input in the batch. However, since we have
return_sequences enabled, encout has the shape
(batch_size, datvocabsize, rnndims), which is
the rnndims-length vector at every time-step. That is, the
rnndims-length vector at every word in the sequence. So we
see the status of the output vector as it changes with each
word in the sequence. We also have return_state enabled,
which just means that we get st, the rnndims vector from
the last cell. This is a GRU, so this st is the same as the
output vector, but we get it here anyway for convenience, to
use as the initial state in the decoder.
de = Embedding(output_dim=self.embdims,
dec = CuDNNGRU(self.rnndims,
decout = dec(de, initial_state=st)
The decoder is as described in many papers on NMT: a
dedicated embedding space followed by a recurrent layer. We
start the decoder with the final state of the code/text RNN.
txt_attn = dot([decout, txtout], axes=[2, 2])
txt_attn = Activation(’softmax’)(txt_attn)
The next step is the code/text attention mechanism, with a
design similar to that described by Luong et al. [40]. First,
we take the dot product of the decoder and code/text encoder
output. The output shape of decout is, e.g., (batch_size,
13, 256) and txtout is (batch_size, 100, 256).
The axis 2 of decout is 256 long. The axis 2 of txtout is
also 256 long. So by computing the dot product along axis 2
in both, we get a tensor of shape (batch_size, 13, 100).
For one example in the batch, we get decout of (13, 256)
and txtout (256, 100).
decout (axis 2) txtout (axis 2) txt attn
1 2 .. 256
1 2 .. 100
2↓ ↓
1 2 .. 100
1 a b
2 c d
Where ais the dot product of vectors v1 and v3, and bis
the dot product of v1 and v4, etc.
The result is that each of the 13 positions in the decoder
sequence is now represented by a 100-length vector. Each
value in the 100-length vector reflects the similarity between
the element in the decoder sequence and the element in the
encoder sequence. I.e. babove reflects how similar element 1
in the decoder sequence is similar to element 2 in the code/text
encoder sequence. The 100-length vector for each of the 13
input positions reflects how much that a given input position
is similar (should “pay attention to”) a position in the output.
Then we apply a softmax to each of the 13 (100-length)
vectors. The effect is to exaggerate the “most similar” things,
so that “more attention” will be paid to the more-similar input
vectors – the network learns during training to make them
more similar. Note that the dot product here is not normalized,
so it is not necessarily equivalent to cosine similarity.
txt_context=dot([txt_attn, txtout],axes=[2, 1])
Next, we make use of the attention vectors by using them
to create the context vectors for the code/text input. To do
that, we scale the encoder vectors by the attention vectors.
This is how we “pay attention” to particular areas of input
for specific outputs. The above line of code takes txt_attn,
with shape (batch_size, 13, 100), and computes the dot
product with txtout (batch_size, 100, 256). Recall that
the encoder has txtvocabsize; 100 elements since it takes a
sequence of 100 words. Axis 1 of this tensor means “for each
element of the input sequence.”
The multiplication, for each sample in the batch, is:
txt attn (axis 2) txtout (axis 1) txt context
1 2 .. 100
1 2 .. 256
2↓ ↓
1 2 .. 256
1 a b
2 c d
The result is a context matrix that has one context vector for
each element in the output sequence. This is different than the
vanilla sequence to sequence approach, which has only one
context vector used for every output. Each output sequence
location has its own context vector. This vector is created from
the most attended-to part of the encoder sequence.
ast_attn = dot([astout, encout], axes=[2, 2])
ast_attn = Activation(’softmax’)(ast_attn)
ast_context =
dot([ast_attn, txtout], axes=[2, 1])
We perform the same attention operations to the AST
encoding as we do for the code/text encoding.
context = concatenate(
[txt_context, ast_context, decout])
But, we still need to combine the code/text and AST context
with the decoder sequence information. This is important
because we send each word one at a time, as noted in the
previous section. The model gets to look at the previous
words in the sentence in addition to the words in the encoder
sequences. It does not have the burden of predicting the entire
output sequence all at once. Technically, what we have here are
two context matrices with shape (batch_size, 13, 256)
and a decout with shape (batch_size, 13, 256). The
default axis is -1, which means the last part of the shape
(the 256 one in this case). This creates a tensor of shape
(batch_size, 13, 768): one 768-length vector for each
of the 13 input elements instead of three 256-length vectors.
out = TimeDistributed(Dense(self.rnndims,
We are nearing the point of predicting a next word. A
TimeDistributed layer provides one dense layer per vector in
the context matrix. The result is one rnndims-length vector
for every element in the decoder sequence. For example, one
256-length vector for each of the 13 positions in the decoder
sequence. Essentially, this creates one predictor for each of
the 13 decoder positions.
out = Flatten()(out)
out = Dense(self.comvocabsize,
However, we are trying to output a single word, the next
word in the sequence. Ultimately we need a single output
vector of length comsvocabsize. So we first flatten the (13,
256) matrix into a single (3328) vector, then we use a dense
output layer of length comsvocabsize, and apply softmax.
model = Model(inputs=[txt_input, com_input,
ast_input], outputs=out)
The result is a model with code/text, AST, and comment
sequence inputs, and a predicted next word in the comment
sequence as output.
C. Hardware Details
The hardware on which we implemented, trained, and tested
our model included one Xeon E5-1650v4 CPU, 64gb RAM,
and two Quadro P5000 GPUs. It was necessary to train on
GPUs with 16gb VRAM due to the large size of our model.
We prepared a large corpus of Java methods from the
Sourcerer repository provided by Lopes et al. [24]. The
repository contains over 51 million Java methods from over
50000 projects. We considered updating the repository with
new downloads from GitHub, but we found that the Sourcerer
dataset was quite thorough, leading to a large amount of
overlap with newer projects that could not be eliminated (due
to name changes, code cloning, etc.). This overlap could lead
to major validity problems for our experiments (e.g., if testing
samples were inadvertently placed in the training set). We
decided to use the Sourcerer projects exclusively.
Significant preparation was necessary to make the repository
a suitable dataset for applications of NMT, and we view
this preparation as an important contribution of this paper
to the research field (unlike in the NMT research area, there
are relatively few curated datasets for code summarization).
After downloading the archives, we used a toolkit provided by
McMillan et al. [51] to extract the Java methods and organize
them into a SQL database. Then we filtered for methods that
were preceded by JavaDoc comments (indicated by /∗ ∗). We
used only comments intended as JavaDocs, because there is
an assumption that the first sentence in the comment will be a
summary of the method’s behavior [2]. Then we extracted the
first sentence by looking for the first period, or the first newline
if no period was present. Next we used the langdetect
library to remove comments not in English. About 4m methods
remained after these steps.
A potential problem was auto-generated code. Auto-
generated code is a problem because both the code and
comments created by auto-generators tend to be very similar.
If nearly-identical code is in the training and testing sets,
the model will learn these cases easily, which could simul-
taneously reduce performance on the “real” examples while
falsely inflating performance metrics such as BLEU, since
the metrics would reward the model for correctly identifying
the duplicate cases. Happily, the solution is fairly simple: we
remove any methods from files that include phrases such as
“generated by” suggested by Shimonaka et al. [52]. This filter
is quite aggressive, as it reduced the dataset size to around 2m
methods, and on manual inspection we found no cases of auto-
generated code. In fact, a majority of the filtered methods were
exact duplicates (around 100k unique examples out of 2m
removed methods). But because comments to auto-generated
code are often still meaningful, we added one copy of each of
the 100k unique examples back into the dataset, and ensured
that they were in the training set only (so we did not attempt to
test against auto-generated comments). The result is a dataset
of around 2.1m methods.
Our other preprocessing steps followed the practice of many
software engineering papers. We split the code and comments
on camel case and underscore, removed non-alpha characters,
and set to lower case. We did not perform stemming.
We then split the dataset by project into training, validation,
and test sets. By “by project” we mean that we randomly
divided the projects into the three groups: 90% of projects
into training, 5% into validation, and 5% into testing. Then
all the methods from a project went into the group assigned
to its project. A side effect is that since projects have different
numbers of methods, 91% of methods are in training, 4.8%
in validation, and 4.2% in testing. But this slight variation
is necessary to maintain a realistic situation. As mentioned
in Section III, we respectfully believe that not splitting by
project and not removing auto-generated code are mistakes
made by a vast majority of previous NMT applications to code
summarization, and artificially inflates the reported scores (for
example, SBT is reported to have 38 BLEU, versus 14 BLEU
with the same technique in our evaluation).
To obtain the ASTs, we first used srcml [53] to extract an
XML representation of each method. Then we built a tool
to convert the XML representation into the flattened SBT
representation, to generate SBT-formatted output described by
Hu et al. [22]. Finally, we created our own modification of
SBT in which all the code structure remained intact, but in
which we replaced all words (except official Java API class
names) in the code to a special <OTHER>token. We call
this SBT-AO for SBT AST only. We use this modification to
simulate the case when only an AST can be extracted.
From this corpus of Java methods, we create two datasets:
The standard dataset contains three elements for each
Java method: 1) the pre-processed Java source code for
the method, 2) the pre-processed comment, and 3) the
SBT-AO representation of the Java code.
The challenge dataset contains two elements for each
method: 1) the pre-processed comment, and 2) the SBT-
AO representation of the Java code.
Technically, we also have a third dataset containing the
default SBT representation (with code words) and the pre-
processed comment, which we use for experiments to compare
our approach to the baselines. However, the standard and
challenge datasets are our focus in this paper, intended to
compare the case when internal documentation is available,
and the much more difficult case with only an AST.
This section covers our evaluation, comparing our approach
to baselines over the standard and challenge datasets.
A. Research Questions
Our research objective is to determine the performance
difference between our approach and competitive baseline
approaches in two situations that we explore through these
Research Questions (RQs):
RQ1What is the difference in performance between our
approach and competitive approaches in the “stan-
dard” situation, assuming internal documentation?
RQ2What is performance of our approach in the “chal-
lenge” situation, assuming an AST only?
The rationale for these RQs was largely covered in the
Introduction and Background sections. Essentially, existing
applications of NMT for the problem of code summarization
almost entirely rely on the programmer writing meaning-
ful internal documentation such as identifier names. As we
will show, this assumption makes the problem “easy” for
seq2seq NMT models, since many methods have internal
documentation that is very similar to the summary comment
(a phenomenon also observed by Tan et al. [54] and Louis et
al. [55]). We ask RQ1in order to study the performance of
our approach under this assumption.
In contrast, we ask RQ2because the assumption of internal
documentation is often not valid. Very often, only the bytecode
is available, or programmers neglect to write good internal
documentation, or code has even been obfuscated deliberately.
In these cases, it is usually still possible to extract an AST
for a method, even if it contains no meaningful words. In
principle, the structure of a program is all that is necessary
to understand it, since ultimately that is what defines the
behavior of the program. In practice, it is very difficult to
connect structure directly to high-level concepts described in
summaries. We seek to quantify a baseline performance level
with our approach (since, to our knowledge, no published
approach functions in this situation).
B. Baselines
To answer RQ1(the standard experiment), we compare
our approach to three baselines. One baseline (which we call
attendgru) is a generic attentional encoder-decoder model,
to represent an application of a strong off-the-shelf approach
from the NLP research area. Note that there are a huge variety
of NMT systems described in the NLP literature, but that a
vast majority have an attentional encoder-decoder model at
their heart (see Section III). To maintain an “apples to apples”
comparison, the baseline is identical to the “code/text” encoder
in our approach (the decoder is identical as well). In essence,
the baseline is the same as our proposed approach, except
without the AST encoder and associated concatenation. While
we could have chosen any number of approaches from NLP
literature, it is very difficult to say up front which will perform
best for code summarization, and we needed to ensure minimal
differences to maximize validity of our results. If, for example,
we had used an architecture with an LSTM instead of a GRU
in the encoder, we would have no way of knowing if the
difference between our approach and the baseline were due
to the AST information we added, or due to using an LSTM
instead of a GRU.
A second baseline is the SBT approach presented by Hu et
al. [22]. This approach was presented at ICPC’18, and (at
the time of writing) represents the latest publication about
source code summarization in a software engineering venue.
That paper used an LSTM-based encoder-decoder architecture
based on a popular guide for building seq2seq NMT systems,
but used their SBT representation of code instead of the source
code only. For our baseline, we use their SBT representation,
but use the same GRU-based encoder-decoder from our NLP
baseline, also to ensure an “apples to apples” comparison.
Since the model architecture is the same, we can safely
attribute performance differences to the input format (e.g.,
SBT vs. code-only).
A third baseline is codenn, presented by Iyer et al. [19].
Given the complexity of the approach, we used their publicly-
available implementation. The original paper describes only
applications to SQL and C#, but we noticed that their C#
parser extracted common code features that are also available
in Java. We made small modifications to the C# parser so that
it would function equivalently for Java.
We call our approach ast-attendgru in our experiments.
We used a greedy search algorithm for inference for all
approaches, rather than beam search, to minimize the number
of experimental variables and computation cost.
C. Methodology
Our methodology to answer both RQs is identical, and
follows best practice established throughout the literature on
NMT (see Section III): for RQ1, we train our approach and
each baseline with the training set from the standard dataset for
a total of 10 epochs. Then, for each approach, we computed
performance metrics for the model after each epoch against
the validation set. (In all cases, validation performance began
to degrade after five or six epochs.) Next we chose the model
after the epoch with the highest validation performance, and
computed performance metrics for this model against the
testing set. These testing results are the results we report in
this paper. Our methodology to answer RQ2differs only in
that we trained and tested using the challenge dataset.
We report the performance metric BLEU [56], also in
keeping with standard practice in NMT. BLEU is a mea-
sure of the text similarity between predicted summaries
and reference summaries. We report a composite BLEU
score in addition to BLEU1through BLEU4(BLEUnis a
measure of the similarity of n-length subsequences, versus
entire summary sentences). Technically speaking, we used
nltk.translate.bleu_score [57] in our implementation.
D. Threats to Validity
The primary threats to validity to this evaluation include:
1) Our dataset. We use a very large dataset with millions
of Java methods in order to maximize the generalizability of
our results, but the possibility remains that we would obtain
different results with a different dataset. And, 2) we did not
perform cross-validation. We attempt to mitigate this risk by
using random samples to split the training/validation/testing
sets, a different split could result in different performance.
This risk is common among NMT experiments due to very
high training computation costs (4+ hours per epoch).
This section discusses our evaluation results and obser-
vations. After answering our research questions, we explore
examples to give an insight into how the network functions
and why it works. Note that we use these observations to
build an ensemble method at the end of this paper.
A. RQ1: Standard Experiment
We found in the standard experiment that ast-attendgru
and attendgru obtain roughly equal performance in terms of
BLEU score, but provide orthogonal results, as we will explain
in this section and the example in subsection C.
In terms of BLEU score, ast-attendgru and attendgru
are roughly equal in performance: 19.6 BLEU vs 19.4 BLEU.
SBT is lower, at about 14 BLEU, and codenn is about 10
BLEU. Figure 1 includes a table with the full BLEU results
for each result (and additional data in our online appendix).
For SBT, the results conflicted with our expectations based
on the presenting paper [22], in which SBT outperformed
a standard seq2seq model like attendgru. We see two
possible explanations: First, even though our seq2seq baseline
implementation represents a standard approach, there are a
few architectural differences from the paper by Hu et al. [22],
such as different embedding vector sizes. While we did not
model B B1 B2 B3 B4 dataset
ast-attendgru 19.6 39.3 22.2 14.9 11.4
attendgru 19.4 39.0 22.0 14.8 11.3
sbt 14.0 31.8 16.0 10.1 7.5
codenn 9.95 21.2 9.7 7.6 6.3
ast-attendgru 9.47 25.7 11.0 6.1 4.7 challenge
Fig. 1: Below are BLEU1-4 scores and the composite BLEU score for
each approach and dataset. Above, the chart depicts the composite
scores only. We observe that attendgru and ast-attendgru perform
equally in terms of BLEU score on the standard set, though we
improve it with an ensemble decoder in Section VIII.
observe major changes in the results from these architectural
differences in our own pilot studies, it is possible that “one’s
mileage may vary” depending on the dataset. Second, as we
note in Sections III and V, the previous study did not split
by project, so methods in the same project will be in the
training and test set. The very high reported BLEU scores
in [22] could be explained by overloaded methods with very
similar structure – SBT would detect a function in the test set
with a very similar AST to an overloaded method in the same
project in the training set.
The improvement by all approaches over codenn matches
expectations from previous experiments. The codenn ap-
proach was intended as a versatile technique for both code
search and summarization, and was a relatively early attempt
at applying NMT to the code summarization problem. In
addition, it was designed for C# and SQL datasets; we adapted
it to Java as described in the previous section.
A key observation of the standard experiment is that
ast-attendgru and attendgru provide orthogonal predic-
tions – there is a set of methods in which one performs better,
and a different set in which the other has higher performance.
While ast-attendgru is slightly ahead of attendgru, we
do not view a 0.2 BLEU difference a major improvement in
and of itself. Normally we would expect an approach to out-
perform a different approach by some margin across a majority
of the examples (i.e. non-orthogonal performance), and this is
indeed what we observe when comparing ast-attendgru
to SBT, as shown on the left below (around 60k methods in
which ast-attendgru performed better, vs. 20k for SBT):
But what we observe for ast-attendgru and attendgru
is that there are two sets of roughly 33k methods in the
91k test set in which one or another approach has higher
performance (above, right). In other words, among the predic-
tions in which there was a difference between the approaches,
ast-attendgru and gives better predictions (in terms of
BLEU score) for about half, while attendgru performs
better on about half. Orthogonal performance makes these two
approaches a good candidate for ensemble prediction, which
we further explain in subsection Cand Section VIII.
B. RQ2: Challenge Experiment
We obtain a BLEU score of about 9.5 for ast-attendgru
in the challenge experiment. Note that the only difference
between the standard and challenge experiments is that we
trained and tested using the AST only, in the form of the
SBT-AO representation fed to ast-attendgru. Technically,
there are other configurations that would produce the same
result, such as using SBT-AO as input to attendgru instead
of the source code. Any of these configurations would meet
our objective with this experiment of establishing performance
for the scenario when only an AST is available.
C. Explanation and Example
Merely reporting BLEU scores leaves an open question as to
what the scores mean in practice. Consider these two examples
from the standard and challenge experiments (method IDs
align with our downloadable dataset for reproducibility). We
chose the following examples for illustrative purposes, and as
an aid for explanation. While relatively short, we feel that these
methods provide a useful insight into how the models operate.
For a more in depth analysis, a human evaluation would be
required, which is beyond the scope of this paper.
Example 1 is one of the cases where ast-attendgru
succeeds when attendgru fails. To understand why, recall
that, in our model as with a majority of NMT systems, the
system predicts a sentence one word at a time. For each
word, the model receives information about the method (the
code/text plus the AST for models that use it), along with each
word that has been predicted so far. So to predict “token”,
Example 1, Method ID 49111725:
public Config tokenUrl(String tokenUrl) {
this.tokenUrl = tokenUrl;
return this; }
reference sets the token url
ast-attendgru sets the token url
attendgru stan. returns the url of the token
sbt sets the <UNK>
ast-attendgru chal. sets the value of the <UNK>property
Tokenized code/text input: <s>public config token url string token
url this token url token url return this </s>
SBT-AO input: ( unit ( function ( specifier ) specifier OTHER ( type (
name ) name OTHER ) type ( name ) name OTHER ( parameter list
( parameter ( decl ( type ( name ) name String ) type ( name )
name OTHER ) decl ) parameter ) parameter list ( block ( expr stmt
( expr ( name ( name ) name OTHER ( operator ) operator OTHER
( name ) name OTHER ) name ( operator ) operator OTHER (
name ) name OTHER ) expr ) expr stmt ( return ( expr ( name
) name OTHER ) expr ) return ) block ) function ) unit
(a) attendgru (b) ast-attendgru
Fig. 2: Heatmaps of the attention layer in (a) attendgru and (b)
ast-attendgru for the code/text input for Example 1. The x-
axis is the 13 positions in the summary input. The y-axis is the 100
positions in the code input. Images are truncated to code input length.
ast-attendgru would receive the code/text, the AST, and
the phrase “sets the”.In contrast, attendgru only receives
the code/text and “sets the”. To predict the first word, “sets”,
attendgru only knows that it is the start of the sentence
(indicated by a start-of-sentence <s>token), and the code/text
input. To help make the prediction attendgru is equipped
with an attention layer learned during training to attend to
certain parts of the input. That layer is depicted in Figure 2(a).
Note that there is high activation (bright yellow) in position
(14, 1), indicating significant attention paid to location 14 in
the code/text input: this is the word return. What has happened
is that, during training, the model saw many examples of getter
methods that were only a few lines and ended with a return.
In many cases, the model could rely on very explicit method
names, such as getPlayerScore (method ID 38221679).
attendgru performed remarkably well in these cases, as the
situation is quite like natural language – it learns to align words
in the input vocabulary to words in the target vocabulary, and
where they belong in a sentence. However, in cases such as
Example 1 where the method name does not clearly state what
the method should do (the name tokenUrl is not obviously a
setter), attendgru struggles to choose the right words, even
if, as in Example 1, it correctly identifies the subject of the
action (“url of the token”).
These situations are where the AST is beneficial. The
code/text activation layer for ast-attendgru attends heavily
to the start of sentence token (note column 0 in Figure 2(b)),
which, since <s>is the start of every sentence, probably acts
like a “not sure” signal. But the model also has the AST input.
Figure 3 shows the AST attention layer of ast-attendgru
when trying to predict the first word. There are four areas
of interest that help elucidate how the model processes the
structure of the model, denoted A through D in the figure,
and color-coded to the corresponding areas in the AST input.
First, area A, is the portion of the method signature prior to the
parameter. Recall that our AST representation is structure only,
so almost all methods will start the same way. So as expected,
the attention in area A is largely formless. The heatmap shows
much more definition in area B. It is the parameter list, and the
model has likely learned that short methods with parameter
lists tend to be setters. The model activates very heavily at
locations C and D, which are the start and end of the expr stmt
AST node. A very common situation in the training set is that
a short method with a parameter and an assignment is a setter.
The model has learned this and chose “sets” as the first word.
All of the models with AST input correctly chose “sets”.
SBT found that the method is a setter, but could not determine
Fig. 3: Heatmap of the attention layer in ast-attendgru for the AST input for Example 1. The x-axis is the summary input and the
y-axis is the AST (SBT-AO) input. High activation (more yellow) indicates more attention paid to e.g. position 48 of the AST input.
what was being set – we attribute this behavior to the fact
that the SBT representation blends the code/text and structural
information into a single input, which creates a challenge for
the model to learn orthogonal types of information in the same
vector space (which work in other areas e.g. image captioning
implies is not advisable [58]). While there is not space in this
paper to explore fully, we note that even ast-attendgru
during the challenge experiment correctly characterized the
method as setting the value of a property, generating an
unknown token when it could not determine which property.
In fact, ast-attendgru correctly predicted the first word of
the summary (which is usually a verb) 33% of the time during
the challenge experiment, compared to 52% of the time in the
standard experiment. Briefly consider Example 2:
Example 2, Method ID 40490666:
public void disconnect() {
try {
connected = false;
} catch (IOException ex) {
ex.printStackTrace(); } }
reference closes the socket for reconnection
ast-attendgru disconnect from the server
attendgru stan. disconnects from the server
sbt disconnect from the server
ast-attendgru chal. closes the connection
All approaches performed well for this method, but for
different reasons. attendgru linked the method name to
the verb “disconnects”. SBT relied more on later features
such as the call to notifyDisconnect(). Most interestingly,
ast-attendgru performed best in the challenge experiment.
In exploring this result, we found a few methods with a similar
AST (IDs 146827, 22838818, 28418561, 5785101). All of
these had a few lines in a try block followed by a short
catch block, and 2-3 method calls and assignments to null or
false in the try. These methods had summaries like “close the
communication with the gps device”, “stops the timer”, and
“disconnect from the current client” – all these methods deal
with close and cleanup behavior. The model probably learned
this during training, and chose similar words for the summary.
In answering RQ1, we found that attendgru and
ast-attendgru performed better on different sets of meth-
ods. While we are hesitant to overinterpret single examples, the
examples in this section are consistent with numerous others in
the dataset (we provide a script for randomly sampling exam-
ples called rand samples in our online appendix for
interested readers). The examples are also consistent with the
interpretation that the off-the-shelf NMT system (attendgru)
performs quite well in cases where the summaries are clear
from the method signature, and in these cases the AST may
be superfluous. But, the model benefits from the AST in cases
when words in the code/text input are not sufficient or clear.
As a hint toward future work, we test a combination of
the attendgru and ast-attendgru models using ensem-
ble decoding. The combination itself is straightforward: we
compute an element-wise mean of the output vector of each
model (the same trained models used in our evaluation). The
training and test procedure does not change, except that during
prediction, we use the maximum value of the combined output
vector, rather than just one output vector from one model.
This is the same ensemble decoding procedure implemented
by OpenNMT [59], and is one of the most common of several
options described by literature on multi-source NMT [60].
Since we are combining output vectors, the models “work
together” during prediction of every word – it is not just
choosing one model or another for the whole sentence. The
idea is that one model may assign similar weights in the output
vector to two or more words, in cases where it performs less
well. And another model that performs better in that situation
may assign more weight to a single word. In our system, the
hope is that attendgru will contribute more when code/text
words are clear, but ast-attendgru will contribute more
when they are unclear.
The ensemble decoding procedure improves performance
to 20.9 BLEU, from 19.6 for ast-attendgru and 19.4
for attendgru. This is more than a full BLEU point im-
provement, which is quite significant for a relatively simple
procedure. This result points us to future work including
more advanced ensemble decoding (e.g. predicting when to
use one model or another), optimizations to the network (e.g.
dropout, parameter tuning), and, critically, using different data
processing techniques on each type of input.
We have presented a neural model for generating natural
language descriptions of subroutines. We implement our model
and evaluate it over a large dataset of Java methods. We
demonstrate that our model ast-attendgru, in terms of
BLEU score, outperforms baselines from SE literature and is
slightly ahead of a strong off-the-shelf approach from NLP
literature. We also demonstrate that and ensemble of our
approach and the off-the-shelf NLP approach outperforms all
other tested configurations. We provide a walkthrough example
to provide insight into how the models work – we conclude
that the default NMT system works well in situations where
good internal documentation is provided, but less well when
it is not provided, and that ast-attendgru assists in these
cases. We demonstrate how ast-attendgru can produce
coherent predictions even with zero internal documentation.
Our dataset, code, models, and results are available via:
This work is supported in part by the NSF CCF-1452959, CCF-
1717607, and CNS-1510329 grants. Any opinions, findings, and
conclusions expressed herein are the authors and do not necessarily
reflect those of the sponsors
[1] P. W. McBurney and C. McMillan, “Automatic source code summa-
rization of context for java methods,IEEE Transactions on Software
Engineering, vol. 42, no. 2, pp. 103–119, 2016.
[2] D. Kramer, “Api documentation from source code comments: a case
study of javadoc,” in Proceedings of the 17th annual international
conference on Computer documentation. ACM, 1999, pp. 147–153.
[3] A. Von Mayrhauser and A. M. Vans, “Program comprehension during
software maintenance and evolution,Computer, no. 8, pp. 44–55, 1995.
[4] S. Letovsky, “Cognitive processes in program comprehension,” Journal
of Systems and software, vol. 7, no. 4, pp. 325–339, 1987.
[5] B. Cornelissen, A. Zaidman, A. Van Deursen, L. Moonen, and
R. Koschke, “A systematic survey of program comprehension through
dynamic analysis,” IEEE Transactions on Software Engineering, vol. 35,
no. 5, pp. 684–702, 2009.
[6] J. I. Maletic and A. Marcus, “Supporting program comprehension
using semantic and structural information,” in Proceedings of the 23rd
International Conference on Software Engineering. IEEE Computer
Society, 2001, pp. 103–112.
[7] A. Forward and T. C. Lethbridge, “The relevance of software documen-
tation, tools and technologies: a survey,” in Proceedings of the 2002
ACM symposium on Document engineering. ACM, 2002, pp. 26–33.
[8] D. van Heesch. (2018) Doxygen website. [Online]. Available:
[9] B. P. Eddy, J. A. Robinson, N. A. Kraft, and J. C. Carver, “Evaluating
source code summarization techniques: Replication and expansion,”
in Program Comprehension (ICPC), 2013 IEEE 21st International
Conference on. IEEE, 2013, pp. 13–22.
[10] E. Hill, L. Pollock, and K. Vijay-Shanker, “Automatically capturing
source code context of nl-queries for software maintenance and reuse,
in Proceedings of the 31st International Conference on Software Engi-
neering. IEEE Computer Society, 2009, pp. 232–242.
[11] S. Haiduc, J. Aponte, L. Moreno, and A. Marcus, “On the use of
automated text summarization techniques for summarizing source code,”
in Reverse Engineering (WCRE), 2010 17th Working Conference on.
IEEE, 2010, pp. 35–44.
[12] G. Sridhara, E. Hill, D. Muppaneni, L. Pollock, and K. Vijay-Shanker,
“Towards automatically generating summary comments for java meth-
ods,” in Proceedings of the IEEE/ACM international conference on
Automated software engineering. ACM, 2010, pp. 43–52.
[13] S. Rastkar, G. C. Murphy, and A. W. Bradley, “Generating natural
language summaries for crosscutting source code concerns,” in Software
Maintenance (ICSM), 2011 27th IEEE International Conference on.
IEEE, 2011, pp. 103–112.
[14] L. Moreno, J. Aponte, G. Sridhara, A. Marcus, L. Pollock, and K. Vijay-
Shanker, “Automatic generation of natural language summaries for
java classes,” in Program Comprehension (ICPC), 2013 IEEE 21st
International Conference on. IEEE, 2013, pp. 23–32.
[15] P. Rodeghero, C. Liu, P. W. McBurney, and C. McMillan, “An eye-
tracking study of java programmers and application to source code
summarization,” IEEE Transactions on Software Engineering, vol. 41,
no. 11, pp. 1038–1054, 2015.
[16] M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton, “A survey
of machine learning for big code and naturalness,” arXiv preprint
arXiv:1709.06182, 2017.
[17] Y. Oda, H. Fudaba, G. Neubig, H. Hata, S. Sakti, T. Toda, and
S. Nakamura, “Learning to generate pseudo-code from source code using
statistical machine translation (t),” in Automated Software Engineering
(ASE), 2015 30th IEEE/ACM International Conference on. IEEE, 2015,
pp. 574–584.
[18] M. Allamanis, H. Peng, and C. Sutton, “A convolutional attention
network for extreme summarization of source code,” in International
Conference on Machine Learning, 2016, pp. 2091–2100.
[19] S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer, “Summarizing source
code using a neural attention model,” in Proceedings of the 54th Annual
Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), vol. 1, 2016, pp. 2073–2083.
[20] S. Jiang, A. Armaly, and C. McMillan, “Automatically generating
commit messages from diffs using neural machine translation,” in Pro-
ceedings of the 32nd IEEE/ACM International Conference on Automated
Software Engineering. IEEE Press, 2017, pp. 135–146.
[21] P. Yin, B. Deng, E. Chen, B. Vasilescu, and G. Neubig, “Learning to
mine aligned code and natural language pairs from stack overflow,” in
International Conference on Mining Software Repositories, ser. MSR.
ACM, 2018, pp. 476–486.
[22] X. Hu, G. Li, X. Xia, D. Lo, and Z. Jin, “Deep code comment
generation,” in Proceedings of the 26th Conference on Program Com-
prehension. ACM, 2018, pp. 200–210.
[23] V. J. Hellendoorn and P. Devanbu, “Are deep neural networks the best
choice for modeling source code?” in Proceedings of the 2017 11th
Joint Meeting on Foundations of Software Engineering. ACM, 2017,
pp. 763–773.
[24] C. Lopes, S. Bajracharya, J. Ossher, and P. Baldi, “UCI source
code data sets,” 2010. [Online]. Available:$\
[25] K. Richardson, S. Zarrieß, and J. Kuhn, “The code2text challenge: Text
generation in source code libraries,” arXiv preprint arXiv:1708.00098,
[26] W. Cohen and P. Devanbu, “Workshop on nlp for software engineering,”
2018. [Online]. Available:
[27] T. Xie, “Intelligent software engineering: Synergy between ai and soft-
ware engineering,” in Proceedings of the 11th Innovations in Software
Engineering Conference. ACM, 2018, p. 1.
[28] X. Hu, G. Li, X. Xia, D. Lo, S. Lu, and Z. Jin, “Summarizing source
code with transferred api knowledge.” in IJCAI, 2018, pp. 2269–2275.
[29] P. Loyola, E. Marrese-Taylor, and Y. Matsuo, “A neural architecture for
generating natural language descriptions from source code changes,”
arXiv preprint arXiv:1704.04856, 2017.
[30] D. Movshovitz-Attias and W. W. Cohen, “Natural language models for
predicting programming comments,” in Proceedings of the 51st Annual
Meeting of the Association for Computational Linguistics (Volume 2:
Short Papers), vol. 2, 2013, pp. 35–40.
[31] M. Allamanis, D. Tarlow, A. Gordon, and Y. Wei, “Bimodal modelling
of source code and natural language,” in International Conference on
Machine Learning, 2015, pp. 2123–2132.
[32] A. V. M. Barone and R. Sennrich, “A parallel corpus of python functions
and documentation strings for automated code documentation and code
generation,” arXiv preprint arXiv:1707.02275, 2017.
[33] X. Gu, H. Zhang, and S. Kim, “Deep code search,” in Proceedings of
the 40th International Conference on Software Engineering. ACM,
2018, pp. 933–944.
[34] Z. Yao, D. S. Weld, W.-P. Chen, and H. Sun, “Staqc: A systemati-
cally mined question-code dataset from stack overflow,” arXiv preprint
arXiv:1803.09371, 2018.
[35] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
jointly learning to align and translate,” arXiv preprint arXiv:1409.0473,
[36] S. Haiduc, J. Aponte, and A. Marcus, “Supporting program compre-
hension with source code summarization,” in Proceedings of the 32Nd
ACM/IEEE International Conference on Software Engineering-Volume
2. ACM, 2010, pp. 223–226.
[37] G. Sridhara, L. Pollock, and K. Vijay-Shanker, “Automatically detecting
and describing high level actions within methods,” in Proceedings of the
33rd International Conference on Software Engineering. ACM, 2011,
pp. 101–110.
[38] I. Sutskever, J. Martens, and G. E. Hinton, “Generating text with
recurrent neural networks,” in Proceedings of the 28th International
Conference on Machine Learning (ICML-11), 2011, pp. 1017–1024.
[39] N. Nazar, Y. Hu, and H. Jiang, “Summarizing software artifacts: A
literature review,” Journal of Computer Science and Technology, vol. 31,
no. 5, pp. 883–909, 2016.
[40] M.-T. Luong, H. Pham, and C. D. Manning, “Effective ap-
proaches to attention-based neural machine translation,” arXiv preprint
arXiv:1508.04025, 2015.
[41] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
with neural networks,” in Advances in neural information processing
systems, 2014, pp. 3104–3112.
[42] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,nature, vol. 521,
no. 7553, p. 436, 2015.
[43] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning.
MIT press Cambridge, 2016, vol. 1.
[44] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
in Neural Information Processing Systems, 2017, pp. 5998–6008.
[45] M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Tho-
rat, F. Vi ´
egas, M. Wattenberg, G. Corrado et al., “Google’s multilin-
gual neural machine translation system: Enabling zero-shot translation,”
Transactions of the Association of Computational Linguistics, vol. 5,
no. 1, pp. 339–351, 2017.
[46] K. Chen, J. Wang, L.-C. Chen, H. Gao, W. Xu, and R. Nevatia, “Abc-
cnn: An attention based convolutional neural network for visual question
answering,” arXiv preprint arXiv:1511.05960, 2015.
[47] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention
networks for image question answering,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2016, pp. 21–
[48] J. Johnson, A. Karpathy, and L. Fei-Fei, “Densecap: Fully convolutional
localization networks for dense captioning,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2016, pp.
[49] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deep
captioning with multimodal recurrent neural networks (m-rnn),” arXiv
preprint arXiv:1412.6632, 2014.
[50] A. Marcus and J. I. Maletic, “Recovering documentation-to-source-code
traceability links using latent semantic indexing,” in Proceedings of the
25th international conference on software engineering. IEEE Computer
Society, 2003, pp. 125–135.
[51] C. McMillan, M. Grechanik, D. Poshyvanyk, Q. Xie, and C. Fu,
“Portfolio: finding relevant functions and their usage,” in Proceedings
of the 33rd International Conference on Software Engineering. ACM,
2011, pp. 111–120.
[52] K. Shimonaka, S. Sumi, Y. Higo, and S. Kusumoto, “Identifying auto-
generated code by using machine learning techniques,” in Empirical
Software Engineering in Practice (IWESEP), 2016 7th International
Workshop on. IEEE, 2016, pp. 18–23.
[53] M. L. Collard, M. J. Decker, and J. I. Maletic, “Lightweight trans-
formation and fact extraction with the srcml toolkit,” in Source Code
Analysis and Manipulation (SCAM), 2011 11th IEEE International
Working Conference on. IEEE, 2011, pp. 173–184.
[54] S. H. Tan, D. Marinov, L. Tan, and G. T. Leavens, “@tcomment: Testing
javadoc comments to detect comment-code inconsistencies,” in 2012
IEEE Fifth International Conference on Software Testing, Verification
and Validation, April 2012, pp. 260–269.
[55] A. Louis, S. K. Dash, E. T. Barr, and C. A. Sutton, “Deep learning to
detect redundant method comments,” CoRR, vol. abs/1806.04616, 2018.
[56] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method
for automatic evaluation of machine translation,” in Proceedings of
the 40th annual meeting on association for computational linguistics.
Association for Computational Linguistics, 2002, pp. 311–318.
[57] R. H. C. T. L. Chin Yee Lee, Hengfeng Li. (2018) Nltk
translate bleu score calculator v3.3. [Online]. Available: https:
[58] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A
neural image caption generator,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, 2015, pp. 3156–3164.
[59] S.-A. Grnroos. (2018) Opennmt ensemble decoding. [Online]. Available: py/pull/732
[60] E. Garmash and C. Monz, “Ensemble learning for multi-source neural
machine translation,” in Proceedings of COLING 2016, the 26th Inter-
national Conference on Computational Linguistics: Technical Papers,
2016, pp. 1409–1418.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
During software maintenance, code comments help developers comprehend programs and reduce additional time spent on reading and navigating source code. Unfortunately, these comments are often mismatched, missing or outdated in the software projects. Developers have to infer the functionality from the source code. This paper proposes a new approach named DeepCom to automatically generate code comments for Java methods. The generated comments aim to help developers understand the functionality of Java methods. DeepCom applies Natural Language Processing (NLP) techniques to learn from a large code corpus and generates comments from learned features. We use a deep neural network that analyzes structural information of Java methods for better comments generation. We conduct experiments on a large-scale Java corpus built from 9,714 open source projects from GitHub. We evaluate the experimental results on a machine translation metric. Experimental results demonstrate that our method DeepCom outperforms the state-of-the-art by a substantial margin.
Conference Paper
Full-text available
To implement a program functionality, developers can reuse previously written code snippets by searching through a large-scale codebase. Over the years, many code search tools have been proposed to help developers. The existing approaches often treat source code as textual documents and utilize information retrieval models to retrieve relevant code snippets that match a given query. These approaches mainly rely on the textual similarity between source code and natural language query. They lack a deep understanding of the semantics of queries and source code. In this paper, we propose a novel deep neural network named CODEnn (Code-Description Embedding Neural Network). Instead of matching text similarity, CODEnn jointly embeds code snippets and natural language descriptions into a high-dimensional vector space, in such a way that code snippet and its corresponding description have similar vectors. Using the unified vector representation, code snippets related to a natural language query can be retrieved according to their vectors. Semantically related words can also be recognized and irrelevant/noisy keywords in queries can be handled. As a proof-of-concept application, we implement a code search tool named DeepCS using the proposed CODEnn model. We empirically evaluate DeepCS on a large scale codebase collected from GitHub. The experimental results show that our approach can effectively retrieve relevant code snippets and outperforms previous techniques.
Conference Paper
For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high quality code snippets. However, existing heuristic methods (e.g., pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.
Conference Paper
Code summarization, aiming to generate succinct natural language description of source code, is extremely useful for code search and code comprehension. It has played an important role in software maintenance and evolution. Previous approaches generate summaries by retrieving summaries from similar code snippets. However, these approaches heavily rely on whether similar code snippets can be retrieved, how similar the snippets are, and fail to capture the API knowledge in the source code, which carries vital information about the functionality of the source code. In this paper, we propose a novel approach, named TL-CodeSum, which successfully uses API knowledge learned in a different but related task to code summarization. Experiments on large-scale real-world industry Java projects indicate that our approach is effective and outperforms the state-of-the-art in code summarization.
Stack Overflow (SO) has been a great source of natural language questions and their code solutions (i.e., question-code pairs), which are critical for many tasks including code retrieval and annotation. In most existing research, question-code pairs were collected heuristically and tend to have low quality. In this paper, we investigate a new problem of systematically mining question-code pairs from Stack Overflow (in contrast to heuristically collecting them). It is formulated as predicting whether or not a code snippet is a standalone solution to a question. We propose a novel Bi-View Hierarchical Neural Network which can capture both the programming content and the textual context of a code snippet (i.e., two views) to make a prediction. On two manually annotated datasets in Python and SQL domain, our framework substantially outperforms heuristic methods with at least 15% higher F1 and accuracy. Furthermore, we present StaQC (Stack Overflow Question-Code pairs), the largest dataset to date of ~148K Python and ~120K SQL question-code pairs, automatically mined from SO using our framework. Under various case studies, we demonstrate that StaQC can greatly help develop data-hungry models for associating natural language with programming language.
Conference Paper
There has been a long history of applying AI technologies to address software engineering problems especially on tool automation. On the other hand, given the increasing importance and popularity of AI software, recent research efforts have been on exploring software engineering solutions to improve the productivity of developing AI software and the dependability of AI software. The emerging field of intelligent software engineering is to focus on two aspects: (1) instilling intelligence in solutions for software engineering problems; (2) providing software engineering solutions for intelligent software. This extended abstract shares perspectives on these two aspects of intelligent software engineering.
Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit code's abundance of patterns. In this article, we survey this work. We contrast programming languages against natural languages and discuss how these similarities and differences drive the design of probabilistic models. We present a taxonomy based on the underlying design principles of each model and use it to navigate the literature. Then, we review how researchers have adapted these models to application areas and discuss cross-cutting and application-specific challenges and opportunities.
Conference Paper
Current statistical language modeling techniques, including deep-learning based models, have proven to be quite effective for source code. We argue here that the special properties of source code can be exploited for further improvements. In this work, we enhance established language modeling approaches to handle the special challenges of modeling source code, such as: frequent changes, larger, changing vocabularies, deeply nested scopes, etc. We present a fast, nested language modeling toolkit specifically designed for software, with the ability to add & remove text, and mix & swap out many models. Specifically, we improve upon prior cache-modeling work and present a model with a much more expansive, multi-level notion of locality that we show to be well-suited for modeling software. We present results on varying corpora in comparison with traditional N-gram, as well as RNN, and LSTM deep-learning language models, and release all our source code for public use. Our evaluations suggest that carefully adapting N-gram models for source code can yield performance that surpasses even RNN and LSTM based deep-learning models.