PreprintPDF Available

Improved Code Summarization via a Graph Neural Network

Preprints and early-stage research may not have been peer reviewed yet.


Automatic source code summarization is the task of generating natural language descriptions for source code. Automatic code summarization is a rapidly expanding research area, especially as the community has taken greater advantage of advances in neural network and AI technologies. In general, source code summariza-tion techniques use the source code as input and outputs a natural language description. Yet a strong consensus is developing that using structural information as input leads to improved performance. e rst approaches to use structural information a ened the AST into a sequence. Recently, more complex approaches based on random AST paths or graph neural networks have improved on the models using a ened ASTs. However, the literature still does not describe the using a graph neural network together with source code sequence as separate inputs to a model. erefore, in this paper, we present an approach that uses a graph-based neural architecture that be er matches the default structure of the AST to generate these summaries. We evaluate our technique using a data set of 2.1 million Java method-comment pairs and show improvement over four baseline techniques, two from the so ware engineering literature, and two from machine learning literature.
Improved Code Summarization via a Graph Neural Network
Alexander LeClair
University of Notre Dame
South Bend, IN 46556
Sakib Haque
University of Notre Dame
South Bend, IN 46556
Lingfei Wu
IBM Research
Yorktown Heights, NY 10598
Collin McMillan
University of Notre Dame
South Bend, IN 46556
Automatic source code summarization is the task of generating
natural language descriptions for source code. Automatic code
summarization is a rapidly expanding research area, especially as
the community has taken greater advantage of advances in neural
network and AI technologies. In general, source code summariza-
tion techniques use the source code as input and outputs a natural
language description. Yet a strong consensus is developing that us-
ing structural information as input leads to improved performance.
e rst approaches to use structural information aened the
AST into a sequence. Recently, more complex approaches based
on random AST paths or graph neural networks have improved
on the models using aened ASTs. However, the literature still
does not describe the using a graph neural network together with
source code sequence as separate inputs to a model. erefore, in
this paper, we present an approach that uses a graph-based neural
architecture that beer matches the default structure of the AST
to generate these summaries. We evaluate our technique using a
data set of 2.1 million Java method-comment pairs and show im-
provement over four baseline techniques, two from the soware
engineering literature, and two from machine learning literature.
Automatic documentation, neural networks, deep learning, articial
ACM Reference format:
Alexander LeClair, Sakib Haque, Lingfei Wu, and Collin McMillan. 2016.
Improved Code Summarization via a Graph Neural Network. In Proceed-
ings of ACM Conference, Washington, DC, USA, July 2017 (Conference’17),
12 pages.
DOI: 10.1145/nnnnnnn.nnnnnnn
Source code summarization is the task of writing brief natural
language descriptions of code [
]. ese descriptions
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permied. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from
Conference’17, Washington, DC, USA
©2016 ACM. 978-x-xxxx-xxxx-x/Y Y/MM. . . $15.00
DOI: 10.1145/nnnnnnn.nnnnnnn
have long been the backbone of developer documentation such as
JavaDocs [
]. e idea is that a short description allows a pro-
grammer to understand what a section of code does and that code’s
purpose in the overall program, without requiring the programmer
to read the code itself. Summaries like “uploads log les to the
backup server” or “formats decimal values as scientic notation”
can give programmers a clear picture of what code does, saving
them time from comprehending the details of that code.
Automatic code summarization is a rapidly expanding research
area. Programmers are notorious for neglecting the manual eort
of writing summaries themselves [
], and automation
has long been cited as a desirable alternative [
]. e term “source
code summarization” was coined around ten years ago [
] and
since that time the eld has proliferated. At rst, the dominant
strategy was based on sentence templates and heuristics derived
from empirical studies [
]. Starting around
2016, data-driven strategies based on neural networks came to
the forefront, leveraging gains from both the AI/NLP and mining
soware repositories research communities [3, 23, 25, 29].
ese data-driven approaches were inspired by neural machine
translation (NMT) from natural language processing. In NMT, a
sentence in one language e.g. English is translated into another
language e.g. Spanish. A dataset of millions of examples of English
sentences paired with Spanish translations is required. A neural
architecture based on the encoder-decoder model is used to learn the
mapping between words and even the correct grammatical structure
from one language to the other based on these examples. is
works well because both input and output languages are sequences
of roughly equal length, and mappings of words tend to exist across
languages. e metaphor in code summarization is to treat source
code as one language input and summaries as another. So code
would be input to the same models’ encoder, and summaries to the
decoder. Advances in repository mining made it possible to gather
large datasets of paired examples [33].
But evidence is accumulating that the metaphor to NMT has ma-
jor limits [
]. Source code has far fewer words that map directly
to summaries than the NMT use case [
]. Source code tends not
to be of equal length to summaries; it is much longer [
]. And
crucially, source code is not merely a sequence of words. Code is
a complex web of interacting components, with dierent classes,
routines, statements, and identiers connected via dierent rela-
tionships. Soware engineering researchers have long recognized
that code is much more suited to graph or tree representations
that tease out the nuances of these relationships [
]. Yet, the
arXiv:2004.02843v2 [cs.SE] 7 Apr 2020
Conference’17, July 2017, Washington, DC, USA LeClair, et al.
typical application of NMT to code summarization treats code as a
sequence to be fed into a recurrent neural network (RNN) or similar
structure designed for sequential information.
e literature is beginning to recognize the limits to sequential
representations of code for code summarization. Hu et al. [
] an-
notate the sequence with clues from the abstract syntax tree (AST).
LeClair et al. [
] expand on this idea by separating the code and
AST into two dierent inputs. Alon et al. [
] extract paths from
the AST to aid summarization. Meanwhile, Allamanis et al. [
propose using graph neural networks (GNNs) to learn represen-
tations of code (though for the problem of code generation, not
summarization). ese approaches all show how neural networks
can be eective in extracting information from source code beer
in a graph or tree form than in a sequence of tokens, and using that
information for downstream tasks such as summarization.
What is missing from the literature is a thorough examination of
graph neural networks improve representations of code based
on the AST. ere is evidence that GNN-based representations
improve performance, but the degree of that improvement for code
summarization has not been explored thoroughly, and the reasons
for the improvement are not well understood.
In this paper, we present an approach for improving source code
summarization using GNNs. Specically, we target the problem
of summarizing program subroutines. Our approach is based on
model presented by Xu et al. [
], though with a few
modications to customize the model to a soware engineering
context. In short, we use the GNN-based encoder of
model the AST of each subroutine, combined with the RNN-based
encoder used by LeClair et al. [
] to model the subroutine as a
sequence. We demonstrate a 4.6% BLEU score improvement for a
large, published dataset [
] as compared to recent baselines. In an
experiment, we use techniques from the literature on explainable
AI to propose explanations for why the approach performs beer
and in which cases. We seek to provide insights to guide future
researchers. We make all our data, implementations, and experi-
mental framework available via our online appendix (Section 9).
We target the problem of automatically generating summaries of
program subroutines. To be clear, the input is the source code of a
subroutine, and the output is a short, natural language description
of that subroutine. ese summaries have several advantages when
put into documentation such as decreased time to understand code
], improved code comprehension [
], and to making code
more searchable [
]. Programmers are notorious for consuming
high quality documentation for themselves, while neglecting to
write and update it themselves [
]. erefore, recent research
has focused on automating the documentation process. Current
research has had success generating summaries for a subset of
methods that are generally shorter and use simpler language in both
the code and reference comment (e.g. seers and geers), but have
had a problem with methods that have more complex structures or
language [
]. A similar situation for other SE research problems
has been helped by various graph representations of code [
but using graph representations is only starting to be accepted
for code summarization [
]. Graph representations have the
potential to improve code summarization because, instead of using
only a sequence of code tokens as input, the model can access a
rich variety of relationships among tokens.
Automatic documentation has a large potential impact on how
soware is developed and maintained. Not only would automatic
documentation reduce the time and energy programmers spend
reading and writing soware, having a high level summary avail-
able has been shown to improve results in other SE tasks such as
code categorization and code search [22, 28].
is section discusses some of the previous work relevant to this
work and source code summarization.
3.1 Source Code Summarization
Source code summarization research can be broadly categorized
as either 1) heuristic/template-driven approaches or 2) more re-
cent AI/Data-driven approaches. Heuristic-based approaches for
source code summarization started to gain popularity in 2010 with
work done by Haiduc et al. [
]. In their work, text retrieval
techniques and latent semantic indexing (LSI) were used to pick
important keywords out of source code, then those words are con-
sidered the summary. Early work done by Haiduc et al. and others
have helped inspire other work using extractive summarization
techniques based on TF-IDF, LSI, and LDA to create a summary.
Heuristic-based approaches are less related to this work than data-
driven approaches, so due to space limitations we direct readers
to surveys by Song et al. [
] and Nazar et al. [
] for additional
background on the topic.
is paper builds on the current work done with data-driven
approaches in source code summarization which have dominated
NLP and SE literature since around 2015. In Table 1, we divide
recent work into two groups by their use of the AST as an input to
the model. en we further divide related work by the following
six aributes:
Src Code - A model uses the source code sequence as input, not
as part of the AST.
(2) AST - A model uses the AST as input.
(3) API - A model uses API information.
FlatAST - Using a aened version of the AST as model input.
GNN - e model uses a form of graph neural network for
node/edge embedding.
(6) Paths - Using a path through the AST as input to the model.
A brief history of the related data-driven work starts with Iyer
et al [
]. In their work they used stack overow questions and
responses where the title of the post was considered the high level
summary, and the source code in the top rated response was used
as the input. e model they developed was an aention based
sequence to sequence model similar to those used in neural machine
translation tasks [
]. To expand on this idea, Hu et al. [
] added
API information as an additional input into the model. ey found
that the model was able to generate beer responses if it had access
to information provided by API calls in the source code.
Later, Hu et al. [
] developed a structure based traversal (SBT)
method for aening the AST into a sequence that keeps words in
the code associated with their node type. e SBT sequence is a
combination of source code tokens and AST structure which was
then input into an o the shelf encoder/decoder model. LeClair
Improved Code Summarization via a Graph Neural Network Conference’17, July 2017, Washington, DC, USA
Src Code AST API FlatAST GNN Paths
2016 Iyer et al. [25] x
2017 Loyola et al. [34] x
2017 Lu et al. [35] x x
2018 Hu et al. [24] x x
2018 Liang et al. [31] x x
2018 Hu et al. [23] x x x
2018 Wan et al. [56] x x
2019 LeClair et al. [29] x x x
2019 Alon et al. [3] x x x
2019 Fernandes et al. [16] x x
Table 1: Comparison of recent data-driven Source Code Sum-
marization research categorized by the data, architectures,
and approaches used. e approaches in the upper table
use only the source code sequence as input to their models,
while the bottom table approaches use the AST or a combi-
nation of AST and source code.
et al. [
] built upon this work by creating a multi-input model
that used the SBT sequence with all identiers removed as the rst
input, and the source code tokens as the second. ey found that if
you decouple the structure of the code form the code itself that the
model improved its ability to learn that structure.
More recently, Alon et al. [
] in 2018 proposed a source code
summarization technique that would encode each pairwise path
between nodes in the AST. ey would then randomly select a
subset of these paths for each iteration in training. ese paths
were then encoded and used as input to a standard encoder/decoder
model. ey found that encoding the AST paths allowed the model
to generalize to unseen methods more easily, as well as providing a
level of regularization by randomly selecting a subset of paths each
training iteration.
en in 2019 Fernandes et al. [
] developed a GNN based model
that uses three graph representations of source code as input 1)
next token, 2) AST, and 3) last lexical use. To represent these three
graphs they used a shared node setup where each graph represented
a dierent set of edges between source code tokens. Using this
approach they observed a beer “global” view of the method and
had success with maintaining the central named entity from the
method. Fernandes’ observation is an important clue that there is
additional information embedded in the source code beyond the
sequence of tokens, this motivates the use of a GNN for the AST as
a separate input in our work.
3.2 Neural Machine Translation
For the last six years work in neural machine translation (NMT)
has been dominated by the encoder-decoder model architecture
developed by Bahdanau et al. [
]. e encoder-decoder architecture
can be thought of as two separate models, one to encode the input
(e.g. English words) into a vector representation, and one to decode
that representation into the desired output (e.g. German tokens).
Commonly, encoder-decoder models use a recurrent layer (RNN,
GRU, LSTM, etc.) in both the encoder and decoder. Recurrent
layers are eective at learning sequence information because for
each token in a sequence, information is propagated through the
layer [
]. is allows each token to aect the following tokens
in the sequence. Some of the common recurrent layers such as
the GRU and LSTM also can return state information at each time
step of the input sequence. e state information output from the
encoder is commonly used to seed the initial state of the decoder
improving translation results [53].
Another more recent addition to many encoder-decoder models
is the aention mechanism. e intuition behind aention is that
not all tokens in a sequence are of equal importance to the nal
output prediction. What aention tries to do is to learn what words
are important and map input tokens to output tokens. It does this
by taking the input sequence at every time step and the predicted
sequence at a time step and tries to determine which time step in
the input will be most useful to predict the next token in the output.
3.3 Graph Neural Networks
Graph Neural Networks are another key background technology
to this paper. A recent survey by Wu et al. [
] categorizes GNNs
into four groups:
(1) Recurrent Graph Neural Networks (RecGNNs)
(2) Convolutional Graph Neural Networks (ConvGNNs)
(3) Graph Autoencoders (GAEs)
(4) Spatial-temporal Graph Neural Networks (STGNNs)
We will focus on ConvGNNs in this section because they are well
suited for this task, and it is what we use in this paper. ConvGNNs
were developed aer RecGNNs and were designed with the same
idea of message passing between nodes. ey have also been shown
to encode spatial information beer than RecGNNs and are able to
be stacked, improving the ability to propagate information across
nodes [
]. ConvGNNs take graph data and learn representations
of nodes based on the initial node vector and its neighbors in the
graph. e process of combining the information from neighboring
nodes is called “aggregation. By aggregating information from
neighboring nodes a model can learn representations based on
arbitrary relationships. ese relationships could be the hidden
structures of a sentence, the parts of speech [
], dependency parsing
trees [
], or the sequence of tokens [
]. ConvGNNs have been
used for similar tasks before, such as in graph2seq for semantic
parsing [58] and natural question generation [8].
ConvGNNs also allow nodes to get information from other nodes
that are further than just a single edge or “hop” away. In the gure
below we show an example partial AST and what 1, 2, and 3 hops
look like for the token ‘function’. Each time a hop is performed,
the node gets information from its neighboring nodes. So, on the
rst hop the token ‘function’ aggregates information from the
1 hop
2 hop
3 hop
Conference’17, July 2017, Washington, DC, USA LeClair, et al.
Src sequence
AST nodes
AST edges
Context Output
Figure 1: High level diagram of model architecture for 2-hop model
nodes ‘block’, ‘specier’, ‘name’, ‘type’, and ‘parameter list’. In
the next hop that occurs, the ‘function’ node will still only combine
information from its neighbors, but now each of those nodes will
have aggregate information from their children. For example, the
node ‘block’ will contain information from the ‘expr stmt’ node.
en when the ‘function’ node aggregates the ‘block’ node, it has
information from both ‘block’ and ‘expr stmt’.
ere are several aggregation strategies which have been shown
to have dierent performance for dierent tasks [
]. A common
aggregation strategy is to sum a node vector with its neighbors
and then apply an activation on that node, but there are many
schemes that can be used to combine node information. Some
other approaches to this are pooling, min, max, and mean. Xu et
al. discuss dierent node and edge aggregation methods in their
paper on creating sequences from graphs. ey found that in most
cases a mean aggregator out performed other types of aggregators,
including one using an LSTM.
is section provides the details of our approach. Our model is
based o the neural model proposed by LeClair et al. [
] and builds
on that work by using ConvGNNs discussed in the previous section.
In a nutshell, our approach works in 5 steps:
Embed the source code sequence and the AST node tokens.
Encode the embedding output with a recurrent layer for
the source code token sequence and a ConvGNN for the
AST nodes and edges.
(3) Use an aention mechanism to learn important tokens in
the source code and AST.
(4) Decode the encoder outputs.
(5) Predict the next token in the sequence.
4.1 Model Overview
An overview of our model is in Figure 1. In a nutshell, what we
did was modify the model on the multi-input encoder-decoder
proposed by LeClair et al. to use a ConvGNN instead of a aened
AST. Notice in area A of Figure 1 that our model has four inputs 1)
the sequence of source code tokens, 2) the nodes of the AST, 3) the
edges of the AST, 4) the predicted sequence up to this point. Next, in
area B, we embed the inputs using standard embedding layers. e
source sequence and AST nodes share an embedding due to a large
overlap in vocabulary. en in area C of Figure 1 the AST nodes are
input into the ConvGNN layers, the number of layers here depends
on the hop size of the model, and then input into a GRU. e source
code sequence goes into a GRU aer the embedding in area D. For
the decoder in area H, we have an embedding layer feeding into
a GRU. We then do two aention mechanisms seen in area E, one
between the source code and summary, and the other between the
AST and summary. en in areas F and G we combine the outputs
of our aention creating a context vector which is aened and
used to predict the next token in the sequence.
Key Novel Component. e key novel component of this paper is
in processing the AST using a ConvGNN and combining the output
of the ConvGNN encoder with the output of the source code token
encoder. In our approach, the ConvGNN allows the nodes of the
AST to learn representations based on their neighboring nodes.
Teaching the model information about the structure of the code,
and how it relates to the tokens found in the source code sequence.
Both the source and and AST encodings are input into separate
aention mechanisms with the decoder and are then concatenated.
is creates a context vector which we then use in a dense layer to
predict the next token in the sequence.
Basically, what we do is combine the structure of the sequence
(the AST) with the sequence itself (the source code). Combining
the structure of a sequence and the sequence itself into a model
has been shown to improve the quality of generated summaries in
both SE and NLP literature [
]. In this paper we aim to show
that using a neural network architecture that is more suited to the
structure of the data (graph vs sequence) we can further improve
the models ability to learn complex relationships in the source code.
4.2 Model Details
In this section we will discuss specic model implementation details
that we used for our best performing model to encourage repro-
ducibility. We developed our proposed model using Keras [
] and
Tensorow [
]. We also provide our source code and data as an
online appendix (details can be found in Section 9).
First, as mentioned in the previous section, our model is based on
the encoder-decoder architecture and has four inputs 1) the source
code sequence, 2/3) the AST as a collection of nodes along with
an adjacency matrix with edge information and 4) the comment
generated up to this point which is the input to the decoder. As
seen in Figure 1, we use two embedding layers one for the source
code and AST and one for the decoder. We use a single embedding
layer for both the source code and AST node inputs because they
have such a large overlap in vocabulary. e shared embedding
layer has a vocabulary size of 10908 and an embedding size of 100.
e decoder embedding layer has a vocabulary size of 10000 and
an embedding size of 100. So far, this follows the model proposed
by LeClair et al. [29].
Improved Code Summarization via a Graph Neural Network Conference’17, July 2017, Washington, DC, USA
Next, the model has two encoders, one for the source code se-
quence and another for the AST. e source code sequence encoder
is a single GRU layer with an output length of 256. We have the
source code GRU return its hidden states to use as the initial state
for the decoder GRU. e second encoder, the AST encoder, is a
collection of ConvGNN layers followed by a GRU of length 256.
e number of ConvGNN layers depends on the number of hops
used, for our best model this was 2-hops as seen in Figure 1.
e ConvGNN that we use for the AST node embeddings takes
the AST embedding layer output and the AST edge data as inputs.
en, for each node in the input it sums the current node vector
with each of it’s neighbors and multiplies that by a set of trainable
weights and adds a trainable bias. In our best performing imple-
mentation we use a ConvGNN layer for each hop in the model as
seen in Figure 1. We also test our model with dierent numbers of
hops, which slightly changes the architecture of the AST encoder
of the model by adding additional ConvGNN layers.
Next, we have two aention mechanisms 1) an aention between
the decoder and the source code, and 2) between the decoder and
the AST. ese aention mechanisms learn which tokens from
the source code/AST are important to the prediction of the next
token in the decoder sequence given the current predicted sequence
generated up to this point. e aention mechanisms are then
concatenated together with the decoder to create a nal context
vector. en, we apply a dense layer to each vector in our nal
context vector, which we then aen and use to predict the next
token in the sequence.
4.3 Data Preparation
e data set that we used for this project was provided by LeClair et
al. in a paper on recommendations for source code summarization
datasets [
]. LeClair et al. describe best practices for developing a
dataset for source code summarization and also provide a dataset of
2.1 million Java method comment pairs. ey provide their dataset
in two versions, 1) a ltered version with the raw, unprocessed
version of the methods and comments and 2) the tokenized version
where text processing has already been applied. For our baseline
comparisons, we use the tokenized version of the dataset provided
by LeClair et al. allowing us to directly compare results with their
work in source code summarization. e dataset did not include
ASTs that were already parsed, so we use the SrcML library [
] to
generate the associated ASTs from the raw source code.
4.4 Hardware Details
For training, validating, testing of our models we used a workstation
with Xeon E1430v4 CPUs, 110GB RAM, a Titan RTX GPU, and a
adro P5000 GPU. Soware used include the following:
Ubuntu 18.04 Python 3.6 CUDA 10
Tensorow 1.14 Keras 2.2 CuDNN 7
In this section we discuss the design of our experiments and discuss
the methodology, baselines, and metrics used to obtain our results.
5.1 Research estions
Our research objective is to determine if our proposed approach of
using the source code sequence along with a graph based AST and
ConvGNN outperform current baselines. We also want to determine
why our proposed model may outperform current baselines based
on the use of the AST graph and ConvGNN. We ask the following
Research estions (RQs) to explore these situations:
What is the performance of our approach compared to the
baselines in Section 5.4 in terms of the metrics in Sec-
tion 5.3?
What is the degree of dierence in performance caused by
the number of graph hops in terms of the metrics in Sec-
tion 5.3?
Is there evidence that the performance dierences are due to
use of the ConvGNN?
e rationale for RQ
is to compare our approach with other ap-
proaches that use the AST as input. Previous work has already
shown that the inclusion of the AST as an input to the model out-
performs previous models where no AST information was provided
]. Some previous work provides the AST as a tree or
graph [
], but the source code sequence was not provided to the
model. Our proposed model is a logical next step in source code
summarization literature and we ask RQ
to evaluate our model
against previous models.
e rationale for RQ
is to determine what aect (if any) the
number of hops has on the generated summaries (a description
of ConvGNN hops can be found in Section 3.3). Xu et al. [
discuss the impact of hop size on their work with generating SQL
queries. ey found that for their test models, the number of hops
did not aect model convergence. To test this they generate random
directed graphs of sizes 100 and 1000 and trained a model to nd
the shortest path between nodes, but did not evaluate how hop size
aects the task of source code summarization. Since ConvGNNs
create a layer for each hop, it becomes computationally expensive
to train models with an arbitrarily large number of hops. With RQ
we hope to build an intuition into how the number of hops aects
ConvGNN learning specically for source code summarization.
e rationale for RQ
is that discovering why a model learned to
generate certain summaries can be just as important as evaluation
metrics [
]. Doshie et al. [
] discuss what inter-
pretability means and oers guidelines to researchers on what they
can do to make their models more explainable, while Roscher et al.
state that “…explainability is a prerequisite to ensure the scientic
value of the outcome”[
]. As models are developed many factors
change, and it is oen times not an easy task to determine which
factors had the greatest impact on performance. In their work on
explainable AI, Arras et al. and Samek et al. show how you can
use visualizations to aid in the process of explainability for text
based modeling tasks [
]. With RQ
we aim to explain what
impact the inclusion of the AST as a graph and ConvGNN had on
generated summaries.
5.2 Methodology
To answer RQ
, we follow established methodology and evaluation
metrics that have become standard in both source code summariza-
tion work and neural machine translation from NLP [
]. To
start, we use a large well-documented data set from the literature
to allow us to easily compare results and baselines. We use the data
handling guidelines outlined in LeClair et al. [
] so that we do not
have data leakage between our training, validation, and testing sets.
Conference’17, July 2017, Washington, DC, USA LeClair, et al.
ast-aendgru 18.69 37.13 21.11 14.27 10.90 49.75
graph2seq 18.61 37.56 21.27 14.13 10.63 49.69
code2seq 18.84 37.49 21.36 14.37 10.95 49.69
BiLSTM+GNN-¿LSTM 19.05 37.70 21.53 14.59 11.11 55.74
ConvGNN Models # of hops BLEU-A BLEU-1 BLEU-2 BLEU-3 BLEU-4 ROUGE-LCS F1
code+gnn+dense 2 19.46 38.71 22.04 14.86 11.31 56.07
code+gnn+BiLSTM 2 19.93 39.14 22.49 15.31 11.70 56.08
code+gnn+GRU 1 19.70 38.15 22.12 15.22 11.73 57.15
code+gnn+GRU 2 19.89 39.01 22.42 15.28 11.70 55.78
code+gnn+GRU 3 19.58 38.48 22.09 15.01 11.52 56.14
code+gnn+GRU 5 19.68 38.89 22.30 15.09 11.46 55.81
code+gnn+GRU 10 19.34 38.68 21.94 14.73 11.20 55.10
Table 2: BLEU and ROUGE-LCS scores for the baselines and our proposed models
Next, we train our models for ten epochs and choose the model
with the highest validation accuracy score for our comparisons.
Choosing the model with the best validation performance out of
ten epochs is a training strategy that has been successfully used in
other related work [29]. For RQ1we evaluate the best performing
model using automated evaluation techniques to compare against
our baselines and report in this paper.
For RQ
we train ve ConvGNN models with all hyper-parameters
frozen except for hop size. We test our model using hop sizes of
1,2,3,5, and 10 in line with other related work [
]. We used the
model conguration outlined in Section 4 that uses the source code
and AST input, as well as a GRU layer directly aer the ConvGNN
layers. We chose this model because of its performance and its
faster training speed compared with the BiLSTM model. To report
our results we use the same “best of ten” technique that we use to
answer RQ
, that is, we train each model for ten epochs and report
the results on the model with the highest validation accuracy.
For RQ
we use a combination of automated tools and metrics
such as BLEU [
], ROUGE [
], and visualizations from model
weights. Visualizing model weights and parameters has become a
popular way to help explain what deep learning models are doing,
and possibly give insight into why they generate the output that
they do. To help us answer RQ
we use concepts similar to those
outlined in Samek et al. [47] for model visualizations.
5.3 Metrics
For our quantitative metrics we use both BLEU [
] and ROUGE
] to evaluate our model performance. BLEU scores are a standard
evaluation metric in the source code summarization literature [
]. BLEU is a text similarity metric that compares overlapping
n-grams between two given texts. While BLEU can be thought of
as a precision score: how much of the generated text appears in
the reference text. In contrast, ROUGE can be thought of as a recall
score: how much of the reference appears in the generated text.
ROUGE is used primarily in text summarization tasks in the NLP
literature due to the score allowing multiple references since there
may be multiple correct summaries of a text [
]. In our work we
do not have multiple reference texts per method, but ROUGE gives
us additional information about the performance of our models
that BLEU scores alone do not provide. In this paper we report a
composite BLEU score, BLEU1through BLEU4(n-grams of length
1 to length 4), and ROUGE-LCS (longest common sub-sequence) to
have a well rounded set of automated evaluation metrics.
5.4 Baselines
We compare our model against four baselines. ese baselines are
all from recent work that is directly relevant to this paper. We
chose these baselines because they provide comparison for three
categories in source code summarization using the AST: 1) aened
AST, 2) using paths through the AST, and 3) using a graph neural
network to encode the AST.
Each of these baselines uses AST information as input to the
model with dierent schemes. ey also cover a variety of model
architectures and congurations. Due to space limitations we do
not list all relevant details for each model, but have a more in depth
overview in Section 3.1 and in Table 1.
: In this model LeClair et al. [
] use a stan-
dard encoder-decoder model and add an additional encoder
for the AST. ey aen the AST using the SBT technique
outline in Hu et al. [
]. en both the source code to-
kens and the aened AST are provided as input into the
model. For encoding these inputs they use recurrent layers
and then use a decoder with a recurrent layer to generate
the predictions. is approach is representative of other
approaches that aen the AST into a sequence.
: Xu et al. [
] developed a general graph to
sequence neural model that generates both node and graph
embeddings. In their work they use an SQL query and gen-
erate a natural language query based on the SQL. eir
implementation propagates both forward and backwards
over the graph, and includes node level aention. ey
achieved state of the art results on an SQL-¿natural lan-
guage task using BLEU-4 as a metric. ey also evaluate
how the number of hops aected the performance of the
model nding that any number of hops still converged to
similar results, but specic models could perform just as
well with less hops lowering the amount of computation
: Alon et al [
] use random pairwise paths through
the AST as model input which we discuss more in depth
in Section 3.1. ey used C# code to generate summaries,
while we use Java. ey had a variety of congurations that
they test, due to this we did a good-faith re-implementation
of their base model in an aempt to capture the major con-
tributions of their approach.
Improved Code Summarization via a Graph Neural Network Conference’17, July 2017, Washington, DC, USA
: Fernandes et al. [
] proposed a model us-
ing a BILSTM and GNN trained with a C# data set for code
summarization. We reproduced a model using the informa-
tion outlined in their paper. We trained the model using
the Java data set from LeClair et al. to create a comparison
for our work. In their paper they report results on a variety
of model architectures and setups, we include compari-
son results with a model based on their best performing
ese baselines are not an exhaustive list of relevant work, but
they cover recent techniques used for source code summarization.
Some other work that we chose not to use for baselines include
Hu et al. [
], and Wan et al. [
]. We chose not to include Hu et
al. in our baselines because the work done by LeClair et al. built
upon their work and was shows to have higher performance, and
is much closer to our proposed work in this paper. In our proposed
model we use the technique outlined in LeClair et al. of separating
the source code sequence tokens from the AST.
Wan et al. [
] is another potential baseline, but we found it
unsuitable for comparison in this paper for three reasons: 1) the
approach combines an AST+code encoding with Reinforcement
Learning (RL), and the RL component adds many experimental
variables with eects dicult to distinguish from the AST+code
component, 2) the AST+code encoding technique has now been
superceded by other techniques which we already use as baselines,
and 3) we were unable to reproduce the results in the paper. An
interesting question for future work is to study the eects of the
RL component in a separate experiment: the RL component of
Wan et al. is supportive of, rather than a competitor with, the
AST+code encoding. We also do not compare against heuristic based
approaches. Most data-driven approaches outperform heuristic
based approaches in all of the automated metrics, and previous
work has already reported the comparisons.
5.5 reats to Validity
e primary threat to validity for this paper is that the automated
metrics we use to score and rate out models may not be representa-
tive of human judgement. BLEU and ROUGE metrics can give us a
good indication how our model performs compared to the reference
text and other models, but there are instances where the model may
generate a valid summary that does not align with the reference
text. On the other hand, there is no evaluation as to whether a
reference comment for a given method is a good summary. e
benet of these automated metrics is that they are fast and have
wide use among the source code summarization community. To
mitigate the potential pitfalls that using automated metrics may
involve, we include an in depth discussion and evaluation of spe-
cic examples from our model to help interpret what our model
has learned when compared to baselines.
e dataset we use is also another potential threat to validity.
While other data sets do exist with other programming languages,
for example C# or Python, many of these data sets lack the size and
scope of the data set provided by LeClair et al.. For example the
C# dataset used Fernandes et al. has 23 projects, with 55,635 meth-
ods having associated documentation. Another common pitfall of
datasets described in LeClair et al. is that many datasets split data
on the function level instead of the project level. is means that
functions from the same project can appear in both the training and
testing sets causing potential data leakage. Using Java is benecial
due to its widespread use in many dierent types of programs and
its adoption in industry. Java also has a well dened commenting
standard with JavaDocs that creates easily parsable documentation.
One other threat to validity is that we were unable to perform
extensive hyper-parameter optimizations on our models due to
hardware limitations. It could be the case that some of our models
or baselines outlined in Table 2 could be heavily impacted by certain
hyper-parameters (e.g., input sequence length, learning rate, and
vocabulary size), giving dierent scores and rankings. is is a
common issue with deep learning projects, and this aects nearly
all similar types of experiments. We try to mitigate the impact of
this issue by being consistent with our hyper-parameter values.
We also take great care when reproducing work for our baselines,
making sure the experimental set ups are reasonable and match
them as closely as we can to their descriptions.
is section provides the experiment results for the research ques-
tions we ask in Section 5.1. To answer RQ
we use a combination of
automated metrics and discuss its performance compared to other
models in the context of these metrics. For RQ
we test a series of
models with dierent hop sizes and compare them to our model as
well as our baselines. To answer RQ
we provide a set of examples
comparing our model using the graph AST and ConvGNN with a
aened AST model and show how the addition of the ConvGNN
contributes to the overall summary.
6.1 RQ1: antitative Evaluation
For our experimental results we tested three model congurations.
In labeling our models we use code+gnn to represent the models
that use an encoder for the source code tokens, a ConvGNN encoder
for the AST, and then we use a +
layer name
format to show the
layer that was used on the output of the ConvGNN.
We found that model code+gnn+BiLSTM was the highest per-
forming approach obtaining a BLEU-A score of 19.93 and ROUGE-
LCS score of 56.08, as seen in Table 2. e code+gnn+BiLSTM model
outperformed the nearest graph-based baseline by 4.6% BLEU-A
and 0.06% ROUGE-LCS. e code+gnn+BiLSTM model also out-
performed the aened AST baseline by 5.7% BLEU-A and 12.72%
ROUGE-LCS. We aribute this increase in performance to the use
of the ConvGNN as an encoding for the AST. Adding the ConvGNN
allows the model to learn beer AST node representations than it
can with only a sequence model. We go into more depth into how
the ConvGNN may be boosting performance in Section 6.3. We
aribute our performance improvement over other graph-based ap-
proaches to the use of the source code token sequence as a separate
additional encoder. We found that using both the source code se-
quence and the AST allows the model to learn when to copy tokens
directly from the source code, serving a purpose similar to a ‘copy
mechanism’ as described by Gu et al. [
]. Copy mechanisms are
used to copy words directly from the input text to the output, pri-
marily used to improve performance with rare or unknown tokens.
In this case, the model has learned to copy tokens directly from the
source code. is works well for source code summarization be-
cause of the large overlap in source code and summary vocabulary
Conference’17, July 2017, Washington, DC, USA LeClair, et al.
(over 94%). In Section 6.3 example 1 we show how models that use
both source code and AST input utilize the source code aention
like a copy mechanism.
We also see a noticeable eect on model performance based on
the recurrent layer aer the ConvGNNs. LeClair et al. achieved
18.69 BLEU-A using only a recurrent layer to encode the aened
AST sequence, and without a recurrent layer the code+gnn+dense
model achieves a 19.46 BLEU-A. In an eort to see how dier-
ent recurrent layers aect the models performance, we trained a
two hop model using GRU and another model using a BiLSTM
as shown in Table 2. We found that code+gnn+BiLSTM outper-
formed code+gnn+GRU by 0.05 BLEU-A and 0.3 ROUGE-LCS. e
improved score of the BiLSTM layer is likely due to the increased
complexity of the layer over the GRU. We nd in many cases that
the BiLSTM architecture outperforms other recurrent layers, but
at a signicantly increased computational cost. For this reason,
and because the code+gnn+BiLSTM model only outperformed the
code+gnn+GRU model by 0.05 BLEU-A (0.2%), we chose to conduct
our other tests using the code+gnn+GRU architecture.
6.2 RQ2: Hop size analysis
In Table 2 we compare the number of hops in the ConvGNN layers
and how it aects the performance of the model. We found that for
the AST two hops had the best overall performance. With having
two hops, a node will get information from nodes up to two edges
away. As outlined in Section 4, our model implementation creates
a separate ConvGNN layer for each hop in series. One explanation
for why two hops had the best performance is that, because we
are generating summaries at the method level, the ASTs in the
dataset are not very deep. Another possibility could be that the
other nodes most important to a specic node are its neighbors, and
dealing with smaller clusters of node data is sucient for learning.
Lastly, even though the number of hops directly inuences how
far and quickly information will propagate through the ConvGNN,
every iteration the neighboring nodes are now an aggregate of their
hops away. In other words, aer one iteration with
two hops, a nodes neighbor is now an aggregate of nodes three
hops away, so aer enough iterations each node should be aected
by every other node, with closer nodes having a larger eect.
While using two hops reported the best BLEU score for the
code+gnn+GRU models, it only performed 1.5% beer than using
three hops and 2.8% beer than using 10. Also notice that using
ve hops outperformed three and ten hops, this could be due to the
random initialization or other minor factors. Because the dierence
in overall BLEU score is relatively small between hop sizes, we
believe that the number of hops is less important than other hyper-
parameters. It could be that the number of hops will be more
important when summarizing larger selections of code where nodes
are farther apart. For example, if the task were to summarize an
entire program it may be benecial to have more hops in your
encoder to allow information to propagate farther.
6.3 RQ3: Graph AST Contribution
We provide three in-depth examples of the GNN’s contribution to
our model. In these examples we also compare directly with the
ast-aendgru model proposed by LeClair et al. [
]. We chose to
compare with ast-aendgru because it represents a collection of
Example 1, Method ID 20477616
reference sends a guess to the server
code+gnn+GRU sends a guess to the socket
ast-aendgru aempts to initiate a ¡UNK¿ guess
source code
public void sendGuess(String guess) {
if( isConnected() ) {
gui.statusBarInfo("Querying...", false);
try {
os.write( (guess + "\\r\\n").getBytes() );
} catch (IOException e) {
"Failed to send guess.IOException",true
"IOException during send guess to server"
(b) code+gnn+GRU : Source attention
(c) code+gnn+GRU : AST attention
(d) ast-aendgru: Source attention
(e) ast-aendgru: AST attention
Example 1: Visualization of source code and AST attention
for code+gnn+gru and ast-attendgru
Improved Code Summarization via a Graph Neural Network Conference’17, July 2017, Washington, DC, USA
work using aened ASTs to summarize source code as well as
having separate encoders for the source code sequence and AST.
We feel that comparing against this model allows us to isolate the
contribution that the ConvGNN is making to the generated sum-
maries. It should be noted however that these two models process
the AST dierently, they both use SrcML [
] to generate ASTs,
but ast-aendgru takes another step and additionally processes
the AST using the SBT technique developed by Hu et al.. More
details about how LeClair et al. process the AST can be found in
Section 5.4.
Our rst example visualized in Example 1 shows an instance
where the reference summary contains tokens that also appear in
the source code. We use this example to showcase how the source
code and AST aentions work together as a copy mechanism to
generate summaries. e visualizations in Example 1 are a snapshot
of the aention outputs for both the source code and AST when
the models are predicting the rst token in the sequence, which
is ‘sends’ in the reference. We can see in Example 1 (a) that the
code+gnn+GRU model and (c) the ast-aendgru model aend to the
third token in the input source code sequence (column 2), which
in this case is the token ‘send’. Where these two models dier,
however, is what their respective ASTs are aending to (seen in (b)
and (d)). In the case of code+gnn+GRU (b), the model is aending
to the token ‘send’ in column eight and the token ‘status’ in column
thirty-eight. On the other hand, e ast-aendgru (d) model is
aending to column thirty-seven, which is the token ‘sexpr stmt’.
One explanation for this is that code+gnn+GRU is able to com-
bine structural and code elements beer than models that don’t
utilize a ConvGNN. In the context of this example what this means
is that the AST token ‘send’ in the code+gnn+GRU is a learned
combination of the ‘send’ node and its neighbors, which in the
AST are nodes ‘name’, ‘status’, ‘bar’, and ‘info’. e ast-aendgru
model only sees the AST as a sequence of tokens, so when it aends
to the token ‘sexpr stmt’, its neighboring tokens are ‘sblock’ and
‘sexpr’. Another observation from Example 1 (d) is that, generally,
the ast-aendgru model activates more on the AST sequence than it
does on the source code sequence, while code+gnn+GRU activates
similarly on both aentions and focuses more on specic tokens.
is could be due to the ConvGNN learning more specic structure
information from the AST.
We also found that because the AST aention is more ne
grained than the ast-aendgru aention, it learns whether to copy
words directly from the source code beer than the other model.
In this case, because both the source code and AST aention focus
on the same token, ‘send’, it determined that ‘send’ or a word very
close to it (in this case ‘sends’) should be the predicted token. If
the source code and AST aention dier then the model will oen
times predict tokens that are not in the source code or AST.
If we look at later tokens in the predicted sequence, we see that
code+gnn+GRU predicts the correct tokens until the last one. For
the nal token the reference token is ‘server’ and code+gnn+GRU
predicted ‘socket’. What is also interesting here is that ast-aendgru
predicted the token ‘guess’ which is in the reference summary and
source code sequence. If we look at Example 2 we see the output
of the AST aention mechanisms for both the code+gnn+GRU and
ast-aendgru models during their prediction of the nal token in
the sequence. What this means is that code+gnn+GRU has the
input sequence [sends, a, guess, to, the] and predicts ‘socket’ and
ast-aendgru has the sequence [aempts, to, initiate, a, ¡UNK¿] and
predicts ‘guess’. We do not include the source code visualization
here because they were very similar to the visualizations in Example
1 (a) and (c). So, in the source code aention both models aend
to the tokens ‘send’ and ‘guess’, but as we can see in the AST
visualization, code+gnn+GRU is aending to the token in column
forty-six - ‘querying’; and ast-aendgru is aending to a large, non-
specic area in the structure of the AST. e code+gnn+GRU model
has learned that the combination of the tokens ‘send’, ‘guess’, and
the AST token ‘querying’ lead to the prediction of the token ‘socket’.
While this prediction was incorrect, the token ‘socket’ is closely
related to the term ‘server’ in this context. Notice that the token
‘querying’ is also in the source code, but neither model aends to
it. As stated above, the source code aention is acting more as a
copy mechanism and is aending to tokens that it believes should
be the next predicted token, then it relies on the AST aention to
add additional information for the nal prediction.
Example 3 is a case where the code+gnn+GRU model correctly
predicts the sixth token in the sequence, ‘rst’, but ast-aendgru
predicts the token ‘specied’. For this prediction both models have
the same predicted token sequence input to the decoder. If we
look at Example 3, we can see the the aention visualizations for
our models for their prediction of the sixth token in the sequence.
In Example 3 (a) in the sixth row, the code+gnn+GRU model is
aending to the h column, which is the token ’o’. In this piece of
code ’o’ is the identier for the input parameter for the method. e
ast-aendgru model source aention (Example 3 (c)) is aending
to columns seventeen and thirty-four which are both the token
‘game’. Looking at the code+gnn+GRU AST aention (Example
3 (b)), we see that it is aending to column sixty-three, which is
also the token ‘game’. So, in this example the ast-aengru model’s
source aention is aending to ‘game’ and the code+gnn+GRU
model’s aention is also aending to the token ‘game’ in the AST.
is is important because it shows both models have learned that
this token is important to the prediction, but in dierent contexts.
Looking at the visualizations, we see again that the code+gnn+GRU
model is able to focus on specic, important tokens in both the
(a) code+gnn+GRU : AST attention
(b) ast-aendgru: AST attention
Example 2: Visualization of AST attention mechanisms for
code+gnn+gru and ast-attendgru when predicting the nal
token in the sequence.
Conference’17, July 2017, Washington, DC, USA LeClair, et al.
source code and the AST, while the ast-aendgru model aends to
larger portions of the AST.
e major take-aways from the work outlined in this paper are:
Using the AST as a graph with ConvGNN layers outper-
forms a aened version of the AST
Including the source code sequence as a separate encoder
allows the model to learn to use the source code and AST
as a copy mechanism.
e improved node embeddings from the ConvGNN al-
low the model to learn beer token representations where
representation of tokens in the AST are a combination of
structure elements.
e three examples that we show are situations where the addition
of the ConvGNN allowed the model to learn beer node represen-
tations than using a aened sequence for the AST. When both
the source code aention and the AST aention align on a specic
token, the model treats this like a copy mechanism, directly copying
the input source token to the output. When the source and AST
aention do not agree, we see the model relying more on the AST
to predict the next token in the sequence. When we compare this
to a model with a aened AST input, we see a large dierence
in how the AST is being aended to, generally the aened AST
model looks at larger structure areas instead of specic tokens.
As an avenue for future work, models such as these have been
shown to improve performance when ensembled. LeClair et al.
showed that a model without AST information outperformed a
model using AST information on specic types of summaries [
is could lead to interesting results, potentially showing that bring-
ing in dierent features from the source code allows the models to
learn to generate beer summaries for specic types of methods.
In this work we have presented a new neural model architecture
that utilizes a sequence of source code tokens along with Con-
vGNNs to encode the AST of a Java method and generate natural
language summaries. We provide background and insights into why
using a graph based neural network to encode the AST improves
performance, along with providing a comparison of our results
against relevant baselines. We conclude that the combination of
source code tokens along with the AST and ConvGNNs allows
the model to beer learn when to directly copy tokens from the
source code, as well as create beer representations of tokens in
the AST. We show that that the use of the ConvGNN to encode
the AST improves aggregate BLEU scores (BLEU-A) by over 4.6%
over other graph-based approaches and 5.7% improvement over
aened AST approaches. We also provide an in dept analysis of
how the ConvGNN layers aribute to this increase in performance,
and speculate on how these insights can be used for future work.
All of our models, source code, and data used in this work can be
found in our online repository at hps://
Example 3, Method ID 25584536
reference returns the index of the rst occurrence of
the specied element
code+gnn+GRU returns the index of the rst occurrence of
the specied element
ast-aendgru returns the index of the specied object
in the list
source code
public int indexOf(Object o) {
if (o == null) {
for (int i = 0; i < size; i++) {
if (gameObjects[i] == null) {
return i;
} else {
for (int i = 0; i < size; i++) {
if (o.equals(gameObjects[i])) {
return i;
return -1;
(b) code+gnn+GRU : Source attention
(c) code+gnn+GRU : AST attention
(d) ast-aendgru: Source attention
(e) ast-aendgru: AST attention
Example 3: Visualization of source code and AST attention
for code+gnn+GRU and ast-attendgru
Improved Code Summarization via a Graph Neural Network Conference’17, July 2017, Washington, DC, USA
is work is supported in part by NSF CCF-1452959 and CCF-1717607.
Any opinions, ndings, and conclusions expressed herein are the au-
thors and do not necessarily reect those of the sponsors.
ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
Craig Citro, Greg S. Corrado, Andy Davis, Jerey Dean, Mahieu Devin, San-
jay Ghemawat, Ian Goodfellow, Andrew Harp, Georey Irving, Michael Isard,
Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg,
Dandelion Man
e, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike
Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul
Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi
egas, Oriol Vinyals,
Pete Warden, Martin Waenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.
2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.
hps://www.tensor Soware available from tensor
Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learn-
ing to represent programs with graphs. International Conference on Learning
Representations (2018).
Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019. code2seq: Generating
sequences from structured representations of code. International Conference on
Learning Representations (2019).
Leila Arras, Franziska Horn, Grgoire Montavon, Klaus-Robert Mller, and Wo-
jciech Samek. 2017. What is relevant in a text document?: An inter-
pretable machine learning approach. PLOS ONE 12, 8 (Aug 2017), e0181142.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma-
chine translation by jointly learning to align and translate. arXiv preprint
arXiv:1409.0473 (2014).
David Binkley. 2007. Source code analysis: A road map. In 2007 Future of Soware
Engineering. IEEE Computer Society, 104–119.
Huadong Chen, Shujian Huang, David Chiang, and Jiajun Chen. 2017. Improved
Neural Machine Translation with a Syntax-Aware Encoder and Decoder. In
Proceedings of the 55th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). Association for Computational Linguistics,
Vancouver, Canada, 1936–1945. hps://
Yu Chen, Lingfei Wu, and Mohammed J Zaki. 2020. Reinforcement learning
based graph-to-sequence model for natural question generation. International
Conference on Learning Representations (2020).
[9] Franc¸ois Chollet et al. 2015. Keras. hps://
Michael L Collard, Michael J Decker, and Jonathan I Maletic. 2011. Lightweight
transformation and fact extraction with the srcML toolkit. In Source Code Analysis
and Manipulation (SCAM), 2011 11th IEEE International Working Conference on.
IEEE, 173–184.
Bas Cornelissen, Andy Zaidman, Arie Van Deursen, Leon Moonen, and Rainer
Koschke. 2009. A systematic survey of program comprehension through dynamic
analysis. IEEE Transactions on Soware Engineering 35, 5 (2009), 684–702.
Sergio Cozzei B. de Souza, Nicolas Anquetil, and K
athia M. de Oliveira. 2005. A
study of the documentation essential to soware maintenance. In Proceedings of
the 23rd annual international conference on Design of communication: documenting
& designing for pervasive information (SIGDOC ’05). ACM, New York, NY, USA,
68–75. hps://
Derek Doran, Sarah Schulz, and Tarek R. Besold. 2017. What Does Explainable
AI Really Mean? A New Conceptualization of Perspectives. CoRR abs/1710.00794
(2017). arXiv:1710.00794 hp://
Finale Doshi-Velez and Been Kim. 2017. Towards A Rigorous Science of Inter-
pretable Machine Learning. arXiv e-prints, Article arXiv:1702.08608 (Feb 2017),
arXiv:1702.08608 pages. arXiv:stat.ML/1702.08608
Brian P Eddy, Jerey A Robinson, Nicholas A Kra, and Jerey C Carver. 2013.
Evaluating source code summarization techniques: Replication and expansion.
In Program Comprehension (ICPC), 2013 IEEE 21st International Conference on.
IEEE, 13–22.
Patrick Fernandes, Miltiadis Allamanis, and Marc Brockschmidt. 2018. Structured
Neural Summarization. CoRR abs/1811.01824 (2018). arXiv:1811.01824 hp:
Andrew Forward and Timothy C. Lethbridge. 2002. e relevance of soware
documentation, tools and technologies: a survey. In Proceedings of the 2002 ACM
symposium on Document engineering (DocEng ’02). ACM, New York, NY, USA,
26–33. hps://
Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Li. 2016. Incorporating
Copying Mechanism in Sequence-to-Sequence Learning. Proceedings of the 54th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers) (2016). hps://
Sonia Haiduc, Jairo Aponte, Laura Moreno, and Andrian Marcus. 2010. On the
use of automated text summarization techniques for summarizing source code.
In Reverse Engineering (WCRE), 2010 17th Working Conference on. IEEE, 35–44.
S. Haiduc and A. Marcus. 2008. On the Use of Domain Terms in Source Code. In
16th IEEE International Conference on Program Comprehension (ICPC’08). Amster-
dam, e Netherlands, 113–122.
Vincent J Hellendoorn and Premkumar Devanbu. 2017. Are deep neural networks
the best choice for modeling source code?. In Proceedings of the 2017 11th Joint
Meeting on Foundations of Soware Engineering. ACM, 763–773.
M. J. Howard, S. Gupta, L. Pollock, and K. Vijay-Shanker. 2013. Automatically
mining soware-based, semantically-similar words from comment-code map-
pings. In 2013 10th Working Conference on Mining Soware Repositories (MSR).
377–386. hps://
Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep code comment
generation. In Proceedings of the 26th International Conference on Program Com-
prehension. ACM, 200–210.
Xing Hu, Ge Li, Xin Xia, David Lo, Shuai Lu, and Zhi Jin. 2018. Summarizing
Source Code with Transferred API Knowledge.. In IJCAI. 2269–2275.
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zelemoyer. 2016.
Summarizing source code using a neural aention model. In Proceedings of the
54th Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), Vol. 1. 2073–2083.
Mira Kajko-Masson. 2005. A Survey of Documentation Practice within Cor-
rective Maintenance. Empirical Sow. Engg. 10, 1 (Jan. 2005), 31–55. hps:
Douglas Kramer. 1999. API documentation from source code comments: a case
study of Javadoc. In Proceedings of the 17th annual international conference on
Computer documentation. ACM, 147–153.
A. LeClair, Z. Eberhart, and C. McMillan. 2018. Adapting Neural Text Clas-
sication for Improved Soware Categorization. In 2018 IEEE International
Conference on Soware Maintenance and Evolution (ICSME). 461–472. hps:
Alexander LeClair, Siyuan Jiang, and Collin McMillan. 2019. A neural model for
generating natural language summaries of program subroutines. In Proceedings
of the 41st International Conference on Soware Engineering. IEEE Press, 795–806.
Alexander LeClair and Collin McMillan. 2019. Recommendations for Datasets
for Source Code Summarization. In Proceedings of the 2019 Conference of the
North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers). 3931–3937.
Yuding Liang and Kenny Q. Zhu. 2018. Automatic Generation of Text Descriptive
Comments for Code Blocks. CoRR abs/1808.06880 (2018). arXiv:1808.06880
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries.
Text Summarization Branches Out (2004).
C. Lopes, S. Bajracharya, J. Ossher, and P. Baldi. 2010. UCI Source Code Data
Sets. hp://$\sim$lopes/datasets/
Pablo Loyola, Edison Marrese-Taylor, and Yutaka Matsuo. 2017. A Neural Ar-
chitecture for Generating Natural Language Descriptions from Source Code
Changes. In ACL.
Yangyang Lu, Zelong Zhao, Ge Li, and Zhi Jin. 2019. Learning to Generate Com-
ments for API-Based Code Snippets. In Soware Engineering and Methodology
for Emerging Domains, Zheng Li, He Jiang, Ge Li, Minghui Zhou, and Ming Li
(Eds.). Springer Singapore, Singapore, 3–14.
Paul W McBurney, Cheng Liu, and Collin McMillan. 2016. Automated feature
discovery via sentence selection and source code summarization. Journal of
Soware: Evolution and Process 28, 2 (2016), 120–145.
Paul W McBurney and Collin McMillan. 2016. Automatic source code summa-
rization of context for java methods. IEEE Transactions on Soware Engineering
42, 2 (2016), 103–119.
Tim Miller. 2019. Explanation in articial intelligence: Insights from the social
sciences. Articial Intelligence 267 (2019), 1 – 38. hps://
Laura Moreno and Jairo Aponte. 2012. On the analysis of human and automatic
summaries of source code. CLEI Electronic Journal 15, 2 (2012), 2–2.
Laura Moreno, Jairo Aponte, Giriprasad Sridhara, Andrian Marcus, Lori Pollock,
and K Vijay-Shanker. 2013. Automatic generation of natural language summaries
for java classes. In Program Comprehension (ICPC), 2013 IEEE 21st International
Conference on. IEEE, 23–32.
Najam Nazar, Yan Hu, and He Jiang. 2016. Summarizing soware artifacts:
A literature review. Journal of Computer Science and Technology 31, 5 (2016),
Karl J Oenstein and Linda M Oenstein. 1984. e program dependence graph
in a soware development environment. ACM SIGSOFT Soware Engineering
Notes 9, 3 (1984), 177–184.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a
method for automatic evaluation of machine translation. In Proceedings of the
40th annual meeting on association for computational linguistics. Association for
Computational Linguistics, 311–318.
Paige Rodeghero, Cheng Liu, Paul W McBurney, and Collin McMillan. 2015.
An eye-tracking study of java programmers and application to source code
Conference’17, July 2017, Washington, DC, USA LeClair, et al.
summarization. IEEE Transactions on Soware Engineering 41, 11 (2015), 1038–
Tobias Roehm, Rebecca Tiarks, Rainer Koschke, and Walid Maalej. 2012. How do
professional developers comprehend soware?. In Proceedings of the 2012 Inter-
national Conference on Soware Engineering (ICSE 2012). IEEE Press, Piscataway,
NJ, USA, 255–265. hp://
Ribana Roscher, Bastian Bohn, Marco F. Duarte, and Jochen Garcke.
2019. Explainable Machine Learning for Scientic Insights and Discoveries.
Wojciech Samek, omas Wiegand, and Klaus-Robert M
uller. 2017. Explainable
articial intelligence: Understanding, visualizing and interpreting deep learning
models. arXiv preprint arXiv:1708.08296 (2017).
Lin Shi, Hao Zhong, Tao Xie, and Mingshu Li. 2011. An empirical study on
evolution of API documentation. In Proceedings of the 14th international conference
on Fundamental approaches to soware engineering: part of the joint European
conferences on theory and practice of soware (FASE’11/ETAPS’11).Springer-Verlag,
Berlin, Heidelberg, 416–431. hp://
Xiaotao Song, Hailong Sun, Xu Wang, and Jiafei Yan.2019. A Survey of Automatic
Generation of Source Code Comments: Algorithms and Techniques. IEEE Access
Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K Vijay-
Shanker. 2010. Towards automatically generating summary comments for java
methods. In Proceedings of the IEEE/ACM international conference on Automated
soware engineering. ACM, 43–52.
Giriprasad Sridhara, Lori Pollock, and K Vijay-Shanker. 2011. Automatically
detecting and describing high level actions within methods. In Proceedings of the
33rd International Conference on Soware Engineering. ACM, 101–110.
Ilya Sutskever, James Martens, and Georey E Hinton. 2011. Generating text with
recurrent neural networks. In Proceedings of the 28th International Conference on
Machine Learning (ICML-11). 1017–1024.
Ilya Sutskever, Oriol Vinyals, and oc V Le. 2014. Sequence to sequence
learning with neural networks. In Advances in neural information processing
systems. 3104–3112.
Anneliese Von Mayrhauser and A Marie Vans. 1995. Program comprehension
during soware maintenance and evolution. Computer 8 (1995), 44–55.
Laura von Rueden, Sebastian Mayer, Jochen Garcke, Christian Bauckhage, and
Jannis Schuecker. 2019. Informed Machine Learning - Towards a Taxonomy of Ex-
plicit Integration of Knowledge into Machine Learning. arXiv:stat.ML/1903.12394
Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu,
and Philip S. Yu. 2018. Improving Automatic Source Code Summarization
via Deep Reinforcement Learning. In Proceedings of the 33rd ACM/IEEE In-
ternational Conference on Automated Soware Engineering (ASE 2018). Asso-
ciation for Computing Machinery, New York, NY, USA, 397407. hps:
Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and
Philip S. Yu. 2019. A Comprehensive Survey on Graph Neural Networks. CoRR
abs/1901.00596 (2019). arXiv:1901.00596 hp://
Kun Xu, Lingfei Wu, Zhiguo Wang, Yansong Feng, Michael Witbrock, and Vadim
Sheinin. 2018. Graph2seq: Graph to sequence learning with aention-based
neural networks. arXiv preprint arXiv:1804.00823 (2018).
Kun Xu, Lingfei Wu, Zhiguo Wang, Mo Yu, Liwei Chen, and Vadim Sheinin.
2018. Exploiting rich syntactic information for semantic parsing with graph-to-
sequence model. Conference on Empirical Methods in Natural Language Processing
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
As an integral part of source code files, code comments help improve program readability and comprehension. However, developers sometimes do not comment their program code adequately due to the incurred extra efforts, lack of relevant knowledge, unawareness of the importance of code commenting or some other factors. As a result, code comments can be inadequate, absent or even mismatched with source code, which affects the understanding, reusing and the maintenance of software. To solve these problems of code comments, researchers have been concerned with generating code comments automatically. In this work, we aim at conducting a survey of automatic code commenting researches. First, we generally analyze the challenges and research framework of automatic generation of program comments. Second, we present the classification of representative algorithms, the design principles, strengths and weaknesses of each category of algorithms. Meanwhile, we also provide an overview of the quality assessment of the generated comments. Finally, we summarize some future directions for advancing the techniques of automatic generation of code comments and the quality assessment of comments.
Conference Paper
Full-text available
Source Code Summarization is the task of writing short, natural language descriptions of source code. The main use for these descriptions is in software documentation e.g. the one-sentence Java method descriptions in JavaDocs. Code summarization is rapidly becoming a popular research problem, but progress is restrained due to a lack of suitable datasets. In addition, a lack of community standards for creating datasets leads to confusing and unreproducible research results -- we observe swings in performance of more than 33% due only to changes in dataset design. In this paper, we make recommendations for these standards from experimental results. We release a dataset based on prior work of over 2.1m pairs of Java methods and one sentence method descriptions from over 28k Java projects. We describe the dataset and point out key differences from natural language data, to guide and support future researchers. Dataset Available at
Conference Paper
Full-text available
Software Categorization is the task of organizing software into groups that broadly describe the behavior of the software, such as “editors” or “science.” Categorization plays an important role in several maintenance tasks, such as repository navigation and feature elicitation. Current approaches attempt to cast the problem as text classification, to make use of the rich body of literature from the NLP domain. However, as we will show in this paper, text classification algorithms are generally not applicable off-the-shelf to source code; we found that they work well when high-level project descriptions are available, but suffer very large performance penalties when classifying source code and comments only. We propose a set of adaptations to a state-of-the-art neural classification algorithm, and perform two evaluations: one with reference data from Debian enduser programs, and one with a set of C/C++ libraries that we hired professional programmers to annotate. We show that our proposed approach achieves performance exceeding that of previous software classification techniques as well as a state-ofthe-art neural text classification technique.
Conference Paper
Full-text available
During software maintenance, code comments help developers comprehend programs and reduce additional time spent on reading and navigating source code. Unfortunately, these comments are often mismatched, missing or outdated in the software projects. Developers have to infer the functionality from the source code. This paper proposes a new approach named DeepCom to automatically generate code comments for Java methods. The generated comments aim to help developers understand the functionality of Java methods. DeepCom applies Natural Language Processing (NLP) techniques to learn from a large code corpus and generates comments from learned features. We use a deep neural network that analyzes structural information of Java methods for better comments generation. We conduct experiments on a large-scale Java corpus built from 9,714 open source projects from GitHub. We evaluate the experimental results on a machine translation metric. Experimental results demonstrate that our method DeepCom outperforms the state-of-the-art by a substantial margin.
We propose a framework to automatically generate descriptive comments for source code blocks. While this problem has been studied by many researchers previously, their methods are mostly based on fixed template and achieves poor results. Our framework does not rely on any template, but makes use of a new recursive neural network called CodeRNN to extract features from the source code and embed them into one vector. When this vector representation is input to a new recurrent neural network (Code-GRU), the overall framework generates text descriptions of the code with accuracy (Rouge-2 value) significantly higher than other learning-based approaches such as sequence-to-sequence model. The Code-RNN model can also be used in other scenario where the representation of code is required.
Deep learning has revolutionized many machine learning tasks in recent years, ranging from image classification and video processing to speech recognition and natural language understanding. The data in these tasks are typically represented in the Euclidean space. However, there is an increasing number of applications, where data are generated from non-Euclidean domains and are represented as graphs with complex relationships and interdependency between objects. The complexity of graph data has imposed significant challenges on the existing machine learning algorithms. Recently, many studies on extending deep learning approaches for graph data have emerged. In this article, we provide a comprehensive overview of graph neural networks (GNNs) in data mining and machine learning fields. We propose a new taxonomy to divide the state-of-the-art GNNs into four categories, namely, recurrent GNNs, convolutional GNNs, graph autoencoders, and spatial-temporal GNNs. We further discuss the applications of GNNs across various domains and summarize the open-source codes, benchmark data sets, and model evaluation of GNNs. Finally, we propose potential research directions in this rapidly growing field.
Comments play an important role in software developments. They can not only improve the readability and maintainability of source code, but also provide significant resource for software reuse. However, it is common that lots of code in software projects lacks of comments. Automatic comment generation is proposed to address this issue. In this paper, we present an end-to-end approach to generate comments for API-based code snippets automatically. It takes API sequences as the core semantic representations of method-level API-based code snippets and generates comments from API sequences with sequence-to-sequence neural models. In our evaluation, we extract 217K pairs of code snippets and comments from Java projects to construct the dataset. Finally, our approach gains 36.48% BLEU-4 score and 9.90% accuracy on the test set. We also do case studies on generated comments, which presents that our approach generates reasonable and effective comments for API-based code snippets.
Conference Paper
Code summarization provides a high level natural language description of the function performed by code, as it can benefit the software maintenance, code categorization and retrieval. To the best of our knowledge, most state-of-the-art approaches follow an encoder-decoder framework which encodes the code into a hidden space and then decode it into natural language space, suffering from two major drawbacks: a) Their encoders only consider the sequential content of code, ignoring the tree structure which is also critical for the task of code summarization; b) Their decoders are typically trained to predict the next word by maximizing the likelihood of next ground-truth word with previous ground-truth word given. However, it is expected to generate the entire sequence from scratch at test time. This discrepancy can cause an exposure bias issue, making the learnt decoder suboptimal. In this paper, we incorporate an abstract syntax tree structure as well as sequential content of code snippets into a deep reinforcement learning framework (i.e., actor-critic network). The actor network provides the confidence of predicting the next word according to current state. On the other hand, the critic network evaluates the reward value of all possible extensions of the current state and can provide global guidance for explorations. We employ an advantage reward composed of BLEU metric to train both networks. Comprehensive experiments on a real-world dataset show the effectiveness of our proposed model when compared with some state-of-the-art methods.