Conference PaperPDF Available

Recommendations for Datasets for Source Code Summarization


Abstract and Figures

Source Code Summarization is the task of writing short, natural language descriptions of source code. The main use for these descriptions is in software documentation e.g. the one-sentence Java method descriptions in JavaDocs. Code summarization is rapidly becoming a popular research problem, but progress is restrained due to a lack of suitable datasets. In addition, a lack of community standards for creating datasets leads to confusing and unreproducible research results -- we observe swings in performance of more than 33% due only to changes in dataset design. In this paper, we make recommendations for these standards from experimental results. We release a dataset based on prior work of over 2.1m pairs of Java methods and one sentence method descriptions from over 28k Java projects. We describe the dataset and point out key differences from natural language data, to guide and support future researchers. Dataset Available at
Content may be subject to copyright.
Recommendations for Datasets for Source Code Summarization
Alex LeClair, Collin McMillan
Department of Computer Science and Engineering
University of Notre Dame
{aleclair, cmc}
Source Code Summarization is the task of
writing short, natural language descriptions
of source code. The main use for these de-
scriptions is in software documentation e.g.
the one-sentence Java method descriptions in
JavaDocs. Code summarization is rapidly
becoming a popular research problem, but
progress is restrained due to a lack of suit-
able datasets. In addition, a lack of commu-
nity standards for creating datasets leads to
confusing and unreproducible research results
– we observe swings in performance of more
than 33% due only to changes in dataset de-
sign. In this paper, we make recommendations
for these standards from experimental results.
We release a dataset based on prior work of
over 2.1m pairs of Java methods and one sen-
tence method descriptions from over 28k Java
projects. We describe the dataset and point out
key differences from natural language data, to
guide and support future researchers.
1 Introduction
Source Code Summarization is the task of writ-
ing short, natural language descriptions of source
code (Eddy et al.,2013). The most common use
for these descriptions is in software documen-
tation, such as the summaries of Java methods
in JavaDocs (Kramer,1999). Automatic gener-
ation of code summaries is a rapidly-expanding
research area at the crossroads of Computational
Linguistics and Software Engineering, as a grow-
ing tally of new workshops and NSF-sponsored
meetings have recognized (Cohen and Devanbu,
2018;Quirk,2015). The reason, in a nutshell, is
that the vast majority of code summarization tech-
niques are adaptations of techniques originally de-
signed to solve NLP problems.
A major barrier to ongoing research is a lack
of standardized datasets. In many NLP tasks such
as Machine Translation there are large, curated
datasets (e.g. Europarl (Koehn,2018)) used by
several research groups. The benefit of these stan-
dardized datasets is twofold: First, scientists are
able to evaluate new techniques using the same test
conditions as older techniques. And second, the
datasets tend to conform to community customs of
best practice, which avoids errors during evalua-
tion. These benefits are generally not yet available
to code summarization researchers; while large,
public code repositories do exist, most research
projects must parse and process these repositories
on their own, leading to significant differences on
one project to another. The result is that research
progress is slowed as reproducibilty of earlier re-
sults is difficult.
Inevitably, differences in dataset creation also
occur that can mislead researchers and over or un-
derstate the performance of some techniques. For
example, a recent source code summarization pa-
per reports achieving 25 BLEU when generating
English descriptions of Java methods with an ex-
isting technique (Gu et al.,2018), which is 5 points
higher than the original paper reports (Iyer et al.,
2016). The paper also reports 35+ BLEU for a
vanilla seq2seq NMT model, which is 16 points
higher than what we are able to replicate. While it
is not our intent to single out any one paper, we do
wish to call attention to a problem in the research
area generally: a lack of standard datasets leads to
results that are difficult to interpret and replicate.
In this paper, we propose a set of guidelines
for building datasets for source code summariza-
tion techniques. We support our guidelines with
related literature or experimentation where strong
literary consensus is not available. We also com-
pute several metrics related to word usage to guide
future researchers who use the dataset. We have
made a dataset of over 2.1m Java methods and
summaries from over 28k Java projects available
via an online appendix (URL in Section 6).
arXiv:1904.02660v1 [cs.CL] 4 Apr 2019
2 Related Work
Related work to this paper consists of approaches
for source code summarization. As with many
research areas, data-driven AI-based approaches
have superseded heuristic/template-based tech-
niques, though overall the field is quite new. Work
by Haiduc et al. (Haiduc et al.,2010a,b) in 2010
coined the term “source code summarization”, and
several heuristic/template-based techniques fol-
lowed including work by Sridhara et al. (Srid-
hara et al.,2010,2011), McBurney et al. (McBur-
ney and McMillan,2016), and Rodeghero et
al. (Rodeghero et al.,2015).
More recent techniques are data-driven, though
the overall size of the field is small. Literature
includes work by Hu et al. (Hu et al.,2018a,b)
and Iyer et al. (Iyer et al.,2016). Projects tar-
geting problems similar to code summarization
have been published widely, including on com-
mit message generation (Jiang et al.,2017;Loyola
et al.,2017), method name generation (Allamanis
et al.,2016), pseudocode generation (Oda et al.,
2015), and code search (Gu et al.,2018). Nazar et
al. (Nazar et al.,2016) provide a survey.
Of note is that no standard datasets for code
summarization have yet been published. Each of
the above papers takes an ad hoc approach, in
which the authors download large repositories of
code and apply their own preprocessing. There
are few standard practices, leading to major dif-
ferences in the reported results in different papers,
as discussed in the previous section. For example,
the works by LeClair et al. (LeClair and McMil-
lan,2019) and Hu et al. (Hu et al.,2018a) both
modify the CODENN model from Iyer et al. (Iyer
et al.,2016) to work on Java methods and com-
ments. LeClair et al. and Hu et al. report very
disparate results: A BLEU-4 score of 6.3 for CO-
DENN on one dataset, and 25.3 on another, even
though both datasets were generated from Java
source code repositories.
These disparate results happen for a variety of
reasons, such as a difference in data set sizes and
tokenization schemes. LeClair et al. use a data set
of 2.1 million Java method-comment pairs while
Hu et al. use a total of 69,708. Hu et al. also re-
place out of vocabulary (OOV) tokens in the com-
ments with <UNK>in the training, validation,
and testing sets, while LeClair et al. remove OOV
tokens from the training set only.
3 Dataset Preparations
The dataset we use in this paper is based on the
dataset provided by LeClair et al. (LeClair and
McMillan,2019) in a pre-release. We used this
dataset because it is both the largest and most re-
cent in source code summarization. That dataset
has its origins in the Sourcerer project by Lopes et
al. (Lopes et al.,2010), which includes over 51
million Java methods. LeClair et al. provided
the dataset after minimal initial processing that fil-
tered for Java methods with JavaDoc comments
in English, and removed methods over 100 words
long and comments >13 and <3 words. The result
is a dataset of 2.1m Java methods and associated
comments. LeClair et al. do additional process-
ing, but do not quantify the effects of their deci-
sions – this is a problem because other researchers
would not know which of the decisions to follow.
We explore the following research questions to
help provide guidelines and justifications for our
design decisions in creating the dataset.
3.1 Research Questions
Our research objective and contribution in this pa-
per is to quantify the effect of key dataset pro-
cessing configurations, with the aim to make rec-
ommendations on which configurations should be
used. We ask the following Research Questions:
RQ1What is the effect of splitting by method ver-
sus splitting by project?
RQ2What is the effect of removing automatically
generated Java methods?
The scope of the dataset in this paper is source
code summarization of Java methods – the dataset
contains pairs of Java methods and JavaDoc de-
scriptions of those methods. However, we be-
lieve these RQs will provide guidance for similar
datasets e.g. C/C++ functions and descriptions, or
other units of granularity e.g. code snippets in-
stead of methods/functions.
The rationale behind RQ1is that many papers
split the dataset into training, validation, and test
sets at the unit of granularity under study. For
example, dividing all Java methods in the dataset
into 80% in training, 10% in validation, and 10%
in testing. However, this results in a situation
where it is possible for code from one project to
be in both the testing set and the training set. It is
possible that similar vocabulary and code patterns
Figure 1: Word count histogram for code, comment, and the
book summaries. About 22% of words occur one time across
all Java methods, versus 35% in the book summaries.
are used in methods from the same project, and
even worse, it is possible that overloaded methods
appear in both the training and test sets. How-
ever, this possibility is theoretical and a nega-
tive effect has never been shown. In contrast, we
split by project: randomly divide the Java projects
into training/validation/test groups, then place all
methods from e.g. test projects into the test set.
The rationale behind RQ2is that automati-
cally generated code is common in many Java
projects (Shimonaka et al.,2016), and that it
is possible that very similar code is generated
for projects in the training set and the test-
ing/validation sets. Shimonaka et al. (Shimonaka
et al.,2016) point out that the typical approach for
identifying auto-generated code is a simple case-
insensitive text search for the phrase “generated
by” in the comments of the Java files. LeClair et
al. (LeClair and McMillan,2019) report that this
search turns out to be quite aggressive, catching
nearly all auto-generated code in the repository.
However, as with RQ1, the effect of this filter is
theoretical and has not been measured in practice.
3.2 Methodology
Our methodology for answering RQ1is to com-
pare the results of a standard NMT algorithm with
the dataset split by project, to the results of the
same algorithm on the same dataset, except with
the dataset split by function. But because random
splits could be “lucky”, we created four random
datasets split by project, and four split by function,
seen in Table 2. We then use an off-the-shelf, stan-
dard NMT technique called attendgru provided
pre-release by LeClair et al. (LeClair and McMil-
lan,2019) and used as a baseline approach in their
recent paper. The technique is just an attentional
encoder/decoder based on single-layer GRUs, and
represents a strong NMT baseline used by many
papers. We train attendgru with each of the four
training sets, find the best-performing model using
the validation set associated with that training set
Figure 2: Histogram of word occurrences per document.
Approximately 34% of words occur in only one Java method,
20% occur in two methods, etc.
(out of 10 maximum epochs), and then obtain test
performance for that model. We report the average
of the results over the four random splits. Note that
we used the same configuration for attendgru as
LeClair et al. report, except that we reduced the
output vocabulary to 10k to reduce model size.
Our process for RQ2is similar. We created four
random split-by-project sets in which automati-
cally generated code was not removed. Then we
compared them to the four random split-by-project
sets we created for RQ1(in which auto-generated
code was removed).
3.3 Dataset Characteristics
We make three observations about the dataset that,
in our view, are likely to affect how researchers de-
sign source code summarization algorithms. First,
as depicted in Figure 1, words appear to be used
more often in code as compared to natural lan-
guage – there are fewer words used only one or
two times, and in general more used 3+ times.
At the same time (Figure 2), the pattern for word
occurrences per document appears similar, imply-
ing that even though words in code are repeated,
they are repeated often in the same method and
not across methods. Even though this may suggest
that the occurrence of unique words in source code
is isolated enough to have little affect on BLEU
score, we show in Section 4that this word over-
lap causes BLEU score inflation when you split
by function. This is important because the typi-
cal MT use case assumes that a “dictionary” can
be created (e.g., via attention) to map words in a
source to words in a target language. An algorithm
applied to code summarization needs to tolerate
multiple occurrences of the same words. To com-
pare the source code, comments, and natural lan-
guage datasets we tokenized our data by removing
all special characters, lower casing, and for source
code – splitting camel case into separate tokens.
A related observation is that Java methods tend
to be much longer than comments (Figure 3ar-
SplittingStrategy Set1 Set2 Set3 Set4
Split by project 17.81 16.73 17.11 17.99
Split by function 20.97 23.74 23.67 23.68
Auto-generated code included 19.11 19.09 18.04 15.66
Table 1: Average BLEU Scores from 15 epochs for each of the four sets.
eas (c) and (d)). Typically, code summarization
tools take inspiration from NMT algorithms de-
signed for cases of similar encoder/decoder se-
quence length. Many algorithms such as recur-
rent networks are sensitive to sequence length, and
may not be optimal off-the-shelf.
A third observation is that the words in meth-
ods and comments tend to overlap, but in fact a
vast majority of words are different (70% of words
in code summary comments do not occur in the
code method, see Figure 3area (b)). This situa-
tion makes the code summarization problem quite
difficult because the words in the comments repre-
sent high level concepts, while the words in the
source code represent low level implementation
details – a situation known as the “concept assign-
ment problem” (Biggerstaff et al.,1993). A code
summarization algorithm cannot only learn a word
dictionary as it might in a typical NMT setting,
or select summarizing words from the method for
a summary as a natural language summarization
tool might. A code summarization algorithm must
learn to identify concepts from code details, and
assign high level terms to those concepts.
4 Experimental Results & Conclusion
In this section, we answer our Research Questions
and provide supporting evidence and rational.
4.1 RQ1: Splitting Strategy
We observe a large “false” boost in BLEU score
when split by function instead of split by project
Figure 3: Overlap of words between methods and comments
(areas a and b). Over 30% of words in comments, on average
also occur in the method it describes. About 11% of words
in code, on average, also occur in the comment describing it.
Also, word length of methods and comments (areas c and d).
Methods average around 30 words, versus 10 for comments.
(see Figure 4). We consider this boost false be-
cause it involves placing functions from projects
in the test set into the training set – an unrealis-
tic scenario. An average of four runs when split
by project was 17.41 BLEU, a result relatively
consistent across the splits (maximum was 18.28
BLEU, minimum 16.10). In contrast, when split
by function, the average BLEU score was 23.02,
and increase of nearly one third as seen in Ta-
ble 1. Our conclusion is that splitting by func-
tion is to be avoided during dataset creation for
source code summarization. Beyond this narrow
answer to the RQ, in general, any leakage of infor-
mation from test set projects into the training or
validation sets ought to be strongly avoided, even
if the unit of granularity is smaller than a whole
project. We reiterate from Section 1that this is
not a theoretical problem: many papers published
using data-driven techniques for code summariza-
tion and other research problems split their data at
the level of granularity under study.
4.2 RQ2: Removing Autogen. Code
We also found a boost in BLEU score when not re-
moving automatically generated code, though the
difference was less than observed for RQ1. The
baseline performance increased to 18 BLEU when
not removing auto-generated code, and it varied
much more depending on the split (some projects
have much more auto-generated code than others).
Our recommendation is that, in general, reason-
Figure 4: Boxplots of BLEU scores from attendgru for
four runs under configurations for RQ1and RQ2.
BP Set1 BP Set2 BP Set3 BP Set4 BF All Sets
Training Set 1,935,860 1,950,026 1,942,291 1,933,677 1,943,723
Validation Set 105,693 100,920 104,837 105,997 107,984
Testing Set 107,568 98,175 101,993 109,447 107,984
Table 2: Number of method-comment pairs in the train, validation, test sets used in each random split set when split by project
(BP) and by function (BF).
able precautions should be implemented to remove
auto-generated code from the dataset because we
do find evidence that auto-generated code can af-
fect the results of experiments.
5 Discussion
This paper provides benefits to researchers in the
field of automatic source code summarization in
two areas. First, we provide insight into the effects
of splitting a Java method and comment dataset
by project or by function, and how these differ-
ent splitting methods effect the task of source code
summarization. Second, we provide a dataset
of 2.1m pairs of Java methods and one sentence
method descriptions in a cleaned and tokenized
format (discussed in 6) as well as a training, vali-
dation, testing split.
Note however that there may be cases where re-
searchers wish to adapt our recommendations for
a specific context. For example, when generating
comments in an IDE. The problem of code sum-
marization in an IDE is slightly different than what
we have presented, and would benefit from includ-
ing code-comment pairs from the same project.
IDEs have the advantage of access to a program-
mer’s source code and edit history in real time –
they do not rely on a repository collected post-hoc.
Moreno et al. (Moreno et al.,2013) take advantage
of this information to generate Java class sum-
maries in an eclipse plugin – their tool uses both
the class and project level information from com-
pleted projects to generate these summaries, while
not using any information from outside sources.
However, even in this case, care must be taken
to avoid unrealistic scenarios, such as ensuring
that the training set consists only of code older
than the code in the test set. For example, consider
a programmer at revision 75 of his or her project
who requests automatically generated comments
from the IDE, then goes on to write a total of 100
revisions for the project. An experiment simulat-
ing this situation should only use revisions 1-74
as training data – revisions 76+ are “in the future”
from the perspective of the real world situation.
6 Downloadable Dataset
In our online appendix we have made three down-
loadable sets available as seen in Table 3. The
first is our SQL database, generated using the tool
from McMillan et al. (McMillan et al.,2011),
that contains the file name, method comment, and
start/end lines for each method, we call this dataset
our “Raw Dataset”. We also provide a link to
the Sourcerer dataset (Linstead et al.,2009) which
is used as a base for the dataset in LeClair et
al. (LeClair and McMillan,2019). In addition
to the Raw Dataset, we also provide a “Filtered
Dataset” that consists of a set of 2.1m method
comment pairs. In the Filtered Dataset we re-
moved auto-generated source code files, as well
all method’s that do not have an associated com-
ment. No preprocessing was applied to the source
code and comment strings in the Filtered Dataset.
The third downloadable set we supply is the “To-
kenized Dataset”. In the Tokenized Dataset, we
processed the source code and comments from
the Filtered Dataset identically to the tokeniza-
tion scheme described in Section 5 of (LeClair and
McMillan,2019). This set also provides a training,
validation, and test set as well as a script to easily
reshuffle these sets.
The URL for download is:
Dataset Methods Comments
Raw Dataset 51,841,717 7,063,331
Filtered Dataset 2,149,121 2,149,121
Tokenized Dataset 2,149,121 2,149,121
Table 3: Datasets available for download.
This work is supported in part by the NSF
CCF-1452959, CCF-1717607, and CNS-1510329
grants. Any opinions, findings, and conclusions
expressed herein are the authors and do not neces-
sarily reflect those of the sponsors
Miltiadis Allamanis, Hao Peng, and Charles Sutton.
2016. A convolutional attention network for ex-
treme summarization of source code. In Inter-
national Conference on Machine Learning, pages
Ted J Biggerstaff, Bharat G Mitbander, and Dallas
Webster. 1993. The concept assignment problem in
program understanding. In Proceedings of the 15th
international conference on Software Engineering,
pages 482–498. IEEE Computer Society Press.
William Cohen and Prem Devanbu. 2018. Workshop
on nlp for software engineering.
Brian P Eddy, Jeffrey A Robinson, Nicholas A Kraft,
and Jeffrey C Carver. 2013. Evaluating source code
summarization techniques: Replication and expan-
sion. In Program Comprehension (ICPC), 2013
IEEE 21st International Conference on, pages 13–
22. IEEE.
Xiaodong Gu, Hongyu Zhang, and Sunghun Kim.
2018. Deep code search. In Proceedings of the 40th
International Conference on Software Engineering,
pages 933–944. ACM.
Sonia Haiduc, Jairo Aponte, and Andrian Marcus.
2010a. Supporting program comprehension with
source code summarization. In Proceedings of the
32Nd ACM/IEEE International Conference on Soft-
ware Engineering-Volume 2, pages 223–226. ACM.
Sonia Haiduc, Jairo Aponte, Laura Moreno, and An-
drian Marcus. 2010b. On the use of automated text
summarization techniques for summarizing source
code. In 2010 17th Working Conference on Reverse
Engineering, pages 35–44. IEEE.
Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018a.
Deep code comment generation. In Proceedings of
the 26th Conference on Program Comprehension,
pages 200–210. ACM.
Xing Hu, Ge Li, Xin Xia, David Lo, Shuai Lu, and Zhi
Jin. 2018b. Summarizing source code with trans-
ferred api knowledge. In IJCAI, pages 2269–2275.
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and
Luke Zettlemoyer. 2016. Summarizing source code
using a neural attention model. In Proceedings of
the 54th Annual Meeting of the Association for Com-
putational Linguistics (Volume 1: Long Papers),
volume 1, pages 2073–2083.
Siyuan Jiang, Ameer Armaly, and Collin McMillan.
2017. Automatically generating commit messages
from diffs using neural machine translation. In Pro-
ceedings of the 32nd IEEE/ACM International Con-
ference on Automated Software Engineering, pages
135–146. IEEE Press.
Philipp Koehn. 2018. European parliament proceed-
ings parallel corpus 1996-2011. http://www. Accessed October
25, 2018.
Douglas Kramer. 1999. Api documentation from
source code comments: a case study of javadoc. In
Proceedings of the 17th annual international confer-
ence on Computer documentation, pages 147–153.
Alexander LeClair and Collin McMillan. 2019. A neu-
ral model for generating natural language summaries
of program subroutines. International Conference
on Software Engineering (ICSE) (*under review*).
Erik Linstead, Sushil Bajracharya, Trung Ngo, Paul
Rigor, Cristina Lopes, and Pierre Baldi. 2009.
Sourcerer: mining and searching internet-scale soft-
ware repositories.Data Mining and Knowledge Dis-
covery, 18:300–336. 10.1007/s10618-008-0118-x.
C. Lopes, S. Bajracharya, J. Ossher, and P. Baldi. 2010.
UCI source code data sets.
Pablo Loyola, Edison Marrese-Taylor, and Yutaka Mat-
suo. 2017. A neural architecture for generating natu-
ral language descriptions from source code changes.
arXiv preprint arXiv:1704.04856.
Paul W McBurney and Collin McMillan. 2016. Auto-
matic source code summarization of context for java
methods. IEEE Transactions on Software Engineer-
ing, 42(2):103–119.
Collin McMillan, Mark Grechanik, Denys Poshy-
vanyk, Qing Xie, and Chen Fu. 2011. Portfolio:
finding relevant functions and their usage. In Pro-
ceedings of the 33rd International Conference on
Software Engineering, pages 111–120. ACM.
L. Moreno, A. Marcus, L. Pollock, and K. Vijay-
Shanker. 2013. Jsummarizer: An automatic gener-
ator of natural language summaries for java classes.
In 2013 21st International Conference on Program
Comprehension (ICPC), pages 230–232.
Najam Nazar, Yan Hu, and He Jiang. 2016. Summariz-
ing software artifacts: A literature review. Journal
of Computer Science and Technology, 31(5):883–
Yusuke Oda, Hiroyuki Fudaba, Graham Neubig,
Hideaki Hata, Sakriani Sakti, Tomoki Toda, and
Satoshi Nakamura. 2015. Learning to generate
pseudo-code from source code using statistical ma-
chine translation (t). In Automated Software Engi-
neering (ASE), 2015 30th IEEE/ACM International
Conference on, pages 574–584. IEEE.
Chris Quirk. 2015. Nsf interdisciplinary work-
shop on statistical nlp and software engineer-
nlse2015/. Accessed October 25, 2016.
Paige Rodeghero, Cheng Liu, Paul W McBurney, and
Collin McMillan. 2015. An eye-tracking study of
java programmers and application to source code
summarization. IEEE Transactions on Software En-
gineering, 41(11):1038–1054.
Kento Shimonaka, Soichi Sumi, Yoshiki Higo, and
Shinji Kusumoto. 2016. Identifying auto-generated
code by using machine learning techniques. In Em-
pirical Software Engineering in Practice (IWESEP),
2016 7th International Workshop on, pages 18–23.
Giriprasad Sridhara, Emily Hill, Divya Muppaneni,
Lori Pollock, and K Vijay-Shanker. 2010. To-
wards automatically generating summary comments
for java methods. In Proceedings of the IEEE/ACM
international conference on Automated software en-
gineering, pages 43–52. ACM.
Giriprasad Sridhara, Lori Pollock, and K Vijay-
Shanker. 2011. Automatically detecting and de-
scribing high level actions within methods. In Pro-
ceedings of the 33rd International Conference on
Software Engineering, pages 101–110. ACM.
... Other strategies include symbolic or concolic execution [15]. But these strategies are not practical on a large dataset -training data for neural code summarization often run into hundreds of thousands or even millions of examples [16], [17]. ...
... The Java dataset we use consists of 190K Java methods that we curated from a larger dataset of 2.1m methods proposed in 2019 by LeClair el al. [16] named funcom. We selected the funcom dataset for three reasons. ...
Full-text available
Source code summarization is the task of writing natural language descriptions of source code behavior. Code summarization underpins software documentation for programmers. Short descriptions of code help programmers understand the program quickly without having to read the code itself. Lately, neural source code summarization has emerged as the frontier of research into automated code summarization techniques. By far the most popular targets for summarization are program subroutines. The idea, in a nutshell, is to train an encoder-decoder neural architecture using large sets of examples of subroutines extracted from code repositories. The encoder represents the code and the decoder represents the summary. However, most current approaches attempt to treat the subroutine as a single unit. For example, by taking the entire subroutine as input to a Transformer or RNN-based encoder. But code behavior tends to depend on the flow from statement to statement. Normally dynamic analysis may shed light on this flow, but dynamic analysis on hundreds of thousands of examples in large datasets is not practical. In this paper, we present a statement-based memory encoder that learns the important elements of flow during training, leading to a statement-based subroutine representation without the need for dynamic analysis. We implement our encoder for code summarization and demonstrate a significant improvement over the state-of-the-art.
... In this paper, our aim is to train a state-of-the-art neural network model (we use the NeuralCodeSum model (Ahmad et al., 2020)) designed for source code summarisation, using a dataset of source code -summary pairs as described in Section 2. The dataset we use is a modified version of the Funcom dataset (LeClair and McMillan, 2019). We selected this dataset to allow us to compare our model to previous research (Mahmud et al., 2021) which trains NeuralCodeSum on the same dataset and evaluates the results using traditional n-gram matching metrics. ...
... The lack of unified and standard datasets is a major obstacle to the rapid development of code summarization research [23]. The unification of test datasets plays a positive role in promoting neural code summarization research. ...
In software engineering, software personnel faced many large-scale software and complex systems, these need programmers to quickly and accurately read and understand the code, and efficiently complete the tasks of software change or maintenance tasks. Code-NN is the first model to use deep learning to accomplish the task of code summary generation, but it is not used the structural information in the code itself. In the past five years, researchers have designed different code summarization systems based on neural networks. They generally use the end-to-end neural machine translation framework, but many current research methods do not make full use of the structural information of the code. This paper raises a new model called G-DCS to automatically generate a summary of java code; the generated summary is designed to help programmers quickly comprehend the effect of java methods. G-DCS uses natural language processing technology, and training the model uses a code corpus. This model could generate code summaries directly from the code files in the coded corpus. Compared with the traditional method, it uses the information of structural on the code. Through Graph Convolutional Neural Network (GCN) extracts the structural information on the code to generate the code sequence, which makes the generated code summary more accurate. The corpus used for training was obtained from GitHub. Evaluation criteria using BLEU-n. Experimental results show that our approach outperforms models that do not utilize code structure information.
Full-text available
Source code summarization is the task of writing natural language descriptions of source code. A typical use case is generating short summaries of subroutines for use in API documentation. The heart of almost all current research into code summarization is the encoder-decoder neural architecture, and the encoder input is almost always a single subroutine or other short code snippet. The problem with this setup is that the information needed to describe the code is often not present in the code itself -- that information often resides in other nearby code. In this paper, we revisit the idea of ``file context'' for code summarization. File context is the idea of encoding select information from other subroutines in the same file. We propose a novel modification of the Transformer architecture that is purpose-built to encode file context and demonstrate its improvement over several baselines. We find that file context helps on a subset of challenging examples where traditional approaches struggle.
Full-text available
Automatic summarization is attracting increasing attention as one of the most promising research areas. This technology has been tried in various real-world applications in recent years and achieved a good response. However, the applicability of conventional evaluation metrics cannot keep up with rapidly evolving summarization task formats and ensuing indicator. After recent years of research, automatic summarization task requires not only readability and fluency, but also informativeness and consistency. Diversified application scenarios also bring new challenges both for generative language models and evaluation metrics. In this review, we analysis and specifically focus on the difference between the task format and the evaluation metrics.
Full-text available
A code summary is a brief natural language description of source code. Summaries are usually only a single sentence long, and yet form the backbone of developer documentation. A short descriptions such as "changes all visible polygons to the color blue" can give a programmer a high-level idea of what code does without the effort of reading the code itself. Recently, products based on Large Language Models such as ChatGPT have demonstrated a strong ability to write these descriptions automatically. However, to use these tools, programmers must send their code to untrusted third parties for processing (e.g., via an API call). This loss of custody is not acceptable to many organizations. In this paper, we present an alternative: we train an open source model using sample output generated by GPT-3.5 in a process related to knowledge distillation. Our model is small enough (350m parameters) to be run on a single 16gb GPU, yet we show in our evaluation that it is large enough to mimic GPT-3.5 on this task.
Full-text available
This paper presents an improved loss function for neural source code summariza-tion. Code summarization is the task of writing natural language descriptions ofsource code. Neural code summarization refers to automated techniques for gen-erating these descriptions using neural networks. Almost all current approachesinvolve neural networks as either standalone models or as part of a pretrainedlarge language models e.g., GPT, Codex, LLaMA. Yet almost all also use acategorical cross-entropy (CCE) loss function for network optimization. Twoproblems with CCE are that 1) it computes loss over each word prediction one-at-a-time, rather than evaluating a whole sentence, and 2) it requires a perfectprediction, leaving no room for partial credit for synonyms. We propose and eval-uate a loss function to alleviate this problem. In essence, we propose to use asemantic similarity metric to calculate loss over the whole output sentence pre-diction per training batch, rather than just loss for each word. We also proposeto combine our loss with traditional CCE for each word, which streamlines thetraining process compared to baselines. We evaluate our approach over severalbaselines and report an improvement in the vast majority of conditions.
Conference Paper
Full-text available
During software maintenance, code comments help developers comprehend programs and reduce additional time spent on reading and navigating source code. Unfortunately, these comments are often mismatched, missing or outdated in the software projects. Developers have to infer the functionality from the source code. This paper proposes a new approach named DeepCom to automatically generate code comments for Java methods. The generated comments aim to help developers understand the functionality of Java methods. DeepCom applies Natural Language Processing (NLP) techniques to learn from a large code corpus and generates comments from learned features. We use a deep neural network that analyzes structural information of Java methods for better comments generation. We conduct experiments on a large-scale Java corpus built from 9,714 open source projects from GitHub. We evaluate the experimental results on a machine translation metric. Experimental results demonstrate that our method DeepCom outperforms the state-of-the-art by a substantial margin.
Conference Paper
Full-text available
To implement a program functionality, developers can reuse previously written code snippets by searching through a large-scale codebase. Over the years, many code search tools have been proposed to help developers. The existing approaches often treat source code as textual documents and utilize information retrieval models to retrieve relevant code snippets that match a given query. These approaches mainly rely on the textual similarity between source code and natural language query. They lack a deep understanding of the semantics of queries and source code. In this paper, we propose a novel deep neural network named CODEnn (Code-Description Embedding Neural Network). Instead of matching text similarity, CODEnn jointly embeds code snippets and natural language descriptions into a high-dimensional vector space, in such a way that code snippet and its corresponding description have similar vectors. Using the unified vector representation, code snippets related to a natural language query can be retrieved according to their vectors. Semantically related words can also be recognized and irrelevant/noisy keywords in queries can be handled. As a proof-of-concept application, we implement a code search tool named DeepCS using the proposed CODEnn model. We empirically evaluate DeepCS on a large scale codebase collected from GitHub. The experimental results show that our approach can effectively retrieve relevant code snippets and outperforms previous techniques.
Conference Paper
Source code summarization -- creating natural language descriptions of source code behavior -- is a rapidly-growing research topic with applications to automatic documentation generation, program comprehension, and software maintenance. Traditional techniques relied on heuristics and templates built manually by human experts. Recently, data-driven approaches based on neural machine translation have largely overtaken template-based systems. But nearly all of these techniques rely almost entirely on programs having good internal documentation; without clear identifier names, the models fail to create good summaries. In this paper, we present a neural model that combines words from code with code structure from an AST. Unlike previous approaches, our model processes each data source as a separate input, which allows the model to learn code structure independent of the text in code. This process helps our approach provide coherent summaries in many cases even when zero internal documentation is provided. We evaluate our technique with a dataset we created from 2.1m Java methods. We find improvement over two baseline techniques from SE literature and one from NLP literature.
Conference Paper
Code summarization, aiming to generate succinct natural language description of source code, is extremely useful for code search and code comprehension. It has played an important role in software maintenance and evolution. Previous approaches generate summaries by retrieving summaries from similar code snippets. However, these approaches heavily rely on whether similar code snippets can be retrieved, how similar the snippets are, and fail to capture the API knowledge in the source code, which carries vital information about the functionality of the source code. In this paper, we propose a novel approach, named TL-CodeSum, which successfully uses API knowledge learned in a different but related task to code summarization. Experiments on large-scale real-world industry Java projects indicate that our approach is effective and outperforms the state-of-the-art in code summarization.
This paper presents a literature review in the field of summarizing software artifacts, focusing on bug reports, source code, mailing lists and developer discussions artifacts. From Jan. 2010 to Apr. 2016, numerous summarization techniques, approaches, and tools have been proposed to satisfy the ongoing demand of improving software performance and quality and facilitating developers in understanding the problems at hand. Since aforementioned artifacts contain both structured and unstructured data at the same time, researchers have applied different machine learning and data mining techniques to generate summaries. Therefore, this paper first intends to provide a general perspective on the state of the art, describing the type of artifacts, approaches for summarization, as well as the common portions of experimental procedures shared among these artifacts. Moreover, we discuss the applications of summarization, i.e., what tasks at hand have been achieved through summarization. Next, this paper presents tools that are generated for summarization tasks or employed during summarization tasks. In addition, we present different summarization evaluation methods employed in selected studies as well as other important factors that are used for the evaluation of generated summaries such as adequacy and quality. Moreover, we briefly present modern communication channels and complementarities with commonalities among different software artifacts. Finally, some thoughts about the challenges applicable to the existing studies in general as well as future research directions are also discussed. The survey of existing studies will allow future researchers to have a wide and useful background knowledge on the main and important aspects of this research field.