Figure 3 - uploaded by Alexander LeClair
Content may be subject to copyright.
Overlap of words between methods and comments (areas a and b). Over 30% of words in comments, on average also occur in the method it describes. About 11% of words in code, on average, also occur in the comment describing it. Also, word length of methods and comments (areas c and d). Methods average around 30 words, versus 10 for comments.

Overlap of words between methods and comments (areas a and b). Over 30% of words in comments, on average also occur in the method it describes. About 11% of words in code, on average, also occur in the comment describing it. Also, word length of methods and comments (areas c and d). Methods average around 30 words, versus 10 for comments.

Source publication
Conference Paper
Full-text available
Source Code Summarization is the task of writing short, natural language descriptions of source code. The main use for these descriptions is in software documentation e.g. the one-sentence Java method descriptions in JavaDocs. Code summarization is rapidly becoming a popular research problem, but progress is restrained due to a lack of suitable dat...

Contexts in source publication

Context 1
... related observation is that Java methods tend to be much longer than comments (Figure 3 ar- eas (c) and (d)). Typically, code summarization tools take inspiration from NMT algorithms designed for cases of similar encoder/decoder sequence length. ...
Context 2
... third observation is that the words in methods and comments tend to overlap, but in fact a vast majority of words are different (70% of words in code summary comments do not occur in the code method, see Figure 3 area (b)). This situation makes the code summarization problem quite difficult because the words in the comments represent high level concepts, while the words in the source code represent low level implementation details -a situation known as the "concept assignment problem" ( Biggerstaff et al., 1993). ...

Similar publications

Preprint
Full-text available
Source Code Summarization is the task of writing short, natural language descriptions of source code. The main use for these descriptions is in software documentation e.g. the one-sentence Java method descriptions in JavaDocs. Code summarization is rapidly becoming a popular research problem, but progress is restrained due to a lack of suitable dat...

Citations

... Other strategies include symbolic or concolic execution [15]. But these strategies are not practical on a large dataset -training data for neural code summarization often run into hundreds of thousands or even millions of examples [16], [17]. ...
... The Java dataset we use consists of 190K Java methods that we curated from a larger dataset of 2.1m methods proposed in 2019 by LeClair el al. [16] named funcom. We selected the funcom dataset for three reasons. ...
Preprint
Full-text available
Source code summarization is the task of writing natural language descriptions of source code behavior. Code summarization underpins software documentation for programmers. Short descriptions of code help programmers understand the program quickly without having to read the code itself. Lately, neural source code summarization has emerged as the frontier of research into automated code summarization techniques. By far the most popular targets for summarization are program subroutines. The idea, in a nutshell, is to train an encoder-decoder neural architecture using large sets of examples of subroutines extracted from code repositories. The encoder represents the code and the decoder represents the summary. However, most current approaches attempt to treat the subroutine as a single unit. For example, by taking the entire subroutine as input to a Transformer or RNN-based encoder. But code behavior tends to depend on the flow from statement to statement. Normally dynamic analysis may shed light on this flow, but dynamic analysis on hundreds of thousands of examples in large datasets is not practical. In this paper, we present a statement-based memory encoder that learns the important elements of flow during training, leading to a statement-based subroutine representation without the need for dynamic analysis. We implement our encoder for code summarization and demonstrate a significant improvement over the state-of-the-art.
... The lack of unified and standard datasets is a major obstacle to the rapid development of code summarization research [23]. The unification of test datasets plays a positive role in promoting neural code summarization research. ...
Article
In software engineering, software personnel faced many large-scale software and complex systems, these need programmers to quickly and accurately read and understand the code, and efficiently complete the tasks of software change or maintenance tasks. Code-NN is the first model to use deep learning to accomplish the task of code summary generation, but it is not used the structural information in the code itself. In the past five years, researchers have designed different code summarization systems based on neural networks. They generally use the end-to-end neural machine translation framework, but many current research methods do not make full use of the structural information of the code. This paper raises a new model called G-DCS to automatically generate a summary of java code; the generated summary is designed to help programmers quickly comprehend the effect of java methods. G-DCS uses natural language processing technology, and training the model uses a code corpus. This model could generate code summaries directly from the code files in the coded corpus. Compared with the traditional method, it uses the information of structural on the code. Through Graph Convolutional Neural Network (GCN) extracts the structural information on the code to generate the code sequence, which makes the generated code summary more accurate. The corpus used for training was obtained from GitHub. Evaluation criteria using BLEU-n. Experimental results show that our approach outperforms models that do not utilize code structure information.
... jm52m is a dataset of 52m Java methods created from 52k Java projects. The source code originated from the Merobase [20] and Sourcerer [26] data releases, supplemented by our own prior work in LeClair et al. [24]. It contains code uploaded to code repositories between 2008 and 2018. ...
Preprint
Full-text available
This tool demonstration presents a research toolkit for a language model of Java source code. The target audience includes researchers studying problems at the granularity level of subroutines, statements, or variables in Java. In contrast to many existing language models, we prioritize features for researchers including an open and easily-searchable training set, a held out test set with different levels of deduplication from the training set, infrastructure for deduplicating new examples, and an implementation platform suitable for execution on equipment accessible to a relatively modest budget. Our model is a GPT2-like architecture with 350m parameters. Our training set includes 52m Java methods (9b tokens) and 13m StackOverflow threads (10.5b tokens). To improve accessibility of research to more members of the community, we limit local resource requirements to GPUs with 16GB video memory. We provide a test set of held out Java methods that include descriptive comments, including the entire Java projects for those methods. We also provide deduplication tools using precomputed hash tables at various similarity thresholds to help researchers ensure that their own test examples are not in the training set. We make all our tools and data open source and available via Huggingface and Github.
... Benchmark datasets comparison: The Vault is a comprehensive collection of parallel code and docstring pairs, which is larger in scale and covers more programming languages than existing datasets. Table 4 presents a comparison between our dataset and other parallel datasets, including Funcom [25], Deepcom [18], CONCODE [21], CodeSearchNet [19], CoDesc [16] and non-public data used for pretraining [5,4,38]. Although these datasets are widely used for pretraining and fine-tuning downstream tasks, they typically contain only a single language and a small number of samples. ...
Preprint
We present The Vault, an open-source, large-scale code-text dataset designed to enhance the training of code-focused large language models (LLMs). Existing open-source datasets for training code-based LLMs often face challenges in terms of size, quality (due to noisy signals), and format (only containing code function and text explanation pairings). The Vault overcomes these limitations by providing 40 million code-text pairs across 10 popular programming languages, thorough cleaning for 10+ prevalent issues, and various levels of code-text pairings, including class, function, and line levels. Researchers and practitioners can utilize The Vault for training diverse code-focused LLMs or incorporate the provided data cleaning methods and scripts to improve their datasets. By employing The Vault as the training dataset for code-centric LLMs, we anticipate significant advancements in code understanding and generation tasks, fostering progress in both artificial intelligence research and software development practices.
... Our study builds on prior work scrutinizing the design [11,103,138,201,259], documentation [36,54,78,210,222,239], and analytical evaluation [68,166,181,227] of datasets. For example, Hanley et al. [95] identified four aspects of human-centric dataset development that result in ethical concern, namely purpose (e.g., moral legitimacy), creation (e.g., data sourcing and cleaning), composition (e.g., data instances, metadata), and distribution (e.g., terms of use). ...
... For example, Hutiri et al. [103] present guidelines for designing speaker verification evaluation datasets, addressing limitations in previous datasets, i.e., evaluation bias, unrepresentativeness, and unaccounted-for sources of error. LeClair and McMillan [138] provide a set of recommendations alongside a new dataset, motivated by conflicting results in the task of source code summarization, which was due to a lack of community consensus on how datasets should be designed and collected. Similarly, in the library domain, despite the major role that library-linked data plays in retrieval, there is a lack of agreed upon methodological guidelines for publishing library linked data, which has now been addressed by Vilazzon et al. [259]. ...
Preprint
Full-text available
Speech datasets are crucial for training Speech Language Technologies (SLT); however, the lack of diversity of the underlying training data can lead to serious limitations in building equitable and robust SLT products, especially along dimensions of language, accent, dialect, variety, and speech impairment - and the intersectionality of speech features with socioeconomic and demographic features. Furthermore, there is often a lack of oversight on the underlying training data - commonly built on massive web-crawling and/or publicly available speech - with regard to the ethics of such data collection. To encourage standardized documentation of such speech data components, we introduce an augmented datasheet for speech datasets, which can be used in addition to "Datasheets for Datasets". We then exemplify the importance of each question in our augmented datasheet based on in-depth literature reviews of speech data used in domains such as machine learning, linguistics, and health. Finally, we encourage practitioners - ranging from dataset creators to researchers - to use our augmented datasheet to better define the scope, properties, and limits of speech datasets, while also encouraging consideration of data-subject protection and user community empowerment. Ethical dataset creation is not a one-size-fits-all process, but dataset creators can use our augmented datasheet to reflexively consider the social context of related SLT applications and data sources in order to foster more inclusive SLT products downstream.
... Basically the idea is that an encoder forms a representation of source code in a vector space, while the decoder forms a representation of the summary in a different vector space. With sufficient training data (usually millions of samples [10], [11]), another part of the model learns to connect features in one space to the other and can be used to predict output summaries for arbitrary input source code. Neural designs based on the encoder-decoder model have almost completely supplanted earlier template-and heuristic-based approaches [7]. ...
... We use two datasets in this paper: one is Java and the other is C/C++. The Java dataset, first published by LeClair et al., introduce some of the best practices for developing a dataset for source code summarization that are now a standard among the research community [11]. It consists of 2.1m methods from more than 28k projects. ...
... The C/C++ dataset was first published by Haque et al. [14] following an extraction model proposed by Eberhart et al. [55] to adhere to the idiosyncrasies of C/C++, while maintaining the same strict standards proposed by LeClair et al. [11]. It consists of 1.1m methods from more than 33k projects. ...
Preprint
Label smoothing is a regularization technique for neural networks. Normally neural models are trained to an output distribution that is a vector with a single 1 for the correct prediction, and 0 for all other elements. Label smoothing converts the correct prediction location to something slightly less than 1, then distributes the remainder to the other elements such that they are slightly greater than 0. A conceptual explanation behind label smoothing is that it helps prevent a neural model from becoming "overconfident" by forcing it to consider alternatives, even if only slightly. Label smoothing has been shown to help several areas of language generation, yet typically requires considerable tuning and testing to achieve the optimal results. This tuning and testing has not been reported for neural source code summarization - a growing research area in software engineering that seeks to generate natural language descriptions of source code behavior. In this paper, we demonstrate the effect of label smoothing on several baselines in neural code summarization, and conduct an experiment to find good parameters for label smoothing and make recommendations for its use.
... After this process, we removed all instances longer than 512 tokens (i.e., the number of tokens used to represent both the method and its Javadoc was higher than 512), as also done by previous work using DL to automate code-related tasks (see e.g., [70], [73], [74]): 4,821,922 instances were left. ...
... 2) Code summarization: We use the FunCom dataset [73], [75], featuring 2,149,120 instances, with each of them being composed by a Java method and its associated Javadoc comment. FunCom has been curated to only include English comments and exclude auto-generated files. ...
Preprint
Transformers have gained popularity in the software engineering (SE) literature. These deep learning models are usually pre-trained through a self-supervised objective, meant to provide the model with basic knowledge about a language of interest (e.g., Java). A classic pre-training objective is the masked language model (MLM), in which a percentage of tokens from the input (e.g., a Java method) is masked, with the model in charge of predicting them. Once pre-trained, the model is then fine-tuned to support the specific downstream task of interest (e.g., code summarization). While there is evidence suggesting the boost in performance provided by pre-training, little is known about the impact of the specific pre-training objective(s) used. Indeed, MLM is just one of the possible pre-training objectives and recent work from the natural language processing field suggest that pre-training objectives tailored for the specific downstream task of interest may substantially boost the model's performance. In this study, we focus on the impact of pre-training objectives on the performance of transformers when automating code-related tasks. We start with a systematic literature review aimed at identifying the pre-training objectives used in SE. Then, we pre-train 32 transformers using both (i) generic pre-training objectives usually adopted in SE; and (ii) pre-training objectives tailored to specific code-related tasks subject of our experimentation, namely bug-fixing, code summarization, and code completion. We also compare the pre-trained models with non pre-trained ones. Our results show that: (i) pre-training helps in boosting performance only if the amount of fine-tuning data available is small; (ii) the MLM objective is usually sufficient to maximize the prediction performance of the model, even when comparing it with pre-training objectives specialized for the downstream task at hand.
... However, there exists no dataset of aligned binaries and source code summaries since this is a new and unexplored task. As pointed out by LeClair and McMillan, the lack of standardised datasets is a major barrier to ongoing research, which we will address for this task [19]. In this paper, we create a dataset containing pairs of decompiled and strippeddecompiled functions and summaries of these functions. ...
... In this paper, we create a dataset containing pairs of decompiled and strippeddecompiled functions and summaries of these functions. During the creation of this dataset, we conform to the current best practices for dataset construction [19,20]. ...
... Code summarisation (also referred to as source code summarisation) is the task of writing short descriptions from source code, usually a single-sentence summary of the source code. The main use is for software documentation, like the one-sentence JavaDoc description used in Java [19]. This documentation is important for program comprehension and maintenance. ...
Preprint
Full-text available
Reverse engineering binaries is required to understand and analyse programs for which the source code is unavailable. Decompilers can transform the largely unreadable binaries into a more readable source code-like representation. However, reverse engineering is time-consuming, much of which is taken up by labelling the functions with semantic information. While the automated summarisation of decompiled code can help Reverse Engineers understand and analyse binaries, current work mainly focuses on summarising source code, and no suitable dataset exists for this task. In this work, we extend large pre-trained language models of source code to summarise decompiled binary functions. Furthermore, we investigate the impact of input and data properties on the performance of such models. Our approach consists of two main components; the data and the model. We first build CAPYBARA, a dataset of 214K decompiled function-documentation pairs across various compiler optimisations. We extend CAPYBARA further by generating synthetic datasets and deduplicating the data. Next, we fine-tune the CodeT5 base model with CAPYBARA to create BinT5. BinT5 achieves the state-of-the-art BLEU-4 score of 60.83, 58.82, and 44.21 for summarising source, decompiled, and synthetically stripped decompiled code, respectively. This indicates that these models can be extended to decompiled binaries successfully. Finally, we found that the performance of BinT5 is not heavily dependent on the dataset size and compiler optimisation level. We recommend future research to further investigate transferring knowledge when working with less expressive input formats such as stripped binaries.
... Code summarization (Lu et al., 2021;LeClair and McMillan, 2019) generates explanatory natural language documentation from the given source code snippets. The output natural language description can be evaluated with BLEU and ROUGE scores. ...
Preprint
As the complexity of modern software continues to escalate, software engineering has become an increasingly daunting and error-prone endeavor. In recent years, the field of Neural Code Intelligence (NCI) has emerged as a promising solution, leveraging the power of deep learning techniques to tackle analytical tasks on source code with the goal of improving programming efficiency and minimizing human errors within the software industry. Pretrained language models have become a dominant force in NCI research, consistently delivering state-of-the-art results across a wide range of tasks, including code summarization, generation, and translation. In this paper, we present a comprehensive survey of the NCI domain, including a thorough review of pretraining techniques, tasks, datasets, and model architectures. We hope this paper will serve as a bridge between the natural language and programming language communities, offering insights for future research in this rapidly evolving field.
... CoDesc [11] is a large parallel dataset of source codes and equivalent natural language descriptions. This dataset is build from several similar, but noisy datasets such as CodeSearchNet [5], FunCom [13], DeepCom [14], and CONCODE [4]. HumanEval [6] contains 164 hand written python programs with English docstring and comments. ...