Conference Paper

Improved Automatic Summarization of Subroutines via Attention to File Context

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... An alternative model was proposed by Haque et al (2020) using "file context." File context means the other subroutines in the same file as a subroutine under investigation. ...
... Newer models achieve higher BLEU score by boosting performance on part of the dataset, but not in the way in which file context can help. And unfortunately, there is not a clear means to augment Transformer-based models with the encoder proposed by Haque et al (2020) due to the differences in how Transformer and RNN-based architectures handle attention. ...
... "File context" is a term in Software Engineering research literature that means the other information in the same file as a section of code under investigation (Holmes and Murphy, 2005;Hill et al, 2009;Guerrouj et al, 2014;Ding et al, 2022). In this paper, as in the earlier work by Haque et al (2020), the sections of code under investigation are subroutines, and the file context includes a few of the other subroutines in the same file. File context has been cited for decades as a key source of information for understanding source code, since code lives in an ecosystem of interdependent software components. ...
Preprint
Full-text available
Source code summarization is the task of writing natural language descriptions of source code. A typical use case is generating short summaries of subroutines for use in API documentation. The heart of almost all current research into code summarization is the encoder-decoder neural architecture, and the encoder input is almost always a single subroutine or other short code snippet. The problem with this setup is that the information needed to describe the code is often not present in the code itself -- that information often resides in other nearby code. In this paper, we revisit the idea of ``file context'' for code summarization. File context is the idea of encoding select information from other subroutines in the same file. We propose a novel modification of the Transformer architecture that is purpose-built to encode file context and demonstrate its improvement over several baselines. We find that file context helps on a subset of challenging examples where traditional approaches struggle.
... Classification tasks in this field primarily focus on detecting the programming language of code snippets [40,41,42], whereas summarization tasks focus on transforming code snippets into natural language text for various tasks. The format of the generated text differs based on the purpose, such as transforming code differences into commit messages for version control [43,44], or into comments for documentation [45,46,47,48]. Our work extends these efforts to comprehend code and modifies it to finely localize privacy behaviors and predict their privacy labels. ...
... AST Paths (RQ 1.3): The number of paths used to represent a code sample is varied from 100 -300 in code summarization studies [47,58]. Therefore, to evaluate the optimal number of AST paths, especially for classification, we compare the performance of 100, 200, and 300 AST paths. ...
... Traversing from one terminal node to another is referred to as an AST path. Figure 8 (c) shows a list of AST paths traversed from the partial AST in Figure 8 (b). Since an AST contains useful syntactic information about a code snippet, recent work in code summarization [47,45,46] use AST paths to represent code. ADPAc contains the AST paths of code samples and their labels which we use in this work. ...
Preprint
Full-text available
Mobile applications are required to give privacy notices to users when they collect or share personal information. Creating consistent and concise privacy notices can be a challenging task for developers. Previous work has attempted to help developers create privacy notices through a questionnaire or predefined templates. In this paper, we propose a novel approach and a framework, called PriGen, that extends these prior work. PriGen uses static analysis to identify Android applications' code segments that process sensitive information (i.e. permission-requiring code segments) and then leverages a Neural Machine Translation model to translate them into privacy captions. We present the initial evaluation of our translation task for ~300,000 code segments.
... Later, Hu et al. [9] and LeClair et al. [6] noted the importance of including structural information about the source code by using the Abstract Syntax Tree (AST). They both flattened the AST into N S C *Iyer et al. (2016) [25] x *Loyola et al. (2017) [26] x *Jiang et al. (2017) [27] x *Hu et al. (2018) [28] x *Hu et al. (2018) [9] x x *Allamanis et al. (2018) [29] x x *Alon et al. (2019) [30] x x *Gao et al. (2019) [31] x *LeClair et al. (2019) [6] x x *Fernandes et al. (2019) [32] x x *Haque et al. (2020) [33] x x x *Haldar et al. (2020) [34] x x *LeClair et al. (2020) [35] x x *Ahmad et al. (2020) [36] x x *Zügner et al. (2021) [8] x x *Liu et al. (2021) [37] x x *LeClair et al. (2021) [38] x x *Gao et al. (2021) [39] x x *Wang et al. (2021) [40] x x *Bansal et al. (2021) [41] x x x * Gong et al. (2022) [42] x x sequential tokens, but incorporated these tokens in different ways in their model. Other research papers soon explored various ways of capturing these structural information from the AST [29], [30], [35], [32]. ...
... At the same time, a parallel research track delved into incorporating contextual information. This context encompasses API calls to learn the mapping between API sequences and natural language description [28] as well as other functions in the file to provide supporting information for the code [33]. Bansal et al. further expanded the latter idea by including project context information [41]. ...
... The training, validation and test set are split by projects to prevent data from train set to leak into the test set by virtue of being in the same project. This dataset has since been used in many peer-reviewed publications [33], [35], [52], [41], [53] and new additions has since been made to it, including context tokenization. We use a filtered version of this dataset, with 1.9m functions, published by Bansal et al. that remove code clones in accordance with recommendations by Allamanis et al. [54]. ...
Preprint
Label smoothing is a regularization technique for neural networks. Normally neural models are trained to an output distribution that is a vector with a single 1 for the correct prediction, and 0 for all other elements. Label smoothing converts the correct prediction location to something slightly less than 1, then distributes the remainder to the other elements such that they are slightly greater than 0. A conceptual explanation behind label smoothing is that it helps prevent a neural model from becoming "overconfident" by forcing it to consider alternatives, even if only slightly. Label smoothing has been shown to help several areas of language generation, yet typically requires considerable tuning and testing to achieve the optimal results. This tuning and testing has not been reported for neural source code summarization - a growing research area in software engineering that seeks to generate natural language descriptions of source code behavior. In this paper, we demonstrate the effect of label smoothing on several baselines in neural code summarization, and conduct an experiment to find good parameters for label smoothing and make recommendations for its use.
... At the source code level, researchers have carried out many studies on automatic comment generation methods (Alon et al. 2018;Haiduc et al. 2010b;Haque et al. 2020;Hu et al. 2018a;Moreno et al. 2013). At present, the most prevalent methods are based on deep learning (Sridhara et al. 2011). ...
... Most of these methods have limited generalization ability. In recent years, deep learning techniques have developed rapidly, and many studies have been devoted to generating code comments using deep learning methods (Alon et al. 2018;Haque et al. 2020;Hu et al. 2018a;Iyer et al. 2016). What's more, deep learning methods have also promoted the development of traditional retrieval-based comment generation methods Zhang et al. 2020;Wei et al. 2020). ...
... The specific approach is to design a model architecture to process word sequences and SBT/AST sequences in different recurrent networks with attention mechanisms. Haque et al. (2020) argued that using only the internal information of the code fragment would limit the performance of the model. Therefore, they proposed a method to use file context information to help generate code comments. ...
Article
Full-text available
Bytecode is a form of instruction set designed for efficient execution by a software interpreter. Unlike human-readable source code, bytecode is even harder to understand for programmers and researchers. Bytecode has been widely used in various software tasks such as malware detection and clone detection. In order to understand the meaning of the bytecode more quickly and accurately and further help programmers in more software activities, we propose a bytecode comment generation method (called BCGen) using neural language model. Specifically, to get the structured information of the bytecode, we first generate the control flow graph (CFG) of the bytecode, and serialize the CFG with bytecode semantic information. Then a transformer model combining gate recurrent unit is proposed to learn the features of bytecode to generate comments. We obtain the bytecode by building the Jar packages of the well-known open-source projects in the Maven repository and construct a bytecode dataset to train and evaluate our model. Experimental results show that the BLEU of BCGen can reach 0.26, which outperforms several baselines and proves the effectiveness and practicability of our method. It is concluded that it is possible to generate natural language comments directly from the bytecode. Meanwhile, it is important to take structured and semantic information into account in generating bytecode comments.
... Depending on the usage context of a repair tool, fix suggestions could also be helpful for developers (Monperrus 2014). While some repair tools could be fully automated at runtime, there are repair-suggestion tools (Hartmann et al. 2010;Jeffrey et al. 2009;Chandra et al. 2011;Kaleeswaran et al. 2014;Koyuncu et al. 2019) that are built for human consumption. In this study, we consider both scenarios, i.e., some code representations could go back to the actual source code, while others could provide developers with useful debugging hints that is adequate to fix a bug. ...
... AST Variants AST provides a form of representation of the program structure that can assist to reason about program syntax and semantics. AST is leveraged in a variety of learning-based approaches with applications in different domains such as code summarization (Haque et al. 2020) and repair (Mesbah et al. 2019;Bader et al. 2019;Dinella et al. 2020). However, AST can be represented in a variety of ways. ...
... Context pertaining to the buggy statement could play a role in the representation, which we have not considered in this study. Context can be extracted in various levels of granularity such as the buggy statement (Pradel and Sen 2018;Hata et al. 2018), surrounding statements, enclosing function (Lutellier et al. 2020;Watson et al. 2020), class (Tufano et al. 2018b;, enclosing file (Haque et al. 2020), or encapsulating AST subtrees (Zhang et al. 2019;Tufano et al. 2018a). ...
Article
Full-text available
Training a deep learning model on source code has gained significant traction recently. Since such models reason about vectors of numbers, source code needs to be converted to a code representation before vectorization. Numerous approaches have been proposed to represent source code, from sequences of tokens to abstract syntax trees. However, there is no systematic study to understand the effect of code representation on learning performance. Through a controlled experiment, we examine the impact of various code representations on model accuracy and usefulness in deep learning-based program repair. We train 21 different generative models that suggest fixes for name-based bugs, including 14 different homogeneous code representations, four mixed representations for the buggy and fixed code, and three different embeddings. We assess if fix suggestions produced by the model in various code representations are automatically patchable, meaning they can be transformed to a valid code that is ready to be applied to the buggy code to fix it. We also conduct a developer study to qualitatively evaluate the usefulness of inferred fixes in different code representations. Our results highlight the importance of code representation and its impact on learning and usefulness. Our findings indicate that (1) while code abstractions help the learning process, they can adversely impact the usefulness of inferred fixes from a developer’s point of view; this emphasizes the need to look at the patches generated from the practitioner’s perspective, which is often neglected in the literature, (2) mixed representations can outperform homogeneous code representations, (3) bug type can affect the effectiveness of different code representations; although current techniques use a single code representation for all bug types, there is no single best code representation applicable to all bug types.
... We train all four models from scratch on the training set from each dataset. Our main interest is to test our loss function rather than other variables, so we follow the training procedure established by several recent papers Haque et al, 2020;LeClair et al, 2020): train for ten epochs, select the epoch for which the validation accuracy was the highest, then report metric scores over the testing set for that epoch. Key hyperparameters include: We used t and w reported by Haque et al (2020) and Bansal et al (2021). ...
... Our main interest is to test our loss function rather than other variables, so we follow the training procedure established by several recent papers Haque et al, 2020;LeClair et al, 2020): train for ten epochs, select the epoch for which the validation accuracy was the highest, then report metric scores over the testing set for that epoch. Key hyperparameters include: We used t and w reported by Haque et al (2020) and Bansal et al (2021). The values for v and z are suggestions from a study of code summarization datasets (LeClair and McMillan, 2019). ...
Preprint
Full-text available
This paper presents an improved loss function for neural source code summariza-tion. Code summarization is the task of writing natural language descriptions ofsource code. Neural code summarization refers to automated techniques for gen-erating these descriptions using neural networks. Almost all current approachesinvolve neural networks as either standalone models or as part of a pretrainedlarge language models e.g., GPT, Codex, LLaMA. Yet almost all also use acategorical cross-entropy (CCE) loss function for network optimization. Twoproblems with CCE are that 1) it computes loss over each word prediction one-at-a-time, rather than evaluating a whole sentence, and 2) it requires a perfectprediction, leaving no room for partial credit for synonyms. We propose and eval-uate a loss function to alleviate this problem. In essence, we propose to use asemantic similarity metric to calculate loss over the whole output sentence pre-diction per training batch, rather than just loss for each word. We also proposeto combine our loss with traditional CCE for each word, which streamlines thetraining process compared to baselines. We evaluate our approach over severalbaselines and report an improvement in the vast majority of conditions.
... Humans use external context during program comprehension [42], so researchers are motivated to build models that use external context as well. In 2020, Haque et al. [34] introduced "file context" that encodes the file with the target source code. They use an attention mechanism to learn from other methods in the same file as the subroutine being summarized. ...
... Then in 2021, Bansal et al. [36] introduced "project context" extending upon the concept of file context. Their model encodes files and specific methods inside those files to add to the knowledge [23] x *Lu et al. (2017) [24] x *Jiang et al. (2017) [25] x *Hu et al. (2018) [26] x x *Hu et al. (2018) [6] x *Allamanis et al. (2018) [27] x x *Wan et al. (2018) [28] x *Liang et al. (2018) [29] x *Alon et al. (2019) [10], [30] x x *Gao et al. (2019) [31] x *LeClair et al. (2019) [7] x x *Nie et al. (2019) [32] x *Haldar et al. (2020) [33] x x *Ahmad et al. (2020) [8] x x *Haque et al. (2020) [34] x x *LeCLair et al. (2020) [9] x x *Feng et al. (2020) [35] x x *Bansal et al. (2021) [36] x x *Zügner et al. (2021) [37] x *Liu et al. (2021) [38] x x *(This Paper) x x Fig. 1. Snapshot of the past five years in source code summarization. ...
Preprint
Full-text available
Source code summarization is the task of writing natural language descriptions of source code behavior. Code summarization underpins software documentation for programmers. Short descriptions of code help programmers understand the program quickly without having to read the code itself. Lately, neural source code summarization has emerged as the frontier of research into automated code summarization techniques. By far the most popular targets for summarization are program subroutines. The idea, in a nutshell, is to train an encoder-decoder neural architecture using large sets of examples of subroutines extracted from code repositories. The encoder represents the code and the decoder represents the summary. However, most current approaches attempt to treat the subroutine as a single unit. For example, by taking the entire subroutine as input to a Transformer or RNN-based encoder. But code behavior tends to depend on the flow from statement to statement. Normally dynamic analysis may shed light on this flow, but dynamic analysis on hundreds of thousands of examples in large datasets is not practical. In this paper, we present a statement-based memory encoder that learns the important elements of flow during training, leading to a statement-based subroutine representation without the need for dynamic analysis. We implement our encoder for code summarization and demonstrate a significant improvement over the state-of-the-art.
... This dataset has been used to train, configure (e.g., hyperparameters tuning), and perform a first assessment of the three techniques. In particular, as done in previous works related to the automation of code-related activities Alon et al. 2019;Watson et al. 2020;Haque et al. 2020;Tufano et al. 2021), we considered a prediction generated by the models as correct if it resembles the choice made by the original developers (e.g., if the recommended variable name is the same chosen by the developers). However, this validation assumes that the identifiers selected by the developers are meaningful, which is not always the case. ...
... We used srcML (Scrml website 2019) to extract from each Java file contained in the 1,425 projects all methods having #tokens ≤ 512, where #tokens represents the number of tokens composing a function (excluding comments). The filter on the maximum length of the method is needed to limit the computational expense of training DL-based models (similar choices have been made in previous works Haque et al. 2020;Tufano et al. 2021), with values ranging between 50 and 100 tokens). All duplicate methods have been removed from the dataset to avoid overlap between training and test sets we built from them. ...
Article
Full-text available
Identifiers, such as method and variable names, form a large portion of source code. Therefore, low-quality identifiers can substantially hinder code comprehension. To support developers in using meaningful identifiers, several (semi-)automatic techniques have been proposed, mostly being data-driven (e.g., statistical language models, deep learning models) or relying on static code analysis. Still, limited empirical investigations have been performed on the effectiveness of such techniques for recommending developers with meaningful identifiers, possibly resulting in rename refactoring operations. We present a large-scale study investigating the potential of data-driven approaches to support automated variable renaming. We experiment with three state-of-the-art techniques: a statistical language model and two DL-based models. The three approaches have been trained and tested on three datasets we built with the goal of evaluating their ability to recommend meaningful variable identifiers. Our quantitative and qualitative analyses show the potential of such techniques that, under specific conditions, can provide valuable recommendations and are ready to be integrated in rename refactoring tools. Nonetheless, our results also highlight limitations of the experimented approaches that call for further research in this field.
... This dataset has been used to train, configure (i.e., hyperparameters tuning), and perform a first assessment of the three techniques. In particular, as done in previous works related to the automation of code-related activities [7,22,23,24,25,26], we considered a prediction generated by the models as correct if it resembles the choice made by the original developers (i.e., if the recommended variable name is the same chosen by the developers). However, this validation assumes that the identifiers selected by the developers are meaningful, which is not always the case. ...
... We used srcML [72] to extract from each Java file contained in the 1,425 projects all methods having #tokens ≤ 512, where #tokens represents the number of tokens composing a function (excluding comments). The filter on the maximum length of the method is needed to limit the computational expense of training DL-based models (similar choices have been made in previous works [22,25,26], with values ranging between 50 and 100 tokens). All duplicate methods have been removed from the dataset to avoid overlap between training and test sets we built from them. ...
Preprint
Full-text available
Identifiers, such as method and variable names, form a large portion of source code. Therefore, low-quality identifiers can substantially hinder code comprehension. To support developers in using meaningful identifiers, several (semi-)automatic techniques have been proposed, mostly being data-driven (e.g. statistical language models, deep learning models) or relying on static code analysis. Still, limited empirical investigations have been performed on the effectiveness of such techniques for recommending developers with meaningful identifiers, possibly resulting in rename refactoring operations. We present a large-scale study investigating the potential of data-driven approaches to support automated variable renaming. We experiment with three state-of-the-art techniques: a statistical language model and two DL-based models. The three approaches have been trained and tested on three datasets we built with the goal of evaluating their ability to recommend meaningful variable identifiers. Our quantitative and qualitative analyses show the potential of such techniques that, under specific conditions, can provide valuable recommendations and are ready to be integrated in rename refactoring tools. Nonetheless, our results also highlight limitations of the experimented approaches that call for further research in this field.
... Meanwhile, comments may describe not only the functions, but also the design intents, program logic, and functionalities of programs behind the source code. The existing code summarization models can be categorized into three different types based on the techniques used, i.e., Information Retrieval (IR) based approaches [19,27,74], Neural Machine Translation (NMT) based approaches [8,10,13,15,29,36,40,67,72,73,76], and hybrid approaches [31,32,41,77] that combine IR and NMT techniques. ...
... For example, Edmund et al. [74] generated code summarization for a given code snippet by retrieving the replicated code samples from the corpus with clone detection techniques. Recently, with the booming of deep learning techniques, many NMT based code summarization approaches have been proposed, which train the neural models from a large-scale code-comment corpus to automatically generate summaries [8,10,13,15,29,31,36,40,67,72,73,76]. For example, Iyer et al. [36] treated the code summarization task as an end-to-end translation problem and first introduced NMT into code comment generation. ...
Preprint
Code summarization, the task of generating useful comments given the code, has long been of interest. Most of the existing code summarization models are trained and validated on widely-used code comment benchmark datasets. However, little is known about the quality of the benchmark datasets built from real-world projects. Are the benchmark datasets as good as expected? To bridge the gap, we conduct a systematic research to assess and improve the quality of four benchmark datasets widely used for code summarization tasks. First, we propose an automated code-comment cleaning tool that can accurately detect noisy data caused by inappropriate data preprocessing operations from existing benchmark datasets. Then, we apply the tool to further assess the data quality of the four benchmark datasets, based on the detected noises. Finally, we conduct comparative experiments to investigate the impact of noisy data on the performance of code summarization models. The results show that these data preprocessing noises widely exist in all four benchmark datasets, and removing these noisy data leads to a significant improvement on the performance of code summarization. We believe that the findings and insights will enable a better understanding of data quality in code summarization tasks, and pave the way for relevant research and practice.
... We used the Funcom dataset proposed in [49]. Funcom was applied to recent studies [50], [24]. Funcom is composed of 2M of Java methods alongside their documentation. ...
Article
Full-text available
This study presents a novel category of Transformer architectures known as comb transformers, which effectively reduce the space complexity of the self-attention layer from a quadratic to a sub-quadratic level. This is achieved by processing sequence segments independently and incorporating X -word embeddings to merge cross-segment information. The reduction in attention memory requirements enables the deployment of deeper architectures, potentially leading to more competitive outcomes. Furthermore, we design an abstract syntaxtree (AST)-based code representation to effectively exploit comb transformer properties. To explore the potential of our approach, we develop nine specific instances based on three popular architectural concepts: funnel, hourglass, and encoder-decoder. These architectures are subsequently trained on three code-related tasks: method name generation, code search, and code summarization. These tasks encompass a range of capabilities: short/long sequence generation and classification. In addition to the proposed comb transformers, we also evaluate several baseline architectures for comparative analysis. Our findings demonstrate that the comb transformers match the performance of the baselines and frequently perform better.
... To create the pre-training dataset, which counts a body of 146,006 general-purpose YAML files, we excluded duplicated instances as well as those including non-ASCII tokens and all those having # ≥ 1024. Fixing an upper-bound in terms of the number of tokens for the model's input helps in taming the computational cost of training and is a common practice in the literature exploiting DL models to automate code-related tasks [18,25,37,38,53,56]. ...
Preprint
Continuous integration and delivery (CI/CD) are nowadays at the core of software development. Their benefits come at the cost of setting up and maintaining the CI/CD pipeline, which requires knowledge and skills often orthogonal to those entailed in other software-related tasks. While several recommender systems have been proposed to support developers across a variety of tasks, little automated support is available when it comes to setting up and maintaining CI/CD pipelines. We present GH-WCOM (GitHub Workflow COMpletion), a Transformer-based approach supporting developers in writing a specific type of CI/CD pipelines, namely GitHub workflows. To deal with such a task, we designed an abstraction process to help the learning of the transformer while still making GH-WCOM able to recommend very peculiar workflow elements such as tool options and scripting elements. Our empirical study shows that GH-WCOM provides up to 34.23% correct predictions, and the model's confidence is a reliable proxy for the recommendations' correctness likelihood.
... For example, LeClair et al (2019); Alon et al (2019a,b) combined AST with source code and Allamanis et al (2018b) modeled AST as a graph for neural models. Haque et al (2020) applied the attention mechanism to the file context. Bansal et al (2021b) combined information different software projects. ...
Preprint
Full-text available
A code summary is a brief natural language description of source code. Summaries are usually only a single sentence long, and yet form the backbone of developer documentation. A short descriptions such as "changes all visible polygons to the color blue" can give a programmer a high-level idea of what code does without the effort of reading the code itself. Recently, products based on Large Language Models such as ChatGPT have demonstrated a strong ability to write these descriptions automatically. However, to use these tools, programmers must send their code to untrusted third parties for processing (e.g., via an API call). This loss of custody is not acceptable to many organizations. In this paper, we present an alternative: we train an open source model using sample output generated by GPT-3.5 in a process related to knowledge distillation. Our model is small enough (350m parameters) to be run on a single 16gb GPU, yet we show in our evaluation that it is large enough to mimic GPT-3.5 on this task.
... Hu et al. [27] use one additional encoder to encode API sequences and improve the summary generation by learned the API knowledge. Subsequently, various additional information is incorporated to further improve DL-based code summarization performance, such as abstract syntax trees [17,32,34,49,58,64], code property graphs [36], similar code snippets [33,62], file context [22], etc. Recently, with the success of the pre-training and fine-tuning paradigm in the field of NLP (e.g., BERT [12] and T5 [44]), many works introduce this paradigm to further boost neural code summarization, such as CodeBERT [15] and CodeT5 [61]. ...
Preprint
To support software developers in understanding and maintaining programs, various automatic code summarization techniques have been proposed to generate a concise natural language comment for a given code snippet. Recently, the emergence of large language models (LLMs) has led to a great boost in the performance of natural language processing tasks. Among them, ChatGPT is the most popular one which has attracted wide attention from the software engineering community. However, it still remains unclear how ChatGPT performs in (automatic) code summarization. Therefore, in this paper, we focus on evaluating ChatGPT on a widely-used Python dataset called CSN-Python and comparing it with several state-of-the-art (SOTA) code summarization models. Specifically, we first explore an appropriate prompt to guide ChatGPT to generate in-distribution comments. Then, we use such a prompt to ask ChatGPT to generate comments for all code snippets in the CSN-Python test set. We adopt three widely-used metrics (including BLEU, METEOR, and ROUGE-L) to measure the quality of the comments generated by ChatGPT and SOTA models (including NCS, CodeBERT, and CodeT5). The experimental results show that in terms of BLEU and ROUGE-L, ChatGPT's code summarization performance is significantly worse than all three SOTA models. We also present some cases and discuss the advantages and disadvantages of ChatGPT in code summarization. Based on the findings, we outline several open challenges and opportunities in ChatGPT-based code summarization.
... Such a DNN model is learned using existing largescale code-comment pairwise data. CodeNN [32] is an early attempt in this direction that uses only code token sequences, followed by various approaches that utilize the AST structure [4,28,29], API knowledge [30], type information [9], global context [7,26,66], reinforcement learning [22,62,65], multi-task learning [72], dual learning [68,73], pre-trained language models [19,21,67], and hybrid approaches [69,77]. In addition, a number of works also focus on generating latest and informative comments based on outdated comments (a.k.a comment updating) [39,40]. ...
Preprint
Full-text available
Code comment generation aims at generating natural language descriptions for a code snippet to facilitate developers' program comprehension activities. Despite being studied for a long time, a bottleneck for existing approaches is that given a code snippet, they can only generate one comment while developers usually need to know information from diverse perspectives such as what is the functionality of this code snippet and how to use it. To tackle this limitation, this study empirically investigates the feasibility of utilizing large language models (LLMs) to generate comments that can fulfill developers' diverse intents. Our intuition is based on the facts that (1) the code and its pairwise comment are used during the pre-training process of LLMs to build the semantic connection between the natural language and programming language, and (2) comments in the real-world projects, which are collected for the pre-training, usually contain different developers' intents. We thus postulate that the LLMs can already understand the code from different perspectives after the pre-training. Indeed, experiments on two large-scale datasets demonstrate the rationale of our insights: by adopting the in-context learning paradigm and giving adequate prompts to the LLM (e.g., providing it with ten or more examples), the LLM can significantly outperform a state-of-the-art supervised learning approach on generating comments with multiple intents. Results also show that customized strategies for constructing the prompts and post-processing strategies for reranking the results can both boost the LLM's performances, which shed light on future research directions for using LLMs to achieve comment generation.
... Wang et al. [28] included Unified Modeling Language and enclosing class names. Haque et al. [51] adopted other methods within the same file (file context) where the main method exists as extra inputs. Bansal et al. [52] extended the level of using external knowledge by combining project context and included other files in the entire project as an additional input. ...
Preprint
Code comments are significantly helpful in comprehending software programs and also aid developers to save a great deal of time in software maintenance. Code comment generation aims to automatically predict comments in natural language given a code snippet. Several works investigate the effect of integrating external knowledge on the quality of generated comments. In this study, we propose a solution, namely APIContext2Com, to improve the effectiveness of generated comments by incorporating the pre-defined Application Programming Interface (API) context. The API context includes the definition and description of the pre-defined APIs that are used within the code snippets. As the detailed API information expresses the functionality of a code snippet, it can be helpful in better generating the code summary. We introduce a seq-2-seq encoder-decoder neural network model with different sets of multiple encoders to effectively transform distinct inputs into target comments. A ranking mechanism is also developed to exclude non-informative APIs, so that we can filter out unrelated APIs. We evaluate our approach using the Java dataset from CodeSearchNet. The findings reveal that the proposed model improves the best baseline by 1.88 (8.24 %), 2.16 (17.58 %), 1.38 (18.3 %), 0.73 (14.17 %), 1.58 (14.98 %) and 1.9 (6.92 %) for BLEU1, BLEU2, BLEU3, BLEU4, METEOR, ROUGE-L respectively. Human evaluation and ablation studies confirm the quality of the generated comments and the effect of architecture and ranking APIs.
... Allamanis et al. (2018) conducted an extensive survey of such research efforts. There has also been recent work using NLP techniques and deep learning to develop and update comments based on existing code and any changes which occur (Gros et al. 2020;Haque et al. 2020;Panthaplackel et al. 2020). Searching open-source repositories to retrieve existing code snippets for a given user query is a key task in software engineering. ...
Article
Full-text available
Despite their ability to detect critical bugs in software, static analysis tools’ high false positive rates are a key barrier to their adoption in real-world settings. To improve the usability of these tools, researchers have recently begun to apply machine learning techniques to classify and filter incorrect analysis reports. Although initial results have been promising, the long-term potential and best practices for this line of research are unclear due to the lack of detailed, large-scale empirical evaluation. To partially address this knowledge gap, we present a comparative empirical study of three machine learning techniques—traditional models, recurrent neural networks (RNNs), and graph neural networks (GNNs)—for classifying correct and incorrect results in three static analysis tools—FindSecBugs, CBMC, and JBMC—using multiple datasets. These tools represent different techniques of static analysis, namely taint analysis and model-checking. We also introduce and evaluate new data preparation routines for RNNs and node representations for GNNs. We find that overall classification accuracy reaches a high of 80%–99% for different datasets and application scenarios. We observe that data preparation routines have a positive impact on classification accuracy, with an improvement of up to 5% for RNNs and 16% for GNNs. Overall, our results suggest that neural networks (RNNs or GNNs) that learn over a program’s source code outperform traditional models, although interesting tradeoffs are present among all techniques. Our observations provide insight into the future research needed to speed the adoption of machine learning approaches for static analysis tools in practice.
... The need for context in software engineering is well established. For example, IDEs need context to understand the task they are supporting [15], developers need context to navigate technical discussions on Stack Overflow [16], and tools need context to automatically process source code [17], [18]. Context can include static artefacts such as documentation [19], historical information such as past changes [20], dynamic execution information such as traces [21], individual developer activity such as IDE interactions [22], and team and organisation activity such as communication and coordination archives [23]. ...
Preprint
Deep learning models have been successfully applied to a variety of software engineering tasks, such as code classification, summarisation, and bug and vulnerability detection. In order to apply deep learning to these tasks, source code needs to be represented in a format that is suitable for input into the deep learning model. Most approaches to representing source code, such as tokens, abstract syntax trees (ASTs), data flow graphs (DFGs), and control flow graphs (CFGs) only focus on the code itself and do not take into account additional context that could be useful for deep learning models. In this paper, we argue that it is beneficial for deep learning models to have access to additional contextual information about the code being analysed. We present preliminary evidence that encoding context from the call hierarchy along with information from the code itself can improve the performance of a state-of-the-art deep learning model for two software engineering tasks. We outline our research agenda for adding further contextual information to source code representations for deep learning.
Article
Code commenting plays an important role in program comprehension. Automatic comment generation helps improve software maintenance efficiency. The code comments to annotate a method mainly include header comments and snippet comments. The header comment aims to describe the functionality of the entire method, thereby providing a general comment at the beginning of the method. The snippet comment appears at multiple code segments in the body of a method, where a code segment is called a code snippet. Both of them help developers quickly understand code semantics, thereby improving code readability and code maintainability. However, existing automatic comment generation models mainly focus more on header comments because there are public datasets to validate the performance. By contrast, it is challenging to collect datasets for snippet comments because it is difficult to determine their scope. Even worse, code snippets are often too short to capture complete syntax and semantic information. To address this challenge, we propose a novel S nippet C omment Gen eration approach called SCGen . First, we utilize the context of the code snippet to expand the syntax and semantic information. Specifically, 600,243 snippet code-comment pairs are collected from 959 Java projects. Then, we capture variables from code snippets and extract variable-related statements from the context. After that, we devise an algorithm to parse and traverse abstract syntax tree (AST) information of code snippets and corresponding context. Finally, SCGen generates snippet comments after inputting the source code snippet and corresponding AST information into a sequence-to-sequence-based model. We conducted extensive experiments on the dataset we collected to evaluate our SCGen . Our approach obtains 18.23 in BLEU-4 metrics, 18.83 in METEOR, and 23.65 in ROUGE-L, which outperforms state-of-the-art comment generation models.
Preprint
Full-text available
Large language models trained on source code can support a variety of software development tasks, such as code recommendation and program repair. Large amounts of data for training such models benefit the models' performance. However, the size of the data and models results in long training times and high energy consumption. While publishing source code allows for replicability, users need to repeat the expensive training process if models are not shared. The main goal of the study is to investigate if publications that trained language models for software engineering (SE) tasks share source code and trained artifacts. The second goal is to analyze the transparency on training energy usage. We perform a snowballing-based literature search to find publications on language models for source code, and analyze their reusability from a sustainability standpoint. From 494 unique publications, we identified 293 relevant publications that use language models to address code-related tasks. Among them, 27% (79 out of 293) make artifacts available for reuse. This can be in the form of tools or IDE plugins designed for specific tasks or task-agnostic models that can be fine-tuned for a variety of downstream tasks. Moreover, we collect insights on the hardware used for model training, as well as training time, which together determine the energy consumption of the development process. We find that there are deficiencies in the sharing of information and artifacts for current studies on source code models for software engineering tasks, with 40% of the surveyed papers not sharing source code or trained artifacts. We recommend the sharing of source code as well as trained artifacts, to enable sustainable reproducibility. Moreover, comprehensive information on training times and hardware configurations should be shared for transparency on a model's carbon footprint.
Article
Source code summarization is the task of writing natural language descriptions of source code. The primary use of these descriptions is in documentation for programmers. Automatic generation of these descriptions is a high value research target due to the time cost to programmers of writing these descriptions themselves. In recent years, a confluence of software engineering and artificial intelligence research has made inroads into automatic source code summarization through applications of neural models of that source code. However, an Achilles' heel to a vast majority of approaches is that they tend to rely solely on the context provided by the source code being summarized. But empirical studies in program comprehension are quite clear that the information needed to describe code much more often resides in the context in the form of Function Call Graph surrounding that code. In this paper, we present a technique for encoding this call graph context for neural models of code summarization. We implement our approach as a supplement to existing approaches, and show statistically significant improvement over existing approaches. In a human study with 20 programmers, we show that programmers perceive generated summaries to generally be as accurate, readable, and concise as human-written summaries.
Article
Full-text available
Source code comments are a cornerstone of software documentation facilitating feature development and maintenance. Well-defined documentation formats, like Javadoc, make it easy to include structural metadata used to, for example, generate documentation manuals. However, the actual usage of structural elements in source code comments has not been studied yet. We investigate to which extent these structural elements are used in practice and whether the added information can be leveraged to improve tools assisting developers when writing comments. Existing research on comment generation traditionally focuses on automatic generation of summaries. However, recent works have shown promising results when supporting comment authoring through a next-word prediction. In this paper, we present an in-depth analysis of commenting practice in more than 18K open-source projects written in Python and Java showing that many structural elements, particularly parameter and return value descriptions are indeed widely used. We discover that while a majority are rather short at about 6 to 9 words, many are several hundred words in length. We further find that Python comments tend to be significantly longer than Java comments, possibly due to the weakly-typed nature of the former. Following the empirical analysis, we extend an existing language model with support for structural information, substantially improving the Top-1 accuracy of predicted words (Python 9.6%, Java 7.8%).
Article
Neural source code summarization is the task of generating natural language descriptions of source code behavior using neural networks. A fundamental component of most neural models is an attention mechanism. The attention mechanism learns to connect features in source code to specific words to use when generating natural language descriptions. Humans also pay attention to some features in code more than others. This human attention reflects experience and high-level cognition well beyond the capability of any current neural model. In this paper, we use data from published eye-tracking experiments to create a model of this human attention. The model predicts which words in source code are the most important for code summarization. Next, we augment a baseline neural code summarization approach using our model of human attention. We observe an improvement in prediction performance of the augmented approach in line with other bio-inspired neural models.
Preprint
Full-text available
While a large number of pre-trained models of source code have been successfully developed and applied to a variety of software engineering (SE) tasks in recent years, our understanding of these pre-trained models is arguably fairly limited. With the goal of advancing our understanding of these models, we perform the first systematic empirical comparison of 19 recently-developed pre-trained models of source code on 13 SE tasks. To gain additional insights into these models, we adopt a recently-developed 4-dimensional categorization of pre-trained models, and subsequently investigate whether there are correlations between different categories of pre-trained models and their performances on different SE tasks.
Article
Contextual information plays a vital role for software developers when understanding and fixing a bug. Consequently, deep learning-based program repair techniques leverage context for bug fixes. However, existing techniques treat context in an arbitrary manner, by extracting code in close proximity of the buggy statement within the enclosing file, class, or method, without any analysis to find actual relations with the bug. To reduce noise, they use a predefined maximum limit on the number of tokens to be used as context. We present a program slicing-based approach, in which instead of arbitrarily including code as context, we analyze statements that have a control or data dependency on the buggy statement. We propose a novel concept called dual slicing , which leverages the context of both buggy and fixed versions of the code to capture relevant repair ingredients. We present our technique and tool called Katana , the first to apply slicing-based context for a program repair task. The results show Katana effectively preserves sufficient information for a model to choose contextual information while reducing noise. We compare against four recent state-of-the-art context-aware program repair techniques. Our results show Katana fixes between 1.5 to 3.7 times more bugs than existing techniques.
Chapter
At present, many successful applications use deep learning method in the field of Visual Question Answering (VQA). With the introduction of Optical Character Recognition (OCR), Text-based Visual Question Answering (TextVQA) tasks have a mature basic structure, a transformer-based iterative decoding prediction module. However, there is a problem in the current models: the training process of the model is inconsistent with the inference process. This inconsistency is shown in the different input and the different iteration prediction steps. We propose a scheduled mask method. After using this method, our model can gradually adapt to the situation without the ground truth answer input in the training process. We have verified the effectiveness of our method on the TextVQA dataset and exceeded the performance of other models previously proposed.
Preprint
Full-text available
In recent years, there has been a wide interest in designing deep neural network-based models that automate downstream software engineering tasks, such as program document generation, code search, and program repair. Although the main objective of these studies is to improve the effectiveness of the downstream task, many studies only attempt to employ the next best neural network model, without a proper in-depth analysis of why a particular solution works or does not, on particular tasks or scenarios. In this paper, using an eXplainable AI (XAI) method (attention mechanism), we study state-of-the-art Transformer-based models (CodeBERT and GraphCodeBERT) on a set of software engineering downstream tasks: code document generation (CDG), code refinement (CR), and code translation (CT). We first evaluate the validity of the attention mechanism on each particular task. Then, through quantitative and qualitative studies, we identify what CodeBERT and GraphCodeBERT learn (put the highest attention on, in terms of source code token types), on these tasks. Finally, we show some of the common patterns when the model does not work as expected (perform poorly while the problem in hand is easy) and suggest recommendations that may alleviate the observed challenges.
Article
Context Source code summarization is a crucial yet far from settled task for describing structured code snippets in natural language. High-quality code summaries could effectively facilitate program comprehension and software maintenance. A good code summary is supposed to have the following characteristics: complete information, correct meaning, and consistent description. In recent years, numerous approaches have been proposed for code summarization, but it is still very challenging for developers to automatically learn the complex semantics from the source code and generate complete, correct and consistent code summaries. Objective In this paper, we propose KGCodeSum, a novel keyword-guided abstractive code summarization approach that incorporates structural and contextual information. Methods To improve summaries’ quality, we leverage both the structural information embedded in code itself and the contextual information from related code snippets. Meanwhile, we make use of keywords to guide summaries’ generation to guarantee the code summaries contain key information. Finally, we propose a new dynamic vocabulary strategy which can effectively resolve the UNK problems in code summaries. Results Through our evaluation on the large-scale benchmark datasets with 2.1 million java method-comment pairs and 1.1 million C/C++ function-summary pairs, We have observed that our approach could generate better code summaries than existing state-of-the-art approaches in terms of completeness, correctness and consistency. In addition, we also find that incorporating the dynamic vocabulary strategy into our approach could significantly save time and space in the model training process. Conclusion Our KGCodeSum approach could effectively generate code summaries.
Article
Full-text available
Deep learning (DL) is playing an increasingly important role in our lives. It has already made a huge impact in areas such as cancer diagnosis, precision medicine, self-driving cars, predictive forecasting, speech recognition, etc. The painstakingly handcrafted feature extractors used in the traditional learning, classification and pattern recognition systems are not scalable for large-sized data sets. In many cases depending on the problem complexity, deep learning can also overcome limitations of earlier shallow networks that prevented efficient training and abstractions of hierarchical representations of multi-dimensional training data. Deep Neural Network (DNN) uses multiple (deep) layers of units with highly optimized algorithms and architectures. The paper reviews several optimization methods to improve accuracy of the training and reduce training time. We delve into the math behind training algorithms used in recent deep networks. We describe current shortcomings, enhancements and implementations. The review also covers different types of deep architectures such as deep convolution networks, deep residual networks, recurrent neural networks, reinforcement learning, variational autoencoders, and others.
Conference Paper
Full-text available
Source Code Summarization is the task of writing short, natural language descriptions of source code. The main use for these descriptions is in software documentation e.g. the one-sentence Java method descriptions in JavaDocs. Code summarization is rapidly becoming a popular research problem, but progress is restrained due to a lack of suitable datasets. In addition, a lack of community standards for creating datasets leads to confusing and unreproducible research results -- we observe swings in performance of more than 33% due only to changes in dataset design. In this paper, we make recommendations for these standards from experimental results. We release a dataset based on prior work of over 2.1m pairs of Java methods and one sentence method descriptions from over 28k Java projects. We describe the dataset and point out key differences from natural language data, to guide and support future researchers. Dataset Available at www.leclair.tech/data/funcom
Article
Full-text available
Generating a description of an image is called image captioning. Image captioning requires to recognize the important objects, their attributes and their relationships in an image. It also needs to generate syntactically and semantically correct sentences. Deep learning-based techniques are capable of handling the complexities and challenges of image captioning. In this survey paper, we aim to present a comprehensive review of existing deep learning-based image captioning techniques. We discuss the foundation of the techniques to analyze their performances, strengths and limitations. We also discuss the datasets and the evaluation metrics popularly used in deep learning based automatic image captioning.
Conference Paper
Full-text available
During software maintenance, code comments help developers comprehend programs and reduce additional time spent on reading and navigating source code. Unfortunately, these comments are often mismatched, missing or outdated in the software projects. Developers have to infer the functionality from the source code. This paper proposes a new approach named DeepCom to automatically generate code comments for Java methods. The generated comments aim to help developers understand the functionality of Java methods. DeepCom applies Natural Language Processing (NLP) techniques to learn from a large code corpus and generates comments from learned features. We use a deep neural network that analyzes structural information of Java methods for better comments generation. We conduct experiments on a large-scale Java corpus built from 9,714 open source projects from GitHub. We evaluate the experimental results on a machine translation metric. Experimental results demonstrate that our method DeepCom outperforms the state-of-the-art by a substantial margin.
Conference Paper
Full-text available
To implement a program functionality, developers can reuse previously written code snippets by searching through a large-scale codebase. Over the years, many code search tools have been proposed to help developers. The existing approaches often treat source code as textual documents and utilize information retrieval models to retrieve relevant code snippets that match a given query. These approaches mainly rely on the textual similarity between source code and natural language query. They lack a deep understanding of the semantics of queries and source code. In this paper, we propose a novel deep neural network named CODEnn (Code-Description Embedding Neural Network). Instead of matching text similarity, CODEnn jointly embeds code snippets and natural language descriptions into a high-dimensional vector space, in such a way that code snippet and its corresponding description have similar vectors. Using the unified vector representation, code snippets related to a natural language query can be retrieved according to their vectors. Semantically related words can also be recognized and irrelevant/noisy keywords in queries can be handled. As a proof-of-concept application, we implement a code search tool named DeepCS using the proposed CODEnn model. We empirically evaluate DeepCS on a large scale codebase collected from GitHub. The experimental results show that our approach can effectively retrieve relevant code snippets and outperforms previous techniques.
Conference Paper
Full-text available
In this work, we cast abstractive text summarization as a sequence-to-sequence problem and employ the framework of Attentional Encoder-Decoder Recurrent Neural Networks to this problem, outperforming state-of-the art model of Rush et. al. (2015) on two different corpora. We also move beyond the basic architecture, and propose several novel models to address important problems in summarization including modeling key-words, capturing the hierarchy of sentence-to-word structure and addressing the problem of words that are key to a document, but rare elsewhere. Our work shows that many of our proposed solutions contribute to further improvement in performance. In addition, we propose a new dataset consisting of multi-sentence summaries, and establish performance benchmarks for further research.
Article
Full-text available
Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.
Conference Paper
Full-text available
As software systems continue to grow and evolve, lo- cating code for maintenance and reuse tasks becomes in- creasingly difficult. Existing static code search techniques using natural language queries provide little support to help developers determine whether search results are rel- evant, and few recommend alternative words to help devel- opers reformulate poor queries. In this paper, we present a novel approach that automatically extracts natural lan- guage phrases from source code identifiers and categorizes the phrases and search results in a hierarchy. Our contex- tual search approach allows developers to explore the word usage in a piece of software, helping them to quickly identify relevant program elements for investigation or to quickly recognize alternative words for query reformulation. An empirical evaluation of 22 developers reveals that our con- textual search approach significantly outperforms the most closely related technique in terms of effort and effectiveness.
Conference Paper
Full-text available
This paper highlights the results of a survey of software professionals. One of the goals of this survey was to uncover the perceived relevance (or lack thereof) of software documentation, and the tools and technologies used to maintain, verify and validate such documents. The survey results highlight the preferences for and aversions against software documentation tools. Participants agree that documentation tools should seek to better extract knowledge from core resources. These resources include the system's source code, test code and changes to both. Resulting technologies could then help reduce the effort required for documentation maintenance, something that is shown to rarely occur. Our data reports compelling evidence that software professionals value technologies that improve automation of the documentation process, as well as facilitating its maintenance.
Article
Full-text available
During maintenance developers cannot read the entire code of large systems. They need a way to get a quick understanding of source code entities (such as, classes, methods, packages, etc.), so they can efficiently identify and then focus on the ones related to their task at hand. Sometimes reading just a method header or a class name does not tell enough about its purpose and meaning, while reading the entire implementation takes too long. We study a solution which mitigates the two approaches, i.e., short and accurate textual descriptions that illustrate the software entities without having to read the details of the implementation. We create such descriptions using techniques from automatic text summarization. The paper presents a study that investigates the suitability of various such techniques for generating source code summaries. The results indicate that a combination of text summarization techniques is most appropriate for source code summarization and that developers generally agree with the summaries produced.
Article
The field of machine learning is witnessing its golden era as deep learning slowly becomes the leader in this domain. Deep learning uses multiple layers to represent the abstractions of data to build computational models. Some key enabler deep learning algorithms such as generative adversarial networks, convolutional neural networks, and model transfers have completely changed our perception of information processing. However, there exists an aperture of understanding behind this tremendously fast-paced domain, because it was never previously represented from a multiscope perspective. The lack of core understanding renders these powerful methods as black-box machines that inhibit development at a fundamental level. Moreover, deep learning has repeatedly been perceived as a silver bullet to all stumbling blocks in machine learning, which is far from the truth. This article presents a comprehensive review of historical and recent state-of-the-art approaches in visual, audio, and text processing; social network analysis; and natural language processing, followed by the in-depth analysis on pivoting and groundbreaking advances in deep learning applications. It was also undertaken to review the issues faced in deep learning such as unsupervised learning, black-box models, and online learning and to illustrate how these challenges can be transformed into prolific future research avenues.
Conference Paper
Code summarization provides a high level natural language description of the function performed by code, as it can benefit the software maintenance, code categorization and retrieval. To the best of our knowledge, most state-of-the-art approaches follow an encoder-decoder framework which encodes the code into a hidden space and then decode it into natural language space, suffering from two major drawbacks: a) Their encoders only consider the sequential content of code, ignoring the tree structure which is also critical for the task of code summarization; b) Their decoders are typically trained to predict the next word by maximizing the likelihood of next ground-truth word with previous ground-truth word given. However, it is expected to generate the entire sequence from scratch at test time. This discrepancy can cause an exposure bias issue, making the learnt decoder suboptimal. In this paper, we incorporate an abstract syntax tree structure as well as sequential content of code snippets into a deep reinforcement learning framework (i.e., actor-critic network). The actor network provides the confidence of predicting the next word according to current state. On the other hand, the critic network evaluates the reward value of all possible extensions of the current state and can provide global guidance for explorations. We employ an advantage reward composed of BLEU metric to train both networks. Comprehensive experiments on a real-world dataset show the effectiveness of our proposed model when compared with some state-of-the-art methods.
Article
Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit code's abundance of patterns. In this article, we survey this work. We contrast programming languages against natural languages and discuss how these similarities and differences drive the design of probabilistic models. We present a taxonomy based on the underlying design principles of each model and use it to navigate the literature. Then, we review how researchers have adapted these models to application areas and discuss cross-cutting and application-specific challenges and opportunities.
Conference Paper
Current statistical language modeling techniques, including deep-learning based models, have proven to be quite effective for source code. We argue here that the special properties of source code can be exploited for further improvements. In this work, we enhance established language modeling approaches to handle the special challenges of modeling source code, such as: frequent changes, larger, changing vocabularies, deeply nested scopes, etc. We present a fast, nested language modeling toolkit specifically designed for software, with the ability to add & remove text, and mix & swap out many models. Specifically, we improve upon prior cache-modeling work and present a model with a much more expansive, multi-level notion of locality that we show to be well-suited for modeling software. We present results on varying corpora in comparison with traditional N-gram, as well as RNN, and LSTM deep-learning language models, and release all our source code for public use. Our evaluations suggest that carefully adapting N-gram models for source code can yield performance that surpasses even RNN and LSTM based deep-learning models.
Article
This paper presents a literature review in the field of summarizing software artifacts, focusing on bug reports, source code, mailing lists and developer discussions artifacts. From Jan. 2010 to Apr. 2016, numerous summarization techniques, approaches, and tools have been proposed to satisfy the ongoing demand of improving software performance and quality and facilitating developers in understanding the problems at hand. Since aforementioned artifacts contain both structured and unstructured data at the same time, researchers have applied different machine learning and data mining techniques to generate summaries. Therefore, this paper first intends to provide a general perspective on the state of the art, describing the type of artifacts, approaches for summarization, as well as the common portions of experimental procedures shared among these artifacts. Moreover, we discuss the applications of summarization, i.e., what tasks at hand have been achieved through summarization. Next, this paper presents tools that are generated for summarization tasks or employed during summarization tasks. In addition, we present different summarization evaluation methods employed in selected studies as well as other important factors that are used for the evaluation of generated summaries such as adequacy and quality. Moreover, we briefly present modern communication channels and complementarities with commonalities among different software artifacts. Finally, some thoughts about the challenges applicable to the existing studies in general as well as future research directions are also discussed. The survey of existing studies will allow future researchers to have a wide and useful background knowledge on the main and important aspects of this research field.
Article
Source code summarization is the task of creating readable summaries that describe the functionality of software. Source code summarization is a critical component of documentation generation, for example as Javadocs formed from short paragraphs attached to each method in a Java program. At present, a majority of source code summarization is manual, in that the paragraphs are written by human experts. However, new automated technologies are becoming feasible. These automated techniques have been shown to be effective in select situations, though a key weakness is that they do not explain the source code's context. That is, they can describe the behavior of a Java method, but not why the method exists or what role it plays in the software. In this paper, we propose a source code summarization technique that writes English descriptions of Java methods by analyzing how those methods are invoked. We then performed two user studies to evaluate our approach. First, we compared our generated summaries to summaries written manually by experts. Then, we compared our summaries to summaries written by a state-of-the-art automatic summarization tool. We found that while our approach does not reach the quality of human-written summaries, we do improve over the state-of-the-art summarization tool in several dimensions by a statistically-significant margin.
Conference Paper
One approach to easing program comprehension is to reduce the amount of code that a developer has to read. Describing the high level abstract algorithmic actions associated with code fragments using succinct natural language phrases potentially enables a newcomer to focus on fewer and more abstract concepts when trying to understand a given method. Unfortunately, such descriptions are typically missing because it is tedious to create them manually. We present an automatic technique for identifying code fragments that implement high level abstractions of actions and expressing them as a natural language description. Our studies of 1000 Java programs indicate that our heuristics for identifying code fragments implementing high level actions are widely applicable. Judgements of our generated descriptions by 15 experienced Java programmers strongly suggest that indeed they view the fragments that we identify as representing high level actions and our synthesized descriptions accurately express the abstraction.
Conference Paper
This paper describes in a general way the process we went through to determine the goals, principles, audience, content and style for writing comments in source code for the Java platform at the Java Software division of Sun Microsystems. This includes how the documentation comments evolved to become the home of the Java platform API specification, and the guidelines we developed to make it practical for this document to reside in the same files as the source code.
Conference Paper
Studies have shown that good comments can help programmers quickly understand what a method does, aiding program comprehension and software maintenance. Unfortunately, few software projects adequately comment the code. One way to overcome the lack of human-written summary comments, and guard against obsolete comments, is to automatically generate them. In this paper, we present a novel technique to automatically generate descriptive summary comments for Java methods. Given the signature and body of a method, our automatic comment generator identifies the content for the summary and generates natural language text that summarizes the method's overall actions. According to programmers who judged our generated comments, the summaries are accurate, do not miss important content, and are reasonably concise.
Article
Gradient descent algorithms in recurrent neural networks can have problems when the network dynamics experience bifurcations in the course of learning. The possible hazards caused by the bifurcations of the network dynamics and the learning equations are investigated. The roles of teacher forcing, preprogramming of network structures, and the approximate learning algorithms are discussed. 1 Introduction Supervised learning in recurrent neural networks has been extensively applied to speech recognition, language processing [2, 5, 6], and the modeling of biological neural networks [1, 11, 16, 18]. Although gradient descent algorithms for recurrent networks are considered as a simple extension to the back-propagation learning for feed-forward networks, there is an essential difference between the learning processes in feed-forward and recurrent networks. The output of a feed-forward network is a continuous function of the weights if each unit has a smooth output function, such as a sig...
UCI Source Code Data Sets
  • C S Lopes
  • J Bajracharya
  • P Ossher
  • Baldi
Xiaotao Song Hailong Sun Xu Wang and Jiafei Yan. 2019. A Survey of Automatic Generation of Source Code Comments: Algorithms and Techniques
  • Xiaotao Song Hailong Sun
  • Xu Wang
  • Jiafei Yan
Recent trends in deep learning based natural language processing