Conference Paper

Ensemble Models for Neural Source Code Summarization of Subroutines

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... We combine our statement based memory with non statement encoders from recent literature. We use an ensemble technique as recommended for code summarization by LeClair et al. [49]. Ensemble of models is used in a wide range of research areas such as Gene Identification [50] and Computer vision [51] . ...
... We designed our encoder to add to, not compete with existing approaches. To that end, a vast majority of papers in the past 5 years use ensembles to evaluate orthogonal improvements by combining different approaches [49], [53], [54]. We use automated metrics to quantify the differences between our approach and the baseline, consistent with literature [55], [56]. ...
... Consequently, we keep the size of all embedding and dense layers constant. The penultimate step is to ensemble each combination of statement based and nonstatement based models using techniques from literature [49]. Lastly, we use automated metrics for evaluation. ...
Preprint
Full-text available
Source code summarization is the task of writing natural language descriptions of source code behavior. Code summarization underpins software documentation for programmers. Short descriptions of code help programmers understand the program quickly without having to read the code itself. Lately, neural source code summarization has emerged as the frontier of research into automated code summarization techniques. By far the most popular targets for summarization are program subroutines. The idea, in a nutshell, is to train an encoder-decoder neural architecture using large sets of examples of subroutines extracted from code repositories. The encoder represents the code and the decoder represents the summary. However, most current approaches attempt to treat the subroutine as a single unit. For example, by taking the entire subroutine as input to a Transformer or RNN-based encoder. But code behavior tends to depend on the flow from statement to statement. Normally dynamic analysis may shed light on this flow, but dynamic analysis on hundreds of thousands of examples in large datasets is not practical. In this paper, we present a statement-based memory encoder that learns the important elements of flow during training, leading to a statement-based subroutine representation without the need for dynamic analysis. We implement our encoder for code summarization and demonstrate a significant improvement over the state-of-the-art.
... Neural Machine Translation based models are also widely used for code summarization (Iyyer et al. 2016;Haijev 2016;Wan et al. 2018;Hu et al. 2019;LeClair et al. 2019;Ahmad et al. 2020;Yu et al. 2020;Wang et al. 2020;Shin et al. 2021;LeClair et al. 2020). CodeNN (Iyyer et al. 2016) and Attgru (LeClair et al. 2019) use the RNN-based encoder-decoder framework that model to encodes code to context vectors with an attention mechanism and then generates summaries in the decoder, Some studies leverage context Wang et al. 2020;Haque et al. 2020;Bansal et al. 2021) or apply ensemble model (LeClair et al. 2021;Shin et al. 2022) for code summarization. Recent, some models (Ahmad et al. 2020;Gao et al. 2021;Wu et al. 2021) code using Transformer to capture the long-range dependencies. ...
... Neural Machine Translation based models are also widely used for code summarization (Iyyer et al. 2016;Haijev 2016;Wan et al. 2018;Hu et al. 2019;LeClair et al. 2019;Ahmad et al. 2020;Yu et al. 2020;Wang et al. 2020;Shin et al. 2021;LeClair et al. 2020). CodeNN (Iyyer et al. 2016) and Attgru (LeClair et al. 2019) use the RNN-based encoder-decoder framework that model to encodes code to context vectors with an attention mechanism and then generates summaries in the decoder, Some studies leverage context Wang et al. 2020;Haque et al. 2020;Bansal et al. 2021) or apply ensemble model (LeClair et al. 2021;Shin et al. 2022) for code summarization. Recent, some models (Ahmad et al. 2020;Gao et al. 2021;Wu et al. 2021) code using Transformer to capture the long-range dependencies. ...
Article
Full-text available
Recently, machine learning techniques especially deep learning techniques have made substantial progress on some code intelligence tasks such as code summarization, code search, clone detection, etc. How to represent source code to effectively capture the syntactic, structural, and semantic information is a key challenge. Recent studies show that the information extracted from abstract syntax trees (ASTs) is conducive to code representation learning. However, existing approaches fail to fully capture the rich information in ASTs due to the large size/depth of ASTs. In this paper, we propose a novel model CoCoAST that hierarchically splits and reconstructs ASTs to comprehensively capture the syntactic and semantic information of code without the loss of AST structural information. First, we hierarchically split a large AST into a set of subtrees and utilize a recursive neural network to encode the subtrees. Then, we aggregate the embeddings of subtrees by reconstructing the split ASTs to get the representation of the complete AST. Finally, we combine AST representation carrying the syntactic and structural information and source code embedding representing the lexical information to obtain the final neural code representation. We have applied our source code representation to two common program comprehension tasks, code summarization and code search. Extensive experiments have demonstrated the superiority of CoCoAST. To facilitate reproducibility, our data and code are available https://github.com/s1530129650/CoCoAST.
... Straightforward combination of model output, as usual in ensembling (Rokach, 2010), is difficult for highly structured output such as summaries. LeClair et al. (2021) combine multiple source encoders with a joint decoder, which is effective but requires disassembling models. In this paper, we instead adopt a meta-learning approach (Naik and Mammone, 1992) in which we learn a summary selector. ...
... As shown in Table 2, we divide the dataset into three partitions: for summary generation, for metalearning, and for testing. The test partition corresponds to the one used in previous work LeClair et al., 2021), whereas the partition to train summarization models is smaller than in prior work, as we keep some data for the meta-model. Because for a substantial percentage of code segments, all summarization models fail to produce a summary with non-zero BLEU, we also consider a filtered dataset containing only segments where at least one summarization model achieves BLEU > 0. The filtered dataset hence are the cases where the meta-model has a chance to improve over the individual models. ...
Preprint
Full-text available
Source code summarization is the task of generating a high-level natural language description for a segment of programming language code. Current neural models for the task differ in their architecture and the aspects of code they consider. In this paper, we show that three SOTA models for code summarization work well on largely disjoint subsets of a large code-base. This complementarity motivates model combination: We propose three meta-models that select the best candidate summary for a given code segment. The two neural models improve significantly over the performance of the best individual model, obtaining an improvement of 2.1 BLEU points on a dataset of code segments where at least one of the individual models obtains a non-zero BLEU.
... In this study, we use the multi-intent comment generation datasets released by the previous study [47] as our evaluation datasets. In concrete, we use two datasets of Java programming language, i.e., the Funcom [36] and TLC [30] datasets, both of which are the most widely-used datasets for the code comment generation task [3,13,20,35,77]. Funcom contains 2.1M code-comment pairs from 29K Java projects, which were collected by Lopes et al. [1] and further cleaned by LeClair et al. [36]. ...
Preprint
Full-text available
Code comment generation aims at generating natural language descriptions for a code snippet to facilitate developers' program comprehension activities. Despite being studied for a long time, a bottleneck for existing approaches is that given a code snippet, they can only generate one comment while developers usually need to know information from diverse perspectives such as what is the functionality of this code snippet and how to use it. To tackle this limitation, this study empirically investigates the feasibility of utilizing large language models (LLMs) to generate comments that can fulfill developers' diverse intents. Our intuition is based on the facts that (1) the code and its pairwise comment are used during the pre-training process of LLMs to build the semantic connection between the natural language and programming language, and (2) comments in the real-world projects, which are collected for the pre-training, usually contain different developers' intents. We thus postulate that the LLMs can already understand the code from different perspectives after the pre-training. Indeed, experiments on two large-scale datasets demonstrate the rationale of our insights: by adopting the in-context learning paradigm and giving adequate prompts to the LLM (e.g., providing it with ten or more examples), the LLM can significantly outperform a state-of-the-art supervised learning approach on generating comments with multiple intents. Results also show that customized strategies for constructing the prompts and post-processing strategies for reranking the results can both boost the LLM's performances, which shed light on future research directions for using LLMs to achieve comment generation.
... Later, Hu et al. [9] and LeClair et al. [6] noted the importance of including structural information about the source code by using the Abstract Syntax Tree (AST). They both flattened the AST into N S C *Iyer et al. (2016) [25] x *Loyola et al. (2017) [26] x *Jiang et al. (2017) [27] x *Hu et al. (2018) [28] x *Hu et al. (2018) [9] x x *Allamanis et al. (2018) [29] x x *Alon et al. (2019) [30] x x *Gao et al. (2019) [31] x *LeClair et al. (2019) [6] x x *Fernandes et al. (2019) [32] x x *Haque et al. (2020) [33] x x x *Haldar et al. (2020) [34] x x *LeClair et al. (2020) [35] x x *Ahmad et al. (2020) [36] x x *Zügner et al. (2021) [8] x x *Liu et al. (2021) [37] x x *LeClair et al. (2021) [38] x x *Gao et al. (2021) [39] x x *Wang et al. (2021) [40] x x *Bansal et al. (2021) [41] x x x * Gong et al. (2022) [42] x x sequential tokens, but incorporated these tokens in different ways in their model. Other research papers soon explored various ways of capturing these structural information from the AST [29], [30], [35], [32]. ...
Preprint
Label smoothing is a regularization technique for neural networks. Normally neural models are trained to an output distribution that is a vector with a single 1 for the correct prediction, and 0 for all other elements. Label smoothing converts the correct prediction location to something slightly less than 1, then distributes the remainder to the other elements such that they are slightly greater than 0. A conceptual explanation behind label smoothing is that it helps prevent a neural model from becoming "overconfident" by forcing it to consider alternatives, even if only slightly. Label smoothing has been shown to help several areas of language generation, yet typically requires considerable tuning and testing to achieve the optimal results. This tuning and testing has not been reported for neural source code summarization - a growing research area in software engineering that seeks to generate natural language descriptions of source code behavior. In this paper, we demonstrate the effect of label smoothing on several baselines in neural code summarization, and conduct an experiment to find good parameters for label smoothing and make recommendations for its use.
... Zhang et al. [53] use retrieval and similar code to improve the performance of code summarization model. LeClair et al. [54] and Zhou et al. [20] study how to ensemble the method-level code summarization models. ...
Article
Full-text available
Code summaries are clear and concise natural language descriptions of program entities. Meaningful code summaries assist developers in better understanding. Code summarization refers to the task of generating a natural language summary from a code snippet. Most researches on code summarization focus on automatically generating summaries for methods or functions. However, in an object-oriented language such as Java, class is the basic programming unit rather than method. To fill this gap, in this paper, we investigate how to generate summaries for Java classes utilizing deep learning-based approaches. We propose a novel encoder–decoder model called ClassSum to generate functionality descriptions for Java classes and build a dataset containing 172,639 <class, summary> pairs from 3185 repositories hosted on Github. Since the code of class is much longer and more complicated, encoding a whole class via neural network is more challenging than encoding a method. On the other hand, the content within a class may be incomplete. To overcome this difficulty, we reduce the code of a class by only keeping its key elements, namely class signatures, method signatures and attribute names. To utilize both lexical and structural information of code, our model takes token sequence and abstract syntax tree of the reduced class content as inputs. ClassSum and five baselines (designed for method-level code summarization) are evaluated on our dataset. Experiment results show that summaries generated by ClassSum are more accurate and readable than those generated by baselines. Our dataset is available at https://github.com/classsum/ClassSum.
... We used reranking [22,31] to combine complementary models in this work. Ensembling [23] is another approach for combining complementary models for generation tasks, but requires additional training. Co-training [4] and tri-training [54] approaches, although shown to be very effective in combining complementary models, are designed for classification models rather than generation models. ...
Preprint
Full-text available
Pretrained language models have been shown to be effective in many software-related generation tasks; however, they are not well-suited for editing tasks as they are not designed to reason about edits. To address this, we propose a novel pretraining objective which explicitly models edits and use it to build CoditT5, a large language model for software-related editing tasks that is pretrained on large amounts of source code and natural language comments. We fine-tune it on various downstream editing tasks, including comment updating, bug fixing, and automated code review. By outperforming pure generation-based models, we demonstrate the generalizability of our approach and its suitability for editing tasks. We also show how a pure generation model and our edit-based model can complement one another through simple reranking strategies, with which we achieve state-of-the-art performance for the three downstream editing tasks.
... AST (Abstract Syntax Tree), as a tree representation of source code, aims at representing the syntactic structure of source code. Many recent methods [21][22][23][24][25][26] encode ASTs by leveraging GNNs to learn the structural features of source code. They obtain the code representation by aggregating nodes in an AST together based on the parent-child connections so that the relationships between the adjacent levels can be captured. ...
Preprint
Full-text available
syntax trees (ASTs) play a crucial role in source code representation. However, due to the large number of nodes in an AST and the typically deep AST hierarchy, it is challenging to learn the hierarchical structure of an AST effectively. In this paper, we propose HELoC, a hierarchical contrastive learning model for source code representation. To effectively learn the AST hierarchy, we use contrastive learning to allow the network to predict the AST node level and learn the hierarchical relationships between nodes in a self-supervised manner, which makes the representation vectors of nodes with greater differences in AST levels farther apart in the embedding space. By using such vectors, the structural similarities between code snippets can be measured more precisely. In the learning process, a novel GNN (called Residual Self-attention Graph Neural Network, RSGNN) is designed, which enables HELoC to focus on embedding the local structure of an AST while capturing its overall structure. HELoC is self-supervised and can be applied to many source code related downstream tasks such as code classification, code clone detection, and code clustering after pre-training. Our extensive experiments demonstrate that HELoC outperforms the state-of-the-art source code representation models.
... In 2020, LeClair et al. [67] improved the method of processing source code AST information [83,84] on the basis of ast-attendgru [22], and the accuracy of source code summarization was increased. In 2021, LeClair et al. [90] explored the orthogonal nature of different neural code summarization approaches and proposed ensemble models to exploit this orthogonality for better overall performance. The combination of search and generative methods in ASCS is a promising idea. ...
Article
Full-text available
Source code summarization refers to the natural language description of the source code’s function. It can help developers easily understand the semantics of the source code. We can think of the source code and the corresponding summarization as being symmetric. However, the existing source code summarization is mismatched with the source code, missing, or out of date. Manual source code summarization is inefficient and requires a lot of human efforts. To overcome such situations, many studies have been conducted on Automatic Source Code Summarization (ASCS). Given a set of source code, the ASCS techniques can automatically generate a summary described with natural language. In this paper, we give a review of the development of ASCS technology. Almost all ASCS technology involves the following stages: source code modeling, code summarization generation, and quality evaluation. We further categorize the existing ASCS techniques based on the above stages and analyze their advantages and shortcomings. We also draw a clear map on the development of the existing algorithms.
Article
Code summarization aims to generate short functional descriptions for source code to facilitate code comprehension. While Information Retrieval (IR) approaches that leverage similar code snippets and corresponding summaries have led the early research, Deep Learning (DL) approaches that use neural models to capture statistical properties between code and summaries are now mainstream. Although some preliminary studies suggest that IR approaches are more effective in some cases, it is currently unclear how effective the existing approaches can be in general, where and why IR/DL approaches perform better, and whether the integration of IR and DL can achieve better performance. Consequently, there is an urgent need for a comprehensive study of the IR and DL code summarization approaches to provide guidance for future development in this area. This paper presents the first large-scale empirical study of 18 IR, DL, and hybrid code summarization approaches on five benchmark datasets. We extensively compare different types of approaches using automatic metrics, we conduct quantitative and qualitative analyses of where and why IR and DL approaches perform better, respectively, and we also study hybrid approaches for assessing the effectiveness of integrating IR and DL. The study shows that the performance of IR approaches should not be underestimated, that while DL models perform better in predicting tokens from method signatures and capturing structural similarities in code, simple IR approaches tend to perform better in the presence of code with high similarity or long reference summaries, and that existing hybrid approaches do not perform as well as individual approaches in their respective areas of strength. Based on our findings, we discuss future research directions for better code summarization.
Preprint
Full-text available
One of the most common solutions adopted by software researchers to address code generation is by training Large Language Models (LLMs) on massive amounts of source code. Although a number of studies have shown that LLMs have been effectively evaluated on popular accuracy metrics (e.g., BLEU, CodeBleu), previous research has largely overlooked the role of Causal Inference as a fundamental component of the interpretability of LLMs' performance. Existing benchmarks and datasets are meant to highlight the difference between the expected and the generated outcome, but do not take into account confounding variables (e.g., lines of code, prompt size) that equally influence the accuracy metrics. The fact remains that, when dealing with generative software tasks by LLMs, no benchmark is available to tell researchers how to quantify neither the causal effect of SE-based treatments nor the correlation of confounders to the model's performance. In an effort to bring statistical rigor to the evaluation of LLMs, this paper introduces a benchmarking strategy named Galeras comprised of curated testbeds for three SE tasks (i.e., code completion, code summarization, and commit generation) to help aid the interpretation of LLMs' performance. We illustrate the insights of our benchmarking strategy by conducting a case study on the performance of ChatGPT under distinct prompt engineering methods. The results of the case study demonstrate the positive causal influence of prompt semantics on ChatGPT's generative performance by an average treatment effect of $\approx 3\%$. Moreover, it was found that confounders such as prompt size are highly correlated with accuracy metrics ($\approx 0.412\%$). The end result of our case study is to showcase causal inference evaluations, in practice, to reduce confounding bias. By reducing the bias, we offer an interpretable solution for the accuracy metric under analysis.
Article
Source code summarization is the task of writing natural language descriptions of source code. The primary use of these descriptions is in documentation for programmers. Automatic generation of these descriptions is a high value research target due to the time cost to programmers of writing these descriptions themselves. In recent years, a confluence of software engineering and artificial intelligence research has made inroads into automatic source code summarization through applications of neural models of that source code. However, an Achilles' heel to a vast majority of approaches is that they tend to rely solely on the context provided by the source code being summarized. But empirical studies in program comprehension are quite clear that the information needed to describe code much more often resides in the context in the form of Function Call Graph surrounding that code. In this paper, we present a technique for encoding this call graph context for neural models of code summarization. We implement our approach as a supplement to existing approaches, and show statistically significant improvement over existing approaches. In a human study with 20 programmers, we show that programmers perceive generated summaries to generally be as accurate, readable, and concise as human-written summaries.
Preprint
Deliberation is a common and natural behavior in human daily life. For example, when writing papers or articles, we usually first write drafts, and then iteratively polish them until satisfied. In light of such a human cognitive process, we propose DECOM, which is a multi-pass deliberation framework for automatic comment generation. DECOM consists of multiple Deliberation Models and one Evaluation Model. Given a code snippet, we first extract keywords from the code and retrieve a similar code fragment from a pre-defined corpus. Then, we treat the comment of the retrieved code as the initial draft and input it with the code and keywords into DECOM to start the iterative deliberation process. At each deliberation, the deliberation model polishes the draft and generates a new comment. The evaluation model measures the quality of the newly generated comment to determine whether to end the iterative process or not. When the iterative process is terminated, the best-generated comment will be selected as the target comment. Our approach is evaluated on two real-world datasets in Java (87K) and Python (108K), and experiment results show that our approach outperforms the state-of-the-art baselines. A human evaluation study also confirms the comments generated by DECOM tend to be more readable, informative, and useful.
Article
Code summarization aims to generate high-quality functional summaries of code snippets to improve the efficiency of program development and maintenance. It is a pressing challenge for code summarization models to capture more comprehensive code knowledge by integrating the feature correlations between the semantics and syntax of the code. In this paper, we propose a multi-modal similarity network based code summarization method: GT-SimNet. It proposes a novel code semantic modelling method based on a local application programming interface (API) dependency graph (Local-ADG), which exhibits an excellent ability to mask irrelevant semantics outside the current code snippet. For code feature fusion, GT-SimNet uses the SimNet network to calculate the correlation coefficients between Local-ADG and abstract syntax tree (AST) nodes and performs fusion under the influence of the correlation coefficients. Finally, it completes the prediction of the target summary by the generator. We conduct extensive experiments to evaluate the performance of GT-SimNet on two Java datasets. The results show that GT-SimNet achieved BLEU scores of 38.73% and 41.36% on the two datasets, 1.47%∼2.68% higher than the best existing baseline. Importantly, GT-SimNet reduces the BLEU scores by 7.28% after removing Local-ADG. This indicates that Local-ADG is effective for the semantic representation of the code.
Preprint
Code summarization, the task of generating useful comments given the code, has long been of interest. Most of the existing code summarization models are trained and validated on widely-used code comment benchmark datasets. However, little is known about the quality of the benchmark datasets built from real-world projects. Are the benchmark datasets as good as expected? To bridge the gap, we conduct a systematic research to assess and improve the quality of four benchmark datasets widely used for code summarization tasks. First, we propose an automated code-comment cleaning tool that can accurately detect noisy data caused by inappropriate data preprocessing operations from existing benchmark datasets. Then, we apply the tool to further assess the data quality of the four benchmark datasets, based on the detected noises. Finally, we conduct comparative experiments to investigate the impact of noisy data on the performance of code summarization models. The results show that these data preprocessing noises widely exist in all four benchmark datasets, and removing these noisy data leads to a significant improvement on the performance of code summarization. We believe that the findings and insights will enable a better understanding of data quality in code summarization tasks, and pave the way for relevant research and practice.
Article
Full-text available
We present a survey on multilingual neural machine translation (MNMT), which has gained a lot of traction in recent years. MNMT has been useful in improving translation quality as a result of translation knowledge transfer (transfer learning). MNMT is more promising and interesting than its statistical machine translation counterpart, because end-to-end modeling and distributed representations open new avenues for research on machine translation. Many approaches have been proposed to exploit multilingual parallel corpora for improving translation quality. However, the lack of a comprehensive survey makes it difficult to determine which approaches are promising and, hence, deserve further exploration. In this article, we present an in-depth survey of existing literature on MNMT. We first categorize various approaches based on their central use-case and then further categorize them based on resource scenarios, underlying modeling principles, core-issues, and challenges. Wherever possible, we address the strengths and weaknesses of several techniques by comparing them with each other. We also discuss the future directions for MNMT. This article is aimed towards both beginners and experts in NMT. We hope this article will serve as a starting point as well as a source of new ideas for researchers and engineers interested in MNMT.
Article
Full-text available
As an integral part of source code files, code comments help improve program readability and comprehension. However, developers sometimes do not comment their program code adequately due to the incurred extra efforts, lack of relevant knowledge, unawareness of the importance of code commenting or some other factors. As a result, code comments can be inadequate, absent or even mismatched with source code, which affects the understanding, reusing and the maintenance of software. To solve these problems of code comments, researchers have been concerned with generating code comments automatically. In this work, we aim at conducting a survey of automatic code commenting researches. First, we generally analyze the challenges and research framework of automatic generation of program comments. Second, we present the classification of representative algorithms, the design principles, strengths and weaknesses of each category of algorithms. Meanwhile, we also provide an overview of the quality assessment of the generated comments. Finally, we summarize some future directions for advancing the techniques of automatic generation of code comments and the quality assessment of comments.
Article
Full-text available
The use of data-driven ensemble approaches for the prediction of the solar Photovoltaic (PV) power production is promising due to their capability of handling the intermittent nature of the solar energy source. In this work, a comprehensive ensemble approach composed by optimized and diversified Artificial Neural Networks (ANNs) is proposed for improving the 24h-ahead solar PV power production predictions. The ANNs are optimized in terms of number of hidden neurons and diversified in terms of the diverse training datasets used to build the ANNs, by resorting to trial-and-error procedure and BAGGING techniques, respectively. In addition, the Bootstrap technique is embedded to the ensemble for quantifying the sources of uncertainty that affect the ensemble models’ predictions in the form of Prediction Intervals (PIs). The effectiveness of the proposed ensemble approach is demonstrated by a real case study regarding a gridconnected solar PV system (231 kWac capacity) installed on the rooftop of the Faculty of Engineering at the Applied Science Private University (ASU), Amman, Jordan. Results show that the proposed approach outperforms three benchmark models, including smart persistence model and single optimized ANN model currently adopted by the PV system’s owner for the prediction task, with a performance gain reaches up to 11%, 12%, and 9%, for RMSE, MAE, and WMAE standard performance metrics, respectively. Simultaneously, the proposed approach has shown superior in quantifying the uncertainty affecting the power predictions, by establishing slightly wider PIs that achieve the highest confidence level reaches up to 84% for a predefined confidence level of 80% compared to three other approaches of literature. These enhancements would, indeed, allow balancing power supplies and demands across centralized grid networks through economic dispatch decisions between the energy sources that contribute to the energy mix.
Conference Paper
Full-text available
Source Code Summarization is the task of writing short, natural language descriptions of source code. The main use for these descriptions is in software documentation e.g. the one-sentence Java method descriptions in JavaDocs. Code summarization is rapidly becoming a popular research problem, but progress is restrained due to a lack of suitable datasets. In addition, a lack of community standards for creating datasets leads to confusing and unreproducible research results -- we observe swings in performance of more than 33% due only to changes in dataset design. In this paper, we make recommendations for these standards from experimental results. We release a dataset based on prior work of over 2.1m pairs of Java methods and one sentence method descriptions from over 28k Java projects. We describe the dataset and point out key differences from natural language data, to guide and support future researchers. Dataset Available at www.leclair.tech/data/funcom
Article
Full-text available
We propose an ensemble of Long‐Short Term Memory (LSTM) Neural Networks for intraday stock predictions, using a large variety of Technical Analysis indicators as network inputs. The proposed ensemble operates in an online way, weighting the individual models proportionally to their recent performance, which allows us to deal with possible non‐stationarities in an innovative way. The performance of the models is measured by Area Under the Curve of the Receiver Operating Characteristic. We evaluate the predictive power of our model on several US large‐cap stocks and benchmark it against Lasso and Ridge logistic classifiers. The proposed model is found to perform better than the benchmark models or equally weighted ensembles.
Article
Full-text available
Generating a description of an image is called image captioning. Image captioning requires to recognize the important objects, their attributes and their relationships in an image. It also needs to generate syntactically and semantically correct sentences. Deep learning-based techniques are capable of handling the complexities and challenges of image captioning. In this survey paper, we aim to present a comprehensive review of existing deep learning-based image captioning techniques. We discuss the foundation of the techniques to analyze their performances, strengths and limitations. We also discuss the datasets and the evaluation metrics popularly used in deep learning based automatic image captioning.
Conference Paper
Full-text available
During software maintenance, code comments help developers comprehend programs and reduce additional time spent on reading and navigating source code. Unfortunately, these comments are often mismatched, missing or outdated in the software projects. Developers have to infer the functionality from the source code. This paper proposes a new approach named DeepCom to automatically generate code comments for Java methods. The generated comments aim to help developers understand the functionality of Java methods. DeepCom applies Natural Language Processing (NLP) techniques to learn from a large code corpus and generates comments from learned features. We use a deep neural network that analyzes structural information of Java methods for better comments generation. We conduct experiments on a large-scale Java corpus built from 9,714 open source projects from GitHub. We evaluate the experimental results on a machine translation metric. Experimental results demonstrate that our method DeepCom outperforms the state-of-the-art by a substantial margin.
Conference Paper
Full-text available
We present Nematus, a toolkit for Neural Machine Translation. The toolkit prioritizes high translation accuracy, usability, and extensibility. Nematus has been used to build top-performing submissions to shared translation tasks at WMT and IWSLT, and has been used to train systems for production environments.
Article
Full-text available
We propose a model to automatically describe changes introduced in the source code of a program using natural language. Our method receives as input a set of code commits, which contains both the modifications and message introduced by an user. These two modalities are used to train an encoder-decoder architecture. We evaluated our approach on twelve real world open source projects from four different programming languages. Quantitative and qualitative results showed that the proposed approach can generate feasible and semantically sound descriptions not only in standard in-project settings, but also in a cross-project setting.
Article
Full-text available
When programmers encounter an unfamiliar API library, they often need to refer to its documentations, tutorials, or discussions on development forums to learn its proper usage. These API documents contain valuable information, but may also mislead programmers as they may contain errors (e.g., broken code names and obsolete code samples). Although most API documents are actively maintained and updated, studies show that many new and latent errors do exist. It is tedious and error-prone to find such errors manually as API documents can be enormous with thousands of pages. Existing tools are ineffective in locating documentation errors because traditional natural language (NL) tools do not understand code names and code samples, and traditional code analysis tools do not understand NL sentences. In this paper, we propose the first approach, DOCREF, specifically designed and developed to detect API documentation errors. We formulate a class of inconsistencies to indicate potential documentation errors, and combine NL and code analysis techniques to detect and report such inconsistencies. We have implemented DOCREF and evaluated its effectiveness on the latest documentations of five widely-used API libraries. DOCREF has detected more than 1,000 new documentation errors, which we have reported to the authors. Many of the errors have already been confirmed and fixed, after we reported them.
Article
Full-text available
When programmers encounter an unfamiliar API library, they often need to refer to its documentations, tutorials, or discussions on development forums to learn its proper usage. These API documents contain valuable information, but may also mislead programmers as they may contain errors (e.g., broken code names and obsolete code samples). Although most API documents are actively maintained and updated, studies show that many new and latent errors do exist. It is tedious and error-prone to find such errors manually as API documents can be enormous with thousands of pages. Existing tools are ineffective in locating documentation errors because traditional natural language (NL) tools do not understand code names and code samples, and traditional code analysis tools do not understand NL sentences. In this paper, we propose the first approach, DOCREF, specifically designed and developed to detect API documentation errors. We formulate a class of inconsistencies to indicate potential documentation errors, and combine NL and code analysis techniques to detect and report such inconsistencies. We have implemented DOCREF and evaluated its effectiveness on the latest documentations of five widely-used API libraries. DOCREF has detected more than 1,000 new documentation errors, which we have reported to the authors. Many of the errors have already been confirmed and fixed, after we reported them.
Conference Paper
Full-text available
With the evolution of an API library, its documentation also evolves. The evolution of API documentation is common knowledge for programmers and library developers, but not in a quantitative form. Without such quantitative knowledge, programmers may neglect important revisions of API documentation, and library developers may not effectively improve API documentation based on its revision histories. There is a strong need to conduct a quantitative study on API documentation evolution. However, as API documentation is large in size and revisions can be complicated, it is quite challenging to conduct such a study. In this paper, we present an analysis methodology to analyze the evolution of API documentation. Based on the methodology, we conduct a quantitative study on API documentation evolution of five widely used real-world libraries. The results reveal various valuable findings, and these findings allow programmers and library developers to better understand API documentation evolution.
Conference Paper
Full-text available
This paper highlights the results of a survey of software professionals. One of the goals of this survey was to uncover the perceived relevance (or lack thereof) of software documentation, and the tools and technologies used to maintain, verify and validate such documents. The survey results highlight the preferences for and aversions against software documentation tools. Participants agree that documentation tools should seek to better extract knowledge from core resources. These resources include the system's source code, test code and changes to both. Resulting technologies could then help reduce the effort required for documentation maintenance, something that is shown to rarely occur. Our data reports compelling evidence that software professionals value technologies that improve automation of the documentation process, as well as facilitating its maintenance.
Article
Full-text available
During maintenance developers cannot read the entire code of large systems. They need a way to get a quick understanding of source code entities (such as, classes, methods, packages, etc.), so they can efficiently identify and then focus on the ones related to their task at hand. Sometimes reading just a method header or a class name does not tell enough about its purpose and meaning, while reading the entire implementation takes too long. We study a solution which mitigates the two approaches, i.e., short and accurate textual descriptions that illustrate the software entities without having to read the details of the implementation. We create such descriptions using techniques from automatic text summarization. The paper presents a study that investigates the suitability of various such techniques for generating source code summaries. The results indicate that a combination of text summarization techniques is most appropriate for source code summarization and that developers generally agree with the summaries produced.
Article
Full-text available
Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused.
Article
We propose a framework to automatically generate descriptive comments for source code blocks. While this problem has been studied by many researchers previously, their methods are mostly based on fixed template and achieves poor results. Our framework does not rely on any template, but makes use of a new recursive neural network called CodeRNN to extract features from the source code and embed them into one vector. When this vector representation is input to a new recurrent neural network (Code-GRU), the overall framework generates text descriptions of the code with accuracy (Rouge-2 value) significantly higher than other learning-based approaches such as sequence-to-sequence model. The Code-RNN model can also be used in other scenario where the representation of code is required.
Conference Paper
The field of big code relies on mining large corpora of code to perform some learning task towards creating better tools for software engineers. A significant threat to this approach was recently identified by Lopes et al. (2017) who found a large amount of near-duplicate code on GitHub. However, the impact of code duplication has not been noticed by researchers devising machine learning models for source code. In this work, we explore the effects of code duplication on machine learning models showing that reported performance metrics are sometimes inflated by up to 100% when testing on duplicated code corpora compared to the performance on de-duplicated corpora which more accurately represent how machine learning models of code are used by software engineers. We present a duplication index for widely used datasets, list best practices for collecting code corpora and evaluating machine learning models on them. Finally, we release tools to help the community avoid this problem in future research.
Chapter
Comments play an important role in software developments. They can not only improve the readability and maintainability of source code, but also provide significant resource for software reuse. However, it is common that lots of code in software projects lacks of comments. Automatic comment generation is proposed to address this issue. In this paper, we present an end-to-end approach to generate comments for API-based code snippets automatically. It takes API sequences as the core semantic representations of method-level API-based code snippets and generates comments from API sequences with sequence-to-sequence neural models. In our evaluation, we extract 217K pairs of code snippets and comments from Java projects to construct the dataset. Finally, our approach gains 36.48% BLEU-4 score and 9.90% accuracy on the test set. We also do case studies on generated comments, which presents that our approach generates reasonable and effective comments for API-based code snippets.
Article
Ensemble weather predictions require statistical postprocessing of systematic errors to obtain reliable and accurate probabilistic forecasts. Traditionally, this is accomplished with distributional regression models in which the parameters of a predictive distribution are estimated from a training period. We propose a flexible alternative based on neural networks that can incorporate nonlinear relationships between arbitrary predictor variables and forecast distribution parameters that are automatically learned in a data-driven way rather than requiring prespecified link functions. In a case study of 2-m temperature forecasts at surface stations in Germany, the neural network approach significantly outperforms benchmark postprocessing methods while being computationally more affordable. Key components to this improvement are the use of auxiliary predictor variables and station-specific information with the help of embeddings. Furthermore, the trained neural network can be used to gain insight into the importance of meteorological variables, thereby challenging the notion of neural networks as uninterpretable black boxes. Our approach can easily be extended to other statistical postprocessing and forecasting problems. We anticipate that recent advances in deep learning combined with the ever-increasing amounts of model and observation data will transform the postprocessing of numerical weather forecasts in the coming decade.
Conference Paper
Code summarization, aiming to generate succinct natural language description of source code, is extremely useful for code search and code comprehension. It has played an important role in software maintenance and evolution. Previous approaches generate summaries by retrieving summaries from similar code snippets. However, these approaches heavily rely on whether similar code snippets can be retrieved, how similar the snippets are, and fail to capture the API knowledge in the source code, which carries vital information about the functionality of the source code. In this paper, we propose a novel approach, named TL-CodeSum, which successfully uses API knowledge learned in a different but related task to code summarization. Experiments on large-scale real-world industry Java projects indicate that our approach is effective and outperforms the state-of-the-art in code summarization.
Article
Celebrated \emph{Sequence to Sequence learning (Seq2Seq)} and its fruitful variants are powerful models to achieve excellent performance on the tasks that map sequences to sequences. However, these are many machine learning tasks with inputs naturally represented in a form of graphs, which imposes significant challenges to existing Seq2Seq models for lossless conversion from its graph form to the sequence. In this work, we present a general end-to-end approach to map the input graph to a sequence of vectors, and then another attention-based LSTM to decode the target sequence from these vectors. Specifically, to address inevitable information loss for data conversion, we introduce a novel graph-to-sequence neural network model that follows the encoder-decoder architecture. Our method first uses an improved graph-based neural network to generate the node and graph embeddings by a novel aggregation strategy to incorporate the edge direction information into the node embeddings. We also propose an attention based mechanism that aligns node embeddings and decoding sequence to better cope with large graphs. Experimental results on bAbI task, Shortest Path Task, and Natural Language Generation Task demonstrate that our model achieves the state-of-the-art performance and significantly outperforms other baselines. We also show that with the proposed aggregation strategy, our proposed model is able to quickly converge to good performance.
Article
Ensemble methods are considered the state‐of‐the art solution for many machine learning challenges. Such methods improve the predictive performance of a single model by training multiple models and combining their predictions. This paper introduce the concept of ensemble learning, reviews traditional, novel and state‐of‐the‐art ensemble methods and discusses current challenges and trends in the field. This article is categorized under: • Algorithmic Development > Model Combining • Technologies > Machine Learning • Technologies > Classification
Article
Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit code's abundance of patterns. In this article, we survey this work. We contrast programming languages against natural languages and discuss how these similarities and differences drive the design of probabilistic models. We present a taxonomy based on the underlying design principles of each model and use it to navigate the literature. Then, we review how researchers have adapted these models to application areas and discuss cross-cutting and application-specific challenges and opportunities.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Conference Paper
Many real-world applications have problems when learning from imbalanced data sets, such as medical diagnosis, fraud detection, and text classification. Very few minority class instances cannot provide sufficient information and result in performance degrading greatly. As a good way to improve the classification performance of weak learner, some ensemble-based algorithms have been proposed to solve class imbalance problem. However, it is still not clear that how diversity affects classification performance especially on minority classes, since diversity is one influential factor of ensemble. This paper explores the impact of diversity on each class and overall performance. As the other influential factor, accuracy is also discussed because of the trade-off between diversity and accuracy. Firstly, three popular re-sampling methods are combined into our ensemble model and evaluated for diversity analysis, which includes under-sampling, over-sampling, and SMOTE - a data generation algorithm. Secondly, we experiment not only on two-class tasks, but also those with multiple classes. Thirdly, we improve SMOTE in a novel way for solving multi-class data sets in ensemble model - SMOTEBagging.
Conference Paper
This paper describes in a general way the process we went through to determine the goals, principles, audience, content and style for writing comments in source code for the Java platform at the Java Software division of Sun Microsystems. This includes how the documentation comments evolved to become the home of the Java platform API specification, and the guidelines we developed to make it practical for this document to reside in the same files as the source code.
Article
In the first part of the paper we consider the problem of dynamically apportioning resources among a set of options in a worst-case on-line framework. The model we study can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting. We show that the multiplicative weight-update Littlestone-Warmuth rule can be adapted to this model, yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems. We show how the resulting learning algorithm can be applied to a variety of problems, including gambling, multiple-outcome prediction, repeated games, and prediction of points in ℝn. In the second part of the paper we apply the multiplicative weight-update technique to derive a new boosting algorithm. This boosting algorithm does not require any prior knowledge about the performance of the weak learning algorithm. We also study generalizations of the new boosting algorithm to the problem of learning functions whose range, rather than being binary, is an arbitrary finite set or a bounded segment of the real line.
Article
In this paper, we classify the breast cancer of medical diagnostic data. Information gain has been adapted for feature selections. Neural fuzzy (NF), k-nearest neighbor (KNN), quadratic classifier (QC), each single model scheme as well as their associated, ensemble ones have been developed for classifications. In addition, a combined ensemble model with these three schemes has been constructed for further validations. The experimental results indicate that the ensemble learning performs better than individual single ones. Moreover, the combined ensemble model illustrates the highest accuracy of classifications for the breast cancer among all models.
code2seq: Generating sequences from structured representations of code
  • U Alon
  • S Brody
  • O Levy
  • E Yahav
Ensemble learning for multi-source neural machine translation
  • garmash
Retrieval-augmented generation for code summarization via hybrid {gnn}
  • S Liu
  • Y Chen
  • X Xie
  • J K Siow
  • Y Liu
Structured neural summarization
  • P Fernandes
  • M Allamanis
  • M Brockschmidt
Language-agnostic representation learning of source code from structure and context
  • D Zügner
  • T Kirschstein
  • M Catasta
  • J Leskovec
  • S Günnemann
code2seq: Generating sequences from structured representations of code
  • alon
Retrieval-augmented generation for code summarization via hybrid {gnn}
  • liu
Language-agnostic representation learning of source code from structure and context
  • zügner