Figure 2 - uploaded by Alexander LeClair
Content may be subject to copyright.
Histogram of word occurrences per document.

Histogram of word occurrences per document.

Source publication
Conference Paper
Full-text available
Source Code Summarization is the task of writing short, natural language descriptions of source code. The main use for these descriptions is in software documentation e.g. the one-sentence Java method descriptions in JavaDocs. Code summarization is rapidly becoming a popular research problem, but progress is restrained due to a lack of suitable dat...

Context in source publication

Context 1
... as depicted in Figure 1, words appear to be used more often in code as compared to natural language -there are fewer words used only one or two times, and in general more used 3+ times. At the same time (Figure 2), the pattern for word occurrences per document appears similar, implying that even though words in code are repeated, they are repeated often in the same method and not across methods. Even though this may suggest that the occurrence of unique words in source code is isolated enough to have little affect on BLEU score, we show in Section 4 that this word overlap causes BLEU score inflation when you split by function. ...

Similar publications

Preprint
Full-text available
Source Code Summarization is the task of writing short, natural language descriptions of source code. The main use for these descriptions is in software documentation e.g. the one-sentence Java method descriptions in JavaDocs. Code summarization is rapidly becoming a popular research problem, but progress is restrained due to a lack of suitable dat...

Citations

... Even in cases where LLMdriven code translation might not be trustworthy, automatically generated documentation may accelerate code understanding to facilitate translation, refactoring, and maintenance if it can be generated consistently and accurately. While documentation generation methods have advanced greatly in recent years [15]- [17], they have been trained and evaluated primarily on mainstream languages like C, Python, and Java, and on relatively short, simple programs with limited complexity [18]. This makes it difficult to infer whether results on mainstream benchmarks will generalize in useful ways to legacy software-which is often written in niche or antiquated languages while also exhibiting extreme complexity. ...
... Evaluating generated comments is fundamentally important for industrial applications, where the quality of the output determines whether an automated tool is ready to be used on production codebases. Manual human evaluation of large numbers of comments is costly, but evaluation must be performed on each new system (even on mainstream languages, the quality and reliability of generated documentation can vary dramatically across codebases [18], [20]). There is thus a need for automated metrics of documentation quality that can be used to test and develop documentation generation tools. ...
... There is a lack of quality benchmark datasets which makes the development of advanced documentation generation techniques difficult [18]. This challenge is compounded for legacy software, which is often written in rare or domain-specific languages. ...
Preprint
Full-text available
Legacy software systems, written in outdated languages like MUMPS and mainframe assembly, pose challenges in efficiency, maintenance, staffing, and security. While LLMs offer promise for modernizing these systems, their ability to understand legacy languages is largely unknown. This paper investigates the utilization of LLMs to generate documentation for legacy code using two datasets: an electronic health records (EHR) system in MUMPS and open-source applications in IBM mainframe Assembly Language Code (ALC). We propose a prompting strategy for generating line-wise code comments and a rubric to evaluate their completeness, readability, usefulness, and hallucination. Our study assesses the correlation between human evaluations and automated metrics, such as code complexity and reference-based metrics. We find that LLM-generated comments for MUMPS and ALC are generally hallucination-free, complete, readable, and useful compared to ground-truth comments, though ALC poses challenges. However, no automated metrics strongly correlate with comment quality to predict or measure LLM performance. Our findings highlight the limitations of current automated measures and the need for better evaluation metrics for LLM-generated documentation in legacy systems.
... RQ3: What is the effect of intra-or cross-project training? Previous research [5,42] studied the effect of different dataset designs on the performance of the code summarization models. In this RQ, we investigate the outcome of models based on choosing the training and test functions from the same or different R repositories, also known as intra-or cross-project data selection. ...
Preprint
Full-text available
Recent advancements in developing Pre-trained Language Models for Code (Code-PLMs) have urged many areas of Software Engineering (SE) and brought breakthrough results for many SE tasks. Though these models have achieved the state-of-the-art performance for SE tasks for many popular programming languages, such as Java and Python, the Scientific Software and its related languages like R programming language have rarely benefited or even been evaluated with the Code-PLMs. Research has shown that R has many differences with other programming languages and requires specific techniques. In this study, we provide the first insights for code intelligence for R. For this purpose, we collect and open source an R dataset, and evaluate Code-PLMs for the two tasks of code summarization and method name prediction using several settings and strategies, including the differences in two R styles, Tidy-verse and Base R. Our results demonstrate that the studied models have experienced varying degrees of performance degradation when processing R programming language code, which is supported by human evaluation. Additionally, not all models show performance improvement in R-specific tasks even after multi-language fine-tuning. The dual syntax paradigms in R significantly impact the models' performance, particularly in code summarization tasks. Furthermore, the project-specific context inherent in R codebases significantly impacts the performance when attempting cross-project training.
... The tasks that fall under the code-output category include code generation, code translation, code completion, and program repair, encompassing 14 datasets ( [49], [50], [51], [44], [52], [24], [53], [54], [23], [46], [47]). The text-output category consists of the code summarization task, which includes 3 datasets ( [23], [24], [48]). The tasks in the class-output category are vulnerability detection and clone detection, covering 6 datasets ( [21], [22], [41], [42], [43], [25]). ...
... CodeXGLUE [23] code x glue ct code to text subset="java" XLCoST [24] codeparrot/xlcost-text-to-code subset="C++-program-level" Funcom [48] apcl/funcom-java-long Code Generation APPS [49] codeparrot/apps difficulties="all" MBPP [50] mbpp subset="sanitized" Mercury [51] Elfsong/Mercury InstructHumanEval [44] codeparrot/instructhumaneval StudentEval [52] wellesley-easel/StudentEval XLCoST [24] codeparrot/xlcost-text-to-code subset="C++-program-level" CoNaLa [53] neulab/conala subset="curated" CONCODE [54] AhmedSSoliman/CodeXGLUE-CONCODE task type. For models INCTRL, the prompt consists solely of a task definition and input data. ...
Preprint
Bimodal software analysis initially appeared to be within reach with the advent of large language models. Unfortunately, the complex interplay of natural language text and code in software engineering, presents unique challenges that prevent pretrained models to generalize to a variety of tasks. We postulate that in-context learning for the code-text bimodality is a promising avenue. This paper thus introduces a comprehensive study of in-context code-text learning, focusing on leveraging pretrained CodeLLAMA models. We consider a diverse dataset encompassing 23 software engineering tasks, which we transform in an in-context learning format. To effectively extract informative features, we propose a configurable prompt template. Our proposed pipeline, InCTRL, then unifies prompt learning across various software engineering tasks. Extensive evaluation on the study datasets demonstrates the superiority of INCTRL-models in few-shot performance, surpassing state-of-the-art models including the support model, CodeLLAMA. Typically, we observe that applied to the CodeLLAMA model, INCTRL brings improvements in terms of precision (at least about 12\%) and recall (up to 93.88\%) on various tasks. For example, on the task of program repair, INCTRL improves the BLEU score of CodeLLAMA by 85 points, while for clone detection, INCTRL achieves an improvement of 69 percentage points. Moreover, INCTRL-models offer state-of-the-art performance when using retrieval-augmented generation on individual downstream tasks. Finally, we qualitatively analyze the benefits of INCTRL over CodeLLAMA and open-source all models for broader impact. We make our code and dataset publicly available at: \begin{center} {\url{https://anonymous.4open.science/r/inctrl-B65B}} \end{center}
... Public datasets for code generation in other PLs often use function-level samples. 38,[40][41][42] This is because functions encapsulate code into units of functionality. Additionally, they provide well-defined arguments that are expected to be used in the body, giving the models more information to guide code generation. ...
... Public datasets for code generation in other PLs often use function-level samples. 38,[40][41][42] This is because functions encapsulate code into units of functionality. Additionally, they provide well-defined arguments that are expected to be used in the body, giving the models more information to guide code generation. ...
Preprint
Full-text available
As Moore's Law continues to increase the complexity of electronic systems, Electronic Design Automation (EDA) must advance to meet global demand. An important example of an EDA technology is SKILL, a scripting language used to customize and extend EDA software. Recently, code generation models using the transformer architecture have achieved impressive results in academic settings and have even been used in commercial developer tools to improve developer productivity. To the best of our knowledge, this study is the first to apply transformers to SKILL code autocompletion towards improving the productivity of hardware design engineers. In this study, a novel, data-efficient methodology for generating SKILL code is proposed and experimentally validated. More specifically, we propose a novel methodology for (i) creating a high-quality SKILL dataset with both unlabeled and labeled data, (ii) a training strategy where T5 models pre-trained on general programming language code are fine-tuned on our custom SKILL dataset using unsupervised and supervised learning, and (iii) evaluating synthesized SKILL code. We show that models trained using the proposed methodology outperform baselines in terms of human-judgment score and BLEU score. A major challenge faced was the extremely small amount of available SKILL code data that can be used to train a transformer model to generate SKILL code. Despite our validated improvements, the extremely small dataset available to us was still not enough to train a model that can reliably autocomplete SKILL code. We discuss this and other limitations as well as future work that could address these limitations.
... Then, 266 programs that have conditional statements are selected from the dataset in Ref. [37] to carry out the time-consuming comparative experiment of CSE and QSE. The results are shown in Fig 17. ...
Article
Full-text available
With advances in quantum computing, researchers can now write and run many quantum programs. However, there is still a lack of effective methods for debugging quantum programs. In this paper, quantum symbolic execution (QSE) is proposed to generate test cases, which helps to find bugs in quantum programs. The main idea of quantum symbolic execution is to find the suitable test cases from all possible ones (i.e., test case space). It is different from the way of classical symbol execution, which gets test cases by calculating instead of searching. QSE utilizes quantum superposition and parallelism to store the test case space with only a few qubits. According to the conditional statements in the debugged program, the test case space is continuously divided into subsets, subsubsets and so on. Elements in the same subset are suitable test cases that can test the corresponding branch in the code to be tested. QSE not only provides a possible way to debug quantum programs, but also avoids the difficult problem of solving constraints in classical symbolic execution.
... Other strategies include symbolic or concolic execution [15]. But these strategies are not practical on a large dataset -training data for neural code summarization often run into hundreds of thousands or even millions of examples [16], [17]. ...
... The Java dataset we use consists of 190K Java methods that we curated from a larger dataset of 2.1m methods proposed in 2019 by LeClair el al. [16] named funcom. We selected the funcom dataset for three reasons. ...
Preprint
Full-text available
Source code summarization is the task of writing natural language descriptions of source code behavior. Code summarization underpins software documentation for programmers. Short descriptions of code help programmers understand the program quickly without having to read the code itself. Lately, neural source code summarization has emerged as the frontier of research into automated code summarization techniques. By far the most popular targets for summarization are program subroutines. The idea, in a nutshell, is to train an encoder-decoder neural architecture using large sets of examples of subroutines extracted from code repositories. The encoder represents the code and the decoder represents the summary. However, most current approaches attempt to treat the subroutine as a single unit. For example, by taking the entire subroutine as input to a Transformer or RNN-based encoder. But code behavior tends to depend on the flow from statement to statement. Normally dynamic analysis may shed light on this flow, but dynamic analysis on hundreds of thousands of examples in large datasets is not practical. In this paper, we present a statement-based memory encoder that learns the important elements of flow during training, leading to a statement-based subroutine representation without the need for dynamic analysis. We implement our encoder for code summarization and demonstrate a significant improvement over the state-of-the-art.
... The lack of unified and standard datasets is a major obstacle to the rapid development of code summarization research [23]. The unification of test datasets plays a positive role in promoting neural code summarization research. ...
Article
In software engineering, software personnel faced many large-scale software and complex systems, these need programmers to quickly and accurately read and understand the code, and efficiently complete the tasks of software change or maintenance tasks. Code-NN is the first model to use deep learning to accomplish the task of code summary generation, but it is not used the structural information in the code itself. In the past five years, researchers have designed different code summarization systems based on neural networks. They generally use the end-to-end neural machine translation framework, but many current research methods do not make full use of the structural information of the code. This paper raises a new model called G-DCS to automatically generate a summary of java code; the generated summary is designed to help programmers quickly comprehend the effect of java methods. G-DCS uses natural language processing technology, and training the model uses a code corpus. This model could generate code summaries directly from the code files in the coded corpus. Compared with the traditional method, it uses the information of structural on the code. Through Graph Convolutional Neural Network (GCN) extracts the structural information on the code to generate the code sequence, which makes the generated code summary more accurate. The corpus used for training was obtained from GitHub. Evaluation criteria using BLEU-n. Experimental results show that our approach outperforms models that do not utilize code structure information.
... jm52m is a dataset of 52m Java methods created from 52k Java projects. The source code originated from the Merobase [20] and Sourcerer [26] data releases, supplemented by our own prior work in LeClair et al. [24]. It contains code uploaded to code repositories between 2008 and 2018. ...
Preprint
Full-text available
This tool demonstration presents a research toolkit for a language model of Java source code. The target audience includes researchers studying problems at the granularity level of subroutines, statements, or variables in Java. In contrast to many existing language models, we prioritize features for researchers including an open and easily-searchable training set, a held out test set with different levels of deduplication from the training set, infrastructure for deduplicating new examples, and an implementation platform suitable for execution on equipment accessible to a relatively modest budget. Our model is a GPT2-like architecture with 350m parameters. Our training set includes 52m Java methods (9b tokens) and 13m StackOverflow threads (10.5b tokens). To improve accessibility of research to more members of the community, we limit local resource requirements to GPUs with 16GB video memory. We provide a test set of held out Java methods that include descriptive comments, including the entire Java projects for those methods. We also provide deduplication tools using precomputed hash tables at various similarity thresholds to help researchers ensure that their own test examples are not in the training set. We make all our tools and data open source and available via Huggingface and Github.
... Benchmark datasets comparison: The Vault is a comprehensive collection of parallel code and docstring pairs, which is larger in scale and covers more programming languages than existing datasets. Table 4 presents a comparison between our dataset and other parallel datasets, including Funcom [25], Deepcom [18], CONCODE [21], CodeSearchNet [19], CoDesc [16] and non-public data used for pretraining [5,4,38]. Although these datasets are widely used for pretraining and fine-tuning downstream tasks, they typically contain only a single language and a small number of samples. ...
Preprint
Full-text available
We present The Vault, an open-source, large-scale code-text dataset designed to enhance the training of code-focused large language models (LLMs). Existing open-source datasets for training code-based LLMs often face challenges in terms of size, quality (due to noisy signals), and format (only containing code function and text explanation pairings). The Vault overcomes these limitations by providing 40 million code-text pairs across 10 popular programming languages, thorough cleaning for 10+ prevalent issues, and various levels of code-text pairings, including class, function, and line levels. Researchers and practitioners can utilize The Vault for training diverse code-focused LLMs or incorporate the provided data cleaning methods and scripts to improve their datasets. By employing The Vault as the training dataset for code-centric LLMs, we anticipate significant advancements in code understanding and generation tasks, fostering progress in both artificial intelligence research and software development practices.