Fig 3 - uploaded by Alexander LeClair
Content may be subject to copyright.
Performance of individual Mechanical Turk workers in RQ 2 as measured by ROUGE-1 F1, compared to the total number of HITs completed by each worker. The dashed line signifies the mean F1 score.

Performance of individual Mechanical Turk workers in RQ 2 as measured by ROUGE-1 F1, compared to the total number of HITs completed by each worker. The dashed line signifies the mean F1 score.

Context in source publication

Context 1
... connotation of the word "hack" was a clear signal to the experts, but not the non-experts. Consistent with other literature using Mechanical Turk [63], we observe that many workers who complete only a few HITs exhibit poor performance, while those who complete more HITs consistently perform reasonably well, as shown in Figure 3. Given this variation, it could be misleading to report a statistical summary of all annotators as we did for RQ 1 , so we summarize the results as a mean for general comparison on all metrics in Table III. ...

Citations

... The C/C++ dataset was first published by Haque et al. [14] following an extraction model proposed by Eberhart et al. [55] to adhere to the idiosyncrasies of C/C++, while maintaining the same strict standards proposed by LeClair et al. [11]. It consists of 1.1m methods from more than 33k projects. ...
Preprint
Label smoothing is a regularization technique for neural networks. Normally neural models are trained to an output distribution that is a vector with a single 1 for the correct prediction, and 0 for all other elements. Label smoothing converts the correct prediction location to something slightly less than 1, then distributes the remainder to the other elements such that they are slightly greater than 0. A conceptual explanation behind label smoothing is that it helps prevent a neural model from becoming "overconfident" by forcing it to consider alternatives, even if only slightly. Label smoothing has been shown to help several areas of language generation, yet typically requires considerable tuning and testing to achieve the optimal results. This tuning and testing has not been reported for neural source code summarization - a growing research area in software engineering that seeks to generate natural language descriptions of source code behavior. In this paper, we demonstrate the effect of label smoothing on several baselines in neural code summarization, and conduct an experiment to find good parameters for label smoothing and make recommendations for its use.
... We recruited these programmers via Upwork, offering remuneration of US$60/hr, market rate in our location. We did not use Amazon Mechanical Turk (AMT) [38] because Eberhart et al. [39] shows that AMT users demonstrate lower similarity in agreement. They argue that this is due to lack of expert domain knowledge and we need high degree of reliability to recommend the use of a new metric. ...
Preprint
Source code summarization involves creating brief descriptions of source code in natural language. These descriptions are a key component of software documentation such as JavaDocs. Automatic code summarization is a prized target of software engineering research, due to the high value summaries have to programmers and the simultaneously high cost of writing and maintaining documentation by hand. Current work is almost all based on machine models trained via big data input. Large datasets of examples of code and summaries of that code are used to train an e.g. encoder-decoder neural model. Then the output predictions of the model are evaluated against a set of reference summaries. The input is code not seen by the model, and the prediction is compared to a reference. The means by which a prediction is compared to a reference is essentially word overlap, calculated via a metric such as BLEU or ROUGE. The problem with using word overlap is that not all words in a sentence have the same importance, and many words have synonyms. The result is that calculated similarity may not match the perceived similarity by human readers. In this paper, we conduct an experiment to measure the degree to which various word overlap metrics correlate to human-rated similarity of predicted and reference summaries. We evaluate alternatives based on current work in semantic similarity metrics and propose recommendations for evaluation of source code summarization.
... Participants were prescreened for English fluency. Because this survey did not require participants to read or understand code, it fell in line with literature tasking non-programmers with reading software documentation [46]. Nevertheless, we limited the participant pool to participants reporting experience with "Computer Programming." ...
Preprint
In source code search, a common information-seeking strategy involves providing a short initial query with a broad meaning, and then iteratively refining the query using terms gleaned from the results of subsequent searches. This strategy requires programmers to spend time reading search results that are irrelevant to their development needs. In contrast, when programmers seek information from other humans, they typically refine queries by asking and answering clarifying questions. Clarifying questions have been shown to benefit general-purpose search engines, but have not been examined in the context of code search. We present a method for generating natural-sounding clarifying questions using information extracted from function names and comments. Our method outperformed a keyword-based method for single-turn refinement in synthetic studies, and was associated with shorter search duration in human studies.
... A first interesting parallel to MT research in NLP is that code summarization models also differ substantially in their assumptions about the nature of the task. Some adopt a sequence-to-sequence mapping approach (Iyer et al., 2016;Eberhart et al., 2020), while others take into account code structure, e.g., abstract syntax trees (ASTs) (Hu et al., 2018a;Wan et al., 2018;, or infer latent structure with graph neural networks or transformers (Ahmad et al., 2020). Another active direction, again similar to many NLP tasks, is the inclusion of contextual and background information, through API calls (Hu et al., 2018b), information from other methods or projects , or exploiting the symmetry between code summarization and generation (Wei et al., 2019). ...
Preprint
Full-text available
Source code summarization is the task of generating a high-level natural language description for a segment of programming language code. Current neural models for the task differ in their architecture and the aspects of code they consider. In this paper, we show that three SOTA models for code summarization work well on largely disjoint subsets of a large code-base. This complementarity motivates model combination: We propose three meta-models that select the best candidate summary for a given code segment. The two neural models improve significantly over the performance of the best individual model, obtaining an improvement of 2.1 BLEU points on a dataset of code segments where at least one of the individual models obtains a non-zero BLEU.
... This is the action space within which our dialogue manager can operate. We started with the dataset provided by Eberhart et al. [44] of API help dialogues, then refined the set of possible dialogue acts based on other related literature. We show the action space in Table I. ...
... The libssh dataset comprises 264 functions, while the Allegro dataset comprises 917 functions. We chose these APIs for three reasons: 1) C APIs do not include class hierarchies, enabling an emphasis on concept-based search, 2) the size and domain differences help us perceive how well the policies can generalize to a broader range of APIs, and 3) these were the APIs used in the API search experiments by Eberhart et al. [44]. ...
... The complete lists of search tasks are available in our online appendix (see Section VII). We based these questions on search tasks used in related literature [61], [62], [44]. Each question targeted a particular function in the API. ...
Preprint
API search involves finding components in an API that are relevant to a programming task. For example, a programmer may need a function in a C library that opens a new network connection, then another function that sends data across that connection. Unfortunately, programmers often have trouble finding the API components that they need. A strong scientific consensus is emerging towards developing interactive tool support that responds to conversational feedback, emulating the experience of asking a fellow human programmer for help. A major barrier to creating these interactive tools is implementing dialogue management for API search. Dialogue management involves determining how a system should respond to user input, such as whether to ask a clarification question or to display potential results. In this paper, we present a dialogue manager for interactive API search that considers search results and dialogue history to select efficient actions. We implement two dialogue policies: a hand-crafted policy and a policy optimized via reinforcement learning. We perform a synthetics evaluation and a human evaluation comparing the policies to a generic single-turn, top-N policy used by source code search engines.
... 4) Comment Usage: Researchers also pay attention to how to utilize high-quality comments. Eberhart et al. [30] designed an automated approach to extract summary descriptions of subroutines from unstructured code comments. Tan et al. [31] extracted interrupt related annotations from code and comments for detecting operating system concurrency bugs. ...
Conference Paper
Full-text available
Code comments are key to program comprehension. When they are not consistent with the code, maintenance is hindered. Yet developers often forget to update comments along with their code evolution. With recent advances in neural ma- chine translation, the research community is contemplating novel approaches for automatically generating up-to-date comments following code changes. CUP is such an example state-of-the-art approach whose promising performance remains however to be comprehensively assessed. Our study contributes to the literature by performing an in-depth analysis on the effectiveness of CUP. Our analysis revealed that the overall effectiveness of CUP is largely contributed by its success on updating comments via a single token change (96.6%). Several update failures occur when CUP ignores some code change information (10.4%) or when it is otherwise misled by additional information (12.8%). To put in perspective the achievements of CUP, we implement H EB C UP , a straightforward heuristic-based approach for code comment update. Building on our observations on CUP successful and failure cases, we design heuristics for focusing the update on the changed code and for performing token-level comment update. H EB C UP is shown to outperform CUP in terms of Accuracy by more than 60% while being over three orders of magnitude (i.e., 1700 times) faster. Further empirical analysis confirms that the H EB C UP does not even overfit to the empirical analysis set. Overall, with this study, we call for more research in deep learning based comment update towards achieving state- of-the-art performance that would be unreachable by other less sophisticated techniques.
... splitting by project, removing auto-generated code, minimum/maximum summary lengths, and other quality filters). We use a model published by Eberhart et al. [67] to extract the summary from comments for C/C++ functions, since C/C++ tends not to have the same rules for writing subroutine summary documentation as Java. ...
Preprint
Source code summarization is the task of creating short, natural language descriptions of source code. Code summarization is the backbone of much software documentation such as JavaDocs, in which very brief comments such as "adds the customer object" help programmers quickly understand a snippet of code. In recent years, automatic code summarization has become a high value target of research, with approaches based on neural networks making rapid progress. However, as we will show in this paper, the production of good summaries relies on the production of the action word in those summaries: the meaning of the example above would be completely changed if "removes" were substituted for "adds." In this paper, we advocate for a special emphasis on action word prediction as an important stepping stone problem towards better code summarization -- current techniques try to predict the action word along with the whole summary, and yet action word prediction on its own is quite difficult. We show the value of the problem for code summaries, explore the performance of current baselines, and provide recommendations for future research.