ArticlePDF Available

Measuring Program Comprehension: A Large-Scale Field Study with Professionals

Authors:

Abstract

During software development and maintenance, developers spend a considerable amount of time on program comprehension activities. Previous studies show that program comprehension takes up as much as half of a developer's time. However, most of these studies are performed in a controlled setting, or with a small number of participants, and investigate the program comprehension activities only within the IDEs. However, developers' program comprehension activities go well beyond their IDE interactions. In this paper, we extend our ActivitySpace framework to collect and analyze Human-Computer Interaction (HCI) data across many applications (not just the IDEs). We follow Minelli et al.'s approach to assign developers' activities into four categories: navigation, editing, comprehension, and other. We then measure the comprehension time by calculating the time that developers spend on program comprehension, e.g. inspecting console and breakpoints in IDE, or reading and understanding tutorials in web browsers. Using this approach, we can perform a more realistic investigation of program comprehension activities, through a field study of program comprehension in practice across a total of seven real projects, on 78 professional developers, and amounting to 3,148 working hours. Our study leverages interaction data that is collected across many applications by the developers. Our study finds that on average developers spend ∼58% of their time on program comprehension activities, and that they frequently use web browsers and document editors to perform program comprehension activities. We also investigate the impact of programming language, developers' experience, and project phase on the time that is spent on program comprehension, and we find senior developers spend significantly less percentages of time on program comprehension than junior developers. Our study also highlights the importance of several research directions needed to reduce program comprehension time, e.g., building automatic detection and improvement of low quality code and documentation, construction of software-engineering-specific search engines, designing better IDEs that help developers navigate code and browse information more efficiently, etc.
Published in IEEE Transactions on Software Engineering, 2017 July, Volume PP, Issue 99, Pages 1-26
http://doi.org/10.1109/TSE.2017.2734091
0 20 40 60 80
Effective Working Hours
100
A B C D E F G
Java C#
Low Medium High
Maintenance
Development
Hengtian IGS

Supplementary resource (1)

... Developers spend a lot of time understanding source code. Estimates of the average working time invested in source code comprehension range from 30 to 70% [27,65]. Accordingly, there is great motivation among researchers to optimize this process through scientific research. ...
Preprint
Full-text available
The relevance of code comprehension in a developer's daily work was recognized more than 40 years ago. Over the years, several studies have gathered evidence that developers do indeed invest a considerable amount of their daily work in code comprehension. Consequently, many studies were conducted to find out how developers could be supported during code comprehension and which code characteristics contribute to better comprehension. Today, such experiments are more common than ever. While this is great for advancing the field, the number of publications makes it difficult to keep an overview. Additionally, designing rigorous experiments with human participants is a challenging task, and the multitude of design decisions and options can make it difficult for researchers to select a suitable design. We therefore conducted a systematic mapping study of 95 source code comprehension experiments published between 1979 and 2019. By systematically structuring the design characteristics of code comprehension studies, we provide a basis for subsequent discussion of the huge diversity of design options in the face of a lack of basic research on their consequences and comparability. We describe what topics have been studied, as well as how these studies have been designed, conducted, and reported. Frequently chosen design options and deficiencies are pointed out. We conclude with five concrete action items that we as a research community should address moving forward to improve publications of code comprehension experiments.
... And the development and maintenance of software requires the participation of a large number of software developers. However, developing software is a costly process (Allamanis et al. 2018) and previous work (Xia et al. 2017) found that program comprehension will take up as much as half of a developers' time during software development and maintenance. Unfortunately, many projects in opensource software ecosystem (such as Github) have mismatched, missing, or outdated code comments due to tight project schedules and other reasons (Hu et al. 2018a). ...
Article
Full-text available
In open-source software ecosystems, the scale of source code is getting larger and larger, and developers often use various methods (good code comments or method names, etc.) to make the code easier to read and understand. However, high-quality code comments or method names are often unavailable due to tight project schedules or other reasons in open-source software ecosystems such as Github. Therefore, in this work, we try to use deep learning models to generate appropriate code comments or method names to help software development and maintenance, which requires a non-trivial understanding of the code. Therefore, we propose a Graph neural network enhanced Transformer model (GTrans for short) to learn code representation to understand code better. Specifically, GTrans learns code representation from code sequences and graphs. We use a Transformer encoder to capture the global representation from code sequence and a graph neural network (GNN) encoder to focus on the local details in the code graph, and then use a decoder to combine both global and local representations by attention mechanism. We use three public datasets collected from GitHub to evaluate our model. In an extensive evaluation, we show that GTrans outperforms the state-of-the-art models up to 3.8% increase in METEOR metrics on code comment generation and outperforms the state-of-the-art models by margins of 5.8%–9.4% in ROUGE metrics on method name generation after some adjustments on the structure. Empirically, we find the method name generation task depends on more local information than global, and the code comment generation task is in contrast. Our data and code are available at https://github.com/zc-work/GTrans.
... Software comprehension is an imperative and indispensable prerequisite for software development activities such as maintenance, testing, and quality management (Krüger 2019;Xia et al. 2017). As a software system grows, its functionality and components' interactions increase in size and complexity. ...
Article
Full-text available
Understanding software evolution is essential for software development tasks, including debugging, maintenance, and testing. As a software system evolves, it grows in size and becomes more complex, hindering its comprehension. Researchers proposed several approaches for software quality analysis based on software metrics. One of the primary practices is predicting defects across software components in the codebase to improve agile product quality. While several software metrics exist, graph-based metrics have rarely been utilized in software quality. In this paper, we explore recent network comparison advancements to characterize software evolution and focus on aiding software metrics analysis and defect prediction. We support our approach with an automated tool named GraphEvoDef. Particularly, GraphEvoDef provides three major contributions: (1) detecting and visualizing significant events in software evolution using call graphs, (2) extracting metrics that are suitable for software comprehension, and (3) detecting and estimating the number of defects in a given code entity (e.g., class). One of our major findings is the usefulness of the Network Portrait Divergence metric, borrowed from the information theory domain, to aid the understanding of software evolution. To validate our approach, we examined 29 different open-source Java projects from GitHub and then demonstrated the proposed approach using 9 use cases with defect data from the the PROMISE dataset. We also trained and evaluated defect prediction models for both classification and regression tasks. Our proposed technique has an 18% reduction in the mean square error and a 48% increase in squared correlation coefficient over the state-of-the-art approaches in the defect prediction domain.
... Code summarization Code summarization is a widely studied problem in software engineering. Developers spend around 59% of their time on activities somewhat relevant to program comprehension [35], and good comments can ease the development and maintenance process by helping developers more quickly understand the meaning of code under maintenance [30]. However, misaligned and outdated comments are prevalent in SE projects. ...
Preprint
Foundation models (e.g., CodeBERT, GraphCodeBERT, CodeT5) work well for many software engineering tasks. These models are pre-trained (using self-supervision) with billions of code tokens, and then fine-tuned with hundreds of thousands of labeled examples, typically drawn from many projects. However, software phenomena can be very project-specific. Vocabulary, and other phenomena vary substantially with each project. Thus, training on project-specific data, and testing on the same project, is a promising idea. This hypothesis has to be evaluated carefully, e.g., in a time-series setting, to prevent training-test leakage. We compare several models and training approaches, including same-project training, cross-project training, training a model especially designed to be sample efficient (and thus prima facie well-suited for learning in a limited-sample same-project setting) and a maximalist hybrid approach, fine-tuning first on many projects in many languages and then training on the same-project. We find that the maximalist hybrid setting provides consistent, substantial gains over the state-of-the-art, on many different projects in both Java and Python.
... As the complexity of software projects and the frequency of software product iterations continue to increase, program comprehension is becoming more important throughout the software development process. As recently shown by Xia et al. [47], 58% of developers' time was spent in comprehending code. In addition to the code itself, code comments are considered as the most important form of documentation for program comprehension [5]. ...
Preprint
When changing code, developers sometimes neglect updating the related comments, bringing inconsistent or outdated comments. These comments increase the cost of program understanding and greatly reduce software maintainability. Researchers have put forward some solutions, such as CUP and HEBCUP, which update comments efficiently for simple code changes (i.e. modifying of a single token), but not good enough for complex ones. In this paper, we propose an approach, named HatCUP (Hybrid Analysis and Attention based Comment UPdater), to provide a new mechanism for comment updating task. HatCUP pays attention to hybrid analysis and information. First, HatCUP considers the code structure change information and introduces a structure-guided attention mechanism combined with code change graph analysis and optimistic data flow dependency analysis. With a generally popular RNN-based encoder-decoder architecture, HatCUP takes the action of the code edits, the syntax, semantics and structure code changes, and old comments as inputs and generates a structural representation of the changes in the current code snippet. Furthermore, instead of directly generating new comments, HatCUP proposes a new edit or non-edit mechanism to mimic human editing behavior, by generating a sequence of edit actions and constructing a modified RNN model to integrate newly developed components. Evaluation on a popular dataset demonstrates that HatCUP outperforms the state-of-the-art deep learning-based approaches (CUP) by 53.8% for accuracy, 31.3% for recall and 14.3% for METEOR of the original metrics. Compared with the heuristic-based approach (HEBCUP), HatCUP also shows better overall performance.
Article
Context Source code summarization is a crucial yet far from settled task for describing structured code snippets in natural language. High-quality code summaries could effectively facilitate program comprehension and software maintenance. A good code summary is supposed to have the following characteristics: complete information, correct meaning, and consistent description. In recent years, numerous approaches have been proposed for code summarization, but it is still very challenging for developers to automatically learn the complex semantics from the source code and generate complete, correct and consistent code summaries. Objective In this paper, we propose KGCodeSum, a novel keyword-guided abstractive code summarization approach that incorporates structural and contextual information. Methods To improve summaries’ quality, we leverage both the structural information embedded in code itself and the contextual information from related code snippets. Meanwhile, we make use of keywords to guide summaries’ generation to guarantee the code summaries contain key information. Finally, we propose a new dynamic vocabulary strategy which can effectively resolve the UNK problems in code summaries. Results Through our evaluation on the large-scale benchmark datasets with 2.1 million java method-comment pairs and 1.1 million C/C++ function-summary pairs, We have observed that our approach could generate better code summaries than existing state-of-the-art approaches in terms of completeness, correctness and consistency. In addition, we also find that incorporating the dynamic vocabulary strategy into our approach could significantly save time and space in the model training process. Conclusion Our KGCodeSum approach could effectively generate code summaries.
Article
Software comments sometimes are not promptly updated in sync when the associated code is changed. The inconsistency between code and comments may mislead the developers and result in future bugs. Thus, studies concerning code-comment synchronization have become highly important, which aims to automatically synchronize comments with code changes. Existing code-comment synchronization approaches mainly contain two types, i.e., (1) deep learning-based (e.g., CUP), and (2) heuristic-based (e.g., HebCUP). The former constructs a neural machine translation-structured semantic model, which has a more generalized capability on synchronizing comments with software evolution and growth. However, the latter designs a series of rules for performing token-level replacements on old comments, which can generate the completely correct comments for the samples fully covered by their fine-designed heuristic rules. In this article, we propose a composite approach named CBS (i.e., Classifying Before Synchronizing) to further improve the code-comment synchronization performance, which combines the advantages of CUP and HebCUP with the assistance of inferred categories of Code-Comment Inconsistent (CCI) samples. Specifically, we firstly define two categories (i.e., heuristic-prone and non-heuristic-prone) for CCI samples and propose five features to assist category prediction. The samples whose comments can be correctly synchronized by HebCUP are heuristic-prone, while others are non-heuristic-prone. Then, CBS employs our proposed Multi-Subsets Ensemble Learning (MSEL) classification algorithm to alleviate the class imbalance problem and construct the category prediction model. Next, CBS uses the trained MSEL to predict the category of the new sample. If the predicted category is heuristic-prone, CBS employs HebCUP to conduct the code-comment synchronization for the sample, otherwise, CBS allocates CUP to handle it. Our extensive experiments demonstrate that CBS statistically significantly outperforms CUP and HebCUP, and obtains an average improvement of 23.47%, 22.84%, 3.04%, 3.04%, 1.64%, and 19.39% in terms of Accuracy, Recall@5, Average Edit Distance (AED), Relative Edit Distance (RED), BLEU-4, and Effective Synchronized Sample (ESS) ratio, respectively, which highlights that category prediction for CCI samples can boost the code-comment synchronization performance.
Article
Automatic code documentation generation has been a crucial task in the field of software engineering. It not only relieves developers from writing code documentation but also helps them to understand programs better. Specifically, deep-learning-based techniques that leverage large-scale source code corpora have been widely used in code documentation generation. These works tend to use automatic metrics (such as BLEU, METEOR, ROUGE, CIDEr, and SPICE) to evaluate different models. These metrics compare generated documentation to reference texts by measuring the overlapping words. Unfortunately, there is no evidence demonstrating the correlation between these metrics and human judgment. We conduct experiments on two popular code documentation generation tasks, code comment generation and commit message generation, to investigate presence or absence of correlations between these metrics and human judgements. For each task, we replicate three state-of-the-art approaches and the generated documentation is evaluated automatically in terms of BLEU, METEOR, ROUGE-L, CIDEr, and SPICE. We also ask 24 participants to rate the generated documentation considering three aspects (i.e., language, content, and effectiveness). Each participant is given Java methods or commit diffs along with the target documentation to be rated. The results show that the ranking of generated documentation from automatic metrics is different from that evaluated by human annotators. Thus, these automatic metrics are not reliable enough to replace human evaluation for code documentation generation tasks. In addition, METEOR shows the strongest correlation (with moderate Pearson correlation r about 0.7) to human evaluation metrics. However, it is still much lower than the correlation observed between different annotators (with high Pearson correlation r about 0.8) and correlations that are reported in the literature for other tasks (e.g., Neural Machine Translation [39]
Article
Full-text available
Descriptive method names have a great impact on improving program readability and facilitating software maintenance. Recently, due to high similarity between the task of method naming and text summarization, large amount of research based on natural language processing has been conducted to generate method names. However, method names are much shorter compared to long source code sequences. The salient information of the whole code snippet account for an relatively small part. Additionally, unlike natural language, source code has complicated structure information. Thus, modelling the salient information from highly structured input presents a great challenge. To tackle this problem, we propose a graph neural network (GNN)‐based model with a novel salient information selection layer. Specifically, to comprehensively encode the tokens of the source code, we employ a GNN‐based encoder, which can be directly applied to the code graph to ensure that the syntactic information of code structure and semantic information of code sequence can be modelled sufficiently. To effectively discriminate the salient information, we introduce an information selection layer which contains two parts: a global filter gate used to filter irrelevant information, and a semantic‐aware convolutional layer used to focus on the semantic information contained in code sequence. To improve the precision of the copy mechanism when decoding, we introduce a salient feature enhanced attention mechanism to facilitate the accuracy of copying tokens from input. Experimental results on an open source dataset indicate that our proposed model, equipped with the salient information selection layer, can effectively improve method naming performance compared to other state‐of‐the‐art models.
Article
Full-text available
Recent years have witnessed the increasing emphasis on human aspects in software engineering research and practices. Our survey of existing studies on human aspects in software engineering shows that screen-captured videos have been widely used to record developers’ behavior and study software engineering practices. The screen-captured videos provide direct information about which software tools the developers interact with and which content they access or generate during the task. Such Human-Computer Interaction (HCI) data can help researchers and practitioners understand and improve software engineering practices from human perspective. However, extracting time-series HCI data from screen-captured task videos requires manual transcribing and coding of videos, which is tedious and error-prone. In this paper we report a formative study to understand the challenges in manually transcribing screen-captured videos into time-series HCI data. We then present a computer-vision based video scraping technique to automatically extract time-series HCI data from screen-captured videos. We also present a case study of our scvRipper tool that implements the video scraping technique using 29-hours of task videos of 20 developers in two development tasks. The case study not only evaluates the runtime performance and robustness of the tool, but also performs a detailed quantitative analysis of the tool’s ability to extract time-series HCI data from screen-captured task videos. We also study the developer’s micro-level behavior patterns in software development from the quantitative analysis.
Article
Full-text available
Turnover is the phenomenon of continuous influx and retreat of human resources in a team. Despite being well-studied in many settings, turnover has not been characterized for open-source software projects. We study the source code repositories of five open-source projects to characterize patterns of turnover and to determine the effects of turnover on software quality. We define the base concepts of both external and internal turnover, which are the mobility of developers in and out of a project, and the mobility of developers inside a project, respectively. We provide a qualitative analysis of turnover patterns. We also found, in a quantitative analysis, that the activity of external newcomers negatively impact software quality.
Conference Paper
The utility of source code, as of other knowledge artifacts, is predicated on the existence of individuals skilled enough to derive value by using or improving it. Developers leaving a software project deprive the project of the knowledge of the decisions they have made. Previous research shows that the survivors and newcomers maintaining abandoned code have reduced productivity and are more likely to make mistakes. We focus on quantifying the extent of abandoned source files and adapt methods from financial risk analysis to assess the susceptibility of the project to developer turnover. In particular, we measure the historical loss distribution and find (1) that projects are susceptible to losses that are more than three times larger than the expected loss. Using historical simulations we find (2) that projects are susceptible to large losses that are over five times larger than the expected loss. We use Monte Carlo simulations of disaster loss scenarios and find (3) that simplistic estimates of the 'truck factor' exaggerate the potential for loss. To mitigate loss from developer turnover, we modify Cataldo et al.'s coordination requirements matrices. We find (4) that we can recommend the correct successor 34% to 48% of the time. We also find that having successors reduces the expected loss by as much as 15%. Our approach helps large projects assess the risk of turnover thereby making risk more transparent and manageable.
Article
As code search is a frequent developer activity in software development practices, improving the performance of code search is a critical task. In the text retrieval based search techniques employed in the code search, the term mismatch problem is a critical language issue for retrieval effectiveness. By reformulating the queries, query expansion provides effective ways to solve the term mismatch problem. In this paper, we propose Query Expansion based on Crowd Knowledge (QECK), a novel technique to improve the performance of code search algorithms. QECK identifies software-specific expansion words from the high quality pseudo relevance feedback question and answer pairs on Stack Overflow to automatically generate the expansion queries. Furthermore, we incorporate QECK in the classic Rocchio's model, and propose QECK based code search method QECKRocchio. We conduct three experiments to evaluate our QECK technique and investigate QECKRocchio in a large-scale corpus containing real-world code snippets and a question and answer pair collection. The results show that QECK improves the performance of three code search algorithms by up to 64 percent in Precision, and 35 percent in NDCG. Meanwhile, compared with the state-of-the-art query expansion method, the improvement of QECKRocchio is 22 percent in Precision, and 16 percent in NDCG.
Conference Paper
The number of software engineering research papers over the last few years has grown significantly. An important question here is: how relevant is software engineering research to practitioners in the field? To address this question, we conducted a survey at Microsoft where we invited 3,000 industry practitioners to rate the relevance of research ideas contained in 571 ICSE, ESEC/FSE and FSE papers that were published over a five year period. We received 17,913 ratings by 512 practitioners who labelled ideas as essential, worthwhile, unimportant, or unwise. The results from the survey suggest that practitioners are positive towards studies done by the software engineering research community: 71% of all ratings were essential or worthwhile. We found no correlation between the citation counts and the relevance scores of the papers. Through a qualitative analysis of free text responses, we identify several reasons why practitioners considered certain research ideas to be unwise. The survey approach described in this paper is lightweight: on average, a participant spent only 22.5 minutes to respond to the survey. At the same time, the results can provide useful insight to conference organizers, authors, and participating practitioners.