Article

Explaining Explanations: An Empirical Study of Explanations in Code Reviews

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Code reviews are central for software quality assurance. Ideally, reviewers should explain their feedback to enable authors of code changes to understand the feedback and act accordingly. Different developers might need different explanations in different contexts. Therefore, assisting this process first requires understanding the types of explanations reviewers usually provide. The goal of this paper is to study the types of explanations used in code reviews and explore the potential of Large Language Models (LLMs), specifically ChatGPT, in generating these specific types. We extracted 793 code review comments from Gerrit and manually labeled them based on whether they contained a suggestion, an explanation, or both. Our analysis shows that 42% of comments only include suggestions without explanations. We categorized the explanations into seven distinct types including rule or principle, similar examples, and future implications. When measuring their prevalence, we observed that some explanations are used differently by novice and experienced reviewers. Our manual evaluation shows that, when the explanation type is specified, ChatGPT can correctly generate the explanation in 88 out of 90 cases. This foundational work highlights the potential for future automation in code reviews, which can assist developers in sharing and obtaining different types of explanations as needed, thereby reducing back-and-forth communication.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Identifies review comments which need further explanations to be properly understood by the contributor [83,84] Code Change Analysis ...
... Widyasari et al. [84] exploiting prompt engineering techniques in ChatGPT to improve review comments needing further explanations. Given the recent raise of capabilities of general-purpose LLMs and their applicability to software-related tasks, we expect more and more code review automation techniques to rely on LLMs' prompt engineering rather than on fine-tuned models. ...
... ✓ ✓ Yang et al. [100] https://figshare.com/s/7930029ea5ec5af2845d ✓ ✓ Limited support in specific scenarios [41,48,54,56,72,82,87,88,91,93] Lack of generalizability across different datasets [56,79,117] Unsatisfactory Recommendations [8,42,44,49,72,73,84,92,115,117,128,131,137,153] Does not save time [25] Noise in training data [19,22,33,47,49,54,93] Bias in recommendations [10,14,19,23,25,27,38,97,110,154] Evaluation Suboptimal metrics [11,21,33,47 [50,56,76,101,125] under review, there are limitations related to their applicability on "previously unseen data". For example, retrieving reviews from the past does not allow the approach to generate previously unseen review comments [48], something doable nowadays by training DL models. ...
Preprint
Code Review consists in assessing the code written by teammates with the goal of increasing code quality. Empirical studies documented the benefits brought by such a practice that, however, has its cost to pay in terms of developers' time. For this reason, researchers have proposed techniques and tools to automate code review tasks such as the reviewers selection (i.e., identifying suitable reviewers for a given code change) or the actual review of a given change (i.e., recommending improvements to the contributor as a human reviewer would do). Given the substantial amount of papers recently published on the topic, it may be challenging for researchers and practitioners to get a complete overview of the state-of-the-art. We present a systematic literature review (SLR) featuring 119 papers concerning the automation of code review tasks. We provide: (i) a categorization of the code review tasks automated in the literature; (ii) an overview of the under-the-hood techniques used for the automation, including the datasets used for training data-driven techniques; (iii) publicly available techniques and datasets used for their evaluation, with a description of the evaluation metrics usually adopted for each task. The SLR is concluded by a discussion of the current limitations of the state-of-the-art, with insights for future research directions.
... According to the results, suggestions are less likely to be used when looking for a suitable reviewer, an expert's help or when the goal is to learn how code review is conducted in a project (22/92, 20/92 and 14/92 positive answers respectively). We conclude that suggestions are used as an information source, mainly when looking for rationale or for a code example [35], [36]. ...
... The usage of suggestions varies by stakeholder roles, with members focusing more on defects, likely due to differences in project experience and better knowledge of change impact. Furthermore, we found that code review suggestions serve as valuable educational tools, guiding newcomers, and supporting knowledge sharing by serving as a searchable or recommendable information source [36]. ...
... Code suggestions were also consulted when looking for code examples, which were mentioned by Sadowski et al. [35] as being in high demand among developers. Rationale and code examples both figure among the categories of explanations presented by Widyasari et al. [36] in their taxonomy of explanations in code review comments. The authors recently studied the different types of explanations commonly used in Gerrit code review comments and categorized them into seven types, each reflecting how reviewers communicate feedback and attempt to clarify their reasoning to pull requests authors. ...
Preprint
Full-text available
GitHub introduced the suggestion feature to enable reviewers to explicitly suggest code modifications in pull requests. These suggestions make the reviewers' feedback more actionable for the submitters and represent a valuable knowledge for newcomers. Still, little is known about how code review suggestions are used by developers, what impact they have on pull requests, and how they are influenced by social coding dynamics. To bridge this knowledge gap, we conducted an empirical study on pull requests from 46 engineered GitHub projects, in which developers used code review suggestions. We applied an open coding approach to uncover the types of suggestions and their usage frequency. We also mined pull request characteristics and assessed the impact of using suggestions on merge rate, resolution time, and code complexity. Furthermore, we conducted a survey with contributors of the studied projects to gain insights about the influence of social factors on the usage and acceptance of code review suggestions. We were able to uncover four suggestion types: code style suggestions, improvements, fixes, and documentation with improvements being the most frequent. We found that the use of suggestions positively affects the merge rate of pull requests but significantly increases resolution time without leading to a decrease in code complexity. Our survey results show that suggestions are more likely to be used by reviewers when the submitter is a newcomer. The results also show that developers mostly search suggestions when tracking rationale or looking for code examples. Our work provides insights on the usage of code suggestions and their potential as a knowledge sharing tool.
Article
Large Language Models (LLMs) have significantly impacted numerous domains, including Software Engineering (SE). Many recent publications have explored LLMs applied to various SE tasks. Nevertheless, a comprehensive understanding of the application, effects, and possible limitations of LLMs on SE is still in its early stages. To bridge this gap, we conducted a systematic literature review (SLR) on LLM4SE, with a particular focus on understanding how LLMs can be exploited to optimize processes and outcomes. We selected and analyzed 395 research papers from January 2017 to January 2024 to answer four key research questions (RQs). In RQ1, we categorize different LLMs that have been employed in SE tasks, characterizing their distinctive features and uses. In RQ2, we analyze the methods used in data collection, preprocessing, and application, highlighting the role of well-curated datasets for successful LLM for SE implementation. RQ3 investigates the strategies employed to optimize and evaluate the performance of LLMs in SE. Finally, RQ4 examines the specific SE tasks where LLMs have shown success to date, illustrating their practical contributions to the field. From the answers to these RQs, we discuss the current state-of-the-art and trends, identifying gaps in existing research, and highlighting promising areas for future study. Our artifacts are publicly available at https://github.com/xinyi-hou/LLM4SE_SLR .
Article
Code review has been a well-established quality assurance strategy in software engineering. Nowadays, code review is a tool-based practice, a potential scenario for tools based on generative artificial intelligence (AI) models, which have recently gained worldwide popularity. Initial evidence has shown these tools may support reviewers checking a code change and overcoming understanding barriers, among other possibilities. A way to map potential future directions of tool support is to understand practitioners’ perceptions. In this study, we investigate what has been discussed about generative AI in the code review context by performing a gray literature review. We analyzed 42 documents and found insights from practice and proposals of solutions using generative AI models. Our study provides evidence of code review practitioners actively exploring this new technology, especially ChatGPT. We also discuss three potential implications: saving time, understanding changes, and reviewability of code changes.
Article
Full-text available
Since its introduction in November 2022, ChatGPT has rapidly gained popularity due to its remarkable ability in language understanding and human-like responses. ChatGPT, based on GPT-3.5 architecture, has shown great promise for revolutionizing various research fields, including code generation. However, the reliability and quality of code generated by ChatGPT remain unexplored, raising concerns about potential risks associated with the widespread use of ChatGPT-driven code generation. In this paper, we systematically study the quality of 4,066 ChatGPT-generated code implemented in two popular programming languages, i.e., Java and Python, for 2,033 programming tasks. The goal of this work is three folds. First, we analyze the correctness of ChatGPT on code generation tasks and uncover the factors that influence its effectiveness, including task difficulty, programming language, time that tasks are introduced, and program size. Second, we identify and characterize potential issues with the quality of ChatGPT-generated code. Last, we provide insights into how these issues can be mitigated. Experiments highlight that out of 4,066 programs generated by ChatGPT, 2,756 programs are deemed correct, 1,082 programs provide wrong outputs, and 177 programs contain compilation or runtime errors. Additionally, we further analyze other characteristics of the generated code through static analysis tools, such as code style and maintainability, and find that 1,930 ChatGPT-generated code snippets suffer from maintainability issues. Subsequently, we investigate ChatGPT’s self-repairing ability and its interaction with static analysis tools to fix the errors uncovered in the previous step. Experiments suggest that ChatGPT can partially address these challenges, improving code quality by more than 20%, but there are still limitations and opportunities for improvement. Overall, our study provides valuable insights into the current limitations of ChatGPT and offers a roadmap for future research and development efforts to enhance the code generation capabilities of AI models like ChatGPT.
Article
Full-text available
Context Due to the association of significant efforts, even a minor improvement in the effectiveness of Code Reviews(CR) can incur significant savings for a software development organization. Objective This study aims to develop a finer grain understanding of what makes a code review comment useful to OSS developers, to what extent a code review comment is considered useful to them, and how various contextual and participant-related factors influence its degree of usefulness. Method On this goal, we have conducted a three-stage mixed-method study. We randomly selected 2,500 CR comments from the OpenDev Nova project and manually categorized the comments. We designed a survey of OpenDev developers to better understand their perspectives on useful CRs. Combining our survey-obtained scores with our manually labeled dataset, we trained two regression models - one to identify factors that influence the usefulness of CR comments and the other to identify factors that improve the odds of ‘Functional’ defect identification over the others. Results The results of our study suggest that a CR comment’s usefulness is dictated not only by its technical contributions, such as defect findings or quality improvement tips but also by its linguistic characteristics, such as comprehensibility and politeness. While a reviewer’s coding experience is positively associated with CR usefulness, the number of mutual reviews, comment volume in a file, the total number of lines added /modified, and CR interval have the opposite associations. While authorship and reviewership experiences for the files under review have been the most popular attributes for reviewer recommendation systems, we do not find any significant association of those attributes with CR usefulness. Conclusion We recommend discouraging frequent code review associations between two individuals as such associations may decrease CR usefulness. We also recommend authoring CR comments in a constructive and empathetic tone. As several of our results deviate from prior studies, we recommend more investigations to identify context-specific attributes to build reviewer recommendation models.
Conference Paper
Full-text available
Advances in natural language processing have resulted in large language models (LLMs) that can generate code and code explanations. In this paper, we report on our experiences generating multiple code explanation types using LLMs and integrating them into an interactive e-book on web software development. Three different types of explanations -- a line-by-line explanation, a list of important concepts, and a high-level summary of the code -- were created. Students could view explanations by clicking a button next to code snippets, which showed the explanation and asked about its utility. Our results show that all explanation types were viewed by students and that the majority of students perceived the code explanations as helpful to them. However, student engagement varied by code snippet complexity, explanation type, and code snippet length. Drawing on our experiences, we discuss future directions for integrating explanations generated by LLMs into CS classrooms.
Conference Paper
Full-text available
Background: Code reviewing is an essential part of software development to ensure software quality. However, the abundance of review tasks and the intensity of the workload for reviewers negatively impact the quality of the reviews. The short review text is often unactionable. Aims: We propose the Example Driven Review Explanation (EDRE) method to facilitate the code review process by adding additional explanations through examples. EDRE recommends similar code reviews as examples to further explain a review and help a developer to understand the received reviews with less communication overhead. Method: Through an empirical study in an industrial setting and by analyzing 3,722 Code reviews across three open-source projects, we compared five methods of data retrieval, text classification, and text recommendation. Results: EDRE using TF-IDF word embedding along with an SVM classifier can provide practical examples for each code review with 92% F-score and 90% Accuracy. Conclusions: The example-based explanation is an established method for assisting experts in explaining decisions. EDRE can accurately provide a set of context-specific examples to facilitate the code review process in software teams.
Conference Paper
Full-text available
Code review is an effective quality assurance practice, but can be labor-intensive since developers have to manually review the code and provide written feedback. Recently, a Deep Learning (DL)-based approach was introduced to automatically recommend code review comments based on changed methods. While the approach showed promising results, it requires expensive computational resource and time which limits its use in practice. To address this limitation , we propose CommentFinder ś a retrieval-based approach to recommend code review comments. Through an empirical evaluation of 151,019 changed methods, we evaluate the effectiveness and efficiency of CommentFinder against the state-of-the-art approach. We find that when recommending the best-1 review comment candidate, our CommentFinder is 32% better than prior work in recommending the correct code review comment. In addition, CommentFinder is 49 times faster than the prior work. These findings highlight that our CommentFinder could help reviewers to reduce the manual efforts by recommending code review comments, while requiring less computational time.
Article
Full-text available
Pull-based development has enabled numerous volunteers to contribute to open-source projects with fewer barriers. Nevertheless, a considerable amount of pull requests (PRs) with valid contributions are abandoned by their contributors , wasting the effort and time put in by both the contributors and maintainers. To better understand the underlying dynamics of contributor-abandoned PRs, we conduct a mixed-methods study using both quantitative and qualitative methods. We curate a dataset consisting of 265,325 PRs including 4,450 abandoned ones from ten popular and mature GitHub projects and measure 16 features characterizing PRs, contributors, review processes, and projects. Using statistical and machine learning techniques, we find that complex PRs, novice contributors, and lengthy reviews have a higher probability of abandonment and the rate of PR abandonment fluctuates alongside the projects’ maturity or workload. To identify why contributors abandon their PRs, we also manually examine a random sample of 354 abandoned PRs. We observe that the most frequent abandonment reasons are related to the obstacles faced by contributors, followed by the hurdles imposed by maintainers during the review process. Finally, we survey the top core maintainers of the studied projects to understand their perspectives on dealing with PR abandonment and on our findings.
Conference Paper
Full-text available
Code review is effective, but human-intensive (e.g., developers need to manually modify source code until it is approved). Recently, prior work proposed a Neural Machine Translation (NMT) approach to automatically transform source code to the version that is reviewed and approved (i.e., the after version). Yet, its performance is still suboptimal when the after version has new identifiers or liter-als (e.g., renamed variables) or has many code tokens. To address these limitations, we propose AutoTransform which leverages a Byte-Pair Encoding (BPE) approach to handle new tokens and a Transformer-based NMT architecture to handle long sequences. We evaluate our approach based on 14,750 changed methods with and without new tokens for both small and medium sizes. The results show that when generating one candidate for the after version (i.e., beam width = 1), our AutoTransform can correctly transform 1,413 changed methods, which is 567% higher than the prior work, highlighting the substantial improvement of our approach for code transformation in the context of code review. This work contributes towards automated code transformation for code reviews, which could help developers reduce their effort in modifying source code during the code review process.
Article
Full-text available
Peer code review is a widely adopted software engineering practice to ensure code quality and ensure software reliability in both the commercial and open-source software projects. Due to the large effort overhead associated with practicing code reviews, project managers often wonder, if their code reviews are effective and if there are improvement opportunities in that respect. Since project managers at Samsung Research Bangladesh (SRBD) were also intrigued by these questions, this research developed, deployed, and evaluated a production-ready solution using the Balanced SCorecard (BSC) strategy that SRBD managers can use in their day-to-day management to monitor individual developer’s, a particular project’s or the entire organization’s code review effectiveness. Following the four-step framework of the BSC strategy, we– 1) defined the operation goals of this research, 2) defined a set of metrics to measure the effectiveness of code reviews, 3) developed an automated mechanism to measure those metrics, and 4) developed and evaluated a monitoring application to inform the key stakeholders. Our automated model to identify useful code reviews achieves 7.88% and 14.39% improvement in terms of accuracy and minority class F1 score respectively over the models proposed in prior studies. It also outperforms human evaluators from SRBD, that the model replaces, by a margin of 25.32% and 23.84% respectively in terms of accuracy and minority class F1 score. In our post-deployment survey, SRBD developers and managers indicated that they found our solution as useful and it provided them with important insights to help their decision makings.
Article
Full-text available
The game development industry is growing, and training new developers in game development-specific abilities is essential to satisfying its need for skilled game developers. These developers require effective learning resources to acquire the information they need and improve their game development skills. Question and Answer (Q&A) websites stand out as some of the most used online learning resources in software development. Many studies have investigated how Q&A websites help software developers become more experienced. However, no studies have explored Q&A websites aimed at game development, and there is little information about how game developers use and interact with these websites. In this paper, we study four Q&A communities by analyzing game development data we collected from their websites and the 347 responses received on a survey we ran with game developers. We observe that the communities have declined over the past few years and identify factors that correlate to these changes. Using a Latent Dirichlet Allocation (LDA) model, we characterize the topics discussed in the communities. We also analyze how topics differ across communities and identify the most discussed topics. Furthermore, we find that survey respondents have a mostly negative view of the communities and tended to stop using the websites once they became more experienced. Finally, we provide recommendations on where game developers should post their questions, which can help mitigate the websites’ declines and improve their effectiveness.
Conference Paper
Full-text available
Companies have adopted modern code review as a key technique for continuously monitoring and improving the quality of software changes. One of the main motivations for this is the early detection of design impactful changes, to prevent that design-degrading ones prevail after each code review. Even though design degradation symptoms often lead to changes' rejections, practices of modern code review alone are actually not sufficient to avoid or mitigate design decay. Software design degrades whenever one or more symptoms of poor structural decisions, usually represented by smells, end up being introduced by a change. Design degradation may be related to both technical and social aspects in collaborative code reviews. Unfortunately, there is no study that investigates if code review stakeholders, e.g, reviewers, could benefit from approaches to distinguish and predict design impactful changes with technical and/or social aspects. By analyzing 57,498 reviewed code changes from seven open-source systems, we report an investigation on prediction of design impactful changes in modern code review. We evaluated the use of six ML algorithms to predict design impactful changes. We also extracted and assessed 41 different features based on both social and technical aspects. Our results show that Random Forest and Gradient Boosting are the best algorithms. We also observed that the use of technical features results in more precise predictions. However, the use of social features alone, which are available even before the code review starts (e.g., for team managers or change assigners), also leads to highly-accurate prediction. Therefore social and/or technical prediction models can be used to support further design inspection of suspicious changes early in a code review process. Finally, we provide an enriched dataset that allows researchers to investigate the context behind design impactful changes during the code review process.
Article
Full-text available
Nowadays, with the rapid growth of open source software (OSS), library reuse becomes more and more popular since a large amount of third- party libraries are available to download and reuse. A deeper understanding on why developers reuse a library (i.e., replacing self-implemented code with an external library) or re-implement a library (i.e., replacing an imported external library with self-implemented code) could help researchers better understand the factors that developers are concerned with when reusing code. This understanding can then be used to improve existing libraries and API recommendation tools for researchers and practitioners by using the developers concerns identified in this study as design criteria. In this work, we investigated the reasons behind library reuse and re-implementation. To achieve this goal, we first crawled data from two popular sources, F-Droid and GitHub. Then, potential instances of library reuse and re-implementation were found automatically based on certain heuristics. Next, for each instance, we further manually identified whether it is valid or not. For library re-implementation, we obtained 82 instances which are distributed in 75 repositories. We then conducted two types of surveys (i.e., individual survey to corresponding developers of the validated instances and another open survey) for library reuse and re-implementation. For library reuse individual survey, we received 36 responses out of 139 contacted developers. For re-implementation individual survey, we received 13 responses out of 71 contacted developers. In addition, we received 56 responses from the open survey. Finally, we perform qualitative and quantitative analysis on the survey responses and commit logs of the validated instances. The results suggest that library reuse occurs mainly because developers were initially unaware of the library or the library had not been introduced. Re-implementation occurs mainly because the used library method is only a small part of the library, the library dependencies are too complicated, or the library method is deprecated. Finally, based on all findings obtained from analyzing the surveys and commit messages, we provided a few suggestions to improve the current library recommendation systems: tailored recommendation according to users’ preferences, detection of external code that is similar to a part of the users’ code (to avoid duplication or re-implementation), grouping similar recommendations for developers to compare and select the one they prefer, and disrecommendation of poor-quality libraries.
Conference Paper
Full-text available
From healthcare to criminal justice, artificial intelligence (AI) is increasingly supporting high-consequence human decisions. This has spurred the field of explainable AI (XAI). This paper seeks to strengthen empirical application-specific investigations of XAI by exploring theoretical underpinnings of human decision making, drawing from the fields of philosophy and psychology. In this paper, we propose a conceptual framework for building human-centered, decision-theory-driven XAI based on an extensive review across these fields. Drawing on this framework, we identify pathways along which human cognitive patterns drives needs for building XAI and how XAI can mitigate common cognitive biases. We then put this framework into practice by designing and implementing an explainable clinical diagnostic tool for intensive care phenotyping and conducting a co-design exercise with clinicians. Thereafter, we draw insights into how this framework bridges algorithm-generated explanations and human decision-making theories. Finally, we discuss implications for XAI design and development.
Conference Paper
Full-text available
Employing lightweight, tool-based code review of code changes (aka modern code review) has become the norm for a wide variety of open-source and industrial systems. In this paper, we make an exploratory investigation of modern code review at Google. Google introduced code review early on and evolved it over the years; our study sheds light on why Google introduced this practice and analyzes its current status, after the process has been refined through decades of code changes and millions of code reviews. By means of 12 interviews, a survey with 44 respondents, and the analysis of review logs for 9 million reviewed changes, we investigate motivations behind code review at Google, current practices, and developers' satisfaction and challenges.
Conference Paper
Full-text available
Although peer code review is widely adopted in both commercial and open source development, existing studies suggest that such code reviews often contain a significant amount of non-useful review comments. Unfortunately, to date, no tools or techniques exist that can provide automatic support in improving those non-useful comments. In this paper, we first report a comparative study between useful and non-useful review comments where we contrast between them using their textual characteristics, and reviewers' experience. Then, based on the findings from the study, we develop RevHelper, a prediction model that can help the developers improve their code review comments through automatic prediction of their usefulness during review submission. Comparative study using 1,116 review comments suggested that useful comments share more vocabulary with the changed code, contain salient items like relevant code elements, and their reviewers are generally more experienced. Experiments using 1,482 review comments report that our model can predict comment usefulness with 66\% prediction accuracy which is promising. Comparison with three variants of a baseline model using a case study validates our empirical findings and demonstrates the potential of our model.
Article
Full-text available
Code review is the process of having other team members examine changes to a software system in order to evaluate its technical content and quality. A lightweight variant of this practice, often referred to as Modern Code Review (MCR), is widely adopted by software organizations today. Previous studies have established a relation between the practice of code review and the occurrence of post-release bugs. While the prior work studies the impact of code review practices on software release quality, it is still unclear what impact code review practices have on software design quality. Therefore, using the occurrence of 7 different types of anti-patterns (i.e., poor solutions to design and implementation problems) as a proxy for software design quality, we set out to investigate the relationship between code review practices and software design quality. Through a case study of the Qt, VTK and ITK open source projects, we find that software components with low review coverage or low review participation are often more prone to the occurrence of anti-patterns than those components with more active code review practices.
Article
Peer code review is a widely practiced software engineering process in which software developers collaboratively evaluate and improve source code quality. Whether developers can perform good reviews depends on whether they have sufficient competence and experience. However, the knowledge of what competencies developers need to execute code review is currently limited, thus hindering, for example, the creation of effective support tools and training strategies. To address this gap, we firstly identified 27 competencies relevant to performing code review through expert validation. Later, we conducted an online survey with 105 reviewers to rank these competencies along four dimensions: frequency of usage, importance, proficiency, and desire of reviewers to improve in that competency. The survey shows that technical competencies are considered essential to performing reviews and that respondents feel generally confident in their technical proficiency. Moreover, reviewers feel less confident in how to communicate clearly and give constructive feedback - competencies they consider like-wise an essential part of reviewing. Therefore, research and education should focus in more detail on how to support and develop reviewers' potential to communicate effectively during reviews. In the paper, we also discuss further implications for training, code review performance assessment, and reviewers of different experience level. Data and materials: https://doi.org/10.5281/zenodo.7401313
Article
Context Code review is a valuable software process that helps software practitioners to identify a variety of defects in code. Even though many code review tools and static analysis tools used to improve the efficiency of the process exist, code review is still costly. Objective Understanding the types of defects that code reviews help to identify could reveal other means of cost improvement. Thus, our goal was to identify defect types detected in real-world code reviews, and the extent to which code review can be benefited from defect detection tools. Method To this end, we classified 417 comments from code reviews of 7 OSS Java projects using thematic analysis. Results We identified 116 defect types that we grouped into 15 groups to create a defect classification. Additionally, 38% of these defects could be automatically detected accurately. Conclusion We learnt that even though many capable defect detection tools are available today, a substantial amount of defects that can be detected automatically, reach code review. Also, we identified several code review cost reduction opportunities.
Article
Code review consists of manual inspection, discussion, and judgment of source code by developers other than the code's author. Due to discussions around competing ideas and group decision-making processes, interpersonal conflicts during code reviews are expected. This study systematically investigates how developers perceive code review conflicts and addresses interpersonal conflicts during code reviews as a theoretical construct. Through the thematic analysis of interviews conducted with 22 developers, we confirm that conflicts during code reviews are commonplace, anticipated and seen as normal by developers. Even though conflicts do happen and carry a negative impact for the review, conflicts-if resolved constructively-can also create value and bring improvement. Moreover, the analysis provided insights on how strongly conflicts during code review and its context (i.e., code, developer, team, organization) are intertwined. Finally, there are aspects specific to code review conflicts that call for the research and application of customized conflict resolution and management techniques, some of which are discussed in this paper. Preprint: https://arxiv.org/abs/2201.05425 Data and material: https://doi.org/10.5281/zenodo.5848794
Article
Context Learning-based automatic program repair techniques are showing promise to provide quality fix suggestions for detected bugs in the source code of the software. These tools mostly exploit historical data of buggy and fixed code changes and are heavily dependent on bug localizers while applying to a new piece of code. With the increasing popularity of code review, dependency on bug localizers can be reduced. Besides, the code review-based bug localization is more trustworthy since reviewers’ expertise and experience are reflected in these suggestions. Objective The natural language instructions scripted on the review comments are enormous sources of information about the bug’s nature and expected solutions. However, none of the learning-based tools has utilized the review comments to fix programming bugs to the best of our knowledge. In this study, we investigate the performance improvement of repair techniques using code review comments. Method We train a sequence-to-sequence model on 55,060 code reviews and associated code changes. We also introduce new tokenization and preprocessing approaches that help to achieve significant improvement over state-of-the-art learning-based repair techniques. Results We boost the top-1 accuracy by 20.33% and top-10 accuracy by 34.82%. We could provide a suggestion for stylistics and non-code errors unaddressed by prior techniques. Conclusion We believe that the automatic fix suggestions along with code review generated by our approach would help developers address the review comment quickly and correctly and thus save their time and effort.
Article
Contemporary code review is a widespread practice used by software engineers to maintain high software quality and share project knowledge. However, conducting proper code review takes time and developers often have limited time for review. In this paper, we aim at investigating the information that reviewers need to conduct a proper code review, to better understand this process and how research and tool support can make developers become more effective and efficient reviewers. Previous work has provided evidence that a successful code review process is one in which reviewers and authors actively participate and collaborate. In these cases, the threads of discussions that are saved by code review tools are a precious source of information that can be later exploited for research and practice. In this paper, we focus on this source of information as a way to gather reliable data on the aforementioned reviewers' needs. We manually analyze 900 code review comments from three large open-source projects and organize them in categories by means of a card sort. Our results highlight the presence of seven high-level information needs, such as knowing the uses of methods and variables declared/modified in the code under review. Based on these results we suggest ways in which future code review tools can better support collaboration and the reviewing task.
Conference Paper
Code review has been widely adopted by both industrial and open source software development communities. Research in code review is highly dependant on real-world data, and although existing researchers have attempted to provide code review datasets, there is still no dataset that links code reviews with complete versions of the system's code base mainly because reviewed versions are not kept in the system's version control repository. Thus, we present CROP, the Code Review Open Platform, the first curated code review repository that links review data with isolated complete versions (snapshots) of the source code at the time of review. CROP currently provides data for 8 software systems, 48,975 reviews and 112,617 patches, including versions of the systems that are inaccessible in the systems' original repositories. Moreover, CROP is extensible, and it will be continuously curated and extended.
Article
Context:Evolutionary algorithms typically require large number of evaluations (of solutions) to reach their conclusions - which can be very slow and expensive to evaluate. Objective:To solve search-based SE problems, using fewer evaluations than evolutionary methods. Method:Instead of mutating a small population, we build a very large initial population which is then culled using a recursive bi-clustering binary chop approach. We evaluate this approach on multiple software engineering(SE) models, unconstrained as well as constrained, and compare its performance with standard evolutionary algorithms. Results:Using just a few evaluations (under 100), we can obtain the comparable results to standard evolutionary algorithms. Conclusion:Just because something works, and is widespread use, does not necessarily mean that there is no value in seeking methods to improve that method. Before undertaking SBSE optimization task using traditional EAs, it is recommended to try other techniques, like those explored here, to obtain the same results with fewer evaluations.
Conference Paper
In a large, long-lived project, an effective code review process is key to ensuring the long-term quality of the code base. In this work, we study code review practices of a large, open source project, and we investigate how the developers themselves perceive code review quality. We present a qualitative study that summarizes the results from a survey of 88 Mozilla core developers. The results provide developer insights into how they define review quality, what factors contribute to how they evaluate submitted code, and what challenges they face when performing review tasks. We found that the review quality is primarily associated with the thoroughness of the feedback, the reviewer's familiarity with the code, and the perceived quality of the code itself. Also, we found that while different factors are perceived to contribute to the review quality, reviewers often find it difficult to keep their technical skills up-to-date, manage personal priorities, and mitigate context switching.
Article
Code review is an important part of the software development process. Recently, many open source projects have begun practicing code review through 'modern' tools such as GitHub pull-requests and Gerrit. Many commercial software companies use similar tools for code review internally. These tools enable the owner of a source code change to request individuals to participate in the review, i.e., reviewers. However, this task comes with a challenge. Prior work has shown that the benefits of code review are dependent upon the expertise of the reviewers involved. Thus, a common problem faced by authors of source code changes is that of identifying the best reviewers for their source code change. To address this problem, we present an approach, namely cHRev, to automatically recommend reviewers who are best suited to participate in a given review, based on their historical contributions as demonstrated in their prior reviews. We evaluate the effectiveness of cHRev on three open source systems as well as a commercial codebase at Microsoft and compare it to the state of the art in reviewer recommendation. We show that by leveraging the specific information in previously completed reviews (i.e.,quantification of review comments and their recency), we are able to improve dramatically on the performance of prior approaches, which (limitedly) operate on generic review information (i.e., reviewers of similar source code file and path names) or source coderepository data. We also present the insights into why our approach cHRev outperforms the existing approaches.
Chapter
When data are presented in paired fashion the paired t-test is often used. For example, we might be interested in the heart rate of a subject before and after receiving some drug. In such cases it is the difference that we are interested in. For each subject we have two values, x i and y i and the difference, d i = x i - y i , as outlined below.
Article
Research in program comprehension has evolved considerably over the past decades. However, only little is known about how developers practice program comprehension in their daily work. This article reports on qualitative and quantitative research to comprehend the strategies, tools, and knowledge used for program comprehension. We observed 28 professional developers, focusing on their comprehension behavior, strategies followed, and tools used. In an online survey with 1,477 respondents, we analyzed the importance of certain types of knowledge for comprehension and where developers typically access and share this knowledge. We found that developers follow pragmatic comprehension strategies depending on context. They try to avoid comprehension whenever possible and often put themselves in the role of users by inspecting graphical interfaces. Participants confirmed that standards, experience, and personal communication facilitate comprehension. The team size, its distribution, and open-source experience influence their knowledge sharing and access behavior. While face-to-face communication is preferred for accessing knowledge, knowledge is frequently shared in informal comments. Our results reveal a gap between research and practice, as we did not observe any use of comprehension tools and developers seem to be unaware of them. Overall, our findings call for reconsidering the research agendas towards context-aware tool support.