ArticlePDF Available

Review participation in modern code review: An empirical study of the android, Qt, and OpenStack projects

Authors:

Abstract and Figures

Software code review is a well-established software quality practice. Recently, Modern Code Review (MCR) has been widely adopted in both open source and proprietary projects. Our prior work shows that review participation plays an important role in MCR practices, since the amount of review participation shares a relationship with software quality. However, little is known about which factors influence review participation in the MCR process. Hence, in this study, we set out to investigate the characteristics of patches that: (1) do not attract reviewers, (2) are not discussed, and (3) receive slow initial feedback. Through a case study of 196,712 reviews spread across the Android, Qt, and OpenStack open source projects, we find that the amount of review participation in the past is a significant indicator of patches that will suffer from poor review participation. Moreover, we find that the description length of a patch shares a relationship with the likelihood of receiving poor reviewer participation or discussion, while the purpose of introducing new features can increase the likelihood of receiving slow initial feedback. Our findings suggest that the patches with these characteristics should be given more attention in order to increase review participation, which will likely lead to a more responsive review process.
Content may be subject to copyright.
A preview of the PDF is not available
... Empirical studies have investigated code review efficiency and effectiveness to understand the practice, elaborate recommendations, and develop improvements. Together, these studies share a set of useful code review aspects for further investigation, such as change description [25,75], code churn [76], length of discussion [45,54,66,76], number of changed files [25,45], number of commits [54,67], number of people in the discussion [45], number of resubmissions [45,66], number of review comments, [21,54,66], number of reviewers [66,72], size of change [20,45,66], and time to merge [37,41]. Therefore, the mining of raw code review data (Step 3) consisted of collecting the code reviewing-related attributes listed in Table 1, considering 8,761 PRs from Step 2 (sample 2, Figure 5). ...
... Empirical studies have investigated code review efficiency and effectiveness to understand the practice, elaborate recommendations, and develop improvements. Together, these studies share a set of useful code review aspects for further investigation, such as change description [25,75], code churn [76], length of discussion [45,54,66,76], number of changed files [25,45], number of commits [54,67], number of people in the discussion [45], number of resubmissions [45,66], number of review comments, [21,54,66], number of reviewers [66,72], size of change [20,45,66], and time to merge [37,41]. Therefore, the mining of raw code review data (Step 3) consisted of collecting the code reviewing-related attributes listed in Table 1, considering 8,761 PRs from Step 2 (sample 2, Figure 5). ...
Preprint
Background: Pull-based development has shaped the practice of Modern Code Review (MCR), in which reviewers can contribute code improvements, such as refactorings, through comments and commits in Pull Requests (PRs). Past MCR studies uniformly treat all PRs, regardless of whether they induce refactoring or not. We define a PR as refactoring-inducing, when refactoring edits are performed after the initial commit(s), as either a result of discussion among reviewers or spontaneous actions carried out by the PR developer. Aims: This mixed study (quantitative and qualitative) explores code reviewing-related aspects intending to characterize refactoring-inducing PRs. Method: We hypothesize that refactoring-inducing PRs have distinct characteristics than non-refactoring-inducing ones and thus deserve special attention and treatment from researchers, practitioners, and tool builders. To investigate our hypothesis, we mined a sample of 1,845 Apache's merged PRs from GitHub, mined refactoring edits in these PRs, and ran a comparative study between refactoring-inducing and non-refactoring-inducing PRs. We also manually examined 2,096 review comments and 1,891 detected refactorings from 228 refactoring-inducing PRs. Results: We found 30.2% of refactoring-inducing PRs in our sample and that they significantly differ from non-refactoring-inducing ones in terms of number of commits, code churn, number of file changes, number of review comments, length of discussion, and time to merge. However, we found no statistical evidence that the number of reviewers is related to refactoring-inducement. Our qualitative analysis revealed that at least one refactoring edit was induced by review in 133 (58.3%) of the refactoring-inducing PRs examined. Conclusions: Our findings suggest directions for researchers, practitioners, and tool builders to improve practices around pull-based code review.
... Empirical studies have investigated code review efficiency and effectiveness to understand the practice, elaborate recommendations, and develop improvements. Together, these studies share a set of useful code review aspects for further investigation, such as change description [25,75], code churn [76], length of discussion [45,54,66,76], number of changed files [25,45], number of commits [54,67], number of people in the discussion [45], number of resubmissions [45,66], number of review comments, [21,54,66], number of reviewers [66,72], size of change [20,45,66], and time to merge [37,41]. Therefore, the mining of raw code review data (Step 3) consisted of collecting the code reviewing-related attributes listed in Table 1, considering 8,761 PRs from Step 2 (sample 2, Figure 5). ...
... Empirical studies have investigated code review efficiency and effectiveness to understand the practice, elaborate recommendations, and develop improvements. Together, these studies share a set of useful code review aspects for further investigation, such as change description [25,75], code churn [76], length of discussion [45,54,66,76], number of changed files [25,45], number of commits [54,67], number of people in the discussion [45], number of resubmissions [45,66], number of review comments, [21,54,66], number of reviewers [66,72], size of change [20,45,66], and time to merge [37,41]. Therefore, the mining of raw code review data (Step 3) consisted of collecting the code reviewing-related attributes listed in Table 1, considering 8,761 PRs from Step 2 (sample 2, Figure 5). ...
... The explanatory variables are the four diversity metrics above, i.e., entropy measurements using polarity, adjective, and verb. To fit and interpret the models, we follow standard practices in the literature [61,70]: (1) When fitting the models, we test for multicollinearity between the explanatory variables using the Variable Inflation Factor (VIF), and remove variables with VIF scores above the recommended maximum of 5 [10]. (2) When interpreting the models, we consider coefficient importance if they are statistically significant ( − ≤ 0.05). ...
Preprint
Full-text available
Sentiment analysis in software engineering (SE) has shown promise to analyze and support diverse development activities. We report the results of an empirical study that we conducted to determine the feasibility of developing an ensemble engine by combining the polarity labels of stand-alone SE-specific sentiment detectors. Our study has two phases. In the first phase, we pick five SE-specific sentiment detection tools from two recently published papers by Lin et al. [31, 32], who first reported negative results with standalone sentiment detectors and then proposed an improved SE-specific sentiment detector, POME [31]. We report the study results on 17,581 units (sentences/documents) coming from six currently available sentiment benchmarks for SE. We find that the existing tools can be complementary to each other in 85-95% of the cases, i.e., one is wrong, but another is right. However, a majority voting-based ensemble of those tools fails to improve the accuracy of sentiment detection. We develop Sentisead, a supervised tool by combining the polarity labels and bag of words as features. Sentisead improves the performance (F1-score) of the individual tools by 4% (over Senti4SD [5]) - 100% (over POME [31]). In a second phase, we compare and improve Sentisead infrastructure using Pre-trained Transformer Models (PTMs). We find that a Sentisead infrastructure with RoBERTa as the ensemble of the five stand-alone rule-based and shallow learning SE-specific tools from Lin et al. [31, 32] offers the best F1-score of 0.805 across the six datasets, while a stand-alone RoBERTa shows an F1-score of 0.801.
Article
Technical debt is a sub-optimal state of development in projects. In particular, the type of technical debt incurred by developers themselves (e.g., comments that mean the implementation is imperfect and should be replaced with another implementation) is called self-admitted technical debt (SATD). In theory, technical debt should not be left for a long period because it accumulates more cost over time, making it more difficult to process. Accordingly, developers have traditionally conducted code reviews to find technical debt. In fact, we observe that many SATD comments are often introduced during modern code reviews (MCR) that are light-weight reviews with web applications. However, it is uncertain about the nature of SATD comments that are introduced in the review process: impact, frequency, characteristics, and triggers. Herein, this study empirically examines the relationship between SATD and MCR. Our case study of 156,372 review records from the Qt and OpenStack systems shows that (i) review records involving SATD are about 6%–7% less likely to be accepted by reviews than those without SATD; (ii) review records involving SATD tend to require two to three more revisions compared with those without SATD; (iii) 28–48% of SATD comments are introduced during code reviews; (iv) SATD during reviews works for communicating between authors and reviewers; and (v) 20% of the SATD comments are introduced due to reviewers’ requests.
Article
Sentiment analysis in software engineering (SE) has shown promise to analyze and support diverse development activities. Recently, several tools are proposed to detect sentiments in software artifacts. While the tools improve accuracy over off-the-shelf tools, recent research shows that their performance could still be unsatisfactory. A more accurate sentiment detector for SE can help reduce noise in analysis of software scenarios where sentiment analysis is required. Recently, combinations, i.e., hybrids of stand-alone classifiers are found to offer better performance than the stand-alone classifiers for fault detection. However, we are aware of no such approach for sentiment detection for software artifacts. We report the results of an empirical study that we conducted to determine the feasibility of developing an ensemble engine by combining the polarity labels of stand-alone SE-specific sentiment detectors. Our study has two phases. In the first phase, we pick five SE-specific sentiment detection tools from two recently published papers by Lin et al. [31, 32], who first reported negative results with stand alone sentiment detectors and then proposed an improved SE-specific sentiment detector, POME [31]. We report the study results on 17,581 units (sentences/documents) coming from six currently available sentiment benchmarks for software engineering. We find that the existing tools can be complementary to each other in 85-95% of the cases, i.e., one is wrong but another is right. However, a majority voting-based ensemble of those tools fails to improve the accuracy of sentiment detection. We develop Sentisead, a supervised tool by combining the polarity labels and bag of words as features. Sentisead improves the performance (F1-score) of the individual tools by 4% (over Senti4SD [5]) – 100% (over POME [31]). The initial development of Sentisead occurred before we observed the use of deep learning models for SE-specific sentiment detection. In particular, recent papers show the superiority of advanced language-based pre-trained transformer models (PTM) over rule-based and shallow learning models. Consequently, in a second phase, we compare and improve Sentisead infrastructure using the PTMs. We find that a Sentisead infrastructure with RoBERTa as the ensemble of the five stand-alone rule-based and shallow learning SE-specific tools from Lin et al. [31, 32] offers the best F1-score of 0.805 across the six datasets, while a stand-alone RoBERTa shows an F1-score of 0.801.
Article
Full-text available
Peer code review is a widely adopted software engineering practice to ensure code quality and ensure software reliability in both the commercial and open-source software projects. Due to the large effort overhead associated with practicing code reviews, project managers often wonder, if their code reviews are effective and if there are improvement opportunities in that respect. Since project managers at Samsung Research Bangladesh (SRBD) were also intrigued by these questions, this research developed, deployed, and evaluated a production-ready solution using the Balanced SCorecard (BSC) strategy that SRBD managers can use in their day-to-day management to monitor individual developer’s, a particular project’s or the entire organization’s code review effectiveness. Following the four-step framework of the BSC strategy, we– 1) defined the operation goals of this research, 2) defined a set of metrics to measure the effectiveness of code reviews, 3) developed an automated mechanism to measure those metrics, and 4) developed and evaluated a monitoring application to inform the key stakeholders. Our automated model to identify useful code reviews achieves 7.88% and 14.39% improvement in terms of accuracy and minority class F1 score respectively over the models proposed in prior studies. It also outperforms human evaluators from SRBD, that the model replaces, by a margin of 25.32% and 23.84% respectively in terms of accuracy and minority class F1 score. In our post-deployment survey, SRBD developers and managers indicated that they found our solution as useful and it provided them with important insights to help their decision makings.
Preprint
Context: software projects are common resources in Software Engineering experiments, although these are often selected without following a specific strategy, which reduces the representativeness and replication of the results. An option is the use of preserved collections of software projects, but these must be current, with explicit guidelines that guarantee their updating over a long period of time. Goal: to carry out a systematic secondary study about the strategies to select software projects in empirical studies to discover the guidelines taken into account, the degree of use of project collections, the meta-data extracted and the subsequent statistical analysis conducted. Method: A systematic mapping study to identify studies published from January 2013 to December 2020. Results: 122 studies were identified, of which the 72% used their own guidelines for project selection and the 27% used existent project collections. Likewise, there was no evidence of a standardized framework for the project selection process, nor the application of statistical methods that relates with the sample collection strategy.
Article
Full-text available
Code reviews serve as a quality assurance activity for software teams. Especially for Modern Code Review, sharing a link during a review discussion serves as an effective awareness mechanism where "Code reviews are good FYIs [for your information].". Although prior work has explored link sharing and the information needs of a code review, the extent to which links are used to properly conduct a review is unknown. In this study, we performed a mixed-method approach to investigate the practice of link sharing and their intentions. First, through a quantitative study of the OpenStack and Qt projects, we identify 19,268 reviews that have 39,686 links to explore the extent to which the links are shared, and analyze a correlation between link sharing and review time. Then in a qualitative study, we manually analyze 1,378 links to understand the role and usefulness of link sharing. Results indicate that internal links are more widely referred to (93% and 80% for the two projects). Importantly, although the majority of the internal links are referencing to reviews, bug reports and source code are also shared in review discussions. The statistical models show that the number of internal links as an explanatory factor does have an increasing relationship with the review time. Finally, we present seven intentions of link sharing, with providing context being the most common intention for sharing links. Based on the findings and a developer survey, we encourage the patch author to provide clear context and explore both internal and external resources, while the review team should continue link sharing activities. Future research directions include the investigation of causality between sharing links and the review process, as well as the potential for tool support.
Conference Paper
Full-text available
Software code review is a well-established software quality practice. Recently, Modern Code Review (MCR) has been widely adopted in both open source and proprietary projects. To evaluate the impact that characteristics of MCR practices have on software quality, this paper comparatively studies MCR practices in defective and clean source code files. We investigate defective files along two perspectives: 1) files that will eventually have defects (i.e., future-defective files) and 2) files that have historically been defective (i.e., risky files). Through an empirical study of 11,736 reviews of changes to 24,486 files from the Qt open source project, we find that both future-defective files and risky files tend to be reviewed less rigorously than their clean counterparts. We also find that the concerns addressed during the code reviews of both defective and clean files tend to enhance evolvability, i.e., ease future maintenance (like documentation), rather than focus on functional issues (like incorrect program logic). Our findings suggest that although functionality concerns are rarely addressed during code review, the rigor of the reviewing process that is applied to a source code file throughout a development cycle shares a link with its defect proneness.
Conference Paper
Full-text available
Code ownership establishes a chain of responsibility for modules in large software systems. Although prior work uncovers a link between code ownership heuristics and software quality, these heuristics rely solely on the authorship of code changes. In addition to authoring code changes, developers also make important contributions to a module by reviewing code changes. Indeed, recent work shows that reviewers are highly active in modern code review processes, often suggesting alternative solutions or providing updates to the code changes. In this paper, we complement traditional code ownership heuristics using code review activity. Through a case study of six releases of the large Qt and OpenStack systems, we find that: (1) 67%--86% of developers did not author any code changes for a module, but still actively contributed by reviewing 21%--39% of the code changes, (2) code ownership heuristics that are aware of reviewing activity share a relationship with software quality, and (3) the proportion of reviewers without expertise shares a strong, increasing relationship with the likelihood of having post-release defects. Our results suggest that reviewing activity captures an important aspect of code ownership, and should be included in approximations of it in future studies.
Article
Full-text available
Code review is the process of having other team members examine changes to a software system in order to evaluate its technical content and quality. A lightweight variant of this practice, often referred to as Modern Code Review (MCR), is widely adopted by software organizations today. Previous studies have established a relation between the practice of code review and the occurrence of post-release bugs. While the prior work studies the impact of code review practices on software release quality, it is still unclear what impact code review practices have on software design quality. Therefore, using the occurrence of 7 different types of anti-patterns (i.e., poor solutions to design and implementation problems) as a proxy for software design quality, we set out to investigate the relationship between code review practices and software design quality. Through a case study of the Qt, VTK and ITK open source projects, we find that software components with low review coverage or low review participation are often more prone to the occurrence of anti-patterns than those components with more active code review practices.
Conference Paper
Open source software projects often rely on code contributions from a wide variety of developers to extend the capabilities of their software. Project members evaluate these contributions and often engage in extended discussions to decide whether to integrate changes. These discussions have important implications for project management regarding new contributors and evolution of project requirements and direction. We present a study of how developers in open work environments evaluate and discuss pull requests, a primary method of contribution in GitHub, analyzing a sample of extended discussions around pull requests and interviews with GitHub developers. We found that developers raised issues around contributions over both the appropriateness of the problem that the submitter attempted to solve and the correctness of the implemented solution. Both core project members and third-party stakeholders discussed and sometimes implemented alternative solutions to address these issues. Different stakeholders also influenced the outcome of the evaluation by eliciting support from different communities such as dependent projects or even companies. We also found that evaluation outcomes may be more complex than simply acceptance or rejection. In some cases, although a submitter's contribution was rejected, the core team fulfilled the submitter's technical goals by implementing an alternative solution. We found that the level of a submitter's prior interaction on a project changed how politely developers discussed the contribution and the nature of proposed alternative solutions.
Article
Peer code review is a software quality assurance activity followed in several open-source and closed-source software projects. Rietveld and Gerrit are the most popular peer code review systems used by open-source software projects. Despite the popularity and usefulness of these systems, they do not record or maintain the cost and effort information for a submitted patch review activity. Currently there are no formal or standard metrics available to measure effort and contribution of a patch review activity. We hypothesize that the size and complexity of modified files and patches are significant indicators of effort and contribution of patch reviewers in a patch review process. We propose a metric for computing the effort and contribution of a patch reviewer based on modified file size, patch size and program complexity variables. We conduct a survey of developers involved in peer code review activity to test our hypothesis of causal relationship between proposed indicators and effort. We employ the proposed model and conduct an empirical analysis using the proposed metrics on open-source Google Android project.
Article
Code review is an important part of the software development process. Recently, many open source projects have begun practicing code review through 'modern' tools such as GitHub pull-requests and Gerrit. Many commercial software companies use similar tools for code review internally. These tools enable the owner of a source code change to request individuals to participate in the review, i.e., reviewers. However, this task comes with a challenge. Prior work has shown that the benefits of code review are dependent upon the expertise of the reviewers involved. Thus, a common problem faced by authors of source code changes is that of identifying the best reviewers for their source code change. To address this problem, we present an approach, namely cHRev, to automatically recommend reviewers who are best suited to participate in a given review, based on their historical contributions as demonstrated in their prior reviews. We evaluate the effectiveness of cHRev on three open source systems as well as a commercial codebase at Microsoft and compare it to the state of the art in reviewer recommendation. We show that by leveraging the specific information in previously completed reviews (i.e.,quantification of review comments and their recency), we are able to improve dramatically on the performance of prior approaches, which (limitedly) operate on generic review information (i.e., reviewers of similar source code file and path names) or source coderepository data. We also present the insights into why our approach cHRev outperforms the existing approaches.