Empirical Software Engineering

Published by Springer Nature
Online ISSN: 1573-7616
Learn more about this page
Recent publications
Article
Story Points (SP) are an effort unit that is used to represent the relative effort of a work item. In Agile software development, SP allows a devel- opment team to estimate their delivery capacity and facilitate the sprint plan- ning activities. Although Agile embraces changes, SP changes after the sprint planning may negatively impact the sprint plan. To minimize the impact, there is a need to better understand the SP changes and an automated approach to predict the SP changes. Hence, to better understand the SP changes, we examine the prevalence, accuracy, and impact of information changes on SP changes. Through the analyses based on 19,349 work items spread across seven open-source projects, we find that on average, 10% of the work items have SP changes. These work items typically have SP value increased by 58%-100% rel- ative to the initial SP value when they were assigned to a sprint. We also find that the unchanged SP reflect the development time better than the changed SP. Our qualitative analysis shows that the work items with changed SP of- ten have the information changes relating to updating the scope of work. Our empirical results suggest that SP and the scope of work should be reviewed prior or during sprint planning to achieve a reliable sprint plan. Yet, it could be a tedious task to review all work items in the product (or sprint) backlog. Therefore, we develop a classifier to predict whether a work item will have SP changes after being assigned to a sprint. Our classifier achieves an AUC of 0.69-0.8, which is significantly better than the baselines. Our results suggest that to better manage and prepare for the unreliability in SP estimation, the team can leverage our insights and the classifier during the sprint planning. To facilitate future studies, we provide the replication package and the datasets, which are available online.
 
Article
Nowadays, mobile applications represent the principal means to enable human interaction. Being so pervasive, these applications should be made usable for all users: accessibility collects the guidelines that developers should follow to include features allowing users with disabilities (e.g., visual impairments) to better interact with an application. While research in this field is gaining interest, there is still a notable lack of knowledge on how developers practically deal with the problem: (i) whether they are aware and take accessibility guidelines into account when developing apps, (ii) which guidelines are harder for them to implement, and (iii) which tools they use to be supported in this task. To bridge the gap of knowledge on the state of the practice concerning the accessibility of mobile applications, we adopt a mixed-method research approach with a twofold goal. We aim to (i) verify how accessibility guidelines are implemented in mobile applications through a coding strategy and (ii) survey mobile developers on the issues and challenges of dealing with accessibility in practice. The key results of the study show that most accessibility guidelines are ignored when developing mobile apps. This behavior is mainly due to the lack of developers’ awareness of accessibility concerns and the lack of tools to support them during the development.
 
Article
For many people, mobile platforms are now an essential part of everyday life. A defining feature of mobile platforms is their reliance on battery performance. Due to this reliance, there is a pressing need for mobile applications that minimise their own impact on batteries. While mobile platforms are improving their capabilities in terms of policing the energy use of applications and rationing energy-hungry devices, mobile application developers still lack knowledge in how to write energy efficient programs. Recent work in automatic program improvement using heuristic search over randomly generated program variants has shown some promise in terms of producing reductions in programs’ energy-use. A challenge in this work is accurately measuring the energy-use of program variants. One approach to measurement is to use each platform’s internal meter to assess variants on the device itself. This approach has advantages in terms of measuring actual energy-use on each platform but is not ideal for the search for program variants that perform well across multiple platforms. The work in this paper addresses this problem by using an island-like evolutionary search mode to simultaneously evolve variants on multiple platforms. Island models of evolutionary search conduct search on multiple platforms in parallel and share promising variants. The results show that this approach has advantages over isolated evolution in terms of speeding up evolution on each platform and improving the efficiency of search. Validation results show that the island-inspired model is able to evolve variants with good cross-platform performance. In addition, it evolves a solution that outperforms best found solutions using a sequential evolutionary algorithm on it is native platform with an effect size greater than 90%.
 
Article
Multi-Objective Search Algorithms (MOSAs) have been applied to solve diverse Search-Based Software Engineering (SBSE) problems. In most cases, SBSE users select one or more commonly used MOSAs (for instance, Nondominated Sorting Genetic Algorithm II (NSGA-II)) to solve their search problems, without any justification (i.e., not supported by any evidence) on why those particular MOSAs are selected. However, when working with a specific multi-objective SBSE problem, users typically know what kind(s) of qualities they are looking for in solutions. Such qualities are represented by one or more Quality Indicators (QIs), which are often employed to assess various MOSAs to select the best MOSA. However, users usually have limited time budgets, which prevents them from executing multiple MOSAs and consequently selecting the best MOSA in the end. Therefore, for such users, it is highly preferred to select only one MOSA since the beginning. To this end, in this paper, we aim to assist SBSE users in finding appropriate MOSAs for their experiments, given their choices of QIs or quality aspects (e.g., Convergence, Uniformity). To achieve this aim, we conduct an extensive empirical evaluation with 18 search problems from a set of real-world, industrial, and open-source case studies, to study preferences among commonly used QIs and MOSAs in SBSE. We observe that each QI has its own specific most-preferred MOSA and vice versa; NSGA-II and Strength Pareto Evolutionary Algorithm 2 (SPEA2) are the most preferred MOSAs by QIs; no QI is the most preferred by all the MOSAs; the preferences between QIs and MOSAs vary across the search problems; QIs covering the same quality aspect(s) do not necessarily have the same preference for MOSAs. Based on our results, we provide discussions and guidelines for SBSE users to select appropriate MOSAs based on experimental evidence.
 
Article
Cloud-native applications constitute a recent trend for designing large-scale software systems. However, even though several cloud-native tools and patterns have emerged to support scalability, there is no commonly accepted method to empirically benchmark their scalability. In this study, we present a benchmarking method, allowing researchers and practitioners to conduct empirical scalability evaluations of cloud-native applications, frameworks, and deployment options. Our benchmarking method consists of scalability metrics, measurement methods, and an architecture for a scalability benchmarking tool, particularly suited for cloud-native applications. Following fundamental scalability definitions and established benchmarking best practices, we propose to quantify scalability by performing isolated experiments for different load and resource combinations, which asses whether specified service level objectives (SLOs) are achieved. To balance usability and reproducibility, our benchmarking method provides configuration options, controlling the trade-off between overall execution time and statistical grounding. We perform an extensive experimental evaluation of our method’s configuration options for the special case of event-driven microservices. For this purpose, we use benchmark implementations of the two stream processing frameworks Kafka Streams and Flink and run our experiments in two public clouds and one private cloud. We find that, independent of the cloud platform, it only takes a few repetitions (≤ 5) and short execution times (≤ 5 minutes) to assess whether SLOs are achieved. Combined with our findings from evaluating different search strategies, we conclude that our method allows to benchmark scalability in reasonable time.
 
Article
Spectrum Based Fault Localization (SBFL) is a statistical approach to identify faulty code within a program given a program spectra (i.e., records of program elements executed by passing and failing test cases). Several SBFL techniques have been proposed over the years, but most evaluations of those techniques were done only on Java and C programs, and frequently involve artificial faults. Considering the current popularity of Python, indicated by the results of the Stack Overflow survey among developers in 2020, it becomes increasingly important to understand how SBFL techniques perform on Python projects. However, this remains an understudied topic. In this work, our objective is to analyze the effectiveness of popular SBFL techniques in real-world Python projects. We also aim to compare our observed performance on Python to previously-reported performance on Java. Using the recently-built bug benchmark BugsInPy as our fault dataset, we apply five popular SBFL techniques (Tarantula, Ochiai, OP, Barinel, and DStar) and analyze their performances. We subsequently compare our results with results from Java and C projects reported in earlier related works. We find that 1) the real faults in BugsInPy are harder to identify using SBFL techniques compared to the real faults in Defects4J, indicated by the lower performance of the evaluated SBFL techniques on BugsInPy; 2) older techniques such as Tarantula, Barinel, and Ochiai consistently outperform newer techniques (i.e., OP and DStar) in a variety of metrics and debugging scenarios; 3) claims in preceding studies done on artificial faults in C and Java (such as “OP outperforms Tarantula”) do not hold on Python real faults; 4) lower-performing techniques can outperform higher-performing techniques in some cases, emphasizing the potential benefit of combining SBFL techniques. Our results yield insight into how popular SBFL techniques perform in real Python faults and emphasize the importance of conducting SBFL evaluations on real faults.
 
Frequency of project aspects according to issue nature categories
Overview of themes identified in the interviews with design system project leaders
Article
Design systems represent a user interaction design and development approach that is currently of avid interest in the industry. However, little research work has been done to synthesize knowledge related to design systems in order to inform the design of tools to support their creation, maintenance, and usage practices. This paper represents an important step in which we explored the issues that design system projects usually deal with and the perceptions and values of design system project leaders. Through this exploration, we aim to investigate the needs for tools that support the design system approach. We found that the open source communities around design systems focused on discussing issues related to behaviors of user interface components of design systems. At the same time, leaders of design system projects faced considerable challenges when evolving their design systems to make them both capable of capturing stable design knowledge and flexible to the needs of the various concrete products. They valued a bottom-up approach for design system creation and maintenance, in which components are elevated and merged from the evolving products. Our findings synthesize the knowledge and lay foundations for designing techniques and tools aimed at supporting the design system practice and related modern user interaction design and development approaches.
 
Article
Tool support for automated fault localization in program debugging is limited because state-of-the-art algorithms often fail to provide efficient help to the user. They usually offer a ranked list of suspicious code elements, but the fault is not guaranteed to be found among the highest ranks. In Spectrum-Based Fault Localization (SBFL) – which uses code coverage information of test cases and their execution outcomes to calculate the ranks –, the developer has to investigate several locations before finding the faulty code element. Yet, all the knowledge she a priori has or acquires during this process is not reused by the SBFL tool. There are existing approaches in which the developer interacts with the SBFL algorithm by giving feedback on the elements of the prioritized list. We propose a new approach called iFL which extends interactive approaches by exploiting contextual knowledge of the user about the next item in the ranked list (e. g., a statement), with which larger code entities (e. g., a whole function) can be repositioned in their suspiciousness. We implemented a closely related algorithm proposed by Gong et al. , called Talk . First, we evaluated iFL using simulated users, and compared the results to SBFL and Talk . Next, we introduced two types of imperfections in the simulation: user’s knowledge and confidence levels. On SIR and Defects4J, results showed notable improvements in fault localization efficiency, even with strong user imperfections. We then empirically evaluated the effectiveness of the approach with real users in two sets of experiments: a quantitative evaluation of the successfulness of using iFL , and a qualitative evaluation of practical uses of the approach with experienced developers in think-aloud sessions.
 
Article
Context Applying vulnerability detection techniques is one of many tasks using the limited resources of a software project. Objective The goal of this research is to assist managers and other decision-makers in making informed choices about the use of software vulnerability detection techniques through an empirical study of the efficiency and effectiveness of four techniques on a Java-based web application. Method We apply four different categories of vulnerability detection techniques – systematic manual penetration testing (SMPT), exploratory manual penetration testing (EMPT), dynamic application security testing (DAST), and static application security testing (SAST) – to an open-source medical records system. Results We found the most vulnerabilities using SAST. However, EMPT found more severe vulnerabilities. With each technique, we found unique vulnerabilities not found using the other techniques. The efficiency of manual techniques (EMPT, SMPT) was comparable to or better than the efficiency of automated techniques (DAST, SAST) in terms of Vulnerabilities per Hour (VpH). Conclusions The vulnerability detection technique practitioners should select may vary based on the goals and available resources of the project. If the goal of an organization is to find “all” vulnerabilities in a project, they need to use as many techniques as their resources allow.
 
Article
Context Advances in defect prediction models, aka classifiers, have been validated via accuracy metrics. Effort-aware metrics (EAMs) relate to benefits provided by a classifier in accurately ranking defective entities such as classes or methods. PofB is an EAM that relates to a user that follows a ranking of the probability that an entity is defective, provided by the classifier. Despite the importance of EAMs, there is no study investigating EAMs trends and validity. Aim The aim of this paper is twofold: 1) we reveal issues in EAMs usage, and 2) we propose and evaluate a normalization of PofBs (aka NPofBs), which is based on ranking defective entities by predicted defect density. Method We perform a systematic mapping study featuring 152 primary studies in major journals and an empirical study featuring 10 EAMs, 10 classifiers, two industrial, and 12 open-source projects. Results Our systematic mapping study reveals that most studies using EAMs use only a single EAM (e.g., PofB20) and that some studies mismatched EAMs names. The main result of our empirical study is that NPofBs are statistically and by orders of magnitude higher than PofBs. Conclusions In conclusion, the proposed normalization of PofBs: (i) increases the realism of results as it relates to a better use of classifiers, and (ii) promotes the practical adoption of prediction models in industry as it shows higher benefits. Finally, we provide a tool to compute EAMs to support researchers in avoiding past issues in using EAMs.
 
Article
Timely patching (i.e., the act of applying code changes to a program source code) is paramount to safeguard users and maintainers against dire consequences of malicious attacks. In practice, patching is prioritized following the nature of the code change that is committed in the code repository. When such a change is labeled as being security-relevant, i.e., as fixing a vulnerability, maintainers rapidly spread the change, and users are notified about the need to update to a new version of the library or of the application. Unfortunately, oftentimes, some security-relevant changes go unnoticed as they represent silent fixes of vulnerabilities. In this paper, we propose SSPCatcher, a Co-Training-based approach to catch security patches (i.e., patches that address vulnerable code) as part of an automatic monitoring service of code repositories. Leveraging different classes of features, we empirically show that such automation is feasible and can yield a precision of over 80% in identifying security patches, with an unprecedented recall of over 80%. Beyond such a benchmarking with ground truth data which demonstrates an improvement over the state-of-the-art, we confirmed that SSPCatcher can help catch security patches that were not reported as such.
 
Article
Stack Overflow provides a means for developers to exchange knowledge. While much previous research on Stack Overflow has focused on questions and answers (Q&A), recent work has shown that discussions in comments also contain rich information. On Stack Overflow, discussions through comments and chat rooms can be tied to questions or answers. In this paper, we conduct an empirical study that focuses on the nature of question discussions. We observe that: (1) Question discussions occur at all phases of the Q&A process, with most beginning before the first answer is received. (2) Both askers and answerers actively participate in question discussions; the likelihood of their participation increases as the number of comments increases. (3) There is a strong correlation between the number of question comments and the question answering time (i.e., more discussed questions receive answers more slowly). Our findings suggest that question discussions contain a rich trove of data that is integral to the Q&A processes on Stack Overflow. We further suggest how future research can leverage the information in question discussions, along with the commonly studied Q&A information.
 
Article
In real-time systems, priorities assigned to real-time tasks determine the order of task executions, by relying on an underlying task scheduling policy. Assigning optimal priority values to tasks is critical to allow the tasks to complete their executions while maximizing safety margins from their specified deadlines. This enables real-time systems to tolerate unexpected overheads in task executions and still meet their deadlines. In practice, priority assignments result from an interactive process between the development and testing teams. In this article, we propose an automated method that aims to identify the best possible priority assignments in real-time systems, accounting for multiple objectives regarding safety margins and engineering constraints. Our approach is based on a multi-objective, competitive coevolutionary algorithm mimicking the interactive priority assignment process between the development and testing teams. We evaluate our approach by applying it to six industrial systems from different domains and several synthetic systems. The results indicate that our approach significantly outperforms both our baselines, i.e., random search and sequential search, and solutions defined by practitioners. Our approach scales to complex industrial systems as an offline analysis method that attempts to find near-optimal solutions within acceptable time, i.e., less than 16 hours.
 
Article
Ensuring the consistent usage of formatting conventions is an important aspect of modern software quality assurance. To do so, the source code of a project should be checked against the formatting conventions (or rules) adopted by its development team, and then the detected violations should be repaired if any. While the former task can be automatically done by format checkers implemented in linters, there is no satisfactory solution for the latter. Manually fixing formatting convention violations is a waste of developer time and code formatters do not take into account the conventions adopted and configured by developers for the used linter. In this paper, we present Styler , a tool dedicated to fixing formatting rule violations raised by format checkers using a machine learning approach. For a given project, Styler first generates training data by injecting violations of the project-specific rules in violation-free source code files. Then, it learns fixes by feeding long short-term memory neural networks with the training data encoded into token sequences. Finally, it predicts fixes for real formatting violations with the trained models. Currently, Styler supports a single checker, Checkstyle, which is a highly configurable and popular format checker for Java. In an empirical evaluation, Styler repaired 41% of 26,791 Checkstyle violations mined from 104 GitHub projects. Moreover, we compared Styler with the IntelliJ plugin CheckStyle-IDEA and the machine-learning-based code formatters Naturalize and CodeBuff . We found out that Styler fixes violations of a diverse set of Checkstyle rules (24/25 rules), generates smaller repairs in comparison to the other systems, and predicts repairs in seconds once trained on a project. Through a manual analysis, we identified cases in which Styler does not succeed to generate correct repairs, which can guide further improvements in Styler . Finally, the results suggest that Styler can be useful to help developers repair Checkstyle formatting violations.
 
The overview and summary of GBGallery, including game bug database construction, game testing framework construction, and empirical studies for benchmarking game testing techniques
The box-plots of detected bug number by each technique on the corresponding games
The overlapped bug numbers (in 5 runs) detected by different testing tools on 5 games
Article
Software bug database and benchmark are the wheels of advancing automated software testing. In practice, real bugs often occur sparsely relative to the amount of software code, the extraction and curation of which are quite labor-intensive but can be essential to facilitate the innovation of testing techniques. Over the past decade, several milestones have been made to construct bug databases, pushing the progress of automated software testing research. However, up to the present, it still lacks a real bug database and benchmark for game software, making current game testing research mostly stagnant. The missing of bug database and framework greatly limits the development of automated game testing techniques. To bridge this gap, we first perform large-scale real bug collection and manual analysis from 5 large commercial games, with a total of more than 250,000 lines of code. Based on this, we propose GBGallery, a game bug database and an extensible framework, to enable automated game testing research. In its initial version, GBGallery contains 76 real bugs from 5 games and incorporates 5 state-of-the-art testing techniques for comparative study as a baseline for further research. With GBGallery, we perform large-scale empirical studies and find that the current automated game testing is still at an early stage, where new testing techniques for game software should be extensively investigated. We make GBGallery publicly available, hoping to facilitate the game testing research.
 
Methodology
Discussions contents in SATD comments
Negativity in SATD comments and their priority
13% of respondents believe that writing negative comments to indicate higher priority is an acceptable practice (light blue), while 16% disagree with this (pink) and 38% strongly disagree (red)
Responses to closed questions of the survey
Article
Self-Admitted Technical Debt (SATD) consists of annotations—typically, but not only, source code comments—pointing out incomplete features, maintainability problems, or, in general, portions of a program not-ready yet. The way a SATD comment is written, and specifically its polarity, may be a proxy indicator of the severity of the problem and, to some extent, of the priority with which it should be addressed. In this paper, we study the relationship between different types of SATD comments in source code and their polarity, to understand in which circumstances (and why) developers use negative or rather neutral comments to highlight an SATD. To address this goal, we combine a manual analysis of 1038 SATD comments from a curated dataset with a survey involving 46 professional developers. First of all, we categorize SATD content into its types. Then, we study the extent to which developers express negative sentiment in different types of SATD as a proxy for priority, and whether they believe this can be considered as an acceptable practice. Finally, we look at whether such annotations contain additional details such as bug references and developers’ names/initials. Results of the study indicate that SATD comments are mainly used for annotating poor implementation choices ( $\simeq $ ≃ 41%) and partially implemented functionality ( $\simeq $ ≃ 22%). The latter may depend from “waiting” for other features being implemented, and this makes SATD comments more negatives than in other cases. Around 30% of the survey respondents agree on using/interpreting negative sentiment as a proxy for priority, while 50% of them indicate that it would be better to discuss SATD on issue trackers and not in the source code. However, while our study indicates that open-source developers use links to external systems, such as bug identifiers, to annotate high-priority SATD, better tool support is required for SATD management.
 
Article
Background Research software plays an important role in solving real-life problems, empowering scientific innovations, and handling emergency situations. Therefore, the correctness and trustworthiness of research software are of absolute importance. Software testing is an important activity for identifying problematic code and helping to produce high-quality software. However, testing of research software is difficult due to the complexity of the underlying science, relatively unknown results from scientific algorithms, and the culture of the research software community. Aims The goal of this paper is to better understand current testing practices, identify challenges, and provide recommendations on how to improve the testing process for research software development. Method We surveyed members of the research software developer community to collect information regarding their knowledge about and use of software testing in their projects. Results We analysed 120 responses and identified that even though research software developers report they have an average level of knowledge about software testing, they still find it difficult due to the numerous challenges involved. However, there are a number of ways, such as proper training, that can improve the testing process for research software. Conclusions Testing can be challenging for any type of software. This difficulty is especially present in the development of research software, where software engineering activities are typically given less attention. To produce trustworthy results from research software, there is a need for a culture change so that testing is valued and teams devote appropriate effort to writing and executing tests.
 
Article
This paper presents an approach for identification of vulnerable IoT applications. The approach focuses on a category of vulnerabilities that leads to sensitive information leakage which can be identified by using taint flow analysis. Tainted flows vulnerability is very much impacted by the structure of the program and the order of the statements in the code, designing an approach to detect such vulnerability needs to take into consideration such information in order to provide precise results. In this paper, we propose and develop an approach, FlowsMiner, that mines features from the code related to program structure such as control statements and methods, in addition to program’s statement order. FlowsMiner, generates features in the form of tainted flows. We developed, Flows2Vec, a tool that transform the features recovered by FlowsMiner into vectors, which are then used to aid the process of machine learning by providing a flow’s aware model building process. The resulting model is capable of accurately classify applications as vulnerable if the vulnerability is exhibited by changes in the order of statements in source code. When compared to a base Bag of Words (BoW) approach, the experiments show that the proposed approach has improved the AUC of the prediction models for all algorithms and the best case for Corpus1 dataset is improved from 0.91 to 0.94 and for Corpus2 from 0.56 to 0.96.
 
Article
The accuracy of the SZZ algorithm is pivotal for just-in-time defect prediction because most prior studies have used the SZZ algorithm to detect defect-inducing commits to construct and evaluate their defect prediction models. The SZZ algorithm has two phases to detect defect-inducing commits: (1) linking issue reports in an issue-tracking system to possible defect-fixing commits in a version control system by using an issue-link algorithm (ILA); and (2) tracing the modifications of defect-fixing commits back to possible defect-inducing commits. Researchers and practitioners can address the second phase by using existing solutions such as a tool called cregit. In contrast, although various ILAs have been proposed for the first phase, no large-scale studies exist in which such ILAs are evaluated under the same experimental conditions. Hence, we still have no conclusions regarding the best-performing ILA for the first phase. In this paper, we compare 10 ILAs collected from our systematic literature study with regards to the accuracy of detecting defect-fixing commits. In addition, we compare the defect prediction performance of ILAs and their combinations that can detect defect-fixing commits accurately. We conducted experiments on five open-source software projects. We found that all ILAs and their combinations prevented the defect prediction model from being affected by missing defect-fixing commits. In particular, the combination of a natural language text similarity approach, Phantom heuristics, a random forest approach, and a support vector machine approach is the best way to statistically significantly reduced the absolute differences from the ground-truth defect prediction performance. We summarized the guidelines to use ILAs as our recommendations.
 
Article
Effort estimation models are a fundamental tool in software management, and used as a forecast for resources, constraints and costs associated to software development. For Free/Open Source Software (FOSS) projects, effort estimation is especially complex: professional developers work alongside occasional, volunteer developers, so the overall effort (in person-months) becomes non-trivial to determine. The objective of this work it to develop a simple effort estimation model for FOSS projects, based on the historic data of developers’ effort. The model is fed with direct developer feedback to ensure its accuracy. After extracting the personal development profiles of several thousands of developers from 6 large FOSS projects, we asked them to fill in a questionnaire to determine if they should be considered as full-time developers in the project that they work in. Their feedback was used to fine-tune the value of an effort threshold, above which developers can be considered as full-time. With the help of the over 1,000 questionnaires received, we were able to determine, for every project in our sample, the threshold of commits that separates full-time from non-full-time developers. We finally offer guidelines and a tool to apply our model to FOSS projects that use a version control system.
 
Article
A Linux distribution consists of thousands of packages that are either developed by in-house developers (in-house packages) or by external projects (upstream packages). Leveraging upstream packages speeds up development and improves productivity, yet bugs might slip through into the packaged code and end up propagating into downstream Linux distributions. Maintainers, who integrate upstream projects into their distribution, typically lack the expertise of the upstream projects. Hence, they could try either to propagate the bug report upstream and wait for a fix, or fix the bug locally and maintain the fix until it is incorporated upstream. Both of these outcomes come at a cost, yet, to the best of our knowledge, no prior work has conducted an in-depth analysis of upstream bug management in the Linux ecosystem. Hence, this paper empirically studies how high-severity bugs are fixed in upstream packages for two Linux distributions, i.e., Debian and Fedora. Our results show that 13.9% of the upstream package bugs are explicitly reported being fixed by upstream, and 13.3% being fixed by the distribution, while the vast majority of bugs do not have explicit information about this in Debian. When focusing on the 27.2% with explicit information, our results also indicate that upstream fixed bugs make users wait for a longer time to get fixes and require more additional information compared to fixing upstream bugs locally by the distribution. Finally, we observe that the number of bug comment links to reference information (e.g., design docs, bug reports) of the distribution itself and the similarity score between upstream and distribution bug reports are important factors for the likelihood of a bug being fixed upstream. Our findings strengthen the need for traceability tools on bug fixes of upstream packages between upstream and distributions in order to find upstream fixes easier and lower the cost of upstream bug management locally.
 
Article
Highly-Configurable Software (HCSs) testing is usually costly, as a significant number of variants need to be tested. This becomes more problematic when Continuous Integration (CI) practices are adopted. CI leads the software to be integrated and tested multiple times a day, subject to time constraints (budgets). To address CI challenges, a learning-based test case prioritization approach named COLEMAN has been successfully applied. COLEMAN deals with test case volatility, in which some test cases can be included/removed over the CI cycles. Nevertheless, such an approach does not consider HCS particularities such as, by analogy, the volatility of variants. Given such a context, this work introduces two strategies for applying COLEMAN in the CI of HCS: the Variant Test Set Strategy (VTS) that relies on the test set specific for each variant; and the Whole Test Set Strategy (WST) that prioritizes the test set composed by the union of the test cases of all variants. Both strategies are applied to two real-world HCSs, considering three test budgets. Independently of the time budget, the proposed strategies using COLEMAN have the best performance in comparison with solutions generated randomly and by another learning approach from the literature. Moreover, COLEMAN produces, in more than 92% of the cases, reasonable solutions that are near to the optimal solutions obtained by a deterministic approach. Both strategies spend less than one second to execute. WTS provides better results in the less restrictive budgets, and VTS the opposite. WTS seems to better mitigate the problem of beginning without knowledge, and is more suitable when a new variant to be tested is added.
 
Article
Mutation testing exploits artificial faults to measure the adequacy of test suites and guide their improvement. It has become an extremely popular testing technique as evidenced by the vast literature, numerous tools, and research events on the topic. Previous survey papers have successfully compiled the state of research, its evolution, problems, and challenges. However, the use of mutation testing in practice is still largely unexplored. In this paper, we report the results of a thorough study on the use of mutation testing in GitHub projects. Specifically, we first performed a search for mutation testing tools, 127 in total, and we automatically searched the GitHub repositories including evidence of their use. Then, we focused on the top ten most widely used tools, based on the previous results, and manually revised and classified over 3.5K GitHub active repositories importing them. Among other findings, we observed a recent upturn in interest and activity, with Infection (PHP), PIT (Java) and Humbug (PHP) being the most widely used mutation tools in recent years. The predominant use of mutation testing is development, followed by teaching and learning, and research projects, although with significant differences among mutation tools found in the literature—less adopted and largely used in teaching and research—and those found in GitHub only—more popular and more widely used in development. Our work provides a new and encouraging perspective on the state of practice of mutation testing.
 
The framework of our approach
Architecture of the CNN model
Relations (the number of common keywords) between different projects
F1-score achieved by incrementally adding 100 issue sections into the training dataset
Number of common keywords between different projects
Article
Technical debt is a metaphor indicating sub-optimal solutions implemented for short-term benefits by sacrificing the long-term maintainability and evolvability of software. A special type of technical debt is explicitly admitted by software engineers (e.g. using a TODO comment); this is called Self-Admitted Technical Debt or SATD. Most work on automatically identifying SATD focuses on source code comments. In addition to source code comments, issue tracking systems have shown to be another rich source of SATD, but there are no approaches specifically for automatically identifying SATD in issues. In this paper, we first create a training dataset by collecting and manually analyzing 4,200 issues (that break down to 23,180 sections of issues) from seven open-source projects (i.e., Camel, Chromium, Gerrit, Hadoop, HBase, Impala, and Thrift) using two popular issue tracking systems (i.e., Jira and Google Monorail). We then propose and optimize an approach for automatically identifying SATD in issue tracking systems using machine learning. Our findings indicate that: 1) our approach outperforms baseline approaches by a wide margin with regard to the F1-score; 2) transferring knowledge from suitable datasets can improve the predictive performance of our approach; 3) extracted SATD keywords are intuitive and potentially indicating types and indicators of SATD; 4) projects using different issue tracking systems have less common SATD keywords compared to projects using the same issue tracking system; 5) a small amount of training data is needed to achieve good accuracy.
 
Program generation with relaxed methods during and post code generation
Line coverage by the four different generation methods. The number of lines covered by a single Csmith program is around 150K for GCC and 100K for LLVM
Venn diagrams comparing the line coverage achieved by the four systems after each generates 135K programs
Article
Compiler fuzzing techniques require a means of generating programs that are free from undefined behaviour (UB) to reliably reveal miscompilation bugs. Existing program generators such as Csmith achieve UB-freedom by heavily restricting the form of generated programs. The idiomatic nature of the resulting programs risks limiting the test coverage they can offer, and thus the compiler bugs they can discover. We investigate the idea of adapting existing fuzzers to be less restrictive concerning UB, in the practical setting of C compiler testing via a new tool, CsmithEdge, which extends Csmith. CsmithEdge probabilistically weakens the constraints used to enforce UB-freedom, thus generated programs are no longer guaranteed to be UB-free. It then employs several off-the-shelf UB detection tools and a novel dynamic analysis to (a) detect cases where the generated program exhibits UB and (b) determine where Csmith has been too conservative in its use of safe math wrappers that guarantee UB-freedom for arithmetic operations, removing the use of redundant ones. The resulting UB-free programs can be used to test for miscompilation bugs via differential testing. The non-UB-free programs can still be used to check that the compiler under test does not crash or hang. Our experiments on recent versions of GCC, LLVM and the Microsoft Visual Studio Compiler show that CsmithEdge was able to discover 7 previously unknown miscompilation bugs (5 already fixed in response to our reports) that could not be found via intensive testing using Csmith, and 2 compiler-hang bugs that were fixed independently shortly before we considered reporting them.
 
Article
Some test amplification tools extend a manually created test suite with additional test cases to increase the code coverage. The technique is effective, in the sense that it suggests strong and understandable test cases, generally adopted by software engineers. Unfortunately, the current state-of-the-art for test amplification heavily relies on program analysis techniques which benefit a lot from explicit type declarations present in statically typed languages. In dynamically typed languages, such type declarations are not available and as a consequence test amplification has yet to find its way to programming languages like Smalltalk, Python, Ruby and Javascript. We propose to exploit profiling information —readily obtainable by executing the associated test suite— to infer the necessary type information creating special test inputs with corresponding assertions. We evaluated this approach on 52 selected test classes from 13 mature projects in the Pharo ecosystem containing approximately 400 test methods. We show the improvement in killing new mutants and mutation coverage at least in 28 out of 52 test classes (≈ 53%). Moreover, these generated tests are understandable by humans: 8 out of 11 pull-requests submitted were merged into the main code base (≈ 72%). These results are comparable to the state-of-the-art, hence we conclude that test amplification is feasible for dynamically typed languages.
 
Article
Developers sometimes choose design and implementation shortcuts due to the pressure from tight release schedules. However, shortcuts introduce technical debt that increases as the software evolves. The debt needs to be repaid as fast as possible to minimize its impact on software development and software quality. Sometimes, technical debt is admitted by developers in comments and commit messages. Such debt is known as self-admitted technical debt (SATD). In data-intensive systems, where data manipulation is a critical functionality, the presence of SATD in the data access logic could seriously harm performance and maintainability. Understanding the composition and distribution of the SATDs across software systems and their evolution could provide insights into managing technical debt efficiently. We present a large-scale empirical study on the prevalence, composition, and evolution of SATD in data-intensive systems. We analyzed 83 open-source systems relying on relational databases as well as 19 systems relying on NoSQL databases. We detected SATD in source code comments obtained from different snapshots of the subject systems. To understand the evolution dynamics of SATDs, we conducted a survival analysis. Next, we performed a manual analysis of 361 sample data-access SATDs, investigating the composition of data-access SATDs and the reasons behind their introduction and removal. We identified 15 new SATD categories, out of which 11 are specific to database access operations. We found that most of the data-access SATDs are introduced in the later stages of change history rather than at the beginning. We also observed that bug fixing and refactoring are the main reasons behind the introduction of data-access SATDs.
 
Article
Pull request latency evaluation is an essential application of effort evaluation in the pull-based development scenario. It can help the reviewers sort the pull request queue, remind developers about the review processing time, speed up the review process and accelerate software development. There is a lack of work that systematically organizes the factors that affect pull request latency. Also, there is no related work discussing the differences and variations in characteristics in different scenarios and contexts. In this paper, we collected relevant factors through a literature review approach. Then we assessed their relative importance in five scenarios and six different contexts using the mixed-effects linear regression model. The most important factors differ in different scenarios. The length of the description is most important when pull requests are submitted. The existence of comments is most important when closing pull requests, using CI tools, and when the contributor and the integrator are different. When there exist comments, the latency of the first comment is the most important. Meanwhile, the influence of factors may change in different contexts. For example, the number of commits in a pull request has a more significant impact on pull request latency when closing than submitting due to changes in contributions brought about by the review process. Both human and bot comments are positively correlated with pull request latency. In contrast, the bot’s first comments are more strongly correlated with latency, but the number of comments is less correlated. Future research and tool implementation needs to consider the impact of different contexts. Researchers can conduct related studies based on our publicly available datasets and replication scripts.
 
Article
Context Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Methods We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus. Results We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case. Conclusion Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise.
 
Article
Code smells, also known as anti-patterns, are poor design or implementation choices that hinder program comprehensibility and maintainability. While several code smell detection methods have been proposed, Mantyla et al. identified the uncertainty issue as one of the major individual human factors that may affect developer’s decisions about the smelliness of software classes: they may indeed have different opinions mainly due to their different knowledge and expertise. Unfortunately, almost all the existing approaches assume data perfection and neglect the uncertainty when identifying the labels of the software classes. Ignoring or rejecting any uncertainty form could lead to a considerable loss of information, which could significantly deteriorate the effectiveness of the detection and identification processes. Inspired by our previous works and motivated by the interesting performance of the PDT (Possibilistic Decision Tree) in classifying uncertain data, we propose ADIPE (Anti-pattern Detection and Identification using Possibilistic decision tree Evolution), as a new tool that evolves and optimizes a set of detectors (PDTs) that could effectively deal with software class labels uncertainty using some concepts from the Possibility theory. ADIPE uses a PBE (Possibilistic Base of Examples: a dataset with possibilistic labels) that it is built using a set of opinion-based classifiers (i.e., a set of probabilistic classifiers) with the aim to simulate human developers’ uncertainty. A set of advisors and probabilistic classifiers are employed in order to mimic the subjectivity and the doubtfulness of software engineers. A detailed experimental study is conducted to show the merits and outperformance of ADIPE in dealing with uncertainty in code smells detection and identification with respect to four relevant state-of-the-art methods, including the baseline PDT. The experimental study was performed in uncertain and certain environments based on two suitable metrics: PF-measure_dist (Possibilistic F-measure_Distance) and IAC (Information Affinity Criterion); which corresponds to the F-measure and Accuracy (PCC) for the certain case. The obtained results for the uncertain environment reveal that for the detection process, the PF-measure_dist of ADIPE ranges within [0.9047 and 0.9285], and its IAC lies within [0.9288 and 0.9557]; while for the identification process, the PF-measure_dist of ADIPE is in [0.8545, 0.9228], and its IAC lies within [0.8751, 0.933]. ADIPE is able to find 35% more code smells with uncertain data than the second best algorithm (i.e., BLOP). In addition, ADIPE succeeds to decrease the number of false alarms (i.e., misclassified smelly instances) with a rate equals to 12%. Our proposed approach is also able to identify 43% more smell types than BLOP and decreases the number of false alarms with a rate equals to 32%. Similar results were obtained for the certain environment, which demonstrate the ability of ADIPE to also deal with the certain environment. Graphical abstractA Possibilistic Evolutionary Approach for Code Smells Detection
 
Article
Understanding program code is a complicated endeavor. As a result, studying code comprehension is also hard. The prevailing approach for such studies is to use controlled experiments, where the difference between treatments sheds light on factors which affect comprehension. But it is hard to conduct controlled experiments with human developers, and we also need to find a way to operationalize what “comprehension” actually means. In addition, myriad different factors can influence the outcome, and seemingly small nuances may be detrimental to the study’s validity. In order to promote the development and use of sound experimental methodology, we discuss both considerations which need to be applied and potential problems that might occur, with regard to the experimental subjects, the code they work on, the tasks they are asked to perform, and the metrics for their performance. A common thread is that decisions that were taken in an effort to avoid one threat to validity may pose a larger threat than the one they removed.
 
Article
As a result of the COVID-19 pandemic, many agile practitioners had to transition into a remote work environment. Despite remote work not being a new concept for agile software practitioners, the forced or recommended nature of remote work is new. This study investigates how the involuntary shift to remote work and how social restrictions imposed by the COVID-19 pandemic have affected agile software development (ASD), and how agile practitioners have been affected in terms of ways of working. An explanatory sequential mixed methods study was performed. Data were collected one year into the COVID-19 pandemic through a questionnaire with 96 respondents and in-depth semi-structured interviews with seven practitioners from seven different companies. Data were analyzed through Bayesian analysis and thematic analysis. The results show, in general, that the aspects of ASD that have been the most affected is communication and social interactions, while technical work aspects have not experienced the same changes. Moreover, feeling forced to work remotely has a significant impact on different aspects of ASD, e.g., productivity and communication, and industry practitioners’ employment of agile development and ways of working have primarily been affected by the lack of social interaction and the shift to digital communication. The results also suggest that there may be a group maturing debt when teams do go back into office, as digital communication and the lack of psychological safety stand in the way for practitioners’ ability to have sensitive discussions and progress as a team in a remote setting.
 
Article
Software developed on public platform is a source of data that can be used to make predictions about those projects. While the individual developing activity may be random and hard to predict, the developing behavior on project level can be predicted with good accuracy when large groups of developers work together on software projects. To demonstrate this, we use 64,181 months of data from 1,159 GitHub projects to make various predictions about the recent status of those projects (as of April 2020). We find that traditional estimation algorithms make many mistakes. Algorithms like k-nearest neighbors (KNN), support vector regression (SVR), random forest (RFT), linear regression (LNR), and regression trees (CART) have high error rates. But that error rate can be greatly reduced using hyperparameter optimization. To the best of our knowledge, this is the largest study yet conducted, using recent data for predicting multiple health indicators of open-source projects. To facilitate open science (and replications and extensions of this work), all our materials are available online at https://github.com/arennax/Health_Indicator_Prediction.
 
Article
Code review plays an important role in software quality control. A typical review process involves a careful check of a piece of code in an attempt to detect and locate defects and other quality issues/violations. One type of issue that may impact the quality of software is code smells-i.e., bad coding practices that may lead to defects or maintenance issues. Yet, little is known about the extent to which code smells are identified during modern code review. To investigate the concept behind code smells identified in modern code review and what actions reviewers suggest and developers take in response to the identified smells, we conducted an empirical study of code smells in code reviews by analyzing reviews from four large open source projects from the OpenStack (Nova and Neutron) and Qt (Qt Base and Qt Creator) communities. We manually checked a total of 25,415 code review comments obtained by keywords search and random selection; this resulted in the identification of 1,539 smell-related reviews which then allowed the study of the causes of code smells, actions taken against identified smells, time taken to fix identified smells, and reasons why developers ignored fixing identified smells. Our analysis found that 1) code smells were not commonly identified in code reviews , 2) smells were usually caused by violation of coding conventions, 3) reviewers usually provided constructive feedback, including fixing (refactoring) recommen-2 Xiaofeng Han et al. dations to help developers remove smells, 4) developers generally followed those recommendations and actioned the changes, 5) once identified by reviewers, it usually takes developers less than one week to fix the smells, and 6) the main reason why developers chose to ignore the identified smells is that it is not worth fixing the smell. Our results suggest the following: 1) developers should closely follow coding conventions in their projects to avoid introducing code smells, 2) review-based detection of code smells is perceived to be a trustworthy approach by developers, mainly because reviews are context-sensitive (as reviewers are more aware of the context of the code given that they are part of the project's development team), and 3) program context needs to be fully considered in order to make a decision of whether to fix the identified code smell immediately.
 
Article
Logging plays a crucial role in software engineering because it is key to perform various tasks including debugging, performance analysis, and detection of anomalies. Despite the importance of log data, the practice of logging still suffers from the lack of common guidelines and best practices. Recent studies investigated logging in C/C++ and Java open-source systems. In this paper, we complement these studies by conducting the first empirical study on logging practices in the Linux kernel , one of the most elaborate open-source development projects in the computer industry. We analyze 22 Linux releases with a focus on three main aspects: the pervasiveness of logging in Linux, the types of changes made to logging statements, and the rationale behind these changes. Our findings show that logging code accounts for 3.73% of the total source code in the Linux kernel, distributed across 72.36% of Linux files. We also found that the distribution of logging statements across Linux subsystems and their components vary significantly with no apparent reasons, suggesting that developers use different criteria when logging. In addition, we observed a slow decrease in the use of logging-reduction of 9.27% between versions v4.3 and v5.3. The majority of changes in logging code are made to fix language issues, modify log levels, and upgrade logging code to use new logging libraries, with the overall goal of improving the precision and consistency of the log output. Many recommendations are derived from our findings such as the use of static analysis tools to detect log-related issues, the adoption of common writing styles to improve the quality of log messages, the development of conventions to guide developers when selecting log levels, the establishment of review sessions to review logging code, and so on. Our recommendations can serve as a basis for developing logging guidelines as well as better logging processes, tools, and techniques.
 
Article
Technical collaboration between multiple contributors is a natural phenomenon in distributed open source software development projects. Macro-collaboration, where each code commit is attributed to a single collaborator, has been extensively studied in the research literature. This is much less the case for so-called micro-collaboration practices, in which multiple authors contribute to the same commit. To support such practices, GitLab and GitHub started supporting social coding mechanisms such as the "Co-Authored-By:" trailers in commit messages, which, in turn, enable to empirically study such micro-collaboration. In order to understand the mechanisms, benefits and limitations of micro-collaboration, this article provides an exemplar case study of collaboration practices in the OpenStack ecosystem. Following a mixed-method research approach we provide qualitative evidence through a thematic and content analysis of semi-structured interviews with 16 OpenStack contributors. We contrast their perception with quantitative evidence gained by statistical analysis of the git commit histories ( ∼ 1M commits) and Gerrit code review histories ( ∼ 631K change sets and ∼ 2M patch sets) of 1,804 OpenStack project repositories over a 9-year period. Our findings provide novel empirical insights to practitioners to promote micro-collaborative coding practices, and to academics to conduct further research towards understanding and automating the micro-collaboration process.
 
Article
Software systems are designed according to guidelines and constraints defined by business rules. Some of these constraints define the allowable or required values for data handled by the systems. These data constraints usually originate from the problem domain (e.g., regulations), and developers must write code that enforces them. Understanding how data constraints are implemented is essential for testing, debugging, and software change. Unfortunately, there are no widely-accepted guidelines or best practices on how to implement data constraints. This paper presents an empirical study that investigates how data constraints are implemented in Java. We study the implementation of 187 data constraints extracted from the documentation of eight real-world Java software systems. First, we perform a qualitative analysis of the textual description of data constraints and identify four data constraint types. Second, we manually identify the implementations of these data constraints and reveal that they can be grouped into 31 implementation patterns. The analysis of these implementation patterns indicates that developers prefer a handful of patterns when implementing data constraints. We also found evidence suggesting that deviations from these patterns are associated with unusual implementation decisions or code smells. Third, we develop a tool-assisted protocol that allows us to identify 256 additional trace links for the data constraints implemented using the 13 most common patterns. We find that almost half of these data constraints have multiple enforcing statements, which are code clones of different types. Finally, a study with 16 professional developers indicates that the patterns we describe can be easily and accurately recognized in Java code.
 
Article
Understanding software evolution is essential for software development tasks, including debugging, maintenance, and testing. As a software system evolves, it grows in size and becomes more complex, hindering its comprehension. Researchers proposed several approaches for software quality analysis based on software metrics. One of the primary practices is predicting defects across software components in the codebase to improve agile product quality. While several software metrics exist, graph-based metrics have rarely been utilized in software quality. In this paper, we explore recent network comparison advancements to characterize software evolution and focus on aiding software metrics analysis and defect prediction. We support our approach with an automated tool named GraphEvoDef. Particularly, GraphEvoDef provides three major contributions: (1) detecting and visualizing significant events in software evolution using call graphs, (2) extracting metrics that are suitable for software comprehension, and (3) detecting and estimating the number of defects in a given code entity (e.g., class). One of our major findings is the usefulness of the Network Portrait Divergence metric, borrowed from the information theory domain, to aid the understanding of software evolution. To validate our approach, we examined 29 different open-source Java projects from GitHub and then demonstrated the proposed approach using 9 use cases with defect data from the the PROMISE dataset. We also trained and evaluated defect prediction models for both classification and regression tasks. Our proposed technique has an 18% reduction in the mean square error and a 48% increase in squared correlation coefficient over the state-of-the-art approaches in the defect prediction domain.
 
The major steps we followed in the study
Steps followed to expand the list of base tags
The distribution of new distinct relevant devRant posts containing COVID-19 related keywords for each date between January and April 12, 2020
Percentage distribution of sentiment polarity across all the relevant posts
Trade off COVID related topic categories by negativity/popularity
Article
Many software developers started to work from home on a short notice during the early periods of COVID-19. A number of previous papers have studied the wellbeing and productivity of software developers during COVID-19. The studies mainly use surveys based on predefined questionnaires. In this paper, we investigate the problems and joys that software developers experienced during the early months of COVID-19 by analyzing their discussions in online forum devRant, where discussions can be open and not bound by predefined survey questionnaires. The devRant platform is designed for developers to share their joys and frustrations of life. We manually analyze 825 devRant posts between January and April 12, 2020 that developers created to discuss their situation during COVID-19. WHO declared COVID-19 as pandemic on March 11, 2020. As such, our data offers us insights in the early months of COVID-19. We manually label each post along two dimensions: the topics of the discussion and the expressed sentiment polarity (positive, negative, neutral). We observed 19 topics that we group into six categories: Workplace & Professional aspects, Personal & Family well-being, Technical Aspects, Lockdown preparedness, Financial concerns, and Societal and Educational concerns. Around 49% of the discussions are negative and 26% are positive. We find evidence of developers’ struggles with lack of documentation to work remotely and with their loneliness while working from home. We find stories of their job loss with little or no savings to fallback to. The analysis of developer discussions in the early months of a pandemic will help various stakeholders (e.g., software companies) make important decision early to alleviate developer problems if such a pandemic or similar emergency situation occurs in near future. Software engineering research can make further efforts to develop automated tools for remote work (e.g., automated documentation).
 
Article
Nowadays there is an increased pressure on mobile app developers to take non-functional properties into account. An app that is too slow or uses much bandwidth will decrease user satisfaction, and thus can lead to users simply abandoning the app. Although automated software improvement techniques exist for traditional software, these are not as prevalent in the mobile domain. Moreover, it is yet unknown if the same software changes would be as effective. With that in mind, we mined overall 100 Android repositories to find out how developers improve execution time, memory consumption, bandwidth usage and frame rate of mobile apps. We categorised non-functional property (NFP) improving commits related to performance to see how existing automated software improvement techniques can be improved. Our results show that although NFP improving commits related to performance are rare, such improvements appear throughout the development lifecycle. We found altogether 560 NFP commits out of a total of 74,408 commits analysed. Memory consumption is sacrificed most often when improving execution time or bandwidth usage, although similar types of changes can improve multiple non-functional properties at once. Code deletion is the most frequently utilised strategy except for frame rate, where increase in concurrency is the dominant strategy. We find that automated software improvement techniques for mobile domain can benefit from addition of SQL query improvement, caching and asset manipulation. Moreover, we provide a classifier which can drastically reduce manual effort to analyse NFP improving commits.
 
Distributions of perceived frequency evaluations in the survey responses
Distributions of perceived severity evaluations in the survey responses
Evaluation of the software product management problems in the frequency-severity space
Article
The principal focus of software product management is to ensure the economic success of the product, which means to prolong the product life as much as possible with modest expenditures to maximizs profits. Software product managers play an important role in the software development organization while being responsible for the strategy, business case, product roadmap, high-level requirements, product deployment (release-management), and retirement plan. This article explores the problems that affect the software product management process, their perceived frequency and perceived severity. The data were collected by a systematic literature review (5 main databases were analyzed), interviews (10 software product managers from IT companies), and surveys (89 participants). 95 software product management problems assigned nonexclusively to 7 areas were identified. 27 commonly mentioned software product management problems were evaluated for their perceived frequency and perceived severity. The problems perceived as the most frequent are: determining the true value of the product that the customer needs, strategy and priorities change frequently, technical debt, working in silos, and balancing between reactive and proactive work. In total, 95 problems have been identified which have been narrowed down to 27 problems based on their occurrence in at least 3 interviews. These selected problems were prioritized by perceived frequency and perceived severity. Some of the identified problems spanned beyond the software product management process itself, but they all affect the work of software product managers.
 
Article
Over time, software systems tend to increase in complexity and become harder to maintain. While the drawbacks of code complexity are well-known, complex code is present in most real software projects. Here, an important question arises: why, with all the advice out there against it, do we continue to end up with complex methods? Unfortunately, code complexity is typically assessed for a single programming language (often Java), reducing the generality of findings. Thus, assessing how and why complex code evolves in multiple programming languages is fundamental to address this limitation. In this paper, we provide a multi-language empirical study to assess the evolution of complex methods and a survey to understand developers’ perceptions. We analyze 1,000 complex methods (according to cyclomatic complexity) of 50 popular projects written in JavaScript, Python, Java, C++, and C# and we perform a survey with over 70 developers. We find that programming language plays an important role in the study of code complexity. For example, C++ and Python projects have more methods that increase complexity over time, whereas Java and C# present more efforts to reduce it. Moreover, the developers’ perception of complexity is subjective and varies per programming language: many analyzed methods are not considered complex by developers, while others are considered well-written or harmless. Furthermore, developers may deliberately avoid refactoring complex code due to several reasons, including code stability and lack of refactoring priority. In some cases, developers are satisfied with complexity or even want to purposely expose it. Finally, based on our findings, we discuss insights for researchers and practitioners.
 
Article
With the increasing demand for customized systems and rapidly evolving technology, software engineering faces many challenges. A particular challenge is the development and maintenance of systems that are highly variable both in space (concurrent variations of the system at one point in time) and time (sequential variations of the system, due to its evolution). Recent research aims to address this challenge by managing variability in space and time simultaneously. However, this research originates from two different areas, software product line engineering and software configuration management, resulting in non-uniform terminologies and a varying understanding of concepts. These problems hamper the communication and understanding of involved concepts, as well as the development of techniques that unify variability in space and time. To tackle these problems, we performed an iterative, expert-driven analysis of existing tools from both research areas to derive a conceptual model that integrates and unifies concepts of both dimensions of variability. In this article, we first explain the construction process and present the resulting conceptual model. We validate the model and discuss its coverage and granularity with respect to established concepts of variability in space and time. Furthermore, we perform a formal concept analysis to discuss the commonalities and differences among the tools we considered. Finally, we show illustrative applications to explain how the conceptual model can be used in practice to derive conforming tools. The conceptual model unifies concepts and relations used in software product line engineering and software configuration management, provides a unified terminology and common ground for researchers and developers for comparing their works, clarifies communication, and prevents redundant developments.
 
Article
Software companies commonly develop and maintain variants of systems, with different feature combinations for different customers. Thus, they must cope with variability in space. Software companies further must cope with variability in time, when updating system variants by revising existing software features. Inevitably, variants evolve orthogonally along these two dimensions, resulting in challenges for software maintenance. Our work addresses this challenge with ECSEST (Extraction and Composition for Systems Evolving in Space and Time), an approach for locating feature revisions and composing variants with different feature revisions. We evaluated ECSEST using feature revisions and variants from six highly configurable open source systems. To assess the correctness of our approach, we compared the artifacts of input variants with the artifacts from the corresponding composed variants based on the implementation of the extracted features. The extracted traces allowed composing variants with 99-100% precision, as well as with 97-99% average recall. Regarding the composition of variants with new configurations, our approach can combine different feature revisions with 99% precision and recall on average. Additionally, our approach retrieves hints when composing new configurations, which are useful to find artifacts that may have to be added or removed for completing a product. The hints help to understand possible feature interactions or dependencies. The average time to locate feature revisions ranged from 25 to 250 seconds, whereas the average time for composing a variant was 18 seconds. Therefore, our experiments demonstrate that ECSEST is feasible and effective.
 
Gorgeous overview
Program Representation Schema
Grammar to create refactoring algorithms for the class-level.
Procedure Representation
Article
Recent machine learning studies present accurate results generating prediction models to identify refactoring operations for a program. However, such works are limited to prediction, i.e., they learn refactoring operations strictly as applied by developers, but there are possibilities that they might not think. On the other hand, the Search-Based Software Refactoring (SBR) field applies search algorithms to find refactoring operations in a vast space of possibilities to improve diverse quality attributes. Nevertheless, existing SBR approaches do not generate a model as machine learning studies, and then, they need to be reapplied individually for each program needing refactoring. To mitigate this limitation, this work introduces a novel SBR learning approach that generates refactoring algorithms capable of providing refactoring operations to several programs. These algorithms are composed of procedures that use rules to determine the refactoring operations. To create the algorithms, a learning process first extracts refactoring patterns from programs by grouping their elements that were refactored in similar ways. After that, a Grammatical Evolution (GE) is applied to generate the algorithms based on a grammar encompassing details of the extracted patterns. GE works to generate an algorithm that provides refactoring operations similar to those applied in practice while improving quality attributes, such as modularity. The approach is evaluated using refactoring data from 40 Java programs of GitHub repositories. The algorithms are tested against different programs, obtaining an overall average of 60% of modularity improvement and 50% of similarity with actual refactoring operations.
 
Article
The increasing interest in open source software has led to the emergence of large language-specific package distributions of reusable software libraries, such as npm and RubyGems. These software packages can be subject to vulnerabilities that may expose dependent packages through explicitly declared dependencies. Using Snyk’s vulnerability database, this article empirically studies vulnerabilities affecting npm and RubyGems packages. We analyse how and when these vulnerabilities are disclosed and fixed, and how their prevalence changes over time. We also analyse how vulnerable packages expose their direct and indirect dependents to vulnerabilities. We distinguish between two types of dependents: packages distributed via the package manager, and external GitHub projects depending on npm packages. We observe that the number of vulnerabilities in npm is increasing and being disclosed faster than vulnerabilities in RubyGems. For both package distributions, the time required to disclose vulnerabilities is increasing over time. Vulnerabilities in npm packages affect a median of 30 package releases, while this is 59 releases in RubyGems packages. A large proportion of external GitHub projects is exposed to vulnerabilities coming from direct or indirect dependencies. 33% and 40% of dependency vulnerabilities to which projects and packages are exposed, respectively, have their fixes in more recent releases within the same major release range of the used dependency. Our findings reveal that more effort is needed to better secure open source package distributions.
 
Article
Applying mutation testing to test subtle program changes, such as program patches or other small-scale code modifications, requires using mutants that capture the delta of the altered behaviours. To address this issue, we introduce the concept of commit-relevant mutants, which are the mutants that interact with the behaviours of the system affected by a particular commit. Therefore, commit-aware mutation testing, is a test assessment metric tailored to a specific commit. By analysing 83 commits from 25 projects involving 2,253,610 mutants in both C and Java, we identify the commit-relevant mutants and explore their relationship with other categories of mutants. Our results show that commit-relevant mutants represent a small subset of all mutants, which differs from the other classes of mutants (subsuming and hard-to-kill), and that the commit-relevant mutation score is weakly correlated with the traditional mutation score (Kendall/Pearson 0.15-0.4). Moreover, commit-aware mutation analysis provides insights about the testing of a commit, which can be more efficient than the classical mutation analysis; in our experiments, by analysing the same number of mutants, commit-aware mutants have better fault-revelation potential (30% higher chances of revealing commit-introducing faults) than traditional mutants. We also illustrate a possible application of commit-aware mutation testing as a metric to evaluate test case prioritisation.
 
Article
Software developers attempt to reproduce software bugs to understand their erroneous behaviours and to fix them. Unfortunately, they often fail to reproduce (or fix) them, which leads to faulty, unreliable software systems. However, to date, only a little research has been done to better understand what makes the software bugs non-reproducible. In this article, we conduct a multimodal study to better understand the non-reproducibility of software bugs. First, we perform an empirical study using 576 non-reproducible bug reports from two popular software systems (Firefox, Eclipse) and identify 11 key factors that might lead a reported bug to non-reproducibility. Second, we conduct a user study involving 13 professional developers where we investigate how the developers cope with non-reproducible bugs. We found that they either close these bugs or solicit for further information, which involves long deliberations and counter-productive manual searches. Third, we offer several actionable insights on how to avoid non-reproducibility (e.g., false-positive bug report detector) and improve reproducibility of the reported bugs (e.g., sandbox for bug reproduction) by combining our analyses from multiple studies (e.g., empirical study, developer study). Fourth, we explain the differences between reproducible and non-reproducible bug reports by systematically interpreting multiple machine learning models that classify these reports with high accuracy. We found that links to existing bug reports might help improve the reproducibility of a reported bug. Finally, we detect the connected bug reports to a non-reproducible bug automatically and further demonstrate how 93 bugs connected to 71 non-reproducible bugs from our dataset can offer complementary information (e.g., attachments, screenshots, program flows).
 
Frequency for contributed repository kinds with Fork and Upstream. Experimental and Documentation are the most frequently targeted software repository kinds, i.e., 24% and 21%, respectively
An example of how we define developers practice social coding, where more than one author contributes to the git.gemspec file
Qualitative analysis using a follow-up survey to acquire the perception of Newcomer OSS-Candidates
Article
The ability of an Open Source Software (OSS) project to attract, onboard, and retain any newcomer is vital to its livelihood. Although, evidence suggests an upsurge in novice developers joining social coding platforms (such as GitHub), the extent to which their activities result in a OSS contribution is unknown. Henceforth, we execute the protocols of a registered report to study activities of a “Newcomer OSS-Candidate”, who is a novice developer that is new to that social coding platform, and has the intention to later onboard an OSS project. Using GitHub as a case platform, we analyze 171 identified Newcomer OSS-Candidates to characterize their contribution activities. Results show that Newcomer OSS-Candidates are likely to target software based repositories (i.e., 66%), and their first contributions are mainly associated with development (commits) and maintenance (PRs). Newcomer OSS-Candidates are less likely to practice social coding, but eventually end up onboarding (i.e., 30% quantitative, 70% follow-up survey) an OSS project. Furthermore, they cite finding a way to start as the most challenging barrier to contribute. Our work reveals insights on how newcomers to social coding platforms are potential sources of OSS contributions.
 
Top-cited authors
Ahmed E. Hassan
  • Queen's University
David Lo
  • Singapore Management University
Lionel C. Briand
  • Simula Research Laboratory
Cor-Paul Bezemer
  • University of Alberta
Xin Xia
  • Zhejiang University