Conference Paper

Discovering, reporting, and fixing performance bugs

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Software performance is critical for how users perceive the quality of software products. Performance bugs - programming errors that cause significant performance degradation - lead to poor user experience and low system throughput. Designing effective techniques to address performance bugs requires a deep understanding of how performance bugs are discovered, reported, and fixed. In this paper, we study how performance bugs are discovered, reported to developers, and fixed by developers, and compare the results with those for non-performance bugs. We study performance and non-performance bugs from three popular code bases: Eclipse JDT, Eclipse SWT, and Mozilla. First, we find little evidence that fixing performance bugs has a higher chance to introduce new functional bugs than fixing non-performance bugs, which implies that developers may not need to be over-concerned about fixing performance bugs. Second, although fixing performance bugs is about as error-prone as fixing nonperformance bugs, fixing performance bugs is more difficult than fixing non-performance bugs, indicating that developers need better tool support for fixing performance bugs and testing performance bug patches. Third, unlike many non-performance bugs, a large percentage of performance bugs are discovered through code reasoning, not through users observing the negative effects of the bugs (e.g., performance degradation) or through profiling. The result suggests that techniques to help developers reason about performance, better test oracles, and better profiling techniques are needed for discovering performance bugs.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... It is the foundation of a system's ability to achieve various quality attributes [2]. Software performance is, for many systems, the most important quality attribute driving the design [3], [4]. Performance is the ability of a software system to perform its duties according to time constraints within its allowance of resources [3]. ...
... Performance is the ability of a software system to perform its duties according to time constraints within its allowance of resources [3]. Poor performance can result in long execution times, unhappy users, and even system crashes [3], [4]. Performance problems may be introduced and incubate in the system in early software design and architecture decisions [5]. ...
... Zaman et al. further extends the performance concerns by considering resource utilization as another aspect of software performance [3], [23]. The resources of interest usually include: 1) hardware resources, such as CPU, disk I/O, and memory; 2) logical resources, such as buffers, locks, and semaphores; and 3) processing resources, such as threads and processes [3], [4], [24], [25]. Moreover, recent studies also address other system characteristics, such as the data transmission loss ratio [S5], [S6], energy consumption [S7], and network latency [S8]. ...
Preprint
Full-text available
Software architecture is the foundation of a system's ability to achieve various quality attributes, including software performance. However, there lacks comprehensive and in-depth understanding of why and how software architecture and performance analysis are integrated to guide related future research. To fill this gap, this paper presents a systematic mapping study of 109 papers that integrate software architecture and performance analysis. We focused on five research questions that provide guidance for researchers and practitioners to gain an in-depth understanding of this research area. These questions addressed: a systematic mapping of related studies based on the high-level research purposes and specific focuses (RQ1), the software development activities these studies intended to facilitate (RQ2), the typical study templates of different research purposes (RQ3), the available tools and instruments for automating the analysis (RQ4), and the evaluation methodology employed in the studies (RQ5). Through these research questions, we also identified critical research gaps and future directions, including: 1) the lack of available tools and benchmark datasets to support replication, cross-validation and comparison of studies; 2) the need for architecture and performance analysis techniques that handle the challenges in emerging software domains; 3) the lack of consideration of practical factors that impact the adoption of the architecture and performance analysis approaches; and finally 4) the need for the adoption of modern ML/AI techniques to efficiently integrate architecture and performance analysis.
... However, the current methods for rectifying performance defects still have evident limitations. While numerous performance defect rectification methods based on specific algorithms and rule sets exist, most of these methods only target certain specific problems, such as redundant calculations [8], software configuration errors [9], or inefficient loops [5,10,11]. This specificity makes the rectification of performance defects more intricate and also makes the construction and maintenance of these rule sets both time-consuming and resource-intensive [12]. ...
... However, their adverse effects on user experience, system throughput, and resource utilization cannot be underestimated. It is particularly noteworthy that, compared to functional defects, performance defects are often harder to detect [2,3,4] and rectify [5,6]. This not only increases the burden on developers but also underscores the inadequacies of current tools in pinpointing and addressing such issues. ...
... In addition, a range of specific types of performance issues has garnered widespread attention [45,46,47]. For instance, there are tools dedicated to detecting running time bloat [48,49,50], inefficient data structures [51], database performance anti-patterns [52], improper resource sharing in multi-threaded software [53], and inefficient loops [5,11]. To fix these performance issues, researchers have developed several targeted solutions, such as reducing redundant computations [8], rectifying software configuration errors [9], and enhancing loop efficiency [54]. ...
Preprint
Context: With the waning of Moore's Law, the software industry is placing increasing importance on finding alternative solutions for continuous performance enhancement. The significance and research results of software performance optimization have been on the rise in recent years, especially with the advancement propelled by Large Language Models(LLMs). However, traditional strategies for rectifying performance flaws have shown significant limitations at the competitive code efficiency optimization level, and research on this topic is surprisingly scarce. Objective: This study aims to address the research gap in this domain, offering practical solutions to the various challenges encountered. Specifically, we have overcome the constraints of traditional performance error rectification strategies and developed a Language Model (LM) tailored for the competitive code efficiency optimization realm. Method: We introduced E-code, an advanced program synthesis LM. Inspired by the recent success of expert LMs, we designed an innovative structure called the Expert Encoder Group. This structure employs multiple expert encoders to extract features tailored for different input types. We assessed the performance of E-code against other leading models on a competitive dataset and conducted in-depth ablation experiments. Results: Upon systematic evaluation, E-code achieved a 54.98% improvement in code efficiency, significantly outperforming other advanced models. In the ablation experiments, we further validated the significance of the expert encoder group and other components within E-code. Conclusion: The research findings indicate that the expert encoder group can effectively handle various inputs in efficiency optimization tasks, significantly enhancing the model's performance.
... As referred in the ISO/IEC 25010 software quality guidelines, the computational efficiency of the software is a critical cornerstone of system performance and user satisfaction [1], [2]. Inefficient code snippets can induce increased system latency, computational resources waste, and lead to poor user experience, which is referred to as performance bugs [3], [4]. Existing studies have demonstrated that these inefficiencies are widely existed in software and are hard to detect and repair [4], [5]. ...
... Early research in code optimization primarily focuses on rule-based methods, which mainly target specific types of inefficiencies such as software misconfigurations [6] and loop inefficiencies [3]. These methods heavily rely on pre-defined rules created by experts, which are labor-intensive and suffer from the low coverage problem [2], [7]. ...
... In the first optimized version, this code replaces arrays with an unordered map to efficiently track AC and WA statuses and updates counts during input processing. Then, as shown in Fig. 11 3 , SBLLM follows the mutation instruction in the GO-COT prompt and further optimizes the code in the next iteration by skipping the processing for problems that have already been accepted. ...
Preprint
The code written by developers usually suffers from efficiency problems and contain various performance bugs. These inefficiencies necessitate the research of automated refactoring methods for code optimization. Early research in code optimization employs rule-based methods and focuses on specific inefficiency issues, which are labor-intensive and suffer from the low coverage issue. Recent work regards the task as a sequence generation problem, and resorts to deep learning (DL) techniques such as large language models (LLMs). These methods typically prompt LLMs to directly generate optimized code. Although these methods show state-of-the-art performance, such one-step generation paradigm is hard to achieve an optimal solution. First, complex optimization methods such as combinatorial ones are hard to be captured by LLMs. Second, the one-step generation paradigm poses challenge in precisely infusing the knowledge required for effective code optimization within LLMs, resulting in under-optimized code.To address these problems, we propose to model this task from the search perspective, and propose a search-based LLMs framework named SBLLM that enables iterative refinement and discovery of improved optimization methods. SBLLM synergistically integrate LLMs with evolutionary search and consists of three key components: 1) an execution-based representative sample selection part that evaluates the fitness of each existing optimized code and prioritizes promising ones to pilot the generation of improved code; 2) an adaptive optimization pattern retrieval part that infuses targeted optimization patterns into the model for guiding LLMs towards rectifying and progressively enhancing their optimization methods; and 3) a genetic operator-inspired chain-of-thought prompting part that aids LLMs in combining different optimization methods and generating improved optimization methods.
... Performance bugs are a notorious challenge that degrade software performance and waste computational resources [31,45]. These bugs are associated with reduced end-user satisfaction, increased development and maintenance costs, and diminished revenues [37,45]. ...
... Performance bugs are a notorious challenge that degrade software performance and waste computational resources [31,45]. These bugs are associated with reduced end-user satisfaction, increased development and maintenance costs, and diminished revenues [37,45]. Due to their pervasive nature, performance bugs will continue to emerge as software and hardware evolve and new areas of computing emerge. ...
... Software code repositories are an abundant source of performance bugs and bug-fix efforts. Prior research [25,33,45,46,60] has manually analyzed code commits in these repositories to identify various categories of performance bugs. However, curating a large dataset of performance bugs from code commits is challenging. ...
Preprint
Full-text available
Performance bugs challenge software development, degrading performance and wasting computational resources. Software developers invest substantial effort in addressing these issues. Curating these performance bugs can offer valuable insights to the software engineering research community, aiding in developing new mitigation strategies. However, there is no large-scale open-source performance bugs dataset available. To bridge this gap, we propose PerfCurator, a repository miner that collects performance bug-related commits at scale. PerfCurator employs PcBERT-KD, a 125M parameter BERT model trained to classify performance bug-related commits. Our evaluation shows PcBERT-KD achieves accuracy comparable to 7 billion parameter LLMs but with significantly lower computational overhead, enabling cost-effective deployment on CPU clusters. Utilizing PcBERT-KD as the core component, we deployed PerfCurator on a 50-node CPU cluster to mine GitHub repositories. This extensive mining operation resulted in the construction of a large-scale dataset comprising 114K performance bug-fix commits in Python, 217.9K in C++, and 76.6K in Java. Our results demonstrate that this large-scale dataset significantly enhances the effectiveness of data-driven performance bug detection systems.
... However, performance issues are largely under-tagged because the manual tagging process is voluntary and laborious. In practice, for all four issue-tracking systems in this study (i.e., Apache's Jira, Bugzilla, Redmine, and Mantis Bug Tracker), the performance issues tagging rates are below 1% -when empirical studies [10] find that performance issues should be around 4% to 16%. This discrepancy suggests a gap in accurately identifying and tagging performance issues. ...
... How does it compare to the baseline methods? As mentioned earlier, there is only 4% to 16% of performance issues in the issue tracking systems [4], [10]. It has been reported previously that data balancing could improve the accuracy of machine learning models [51], [52], [53]. ...
... A rich body of prior studies has focused on performance issue analysis from different perspectives [4], [8], [9], [10], [48], [69], [70], [71], [72], [73], [73], [74], [75]. Recent studies that rely on real-life performance issues use keyword matching and manual verification to extract a dataset. ...
Article
Full-text available
Software performance is critical for system efficiency, with performance issues potentially resulting in budget overruns, project delays, and market losses. Such problems are reported to developers through issue tracking systems, which are often under-tagged, as the manual tagging process is voluntary and time-consuming. Existing automated performance issue tagging techniques, such as keyword matching and machine/deep learning models, struggle due to imbalanced datasets and a high degree of variance. This paper presents a novel hybrid classification approach, combining Heuristic Linguistic Patterns ( HLP s) with machine/deep learning models to enable practitioners to automatically identify performance-related issues. The proposed approach works across three progressive levels: HLP tagging, sentence tagging, and issue tagging, with a focus on linguistic analysis of issue descriptions. The authors evaluate the approach on three different datasets collected from different projects and issue-tracking platforms to prove that the proposed framework is accurate, project- and platform-agnostic, and robust to imbalanced datasets. Furthermore, this study also examined how the two unique techniques of the framework, including the fuzzy HLP matching and the Issue HLP Matrix , contribute to the accuracy. Finally, the study explored the effectiveness and impact of two off-the-shelf feature selection techniques, Boruta and RFE , with the proposed framework. The results showed that the proposed framework has great potential for practitioners to accurately (with up to 100% precision, 66% recall, and 79% F1 -score) identify performance issues, with robustness to imbalanced data and good transferability to new projects and issue tracking platforms.
... The goal of performance microbenchmarking frameworks is to detect performance bugs as early as possible by, for example, checking and testing each system build [6]- [8]. Due to the numerous challenges of perfor-mance microbenchmarking, such as unreliable results [9], [10], the need for in-depth knowledge of methodology [11], [12], and a lack of appropriate tooling [3], [13], [14], several performance microbenchmarking frameworks, including JMH in the Java ecosystem, have been proposed to automate performance microbenchmarking. ...
... Due to the numerous challenges of performance microbenchmarking, such as unreliable results [9], [10], the need for in-depth knowledge of methodology [11], [12], and a lack of appropriate tooling [3], [13], [14], several performance microbenchmarking frameworks, including JMH in the Java ecosystem, have been proposed to automate performance microbenchmarking. JMH is a framework developed under the OpenJDK umbrella that enables users to design and run repeatable performance microbenchmarks. ...
... Moreover, since performance bugs are usually difficult to reproduce, it takes a long time to detect and fix them, such as 1,075 days on average to discover and fix 36 performance bugs in Jin et al. [29]. Nistor et al. [13] disclose that, compared to reasoning about functional bugs, developers have little support for reasoning about performance bugs. In addition, more experienced developers are required to address performance bugs [30]. ...
Article
Performance is a crucial non-functional requirement of many software systems. Despite the widespread use of performance testing, developers still struggle to construct and evaluate the quality of performance tests. To address these two major challenges, we implement a framework, dubbed ju2jmh , to automatically generate performance microbenchmarks from JUnit tests and use mutation testing to study the quality of generated microbenchmarks. Specifically, we compare our ju2jmh generated benchmarks to manually written JMH benchmarks and to automatically generated JMH benchmarks using the AutoJMH framework, as well as directly measuring system performance with JUnit tests. For this purpose, we have conducted a study on three subjects (Rxjava, Eclipse-collections, and Zipkin) with \sim 454 K source lines of code (SLOC), 2,417 JMH benchmarks (including manually written and generated AutoJMH benchmarks) and 35,084 JUnit tests. Our results show that the ju2jmh generated JMH benchmarks consistently outperform using the execution time and throughput of JUnit tests as a proxy of performance and JMH benchmarks automatically generated using the AutoJMH framework while being comparable to JMH benchmarks manually written by developers in terms of tests’ stability and ability to detect performance bugs. Nevertheless, ju2jmh benchmarks are able to cover more of the software applications than manually written JMH benchmarks during the microbenchmark execution. Furthermore, ju2jmh benchmarks are generated automatically, while manually written JMH benchmarks requires many hours of hard work and attention; therefore our study can reduce developers’ effort to construct microbenchmarks. In addition, we identify three factors (too low test workload, unstable tests and limited mutant coverage) that affect a benchmark’s ability to detect performance bugs. To the best of our knowledge, this is the first study aimed at assisting developers in fully automated microbenchmark creation and assessing microbenchmark quality for performance testing.
... Performance bugs may not cause system failure and may depend on user input, therefore detecting them can be challenging [9,16]. They also tend to be harder to fix than nonperformance bugs [20,26]. As a result, better tool support is needed for fixing performance bugs. ...
... However, a majority of existing performance bug detection approaches focus on specific types of performance problems. For instance, prior work investigated the detection of inefficient loops [20,27,31], database related performance issues, low-utility data structures * Corresponding Author [33], false sharing specially in multi-threaded code [18], etc. Approaches that fix specific performance issues due to repeated computations [10], software misconfigurations [15], loop inefficiencies [21], etc. have also been developed. Many of these approaches rely on expert-written algorithms or pre-defined set of rules to detect and fix performance issues based on patterns in abstract syntax tree, control flow graphs, profiles, etc. Building rule-based analyzers is a non-trivial task as it requires achieving the right balance between precision and recall. ...
... Other performance detection tools focus on detecting a specific type of performance bug. For instance, a set of tools have been developed for detecting runtime bloat [12,32,34], low-utility data structures [33], database related performance anti-patterns [6], false sharing problem in multi-threaded software [18], and detecting inefficient loops [20,31]. Approaches fixing specific performance issues, such as repeated computations [10], software misconfigurations [15], loop inefficiencies [21], etc. have also been developed. ...
Preprint
Full-text available
Improving software performance is an important yet challenging part of the software development cycle. Today, the majority of performance inefficiencies are identified and patched by performance experts. Recent advancements in deep learning approaches and the wide-spread availability of open source data creates a great opportunity to automate the identification and patching of performance problems. In this paper, we present DeepPERF, a transformer-based approach to suggest performance improvements for C# applications. We pretrain DeepPERF on English and Source code corpora and followed by finetuning for the task of generating performance improvement patches for C# applications. Our evaluation shows that our model can generate the same performance improvement suggestion as the developer fix in ~53% of the cases, getting ~34% of them verbatim in our expert-verified dataset of performance changes made by C# developers. Additionally, we evaluate DeepPERF on 50 open source C# repositories on GitHub using both benchmark and unit tests and find that our model is able to suggest valid performance improvements that can improve both CPU usage and Memory allocations. So far we've submitted 19 pull-requests with 28 different performance optimizations and 11 of these PRs have been approved by the project owners.
... However, these studies are not specifically designed for PBs, and thus only capture some partial characteristics of PBs in DL systems. In contrast, PBs have been widely studied for traditional systems, e.g., desktop or server applications [30,48,62,79], highly configurable systems [24,25], mobile applications [38,40], databasebacked web applications [77,78], and JavaScript systems [59]. However, PBs in DL systems could be different due to the programming paradigm shift from traditional systems to DL systems. ...
... Step 2: PB Post Selection. Instead of directly using performancerelated keywords from the existing studies on PBs in traditional systems (e.g., [30,48,62,79]), we derived a keyword set in the following way to achieve a wide and comprehensive coverage of PB posts. We first randomly sampled 100 posts with a tag of "performance" from the 18,730 posts in Step 1. ...
... A lot of empirical studies have characterized performance bugs from different perspectives (e.g., root causes, discovery, diagnosis, fixing and reporting) for desktop or server applications [30,48,62,79,88], highly configurable systems [24,25], mobile applications [38,40], database-backed web applications [77,78], and JavaScript systems [59]. They shed light on potential directions on performance analysis (e.g., detection, profiling and testing). ...
Preprint
Full-text available
Deep learning (DL) has been increasingly applied to a variety of domains. The programming paradigm shift from traditional systems to DL systems poses unique challenges in engineering DL systems. Performance is one of the challenges, and performance bugs(PBs) in DL systems can cause severe consequences such as excessive resource consumption and financial loss. While bugs in DL systems have been extensively investigated, PBs in DL systems have hardly been explored. To bridge this gap, we present the first comprehensive study to characterize symptoms, root causes, and introducing and exposing stages of PBs in DL systems developed in TensorFLow and Keras, with a total of 238 PBs collected from 225 StackOverflow posts. Our findings shed light on the implications on developing high performance DL systems, and detecting and localizing PBs in DL systems. We also build the first benchmark of 56 PBs in DL systems, and assess the capability of existing approaches in tackling them. Moreover, we develop a static checker DeepPerf to detect three types of PBs, and identify 488 new PBs in 130 GitHub projects.62 and 18 of them have been respectively confirmed and fixed by developers.
... Table 2 lists web client subjects used in the prior research. Nistor et al. [17] study over 600 bugs from three open-source projects. They compare and contrast the difference of discovering, reporting, and fixing between performance bugs and non-performance bugs. ...
... Configurable software systems complicate performance testing. Prior study [17] shows that performance bugs in configurable software systems are more complex and take a longer time to fix. The sheer size of the configuration space makes the quality of software even harder to achieve. ...
... It is not unusual to see discussions in a bug report that a performance bug is introduced a few versions ago but only to surface in the bug report recently. Bug Detection Nistor et al. [17] report that most (up to 57%) performance bugs are discovered with code reasoning. Code reasoning involves code understanding. ...
... For example, if a code change introduces a security vulnerability, security measures to counteract this may be implemented elsewhere (Williams et al. 2018;Mahrous and Malhotra 2018;Ping et al. 2011). If a code change introduces a performance issue, this performance issue may be fixed and improved in a different part of the system (Nistor et al. 2013;Jin et al. 2012), for example by changing configuration parameters. Non-functional bugs can be harder to fix than their functional counterparts. ...
... However, those studies do not make a distinction between functional and non-functional bugs during their evaluation. Nonetheless, it has been shown that non-functional bugs present different characteristics than functional bugs (Nistor et al. 2013). In particular, non-functional requirements describe the quality attributes of a program, as opposed to its functionality (Kotonya and Sommerville 1998). ...
... In either scenario, the SZZ approach may consider the later changes as bug-inducing instead of the original changes. This phenomenon is intuitive since non-functional bugs often take a long time to be discovered and fixed (Nistor et al. 2013). Therefore, considering the most recent code change before the bug reporting date may not be a suitable heuristic for non-functional bugs. ...
Article
Full-text available
Non-functional bugs, e.g., performance bugs and security bugs, bear a heavy cost on both software developers and end-users. For example, IBM estimates the cost of a single data breach to be millions of dollars. Tools to reduce the occurrence, impact, and repair time of non-functional bugs can therefore provide key assistance for software developers racing to fix these issues. Identifying bug-inducing changes is a critical step in software quality assurance. In particular, the SZZ approach is commonly used to identify bug-inducing commits. However, the fixes to non-functional bugs may be scattered and separate from their bug-inducing locations in the source code. The nature of non-functional bugs may therefore make the SZZ approach a sub-optimal approach for identifying bug-inducing changes. Yet, prior studies that leverage or evaluate the SZZ approach do not consider non-functional bugs, leading to potential bias on the results. In this paper, we conduct an empirical study on the results of the SZZ approach when used to identify the inducing changes of the non-functional bugs in the NFBugs dataset. We eliminate a majority of the bug-inducing commits as they are not in the same method or class level. We manually examine whether each identified bug-inducing change is indeed the correct bug-inducing change. Our manual study shows that a large portion of non-functional bugs cannot be properly identified by the SZZ approach. By manually identifying the root causes of the falsely detected bug-inducing changes, we uncover root causes for false detection that have not been found by previous studies. We evaluate the identified bug-inducing changes based on three criteria from prior research, i.e., the earliest bug appearance, the future impact of changes, and the realism of bug introduction. We find that prior criteria may be irrelevant for non-functional bugs. Our results may be used to assist in future research on non-functional bugs, and highlight the need to complement SZZ to accommodate the unique characteristics of non-functional bugs.
... Performance failures pose an enormous challenge for developers. Compared to functional failures, they take considerably longer to be discovered [Jin et al., 2012], are harder to reproduce and debug [Zaman et al., 2012;Nistor et al., 2013a], take longer to fix [Zaman et al., 2011;Nistor et al., 2013a;Liu et al., 2014;Mazuera-Rozo et al., 2020], and require more as well as more experienced developers to do so [Zaman et al., 2011;Nistor et al., 2013a;. Users are similarly affected, e.g., they are more likely to leave if an application if it takes longer to load [Akamai Technologies Inc., 2017;Artz, 2009;. ...
... Performance failures pose an enormous challenge for developers. Compared to functional failures, they take considerably longer to be discovered [Jin et al., 2012], are harder to reproduce and debug [Zaman et al., 2012;Nistor et al., 2013a], take longer to fix [Zaman et al., 2011;Nistor et al., 2013a;Liu et al., 2014;Mazuera-Rozo et al., 2020], and require more as well as more experienced developers to do so [Zaman et al., 2011;Nistor et al., 2013a;. Users are similarly affected, e.g., they are more likely to leave if an application if it takes longer to load [Akamai Technologies Inc., 2017;Artz, 2009;. ...
... Performance failures pose an enormous challenge for developers. Compared to functional failures, they take considerably longer to be discovered [Jin et al., 2012], are harder to reproduce and debug [Zaman et al., 2012;Nistor et al., 2013a], take longer to fix [Zaman et al., 2011;Nistor et al., 2013a;Liu et al., 2014;Mazuera-Rozo et al., 2020], and require more as well as more experienced developers to do so [Zaman et al., 2011;Nistor et al., 2013a;. Users are similarly affected, e.g., they are more likely to leave if an application if it takes longer to load [Akamai Technologies Inc., 2017;Artz, 2009;. ...
Article
Software performance faults have severe consequences for users, developers, and companies. One way to unveil performance faults before they manifest in production is performance testing, which ought to be done on every new version of the software, ideally on every commit. However, performance testing faces multiple challenges that inhibit it from being applied early in the development process, on every new commit, and in an automated fashion. In this dissertation, we investigate three challenges of software microbenchmarks, a performance testing technique on unit granularity which is predominantly used for libraries and frameworks. The studied challenges affect the quality aspects (1) runtime, (2) result variability, and (3) performance change detection of microbenchmark executions. The objective is to understand the extent of these challenges in real-world software and to find solutions to address these. To investigate the challenges’ extent, we perform a series of experiments and analyses. We execute benchmarks in bare-metal as well as multiple cloud environments and conduct a large-scale mining study on benchmark configurations. The results show that all three challenges are common: (1) benchmark suite runtimes are often longer than 3 hours; (2) result variability can be extensive, in some cases up to 100%; and (3) benchmarks often only reliably detect large performance changes of 60% or more. To address the challenges, we devise targeted solutions as well as adapt well-known techniques from other domains for software microbenchmarks: (1) a solution that dynamically stops benchmark executions based on statistics to reduce runtime while maintaining low result variability; (2) a solution to identify unstable benchmarks that does not require execution, based on statically-computable source code features and machine learning algorithms; (3) traditional test case prioritization (TCP) techniques to execute benchmarks earlier that detect larger performance changes; and (4) specific execution strategies to detect small performance changes reliably even when executed in unreliable cloud environments. We experimentally evaluate the solutions and techniques on real-world benchmarks and find that they effectively deal with the three challenges. (1) Dynamic reconfiguration enables to drastically reduce runtime by between 48.4% and 86.0% without changing the results of 78.8% to 87.6% of the bench- marks, depending on the project and statistic used. (2) The instability prediction model allows to effectively identify unstable benchmarks when relying on random forest classifiers, having a prediction performance between 0.79 and 0.90 area under the receiver operating characteristic curve (AUC). (3) TCP applied to benchmarks is effective and efficient, with APFD-P values for the best technique ranging from 0.54 to 0.71 and a computational overhead of 11%. (4) Batch testing, i.e., executing the benchmarks of two versions on the same instances interleaved and repeated as well as repeated across instances, enables to reliably detect performance changes of 10% or less, even when using unreliable cloud infrastructure as execution environment. Overall, this dissertation shows that real-world software microbenchmarks are considerably affected by all three challenges (1) runtime, (2) result variability, and (3) performance change detection; however, deliberate planning and execution strategies effectively reduce their impact.
... Unlike functional bugs, performance bugs do not usually cause system failure. They tend to be harder to detect [3,10,11,19] and fix compared to functional bugs [28,35] and are usually fixed by expert developers [9]. As a result, better tool support is needed to fix performance bugs, specially for novice developers. ...
... For such a tool to be applicable in practice, it should support a wide-range of performance bugs. However, the existing fixing approaches usually target specific kinds of performance bugs, such as repeated computations [12], software misconfigurations [17], loop inefficiencies [28]. The majority of these approaches are also rule-based analyzers, with high maintenance cost [6]. ...
Preprint
Full-text available
Performance bugs are non-functional bugs that can even manifest in well-tested commercial products. Fixing these performance bugs is an important yet challenging problem. In this work, we address this challenge and present a new approach called Retrieval-Augmented Prompt Generation (RAPGen). Given a code snippet with a performance issue, RAPGen first retrieves a prompt instruction from a pre-constructed knowledge-base of previous performance bug fixes and then generates a prompt using the retrieved instruction. It then uses this prompt on a Large Language Model (such as Codex) in zero-shot to generate a fix. We compare our approach with the various prompt variations and state of the art methods in the task of performance bug fixing. Our evaluation shows that RAPGen can generate performance improvement suggestions equivalent or better than a developer in ~60% of the cases, getting ~39% of them verbatim, in an expert-verified dataset of past performance changes made by C# developers.
... Figure 5 shows project-wise smell distribution of C++ script smell and settings smells. Based on the analysis, we identified that for each project the median number of CPP scripts is 1. 32 contains more than 11 performance smells that show necessity to have performance bottleneck detection tools such as UEPerf-Analyzer to improve the performance of XR application. Among the analyzed projects, sandisk/GabrielPaliari project has 60 settings smell which is the highest among the analyzed projects. ...
... The study pointed out that performance issues are difficult to reproduce and also require more discussion to fix the performance issues. Nistor et al. [32] also performed a similar study on performance and non-performance bugs from three popular codebases: Eclipse JDT, Eclipse SWT, and Mozilla. The work summarized that fixing performance bugs is more challenging than non-performance bugs. ...
Preprint
Extended Reality (XR) includes Virtual Reality (VR), Augmented Reality (AR) and Mixed Reality (MR). XR is an emerging technology that simulates a realistic environment for users. XR techniques have provided revolutionary user experiences in various application scenarios (e.g., training, education, product/architecture design, gaming, remote conference/tour, etc.). Due to the high computational cost of rendering real-time animation in limited-resource devices and constant interaction with user activity, XR applications often face performance bottlenecks, and these bottlenecks create a negative impact on the user experience of XR software. Thus, performance optimization plays an essential role in many industry-standard XR applications. Even though identifying performance bottlenecks in traditional software (e.g., desktop applications) is a widely explored topic, those approaches cannot be directly applied within XR software due to the different nature of XR applications. Moreover, XR applications developed in different frameworks such as Unity and Unreal Engine show different performance bottleneck patterns and thus, bottleneck patterns of Unity projects can't be applied for Unreal Engine (UE)-based XR projects. To fill the knowledge gap for XR performance optimizations of Unreal Engine-based XR projects, we present the first empirical study on performance optimizations from seven UE XR projects, 78 UE XR discussion issues and three sources of UE documentation. Our analysis identified 14 types of performance bugs, including 12 types of bugs related to UE settings issues and two types of CPP source code-related issues. To further assist developers in detecting performance bugs based on the identified bug patterns, we also developed a static analyzer, UEPerfAnalyzer, that can detect performance bugs in both configuration files and source code.
... The number of bug reports in this study (310) is reasonable, as it is large enough to make statistically significant claims, and small enough to allow for a reasonably thorough analysis of each bug report. Indeed, there is also precedence from prior research, including one conducted by the present author, where 317 bug reports were analyzed (Ocariza et al. 2013), and those conducted by (Nistor et al. 2013) and (Selakovic and Pradel 2016), each of which analyzed fewer than 300 performance-related bug reports. ...
... Lastly, (Nistor et al. 2013) looked at performance bugs from different code bases, and analyzed how they are detected, reported, and fixed, in comparison to non-performance bugs. For instance, the authors found that performance bugs are generally more difficult to fix compared to non-performance bugs, and conclude the need for better tool support for the former. ...
Article
Full-text available
Performance regressions can have a drastic impact on the usability of a software application. The crucial task of localizing such regressions can be achieved using bisection, which attempts to find the bug-introducing commit using binary search. This approach is used extensively by many development teams, but it is an inherently heuristical approach when applied to performance regressions, and therefore, does not have correctness guarantees. Unfortunately, bisection is also time-consuming, which implies the need to assess its effectiveness prior to running it. To this end, the goal of this study is to analyze the effectiveness of bisection for performance regressions. This goal is achieved by first formulating a metric that quantifies the probability of a successful bisection, and extracting a list of input parameters – the contributing properties – that potentially impact its value; a sensitivity analysis is then conducted on these properties to understand the extent of their impact. Furthermore, an empirical study of 310 bug reports describing performance regressions in 17 real-world applications is conducted, to better understand what these contributing properties look like in practice. The results show that while bisection can be highly effective in localizing real-world performance regressions, this effectiveness is sensitive to the contributing properties, especially the choice of baseline and the distributions at each commit. The results also reveal that most bug reports do not provide sufficient information to help developers properly choose values and metrics that can maximize the effectiveness, which implies the need for measures to fill this information gap.
... Developers often spend a substantial amount of time diagnosing a configurable software system to localize and fix a performance bug, or to determine that the system was misconfigured [8,11,26,30,32,33,55,58,59,86]. This struggle is quite common when maintaining configurable software systems. ...
... Our goal is to support developers in the process of debugging the performance of configurable software systems; in particular, when developers do not even know which options or interactions in their current configuration cause an unexpected performance behavior. When performance issues occur in software systems, developers need to identify relevant information to debug the unexpected performance behaviors [8,11,27,55]. For this task, in addition to using off-the-shelf profilers [15,53,74], some researchers suggest using more targeted profiling techniques [10,12,13,21,84] and visualizations [2,6,12,21,62,70] to identify and analyze the locations of performance bottlenecks. ...
Preprint
Full-text available
Determining whether a configurable software system has a performance bug or it was misconfigured is often challenging. While there are numerous debugging techniques that can support developers in this task, there is limited empirical evidence of how useful the techniques are to address the actual needs that developers have when debugging the performance of configurable software systems; most techniques are often evaluated in terms of technical accuracy instead of their usability. In this paper, we take a human-centered approach to identify, design, implement, and evaluate a solution to support developers in the process of debugging the performance of configurable software systems. We first conduct an exploratory study with 19 developers to identify the information needs that developers have during this process. Subsequently, we design and implement a tailored tool, adapting techniques from prior work, to support those needs. Two user studies, with a total of 20 developers, validate and confirm that the information that we provide helps developers debug the performance of configurable software systems.
... Compared to the amount of functional bugs, it is typical that the amount of performance bugs are relatively small in software projects (Ding et al. 2020;Radu and Nadi 2019;Jin et al. 2012;Nistor et al. 2013). Therefore, the lack of data to build JIT bug prediction models for performance bugs may become a common challenge in practice. ...
... Our paper analyzes the performance bugs in Cassandra and Hadoop, and the SZZ approach's ability to determine the bug inducing changes and concentrate on the impact of these changes on predictive models. Nistor et al. (2013) studied software performance since performance is critical for how users perceive the quality of software products. Performance bugs lead to poor user experience and low system throughput (Molyneaux 2009;Bryant and O'Hallaron 2015). ...
Article
Full-text available
Performance bugs bear a heavy cost on both software developers and end-users. Tools to reduce the occurrence, impact, and repair time of performance bugs, can therefore provide key assistance for software developers racing to fix these bugs. Classification models that focus on identifying defect-prone commits, referred to as Just-In-Time (JIT) Quality Assurance are known to be useful in allowing developers to review risky commits. These commits can be reviewed while they are still fresh in developers’ minds, reducing the costs of developing high-quality software. JIT models, however, leverage the SZZ approach to identify whether or not a change is bug-inducing. The fixes to performance bugs may be scattered across the source code, separated from their bug-inducing locations. The nature of performance bugs may make SZZ a sub-optimal approach for identifying their bug-inducing commits. Yet, prior studies that leverage or evaluate the SZZ approach do not distinguish performance bugs from other bugs, leading to potential bias in the results. In this paper, we conduct an empirical study on the JIT defect prediction for performance bugs. We concentrate on SZZ’s ability to identify the bug-inducing commits of performance bugs in two open-source projects, Cassandra, and Hadoop. We verify whether the bug-inducing commits found by SZZ are truly bug-inducing commits by manually examining these identified commits. Our manual examination includes cross referencing fix commits and JIRA bug reports. We evaluate model performance for JIT models by using them to identify bug-inducing code commits for performance related bugs. Our findings show that JIT defect prediction classifies non-performance bug-inducing commits better than performance bug-inducing commits, i.e., the SZZ approach does introduce errors when identifying bug-inducing commits. However, we find that manually correcting these errors in the training data only slightly improves the models. In the absence of a large number of correctly labelled performance bug-inducing commits, our findings show that combining all available training data (i.e., truly performance bug-inducing commits, non-performance bug-inducing commits, and non-bug-inducing commits) yields the best classification results.
... This piece of code looks innocent. However, there is an outer loop in function my_xml_parse(), which is to parse input string str into XML_NODEs (lines [18][19][20][21][22][23][24][25][26][27][28][29][30][31][32]. The outer loop keeps calling xml_parent() using the next sibling of the previous XML_NODE, which has O(N 2 ) complexity in the number of children of a parent XML_NODE. ...
... Combating performance bugs depends on a good understanding of performance bugs. Many empirical studies were conducted to understand different types of performance bugs [1,2,3,20,21,22,23,24]. They provide important findings, which deepen researchers' understanding and guide the technical design to fight performance bugs from different aspects. ...
Article
Full-text available
Complexity problems are a common type of performance issues, caused by algorithmic inefficiency. Algorithmic profiling aims to automatically attribute execution complexity to an executed code construct. It can identify code constructs in superlinear complexity to facilitate performance optimizations and debugging. However, existing algorithmic profiling techniques suffer from several severe limitations, missing the opportunity to be deployed in production environment and failing to effectively pinpoint root causes for performance failures caused by complexity problems. In this paper, we design a tool, ComAir, which can effectively conduct algorithmic profiling in production environment. We propose several novel instrumentation methods to significantly lower runtime overhead and enable the production-run usage. We also design an effective ranking mechanism to help developers identify root causes of performance failures due to complexity problems. Our experimental results show that ComAir can effectively identify root causes and generate accurate profiling results in production environment, while incurring a negligible runtime overhead
... One way to evaluate test suites is to seed small faults, called mutants, into source code and asses the ability of a suite to detect these faults [17,27]. Such mutants have been defined in the literature to reflect the typical errors developers make when writing source code [40,44,51,56,60,66,78]. ...
Preprint
Mutation testing has been widely used to assess the fault-detection effectiveness of a test suite, as well as to guide test case generation or prioritization. Empirical studies have shown that, while mutants are generally representative of real faults, an effective application of mutation testing requires "traditional" operators designed for programming languages to be augmented with operators specific to an application domain and/or technology. This paper proposes MDroid+, a framework for effective mutation testing of Android apps. First, we systematically devise a taxonomy of 262 types of Android faults grouped in 14 categories by manually analyzing 2,023 software artifacts from different sources (e.g., bug reports, commits). Then, we identified a set of 38 mutation operators, and implemented an infrastructure to automatically seed mutations in Android apps with 35 of the identified operators. The taxonomy and the proposed operators have been evaluated in terms of stillborn/trivial mutants generated and their capacity to represent real faults in Android apps, as compared to other well know mutation tools.
... Essential non-committers were formally defined in [68], identifying bug resolution catalysts through a proposed approach. The process of finding, reporting, and fixing performance defects was examined in [69], while dormant and non-dormant bugs were compared in [70] based on fixing time, size, and the identity of the fixer. A new approach for automatically extracting bug-fix patterns was proposed in [71], and the application of process mining for effective process management was discussed in [72]. ...
Article
Full-text available
Introduction/ Importance of Study: Bug repository mining is a crucial research area in software engineering, analyzing software change trends, defect prediction, and evolution. It involves developing methods and tools for mining repositories, and providing essential data for bug management. Objective: The goal of this study is to analyze and synthesize recent trends in mining software bug repositories, providing valuable insights for future research and practical bug management. Novelty statement: Our research contributes novel insights into mining software repository techniques and approaches employed in specific tasks such as bug localization, triaging, and prediction, along with their limitations and possible future trends. Material and Method: This study presents a comprehensive survey that categorizes and synthesizes the current research within this field. This categorization is derived from an in-depth review of studies conducted over the past fifteen years, from 2010 to 2024. The survey is organized around three key dimensions: the test systems employed in bug repositories, the methodologies commonly used in this area of research, and the prevailing trends shaping the field. Results and Discussion: Our results highlighted the significance of artificial intelligence and machine learning integration in bug repository mining; which has revolutionized the software development process by enhancing the classification, prediction, and vulnerability detection of bugs. Concluding Remarks: This survey aims to provide a clear and detailed understanding of the evolution of bug repository mining, offering valuable insights for the ongoing advancement of software engineering.
... The authors argued that "it is a critical part of the software development process, as it ensures that software applications are functioning correctly and efficiently" (Nathalia et al., 2023). Though, studies (Dakhel et al., 2023;Nistor et al., 2013;Xuan et al., 2016) have shown that software engineers have done a lot of research in an attempt to limit the number of bugs found in the software development process in order to make codes better, however, bugs (i.e., errors) still remain issue for software developers (DeLiema et al., 2023;Kang et al., 2023;Xin & Reiss, 2017). ...
Article
Full-text available
ChatGPT is an advanced language model that has been gaining attention in the natural language processing field. However, its functionalities go beyond language-related tasks. ChatGPT can also serve as a robust tool for debugging software code. Debugging holds crucial significance in the software development process as bugs, or code errors, can significantly impact the functionality and security of applications. The process of identifying and rectifying bugs using traditional debugging approaches can be labour-intensive and time-consuming, typically requiring the expertise of skilled developers. With the growing complexity of software applications, there is an increasing demand for efficient and precise debugging solutions. Thus, the purpose of this study is to investigate the effectiveness of ChatGPT in identifying, predicting, explaining, and resolving programming bugs. Findings of the study revealed that ChatGPT can to analyze and comprehend code. Also, the review results discovered that ChatGPT has the potential to streamline the debugging process, making it more suitable for developers with varying experience levels. Lastly, the study offers insights into integrating ChatGPT into the software development workflow, and proposed directions for upcoming studies.
... There are few empirical studies on performance bugs [8], [26]- [29], their root cause [8], [26], [30], fixing strategy [8], [26], [27] their impact or relevance [8], [29] and both static and dynamic analysis based detection approaches [31]- [34]. Researchers also suggested various ways of data-access optimization to improve performance of database-backed web applications using caching and prefetching techniques. ...
Preprint
Data-intensive systems handle variable, high volume, and high-velocity data generated by human and digital devices. Like traditional software, data-intensive systems are prone to technical debts introduced to cope-up with the pressure of time and resource constraints on developers. Data-access is a critical component of data-intensive systems as it determines the overall performance and functionality of such systems. While data access technical debts are getting attention from the research community, technical debts affecting the performance, are not well investigated. Objective: Identify, categorize, and validate data access performance issues in the context of NoSQL-based and polyglot persistence data-intensive systems using qualitative study. Method: We collect issues from NoSQL-based and polyglot persistence open-source data-intensive systems and identify data access performance issues using inductive coding and build a taxonomy of the root causes. Then, we validate the perceived relevance of the newly identified performance issues using a developer survey.
... Misconfigurations are typically caused by interactions between software and hardware, resulting in non-functional faults 1 -depredations in non-functional system properties such as latency and energy consumption. These non-functional faults-unlike regular software bugs-do not cause the system to crash or exhibit any obvious misbehavior [76,85,99]. Instead, misconfigured systems remain operational but degrade in performance [16,71,75,86]. ...
Preprint
Full-text available
Modern computer systems are highly configurable, with the variability space sometimes larger than the number of atoms in the universe. Understanding and reasoning about the performance behavior of highly configurable systems, due to a vast variability space, is challenging. State-of-the-art methods for performance modeling and analyses rely on predictive machine learning models, therefore, they become (i) unreliable in unseen environments (e.g., different hardware, workloads), and (ii) produce incorrect explanations. To this end, we propose a new method, called Unicorn, which (a) captures intricate interactions between configuration options across the software-hardware stack and (b) describes how such interactions impact performance variations via causal inference. We evaluated Unicorn on six highly configurable systems, including three on-device machine learning systems, a video encoder, a database management system, and a data analytics pipeline. The experimental results indicate that Unicorn outperforms state-of-the-art performance optimization and debugging methods. Furthermore, unlike the existing methods, the learned causal performance models reliably predict performance for new environments.
... Other tools perform security assessment, automated test case generation and detection of non-functional issues such as energy consumption [270], [271]. While fixing non-functional performance bugs, developers need to consider the threat of introducing functional bugs [272] and hindering code maintainability [273]. In this context, Linares et al. [274] suggested that developers rarely implement micro-optimizations (e.g., changes at statement level). ...
Article
Nowadays there is a mobile application for almost everything a user may think of, ranging from paying bills and gathering information to playing games and watching movies. In order to ensure user satisfaction and success of applications, it is important to provide high performant applications. This is particularly important for resource constraint systems such as mobile devices. Thereby, non-functional performance characteristics, such as energy and memory consumption, play an important role for user satisfaction. This paper provides a comprehensive survey of non-functional performance optimization for Android applications. We collected 155 unique publications, published between 2008 and 2020, that focus on the optimization of non-functional performance of mobile applications. We target our search at four performance characteristics, in particular: responsiveness, launch time, memory and energy consumption. For each performance characteristic, we categorize optimization approaches based on the method used in the corresponding publications. Furthermore, we identify research gaps in the literature for future work.
... Configuring software systems is often challenging. In practice, many users execute systems with inefficient configurations in terms of performance and, often directly correlated, energy consumption [22,23,33,54]. While users can adjust configuration options to tradeoff between performance and the system's functionality, this configuration task can be overwhelming; many systems, such as databases, Web servers, and video encoders, have numerous configuration options that may interact, possibly producing unexpected and undesired behavior. ...
Preprint
Full-text available
Performance-influence models can help stakeholders understand how and where configuration options and their interactions influence the performance of a system. With this understanding, stakeholders can debug performance behavior and make deliberate configuration decisions. Current black-box techniques to build such models combine various sampling and learning strategies, resulting in tradeoffs between measurement effort, accuracy, and interpretability. We present Comprex, a white-box approach to build performance-influence models for configurable systems, combining insights of local measurements, dynamic taint analysis to track options in the implementation, compositionality, and compression of the configuration space, without relying on machine learning to extrapolate incomplete samples. Our evaluation on 4 widely-used, open-source projects demonstrates that Comprex builds similarly accurate performance-influence models to the most accurate and expensive black-box approach, but at a reduced cost and with additional benefits from interpretable and local models.
... Performance bug is also studied for software systems where bugs are detected by users or code reasoning [43]. A machine learning approach is developed for evaluating software performance degradation due to code change [4]. ...
Preprint
Processor design validation and debug is a difficult and complex task, which consumes the lion's share of the design process. Design bugs that affect processor performance rather than its functionality are especially difficult to catch, particularly in new microarchitectures. This is because, unlike functional bugs, the correct processor performance of new microarchitectures on complex, long-running benchmarks is typically not deterministically known. Thus, when performance benchmarking new microarchitectures, performance teams may assume that the design is correct when the performance of the new microarchitecture exceeds that of the previous generation, despite significant performance regressions existing in the design. In this work, we present a two-stage, machine learning-based methodology that is able to detect the existence of performance bugs in microprocessors. Our results show that our best technique detects 91.5% of microprocessor core performance bugs whose average IPC impact across the studied applications is greater than 1% versus a bug-free design with zero false positives. When evaluated on memory system bugs, our technique achieves 100% detection with zero false positives. Moreover, the detection is automatic, requiring very little performance engineer time.
... Incorrect configuration (misconfiguration) elicits unexpected interactions between software and hardware resulting non-functional faults, i.e., faults in non-functional system properties such as latency, energy consumption, and/or heat dissipation. These non-functional faults-unlike regular software bugs-do not cause the system to crash or exhibit an obvious misbehavior [70,78,88]. Instead, misconfigured systems remain operational while being compromised, resulting severe performance degradation in latency, energy consumption, and/or heat dissipation [16,66,69,80]. ...
Preprint
Full-text available
Modern computing platforms are highly-configurable with thousands of interacting configurations. However, configuring these systems is challenging. Erroneous configurations can cause unexpected non-functional faults. This paper proposes CADET (short for Causal Debugging Toolkit) that enables users to identify, explain, and fix the root cause of non-functional faults early and in a principled fashion. CADET builds a causal model by observing the performance of the system under different configurations. Then, it uses casual path extraction followed by counterfactual reasoning over the causal model to: (a) identify the root causes of non-functional faults, (b) estimate the effects of various configurable parameters on the performance objective(s), and (c) prescribe candidate repairs to the relevant configuration options to fix the non-functional fault. We evaluated CADET on 5 highly-configurable systems deployed on 3 NVIDIA Jetson systems-on-chip. We compare CADET with state-of-the-art configuration optimization and ML-based debugging approaches. The experimental results indicate that CADET can find effective repairs for faults in multiple non-functional properties with (at most) 17% more accuracy, 28% higher gain, and 40×40\times speed-up than other ML-based performance debugging methods. Compared to multi-objective optimization approaches, CADET can find fixes (at most) 9×9\times faster with comparable or better performance gain. Our case study of non-functional faults reported in NVIDIA's forum show that CADET can find 1414% better repairs than the experts' advice in less than 30 minutes.
... Recent studies have shown that performance problems caused by misconfiguration are still prevalent [4], [13], [17]. Performance issues can cause significant performance degradation which leads to long response time and a low program throughput [7], [17], [24]. ...
Preprint
Performance is an important non-functional aspect of the software requirement. Modern software systems are highly-configurable and misconfigurations may easily cause performance issues. A software system that suffers performance issues may exhibit low program throughput and long response time. However, the sheer size of the configuration space makes it challenging for administrators to manually select and adjust the configuration options to achieve better performance. In this paper, we propose ConfRL, an approach to tune software performance automatically. The key idea of ConfRL is to use reinforcement learning to explore the configuration space by a trial-and-error approach and to use the feedback received from the environment to tune configuration option values to achieve better performance. To reduce the cost of reinforcement learning, ConfRL employs sampling, clustering, and dynamic state reduction techniques to keep states in a large configuration space manageable. Our evaluation of four real-world highly-configurable server programs shows that ConfRL can efficiently and effectively guide software systems to achieve higher long-term performance.
... In addition, these approaches rely on a heuristic to pinpoint the performance regression-causes, while ZAM reasons its way through the timeline to find the cause. In this sense, ZAM's methodology is consistent with the results of an empirical study by Nistor et al. [36], which found that performance issues are fixed mainly through code reasoning. ...
Article
Full-text available
A performance regression in software is defined as an increase in an application step’s response time as a result of code changes. Detecting such regressions can be done using profiling tools; however, investigating their root cause is a mostly-manual and time-consuming task. This statement holds true especially when comparing execution timelines, which are dynamic function call trees augmented with response time data; these timelines are compared to find the performance regression-causes – the lowest-level function calls that regressed during execution. When done manually, these comparisons often require the investigator to analyze thousands of function call nodes. Further, performing these comparisons on web applications is challenging due to JavaScript’s asynchronous and event-driven model, which introduce noise in the timelines. In response, we propose a design – Zam – that automatically compares execution timelines collected from web applications, to identify performance regression-causes. Our approach uses a hybrid node matching algorithm that recursively attempts to find the longest common subsequence in each call tree level, then aggregates multiple comparisons’ results to eliminate noise. Our evaluation of Zam on 10 web applications indicates that it can identify performance regression-causes with a path recall of 100% and a path precision of 96%, while performing comparisons in under a minute on average. We also demonstrate the real-world applicability of Zam, which has been used to successfully complete performance investigations by the performance and reliability team in SAP.
Article
Modern multi-threaded systems are highly complex. This makes their behavior difficult to understand. Developers frequently capture behavior in the form of program traces and then manually inspect these traces. Existing tools, however, fail to scale to traces larger than a million events. In this paper we present an approach to compress multi-threaded traces in order to allow developers to visually explore these traces at scale. Our approach is able to compress traces that contain millions of events down to a few hundred events. We use this approach to design and implement a tool called NonSequitur. We present three case studies which demonstrate how we used NonSequitur to analyze real-world performance issues with Meta's storage engine RocksDB and MongoDB's storage engine WiredTiger, two complex database backends. We also evaluate NonSequitur with 42 participants on traces from RocksDB and WiredTiger. We demonstrate that, in some cases, participants on average scored 11 times higher when performing performance analysis tasks on large execution traces. Additionally, for some performance analysis tasks, the participants spent on average three times longer with other tools than with NonSequitur.
Article
Full-text available
Data-intensive systems handle variable, high-volume, and high-velocity data generated by human and digital devices. Like traditional software, data-intensive systems are prone to technical debts introduced to cope-up with the pressure of time and resource constraints on developers. Data-access is a critical component of data-intensive systems, as it determines their overall performance and functionality. While data access technical debts are getting attention from the research community, technical debts that affect performance are not well investigated. This study aims to identify, categorize, and validate data-access performance anti-patterns. We collected issues from NoSQL-based and polyglot persistence open-source data-intensive systems, implemented in Java programing language, and identified 14 new data access-performance anti-patterns categorized under seven high-level categories. We conducted a developer survey to evaluate the perceived relevance and criticality of the newly identified anti-patterns and found that Improper Handling of Node Failures, Using Synchronous Connection, and Inefficient Driver API performance anti-patterns are the most critical data-access performance anti-patterns. The study findings can help improve the quality of data-intensive software systems by raising awareness of practitioners about the impact of the data-access performance anti-patterns. At the same time, the findings will help quality assurance teams to prioritize the correction of performance anti-patterns based on their criticality.
Chapter
Full-text available
This paper describes a formal general-purpose automated program repair (APR) framework based on the concept of program invariants. In the presented repair framework, the execution traces of a defected program are dynamically analyzed to infer specifications φcorrect\varphi _{correct} φ correct and φviolated\varphi _{violated} φ violated , where φcorrect\varphi _{correct} φ correct represents the set of likely invariants (good patterns) required for a run to be successful and φviolated\varphi _{violated} φ violated represents the set of likely suspicious invariants (bad patterns) that result in the bug in the defected program. These specifications are then refined using rigorous program analysis techniques, which are also used to drive the repair process towards feasible patches and assess the correctness of generated patches. We demonstrate the usefulness of leveraging invariants in APR by developing an invariant-based repair system for performance bugs. The initial analysis shows the effectiveness of invariant-based APR in handling performance bugs by producing patches that ensure program’s efficiency increase without adversely impacting its functionality.
Article
Context: Software performance is crucial for ensuring the quality of software products. As one of the non‐functional requirements, the few efforts devoted to software performance have often been neglected until a later phase in the software development life cycle (SDLC). The lack of clarity of what software performance research literature is available prevents researchers from understanding what software performance research fields are available. It also creates difficulty for practitioners to adopt state‐of‐the‐art software performance techniques. Software performance research is not as organized as other established research topics such as software testing. Thus, it is essential to conduct a systematic mapping study as a first step to provide an overview of the latest research literature available in software performance. Objective: The objective of this systematic mapping study is to survey and map software performance research literature into suitable categories and to synthesize the literature data for future access and reference. Method: This systematic mapping study conducts a manual examination by querying research literature in noble journals and proceedings in software engineering in the past decade. We examine each paper manually and identify primary studies for further analysis and synthesis according to the pre‐defined inclusion criteria. Lastly, we map the primary studies based on their corresponding classification category. Results: This systematic mapping study provides a state‐of‐the‐art literature mapping in software performance research. We have carefully examined 222 primary studies out of 2000+ research literature. We have identified six software performance research categories and 15 subcategories. We generate the primary study mapping and report five research findings. Conclusions: Unlike established research fields, it is unclear what types of software performance research categories are available to the community. This work takes the systematic mapping study approach to survey and map the latest software performance research literature. The study results provide an overview of the paper distribution and a reference for researchers to navigate research literature on software performance.
Article
Bug reports are submitted by the software stakeholders to foster the location and elimination of bugs. However, in large-scale software systems, it may be impossible to track and solve every bug, and thus developers should pay more attention to High-Impact Bugs (HIBs). Previous studies analyzed textual descriptions to automatically identify HIBs, but they ignored the quality of code, which may also indicate the cause of HIBs. To address this issue, we integrate the features reflecting the quality of production (i.e. CK metrics) and test code (i.e. test smells) into our textual similarity based model to identify HIBs. Our model outperforms the compared baseline by up to 39% in terms of AUC-ROC and 64% in terms of F-Measure. Then, we explain the behavior of our model by using SHAP to calculate the importance of each feature, and we apply case studies to empirically demonstrate the relationship between the most important features and HIB. The results show that several test smells (e.g. Assertion Roulette, Conditional Test Logic, Duplicate Assert, Sleepy Test) and product metrics (e.g. NOC, LCC, PF, and ProF) have important contributions to HIB identification.
Article
Software performance is a critical quality attribute that determines the success of a software system. However, practitioners lack comprehensive and holistic understanding of how real-life performance issues are caused and resolved in practice from the technical, engineering, and economic perspectives. This paper presents a large-scale empirical study of 570 real-life performance issues from 13 open source projects from various problem domains, and implemented in three popular programming languages, Java (192 issues), C/C++ (162 issues), and Python (216 issues). From the technical perspective, we summarize eight general types of performance issues with corresponding root causes and resolutions that apply for all three languages. We also identify available tools for detecting and resolving different types of issues from the literature. In addition, we found that 27% of the 570 issues are resolved by design-level optimization—coordinated revision of a group of related source files and their design structure. We reveal four typical design-level optimization patterns, including classic design patterns , change propagation , optimization clone , and parallel optimization that practitioners should be aware of in resolving performance issues. From the engineering perspective, this study analyzes how test code changes in performance optimization. We found that only 15% of the 570 performance issues involve revision of test code. In most cases, the revised test cases focus on the functional logic of the performance optimization, rather than directly evaluate the performance improvement. This finding points to the potential lack of engineering standard for formally verifying performance optimization in regression testing. Finally, from the economic perspective, we analyze the “Return On Investment” of performance optimization. We found that design-level optimization usually requires more investment, but not always yields to higher performance improvement. However, developers tend to use design-level optimization when they concern about other quality attributes, such as maintainability and readability.
Thesis
System availability and efficiency are critical aspects in the oil and gas sector; as any fault affecting those systems may cause operations to shut down; which will negatively impact operation resources as well as costs, human resources and time. Therefore, it became important to investigate the reasons of such errors. In this study, software errors and maintenance are studied. End user errors are targeted after finding that is the number of these errors is projected to increase. The factors that affect end user behavior in oil and gas systems are also investigated and the relation between system availability and end user behavior are evaluated. An investigation has been performed following the descriptive methodology in order to gain insights into the human error factor encountered by various international oil and gas companies around the Middle East and North Africa. This was conducted by distributing a questionnaire to 120 employees of the companies in this study; 81 had responded. The questionnaire contained questions related to software/hardware errors and errors due to the end user. In short, the study shows that there is a relation between end user behavior and system availability and efficiency. Factors including training, experience, education, work shifts, system interface and I/O devices were identified in the study as factors affecting end user behavior. Moreover, the study contributes new knowledge by identifying a new factor that leads to system unavailability, namely memory sticks. This thesis presents a valuable knowledge that explains how errors occur and the reasons for their occurrence. Major limitations of this research include company policies, legal issues and information resources.
Chapter
The technology enabled service industry is emerging as the most dynamic sectors in world's economy. Various service sector industries such as financial services, banking solutions, telecommunication, investment management, etc. completely rely on using large scale software for their smooth operations. Any malwares or bugs in these software is an issue of big concern and can have serious financial consequences. This chapter addresses the problem of bug handling in service sector software. Predictive analysis is a helpful technique for keeping software systems error free. Existing research in bug handling focus on various predictive analysis techniques such as data mining, machine learning, information retrieval, optimisation, etc. for bug resolving. This chapter provides a detailed analysis of bug handling in large service sector software. The main emphasis of this chapter is to discuss research involved in applying predictive analysis for bug handling. The chapter also presents some possible future research directions in bug resolving using mathematical optimisation techniques.
Conference Paper
Full-text available
Changes, a rather inevitable part of software development can cause maintenance implications if they introduce bugs into the system. By isolating and characterizing these bug introducing changes it is possible to uncover potential risky source code entities or issues that produce bugs. In this paper, we mine the bug introducing changes in the Android platform by mapping bug reports to the changes that introduced the bugs. We then use the change information to look for both potential problematic parts and dynamics in development that can cause maintenance implications. We believe that the results of our study can help better manage Android software development.
Article
Full-text available
A recent study finds that errors of omission are harder for programmers to detect than errors of commission. While several change recommendation systems already exist to prevent or reduce omission errors during software development, there have been very few studies on why errors of omission occur in practice and how such errors could be prevented. In order to understand the characteristics of omission errors, this paper investigates a group of bugs that were fixed more than once in open source projects — those bugs whose initial patches were later considered incomplete and to which programmers applied supplementary patches. Our study on Eclipse JDT core, Eclipse SWT, and Mozilla shows that a significant portion of resolved bugs (22% to 33%) involves more than one fix attempt. Our manual inspection shows that the causes of omission errors are diverse, including missed porting changes, incorrect handling of conditional statements, or incomplete refactorings, etc. While many consider that missed updates to code clones often lead to omission errors, only a very small portion of supplementary patches (12% in JDT, 25% in SWT, and 9% in Mozilla) have a content similar to their initial patches. This implies that supplementary change locations cannot be predicted by code clone analysis alone. Furthermore, 14% to 15% of files in supplementary patches are beyond the scope of immediate neighbors of their initial patch locations — they did not overlap with the initial patch locations nor had direct structural dependencies on them (e.g. calls, accesses, subtyping relations, etc.). These results call for new types of omission error prevention approaches that complement existing change recommendation systems.
Article
Full-text available
Software performance is one of the important qualities that makes software stand out in a competitive market. However, in earlier work we found that performance bugs take more time to fix, need to be fixed by more experienced developers and require changes to more code than non-performance bugs. In order to be able to improve the resolution of performance bugs, a better understanding is needed of the current practice and shortcomings of reporting, reproducing, tracking and fixing performance bugs. This paper qualitatively studies a random sample of 400 performance and non-performance bug reports of Mozilla Firefox and Google Chrome across four dimensions (Impact, Context, Fix and Fix validation). We found that developers and users face problems in reproducing performance bugs and have to spend more time discussing performance bugs than other kinds of bugs. Sometimes performance regressions are tolerated as a tradeoff to improve something else.
Article
Full-text available
In this paper we present a profiling methodology and toolkit for helping developers discover hidden asymptotic inefficiencies in the code. From one or more runs of a program, our profiler automatically measures how the performance of individual routines scales as a function of the input size, yielding clues to their growth rate. The output of the profiler is, for each executed routine of the program, a set of tuples that aggregate performance costs by input size. The collected profiles can be used to produce performance plots and derive trend functions by statistical curve fitting or bounding techniques. A key feature of our method is the ability to automatically measure the size of the input given to a generic code fragment: to this aim, we propose an effective metric for estimating the input size of a routine and show how to compute it efficiently. We discuss several case studies, showing that our approach can reveal asymptotic bottlenecks that other profilers may fail to detect and characterize the workload and behavior of individual routines in the context of real applications. To prove the feasibility of our techniques, we implemented a Valgrind tool called aprof and performed an extensive experimental evaluation on the SPEC CPU2006 benchmarks. Our experiments show that aprof delivers comparable performance to other prominent Valgrind tools, and can generate informative plots even from single runs on typical workloads for most algorithmically-critical routines.
Article
Full-text available
Given limited resource and time before software release, development-site testing and debugging become more and more insufficient to ensure satisfactory software performance. As a counterpart for debugging in the large pioneered by the Microsoft Windows Error Reporting (WER) system focusing on crashing/hanging bugs, performance debugging in the large has emerged thanks to available infrastructure support to collect execution traces with performance issues from a huge number of users at the deployment sites. However, performance debugging against these numerous and complex traces remains a significant challenge for performance analysts. In this paper, to enable performance debugging in the large in practice, we propose a novel approach, called StackMine, that mines callstack traces to help performance analysts effectively discover highly impactful performance bugs (e.g., bugs impacting many users with long response delay). As a successful technology-transfer effort, since December 2010, StackMine has been applied in performance-debugging activities at a Microsoft team for performance analysis, especially for a large number of execution traces. Based on real-adoption experiences of StackMine in practice, we conducted an evaluation of StackMine on performance debugging in the large for Microsoft Windows 7. We also conducted another evaluation on a third-party application. The results highlight substantial benefits offered by StackMine in performance debugging in the large for large-scale software systems.
Conference Paper
Full-text available
Customizable programs and program families provide user-selectable features to allow users to tailor a program to an application scenario. Knowing in advance which feature selection yields the best performance is difficult because a direct measurement of all possible feature combinations is infeasible. Our work aims at predicting program performance based on selected features. However, when features interact, accurate predictions are challenging. An interaction occurs when a particular feature combination has an unexpected influence on performance. We present a method that automatically detects performance-relevant feature interactions to improve prediction accuracy. To this end, we propose three heuristics to reduce the number of measurements required to detect interactions. Our evaluation consists of six real-world case studies from varying domains (e.g., databases, encoding libraries, and web servers) using different configuration techniques (e.g., configuration files and preprocessor flags). Results show an average prediction accuracy of 95 %.
Article
Full-text available
A goal of performance testing is to find situations when applications unexpectedly exhibit worsened characteris-tics for certain combinations of input values. A fundamental question of performance testing is how to select a manageable subset of the input data faster to find performance problems in applications automatically. We offer a novel solution for finding performance problems in applications automatically using black-box software testing. Our solution is an adaptive, feedback-directed learning testing system that learns rules from execution traces of applications and then uses these rules to select test input data automatically for these applications to find more performance problems when compared with exploratory random testing. We have implemented our solution and applied it to a medium-size ap-plication at a major insurance company and to an open-source application. Performance problems were found automatically and confirmed by experienced testers and developers.
Conference Paper
Full-text available
A good understanding of the impact of different types of bugs on various project aspects is essential to improve software quality research and practice. For instance, we would expect that security bugs are fixed faster than other types of bugs due to their critical nature. However, prior research has often treated all bugs as similar when studying various aspects of software quality (e.g., predicting the time to fix a bug), or has focused on one particular type of bug (e.g., security bugs) with little comparison to other types. In this paper, we study how different types of bugs (performance and security bugs) differ from each other and from the rest of the bugs in a software project. Through a case study on the Firefox project, we find that security bugs are fixed and triaged much faster, but are reopened and tossed more frequently. Furthermore, we also find that security bugs involve more developers and impact more files in a project. Our work is the first work to ever empirically study performance bugs and compare it to frequently-studied security bugs. Our findings highlight the importance of considering the different types of bugs in software quality research and practice.
Conference Paper
Full-text available
The relationship between various software-related phenomena (e.g., code complexity) and post-release software defects has been thoroughly examined. However, to date these predictions have a limited adoption in practice. The most commonly cited reason is that the prediction identifies too much code to review without distinguishing the impact of these defects. Our aim is to address this drawback by focusing on high-impact defects for customers and practitioners. Customers are highly impacted by defects that break pre-existing functionality (breakage defects), whereas practitioners are caught off-guard by defects in files that had relatively few pre-release changes (surprise defects). The large commercial software system that we study already had an established concept of breakages as the highest-impact defects, however, the concept of surprises is novel and not as well established. We find that surprise defects are related to incomplete requirements and that the common assumption that a fix is caused by a previous change does not hold in this project. We then fit prediction models that are effective at identifying files containing breakages and surprises. The number of pre-release defects and file size are good indicators of breakages, whereas the number of co-changed files and the amount of time between the latest pre-release change and the release date are good indicators of surprises. Although our prediction models are effective at identifying files that have breakages and surprises, we learn that the prediction should also identify the nature or type of defects, with each type being specific enough to be easily identified and repaired.
Conference Paper
Full-text available
Software engineering researchers have long been interested in where and why bugs occur in code, and in predicting where they might turn up next. Historical bug-occurence data has been key to this research. Bug tracking systems, and code version histories, record when, how and by whom bugs were fixed; from these sources, datasets that relate file changes to bug fixes can be extracted. These historical datasets can be used to test hypotheses concerning processes of bug introduction, and also to build statistical bug prediction models. Unfortunately, processes and humans are imperfect, and only a fraction of bug fixes are actually labelled in source code version histories, and thus become available for study in the extracted datasets. The question naturally arises, are the bug fixes recorded in these historical datasets a fair representation of the full population of bug fixes? In this paper, we investigate historical data from several software projects, and find strong evidence of systematic bias. We then investigate the potential effects of "unfair, imbalanced" datasets on the performance of prediction techniques. We draw the lesson that bias is a critical problem that threatens both the effectiveness of processes that rely on biased datasets to build prediction models and the generalizability of hypotheses tested on biased data.
Conference Paper
Full-text available
Robust distributed systems commonly employ high-level recovery mechanisms enabling the system to recover from a wide variety of problematic environmental conditions such as node failures, packet drops and link disconnections. Unfortunately, these recovery mechanisms also effectively mask additional serious design and implementation errors, disguising them as latent performance bugs that severely degrade end-to-end system performance. These bugs typically go unnoticed due to the challenge of distinguishing between a bug and an intermittent environmental condition that must be tolerated by the system. We present techniques that can automatically pinpoint latent performance bugs in systems implementations, in the spirit of recent advances in model checking by systematic state space exploration. The techniques proceed by automating the process of conducting random simulations, identifying performance anomalies, and analyzing anomalous executions to pinpoint the circumstances leading to performance degradation. By focusing our implementation on the MACE toolkit, MACEPC can be used to test our implementations directly, without modification. We have applied MACEPC to five thoroughly tested and trusted distributed systems implementations. MACEPC was able to find significant, previously unknown, long-standing performance bugs in each of the systems, and led to fixes that significantly improved the end-to-end performance of the systems.
Conference Paper
Full-text available
Reproducing bug symptoms is a prerequisite for perform- ing automatic bug diagnosis. Do bugs have characteristics that ease or hinder automatic bug diagnosis? In this pa- per, we conduct a thorough empirical study of several key characteristics of bugs that affect reproducibility at the pro- duction site. We examine randomly selected bug reports of six server applications and consider their implications on automatic bug diagnosis tools. Our results are promising. From the study, we find that nearly 82% of bug symptoms can be reproduced deterministically by re-running with the same set of inputs at the production site. We further find that very few input requests are needed to reproduce most failures; in fact, just one input request after session estab- lishment suffices to reproduce the failure in nearly 77% of the cases. We describe the implications of the results on repro- ducing software failures and designing automated diagnosis tools for production runs.
Conference Paper
Full-text available
Program analysis and automated test generation have primarily been used to find correctness bugs. We present complexity testing, a novel automated test generation tech- nique to find performance bugs. Our complexity testing al- gorithm, which we call WISE (Worst-case Inputs from Sym- bolic Execution), operates on a program accepting inputs of arbitrary size. For each input size, WISE attempts to con- struct an input which exhibits the worst-case computational complexity of the program. WISE uses exhaustive test gen- eration for small input sizes and generalizes the result of executing the program on those inputs into an "input gen- erator." The generator is subsequently used to efficiently generate worst-case inputs for larger input sizes. We have performed experiments to demonstrate the utility of our ap- proach on a set of standard data structures and algorithms. Our results show that WISE can effectively generate worst- case inputs for several of these benchmarks.
Conference Paper
Full-text available
With the ubiquity of multi-core processors, software must make eective use of multiple cores to obtain good performance on modern hardware. One of the biggest roadblocks to this is load imbalance, or the uneven distribution of work across cores. We propose LIME, a framework for analyzing parallel programs and reporting the cause of load imbalance in application source code. This framework uses statistical techniques to pinpoint load imbalance problems stemming from both control flow issues (e.g., unequal iteration counts) and interactions between the application and hardware (e.g., unequal cache miss counts). We evaluate LIME on applications from widely used parallel benchmark suites, and show that LIME accurately reports the causes of load imbalance, their nature and origin in the code, and their relative importance.
Conference Paper
Full-text available
Most Java programmers would agree that Java is a language that promotes a philosophy of “create and go forth”. By design, temporary objects are meant to be created on the heap, possibly used and then abandoned to be collected by the garbage collector. Excessive generation of temporary objects is termed “object churn” and is a form of software bloat that often leads to performance and memory problems. To mitigate this problem, many compiler optimizations aim at identifying objects that may be allocated on the stack. However, most such optimizations miss large opportunities for memory reuse when dealing with objects inside loops or when dealing with container objects. In this paper, we describe a novel algorithm that detects bloat caused by the creation of temporary container and String objects within a loop. Our analysis determines which objects created within a loop can be reused. Then we describe a source-to-source transformation that efficiently reuses such objects. Empirical evaluation indicates that our solution can reduce upto 40% of temporary object allocations in large programs, resulting in a performance improvement that can be as high as a 20% reduction in the run time, specifically when a program has a high churn rate or when the program is memory intensive and needs to run the GC often.
Conference Paper
Full-text available
Performance analysts profile their programs to find methods that are worth optimizing: the "hot" methods. This paper shows that four commonly-used Java profilers ( xprof , hprof , jprofile, and yourkit ) often disagree on the identity of the hot methods. If two profilers disagree, at least one must be incorrect. Thus, there is a good chance that a profiler will mislead a performance analyst into wasting time optimizing a cold method with little or no performance improvement. This paper uses causality analysis to evaluate profilers and to gain insight into the source of their incorrectness. It shows that these profilers all violate a fundamental requirement for sampling based profilers: to be correct, a sampling-based profilermust collect samples randomly. We show that a proof-of-concept profiler, which collects samples randomly, does not suffer from the above problems. Specifically, we show, using a number of case studies, that our profiler correctly identifies methods that are important to optimize; in some cases other profilers report that these methods are cold and thus not worth optimizing.
Conference Paper
Full-text available
Calling context trees (CCTs) associate performance metrics with paths through a program's call graph, providing valuable information for program understanding and performance analysis. Although CCTs are typically much smaller than call trees, in real applications they might easily consist of tens of millions of distinct calling contexts: this sheer size makes them difficult to analyze and might hurt execution times due to poor access locality. For performance analysis, accurately collecting information about hot calling contexts may be more useful than constructing an entire CCT that includes millions of uninteresting paths. As we show for a variety of prominent Linux applications, the distribution of calling context frequencies is typically very skewed. In this paper we show how to exploit this property to reduce the CCT size considerably. We introduce a novel run-time data structure, called Hot Calling Context Tree (HCCT), that offers an additional intermediate point in the spectrum of data structures for representing interprocedural control flow. The HCCT is a subtree of the CCT that includes only hot nodes and their ancestors. We show how to compute the HCCT without storing the exact frequency of all calling contexts, by using fast and space-efficient algorithms for mining frequent items in data streams. With this approach, we can distinguish between hot and cold contexts on the fly, while obtaining very accurate frequency counts. We show both theoretically and experimentally that the HCCT achieves a similar precision as the CCT in a much smaller space, roughly proportional to the number of distinct hot contexts: this is typically several orders of magnitude smaller than the total number of calling contexts encountered during a program's execution. Our space-efficient approach can be effectively combined with previous context-sensitive profiling techniques, such as sampling and bursting.
Conference Paper
Concurrency bugs are widespread in multithreaded programs. Fixing them is time-consuming and error-prone. We present CFix, a system that automates the repair of concurrency bugs. CFix works with a wide variety of concurrency-bug detectors. For each failure-inducing interleaving reported by a bug detector, CFix first determines a combination of mutual-exclusion and order relationships that, once enforced, can prevent the buggy interleaving. CFix then uses static analysis and testing to determine where to insert what synchronization operations to force the desired mutual-exclusion and order relationships, with a best effort to avoid deadlocks and excessive performance losses. CFix also simplifies its own patches by merging fixes for related bugs. Evaluation using four different types of bug detectors and thirteen real-world concurrency-bug cases shows that CFix can successfully patch these cases without causing deadlocks or excessive performance degradation. Patches automatically generated by CFix are of similar quality to those manually written by developers.
Conference Paper
Many bugs, even those that are known and documented in bug reports, remain in mature software for a long time due to the lack of the development resources to fix them. We propose a general approach, R2Fix, to automatically generate bug-fixing patches from free-form bug reports. R2Fix combines past fix patterns, machine learning techniques, and semantic patch generation techniques to fix bugs automatically. We evaluate R2Fix on three projects, i.e., the Linux kernel, Mozilla, and Apache, for three important types of bugs: buffer overflows, null pointer bugs, and memory leaks. R2Fix generates 57 patches correctly, 5 of which are new patches for bugs that have not been fixed by developers yet. We reported all 5 new patches to the developers; 4 have already been accepted and committed to the code repositories. The 57 correct patches generated by R2Fix could have shortened and saved up to an average of 63 days of bug diagnosis and patch generation time.
Article
Traditional profilers identify where a program spends most of its resources. They do not provide information about why the program spends those resources or about how resource consumption would change for different program inputs. In this paper we introduce the idea of algorithmic profiling. While a traditional profiler determines a set of measured cost values, an algorithmic profiler determines a cost function. It does that by automatically determining the "inputs" of a program, by measuring the program's "cost" for any given input, and by inferring an empirical cost function.
Article
There are more bugs in real-world programs than human programmers can realistically address. This paper evaluates two research questions: “What fraction of bugs can be repaired automatically?” and “How much does it cost to repair a bug automatically?” In previous work, we presented GenProg, which uses genetic programming to repair defects in off-the-shelf C programs. To answer these questions, we: (1) propose novel algorithmic improvements to GenProg that allow it to scale to large programs and find repairs 68% more often, (2) exploit GenProg's inherent parallelism using cloud computing resources to provide grounded, human-competitive cost measurements, and (3) generate a large, indicative benchmark set to use for systematic evaluations. We evaluate GenProg on 105 defects from 8 open-source programs totaling 5.1 million lines of code and involving 10,193 test cases. GenProg automatically repairs 55 of those 105 defects. To our knowledge, this evaluation is the largest available of its kind, and is often two orders of magnitude larger than previous work in terms of code or test suite size or defect count. Public cloud computing prices allow our 105 runs to be reproduced for 403;asuccessfulrepaircompletesin96minutesandcosts403; a successful repair completes in 96 minutes and costs 7.32, on average.
Article
Many applications suffer from run-time bloat: excessive memory usage and work to accomplish simple tasks. Bloat significantly affects scalability and performance, and exposing it requires good diagnostic tools. We present a novel analysis that profiles the run-time execution to help programmers uncover potential performance problems. The key idea of the proposed approach is to track object references, starting from object creation statements, through assignment statements, and eventually statements that perform useful operations. This propagation is abstracted by a representation we refer to as a reference propagation graph. This graph provides path information specific to reference producers and their run-time contexts. Several client analyses demonstrate the use of reference propagation profiling to uncover runtime inefficiencies. We also present a study of the properties of reference propagation graphs produced by profiling 36 Java programs. Several cases studies discuss the inefficiencies identified in some of the analyzed programs, as well as the significant improvements obtained after code optimizations.
Article
Developers frequently use inefficient code sequences that could be fixed by simple patches. These inefficient code sequences can cause significant performance degradation and resource waste, referred to as performance bugs. Meager increases in single threaded performance in the multi-core era and increasing emphasis on energy efficiency call for more effort in tackling performance bugs. This paper conducts a comprehensive study of 110 real-world performance bugs that are randomly sampled from five representative software suites (Apache, Chrome, GCC, Mozilla, and MySQL). The findings of this study provide guidance for future work to avoid, expose, detect, and fix performance bugs. Guided by our characteristics study, efficiency rules are extracted from 25 patches and are used to detect performance bugs. 332 previously unknown performance problems are found in the latest versions of MySQL, Apache, and Mozilla applications, including 219 performance problems found by applying rules across applications.
Conference Paper
Software bugs affect system reliability. When a bug is exposed in the field, developers need to fix them. Unfortunately, the bug-fixing process can also introduce errors, which leads to buggy patches that further aggravate the damage to end users and erode software vendors' reputation. This paper presents a comprehensive characteristic study on incorrect bug-fixes from large operating system code bases including Linux, OpenSolaris, FreeBSD and also a mature commercial OS developed and evolved over the last 12 years, investigating not only themistake patterns during bug-fixing but also the possible human reasons in the development process when these incorrect bug-fixes were introduced. Our major findings include: (1) at least 14.8%--24.4% of sampled fixes for post-release bugs in these large OSes are incorrect and have made impacts to end users. (2) Among several common bug types, concurrency bugs are the most difficult to fix correctly: 39% of concurrency bug fixes are incorrect. (3) Developers and reviewers for incorrect fixes usually do not have enough knowledge about the involved code. For example, 27% of the incorrect fixes are made by developers who have never touched the source code files associated with the fix. Our results provide useful guidelines to design new tools and also to improve the development process. Based on our findings, the commercial software vendor whose OS code we evaluated is building a tool to improve the bug fixing and code reviewing process.
Conference Paper
Framework-intensive applications (e.g., Web applications) heavily use temporary data structures, often resulting in performance bot- tlenecks. This paper presents an optimized blended escape analysis to approximate object lifetimes and thus, to identify these tempo- raries and their uses. Empirical results show that this optimized analysis on average prunes 37% of the basic blocks in our bench- marks, and achieves a speedup of up to 29 times compared to the original analysis. Newly defined metrics quantify key properties of temporary data structures and their uses. A detailed empirical eval- uation offers the first characterization of temporaries in framework- intensive applications. The results show that temporary data struc- tures can include up to 12 distinct object types and can traverse through as many as 14 method invocations before being captured.
Conference Paper
Every bug has a story behind it. The people that discover and resolve it need to coordinate, to get information from documents, tools, or other people, and to navigate through issues of accountability, ownership, and organizational structure. This paper reports on a field study of coordination activities around bug fixing that used a combination of case study research and a survey of software professionals. Results show that the histories of even simple bugs are strongly dependent on social, organizational, and technical knowledge that cannot be solely extracted through automation of electronic repositories, and that such automation provides incomplete and often erroneous accounts of coordination. The paper uses rich bug histories and survey results to identify common bug fixing coordination patterns and to provide implications for tool designers and researchers of coordination in software development.
Conference Paper
Concurrent programming is increasingly important for achieving performance gains in the multi-core era, but it is also a difficult and error-prone task. Concurrency bugs are particularly difficult to avoid and diagnose, and therefore in order to improve methods for handling such bugs, we need a better understanding of their characteristics. In this paper we present a study of concurrency bugs in MySQL, a widely used database server. While previous studies of real-world concurrency bugs exist, they have centered their attention on the causes of these bugs. In this paper we provide a complementary focus on their effects, which is important for understanding how to detect or tolerate such bugs at run-time. Our study uncovered several interesting facts, such as the existence of a significant number of latent concurrency bugs, which silently corrupt data structures and are exposed to the user potentially much later. We also highlight several implications of our findings for the design of reliable concurrent systems.
Conference Paper
The reality of multi-core hardware has made concurrent programs pervasive. Unfortunately, writing correct concurrent programs is difcult. Addressing this challenge requires advances in multiple directions, including concurrency bug detection, concurrent pro- gram testing, concurrent programming model design, etc. Design- ing effective techniques in all these directions will signicantly benet from a deep understanding of real world concurrency bug characteristics. This paper provides the rst (to the best of our knowledge) com- prehensive real world concurrency bug characteristic study. Specif- ically, we have carefully examined concurrency bug patterns, man- ifestation, and x strategies of 105 randomly selected real world concurrency bugs from 4 representative server and client open- source applications (MySQL, Apache, Mozilla and OpenOfce). Our study reveals several interesting ndings and provides use- ful guidance for concurrency bug detection, testing, and concurrent programming language design. Some of our ndings are as follows: (1) Around one third of the examined non-deadlock concurrency bugs are caused by vio- lation to programmers' order intentions, which may not be easily expressed via synchronization primitives like locks and transac- tional memories; (2) Around 34% of the examined non-deadlock concurrency bugs involve multiple variables, which are not well addressed by existing bug detection tools; (3) About 92% of the examined concurrency bugs can be reliably triggered by enforcing certain orders among no more than 4 memory accesses. This indi- cates that testing concurrent programs can target at exploring possi- ble orders among every small groups of memory accesses, instead of among all memory accesses; (4) About 73% of the examined non-deadlock concurrency bugs were not x ed by simply adding or changing locks, and many of the x es were not correct at the rst try, indicating the difculty of reasoning concurrent execution by programmers.
Conference Paper
We present a study of operating system errors found by automatic, static, compiler analysis applied to the Linux and OpenBSD kernels. Our approach differs from previous studies that consider errors found by manual inspection of logs, testing, and surveys because static analysis is applied uniformly to the entire kernel source, though our approach necessarily considers a less comprehensive variety of errors than previous studies. In addition, automation allows us to track errors over multiple versions of the kernel source to estimate how long errors remain in the system before they are fixed.We found that device drivers have error rates up to three to seven times higher than the rest of the kernel. We found that the largest quartile of functions have error rates two to six times higher than the smallest quartile. We found that the newest quartile of files have error rates up to twice that of the oldest quartile, which provides evidence that code "hardens" over time. Finally, we found that bugs remain in the Linux kernel an average of 1.8 years before being fixed.
Conference Paper
Load tests aim to validate whether system performance is acceptable under peak conditions. Existing test generation techniques induce load by increasing the size or rate of the input. Ignoring the particular input values, however, may lead to test suites that grossly mischaracterize a system's performance. To address this limitation we introduce a mixed symbolic execution based approach that is unique in how it 1) favors program paths associated with a performance measure of interest, 2) operates in an iterative-deepening beam-search fashion to discard paths that are unlikely to lead to high-load tests, and 3) generates a test suite of a given size and level of diversity. An assessment of the approach shows it generates test suites that induce program response times and memory consumption several times worse than the compared alternatives, it scales to large and complex inputs, and it exposes a diversity of resource consuming program behavior.
Conference Paper
Many large-scale Java applications suffer from runtime bloat. They execute large volumes of methods, and create many temporary objects, all to execute relatively simple operations. There are large opportunities for performance optimizations in these applications, but most are being missed by existing optimization and tooling technology. While JIT optimizations struggle for a few percent, performance experts analyze deployed applications and regularly find gains of 2× or more. Finding such big gains is difficult, for both humans and compil- ers, because of the diffuse nature of runtime bloat. Time is spread thinly across calling contexts, making it difficult to judge how to improve performance. Bloat results from a pile-up of seemingly harmless decisions. Each adds temporary objects and method calls, and often copies values between those temporary objects. While data copies are not the entirety of bloat, we have observed that they are excellent indicators of regions of excessive activity. By opti- mizing copies, one is likely to remove the objects that carry copied values, and the method calls that allocate and populate them. We introduce copy profiling, a technique that summarizes run- time activity in terms of chains of data copies. A flat copy profile counts copies by method. We show how flat profiles alone can be helpful. In many cases, diagnosing a problem requires data flow context. Tracking and making sense of raw copy chains does not scale, so we introduce a summarizing abstraction called the copy graph. We implement three clients analyses that, using the copy graph, expose common patterns of bloat, such as finding hot copy chains and discovering temporary data structures. We demonstrate, with examples from a large-scale commercial application and sev- eral benchmarks, that copy profiling can be used by a programmer to quickly find opportunities for large performance gains.
Conference Paper
Fixing software bugs has always been an important and time-consuming process in software development. Fixing concurrency bugs has become especially critical in the multicore era. However, fixing concurrency bugs is challenging, in part due to non-deterministic failures and tricky parallel reasoning. Beyond correctly fixing the original problem in the software, a good patch should also avoid introducing new bugs, degrading performance unnecessarily, or damaging software readability. Existing tools cannot automate the whole fixing process and provide good-quality patches. We present AFix, a tool that automates the whole process of fixing one common type of concurrency bug: single-variable atomicity violations. AFix starts from the bug reports of existing bug-detection tools. It augments these with static analysis to construct a suitable patch for each bug report. It further tries to combine the patches of multiple bugs for better performance and code readability. Finally, AFix's run-time component provides testing customized for each patch. Our evaluation shows that patches automatically generated by AFix correctly eliminate six out of eight real-world bugs and significantly decrease the failure probability in the other two cases. AFix patches never introduce new bugs and usually have similar performance to manually-designed patches.
Article
This study focuses largely on two issues: (a) improved syntax for iterations and error exits, making it possible to write a larger class of programs clearly and efficiently without ″go to″ statements; (b) a methodology of program design, beginning with readable and correct, but possibly inefficient programs that are systematically transformed in necessary into efficent and correct, but possibly less readable code. The discussion brings out opposing points of view about whether or not ″go to″ statements should be abolished; some merit is found on both sides of this question. Finally, an attempt is made to define the true nature of structured programming, and to recommend fruitful directions for further study.
Article
Many popular software systems automatically report failures back to the vendors, allowing developers to focus on the most pressing problems. However, it takes a certain period of time to assess which failures occur most frequently. In an empirical investigation of the Firefox and Thunderbird crash report databases, we found that only 10 to 20 crashes account for the large majority of crash reports; predicting these “top crashes” thus could dramatically increase software quality. By training a machine learner on the features of top crashes of past releases, we can effectively predict the top crashes well before a new release. This allows for quick resolution of the most important crashes, leading to improved user experience and better allocation of maintenance efforts.
Conference Paper
We test the hypothesis that generic recovery techniques, such as process pairs, can survive most application faults without using application-specific information. We examine in detail the faults that occur in three, large, open-source applications: the Apache Web server, the GNOME desktop environment and the MySQL database. Using information contained in the bug reports and source code, we classify faults based on how they depend on the operating environment. We find that 72-87% of the faults are independent of the operating environment and are hence deterministic (non-transient). Recovering from the failures caused by these faults requires the use of application-specific knowledge. Half of the remaining faults depend on a condition in the operating environment that is likely to persist on retry, and the failures caused by these faults are also likely to require application-specific recovery. Unfortunately, only 5-14% of the faults were triggered by transient conditions, such as timing and synchronization, that naturally fix themselves during recovery. Our results indicate that classical application-generic recovery techniques, such as process pairs, will not be sufficient to enable applications to survive most failures caused by application faults
Article
We present a study of operating system errors found by automatic, static, compiler analysis applied to the Linux and OpenBSD kernels. Our approach differs from previous studies that consider errors found by manual inspection of logs, testing, and surveys because static analysis is applied uniformly to the entire kernel source, though our approach necessarily considers a less comprehensive variety of errors than previous studies. In addition, automation allows us to track errors over multiple versions of the kernel source to estimate how long errors remain in the system before they are fixed. We found that device drivers have error rates up to three to seven times higher than the rest of the kernel. We found that the largest quartile of functions have error rates two to six times higher than the smallest quartile. We found that the newest quartile of files have error rates up to twice that of the oldest quartile, which provides evidence that code "hardens" over time. Finally, we found that bugs remain in the Linux kernel an average of 1.8 years before being fixed.
Trend Mcro will pay for PC repair costs
  • P Kallender
Apache's JIRA issue tracker
  • Apache Software Foundation
Inside Windows 7-reliability, performance and PerfTrack
  • D Fields
  • B Karagounis
Lessons from the Colorado benefits management system disaster
  • G E Morris
1901 census site still down after six months
  • T Richardson