Conference Paper

Discovering, reporting, and fixing performance bugs

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Software performance is critical for how users perceive the quality of software products. Performance bugs - programming errors that cause significant performance degradation - lead to poor user experience and low system throughput. Designing effective techniques to address performance bugs requires a deep understanding of how performance bugs are discovered, reported, and fixed. In this paper, we study how performance bugs are discovered, reported to developers, and fixed by developers, and compare the results with those for non-performance bugs. We study performance and non-performance bugs from three popular code bases: Eclipse JDT, Eclipse SWT, and Mozilla. First, we find little evidence that fixing performance bugs has a higher chance to introduce new functional bugs than fixing non-performance bugs, which implies that developers may not need to be over-concerned about fixing performance bugs. Second, although fixing performance bugs is about as error-prone as fixing nonperformance bugs, fixing performance bugs is more difficult than fixing non-performance bugs, indicating that developers need better tool support for fixing performance bugs and testing performance bug patches. Third, unlike many non-performance bugs, a large percentage of performance bugs are discovered through code reasoning, not through users observing the negative effects of the bugs (e.g., performance degradation) or through profiling. The result suggests that techniques to help developers reason about performance, better test oracles, and better profiling techniques are needed for discovering performance bugs.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... However, these studies are not specifically designed for PBs, and thus only capture some partial characteristics of PBs in DL systems. In contrast, PBs have been widely studied for traditional systems, e.g., desktop or server applications [30,48,62,79], highly configurable systems [24,25], mobile applications [38,40], databasebacked web applications [77,78], and JavaScript systems [59]. However, PBs in DL systems could be different due to the programming paradigm shift from traditional systems to DL systems. ...
... Step 2: PB Post Selection. Instead of directly using performancerelated keywords from the existing studies on PBs in traditional systems (e.g., [30,48,62,79]), we derived a keyword set in the following way to achieve a wide and comprehensive coverage of PB posts. We first randomly sampled 100 posts with a tag of "performance" from the 18,730 posts in Step 1. ...
... A lot of empirical studies have characterized performance bugs from different perspectives (e.g., root causes, discovery, diagnosis, fixing and reporting) for desktop or server applications [30,48,62,79,88], highly configurable systems [24,25], mobile applications [38,40], database-backed web applications [77,78], and JavaScript systems [59]. They shed light on potential directions on performance analysis (e.g., detection, profiling and testing). ...
Preprint
Full-text available
Deep learning (DL) has been increasingly applied to a variety of domains. The programming paradigm shift from traditional systems to DL systems poses unique challenges in engineering DL systems. Performance is one of the challenges, and performance bugs(PBs) in DL systems can cause severe consequences such as excessive resource consumption and financial loss. While bugs in DL systems have been extensively investigated, PBs in DL systems have hardly been explored. To bridge this gap, we present the first comprehensive study to characterize symptoms, root causes, and introducing and exposing stages of PBs in DL systems developed in TensorFLow and Keras, with a total of 238 PBs collected from 225 StackOverflow posts. Our findings shed light on the implications on developing high performance DL systems, and detecting and localizing PBs in DL systems. We also build the first benchmark of 56 PBs in DL systems, and assess the capability of existing approaches in tackling them. Moreover, we develop a static checker DeepPerf to detect three types of PBs, and identify 488 new PBs in 130 GitHub projects.62 and 18 of them have been respectively confirmed and fixed by developers.
... Table 2 lists web client subjects used in the prior research. Nistor et al. [17] study over 600 bugs from three open-source projects. They compare and contrast the difference of discovering, reporting, and fixing between performance bugs and non-performance bugs. ...
... Configurable software systems complicate performance testing. Prior study [17] shows that performance bugs in configurable software systems are more complex and take a longer time to fix. The sheer size of the configuration space makes the quality of software even harder to achieve. ...
... It is not unusual to see discussions in a bug report that a performance bug is introduced a few versions ago but only to surface in the bug report recently. Bug Detection Nistor et al. [17] report that most (up to 57%) performance bugs are discovered with code reasoning. Code reasoning involves code understanding. ...
... For example, if a code change introduces a security vulnerability, security measures to counteract this may be implemented elsewhere (Williams et al. 2018;Mahrous and Malhotra 2018;Ping et al. 2011). If a code change introduces a performance issue, this performance issue may be fixed and improved in a different part of the system (Nistor et al. 2013;Jin et al. 2012), for example by changing configuration parameters. Non-functional bugs can be harder to fix than their functional counterparts. ...
... However, those studies do not make a distinction between functional and non-functional bugs during their evaluation. Nonetheless, it has been shown that non-functional bugs present different characteristics than functional bugs (Nistor et al. 2013). In particular, non-functional requirements describe the quality attributes of a program, as opposed to its functionality (Kotonya and Sommerville 1998). ...
... In either scenario, the SZZ approach may consider the later changes as bug-inducing instead of the original changes. This phenomenon is intuitive since non-functional bugs often take a long time to be discovered and fixed (Nistor et al. 2013). Therefore, considering the most recent code change before the bug reporting date may not be a suitable heuristic for non-functional bugs. ...
Article
Full-text available
Non-functional bugs, e.g., performance bugs and security bugs, bear a heavy cost on both software developers and end-users. For example, IBM estimates the cost of a single data breach to be millions of dollars. Tools to reduce the occurrence, impact, and repair time of non-functional bugs can therefore provide key assistance for software developers racing to fix these issues. Identifying bug-inducing changes is a critical step in software quality assurance. In particular, the SZZ approach is commonly used to identify bug-inducing commits. However, the fixes to non-functional bugs may be scattered and separate from their bug-inducing locations in the source code. The nature of non-functional bugs may therefore make the SZZ approach a sub-optimal approach for identifying bug-inducing changes. Yet, prior studies that leverage or evaluate the SZZ approach do not consider non-functional bugs, leading to potential bias on the results. In this paper, we conduct an empirical study on the results of the SZZ approach when used to identify the inducing changes of the non-functional bugs in the NFBugs dataset. We eliminate a majority of the bug-inducing commits as they are not in the same method or class level. We manually examine whether each identified bug-inducing change is indeed the correct bug-inducing change. Our manual study shows that a large portion of non-functional bugs cannot be properly identified by the SZZ approach. By manually identifying the root causes of the falsely detected bug-inducing changes, we uncover root causes for false detection that have not been found by previous studies. We evaluate the identified bug-inducing changes based on three criteria from prior research, i.e., the earliest bug appearance, the future impact of changes, and the realism of bug introduction. We find that prior criteria may be irrelevant for non-functional bugs. Our results may be used to assist in future research on non-functional bugs, and highlight the need to complement SZZ to accommodate the unique characteristics of non-functional bugs.
... Performance failures pose an enormous challenge for developers. Compared to functional failures, they take considerably longer to be discovered [Jin et al., 2012], are harder to reproduce and debug [Zaman et al., 2012;Nistor et al., 2013a], take longer to fix [Zaman et al., 2011;Nistor et al., 2013a;Liu et al., 2014;Mazuera-Rozo et al., 2020], and require more as well as more experienced developers to do so [Zaman et al., 2011;Nistor et al., 2013a;. Users are similarly affected, e.g., they are more likely to leave if an application if it takes longer to load [Akamai Technologies Inc., 2017;Artz, 2009;. ...
... Performance failures pose an enormous challenge for developers. Compared to functional failures, they take considerably longer to be discovered [Jin et al., 2012], are harder to reproduce and debug [Zaman et al., 2012;Nistor et al., 2013a], take longer to fix [Zaman et al., 2011;Nistor et al., 2013a;Liu et al., 2014;Mazuera-Rozo et al., 2020], and require more as well as more experienced developers to do so [Zaman et al., 2011;Nistor et al., 2013a;. Users are similarly affected, e.g., they are more likely to leave if an application if it takes longer to load [Akamai Technologies Inc., 2017;Artz, 2009;. ...
... Performance failures pose an enormous challenge for developers. Compared to functional failures, they take considerably longer to be discovered [Jin et al., 2012], are harder to reproduce and debug [Zaman et al., 2012;Nistor et al., 2013a], take longer to fix [Zaman et al., 2011;Nistor et al., 2013a;Liu et al., 2014;Mazuera-Rozo et al., 2020], and require more as well as more experienced developers to do so [Zaman et al., 2011;Nistor et al., 2013a;. Users are similarly affected, e.g., they are more likely to leave if an application if it takes longer to load [Akamai Technologies Inc., 2017;Artz, 2009;. ...
Article
Software performance faults have severe consequences for users, developers, and companies. One way to unveil performance faults before they manifest in production is performance testing, which ought to be done on every new version of the software, ideally on every commit. However, performance testing faces multiple challenges that inhibit it from being applied early in the development process, on every new commit, and in an automated fashion. In this dissertation, we investigate three challenges of software microbenchmarks, a performance testing technique on unit granularity which is predominantly used for libraries and frameworks. The studied challenges affect the quality aspects (1) runtime, (2) result variability, and (3) performance change detection of microbenchmark executions. The objective is to understand the extent of these challenges in real-world software and to find solutions to address these. To investigate the challenges’ extent, we perform a series of experiments and analyses. We execute benchmarks in bare-metal as well as multiple cloud environments and conduct a large-scale mining study on benchmark configurations. The results show that all three challenges are common: (1) benchmark suite runtimes are often longer than 3 hours; (2) result variability can be extensive, in some cases up to 100%; and (3) benchmarks often only reliably detect large performance changes of 60% or more. To address the challenges, we devise targeted solutions as well as adapt well-known techniques from other domains for software microbenchmarks: (1) a solution that dynamically stops benchmark executions based on statistics to reduce runtime while maintaining low result variability; (2) a solution to identify unstable benchmarks that does not require execution, based on statically-computable source code features and machine learning algorithms; (3) traditional test case prioritization (TCP) techniques to execute benchmarks earlier that detect larger performance changes; and (4) specific execution strategies to detect small performance changes reliably even when executed in unreliable cloud environments. We experimentally evaluate the solutions and techniques on real-world benchmarks and find that they effectively deal with the three challenges. (1) Dynamic reconfiguration enables to drastically reduce runtime by between 48.4% and 86.0% without changing the results of 78.8% to 87.6% of the bench- marks, depending on the project and statistic used. (2) The instability prediction model allows to effectively identify unstable benchmarks when relying on random forest classifiers, having a prediction performance between 0.79 and 0.90 area under the receiver operating characteristic curve (AUC). (3) TCP applied to benchmarks is effective and efficient, with APFD-P values for the best technique ranging from 0.54 to 0.71 and a computational overhead of 11%. (4) Batch testing, i.e., executing the benchmarks of two versions on the same instances interleaved and repeated as well as repeated across instances, enables to reliably detect performance changes of 10% or less, even when using unreliable cloud infrastructure as execution environment. Overall, this dissertation shows that real-world software microbenchmarks are considerably affected by all three challenges (1) runtime, (2) result variability, and (3) performance change detection; however, deliberate planning and execution strategies effectively reduce their impact.
... As a result, the number of research papers analyzing the specific characteristics of performance bugs has grown significantly in the last decade. A performance bug is defined as a programming or configuration error that causes significant performance degradation, leading to undesirable effects like low system throughput, memory bloat, Graphical User Interface (GUI) lagging, or energy drain [1], [2]. Preventing performance bugs, or implementing effective tools to detect and fix them, requires a wide understanding of the nature of these issues in real-world programs. ...
... Well-tested applications such as Microsoft SQL Server, Apache HTTPD and Mozilla Firefox, among others, are affected by hundreds of performance bugs [2], [23]. A performance bug is a programming error that causes significant performance degradation in a program, leading to slow and/or inefficient software [1], [2]. ...
... Well-tested applications such as Microsoft SQL Server, Apache HTTPD and Mozilla Firefox, among others, are affected by hundreds of performance bugs [2], [23]. A performance bug is a programming error that causes significant performance degradation in a program, leading to slow and/or inefficient software [1], [2]. These bugs can cause GUI lagging, memory bloat or excessive energy consumption, among others, and consequently they may cause a poor user experience and a loss of customers and money to companies. ...
Article
Full-text available
The detection of performance bugs, like those causing an unexpected execution time, has gained much attention in the last years due to their potential impact in safety-critical and resourceconstrained applications. Much effort has been put on trying to understand the nature of performance bugs in different domains as a starting point for the development of effective testing techniques. However, the lack of a widely accepted classification scheme of performance faults and, more importantly, the lack of well-documented and understandable datasets makes it difficult to draw rigorous and verifiable conclusions widely accepted by the community. In this paper, we present TANDEM, a dual contribution related to real-world performance bugs. Firstly, we propose a taxonomy of performance bugs based on a thorough systematic review of the related literature, divided into three main categories: effects, causes and contexts of bugs. Secondly, we provide a complete collection of fully documented real-world performance bugs. Together, these contributions pave the way for the development of stronger and reproducible research results on performance testing.
... Performance problems have been studied from several decades in literature, and software performance engineering emerged as the discipline focused on fostering the specification of performance-related factors [95,8,94] and reporting experiences related their management [81,41,78,3]. Performance bugs, i.e., suboptimal implementation choices that create significant performance degradation, have been demonstrated to hurt the satisfaction of end-users in the context of desktop applications [67]. These bugs, that are pervasive and difficult to understand, can cause delays, failures on deployment, redesigns, even a new implementation of the system or abandonment of projects, which lead to significant costs [93,58]. ...
... Nistor et al. [67] present an empirical study on three popular code bases (Eclipse JDT, Eclipse SWT, and Mozilla) with the goal of investigating how performance and non-performance bugs are discovered, reported and fixed by developers. Three main findings are outlined: (i) fixing performance bugs may introduce new functional bugs, similarly to fixing non-performance bugs; (ii) fixing performance bugs is more difficult than fixing non-performance bugs; (iii) unlike non-performance bugs, many performance bugs are found by code reasoning and profiling, not through direct observation of the bug's negative effects. ...
... Olivo et al. [70] also investigate performance bugs, specifically traversal bugs that arise if a program fragment repeatedly iterates over a data structure, such as an array or list, that has not been modified between successive traversals. Such performance bugs are typically easy to fix and often only require the [98] Analysis of the collaboration among project members to detect and fix performance bugs Limited to browsers (i.e., Mozilla Firefox and Google Chrome) performance issues service latency Nistor et al. [67] Performance bugs are demonstrated to be more difficult than functional bugs Bugs are found through code reasoning, not by the direct observation of profiling data system throughput Liu et al. [57] Empirical study of performance bugs from smartphone applications Limited to Android applications service latency, system throughput, resource utilization Hecht et al. [27] Empirical study on the impact of code smells for performance metrics Limited to Android applications service latency, system throughput Cruz et al. [9] Empirical study on the impact of performance best practices on the energy consuption Limited to Android applications energy consumption Olivo et al. [70] Static detection of performance bugs in collection of redundant traversals Limited to data structures wrongly used compression ratio Jovic et al. [36] Look for causes of long latency performance bugs Limited to Java applications service latency Killian et al. [39] Detection of performance bugs in distributed systems Network delays are simulated and can hide some software specific bugs communication and bandwidth ...
Article
Full-text available
A recent research showed that mobile apps represent nowadays 75% of the whole usage of mobile devices. This means that the mobile user experience, while tied to many factors (e.g., hardware device, connection speed, etc.), strongly depends on the quality of the apps being used. With “quality” here we do not simply refer to the features offered by the app, but also to its non-functional characteristics, such as security, reliability, and performance. This latter is particularly important considering the limited hardware resources (e.g., memory) mobile apps can exploit. In this paper, we present the largest study at date investigating performance bugs in mobile apps. In particular, we (i) define a taxonomy of the types of performance bugs affecting Android and iOS apps; and (ii) study the survivability of performance bugs (i.e., the number of days between the bug introduction and its fixing). Our findings aim to help researchers and apps developers in building performance-bugs detection tools and focusing their verification and validation activities on the most frequent types of performance bugs.
... Compared to functional faults, performance bugs are significantly harder to detect and require more time and effort to be fixed [3]. This is partly due to the lack of test oracles, that is, mechanisms to decide whether the performance of the program with a given input is acceptable i.e., the oracle problem [7,8]. ...
... This is partly due to the lack of test oracles, that is, mechanisms to decide whether the performance of the program with a given input is acceptable i.e., the oracle problem [7,8]. For instance, Nistor et al. [3] analyzed 210 performance bugs from three mature open source projects and concluded that "better oracles are needed for discovering performance bugs". In contrast to functional bugs, performance bugs do not usually produce wrong results or crashes in the program under test and therefore they cannot be detected by simply inspecting the program output. ...
... Performance bugs are programming errors that can cause a significant performance degradation like excessive memory consumption [1] or energy leaks [2,3]. Performance bugs affect to key nonfunctional properties of programs such as execution time or memory consumption. ...
Article
Performance bugs are known to be a major threat to the success of software products. Performance tests aim to detect performance bugs by executing the program through test cases and checking whether it exhibits a noticeable performance degradation. The principles of mutation testing, a well-established testing technique for the assessment of test suites through the injection of artificial faults, could be exploited to evaluate and improve the detection power of performance tests. However, the application of mutation testing to assess performance tests, henceforth called performance mutation testing (PMT), is a novel research topic with numerous open challenges. In previous papers, we identified some key challenges related to PMT. In this work, we go a step further and explore the feasibility of applying PMT at the source-code level in general-purpose languages. To do so, we revisit concepts associated with classical mutation testing, and design seven novel mutation operators to model known bug-inducing patterns. As a proof of concept, we applied traditional mutation operators as well as performance mutation operators to open-source C++ programs. The results reveal the potential of the new performance-mutants to help assess and enhance performance tests when compared to traditional mutants. A review of live mutants in these programs suggests that they can induce the design of special test inputs. In addition to these promising results, our work brings a whole new set of challenges related to PMT, which will hopefully serve as a starting point for new contributions in the area.
... A Performance bug [22] is defined as "programming errors that causes significant performance degradation." The performance degradation contains poor user experience, laggy application responsiveness, lower system throughput, and waste computational resources [18]. ...
... The performance degradation contains poor user experience, laggy application responsiveness, lower system throughput, and waste computational resources [18]. [22] showed that a performance bug needs more time to be fixed compared with a non-performance bug. ...
... As we confirmed "vulnerability" and "unauthorized access" achieved relatively high attention from the developers, security bugs are also considered high impact by researchers and have been well studied [6,12,16,17,19,29,35]. "Lower performance" in "Effect" category is also well studied [20][21][22]35] as performance bugs. However, to the best of our knowledge, there is no study on "data loss" in "Data" category which is of relatively high concern to FLOSS developers (5%). ...
Chapter
Full-text available
In recent years, many researchers in the SE community have been devoting considerable efforts to provide FLOSS developers with a means to quickly find and fix various kinds of bugs in FLOSS products such as security and performance bugs. However, it is not exactly sure how FLOSS developers think about bugs to be removed preferentially. Without a full understanding of FLOSS developers’ perceptions of bug finding and fixing, researchers’ efforts might remain far away from FLOSS developers’ needs. In this study, we interview 322 notable GitHub developers about high impact bugs to understand FLOSS developers’ needs for bug finding and fixing, and we manually inspect and classify developers’ answers (bugs) by symptoms and root causes of bugs. As a result, we show that security and breakage bugs are highly crucial for FLOSS developers. We also identify what kinds of high impact bugs should be studied newly by the SE community to help FLOSS developers.
... Figure 5 shows project-wise smell distribution of C++ script smell and settings smells. Based on the analysis, we identified that for each project the median number of CPP scripts is 1. 32 contains more than 11 performance smells that show necessity to have performance bottleneck detection tools such as UEPerf-Analyzer to improve the performance of XR application. Among the analyzed projects, sandisk/GabrielPaliari project has 60 settings smell which is the highest among the analyzed projects. ...
... The study pointed out that performance issues are difficult to reproduce and also require more discussion to fix the performance issues. Nistor et al. [32] also performed a similar study on performance and non-performance bugs from three popular codebases: Eclipse JDT, Eclipse SWT, and Mozilla. The work summarized that fixing performance bugs is more challenging than non-performance bugs. ...
Preprint
Extended Reality (XR) includes Virtual Reality (VR), Augmented Reality (AR) and Mixed Reality (MR). XR is an emerging technology that simulates a realistic environment for users. XR techniques have provided revolutionary user experiences in various application scenarios (e.g., training, education, product/architecture design, gaming, remote conference/tour, etc.). Due to the high computational cost of rendering real-time animation in limited-resource devices and constant interaction with user activity, XR applications often face performance bottlenecks, and these bottlenecks create a negative impact on the user experience of XR software. Thus, performance optimization plays an essential role in many industry-standard XR applications. Even though identifying performance bottlenecks in traditional software (e.g., desktop applications) is a widely explored topic, those approaches cannot be directly applied within XR software due to the different nature of XR applications. Moreover, XR applications developed in different frameworks such as Unity and Unreal Engine show different performance bottleneck patterns and thus, bottleneck patterns of Unity projects can't be applied for Unreal Engine (UE)-based XR projects. To fill the knowledge gap for XR performance optimizations of Unreal Engine-based XR projects, we present the first empirical study on performance optimizations from seven UE XR projects, 78 UE XR discussion issues and three sources of UE documentation. Our analysis identified 14 types of performance bugs, including 12 types of bugs related to UE settings issues and two types of CPP source code-related issues. To further assist developers in detecting performance bugs based on the identified bug patterns, we also developed a static analyzer, UEPerfAnalyzer, that can detect performance bugs in both configuration files and source code.
... The number of bug reports in this study (310) is reasonable, as it is large enough to make statistically significant claims, and small enough to allow for a reasonably thorough analysis of each bug report. Indeed, there is also precedence from prior research, including one conducted by the present author, where 317 bug reports were analyzed (Ocariza et al. 2013), and those conducted by (Nistor et al. 2013) and (Selakovic and Pradel 2016), each of which analyzed fewer than 300 performance-related bug reports. ...
... Lastly, (Nistor et al. 2013) looked at performance bugs from different code bases, and analyzed how they are detected, reported, and fixed, in comparison to non-performance bugs. For instance, the authors found that performance bugs are generally more difficult to fix compared to non-performance bugs, and conclude the need for better tool support for the former. ...
Article
Full-text available
Performance regressions can have a drastic impact on the usability of a software application. The crucial task of localizing such regressions can be achieved using bisection, which attempts to find the bug-introducing commit using binary search. This approach is used extensively by many development teams, but it is an inherently heuristical approach when applied to performance regressions, and therefore, does not have correctness guarantees. Unfortunately, bisection is also time-consuming, which implies the need to assess its effectiveness prior to running it. To this end, the goal of this study is to analyze the effectiveness of bisection for performance regressions. This goal is achieved by first formulating a metric that quantifies the probability of a successful bisection, and extracting a list of input parameters – the contributing properties – that potentially impact its value; a sensitivity analysis is then conducted on these properties to understand the extent of their impact. Furthermore, an empirical study of 310 bug reports describing performance regressions in 17 real-world applications is conducted, to better understand what these contributing properties look like in practice. The results show that while bisection can be highly effective in localizing real-world performance regressions, this effectiveness is sensitive to the contributing properties, especially the choice of baseline and the distributions at each commit. The results also reveal that most bug reports do not provide sufficient information to help developers properly choose values and metrics that can maximize the effectiveness, which implies the need for measures to fill this information gap.
... Developers often spend a substantial amount of time diagnosing a configurable software system to localize and fix a performance bug, or to determine that the system was misconfigured [8,11,26,30,32,33,55,58,59,86]. This struggle is quite common when maintaining configurable software systems. ...
... Our goal is to support developers in the process of debugging the performance of configurable software systems; in particular, when developers do not even know which options or interactions in their current configuration cause an unexpected performance behavior. When performance issues occur in software systems, developers need to identify relevant information to debug the unexpected performance behaviors [8,11,27,55]. For this task, in addition to using off-the-shelf profilers [15,53,74], some researchers suggest using more targeted profiling techniques [10,12,13,21,84] and visualizations [2,6,12,21,62,70] to identify and analyze the locations of performance bottlenecks. ...
Preprint
Full-text available
Determining whether a configurable software system has a performance bug or it was misconfigured is often challenging. While there are numerous debugging techniques that can support developers in this task, there is limited empirical evidence of how useful the techniques are to address the actual needs that developers have when debugging the performance of configurable software systems; most techniques are often evaluated in terms of technical accuracy instead of their usability. In this paper, we take a human-centered approach to identify, design, implement, and evaluate a solution to support developers in the process of debugging the performance of configurable software systems. We first conduct an exploratory study with 19 developers to identify the information needs that developers have during this process. Subsequently, we design and implement a tailored tool, adapting techniques from prior work, to support those needs. Two user studies, with a total of 20 developers, validate and confirm that the information that we provide helps developers debug the performance of configurable software systems.
... Compared to the amount of functional bugs, it is typical that the amount of performance bugs are relatively small in software projects (Ding et al. 2020;Radu and Nadi 2019;Jin et al. 2012;Nistor et al. 2013). Therefore, the lack of data to build JIT bug prediction models for performance bugs may become a common challenge in practice. ...
... Our paper analyzes the performance bugs in Cassandra and Hadoop, and the SZZ approach's ability to determine the bug inducing changes and concentrate on the impact of these changes on predictive models. Nistor et al. (2013) studied software performance since performance is critical for how users perceive the quality of software products. Performance bugs lead to poor user experience and low system throughput (Molyneaux 2009;Bryant and O'Hallaron 2015). ...
Article
Full-text available
Performance bugs bear a heavy cost on both software developers and end-users. Tools to reduce the occurrence, impact, and repair time of performance bugs, can therefore provide key assistance for software developers racing to fix these bugs. Classification models that focus on identifying defect-prone commits, referred to as Just-In-Time (JIT) Quality Assurance are known to be useful in allowing developers to review risky commits. These commits can be reviewed while they are still fresh in developers’ minds, reducing the costs of developing high-quality software. JIT models, however, leverage the SZZ approach to identify whether or not a change is bug-inducing. The fixes to performance bugs may be scattered across the source code, separated from their bug-inducing locations. The nature of performance bugs may make SZZ a sub-optimal approach for identifying their bug-inducing commits. Yet, prior studies that leverage or evaluate the SZZ approach do not distinguish performance bugs from other bugs, leading to potential bias in the results. In this paper, we conduct an empirical study on the JIT defect prediction for performance bugs. We concentrate on SZZ’s ability to identify the bug-inducing commits of performance bugs in two open-source projects, Cassandra, and Hadoop. We verify whether the bug-inducing commits found by SZZ are truly bug-inducing commits by manually examining these identified commits. Our manual examination includes cross referencing fix commits and JIRA bug reports. We evaluate model performance for JIT models by using them to identify bug-inducing code commits for performance related bugs. Our findings show that JIT defect prediction classifies non-performance bug-inducing commits better than performance bug-inducing commits, i.e., the SZZ approach does introduce errors when identifying bug-inducing commits. However, we find that manually correcting these errors in the training data only slightly improves the models. In the absence of a large number of correctly labelled performance bug-inducing commits, our findings show that combining all available training data (i.e., truly performance bug-inducing commits, non-performance bug-inducing commits, and non-bug-inducing commits) yields the best classification results.
... This piece of code looks innocent. However, there is an outer loop in function my_xml_parse(), which is to parse input string str into XML_NODEs (lines [18][19][20][21][22][23][24][25][26][27][28][29][30][31][32]. The outer loop keeps calling xml_parent() using the next sibling of the previous XML_NODE, which has O(N 2 ) complexity in the number of children of a parent XML_NODE. ...
... Combating performance bugs depends on a good understanding of performance bugs. Many empirical studies were conducted to understand different types of performance bugs [1,2,3,20,21,22,23,24]. They provide important findings, which deepen researchers' understanding and guide the technical design to fight performance bugs from different aspects. ...
Article
Full-text available
Complexity problems are a common type of performance issues, caused by algorithmic inefficiency. Algorithmic profiling aims to automatically attribute execution complexity to an executed code construct. It can identify code constructs in superlinear complexity to facilitate performance optimizations and debugging. However, existing algorithmic profiling techniques suffer from several severe limitations, missing the opportunity to be deployed in production environment and failing to effectively pinpoint root causes for performance failures caused by complexity problems. In this paper, we design a tool, ComAir, which can effectively conduct algorithmic profiling in production environment. We propose several novel instrumentation methods to significantly lower runtime overhead and enable the production-run usage. We also design an effective ranking mechanism to help developers identify root causes of performance failures due to complexity problems. Our experimental results show that ComAir can effectively identify root causes and generate accurate profiling results in production environment, while incurring a negligible runtime overhead
... We use a simple heuristic and select issues that contain the keyword "leak" in the issue title or issue description. The keyword search is a popular method used by previous empirical studies [57,92,147] to filter the issues of interest from the others. It is worth mentioning that we investigated other leak-related keywords (unreleased, out-of-memory, OOM, closed, and others). ...
... To asses the time to resolve (TTR) of an issue report, we adopted the methodology used in previous studies [57,92,110]. We collect two timestamps from each issue report: the time it was created (recorded in the issue tracker), and the time it was resolved (labeled as resolved). ...
Thesis
Modern software systems evolve steadily. Software developers change the software codebase every day to add new features, to improve the performance, or to fix bugs. Despite extensive testing and code inspection processes before releasing a new software version, the chance of introducing new bugs is still high. A code that worked yesterday may not work today, or it can show a degraded performance causing software regression. The laws of software evolution state that the complexity increases as software evolves. Such increasing complexity makes software maintenance harder and more costly. In a typical software organization, the cost of debugging, testing, and verification can easily range from 50% to 75% of the total development costs. Given that human resources are the main cost factor in the software maintenance and the software codebase evolves continuously, this dissertation tries to answer the following question: How can we help developers to localize the software defects more effectively during software development? We answer this question in three aspects. First, we propose an approach to localize failure-inducing changes for crashing bugs. Assume the source code of a buggy version, a failing test, the stack trace of the crashing site, and a previous correct version of the application. We leverage program analysis to contrast the behavior of the two software versions under the failing test. The difference set is the code statements which contribute to the failure site with a high probability. Second, we extend the version comparison technique to detect the leak-inducing defects caused by software changes. Assume two versions of a software codebase (one previous non-leaky and the current leaky version) and the existing test suite of the application. First, we compare the memory footprint of the code locations between two versions. Then, we use a confidence score to rank the suspicious code statements, i.e., those statements which can be the potential root causes of memory leaks. The higher the score, the more likely the code statement is a potential leak. Third, our observation on the related work about debugging and fault localization reveals that there is no empirical study which characterizes the properties of the leak- inducing defects and their repairs. Understanding the characteristics of the real defects caused by resource and memory leaks can help both researchers and practitioners to improve the current techniques for leak detection and repair. To fill this gap, we conduct an empirical study on 491 reported resource and memory leak defects from 15 large Java applications. We use our findings to draw implications for leak avoidance, detection, localization, and repair.
... We use a simple heuristic and select issues that contain the keyword "leak" in the issue title or issue description. The keyword search is a well-known method used by previous empirical studies (Jin et al. 2012a;Zhong and Su 2015;Nistor et al. 2013) to filter the issues of interest from the others. It is worth mentioning that we investigated other leak-related keywords (unreleased, out-of-memory, OOM, closed, etc.). ...
... The higher the entropy, the more complex the repair patch. To asses the time to resolve (TTR) of an issue report, we adopted the methodology used in previous studies (Song and Lu 2014;Nistor et al. 2013;Jin et al. 2012b). We collect two timestamps from each issue report: the time it was created (recorded in the issue tracker), and the time it was resolved (labeled as resolved). ...
Article
Full-text available
Despite huge software engineering efforts and programming language support, resource and memory leaks are still a troublesome issue, even in memory-managed languages such as Java. Understanding the properties of leak-inducing defects, how the leaks manifest, and how they are repaired is an essential prerequisite for designing better approaches for avoidance, diagnosis, and repair of leak-related bugs. We conduct a detailed empirical study on 491 issues from 15 large open-source Java projects. The study proposes taxonomies for the leak types, for the defects causing them, and for the repair actions. We investigate, under several aspects, the distributions within each taxonomy and the relationships between them. We find that manual code inspection and manual runtime detection are still the main methods for leak detection. We find that most of the errors manifest on error-free execution paths, and developers repair the leak defects in a shorter time than non-leak defects. We also identify 13 recurring code transformations in the repair patches. Based on our findings, we draw a variety of implications on how developers can avoid, detect, isolate and repair leak-related bugs.
... Many empirical studies [2], [5], [17] have been conducted for performance bugs, but rare researches focus on synchronization performance bugs in cloud distributed systems. We collect 26 performance issues in distributed systems, and do analysis on their root cause, fix strategy and time complexity in order to understand these synchronization performance bugs better. ...
... They proposed different methods to study the reported bugs. Some [2] of them focus on the lifecyle of a performance bug, like what is the root cause, how they are introduced, how they are exposed, and how they are fixed, then find that performance problems take long time to get diagnosed and the help from profilers is very limited; some [16], [17] of them look at how performance are noticed and reported by end users; some [5] of them compare qualitative difference between performance bugs and nonperformance bugs across impact, fix and fix validation. Besides, some researches have been done on a specified code structure, like loop. ...
Article
Full-text available
In such an information society, Internet of Things (IoT) plays an increasingly important role in our daily lives. With such a huge number of deployed IoT devices, CPS calls for powerful distributed infrastructures to supply big data computing, intelligence, and storage services. With the increasingly complex distributed software infrastructures, new intricate bugs continue to manifest, causing huge economic loss. Synchronization performance problems, which means that improper synchronizations may degrade the performance and even lead to service exception, heavily influence the entire distributed cluster, imperiling the reliability of the system. As one kind of performance problems, synchronization performance problems are acknowledged as difficult to diagnosis and fix. We collect 26 performance issues in 3 real-world distributed systems: HDFS, Hadoop MapReduce and HBase, and do analysis on their root cause, fix strategy and algorithm complexity in order to understand these synchronization performance bugs better. Then we implement a static detection tool including critical section identifier, loop identifier, inner loop identifier, expensive loop identifier, and pruning component. After that, we evaluate our detection tool on these three distributed systems with sampled bugs. In the evaluation, our detection tool accurately finds out all the target bugs. Besides, it points out more new potential performance problems than previous works. With the strict performance overhead, our detection tool is proved to be greatly efficient.
... There are few empirical studies on performance bugs [8], [26]- [29], their root cause [8], [26], [30], fixing strategy [8], [26], [27] their impact or relevance [8], [29] and both static and dynamic analysis based detection approaches [31]- [34]. Researchers also suggested various ways of data-access optimization to improve performance of database-backed web applications using caching and prefetching techniques. ...
Preprint
Data-intensive systems handle variable, high volume, and high-velocity data generated by human and digital devices. Like traditional software, data-intensive systems are prone to technical debts introduced to cope-up with the pressure of time and resource constraints on developers. Data-access is a critical component of data-intensive systems as it determines the overall performance and functionality of such systems. While data access technical debts are getting attention from the research community, technical debts affecting the performance, are not well investigated. Objective: Identify, categorize, and validate data access performance issues in the context of NoSQL-based and polyglot persistence data-intensive systems using qualitative study. Method: We collect issues from NoSQL-based and polyglot persistence open-source data-intensive systems and identify data access performance issues using inductive coding and build a taxonomy of the root causes. Then, we validate the perceived relevance of the newly identified performance issues using a developer survey.
... Misconfigurations are typically caused by interactions between software and hardware, resulting in non-functional faults 1 -depredations in non-functional system properties such as latency and energy consumption. These non-functional faults-unlike regular software bugs-do not cause the system to crash or exhibit any obvious misbehavior [76,85,99]. Instead, misconfigured systems remain operational but degrade in performance [16,71,75,86]. ...
Preprint
Full-text available
Modern computer systems are highly configurable, with the variability space sometimes larger than the number of atoms in the universe. Understanding and reasoning about the performance behavior of highly configurable systems, due to a vast variability space, is challenging. State-of-the-art methods for performance modeling and analyses rely on predictive machine learning models, therefore, they become (i) unreliable in unseen environments (e.g., different hardware, workloads), and (ii) produce incorrect explanations. To this end, we propose a new method, called Unicorn, which (a) captures intricate interactions between configuration options across the software-hardware stack and (b) describes how such interactions impact performance variations via causal inference. We evaluated Unicorn on six highly configurable systems, including three on-device machine learning systems, a video encoder, a database management system, and a data analytics pipeline. The experimental results indicate that Unicorn outperforms state-of-the-art performance optimization and debugging methods. Furthermore, unlike the existing methods, the learned causal performance models reliably predict performance for new environments.
... Other tools perform security assessment, automated test case generation and detection of non-functional issues such as energy consumption [270], [271]. While fixing non-functional performance bugs, developers need to consider the threat of introducing functional bugs [272] and hindering code maintainability [273]. In this context, Linares et al. [274] suggested that developers rarely implement micro-optimizations (e.g., changes at statement level). ...
Article
Nowadays there is a mobile application for almost everything a user may think of, ranging from paying bills and gathering information to playing games and watching movies. In order to ensure user satisfaction and success of applications, it is important to provide high performant applications. This is particularly important for resource constraint systems such as mobile devices. Thereby, non-functional performance characteristics, such as energy and memory consumption, play an important role for user satisfaction. This paper provides a comprehensive survey of non-functional performance optimization for Android applications. We collected 155 unique publications, published between 2008 and 2020, that focus on the optimization of non-functional performance of mobile applications. We target our search at four performance characteristics, in particular: responsiveness, launch time, memory and energy consumption. For each performance characteristic, we categorize optimization approaches based on the method used in the corresponding publications. Furthermore, we identify research gaps in the literature for future work.
... Configuring software systems is often challenging. In practice, many users execute systems with inefficient configurations in terms of performance and, often directly correlated, energy consumption [22,23,33,54]. While users can adjust configuration options to tradeoff between performance and the system's functionality, this configuration task can be overwhelming; many systems, such as databases, Web servers, and video encoders, have numerous configuration options that may interact, possibly producing unexpected and undesired behavior. ...
Preprint
Full-text available
Performance-influence models can help stakeholders understand how and where configuration options and their interactions influence the performance of a system. With this understanding, stakeholders can debug performance behavior and make deliberate configuration decisions. Current black-box techniques to build such models combine various sampling and learning strategies, resulting in tradeoffs between measurement effort, accuracy, and interpretability. We present Comprex, a white-box approach to build performance-influence models for configurable systems, combining insights of local measurements, dynamic taint analysis to track options in the implementation, compositionality, and compression of the configuration space, without relying on machine learning to extrapolate incomplete samples. Our evaluation on 4 widely-used, open-source projects demonstrates that Comprex builds similarly accurate performance-influence models to the most accurate and expensive black-box approach, but at a reduced cost and with additional benefits from interpretable and local models.
... Performance bug is also studied for software systems where bugs are detected by users or code reasoning [43]. A machine learning approach is developed for evaluating software performance degradation due to code change [4]. ...
Preprint
Processor design validation and debug is a difficult and complex task, which consumes the lion's share of the design process. Design bugs that affect processor performance rather than its functionality are especially difficult to catch, particularly in new microarchitectures. This is because, unlike functional bugs, the correct processor performance of new microarchitectures on complex, long-running benchmarks is typically not deterministically known. Thus, when performance benchmarking new microarchitectures, performance teams may assume that the design is correct when the performance of the new microarchitecture exceeds that of the previous generation, despite significant performance regressions existing in the design. In this work, we present a two-stage, machine learning-based methodology that is able to detect the existence of performance bugs in microprocessors. Our results show that our best technique detects 91.5% of microprocessor core performance bugs whose average IPC impact across the studied applications is greater than 1% versus a bug-free design with zero false positives. When evaluated on memory system bugs, our technique achieves 100% detection with zero false positives. Moreover, the detection is automatic, requiring very little performance engineer time.
... Incorrect configuration (misconfiguration) elicits unexpected interactions between software and hardware resulting non-functional faults, i.e., faults in non-functional system properties such as latency, energy consumption, and/or heat dissipation. These non-functional faults-unlike regular software bugs-do not cause the system to crash or exhibit an obvious misbehavior [70,78,88]. Instead, misconfigured systems remain operational while being compromised, resulting severe performance degradation in latency, energy consumption, and/or heat dissipation [16,66,69,80]. ...
Preprint
Full-text available
Modern computing platforms are highly-configurable with thousands of interacting configurations. However, configuring these systems is challenging. Erroneous configurations can cause unexpected non-functional faults. This paper proposes CADET (short for Causal Debugging Toolkit) that enables users to identify, explain, and fix the root cause of non-functional faults early and in a principled fashion. CADET builds a causal model by observing the performance of the system under different configurations. Then, it uses casual path extraction followed by counterfactual reasoning over the causal model to: (a) identify the root causes of non-functional faults, (b) estimate the effects of various configurable parameters on the performance objective(s), and (c) prescribe candidate repairs to the relevant configuration options to fix the non-functional fault. We evaluated CADET on 5 highly-configurable systems deployed on 3 NVIDIA Jetson systems-on-chip. We compare CADET with state-of-the-art configuration optimization and ML-based debugging approaches. The experimental results indicate that CADET can find effective repairs for faults in multiple non-functional properties with (at most) 17% more accuracy, 28% higher gain, and $40\times$ speed-up than other ML-based performance debugging methods. Compared to multi-objective optimization approaches, CADET can find fixes (at most) $9\times$ faster with comparable or better performance gain. Our case study of non-functional faults reported in NVIDIA's forum show that CADET can find $14%$ better repairs than the experts' advice in less than 30 minutes.
... Recent studies have shown that performance problems caused by misconfiguration are still prevalent [4], [13], [17]. Performance issues can cause significant performance degradation which leads to long response time and a low program throughput [7], [17], [24]. ...
Preprint
Performance is an important non-functional aspect of the software requirement. Modern software systems are highly-configurable and misconfigurations may easily cause performance issues. A software system that suffers performance issues may exhibit low program throughput and long response time. However, the sheer size of the configuration space makes it challenging for administrators to manually select and adjust the configuration options to achieve better performance. In this paper, we propose ConfRL, an approach to tune software performance automatically. The key idea of ConfRL is to use reinforcement learning to explore the configuration space by a trial-and-error approach and to use the feedback received from the environment to tune configuration option values to achieve better performance. To reduce the cost of reinforcement learning, ConfRL employs sampling, clustering, and dynamic state reduction techniques to keep states in a large configuration space manageable. Our evaluation of four real-world highly-configurable server programs shows that ConfRL can efficiently and effectively guide software systems to achieve higher long-term performance.
... In addition, these approaches rely on a heuristic to pinpoint the performance regression-causes, while ZAM reasons its way through the timeline to find the cause. In this sense, ZAM's methodology is consistent with the results of an empirical study by Nistor et al. [36], which found that performance issues are fixed mainly through code reasoning. ...
Article
Full-text available
A performance regression in software is defined as an increase in an application step’s response time as a result of code changes. Detecting such regressions can be done using profiling tools; however, investigating their root cause is a mostly-manual and time-consuming task. This statement holds true especially when comparing execution timelines, which are dynamic function call trees augmented with response time data; these timelines are compared to find the performance regression-causes – the lowest-level function calls that regressed during execution. When done manually, these comparisons often require the investigator to analyze thousands of function call nodes. Further, performing these comparisons on web applications is challenging due to JavaScript’s asynchronous and event-driven model, which introduce noise in the timelines. In response, we propose a design – Zam – that automatically compares execution timelines collected from web applications, to identify performance regression-causes. Our approach uses a hybrid node matching algorithm that recursively attempts to find the longest common subsequence in each call tree level, then aggregates multiple comparisons’ results to eliminate noise. Our evaluation of Zam on 10 web applications indicates that it can identify performance regression-causes with a path recall of 100% and a path precision of 96%, while performing comparisons in under a minute on average. We also demonstrate the real-world applicability of Zam, which has been used to successfully complete performance investigations by the performance and reliability team in SAP.
... They studied how users perceive the bugs, how bugs are reported, what developers discuss about the bug causes and the bug patches. Their study is similar to that of Nistor et al. [17] but they go further by analyzing additional information for the bug reports. Nguyen et al. [16] interviewed the performance engineers responsible for an industrial software system, to understand these regression-causes. ...
Article
Context Software performance may suffer regressions caused by source code changes. Measuring performance at each new software version is useful for early detection of performance regressions. However, systematically running benchmarks is often impractical (e.g., long running execution, prioritizing functional correctness over non-functional). Objective In this article, we propose Horizontal Profiling, a sampling technique to predict when a new revision may cause a regression by analyzing the source code and using run-time information of a previous version. The goal of Horizontal Profiling is to reduce the performance testing overhead by benchmarking just software versions that contain costly source code changes. Method We present an evaluation in which we apply Horizontal Profiling to identify performance regressions of 17 software projects written in the Pharo programming language, totaling 1,288 software versions. Results Horizontal Profiling detects more than 80% of the regressions by benchmarking less than 20% of the versions. In addition, our experiments show that Horizontal Profiling has better precision and executes the benchmarks in less versions that the state of the art tools, under our benchmarks. Conclusions We conclude that by adequately characterizing the run-time information of a previous version, it is possible to determine if a new version is likely to introduce a performance regression or not. As a consequence, a significant fraction of the performance regressions are identified by benchmarking only a small fraction of the software versions.
... Nistor et al. [22] conducted a comprehensive study to compare performance and non-performance bugs regarding how they are discovered, reported, and fixed to answer the questions which left behind in the previous studies. More precisely, they have manually inspected and compared 210 performance bugs and 210 nonperformance bugs from three mature code bases known as Eclipse Java Development Tools (JDT), Eclipse Standard Widget Toolkit (SWT), and the Mozilla project. ...
Conference Paper
Quality is a multi-faceted aspect of software. As described by international standards, the process of quality assurance is concerned not only with the functionality of the software, but also with performance, security, maintainability, reliability and others. However, during the maintenance phase of software development, developers usually focus on one aspect at a time, for example improving design or fixing bugs, either due to time constraints or because of specific priorities. In this work, we present a study to show that quality issues do not occur in isolation during development. We study the source code from 10 Android applications and we explore problems around reliability, maintainability and security. In addition, we study the impact of maintenance activities around these problems on other quality aspects like performance and energy consumption. Our first objective is to find if quality problems of different types occur together and if there is correlation between specific types. Secondly, we want to see if fixing problems always has monotonically positive effect on the general quality or special attention needs to be taken when fixing specific problems. Our long-term goal is to create tool support for the multi-dimensional analysis and assurance of software quality.
... The relationship between code smells, and design problems have been largely discussed in the technical literature. Although bug is a buzzword used in software engineering research and practice with several meanings [26,27,39], we found that the references to bug made by practitioners addresses problems in the execution of the source code and semantic issues. Therefore, we may interpret that practitioners believe that code smells would hamper maintenance activities by contributing to the incidence of bugs [10]. ...
Conference Paper
Full-text available
Context: The identification of code smells is one of the most subjective tasks in software engineering. A key reason is the influence of collective aspects of communities working on this task, such as their beliefs regarding the relevance of certain smells. However, collective aspects are often neglected in the context of smell identification. For this purpose, we can use the social representations theory. Social representations comprise the set of values, behaviors , and practices of communities associated with a social object, such as the task of identifying smells. Aim: To characterize the social representations behind smell identification. Method: We conducted an empirical study on the social representations of smell identification by two communities. One community is composed of postgraduate students from different Brazilian universities. The other community is composed of practitioners located in Brazilian companies, having different levels of experience in code reviews. We analyzed the associations made by the study participants about smell identification, i.e., what immediately comes to their minds when they think about this task. Results: One of the key findings is that the community of students and practitioners have stronger associations with different types of code smells. Students share a strong belief that smell identification is a matter of measurement, while practitioners focus on the structure of the source code and its semantics. Besides, we found that only practitioners frequently associate the task with individual skills. This finding suggests research directions on code smells may be revisited. Conclusion: We found evidence that social representations theory allows identifying research gaps and opportunities by looking beyond the borders of formal knowledge and individual opinions. Therefore, this theory can be considered an important resource for conducting qualitative studies in software engineering.
... Motivation: Developers cannot treat all the bugs in the same priority since some bugs can highly impact on a variety of activities in the bug management process, products, and end-users. In order to prioritize effectively, thus, Software engineering researchers have introduced different types of High Impact Bugs (HIB) based on their impact on software processes, products, or end-users such as security bugs [13], performance bugs [14], breakage bugs [15], surprise bugs [15], dormant bugs [16], and blocker bugs [17]. The previous study revealed that different types of bugs (e.g., Performance and Security bugs) differ from each other [18]. ...
Conference Paper
Full-text available
Bug reports are the primary means through which developers triage and fix bugs. To achieve this effectively, bug reports need to clearly describe those features that are important for the developers. However, previous studies have found that reporters do not always provide such features. Therefore, we first perform an exploratory study to identify the key features that reporters frequently miss in their initial bug report submissions. Then, we plan to propose an automatic approach for supporting reporters to make a good bug report. For our initial studies, we manually examine bug reports of five large-scale projects from two ecosystems such as Apache (Camel, Derby, and Wicket) and Mozilla (Firefox and Thunderbird). As initial results, we identify five key features that reporters often miss in their initial bug reports and developers require them for fixing bugs. We build and evaluate classification models using four different text-classification techniques. The evaluation results show that our models can effectively predict the key features. Our ongoing research focuses on developing an automatic features recommendation model to improve the contents of bug reports.
Article
Context: Software performance is crucial for ensuring the quality of software products. As one of the non‐functional requirements, the few efforts devoted to software performance have often been neglected until a later phase in the software development life cycle (SDLC). The lack of clarity of what software performance research literature is available prevents researchers from understanding what software performance research fields are available. It also creates difficulty for practitioners to adopt state‐of‐the‐art software performance techniques. Software performance research is not as organized as other established research topics such as software testing. Thus, it is essential to conduct a systematic mapping study as a first step to provide an overview of the latest research literature available in software performance. Objective: The objective of this systematic mapping study is to survey and map software performance research literature into suitable categories and to synthesize the literature data for future access and reference. Method: This systematic mapping study conducts a manual examination by querying research literature in noble journals and proceedings in software engineering in the past decade. We examine each paper manually and identify primary studies for further analysis and synthesis according to the pre‐defined inclusion criteria. Lastly, we map the primary studies based on their corresponding classification category. Results: This systematic mapping study provides a state‐of‐the‐art literature mapping in software performance research. We have carefully examined 222 primary studies out of 2000+ research literature. We have identified six software performance research categories and 15 subcategories. We generate the primary study mapping and report five research findings. Conclusions: Unlike established research fields, it is unclear what types of software performance research categories are available to the community. This work takes the systematic mapping study approach to survey and map the latest software performance research literature. The study results provide an overview of the paper distribution and a reference for researchers to navigate research literature on software performance.
Article
Bug reports are submitted by the software stakeholders to foster the location and elimination of bugs. However, in large-scale software systems, it may be impossible to track and solve every bug, and thus developers should pay more attention to High-Impact Bugs (HIBs). Previous studies analyzed textual descriptions to automatically identify HIBs, but they ignored the quality of code, which may also indicate the cause of HIBs. To address this issue, we integrate the features reflecting the quality of production (i.e. CK metrics) and test code (i.e. test smells) into our textual similarity based model to identify HIBs. Our model outperforms the compared baseline by up to 39% in terms of AUC-ROC and 64% in terms of F-Measure. Then, we explain the behavior of our model by using SHAP to calculate the importance of each feature, and we apply case studies to empirically demonstrate the relationship between the most important features and HIB. The results show that several test smells (e.g. Assertion Roulette, Conditional Test Logic, Duplicate Assert, Sleepy Test) and product metrics (e.g. NOC, LCC, PF, and ProF) have important contributions to HIB identification.
Thesis
System availability and efficiency are critical aspects in the oil and gas sector; as any fault affecting those systems may cause operations to shut down; which will negatively impact operation resources as well as costs, human resources and time. Therefore, it became important to investigate the reasons of such errors. In this study, software errors and maintenance are studied. End user errors are targeted after finding that is the number of these errors is projected to increase. The factors that affect end user behavior in oil and gas systems are also investigated and the relation between system availability and end user behavior are evaluated. An investigation has been performed following the descriptive methodology in order to gain insights into the human error factor encountered by various international oil and gas companies around the Middle East and North Africa. This was conducted by distributing a questionnaire to 120 employees of the companies in this study; 81 had responded. The questionnaire contained questions related to software/hardware errors and errors due to the end user. In short, the study shows that there is a relation between end user behavior and system availability and efficiency. Factors including training, experience, education, work shifts, system interface and I/O devices were identified in the study as factors affecting end user behavior. Moreover, the study contributes new knowledge by identifying a new factor that leads to system unavailability, namely memory sticks. This thesis presents a valuable knowledge that explains how errors occur and the reasons for their occurrence. Major limitations of this research include company policies, legal issues and information resources.
Chapter
The technology enabled service industry is emerging as the most dynamic sectors in world's economy. Various service sector industries such as financial services, banking solutions, telecommunication, investment management, etc. completely rely on using large scale software for their smooth operations. Any malwares or bugs in these software is an issue of big concern and can have serious financial consequences. This chapter addresses the problem of bug handling in service sector software. Predictive analysis is a helpful technique for keeping software systems error free. Existing research in bug handling focus on various predictive analysis techniques such as data mining, machine learning, information retrieval, optimisation, etc. for bug resolving. This chapter provides a detailed analysis of bug handling in large service sector software. The main emphasis of this chapter is to discuss research involved in applying predictive analysis for bug handling. The chapter also presents some possible future research directions in bug resolving using mathematical optimisation techniques.
Article
Mutation testing has been widely used to assess the fault-detection effectiveness of a test suite, as well as to guide test case generation or prioritization. Empirical studies have shown that, while mutants are generally representative of real faults, an effective application of mutation testing requires “traditional” operators designed for programming languages to be augmented with operators specific to an application domain and/or technology. The case for Android apps is not an exception. Therefore, in this paper we describe the process we followed to create (i) a taxonomy of mutation operations and, (ii) two tools, MDroid+ and MutAPK for mutant generation of Android apps. To this end, we systematically devise a taxonomy of 262 types of Android faults grouped in 14 categories by manually analyzing 2,023 software artifacts from different sources ( e.g., bug reports, commits). Then, we identified a set of 38 mutation operators, and implemented them in two tools, the first enabling mutant generation at the source code level, and the second designed to perform mutations at APK level. The rationale for having a dual-approach is based on the fact that source code is not always available when conducting mutation testing. Thus, mutation testing for APKs enables new scenarios in which researchers/practitioners only have access to APK files. The taxonomy, proposed operators, and tools have been evaluated in terms of the number of non-compilable, trivial, equivalent, and duplicate mutants generated and their capacity to represent real faults in Android apps as compared to other well-known mutation tools.
Chapter
The technology enabled service industry is emerging as the most dynamic sectors in world's economy. Various service sector industries such as financial services, banking solutions, telecommunication, investment management, etc. completely rely on using large scale software for their smooth operations. Any malwares or bugs in these software is an issue of big concern and can have serious financial consequences. This chapter addresses the problem of bug handling in service sector software. Predictive analysis is a helpful technique for keeping software systems error free. Existing research in bug handling focus on various predictive analysis techniques such as data mining, machine learning, information retrieval, optimisation, etc. for bug resolving. This chapter provides a detailed analysis of bug handling in large service sector software. The main emphasis of this chapter is to discuss research involved in applying predictive analysis for bug handling. The chapter also presents some possible future research directions in bug resolving using mathematical optimisation techniques.
Preprint
Full-text available
Bug reports are the primary means through which developers triage and fix bugs. To achieve this effectively, bug reports need to describe clearly those features that are important for the developers. However, previous studies have found that reporters do not always provide such features. Therefore, we first perform an exploratory study to identify the key features that reporters frequently miss in their initial bug report submissions. Then, we propose an approach that predicts whether reporters should provide certain key features to ensure a good bug report. A case study of the bug reports for Camel, Derby, Wicket, Firefox, and Thunderbird projects, shows that Steps to Reproduce, Test Case, Code Example, Stack Trace, and Expected Behavior are the additional features that reporters most often omit from their initial bug report submissions. We also find that these features significantly affect the bug-fixing process. Based on our findings, we build and evaluate classification models using four different text-classification techniques to predict key features by leveraging historical bug-fixing knowledge. The evaluation results show that our models can effectively predict the key features. Our comparative study of different text-classification techniques shows that NBM outperforms other techniques. Our findings can benefit reporters to improve the contents of bug reports.
Article
Performance is one of the key aspects of non-functional qualities as performance bugs can cause significant performance degradation and lead to poor user experiences. While bug reports are intended to help developers to understand and fix bugs, they are also extensively used by researchers for finding benchmarks to evaluate their testing and debugging approaches. Although researchers spend a considerable amount of time and effort in finding usable performance bugs from bug repositories, they often get only a few. Reproducing performance bugs is difficult even for performance bugs that are confirmed by developers with domain knowledge. The amount of information disclosed in a bug report may not always be sufficient to reproduce the performance bug for researchers, and thus hinders the usability of bug repository as the resource for finding benchmarks. In this paper, we study the characteristics of confirmed performance bugs by reproducing them using only informations available from the bug report to examine the challenges of bug reproduction from the perspective of researchers. We spent more than 800 h over the course of six months to study and to try to reproduce 93 confirmed performance bugs, which are randomly sampled from two large-scale open-source server applications. We (1) studied the characteristics of the reproduced performance bug reports; (2) summarized the causes of failed-to-reproduce performance bug reports from the perspective of researchers by reproducing bugs that have been solved in bug reports; (3) shared our experience on suggesting workarounds to improve the bug reproduction success rate; (4) delivered a virtual machine image that contains a set of 17 ready-to-execute performance bug benchmarks. The findings of our study provide guidance and a set of suggestions to help researchers to understand, evaluate, and successfully replicate performance bugs.
Conference Paper
Full-text available
Changes, a rather inevitable part of software development can cause maintenance implications if they introduce bugs into the system. By isolating and characterizing these bug introducing changes it is possible to uncover potential risky source code entities or issues that produce bugs. In this paper, we mine the bug introducing changes in the Android platform by mapping bug reports to the changes that introduced the bugs. We then use the change information to look for both potential problematic parts and dynamics in development that can cause maintenance implications. We believe that the results of our study can help better manage Android software development.
Article
Full-text available
A recent study finds that errors of omission are harder for programmers to detect than errors of commission. While several change recommendation systems already exist to prevent or reduce omission errors during software development, there have been very few studies on why errors of omission occur in practice and how such errors could be prevented. In order to understand the characteristics of omission errors, this paper investigates a group of bugs that were fixed more than once in open source projects — those bugs whose initial patches were later considered incomplete and to which programmers applied supplementary patches. Our study on Eclipse JDT core, Eclipse SWT, and Mozilla shows that a significant portion of resolved bugs (22% to 33%) involves more than one fix attempt. Our manual inspection shows that the causes of omission errors are diverse, including missed porting changes, incorrect handling of conditional statements, or incomplete refactorings, etc. While many consider that missed updates to code clones often lead to omission errors, only a very small portion of supplementary patches (12% in JDT, 25% in SWT, and 9% in Mozilla) have a content similar to their initial patches. This implies that supplementary change locations cannot be predicted by code clone analysis alone. Furthermore, 14% to 15% of files in supplementary patches are beyond the scope of immediate neighbors of their initial patch locations — they did not overlap with the initial patch locations nor had direct structural dependencies on them (e.g. calls, accesses, subtyping relations, etc.). These results call for new types of omission error prevention approaches that complement existing change recommendation systems.
Article
Full-text available
Software performance is one of the important qualities that makes software stand out in a competitive market. However, in earlier work we found that performance bugs take more time to fix, need to be fixed by more experienced developers and require changes to more code than non-performance bugs. In order to be able to improve the resolution of performance bugs, a better understanding is needed of the current practice and shortcomings of reporting, reproducing, tracking and fixing performance bugs. This paper qualitatively studies a random sample of 400 performance and non-performance bug reports of Mozilla Firefox and Google Chrome across four dimensions (Impact, Context, Fix and Fix validation). We found that developers and users face problems in reproducing performance bugs and have to spend more time discussing performance bugs than other kinds of bugs. Sometimes performance regressions are tolerated as a tradeoff to improve something else.
Article
Full-text available
In this paper we present a profiling methodology and toolkit for helping developers discover hidden asymptotic inefficiencies in the code. From one or more runs of a program, our profiler automatically measures how the performance of individual routines scales as a function of the input size, yielding clues to their growth rate. The output of the profiler is, for each executed routine of the program, a set of tuples that aggregate performance costs by input size. The collected profiles can be used to produce performance plots and derive trend functions by statistical curve fitting or bounding techniques. A key feature of our method is the ability to automatically measure the size of the input given to a generic code fragment: to this aim, we propose an effective metric for estimating the input size of a routine and show how to compute it efficiently. We discuss several case studies, showing that our approach can reveal asymptotic bottlenecks that other profilers may fail to detect and characterize the workload and behavior of individual routines in the context of real applications. To prove the feasibility of our techniques, we implemented a Valgrind tool called aprof and performed an extensive experimental evaluation on the SPEC CPU2006 benchmarks. Our experiments show that aprof delivers comparable performance to other prominent Valgrind tools, and can generate informative plots even from single runs on typical workloads for most algorithmically-critical routines.
Article
Full-text available
Given limited resource and time before software release, development-site testing and debugging become more and more insufficient to ensure satisfactory software performance. As a counterpart for debugging in the large pioneered by the Microsoft Windows Error Reporting (WER) system focusing on crashing/hanging bugs, performance debugging in the large has emerged thanks to available infrastructure support to collect execution traces with performance issues from a huge number of users at the deployment sites. However, performance debugging against these numerous and complex traces remains a significant challenge for performance analysts. In this paper, to enable performance debugging in the large in practice, we propose a novel approach, called StackMine, that mines callstack traces to help performance analysts effectively discover highly impactful performance bugs (e.g., bugs impacting many users with long response delay). As a successful technology-transfer effort, since December 2010, StackMine has been applied in performance-debugging activities at a Microsoft team for performance analysis, especially for a large number of execution traces. Based on real-adoption experiences of StackMine in practice, we conducted an evaluation of StackMine on performance debugging in the large for Microsoft Windows 7. We also conducted another evaluation on a third-party application. The results highlight substantial benefits offered by StackMine in performance debugging in the large for large-scale software systems.
Conference Paper
Full-text available
Customizable programs and program families provide user-selectable features to allow users to tailor a program to an application scenario. Knowing in advance which feature selection yields the best performance is difficult because a direct measurement of all possible feature combinations is infeasible. Our work aims at predicting program performance based on selected features. However, when features interact, accurate predictions are challenging. An interaction occurs when a particular feature combination has an unexpected influence on performance. We present a method that automatically detects performance-relevant feature interactions to improve prediction accuracy. To this end, we propose three heuristics to reduce the number of measurements required to detect interactions. Our evaluation consists of six real-world case studies from varying domains (e.g., databases, encoding libraries, and web servers) using different configuration techniques (e.g., configuration files and preprocessor flags). Results show an average prediction accuracy of 95 %.
Article
Full-text available
A goal of performance testing is to find situations when applications unexpectedly exhibit worsened characteris-tics for certain combinations of input values. A fundamental question of performance testing is how to select a manageable subset of the input data faster to find performance problems in applications automatically. We offer a novel solution for finding performance problems in applications automatically using black-box software testing. Our solution is an adaptive, feedback-directed learning testing system that learns rules from execution traces of applications and then uses these rules to select test input data automatically for these applications to find more performance problems when compared with exploratory random testing. We have implemented our solution and applied it to a medium-size ap-plication at a major insurance company and to an open-source application. Performance problems were found automatically and confirmed by experienced testers and developers.
Conference Paper
Full-text available
A good understanding of the impact of different types of bugs on various project aspects is essential to improve software quality research and practice. For instance, we would expect that security bugs are fixed faster than other types of bugs due to their critical nature. However, prior research has often treated all bugs as similar when studying various aspects of software quality (e.g., predicting the time to fix a bug), or has focused on one particular type of bug (e.g., security bugs) with little comparison to other types. In this paper, we study how different types of bugs (performance and security bugs) differ from each other and from the rest of the bugs in a software project. Through a case study on the Firefox project, we find that security bugs are fixed and triaged much faster, but are reopened and tossed more frequently. Furthermore, we also find that security bugs involve more developers and impact more files in a project. Our work is the first work to ever empirically study performance bugs and compare it to frequently-studied security bugs. Our findings highlight the importance of considering the different types of bugs in software quality research and practice.
Conference Paper
Full-text available
The relationship between various software-related phenomena (e.g., code complexity) and post-release software defects has been thoroughly examined. However, to date these predictions have a limited adoption in practice. The most commonly cited reason is that the prediction identifies too much code to review without distinguishing the impact of these defects. Our aim is to address this drawback by focusing on high-impact defects for customers and practitioners. Customers are highly impacted by defects that break pre-existing functionality (breakage defects), whereas practitioners are caught off-guard by defects in files that had relatively few pre-release changes (surprise defects). The large commercial software system that we study already had an established concept of breakages as the highest-impact defects, however, the concept of surprises is novel and not as well established. We find that surprise defects are related to incomplete requirements and that the common assumption that a fix is caused by a previous change does not hold in this project. We then fit prediction models that are effective at identifying files containing breakages and surprises. The number of pre-release defects and file size are good indicators of breakages, whereas the number of co-changed files and the amount of time between the latest pre-release change and the release date are good indicators of surprises. Although our prediction models are effective at identifying files that have breakages and surprises, we learn that the prediction should also identify the nature or type of defects, with each type being specific enough to be easily identified and repaired.
Conference Paper
Full-text available
Software engineering researchers have long been interested in where and why bugs occur in code, and in predicting where they might turn up next. Historical bug-occurence data has been key to this research. Bug tracking systems, and code version histories, record when, how and by whom bugs were fixed; from these sources, datasets that relate file changes to bug fixes can be extracted. These historical datasets can be used to test hypotheses concerning processes of bug introduction, and also to build statistical bug prediction models. Unfortunately, processes and humans are imperfect, and only a fraction of bug fixes are actually labelled in source code version histories, and thus become available for study in the extracted datasets. The question naturally arises, are the bug fixes recorded in these historical datasets a fair representation of the full population of bug fixes? In this paper, we investigate historical data from several software projects, and find strong evidence of systematic bias. We then investigate the potential effects of "unfair, imbalanced" datasets on the performance of prediction techniques. We draw the lesson that bias is a critical problem that threatens both the effectiveness of processes that rely on biased datasets to build prediction models and the generalizability of hypotheses tested on biased data.
Conference Paper
Full-text available
Robust distributed systems commonly employ high-level recovery mechanisms enabling the system to recover from a wide variety of problematic environmental conditions such as node failures, packet drops and link disconnections. Unfortunately, these recovery mechanisms also effectively mask additional serious design and implementation errors, disguising them as latent performance bugs that severely degrade end-to-end system performance. These bugs typically go unnoticed due to the challenge of distinguishing between a bug and an intermittent environmental condition that must be tolerated by the system. We present techniques that can automatically pinpoint latent performance bugs in systems implementations, in the spirit of recent advances in model checking by systematic state space exploration. The techniques proceed by automating the process of conducting random simulations, identifying performance anomalies, and analyzing anomalous executions to pinpoint the circumstances leading to performance degradation. By focusing our implementation on the MACE toolkit, MACEPC can be used to test our implementations directly, without modification. We have applied MACEPC to five thoroughly tested and trusted distributed systems implementations. MACEPC was able to find significant, previously unknown, long-standing performance bugs in each of the systems, and led to fixes that significantly improved the end-to-end performance of the systems.
Conference Paper
Full-text available
Reproducing bug symptoms is a prerequisite for perform- ing automatic bug diagnosis. Do bugs have characteristics that ease or hinder automatic bug diagnosis? In this pa- per, we conduct a thorough empirical study of several key characteristics of bugs that affect reproducibility at the pro- duction site. We examine randomly selected bug reports of six server applications and consider their implications on automatic bug diagnosis tools. Our results are promising. From the study, we find that nearly 82% of bug symptoms can be reproduced deterministically by re-running with the same set of inputs at the production site. We further find that very few input requests are needed to reproduce most failures; in fact, just one input request after session estab- lishment suffices to reproduce the failure in nearly 77% of the cases. We describe the implications of the results on repro- ducing software failures and designing automated diagnosis tools for production runs.
Conference Paper
Full-text available
Program analysis and automated test generation have primarily been used to find correctness bugs. We present complexity testing, a novel automated test generation tech- nique to find performance bugs. Our complexity testing al- gorithm, which we call WISE (Worst-case Inputs from Sym- bolic Execution), operates on a program accepting inputs of arbitrary size. For each input size, WISE attempts to con- struct an input which exhibits the worst-case computational complexity of the program. WISE uses exhaustive test gen- eration for small input sizes and generalizes the result of executing the program on those inputs into an "input gen- erator." The generator is subsequently used to efficiently generate worst-case inputs for larger input sizes. We have performed experiments to demonstrate the utility of our ap- proach on a set of standard data structures and algorithms. Our results show that WISE can effectively generate worst- case inputs for several of these benchmarks.
Conference Paper
Full-text available
With the ubiquity of multi-core processors, software must make eective use of multiple cores to obtain good performance on modern hardware. One of the biggest roadblocks to this is load imbalance, or the uneven distribution of work across cores. We propose LIME, a framework for analyzing parallel programs and reporting the cause of load imbalance in application source code. This framework uses statistical techniques to pinpoint load imbalance problems stemming from both control flow issues (e.g., unequal iteration counts) and interactions between the application and hardware (e.g., unequal cache miss counts). We evaluate LIME on applications from widely used parallel benchmark suites, and show that LIME accurately reports the causes of load imbalance, their nature and origin in the code, and their relative importance.
Conference Paper
Full-text available
Most Java programmers would agree that Java is a language that promotes a philosophy of “create and go forth”. By design, temporary objects are meant to be created on the heap, possibly used and then abandoned to be collected by the garbage collector. Excessive generation of temporary objects is termed “object churn” and is a form of software bloat that often leads to performance and memory problems. To mitigate this problem, many compiler optimizations aim at identifying objects that may be allocated on the stack. However, most such optimizations miss large opportunities for memory reuse when dealing with objects inside loops or when dealing with container objects. In this paper, we describe a novel algorithm that detects bloat caused by the creation of temporary container and String objects within a loop. Our analysis determines which objects created within a loop can be reused. Then we describe a source-to-source transformation that efficiently reuses such objects. Empirical evaluation indicates that our solution can reduce upto 40% of temporary object allocations in large programs, resulting in a performance improvement that can be as high as a 20% reduction in the run time, specifically when a program has a high churn rate or when the program is memory intensive and needs to run the GC often.
Conference Paper
Full-text available
Performance analysts profile their programs to find methods that are worth optimizing: the "hot" methods. This paper shows that four commonly-used Java profilers ( xprof , hprof , jprofile, and yourkit ) often disagree on the identity of the hot methods. If two profilers disagree, at least one must be incorrect. Thus, there is a good chance that a profiler will mislead a performance analyst into wasting time optimizing a cold method with little or no performance improvement. This paper uses causality analysis to evaluate profilers and to gain insight into the source of their incorrectness. It shows that these profilers all violate a fundamental requirement for sampling based profilers: to be correct, a sampling-based profilermust collect samples randomly. We show that a proof-of-concept profiler, which collects samples randomly, does not suffer from the above problems. Specifically, we show, using a number of case studies, that our profiler correctly identifies methods that are important to optimize; in some cases other profilers report that these methods are cold and thus not worth optimizing.
Conference Paper
Full-text available
Calling context trees (CCTs) associate performance metrics with paths through a program's call graph, providing valuable information for program understanding and performance analysis. Although CCTs are typically much smaller than call trees, in real applications they might easily consist of tens of millions of distinct calling contexts: this sheer size makes them difficult to analyze and might hurt execution times due to poor access locality. For performance analysis, accurately collecting information about hot calling contexts may be more useful than constructing an entire CCT that includes millions of uninteresting paths. As we show for a variety of prominent Linux applications, the distribution of calling context frequencies is typically very skewed. In this paper we show how to exploit this property to reduce the CCT size considerably. We introduce a novel run-time data structure, called Hot Calling Context Tree (HCCT), that offers an additional intermediate point in the spectrum of data structures for representing interprocedural control flow. The HCCT is a subtree of the CCT that includes only hot nodes and their ancestors. We show how to compute the HCCT without storing the exact frequency of all calling contexts, by using fast and space-efficient algorithms for mining frequent items in data streams. With this approach, we can distinguish between hot and cold contexts on the fly, while obtaining very accurate frequency counts. We show both theoretically and experimentally that the HCCT achieves a similar precision as the CCT in a much smaller space, roughly proportional to the number of distinct hot contexts: this is typically several orders of magnitude smaller than the total number of calling contexts encountered during a program's execution. Our space-efficient approach can be effectively combined with previous context-sensitive profiling techniques, such as sampling and bursting.
Conference Paper
Concurrency bugs are widespread in multithreaded programs. Fixing them is time-consuming and error-prone. We present CFix, a system that automates the repair of concurrency bugs. CFix works with a wide variety of concurrency-bug detectors. For each failure-inducing interleaving reported by a bug detector, CFix first determines a combination of mutual-exclusion and order relationships that, once enforced, can prevent the buggy interleaving. CFix then uses static analysis and testing to determine where to insert what synchronization operations to force the desired mutual-exclusion and order relationships, with a best effort to avoid deadlocks and excessive performance losses. CFix also simplifies its own patches by merging fixes for related bugs. Evaluation using four different types of bug detectors and thirteen real-world concurrency-bug cases shows that CFix can successfully patch these cases without causing deadlocks or excessive performance degradation. Patches automatically generated by CFix are of similar quality to those manually written by developers.
Conference Paper
Many bugs, even those that are known and documented in bug reports, remain in mature software for a long time due to the lack of the development resources to fix them. We propose a general approach, R2Fix, to automatically generate bug-fixing patches from free-form bug reports. R2Fix combines past fix patterns, machine learning techniques, and semantic patch generation techniques to fix bugs automatically. We evaluate R2Fix on three projects, i.e., the Linux kernel, Mozilla, and Apache, for three important types of bugs: buffer overflows, null pointer bugs, and memory leaks. R2Fix generates 57 patches correctly, 5 of which are new patches for bugs that have not been fixed by developers yet. We reported all 5 new patches to the developers; 4 have already been accepted and committed to the code repositories. The 57 correct patches generated by R2Fix could have shortened and saved up to an average of 63 days of bug diagnosis and patch generation time.
Article
Traditional profilers identify where a program spends most of its resources. They do not provide information about why the program spends those resources or about how resource consumption would change for different program inputs. In this paper we introduce the idea of algorithmic profiling. While a traditional profiler determines a set of measured cost values, an algorithmic profiler determines a cost function. It does that by automatically determining the "inputs" of a program, by measuring the program's "cost" for any given input, and by inferring an empirical cost function.
Article
There are more bugs in real-world programs than human programmers can realistically address. This paper evaluates two research questions: “What fraction of bugs can be repaired automatically?” and “How much does it cost to repair a bug automatically?” In previous work, we presented GenProg, which uses genetic programming to repair defects in off-the-shelf C programs. To answer these questions, we: (1) propose novel algorithmic improvements to GenProg that allow it to scale to large programs and find repairs 68% more often, (2) exploit GenProg's inherent parallelism using cloud computing resources to provide grounded, human-competitive cost measurements, and (3) generate a large, indicative benchmark set to use for systematic evaluations. We evaluate GenProg on 105 defects from 8 open-source programs totaling 5.1 million lines of code and involving 10,193 test cases. GenProg automatically repairs 55 of those 105 defects. To our knowledge, this evaluation is the largest available of its kind, and is often two orders of magnitude larger than previous work in terms of code or test suite size or defect count. Public cloud computing prices allow our 105 runs to be reproduced for $403; a successful repair completes in 96 minutes and costs $7.32, on average.
Article
Many applications suffer from run-time bloat: excessive memory usage and work to accomplish simple tasks. Bloat significantly affects scalability and performance, and exposing it requires good diagnostic tools. We present a novel analysis that profiles the run-time execution to help programmers uncover potential performance problems. The key idea of the proposed approach is to track object references, starting from object creation statements, through assignment statements, and eventually statements that perform useful operations. This propagation is abstracted by a representation we refer to as a reference propagation graph. This graph provides path information specific to reference producers and their run-time contexts. Several client analyses demonstrate the use of reference propagation profiling to uncover runtime inefficiencies. We also present a study of the properties of reference propagation graphs produced by profiling 36 Java programs. Several cases studies discuss the inefficiencies identified in some of the analyzed programs, as well as the significant improvements obtained after code optimizations.
Article
Developers frequently use inefficient code sequences that could be fixed by simple patches. These inefficient code sequences can cause significant performance degradation and resource waste, referred to as performance bugs. Meager increases in single threaded performance in the multi-core era and increasing emphasis on energy efficiency call for more effort in tackling performance bugs. This paper conducts a comprehensive study of 110 real-world performance bugs that are randomly sampled from five representative software suites (Apache, Chrome, GCC, Mozilla, and MySQL). The findings of this study provide guidance for future work to avoid, expose, detect, and fix performance bugs. Guided by our characteristics study, efficiency rules are extracted from 25 patches and are used to detect performance bugs. 332 previously unknown performance problems are found in the latest versions of MySQL, Apache, and Mozilla applications, including 219 performance problems found by applying rules across applications.
Conference Paper
Software bugs affect system reliability. When a bug is exposed in the field, developers need to fix them. Unfortunately, the bug-fixing process can also introduce errors, which leads to buggy patches that further aggravate the damage to end users and erode software vendors' reputation. This paper presents a comprehensive characteristic study on incorrect bug-fixes from large operating system code bases including Linux, OpenSolaris, FreeBSD and also a mature commercial OS developed and evolved over the last 12 years, investigating not only themistake patterns during bug-fixing but also the possible human reasons in the development process when these incorrect bug-fixes were introduced. Our major findings include: (1) at least 14.8%--24.4% of sampled fixes for post-release bugs in these large OSes are incorrect and have made impacts to end users. (2) Among several common bug types, concurrency bugs are the most difficult to fix correctly: 39% of concurrency bug fixes are incorrect. (3) Developers and reviewers for incorrect fixes usually do not have enough knowledge about the involved code. For example, 27% of the incorrect fixes are made by developers who have never touched the source code files associated with the fix. Our results provide useful guidelines to design new tools and also to improve the development process. Based on our findings, the commercial software vendor whose OS code we evaluated is building a tool to improve the bug fixing and code reviewing process.
Conference Paper
Framework-intensive applications (e.g., Web applications) heavily use temporary data structures, often resulting in performance bot- tlenecks. This paper presents an optimized blended escape analysis to approximate object lifetimes and thus, to identify these tempo- raries and their uses. Empirical results show that this optimized analysis on average prunes 37% of the basic blocks in our bench- marks, and achieves a speedup of up to 29 times compared to the original analysis. Newly defined metrics quantify key properties of temporary data structures and their uses. A detailed empirical eval- uation offers the first characterization of temporaries in framework- intensive applications. The results show that temporary data struc- tures can include up to 12 distinct object types and can traverse through as many as 14 method invocations before being captured.
Conference Paper
Every bug has a story behind it. The people that discover and resolve it need to coordinate, to get information from documents, tools, or other people, and to navigate through issues of accountability, ownership, and organizational structure. This paper reports on a field study of coordination activities around bug fixing that used a combination of case study research and a survey of software professionals. Results show that the histories of even simple bugs are strongly dependent on social, organizational, and technical knowledge that cannot be solely extracted through automation of electronic repositories, and that such automation provides incomplete and often erroneous accounts of coordination. The paper uses rich bug histories and survey results to identify common bug fixing coordination patterns and to provide implications for tool designers and researchers of coordination in software development.
Conference Paper
Concurrent programming is increasingly important for achieving performance gains in the multi-core era, but it is also a difficult and error-prone task. Concurrency bugs are particularly difficult to avoid and diagnose, and therefore in order to improve methods for handling such bugs, we need a better understanding of their characteristics. In this paper we present a study of concurrency bugs in MySQL, a widely used database server. While previous studies of real-world concurrency bugs exist, they have centered their attention on the causes of these bugs. In this paper we provide a complementary focus on their effects, which is important for understanding how to detect or tolerate such bugs at run-time. Our study uncovered several interesting facts, such as the existence of a significant number of latent concurrency bugs, which silently corrupt data structures and are exposed to the user potentially much later. We also highlight several implications of our findings for the design of reliable concurrent systems.
Conference Paper
The reality of multi-core hardware has made concurrent programs pervasive. Unfortunately, writing correct concurrent programs is difcult. Addressing this challenge requires advances in multiple directions, including concurrency bug detection, concurrent pro- gram testing, concurrent programming model design, etc. Design- ing effective techniques in all these directions will signicantly benet from a deep understanding of real world concurrency bug characteristics. This paper provides the rst (to the best of our knowledge) com- prehensive real world concurrency bug characteristic study. Specif- ically, we have carefully examined concurrency bug patterns, man- ifestation, and x strategies of 105 randomly selected real world concurrency bugs from 4 representative server and client open- source applications (MySQL, Apache, Mozilla and OpenOfce). Our study reveals several interesting ndings and provides use- ful guidance for concurrency bug detection, testing, and concurrent programming language design. Some of our ndings are as follows: (1) Around one third of the examined non-deadlock concurrency bugs are caused by vio- lation to programmers' order intentions, which may not be easily expressed via synchronization primitives like locks and transac- tional memories; (2) Around 34% of the examined non-deadlock concurrency bugs involve multiple variables, which are not well addressed by existing bug detection tools; (3) About 92% of the examined concurrency bugs can be reliably triggered by enforcing certain orders among no more than 4 memory accesses. This indi- cates that testing concurrent programs can target at exploring possi- ble orders among every small groups of memory accesses, instead of among all memory accesses; (4) About 73% of the examined non-deadlock concurrency bugs were not x ed by simply adding or changing locks, and many of the x es were not correct at the rst try, indicating the difculty of reasoning concurrent execution by programmers.
Conference Paper
We present a study of operating system errors found by automatic, static, compiler analysis applied to the Linux and OpenBSD kernels. Our approach differs from previous studies that consider errors found by manual inspection of logs, testing, and surveys because static analysis is applied uniformly to the entire kernel source, though our approach necessarily considers a less comprehensive variety of errors than previous studies. In addition, automation allows us to track errors over multiple versions of the kernel source to estimate how long errors remain in the system before they are fixed.We found that device drivers have error rates up to three to seven times higher than the rest of the kernel. We found that the largest quartile of functions have error rates two to six times higher than the smallest quartile. We found that the newest quartile of files have error rates up to twice that of the oldest quartile, which provides evidence that code "hardens" over time. Finally, we found that bugs remain in the Linux kernel an average of 1.8 years before being fixed.
Conference Paper
Load tests aim to validate whether system performance is acceptable under peak conditions. Existing test generation techniques induce load by increasing the size or rate of the input. Ignoring the particular input values, however, may lead to test suites that grossly mischaracterize a system's performance. To address this limitation we introduce a mixed symbolic execution based approach that is unique in how it 1) favors program paths associated with a performance measure of interest, 2) operates in an iterative-deepening beam-search fashion to discard paths that are unlikely to lead to high-load tests, and 3) generates a test suite of a given size and level of diversity. An assessment of the approach shows it generates test suites that induce program response times and memory consumption several times worse than the compared alternatives, it scales to large and complex inputs, and it exposes a diversity of resource consuming program behavior.
Conference Paper
Many large-scale Java applications suffer from runtime bloat. They execute large volumes of methods, and create many temporary objects, all to execute relatively simple operations. There are large opportunities for performance optimizations in these applications, but most are being missed by existing optimization and tooling technology. While JIT optimizations struggle for a few percent, performance experts analyze deployed applications and regularly find gains of 2× or more. Finding such big gains is difficult, for both humans and compil- ers, because of the diffuse nature of runtime bloat. Time is spread thinly across calling contexts, making it difficult to judge how to improve performance. Bloat results from a pile-up of seemingly harmless decisions. Each adds temporary objects and method calls, and often copies values between those temporary objects. While data copies are not the entirety of bloat, we have observed that they are excellent indicators of regions of excessive activity. By opti- mizing copies, one is likely to remove the objects that carry copied values, and the method calls that allocate and populate them. We introduce copy profiling, a technique that summarizes run- time activity in terms of chains of data copies. A flat copy profile counts copies by method. We show how flat profiles alone can be helpful. In many cases, diagnosing a problem requires data flow context. Tracking and making sense of raw copy chains does not scale, so we introduce a summarizing abstraction called the copy graph. We implement three clients analyses that, using the copy graph, expose common patterns of bloat, such as finding hot copy chains and discovering temporary data structures. We demonstrate, with examples from a large-scale commercial application and sev- eral benchmarks, that copy profiling can be used by a programmer to quickly find opportunities for large performance gains.
Conference Paper
Fixing software bugs has always been an important and time-consuming process in software development. Fixing concurrency bugs has become especially critical in the multicore era. However, fixing concurrency bugs is challenging, in part due to non-deterministic failures and tricky parallel reasoning. Beyond correctly fixing the original problem in the software, a good patch should also avoid introducing new bugs, degrading performance unnecessarily, or damaging software readability. Existing tools cannot automate the whole fixing process and provide good-quality patches. We present AFix, a tool that automates the whole process of fixing one common type of concurrency bug: single-variable atomicity violations. AFix starts from the bug reports of existing bug-detection tools. It augments these with static analysis to construct a suitable patch for each bug report. It further tries to combine the patches of multiple bugs for better performance and code readability. Finally, AFix's run-time component provides testing customized for each patch. Our evaluation shows that patches automatically generated by AFix correctly eliminate six out of eight real-world bugs and significantly decrease the failure probability in the other two cases. AFix patches never introduce new bugs and usually have similar performance to manually-designed patches.
Article
This study focuses largely on two issues: (a) improved syntax for iterations and error exits, making it possible to write a larger class of programs clearly and efficiently without ″go to″ statements; (b) a methodology of program design, beginning with readable and correct, but possibly inefficient programs that are systematically transformed in necessary into efficent and correct, but possibly less readable code. The discussion brings out opposing points of view about whether or not ″go to″ statements should be abolished; some merit is found on both sides of this question. Finally, an attempt is made to define the true nature of structured programming, and to recommend fruitful directions for further study.
Article
Many popular software systems automatically report failures back to the vendors, allowing developers to focus on the most pressing problems. However, it takes a certain period of time to assess which failures occur most frequently. In an empirical investigation of the Firefox and Thunderbird crash report databases, we found that only 10 to 20 crashes account for the large majority of crash reports; predicting these “top crashes” thus could dramatically increase software quality. By training a machine learner on the features of top crashes of past releases, we can effectively predict the top crashes well before a new release. This allows for quick resolution of the most important crashes, leading to improved user experience and better allocation of maintenance efforts.
Conference Paper
We test the hypothesis that generic recovery techniques, such as process pairs, can survive most application faults without using application-specific information. We examine in detail the faults that occur in three, large, open-source applications: the Apache Web server, the GNOME desktop environment and the MySQL database. Using information contained in the bug reports and source code, we classify faults based on how they depend on the operating environment. We find that 72-87% of the faults are independent of the operating environment and are hence deterministic (non-transient). Recovering from the failures caused by these faults requires the use of application-specific knowledge. Half of the remaining faults depend on a condition in the operating environment that is likely to persist on retry, and the failures caused by these faults are also likely to require application-specific recovery. Unfortunately, only 5-14% of the faults were triggered by transient conditions, such as timing and synchronization, that naturally fix themselves during recovery. Our results indicate that classical application-generic recovery techniques, such as process pairs, will not be sufficient to enable applications to survive most failures caused by application faults
Article
We present a study of operating system errors found by automatic, static, compiler analysis applied to the Linux and OpenBSD kernels. Our approach differs from previous studies that consider errors found by manual inspection of logs, testing, and surveys because static analysis is applied uniformly to the entire kernel source, though our approach necessarily considers a less comprehensive variety of errors than previous studies. In addition, automation allows us to track errors over multiple versions of the kernel source to estimate how long errors remain in the system before they are fixed. We found that device drivers have error rates up to three to seven times higher than the rest of the kernel. We found that the largest quartile of functions have error rates two to six times higher than the smallest quartile. We found that the newest quartile of files have error rates up to twice that of the oldest quartile, which provides evidence that code "hardens" over time. Finally, we found that bugs remain in the Linux kernel an average of 1.8 years before being fixed.
Trend Mcro will pay for PC repair costs
  • P Kallender
Apache's JIRA issue tracker
  • Apache Software Foundation
Inside Windows 7-reliability, performance and PerfTrack
  • D Fields
  • B Karagounis
Lessons from the Colorado benefits management system disaster
  • G E Morris
1901 census site still down after six months