Conference PaperPDF Available

Trivial Compiler Equivalence: A Large Scale Empirical Study of a Simple, Fast and Effective Equivalent Mutant Detection Technique

Authors:

Abstract and Figures

Identifying equivalent mutants remains the largest impediment to the widespread uptake of mutation testing. Despite being researched for more than three decades, the problem remains. We propose Trivial Compiler Equivalence (TCE) a technique that exploits the use of readily available compiler technology to address this long-standing challenge. TCE is directly applicable to real-world programs and can imbue existing tools with the ability to detect equivalent mutants and a special form of useless mutants called duplicated mutants. We present a thorough empirical study using 6 large open source programs, several orders of magnitude larger than those used in previous work, and 18 benchmark programs with hand-analysis equivalent mutants. Our results reveal that, on large real-world programs, TCE can discard more than 7% and 21% of all the mutants as being equivalent and duplicated mutants respectively. A human- based equivalence verification reveals that TCE has the ability to detect approximately 30% of all the existing equivalent mutants.
Content may be subject to copyright.
A preview of the PDF is not available
... However, the number of studies on detect approaches has increased in the last years, such as those related to constraint solving [19] or data-flow analysis [20]. The most remarkable advance in this area is the technique called Trivial Compiler Equivalence (TCE) [21,5]. Based on the transformations performed by compiler optimizations, TCE can detect equivalent mutants that produce the same binary files as those of the original version of the program. ...
... In those studies, TCE showed the ability to detect around 30% and 50% of all equivalent mutants respectively in C and Java programs. Apart from the application of TCE to traditional C [21] and Java mutants [5,22], this technique has also been assessed with memory mutants [23] and class mutants for C++ [24], being able to discard 5.5% and 27% of the set of equivalent mutants, respectively. ...
... Indeed, the fitness function in EMT could have unintentionally been degrading its performance: the algorithm can misleadingly be breeding new mutants derived from equivalent mutants instead of useful mutants. If that is the case, EMT could be combined with state-of-the-art techniques to identify equivalent mutants, like Trivial Compiler Equivalence (TCE) [5,21], thereby being able to distinguish live mutants from equivalent ones in a real situation. ...
Article
Full-text available
Context: Mutation testing is considered to be a powerful approach to assess and improve the quality of test suites. However, this technique is expensive mainly because some mutants are semantically equivalent to the original program; in general, equivalent mutants require manual revision to differentiate them from useful ones, which is known as the Equivalent Mutant Problem (EMP). Objective: In the past, several authors have proposed different techniques to individually identify certain equivalent mutants, with notable advances in the last years. In our work, by contrast, we address the EMP from a global perspective. Namely, we wonder the extent to which equivalent mutants are connected (i.e., whether they share mutation operators and code areas) as well as the extent to which the knowledge of that connection can benefit the mutant selection process. Such a study could allow going beyond the implicit limit in the traditional individual detection of equivalent mutants. Method: We use an evolutionary algorithm to select the mutants, an approach called Evolutionary Mutation Testing (EMT). We propose a new derived version, Equivalence-Aware EMT (EA-EMT), which penalizes the fitness of known equivalent mutants so that they do not transfer their features to the next generations of mutants. Results: In our experiments applying EMT to well-known C++ programs, we found that (i) equivalent mutants often originate from other equivalent mutants (over 60% on average); (ii) EA-EMT’s approach of penalizing known equivalent mutants provides better results than the original EMT in most of the cases (notably, the more equivalent mutants are detected, the better); and (iii) we can combine EA-EMT with Trivial Compiler Equivalence as a way to automatically identify equivalent mutants in a real situation, reaching a more stable version of EMT. Conclusions: This novel approach opens the way for improvement in other related areas that deal with equivalent versions.
... Interestingly, this means that semantically-equivalent mutants are actually the target of PMT, rather than an obstacle, as in traditional mutation. This could lead to problems with compiling optimization techniques, for example, which could automatically undo performance mutations if they detect that the original program and the mutant are equivalent [27]. ...
... [5, 28] RCL Loop perturbation Execution time [15] URV Method call Execution time [8,9] MSL Object generation; Memory consumption Conditional statement; Execution time Loop perturbation [5] SOC Conditional statement Execution time [19,27] HWO Method call Execution time [9,14] CSO Object generation Memory consumption [9,17] MSR Collections Memory consumption a loop to stop when a certain condition is met, thus saving some unnecessary iterations that do not affect the outcome of the program. With regard to this, Nistor et al. [29] identified a family of performance bugs associated with a loop and a condition that have CondBreak fixes, i.e., bugs that can be fixed by adding a condition inside the loop. ...
... For the evaluation of the relation between PMT and compiler optimizations (RQ5), we applied the technique known as Trivial Compiler Equivalence or TCE [27]. This technique allows the automated detection of two equivalent programs when, as a result of the optimizations performed by compilers, 14 P. DELGADO-PÉREZ, A. B. SÁNCHEZ, S. SEGURA, I. MEDINA-BULO the binary files of both programs turn out to be identical. ...
Article
Performance bugs are known to be a major threat to the success of software products. Performance tests aim to detect performance bugs by executing the program through test cases and checking whether it exhibits a noticeable performance degradation. The principles of mutation testing, a well-established testing technique for the assessment of test suites through the injection of artificial faults, could be exploited to evaluate and improve the detection power of performance tests. However, the application of mutation testing to assess performance tests, henceforth called performance mutation testing (PMT), is a novel research topic with numerous open challenges. In previous papers, we identified some key challenges related to PMT. In this work, we go a step further and explore the feasibility of applying PMT at the source-code level in general-purpose languages. To do so, we revisit concepts associated with classical mutation testing, and design seven novel mutation operators to model known bug-inducing patterns. As a proof of concept, we applied traditional mutation operators as well as performance mutation operators to open-source C++ programs. The results reveal the potential of the new performance-mutants to help assess and enhance performance tests when compared to traditional mutants. A review of live mutants in these programs suggests that they can induce the design of special test inputs. In addition to these promising results, our work brings a whole new set of challenges related to PMT, which will hopefully serve as a starting point for new contributions in the area.
... Equivalent mutants [34,35] are functionally equivalent versions of the original program, and thus cannot be killed: no test case applied to both the mutant and the original program could result in different behavior. In our study, therefore, all equivalent mutants were removed, leaving only those mutants that could be detected by at least one test case. ...
... In Table 2, the #Mutant column gives the total number of all mutants (#All), and the (#Detected) column gives the number of detected mutants. Although all detected mutants were considered in our study, some mutants, called duplicated mutants) [35], were equivalent to other mutants (but not to the original program). Similarly, some mutants, called subsumed mutants [36,37] were subsumed by others: If a subsuming mutant [38]) is killed, then its subsumed mutants are also killed. ...
Article
Full-text available
Regression test case prioritization (RTCP) aims to improve the rate of fault detection by executing more important test cases as early as possible. Various RTCP techniques have been proposed based on different coverage criteria. Among them, a majority of techniques leverage code coverage information to guide the prioritization process, with code units being considered individually, and in isolation. In this paper, we propose a new coverage criterion, code combinations coverage, that combines the concepts of code coverage and combination coverage. We apply this coverage criterion to RTCP, as a new prioritization technique, code combinations coverage based prioritization (CCCP). We report on empirical studies conducted to compare the testing effectiveness and efficiency of CCCP with four popular RTCP techniques: total, additional, adaptive random, and search-based test prioritization. The experimental results show that even when the lowest combination strength is assigned, overall, the CCCP fault detection rates are greater than those of the other four prioritization techniques. The CCCP prioritization costs are also found to be comparable to the additional test prioritization technique. Moreover, our results also show that when the combination strength is increased, CCCP provides higher fault detection rates than the state-of-the-art, regardless of the levels of code coverage.
... Equivalent mutants [34,35] are functionally equivalent versions of the original program, and thus cannot be killed: no test case applied to both the mutant and the original program could result in different behavior. In our study, therefore, all equivalent mutants were removed, leaving only those mutants that could be detected by at least one test case. ...
... In Table 2, the #Mutant column gives the total number of all mutants (#All), and the (#Detected) column gives the number of detected mutants. Although all detected mutants were considered in our study, some mutants, called duplicated mutants) [35], were equivalent to other mutants (but not to the original program). Similarly, some mutants, called subsumed mutants [36,37] were subsumed by others: If a subsuming mutant [38]) is killed, then its subsumed mutants are also killed. ...
Preprint
Full-text available
Regression test case prioritization (RTCP) aims to improve the rate of fault detection by executing more important test cases as early as possible. Various RTCP techniques have been proposed based on different coverage criteria. Among them, a majority of techniques leverage code coverage information to guide the prioritization process, with code units being considered individually, and in isolation. In this paper, we propose a new coverage criterion, code combinations coverage, that combines the concepts of code coverage and combination coverage. We apply this coverage criterion to RTCP, as a new prioritization technique, code combinations coverage based prioritization (CCCP). We report on empirical studies conducted to compare the testing effectiveness and efficiency of CCCP with four popular RTCP techniques: total, additional, adaptive random, and search-based test prioritization. The experimental results show that even when the lowest combination strength is assigned, overall, the CCCP fault detection rates are greater than those of the other four prioritization techniques. The CCCP prioritization costs are also found to be comparable to the additional test prioritization technique. Moreover, our results also show that when the combination strength is increased, CCCP provides higher fault detection rates than the state-of-the-art, regardless of the levels of code coverage.
... As pointed out by Lin et al. [55], the original test-suites coming with each program are not sufficient to achieve a high testing coverage for its core functions. Hence, Lin et al. extend the original test suites with additional tests and use KLEE [14] to ensure a coverage higher than 60% on core functions and also gcov [29] faults, which is to inject mutants and minimize them via Trivial Compiler Equivalence [72] (TCE). In this article, to have the fair comparison, we also use the same test suites and generate faults as those in Nemo [55]. ...
... However, the information on coverage and faults of these systems is not published. In the future, to evaluate on more subject systems, we will make endeavors to collect the coverage information via gcov and the faults via checking the bug-tracking systems (if available) or generating faults by Trivia Compiler Equivalence (TCE) [72]. ...
Article
Full-text available
Test-suite minimization is one key technique for optimizing the software testing process. Due to the need to balance multiple factors, multi-criteria test-suite minimization (MCTSM) becomes a popular research topic in the recent decade. The MCTSM problem is typically modeled as integer linear programming (ILP) problem and solved with weighted-sum single objective approach. However, there is no existing approach that can generate sound (i.e., being Pareto-optimal) and complete (i.e., covering the entire Pareto front) Pareto-optimal solution set, to the knowledge of the authors. In this work, we first prove that the ILP formulation can accurately model the MCTSM problem and then propose the multi-objective integer programming (MOIP)approaches to solve it. We apply our MOIP approaches on three specific MCTSM problems and compare the results with those of the cutting-edge methods, namely, Nonlinear Formulation_Linear Solver(NF_LS) and two Multi-Objective Evolutionary Algorithms (MOEAs). The results show that our MOIP approaches can always find sound and complete solutions on five subject programs, using similar or significantly less time than NF_LS and two MOEAs do. The current experimental results are quite promising, and our approaches have the potential to be applied for other similar search-based software engineering problems.
... A proportion of generated mutants might behave in the same way as the original code, in which case the calculation of mutation score can be deflated. The detection of equivalent mutants is a challenging and ongoing area of research [29]. ...
... It is possible to mitigate this issue by replacing with − . However, the detection of equivalent mutants is a challenging task [29] which is typically not readily available in mutation testing tools (e.g. PIT [11]). ...
... Kintis attempted to isolate firstorder equivalent mutants using a second-order mutation [51]. Papadakis developed an algorithm called trivial compiler equivalence to find equal mutants [52]. The equivalent mutant problem has been discussed in many ways in the literature. ...
... In addition, we intend to evaluate the quality of the mutants generated by our operators. In particular, after applying our mutation engine, we will check the number of equivalent and duplicated mutants [19,41] and even assess the number of redundant mutants [4]. Moreover, we intend to evaluate the correlation of our mutants with real faults related to code annotations, similarly to what has been done in previous studies with similar intent [5,15,33]. ...
Article
Mutation testing injects code changes to check whether tests can detect them. Mutation testing tools use mutation operators that modify program elements such as operators, names, and entire statements. Most existing mutation operators focus on imperative and object-oriented language constructs. However, many current projects use meta-programming through code annotations. In a previous work, we have proposed nine mutation operators for code annotations focused on the Java programming language. In this article, we extend our previous work by mapping the operators to the C# language. Moreover, we enlarge the empirical evaluation. In particular, we mine Java and C# projects that make heavy use of annotations to identify annotation-related faults. We analyzed 200 faults and categorized them as “misuse,” when the developer did not appear to know how to use the code annotations properly, and “wrong annotation parsing” when the developer incorrectly parsed annotation code (by using reflection, for example). Our operators mimic 95% of the 200 mined faults. In particular, three operators can mimic 82% of the faults in Java projects and 84% of the faults in C# projects. In addition, we provide an extended and improved repository hosted on GitHub with the 200 code annotation faults we analyzed. We organize the repository according to the type of errors made by the programmers while dealing with code annotations, and to the mutation operator that can mimic the faults. Last but not least, we also provide a mutation engine, based on these operators, which is publicly available and can be incorporated into existing or new mutation tools. The engine works for Java and C#. As implications for practice, our operators can help developers to improve test suites and parsers of annotated code.
The state-of-the-practice in software development is driven by constant change fueled by continues integration servers. Such constant change demands for frequent and fully automated tests capable to detect faults immediately upon project build. As the fault detection capability of the test suite becomes so important, modern software development teams continuously monitor the quality of the test suite as well. However, it appears that the state-of-the-practice is reluctant to adopt strong coverage metrics (namely mutation coverage), instead relying on weaker kinds of coverage (namely branch coverage). In this paper, we investigate three reasons that prohibit the adoption of mutation coverage in a continuous integration setting: (1) the difficulty of its integration into the build system, (2) the perception that branch coverage is “good enough”, and (3) the performance overhead during the build. Our investigation is based on a case study involving four open source systems and one industrial system. We demonstrate that mutation coverage reveals additional weaknesses in the test suite compared to branch coverage and that it is able to do so with an acceptable performance overhead during project build.
Article
Full-text available
Android is an operating system source which offers flexibility and support for most mobile applications, and easy access to social networks. It is important to understand the complexity of design, development, implementation, and testing of Android apps. A number of challenges may be faced in testing android applications, including the lack of testing processes and methods, testing experts being unavailable, poor in-house testing environment, and time restrictions. Mutation testing is a fault-based testing technique, applied by generating mutants and running the application with these mutants to analyze the killed and equivalent mutants. We defined a set of mutation operators according to the features of android applications: apps with content sharing, apps with multimedia, apps with graphics, and apps with user location and maps. We identified 42 mutation operators. In addition, we implemented a new tool, “µ-Android,” which automatically generates mutants and retrieves results to prove the efficiency of the test cases and enable the new operators.
Article
Full-text available
Context. The equivalent mutant problem (EMP) is one of the crucial problems in mutation testing widely studied over decades. Objectives. The objectives are: to present a systematic literature review (SLR) in the field of EMP; to identify, classify and improve the existing, or implement new, methods which try to overcome EMP and evaluate them. Method. We performed SLR based on the search of digital libraries. We implemented four second order mutation (SOM) strategies, in addition to first order mutation (FOM), and compared them from different perspectives. Results. Our SLR identified 17 relevant techniques (in 22 articles) and three categories of techniques: detecting (DEM); suggesting (SEM); and avoiding equivalent mutant generation (AEMG). The experiment indicated that SOM in general and JudyDiffOp strategy in particular provide the best results in the following areas: total number of mutants generated; the association between the type of mutation strategy and whether the generated mutants were equivalent or not; the number of not killed mutants; mutation testing time; time needed for manual classification. Conclusions. The results in the DEM category are still far from perfect. Thus, the SEM and AEMG categories have been developed. The JudyDiffOp algorithm achieved good results in many areas.
Conference Paper
Full-text available
A good test suite is one that detects real faults. Because the set of faults in a program is usually unknowable, this definition is not useful to practitioners who are creating test suites, nor to researchers who are creating and evaluating tools that generate test suites. In place of real faults, testing research often uses mutants, which are artificial faults — each one a simple syntactic variation — that are systematically seeded throughout the program under test. Mutation analysis is appealing because large numbers of mutants can be automatically-generated and used to compensate for low quantities or the absence of known real faults. Unfortunately, there is little experimental evidence to support the use of mutants as a replacement for real faults. This paper investigates whether mutants are indeed a valid substitute for real faults, i.e., whether a test suite's ability to detect mutants is correlated with its ability to detect real faults that developers have fixed. Unlike prior studies, these investigations also explicitly consider the conflating effects of code coverage on the mutant detection rate. Our experiments used 357 real faults in 5 open-source applications that comprise a total of 321,000 lines of code. Furthermore, our experiments used both developer-written and automatically-generated test suites. The results show a statistically significant correlation between mutant detection and real fault detection, independently of code coverage. The results also give concrete suggestions on how to improve mutation analysis and reveal some inherent limitations.
Conference Paper
We study the simultaneous test effectiveness and efficiency improvement achievable by Strongly Subsuming Higher Order Mutants (SSHOMs), constructed from 15,792 first order mutants in four Java programs. Using SSHOMs in place of the first order mutants they subsume yielded a 35%-45% reduction in the number of mutants required, while simultaneously improving test efficiency by 15% and effectiveness by between 5.6% and 12%. Trivial first order faults often combine to form exceptionally non-trivial higher order faults; apparently innocuous angels can combine to breed monsters. Nevertheless, these same monsters can be recruited to improve automated test effectiveness and efficiency.
Conference Paper
Though mutation testing has been widely studied for more than thirty years, the prevalence and properties of equivalent mutants remain largely unknown. We report on the causes and prevalence of equivalent mutants and their relationship to stubborn mutants (those that remain undetected by a high quality test suite, yet are non-equivalent). Our results, based on manual analysis of 1,230 mutants from 18 programs, reveal a highly uneven distribution of equivalence and stubbornness. For example, the ABS class and half UOI class generate many equivalent and almost no stubborn mutants, while the LCR class generates many stubborn and few equivalent mutants. We conclude that previous test effectiveness studies based on fault seeding could be skewed, while developers of mutation testing tools should prioritise those operators that we found generate disproportionately many stubborn (and few equivalent) mutants.
Article
To assess the qua lity of test suites, mutation analysis seeds artificial defects (mutations) into programs; a non-detected mutation indicates a weakness in the test suite. We present an automated approach to generate unit tests that detect these mutations for object-oriented classes. This has two advantages: First, the resulting test suite is optimized towards finding defects rather than covering code. Second, the state change caused by mutations induces oracles that precisely detect the mutants. Evaluated on two open source libraries, our μTEST prototype generates test suites that find significantly more seeded defects than the original manually written test suites.
Article
We study the simultaneous test effectiveness and efficiency improvement achievable by Strongly Subsuming Higher Order Mutants (SSHOMs), constructed from 15,792 first order mutants in four Java programs. Using SSHOMs in place of the first order mutants they subsume yielded a 35%-45% reduction in the number of mutants required, while simultaneously improving test efficiency by 15% and effectiveness by between 5.6% and 12%. Trivial first order faults often combine to form exceptionally non-trivial higher order faults; apparently innocuous angels can combine to breed monsters. Nevertheless, these same monsters can be recruited to improve automated test effectiveness and efficiency.
Chapter
The experiment data from the operation is input to the analysis and interpretation. After collecting experimental data in the operation phase, we want to be able to draw conclusions based on this data. To be able to draw valid conclusions, we must interpret the experiment data.
Conference Paper
Mutation analysis generates tests that distinguish variations, or mutants, of an artifact from the original. Mutation analysis is widely considered to be a powerful approach to testing, and hence is often used to evaluate other test criteria in terms of mutation score, which is the fraction of mutants that are killed by a test set. But mutation analysis is also known to provide large numbers of redundant mutants, and these mutants can inflate the mutation score. While mutation approaches broadly characterized as reduced mutation try to eliminate redundant mutants, the literature lacks a theoretical result that articulates just how many mutants are needed in any given situation. Hence, there is, at present, no way to characterize the contribution of, for example, a particular approach to reduced mutation with respect to any theoretical minimal set of mutants. This paper's contribution is to provide such a theoretical foundation for mutant set minimization. The central theoretical result of the paper shows how to minimize efficiently mutant sets with respect to a set of test cases. We evaluate our method with a widely-used benchmark.
Conference Paper
This paper introduces a set of data flow patterns that reveal code locations able to produce equivalent mutants. For each pattern, a formal definition is given and the necessary conditions implying its existence in the source code of the program under test are described. By identifying such problematic situations, the introduced patterns can provide advice on code locations that should not be mutated. Apart from dealing with equivalent mutants, the proposed patterns are able to identify specific paths for which a mutant is functionally equivalent to the original program. This knowledge can be leveraged by test case generation techniques in order not to target these paths when attempting to kill the corresponding mutants. An empirical study, conducted on a set of manually identified equivalent mutants, provides evidence regarding the detection power of the introduced patterns and unveils their existence in real world software.
Conference Paper
Some conflicting results have been reported on the comparison between t-way combinatorial testing and random testing. In this paper, we report a new study that applies t-way and random testing to the Siemens suite. In particular, we investigate the stability of the two techniques. We measure both code coverage and fault detection effectiveness. Each program in the Siemens suite has a number of faulty versions. In addition, mutation faults are used to better evaluate fault detection effectiveness in terms of both number and diversity of faults. The experimental results show that in most cases, t-way testing performed as good as or better than random testing. There are few cases where random testing performed better, but with a very small margin. Overall, the differences between the two techniques are not as significant as one would have probably expected. We discuss the practical implications of the results. We believe that more studies are needed to better understand the comparison of the two techniques.