ArticlePDF Available

An Experimental Determination of Sufficient Mutant Operators

Authors:

Abstract and Figures

This paper quantifies the expense of mutation in terms of the number of mutants that are created, then proposes and evaluates a technique that reduces the number of mutants by an order of magnitude. Selective mutation reduces the cost of mutation testing by reducing the number of mutants. This paper reports experimental results that compare selective mutation testing with standard, or non-selective, mutation testing, and results that quantify the savings achieved by selective mutation testing. The results support the hypothesis that selective mutation is almost as strong as non-selective mutation; in experimental trials selective mutation provides almost the same coverage as non-selective mutation, with a four-fold or more reduction in the number of mutants.
Content may be subject to copyright.
0
4
8
12
16
20
dsa lcr der rsr san glr sdl cnr ror crp aor car aar sar uoi abs src acr csr scr asr svr
Mutant Operators
Percentage
... The fault revealing mutant selection goal is different from that of the "traditional" mutant reduction techniques, which is to reduce the number of mutants (Offutt et al 1996a;Wong and Mathur 1995b;Ferrari et al 2018;Papadakis et al 2018a). Mutant reduction strategies focus on selecting a small set of mutants that is representative of the larger set. ...
... In the literature many mutant selection methods have been proposed (Papadakis et al 2018a; Ferrari et al 2018) by restricting the considered mutants according to their types, i.e., applying one or more mutant operators. Empirical studies (Kurtz et al 2016;Deng et al 2013), have shown that the most successful strategies are the statement deletion (Deng et al 2013) and the E-Selective mutant set (Offutt et al 1996a(Offutt et al , 1993. We therefore compare our approach with these methods. ...
... E-Selective refers to the 5 operator mutant set introduced by Offutt et al. (Offutt et al 1996a(Offutt et al , 1993. This set is the most popular operator set (Papadakis et al 2018a) that is included in most of the modern mutation testing tools. ...
Preprint
Mutant selection refers to the problem of choosing, among a large number of mutants, the (few) ones that should be used by the testers. In view of this, we investigate the problem of selecting the fault revealing mutants, i.e., the mutants that are most likely to be killable and lead to test cases that uncover unknown program faults. We formulate two variants of this problem: the fault revealing mutant selection and the fault revealing mutant prioritization. We argue and show that these problems can be tackled through a set of 'static' program features and propose a machine learning approach, named FaRM, that learns to select and rank killable and fault revealing mutants. Experimental results involving 1,692 real faults show the practical benefits of our approach in both examined problems. Our results show that FaRM achieves a good trade-off between application cost and effectiveness (measured in terms of faults revealed). We also show that FaRM outperforms all the existing mutant selection methods, i.e., the random mutant sampling, the selective mutation and defect prediction (mutating the code areas pointed by defect prediction). In particular, our results show that with respect to mutant selection, our approach reveals 23% to 34% more faults than any of the baseline methods, while, with respect to mutant prioritization, it achieves higher average percentage of revealed faults with a median difference between 4% and 9% (from the random mutant orderings).
... In its current version, LittleDarwin supports mutation testing of Java programs with in total 9 mutation operators. These mutation operators are an adaptation of the minimal set introduced by Offutt et al. [21]. The description of each mutation operator along with an example can be found in Table 1. ...
... Wong et al. [28] examined a selective set of mutation operators (2 out of 22) and concluded that the results are similar to those of all mutation operators. Offutt et al. [20,21] demonstrated through empirical experiments that five mutation operators are sufficient to emulate the full set of mutation operators. Barbosa et al. uses random mutant selection as a control technique to determine the sufficient set of mutation operators for C [6]. ...
Preprint
Mutation testing is a standard technique to evaluate the quality of a test suite. Due to its computationally intensive nature, many approaches have been proposed to make this technique feasible in real case scenarios. Among these approaches, uniform random mutant selection has been demonstrated to be simple and promising. However, works on this area analyze mutant samples at project level mainly on projects with adequate test suites. In this paper, we fill this lack of empirical validation by analyzing random mutant selection at class level on projects with non-adequate test suites. First, we show that uniform random mutant selection underachieves the expected results. Then, we propose a new approach named weighted random mutant selection which generates more representative mutant samples. Finally, we show that representative mutant samples are larger for projects with high test adequacy.
... In addition to these targeted mutations, we draw upon established mutation techniques, such as those proposed by Offutt [33], to introduce a wider variety of mutation strategies. These strategies include random value generation, boundary testing, and invalid input creation, all of which contribute to a more comprehensive testing process. ...
Article
Full-text available
Internet of Things (IoT) devices offer convenience through web interfaces, web VPNs, and other web-based services, all relying on the HTTP protocol. However, these externally exposed HTTP services present significant security risks. Although fuzzing has shown some effectiveness in identifying vulnerabilities in IoT HTTP services, most state-of-the-art tools still rely on random mutation strategies, leading to difficulties in accurately understanding the HTTP protocol’s structure and generating many invalid test cases. Furthermore, These fuzzers rely on a limited set of initial seeds for testing. While this approach initiates testing, the limited number and diversity of seeds hinder comprehensive coverage of complex scenarios in IoT HTTP services. In this paper, we investigate and find that large language models (LLMs) excel in parsing HTTP protocol data and analyzing code logic. Based on these findings, we propose a novel LLM-guided IoT HTTP fuzzing method, ChatHTTPFuzz, which automatically parses protocol fields and analyzes service code logic to generate protocol-compliant test cases. Specifically, we use LLMs to label fields in HTTP protocol data, creating seed templates. Second, The LLM analyzes service code to guide the generation of additional packets aligned with the code logic, enriching the seed templates and their field values. Finally, we design an enhanced Thompson sampling algorithm based on the exploration balance factor and mutation potential factor to schedule seed templates. We evaluate ChatHTTPFuzz on 16 different real-world IoT devices. It finds more vulnerabilities than SNIPUZZ, BOOFUZZ, and MUTINY. ChatHTTPFuzz has discovered 116 vulnerabilities, of which 70 are unique, and 23 have been assigned CVEs.
... However, the high cost involved in mutation testing hinders its adoption in practice [1]. Various approaches have been proposed to address this issue, mainly aiming at selecting specific mutant types [23]. This is typically happening at random [10] or based on the characteristics of mutated location using static control flow graphs [29]. ...
Preprint
In this paper we apply mutation testing in an in-time fashion, i.e., across multiple project releases. Thus, we investigate how the mutants of the current version behave in the future versions of the programs. We study the characteristics of what we call latent mutants, i.e., the mutants that are live in one version and killed in later revisions, and explore whether they are predictable with these properties. We examine 131,308 mutants generated by Pitest on 13 open-source projects. Around 11.2% of these mutants are live, and 3.5% of them are latent, manifesting in 104 days on average. Using the mutation operators and change-related features we successfully demonstrate that these latent mutants are identifiable, predicting them with an accuracy of 86% and a balanced accuracy of 67% using a simple random forest classifier.
... There are several techniques that implement this logic (e.g. selective mutation [13,14,16], and mutant sampling [20,[23][24][25]). However, only recently the academics began to investigate the threats to validity the redundant mutants introduce in software testing experiments [18]. ...
Preprint
Many academic studies in the field of software testing rely on mutation testing to use as their comparison criteria. However, recent studies have shown that redundant mutants have a significant effect on the accuracy of their results. One solution to this problem is to use mutant subsumption to detect redundant mutants. Therefore, in order to facilitate research in this field, a mutation testing tool that is capable of detecting redundant mutants is needed. In this paper, we describe how we improved our tool, LittleDarwin, to fulfill this requirement.
... We consider the five mutation operators presented by King and Offutt [16]. As shown by Offutt et al., these five operators are sufficient to effectively implement mutation testing [23]. These operators are listed in Table 2: the leftmost column is the three letter acronym used by King and Offutt, the central column is the full name, and the rightmost column lists the set of operators implied in the mutation. ...
Preprint
In software engineering, impact analysis involves predicting the software elements (e.g., modules, classes, methods) potentially impacted by a change in the source code. Impact analysis is required to optimize the testing effort. In this paper, we propose an evaluation technique to predict impact propagation. Based on 10 open-source Java projects and 5 classical mutation operators, we create 17,000 mutants and study how the error they introduce propagates. This evaluation technique enables us to analyze impact prediction based on four types of call graph. Our results show that graph sophistication increases the completeness of impact prediction. However, and surprisingly to us, the most basic call graph gives the best trade-off between precision and recall for impact prediction.
... Additionally, it combines field values extracted from backend code using LLMs to perform further mutations (detailed in Section 4.4.2). We also refer to common mutation methods proposed by Offutt (Offutt et al, 1996). The marked fields undergo additional mutations using the Radamsa tool to introduce greater randomness and unpredictability, further diversifying the generated seed data. ...
Preprint
Internet of Things (IoT) devices offer convenience through web interfaces, web VPNs, and other web-based services, all relying on the HTTP protocol. However, these externally exposed HTTP services resent significant security risks. Although fuzzing has shown some effectiveness in identifying vulnerabilities in IoT HTTP services, most state-of-the-art tools still rely on random mutation trategies, leading to difficulties in accurately understanding the HTTP protocol's structure and generating many invalid test cases. Furthermore, These fuzzers rely on a limited set of initial seeds for testing. While this approach initiates testing, the limited number and diversity of seeds hinder comprehensive coverage of complex scenarios in IoT HTTP services. In this paper, we investigate and find that large language models (LLMs) excel in parsing HTTP protocol data and analyzing code logic. Based on these findings, we propose a novel LLM-guided IoT HTTP fuzzing method, ChatHTTPFuzz, which automatically parses protocol fields and analyzes service code logic to generate protocol-compliant test cases. Specifically, we use LLMs to label fields in HTTP protocol data, creating seed templates. Second, The LLM analyzes service code to guide the generation of additional packets aligned with the code logic, enriching the seed templates and their field values. Finally, we design an enhanced Thompson sampling algorithm based on the exploration balance factor and mutation potential factor to schedule seed templates. We evaluate ChatHTTPFuzz on 14 different real-world IoT devices. It finds more vulnerabilities than SNIPUZZ, BOOFUZZ, and MUTINY. ChatHTTPFuzz has discovered 103 vulnerabilities, of which 68 are unique, and 23 have been assigned CVEs.
Conference Paper
Mutation testing is a widely accepted method for assessing the effectiveness of software test suites. It focuses on evaluating how well a test suite can identify deliberately introduced faults, known as mutations, in the code, helping to reveal potential vulnerabilities. Traditional mutation testing approaches, however, often encounter significant issues such as high computational demands and limited fault detection range. Recently, there has been an increasing interest in integrating artificial intelligence ( AI) into mutation testing to address these challenges. AI can enhance mutation testing by incorporating intelligent algorithms and automating various tasks. It can analyze the codebase, identify key program components, and strategically select mutation operators that are more likely to detect faults. This paper explores the fundamental concepts of mutation testing, relevant research, and innovations in combining AI with mutation testing, focusing on the challenges and the most effective models currently available. By integrating AI, mutation testing has seen improvements in both fault detection and computational efficiency.
Conference Paper
Full-text available
In testing for program correctness, the standard approaches [11,13,21,22,23,24,34] have centered on finding data D, a finite subset of all possible inputs to program P, such that 1) if for all x in D, P(x) = f(x), then P* = f where f is a partial recursive function that specifies the intended behavior of the program and P* is the function actually computed by program P. A major stumbling block in such formalizations has been that the conclusion of (1) is so strong that, except for trivial classes of programs, (1) is bound to be formally undecidable [23]. There is an undeniable tendency among practitioners to consider program testing an ad hoc human technique: one creates test data that intuitively seems to capture some aspect of the program, observes the program in execution on it, and then draws conclusions on the program's correctness based on the observations. To augment this undisciplined strategy, techniques have been proposed that yield quantitative information on the degree to which a program has been tested. (See Goodenough [14] for a recent survey.) Thus the tester is given an inductive basis for confidence that (1) holds for the particular application. Paralleling the undecidability of deductive testing methods, the inductive methods all have had trivial examples of failure [14,18,22,23]. These deductive and inductive approaches have had a common theme: all have aimed at the strong conclusion of (1). Program mutation [1,7,9,27], on the other hand, is a testing technique that aims at drawing a weaker, yet quite realistic, conclusion of the following nature: (2) if for all x in D, P(x) = f(x), then P* = f OR P is "pathological." To paraphrase, 3) if P is not pathological and P(x) = f(x) for all x in D then P* = f. Below we will make precise what is meant by "P is pathological"; for now it suffices to say that P not pathological means that P was written by a competent programmer who had a good understanding of the task to be performed. Therefore if P does not realize f it is "close" to doing so. This underlying hypothesis of program mutation has become known as the competent programmer hypothesis: either P* = f or some program Q "close" to P has the property Q* = f. To be more specific, program mutation is a testing method that proposes the following version of correctness testing: Given that P was written by a competent programmer, find test data D for which P(D) = f(D) implies P* = f. Our method of developing D, assuming either P or some program close to P is correct, is by eliminating the alternatives. Let &phis; be the set of programs close to P. We restate the method as follows: Find test data D such that: i) for all x in D P(x) = f(x) and ii) for all Q in &phis; either Q* = P* or for some x in D, Q(x) ≠ P(x). If test data D can be developed having properties (i) and (ii), then we say that D differentiates P from &phis;, alternatively P passes the &phis; mutant test. The goal of this paper is to study, from both theoretical and experimental viewpoints, two basic questions: Question 1: If P is written by a competent programmer and if P passes the &phis; mutant test with test data D, does P* = f? Note that, after formally defining &phis; for P in a fixed programming language L, an affirmative answer to question 1 reduces to showing that the competent programmer hypothesis holds for this L and &phis;. We have observed that under many natural definitions of &phis; there is often a strong coupling between members of &phis; and a small subset µ. That is, often one can reduce the problem of finding test data that differentiates P from &phis; to that of finding test data that differentiates P from µ. We will call this subset µ the mutants of P and the second question we will study involves the so-called coupling effect [9]: Question 2 (Coupling Effect): If P passes the µ mutant test with data D, does P pass the &phis; mutant test with data D? Intuitively, one can think of µ as representing the programs that are "very close" to P. In the next section we will present two types of theoretical results concerning the two questions above: general results expressed in terms of properties of the language class L, and specific results for a class of decision table programs and for a subset of LISP. Portions of the work on decision tables and LISP have appeared elsewhere [5,6], but the presentations given here are both simpler and more unified. In the final section we present a system for applying program mutation to FORTRAN and we introduce a new type of software experiment, called a "beat the system" experiment, for evaluating how well our system approximates an affirmative response to the program mutation questions.
Article
Full-text available
A novel technique for automatically generating test data is presented. The technique is based on mutation analysis and creates test data that approximate relative adequacy. It is a fault-based technique that uses algebraic constraints to describe test cases designed to find particular types of faults. A set of tools (collectively called Godzilla) that automatically generates constraints and solves them to create test cases for unit and module testing has been implemented. Godzilla has been integrated with the Mothratesting system and has been used as an effective way to generate test data that kill program mutants. The authors present an initial list of constraints and discuss some of the problems that have been solved to develop the complete implementation of the technique.
Article
Full-text available
A new Type of software test, called mutation analysis, is introduced. A method of applying mutation analysis is described, and the design of several existing automated systems for applying mutation analysis to Fortran and Cobol programs is sketched. These systems have been the means for preliminary studies of the efficiency of mutation analysis and of the relationship between mutation and other systematic testing techniques. The results of several experiments to determine the effectiveness of mutation analysis are described, and examples are presented to illustrate the way in which the technique can be used to detect a wide class of errors, including many previously defined and studied in the literature. Finally, a number of empirical studies are suggested, the results of which may add confidence to the outcome of the mutation analysis of a program. (Author)
Article
Full-text available
Constraint-based testing is a novel way of generating test data to detect specific types of common programming faults. The conditions under which faults will be detected are encoded as mathematical systems of constraints in terms of program symbols. A set of tools, collectively called Godzilla, has been implemented that automatically generates constraint systems and solves them to create test cases for use by the Mothra testing system. Experimental results from using Godzilla show that the technique can produce test data that is very close in terms of mutation adequacy to test data that is produced manually, and at substantially reduced cost. Additionally, these experiments have suggested a new procedure for unit testing, where test cases are viewed as throw-away items rather than scarce resources.
Article
An abstract is not available.
Conference Paper
Program faults are artifacts that are widely studied, but there are many aspects of faults that we still do not understand. In addition to the simple fact that one important goal during testing is to cause failures and thereby detect faults, a full understanding of the characteristics of faults is crucial to several research areas in testing. These include fault-based testing, testability, mutation testing, and the comparative evaluation of testing strategies. In this workshop paper, we explore the fundamental nature of faults by looking at the differences between a syntactic and semantic characterization of faults. We offer definitions of these characteristics and explore the differentiation. Specifically, we discuss the concept of "size" of program faults --- the measurement of size provides interesting and useful distinctions between the syntactic and semantic characterization of faults. We use the fault size observations to make several predictions about testing and present preliminary data that supports this model. We also use the model to offer explanations about several questions that have intrigued testing researchers.
Article
The Godzilla automatic test data generator is an integrated collection of tools that implements a relatively new test data generation method—constraint-based testing—that is based on mutation analysis. Constraint-based testing integrates mutation analysis with several other testing techniques, including statement coverage, branch coverage, domain perturbation, and symbolic evaluation. Because Godzilla uses a rule-based approach to generate test data, it is easily extendible to allow new testing techniques to be integrated into the current system. This article describes the system that has been built to implement constraint-based testing. Godzilla's design emphasizes orthogonality and modularity, allowing relatively easy extensions. Godzilla's internal structure and algorithms are described with emphasis on internal structures of the system and the engineering problems that were solved during the implementation.
Article
Fault-based testing attempts to show that particular faults cannot exist in software by using test sets that differentiate between the original program (hypothesized to be correct) and faulty alternate programs. The success of this approach depends on a number of assumptions, notably that programmers are competent insofar as they only commit relatively trivial faults, and that faults only couple infrequently. Fault coupling occurs when test sets are able to differentiate between the original program and faulty alternate programs when faults occur in isolation, but not when they occur in combination; it is a complicating factor in fault-based testing. Fault coupling is studied here within the context of finite bijective functions. A complete mathematical solution of the problem is possible in this simplified case; the results indicate that fault coupling does indeed occur infrequently, and are thus in agreement with the empirical results obtained by others in the field. One surprising result is that certain kinds of test set are able to avoid fault coupling altogether.