Conference PaperPDF Available

Is Mutation Testing Ready to Be Adopted Industry-Wide?

Authors:
Conference Paper

Is Mutation Testing Ready to Be Adopted Industry-Wide?

Abstract and Figures

Mutation Testing has a long research history as a way to improve the quality of software tests. However, it has not yet reached wide consensus for industry-wide adoption, mainly due to missing clear benefits and computational complexity for the application to large systems. In this paper, we investigate the current state of mutation testing support for Java Virtual Machine (JVM) environments. By running an experimental evaluation, we found out that while default configurations are unbearable for larger projects, using strategies such as selective operators, second order mutation and multi-threading can increase the applicability of the approach. However, there is a trade-off in terms of quality of the achieved results of the mutation analysis process that needs to be taken into account.
Content may be subject to copyright.
Is Mutation Testing Ready to be Adopted
Industry-Wide?
Jakub Moˇzucha and Bruno Rossi ( )
Faculty of Informatics
Masaryk University, Brno, Czech Republic
{jmozucha,brossi}@mail.muni.cz
Abstract. Mutation Testing has a long research history as a way to
improve the quality of software tests. However, it has not yet reached
wide consensus for industry-wide adoption, mainly due to missing clear
benefits and computational complexity for the application to large sys-
tems. In this paper, we investigate the current state of mutation testing
support for Java Virtual Machine (JVM) environments. By running an
experimental evaluation, we found out that while default configurations
are unbearable for larger projects, using strategies such as selective oper-
ators, second order mutation and multi-threading can increase the appli-
cability of the approach. However, there is a trade-off in terms of quality
of the achieved results of the mutation analysis process that needs to be
taken into account.
Keywords: Software Mutation Testing, Experimentation, Equivalent
Mutants, Selective Mutation Operators, Cost-Reduction Strategies
1 Introduction
Large amount of resources are wasted yearly due to bugs introduced in soft-
ware systems, making the testing process one of the critical phases of software
development [2]. A recent research reported the cost of software debugging up
to a yearly $312 Billion, with developers utilizing 50% of their allocated time
to find and fix software bugs [1]. Software Engineering is for long time striving
to find ways to reduce such inefficiencies, with the constant challenge to build
more robust software. Mutation Testing is one such ways, representing a pow-
erful technique to evaluate and improve the quality of software tests written by
developers [14, 7].
The main idea behind Mutation testing is to create many modified copies
of the original program called mutants — each mutant with a single variation
from the original program. All mutants are then tested by test suites to get the
percentage of mutants failing the tests. It has been proven that mutation testing
can bring several benefits to complement the applied testing practices, e.g. for
test cases prioritization [6].
However, mutation testing has been often reported to struggle to be intro-
duced in to real-world industrial contexts [11,15, 8]. So why is mutation testing
2 Jakub Moˇzucha and Bruno Rossi ( )
not widely adopted within industry? According to Madeyski et al. [9], mainly due
to a) performance reasons, b) the equivalent mutants problem — syntactically
but not semantically equal mutants — and c) missing integration tools. In our
view, the biggest drawback of mutation testing — its great computational costs
— prevented until recently to include mutation testing into the development
cycle of most companies. This resulted in development of many techniques to
reduce the costs of mutation testing. Furthermore, another perceived drawback
might be that the advantages of running mutation testing might not be fully
clear as opposed to other simpler testing approaches.
Problem. The applicability of Mutation Testing to real-world project is far
from reaching consensus [4, 15, 8, 9, 11]. While it seems that improvements have
been done in tools integration, performance and equivalent mutants concerns
still remain the most relevant issues, and call for further analyses.
Contribution. We report on an experiment addressed at understanding the
current performance of Mutation Testing in Java Virtual Machine (JVM) en-
vironments, based on our previous experience on empirical studies [16] and the
needs for more industry-academia cooperation [3]. With the collaboration of an
industrial partner, we are in particular looking at different strategies that can
reduce runtime overhead of Mutation Testing. Among the results, we provide
indications about selective operators efficiency for Mutation Testing and their
impact on performance. Practitioners can gain more insights about the perfor-
mance / quality trade-offs in running mutation testing by evaluating several cost
reduction strategies on a typical set of projects. Such information can be relevant
for the integration in their own software development process. Furthermore, we
make available the experimental package for replications.
The paper is structured as follows. Section 2 reports about the background
on mutation testing. In Section 3, we refer about the experimental evaluation,
describing the experimental design, choices made, results, and threats to validity.
Section 4 provides related works that evaluated mutation testing in an experi-
mental setting. Section 5 provides the discussions and Section 6 the conclusions.
2 Mutation Testing Background
Mutation Testing has undergone several decades of extensive research. First
formed and published by DeMillo et al. in a 1971’s seminal paper [5], Mutation
Testing was introduced as a technique that can help programmers to improve the
tests quality significantly. The core of Mutation Analysis is creating and killing
mutants. Each mutant is a copy of original source code modified (mutated) with
a single change (mutation). These mutations are done based on set of predefined
syntactic rules called mutation operators. Traditional mutation operators consist
of statement deletions (e.g. removing a break in a loop), statement modifica-
tions (e.g. replacing a while with do-while), Boolean expression modification
(e.g. switching a != logical operator to ==), or variables/constants replacements.
These mutation operators can be considered to be traditional mutation opera-
tors and are mostly language independent. There are also language-dependent
Is Mutation Testing Ready to be Adopted Industry-Wide? 3
mutation operators that are used to mutate language-specific constructs, taking
into account aspects such as encapsulation, inheritance and polymorphism.
Tests are then executed on the mutants and the failure of mutants is expected.
When the tests fail, the mutant is considered killed and no further tests are
needed to be run using this mutant. For example, the original Java code in
Algorithm 1 is mutated using a mutation operator, which replaces == with !=
and produces the mutant in Algorithm 2.
Algorithm 1: Original Code
if (a == b) then
// do something
else
// do something
Algorithm 2: Mutated Code
if (a != b) then
// do something
else
// do something
If any mutant does not cause the tests to fail, it is considered live. This can
have two meanings: that the tests are not sensitive enough to catch the modified
code or that this mutant is equivalent. An equivalent mutant is syntactically
different from the original, but its semantics are the same and therefore it is im-
possible for tests to detect them. The final indication of tests quality is mutation
score, that is the percentage of non-equivalent mutants killed by the test data or
in other terms, the number of killed mutants over the number of non-equivalent
mutants generated.
A test data is considered mutation-adequate [12] if its mutation score is 100%.
The higher mutation score is, the more faults were discovered — therefore the
better the test process. This process leads to iterative improvement of testing,
moreover, inspecting live mutants can lead to discovery and resolution of other
source code issues. The most serious problem with equivalent mutants is the
distortion of the mutation score: software tools include them in the computation,
as their accurate detection is undecidable and can only be performed by manual
inspection [9].
Mutation testing is a powerful technique, but has great computational costs.
In fact, these costs have prevented mutation testing to be used in a practical way
for many years, despite the relatively long history of mutation testing related
research. In general, these are the most expensive phases:
1. Mutant Generation – aside from great computational costs, also the mem-
ory consumption is considerably high in this phase. Mutation operators have
to be applied on the original code and mutants have to be stored;
2. Mutant Compilation – phase in which the generated mutants have to be
compiled. This phase can be very costly for larger programs;
3. Execution of tests – for every mutant, the tests have to be executed until
they are not killed. Most costly are live mutants, because every test has to
be run on them;
There are several approaches that have been proposed to reduce the equiv-
alent mutation problem ([9] provides an extensive review in the area). At the
4 Jakub Moˇzucha and Bruno Rossi ( )
same time, the problem is very often linked to performance optimization, as less
equivalent mutants generated lead to a reduction in the three phases of mutant
generation, mutant compilation, and tests execution. For this reason, various
cost reducing strategies were developed.
In this paper we look into several of these strategies and their applicability to
improve the performance for industrial applicability. A first strategy is the Se-
lective Mutation technique — the idea is to use only the mutation operators that
produce statistically less equivalent mutants than others. This approach allows
to reduce not only equivalent mutants, but also to improve the performance.
The aim of Selective Mutation is to achieve maximum coverage by generating
the least number of mutants. Complexity reduction of mutants generation is
from quadratic (O(Ref s V ars)) to linear (O(Ref s)) [13], while retaining much
of effectiveness of non-selective mutation.
Another strategy we adopt in the current paper is Higher Order Mutation
(HOM). Taking into account the original mutants we discussed so far as First
Order Mutants (FOMs), the technique creates mutants with more than a single
mutation, referred as higher order mutants as combination of several FOMs [6].
We look in particular at four different algorithms (Last2First,DifferentOpera-
tors,ClosePair and RandomMix) implemented in Judy [10, 9] to combine FOMs
into Second Order Mutants (SOMs) — in which two mutants are combined:
Last2First – the first mutant in the list of FOMs is combined with the last
mutant in the list, the second mutant with next to last and so on;
DifferentOperators – only FOMs generated by different mutation operators
are combined;
ClosePair – two neighboring FOMs from the list are combined;
RandomMix – any two mutants from the FOMs are combined;
All these strategies use a list of first order mutants (FOM) and should gen-
erate at least 50% less mutants, with impact on the final mutation score [9].
3 Experimental Evaluation
We designed an exploratory experiment aimed at getting insights about the cur-
rent applicability of mutation testing in industrial context (summary of the over-
all process, Fig. 1). We run in parallel a literature review (1) and an exploratory
analysis about the usage of the tools for Mutation Testing (5). The selection of
the tools for the experimentation (2), as well as the experimental units (3) were
done based on the criteria of the company. We designed the research questions
(6) and created the experimental design (7) based on the results from the ex-
ploratory analysis, taking both into account the company’s needs and theoretical
constraints and aspects worth investigation from theory. Experiments were then
run (8) and results provided to the industrial partners for knowledge transfer and
identification of future works. Based on an exploratory pre-experiment phase,
we set the following research questions:
Is Mutation Testing Ready to be Adopted Industry-Wide? 5
Fig. 1. Research Process Workflow
RQ1.: What is the performance of Mutation Testing by taking into account
standard configurations (i.e. no selective operators and no mutation strat-
egy)?
RQ2.: What is the impact of Selective Operators on Mutation Testing effi-
ciency & performance?
RQ3.: What is the impact of Second Order Mutation strategies on Mutation
Testing efficiency & performance?
Given the selection of the tools for mutation testing supported by the indus-
trial partner, we looked specifically at three areas of experiments according to
the three research questions set:
EXP1. Mutation Operators Efficiency Experiments. Looking at the se-
lection of the most efficient mutation operators that then be evaluated in the
performance experiments;
EXP2. Performance Experiments & Concurrency Experiments. Look-
ing at the single-thread and multi-threading performance of the tools with stan-
dard configurations and according to different selective operator strategies;
EXP3. SOM Experiments. Evaluating the impact of different Second Order
mutation strategies (Last2First,DifferentOperators,ClosePair,RandomMix).
We run an initial review of the tools available for mutation testing in JVM
environments, omitting experimental tools. We overall considered seven tools:
MuJava, PITest, Javalanche, Judy, Jumble, Jester, MAJOR, that we compared
according to several characteristics (Table 1). To speed-up the mutation genera-
tion process, it is now a standard the support of byte-code mutation — mutants
are applied at the byte code level. The industrial partner involved in the exper-
imentation considered PITest and Judy more relevant for a series of reasons, in
particular the availability of plugins and general easiness of integration, as well
as the open source license and support for Java 8. Both software were used to
run the experiments.
6 Jakub Moˇzucha and Bruno Rossi ( )
MuJava PITest Javalanche Judy Jumble Jester MAJOR
State active active active active active
License Apache 2.0 Apache 2.0 LGPL BSD GPL open ?
Java vers. 7 5 – 8 5, 6 6 – 8 6 –8 6 7
Unit test.
framework
jUnit4 jUnit4,
TestNG6
jUnit4 jUnit4 jUnit4 jUnit3 jUnit4
Production
tools
GUI,
Eclipse
3rd part.
plugins
Eclipse plugin Cmd Cmd Cmd Ant
Automated
class/test
selection
Only
classes
Yes Yes Yes Only tests Yes Yes
Mutation Op-
erators
method
class
method method con-
curr.
meth.
class
meth. meth. method
self-def.
Byte-code mu-
tation
Yes Yes Yes Yes Yes Yes
Table 1. JVM Mutation Testing Tools
For experimentation, we used as experimental projects libraries suggested
by the industrial partner (Table 2): various Apache Commons projects and the
JodaTime library. The selection was done so that the chosen projects contain
possibly the most different distribution of size types, tests duration and test
coverage. However, we couldn’t include larger projects due to the high complexity
of running mutation tests. From an initial list, we had to discard other projects
that had either no tests (Apache Commons Daemon) or in which test cases were
failing with either PITest or Judy (Apache Commons Compression, BeanUtils).
We included in Table 2 Mutation Size, a metric that can give indication
of the complexity of the mutation process. While previous research has proved
that the number of mutants is proportional to the number of variable references
times the number of data objects (O(Refs * Vars)) [12]. However, it is uneasy to
determine the number of variable references for such large projects. Taking into
account that modern mutation testing tools are creating mutants and running
tests based on code coverage, Mutation Size is computed as coverage times size
of project, as a measure of the complexity of the mutation testing process M S =
coverage KLOC.
3.1 Experimental Procedure
The first set of experiments was done with both tools using their default settings.
The only modification was the configuration to the same number of threads, as
Judy runs by default in parallel, while PITest uses only one thread.
After experiments with default settings, the experiments on mutation oper-
ators efficiency were done for each tool. The aim of these tests was to reduce
the number of active operators selecting only the most efficient operators — the
ones that produce mutants that are not so easy-to-kill.
After the selection of mutation operators, another set of performance tests
was done using only the the selected operators. Concurrency tests were done on
various number of used threads, comparing time and memory usage (PITest).
Is Mutation Testing Ready to be Adopted Industry-Wide? 7
Table 2. Pro jects considered for the experimental evaluation. NCL = Number of
Classes, KLOC = Lines of Code (thousands), NTCL = Number of Test Classes, TK-
LOC = Test Lines of Code (thousands), TT = Test Time, CC = Code Coverage, MS
= Mutation Size
Project Ver. NCL KLOC NTCL TK-LOC TT CC MS
Apache Commons Chain 1.2 55 9.852 37 7.398 1.17 66.68% 6.57
Apache Commons CLI 1.3.1 23 6.161 25 5.214 1.39 96.38% 5.94
Apache Commons Codec 1.10 60 15.869 55 15.042 5.31 95.01% 15.08
Apache Commons CSV 1.2 11 3.515 15 3.821 1.76 94.00% 3.30
Apache Commnos DbUtils 1.6 30 7.611 26 4.453 1.36 57.71% 4.39
Apache Commons Digester 3.2 168 23.125 101 14.220 2.31 72.49% 16.76
Apache Commons Lang 1.2 133 68.684 148 55.467 16.51 93.80% 64.43
Apache Commons Validator 3.4 62 16.516 77 14.117 3.42 77.61% 12.82
Joda Time 1.5.0 166 70.593 158 72.423 5.02 90.18% 63.66
Second Order strategies were then evaluated in terms of time performance and
mutation score (Judy).
All tests were run remotely on 4 x Intel Xeon 3.35 GHz CPUs, 16GB RAM.
Every test result is an average of at least 10 iterations, with code coverage
computed using the Cobertura tool. The iterations of tests were launched using
simple bash scripts, which also automated renaming and moving of output files
into specified folders. Every run of tests was launched under a modified version
of the open-source memusg.sh1program, which measures the average and peak
memory usage of all processes invoked by the current terminal session and sub-
sessions2. The versions of the two tools used were PITest 1.1.9 and Judy 2 (release
from 13.5.2015).
Initial Performance Evaluation The initial evaluation was run with default
settings using one thread with default operators (Table 3). By default Judy has
active all 56 mutation operators, while PITest 7 out of 16. Missing values in the
table indicate a failure to complete the testing process.
Looking at the time performance in relation with mutation size (MS), intro-
duced in the previous section to characterize the projects, we found a positive
correlation (Spearman’s Rank-Order, 0.85, p=0.0037, two-tailed). The experi-
ments showed that with one exception, PITest is always faster to generate mu-
tants than Judy, which generated more mutants. Similarly, when comparing how
many mutants per second were generated, PITest generated mutants faster than
Judy. In our experiments, Judy was not able to finish mutation analysis for larger
projects (in particular Lang and Joda Time, that have the highest Mutation Size
among the considered projects). When comparing time per number of mutants,
Judy is generally faster for all tested projects using the default settings tests.
When considering the tested projects metrics, Judy is faster for smaller projects.
However, for bigger projects or for projects with higher line coverage or longer
tests run, the performance is rapidly lower.
1https://gist.github.com/netj/526585
2the experimental package is available at https://goo.gl/5GPdQv
8 Jakub Moˇzucha and Bruno Rossi ( )
Comparing average memory consumption, the same pattern applies as for
comparison of tests duration. Judy consumes less memory for very small projects,
but PITest shows better results for medium and bigger projects. Similarly, the
peak of memory consumption is normally lower for Judy, but for big or better
covered projects, the memory usage peak for Judy is a lot higher than for PITest.
Table 3. Run-time performance - default settings one thread - values in () are by using
selective operators.
Gen.Time (sec) Total Time (sec) Peak Memory (MB)
Project PITest Judy PITest Judy PITest Judy
Commons Chain 1.1 (1.4) 3.18 (2.0) 33.5 (30.2) 5.67 (2.6) 956 (1587) 302 (240)
Commons CLI 1.3 (1.4) 12.8 (6.3) 45.8 (42.4) 228.5 (46.9) 1764 (1743) 4505 (3677)
Commons Codec 5.7 (6.2) — (28.3) 247.6 (278.9) — (2225.5) 3028 (3061) — (4055)
Commons CSV 1.9 (1.6) 2.5 (1.4) 48.1 (44.1) 10.5 (4) 1648 (1654) 900 (351)
Commons DbUtils 1.6 (1.0) 6.2 (47.7) 34.2 (12) 45.6 (78.7) 784 (399) 1441 (4653)
Commons Digester 3.9 (3.0) 13.6 (8.2) 258.8 (120.1) 38.7 (20.2) 2678 (2540) 1509 (2071)
Commons Lang 21.1 (20.3) — (—) 943.6 (907.9) — (—) 3825 (3600) — (—)
Commons Validator 3.5 (3.5) 13.9 (5.1) 207.9 (148.2) 135.5 (29.7) 1309 (1365) 1638 (609)
Joda Time 28.3 (28.3) — (—) 638.8 (546) — (—) 3857 (4267) — (—)
Time required for Mutation Testing is positively correlated with Mutation Size
(LOCS*Coverage). It can be used as initial measure of complexity. Missing tests
or tests failures (for analysis tool) hinder the possibility to apply MT.
Mutation Operators Efficiency Results The procedure of selection of the
most efficient operators needs some further clarification. The strong mutation
operators are those whose mutants are not easy to be killed. It would be very
difficult to create tests that would kill 100% of selective mutants. Therefore, we
adopted a different approach by defining some thresholds to define the selective
operators:
1. Run tests on all projects with all stable mutation operators (stable operators
— not causing unrecoverable crashes during mutation);
2. Find most the populous (generating the highest number of mutants) muta-
tion operators;
3. Exclude the operator if:
Mutation score of mutants created by the operators is higher than the
average mutation score on all the tested projects;
The mutation operator belongs to the most populous operators and the
score of mutants created is higher than the average of 80% for all the
tested projects;
Non-excluded operators were considered selective operators and were active for
the selective mutation performance tests. Tables 4 and 5 are sorted by the most
populous operators from all projects (# Mut) with indication of the percentage
Is Mutation Testing Ready to be Adopted Industry-Wide? 9
of mutation score of the operator being higher than average mutation score
(% > avg)3. The (D) at the end of some operator names for PITest means
that the operator is active by default. The red-painted operators are unstable
ones, yellow are excluded operators and green are the selected operators for
the Selective Operators experiments. The Judy operators that generated many
mutants from which none were killed were considered as unstable ones.
Table 4. Efficiency of PITest Operators
Operator #Mut %>avg Operator #Mut % >avg
INLINE CONSTS 12455 56 VOID METHOD CALLS (D) 2653 33
NEGATE CONDITIONALS (D) 11087 100 INCREMENTS (D) 1128 100
RETURN VALS (D) 10457 89 INVERT NEGS 71 100
REMOVE CONDITIONALS EQ IF 8335 100 REMOVE CONDITIONALS EQ ELSE
MATH (D) 3457 78 NON VOID METHOD CALLS
REMOVE CONDITIONALS ORD ELSE 2752 89 CONSTRUCTOR CALLS
CONDITIONALS BOUNDARY (D) 2752 22 EXPERIMENTAL MEMBER VARIABLE
REMOVE CONDITIONALS ORD IF 2752 67 EXPERIMENTAL SWITCH
Table 5. Efficiency of Judy Operators
Out of the total 16 PITest mutation operators, 5 were selected for selective
mutation including the most populous operator INLINE CONSTS, causing that
the total number of generated mutants during selective mutation was almost
the same as during the mutation using default PITest operators. For selective
mutation using Judy, 28 out of 56 mutation operators were selected and the
number of generated mutants was reduced significantly.
3description of operators can be found at http://pitest.org/quickstart/
mutators/ and http://mutationtesting.org/judy/documentation/
10 Jakub Moˇzucha and Bruno Rossi ( )
The selected operators can be used to evaluate the number of mutated classes
vs the mutation score (Fig. 2a,b). The % of mutated classes refers to the number
of mutated classes over the total projects’ classes. The comparison of mutation
score showed that the mutation score of PITest selective mutation is always
lower than mutation score of default operators. This can mean that default op-
erators are either too easy-to-be-killed, or that selected operators produced more
equivalent mutants. Comparing selective vs non-selective strategies for mutation
score by running a Wilcoxon Signed-Rank Test showed significant differences
(p= 0.0012 <0.05, two-tailed, N= 15)
(a) PITest (b) Judy
(c) PITest (d) Judy
Fig. 2. Default vs Selective Mutation per Mutated Classes and Mutation Score
The total time of mutation analysis showed the real advantage of selective
mutation also for PITest tests ((Fig. 2c,d and Table 3). Except of one tested
project, all other were done faster using selective mutation. Comparing selective
vs non-selective strategies by running a Wilcoxon Signed-Rank Test showed sig-
nificant differences (p= 0.0096 <0.05, two-tailed, N=16) in duration time. To
note also that selective operators allowed Judy to provide results on the Apache
Is Mutation Testing Ready to be Adopted Industry-Wide? 11
Commons Codec project (with mutation size of 15.08).
EXP1. Using Selective Operators can bring benefits in terms of runtime perfor-
mance, however, at the expense of lower mutation score. Selective Operators can
also help in running Mutation Testing on some projects.
(a) Mutation Testing time vs # threads (b) Average Memory vs # threads
Fig. 3. Concurrency Experiments Results
Performance and Concurrency Results The results of the concurrency ex-
periments showed that using two or three threads can result in considerable
reduction of time compared to memory consumption increase (Fig. 3). The aver-
age memory consumption is rising for all testing projects almost linearly (fitted
regression up to 7 threads, avgmem=660.42+148.55*#threads, adj R2=0.22),
while time reduction is less than linear with the number of threads (fitted re-
gression up to 7 threads, time=262.81-21.71*#threads, adj R2=0.20).
Looking at the combined effect of decrease in time and increase in memory
consumption, we considered time vs avgmemory (Fig. 4). In this case, time
reduces less than linearly than the increase in memory (fitted regression up to 7
threads, time=13.61-0.2656*avgmem,adj R2=0.45), so using more threads might
increase consistently memory usage without larger benefits on time reduction.
EXP2. Up to 2-3 threads can bring high benefits in terms of runtime performance.
Change in average memory consumption grows more than linearly compared to
reduction in performance when increasing the number of threads.
SOM Experiments Results We next looked at the Higher Order Mutation
Testing strategy for the Judy project, in particular the four different algorithms
(Last2First DifferentOperators,ClosePair and RandomMix ) to combine first
12 Jakub Moˇzucha and Bruno Rossi ( )
Fig. 4. Delta Time vs Delta Average Memory
order mutants (FOMs) into second order mutants (SOMs) implemented in Judy
[10, 9]. Also in this case, we were interested in performance changes and quality
of mutation score. In running the experiment, we noticed that the number of
generated mutants was reduced at least by 50% for some of the strategies and
projects. Mutation score for all SOM strategies was higher than for FOM (Fig.
5a), while total time was generally lower for the three strategies in comparison
with FOM (Fig. 5b). Running a Friedman non-parametric test for the differences
across groups yielded significant results (0.05, two-tailed) for generated mutants
(p= 0.0089), mutation score (p= 0.0031), and total time (p= 0.0009). However,
like mentioned, the improvement of mutation score might be due to the inclusion
of less equivalent mutants by applying such strategies.
(a) Mutation Score (b) Total Time
Fig. 5. Application of different SOM Strategies vs FOM
When comparing individual SOM strategies, the ClosePair strategy gave the
lowest mutation score, while Last2First and RandomMix produced very similar
Is Mutation Testing Ready to be Adopted Industry-Wide? 13
results for most of tested projects. This can be caused by the fact that neigh-
boring mutants from the list of FOMs combined same type of mutants and the
highest number of equivalent SOM mutants were generated using this strategy.
EXP3. SOM strategies improve the results in terms of mutation score and in
terms of generated mutants, having positive benefits on the performance. How-
ever, manual inspection is needed to understand how many equivalent mutants
are generated.
Threats to Validity We have several threats to validity to report [17].
For internal validity, measurements performed were averaged over several
runs to reduce the impact of external concurring for resources. One of the main
issues is the reliability of mutation score as quality indicator. The score always
includes equivalent mutants as the automated detection is undecidable [9], and
the only way to discover them is by manual inspection — unfeasible for larger
projects. In fact, two projects with the same mutation score might be quite dif-
ferent depending on the number of equivalent mutants. We also used thresholds
for the definition of the selective operators, and some sensitivity analysis can be
more appropriate to define the best ranges.
Related to external validity, we cannot ensure that results generalize to other
projects. However, we selected 9 heterogeneous projects in terms of size, code
coverage. More insights will be given by testing on even larger software projects
of industrial partners. Furthermore, the package of the current experiments will
be available to increase external validity by means of replications.
For conclusion validity, we applied several statistical tests and simple lin-
ear regression in different parts of the experiment. We always used the non-
parametric version of the tests, without normality distribution assumption, and
we believe to have met other assumptions (type of variables, relationships among
measurements) to apply each test.
4 Related Works
There are several related works about experimental evaluations of tools for au-
tomated mutation testing in JVM environments.
One of the first experimental evaluations [13] was done on the mutation
operators of the Mothra project omitting two, four and six of the most populous
mutation operators. The test cases killed 100% of mutants generated by selective
mutation. These test cases were then run on non-selective mutants. These test
cases killed almost 100% percent of mutants. Out of 22 mutation operators used
by Mothra, 5 were key operators that provide almost the same coverage as non-
selective mutation [14].
Madeyski et al. provided an experimental evaluation comparing the perfor-
mance of generating mutants between Judy and MuJava on various Apache
Commons libraries [10]. From the experiments, Judy was able to generate at
least ten time more mutants per second as MuJava.
14 Jakub Moˇzucha and Bruno Rossi ( )
In 2011, the applicability of mutation testing was examined in Nica et al.
[11]. The selected tools were MuJava, Jumble and Javalanche, focusing on the
performance of generating mutants. The only tool able to generate mutants was
MuJava generating about 123 class-level mutants and 30,947 method level mu-
tants in approx. 6 hours. Jumble and Javalanche showed configuration difficulties
and low performance. The main conclusion was that mutation testing was too
slow to be used in real world software projects.
In 2013, Delahaye at al. compared Javalanche, Judy, Jumble, MAJOR and
PITest on several sample projects [4]. The results showed that Jumble, MA-
JOR and PITest were able to finish the analysis for every project, while Judy
generated the highest number of mutants and Javalanche the lowest number of
mutants. The research indicated that mutation testing tools still need a lot of
improvements to be usable in real world projects.
In 2015, Rani et al. compared MuJava, Judy, Jumble, Jester and PITest in an
experimental evaluation [15]. The experiments were run on set of short programs
(17-111 LOC). The research showed that all the tools produced almost the same
average mutation scores except of PITest, which produced 25% higher score than
the rest of the tools. One of the conclusions was that a new mutation system for
Java should be created, with faster generation and execution of mutants.
In 2016 Klischies et al. run an experimentation considering PITest on several
Apache Commons projects. As metric for the experiments authors use the inverse
of mutation score, as an indication of the goodness of the mutation operator set.
They overall considered Mutation Testing applicable to real world projects with
a low number of equivalent mutants, inspected manually, on the set of projects
that were considered. However, strong concerns remained for the applicability
to larger projects and in case code coverage within projects is too low, making
the whole mutation analysis less effective [8].
Our work is different from the aforementioned set of related works as we
focus on the selection of the best mutation operators and mutation strategies for
improvements in performance on a set of medium sized projects. We can directly
compare the SOM experiments results with Madeyski et al. [9], getting the same
results in terms of increase of mutation score and performance improvement.
5 Discussion
There are several findings about the application of Mutation Testing that we put
forward in the current paper. The general performance of Mutation Testing is
impacted by Mutation Size, that is the size of the project and the code coverage
level (RQ1 ). When taking into account the applicability of Mutation Testing,
is appropriate to consider Mutation Size (LOCs size and code coverage) as an
indication of the time required. This can be a good indicator to use by analogy for
the application to other projects. A good strategy for the application of Mutation
Testing is, in fact, to first increase code coverage to good levels, as having lower
code coverage levels cannot tell much about the quality of tests. Clearly, larger
coverage impacts on the execution of the tests, while mutant generation and
Is Mutation Testing Ready to be Adopted Industry-Wide? 15
mutant compilations stay the same. Taking into account multiple threads, time
reduction decreases less than linearly with the increase of memory consumption.
Mutation Testing can be optimized by looking at points in which parallelization
does not bring enough incremental benefits. For the set of projects considered, 2-3
threads are effective numbers for performance / memory resources optimization.
Identifying the most efficient operators and applying selective operators im-
proves the results in terms of runtime performance at the expense of lower mu-
tation score and lower number of mutated classes (RQ2 ). This is a strategy that
can be applied to extend the applicability of Mutation Testing to allow to run
the approach to wider set of projects. In the selective strategy we looked at
the efficiency of the operators in terms of killed mutants, but other approaches
may look at the operators that generate more mutants. Based on the results,
we believe that this set of strategies can help to apply Mutation Testing within
industrial contexts, as default configurations can lead to a larger overhead in
running the process. However, practitioners would need to fine-tune the Muta-
tion Testing environments according to the specific projects needs. We included
a list of selected operators efficiency based on the overall set of projects, that
can give indications for application to other projects.
We looked at the impact of Second Order Mutation to recombine First Order
Mutants and reduce in this way the number of mutants RQ3. All different sub-
strategies considered (Last2First DifferentOperators,ClosePair,RandomMix )
improve in terms of time required to run mutation testing, with higher muta-
tion score than considering the initial mutants. However, while improvements in
time are due to the lower number of generated mutants, mutation score can be
influenced by equivalent mutants, as such manual inspection would be suggested
to look for the effect on each considered project.
6 Conclusion
Mutation Testing is still an evolving testing methodology that can bring great
benefits to software development. With increasing computational resources, it
can reach wider adoption within industry, aiding to build more robust software.
However, there are still aspects that hinder its usage, namely the computational
complexity, equivalent mutants and possible lack of integration tools [9].
In this paper, we looked at the current support of Mutation Testing in JVM
environments, with an experimental evaluation based on industrial partner’s
needs. We focused on various aspects of performance, evaluating different strate-
gies that can be applied to reduce the time needed for mutation analysis. We
evaluated how selective operators and second order mutants can be beneficial for
the mutation testing process, allowing to reduce runtime overhead. Based on the
results, we believe that Mutation Testing is mature enough to be more widely
adopted. In our case, the experimental results have been useful for knowledge
transfer in an industrial cooperation, with future works aimed at exploring the
experimented approaches on the company’s source code repositories.
16 Jakub Moˇzucha and Bruno Rossi ( )
Acknowledgments. We are grateful to the developers of both PITest and
Judy for feedback provided in the usage of the tools. In case of Judy, the SOM
experiments have been possible with a newer version provided by the developers.
References
1. CJBS Insight: Cambridge university study states software bugs cost economy $312
billion per year. http://insight.jbs.cam.ac.uk/2013/financial-content-cambridge-
university-study-states-software-bugs-cost-economy-312-billion-per-year/
2. Crispin, L., Gregory, J.: Agile testing: A practical guide for testers and agile teams.
Pearson Education (2009)
3. Ded´ık, V., Rossi, B.: Automated bug triaging in an industrial context. In: 42nd
EUROMICRO Conference on Software Engineering and Advanced Applications.
pp. 363–367. IEEE (2016)
4. Delahaye, M., Du Bousquet, L.: A comparison of mutation analysis tools for java.
In: Quality Software (QSIC), 13th Int. Conference on. pp. 187–195. IEEE (2013)
5. DeMillo, R.A., Lipton, R.J., Sayward, F.G.: Hints on test data selection: Help for
the practicing programmer. Computer 11(4), 34–41 (Apr 1978)
6. Jia, Y., Harman, M.: Higher order mutation testing. Information and Software
Technology 51(10), 1379–1393 (2009)
7. Jia, Y., Harman, M.: An analysis and survey of the development of mutation
testing. IEEE transactions on software engineering 37(5), 649–678 (2011)
8. Klischies, D., F¨ogen, K.: An analysis of current mutation testing techniques applied
to real world examples. Full-scale Software Engineering/Current Trends in Release
Engineering p. 13 (2016)
9. Madeyski, L., Orzeszyna, W., Torkar, R., Jozala, M.: Overcoming the equivalent
mutant problem: A systematic literature review and a comparative experiment of
second order mutation. IEEE Trans. Softw. Eng. 40(1), 23–42 (Jan 2014)
10. Madeyski, L., Radyk, N.: Judy-a mutation testing tool for java. Software, IET 4(1),
32–42 (2010)
11. Nica, S., Ramler, R., Wotawa, F.: Is mutation testing scalable for real-world soft-
ware projects. In: VALID Third International Conference on Advances in System
Testing and Validation Lifecycle, Barcelona, Spain (2011)
12. Offutt, A.J., Lee, A., Rothermel, G., Untch, R.H., Zapf, C.: An experimental deter-
mination of sufficient mutant operators. ACM Trans. Softw. Eng. Methodol. 5(2),
99–118 (Apr 1996)
13. Offutt, A.J., Rothermel, G., Zapf, C.: An experimental evaluation of selective mu-
tation. In: Proceedings of the 15th Int. Conference on Software Engineering. pp.
100–107. ICSE ’93, IEEE Computer Society Press, Los Alamitos, CA, USA (1993)
14. Offutt, A.J., Untch, R.H.: Mutation testing for the new century. chap. Mutation
2000: Uniting the Orthogonal, pp. 34–44. Kluwer Academic Publishers, Norwell,
MA, USA (2001)
15. Rani, S., Suri, B., Khatri, S.K.: Experimental comparison of automated muta-
tion testing tools for java. In: Reliability, Infocom Technologies and Optimization
(ICRITO), 2015 4th International Conference on. pp. 1–6. IEEE (2015)
16. Roy, N.K.S., Rossi, B.: Towards an improvement of bug severity classification. In:
40th EUROMICRO Conference on Software Engineering and Advanced Applica-
tions. pp. 269–276. IEEE (2014)
17. Wohlin, C., Runeson, P., H¨ost, M., Ohlsson, M.C., Regnell, B., Wessl´en, A.: Ex-
perimentation in software engineering. Springer Science & Business Media (2012)
... In relation to the statistical tests, in 60 studies that conducted evaluations we identified only 13 ( ≈22%) that used 7 different statistical tests: Friedman ( Madeyski et al., 2014;Možucha and Rossi, 2016;Prado Lima et al., 2016;Prado Lima and Vergilio, 2017 ), Kruskal-Wallis ( Harman et al., 2014 ), Wilcoxon ranksum ( Devroey et al., 2016 ), Wilcoxon signed-rank ( Mateo et al., 2013;Harman et al., 2014;Kintis et al., 2015;Možucha and Rossi, 2016 ), Mann-Whitney ( Mateo et al., 2013;Granda et al., 2017 ), T-test ( Tokumoto et al., 2016;Belli et al., 2010 ), and One Way ANOVA ( Belli et al., 2010;Wedyan et al., 2015;Omar et al., 2017 ). Fig. 15 shows the statistical test distribution. ...
... In relation to the statistical tests, in 60 studies that conducted evaluations we identified only 13 ( ≈22%) that used 7 different statistical tests: Friedman ( Madeyski et al., 2014;Možucha and Rossi, 2016;Prado Lima et al., 2016;Prado Lima and Vergilio, 2017 ), Kruskal-Wallis ( Harman et al., 2014 ), Wilcoxon ranksum ( Devroey et al., 2016 ), Wilcoxon signed-rank ( Mateo et al., 2013;Harman et al., 2014;Kintis et al., 2015;Možucha and Rossi, 2016 ), Mann-Whitney ( Mateo et al., 2013;Granda et al., 2017 ), T-test ( Tokumoto et al., 2016;Belli et al., 2010 ), and One Way ANOVA ( Belli et al., 2010;Wedyan et al., 2015;Omar et al., 2017 ). Fig. 15 shows the statistical test distribution. ...
... Through this analysis we observed a lack of the statistical test use, when necessary, to evaluate the results obtained. Only 13 primary studies applied some statistical test when compared to the total Nguyen and Madeyski (2015) ; Možucha and Rossi (2016) ; Nguyen and Madeyski (2016c, 2016b, 2016a, 2017 Granda et al. (2017) of primary studies found that conducted evaluations (60), this is more noticeable and represents only ≈22% of the total. Another important element in the analysis of results is the use of an appropriate statistical test. ...
Article
Context: Higher Order Mutants (HOMs) present some advantages concerning the First-Order Mutants (FOMs). HOMs can better simulate real and subtle faults, reduce the number of generated mutants and test cases, and so on. Objective: In order to characterize the Higher Order Mutation Testing (HOMT) field, this paper presents results of a mapping study, by synthesizing characteristics of the HOMT approaches, HOM generation strategies, evaluation aspects, trends and research opportunities. Method: We followed a research plan to locate, assess, extract and group the outcomes from relevant studies. We found 69 primary studies, which were classified based on dimensions related to aspects of the conducted evaluation, purpose and use of HOMs. Results: Java is the preferred language. Most approaches use Second-Order Mutants (SOMs). We found 50 different techniques used to generate/select HOMs. We observed that from 39 primary studies which apply a strategy, ≈49% use search-based techniques. Conclusions: HOMT has been arising interest in the last years. The results herein presented provide researchers the start-of-the-art on HOMT, allowing them to understand existing approaches, and how the HOMs have been used and evaluated. Furthermore, this paper points out open issues and not addressed topics, which require more investigation, discussing trends and research opportunities in the field.
... The high computational costs are considered as one of the main factors hindering the wider adoption of MT within industry [25]. In the mutant execution step, MT requires running the tests from the test suite against each of the generated mutants. ...
... Software testing is a key phase of the software development process, as it represents the process of quality validation and verification of a software product [79]. Such phase is even more crucial nowadays, as software has become increasingly complex, mission-, safety-critical, and essential in daily activities, calling for an increase in quality [12,46,59]. ...
Preprint
Full-text available
Context: Software testing plays an essential role in product quality improvement. For this reason, several software testing models have been developed to support organizations. However, adoption of testing process models inside organizations is still sporadic, with a need for more evidence about reported experiences. Aim: Our goal is to identify results gathered from the application of software testing models in organizational contexts. We focus on characteristics such as the context of use, practices applied in different testing process phases, and reported benefits & drawbacks. Method: We performed a Systematic Literature Review (SLR) focused on studies about the application of software testing processes, complemented by results from previous reviews. Results: From 35 primary studies and survey-based articles, we collected 17 testing models. Although most of the existing models are described as applicable to general contexts, the evidence obtained from the studies shows that some models are not suitable for all enterprise sizes, and inadequate for specific domains. Conclusion: The SLR evidence can serve to compare different software testing models for applicability inside organizations. Both benefits and drawbacks, as reported in the surveyed cases, allow getting a better view of the strengths and weaknesses of each model.
... Software testing plays a relevant and significant role in today's society, to provide reliable and safer systems [6][7][8]. However, testing in the context of SGs is extremely complex due to the multi-layer structure of the overall infrastructure. ...
Article
Full-text available
The Smart Grid (SG) is nowadays an essential part of modern society, providing two-way energy flow and smart services between providers and customers. The main drawback is the SG complexity, with an SG composed of multiple layers, with devices and components that have to communicate, integrate, and cooperate as a unified system. Such complexity brings challenges for ensuring proper reliability, resilience, availability, integration, and security of the overall infrastructure. In this paper, we introduce a new smart grid testing management platform (herein called SGTMP) for executing real-time hardware-in-the-loop SG tests and experiments that can simplify the testing process in the context of interconnected SG devices. We discuss the context of usage, the system architecture, the interactive web-based interface, the provided API, and the integration with co-simulations frameworks to provide virtualized environments for testing. Furthermore, we present one main scenario about the stress-testing of SG devices that can showcase the applicability of the platform.
... It creates much less mutants and can automatically avoid most transformations that could be equivalent to the original code. These two aspects are usually quoted as drawbacks that prevent the wide use of mutation testing in practice [3,4]. Another benefit of this approach is that it operates at the method level which eases the understanding of the underlying testing problem. ...
Conference Paper
Full-text available
Descartes is a tool that implements extreme mutation operators and aims at finding pseudo-tested methods in Java projects. It leverages the efficient transformation and runtime features of PITest. The demonstration compares Descartes with Gregor, the default mutation engine provided by PITest, in a set of real open source projects. It considers the execution time, number of mutants created and the relationship between the mutation scores produced by both engines. It provides some insights on the main features exposed byDescartes.
... We use the PIT mutation testing framework 4 to apply mutation testing to the studied subjects since PIT is the most robust and widely used mutation testing tool for Java projects [Denaro et al. 2015;Možucha and Rossi 2016;Musco et al. 2016;]. We made three main modications to PIT (Version 1.1.5) ...
Article
Localizing failure-inducing code is essential for software debugging. Manual fault localization can be quite tedious, error-prone, and time-consuming. Therefore, a huge body of research e orts have been dedicated to automated fault localization. Spectrum-based fault localization, the most intensively studied fault localization approach based on test execution information, may have limited effectiveness, since a code element executed by a failed tests may not necessarily have impact on the test outcome and cause the test failure. To bridge the gap, mutation-based fault localization has been proposed to transform the programs under test to check the impact of each code element for better fault localization. However, there are limited studies on the effectiveness of mutation-based fault localization on sufficient number of real bugs. In this paper, we perform an extensive study to compare mutation-based fault localization techniques with various state-of-the-art spectrum-based fault localization techniques on 357 real bugs from the Defects4J benchmark suite. The study results firstly demonstrate the effectiveness of mutation-based fault localization, as well as revealing a number of guidelines for further improving mutation-based fault localization. Based on the learnt guidelines, we further transform test outputs/messages and test code to obtain various mutation information. Then, we propose TraPT, an automated Learning-to-Rank technique to fully explore the obtained mutation information for effective fault localization. The experimental results show that TraPT localizes 65.12% and 94.52% more bugs within Top-1 than state-of-the-art mutation and spectrum based techniques when using the default setting of LIBSVM.
Conference Paper
Full-text available
There is an increasing need to introduce some form of automation within the bug triaging process, so that no time is wasted on the initial assignment of issues. However, there is a gap in current research, as most of the studies deal with open source projects, ignoring the industrial context and needs. In this paper, we report our experience in dealing with the automation of the bug triaging process within a research-industry cooperation. After reporting the requirements and needs that were set within the industrial project, we compare the analysis results with those from an open source project used frequently in related research (Firefox). In spite of the fact that the projects have different size and development process, the data distributions are similar and the best models as well. We found out that more easily configurable models (such as SVM+TF–IDF) are preferred, and that top-x recommendations, number of issues per developers, and online learning can all be relevant factors when dealing with an industrial collaboration.
Conference Paper
Full-text available
Predicting the severity of bugs has been found in past research to improve triaging and the bug resolution process. For this reason, many classification/prediction approaches emerged over the years to provide an automated reasoning over severity classes. In this paper, we use text mining together with bi-grams and feature selection to improve the classification of bugs in severe/non-severe classes. We adopt the Naïve Bayes (NB) classifier considering Mozilla and Eclipse datasets commonly used in related works. Overall, the results show that the application of bi-grams can improve slightly the performance of the classifier, but feature selection can be more effective to determine the most informative terms and bi-grams. The results are in any case project-dependent, as in some cases the addition of bi-grams may worsen the performance.
Article
Full-text available
Context. The equivalent mutant problem (EMP) is one of the crucial problems in mutation testing widely studied over decades. Objectives. The objectives are: to present a systematic literature review (SLR) in the field of EMP; to identify, classify and improve the existing, or implement new, methods which try to overcome EMP and evaluate them. Method. We performed SLR based on the search of digital libraries. We implemented four second order mutation (SOM) strategies, in addition to first order mutation (FOM), and compared them from different perspectives. Results. Our SLR identified 17 relevant techniques (in 22 articles) and three categories of techniques: detecting (DEM); suggesting (SEM); and avoiding equivalent mutant generation (AEMG). The experiment indicated that SOM in general and JudyDiffOp strategy in particular provide the best results in the following areas: total number of mutants generated; the association between the type of mutation strategy and whether the generated mutants were equivalent or not; the number of not killed mutants; mutation testing time; time needed for manual classification. Conclusions. The results in the DEM category are still far from perfect. Thus, the SEM and AEMG categories have been developed. The JudyDiffOp algorithm achieved good results in many areas.
Article
Full-text available
A significant amount of research has been con-ducted in the area of mutation testing. It is a fault based technique that has been intensively used, over the last decades, as an efficient method to assess the quality of a given test suite. In the literature different mutation tools are available, corresponding to different programming languages or different types of applications. Although mutation testing is a powerful technique, limitations do exist. The most common problems are represented by the increased computation time, necessary to derive the entire mutation testing process, and the equivalent mutants problem. Therefore a natural question arises: is mutation testing really suitable in real-world environments? Through the research we start here, we aim to come with an accurate answer to this question.
Chapter
The experiment data from the operation is input to the analysis and interpretation. After collecting experimental data in the operation phase, we want to be able to draw conclusions based on this data. To be able to draw valid conclusions, we must interpret the experiment data.
Conference Paper
Mutation testing is a powerful, but computationally expensive, technique for unit testing software. This expense has prevented mutation form becoming widely used in practical situations, but recent engineering advances have given us techniques and algorithms for significantly reducing the cost of mutation testing. These technique include a new algorithmic execution technique include a new algorithmic execution technique called schema-based mutation, a reduction technique called selective mutation, heuristics for detecting equivalent mutants, and algorithms for automatic test data generation. This paper reviews experimentation with these advances and outlines a design for a system that will approximate mutation, but in a way that will be accessible to every day programmers. We envision a system to which a programmer can submit a program unit and get back a set of input/output pairs that are guaranteed to form an effective test of the unit by being close to mutation adequate. We believe this system could be efficient enough to be adopted by leading-edge software developers. Full automation in unit testing has the potential to dramatically change the economic balance between testing and development, by reducing the cost of testing from the major part of the total development cost to a small fraction.
Conference Paper
Mutation analysis allows software developers to evaluate the quality of a test suite. The quality is measured as the ability of the test suite to detect faults injected into the program under tests. A fault is detected if at least one test case gives different results on the original program and the fault injected one. Mutation tools aim at automating and speeding both the generation of fault injected variants, called mutants, and the execution of the test suite on those mutants. In this paper, we aim at offering meaningful elements of comparison between mutation tools for Java for different usage profiles.