Fault detection effectiveness of source test case generation strategies for metamorphic testing


Abstract and Figures

Metamorphic testing is a well known approach to tackle the oracle problem in software testing. This technique requires the use of source test cases that serve as seeds for the generation of follow-up test cases. Systematic design of test cases is crucial for the test quality. Thus, source test case generation strategy can make a big impact on the fault detection effectiveness of metamorphic testing. Most of the previous studies on metamorphic testing have used either random test data or existing test cases as source test cases. There has been limited research done on systematic source test case generation for metamorphic testing. This paper provides a comprehensive evaluation on the impact of source test case generation techniques on the fault finding effectiveness of metamorphic testing. We evaluated the effectiveness of line coverage, branch coverage, weak mutation and random test generation strategies for source test case generation. The experiments are conducted with 77 methods from 4 open source code repositories. Our results show that by systematically creating source test cases, we can significantly increase the fault finding effectiveness of metamorphic testing. Further, in this paper we introduce a simple metamorphic testing tool called "METtester" that we use to conduct metamorphic testing on these methods.
Fault Detection Eectiveness of Source Test Case Generation
Strategies for Metamorphic Testing
Prashanta Saha
School of Computing, Montana State University
Bozeman, Montana
Upulee Kanewala
School of Computing, Montana State University
Bozeman, Montana
Metamorphic testing is a well known approach to tackle the oracle
problem in software testing. This technique requires the use of
source test cases that serve as seeds for the generation of follow-up
test cases. Systematic design of test cases is crucial for the test
quality. Thus, source test case generation strategy can make a big
impact on the fault detection eectiveness of metamorphic testing.
Most of the previous studies on metamorphic testing have used
either random test data or existing test cases as source test cases.
There has been limited research done on systematic source test
case generation for metamorphic testing. This paper provides a
comprehensive evaluation on the impact of source test case gener-
ation techniques on the fault nding eectiveness of metamorphic
testing. We evaluated the eectiveness of line coverage, branch
coverage, weak mutation and random test generation strategies
for source test case generation. The experiments are conducted
with 77 methods from 4 open source code repositories. Our results
show that by systematically creating source test cases, we can sig-
nicantly increase the fault nding eectiveness of metamorphic
testing. Further, in this paper we introduce a simple metamorphic
testing tool called "METtester" that we use to conduct metamorphic
testing on these methods.
Metamorphic testing, Random testing, Source test case generation,
Weak mutation, Branch coverage, Line coverage
there is no oracle present for the program or it is practically infea-
sible to develop an oracle to verify the correctness of the computed
outputs. This test oracle problem is quite frequent especially with
scientic software and is one of the most challenging problems
in software testing. Metamorphic testing (MT) technique was pro-
posed to alleviate this oracle problem [
]. MT uses properties from
the program under test to dene metamorphic relations (MRs). A
MR species how the outputs should change according to a specic
change made into the source input. Thus, from existing test cases
(named as source test cases) MRs are used to generate new test
cases (named as follow-up test cases). Then the set of source and
follow-up test cases are executed on the program under test and
the outputs are checked according to the corresponding MRs. The
program under test can be considered as faulty if a MR is violated.
Eectiveness of MT in detecting faults depends on the quality
of MRs. Additionally the eectiveness of MT should also rely on
the source test cases. Eectiveness of metamorphic testing can be
improved by systematically generating the source test cases. Such
a systematic approach can reduce the size of the test suite and
could be more cost eective. Most of the previous studies in MT
have used randomly generated test cases as source test data for
metamorphic testing. In this study we investigated the eectiveness
of line, branch coverage, weak mutation, and random testing for
creating source test cases for MT.
Our experimental results show that test cases satisfying weak
mutation coverage provide the best fault nding eectiveness. We
also have found that combining one or more systematic source
test case generation technique(s) may increase the fault detection
ability of MT.
MT is a property based testing approach which aims to alleviate the
oracle problem. But the eectiveness of MT not only depends on
the quality of MRs but also on the source test cases. In this section
we briey discussed MT and source test generation techniques, line,
branch coverage and weak mutation.
2.1 Metamorphic Testing
Source test cases are used in MT [
] to generate follow-up test cases
using a set of MRs identied for the program under test (PUT). MRs
] are identied based on the properties of the problem domain
like the attribute of the algorithm used. We can create source test
cases using techniques like random testing, structural testing or
search based testing. Follow-up test cases are generated by applying
the input transformation specied by the MRs. After executing the
source and follow-up test cases on the PUT we can check if there
is a change in the output that matches the MR, if not the MR is
considered as violated. Violation of MR during testing indicates
fault in the PUT. Since MT checks the relationship between inputs
and outputs of a test program, we can use this technique when the
expected result of a test program is not known.
For example, in gure 1, a Java method add_values is used to
show how source and follow-up test cases work with a PUT. The
add_values method sum up all the array element passed as argument.
Source test case,
is randomly generated and tested
on add_values. The output for this test case is 101. For this program,
when a constant
is added to the input, the output should increase.
This will be used as a MR to conduct MT on this PUT. A constant
value 2 is added to this array to create a follow-up test case
and then run on the PUT. The output for this follow-up
test case is 109. To satisfy this Addition MR the follow-up test output
should be greater than the source output. In this MT example, the
considered MR is satised for this given source and follow-up test
2.2 Source Test Case Generation
To generate source test cases we have used the EvoSuite [
] tool.
EvoSuite is a test generation tool that automatically produces test
cases targeting a higher code coverage. EvoSuite uses an evolution-
ary search approach that evolves whole test suites with respect
to an entire coverage criterion at the same time. In this paper we
generated source test cases based on line, branch coverage , weak
mutation and random testing. Below we briey describe the sys-
tematic approaches used by EvoSuite to generate them.
2.2.1 Line Coverage. In line coverage [
], to cover each line of
source code, we need to make sure that each basic code block in a
method is reached. In traditional search-based testing, this reacha-
bility would be expressed by a combination of branch distance [
and approach-level. The approach-level measures how distant an
individual execution and the target statement are in terms of the
control dependencies. The branch distance estimates how distant a
predicate (a decision making point) is from evaluation to a desired
target result. For example, given a predicate x==6 and an execution
with value x = 4, the branch distance to the predicate valuing true
would be |4-6|=2, whereas execution with value x=5 is closer to
being true with a branch distance of |5-6|=1. Branch distance can
be measured by applying a set of standard rules [14, 16].
In addition to test case generation, if reformation is a test suite to
execute all statements then the approach level is not important, as
all statements will be executed by the similar test suite. Hence, we
only need to inspect the branch distances of all the branches that
are related to the control dependencies of any of the statements
in that class. There is a control dependency for some statements
for each conditional statement in the code. It is required that the
branch of the statement leading to the dependent code is executed.
Hence, by executing all the tests in a test suite the line coverage
tness value can be calculated. The minimum branch distances
dmin (b,Suite)
are calculated for each executed statement among
all observed executions to every branch bin the collection of control
dependent branches
. Thus, the line coverage tness function
is dened as [18]:
fLC (Suite)=v(|NCLs|−|CoveredLines|)+
Where NCLs are the set of all statements in the class under
test (CUT), CoveredLines are the total set of covered statements
which are executed by each test case in the test suite, and v(x) is a
normalizing function in [0,1] (e.g. v(x) =x
(x+1)) [2].
2.2.2 Branch Coverage. The idea of covering branches is well
accepted in practice and implemented in popular tools, even though
the practical rationale of branch coverage may not always match the
more theoretical interpretation of covering all edges of a program’s
control ow. Branch coverage is often dened as maximizing the
number of branches of conditional statements that are executed by
a test suite. Thus, a unit test suite is considered as satised if and
only if its at least one test case satises the branch predicate to true
and at least one test case satises the branch predicate to false.
The tness value for the branch coverage is calculated based on a
criteria which is how close a test suite is to covering all branches of
the CUT. The tness value of a test suite is calculated by executing
all of its test cases, keeping trail of the branch distances
for each of the branch in the CUT. Then [18]:
fBC (Suite)=
v(d(b,Suite ))
To optimize the branch coverage the following distance is cal-
culated, where
dmin (b,Suite)
is the minimal branch distance of
branch b on all executions for the test suite [18]:
0 if the branch has been covered,
v(dmin(b,Suite)) if the predicate has been
executed at least twice,
1 otherwise,
Here it is needed to cover the true and false evaluation of a
predicate, so that a predicate must be executed at least twice by a
test suite. If the predicate is executed only once, then in theory the
searching could oscillate between true and false.
2.2.3 Weak Mutation. Test case generation tools prefer to gen-
erate values that satisfy the constraints or conditions, rather than
developers preferred values like boundary cases. In weak mutation
a small code modication is applied to the CUT and then force the
test generation tool to generate such values that can distinguish
between the original and the mutant. If the execution of a test case
on the mutant leads to a dierent state than the execution on the
CUT than a mutant is considered to be "killed" in the weak muta-
tion. A test suite satises the weak mutation criterion if and only if
at least one test case kill each mutant for the CUT.
Infection distance is measured with respect to a set of mutation
operator which guides to calculate the tness value for the weak
mutation criterion. Here inference of a minimal infection distance
function dmin (μ,Suite)exists and dene [18]:
Figure 1: Test Source and follow-up inputs on PUT.
1 if mutant μwas not reached,
v(dmin(μ,Suite)) if mutant μwas reached.
This results in the following tness function for weak mutation
Where Mcis the set of all mutants generated for the CUT.
We conducted a set of experiments to answer the following research
Which source test case generation technique(s) is/are
most eective for MT in terms of fault detection?
Can the best performing source test case generation
technique be combined to increase the fault nding eec-
tiveness of MT?
Does the fault detection eectiveness of an individual
MR change with the source test generation method?
How does the source test suite size dier for each
source test generation technique?
3.1 Code Corpus
We built a code corpus containing 77 functions that take numerical
inputs and produce numerical outputs . We obtained these functions
from the following open source projects:
The Colt Project1:
A set of open source libraries written
for high-performance scientic and technical computing in
Apache Mahout2:
A machine learning library written in
Apache Commons Mathematics Library3:
A library of
lightweight and self-contained mathematics and statistics
components written in the Java.
We list these functions in Table 2. Functions in the code corpus
perform various calculations using sets of numbers such as cal-
culating statistics (e.g. average, standard deviation and kurtosis),
Figure 2: METtester Architecture.
calculating distances (e.g. Manhattan and Tanimoto) and search-
ing/sorting. Lines of code of these functions varied between 4 and
52, and the number of input parameters for each function varied
between 1 and 4.
3.2 METtester
METtester [
] is a simple tool that we are developing to auto-
mate the MT process on a given Java program. This tool allows
users to specify MRs and source test cases through a simple XML
le. METtester transforms the source test cases according to the
specied MRs and conducts MT on the given program. Figure 2
shows the high level architecture of the tool. Below we describe
the important components of the tool:
XML input le:
User will provide information (Figure 3)
regarding method names to test, source test inputs, MRs, and
the number of test cases to run.
Figure 3: An example of the XML input given to METtester.
XML le parsing:
Xmlparser class in our tool will parse
information from the .xml le and process those. Then that
information will be sent to the Follow-up test case generation
Follow-up test Case Generation:
In this module follow-
up test cases are generated based on the provided MRs and
the source test cases.
Execute Source & Follow-up test cases on the PUT:
ter generation of the follow-up test cases METtester will run
both the source and follow-up test cases individually into
the system programs and return outputs from the programs.
Compare Source & Follow-up test results:
After getting
the test results from the test program METtester will com-
pare those results with the MR operators mentioned in the
xml le. If it satises the MR property then the class will ag
the test case as "Pass". If it fails to satisfy the MR property
class will ag it as "Fail" which means there is fault in the
3.3 Experimental Setup
For the 77 methods described in Section 3.1 we generated a total of
7446 mutated versions using the
Java mutation tool [
]. We used
the following six metamorphic relations that were used in previous
studies to test these functions [
]. Suppose our source test case
X={x1,x2,x3, ..., xn}
. Let source and
follow-up outputs be O(X)and O(Y)respectively:
MR - Addition:
add a positive constant C to the source test
case and the follow-up test case will be
C,x3+C, ..., xn+C}. Then O(Y)≥O(X).
MR - Multiplication:
multiply the source test case by a
positive constant C and the follow-up test case will be
{x1C,x2C,x3C, ..., xnC}. Then O(Y)≥O(X).
MR - Shule:
randomly permute the elements in the source
test case. The follow-up test case can be
Y={x3,x1,xn, ..., x2
}. Then O(Y)=O(X).
MR - Inclusive:
include a new element
0 to the
source test case and the follow-up test case will be
{x1,x2,x3, ..., xn,xn+1}. Then O(Y)≥O(X).
MR - Exclusive:
exclude an existing element from the source
test case and the follow-up test case will be
Y={x1,x2,x3, ...,
xn1}. Then O(Y)≤O(X).
MR - Invertive:
take the inverse of each element of source
test case. Then the follow-up test case will be
1/x3, ..., 1/xn}. Then O(Y)≤O(X).
For each of the methods, we used EvoSuite [
] described in
section 2
2 to generate test cases targeting line, branch and weak
mutation coverage. We used the generated test cases as the source
test cases to conduct MT on the methods using the MRs described
using METtester. Further, we randomly generated 10 test cases for
each method to use as source test cases, to be used as the baseline.
4.1 Eectiveness of the Source Test Case
Generation Techniques
Figure 4 shows the overall mutant killing rates for the four source
test generation techniques. Among all test case generation tech-
niques, weak mutation performed best by killing 68.7% mutants.
Random tests killed 41.5% of the mutants. Table 1 lists the number
of methods that reported the highest mutant kill rates for each type
of test generation technique. For some methods, several source test
generation techniques gave the same best performance.Therefore,
Figure 5 shows a Venn diagram of all the possible logical relations
between the best performing source test generation techniques for
the set of methods. Weak mutation based test generation technique
reported the highest kill rate in 41 (53%) methods, whereas ran-
dom testing reported the highest kill rate only in 13 (17%) methods.
Therefore these results suggest that weak mutation based source
test case generation is more eective in detecting faults with MT.
Figure 4: Total % of mutants killed by each source test suite
generation technique.
Table 1: Total number of methods having the highest mu-
tants kill rate for each source test generation techniques.
Total Methods Weak mutation Line Branch Random
77 41 26 29 13
Weak mutation based test suites have the highest
fault detection rate for majority of the methods
4.2 Fault Finding Eectiveness of Combined
Source Test Cases
To observe whether combining source test case generation tech-
niques will achieve a higher fault detection rate, we combined the
Table 2: All methods with Mutants kill rates and test suite size for each Source test case generation technique
Branch weak mutation Line Random |
Method name
add_values (Add elements in an array) 63.63 1 63.63 1 54.54 1 30 10
array_calc1 33.33 1 33.33 1 46.15 1 52.10 10
array_copy (Deep copy an array) 56.00 1 64.00 1 64.00 1 0.00 10
average ( Average of an array) 38.10 1 73.80 1 42.86 1 28.20 10
bubble (Implements bubble sort) 51.40 1 44.95 3 36.69 1 16.90 10
cnt_zeroes (Count zero in an array) 41.00 1 51.30 2 38.46 1 0.00 10
count_k (Occurrences of k in an array) 31.80 1 36.36 2 34.09 1 50.00 10
count_non_zeroes (Count non zero element in array) 41.00 1 48.71 2 51.28 1 22.20 10
dot_product 63.00 1 60.87 1 56.52 1 22.20 10
elementwise_max (Elementwise maximum) 46.30 2 68.51 3 83.33 2 0.00 10
elementwise_min (Elementwise minimum) 44.40 1 55.56 1 55.56 1 0.00 10
nd_euc_dist (Euclidean distance between two vectors) 80.10 1 76.39 1 79.17 1 50 10
nd_magnitude (Magnitude of a vector) 52.10 1 75.00 1 52.10 1 8.69 10
nd_max (nd the maximum value) 70.80 1 50.00 1 50.00 1 70.90 10
nd_max2 64.10 1 71.84 2 67.96 1 98.40 10
nd_median (Find median value in an array) 48.70 2 98.93 3 41.71 2 53.10 10
nd_min (Find minimum value in an array) 40.40 1 61.70 1 57.45 1 83.80 10
geometric_mean (Returns the geometric mean of the entries in the input array) 51.20 1 53.66 1 95.12 1 65.40 10
hamming_dist (Hamming distance between two vectors) 40.90 1 84.09 3 59.09 2 15.90 10
insertion_sort (Implements insertion sort) 43.60 1 42.55 2 37.23 1 32.65 10
manhattan_dist (Manhattan distance between two vectors) 53.30 1 61.36 2 53.30 1 0.00 10
mean_absolute_error (Measure of dierence between two continuous variables) 37.50 1 41.07 2 39.29 1 0.00 10
selection_sort (Implements selection sort) 41.30 1 41.30 2 39.40 1 21.60 10
sequential_search (Finding a target value within a list) 37.20 2 25.58 3 30.23 2 37.50 10
set_min_val (Set array elements less than k equal to k) 51.20 2 58.14 2 30.23 1 100 10
shell_sort (Implements shell sort) 43.70 1 42.51 1 43.11 1 0.00 10
variance (Returns the variance from a standard deviation) 26.10 1 39.86 1 30.40 1 25.70 10
weighted_average (A mean calculated by giving values in a data set) 86.10 1 56.94 1 86.10 1 21.20 10
manhattanDistance (The distance between two points in a grid) 48.89 1 77.78 2 22.22 1 9.10 10
chebyshevDistance (Distance metric dened on a vector space) 39.08 2 43.68 5 35.63 2 2.00 10
tanimotoDistance (a proper distance metric) 30.21 2 32.97 5 44.50 2 5.60 10
errorRate 61.04 3 58.44 2 58.44 2 0.00 10
sum 50.00 1 77.78 1 50.00 1 35.30 10
distance1 (Compute the distance between the instance and another vector) 53.33 1 80.00 1 53.33 1 14.8 10
distanceInf (Compute the distance between the instance and another vector) 46.67 1 46.67 1 46.67 1 14.8 10
ebeadd (Creates an array whose contents will be the element-by-element addition of the arguments)
92.68 2 100.00 3 100.00 2 15.8 10
ebedivide (Creates an array whose contents will be the element-by-element division) 100.00 2 100.00 5 100.00 2 26.8 10
ebemultiply (Creates an array whose contents will be the element-by-element multiplication) 100.00 2 100.00 3 92.68 2 15 10
safeNorm (Returns the Cartesian norm ) 14.78 1 98.63 5 97.08 4 0.8 10
scale(Create a copy of an array scaled by a value) 48.72 1 58.97 3 53.85 1 47.8 10
entropy 88.42 1 88.42 2 88.42 1 42.9 10
g 93.55 2 95.16 2 93.55 1 20.9 10
calculateAbsoluteDierences 60.98 1 60.98 1 60.98 1 0 10
evaluateHoners 46.03 1 79.37 1 47.62 1 80.4 10
evaluateInternal 95.25 1 93.47 2 95.55 1 90.6 10
evaluateNewton 80.00 1 65.71 1 64.29 1 76.8 10
meanDierence (Returns the mean of the (signed) dierences) 40.00 1 80.00 1 40.00 1 40 10
equals 22.50 3 27.50 4 21.25 3 100 10
chiSquare (Implements Chi-Square test statistics) 96.41 2 96.41 2 96.41 2 65.6 10
partition 43.26 5 95.81 5 28.84 3 88.1 10
evaluateWeightedProduct 30.61 2 40.82 2 42.86 2 2 10
autoCorrelation (Returns the auto-correlation of a data sequence) 25.20 2 93.50 2 43.09 1 79.40 10
covariance (Returns the covariance of two data sequences) 24.84 1 23.57 1 23.57 1 86.70 10
durbinWatson (Durbin-Watson computation) 0.00 0 33.77 1 0.00 0 14.10 10
harmonicMean (Returns the harmonic mean of a data sequence) 74.00 1 74.00 1 76.00 1 42.50 10
kurtosis (Returns the kurtosis (aka excess) of a data sequence) 93.84 1 93.84 1 97.16 1 34.80 10
lag1 (Returns the lag-1 autocorrelation of a dataset) 99.55 1 32.70 1 89.55 1 33.70 10
max (Returns the largest member of a data sequence) 51.72 1 56.90 1 51.72 1 96.60 10
meanDeviation (Returns the mean deviation of a dataset) 54.39 1 33.33 1 28.07 1 78.30 10
min (Returns the smallest member of a data sequence) 67.41 1 81.03 2 70.69 1 96.60 10
polevl 94.23 2 88.46 1 88.46 2 45.50 10
pooledMean (Returns the pooled mean of two data sequences) 36.43 1 34.88 1 34.88 1 19.30 10
pooledVariance (Returns the pooled variance of two data sequences) 43.08 1 47.83 1 47.83 1 31.10 10
power 53.33 1 53.33 1 53.33 1 15.80 10
product (Returns the product) 50.00 1 50.00 1 50.00 1 94.70 10
quantile (Returns the phi-quantile) 40.13 2 40.76 2 32.48 2 40.00 10
sampleKurtosis ( Returns the sample kurtosis (aka excess) of a data sequence) 93.86 1 93.86 1 92.98 1 85.10 10
sampleSkew (Returns the sample skew of a data sequence) 89.47 1 89.47 1 97.37 1 89.50 10
sampleVariance (Returns the sample variance of a data sequence) 75.31 1 75.31 1 12.35 1 71.20 10
skew ( Returns the skew of a data sequence) 93.88 1 93.88 1 93.88 1 48.80 10
square 47.37 1 47.37 1 57.89 1 5.30 10
standardize (Modies a data sequence to be standardized) 89.26 1 89.26 1 91.95 1 77.60 10
sumOfLogarithms ( Returns the sum of logarithms of a data sequence) 75.00 1 68.75 1 68.75 1 21.90 10
sumOfPowerOfDeviations 68.75 1 52.08 1 75.00 1 64.90 10
weightedMean (Returns the weighted mean of a data sequence) 77.46 1 77.46 1 77.46 1 65.00 10
weightedRMS (Returns the weighted RMS (Root-Mean-Square) of a data sequence) 86.96 1 86.96 1 86.96 1 43.30 10
winsorizedMean (Returns the winsorized mean of a sorted data sequence) 33.00 1 37.93 1 34.48 1 0.00 10
Figure 5: Venn Diagram for all the combinations of source
test suites that performed best for each individual methods.
best performing source test generation technique, weak mutation,
with the other source test generation techniques. Table 3 shows
the total percentage of mutants killed with each combined test
suite. Combination of weak mutation and random test cases has the
greater percentage of mutants kill rate (74.91) than combination
of line (72.87) and branch (74.6) separately with weak mutation. If
we combine all of the three strategies it slightly increases the total
percentage of killed mutants (75.98) but there are few things to be
considered, like combined test suite size.
Table 3: Total % of mutants killed after combining Weak Mu-
tation, Line, Branch Coverage, and Random Testing
Weak Mu-
72.87 74.6 75.98 74.91
Combining weak mutation test cases with random
test cases will lead to detect more faults
4.3 Fault Finding Eectiveness of Individual
To see how each source test case generation technique performs
with individual MRs, Figure 6 illustrates the percentage of mutants
killed by all six MRs separately using weak mutation, line, branch
coverage and random test suites. Weak mutation has the highest
percentage of killed mutants in all the six MRs. Specically with
multiplication and invertive MRs, the weak mutation test suite
surpasses others on mutants’ killing rate. But line coverage based
test suites were similar to weak mutation on killing mutants with
addition, shue, inclusive and exclusive MRs. For exclusive MR, all
the test suites performed almost similarly.
Weak mutation killed highest number of mutants
in all the MRs
4.4 Impact of Source Test Suite Size
Table 4 compares the coverage criteria in terms of the total number
of tests generated, their average and median test suite size of the in-
dividual methods. In addition, in columns Smaller, Equal, and Larger
we compare whether the size of the weak mutation test suites are
smaller, equal or larger than those produced by other source test
case generation techniques. And p-value column shows the p-value
computed using the paired t-test between weak mutation - line and
weak mutation -branch. We are not comparing random test suites
here, because we intentionally generated 10 random test cases for
each method. Weak Mutation leads to larger test suites than branch
and line coverage and on average, number of test cases produced
for weak mutation are larger than those produced for branch and
line coverage. The total number of test cases are also relatively
larger for weak mutation compared to line and branch coverage.
Weak Mutation generated a higher number of test
Threats to internal validity may result from the way empirical
study was carried out. EvoSuite and our experimental setup have
been carefully tested, although testing can not denitely prove the
absence of defects.
Threats to construct validity may occur because of the third party
tools we have used. The EvoSuite tool has been used to generate
source test cases for line, branch and weak mutation test generation
techniques. Further, we used the
Java mutation tool to create
mutants for our experiment. To minimize these threats we veried
that the results produced by these tools are correct by manually
inspecting randomly selected outputs produced by each tool.
Threats to external validity were minimized by using the 77
methods was employed as case study, which is collected from 4
dierent open source project classes. This provides high condence
in the possibility to generalize our results to other open source
software. We only used the EvoSuite tool to generate test cases for
our major experiment. But we also used the JCUTE [
generate branch coverage based test suites for our initial case study
and also observed similar results.
Most contributions on MT use either random generated test data or
existing test suites for the generation of source test cases. Not much
research has been done on systematic generation of source test
cases for MT. Gotlieb and Botella [
] presented an approach called
Automated Metamorphic Testing where they translated the code into
an equivalent constraint logic program and tried to nd test cases
that violates the MRs. Chen et al. [
] compared the eectiveness
of random testing and "special values" as source test cases for MT.
Special values are inputs where the output is well known for a
particular method. Wu et al.[
] proved that random test cases are
more eective than those test cases that are derived from "special
values". Segura et al. [
] also compared the eectiveness of random
Fault Detection Eectiveness of Source Test Case Generation Strategies for Metamorphic Testing MET’18, May 27, 2018, Gothenburg, Sweden
Figure 6: % of Mutants killed by all six MRs using 4 test suite strategies (Branch, Line Coverage, Weak Mutation and Random)
Table 4: Average test suites size for Weak mutation, Line coverage, Branch coverage and Random
Test Suites Total Number of Test Cases Average Size Median size Std Dev Smaller Equal Larger p-value
Weak mutation 135 1.75 1 1.13 - - - -
Line 97 1.26 1 0.67 1 45 31 3.102e-07
branch 99 1.29 1 0.59 2 49 26 1.375e-05
Random 770 10 10 0 77 0 0 -
testing with manually generated test suites for MT. Their results
showed that randomly generated test suites are more eective
in detecting faults than manually designed test suites. They also
observed that combining random testing with manual tests provides
better fault detection ability than random testing only.
Batra and Sengupta [
] proposed genetic algorithm to generate
test cases maximizing the paths traversed in the program under
test for MT. Chen et al. [
] also addressed the same problem from
a dierent prospective. They proposed partitioning the input do-
main of the PUT into multiple equivalence classes for MT. They
proposed an algorithm which will generate test cases which will
cover those equivalence classes. They were able to generate test
cases that provide high fault detection rate. Symbolic Execution
was used to construct MRs and their corresponding source test
cases by Dong and Zhang [
]. Program paths were rst analyzed
to generate symbolic inputs and then, these symbolic inputs were
used to construct MRs. In the nal step, source test cases were
generated by replacing the symbolic inputs with real values.
Barus et al. [
] applied the Adaptive Random Testing (ART) over
the random testing (RT) to nd the eectiveness of source test case
generation on MT. Their results showed that ART outperforms RT
on enhancing the eectiveness of MT. Alatawi et al. [
] used the
automated test input generation technique called dynamic symbolic
execution (DSE) to generate the source test inputs for metamorphic
testing. Their results showed that DSE improves the coverage and
fault detection rate of metamorphic testing compared to random
testing using signicantly smaller test suites. Compared to them,
in this work, we evaluate the eectiveness of four commonly used
coverage criteria for automated source test case generation.
In this study we empirically evaluated the fault nding eectiveness
of four dierent source test case generation strategies for MT: line,
branch, weak mutation and random.
Our results show that weak mutation coverage based test gen-
eration can be an eective source test case generation technique
for MT than the other techniques. Our results also show that the
fault nding eectiveness of MT can be improved by combining
source tests generated for weak mutation coverage with randomly
generated source test cases.
Further, in this paper we introduce a MT tool called "METtester."
We plan to incorporate the investigated automated source test
generation techniques into this tool. We also plan to extend the
current case study to larger code bases and experiment with more
source test generation techniques such as adaptive random test
generation and data ow based test generation. Further, we plan to
analyze the impact of the coverage of follow up test cases in our
future research.
This work is supported by award number 1656877 from the Na-
tional Science Foundation. Any Opinions, ndings and conclusions
or recommendations expressed in this material are those of the au-
thor(s) and do not necessarily reect those of the National Science
