ArticlePDF Available

Fault Detection Effectiveness of Source Test Case Generation Strategies for Metamorphic Testing

Authors:

Abstract and Figures

Metamorphic testing is a well known approach to tackle the oracle problem in software testing. This technique requires the use of source test cases that serve as seeds for the generation of follow-up test cases. Systematic design of test cases is crucial for the test quality. Thus, source test case generation strategy can make a big impact on the fault detection effectiveness of metamorphic testing. Most of the previous studies on metamorphic testing have used either random test data or existing test cases as source test cases. There has been limited research done on systematic source test case generation for metamorphic testing. This paper provides a comprehensive evaluation on the impact of source test case generation techniques on the fault finding effectiveness of metamorphic testing. We evaluated the effectiveness of line coverage, branch coverage, weak mutation and random test generation strategies for source test case generation. The experiments are conducted with 77 methods from 4 open source code repositories. Our results show that by systematically creating source test cases, we can significantly increase the fault finding effectiveness of metamorphic testing. Further, in this paper we introduce a simple metamorphic testing tool called "METtester" that we use to conduct metamorphic testing on these methods.
Content may be subject to copyright.
Fault Detection Eectiveness of Source Test Case Generation
Strategies for Metamorphic Testing
Prashanta Saha
School of Computing, Montana State University
Bozeman, Montana
p66n633@msu.montana.edu
Upulee Kanewala
School of Computing, Montana State University
Bozeman, Montana
upulee.kanewala@montana.edu
ABSTRACT
Metamorphic testing is a well known approach to tackle the oracle
problem in software testing. This technique requires the use of
source test cases that serve as seeds for the generation of follow-up
test cases. Systematic design of test cases is crucial for the test
quality. Thus, source test case generation strategy can make a big
impact on the fault detection eectiveness of metamorphic testing.
Most of the previous studies on metamorphic testing have used
either random test data or existing test cases as source test cases.
There has been limited research done on systematic source test
case generation for metamorphic testing. This paper provides a
comprehensive evaluation on the impact of source test case gener-
ation techniques on the fault nding eectiveness of metamorphic
testing. We evaluated the eectiveness of line coverage, branch
coverage, weak mutation and random test generation strategies
for source test case generation. The experiments are conducted
with 77 methods from 4 open source code repositories. Our results
show that by systematically creating source test cases, we can sig-
nicantly increase the fault nding eectiveness of metamorphic
testing. Further, in this paper we introduce a simple metamorphic
testing tool called "METtester" that we use to conduct metamorphic
testing on these methods.
KEYWORDS
Metamorphic testing, Random testing, Source test case generation,
Weak mutation, Branch coverage, Line coverage
ACM Reference Format:
Prashanta Saha and Upulee Kanewala. 2018. Fault Detection Eectiveness
of Source Test Case Generation Strategies for Metamorphic Testing. In
Proceedings of ACM Conference (MET 2018). ACM, Gothenburg, Sweden,
8 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
Atest oracle [
19
] is a mechanism to detect the correctness of the
outcomes of a program. The oracle problem [
2
] can occur when
there is no oracle present for the program or it is practically infea-
sible to develop an oracle to verify the correctness of the computed
outputs. This test oracle problem is quite frequent especially with
Corresponding author
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
MET 2018, , Gothenburg, Sweden
©2018 Copyright held by the owner/author(s).
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM.
https://doi.org/10.1145/nnnnnnn.nnnnnnn
scientic software and is one of the most challenging problems
in software testing. Metamorphic testing (MT) technique was pro-
posed to alleviate this oracle problem [
5
]. MT uses properties from
the program under test to dene metamorphic relations (MRs). A
MR species how the outputs should change according to a specic
change made into the source input. Thus, from existing test cases
(named as source test cases) MRs are used to generate new test
cases (named as follow-up test cases). Then the set of source and
follow-up test cases are executed on the program under test and
the outputs are checked according to the corresponding MRs. The
program under test can be considered as faulty if a MR is violated.
Eectiveness of MT in detecting faults depends on the quality
of MRs. Additionally the eectiveness of MT should also rely on
the source test cases. Eectiveness of metamorphic testing can be
improved by systematically generating the source test cases. Such
a systematic approach can reduce the size of the test suite and
could be more cost eective. Most of the previous studies in MT
have used randomly generated test cases as source test data for
metamorphic testing. In this study we investigated the eectiveness
of line, branch coverage, weak mutation, and random testing for
creating source test cases for MT.
Our experimental results show that test cases satisfying weak
mutation coverage provide the best fault nding eectiveness. We
also have found that combining one or more systematic source
test case generation technique(s) may increase the fault detection
ability of MT.
2 BACKGROUND
MT is a property based testing approach which aims to alleviate the
oracle problem. But the eectiveness of MT not only depends on
the quality of MRs but also on the source test cases. In this section
we briey discussed MT and source test generation techniques, line,
branch coverage and weak mutation.
2.1 Metamorphic Testing
Oracle problem is one of the biggest challenges in software testing.
MT is an eective method to test program that faces oracle prob-
lem. MT [
5
] creates the follow-up test cases from the existing test
cases called source test cases. To generate follow-up test cases rst
we need to identify an appropriate set of MRs that test program
under test (PUT) should satisfy. MRs [
7
] are identied based on the
properties of the problem domain like the attribute of the algorithm
used. We can create source test cases using techniques like random
testing, structural testing or search based testing. Follow-up test
cases are generated by applying the input transformation specied
by the MRs. After executing the source and follow-up test cases
on the PUT we can check if there is a change in the output that
arXiv:1802.07361v1 [cs.SE] 20 Feb 2018
MET 2018, , Gothenburg, Sweden
matches the MR, if not the MR is considered as violated. Violation
of MR during testing indicates fault in the PUT. Since MT checks
the relationship between inputs and outputs of a test program, we
can use this technique when the expected result of a test program
is not known.
For example, in gure 1, a Java method add_values is used to
show how source and follow-up test cases work with a PUT. The
add_values method sum up all the array element passed as argument.
Source test case,
t={
3
,
43
,
1
,
54
}
is randomly generated and tested
on add_values. The output for this test case is 101. For this program,
when a constant
c
is added to the input, the output should increase.
This will be used as a MR to conduct MT on this PUT. A constant
value 2 is added to this array to create a follow-up test case
t
=
{
5
,
45
,
3
,
56
}
and then run on the PUT. The output for this follow-up
test case is 109. To satisfy this Addition MR the follow-up test output
should be greater than the source output. In this MT example, the
considered MR is satised for this given source and follow-up test
cases.
2.2 Source Test Case Generation
To generate source test cases we have used the EvoSuite [
9
] tool.
EvoSuite is a test generation tool that automatically produces test
cases targeting a higher code coverage. EvoSuite uses an evolution-
ary search approach that evolves whole test suites with respect
to an entire coverage criterion at the same time. In this paper we
generated source test cases based on line, branch coverage , weak
mutation and random testing. Below we briey describe the sys-
tematic approaches used by EvoSuite to generate them.
2.2.1 Line Coverage. In line coverage [
16
], to cover each line of
source code, we need to make sure that each basic code block in a
method is reached. In traditional search-based testing, this reacha-
bility would be expressed by a combination of branch distance [
13
]
and approach-level. The approach-level measures how distant an
individual execution and the target statement are in terms of the
control dependencies. The branch distance estimates how distant a
predicate (a decision making point) is from evaluation to a desired
target result. For example, given a predicate x==6 and an execution
with value x = 4, the branch distance to the predicate valuing true
would be |4-6|=2, whereas execution with value x=5 is closer to
being true with a branch distance of |5-6|=1. Branch distance can
be measured by applying a set of standard rules [11, 13].
In addition to test case generation, if reformation is a test suite to
execute all statements then the approach level is not important, as
all statements will be executed by the similar test suite. Hence, we
only need to inspect the branch distances of all the branches that
are related to the control dependencies of any of the statements
in that class. There is a control dependency for some statements
for each conditional statement in the code. It is required that the
branch of the statement leading to the dependent code is executed.
Hence, by executing all the tests in a test suite the line coverage
tness value can be calculated. The minimum branch distances
dmi n (b,Suite )
are calculated for each executed statement among
all observed executions to every branch bin the collection of control
dependent branches
BCD
. Thus, the line coverage tness function
is dened as [16]:
fLC (Suite)=v(|NC Ls |−|CoveredLines |)+Õ
bBCD
v(dmin (b,Suit e))
Where NCLs are the set of all statements in the CUT, CoveredLines
are the total set of covered statements which are executed by each
test case in the test suite, and v(x) is a normalizing function in [0,1]
(e.g. v(x) =x
(x+1)) [1].
2.2.2 Branch Coverage. The idea of covering branches is well
accepted in practice and implemented in popular tools, even though
the practical rationale of branch coverage may not always match the
more theoretical interpretation of covering all edges of a program’s
control ow. Branch coverage is often dened as maximizing the
number of branches of conditional statements that are executed by
a test suite. Thus, a unit test suite is considered as satised if and
only if its at least one test case satises the branch predicate to true
and at least one test case satises the branch predicate to false.
The tness value for the branch coverage is calculated based on a
criteria which is how close a test suite is to covering all branches of
the CUT. The tness value of a test suite is calculated by executing
all of its test cases, keeping trail of the branch distances
d(b,Suite )
for each of the branch in the CUT. Then [16]:
fBC (Suit e)=Õ
bB
v(d(b,Suite ))
To optimize the branch coverage the following distance is cal-
culated, where
dmi n (b,Suite )
is the minimal branch distance of
branch b on all executions for the test suite [16]:
d(b,Suite )=
0if the branch has been covered,
v(dmin (b,Suit e)) if the predicate has been
executed at least twice,
1otherwise,
Here it is needed to cover the true and false evaluation of a
predicate, so that a predicate must be executed at least twice by a
test suite. If the predicate is executed only once, then in theory the
searching could oscillate between true and false.
2.2.3 Weak Mutation. Test case generation tools prefer to gen-
erate values that satisfy the constraints or conditions, rather than
developers preferred values like boundary cases. In weak mutation
a small code modication is applied to the CUT and then force the
test generation tool to generate such values that can distinguish
between the original and the mutant. If the execution of a test case
on the mutant leads to a dierent output than the execution on
the CUT than a mutant is considered to be "killed" in the weak
mutation. A test suite satises the weak mutation criterion if and
only if at least one test case kill each mutant for the CUT.
Infection distance is measured with respect to a set of mutation
operator which guides to calculate the tness value for the weak
mutation criterion. Here inference of a minimal infection distance
function dmi n (µ,Suit e)exists and dene [16]:
Fault Detection Eectiveness of Source Test Case Generation Strategies for Metamorphic Testing MET 2018, , Gothenburg, Sweden
Figure 1: Test Source and follow-up inputs on PUT.
dw(µ,Suit e)=
1if mutant µwas not reached,
v(dmin (µ,Suite)) if mutant µwas reached.
This results in the following tness function for weak mutation
[16]:
fW M (Suit e)=Õ
µMc
dw(µ,Suit e)
Where Mcis the set of all mutants generated for the CUT.
3 EVALUATION METHOD
We conducted a set of experiments to answer the following research
questions:
RQ1:
Which source test case generation technique(s) is/are
most eective for MT in terms of fault detection?
RQ2:
Can the source test case generation techniques be
combined to increase the fault nding eectiveness of MT?
RQ3:
Does the fault detection eectiveness of an individual
MR change with the source test generation method?
RQ4:
How does the source test suite size dier for each
source test generation technique?
3.1 Code Corpus
We built a code corpus containing 77 functions that take numerical
inputs and produce numerical outputs . We obtained these functions
from the following open source projects:
The Colt Project1:
A set of open source libraries written
for high-performance scientic and technical computing in
Java.
Apache Mahout2:
A machine learning library written in
Java.
Apache Commons Mathematics Library3:
A library of
lightweight and self-contained mathematics and statistics
components written in the Java.
We list these functions in Table 2. Functions in the code corpus
perform various calculations using sets of numbers such as cal-
culating statistics (e.g. average, standard deviation and kurtosis),
1http://acs.lbl.gov/software/colt/
2https://mahout.apache.org/
3http://commons.apache.org/proper/commons-math/
Figure 2: METtester Architecture.
calculating distances (e.g. Manhattan and Tanimoto) and search-
ing/sorting. Lines of code of these functions varied between 4 and
52, and the number of input parameters for each function varied
between 1 and 4.
3.2 METtester
METtester [
15
] is a simple tool that we are developing to auto-
mate the MT process on a given Java program. This tool allows
users to specify MRs and source test cases through a simple XML
le. METtester transforms the source test cases according to the
specied MRs and conducts MT on the given program. Figure 2
shows the high level architecture of the tool. Below we describe
the important components of the tool:
XML input le:
User will provide information (Figure 3)
regarding method names to test, source test inputs, MRs, and
the number of test cases to run.
MET 2018, , Gothenburg, Sweden
Figure 3: An example of the XML input given to METtester.
XML le parsing:
Xmlparser class in our tool will parse
information from the .xml le and process those. Then that
information will be sent to the Follow-up test case generation
module.
Follow-up test Case Generation:
In this module follow-
up test cases are generated based on the provided MRs and
the source test cases.
Execute Source & Follow-up test cases on the PUT:
Af-
ter generation of the follow-up test cases METtester will run
both the source and follow-up test cases individually into
the system programs and return outputs from the programs.
Compare Source & Follow-up test results:
After getting
the test results from the test program METtester will com-
pare those results with the MR operators mentioned in the
xml le. If it satises the MR property then the class will ag
the test case as "Pass". If it fails to satisfy the MR property
class will ag it as "Fail" which means there is fault in the
program.
3.3 Experimental Setup
For the 77 methods described in Section 3.1 we generated a total of
7446 mutated versions using the
µ
Java mutation tool [
12
]. We used
the following six metamorphic relations identied by Murphy et al.
[14] to test these functions:
MR - Addition:
Add a positive element. The expected result
should be increased or remain constant.
MR - Multiplication:
Multiply by a positive constant. The
expected result should be increased or remain constant.
MR - Shule:
Randomly permute the elements. The ex-
pected result should remain constant.
MR - Inclusive:
Add a new element. The expected result
should increase or remain constant.
MR - Exclusive:
Remove an existing element. The expected
result should decrease or remain constant.
MR - Invertive:
Take the inverse of each element. The ex-
pected result should decrease or remain constant.
For each of the methods, we used EvoSuite [
9
] described in
section 2
.
2to generate test cases targeting line, branch and weak
mutation coverage. We used the generated test cases as the source
test cases to conduct MT on the methods using the MRs described
using METtester. Further, we randomly generated 10 test cases for
each method to use as source test cases, to be used as the baseline.
4 RESULTS AND DISCUSSION
4.1 Eectiveness of the Source Test Case
Generation Techniques
Figure 4 shows the overall mutant killing rates for the four source
test generation techniques. Among all test case generation tech-
niques, weak mutation performed best by killing 68.7% mutants.
Random tests killed 41.5% of the mutants. Table 1 lists the number
of methods that reported the highest mutant kill rates for each type
of test generation technique. For some methods, several source test
generation techniques gave the same best performance.Therefore,
Figure 5 shows a Venn diagram of all the possible logical relations
between the best performing source test generation techniques for
the set of methods. Weak mutation based test generation technique
reported the highest kill rate in 41 (53%) methods, whereas ran-
dom testing reported the highest kill rate only in 13 (17%) methods.
Therefore these results suggest that weak mutation based source
test case generation is more eective in detecting faults with MT.
Figure 4: Total % of mutants killed by each source test suite
generation technique.
Table 1: Total number of methods having the highest mu-
tants kill rate for each source test generation techniques.
Total Methods Weak mutation Line Branch Random
77 41 26 29 13
RQ1:
Weak mutation based test suites have the highest
fault detection rate for majority of the methods
4.2 Fault Finding Eectiveness of Combined
Source Test Cases
To observe whether combining source test case generation tech-
niques will achieve a higher fault detection rate, we combined the
best performing source test generation technique, weak mutation,
with the other source test generation techniques. Table 3 shows
the total percentage of mutants killed with each combined test
suite. Combination of weak mutation and random test cases has the
greater percentage of mutants kill rate (74.91) than combination
of line (72.87) and branch (74.6) separately with weak mutation. If
Fault Detection Eectiveness of Source Test Case Generation Strategies for Metamorphic Testing MET 2018, , Gothenburg, Sweden
Table 2: All methods with Mutants kill rates and test suite size for each Source test case generation technique
Branch weak mutation Line Random |
Method name
Killrate
(%)
No.
of
Test
Cases
Killrate
(%)
No.
of
Test
Cases
Killrate
(%)
No.
of
Test
Cases
Killrate
(%)
No.
of
Test
Cases
add_values (Add elements in an array) 63.63 1 63.63 1 54.54 1 30 10
array_calc1 33.33 1 33.33 1 46.15 1 52.10 10
array_copy (Deep copy an array) 56.00 1 64.00 1 64.00 1 0.00 10
average ( Average of an array) 38.10 1 73.80 1 42.86 1 28.20 10
bubble (Implements bubble sort) 51.40 1 44.95 3 36.69 1 16.90 10
cnt_zeroes (Count zero in an array) 41.00 1 51.30 2 38.46 1 0.00 10
count_k (Occurrences of k in an array) 31.80 1 36.36 2 34.09 1 50.00 10
count_non_zeroes (Count non zero element in array) 41.00 1 48.71 2 51.28 1 22.20 10
dot_product 63.00 1 60.87 1 56.52 1 22.20 10
elementwise_max (Elementwise maximum) 46.30 2 68.51 3 83.33 2 0.00 10
elementwise_min (Elementwise minimum) 44.40 1 55.56 1 55.56 1 0.00 10
nd_euc_dist (Euclidean distance between two vectors) 80.10 1 76.39 1 79.17 1 50 10
nd_magnitude (Magnitude of a vector) 52.10 1 75.00 1 52.10 1 8.69 10
nd_max (nd the maximum value) 70.80 1 50.00 1 50.00 1 70.90 10
nd_max2 64.10 1 71.84 2 67.96 1 98.40 10
nd_median (Find median value in an array) 48.70 2 98.93 3 41.71 2 53.10 10
nd_min (Find minimum value in an array) 40.40 1 61.70 1 57.45 1 83.80 10
geometric_mean (Returns the geometric mean of the entries in the input array) 51.20 1 53.66 1 95.12 1 65.40 10
hamming_dist (Hamming distance between two vectors) 40.90 1 84.09 3 59.09 2 15.90 10
insertion_sort (Implements insertion sort) 43.60 1 42.55 2 37.23 1 32.65 10
manhattan_dist (Manhattan distance between two vectors) 53.30 1 61.36 2 53.30 1 0.00 10
mean_absolute_error (Measure of dierence between two continuous variables) 37.50 1 41.07 2 39.29 1 0.00 10
selection_sort (Implements selection sort) 41.30 1 41.30 2 39.40 1 21.60 10
sequential_search (Finding a target value within a list) 37.20 2 25.58 3 30.23 2 37.50 10
set_min_val (Set array elements less than k equal to k) 51.20 2 58.14 2 30.23 1 100 10
shell_sort (Implements shell sort) 43.70 1 42.51 1 43.11 1 0.00 10
variance (Returns the variance from a standard deviation) 26.10 1 39.86 1 30.40 1 25.70 10
weighted_average (A mean calculated by giving values in a data set) 86.10 1 56.94 1 86.10 1 21.20 10
manhattanDistance (The distance between two points in a grid) 48.89 1 77.78 2 22.22 1 9.10 10
chebyshevDistance (Distance metric dened on a vector space) 39.08 2 43.68 5 35.63 2 2.00 10
tanimotoDistance (a proper distance metric) 30.21 2 32.97 5 44.50 2 5.60 10
errorRate 61.04 3 58.44 2 58.44 2 0.00 10
sum 50.00 1 77.78 1 50.00 1 35.30 10
distance1 (Compute the distance between the instance and another vector) 53.33 1 80.00 1 53.33 1 14.8 10
distanceInf (Compute the distance between the instance and another vector) 46.67 1 46.67 1 46.67 1 14.8 10
ebeadd (Creates an array whose contents will be the element-by-element addition of the arguments)
92.68 2 100.00 3 100.00 2 15.8 10
ebedivide (Creates an array whose contents will be the element-by-element division) 100.00 2 100.00 5 100.00 2 26.8 10
ebemultiply (Creates an array whose contents will be the element-by-element multiplication) 100.00 2 100.00 3 92.68 2 15 10
safeNorm (Returns the Cartesian norm ) 14.78 1 98.63 5 97.08 4 0.8 10
scale(Create a copy of an array scaled by a value) 48.72 1 58.97 3 53.85 1 47.8 10
entropy 88.42 1 88.42 2 88.42 1 42.9 10
g 93.55 2 95.16 2 93.55 1 20.9 10
calculateAbsoluteDierences 60.98 1 60.98 1 60.98 1 0 10
evaluateHoners 46.03 1 79.37 1 47.62 1 80.4 10
evaluateInternal 95.25 1 93.47 2 95.55 1 90.6 10
evaluateNewton 80.00 1 65.71 1 64.29 1 76.8 10
meanDierence (Returns the mean of the (signed) dierences) 40.00 1 80.00 1 40.00 1 40 10
equals 22.50 3 27.50 4 21.25 3 100 10
chiSquare (Implements Chi-Square test statistics) 96.41 2 96.41 2 96.41 2 65.6 10
partition 43.26 5 95.81 5 28.84 3 88.1 10
evaluateWeightedProduct 30.61 2 40.82 2 42.86 2 2 10
autoCorrelation (Returns the auto-correlation of a data sequence) 25.20 2 93.50 2 43.09 1 79.40 10
covariance (Returns the covariance of two data sequences) 24.84 1 23.57 1 23.57 1 86.70 10
durbinWatson (Durbin-Watson computation) 0.00 0 33.77 1 0.00 0 14.10 10
harmonicMean (Returns the harmonic mean of a data sequence) 74.00 1 74.00 1 76.00 1 42.50 10
kurtosis (Returns the kurtosis (aka excess) of a data sequence) 93.84 1 93.84 1 97.16 1 34.80 10
lag1 (Returns the lag-1 autocorrelation of a dataset) 99.55 1 32.70 1 89.55 1 33.70 10
max (Returns the largest member of a data sequence) 51.72 1 56.90 1 51.72 1 96.60 10
meanDeviation (Returns the mean deviation of a dataset) 54.39 1 33.33 1 28.07 1 78.30 10
min (Returns the smallest member of a data sequence) 67.41 1 81.03 2 70.69 1 96.60 10
polevl 94.23 2 88.46 1 88.46 2 45.50 10
pooledMean (Returns the pooled mean of two data sequences) 36.43 1 34.88 1 34.88 1 19.30 10
pooledVariance (Returns the pooled variance of two data sequences) 43.08 1 47.83 1 47.83 1 31.10 10
power 53.33 1 53.33 1 53.33 1 15.80 10
product (Returns the product) 50.00 1 50.00 1 50.00 1 94.70 10
quantile (Returns the phi-quantile) 40.13 2 40.76 2 32.48 2 40.00 10
sampleKurtosis ( Returns the sample kurtosis (aka excess) of a data sequence) 93.86 1 93.86 1 92.98 1 85.10 10
sampleSkew (Returns the sample skew of a data sequence) 89.47 1 89.47 1 97.37 1 89.50 10
sampleVariance (Returns the sample variance of a data sequence) 75.31 1 75.31 1 12.35 1 71.20 10
skew ( Returns the skew of a data sequence) 93.88 1 93.88 1 93.88 1 48.80 10
square 47.37 1 47.37 1 57.89 1 5.30 10
standardize (Modies a data sequence to be standardized) 89.26 1 89.26 1 91.95 1 77.60 10
sumOfLogarithms ( Returns the sum of logarithms of a data sequence) 75.00 1 68.75 1 68.75 1 21.90 10
sumOfPowerOfDeviations 68.75 1 52.08 1 75.00 1 64.90 10
weightedMean (Returns the weighted mean of a data sequence) 77.46 1 77.46 1 77.46 1 65.00 10
weightedRMS (Returns the weighted RMS (Root-Mean-Square) of a data sequence) 86.96 1 86.96 1 86.96 1 43.30 10
winsorizedMean (Returns the winsorized mean of a sorted data sequence) 33.00 1 37.93 1 34.48 1 0.00 10
MET 2018, , Gothenburg, Sweden
Figure 5: Venn Diagram for all the combinations of source
test suites that performed best for each individual methods.
we combine all of the three strategies it slightly increases the total
percentage of killed mutants (75.98) but there are few things to be
considered, like combined test suite size.
Table 3: Total % of mutants killed after combining Weak Mu-
tation, Line, Branch Coverage, and Random Testing
Weak
Mutation
+Line(%)
Weak
Mutation
+Branch(%)
Weak
Mutation
+Line+
Branch(%)
Weak Mu-
tation+
Random(%)
72.87 74.6 75.98 74.91
RQ2:
Combining weak mutation test cases with random
test cases will lead to detect more faults
4.3 Fault Finding Eectiveness of Individual
MRs
To see how each source test case generation technique performs
with individual MRs, Figure 6 illustrates the percentage of mutants
killed by all six MRs separately using weak mutation, line, branch
coverage and random test suites. Weak mutation has the highest
percentage of killed mutants in all the six MRs. Specically with
multiplication and invertive MRs, the weak mutation test suite
surpasses others on mutants’ killing rate. But line coverage based
test suites were similar to weak mutation on killing mutants with
addition, shue, inclusive and exclusive MRs. For exclusive MR, all
the test suites performed almost similarly.
RQ3:
Weak mutation killed highest number of mutants
in all the MRs
4.4 Impact of Source Test Suite Size
Table 4 compares the coverage criteria in terms of the total number
of tests generated, their average and median test suite size of the in-
dividual methods. In addition, in columns Smaller, Equal, and Larger
we compare whether the size of the weak mutation test suites are
smaller, equal or larger than those produced by other source test
case generation techniques. And p-value column shows the p-value
computed using the paired t-test between weak mutation - line and
weak mutation -branch. We are not comparing random test suites
here, because we intentionally generated 10 random test cases for
each method. Weak Mutation leads to larger test suites than branch
and line coverage and on average, number of test cases produced
for weak mutation are larger than those produced for branch and
line coverage. The total number of test cases are also relatively
larger for weak mutation compared to line and branch coverage.
RQ4:
Weak Mutation generated a higher number of test
cases
5 THREATS TO VALIDITY
Threats to internal validity may result from the way empirical
study was carried out. EvoSuite and our experimental setup have
been carefully tested, although testing can not denitely prove the
absence of defects.
Threats to construct validity may occur because of the third party
tools we have used. The EvoSuite tool has been used to generate
source test cases for line, branch and weak mutation test generation
techniques. Further, we used the
µ
Java mutation tool to create
mutants for our experiment. To minimize these threats we veried
that the results produced by these tools are correct by manually
inspecting randomly selected outputs produced by each tool.
Threats to external validity were minimized by using the 77
methods was employed as case study, which is collected from 4
dierent open source project classes. This provides high condence
in the possibility to generalize our results to other open source
software. We only used the EvoSuite tool to generate test cases for
our major experiment. But we also used the JCUTE [
18
] tool to
generate branch coverage based test suites for our initial case study
and also observed similar results.
6 RELATED WORK
Most contributions on MT use either random generated test data or
existing test suites for the generation of source test cases. Not much
research has been done on systematic generation of source test
cases for MT. Gotlieb and Botella [
10
] presented an approach called
Automated Metamorphic Testing where they translated the code into
an equivalent constraint logic program and tried to nd test cases
that violates the MRs. Chen et al. [
6
] compared the eectiveness
of random testing and "special values" as source test cases for MT.
Special values are inputs where the output is well known for a
particular method. Wu et al.[
20
] proved that random test cases are
more eective than those test cases that are derived from "special
values". Segura et al. [
17
] also compared the eectiveness of random
testing with manually generated test suites for MT. Their results
showed that randomly generated test suites are more eective
in detecting faults than manually designed test suites. They also
observed that combining random testing with manual tests provides
better fault detection ability than random testing only.
Batra and Sengupta [
3
] proposed genetic algorithm to generate
test cases maximizing the paths traversed in the program under
Fault Detection Eectiveness of Source Test Case Generation Strategies for Metamorphic Testing MET 2018, , Gothenburg, Sweden
Figure 6: % of Mutants killed by all six MRs using 4 test suite strategies (Branch, Line Coverage, Weak Mutation and Random)
Table 4: Average test suites size for Weak mutation, Line coverage, Branch coverage and Random
Test Suites Total Number of Test Cases Average Size Median size Std Dev Smaller Equal Larger p-value
Weak mutation 135 1.75 1 1.13 - - - -
Line 97 1.26 1 0.67 1 45 31 3.102e-07
branch 99 1.29 1 0.59 2 49 26 1.375e-05
Random 770 10 10 0 77 0 0 -
test for MT. Chen et al. [
4
] also addressed the same problem from
a dierent prospective. They proposed partitioning the input do-
main of the PUT into multiple equivalence classes for MT. They
proposed an algorithm which will generate test cases which will
cover those equivalence classes. They were able to generate test
cases that provide high fault detection rate. Symbolic Execution
was used to construct MRs and their corresponding source test
cases by Dong and Zhang [
8
]. Program paths were rst analyzed
to generate symbolic inputs and then, these symbolic inputs were
used to construct MRs. In the nal step, source test cases were
generated by replacing the symbolic inputs with real values.
7 CONCLUSIONS & FUTURE WORK
In this study we empirically evaluated the fault nding eectiveness
of four dierent source test case generation strategies for MT: line,
branch, weak mutation and random.
Our results show that weak mutation coverage based test gen-
eration can be an eective source test case generation technique
for MT than the other techniques. Our results also show that the
fault nding eectiveness of MT can be improved by combining
source tests generated for weak mutation coverage with randomly
generated source test cases.
Further, in this paper we introduce a MT tool called "METtester."
We plan to incorporate the investigated automated source test
generation techniques into this tool. We also plan to extend the
current case study to larger code bases and experiment with more
source test generation techniques such as adaptive random test
generation and data ow based test generation.
ACKNOWLEDGMENTS
This work is supported by award number 1656877 from the Na-
tional Science Foundation. Any Opinions, ndings and conclusions
or recommendations expressed in this material are those of the au-
thor(s) and do not necessarily reect those of the National Science
Foundation.
REFERENCES
[1]
A. Arcuri. 2010. It Does Matter How You Normalise the Branch Distance in
Search Based Software Testing. In 2010 Third International Conference on Software
Testing, Verication and Validation. 205–214. https://doi.org/10.1109/ICST.2010.17
[2]
E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo. 2015. The Oracle
Problem in Software Testing: A Survey. IEEE Transactions on Software Engineering
41, 5 (May 2015), 507–525. https://doi.org/10.1109/TSE.2014.2372785
[3]
Gagandeep Batra and Jyotsna Sengupta. 2011. An Ecient Metamorphic Testing
Technique Using Genetic Algorithm. In Information Intelligence, Systems, Technol-
ogy and Management, Sumeet Dua, Sartaj Sahni, and D. P. Goyal (Eds.). Springer
Berlin Heidelberg, Berlin, Heidelberg, 180–188.
[4]
Leilei Chen, Lizhi Cai, Jiang Liu, Zhenyu Liu, Shiyan Wei, and Pan Liu. 2012.
An optimized method for generating cases of metamorphic testing. In 2012 6th
International Conference on New Trends in Information Science, Service Science and
Data Mining (ISSDM2012). 439–443.
MET 2018, , Gothenburg, Sweden
[5]
Tsong Yueh Chen. 2015. Metamorphic Testing: A Simple Method for Alleviating
the Test Oracle Problem. In Proceedings of the 10th International Workshop on
Automation of Software Test (AST ’15). IEEE Press, Piscataway, NJ, USA, 53–54.
http://dl.acm.org/citation.cfm?id=2819261.2819278
[6]
Tsong YuehChen, Fei-Ching Kuo, Ying Liu, and Antony Tang. 2004. Metamorphic
Testing and Testing with Special Values. In 4th IEEE International Workshop
on Source Code Analysis and Manipulation (SCAM 2004), 15-16 September 2004,
Chicago, IL, USA. 128–134.
[7]
T. Y. Chen, F. C. Kuo, D. Towey, and Z. Q. Zhou. 2012. Metamorphic Testing:
Applications and Integration with Other Methods: Tutorial Synopsis. In 2012 12th
International Conference on Quality Software. 285–288. https://doi.org/10.1109/
QSIC.2012.21
[8]
Guowei Dong, Tao Guo,and Puhan Zhang. 2013. Security assurance with program
path analysis and metamorphic testing. In 2013 IEEE 4th International Conference
on Software Engineering and Service Science. 193–197. https://doi.org/10.1109/
ICSESS.2013.6615286
[9]
Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: Automatic Test Suite Gen-
eration for Object-oriented Software. In Proceedings of the 19th ACM SIGSOFT
Symposium and the 13th European Conference on Foundations of Software Engi-
neering (ESEC/FSE ’11). ACM, New York, NY, USA, 416–419. https://doi.org/10.
1145/2025113.2025179
[10]
A. Gotlieb and B. Botella. 2003. Automated metamorphic testing. In Proceed-
ings 27th Annual International Computer Software and Applications Conference.
COMPAC 2003. 34–40. https://doi.org/10.1109/CMPSAC.2003.1245319
[11]
B. Korel. 1990. Automated Software Test Data Generation. IEEE Trans. Softw.
Eng. 16, 8 (Aug. 1990), 870–879. https://doi.org/10.1109/32.57624
[12]
Yu-Seung Ma, Je Outt, and Yong Rae Kwon. 2005. MuJava: An Automated
Class Mutation System: Research Articles. Softw. Test. Verif. Reliab. 15, 2 (June
2005), 97–133. https://doi.org/10.1002/stvr.v15:2
[13]
Phil McMinn. 2004. Search-based Software Test Data Generation: A Survey:
Research Articles. Softw. Test. Verif. Reliab. 14, 2 (June 2004), 105–156. https:
//doi.org/10.1002/stvr.v14:2
[14]
Christian Murphy, Gail Kaiser, Lifeng Hu, and Leon Wu. 2008. Properties of
Machine Learning Applications for Use in Metamorphic Testing. 867–872.
[15]
ps073006 and Upulee Kanewala. 2018. MSU-STLab/METtester 1.0.0. (Jan. 2018).
https://doi.org/10.5281/zenodo.1157183
[16]
José Miguel Rojas, José Campos, Mattia Vivanti, Gordon Fraser, and Andrea
Arcuri. 2015. Combining Multiple Coverage Criteria in Search-Based Unit Test
Generation. Springer International Publishing, Cham, 93–108. https://doi.org/10.
1007/978-3- 319-22183- 0_7
[17]
Sergio Segura, Robert M. Hierons, David Benavides, and Antonio Ruiz-CortÃľs.
2011. Automated metamorphic testing on the analyses of feature models. Infor-
mation and Software Technology 53, 3 (2011), 245 258. https://doi.org/10.1016/j.
infsof.2010.11.002
[18]
Koushik Sen and Gul Agha. 2006. CUTE and jCUTE: Concolic Unit Testing and
Explicit Path Model-Checking Tools. In CAV, Thomas Ball and Robert B. Jones
(Eds.). 419–423.
[19] Elaine Weyuker. 1982. On Testing Non-Testable Programs. 25 (11 1982).
[20]
Peng Wu, SHI Xiao-Chun, TANG Jiang-Jun, and LIN Hui-Min. 2005. Metamorphic
Testing and Special Case Testing: A Case Study. 16 (07 2005).
... A systematic literature review (SLR) methodology was employed to synthesize existing research on software security testing. This method provides evidence-based insights, identifies research gaps, and guides new research directions [37] and [38]. ...
Article
Full-text available
________________________________________________________________________________________ Abstract: Software systems are integral to modern organizations, necessitating rigorous testing to ensure security and integrity. However, with the evolution of technology, vulnerabilities and threats to software security are on the rise. Metaheuristic algorithms (MHS) or evolutionary techniques have emerged as valuable tools in addressing these challenges. This research aims to explore and evaluate evolutionary software security testing techniques comprehensively. Specific objectives include analyzing different test cases and strategies, identifying commonly targeted security vulnerabilities, assessing cost-effective and scalable testing techniques, and developing a framework for selecting optimal evolutionary testing methods. The methodology employs a systematic literature review across five major databases, selecting 52 relevant papers. Findings indicate prevalent security vulnerabilities such as Cross-site scripting XSS, Buffer overflow/stack overflow, SQL/XML injection, etc. The commonly used genetic algorithms for software security testing are Genetic algorithm, Particle swarm optimization, and Simulated annealing. Cost-effective and scalable MHS algorithms are ranked, with the Genetic algorithm emerging as the most effective. Additionally, a model for selecting and utilizing MHS algorithms is proposed based on research findings. This study offers valuable insights for researchers and practitioners, outlining future research avenues and providing practical guidelines for employing MHS algorithms in software security testing.
Conference Paper
Full-text available
Metamorphic testing is a technique that uses metamorphic relations (i.e., necessary properties of the software under test), to construct new test cases (i.e., follow-up test cases), from existing test cases (i.e., source test cases). Metamorphic testing allows for the verification of testing results without the need of test oracles (a mechanism to detect the correctness of the outcomes of a program), and it has been widely used in many application domains to detect real-world faults. Numerous investigations have been conducted to further improve the effectiveness of metamorphic testing. Recent studies have emerged suggesting a new research direction on the generation and selection of source test cases that are effective in fault detection. Herein, we present two important findings: i) a mutant reduction strategy that is applied to increase the testing efficiency of source test cases, and ii) a test suite minimization technique to help reduce the testing costs without trading off fault-finding effectiveness. To validate our results, an empirical study was conducted to demonstrate the increase in efficiency and fault-finding effectiveness of source test cases. The results from the experiment provide evidence to support our claims.
Article
Metamorphic testing is a technique that makes use of some necessary properties of the software under test, termed as metamorphic relations, to construct new test cases, namely follow-up test cases, based on some existing test cases, namely source test cases. Due to the ability of verifying testing results without the need of test oracles, it has been widely used in many application domains and detected lots of real-life faults. Numerous investigations have been conducted to further improve the effectiveness of metamorphic testing, most of which were focused on the identification and selection of “good” metamorphic relations. Recently, a few studies emerged on the research direction of how to generate and select source test cases that are effective in fault detection. In this paper, we propose a novel approach to generating source test cases based on their associated path constraints, which are obtained through symbolic execution. The path distance among test cases is leveraged to guide the prioritization of source test cases, which further improve the efficiency. A tool has been developed to automate the proposed approach as much as possible. Empirical studies have also been conducted to evaluate the fault-detection effectiveness of the approach. The results show that this approach enhances both the performance and automation of metamorphic testing. It also highlights interesting research directions for further improving metamorphic testing.
Article
Automatic software fault localization serves as a significant role in helping developers find bugs efficiently. Existing approaches can be categorized into static methods and dynamic ones, which have improved the fault locating greatly by analyzing static features from the source code or tracking dynamic behaviors during the runtime respectively. However, the localization accuracy is still far from satisfactory for users. To enhance the capability of detecting software faults with the statement granularity, this paper proposes ALBFL, a novel neural ranking model, combining the static and dynamic features could obtain excellent fault localization accuracy. Firstly, ALBFL learns the semantic features of the source code by a transformer encoder. Then, it integrates them with other static features and dynamic features, i.e., statistical features, Spectrum-Based Fault Localization (SBFL) features, and mutation features, through a self-attention layer. Next, in order to evaluate the faulty possibility of each software statement, the integration results output by self-attention layer are fed into the LambdaRank model, which ranks the suspicious statements in descending order. Finally, we test our model on the authoritative dataset–defect4J, in which consists of 5 open-source projects and a total of 357 faulty programs. It shows that the defect statements identified by ALBFL are three times more than 11 traditional SBFL methods, and outperform two state-of-the-art approaches by more than 14% in [email protected] Context: Automatic software fault localization serves as a significant purpose in helping developers solve bugs efficiently. Existing approaches for software fault localization can be categorized into static methods and dynamic ones, which have improved the fault locating ability greatly by analyzing static features from the source code or tracking dynamic behaviors during the runtime respectively. However, the accuracy of fault localization is still unsatisfactory. Objective: To enhance the capability of detecting software faults with the statement granularity, this paper puts forward ALBFL, a novel neural ranking model that combines the static and dynamic features, which obtains excellent fault localization accuracy. Firstly, ALBFL learns the semantic features of the source code by a transformer encoder. Then, it exploits a self-attention layer to integrate those static features and dynamic features. Finally, those integrated features are fed into a LambdaRank model, which can list the suspicious statements in descending order by their ranked scores. Method: The experiments are conducted on an authoritative dataset (i.e., Defect4J), which includes 5 open-source projects, 357 faulty programs in total. We evaluate the effectiveness of ALBFL, effectiveness of combining features, effectiveness of model components and aggregation on method level. Result: The results reflect that ALBFL identifies triple more faulty statements than 11 traditional SBFL methods and outperforms 2 state-of-the-art approaches by on average 14% on ranking faults in the first position. Conclusions: To improve the precision of automatic software fault localization, ALBFL combines neural network ranking model equipped with the self-attention layer and the transformer encoder, which can take full use of various techniques to judge whether a code statement is fault-inducing or not. Moreover, the joint architecture of ALBFL is capable of training the integration of these features under various strategies so as to improve accuracy further. In the future, we plan to exploit more features so as to improve our method's efficiency and accuracy.
Article
Full-text available
Testing involves examining the behaviour of a system in order to discover potential faults. Given an input for a system, the challenge of distinguishing the corresponding desired, correct behaviour from potentially incorrect behavior is called the “test oracle problem”. Test oracle automation is important to remove a current bottleneck that inhibits greater overall test automation. Without test oracle automation, the human has to determine whether observed behaviour is correct. The literature on test oracles has introduced techniques for oracle automation, including modelling, specifications, contract-driven development and metamorphic testing. When none of these is completely adequate, the final source of test oracle information remains the human, who may be aware of informal specifications, expectations, norms and domain specific information that provide informal oracle guidance. All forms of test oracles, even the humble human, involve challenges of reducing cost and increasing benefit. This paper provides a comprehensive survey of current approaches to the test oracle problem and an analysis of trends in this important area of software testing research and practice.
Conference Paper
Full-text available
In software testing, an oracle refers to a mechanism against which testers can decide whether or not outcomes of test case executions are correct. The oracle problem refers to situations when either an oracle is not available, or it is too expensive to apply. Metamorphic testing has emerged as an effective and efficient approach to alleviating the oracle problem. This article introduces the basic concepts and procedures of metamorphic testing, and gives examples to show its applications, and integration with other methods.
Article
Full-text available
This paper presents an integrated metamorphic testing environment MTest and reports an experimental analysis of the effectiveness of metamorphic testing, which is carried out using MTest with a real program of sparse matrix multiplication. Quantitative evaluation and comparison of special case testing, metamorphic testing with special and random test cases are illustrated with two measurements: mutation score and fault detection ratio. The case study shows that metamorphic testing and special case testing are complementary to each other, and with
Conference Paper
Full-text available
To find defects in software, one needs test cases that execute the software systematically, and oracles that assess the correctness of the observed behavior when running these test cases. This paper presents EvoSuite, a tool that automatically generates test cases with assertions for classes written in Java code. To achieve this, EvoSuite applies a novel hybrid approach that generates and optimizes whole test suites towards satisfying a coverage criterion. For the produced test suites, EvoSuite suggests possible oracles by adding small and effective sets of assertions that concisely summarize the current behavior; these assertions allow the developer to detect deviations from expected behavior, and to capture the current behavior in order to protect against future defects breaking this behavior.
Conference Paper
Full-text available
CUTE, a Concolic Unit Testing Engine for C and Java, is a tool to systematically and automatically test sequential C programs (in- cluding pointers) and concurrent Java programs. CUTE combines con- crete and symbolic execution in a way that avoids redundant test cases as well as false warnings. The tool also introduces a race-flipping tech- nique to efficiently test and model check concurrent programs with data inputs.
Conference Paper
Automated test generation techniques typically aim at maximising coverage of well-established structural criteria such as statement or branch coverage. In practice, generating tests only for one specific criterion may not be sufficient when testing object oriented classes, as standard structural coverage criteria do not fully capture the properties developers may desire of their unit test suites. For example, covering a large number of statements could be easily achieved by just calling the main method of a class; yet, a good unit test suite would consist of smaller unit tests invoking individual methods, and checking return values and states with test assertions. There are several different properties that test suites should exhibit, and a search-based test generator could easily be extended with additional fitness functions to capture these properties. However, does search-based testing scale to combinations of multiple criteria, and what is the effect on the size and coverage of the resulting test suites? To answer these questions, we extended the EvoSuite unit test generation tool to support combinations of multiple test criteria, defined and implemented several different criteria, and applied combinations of criteria to a sample of 650 open source Java classes. Our experiments suggest that optimising for several criteria at the same time is feasible without increasing computational costs: When combining nine different criteria, we observed an average decrease of only 0.4 % for the constituent coverage criteria, while the test suites may grow up to 70 %.
Article
Several module and class testing techniques have been applied to object-oriented (OO) programs, but researchers have only recently begun developing test criteria that evaluate the use of key OO features such as inheritance, polymorphism, and encapsulation. Mutation testing is a powerful testing technique for generating software tests and evaluating the quality of software. However, the cost of mutation testing has traditionally been so high that it cannot be applied without full automated tool support. This paper presents a method to reduce the execution cost of mutation testing for OO programs by using two key technologies, mutant schemata generation (MSG) and bytecode translation. This method adapts the existing MSG method for mutants that change the program behaviour and uses bytecode translation for mutants that change the program structure. A key advantage is in performance: only two compilations are required and both the compilation and execution time for each is greatly reduced. A mutation tool based on the MSG/bytecode translation method has been built and used to measure the speedup over the separate compilation approach. Experimental results show that the MSG/bytecode translation method is about five times faster than separate compilation. Copyright © 2004 John Wiley & Sons, Ltd.
Conference Paper
The correctness of mission-critical software is an important part of information security and oracle problem[1] is often a great constraint for their testing. Metamorphic testing(MT) is practical for oracle problem, but calls for more executions and only focuses on program's mathematics properties in most situations. This article provides the Path-Combination-Based MT method, which mines the relationships among inputs that could execute different paths and their corresponding outputs based on the analysis of program structure, and then tests the program with these relationships. The experimental results prove its efficiency.
Conference Paper
The information contained in the successful test case has been fully tapped by metamorphic testing which can effectively solve the oracle problem of software testing. One of the key factors affecting the results of the metamorphic testing is the generation of test cases. In this paper, we propose a criterion called ECCEM (Equivalence-Class Coverage for Every Metamorphic Relation), which covers the test cases based on equivalence classes, the criterion can availably generate fewer test case sets with high detection rate. This paper also proposes a new measure of test cases - the Test Case Rate of utilization (TCR), which can comprehensively assess the generated test suite.
Conference Paper
Testing helps in preserving the quality and reliability of the software component thus ensuring its successful functioning. The task of testing components for which the final output for arbitrary input cannot be known in advance is a challenging task; as sometimes conditions or predicates in the software restrict the input domain Metamorphic testing is an effective technique for testing systems that do not have test oracles. In it existing test case input is modified to produce new test cases in such a manner that they satisfy the metamorphic relations. In this paper, we propose a genetically augmented metamorphic testing approach, which integrates genetic algorithms into metamorphic testing, to detect subtle defects and to optimize test cases for the component. We have further verified metamorphic testing results by all path coverage criteria information, which is generated during the metamorphic testing of the program and its mutants. The effectiveness of the approach has been investigated through testing a triangle type determination program. KeywordsMetamorphic testing–test oracle problem–genetic algorithm