Content uploaded by Prashanta Saha
Author content
All content in this area was uploaded by Prashanta Saha on Oct 11, 2022
Content may be subject to copyright.
Using Metamorphic Relations to Improve The
Effectiveness of Automatically Generated Test
Cases
1st Prashanta Saha
School of Computing
Montana State University
Bozeman, Montana, USA
prashantasaha@montana.edu
2nd Upulee Kanewala
School of Computing
University of North Florida
Jacksonville, Florida, USA
upulee.kanewala@unf.edu
Abstract—Automated test case generation has helped to reduce
the cost of testing. However, developing effective test oracles for
these automatically generated test cases still remains a challenge.
Metamorphic testing (MT) has become a well-known software
testing approach over the years. This testing technique can
effectively alleviate the oracle problem faced when testing using
metamorphic relations (MRs) to determine whether a test case
is passed or failed. In this work, we conduct an empirical study
on an open source linear algebra library to evaluate whether
MRs can be utilized to improve the fault detection effectiveness
of automatically generated test cases. Our experiment suggests
that MRs can help to improve the fault detection effectiveness of
automatically generated test cases.
Index Terms—Metamorphic testing, metamorphic relation,
developer test suite, coverage based test suite, automated test
case generation, mutation analysis
I. INTRODUCTION
Software testing is an integral part of software development
life cycle. Typically, testing is a costly activity yet it is essential
to detect faults. As a means of reducing this cost, there has
been lot of work done on automated test case generation,
including the development of publicly available tools [1].
Automatically generated test suites have certain advantages
over manually written test cases, in particular, saving human
labor and time. Some work has shown that it is more effective
to use test cases that are generated based on some coverage
criteria rather than randomly generated test cases [2]. The
main focus of automated test generation work thus far has
been to develop efficient methods to generate test inputs to
achieve a certain target such as coverage. However, there has
been relatively less attention paid on utilizing effective test
oracles that can improve the fault detection effectiveness of
these automatically generated test cases. A test oracle is used
to check whether the output produced for a given test case
is correct or not [3]. In fact, due to the automated nature
of generating test inputs, defining the oracles for these test
inputs is a hard problem. Thus, many of the automatically
generated test cases would contain trivial oracles, such as the
assert statements that we will discuss later. This reduces the
fault detection effectiveness of these test cases.
For example, consider the matrix Power function shown in
Listing I that returns a new matrix which is the nth power of
the current matrix. Figure I(Left) shows a test case generated
by Evosuite [1](a test case generation tool) for this function.
This test case can cover 90% statements of the Power function
from the Matrix class. Lines 13 to 21 are the assertions
generated by Evosuite that serves as the oracles for this test
case. It is easy to note that these assert statements only check
for trivial properties of the output such as the the number
of rows and columns of the output matrix is not zero. Thus,
these assertions do not check the accuracy of the underlying
calculation, which is computing the nth power of a matrix.
1public M a t r i x po w er ( i n t n ) {
2i f ( n <0 ) {
3 f a i l ( ” T he e x p o n e n t s h o u l d be p o s i t i v e : ”
4 + n + ” . ” ) ;
5}
6
7 M a t r i x r e s u l t = b l a n k O f S h a p e ( r ow s , r ow s ) ;
8 M a t r i x t h a t = t h i s ;
9
10 f o r (i n t i = 0 ; i <ro ws ; i ++) {
11 r e s u l t . s e t ( i , i , 1 . 0 ) ;
12 }
13
14 while ( n >0 ) {
15 i f ( n % 2 == 1 ) {
16 r e s u l t = r e s u l t . m u lt i p l y ( t h a t ) ;
17 }
18
19 n / = 2 ;
20 t h a t = t h a t . m u l t i p ly ( t h a t ) ;
21 }
22
23 r e t u r n result ;
24 }
Listing 1. Power Function from La4j Matrix Class
Metamorphic Testing (MT) is a technique proposed to
alleviate the oracle problem of software under test (SUT) [4].
This is based on the idea that most of the time it is easier
to develop relations between multiple inputs and outputs of
a program than specifying the values of individual outputs.
For example, consider a program that computes the average
of a list of real numbers. It is hard to correctly predict
1 @ Tes t ( t i m e o u t = 4 0 00 )
2p u b l i c v o i d test042 () throws Throwable {
3 MockRandom moc kRa ndo m0 = new MockRandom ( ) ;
4 a s s e r t N o t N u l l ( moc kRa ndom 0 ) ;
5
6 D e n s e M a t r i x d e n s e M a tr i x 0 =
7 D e n s e M a t r i x . r a n do m S y m m et r i c ( 0 , mock Ran dom 0 ) ;
8 a s s e r t E q u a l s ( 0 , d e n s e M a t r i x 0 . c o lu m n s ( ) ) ;
9 a s s e r t E q u a l s ( 0 , d e n s e M a t r i x 0 . r ow s ( ) ) ;
10 a s s e r t N o t Nu l l ( de n s e M a t r i x 0 ) ;
11
12 Ma t r i x m a t r i x 0 = d e n s e M a t r i x 0 . p ow er ( 1 2 93 ) ;
13 a s s e r t N o tS a m e ( d e n s e Ma t r i x 0 , m a t r i x 0 ) ;
14 a s se r t N o t S a m e ( m a tr i x 0 , d e n s e M a t r i x 0 ) ;
15 a s s e r t E q u a l s ( 0 , d e n s e M a t r i x 0 . c o lu m n s ( ) ) ;
16 a s s e r t E q u a l s ( 0 , d e n s e M a t r i x 0 . r ow s ( ) ) ;
17 a s s e r t E q u a l s ( 0 , m a t ri x 0 . ro w s ( ) ) ;
18 a s s e r t E q u a l s ( 0 , m a t ri x 0 . c ol um n s ( ) ) ;
19 a s s e r t T r u e ( m at r i x 0 . e q u al s ( ( O bj e c t )
20 denseMatrix0 )) ;
21 a s s e r t N o t N u l l ( m at r i x 0 ) ;
22 }
1 @ Tes t ( t i m e o u t = 4 0 00 )
2p u b l i c v o i d test042 () throws Throwable {
3 MockRandom moc kRa ndo m0 = new MockRandom ( ) ;
4 a s s e r t N o t N u l l ( moc kRa ndom 0 ) ;
5
6 D e n s e M a t r i x d e n s e M a tr i x 0 =
7 D e n s e M a t r i x . r a n do m S y m m et r i c ( 0 , mock Ran dom 0 ) ;
8 a s s e r t E q u a l s ( 0 , d e n s e M a t r i x 0 . c o lu m n s ( ) ) ;
9 a s s e r t E q u a l s ( 0 , d e n s e M a t r i x 0 . r ow s ( ) ) ;
10 a s s e r t N o t Nu l l ( de n s e M a t r i x 0 ) ;
11
12 Ma t r i x m a t r i x 0 = d e n s e M a t r i x 0 . p ow er ( 1 2 93 ) ;
13
14 / / M a t ri x M u l t i p l i c a t i o n − MR
15 Ma t r i x m a t r i x 1 =
16 de n s e M a t r i x 0 . m ul t i p l y ( d e n s e M a t r i x 0 ) ;
17 m a t r i x 1 = m a t r i x 1 . po we r ( 1 2 9 3 ) ;
18 a s s e r t T r u e ( m a t r i x 0 . e q u a l s ( ( O b j e c t ) m a t r i x 1 ) ) ;
19 }
Fig. 1. (Left) EvoSuite Generated Test Case , (Right) Modified Test Case with MR in MT
the observed output when the input list has millions of real
numbers. However, we can permute the list of real numbers
and check if the returned output matches the previous output. If
the outputs do not match, then there is a fault in the program.
The property that specifies when the elements of the inputs
are randomly permuted the output should remain the same
is called a MR, which is a necessary property of the SUT
and specifies a relationship between multiple inputs and their
outputs [5].
In this work, we investigate whether we can utilize MRs
to improve the fault detection effectiveness of automatically
generated test cases. MRs provide an effective method to
overcome the oracle problem in automatically generated test
cases and verify the underlying calculations. For example,
following is an Matrix Multiplication MR that should be
satisfied by the program in Listing I. The expected property of
this MR is multiplying the input matrix with another matrix
having same size and the expected return will be equal to the
return of the input matrix. To incorporate the checking of this
MR we modified the test case in Figure I (Right) as follows:
1) We multiplied the source test case matrix (denseMatrix0)
with the same matrix (denseMatrix0) to generate the
follow-up test case matrix (matrix1). (In line 15-16)
2) Next, we executed the follow-up test case matrix with
the subject program (Power function). (In line 17)
3) Finally using the assertTrue JUnit assertion function we
were comparing the output of source test case (matrix0)
with the output of the follow-up test case (matrix1).
Then we expected the resultant matrix from these two
test cases are equal. (In line 18)
In this paper, we present the results of an empirical study
conducted to evaluate the effectiveness of utilizing MRs with
automatically generated test inputs. To this end, we gener-
ated coverage based test suites (line, branch, weak mutation
coverage) using EvoSuite for several open-source software
systems that implement matrix calculations and utilized MRs
to augment these automatically generated test cases. Our
results show that MRs can help to increase the effectiveness
of automatically generated test suites and rare cases would
have the similar fault detection effectiveness as the developer
written test suites.
II. BACKGROU ND
A. Metamorphic Testing
Following is the typical process used for applying MT:
1) Identify MRs from the specification of the SUT. An
MR R(x1, x2, ..., xn, f (x1), f(x2), ..., f (xn)) is a nec-
essary property of the SUT and is specified over the
inputs x1, x2, ..., xnand their corresponding outputs
f(x1), f (x2), ..., f(xn).
2) Generate the source test inputs x1, x2, ..., xkand execute
them on the SUT.
3) Construct the follow-up test inputs xk+1 , xk+2, ..., xn
by applying transformation specified by Rto
x1, x2, ..., xk, f (x1), f(x2), ..., f (xk)and execute
them.
4) Verify whether Ris satisfied with the obtained
x1, x2, ..., xn, f (x1), f(x2), ..., f (xn)by executing the
SUT. If Ris not satisfied then MR has revealed a fault
in the SUT.
Step 1 (Identification of MRs), is typically done based on
the knowledge of the program. Recently there have been sev-
eral work done towards automating the MR identification [6]–
[8]. In step 2 (Generation of source test cases), any test case
generation technique can be applied. Previous studies have
used special case [9] and random testing [10] techniques
to generate source test cases. Further, previous studies have
shown that using coverage-based test inputs as source in-
puts would improve the fault detection effectiveness of MT
compared to random test inputs [11]. As shown in the above
SourceTestCase
Matrixa=
0.0,14.2,0.0,4.0,
0.0,5.0,10.0,0.0,
0.0,3.0,0.0,2.3,
11.0,7.0,0.0,1.0
Follow-upTestCase
Matrixa=
1.0,14.2,0.0,4.0,
0.0,6.0,10.0,0.0,
0.0,3.0,1.0,2.3,
11.0,7.0,0.0,2.0
MR-AdditionWith
IdentityMatrix ProgramUnderTest
Matrix.java
matrix=a.Transpose()
OutputofSourceTestCase
matrix=
0.0,0.0,0.0,11.0,
14.2,5.0,3.0,7.0,
0.0,10.0,0.0,0.0,
4.0,0.0,2.3,1.0
OutputofFollow-
upTestCase
matrix=
1.0,0.0,0.0,11.0,
14.2,6.0,3.0,7.0,
0.0,10.0,1.0,0.0,
4.0,0.0,2.3,2.0
MR-AdditionWithIdentity
OutputofFollow-upTC>=
OutputofSourceTC
Fig. 2. Illustration of Metamorphic Testing
process, since MT checks the relationship between inputs and
outputs of a test program, we can use this technique when the
expected results of individual test inputs are unknown.
The following example is a sample of MT process. In
Figure 2, a Java method Transpose from Matrix.java class is
used to show how source and follow-up test cases perform
with a PUT. The Transpose method transposes a matrix
and returns the transposed matrix. Source test case, a=
{(0.0,14.2,0.0,4.0),(0.0,5.0,10.0,0.0),(0.0,3.0,0.0,2.0),
(11.0,7.0,0.0,1.0)}is developer generated and tested on
Transpose method. The output for this source test case is
matrix ={(0.0,0.0,0.0,11.0),(14.2,5.0,3.0,7.0),(0.0,10.0,
0.0,0.0),(4.0,0.0,2.3,1.0)}. For this program, when an
identity matrix is added to the input, the output should
increase. This will be used as a MR to conduct MT on this
PUT. An identity matrix of size 4 is added to this matrix to
create a follow-up test case a′={(1.0,14.2,0.0,4.0),
(0.0,6.0,10.0,0.0),(0.0,3.0,1.0,2.3),(11.0,7.0,0.0,2.0)}
and then execute on the PUT. The output for this follow-up test
case is matrix ={(1.0,0.0,0.0,11.0),(14.2,6.0,3.0,7.0),
(0.0,10.0,1.0,0.0),(4.0,0.0,2.3,2.0)}. To satisfy this MR
the follow-up test output should be greater than the source
output. We calculate the sum of elements of the matrices
from source and follow-up outputs which is 57.5 and 61.5
respectively. In this MT example, 61.5 is greater than 57.5.
Hence, the considered MR is satisfied for this given source
and follow-up test cases.
Previous studies [12]–[15] have consistently shown that
metamorphic testing has many advantages. First and foremost,
MT provides a test result verification mechanism in the
absence of an oracle. The test results are verified against a set
of MRs instead of an oracle. Besides, most MRs are simple
in concepts, so it is convenient to verify test results by using
some simple scripts automatically.
B. Automated Test case generation
In this work, we used EvoSuite [1] as the automated test
case generation tool. It automatically produces test cases
targeting a high coverage such as line, branch, and weak
Mutation coverage. EvoSuite uses an evolutionary search
approach that evolves whole test suites with respect to an entire
coverage criterion at the same time.
III. EMPIRICAL EVALUATI ON
A. Research Questions
We conducted a set of experiments to answer the following
research questions:
1) RQ1: Can MRs be utilized to improve the fault detection
effectiveness of automatically generated test cases?
2) RQ2: How does the improved automatically generated
test cases compare with test suites created by developers
in terms of fault detection effectiveness?
3) RQ3: How does the effectiveness of the MRs vary
compared to automatically generated test cases?
B. Subject Programs
In this experiment we used four classes from la4j1(version
0.6.0) open-source Java library. la4j is a linear algebra library
that provides matrix and vector implementations and algo-
rithms, was one of the software packages used for evaluating
the performance of automated testing tools [1]. We used the
following Java classes from la4j in this study:
•Matrix.java: This class has methods to perform matrix
operations. We picked 20 methods to conduct our exper-
iment on them. The description of these 20 methods is
available in this GitHub repository2.
•LeastSquaresSolver.java: This solver method is the least
squares approximation of linear functions to data. In this
algorithm for approximation of Ax = b equation QR
decomposition have been applied.
•ForwardBackSubstitutionSolver.java: This class repre-
sents the process of solving a system of linear algebraic
equations using Forward Back Substitution method. This
algorithm used to solve LUx = b , where Lis lower
triangular with units on the diagonal and U (= DV) is
upper triangular. And ba given vector 3.
1http://la4j.org/
2https://github.com/ps073006/ConfRepo
3https://algowiki-project.org/en/Forward substitution
•SquareRootSolver.java:This class represents Square
Root method for solving linear systems 4. This algorithm
solves the matrix equation Au = g. for u, with Aap×p
symmetric matrix and ga given vector.
C. MR Identification
We developed the following 10 MRs for testing the func-
tions in the Matrix.java class. Not all these MRs are satisfied
by each of these functions. Total list of these functions and the
specific MRs satisfied by them can be found in this GitHub
repository2. (In all cases we assume that Matrix A comprises
only non-negative numbers).
•MR1 - Scalar Addition: Let Abe the initial input matrix
to a program P, and bbe a positive scalar. Let A′be the
follow-up input matrix where A′=∀i, j ∈b+Ai,j .
Let the output of Pfor Abe O(i.e. P(A) = O)
and P(A′) = O′. Then the expected output relation is
Pi,j O′≥Pi,j O.
•MR2 - Addition With Identity Matrix: Let Abe the
initial input matrix to a program P, and Ibe an identity
matrix. Let A′be the follow-up input matrix where A′=
∀i, j ∈Ii,j +Ai,j . Let P(A) = Oand P(A′) = O′. Then
the expected output relation is Pi,j O′≥Pi,j O.
•MR3 - Scalar Multiplication: Let Abe the initial input
matrix to a program P, and bbe a positive scalar. Let A′
be the follow-up input matrix where A′=∀i, j ∈b.Ai,j .
Let P(A) = Oand P(A′) = O′. Then the expected
output relation is Pi,j O′≥Pi,j O.
•MR4 - Multiplication With Identity Matrix: Let Abe
the initial input matrix to a program P, and Ibe an iden-
tity matrix. Let A′be the follow-up input matrix where
A′=∀i, j ∈Ii,j .Ai,j . Let P(A) = Oand P(A′) = O′.
Then the expected output relation is Pi,j O′=Pi,j O.
•MR5 - Transpose: Let Abe the initial input matrix
to a program P. Let A′be the follow-up input matrix
where A′=∀i, j ∈AT
i,j =Aj,i. Let P(A) = O
and P(A′) = O′. Then the expected output relation is
Pi,j O′=Pi,j O.
•MR6 - Matrix Addition: Let Abe the initial input
matrix to a program P. Let A′be the follow-up input
matrix where A′=∀i, j ∈Ai,j +Ai,j . Let P(A) = O
and P(A′) = O′. Then the expected output relation is
Pi,j O′≥Pi,j O.
•MR7 - Matrix Multiplication: Let Abe the initial input
matrix to a program P. Let A′be the follow-up input
matrix where A′=∀i, j ∈Ai,j .Ai,j . Let P(A) = O
and P(A′) = O′. Then the expected output relation is
Pi,j O′≥Pi,j O.
•MR8 - Permute Column: Let Abe the initial input
matrix to a program Pwith j= 1,2,3, .., n columns.
Let A′be the follow-up input matrix after permuting the
column positions of A. Let P(A) = Oand P(A′) = O′.
Then the expected output relation is Pi,j O′=Pi,j O.
4http://mathworld.wolfram.com/SquareRootMethod.html
•MR9 - Permute Row: Let Abe the initial input matrix
to a program Pwith i= 1,2,3, .., n rows. Let A′be the
follow-up input matrix after permuting the row positions
of A. Let P(A) = Oand P(A′) = O′. Then the expected
output relation is Pi,j O′=Pi,j O.
•MR10 - Permute Element: Let Abe the initial input
matrix to a program Pwith j= 1,2,3, .., n columns
and i= 1,2,3, .., n rows. Rows and columns have to
be same size. Let A′be the follow-up input matrix after
permuting Ai,n element with An,j element. Let P(A) =
Oand P(A′) = O′. Then the expected output relation is
Pi,j O′=Pi,j O.
We also identified 6 MRs for 3 solver classes. In these 3
classes one matrix and one vector value acted as parameters
for the source test case. The particular method we were testing
in these 3 classes is Solve method. These 6 MRs are explained
below:
•MR11 - Multiplication: Let Abe the input matrix and v
be a input vector to a program P. Their source test exe-
cution output is vector P(A, v) = O. After multiplying a
positive scalar constant bwith both, the follow-up matrix
will be A′=∀i, j ∈b.Ai,j and vector v′=∀i∈b.vi.
And their follow-up output is vector P(A′, v′) = O′.
Then, expected output relation is Pi,j O′=Pi,j O.
•MR12 - Permute Row Element: Let Abe the input
matrix with i= 1,2,3, .., n rows and vbe a input vector
with i= 1,2,3, .., n elements to a program P. Their test
execution output is vector P(A, v) = O. The follow-up
matrix will be A′after permuting the row positions and
vector v′after permuting elements. Then, their executed
test output is vector P(A′, v′) = O′. And expected output
relation is Pi,j O′=Pi,j O.
•MR13 - Matrix Vector Addition: Let Abe a input
matrix with i= 1,2,3, .., n rows and vbe a input
vector with i= 1,2,3, .., n elements to a program P.
Their test execution output is vector P(A, v) = O. The
follow-up matrix will be A′=∀i, j ∈Ai,j +Ai,j and
vector v′=∀i∈vi+vi. Their executed test output is
vector P(A′, v′) = O′. Then, expected output relation is
Pi,j O′=Pi,j O.
•MR14 - Multiplication With Transpose Matrix: Let A
be a input matrix with i= 1,2,3, .., n rows and vbe a
input vector with i= 1,2,3, .., n elements to a program
P. Their test execution output is vector P(A, v) = O.
The follow-up matrix will be A′=∀i, j ∈AT
i,j .Ai,j and
vector v′=∀i∈AT
i,j .vi. Their executed test output is
vector P(A′, v′) = O′. Then, expected output relation is
Pi,j O′=Pi,j O.
•MR15 - Multiplication With Identity Matrix: Let Abe
a input matrix with i= 1,2,3, .., n rows and vbe a input
vector with i= 1,2,3, .., n elements to a program P.
Their test execution output is vector P(A, v) = O. The
follow-up matrix will be A′=∀i, j ∈Ii,j .Ai,j with an
identity matrix I. and vector v′=v. And their executed
test output is vector P(A′, v′) = O′. Then, expected
output relation is Pi,j O′=Pi,j O.
•MR16 - Multiplication With Negative: Let Abe a input
matrix and vbe a input vector to a program P. Their
test execution output is vector P(A, v) = O. For follow-
up test input, we multiply a negative constant bto both
and the follow-up matrix will be A′=∀i, j ∈b.Ai,j
and v′=∀i∈b.vi. And their executed test output is
vector P(A′, v′) = O′. Then, expected output relation is
Pi,j O′=Pi,j O.
D. Automated Test Case Generation
For each of the classes as mentioned above, we used
EvoSuite [1] commandline tool to generate test cases targeting
line, branch, and weak mutation coverage. In this experi-
ment, we used EvoSuite 1.0.6v. In commandline, for asser-
tion strategy parameter we have used all available strategies
(e.g. Mutation, Unit). A test assertion is a predicate which
compares some aspects of the observed behavior of a function
against the expected behavior. Some types of test assertions
are related to arrays and standard container classes. We ran
each coverage criterion (e.g., line, branch, and weak mutation)
separately using the criterion parameter. Based on the cover-
age criterion EvoSuite generated separate .java files with all
the JUnit test cases. We ran the generated JUnit test cases with
the original programs, which generate a pass/fail report of test
cases. From that report, we removed test cases, which were
checking Undeclared exceptions e.g. NullPointerException,
IllegalArgumentException. It is not feasible to apply MT for
those test cases since they were throwing exceptions. In Table I
we have listed our test suite sizes separately for all classes.
EvoSuite column has the total test suite generated by line,
branch, and weak mutation coverage.
TABLE I
CLA SSE S WI TH EVOS UI TE TES T SUI TE & DE VE LOP ER TE ST SUI TE
Class name Evosuite Developer
Matrix.java 37 40
LeastSquaresSolver.java 4 11
ForwardBackSubstitutionSolver.java 8 7
SquareRootSolver.java 5 6
E. Utilizing MRs to Modify Automatically Generated Test
Cases
1) To generate the follow-up test cases we modified the
automatically generated test case based on each MR. We
modified the Evosuite generated .java files with follow-
up test cases ( Fig 1(Right)). This follow-up test case
insertion process was manually completed. ( line 15-17)
2) We also added new assert statement to compare the
source and follow-up test outputs.(line 18)
3) We executed those source and follow-up test cases on
the original programs and verified the MR properties.
If any MR property did not hold for any test input, we
excluded that MR for that particular input.
F. Evaluation Approach
We used mutation testing to measure the fault detection
effectiveness of the automatically generated test cases and the
test cases enhanced with MRs. Mutation Testing [16] is a
fault-based testing technique that measures the effectiveness
of test cases and many experiments suggest that mutants act
as a proxy to real faults for comparing testing techniques [17].
Briefly, the technique performs as follows. First, mutants are
created by simply seeding faults in a program. By applying
syntactic changes to its source code, new faulty versions of
the original programs are generated. Each syntactic change
is determined by a operator called a mutation operator. Test
cases are then executed for the faulty and original versions
of the program and checked whether they produce different
responses. If the test output of the mutant is different from the
original program, then we say the mutant is killed. Otherwise,
the mutant remains alive. When a mutant is syntactically dif-
ferent but semantically identical to the original program then
it is referred to as an equivalent mutant. There are 4 common
equivalent mutant situations: the mutant cannot be triggered,
the mutant is generated from dead code, the mutant only
alters the internal states, and the mutant only improves speed.
To detect the original program and the mutated programs
are equivalent is undecidable [18]. The percentage of killed
mutants with respect to the total number of non-equivalent
mutants provides an adequacy measurement of the test suite,
which is called the mutation score. There is another adequacy
measurement metric used to measure the efficiency of MRs,
called fault detection ratio [9]. In MT, fault detection ratio is
calculated by the ratio of the test case that detects a fault by
an MR and the total number of Metamorphic tests generated
from source test cases.
In our evaluation, we used PIT5tool to systematically gener-
ate 1120 mutants for the programs described in Section III-B.
In Table II, we list the total number of mutants generated
by the PIT tool for each class. Due to the high number
of mutants, identifying equivalent mutant manually was not
practical. Thus, we used the following filtering approach to
identify mutants to be used in the experiment: we executed
the EvoSuite generated test suites and developer test suites on
the generated mutants and filtered out the mutants that caused
compilation errors, run-time exceptions. We also filtered out
any mutants that were passing both coverage based and
developer test suites. Column two in Table II, lists the number
of remaining mutants used in the experiment after the filtering
process. Thus the mutation score is calculated by the ratio of
the killed mutants to the remaining total mutants after filtering.
Our full evaluation approach with source code repository is
available in GitHub6.
IV. RESULTS AND DISCUSSIONS
Below we discuss the results of our experiments and
provide answers to our research questions:
5https://pitest.org/
6https://github.com/ps073006/matrixmt
TABLE II
CLA SSE S WI TH MR S AND MU TANT S GEN ERAT ION RE SU LTS
Class name Lines of Code #Mutants #of remaining Mutants # of MRs
Matrix.java 2210 884 363 11
LeastSquaresSolver.java 95 89 39 6
ForwardBackSubstitutionSolver.java 95 92 39 6
SquareRootSolver.java 106 55 32 6
1. Effectiveness of MR over the automatically generated
test suites: Figure 3 shows the mutation score of
automatically generated test suites by EvoSuite (columns
denoted by E) and Developer generated test suites 7(columns
denoted by D). We also show the increase of mutation
score achieved by these test suites after augmenting the
corresponding test cases with applicable MRs as described in
Section III-E. EvoSuite columns show the combined mutation
scores of the test suites generated by all the three strategies
(line, branch, and weak mutation), and for MT the combined
mutation scores of all the MRs are used. From Evosuite
columns, we see a significant increases in mutation score
when augmented with MRs, but for Matrix class, the increase
of mutation score is relatively low compared to the other three
classes. Note that the MR lists are different for Matrix class
than the other 3 classes. The fault detection effectiveness of
MT is determined by the MRs used for testing as well as the
source test cases used to execute those MRs. Thus, further
investigation is required to determine the exact reason for the
reduction of fault detection in the Matrix class.
RQ1: MRs help to increase the fault detection effec-
tiveness of automatically generated test suites.
2. Effectiveness of improved automatically generated test
suites compared to the developer written test suites: From
Figure 3, there was no additional mutant killed by the test
cases augmented by MRs with the developer test suites, except
for the matrix class. Further, MR augmented test cases from
matrix class only killed 6.07% additional mutants. Thus, MRs
did not help to improve the fault detection effectiveness of
developer test suites as they did with automatically generated
test suites.
Developer test suites (Table I) are generated based on the
knowledge of the specification and often developers try to
cover majority of the branches and utilize boundary cases
to generate test cases. Therefore, it is not surprising that
fault detection effectiveness did not improve with the MR
augmentation. However, since our goal is to improve the
effectiveness of automatically generated test suites using
MT, we use this developer test suite as a benchmark of our
experiments. From Figure 3, except for SquareRootSolver
class, the mutation score of the MR augmented automatically
generated test cases are significantly close to the mutation
score of the developer test suite. This evidence suggests
that MRs can be utilized to improve the fault detection
effectiveness of automatically generated test suites up to a
7https://github.com/vkostyukov/la4j
level that closely matches the fault detection effectiveness of
developer test suites.
RQ2: Improved automatically generated test suites are
comparable to the developer generated test suites in
terms of fault detection effectiveness
3. Fault detection effectiveness of the MRs compared to
automatically generated test suites: Figure 4 shows the box
plots of fault detection ratio of the automatically generated
test suites and MR augmented test suites (the specific MR
used for augmentation is listed as the label header) for the
3 solver classes. The box plots show that majority of the
MR augmented test suites either outperform or perform
equally compared to the automatically generated test suites.
For SquareRootSolver and ForwardBackSubstitutionSolver
class, MR11, MR13, MR15, MR16 test suites have median
fault detection ratio over 0.6 which is twice as automatically
generated test suites (∼0.3). The box plots for the
LeastSquaresSolver class show that there are lots of outliers
for both the automatically generated test suites and MR
augmented test suites. For this class, MR11, MR13, MR14,
MR15 have similar median fault detection ratio (0.5) as
the automatically generated test suites. Size of the outliers
suggest that the performance of MR augmented test suites
or automatically generated test cases are not consistent in
the LeastSquaresSolver class. For SquareRootSolver and
ForwardBackSubstitutionSolver class, we can see that MR12
is not supported for both classes. From the above experimental
outcome, we can deduce that majority of the MR augmented
test suites have the ability to kill more mutants compared to
automatically generated test cases.
RQ3: Majority of the MR augmented test suites have
better fault detection effectiveness compared to source
test suites
V. THREATS TO VALID IT Y
Threats to internal validity may result from the way the
empirical study was carried out. EvoSuite and our experimen-
tal setup have been carefully tested, although testing can not
prove the absence of defects. Construction of MRs can still
be error prone since we have manually identified and verified
the MRs against the programs.
Threats to construct validity may occur because of the third-
party tools we have used. The EvoSuite tool has been used to
generate source test cases for line, branch, and weak mutation
test generation techniques. Further, we used the PIT mutation
tool to create mutants for our experiment. To minimize these
E D E D E D E D
0
10
20
30
40
50
60
70
80
90
100
Mutation
Score (%)
SRS LSS FBSS Matrix
25.0
31.25 75.0
12.82
74.36 89.74
53.85
30.77
87.18
45.56
18.27
54.77
6.07
Auto Generated Test Suites
Metamorphic Testing
Developer Test Suites
Fig. 3. Mutation Score of 4 classes for Auto generated Test Suites (Evosuite(E)) and Developer test suites (D) and Metamorphic Testing, SRS =
SquareRootSolver, LSS = LeastSquaresSolver, FBSS = ForwardBackSubstitutionSolver
threats, we verified that the results produced by these tools
are correct by manually inspecting randomly selected outputs
produced by each tool.
Threats to external validity were minimized by using the 4
classes as case studies, which was performing different matrix
operations. This provides high confidence in the possibility
of generalizing our results to other open-source softwares. We
only used the EvoSuite tool to generate test cases for our major
experiment.
VI. RE LATE D WORK
Most contributions on MT use either randomly generated
test data or existing developer test suites for the generation
of source test cases. Not much research has been done on
automatically generation of source test cases for MT. Gotlieb
and Botella [19] presented an approach called Automated
Metamorphic Testing. Using this technique, they translated the
code into an equivalent constraint logic program and tried
to find test cases that violate the MRs. Chen et al. [20]
compared the fault detection effectiveness of random testing
and ”special values” as source test cases for MT. Special values
are one type of inputs where the output is well known for a
particular method. But Wu et al. [9] proved that randomly
generated test cases are more effective than those test cases
that are derived from ”special values” for MT. Segura et
al. [21] also compared the fault detection effectiveness of
random testing with manually generated test suites for MT.
Their experimental results showed that randomly generated
test suites are more effective in detecting faults than manually
designed test suites. They also observed from their results that
combining random testing with manually written tests provides
better fault detection ability than random testing only.
Batra and Sengupta [22] proposed a genetic algorithm
approach to generate test cases maximizing the paths traversed
in the PUT for MT. Chen et al. [23] also resolved the same
problem from a different perspective. They proposed partition-
ing the input domain of the PUT into multiple equivalence
classes for MT. They applied an algorithm that will generate
test cases by covering those equivalence classes. They were
able to generate source and follow-up test cases that provide
a high fault detection rate. Symbolic Execution was used
to construct MRs and generate their corresponding source
test cases by Dong and Zhang [24]. At first, the program
paths were analyzed to generate symbolic inputs, and then,
these symbolic inputs were used to construct MRs. Finally,
source test cases were generated by replacing the symbolic
inputs with real values. Saha et al. [11] applied a coverage-
based testing technique to generate test cases for MT. They
compared their results with randomly generated test cases, and
it outperforms the effectiveness of randomly generated test
suite. Compared to their research, in this work, we evaluate
the improvement of the fault detection effectiveness of MRs
over the automatically generated test suites, specifically 3
commonly used coverage criteria (Line, Branch, and Weak
AGTS MR 11 MR13 MR 14 MR 15 MR 16
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Fault detection Ratio
(a)
AGTS MR 11 MR13 MR 14 MR 15 MR 16
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Fault detection Ratio
(b)
AGTS MR 11 MR 12 MR13 MR 14 MR 15 MR 16
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Fault detection Ratio
(c)
Fig. 4. Fault Detection Ratio of AGTS (Automatically Generated Test Suites)
and MR Augmented Test Suites for (a) ForwardBackSubstitutionSolver, (b)
SquareRootSolver & (c) LeastSquaresSolver class.
Mutation).
VII. CONCLUSION AND FUTURE WOR K
In this study, we empirically evaluated whether the fault
detection effectiveness of automatically generated test suites
can be improved using MRs. Our results show that augmenting
automatically generated test cases with MRs can improve
their fault detection capability since they provide a method to
improve upon the trivial oracles used in these test cases. Our
case study also shows that once the automatically generated
test cases are augmented with MRs, their fault detection
effectiveness is comparable to the developer test suites. This
empirical study results also suggest that identifying strong
MRs is also important, which will help to increase the fault
detection capability of automatically generated test suite where
it fails to perform efficiently.
In the future we will extend this experiment with other
automatic test case generation techniques like Adaptive Ran-
dom Testing. To fully automate MR augmentation of auto-
matically generated test cases, our plan is to integrate the MR
identification approach [8] with source test case generation
process. Finally, we will implement this automated test case
generation process to publicly available METTester [25] tool
(A Metamorphic Testing tool to test scientific applications).
ACKNOWLEDGMENT
This work is partially supported by award number 1656877
from the National Science Foundation. Any opinions, findings,
conclusions or recommendations expressed in this material are
those of the author(s) and do not necessarily reflect those of
the National Science Foundation.
REFERENCES
[1] Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: Automatic Test Suite
Generation for Object-oriented Software (ESEC/FSE ’11). ACM, New
York, NY, USA, 416–419. https://doi.org/10.1145/2025113.2025179
[2] Carlos Pacheco and Michael D. Ernst. 2007. Randoop: Feedback-
directed Random Testing for Java (OOPSLA ’07). ACM, New York,
NY, USA, 815–816. https://doi.org/10.1145/1297846.1297902
[3] Elaine J. Weyuker. 1982. On Testing Non-Testable Programs. Comput.J.
25, 4 (11 1982), 465–470.
[4] Tsong Yueh Chen, S. C. Cheung, and S. W. Yiu. 2020. Metamor-
phic testing: a new approach for generating next test cases. CoRR.
https://arxiv.org/abs/2002.12543
[5] Tsong Yueh Chen, Fei-Ching Kuo, Huai Liu, Pak-Lok Poon, Dave
Towey, T. H. Tse, and Zhi Quan Zhou. 2018. Metamorphic Testing:
A Review of Challenges and Opportunities. ACM Comput. Surv. 51, 1,
Article 4 (Jan. 2018), 27 pages. https://doi.org/10.1145/3143561
[6] C. Sun, A. Fu, P. Poon, X. Xie, H. Liu and T. Y. Chen, ”METRIC+:
A Metamorphic Relation Identification Technique Based on Input plus
Output Domains,” in IEEE Transactions on Software Engineering, doi:
10.1109/TSE.2019.2934848.
[7] Tsong Yueh Chen, Pak-Lok Poon, Xiaoyuan Xie, METRIC: METamor-
phic Relation Identification based on the Category-choice framework,
Journal of Systems and Software, Volume 116, 2016, Pages 177-190,
ISSN 0164-1212, https://doi.org/10.1016/j.jss.2015.07.037.
[8] U. Kanewala and J. M. Bieman. 2013. Using machine learning tech-
niques to detect metamorphic relations for programs without test oracles.
In 2013 IEEE 24th International Symposium on Software Reliability En-
gineering (ISSRE). 1–10. https://doi.org/10.1109/ISSRE.2013.6698899
[9] PengWu, SHI Xiao-Chun, TANG Jiang-Jun, and LIN Hui-Min. 2005.
Metamorphic Testing and Special Case Testing: A Case Study. Journal
of Software, Vol. 16, no. 7 (2005), pp. 1210-1220
[10] H. Liu, X. Xie, J. Yang, Y. Lu, and T. Y. Chen. 2010.
Adaptive Random Testing by Exclusion through Test Profile. In
2010 10th International Conference on Quality Software. 92–101.
https://doi.org/10.1109/QSIC.2010.61
[11] Prashanta Saha and Upulee Kanewala. 2018. Fault Detection Ef-
fectiveness of Source Test Case Generation Strategies for Meta-
morphic Testing (MET ’18). ACM,New York, NY, USA, 2–9.
https://doi.org/10.1145/3193977.3193982
[12] J. Mayer and R. Guderlei. 2006. An Empirical Study on the Selection
of Good Metamorphic Relations. In 30th Annual International Com-
puter Software and Applications Conference (COMPSAC’06), Vol. 1.
475–484. https://doi.org/10.1109/COMPSAC.2006.24
[13] Peng Wu. 2005. Iterative Metamorphic Testing. In 29th Annual In-
ternational Computer Software and Applications Conference (COMP-
SAC’05), Vol. 1. 19–24. https://doi.org/10.1109/COMPSAC.2005.93
[14] C. Sun, G. Wang, B. Mu, H. Liu, Z. Wang, and T. Y. Chen. 2011.
Metamorphic Testing forWeb Services: Framework and a Case Study.
In 2011 IEEE International Conference on Web Services. 283–290.
https://doi.org/10.1109/ICWS.2011.65
[15] X. Xie,W. E.Wong, T. Y. Chen, and B. Xu. 2011. Spectrum-
Based Fault Localization: Testing Oracles are No Longer Mandatory.
In 2011 11th International Conference on Quality Software. 1–10.
https://doi.org/10.1109/QSIC.2011.20
[16] R. A. DeMillo, R. J. Lipton, and F. G. Sayward. 1978. Hints on Test
Data Selection: Help for the Practicing Programmer. Computer 11, 4
(April 1978), 34–41. https: //doi.org/10.1109/C-M.1978.218136
[17] J. H. Andrews, L. C. Briand, and Y. Labiche. 2005. Is mutation an appro-
priate tool for testing experiments? [software testing]. In Proceedings.
27th International Conference on Software Engineering, 2005. ICSE
2005. 402–411. https://doi.org/10. 1109/ICSE.2005.1553583
[18] Budd, T.A., Angluin, D. Two notions of correctness and
their relation to testing. Acta Informatica 18, 31–45 (1982).
https://doi.org/10.1007/BF00625279
[19] A. Gotlieb and B. Botella. 2003. Automated metamorphic
testing. In Proceedings 27th Annual International Computer
Software and Applications Conference. COMPAC 2003. 34–40.
https://doi.org/10.1109/CMPSAC.2003.1245319
[20] Tsong Yueh Chen, Fei-Ching Kuo, Ying Liu, and Antony Tang. 2004.
Metamorphic Testing and Testing with Special Values. In 4th IEEE
International Workshop on Source Code Analysis and Manipulation
(SCAM 2004), 15-16 September 2004, Chicago, IL, USA. 128–134
[21] Sergio Segura, Robert M. Hierons, David Benavides, and Antonio Ruiz-
Cort´
es. 2011. Automated metamorphic testing on the analyses of feature
models. Information and Software Technology 53, 3 (2011), 245 – 258.
https://doi.org/10.1016/j. infsof.2010.11.002
[22] Gagandeep Batra and Jyotsna Sengupta. 2011. An Efficient Metamorphic
Testing Technique Using Genetic Algorithm. In Information Intelligence,
Systems, Technology and Management, Sumeet Dua, Sartaj Sahni, and
D. P. Goyal (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg,
180–188
[23] Leilei Chen, Lizhi Cai, Jiang Liu, Zhenyu Liu, Shiyan Wei, and Pan Liu.
2012. An optimized method for generating cases of metamorphic testing.
In 2012 6th International Conference on New Trends in Information
Science, Service Science and Data Mining (ISSDM2012). 439–443
[24] Guowei Dong, Tao Guo, and Puhan Zhang. 2013. Security assurance
with program path analysis and metamorphic testing. In 2013 IEEE 4th
International Conference on Software Engineering and Service Science.
193–197. https://doi.org/10.1109/ ICSESS.2013.6615286
[25] Prashanta Saha, & Upulee Kanewala. (2018). MSU-STLab/METtester
1.0.0 (1.0.0). Zenodo. https://doi.org/10.5281/zenodo.1157183