ArticlePDF Available

Hints on Test Data Selection: Help for the Practicing Programmer

Authors:

Abstract and Figures

A new, empirically observed effect is introduced. Called ″the coupling effect,″ it may become a very important principle in practical testing activities. The idea is that programs appear to have the property - the ″coupling effect″ - that tests designed to detect simple kinds of errors are also effective in detecting much more complicated errors. This relationship may seem counter-intuitive, but the authors give a way of analyzing it through the use of program mutations (i. e. , incorrect variations from a correct program). One of the most interesting possibilities is that the mutation idea could form the basis for statistically inferring the likelihood of remaining errors in a program.
Content may be subject to copyright.
Hints
on
Test
Data
Selection:
Help
for
the
Practicing
Programmer
Richard
A.
DeMillo
Georgia
Institute
of
Technology
Richard
J.
Lipton
and
Frederick
G.
Sayward
Yale
University
In
many
cases
tests
of
a
program
that
uncover
simple
errors
are
also
effective
in
uncovering
much
more
complex
errors.
This
so-called
coupling
effect
can
be
used
to
save
work
during
the
testing
process.
Much
of
the
technical
literature
in
software
reliability
deals
with
tentative
methodologies
and
underdeveloped
techniques;
hence
it
is
not
surpris-
ing
that
the
programning
staff
responsible
for
debug-
ging
a
large
piece
of
software
often
feels
ignored.
It
is
an
economic
and
political
requirement
in
most
production
programming
shops
that
programmers
shall
spend
as
little
time
as
possible
in
testing.
The
programmer
must
therefore
be
content
to
test
cleverly
but
cheaply;
state-of-the-art
methodologies
always
seem
to
be
just
beyond
what
can
be
afford-
ed.
We
intend
to
convince
the
reader
that
much
can
be
accomplished
even under
these
constraints.
From
the
point
of
view
of
management,
there
is
some
justification
for
opposing
a
long-term
view
of
the
testing
phase
of
the
development
cycle.
Figure
1
shows
the
relative
effect
of
testing
on
the
remain-
ing
system
bugs
for
several
medium-scale
systems
developed
by
System
Development
Corporation.'
Notice
that
in
the
last
half
of
the
test
cycle,
the
average
change
in
the
known-error
status
of
a
system
is
0.4
percent
per
unit
of
testing
effort,
while
in
the
first
half
of
the
cycle,
1.54
percent
of
the
errors
are
discovered
per
unit
of
testing
effort.
Since
it
is
enormously
difficult
to
be
convincing
in,
stating
that
the
testing
effort
is
complete,
the
apparently
rapidly
decreasing
return
per
unit
of
effort
invested
becomes
a
dominating
concern.
The
standard
solution,
of
course,
is
to
limit
the
amount
of
testing
time
to
the
most
favorable
part
of
the
cycle.
Programmers
have
one
great
advantage
that
is
almost
never
exploited:
they
create
programs
that
are
close
to
being
correct!
How,
then,
should
programmers
cope?
Their
more
sophisticated
general
methodologies
are
not
likely
to
be
applicable.2
In
addition,
they
have
the
burden
of
convincing
managers
that
their
software
is
indeed
reliable.
The
coupling
effect
Programmers,
however,
have
one
great
advantage
that
is
almost
never
really
exploited:
they
create
programs
that
are
close
to
being
correct!
Program-
mers
do
not
create
programs
at
random;
competent
programmers,
in
their
many
iterations
through
the
design
process,
are
constantly
whittling
away
the
distance
between
what
their
programs
look
like
now
and
what
they
are
intended
to
look
like.
Pro-
grammers
also
have
at
their
disposal
*
a
rough
idea
of
the
kinds
of
errors
most
likely
to
occur;
*
the
ability
and
opportunity
to
examine
their
programs
in
detail.
Error
classifications.
In
attempting
to
-formulate
a
comprehensive
theory
of
test
data
selection,
Susan
Gerhart
and
John
Goodenough3
have
suggested
that
errors
be
classified
as
follows:
(1)
failure
to
satisfy
specifications
due
to
imple-
mentation
error;
(2)
failure
to
write
specifications
that
correctly
represent
a
design;
(3)
failure
to
understapd
a
requirement;
(4)
failure
to
satisfy
a
requirement.
But
these
are
global
concerns.
Errors
are
always
reflected
in
programs
as
*
missing
control
paths,
*
inappropriate
path
selection,
or
*
inappropriate
or
missing
actions.
0018-9162/78/0400-0034$00.75
©
1978
IEEE
COMPUTER
34
We
do
not
explicitly
address
classifications
(2)
and
(3)
in
this
article,
except
to
point
out
that
even
here
a
programmer
can
do
much
without
fancy
theories.
If
we
are
right
in
our
perception
of
pro-
grams
as
being
close
to
correct,
then
these
errors
should
be
detectable
as
small
deviations
from
the
intended
program.
There
is
an
amazing
lack
of
published
data
on
this
subject,
but
we
do
have
some
idea of
the
most
common
errors.
E.
A.
Youngs,
in
his
PhD
dissertation,4
analyzed
1258
errors
in
Fortran,
Cobol,
PL/I,
and
Basic
programs.
The
errors
were
distributed
as
shown
in
Table
1.
In
addition
to
these
errors,
certain
other
errors
were
present
in
negligible
quantities.
There
were,
for
instance,
operating
system
interface
errors,
such
as
incorrect
job
identification
and
erroneous
external
I/O
assignment.
Also
present
were
errors
in
comments,
pseudo-ops,
and
no-ops
which
for
various
reasons
created
detectable
error
conditions.
Complex
errors
coupled.
How,
then,
do
the
rela-
tively
simple
error
types
discovered
by
Youngs
connect
with
the
Gerhart-Goodenough
error
classi-
fication?
Well,
the
naive
answer
is
that
since
arbi-
trarily
pernicious
errors
may
be
responsible
for
a
given
failure,
it
must
be
that
simple
errors
com-
pound
in
more
massive
error
conditions.
For
the
practical
treatment
of
test
data,
the
Youngs
error
statistics,
therefore,
do
not
seem
to
help
much
at
all.
Fortunately
though,
the
observation
that
pro-
grams
are
"close
to
correct"
leads
us
to
an
assump-
tion
which
makes
the
high
frequency
of
simple
errors
very
important:
The
coupling
effect:
Test
data
that
distinguishes
all
programs
differing
from
a
correct
one
by
only
simple
errors
is
so
sensitive
that
it
also
implic-
itly
distinguishes
more
complex
errors.
In
other
words,
complex
errors
are
coupled
to
simple
errors.
There
is,
of
course,
no
hope
of
"prov-
ing"
the
coupling
effect;
it
is
an
empirical
principle.
If
the
coupling
effect
can
be
observed
in
"real-world"
programs,
then
it
has
dramatic
implications
for
testing
strategies
in
general
and
domain-specific,
limited
testing
in
particular.
Rather
than
scamper
after
errors
of
undetermined
character,
the
tester
should
attempt
a
systematic
search
for
simple
errors
that
will
also
uncover
deeper
errors
via
the
coupling
effect.
Path
analysis.
This
point
seems
so
obvious
that
it's
not
worth
making:
test
to
uncover
errors.
Yet
it's
a
point
that's
often
lost
in
the
shuffle.
In a
common
methodology
known
as
path
analysis,
the
point
of
the
test
data
is
to
drive
a
program
through
all
of
its
control
paths.
It
is
certainly
hard
to
criti-
cize
such
a
goal,
since
a
thoroughly
tested
program
must
have
been
exercised
in
this
way.
But
unless
one
recognizes
that
the
test
data
should
also
dis-
tinguish
errors,
he
might
be
tempted
to
conclude,
for
example,
that
the
program
segment
diagrammed
in
Figure
2
can
be
tested
by
exercising
paths
1-2
and
1-3,
even
though
one
of
the
clauses
P
and
Q
may
not
have
been
affected
at
all!
In
general,
the
relative
ordering
of
P
and
Q
may
be
irrelevant
or
partially
unknown
and
side
effects
may
occur,
so
that
actually
the
eight
paths
shown
in
Figure
3
are
required
to
ensure
that
the
statement
has
been
adequately
tested.
100
80
1-
60
_-
40
_-
20
0
10
20
30
40
50
60
70
PERCENT
OF
TESTING
EFFORT
(MAN-MONTHS,
COMPUTER
HOURS,
ETC.)
80
90
100
Figure
1.
More
programming
errors
are
found
in
the
early
part
of
the
test
cycle
then
in
the
final
part.
Table
1.
Frequency
of
occurrence
of
1258
errors
in
Fortran,
Cobol,
PL/l,
and
Basic
programs.
Relative
Frequency
Error
Type
of
Occurrence
Error
in
assignment
or
computation
.27
Allocation
error
.15
Other,
unknown,
or
multiple
errors
.11
Unsuccessful
iteration
.09
Other
l/O
error
.07
I/O
formatting
error
.06
Error
in
brahching
unconditional
.01
conditional
.05
Parameter
or
subscript
violation
.05
Subprogram
invocation
error
.05
Misplaced
delimiter
.04
Data
error
.02
Error
in
location
or
marker
.02
Nonterminating
subprogram
.01
Figure
2.
Sample
program
segment
with
two
paths.
April
1978
I
I
I
I
I
1.
I
I
I
MI
0
35
Two
examples
given
below
indicate
that
test
data
derived
to
uncover
simple
errors
can,
in
fact,
be
vastly
superior
to,
say,
randomly
chosen
data
or
data
generated
for
path
analysis.
A
byproduct
of
the
discussion
will
be
some
evidence
for
the
coupling
effect.
A
third
example
reveals
another
advantage
of
selecting
test
data
with
an
eye
on
coupling:
since
it's
a
problem-specific
aetivity,
there
are
enhanced
possibilities
for
discovering
useful
heu-
ristics
for
test
data
selection.
This
example
will
lead
to
useful
advice
for
generating
test
vectors
for
programs
that
manipulate
arrays.
Our
groups
at
Yale
University
and
the
Georgia
Institute
of
Technology
have
constructed
a
system
whereby
we
can
determine
the
extent
to
which
a
given
set
of
test
data
has
adequately
tested
a
Fortran
program
by
direct
measurement
of
the
number
and
kinds
of
errors
it
is
capable
of
uncover-
ing.
This
method,
known
as
program
mutation,
is
used
interactively:
A
programmer
enters
from
a
terminal
a
program,
P,
and
a
proposed
test
data
set
whose
adequacy
is
to
be
determined.
The
muta-
tion
system
first
executes
the
program
on
the
test
data;
if
the
program
gives
incorrect
answers
then
certainly
the
program
is
in
error.
On
the
other
hand,
if
the
program
gives
correct
answers,
then
it
may
be
that
the
program
is
still
in
error,
but
the
test
data
is
not
sensitive
enough
to
distinguish
that
error:
it
is
not
adequate.
The
mutation
system
then
creates
a
number
of
mutations
of
P
that
differ
from
P
only
in
the
occurrence
of
simple
errors
(for
instance,
where
P
contains
the
expression
"B.LE.C"
a
mutation
will
contain
"B.EQ.C").
Let
us
call
these
mutations
P,,
P2,
.
.
.,Pk.
Now,
for
the
given
set
of
test
data
there
are
only
two
possibilities:
(1)
on
that
data
P
gives
different
results
from
the
Pi
mutations,
or
(2)
on
that
data
P
gives
the
same
results
as
some
Pi.
In
case
(1)
Pi
is
said
to
be
dead:
the
"error"
that
produced
Pi
from
P
was
indeed
distinguished
by
the
test
data.
In
case
(2),
the
mutant
P1
is
said
to
be
live;
a
mutant
may
be
live
for
two
reasons:
(1)
the
test
data
does
not
contain
enough
sensi-
tivity
to
distinguish
the
error
that
gave
rise
to
Pi,
or
(2)
P,
and
P
are
actually
equivalent
programs
and
no
test
data
will
distinguish
them
(i.e.,
the
"error"
that
gave
rise
to
Pi
was
not
an
error
at
all).
Test
data
that
leaves
no
live
mutants
or
only
live
mutants
that
are
equivalent
to
P
is
adequate
in
the
following
sense:
Either
the
program
P
is
correct
or
there
is
an
unexpected
error
in
P,
which-by
the
coupling
effect-we
expect
to
happen
seldom
if
the
errors
used
to
create
the
mutants
are
carefully
chosen.
Now,
it
is
not
completely
apparent
that
this
process
is
computationally
feasible.
But,
as
we
describe
in
more
detail
elsewhere,
there
is
a
very
good
choice
of
methodology
for
generating
muta-
tions
to
bring
the
procedure
within
attractive
economic
bounds.5
Apparently,
the
information
returned
by
the
mutation
system
can
be
effectively
utilized
by
the
programmer.
The
programmer
looks
at
a
negative
response
from
the
system
as
a
"hard
question"
concerning
his
program
(e.g.,
"The
test
data
you've
given
me
says
it
doesn't
matter
whether
or
not
this
test
is
for
equality
or
inequality;
why
is
that?")
and
is
able
to
use
his
answers
to
the
question
as
a
guide
in
generating
more
sensitive
test
data.
Figure
3.
Eight
paths
may
be
required
for
an
adequate
test.
COMPUTER
36
A
simple
example
Our
first
example
is
very
simple;
it
involves
the
MAX
algorithm
used
for
other
purposes
by
Peter
Naur
in
the
early
1960's.
The
task
is
to
set
a
vari-
able
R
to
the
index
of
the
first
occurrence
of
a
maximum
element
in
the
vector
A(1),
...,
A(N).
For
example,
the
following
Fortran
subroutine
might
be
offered
as
an
implementation
of
such
an
algorithm:
SUBROUTINE
MAX
(A,N,R)
INTEGER
A(N),I,N,R
1
R=1
2
DO31=2,N,1
3
IF
(A(I).GT.A(R))R=I
RETURN
END
We
will
choose
for
our
initial
set
of
test
data
three
vectors
(Table
2).
Table
2.
Three
vectors
constitute
the
initial
set
of
test
data.
A(1)
A(2)
A(3)
data
1
1
2
3
data
2
1
3
2
data
3
3
1
2
How
sensitive
is
this
data?
By
inspection,
we
notice
that
if
an
error
had
occurred
in
the
relational
operation
of
the
IF
statement,
then
either
data
1,
data
2,
or
data
3
would
have
distinguished
those
errors,
except
for
one
case.
None
of
these
data
vectors
distinguishes
.GE.
from
.GT.
in
the
IF
state-
ment.
Similarly,
these
vectors
distinguish
all
simple
errors
in
constants
except
for
starting
the
DO
loop
at
"1"
rather
than
"2."
All
simple
errors
in
vari-
ables
are
likewise
distinguished
except
for
the
errors
in
the
IF
statement
which
replace
"A(I)"
by
"I"
or
by
"A(R)."
That
is,
if
we
run
the
data
set
above
in
any
of
the
following
mutants
of
MAX,
we
get
the
same
results.
SUBROUTINE
MAX
(A,N,R)
INTEGER
A(N),I,N,R
1
R=1
2
DO31=1,N,j
3
IF(A(I).GT.A(R))R=1
RETURN
END
SUBROUTINE
MAX
(A,N,R)
INTEGER
A(N),I,N,R
1
R=1
2
DO
3
I=2,N,1
3
IF(I.GT.A(R))R
=
1
RETURN
END
SUBROUTINE
MAX
(A,N,R)
INTEGER
A(N),I,N,R
1
R=1
2
DO
3
I=2,N,1
3
IF(A(I).GE.A(R))R
=
1
RETURN
END
SUBROUTINE
MAX
(A,N,R)
INTEGER
A(N),I,N,R
1
R=1
2
DO
3
I=2,N,1
3
IF(A(R).GT.A(R))R
=
1
RETURN
END
Let
us
try
to
kill
as
many
of
these
mutants
as
possible.
In
view
of
the
first
difficulty,
we
might
guess
that
our
data
is
not
yet
adequate
because
it
does
not
contain
repeated
elements.
So,
let
us
add
A(1)
A(2)
A(3)
data
4
2
2
1
Now,
replacing
.GT.
by
.GE.
and
running
on
data
4
gives
erroneous
results
so
that
all
mutants
arising
from
simple
relational
errors
are
dead.
Sur-
prisingly,
data
4
also
distinguishes
the
two
errors
in
A(I);
so,
we
are
left
with
only
the
last
mutant
arising
from
the
"constant"
error:
variation
in
begin-
ning
the
DO
loop.
But
closer
inspection
of
the
pro-
gram
indicates
that
starting
the
DO
loop
at
"1"
rather
than
"2"
has
no
effect
on
the
program,
other
than
to
trivially
increase
its
running
time.
So
no
choice
of
test
data
will
distinguish
this
"error,"
since
it
results
in
a
program
equivalent
to
MAX.
So
we
conclude
that
since
the
test
data
1-4
leaves
only
live
mutants
that
are
equivalent
to
MAX,
it
is
adequate.
Comparisons
with
path
analysis
This
example
illustrates
hidden
paths
in
a
program
which
should
also
be
exercised
by
the
test
data.
To
illustrate
what
hidden
paths
are,
consider
the
Fortran
program-call
it
P-suggested
by
C.
V.
Ramamoorthy
and
his
colleagues:6
INTEGER
A,B,C,D
READ
10,A,B,C
10
FORMAT(4I10)
5
IF((A.GE.B)
.AND.(B.GE.C)
)
GOTO
100
PRINT
50
50
FORMAT(1H
,*LENGTH
OF
TRIANGLE
NOT
IN
1ORDER*)
STOP
100
IF((A.EQ.B)
.OR.
(B.EQ.C))
GOTO
500
A=A*A
B=B*B
C=C**2
D=B+C
IF
(A.NE.D)
GOTO
200
PRINT
150
150
FORMAT(1H
,*RIGHT
ANGLED
TRIANGLE*)
STOP
200
IF
(A.LT.D).
GOTO
300
PRINT
250
250
FORMAT(1H
,*OBTUSE
ANGLED
TRIANGLE*)
STOP
300
PRINT
350
350
FORMAT(lH,*ACUTE
ANGLED
TRIANGLE*)
April
1978
37
STOP
500
IF
(
(A.EQ.B)
.AND.
(A.EQ.C)
)
GOTO
600
PRINT
550
550
FORMAT(lH,*ISOCELES
TRIANGLE*)
STOP
600
PRINT
650
650
FORMAT(1H
,*EQUILATERAL
TRIANGLE*)
STOP
END
The
intent
of
this
program
is
to
categorize
triangles,
given
the
lengths
of
their
sides.
A
typical
path
analysis
system
will
derive
test
data-call
it
T-
which
exercises
all
paths
of
P
(Table
3).
Table
3.
Test
data
T
to
exercise
the
Fortran
program
P.
TEST
CASE
A
B
C
TRIANGLE
TYPE
1
2
12
27
ILLEGAL
2
5
4
3
RIGHT
ANGLE
3
26
7
7
ISOSCELES
4
19
19
19
EQUILATERAL
5
14
6
4
OBTUSE
6
24
23
21
ACUTE
Now
consider
the
following
mutant
program
P':
INTEGER,
A,B,C,D
READ
10,A,B,C
10
FORMAT(4110)
5
IF(
A.GE.B
)
GOTO
100
PRINT
50
50
FORMAT(1H
,*LENGTH
OF
TRIANGLE
NOT
IN
1ORDER*)
STOP
100
IF(
B.EQ.C)
GOTO
500
A=A*A
B=B*B
C=C**2
D=B+C
IF
(A.NE.D)
GOTO
200
PRINT
150
150
FORMAT(1H
,*RIGHT
ANGLED
TRIANGLE*)
STOP
200
IF
(A.LT.D)
GOTO
300
PRINT
250
250
FORMAT(1H
,*OBTUSE
ANGLED
TRIANGLE*)
STOP
300
PRINT
350
350
FORMAT(1H
,*ACUTE
ANGLED
TRIANGLE*)
STOP
500
IF
(
(A.EQ.B)
.AND.
(A.EQ.C)
)
GOTO
600
PRINT
550
550
FORMAT(1H
,*ISOCELES
TRIANGLE*)
STOP
600
PRINT
650
650
FORMAT(1H
,*EQUILATERAL
TRIANGLE*)
STOP
END
P'
prints
the
same
answers
as
P
on
T
but
P'
is
clearly
incorrect
since
it
categorizes
the
two
test
cases
showti
in
Table
4
as
acute
angle
triangles:
Table
4.
Two
test
cases
are
acute
angle
triangles.
TEST
CASE
A
B
C
TRIANGLE
TYPE
7
7
5
6
ILLEGAL
8
26
26
7
ISOSCELES
P
and
P'
differ
only
in
the
logical
expressions
found
at
statements
5
and
100.*
The
test
data
T
does
not
sufficiently
test
the
compound
logical
expressions
of
P;
T
only
tests
the
single-clause
logicals
found
in
the
corresponding
statements
of
P'.
Hence,
T'
is
a
stronger
test
of
P
than
is
T
(i.e.,
for
P
we
have
more
confidence
in
the
adequacy
of
T'
than
in
the
adequacy
of
7).
Note
that
the
logical
expression
in
statement
5
of
P
could
be
replaced
by
B.GE.C
to
yield
a
program
P"
which
produces
correct
answers
on
T'.
The
test
case
A=5,
B=7,
C=6
will
remedy
this
and
provide
still
a
stronger
test
of
P.
A
more
substantial
example
Our
last
example
involves
the
FIND
program
of
C.A.R.
Hoare.7
FIND
takes,
as
input,
an
integer
array
A,
its
size
N
>
1,
and
an
array
index
F,
1
<
F
<
N.
After
execution
of
FIND,
all
elements
to
the
left
of
A(F)
have
values
no
larger
than
A(F)
and
all
elements
to
the
right
are
no
smaller.
Clearly,
this
could
be
achieved
by
sorting
A;
indeed,
FIND
is
an
inner
loop
of
a
fast
sorting
algorithm,
although
FIND
executes
faster
than
any
sorting
program.
The
Fortran
version
of
FIND,
translated
directly
from
the
Algol
version,
is
given
below:
C
C
C
C
C
C
C
C
C
C
C
C
C
SUBROUTINE
FIND(A,N,F)
FORTRAN
VERSION
OF
HOARE'S
FIND
PROGRAM
(DIRECT
TRANSLATION
OF
THE
ALGOL
60
PROGRAM
FOUND
IN
HOARE'S
"PROOF
OF
FIND"
ARTICLE
IN
CACM
1971).
INTEGER
A(N),N,F
INTEGER
M,NS,R,I,J,W
M=1
NS=N
10
IF(M.GE.NS)
GOTO
1000
R=A(F)
I=M
J=NS
20
IF(I.GT.J)
GOTO
60
30
IF(A(I).GE.R)
GOTO
40
1=1+1
GOTO
30
40
IF(R.GE.A(J))
GOTO
50
J=J-1
GOTO
40
50
IF(I.GT.J)
GOTO
20
COULD
HAVE
CODED
GO
TO
60
DIRECTLY
-DIDN'T
BECAUSE
THIS
REDUNDANCY
IS
PRESENT
IN
HOARE'S
ALGOL
PROGRAM
DUE
TO
THE
SEMANTICS
OF
THE
WHILE
STATEMENT.
W=A(I)
A(I)=A(J)
A(J)=W
I=1+1
J=J-1
GO
TO
20
*The
clause
A.EQ.B
in
statement
500
is
redundant.
COMPUTER
38
60
IF(F.GT.J)
GOTO
70
NS=J
GOTO
10
70
IF(I.GT.F)
GOTO
1000
M=I
GOTO
10
1000
RETURN
END
FIND
is
of
particular
interest
for
us
because
a
subtle
multiple-error
mutant
of
FIND,
called
BUGGY-
FIND,
has
been
extensively
analyzed
by
SELECT,
a
system
that
generates
test
data
by
symbolic
execu-
tion.8
In
FIND,
the
elements
of
A
are
interchanged
depending
on
a
conditional
of
the
form
X.LE.
A(F)
.AND.
A(F)
.LE.
Y
Since
A(F)
itself
may
be
exchanged,
the
effect
of
this
test
is
preserved
by
setting
a
temporary
vari-
able
R
=
A(F)
and
using
the
conditional
X
.LE.
R
.AND.
R
.LE.
Y
In
BUGGYFIND,
the
temporary
variable
R
is
not
used;
rather,
the
first
form
of
the
conditional
is
used
to
determine
whether
the
elements
of
A
are
to
be
exchanged.
The
SELECT
system
derived
the
test
data
A
(3,2,0,1)
and
F
=
3,
on
which
BUGGY-
FIND
fails.
The
authors
of
SELECT
observed
that
BUGGYFIND
fails
on
only
2
of
the
24
permutations
of
(0,1,2,3),
indicating
that
the
error
is
very
subtle.*
We
will
first
describe
a
simple-error
analysis
of
the
mutants
of
FIND,
beginning
with
initially
naive
guesses
of
test
data
and
finishing
with
a
surpris-
ingly
adequate
set
of
7
A
vectors.
This
data
will
be
called
D,.
The
detailed
analysis
needed
to
deter-
mine
how
many
errors
are
distinguished
by
a
data
set
were
carried
out
on
the
Mutation
system
at
Yale
University.
We
have
asked
several
colleagues
how
they
would
test
FIND,
and
they
have
nearly
unanimously
replied
that
they
would
use
permutations.
We
first
describe
analysis
which
we
have
done
using
permu-
tations
of
the
array
indices
as
data
elements.
In
one
case,
we
use
all
permutations
of
length
4
and
in
another
case,
we
use
random
permutations
of
lengths
5
and
6.
Surprisingly,
the
intuitively
appeal-
ing
choice
of
permutations
as
test
data
is
a
very
poor
one.
We
then
describe
analysis
in
which
another
popular
intuitive
method
is
used:
random
data.
We
show
that
the
adequacy
of
random
data
is
very
dependent
on
the
interval
from
which
the
data
is
drawn
(i.e.,
problem-specific
information
is
needed
to
obtain
good
results).
Finally,
we
find
evidence
for
the
coupling
effect
(i.e.,
adequate
simple-error
data
kills
multiple-error
mu-
tants)
in
two
ways.
First,
the
multiple-error
mutant
BUGGYFIND
fails
on
the
test
data
D,.
Next,
we
describe
the
very
favorable
results
of
executing
random
multiple-error
mutants
of
FIND
on
D,.
We
begin
the
analysis
with
the
24
permutations
of
(0,1,2,3)
with
F
fixed
at
3.
The
results
are
sur-
*We
found
that
BUGGYFIND
failed
on
only
the
aforementioned
permutation.
prisingly
poor,
as
58
live
mutants
are
left.
That
is,
with
these
24
vectors
there
are
58
possible
changes
that
could
have
been
made
in
FIND
that
would
have
yielded
identical
output.
Eventually,
by
increasing
the
number
of
A
vectors
to
49,
only 10
live
mutants
remain.
Using
a
data
reduction
heuristic,
the
49
A
vectors
can
be
reduced
to
a
set
of
seven
A
vectors,
leaving
14
live
mutants.
These
vectors
appear
in
Table
5.
Table
5.
D1-The
simple-error
adequate
data
for
FIND.
TEST
CASE
A
F
1
(-19,34,0,
-4,22,
5
12,222,
-
57,17)
2
(7,9,7)
3
3
(2,3,1,0)
3
4
(-5,-5,-5,-5)
1
5
(1,3,2,0)
3
6
(0,2,3,1)
3
7
,
(0)
1
In
constructing
the
initial
data,
after
the
24
per-
mutations,
the
49
A
vectors
were
chosen
somewhat
haphazardly
at
first.
Later,
A
vectors
were
chosen
specifically
to
eliminate
a
small
subset
of
the
remaining
errors.
There
were
some
interesting
observations
concerning
the
49
vectors:
(1)
The
average
A
vectors
kills
about
550
mutants.
(2)
The
"best"
A
vector
kills
703
mutants
(test
case
1
of
Table
5).
(3)
The
"worst"
A
vector
kills
only
70
mutants.
This
was
the
degenerate
A
=
(0).
The
data
reduction
heuristic
uses
both
the
best
and
the
worst
A
vectors
to
pare
the
49
A
vectors
to
seven.
The
final
step
in
showing
that
the
data
of
Table
5
is
indeed
adequate
is
to
show
that
the
14
remain-
ing
mutants
are
programs
that
are
actually
equiva-
lent
to
FIND.
That
is,
the
14
"errors"
that
could
have
been
made
are
not
really
errors
at
all.
One
might
be
surprised
at
the
large
number
of
equiva-
lent
mutants
(approximately
2
percent).
This
we
attribute
to
FIND's
long
history
(it
was
first
pub-
lished
in
1961).
Over
the
years,
FIND
has
'been
"honed"
to
a
very
efficient
state-so
efficient
that
many
slight
variations
result
in
equivalent
but
slower
programs.
For
example,
the
conditional
I.
GT.
F
in
the
statement
labeled
70
in
the
FIND
can
be
replaced
by
any
logically
false
conditional,
or
the
IF
statement
can
be
replaced
by
a
CONTINUE
state-
ment,
to
result
in
an
equivalent
but
slower
program.
It
is
not
likely
that
this
phenomenon
will
occur
in
programs
which
haven't
been
"fine-tuned."
We
estimate
that
production
programs
have
well
under
1
percent
equivalent
mutants.
Let
us
now
compare
D,
with
exhaustive
tests
on
permutations
of
(0,1,2,3)
and
then
with
tests
on
April
1978
39
random
permutations
of
(0,1,2,3,4)
and
(0,1,2,3,4,5).
Table
6
describes
the
results
for
all
permutations
of
(0,1,2,3).
Table
6.
Results
of
all
permutations
of
(1,2,3,4).
NUMBER
OF
NUMBER
OF
TEST
CASES
VALUES
OF
F
LIVE
MUTANTS
24
1
158
24
2
60
24
3
58
24
4
141
96
1,2,3,&4
38
In
Table
7
the
same
information
is
provided
for
the
case
of
random
test
data.
Table
7.
Results
of
random
permutations.
NUMBER
OF
RANDOM
NUMBER
OF
TEST
CASES
SIZE
OF
A
VALUE
OF
F
LIVE
MUTANTS
10
UNIFORM
FROM
UNIFORM
FROM
88
[5,6]
1
TO
SIZE
OF
A
25
65
50
54
100
54
1000
V
53
As
the
data
indicates,
permutations
give
rather
poor
results
compared
with
D1.
Our
analysis
with
random
data
can
be
divided
into
two
cases:
runs
in
which
the
vectors
were
drawn
from
poorly
chosen
intervals
and
runs
in
which
the
vectors
were
chosen
from
a
good
interval
(-100,100).
The
results
are
described
in
Tables
8
and
9.
Table
8.
Results
of
random
data
from
poorly
chosen
intervals.
NUMBER
OF
RANGE
OVER
RANGE
OVER
RANDOM
WHICH
VECTOR
WHICH
SIZE
VALUE
NUMBER
OF
VECTORS
VALUES
DRAWN
OF
A
DRAWN
OF
F
LIVE
MUTANTS
10
[100,200]
[1,20]
UNIFORM
28
10
[-200,-100]
[1,20]
FROM
28
10
[-100,-90]
[1,20]
SIZE
25
OF
VECTOR
Table
9.
Results
of
random
data
drawn
from
[-
100,1001;
other
parameters
as
in
Table
8.
NUMBER
OF
RANDOM
NUMBER
OF
VECTORS
LIVE
MUTANTS
10
22
50
17
100
11
1000
10
Although
the
intervals
in
Table
8
are
poor,
one
could
conceive
of
worse
intervals.
For
example,
draw
A
from
[1,
size
of
A].
However,
in
view
of
the
permutation
results,
such
data
will
surely
behave
worse
than
that
of
Table
8.
Three
points
are
in
order.
First,
even
with
very
bad
data,
D,
is
much
better
than
simple
permuta-
tions.
Second,
it
took
1000
very
good
random
vectors
to
perform
as
well
as
Di.
Third,
using
random
vectors
yields
little
insight.
The
insight
gained
in
constructing
D,
was
crucial
to
detecting
the
equivalent
versions
of
FIND.
The
coupling
effect
shows
itself
in
two
ways.
First,
BUGGYFIND
fails
on
the
adequate
D,;
hence,
we
have
a
concrete
example
of
the
coupling
effect.
Although
the
second
observation
involves
random-
ness,
and
thus
is
indirect,
it
is
perhaps
mbre
convincing
than
the
"one
point"
concrete
BUGGYFIND3D
example.
We
have
randomly
generated
a
large
number
of
k-error
mutants
for
k
>
1
(called
higher
order
mutants)
and
executed
them
on
D1.
Because
the
number
of
mutants
produced
by
com-
plex
errors
can
grow
combinatorially,
it
is
hopeless
to
try
the
complete
mutation
analysis
on
complex
mutants,
but
it
is
possible
to
select
mutants
at
random
for
execution
on
Di.
Of
more
than
22,000
higher-order
errors
encountered,
only
19
succeed
on
D.
These
19
have
been
shown
to
be
equivalent
to
FIND.
Indeed,
we
have
yet
to
produce
an
incor-
rect
higher-order
mutant
which
suceeds
on
Di!
Conclusions
Our
first
conclusion
is
that
systematically
pur-
suing
test
data
which
distinguishes
errors
from
a
given
class
of
errors
also
yields
"advice"
to
be
used
in
generating
test
data
for
similar
programs.
For
instance,
the
examples
above
lead
us
to
the
following
principles
for
creating
random
or
non-
random
test
data
for
Fortran-like
programs
which
manipulate
arrays
(i.e.,
programs
in
which
array
values
can
also
be
used
as
array
indices):
(1)
Include
cases
in
which
array
values
are
out-
side
the
size
of
the
array.
(2)
Include
cases
in
which
array
values
are
negative.
(3)
Include
cases
in
which
array
values
are
re-
peated.
(4)
Include
such
degenerate
cases
as
D,'s
A
=
(0)
and
A
=
(-5,-5,-5,-5).
Principle
(4)
was
also
noticed
by
Goodenough
and
Gerhart.3
It
is
important
that
a
testing
strategy
be
con-
ducive
to
the
formation
of
hypotheses
about
the
way
test
data
should
be
selected
in
future
tasks.
Information
transferred
between
programming
tasks
provides
a
source
of
"virtual
resources"
to
be
used
in
subsequent
work.
Since
the
amount
of
available
resources
is
limited
by
economic
and
political
barriers,
experience-which
has
the
effect
of
expand-
ing
resources-takes
on
a
special
importance.
It
is,
COMPUTER
40
Seemingly
simple
techniques
can
be
quite
sensitive
via
the
coupling
effect.
of
course,
helpful
to
have
available
such
mechanical
aids
as
the
mutation
system,
but
as
we
have
shown
even
in
the
absence
of
the
appropriate
statistical
information,
a
programmer
can
be
reasonably
con-
fident
that
he
is
improving
his
test
data
selection
strategy.
A
second
conclusion
is
that
until
more
general
strategies
for
systematic
testing
emerge,
program-
mers
are
probably
better
off
using
the
tools
and
insights
they
have
in
great
abundance.
Instead
of
guessing
at
deeply
rooted
sources
of
error,
they
should
use
their
specialized
knowledge
about
the
most
likely
sources
of
error
in
their
application.
We
have
tried
to
illustrate
that
seemingly
simple
tests
can
be
quite
sensitive,
via
the
coupling
effect.
The
techniques
we
advocate
here
are
hardly
ever
general
techniques.
In
a
sense,
they
require
one
to
deal
directly
in
the
details
of
both
coding
and
the
application-a
notion
that
is
certainly
contrary
to
currently
popular
methodologies
for
validating
software.
But
we
believe
there
is
ample
evidence
in
man's
intellectuaI
history
that
he
does
not
solve
important
problems
by
viewing
them
from
a
dis-
tance.
In
fact,
there
is
an
Alice
In
Wonderland
quality
to
fields
which
claim
they
can
solve
other
people's
problems
without
knowing
anything
in
particular
about
the
problems.
So,
there
is
certainly
no
need
to
apologize
for
applying
ad
hoc
strategies
in
program
testing.
A
programmer
who
considers
his
problems
well
and
skillfully
applies
appropriate
techniques
to
their
solution-regardless
of
where
the
techniques
arise-
will
succeed.
*
References
1.
A.
E.
Tucker,
"The
Correlation
of
Computer
Program
Quality
with
Testing
Effort,"
System
Development
Corporation,
TM
2219/000/00,
January
1965.
2.
R.
A.
DeMillo,
R.
J.
Lipton,
A.
J.
PerHls,
"Social
Pro-
cesses
and
Proofs
of
Programs
and
Theorems,"
Proc.
Fourth
ACM
Symposium
on
Principles
of
Program-
ming
Languages,
pp.
206-214.
(To
appear
in
CACM)
3.
John
B.
Goodenough
and
Susan
L.
Gerhart,
"Toward
a
Theory
of
Test
Data
Selection,"
Proc.
International
Conference
on
Reliable
Software,
SIGPLAN
Notices,
Vol.
10,
No.
6,
June
1975,
pp.
493-510.
4.
E.
A.
Youngs,
Error-Proneness
in
Programming,
PhD
thesis,
University
of
North
Carolina,
1971.
5.
T.
A.
Budd,
R.
A.
DeMillo,
R.
J.
Lipton,
F.
G.
Sayward,
"The
Design
of
a
Prototype
Mutation
System
for
Pro-
gram
Testing,"
Proc.,
1978
NCC.
6.
C.
V.
Ramamoorthy,
S.
F.
Ho,
and
W.
T.
Chen,
"On
the
Automated
Generation
of
Program
Test
Data,".
IEEE
Trans.
on
Software
Engineering,
Vol.
SE-2,
No.
4,
December
1976,
pp.
293-300.
7.
C.
A.
R.
Hoare,
"Algorithms
65;
FIND,"
CACM,
Vol.
4,
No.
1,
April
1961,
pp.
321.
8.
R.
S.
Boyer,
B.
Elspas,
K.
N.
Levitt,
"SELECT-A
System
for
Testing
and
Debugging
Programs
by
Symbolic
Execution,"
Proc.
Intern4tional
Conference
on
Reliable
Software,
SIGPLAN
Notices,
Vol.
10,
No.
6,
June
1975,
pp.
234-245.
Richard
DeMillo
has
been
an
associate
professor
of
computer
science
at
the
Georgia
Institute
of
Technology
since
1976.
During
the
four
years
prior
to
that
he
was
assistant
professor
of
computer
science
at
the
University
of
Wisconsin-Milwaukee.
A
technical
consultant
to
several
government
and
research
agencies
and
to
private
industry,
he
is
interested
in
the
theory
of
computing,
programming
languages,
and
programming
methodology.
DeMillo
received
the
BA
in
mathematics
from
the
College
of
St.
Thomas,
St.
Paul,
Minnesota,
and
the
PhD
in
information
and
computer
science
from
the
Georgia
Institute
of
Technology.
He
is
a
member
of
ACM,
the
American
Mathematical
Society,
AAAS,
and
the
Association
for
Symbolic
Logic.
Richard
J.
Lipton
is
an
associate
professor
of
computer
science
at
Yale
University.
A
faculty
member
since
1973,
he
pursues
research
interests
in
computational
complexity
and
in
mathematical
modeling
of
computer
systems.
He
is
also
a
technical
consultant
to
several
government
agencies
and
to
private
industry.
Lipton
received
the
BS
in
mathematics
from
Case
Western
Reserve
University
and
the
PhD
from
Carnegie-
Mellon
University.
Frederick
G.
Sayward
is
an
assistant
professor
of
com-
puter
science
at
Yale
University,
where
he
pursues
research
interests
in
semantical
methods
for
program-
ming
languages,
the
theory
of
parallel
computation
as
applied
to
operating
systems,
the
development
of
pro-
gramming
test
methods,
and
techniques
for
fault-tolerant
computation.
Earlier,
he
worked
as
a
scientific
and
sys-
tems
programmer
at
MIT
Lincoln
Laboratory.
A
member
of
ACM,
the
American
Mathematical
Society,
and
Sigma
Xi,
Sayward
received
the
BS
in
mathematics
from
Southeastern
Massachusetts
University,
the
MS
in
computer
science
from
the
University
of
Wisconsin-
Madison,
and
the
PhD
in
applied
mathematics
from
Brown
University.
April
1978
41
... Black box testing, which has been used in the field of software testing since it was proposed by David Palmer of Bell Labs in 1978 [1], is a fairly mature software testing technique. The key to black box testing is "black", i.e., unknown, and "objectivity" is certainly the most appropriate word to summarize the core idea of black box testing. ...
... The concept of black box testing was first introduced by David Palmer [1] of Bell Labs. He first introduced the concept of black box testing in a paper published in 1978 and applied it to software testing. ...
Article
Full-text available
Grade inflation in traditional teacher-led evaluation methods in higher education undermines the accurate assessment of students’ abilities, hinders their subsequent learning and career development, and negatively affects institutions and society. To address this problem, this paper describes an innovative approach that applies the black box testing method, traditionally used in software testing, to the evaluation of college course grades. This method eliminates direct teacher involvement in course teaching and management; instructors design assessment questions solely based on the syllabus to evaluate students’ learning outcomes. Using the C language course offered during the spring semester of 2021-2022 for mechanical engineering students at the Beijing Institute of Printing as a case study, this research compares student performance under the black box test method with traditional instructor-designed assessments. Results reveal that students scored significantly lower under the black box testing approach, validating its potential to mitigate grade inflation. This study not only highlights a novel application of black box testing in education but also offers a more objective and equitable framework for evaluating academic performance, benefiting students, parents, educational institutions, and society.
... Therefore, mutation operators consist of rules that specify the changes to be made in the program under test. Applying small changes to the software under test encourages the tester to produce test cases that reveal the defects inserted in mutant programs, improving the quality of the test case set (DeMillo et al., 1978) Our Decision Tree Mutation Testing (DTMT) was developed drawing inspiration from mutation operators specifically designed for the C language (Agrawal et al., 1989). Specifically, the following mutation operators are used in our definition: ...
Article
Full-text available
Over the past decade, there has been a significant surge in interest regarding the application of machine learning (ML) across various tasks. Due to this interest, the adoption of ML-based systems has gone mainstream. It turns out that it is imperative to conduct thorough software testing on these systems to ensure that they behave as expected. However, ML-based systems present unique challenges for software testers who are striving to enhance the quality and reliability of these solutions. To cope with these testing challenges, we propose novel test adequacy criteria centered on decision tree models. Our criteria diverge from the conventional method of manually collecting and labeling data. Instead, our criteria relies on the inherent structure of decision tree models to inform the selection of test inputs. Specifically, we introduce decision tree coverage (DTC) and boundary value analysis (BVA) as approaches to systematically guide the creation of effective test data that exercises key structural elements of a given decision tree model. Additionally, we also propose a mutation based criterion to support the validation of ML-based systems. Essentially, this approach involves applying mutation analysis to the decision tree structure. The resulting mutated trees are then used as a reference for selecting test data that can effectively identify incorrect classifications in ML models. To evaluate these criteria, we carried out an experiment using 16 datasets. We measured the effectiveness of test inputs in terms of the difference in model’s behavior between the test input and the training data. According to the results of the experiment, our criteria can be used to improve the test data selection for ML applications by guiding the generation of diversified test data that negatively impact the prediction performance of models.
... For instance, consider a scenario where there is a minimum price threshold that must be met to proceed to checkout. According to the competent programmer hypothesis [26], [27], the same code handles the addition of items when the checkout threshold hasn't been reached. In this scenario, actions that result in a price below the threshold lead to the same state. ...
Article
Full-text available
Model-based testing (MBT) is essential in software testing, offering automation, comprehensive coverage, and defect prevention. It uses abstract models to automatically design and generate test cases, representing the expected system behaviour, including states, transitions, inputs, and outputs. This paper explores the action-state testing modeling technique, originally introduced by the authors in [1]. In this approach, a model step comprises an action (input), one or more responses (outputs), and an optional state. The steps can be arranged sequentially, or they may be forked and joined. Sequential steps appear within the same test case. Forked steps are distributed across different test cases. The joined steps also belong to separate test cases. In addition, the graphical model can be constructed using a text editor. This paper builds upon the concept by establishing its theoretical foundation. We demonstrate how the action-state model eliminates the need for guard conditions and coding, maintains a concise and manageable structure, and seamlessly incorporates outputs, ultimately enhancing testing efficiency. Additionally, we provide guidelines for adding new states and empirically validate the benefits of action-state testing over alternative techniques, achieving a 100% defect detection percentage (DDP). This paper marks the first installment of the author’s Test Design Trilogy, dedicated to refining and unifying various test design techniques.
... Grading test suites by code coverage might reveal gaming behavior as they seek to maximize test suite points without writing a good test suite (Shams, 2015). Mutation testing (DeMillo et al., 1978), as an example of a strong test suite metric, ensures that such maximization efforts do not work. However, they might not avoid gaming the timing of when the test suite is developed. ...
Article
Full-text available
Introduction Self-regulated learning skills are necessary for academic success. While not all students entering post-secondary education are proficient at many of these critical skills, they can be improved upon when practiced. However, self-regulation tends to be highly internal, making it difficult to measure. One form of measurement comes from using data traces collected from educational software. These allow researchers to make strong empirical inferences about a student's internal state. Automatically captured data traces also make it possible to provide automated interventions that help students practice and master self-regulated learning skills. Methods/results Using an experimental methodology we created a set of promising data traces that are grounded in theory to study self-regulated learning within a typical Computer Science course. Extra attention is given to studying the skill of help-seeking, which is both a key to success in CS and requires unobtrusive observation to properly measure. Discussion We also make the case for taking a broader perspective with our data collection efforts. The traces identified in this paper are not from one source, but the full ecosystem of software tools common to CS courses.
... Compilers, for instance, continually optimize code by applying a series of transformations; insights into ensuring behavior preservation can help maintain correctness while boosting efficiency. Likewise, mutation testing [62] could benefit from SLMs in identifying equivalent and redundant mutants [63,64,65,66], reducing the overhead associated with non-informative mutations. ...
Preprint
Full-text available
Popular IDEs frequently contain bugs in their refactoring implementations. Ensuring that a transformation preserves a program's behavior is a complex task. Traditional detection methods rely on predefined preconditions for each refactoring type, limiting their scalability and adaptability to new transformations. These methods often require extensive static and dynamic analyses, which are computationally expensive, time-consuming, and may still fail to detect certain refactoring bugs. This study evaluates the effectiveness of Small Language Models (SLMs) in detecting two types of refactoring bugs in Java and Python: (i) transformations that introduce errors or behavioral changes (Type I) and (ii) transformations unnecessarily blocked by IDEs despite being valid (Type II). We assess whether Llama 3.2 3B, Mistral 7B, Gemma 2 9B, DeepSeek-R1 14B, Phi-4 14B, o1-mini, and o3-mini-high can accurately detect 100 refactoring bugs reported in widely used Java and Python IDEs, such as Eclipse and NetBeans. The study covers 16 refactoring types and employs zero-shot prompting on consumer-grade hardware to evaluate the models' ability to reason about refactoring correctness without explicit prior training. The proprietary o3-mini-high model achieved the highest detection rate, identifying 84.3% of Type I bugs. The open-source Phi-4 14B performed comparably well, demonstrating strong effectiveness across both bug types. However, o3-mini-high struggled with Type II bugs, correctly identifying and applying valid but blocked transformations in only 40% of cases. The findings highlight the potential of SLMs for efficiently detecting refactoring bugs, particularly in verifying behavioral changes. Additionally, SLMs offer a more adaptable solution capable of generalizing across different refactoring types and programming languages, addressing key limitations of traditional approaches.
Article
As technology continues to advance, it becomes increasingly integrated into daily life facilitating complex tasks across a range of environments. While some applications such as smartphones and smartwatches are less critical, others like healthcare devices and autonomous vehicles demand bug-free performance to prevent financial loss or harm. Traditionally, simulation-based testing and formal verification played a major role in ensuring a bug-free device. However, the simulation of bigger systems is limited to a definite number of scenarios on the Design under Verification (DUV). Hence, it is unable to explore all possible inputs that can occur. Formal verification, on the other hand, offers a higher level of assurance through mathematical proofs but is both time-consuming and suffers from scalability issues, especially as designs grow in complexity. Recently, Large Language Models (LLMs) have shown promise in tasks previously limited to human expertise. Their natural language processing capabilities can assist in handling extensive specifications and source code, particularly in debugging hardware descriptions and analyzing security and functionality. The utilization of Retrieval Augmented Generation (RAG) has further enhanced LLMs by incorporating large specification or source code bases, thereby improving their bug-identification and correction capabilities. While recent advancements in LLMs, particularly with RAG, have yielded promising results in bug identification and correction for a small class of hardware bugs, significant gaps remain in their full potential for systematically addressing a wide range of hardware bugs. For instance, existing LLM methodologies struggle to detect bugs involving incorrect constant values, i.e., the use of wrong constants in source code. This limitation underscores the need for further exploration in utilizing LLMs to fully optimize the verification process. To bridge this gap, we propose a 3-phased 4-stage LLM-assisted systematic bug closure methodology that focuses on functional bugs in Verilog HDL rather than structural or syntactic issues. Our approach extracts functional properties of the DUV and systematically breaks down complex expressions into smaller sub-expressions to facilitate bug detection and correction. By employing RAG, the LLM is guided using the functional specifications and source code to identify and correct bugs. If the initial guidance through RAG is insufficient, our methodology initiates an iterative bug closure process. This includes incorporating more extensive information from the specifications, fetching additional lines of code for bug localization, and breaking down complex Verilog HDL expressions. In our comprehensive evaluation, we assess the LLM’s capabilities using 9 different categories of bugs. As benchmarks, we use 5 OpenTitan Intellectual Property (IP) cores to demonstrate the scalability and effectiveness of our bug closure methodology where 60%\approx 60\% of the bugs were corrected. Specifically, we evaluate OpenAI’s GPT-4 in its ability to identify and correct functional bugs in Verilog HDL code.
Article
Mutation testing is a software quality assurance technique that introduces small changes (i.e., ‘mutants’) into the code to help assess the effectiveness of test suites. In mutation testing, equivalent mutants are a special type of mutants that have the same behaviour as the original program for all the possible tests. A significant challenge of mutation testing is called the equivalent mutant problem, which leads to the very expensive cost of mutation testing, in terms of extensive manual work in the identification of equivalent mutants. As a result, mutation testing research efforts have focused on proposing approaches to aid testers in better identifying equivalent mutants. To address this challenge, this paper proposes leveraging new machine learning techniques to automatically classify equivalent mutants. Specifically, this research develops and compares a bidirectional GRU (Bi‐GRU) based RNN model, an Elman recurrent neural network (RNN) model and a long short‐term memory (LSTM) model, all based on the code representation of the abstract syntax tree (AST). The three machine learning models are trained separately, and the hyperparameters are tuned to obtain the best‐performing values for each model. First, an experimental study was conducted using four different mutation operators and 690 equivalent and nonequivalent mutants from 24 C programs, to understand the effectiveness of the proposed approach. The experimental results indicate that the proposed machine learning‐based approach can automatically classify equivalent mutants with an average F1‐Score of 87.4% across all three models. For each mutation operator, the best performing models achieve an average F1‐Score of 94%. The proposed approach can reduce the time and effort on manually identifying equivalent mutants, improve the set of generated mutants and eventually enable mutation testing to be more accessible to industry practitioners. A second set of experiments with seven traditional machine learning algorithms was run to compare against the proposed AST‐based machine learning models. These traditional algorithms include decision trees (DT), Gaussian naive Bayes (GNB), K‐nearest neighbor (KNN), linear discriminant analysis (LDA), linear regression (LR), random forest (RF) and support vector machine (SVM). The comparison results indicate our proposed AST‐based Bi‐GRU, LSTM and RNN models achieved the highest F1‐Scores, at 84.61%, 87.99% and 88.88%, respectively. In addition, to demonstrate the effectiveness of our proposed models in a more realistic scenario where one could expect stubborn mutants, we developed a sub‐dataset consisting of a mix of equivalent, nonequivalent and stubborn mutants. Testing results reflect our proposed models' capabilities at identifying equivalent mutants when stubborn mutants are present, as all metrics scored above 90%. The combination of the key performance metrics across three different experiments shows the AST‐based models' effectiveness in terms of the identification of equivalent mutants.
Article
Fault localization (FL) targets identifying bug locations within a software system, which can enhance debugging efficiency and improve software quality. Due to the impressive code comprehension ability of Large Language Models (LLMs), a few studies have proposed to leverage LLMs to locate bugs, i.e., LLM-based FL, and demonstrated promising performance. However, first, these methods are limited in flexibility. They rely on bug-triggering test cases to perform FL and cannot make use of other available bug-related information, e.g., bug reports. Second, they are built upon proprietary LLMs, which are, although powerful, confronted with risks in data privacy. To address these limitations, we propose a novel LLM-based FL framework named FlexFL, which can flexibly leverage different types of bug-related information and effectively work with open-source LLMs. FlexFL is composed of two stages. In the first stage, FlexFL reduces the search space of buggy code using state-of-the-art FL techniques of different families and provides a candidate list of bug-related methods. In the second stage, FlexFL leverages LLMs to delve deeper to double-check the code snippets of methods suggested by the first stage and refine fault localization results. In each stage, FlexFL constructs agents based on open-source LLMs, which share the same pipeline that does not postulate any type of bug-related information and can interact with function calls without the out-of-the-box capability. Extensive experimental results on Defects4J demonstrate that FlexFL outperforms the baselines and can work with different open-source LLMs. Specifically, FlexFL with a lightweight open-source LLM Llama3-8B can locate 42 and 63 more bugs than two state-of-the-art LLM-based FL approaches AutoFL and AgentFL that both use GPT-3.5. In addition, FlexFL can localize 93 bugs that cannot be localized by non-LLM-based FL techniques at the top 1. Furthermore, to mitigate potential data contamination, we conduct experiments on a dataset which Llama3-8B has not seen before, and the evaluation results show that FlexFL can also achieve good performance.
Preprint
Full-text available
In this paper, we tackle a critical challenge in model evaluation: how to keep code benchmarks useful when models might have already seen them during training. We introduce a novel solution, dynamic benchmarking framework, to address this challenge. Given a code understanding or reasoning benchmark, our framework dynamically transforms each input, i.e., programs, with various semantic-preserving mutations to build a syntactically new while semantically identical benchmark. We evaluated ten popular language models on our dynamic benchmarks. Our evaluation reveals several interesting or surprising findings: (1) all models perform significantly worse than before, (2) the ranking between some models shifts dramatically, and (3) our dynamic benchmarks can resist against the data contamination problem.
Article
Deep reinforcement learning (DRL) systems have been increasingly applied in various domains. Testing them, however, remains a major open research problem. Mutation testing is a popular test suite evaluation technique that analyzes the extent to which test suites detect injected faults. It has been widely researched in both traditional software and the field of deep learning. However, due to the fundamental differences between DRL systems and traditional software, as well as deep learning systems, in aspects such as environment interaction, network decision-making, and data efficiency, previous mutation testing techniques cannot be directly applied to DRL systems. In this paper, we proposed a comprehensive mutation testing framework specifically designed for DRL systems, DRLMutation , to further fill this gap. We first considered the characteristics of DRL, and based on both the training process and the model of trained agent, examined combinations from three dimensions: objects, operation methods, and injection methods. This approach led to a more comprehensive design methodology for DRL mutation operators. After filtering, we identified a total of 107 applicable DRL mutation operators. Then, in the realm of evaluation, we formulated a set of metrics tailored to assess test suites. Finally, we validated the stealthiness and effectiveness of the proposed mutation operators in the Cart Pole , Mountain Car Continuous , Lunar Lander , Breakout and CARLA environments. We show inspiring findings that the majority of these designed DRL mutation operators potentially undermine the decision-making capabilities of the agent without affecting normal training. The varying degrees of disruption achieved by these mutation operators can be used to assess the quality of different test suites.
Conference Paper
Full-text available
Many people have argued that computer programming should strive to become more like mathematics. Maybe so, but not in the way they seem to think. The aim of program verification, an attempt to make programming more mathematics-like, is to increase dramatically one’s confidence in the correct functioning of a piece of software, and the device that verifiers use to achieve this goal is a long chain of formal, deductive logic. In mathematics, the aim is to increase one’s confidence in the correctness of a theorem, and it’s true that one of the devices mathematicians could in theory use to achieve this goal is a long chain of formal logic. But in fact they don’t. What they use is a proof, a very different animal. Nor does the proof settle the matter; contrary to what its name suggests, a proof is only one step in the direction of confidence. We believe that, in the end, it is a social process that determines whether mathematicians feel confident about a theorem — and we believe that, because no comparable social process can take place among program verifiers, program verification is bound to fail. We can’t see how it’s going to be able to affect anyone’s confidence about programs.
Article
Full-text available
It has been extensively argued that the art and science of programming should strive to become more like mathematics. In this paper we argue that this point of view is correct, but that the reasons usually given for it are wrong. We present our view that mathematics is, rather than a formal process, an ongoing social process and that the formalistic view of mathematics is misleading and destructive for proving software. (Author)
Article
Full-text available
This paper examines the theoretical and practical role of testing in software development. We prove a fundamental theorem showing that properly structured tests are capable of demonstrating the absence of errors in a program. The theorem's proof hinges on our definition of test reliability and validity, but its practical utility hinges on being able to show when a test is actually reliable. We explain what makes tests unreliable (for example, we show by example why testing all program statements, predicates, or paths is not usually sufficient to insure test reliability), and we outline a possible approach to developing reliable tests. We also show how the analysis required to define reliable tests can help in checking a program's design and specifications as well as in preventing and detecting implementation errors.
Article
SELECT is an experimental system for assisting in the formal systematic debugging of programs. It is intended to be a compromise between an automated program proving system and the current ad hoc debugging practice, and is similar to a system being developed by King et al. of IBM. SELECT systematically handles the paths of programs written in a LISP subset that includes arrays. For each execution path SELECT returns simplified conditions on input variables that cause the path to be executed, and simplified symbolic values for program variables at the path output. For conditions which form a system of linear equalities and inequalities SELECT will return input variable values that can serve as sample test data. The user can insert constraint conditions, at any point in the program including the output, in the form of symbolically executable assertions. These conditions can induce the system to select test data in user-specified regions. SELECT can also determine if the path is correct with respect to an output assertion. We present four examples demonstrating the various modes of system operation and their effectiveness in finding bugs. In some examples, SELECT was successful in automatically finding useful test data. In others, user interaction was required in the form of output assertions. SELECT appears to be a useful tool for rapidly revealing program errors, but for the future there is a need to expand its expressive and deductive power.
Article
Many algebraic translators provide the programmer with a limited ability to allocate storage. Of course one of the most desirable features of these translators is the extent to which they remove the burden of storage allocation from the programmer. Nevertheless, ...
Article
Thesis (Ph. D. in psychology)--University of N.C., 1969. Bibliography: l. [137]-139. Photocopy of typescript.
Article
Software validation through testing will continue to be a very important tool for ensuring correctness of large scale software systems. Automation of testing tools can greatly enhance their power and reduce testing cost. In this paper, techniques for automated test data generation are discussed. Given a program graph, a set of paths are identified to satisfy some given testing criteria. When a path or a program segment is specified, symbolic execution is used for generating input constraints which define a set of inputs for executing this path or segment. Problems encountered in symbolic execution are discussed. A new approach for resolving array reference ambiguities and a procedure for generating test inputs satisfying input constraints are proposed. References to arrays are recorded in a table. during symbolic execution and ambiguities are resolved when test data are generated to evaluate the subscript expressions. The implementation of a test data generator for Fortran programs incorporating these techniques is also described.
The Correlation of Computer Program Quality with Testing Effort
  • A E Tucker
A. E. Tucker, "The Correlation of Computer Program Quality with Testing Effort," System Development Corporation, TM 2219/000/00, January 1965.