ArticlePDF Available

Hints on Test Data Selection: Help for the Practicing Programmer

Authors:

Abstract and Figures

A new, empirically observed effect is introduced. Called ″the coupling effect,″ it may become a very important principle in practical testing activities. The idea is that programs appear to have the property - the ″coupling effect″ - that tests designed to detect simple kinds of errors are also effective in detecting much more complicated errors. This relationship may seem counter-intuitive, but the authors give a way of analyzing it through the use of program mutations (i. e. , incorrect variations from a correct program). One of the most interesting possibilities is that the mutation idea could form the basis for statistically inferring the likelihood of remaining errors in a program.
Content may be subject to copyright.
Hints
on
Test
Data
Selection:
Help
for
the
Practicing
Programmer
Richard
A.
DeMillo
Georgia
Institute
of
Technology
Richard
J.
Lipton
and
Frederick
G.
Sayward
Yale
University
In
many
cases
tests
of
a
program
that
uncover
simple
errors
are
also
effective
in
uncovering
much
more
complex
errors.
This
so-called
coupling
effect
can
be
used
to
save
work
during
the
testing
process.
Much
of
the
technical
literature
in
software
reliability
deals
with
tentative
methodologies
and
underdeveloped
techniques;
hence
it
is
not
surpris-
ing
that
the
programning
staff
responsible
for
debug-
ging
a
large
piece
of
software
often
feels
ignored.
It
is
an
economic
and
political
requirement
in
most
production
programming
shops
that
programmers
shall
spend
as
little
time
as
possible
in
testing.
The
programmer
must
therefore
be
content
to
test
cleverly
but
cheaply;
state-of-the-art
methodologies
always
seem
to
be
just
beyond
what
can
be
afford-
ed.
We
intend
to
convince
the
reader
that
much
can
be
accomplished
even under
these
constraints.
From
the
point
of
view
of
management,
there
is
some
justification
for
opposing
a
long-term
view
of
the
testing
phase
of
the
development
cycle.
Figure
1
shows
the
relative
effect
of
testing
on
the
remain-
ing
system
bugs
for
several
medium-scale
systems
developed
by
System
Development
Corporation.'
Notice
that
in
the
last
half
of
the
test
cycle,
the
average
change
in
the
known-error
status
of
a
system
is
0.4
percent
per
unit
of
testing
effort,
while
in
the
first
half
of
the
cycle,
1.54
percent
of
the
errors
are
discovered
per
unit
of
testing
effort.
Since
it
is
enormously
difficult
to
be
convincing
in,
stating
that
the
testing
effort
is
complete,
the
apparently
rapidly
decreasing
return
per
unit
of
effort
invested
becomes
a
dominating
concern.
The
standard
solution,
of
course,
is
to
limit
the
amount
of
testing
time
to
the
most
favorable
part
of
the
cycle.
Programmers
have
one
great
advantage
that
is
almost
never
exploited:
they
create
programs
that
are
close
to
being
correct!
How,
then,
should
programmers
cope?
Their
more
sophisticated
general
methodologies
are
not
likely
to
be
applicable.2
In
addition,
they
have
the
burden
of
convincing
managers
that
their
software
is
indeed
reliable.
The
coupling
effect
Programmers,
however,
have
one
great
advantage
that
is
almost
never
really
exploited:
they
create
programs
that
are
close
to
being
correct!
Program-
mers
do
not
create
programs
at
random;
competent
programmers,
in
their
many
iterations
through
the
design
process,
are
constantly
whittling
away
the
distance
between
what
their
programs
look
like
now
and
what
they
are
intended
to
look
like.
Pro-
grammers
also
have
at
their
disposal
*
a
rough
idea
of
the
kinds
of
errors
most
likely
to
occur;
*
the
ability
and
opportunity
to
examine
their
programs
in
detail.
Error
classifications.
In
attempting
to
-formulate
a
comprehensive
theory
of
test
data
selection,
Susan
Gerhart
and
John
Goodenough3
have
suggested
that
errors
be
classified
as
follows:
(1)
failure
to
satisfy
specifications
due
to
imple-
mentation
error;
(2)
failure
to
write
specifications
that
correctly
represent
a
design;
(3)
failure
to
understapd
a
requirement;
(4)
failure
to
satisfy
a
requirement.
But
these
are
global
concerns.
Errors
are
always
reflected
in
programs
as
*
missing
control
paths,
*
inappropriate
path
selection,
or
*
inappropriate
or
missing
actions.
0018-9162/78/0400-0034$00.75
©
1978
IEEE
COMPUTER
34
We
do
not
explicitly
address
classifications
(2)
and
(3)
in
this
article,
except
to
point
out
that
even
here
a
programmer
can
do
much
without
fancy
theories.
If
we
are
right
in
our
perception
of
pro-
grams
as
being
close
to
correct,
then
these
errors
should
be
detectable
as
small
deviations
from
the
intended
program.
There
is
an
amazing
lack
of
published
data
on
this
subject,
but
we
do
have
some
idea of
the
most
common
errors.
E.
A.
Youngs,
in
his
PhD
dissertation,4
analyzed
1258
errors
in
Fortran,
Cobol,
PL/I,
and
Basic
programs.
The
errors
were
distributed
as
shown
in
Table
1.
In
addition
to
these
errors,
certain
other
errors
were
present
in
negligible
quantities.
There
were,
for
instance,
operating
system
interface
errors,
such
as
incorrect
job
identification
and
erroneous
external
I/O
assignment.
Also
present
were
errors
in
comments,
pseudo-ops,
and
no-ops
which
for
various
reasons
created
detectable
error
conditions.
Complex
errors
coupled.
How,
then,
do
the
rela-
tively
simple
error
types
discovered
by
Youngs
connect
with
the
Gerhart-Goodenough
error
classi-
fication?
Well,
the
naive
answer
is
that
since
arbi-
trarily
pernicious
errors
may
be
responsible
for
a
given
failure,
it
must
be
that
simple
errors
com-
pound
in
more
massive
error
conditions.
For
the
practical
treatment
of
test
data,
the
Youngs
error
statistics,
therefore,
do
not
seem
to
help
much
at
all.
Fortunately
though,
the
observation
that
pro-
grams
are
"close
to
correct"
leads
us
to
an
assump-
tion
which
makes
the
high
frequency
of
simple
errors
very
important:
The
coupling
effect:
Test
data
that
distinguishes
all
programs
differing
from
a
correct
one
by
only
simple
errors
is
so
sensitive
that
it
also
implic-
itly
distinguishes
more
complex
errors.
In
other
words,
complex
errors
are
coupled
to
simple
errors.
There
is,
of
course,
no
hope
of
"prov-
ing"
the
coupling
effect;
it
is
an
empirical
principle.
If
the
coupling
effect
can
be
observed
in
"real-world"
programs,
then
it
has
dramatic
implications
for
testing
strategies
in
general
and
domain-specific,
limited
testing
in
particular.
Rather
than
scamper
after
errors
of
undetermined
character,
the
tester
should
attempt
a
systematic
search
for
simple
errors
that
will
also
uncover
deeper
errors
via
the
coupling
effect.
Path
analysis.
This
point
seems
so
obvious
that
it's
not
worth
making:
test
to
uncover
errors.
Yet
it's
a
point
that's
often
lost
in
the
shuffle.
In a
common
methodology
known
as
path
analysis,
the
point
of
the
test
data
is
to
drive
a
program
through
all
of
its
control
paths.
It
is
certainly
hard
to
criti-
cize
such
a
goal,
since
a
thoroughly
tested
program
must
have
been
exercised
in
this
way.
But
unless
one
recognizes
that
the
test
data
should
also
dis-
tinguish
errors,
he
might
be
tempted
to
conclude,
for
example,
that
the
program
segment
diagrammed
in
Figure
2
can
be
tested
by
exercising
paths
1-2
and
1-3,
even
though
one
of
the
clauses
P
and
Q
may
not
have
been
affected
at
all!
In
general,
the
relative
ordering
of
P
and
Q
may
be
irrelevant
or
partially
unknown
and
side
effects
may
occur,
so
that
actually
the
eight
paths
shown
in
Figure
3
are
required
to
ensure
that
the
statement
has
been
adequately
tested.
100
80
1-
60
_-
40
_-
20
0
10
20
30
40
50
60
70
PERCENT
OF
TESTING
EFFORT
(MAN-MONTHS,
COMPUTER
HOURS,
ETC.)
80
90
100
Figure
1.
More
programming
errors
are
found
in
the
early
part
of
the
test
cycle
then
in
the
final
part.
Table
1.
Frequency
of
occurrence
of
1258
errors
in
Fortran,
Cobol,
PL/l,
and
Basic
programs.
Relative
Frequency
Error
Type
of
Occurrence
Error
in
assignment
or
computation
.27
Allocation
error
.15
Other,
unknown,
or
multiple
errors
.11
Unsuccessful
iteration
.09
Other
l/O
error
.07
I/O
formatting
error
.06
Error
in
brahching
unconditional
.01
conditional
.05
Parameter
or
subscript
violation
.05
Subprogram
invocation
error
.05
Misplaced
delimiter
.04
Data
error
.02
Error
in
location
or
marker
.02
Nonterminating
subprogram
.01
Figure
2.
Sample
program
segment
with
two
paths.
April
1978
I
I
I
I
I
1.
I
I
I
MI
0
35
Two
examples
given
below
indicate
that
test
data
derived
to
uncover
simple
errors
can,
in
fact,
be
vastly
superior
to,
say,
randomly
chosen
data
or
data
generated
for
path
analysis.
A
byproduct
of
the
discussion
will
be
some
evidence
for
the
coupling
effect.
A
third
example
reveals
another
advantage
of
selecting
test
data
with
an
eye
on
coupling:
since
it's
a
problem-specific
aetivity,
there
are
enhanced
possibilities
for
discovering
useful
heu-
ristics
for
test
data
selection.
This
example
will
lead
to
useful
advice
for
generating
test
vectors
for
programs
that
manipulate
arrays.
Our
groups
at
Yale
University
and
the
Georgia
Institute
of
Technology
have
constructed
a
system
whereby
we
can
determine
the
extent
to
which
a
given
set
of
test
data
has
adequately
tested
a
Fortran
program
by
direct
measurement
of
the
number
and
kinds
of
errors
it
is
capable
of
uncover-
ing.
This
method,
known
as
program
mutation,
is
used
interactively:
A
programmer
enters
from
a
terminal
a
program,
P,
and
a
proposed
test
data
set
whose
adequacy
is
to
be
determined.
The
muta-
tion
system
first
executes
the
program
on
the
test
data;
if
the
program
gives
incorrect
answers
then
certainly
the
program
is
in
error.
On
the
other
hand,
if
the
program
gives
correct
answers,
then
it
may
be
that
the
program
is
still
in
error,
but
the
test
data
is
not
sensitive
enough
to
distinguish
that
error:
it
is
not
adequate.
The
mutation
system
then
creates
a
number
of
mutations
of
P
that
differ
from
P
only
in
the
occurrence
of
simple
errors
(for
instance,
where
P
contains
the
expression
"B.LE.C"
a
mutation
will
contain
"B.EQ.C").
Let
us
call
these
mutations
P,,
P2,
.
.
.,Pk.
Now,
for
the
given
set
of
test
data
there
are
only
two
possibilities:
(1)
on
that
data
P
gives
different
results
from
the
Pi
mutations,
or
(2)
on
that
data
P
gives
the
same
results
as
some
Pi.
In
case
(1)
Pi
is
said
to
be
dead:
the
"error"
that
produced
Pi
from
P
was
indeed
distinguished
by
the
test
data.
In
case
(2),
the
mutant
P1
is
said
to
be
live;
a
mutant
may
be
live
for
two
reasons:
(1)
the
test
data
does
not
contain
enough
sensi-
tivity
to
distinguish
the
error
that
gave
rise
to
Pi,
or
(2)
P,
and
P
are
actually
equivalent
programs
and
no
test
data
will
distinguish
them
(i.e.,
the
"error"
that
gave
rise
to
Pi
was
not
an
error
at
all).
Test
data
that
leaves
no
live
mutants
or
only
live
mutants
that
are
equivalent
to
P
is
adequate
in
the
following
sense:
Either
the
program
P
is
correct
or
there
is
an
unexpected
error
in
P,
which-by
the
coupling
effect-we
expect
to
happen
seldom
if
the
errors
used
to
create
the
mutants
are
carefully
chosen.
Now,
it
is
not
completely
apparent
that
this
process
is
computationally
feasible.
But,
as
we
describe
in
more
detail
elsewhere,
there
is
a
very
good
choice
of
methodology
for
generating
muta-
tions
to
bring
the
procedure
within
attractive
economic
bounds.5
Apparently,
the
information
returned
by
the
mutation
system
can
be
effectively
utilized
by
the
programmer.
The
programmer
looks
at
a
negative
response
from
the
system
as
a
"hard
question"
concerning
his
program
(e.g.,
"The
test
data
you've
given
me
says
it
doesn't
matter
whether
or
not
this
test
is
for
equality
or
inequality;
why
is
that?")
and
is
able
to
use
his
answers
to
the
question
as
a
guide
in
generating
more
sensitive
test
data.
Figure
3.
Eight
paths
may
be
required
for
an
adequate
test.
COMPUTER
36
A
simple
example
Our
first
example
is
very
simple;
it
involves
the
MAX
algorithm
used
for
other
purposes
by
Peter
Naur
in
the
early
1960's.
The
task
is
to
set
a
vari-
able
R
to
the
index
of
the
first
occurrence
of
a
maximum
element
in
the
vector
A(1),
...,
A(N).
For
example,
the
following
Fortran
subroutine
might
be
offered
as
an
implementation
of
such
an
algorithm:
SUBROUTINE
MAX
(A,N,R)
INTEGER
A(N),I,N,R
1
R=1
2
DO31=2,N,1
3
IF
(A(I).GT.A(R))R=I
RETURN
END
We
will
choose
for
our
initial
set
of
test
data
three
vectors
(Table
2).
Table
2.
Three
vectors
constitute
the
initial
set
of
test
data.
A(1)
A(2)
A(3)
data
1
1
2
3
data
2
1
3
2
data
3
3
1
2
How
sensitive
is
this
data?
By
inspection,
we
notice
that
if
an
error
had
occurred
in
the
relational
operation
of
the
IF
statement,
then
either
data
1,
data
2,
or
data
3
would
have
distinguished
those
errors,
except
for
one
case.
None
of
these
data
vectors
distinguishes
.GE.
from
.GT.
in
the
IF
state-
ment.
Similarly,
these
vectors
distinguish
all
simple
errors
in
constants
except
for
starting
the
DO
loop
at
"1"
rather
than
"2."
All
simple
errors
in
vari-
ables
are
likewise
distinguished
except
for
the
errors
in
the
IF
statement
which
replace
"A(I)"
by
"I"
or
by
"A(R)."
That
is,
if
we
run
the
data
set
above
in
any
of
the
following
mutants
of
MAX,
we
get
the
same
results.
SUBROUTINE
MAX
(A,N,R)
INTEGER
A(N),I,N,R
1
R=1
2
DO31=1,N,j
3
IF(A(I).GT.A(R))R=1
RETURN
END
SUBROUTINE
MAX
(A,N,R)
INTEGER
A(N),I,N,R
1
R=1
2
DO
3
I=2,N,1
3
IF(I.GT.A(R))R
=
1
RETURN
END
SUBROUTINE
MAX
(A,N,R)
INTEGER
A(N),I,N,R
1
R=1
2
DO
3
I=2,N,1
3
IF(A(I).GE.A(R))R
=
1
RETURN
END
SUBROUTINE
MAX
(A,N,R)
INTEGER
A(N),I,N,R
1
R=1
2
DO
3
I=2,N,1
3
IF(A(R).GT.A(R))R
=
1
RETURN
END
Let
us
try
to
kill
as
many
of
these
mutants
as
possible.
In
view
of
the
first
difficulty,
we
might
guess
that
our
data
is
not
yet
adequate
because
it
does
not
contain
repeated
elements.
So,
let
us
add
A(1)
A(2)
A(3)
data
4
2
2
1
Now,
replacing
.GT.
by
.GE.
and
running
on
data
4
gives
erroneous
results
so
that
all
mutants
arising
from
simple
relational
errors
are
dead.
Sur-
prisingly,
data
4
also
distinguishes
the
two
errors
in
A(I);
so,
we
are
left
with
only
the
last
mutant
arising
from
the
"constant"
error:
variation
in
begin-
ning
the
DO
loop.
But
closer
inspection
of
the
pro-
gram
indicates
that
starting
the
DO
loop
at
"1"