In testing for program correctness, the standard approaches
[11,13,21,22,23,24,34] have centered on finding data D, a finite
subset of all possible inputs to program P, such that
1) if for all x in D, P(x) = f(x), then P* = f
where f is a partial recursive function that specifies the
intended behavior of the program and P* is the function actually
computed by program P. A major stumbling block in such
formalizations has been that the conclusion of (1) is so strong
that, except for trivial classes of programs, (1) is bound to be
formally undecidable [23].
There is an undeniable tendency among practitioners to consider
program testing an ad hoc human technique: one creates test data
that intuitively seems to capture some aspect of the program,
observes the program in execution on it, and then draws conclusions
on the program's correctness based on the observations. To augment
this undisciplined strategy, techniques have been proposed that
yield quantitative information on the degree to which a program has
been tested. (See Goodenough [14] for a recent survey.) Thus the
tester is given an inductive basis for confidence that (1) holds
for the particular application. Paralleling the undecidability of
deductive testing methods, the inductive methods all have had
trivial examples of failure [14,18,22,23].
These deductive and inductive approaches have had a common
theme: all have aimed at the strong conclusion of (1). Program
mutation [1,7,9,27], on the other hand, is a testing technique that
aims at drawing a weaker, yet quite realistic, conclusion of the
following nature:
(2) if for all x in D, P(x) = f(x), then P* = f OR P is
"pathological."
To paraphrase,
3) if P is not pathological and P(x) = f(x) for all x in D then
P* = f.
Below we will make precise what is meant by "P is pathological";
for now it suffices to say that P not pathological means that P was
written by a competent programmer who had a good understanding of
the task to be performed. Therefore if P does not realize f it is
"close" to doing so. This underlying hypothesis of program mutation
has become known as the competent programmer hypothesis:
either P* = f or some program Q "close" to P has the property Q* =
f.
To be more specific, program mutation is a testing method that
proposes the following version of correctness testing:
Given that P was written by a competent programmer, find test
data D for which P(D) = f(D) implies P* = f.
Our method of developing D, assuming either P or some program
close to P is correct, is by eliminating the alternatives. Let
&phis; be the set of programs close to P. We restate the method
as follows:
Find test data D such that:
i) for all x in D P(x) = f(x) and
ii) for all Q in &phis; either Q* = P* or for some x in D,
Q(x) ≠ P(x).
If test data D can be developed having properties (i) and (ii),
then we say that D differentiates P from &phis;,
alternatively P passes the &phis; mutant test.
The goal of this paper is to study, from both theoretical and
experimental viewpoints, two basic questions:
Question 1: If P is written by a competent programmer and
if P passes the &phis; mutant test with test data D, does P* =
f?
Note that, after formally defining &phis; for P in a fixed
programming language L, an affirmative answer to question 1 reduces
to showing that the competent programmer hypothesis holds for this
L and &phis;.
We have observed that under many natural definitions of
&phis; there is often a strong coupling between members of
&phis; and a small subset µ. That is, often one can
reduce the problem of finding test data that differentiates P from
&phis; to that of finding test data that differentiates P from
µ. We will call this subset µ the mutants of P
and the second question we will study involves the so-called
coupling effect [9]:
Question 2 (Coupling Effect): If P passes the µ
mutant test with data D, does P pass the &phis; mutant test
with data D?
Intuitively, one can think of µ as representing the
programs that are "very close" to P.
In the next section we will present two types of theoretical
results concerning the two questions above: general results
expressed in terms of properties of the language class L, and
specific results for a class of decision table programs and for a
subset of LISP. Portions of the work on decision tables and LISP
have appeared elsewhere [5,6], but the presentations given here are
both simpler and more unified. In the final section we present a
system for applying program mutation to FORTRAN and we introduce a
new type of software experiment, called a "beat the system"
experiment, for evaluating how well our system approximates an
affirmative response to the program mutation questions.