IEEE TRANSACTIONS ON SOFTWARE ENGINEERING.
Necessary Scientific Basis
Software measurement, like measurement in any
other discipline, must adhere to the science of measurement if
it is to gain widespread acceptance and validity. The observation
of some very simple, but fundamental, principles of measurement
can have an extremely beneficial effect on the subject. Measure-
ment theory is used to highlight both weaknesses and strengths of
software metrics work, including work
metrics validation. We
identify a problem with the well-known Weyuker properties, but
also show that a criticism of these properties by Cherniavsky and
Smith is invalid. We show that the search for general software
complexity measures is doomed to failure. However, the theory
to define and validate measures of specific complexity
attributes. Above all, we are able to view software measurement
in a very wide perspective, rationalising and relating its many
Terms-Software measurement, empirical studies, met-
rics, measurement theory, complexity, validation.
over eleven years since DeMillo and Lipton outlined
the relevance of measurement theory to software metrics
101. More recent work by the author and others ,
has taken the measurement theory basis for software metrics
considerably further. However, despite the important message
in this work, and related material (such
, 1341, ,
), it has been largely ignored by both practitioners and
researchers. The result is that much published work in software
paper therefore provides
timely summary and enhancement of measurement theory
approaches, which enables us to expose problems in software
metrics work and show how they can be avoided.
concise summary of measure-
ment theory. In Section
we use the theory to show that the
search for general-purpose, real-valued software ‘complexity’
measures is doomed to failure. The assumption that funda-
mentally different views of complexity can be characterised
by a single number is counter to the fundamental concepts
of measurement theory. This leads us to re-examine critically
the much cited Weyuker properties . We explain how the
most promising approach is to identify specific attributes of
complexity and measure these separately. In Section
use basic notions of measurement to describe a framework
to view apparently diverse software measure-
unified way. We look at some well-known
approaches to software measurement within this framework,
exposing both the good points and bad points.
Manuscript received August
This work was
The author is with
Centre for Software Reliability, City University,
supported in pan by IED project SMARTIE and ESPRIT project PDCS?.
London, EClV OHB. UK.
In this section, we provide
summary of the key concepts
from the science of measurement which are relevant to soft-
ware metrics. First, we define the fundamental notions (which
are generally not well understood) and then we summarise the
representational theory of measurement. Finally, we explain
how this leads inevitably to a goal-oriented approach.
is defined as the process by which numbers
symbols are assigned to attributes of entities in the real world
in such a way as to describe them according
131, . An
may be an object, such as a person
or a software specification, or an event, such as
the testing phase of
software project. An
is a feature
or property of the entity, such
the height or blood pressure
(of a person), the length or functionality (of a specification),
the cost (of
the duration (of the testing phase).
Just what is meant by the numerical assignment “describing”
the attribute is made precise within the representational theory
of measurement presented below. Informally, the assignment
or symbols must preserve any intuitive and em-
pirical observations about the attributes and entities. Thus,
for example, when measuring the height of humans bigger
numbers must be assigned to the taller humans, although the
numbers themselves will differ according to whether we use
metres, inches, or feet. In most situations an attribute, even one
height of humans, may have a different
intuitive meaning to different people. The normal way to get
round this problem
to define a
for the entities being
measured. The model reflects a specific viewpoint. Thus, for
human might specify a particular type
of posture and whether or not to include hair height
shoes to be wom. Once the model is fixed there
consensus about relations which hold for humans with respect
to height (these are the empirical relations). The need for
good models is particularly relevant in software engineering
measurement. For example, even as simple
lines of code
requires a well defined
model of programs which enables
to identify unique lines
unambiguously. Similarly, to measure the effort spent on,
the unit testing process we would need an agreed “model”
of the process which at least makes clear when the process
begins and ends.
There are two broad types of measurement: direct and indi-
of an attribute
does not depend on the measurement of any other attribute.
of an attribute is measurement which
involves the measurement of one or more other attributes. It
tums out that while some attributes can be measured directly,
we normally get more sophisticated measurement (meaning
more sophisticated scale, see below) if we measure indirectly.
For a good discussion of these issues, see , .
Measurement: Assessment and Prediction:
are two broad uses of measurement: for
Predictive measurement of an attribute
generally depend on
a mathematical model relating
some existing measures
predictive measurement is inevitably dependent on careful
(assessment type) measurement of the attributes
For example, accurate estimates of project resources are
obtained by simply “applying” a cost estimation model with
fixed parameters . However, careful measurement of key
attributes of completed projects can lead to accurate resource
predictions for future projects  Similarly, it is possible
to get accurate predictions of the reliability of software in
operation, but these are dependent on careful data collection
relating to failure times during alpha-testing
For predictive measurement the model alone is not suffi-
cient. Additionally, we need to define the procedures for
determining model parameters and b) interpreting the results.
For example, in the case of software reliability prediction we
might use maximum likelihood estimation for
a) and Bayesian
statistics forb). The model, together with procedures a) and b),
. Using the same model will
generally yield different results if we use different prediction
It must be stressed that, for all but the most trivial attributes,
proposed predictive measures in software engineering are
invariably stochastic rather than deterministic. The same is
true of proposed indirect measures
Measurement Activities must have Clear Objectives:
basic definitions of measurement suggest that any measure-
ment activity must proceed with very clear objectives or
goals. First you need to know whether you want to measure
for assessment or for prediction. Next, you need to know
exactly which entities are the subject of interest. Then you
need to decide which attributes
the chosen entities are the
significant ones. The definition of measurement makes clear
the need to specify both an entity and an attribute before
any measurement can be undertaken (a simple fact which
has been ignored in much software metrics activity). Clearly,
there are no definitive measures which can be prescribed for
every objective in any application area. Yet for many years
software practitioners expected precisely that: ‘what software
metric should we be using?’ was, and still
asked question. It says something about the previous ignorance
of scientific measurement in software engineering that the
Goal/Question/Metric paradigm of Basili and Rombach  has
a revolutionary step forward. GQM spells out
the above necessary obligations for setting objectives before
embarking on any software measurement activity.
The Issues Addressed:
Although there is
agreed theory of measurement, most approaches are devoted
to resolving the following issues: what is and what is not
measurement; which types of attributes can and cannot be
measured and on what kind
scales; how do we know if we
have really measured an attribute; how to define measurement
scales; when is
error margin acceptable or not; which
statements about measurement are meaningful. The texts
, , ,  all deal with these issues. Here we
a brief overview of the
Empirical Relation Systems:
Direct measurement of
ticular attribute possessed
a set of entities must be preceded
by intuitive understanding of that attribute. This intuitive
understanding leads to the identification
between entities. The set of entities
together with the set of
is called an
empirical relation system
for the attribute in question. Thus the attribute of
“height” of people gives rise to empirical relations like “is
tall”, “taller than”, and “much taller than”.
To measure the attribute that is
characterised by an empirical relation system
numerical relation system
maps entities in
to numbers (or symbols) in
and empirical relations in
are mapped to numerical
way that all empirical relations are
preserved. This is the so-called
condition asserts that the correspondence between empirical
and numerical relations
two way. Suppose, for example,
that the binary relation
is mapped by
to the numerical
Then, formally, we have the following instance:
is the set of all people and
relation “taller than”.
of height would map
into the set of real numbers
and ‘taller than’ to the relation
The representation condition asserts that person
taller than person
if and only if
By having to identify empirical relations for an attribute
in advance, the representational approach to measurement
avoids the temptation to
poorly understood, but
intuitively recognisable, attribute in terms of some numerical
assignment. This is one of the most common failings in
software metrics work. Classic examples are where attributes
“quality” are equated with proposed
numbers; for example, complexity with
McCabe’s cyclomatic number , or Halstead’s
“quality” with Kafura and Henry’s fan-in/fan-out equation
Scale Types and Meaningfulness:
Suppose that an attribute
of some set
entities has been characterised by an empirical
There may in general be many ways of
assigning numbers which satisfy the representation condition.
For example, if person
is taller than person
irrespective of whether the measure
feet, centimetres metres, etc. Thus, there are many different
measurement representations for the normal empirical relation
system for the attribute of height of people. However, any two
very specific way:
there is always some constant
FENTON: SOFTWARE MEASUREMENT A
is the representation of height in inches and
in centimetres, c
This transformation from
one valid representation into another is called an
It is the class of admissible transformations which deter-
for an attribute (with respect to some
fixed empirical relation system). For example, where every
admissible transformation is a scalar multiplication (as for
height) the scale type is called
The ratio scale is
a sophisticated scale of measurement which reflects a very
rich empirical relation system. An attribute is never of ratio
we normally start with a crude understanding
of an attribute and a means of measuring it. Accumulating
data and analysing the results leads to the clarification and
re-evaluation of the attribute. This in tum leads to refined
and new empirical relations and improvements in the accu-
racy of the measurement; specifically this is an improved
For many software attributes we are still at the stage of
having very crude empirical relation systems. In the case of
an attribute like “criticality” of software failures an empirical
relation system would at best only identify different classes of
failures and a binary relation “is more critical than”. In this
case, any two representations are related by a monotonically
increasing transformation. With this class of admissible trans-
formations, we have an
scale type. In increasing order
of sophistication, the best known scale types are:
ordinal, interval, ratio,
For full details about the
defining classes of admissible transformations, see
This formal definition of scale type based on admissible
to determine rigorously what kind
of statements about measurement are meaningful. Formally,
a statement involving measurement is
if its truth
or falsity remains unchanged under any admissible transfor-
mation of the measures involved.
Thus, for example, it is
meaningful to say that “Hermann is twice as tall as Peter”; if
the statement is true (false) when we measure height in inches,
it will remain true (false) when we measure height in any
constant multiple of inches. On the other hand the statement
is twice as critical as failure
is not meaningful
if we only have an ordinal scale empirical relation system
for failure criticality. This is because a valid ordinal scale
3, while another
valid ordinal scale measure
In this case the statement is true under
The notion of meaningfulness also enables
what kind of operations we can perform on different measures.
For example, it is meaningful to use
for computing the
average of a set of data measured on a ratio scale but not on
an ordinal scale.
are meaningful for an ordinal scale
but not for a nominal scale. Again, these basic observations
have been ignored in many software measurement studies,
where a common mistake is to use the mean (rather than
median) as measure of average for data which is only ordinal.
Good examples of practical applications of meaningfulness
ideas may be found in
definition of meaningfulness is given in
The serious mathematical as-
pects of measurement theory are largely concemed with
theorems which assert conditions under which certain scales of
direct measurement are possible for certain relation systems.
A typical example of such a theorem, due to Cantor, gives
conditions for real-valued ordinal-scale measurement when
we have a countable set of entities
and a binary relation
The empirical relation system
has a representation in
if and only if
is a strict weak
order. The scale type is ordinal when such a representation
being a “strict weak order” means that it is:
implies that it is not the case
implies that for every
The representational theory of measurement is especially
relevant to the study of software complexity measures. In
this section we show that the search for a general-purpose
real-valued complexity measure is doomed to failure, but
that there are promising axiomatic approaches which help
to measure specific complexity attributes. However, one
well-known axiomatic approach
1451 has serious weaknesses
because it attempts to characterise incompatible views of
General Complexity Measures: The Impossible Holy Grail
For many years researchers have sought to characterise
general notions of “complexity” by a single real number.
To simplify matters, we first restrict our attention to those
measures which attempt only to characterise control-flow com-
plexity. If we can show that
is impossible to define a general
measure of control-flow complexity, then the impossibility of
even more general complexity measures is certain.
Zuse cites dozens of proposed control-flow complexity
There seems to
a minimum assumption
that the empirical relation system for complexity of programs
leads to (at least) an ordinal scale. This is because of the
following hypotheses which are implicit in much of the work.
be the class of programs. Then the
attribute control-flow “complexity” is characterised by an
empirical relation system which includes a binary relation
“less complex than”; specifically
if there is a
is less complex than
The proposed measure
representation of complexity in which the relation
seems plausible. It does not state that
totally ordered with respect to
only that there is some
view of complexity for which there would be a
reasonable consensus that certain pairs of programs are in
For example, in Fig.
it seems plausible that
the measurement theory viewpoint
would be good enough
if most programmers agreed this). Some pairs appear to be
TRANSACTIONS ON SOFTWARE ENGINEERING,
Complexity relation not negatively transitive’?
incomparable, such as
if people were
asked to “rank” these for complexity they would inevitably
end up asking questions like “what is meant by complexity”
before attempting to answer. Since
is supposed to capture a
general view of complexity, this would be enough to deduce
and also that
The idea of the inevitable incomparability of some
programs, even for some specific views of complexity, has
also been noted in .
Unfortunately, while Hypothesis
is plausible, Hypothesis
can be dismissed because of the Representation Condition.
The problem is the “incomparable” programs. While
a total order
a total order in
might force an order which has to
be reflected back in
Thus, if for example
McCabe’s cyclomatic complexity measure
a measure of complexity, the Representation Condition asserts
that we must also have
for which there is no consensus.
Formally we can prove the following theorem.
there is no general
notion of control-flow complexity of programs which can be
measured on an ordinal scale in
prove this, the previous argument is made formal by
appealing to Cantor’s Theorem. It is enough
is not a strict weak order. This follows since
it is reasonable
clear that any consensus exists about the relative complexities
and lij and
The theorem should put an end
search for the
holy grail of a general complexity measure. However, it
does not rule
the search for measures that characterise
specific views of complexity (which is the true measurement
theory approach). For example, a specific program complexity
attribute is “the number of independent paths.” McCabe’s
cyclomatic complexity is an absolute scale measure of this
attribute. It might even be a ratio scale measure of the attribute
of ‘testability’ with respect to independent path testing. Other
specific attributes of complexity, such as the maximum depth
of nesting, distribution of primes in the decomposition tree,
and the number of paths of various types, can all be measured
rigorously and automatically
This idea of looking
measures with respect to particular
viewpoints of complexity is taken much further by Zuse .
Zuse uses measurement theory to analyse the many complexity
measures in the literature; he shows which viewpoint and
assumptions are necessary to use the measures on different
scales. The beauty and relevance of measurement theory is
such that it clearly underlies some of the most promising
work in software measurement even where the authors have
not made the explicit link. Notable in this respect are the
innovative approaches of Melton
and Tian and
In both of these works, the authors seek
to characterise specific views of complexity.
authors do this by proposing a number of axioms reflecting
viewpoints of complexity:
the context of measurement
theory, the axioms correspond to particular empirical relations.
This means that the representation condition can be used to
determine the acceptability of potential measures.
 characterise a specific view of program
complexity by specifying precisely an order relation
program flowgraphs; in other words they
(of Hypothesis 1) as
The benefit of this approach
is that the view of complexity is explicit and the search for
representations (i.e., measures of this view of complexity)
becomes purely analytical. The only weakness
the assertion that a measure
is “any real-valued mapping
the sufficiency condition of the Representation Condition.
Thus, while McCabe’s cyclomatic complexity
necessity, (and is therefore a “measure” according to Melton
it is not a measure in the representational sense
(since in Fig.
is not the case that
Interestingly, Tian and Zelkowitz also use the same
weakened form of representation, but acknowledge that they
“would like the relationship” to be necessary and sufficient.
It follows from Cantor’s theorem that there is no repre-
sentation of Melton’s
However, it is still
get ordinal measurement
a number system which
(and hence, for which
is not required that
is a strict weak order), although the resulting measure is of
purely theoretical interest.
121 that there is
a representation in
is the set of natural
is the divides relation. The construction of the
is based on ensuring incomparable
flowgraphs are mapped to mutually prime numbers.
flowgraphs of Fig.
is a fairly large
3, and M(y) is a very large multiple of 3.
Despite the above evidence, researchers have continued
to search for single real-valued complexity measures which
to have the magical properties of being
indicators of such diverse attributes as
high value for a “complexity”
measure is supposed to be indicative of low comprehensibility,
low reliability, etc. Sometimes these measures are also called
this case, high values of the
measure actually indicate low values of the quality attributes.
The danger of attempting
find measures which char-
many different attributes is that inevitably the
aims. This is counter to
the representational theory of measurement. Nobody would
expect a single number
to characterise every notion of
“quality” of people, which might include the very different
physical strength, and b) intelligence. If such
existed it would have to satisfy a) M(A)
is stronger than
is more intelligent than
The fact that some
highly intelligent people are very weak physically ensures
can satisfy both these properties. Nevertheless,
Weyuker’s list of properties  seems
suggest the need
“metric”, the converse is certainly not true. The
confusion in , and
also in , arises from wrongly equating
these two concepts, and ignoring the theory
for analogous software “complexity” measures.
two of the properties that Weyuker proposes any complexity
should satisfy are the following properties.
There exist program bodies p.
the view that
In software measurement activity, there are three classes of
are any software related activities which take
place over time.
are any artefacts, deliverables
which arise out of the processes.
are the items which are inputs to processes.
We make a distinction between attributes of these which
entities of interest
key factor in its complexity. We can
from Property A that low comprehensibility is
in complexity. This is because it is widely believed that in
certain cases we can understand a program
we see more of it . Thus, while
“size” type complexity
should satisfy property A,
type complexity measure
cannot satisfy property A.
asserts that we can find two program bodies
of equal complexity which when separately concatenated to
same third program yield programs of different complexity.
Clearly, this property has much to do with comprehensibility
do with size.
Thus, properties A and
are relevant for very different,
and incompatible, views of complexity. They cannot both be
satisfied by a single measure which captures notions of size
low comprehensibility. Although the above argument is
not formal, Zuse has recently proved  that, within the
representational theory of measurement, Weyuker’s axioms
are contradictory. Formally, he shows that while Property A
explicitly requires the ratio scale for
excludes the ratio scale.
The general misunderstanding of scientific measurement in
software engineering is illustrated further in a recent paper ,
which was itself
a critique of the Weyuker’s axiom. Chemi-
avsky and Smith define
code based “metric” which satisfies
all of Weyuker’s axioms but, which they rightly claim, is not
sensible measure of complexity. They conclude that axiomatic
approaches may not work. There is no justification for their
conclusion. On the one hand,
they readily accept, there was
no suggestion that Weyuker’s axioms were complete. More
importantly, what they fail to observe,
that Weyuker did not
propose that the axioms were
she only proposed that
they were necessary. Since the Chemiavsky/Smith “metric”
sense) of any specific attribute,
then showing that
satisfies any set
necessary axioms for
any measure is of no interest at
These problems would have been avoided by a simple
lesson from measurement theory: the definition of
mapping does not in itself constitute measurement.
in software engineering to use the word “metric” for any
number extracted from
software entity. Thus while every
of a product, process,
those which can be measured purely in terms of the product,
example, length is an intemal
attribute of any software document, while elapsed time is
an internal attribute of any software process.
product, process, or resource are
those which can only be measured with respect to how
the product, process, or resource relates to other entities
in its environment.
product attribute) is dependent not just on the program
itself, but on the compiler, machine, and user.
is an external attribute of
resource, namely people (either
individuals or groups); it is clearly dependent on many
aspects of the process and the quality of products delivered.
Software managers and software users would most like
to measure and predict extemal attributes. Unfortunately,
they are necessarily only measurable indirectly.
productivity of personnel is most commonly measured
of code delivered
intemal product attribute);
(an intemal process attribute). The problems with
this oversimplistic measure of productivity have been well
documented. Similarly, “quality” of
software system (a very
high level extemal product attribute) is often defined
faults discovered during formal
process attribute); and
reasonable for developers, this measure of quality cannot be
said to be
valid measure from the viewpoint of the user.
Empirical studies have suggested there may be little real
correlation between faults and actual failures of the software
in operation. For example, Adams
of a number of large software systems being used on many
sites around the world; he discovered that a large proportion
of faults almost never lead to failures, while less than
the known faults caused most of the common failures.
It is rare for a genuine consensus to be reached about the
contrived definitions of external attributes. An exception is the
reliability of code in terms of probability of failure
free-operation within a given usage environment
In this case, we need
measure intemal process attributes.
The processes are each of the periods of software operation
between observed failures; the key attribute is the duration of
Sofhjare Metrics Activities Within the Frameuwk
The many, apparently diverse, topics within “software met-
rics” all fit easily into the conceptual framework described
Here, we pick out just three examples.
is generally concerned with
the attributes of
required for the
development (normally from detailed specification through to
implementation). Most approaches involve a prediction system
in which the underlying model has the form
is effort in person months and
is a measure of system
size. The function
may involve other product attributes (such
resource attributes (such
programmer experience). In the
Boehm’s COCOMO (61, size is defined
of delivered source statements, which is an attribute of the final
implemented system. Since the prediction system is used at the
specification phase, we have to
the product attribute
size in order to plug it into the model. This means that we are
replacing one difficult prediction problem (effort prediction)
with another prediction problem which may be no easier (size
prediction). This is avoided in Albrecht’s approach
system “size” is measured by the number of function points
(FP’s). This is computed directly from the specification.
Sofhvare Quality Models and Reliability Models:
popular quality models break down quality into “factors,”
“criteria,” and “metrics” and propose relationships between
them. Quality factors generally correspond to
attributes. The criteria generally correspond to
or process attributes. The metrics generally correspond to
proposed measures of the intemal attributes. In most cases
the proposed relationships are based on purely subjective
opinions. Reliability is one of the high-level, external product
attributes which appears in all quality models. The only type
of products for which this attribute is relevant is executable
software. Reliability modelling is concerned with
reliability of software on the basis of observing times between
failures during operation or testing. Thus internal attributes of
processes are used to predict an external product attribute.
The prediction systems
used in reliability modelling
typically consist of a probability distribution model together
statistical inference procedure (such
likelihood estimation) for determining the model parameters,
prediction procedure for combining the model and the
parameter estimates to make statements about future reliability.
Halstead’s Software Science:
Halstead proposed mea-
sures of three internal program attributes which reflect different
views of size, namely
are defined in terms of
the number of operators,
number of operands,
NI the number of operators, and
number of operands). For example, length
Although these seem reasonable measures of the specific
attributes from the measurement theory perspective, they have
been interpreted by many
being measures of program
totally different attribute. Other Halstead
are genuinely problematic from the measurement theory
perspective. Specifically, these are given by:
is supposed to represent ‘the number of mental discrim-
inations required to understand
actual time in seconds to write
It should now be clear
that these are crude prediction systems. For example,
predicted measure of an attribute
the process of
understanding the program.
However, as discussed in Section
prediction system requires a means of both determining
the model parameters and interpreting the results. Neither
of these is provided in Halstead’s work. More worryingly,
it is possible to show that using
leads to contradictions
involving meaningful statements about effort (the attribute
The discussion in Section
confirms that invariably we
need to measure internal attributes to support the measurement
extemal attributes. This point has also been noted in
Even the best understood external product attribute,
requires inter-failure time data to be collected during testing
or operation. In many situations, we may need to make
prediction about an external product attribute before the
product even exists.
example, given the detailed designs
set of untested software modules, which ones, when
implemented, are likely to be the most difficult to test or
maintain? This is
major motivation for studying measures
attributes of products, and was the driving force
behind much work on complexity measures.
Consider, for example, the product attribute
Many modem software engineering methods and techniques
are based on the premise that a modular structure for software
is a “good thing.” What this assumption means formally is
that modularity, an intemal software product attribute, has
significant impact on external software product attributes such
as maintainability and reliability. Although
number of studies
and  have investigated this relationship there
is no strong evidence to support the widely held beliefs about
the benefits of modularity. While the study in
evidence that modularity was related to maintainability, it
relationship, whereas Pressman and others
believe that neither excessively high nor excessively low
modularity are acceptable. However, the main problem with
all the studies is the lack of a previously validated measure
Using the representational approach, we need to identify the
intuitive notions which lead to
consensus view of modularity
before we can measure
Some empirical relations were
Others can be picked up be reading the
general software engineering literature. For example, it is
widely believed that the average module size alone does not
system’s modularity. It is affected by the whole
structure of the module calling hierarchy. Thus, the number
of levels and the distribution
modules at each level have
FENTON: SOFTWARE MEASUREMENT: A NECESSARY SCIENTIFIC BASIS
be considered; module calling structures with widely varying
widths are not considered to be very modular because of ideas
of chunking from cognitive psychology.
Validating Software Measures
Validating a software measure in the assessment sense is
equivalent to demonstrating empirically that the representation
condition is satisfied for the attribute being measured. For
a measure in the predictive sense, all the components of
the prediction system must be clearly specified and a proper
hypothesis proposed, before experimental design for validation
Despite these simple obligations for measurement valida-
tion, the software engineering literature abounds with so-called
validation studies which have ignored them totally. This
phenomenon has been examined thoroughly in
141 and ,
and fortunately there is some recent work addressing the
problem . Typically a measure (in the assessment sense)
is proposed. For example, this might be a measure of an
intemal structural attribute of source code. The measure is
“validated” by showing that it correlates with some other
existing measure. What this really means is that the proposed
measure is the main independent variable in a prediction
system. Unfortunately, these studies commonly fail to specify
the required prediction system and experimental hypothesis.
Worse still, they do not specify, in advance, what is the
dependent variable being predicted. The result is often an
attempt to find fortuitous correlations with any data which
happens to be available.
many cases, the only such data
happens to be some other structural measure. For example,
in , structural type measures are “validated” by showing
that they correlate with “established” measures like LOC
and McCabe’s cyclomatic complexity number. In such cases,
the validation study tells us nothing
interest. The general
dangers of the “shotgun” approach to correlations of software
measures have been highlighted in
The search for rigorous software measures has not been
helped by a commonly held viewpoint that
no measure is
“valid” unless it is a good predictor of effort. An analogy
would be to reject the usefulness of measuring a person’s
the grounds that it tells us nothing about that person’s
intelligence. The result is that potentially valid measures of
important intemal attributes become distorted. Consider, for
example, Albrecht’s function points . In this approach, the
UFC seems to be a reasonable mea-
sure of the important attribute of
documents. However, the intention was to define a single
size measure as the main independent variable in prediction
systems for effort. Because of this, a
(TCF), is applied to UFC to arrive at the number
function points FP which is the model in the prediction system
for effort. The TCF takes account of 14 product and process
attributes in Albrecht’s approach, and even more in Symons’
approach . This kind of adjustment (to a measure
functionality) is analogous to redefining measures of height
of people in such a way that the measures correlate more
closely with intelligence. Interestingly, Jeffery  has shown
that the complexity adjustments do not even improve effort
predictions; there was
no significant differences between UFC
as effort predictors in his studies. Similar results have
been reported by Kitchenham and Kansala .
Contrary to popular opinion, software measurement, like
measurement in any other discipline, must adhere to the
science of measurement if it is to gain widespread acceptance
and validity. The representational theory
asserts that measurement
is the process of assigning numbers
or symbols to attributes
entities in such a way that all
preserved. The entities of interest in
classified as processes, products, or resources.
Anything we may wish to measure or predict is an identifiable
attribute of these. Attributes are either intemal or extemal.
Although extemal attributes like reliability of products, sta-
bility of processes, or productivity of resources tend to
the ones we
most interested in measuring, we cannot do
directly. We are generally forced to measure indirectly in
terms of intemal attributes. Predictive measurement requires a
This means not just a model but also a set
of prediction procedures for determining the model parameters
and applying the results. These in tum are dependent
accurate measurements in the assessment sense.
We have used measurement theory to highlight both weak-
nesses and strengths of software metrics work, including
on metrics validation. Invariably, it seems that the most
promising theoretical work has been using the key components
of measurement theory. We showed that the search for general
software complexity measures is doomed to failure. However,
the theory does help us to define and validate measures of
specific complexity attributes.
would like to thank
Littlewood and M. Neil for
earlier draft of this paper, and
Page, and R. Whitty for sharing views and
information that have influenced its contents. Finally, I would
thank four anonymous referees who made suggestions
clearly improved the paper.
Adams, “Optimizing preventive service of software products.”
vol. 28, no.
pp. 2-14, 1984.
J. Albrecht, “Measuring application development productivity,” in
Joint SHAREIGUIDE Symp.,
J. Aczel, F.
Rosenbaum, “On scientific laws without
Math. Anal. Applicat..
vol. 119, no. 389416,
Baker, J. Bieman,
philosophy for software measurement,”
vol. 12, pp. 277-281, July 1990.
Chan, B. Littlewood, and J. Snell, “Recalibrating
software reliability models,”
vol. 16, no. 4,
pp. 458-470, Apr. 1990.
Software Engineering Economics.
Rombach, “The tame project: Towards improvement-
orientated software environments,”
IEEE Trans. Software Eng.,
no. 6, pp. 758-773, June 1988.
E. Courtney and D. A. Gustafson, “Shotgun correlations in software
IEE Software Eng.
 J. C. Chemiavsky and C. H. Smith, “On weyuker’s axioms for software
vol. 17, no.
[lo] R. A. DeMillo and R. J. Lipton, “Software project forecasting,” in
Perlis, F.G. Sayward, and M. Shaw, Eds.
Cambridge, MA: MIT Press, 1981, pp. 77-89.
[l I] N. E. Fenton,
A Rigorous Approach.
IEE Software Eng.
vol. 7, no.
pp. 357-362, May 1992.
Finkelstein, “A review of the fundamental concepts of measurement,”
vol. 2, no.
pp. 25-34, 1984.
141 N. E. Fenton and B. A. Kitchenham, “Validating software measures,”
Software Testing, Verification
vol. 1, no. 2, pp. 27-42,
 N.E. Fenton and A. Melton, “Deriving structurally based software
pp. 177-187, July 1990.
[I61 J.-C. Falmagne and
Narens, “Scales and meaningfulness of quantita-
pp. 287-325, 1983.
J. Fleming and J. J. Wallace, “How not
lie with statistics,”
 M. H. Halstead.
North Holland, 1975.
Inglis, “Standard software quality metrics,”
no. 2, pp. 113-118, Feb. 1985.
1201 D.R. Jeffery, G.C. Low, and M. Bames, “A comparison of function
point counting techniques,”
IEEE Trans. SOfMwP Eng.,
vol. 19, no.
pp. 529-532, Mar. 1993.
 Z. Jelinski and P. B. Moranda, “Software reliability research,” in
cal Computer Performance Evaluation,
W. Freiberger, Ed. New York:
Academic Press, 1972, pp. 465-484.
 B. A. Kitchenham and B. de Neumann, “Cost modelling and estimation,”
Software Reliability Handbook,
P. Rook, Ed. New York: Elsevier
Applied Science, 1990, pp. 333-376.
Henry, “Software quality metrics based on intercon-
vol. 29, pp. 218-221, 1986.
 M. Neil, “Multivariate assessment of software products,”
Testing Verification and Reliability,
to appear 1992.
E. Prather and
Giulieri, “Decomposition of flowchart schemata.”
vol. 24, no.
pp. 258-262, 1981.
Software Engineering: A Practitioner’s Approach.,
ed. New York: McGraw-Hill Int., 1987.
Measurement Theory with Applications
Utility. and the Social Sciences.
Reading, MA: Addison Wesley,
“Applications of the theory of meaningfulness to psychology,”
pp. 311-332, 1985.
N. F. Schneidewind, “Methodology for validating software metrics,”
IEEE Trans. Software Eng.,
pp. 41M22, May 1992.
E. Smith, “Characterizing computer performance with
pp. 1202-1206, 1988.
P. H. Sydenham, Ed.,
Handbook of Measurement Science,
York: J. Wiley, 1982.
 C. R. Symons, “Function point analysis: Difficulties and improvements,”
IEEE Trans. Software Eng.,
vol. 14, no.
pp. 2-11, Jan. 1988.
 D. A. Troy and S.H. Zweben, “Measuring the quality of structured
pp. 113-120, 1981.
Tian and M. V. Zelkowitz, “A formal program complexity model and
S.N. Woodfield, H.E. Dunsmore, and V. Y. Shen, “The effects of
modularisation and comments on program comprehension,” in
Sth Int. Conf. Software Eng.,
1979, pp. 213-223.
Weyuker, “Evaluating software complexity measures,’’
vol. 14, no. 9, pp. 1357-1365, Sept. 1988.
 H. Zuse,
de Gruyter, 1990.
“Support of experimentation by measurement theory,” in
mental Software Engineering Issues (Lecture Notes in Computer Science,
vol. 706). H.D. Rombach. V.R. Basili, and R. W. Selby, Eds.
York: Springer-Verlag, 1993, pp. 137-140.
[241 B. A. Kitchenham and K Kansala, “Inter-item correlations among
function points,” in
IEEE Software Metrics Svmp
Baltimore, MD, 1993,
 D H. Krantz, R.D. Luce,
Suppes, and A. Tversky,
B. A Kitchenham and
Taylor, “Software project development cost
pp. 67-278. 1985
Theory and Measuiement
Cambridge. Cambridge Univ
and W.K Cheung, “An empincal study
IEEE Trans Sofhare Eng
vol. 13, no.
[291 B Littlewood, “Forecasting software reliability,” in
(Lecture Notes in Computer
 T. J McCabe, “A complexity measure,”
IEEE Trans Sojrware Eng
SE-2, no 4, pp 308-320, Apr. 1976.
A. C. Melton, D A. Gustafson, J. M. Bieman, and A. A Baker, “Mathe-
matical perspective of software measures research,”
pp 246-254, 1990
[321 J. C. Munson and Khoshgoftaar, “The detection of fault prone modules,”
IEEE Ttans Software En,?,
vol. 18, no
pp 423-433, May 1992
Norman Fenton is
Professor of Computing Sci-
ence in the Centre for Software Reliability
He was previously the
Director of the Centre for Systems and Software
Engineering (CSSE) at South Bank University and
Post-Doctoral Research Fellow at Oxford University
He has consulted widely to industry about metncs
programs, dnd has
led numerous collaborative
projects. One such current project is developing
measurement based framework
of software engineering standards and methods. His research interests are in
software measurement and formal methods of software development. He has
wntten three books on these subjects and published many papers.
Editor of Chapman and Hall’s
Computer Science Research
and Piactice Series
and is on the Editorial Board of the
He has chaired several international conferences on software metrics
Prof Fenton is Secretary
the (National) Centre
Software Reliability He
Chartered Engineer (Member of the IEE), and Associate Fellow of the
Institute of Mathemdtics and its Applications, and is
member of the IEEE
New York. Academic Press, 1971.
New York. Springer-Verlag, 1988, pp. 141-209.