Content uploaded by John Michael Linacre
Author content
All content in this area was uploaded by John Michael Linacre on May 25, 2015
Content may be subject to copyright.
Observations are Always Ordinal;
Measurements, however, Must be Interval
ABSTRACT. Quantitative observations are based on counting observed events or levels of
performance. Meaningful measurement is based on the arithmetical properties of interval
scales. The Rasch measurement model provides the necessary and sufficient means to
transform ordinal counts into linear measures. Imperfect unidimensionality and other threats
to linear measurement can be assessed by means of fit statistics. The Rasch model is being
successfully applied to rating scales.
Merbitz and associates[5] provide a sensitive and useful explanation of the hazards encountered
when data are treated improperly. This misconstruction of data can be understood as the result of a
confusion as to the relationship between observation and measurement - a confusion which can be
speedily resolved with a little clarification.
Data are Always Ordinal
All observations begin as ordinal, if not nominal, data. Quantitative science begins with identifying
conditions and events which, when observed, are deemed worth counting. This counting is the
beginning of quantification. Measurement is deduced from well- defined sets of counts. The most
elementary level is to count the presence, "1," or absence, "0," of the defined condition or event.
More information can be obtained when the conditions that identify countable events are ordered into
successive categories which increase (or decrease) in status along some intended underlying
variable. It then becomes possible to count, not just the presence (versus absence) of an event, but
the number of steps up the ordered set of categories which the particular category observed implies.
When, for example, a rating scale is labeled: "none," "plenty," "nearly all," "all," the inarguable order of
these labels from less to more can be used to represent them as a series of steps. The observation of
"none" can be counted as zero steps up this rating scale, "plenty" as one step up, "nearly all" as two
steps up and "all" as three. This counting has nothing to do with any numbers or weights with which
the categories might have been tagged in addition to or instead of their labels. For instance, "plenty"
might also have been labelled as "20" or "40" by the instrument designer, but the assertion of such a
numerical category label would not alter the fact that, on this scale, "plenty" is just one step up the
scale from "none."
All classifications are qualitative. Some classifications[1], like those above, can be ordered and so are
more than nominal. Other classifications, such as one based on race, can usually not be ordered,
though there may be perspectives from which an ordering becomes useful. This does not mean that
nominal data, such as race or gender, cannot have powerful explanatory or diagnostic power.
Nonparametric statistical techniques[3] can be useful in such cases. But it does mean that they are
not measurement in the accepted sense of the word.
As Merbitz and colleagues[5] emphasize, this counting of steps says nothing about distances
between categories, nor does it require that all test items employ the same rating scale. Whenever
four category labels share the same ordering, however else they may differ in implied amounts, they
can only be represented by exactly the same step counts, even though, after analysis, their
calibrations may well differ. It would not make any difference to the method of step counting if the four
ordered categories were labeled quite differently for another item, say, "none," "almost none," "just a
little," "all." Even though the relative meanings and the intended amounts corresponding to the
alternative sets of category labels are conspicuously different, their order is the same and so their
step counts can only be the same (0, 1, 2, 3). This is always so no matter how the four ordered
categories might be labeled.
Measures are Always Interval/Ratio
What every scientist and layman means by a "measure" is a number with which arithmetic (and linear
statistics) can be done, a number which can be added and subtracted, even multiplied and divided,
and yet with results that maintain their numerical meaning. The original observations in any science
are not yet measures in this sense. They cannot be measures because a measure implies the
previous construction and maintenance of a calibrated measuring system with a well-defined origin
and unit which has been shown to work well enough to be useful. Merbitz and his coworkers stress
the importance of linear scales as a prerequisite to unequivocal statistical analysis. They are saying
that something must be done with counts of observed events to build them into measures. A
measurement system must be constructed from a related set of relevant counts and its coherence
and utility established.
Confusing Counts with Measures
It is true that counts of concrete events are on a kind of ratio scale. They have an obvious origin in
"none" and the counted events provide a raw unit of "one more event." The problem is that the events
that are counted are specific rather than general, concrete rather than abstract, and varying rather
than uniform in their import. Sometimes the next "one more event" implies, according to the labels
assigned, a small increment as in the step up from "none" to "almost none." Sometimes the next
event implies a big increment as in the step up from "none" to "plenty." Since, in either case, all that
we can do at this stage is to count one more step, our raw counts, as they stand, are insensitive to
any differing implications of the steps taken. To get at these implied step sizes we must construct a
measuring system based on a coordinated set of observed counts. This requires a measurement
analysis of the inevitably ordinal observations which always comprise the initial data in any science.
Even those counts which seem to be useful measures in one context may not be measures in
another[8]. For example "seconds" would seem always to be a linear measure of time. But. surprising
as it may seem at first, counting the number of seconds it takes a patient to walk across a room does
not necessarily provide a linear measure of "patient mobility." For that, the "seconds" counted are just
the raw data from which a measuring system has still to be constructed. It is naive to believe that a
seemingly universal counter like "seconds," that is so often linear in physics and commerce, will
necessarily also be linear in the measurement of patient mobility. To construct a linear measure of
patient mobility based on elapsed time we must first count the seconds taken by a relevant sample of
patients of varying mobility to cover a variety of relevant distances of varying magnitudes. Then we
must analyze these counting data to discover whether a linear measure of "mobility" can be
constructed from them and if so what its relation to "seconds" may be.
The Step from Observation to Measurement
Realization of the necessity of a progression from counting observations to measurement is not new.
Serious recognition of the need to transform observations into measures goes back to the turn of the
century. Edward Thorndike[10] called for it 80 years ago. Louis Thurstone[11] invented techniques
which partially solved the problem in the 1920s. Finally, in 1953, Georg Rasch[7] devised a complete
solution which has since been shown to be not only sufficient but also necessary for the construction
of measures in any science. The phrase "in any science" is notable here since the Rasch relationship
has been shown to be just as fundamental to the construction of a surveyor's yardstick as it is to the
construction of other less familiar and more subtle measures.
Rasch's insight into the problem was simple and yet profound. First, he realized that, to be of any use
at all, a measure must retain its quantitative status, within reason, regardless of the context in which it
occurs. For a yardstick to be useful for measurement, it must maintain its length calibrations
irrespective of what it is measuring. So too, each test or rating scale item must maintain its level of
difficulty, regardless of who is responding to it. It also follows that the person measured must retain
the same level of competence or ability regardless of which particular test items are encountered, so
long as whatever items are used belong to the calibrated set of items which define the variable under
study. The implementation of this essential concept of invariance or objectivity has been successfully
extended in the past decade to the leniency (or severity) of raters and to the step structure of rating
scales.
Second, Rasch recognized that the outcome of an interaction between an object-to-be-measured,
such as a person, and a measuring-agent, such as a test item, cannot, in practice, be fully
predetermined but must involve an additional, unavoidably unpredictable, component. This realization
changes the way we can usefully specify what is supposed to happen when a person responds to an
item from an "absolute" outcome to a "likely" outcome. The final measuring system requirements
become: the more able the person, the more likely a success on any relevant item. The more difficult
the item, the less likely a success for any relevant person.
From just these,in retrospect rather obvious, requirements, Rasch deduced a mathematical model
which specifies exactly how to convert observed counts into linear (and ratio) measures. The model
also specifies how to find out the extent to which any particular conversion has been successful
enough to be useful. This "Rasch" model has since been demonstrated to be the one and only
possible mathematical formulation for performing this essential function.
Rasch's introduction of his discovery appears in his innovative 1960 book[7]. Detailed, elementary
explanations of why, when and how to apply Rasch's idea to dichotomous (right/wrong, yes/no,
present/absent) data are provided by Wright and Stone[14]. The extension of this to rating scales and
other observations embedded in ordered categories is developed and explained in Wright and
Masters[13].
This conversion from counts to measures is greatly facilitated by the use of a computer. Rasch
analysis compute programs have been available since 1965. The two most recent and most versatile
are BlGSCALE[12] and FACETS[4]. These programs analyze the initial original data for the possibility
of a single latent variable along which the intended measuring agents, the items, can be calibrated
and the intended objects of measurement, the subjects, can be measured. The programs then report:
1) the best possible unidimensional calibrations and measures which these data can support, 2) the
reliabilities of these calibrations and measures in terms of their standard errors and 3) their internal
validities in terms of detailed fit statistics.
Choosing an Origin
The concept of measurement implies a count of some well-defined unit from a well-defined starting
point usually called "none" or "zero." This implication can be visualized as a distance between two
points on a line. To be useful, measures must be set up to begin counting their standard units from
some convenient reference point defined to be their standard origin. The location of this origin is
fundamentally arbitrary, although there are often frames of reference, or theories, for which a
particular position is especially convenient. Consider temperature. The Celsius, Fahrenheit and Kelvin
scales have different zero points. Each choice was made for good theoretical reasons. Each has
been convenient for particular applications. But no one of them is universally superior, despite the
exhortations of molecular thermodynamicists. It is the same for psychometric scales. Each origin is
chosen for the convenience of its users. Should two users choose different origins, then, as with
temperature, it must be a simple monotonic operation to transform measures relative to one origin
into measures relative to another, or they are not talking about the same variable. However intriguing
it may be theoretically, there is no measurement requirement to locate an absolute point of minimum
intensity or to extrapolate a point such as that of "zero mobility."
A ratio scale does have a clear origin. But that origin is usually of more theoretical interest than
practical utility. It is a simple arithmetical operation to convert measures from an interval scale to a
ratio scale and vice versa. When interval scales are exponentiated, their arbitrary origins become the
unit of the resulting ratio scale and their minus infinity becomes this ratio scale's origin. This
mathematical result, by the way, reminds us that the seemingly unambiguous origins of ratio scales,
however intriguing they may be theoretically, are necessarily unrealizable abstractions, see also
"What is a Ratio Scale".
The practical convenience of being able to measure length from some arbitrary origin, like the end of
a yardstick, far outweighs the abstract benefit of measurement from some theoretically interesting
"absolute" origin, such as the center of the universe. With an interval scale, once it is constructed
from relevant counts, we can always answer questions such as "Is the distance from 'wheelchair' to
'unaided' more than twice as far as the distance from 'cane' to 'unaided'?" The convenient origin for
kind of question is the shared category 'unaided' rather than some abstract point tagged "complete
mobility" or "complete immobility."
`
Why Treating Raw Scores as Measures Sometimes Seems to Work
In view of the clear difference between counts and measures, why do regressions and other interval-
level statistical analyses of raw score counts and numerical category labels so often seem to work?
Examples mentioned include Miller's "100 point" scale[6], the LORS- IIB[6], the FIM[2], and the
Barthel Index[2]. This paradox is due to the monotonic relationship between scores and measures
when data is complete and unedited. This guarantees that correlation analyses of scores and the
measures they may imply will be quite similar. Further the relationship between scores and measures
is necessarily ogival because the closed interval between the minimum possible score and the
maximum possible score must be extended to an open interval of measures from minus infinity to
plus infinity. Toward the center of this ogive the relationship between score and measure is nearly
linear. But the monotonicity between score and measure holds only when data are complete, that is,
when every subject encounters every item, and no unacceptably flawed responses have been
deleted. This kind of completeness is inconvenient and virtually impossible to maintain, since it
permits no missing data and prevents tailoring item difficulties to person abilities. It is also no more
necessary for measurement than it would be to require that all children be measured with exactly the
same particular yardstick before we could analyze their growth. Further the approximate linearity
between central scores and their corresponding measures breaks down as scores approach their
extremes and is strongly influenced by the step structure of the rating scale.
Consequently, as Merbitz and associates warn, it is foolish to count on raw scores being linear. It is
always necessary to verify that any particular set of raw scores do, in fact, closely correspond to
linear measures before subjecting them to statistical analysis[2]. Whatever the outcome of such a
verification it is clearly preferable to convert necessarily nonlinear raw scores to necessarily linear
measures and then to perform the statistical analyses on these measures.
Unidimensionality
An occasional objection to Rasch measurement is its imposition on the data of a single underlying
unidimensional variable. This objection is puzzling because unidimensionality is exactly what is
required for measurement. Unidimensionality is an essence of measurement. In fact the importance
of the Rasch model as the method for constructing measures is due, in part, to its deduction from the
requirement of unidimensionality.
In actual practice, of course, unidimensionality is a qualitative rather than quantitative concept. No
actual test can ever be perfectly unidimensional. No empirical situation can meet exactly the
requirements for measurement which generate the Rasch model. This fact of life is encountered by
every science. Even physicists make corrections for unavoidable multidimensionalities an integral
part of their experimental technique. Nevertheless, the ideal of unidimensional measures must be
approximated if generalizable results are to be obtained.
If a test comprising a mixture of medical and law items is used to make a single pass/fail decision,
then the examination board, however inadvertently, has decided to use this mixed test as though it
were unidimensional. This is regardless of any qualitative or quantitative arguments which might
"prove" multidimensionality. Further, their practical decision does not make medicine and law identical
or exchangeable anywhere but in their pass/fail actions. But their "unidimensional" behavior does
testify that they are making medicine and law exchangeable for these pass/fail decisions. Unless
each test item is to be treated as a test in itself, every test score is a compromise between the
essential ideal of unidimensionality and the unavoidable exigencies of practice. The Rasch model fit
statistics are there in order to evaluate the success of that compromise in each instance. It is the
responsibility of test developers and test managers to use these validity statistics to identify the extent
of the compromises they are making and to minimize their effects on practice.
The pursuit of approximate unidimensionality is undertaken at two levels. First, the test constructor
makes every effort to produce a useful set of observable categories (rating scales) which are
intended and expected to work together to gather unambiguous information along a single, useful
underlying dimension. Test items, tasks, observation techniques and other aspects of the testing
situation are organized to realize, as perfectly as possible, the variable which the test is intended to
measure. Second, the test analyst collects a relevant sample of these carefully defined observations
and evaluates the practical realization of that intention.
Before observations can be used to support any quantitative research or substantive decisions, the
observations must be examined to see how well they fit together to define the intended underlying
variable on a linear scale[9]. Rasch provides theory and technique. But the extent to which a
particular set of observations is in accord with this theory is, indeed, an "empirical matter"[3]. Merbitz
and coworkers[5] caution us against blindly accepting any total score without verifying that its
meaning is in accord with the meanings of the scores on its component items. Assistance in doing
this is provided by fit statistics which report the degree to which the observations match the
specifications necessary for measurement. Misfitting items can be redesigned. Misfitting populations
can be reassessed. Once the quality of the measures has been determined, the analyst, test
constructor, and examination board are then, and only then, in a position to make informed decisions
concerning the quantitative significance of their measures.
The process of test evaluation is never finished. Every time we use our measuring agents, questions,
or items to collect new information from new persons in order to estimate new measures, we must
verify in those new data that the unidimensionality requirements of our measuring system have once
again been sufficiently well approximated to maintain the quantitative utility of the measures
produced. Whether a particular set of data can be used to initiate or to continue a unidimensional
measuring system is an empirical question. The only way it can be addressed is to 1) analyze the
relevant data according to a unidimensional measurement model, 2) find out how well and in what
parts these data do conform to our intentions to measure and, 3) study carefully those parts of the
data which do not conform, and hence cannot be used for measuring, to see if we can learn from
them how to improve our observations and so better achieve our intentions.
Once interval scale measures have been constructed, it is then reasonable to proceed with statistical
analysis in order to determine the predictive validity of the measures from a particular test. We can
also then compare the measures produced by different test instruments, such as the FAS subscales,
to see if they are measures of the same thing, like inches and centimeters, or different things, like
inches and ounces.
Rasch Analysis and the Practice of Measurement
The Rasch measurement model has been successfully applied to testing in schools since 1965, with
large scale implementations in Portland (OR), Detroit, Chicago and New York. Many medical
specialty boards, including the National Board of Medical Examiners[9], employ it in their certification
examinations. Pilot research at the Veteran's Administration and Marianjoy Rehabilitation Center has
demonstrated that useful measures of the degree of impairment can bc constructed from ratings of
the performance of handicapped individuals. New applications of the Rasch model are continually
emerging; judge-awarded ratings is currently an area of active interest for the Board of Registry of the
American Society of Clinical Pathologists and for a national group of occupational therapists centered
at the University of Illinois.
We are grateful to Merbitz and colleagues[5] for raising the important topic of ordinal scales and
inference and so permitting us to discuss this often misunderstood concept of measurement.
BENJAMIN D. WRIGHT AND JOHN M. LINACRE
MESA Research Memorandum Number 44
MESA PSYCHOMETRIC LABORATORY
References
1. Gresham, GE. Letter to the Editor. Arch Phys Med Rehabil 1989; 70:867.
2. Hamilton BB, Granger CV. Letter to the Editor. Arch Phys Med Rehabil 1989; 70:861-2.
3. Johnston MV. Letter to the editor. Arch Phys Med Rehabil 1989; 70:861.
4. Linacre JM. FACETS Computer program for many-faceted Rasch analysis. Chicago: MESA Press,
1989.
5. Merbitz C, Morris J, Grip JC. Ordinal scales and foundations of misinference. Arch Phys Med
Rehabil 1989; 70:308-32.
6. Miller LS. Letter to the editor. Arch Phys Med Rehabil 1989; 70:866.
7. Rasch G. Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish
Institute for Educational Research, 1960, and Chicago: University of Chicago Press. 1980.
8. Santopoalo RD. Letter to the editor. Arch Phys Med Rehabil 1989; 70:863.
9. Silverstein B, Kilgore K. Fisher W. Letter to the editor. Arch Phys Med Rehabil 1989; 70:864-5.
10. Thorndike EL. An introduction to the theory of mental and social measurement. New York:
Teachers College, Columbia University, 1904.
11. Thurstone LL. A method of scaling psychological and educational data. J Educ Psychol 1925;
15:433-51.
12. Wright BD, Linacre JM, Schulz M. BIGSCALE Rasch analysis computer program. Chicago: MESA
Press, 1989.
13. Wright BD, Masters GN. Rating scale analysis. Chicago: MESA Press, 1982.
14. Wright BD, Stone MH. Best test design. Chicago: MESA Press, 1979.
This appeared in
Archives of Physical Medicine and Rehabilitation
70 (12) pp. 857-860, November 1989
The URL of this page is www.rasch.org/memo44.htm