ArticlePDF Available

Observations are always ordinal; measurements, however, must be interval

Authors:

Abstract

Quantitative observations are based on counting observed events or levels of performance. Meaningful measurement is based on the arithmetical properties of interval scales. The Rasch measurement model provides the necessary and sufficient means to transform ordinal counts into linear measures. Imperfect unidimensionality and other threats to linear measurement can be assessed by means of fit statistics. The Rasch model is being successfully applied to rating scales.
Observations are Always Ordinal;
Measurements, however, Must be Interval
ABSTRACT. Quantitative observations are based on counting observed events or levels of
performance. Meaningful measurement is based on the arithmetical properties of interval
scales. The Rasch measurement model provides the necessary and sufficient means to
transform ordinal counts into linear measures. Imperfect unidimensionality and other threats
to linear measurement can be assessed by means of fit statistics. The Rasch model is being
successfully applied to rating scales.
Merbitz and associates[5] provide a sensitive and useful explanation of the hazards encountered
when data are treated improperly. This misconstruction of data can be understood as the result of a
confusion as to the relationship between observation and measurement - a confusion which can be
speedily resolved with a little clarification.
Data are Always Ordinal
All observations begin as ordinal, if not nominal, data. Quantitative science begins with identifying
conditions and events which, when observed, are deemed worth counting. This counting is the
beginning of quantification. Measurement is deduced from well- defined sets of counts. The most
elementary level is to count the presence, "1," or absence, "0," of the defined condition or event.
More information can be obtained when the conditions that identify countable events are ordered into
successive categories which increase (or decrease) in status along some intended underlying
variable. It then becomes possible to count, not just the presence (versus absence) of an event, but
the number of steps up the ordered set of categories which the particular category observed implies.
When, for example, a rating scale is labeled: "none," "plenty," "nearly all," "all," the inarguable order of
these labels from less to more can be used to represent them as a series of steps. The observation of
"none" can be counted as zero steps up this rating scale, "plenty" as one step up, "nearly all" as two
steps up and "all" as three. This counting has nothing to do with any numbers or weights with which
the categories might have been tagged in addition to or instead of their labels. For instance, "plenty"
might also have been labelled as "20" or "40" by the instrument designer, but the assertion of such a
numerical category label would not alter the fact that, on this scale, "plenty" is just one step up the
scale from "none."
All classifications are qualitative. Some classifications[1], like those above, can be ordered and so are
more than nominal. Other classifications, such as one based on race, can usually not be ordered,
though there may be perspectives from which an ordering becomes useful. This does not mean that
nominal data, such as race or gender, cannot have powerful explanatory or diagnostic power.
Nonparametric statistical techniques[3] can be useful in such cases. But it does mean that they are
not measurement in the accepted sense of the word.
As Merbitz and colleagues[5] emphasize, this counting of steps says nothing about distances
between categories, nor does it require that all test items employ the same rating scale. Whenever
four category labels share the same ordering, however else they may differ in implied amounts, they
can only be represented by exactly the same step counts, even though, after analysis, their
calibrations may well differ. It would not make any difference to the method of step counting if the four
ordered categories were labeled quite differently for another item, say, "none," "almost none," "just a
little," "all." Even though the relative meanings and the intended amounts corresponding to the
alternative sets of category labels are conspicuously different, their order is the same and so their
step counts can only be the same (0, 1, 2, 3). This is always so no matter how the four ordered
categories might be labeled.
Measures are Always Interval/Ratio
What every scientist and layman means by a "measure" is a number with which arithmetic (and linear
statistics) can be done, a number which can be added and subtracted, even multiplied and divided,
and yet with results that maintain their numerical meaning. The original observations in any science
are not yet measures in this sense. They cannot be measures because a measure implies the
previous construction and maintenance of a calibrated measuring system with a well-defined origin
and unit which has been shown to work well enough to be useful. Merbitz and his coworkers stress
the importance of linear scales as a prerequisite to unequivocal statistical analysis. They are saying
that something must be done with counts of observed events to build them into measures. A
measurement system must be constructed from a related set of relevant counts and its coherence
and utility established.
Confusing Counts with Measures
It is true that counts of concrete events are on a kind of ratio scale. They have an obvious origin in
"none" and the counted events provide a raw unit of "one more event." The problem is that the events
that are counted are specific rather than general, concrete rather than abstract, and varying rather
than uniform in their import. Sometimes the next "one more event" implies, according to the labels
assigned, a small increment as in the step up from "none" to "almost none." Sometimes the next
event implies a big increment as in the step up from "none" to "plenty." Since, in either case, all that
we can do at this stage is to count one more step, our raw counts, as they stand, are insensitive to
any differing implications of the steps taken. To get at these implied step sizes we must construct a
measuring system based on a coordinated set of observed counts. This requires a measurement
analysis of the inevitably ordinal observations which always comprise the initial data in any science.
Even those counts which seem to be useful measures in one context may not be measures in
another[8]. For example "seconds" would seem always to be a linear measure of time. But. surprising
as it may seem at first, counting the number of seconds it takes a patient to walk across a room does
not necessarily provide a linear measure of "patient mobility." For that, the "seconds" counted are just
the raw data from which a measuring system has still to be constructed. It is naive to believe that a
seemingly universal counter like "seconds," that is so often linear in physics and commerce, will
necessarily also be linear in the measurement of patient mobility. To construct a linear measure of
patient mobility based on elapsed time we must first count the seconds taken by a relevant sample of
patients of varying mobility to cover a variety of relevant distances of varying magnitudes. Then we
must analyze these counting data to discover whether a linear measure of "mobility" can be
constructed from them and if so what its relation to "seconds" may be.
The Step from Observation to Measurement
Realization of the necessity of a progression from counting observations to measurement is not new.
Serious recognition of the need to transform observations into measures goes back to the turn of the
century. Edward Thorndike[10] called for it 80 years ago. Louis Thurstone[11] invented techniques
which partially solved the problem in the 1920s. Finally, in 1953, Georg Rasch[7] devised a complete
solution which has since been shown to be not only sufficient but also necessary for the construction
of measures in any science. The phrase "in any science" is notable here since the Rasch relationship
has been shown to be just as fundamental to the construction of a surveyor's yardstick as it is to the
construction of other less familiar and more subtle measures.
Rasch's insight into the problem was simple and yet profound. First, he realized that, to be of any use
at all, a measure must retain its quantitative status, within reason, regardless of the context in which it
occurs. For a yardstick to be useful for measurement, it must maintain its length calibrations
irrespective of what it is measuring. So too, each test or rating scale item must maintain its level of
difficulty, regardless of who is responding to it. It also follows that the person measured must retain
the same level of competence or ability regardless of which particular test items are encountered, so
long as whatever items are used belong to the calibrated set of items which define the variable under
study. The implementation of this essential concept of invariance or objectivity has been successfully
extended in the past decade to the leniency (or severity) of raters and to the step structure of rating
scales.
Second, Rasch recognized that the outcome of an interaction between an object-to-be-measured,
such as a person, and a measuring-agent, such as a test item, cannot, in practice, be fully
predetermined but must involve an additional, unavoidably unpredictable, component. This realization
changes the way we can usefully specify what is supposed to happen when a person responds to an
item from an "absolute" outcome to a "likely" outcome. The final measuring system requirements
become: the more able the person, the more likely a success on any relevant item. The more difficult
the item, the less likely a success for any relevant person.
From just these,in retrospect rather obvious, requirements, Rasch deduced a mathematical model
which specifies exactly how to convert observed counts into linear (and ratio) measures. The model
also specifies how to find out the extent to which any particular conversion has been successful
enough to be useful. This "Rasch" model has since been demonstrated to be the one and only
possible mathematical formulation for performing this essential function.
Rasch's introduction of his discovery appears in his innovative 1960 book[7]. Detailed, elementary
explanations of why, when and how to apply Rasch's idea to dichotomous (right/wrong, yes/no,
present/absent) data are provided by Wright and Stone[14]. The extension of this to rating scales and
other observations embedded in ordered categories is developed and explained in Wright and
Masters[13].
This conversion from counts to measures is greatly facilitated by the use of a computer. Rasch
analysis compute programs have been available since 1965. The two most recent and most versatile
are BlGSCALE[12] and FACETS[4]. These programs analyze the initial original data for the possibility
of a single latent variable along which the intended measuring agents, the items, can be calibrated
and the intended objects of measurement, the subjects, can be measured. The programs then report:
1) the best possible unidimensional calibrations and measures which these data can support, 2) the
reliabilities of these calibrations and measures in terms of their standard errors and 3) their internal
validities in terms of detailed fit statistics.
Choosing an Origin
The concept of measurement implies a count of some well-defined unit from a well-defined starting
point usually called "none" or "zero." This implication can be visualized as a distance between two
points on a line. To be useful, measures must be set up to begin counting their standard units from
some convenient reference point defined to be their standard origin. The location of this origin is
fundamentally arbitrary, although there are often frames of reference, or theories, for which a
particular position is especially convenient. Consider temperature. The Celsius, Fahrenheit and Kelvin
scales have different zero points. Each choice was made for good theoretical reasons. Each has
been convenient for particular applications. But no one of them is universally superior, despite the
exhortations of molecular thermodynamicists. It is the same for psychometric scales. Each origin is
chosen for the convenience of its users. Should two users choose different origins, then, as with
temperature, it must be a simple monotonic operation to transform measures relative to one origin
into measures relative to another, or they are not talking about the same variable. However intriguing
it may be theoretically, there is no measurement requirement to locate an absolute point of minimum
intensity or to extrapolate a point such as that of "zero mobility."
A ratio scale does have a clear origin. But that origin is usually of more theoretical interest than
practical utility. It is a simple arithmetical operation to convert measures from an interval scale to a
ratio scale and vice versa. When interval scales are exponentiated, their arbitrary origins become the
unit of the resulting ratio scale and their minus infinity becomes this ratio scale's origin. This
mathematical result, by the way, reminds us that the seemingly unambiguous origins of ratio scales,
however intriguing they may be theoretically, are necessarily unrealizable abstractions, see also
"What is a Ratio Scale".
The practical convenience of being able to measure length from some arbitrary origin, like the end of
a yardstick, far outweighs the abstract benefit of measurement from some theoretically interesting
"absolute" origin, such as the center of the universe. With an interval scale, once it is constructed
from relevant counts, we can always answer questions such as "Is the distance from 'wheelchair' to
'unaided' more than twice as far as the distance from 'cane' to 'unaided'?" The convenient origin for
kind of question is the shared category 'unaided' rather than some abstract point tagged "complete
mobility" or "complete immobility."
`
Why Treating Raw Scores as Measures Sometimes Seems to Work
In view of the clear difference between counts and measures, why do regressions and other interval-
level statistical analyses of raw score counts and numerical category labels so often seem to work?
Examples mentioned include Miller's "100 point" scale[6], the LORS- IIB[6], the FIM[2], and the
Barthel Index[2]. This paradox is due to the monotonic relationship between scores and measures
when data is complete and unedited. This guarantees that correlation analyses of scores and the
measures they may imply will be quite similar. Further the relationship between scores and measures
is necessarily ogival because the closed interval between the minimum possible score and the
maximum possible score must be extended to an open interval of measures from minus infinity to
plus infinity. Toward the center of this ogive the relationship between score and measure is nearly
linear. But the monotonicity between score and measure holds only when data are complete, that is,
when every subject encounters every item, and no unacceptably flawed responses have been
deleted. This kind of completeness is inconvenient and virtually impossible to maintain, since it
permits no missing data and prevents tailoring item difficulties to person abilities. It is also no more
necessary for measurement than it would be to require that all children be measured with exactly the
same particular yardstick before we could analyze their growth. Further the approximate linearity
between central scores and their corresponding measures breaks down as scores approach their
extremes and is strongly influenced by the step structure of the rating scale.
Consequently, as Merbitz and associates warn, it is foolish to count on raw scores being linear. It is
always necessary to verify that any particular set of raw scores do, in fact, closely correspond to
linear measures before subjecting them to statistical analysis[2]. Whatever the outcome of such a
verification it is clearly preferable to convert necessarily nonlinear raw scores to necessarily linear
measures and then to perform the statistical analyses on these measures.
Unidimensionality
An occasional objection to Rasch measurement is its imposition on the data of a single underlying
unidimensional variable. This objection is puzzling because unidimensionality is exactly what is
required for measurement. Unidimensionality is an essence of measurement. In fact the importance
of the Rasch model as the method for constructing measures is due, in part, to its deduction from the
requirement of unidimensionality.
In actual practice, of course, unidimensionality is a qualitative rather than quantitative concept. No
actual test can ever be perfectly unidimensional. No empirical situation can meet exactly the
requirements for measurement which generate the Rasch model. This fact of life is encountered by
every science. Even physicists make corrections for unavoidable multidimensionalities an integral
part of their experimental technique. Nevertheless, the ideal of unidimensional measures must be
approximated if generalizable results are to be obtained.
If a test comprising a mixture of medical and law items is used to make a single pass/fail decision,
then the examination board, however inadvertently, has decided to use this mixed test as though it
were unidimensional. This is regardless of any qualitative or quantitative arguments which might
"prove" multidimensionality. Further, their practical decision does not make medicine and law identical
or exchangeable anywhere but in their pass/fail actions. But their "unidimensional" behavior does
testify that they are making medicine and law exchangeable for these pass/fail decisions. Unless
each test item is to be treated as a test in itself, every test score is a compromise between the
essential ideal of unidimensionality and the unavoidable exigencies of practice. The Rasch model fit
statistics are there in order to evaluate the success of that compromise in each instance. It is the
responsibility of test developers and test managers to use these validity statistics to identify the extent
of the compromises they are making and to minimize their effects on practice.
The pursuit of approximate unidimensionality is undertaken at two levels. First, the test constructor
makes every effort to produce a useful set of observable categories (rating scales) which are
intended and expected to work together to gather unambiguous information along a single, useful
underlying dimension. Test items, tasks, observation techniques and other aspects of the testing
situation are organized to realize, as perfectly as possible, the variable which the test is intended to
measure. Second, the test analyst collects a relevant sample of these carefully defined observations
and evaluates the practical realization of that intention.
Before observations can be used to support any quantitative research or substantive decisions, the
observations must be examined to see how well they fit together to define the intended underlying
variable on a linear scale[9]. Rasch provides theory and technique. But the extent to which a
particular set of observations is in accord with this theory is, indeed, an "empirical matter"[3]. Merbitz
and coworkers[5] caution us against blindly accepting any total score without verifying that its
meaning is in accord with the meanings of the scores on its component items. Assistance in doing
this is provided by fit statistics which report the degree to which the observations match the
specifications necessary for measurement. Misfitting items can be redesigned. Misfitting populations
can be reassessed. Once the quality of the measures has been determined, the analyst, test
constructor, and examination board are then, and only then, in a position to make informed decisions
concerning the quantitative significance of their measures.
The process of test evaluation is never finished. Every time we use our measuring agents, questions,
or items to collect new information from new persons in order to estimate new measures, we must
verify in those new data that the unidimensionality requirements of our measuring system have once
again been sufficiently well approximated to maintain the quantitative utility of the measures
produced. Whether a particular set of data can be used to initiate or to continue a unidimensional
measuring system is an empirical question. The only way it can be addressed is to 1) analyze the
relevant data according to a unidimensional measurement model, 2) find out how well and in what
parts these data do conform to our intentions to measure and, 3) study carefully those parts of the
data which do not conform, and hence cannot be used for measuring, to see if we can learn from
them how to improve our observations and so better achieve our intentions.
Once interval scale measures have been constructed, it is then reasonable to proceed with statistical
analysis in order to determine the predictive validity of the measures from a particular test. We can
also then compare the measures produced by different test instruments, such as the FAS subscales,
to see if they are measures of the same thing, like inches and centimeters, or different things, like
inches and ounces.
Rasch Analysis and the Practice of Measurement
The Rasch measurement model has been successfully applied to testing in schools since 1965, with
large scale implementations in Portland (OR), Detroit, Chicago and New York. Many medical
specialty boards, including the National Board of Medical Examiners[9], employ it in their certification
examinations. Pilot research at the Veteran's Administration and Marianjoy Rehabilitation Center has
demonstrated that useful measures of the degree of impairment can bc constructed from ratings of
the performance of handicapped individuals. New applications of the Rasch model are continually
emerging; judge-awarded ratings is currently an area of active interest for the Board of Registry of the
American Society of Clinical Pathologists and for a national group of occupational therapists centered
at the University of Illinois.
We are grateful to Merbitz and colleagues[5] for raising the important topic of ordinal scales and
inference and so permitting us to discuss this often misunderstood concept of measurement.
BENJAMIN D. WRIGHT AND JOHN M. LINACRE
MESA Research Memorandum Number 44
MESA PSYCHOMETRIC LABORATORY
References
1. Gresham, GE. Letter to the Editor. Arch Phys Med Rehabil 1989; 70:867.
2. Hamilton BB, Granger CV. Letter to the Editor. Arch Phys Med Rehabil 1989; 70:861-2.
3. Johnston MV. Letter to the editor. Arch Phys Med Rehabil 1989; 70:861.
4. Linacre JM. FACETS Computer program for many-faceted Rasch analysis. Chicago: MESA Press,
1989.
5. Merbitz C, Morris J, Grip JC. Ordinal scales and foundations of misinference. Arch Phys Med
Rehabil 1989; 70:308-32.
6. Miller LS. Letter to the editor. Arch Phys Med Rehabil 1989; 70:866.
7. Rasch G. Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish
Institute for Educational Research, 1960, and Chicago: University of Chicago Press. 1980.
8. Santopoalo RD. Letter to the editor. Arch Phys Med Rehabil 1989; 70:863.
9. Silverstein B, Kilgore K. Fisher W. Letter to the editor. Arch Phys Med Rehabil 1989; 70:864-5.
10. Thorndike EL. An introduction to the theory of mental and social measurement. New York:
Teachers College, Columbia University, 1904.
11. Thurstone LL. A method of scaling psychological and educational data. J Educ Psychol 1925;
15:433-51.
12. Wright BD, Linacre JM, Schulz M. BIGSCALE Rasch analysis computer program. Chicago: MESA
Press, 1989.
13. Wright BD, Masters GN. Rating scale analysis. Chicago: MESA Press, 1982.
14. Wright BD, Stone MH. Best test design. Chicago: MESA Press, 1979.
This appeared in
Archives of Physical Medicine and Rehabilitation
70 (12) pp. 857-860, November 1989
The URL of this page is www.rasch.org/memo44.htm
... The primary outcome of interest for this study was the change in interval scale of WOMAC pain and function 28 at 6-months after knee arthroplasty. An interval scale is a quantitative measurement scale where units have an order with the difference between any unit points along the scale is equal and provides the basic requirement for linear measurement over time 29,30 . We chose 6-months as the primary outcome as improvement in pain and function after knee arthroplasty often plateau within 3-months, and the 6-month time point is frequently used as clearly representing persistent post-operative pain 31 . ...
... Separate WOMAC pain and function scores were available to reveal the differential effect of association with PCS. Furthermore, we transformed the WOMAC pain and function scores from ordinal to interval scale 28 , and all the changes in score used as primary outcomes were in interval scores that follows arithmetic orders to inform the magnitude of change accurately 29,30 . This ensured any unit change in WOMAC pain and functional scores along these interval scales were equal regardless of the baseline scores. ...
Article
Full-text available
Pain catastrophizing is an exaggerated focus on pain sensations. It may be an independent factor influencing pain and functional outcomes of knee arthroplasty. We aimed to evaluate the association between pre-operative pain catastrophizing with pain and function outcomes up to one year after knee arthroplasty. We used data from a cohort study of patients undergoing primary knee arthroplasty (either total or unicompartmental arthroplasty) for knee osteoarthritis. Pain catastrophizing was assessed pre-operatively using the Pain Catastrophizing scale (PCS). Other baseline variables included demographics, body mass index, radiographic severity, anxiety, depression, and knee pain and function assessed using the Western Ontario and McMaster University Index (WOMAC). Patients completed the WOMAC at 6- and 12-months after arthroplasty. WOMAC pain and function scores were converted to interval scale and the association of PCS and changes of WOMAC pain and function were evaluated in generalized linear regression models with adjustment with confounding variables. Of the 1136 patients who underwent arthroplasty (70% female, 84% Chinese, 92% total knee arthroplasty), 1102 and 1089 provided data at 6- and 12-months post-operatively. Mean (± SD) age of patients was 65.9 (± 7.0) years. PCS was associated with a change in WOMAC pain at both 6-months and 12-months (β = − 0.04, 95% confidence interval: − 0.06, − 0.02; P < 0.001) post-operatively after adjustment in multivariable models; as well as change in WOMAC function at 6-months and 12-months. In this large cohort study, pre-operative pain catastrophizing was associated with lower improvements in pain and function at 6-months and 12-months after arthroplasty.
... In contrast, MnSq values less than 0.5 indicated little variability of qualifiers in that category, that is, the response pattern was predictable or deterministic. The first result represented a major threat to construct validity; the second indicated that the item did not differentiate people with different functioning levels, contributing little to the definition of the construct [24]. ...
... The easier the category was, the more likely the person was to receive a low qualifier. When all items in a test met these expectations, it meant that the test fitted the model, and the likelihood was that participants with more functioning had lower scores than those with less functioning [24]. ...
Article
Full-text available
Background The comprehensive core set for knee dysfunction was developed to classify the functioning of people with any knee dysfunction. To be used as a clinical instrument to measure the functioning of people with knee dysfunction, the construct validity of the core set still needs to be assessed. The purpose of this study was to analyze the construct validity of the comprehensive core set for knee dysfunction as an instrument to measure functioning. Methods A cross-sectional study with 200 participants with knee dysfunction with or without clinical diagnosis of knee pathology, with or without complaint of pain, with or without instability, and/or with or without knee movement restriction of any type. Participants were assessed using the comprehensive core set for knee dysfunction with 25 categories, the subjective form from the International Knee Documentation Committee scale, and measures of self-perceived general health and functioning. The construct validity of the core set was assessed by Rasch analysis, and the external construct validity was assessed by correlation between the score of the brief core set for knee dysfunction with the subjective form from the International Knee Documentation Committee scale, and scores of self-perception of health and functioning. Results Twelve categories were consistent with a unidimensional construct, with no difference in the response pattern for age, sex, educational level, and time of complaint. These categories were included in the brief core set for knee dysfunction. The mean score of the brief core set was 37 ± 21 points, a value classified as moderate impairment regarding functioning. Correlations with the subjective form from the International Knee Documentation Committee scale and scores of self-perception were adequate (p < 0.01; r > 0.5). Conclusion The brief core set for knee dysfunction, a set with 12 categories, can be used as a clinical instrument to measure and score the functioning of people with knee dysfunction, aged between 18 and 89 years, with adequate construct validity.
... For example, in CTT, a measure with more items is more reliable than a measure with fewer items to test a latent trait, and the score of each item is ranked the same, without considering item difficulty [4]. In MTT, Rasch modeling can test whether the items in an instrument form a single construct (unidimensionality) [5]. If not, some items measure different traits, and possibly other constructs [6]. ...
Article
Full-text available
Background/Objectives: Many patients with neck pain have arm problems. The purpose of this study was to examine the psychometric properties of the QuickDASH in patients reporting neck pain by Rasch analysis. Methods: The study was a cross-sectional study. Rasch analysis was used to examine the QuickDASH for unidimensionality, category function, item difficulty and targeting, and reliability in patients with neck pain. The two-factor model, comprising a function factor (items 1–8) and symptom factor (items 9–11), were separately assessed by Rasch analysis. Results: The mean age of the 302 participants was 57.9 ± 10.4 years old. The mean QuickDASH score was 24.8 ± 23.3 (95% CI: 22.2–27.5). For the function factor, the InfitMNSQ/OutfitMNSQ ranges were 0.700–1.124/0.661–1.121, indicating that all items fitted the model’s expectation. Only two items (items 4 and 6) exhibited category response disorder. The map for the person–item response thresholds covered the patient distribution well. The reliability was good, with a person separation index of 0.85. For the symptom factor, the InfitMNSQ/OutfitMNSQ ranges were 0.522–0.863/0.517–0.885, indicating that all items fitted the model’s expectation. No items with category response disorder were found for the symptom factor, and the reliability was good, with a person separation index of 0.82. Conclusions: The items fit the Rasch model well, and the wide range of item response thresholds covered the ranges of the patients’ disability and symptoms well. The QuickDASH has a two-factor structure, and is an excellent measure of upper limb function and symptoms in patients reporting neck pain.
... Unlike classical test theory which evaluates psychometric properties of an instrument as a whole, Rasch analysis examines measurement properties of each individual item within an assessment (17). Rasch analysis uses probabilistic mathematical modeling to examine a tool's ability to quantify abstract constructs in a meaningful way (18), which is especially useful for assessments that are based on clinician observation or patient report rather than physiological measurement (17,32). Rasch analysis can be conducted when items within an assessment represent a single construct and a reasonable hierarchy of item difficulty can be presumed. ...
Article
Full-text available
Introduction Children with medical complexity (CMC) are medically fragile with severe brain damage and chronic conditions, necessitating daily care. Their neurological impairments often limit participation in childhood activities, affecting quality of life. Current assessment tools fail to detect subtle abilities in CMC, hindering development of effective rehabilitation goals and interventions. The Pediatric Awareness and Sensory Motor Assessment (PASMA) was created to fill this gap, providing sensitive measurement of sensory awareness and motor response across five domains (i.e., olfactory, visual, auditory, gustatory, and tactile). Methods In this retrospective study, a Rasch analysis was conducted on PASMA data for CMC. The PASMA was administered five times over ten weekdays to each child, reflecting its intended clinical use to gain a reliable sense of each child's awareness. Results Analysis of data from 36 CMC revealed that the PASMA is sufficiently unidimensional, effectively measuring sensory awareness and motor response as a single construct. Its rating scale structure was validated without modifications, and the item hierarchy matched clinical expectations. High item reliability (0.97) was observed, with one item (V2 blink in response to light) slightly misfitting, but without affecting overall measures. Adequate person reliability was observed (0.81), with 15% person misfit. Person misfit did not degrade item measures or model statistics. Differential item functioning (DIF) was noted for the three easiest items on specific days. The PASMA successfully stratified participants into three distinct awareness levels (low, medium, and high awareness), without floor or ceiling effects. Discussion The PASMA is a valid unidimensional measure of sensory awareness and motor response in CMC. Rating scale characteristics, item hierarchy, and person separation measures all support the PASMA's measurement properties within this heterogeneous sample of CMC. DIF findings support a potential reduction in the recommended number of PASMA administrations per individual. Future research will focus on establishing rater reliability and external validity. Additional efforts will support health professionals to utilize the PASMA for baseline assessments, guiding personalized interventions, and tracking progress. Conclusion Clinical use of the PASMA could provide new opportunities to detect subtle abilities, preferences, and changes in CMC, to promote meaningful participation and improve quality of life.
... Kişi ayrım güvenirliği, bireylerin ölçekteki performanslarının ne kadar güvenilir bir şekilde ayrıştırılabildiğini gösterir. Kişi Ayrım Güvenirliği, Cronbach α'da olduğu gibi, 0,75'in üzerindeki değerler iyi güvenilirliği, 0,9'un üzerindeki değerler ise ölçeğin çok iyi güvenilirliğini gösterir (Wright ve Linacre, 1989;Linacre, 2023). ...
Article
Full-text available
Eğitimciler öğrencileri, öğrenci becerilerini, yeterliliklerini sınav, test ve quiz gibi değerlendirme süreçlerini kullanarak ölçmek isterler. Bu çalışmanın amacı, eğitim süreçlerinde sınavların ölçme geçerliliğini ve güvenilirliğini artırmak için Rasch analizi yönteminin uygulamalı bir örneğini sunmak ve sınav sorularının zorluk düzeylerini ve ayırt edicilik özelliklerini belirlemedeki rolünü ortaya koymaktır. Bu çalışmada kullanılan ölçme aracı, Türkiye'deki bir üniversitede yabancı dilmuafiyet sınavına giren 650 üniversite 1. Sınıf öğrencisinden oluşmaktadır. Testin özellikleri Rasch modeli kullanılarak incelenmiştir. 20 maddeden oluşan nihai testin güvenilirlik düzeyi 0,74 olarak hesaplanmıştır. Test sorularının modele uygunluğu için tek boyutluluk, değişmezlik, Rasch modele uyum, model güvenilirliği testleri uygulanmıştır. Verinin modele uygunluğu için varsayımlar( tek boyutluluk, değişmezlik, model uyumu, güvenirlik) sınanmış ve 6 soru analizden çıkarılmıştır. 14 soru için öğrencilerin yanıtlama düzeyleri ve soruların zorluk parametreleri elde edilmiştir. En zor soru s12 ve en kolay soru ise, s5 olarak bulunmuştur.Soruların %7,14’ü çok kolay ve çok zor, %14,2’si kolay, %50’si orta ve %21,4’ü zor sorulardan oluşmaktadır. Yetenek değerleri, genel olarak 0.540 ile 0.903 arasında değişmekte ve testin maddelerinin ayırt edicilik açısından makul düzeyde olduğunu göstermektedir. Wright Harita, test sorularının çoğunlukla katılımcıların yetenek seviyelerine uygun olduğunu, ancak aşırı yüksek veya düşük yetenek düzeyine sahip bireyler için uygun maddelerin eksik olduğunu göstermektedir. Testin daha geniş bir yetenek aralığını kapsaması için hem daha kolay hem de daha zor sorulareklenmelidir.
... Data from the ICAST-C questionnaire were transformed by Rasch model analysis using a statistical software called Winstep version 3.73. Rasch model analysis assumes that behavior is determined by the difficulty of an item and the ability of a person [25]. The Rasch model analysis converts ordinal data into standardized ratio data called logit values (log odds units). ...
Article
Full-text available
Maltreatment affects emotional development in adolescents and inhibits social adjustment. This study aimed to analyze the relationship between maltreatment and mental health among adolescents. A cross-sectional study was conducted on adolescents in the first and second grades of middle school (12–14 years old) and high school (15–17 years old) in eight cities and municipalities in the province, selected through several stages of simple random sampling (N = 1837). The International Society for the Prevention of Child Abuse and Neglect (ISPCAN) Child Abuse Screening Tool for Children (ICAST-C) questionnaire for detecting maltreatment was translated, simplified, and validated by an expert based on a theoretical framework that involved pediatricians, public health, and medicolegal perspectives. The Strengths and Difficulties Questionnaire (SDQ) was used to assess emotional states. ICAST-C and SDQ scores were transformed to logit values using Rasch model analysis. Distribution frequency and linear regression were used for data analysis. The results indicated that 85.6% of adolescents aged 12–14 and 83% of those aged 15–17 experienced physical maltreatment, while 89.4% of the 12–14 age group and 82.9% of the 15–17 age group experienced psychological maltreatment. The emotional states of the two groups were 52.8% and 59.2%, respectively. There was a significant correlation between the experience of physical maltreatment and emotions among 12–14 (r1 = 0.148 (0.190–0.257)) and 15–17 years old (r1 = 0.047 (0.084–0.156)). There was a significant correlation between the experience of psychological maltreatment and emotions among 12–14 years old ’(r2 = 0.191 (0.270–0.350)) and 15 to 17 years old (r2 = 0.097 (0.167–0.252)). In conclusion, physical and psychological maltreatment were correlated with mental health states among adolescent students in West Java, Indonesia.
... 13,16,17 The advantage of using the Rasch-transformed values is that they are based on a continuous, equal interval scale, in contrast with the ordinal raw scores. 27 Self-care items included eating, grooming, bathing, dressing upper body, dressing lower body, and toileting. Mobility items included tub transfer or shower transfer, bed-chair transfer, toilet transfer, walking or wheelchair, and climbing stairs. ...
Article
Full-text available
Objective To examine associations among the time and content of rehabilitation treatment with self-care and mobility functional gain rate for adults with acquired brain injury. Design Retrospective cohort study using electronic health record and administrative billing data. Setting Inpatient rehabilitation unit at a large, academic medical center. Participants Adults with primary diagnosis of stroke, traumatic brain injury, or nontraumatic brain injury admitted to the inpatient rehabilitation unit between 2012 and 2017 (N=799). Interventions Not applicable. Main Outcome Measures Gain rate in self-care and mobility function, using the Functional Independence Measure. Hierarchical regression models were used to identify the contributions of baseline characteristics, units, and content of occupational therapy, physical therapy, and speech-language pathology treatment to functional gain rates. Results Median length of rehabilitation stay was 10 days (interquartile range, 8-13d). Patients received an mean of 10.62 units of therapy (SD, 2.05) daily. For self-care care gain rate, the best-fitting model accounted for 32% of the variance. Occupational therapy activities of daily living units were positively associated with gain rate. For mobility gain rate, the best-fitting model accounted for 37% of the variance. Higher amounts of physical therapy bed mobility training were inversely associated with mobility gain rate. Conclusions More activities of daily living in occupational therapy is associated with faster improvement on self-care function for adults with acquired brain injury, whereas more bed mobility in physical therapy was associated with slower improvement. A potential challenge with value-based payments is the alignment between clinically appropriate therapy activities and the metrics by which patient improvement are evaluated. There is a risk that therapists and facilities will prioritize activities that drive improvement on metrics and deemphasize other patient-centered goals.
... In different ways, all these chapters show how a meaningful extension of the SI to cover the psychological and social domains depends on and benefits from the following conceptual distinctions and methodological demands. Readers unfamiliar with technical issues in measurement and metrology should approach the chapters with the following in mind and might be motivated to explore these ideas at greater length if practical applications are to be undertaken: -Ordinal scores are not interval measurements, just as numeric counts are not measured quantities (Wright, 1992b;Wright & Linacre, 1989). Everyone knows we cannot say who has more rock when I have two and you have five, yet we persist in fallaciously treating test scores as measurements in the absence of a defined unit quantity. ...
... Thus, the response received by the operator is a combination of the difficulty of the item and the ability of the person: a person with high ability is more likely to respond positively to a difficult item than a person with lower ability, and an easy item is more likely to be endorsed by more persons than a more difficult item (this is further exemplified in Section 11.5 for social sustainability metrics, as illustrated in Figure 11.2). Consequently, the response is characterized by having no numerical meaning, and it can only be used to indicate order (Turetsky & Bashkansky, 2022;Wright & Linacre, 1989). This response is remarkably similar to what is typically observed in today's evaluations of social sustainability based on sets of compiled and summarized indicators. ...
... This conventional method of data analysis disregards the inherent subjectivity within the data, operating under unfounded presumptions regarding the scale's nature [38,42] . Consequently, such practices may culminate in invalid mathematical operations and compromise the efficacy of statistical analyses [43][44][45][46] . ...
Article
Full-text available
This study aims to investigate gender differences in the understanding and application of green energy technology in schools in Indonesia. The method used is a survey with a questionnaire that covers aspects of knowledge attitudes readiness and obstacles related to green energy technology from a gender perspective. The sample of this study consisted of 829 teachers in various schools in Indonesia with a balanced distribution between male and female teachers. The data was analyzed using the Rasch measurement model with WINSTEPS 5.7.1 software to ensure the validity and reliability of the instrument. The results show that the instrument developed has good reliability and validity without significant item bias based on gender. The analysis shows that female teachers tend to have a higher understanding and application of green energy technology than male teachers. The ability distribution shows that most respondents are at a moderate to high level of ability in understanding and applying green energy technologies. These findings indicate the need for more inclusive and gender-sensitive education strategies to ensure all groups can contribute effectively to the implementation of green energy technologies in schools. The results of this study can be used as a basis for developing policy recommendations aimed at increasing equal involvement and understanding of green energy technologies among teachers both men and women. It is hoped that the gender gap in the understanding and application of green energy technology can be minimized and the application of this technology in schools can be improved.
Article
Full-text available
Fundamental deficiencies in the information provided by an ordinal scale constrain the logical inferences that can be drawn; inferences about progress in treatment are particularly vulnerable. Ignoring or denying the limitations of scale information will have serious practical and economic consequences. Currently, there is a high demand for functional assessment scales within the rehabilitation community. It is hoped that such scales will satisfy the very real need for measures of function which reflect the impact of treatment on patient progress. Unfortunately, some commonly used evaluation instruments are not well suited to this task. The underlying rationale for clinical decision-making based on these scales is examined.
Article
FACETS Computer program for many-faceted Rasch analysis
  • J M Linacre
Linacre JM. FACETS Computer program for many-faceted Rasch analysis. Chicago: MESA Press, 1989.
Best test design This appeared in Archives of
  • Bd Wright
  • Mh Stone
Wright BD, Stone MH. Best test design. Chicago: MESA Press, 1979. This appeared in Archives of Physical Medicine and Rehabilitation 70 (12) pp. 857-860, November 1989
BIGSCALE Rasch analysis computer program
  • B D Wright
  • J M Linacre
  • M Schulz
Wright BD, Linacre JM, Schulz M. BIGSCALE Rasch analysis computer program. Chicago: MESA Press, 1989.
Letter to the editor
  • B Silverstein
  • K Kilgore
  • W Fisher
Silverstein B, Kilgore K. Fisher W. Letter to the editor. Arch Phys Med Rehabil 1989; 70:864-5.