Evaluation of a preliminary physical function item bank supported the
expected advantages of the Patient-Reported Outcomes Measurement
Information System (PROMIS)
M. Rosea,b,*, J.B. Bjornera,b, J. Beckera,b, J.F. Friesc, J.E. Warea,d
aHealth Assessment Lab and QualityMetric Incorporated, 275 Wyman Street, Suite 120, Waltham, MA 02451, USA
bDepartment of Psychosomatic Medicine and Psychotherapy, University Medical Center, Hamburg-Eppendorf, Germany
cStanford University School of Medicine, Palo Alto, CA, USA
dSchool of Medicine, Department of Medicine, Tufts University, Boston, MA, USA
Accepted 6 June 2006
Objective: The Patient-Reported Outcomes Measurement Information System (PROMIS) was initiated to improve precision, reduce
respondent burden, and enhance the comparability of health outcomes measures. We used item response theory (IRT) to construct and
evaluate a preliminary item bank for physical function assuming four subdomains.
Study Design and Setting: Data from seven samples (N 517,726) using 136 items from nine questionnaires were evaluated. A gen-
eralized partial credit model was used to estimate item parameters, which were normed to a mean of 50 (SD 510) in the US population.
Item bank properties were evaluated through Computerized Adaptive Test (CAT) simulations.
Results: IRT requirements were fulfilled by 70 items covering activities of daily living, lower extremity, and central body functions.
The original item context partly affected parameter stability. Items on upper body function, and need for aid or devices did not fit the
IRT model. In simulations, a 10-item CAT eliminated floor and decreased ceiling effects, achieving a small standard error (!2.2) across
scores from 20 to 50 (reliability O0.95 for a representative US sample). This precision was not achieved over a similar range by any
comparable fixed length item sets.
Conclusion: The methods of the PROMIS project are likely to substantially improve measures of physical function and to increase the
efficiency of their administration using CAT. ? 2008 Elsevier Inc. All rights reserved.
Keywords: Item response theory; Computerized Adaptive Test; Physical function; Health status; Questionnaire
Over the past several decades, the use of patient-reported
outcomes (PROs) in clinical studies has steadily increased
in frequency, as has their importance in evaluating therapies
and developing treatment plans. The plethora of outcomes
tools available today allows for increasing specification of
a range of domains related to health and well-being, but
with two major limitations.
First, health outcomes research has produced a number
of well validated instruments , but the most precise
and comprehensive questionnaires are rather lengthy and
complex, leading to a level of respondent burden that ham-
pers recruitment, limits the representativeness of the patient
population being studied, and leads to substantial problems
of missing data. This is particularly important if different
constructs are measured. Thus, the most popular health
profile instruments are relatively short questionnaires
(e.g., SF-36?Health Survey [2,3]), but even for the mea-
surement of one specific domain, like physical function,
brief questionnaires are mostly favored (e.g., Health
Assessment Questionnaire [HAQ] ). These shorter ques-
tionnaires represent a compromise in measurement preci-
sion, range, and other desirable attributes in favor of
practicality. The short forms are useful for measuring the
health status of larger groups, but the precision loss is of
greater concern when groups are rather small or scores
are estimated for individual patients to guide clinical deci-
sion making .
* Corresponding author. Health Assessment Lab and QualityMetric
Incorporated, 275 Wyman Street, Suite 120, Waltham, MA 02451, Tel.:
E-mail address: email@example.com (M. Rose).
0895-4356/08/$ e see front matter ? 2008 Elsevier Inc. All rights reserved.
Journal of Clinical Epidemiology 61 (2008) 17e33
A second major limitation has been that results from dif-
ferent questionnaires are difficult to compare, even when
two similar instruments assess the same outcomes for the
same illness, such as measuring the disability of rheumatic
patients with the HAQ  or Western Ontario and McMas-
ter Universities arthritis index (WOMAC) . The situation
is as if leukocyte counts assessed in different settings were
not comparable with one another, but were dependent on
the particular laboratory used. There is a strong need to
develop a standardized, efficient approach to outcome
measurement for a variety of clinical applications including
population monitoring, clinical trials research, and individ-
ual patient monitoring, so that results can be compared
across conditions, therapies, trials, and patients.
Use of Item Response Theory (IRT) to build item banks
and Computerized Adaptive Tests (CATs) are believed to
be promising solutions to both problems. An item bank
consists of a set of items measuring the same concept and
a description of the items’ measurement properties based
on IRT models . IRT [8,9] describes the probability of
choosing each response on a questionnaire item as a func-
tion of the latent trait measured by the items (referred to as
the IRT score or theta [q]) [10,11]. On the basis of the IRT
models, the latent trait can be estimated from the responses
to any subset of items in the bank . Accordingly, re-
searchers or clinicians can select items that are most rele-
vant for the given group or individual patient and score
the responses on a general ruler that is independent of the
choice of items. Further, if the item bank contains items
from established questionnaires, scores on these question-
naires can be predicted from estimates of the latent trait.
Thus, using an IRT item bank will allow comparisons be-
tween results from different questionnaires also. The item
bank is not static, but can be continuously expanded and
improved with additional items.
An IRT-based item bank also provides the foundation for
CATs [12e14]. CATs make it possible to select the most
informative items from the item bank for every individual
patient according to his or her degree of the latent trait,
and to administer only those items. Thus, the higher preci-
sion needed for individual patient measurement is achieved,
while at the same time respondent burden can be controlled
[15e19]. We demonstrated these advantages earlier for the
Headache Impact Test [20,21], and the Anxiety CAT 
and have developed CATs for all SF-36 domains (see
subsequent paper in this series.
To systematically apply IRT and CATs in its studies, the
National Institutes of Health (NIH) recently initiated the
development of the Patient-Reported Outcomes Measure-
ment Information System (PROMIS) (http://nihroadmap.
nih.gov/). This trans-NIH initiative aims ‘‘to revolutionize
the way patient-reported outcome tools are selected and
employed in clinical research and practice evaluation’’
(http://www.nihpromis.org). Five domains are being as-
sessed initially: physical function, pain, fatigue, mental
health, and role functioning, across six Primary Research
Sites and a Statistical Coordinating Center. This study de-
scribes the pilot development and analysis of a preliminary
item bank for physical function.
The physical function construct has been evaluated us-
ing IRT methods for more than a decade [23e32]. Items
covering a wide range of physical activity levels, from
self-care (e.g., bathing and dressing) to performance of vig-
orous physical activities (e.g., running, strenuous sports),
usually can be sufficiently calibrated on a common metric
to satisfy the assumptions of IRT models. These studies,
like others using IRT techniques to rescore PRO measure-
ments [33,34], generally show that the use of IRT tech-
niques is superior to the use of classical test theory. In
principle, the same measurement assumptions are made in
classical test theory as in IRT. However, the explicit formu-
lation of requirements in IRT may force us to refine our as-
sumptions about a construct itself and which particular
items or subdomains may or may not be included .
We are aware of just two completed projects to build
IRT-based CATs (in rehabilitation research) for physical
function [36,37]. Thus, despite the work done, we are still
at the early stages, compared to the ambitious goals of the
PROMIS initiative. However, preliminary work, including
this article, is essential to identify issues and resolve prob-
lems as a necessary precursor to reach these aims.
Within this article, we analyze steps by which an item
bank for physical function can be built, and describe how
IRT scores and traditional instruments can be compared
and their relative utility assessed.
An overview of the following different steps of the anal-
yses is shown in Fig. 1.
Cross-sectional data from seven studies and including
a total of 17,726 respondents (Table 1) were used. Data
from 945 osteoarthritis patients were collected within the
Arthritis, Rheumatism, and Aging Medical Information
System (ARAMIS) . We also used baseline data from
a rheumatoid arthritis clinical trial (RAC)  and the
baseline sample of the Medical Outcomes Study (MOS)
[40,41]. Data from seniors came from a random sample
of 5,000 cases from a large public-use file of the first cohort
of the Medicare Health Outcomes Survey (HOS) 
These samples are primarily chronically ill or elderly,
and generally have serious limitations in their physical
function, with 0e26% reporting no limitations in vigorous
activities (Table 1). To obtain greater variation in physical
function, we also used three samples from the general pop-
ulation, in which 24e69% reported no limitations. The Ge-
neric Physical Function (GPF) Item Bank Development
18M. Rose et al. / Journal of Clinical Epidemiology 61 (2008) 17e33
responses in one
with one clear
Data Sets Identification
Item Selection Rules
Confirmatory Factor Anal.
Item Response Curves
Stability & Fit Tests
New Item Recruitment
Item included into
Initial Item Bank Development
Continuous Item Bank Improvement
Fig. 1. Steps to build an item bank.
19M. Rose et al. / Journal of Clinical Epidemiology 61 (2008) 17e33
Study was conducted by QualityMetric Incorporated and
RoperStarch, to include physical function items that cover
a wider range of ability . Additional physical function
items were obtained from the Health Insurance Experiment
(HIE, baseline data from Seattle, MA and SC) , a large-
scale social experiment that enrolled subjects younger than
65 years. Finally, a third general population data set was
collected by the National Research Corporation and Quali-
tyMetric from a representative US sample in 1998 and used
to norm the SF-36 .
To use existing databases seemed favorable as a first step
in this project, as it enabled us to evaluate our methodology
and will allow us to build pilot CATs and to cross-calibrate
IRT scores with scores from existing measures without de-
lay. We did not impute missing values and thus excluded
cases that had missing data for any of the items surveyed
(0% [HOS] to 22% [MOS]). All sample sizes mentioned
in this article are samples without any missing values.
2.2. Item selection
The construct of physical function has been examined
from different disease perspectives . Although some
types of questions asked are similar, instrument developers
with a background in rheumatology tend to select items re-
flecting musculoskeletal diseases, whereas developers with
a cardiology background tend to prefer items reflecting car-
diopulmonary performance [47e49]. Although some ques-
tions may be similar (e.g., whether a patient can walk
a certain distance), on the one hand walking is seen as an
Fries & McShane, 1986 
Kosinski et al., 2000 
Stewart & Ware, 1992, 
for Quality Assurance,
Ware et al., 2001 
Rheumatoid arthritis 600506137591 1002
Medicare beneficiaries5,00094.1% O65h
General population 75043614 518538 24 SF-36v2e
Stewart et al., 1981 
Ware et al., 2000 
aChronic conditions limited those measured across all data sets (heart disease, diabetes, hypertension, and arthritis).
bThe ARAMIS project is a longitudinal project as described by Fries and McShane ; we used the data collected in 2004.
cHAQ: Health Assessment Questionnaire . Twenty 4 response category items and 21 dichotomous items asking for the use of devices or assistance, all
41 items were used as individual items.
dWOMAC: Western Ontario and McMaster Universities osteoarthritis index . The visual analogue scale (VAS) version was used, which were divided
into five equally distributed distances for purposes of item analysis.
fMOS: Within the Medical Outcomes Study, items were used that later were included in the SF-36.
gPAQ: Patient Assessment Questionnaire : 10 items later used for the SF-36 plus 5 other PAQ items.
hAge variable only available in three categories in HOS public-use file.
iHOS: Medicare Health Outcomes Survey . Baseline first cohort: six activities of daily living and four shortness of breath items.
jSF-8 Health Survery .
kLSU: Louisiana State University Questionnaire .
lSIP: Sickness Impact Profile , nine items were used.
mHIE: Health Insurance Experiment items  (four of these items were also used in the HIE sample [see 12]).
nad hoc items.
oHIE item uses a dichotomous response option (yes/no); all other data sets use SF-36 item.
pNon-Dayton Enrollment Form A and B.
qWithout overlapping items.
20M. Rose et al. / Journal of Clinical Epidemiology 61 (2008) 17e33
exemplary task to assess the underlying capacity of the car-
diopulmonary system, whereas on the other hand it refers to
the functioning of the lower extremities and is used to as-
sess if that particular task can be fulfilled or if the patient
might need assistance. Further, we find a different underly-
ing understanding of questionnaire items in rehabilitation
research, with more focus on the musculoskeletal, the neu-
rological, or the cardiopulmonary system [37,50e54], and
generic instruments that also try to include questions rele-
vant for healthy persons.
PROMIS aims to develop an instrument that can fulfill
all tasks at once. Thus, clearly, one of the major challenges
is to define one common construct, which is equally rele-
vant across groups or diseases and different levels of ability.
Decisions about which subdomains form the physical func-
tion construct and which of these should be included in one
score or which should be treated separately thus become
important [35,55]. Typically the ability to fulfill self-care,
or instrumental daily activities is seen as one aspect of
physical function , and surely mobility or gross motor
activities (lower extremity) are considered a central part
of it as well [35,56]. Further, back and neck (central) func-
tions are important indicators of physical function in partic-
ular for orthopedic patients, as is the upper extremity
function (grip, reach, etc.) for rheumatoid arthritis patients.
The PROMIS Domain Hierarchy subcommittee hypothe-
sized four subdomains of physical function, which we refer
to in Fig. 2. Most items assess more than one subdomain of
physical function, but generally can be assigned to one pre-
dominant category. Those daily activities that cannot be as-
signed to one part of the body will fall into the compound
daily activities category (Fig. 2). We included items from
all subdomains in the item bank development to assess
the shared variance (Fig. 2).
There has been a lengthy debate about the use of capac-
ity items (e.g., are you able to, can you) vs. actual perfor-
mance items (e.g., did you climb up a flight of stairs today)
. On one hand, the use of capacity items has been
criticized, as patients may overestimate their capabilities
compared to their actual abilities . On the other hand,
the use of performance items has been criticized, because
the exemplary task referred to might not be relevant for
the respondent, for example, as when he or she does not
climb stairs because of living in a ground floor apartment.
In the data sets we used, most items assessed how much dif-
ficulty or how limited the participants are. The ‘‘difficulty’’
term was considered by Patrick et al. primarily to assess
performance , whereas McDowell and Newell interpret
this as an ‘‘intermediate phrasing’’ (, p.49). For compar-
ison, we also included nine pure performance items from
the Sickness Impact Profile (SIP)  (e.g., Today, I walk
We excluded items which showed a substantial overlap
with role participation, as well as items about satisfaction
with physical abilities, to avoid confounding with coping
abilities as much as possible. Decisions as to which items
were initially included were made using a qualitative
approach , reviewing all original questionnaires. Items
with three different recall periods (present, 1, or 4 weeks)
were considered. In the first round, the theoretical construct
was discussed. Three reviewers (one psychologist, one pub-
lic health researcher, one physician) then independently
decided if items belonged in the item pool. In the second
round, items for which no agreement was achieved were
reconsidered by all three reviewers together. An item
was included in the analysis if at least two reviewers voted
2.3. Data analysis
Initial item analyses were conducted independently on
each sample. We examined unidimensionality and local
independence of the items, item response curves (IRCs),
and differential item functioning, using previously reported
methodology [7,60]. After estimating item parameters for
each sample, and exploring possible differential item func-
tioning (DIF) between samples, we linked them to estimate
the item parameters for the whole item bank. Finally, the
bank was normed to have a mean of 50 and a standard
deviation of 10 in the US general population.
The frequency distribution of each item in each sample
was evaluated and items with extreme skewness excluded
(more than 95% responses in one category).
2.3.2. Unidimensionality and local independence
We used different exploratory and confirmatory factor
analyses to explore the interrelationship of the proposed
four physical function subdomains, the loading on one
superordinated factor, and if a sufficiently unidimensional
physical function construct [11,61,62] can be assumed.
Finally, a one-factorial confirmatory factor analysis (CFA)
was applied, using the program Mplus? (Muth? en &
Muth? en, Los Angeles, CA) with a weighted least squares
do eight hours
of physical labor
open a carton
from a chair
Fig. 2. Four subdomains of the physical function construct and overlap of
21M. Rose et al. / Journal of Clinical Epidemiology 61 (2008) 17e33
estimation with robust standard errors and mean- and
variance-adjusted c2statistics (WLSMV) . Items with
path weights below 0.40 were eliminated, following
To test for local independence , we analyzed resid-
ual correlations using Mplus? [65,66]. If a pair of items
had a residual correlation of 0.25 or more, we eliminated
the item that showed a higher accumulated residual corre-
lation with the remaining items, as done in earlier studies
[22,67]. Additionally, we initially excluded three items of
the SF-36 PF scale (SF-36V2: PF09: walk one hundred
yards, PF08: walk several hundred yards, PF05: climb
one flight of stairs), as they are logically dependent with
one of the remaining SF items (PF07: walk 1 mile, PF04:
climb several flight of stairs)  and thus local indepen-
dence cannot be assumed. To estimate parameters for these
three items, we fixed all item parameters and then estimated
PF08 with item PF07 excluded using the US general popu-
lation normative sample. We followed the same logic to es-
timate item PF09, and to estimate item PF05 excluding
2.3.3. Differential item functioning
Tests of DIF [68e70] are used to identify systematic er-
rors due to a group bias (independent variables are gender,
age, race, ethnicity, samples, household income, where
available). We carried out DIF analyses using an ordinal lo-
gistic regression model. In this approach, the item response
is regressed on the total sum score of all items and on the
independent variable (e.g., age) in question. A significant
effect of the independent variable on the item response
(when controlling for the total sum score) is an indication
of uniform DIF. Further, a significant interaction effect
(between the independent variable and the sum score) on
the item response indicates nonuniform DIF. To evaluate
the magnitude of DIF, we calculated the coefficient of
determination R2as defined by Nagelkerke . The coef-
ficient of determination R2is defined as the proportion of
variation explained by the logistic regression model. We
set an increase in dR2O0.03 (combined uniform and non-
uniform DIF) as criterion to indicate noticeable DIF. If DIF
was detected for an item in any sample, it was excluded in
all samples (e.g., Table 2a: item 16, 17, column D).
2.3.4. Heuristic analysis, item parameter estimation,
and item fit
We calculated IRCs using the program TestGraf , ap-
plying a nonparametric kernel-smoothing technique. Each
response option curve should have only one clear maximum
that is well separated from the maximum of other curves.
The item parameter was estimated using a generalized
partial credit model as described by Muraki . The item
fit statistics for two parameter models used in evaluation of
health status instruments are typically chi-square statistics,
which are highly sensitive to sample size. This is a long-
standing problem and further research is widely encouraged
[11,62,74,75]. For a pragmatic solution, we applied the
item fit statistics described earlier , to sample sizes less
than 1,500 and excluded items with P-values below 0.05.
For sample sizes greater than 1,500, we relied on other
results for model fit (CFA, residual correlations, heuristic
assessment of the IRCs, analysis of the model in partial
samples, DIF, and exploring the item parameters using
simulation studies; see also ).
2.3.5. Linking and norming of item parameters
All of the analyses described above were carried out sep-
arately for each data set. To establish a common metric
across all samples, we combined all data sets and co-
calibrated all items simultaneously using a multigroup
analysis (each sample being a different group) . This
approach allows the use of all information in the data in
a consistent way and takes into account that score distribu-
tions may differ between groups. The sample that answered
the most items (the GPF study) initially defined the metric,
but we subsequently normed the parameters based on the
SF-Norm sample (see below). Before the final analyses,
we tested DIF across samples in a series of pair-wise
comparisons of samples, each comparison conducted on
those items the two samples had in common (Table 1).
Finally, we evaluated the stability of item parameter
estimates for items with (1) similar content applied in the
same sample (Table 3: item 6/7, item 10/31, item 1/30, item
9/32, item 10/31, item 7/14, item 37/40, item 56/58, item
63/67, item 69/70 [poly/dichotomous response option], ex-
cluding one item of the pair) and (2) items that were pre-
sented in close proximity within one questionnaire, to
evaluate the possibility that estimates were inflated due to
context effects, or minor multidimensionality. This applies
to the HAQ , WOMAC , and SF-36  items. We
reestimated the item parameters for these items, excluding
adjacent items from the questionnaire. In each step of these
analysis, 2/3 of the items coming from one questionnaire
were excluded, 1/3 were kept (doing three analyses to cover
all items). Large differences between the two sets of item
parameters for the same items were taken as an indication
of context effects.
We normed the item bank based on weighted maximum
likelihood estimates  for each person in the normative
general population sample. Although analysis of estimated
scores can lead to bias , other approaches, such as set-
ting the general population sample to mean 0 and standard
deviation of 1 in a multigroup analysis lead to inflated slope
parameters and a strange shape of the estimated population
distribution (discussed in detail in a subsequent paper in
this series). Because inflated item parameters would lead
to overestimation of measurement precision, we chose the
norming based on estimated scores as the most cautious ap-
proach. The final IRT score was then rescaled to a mean of
50 and a standard deviation of 10 in the 1998 US general
population (SF-Norm sample).
22M. Rose et al. / Journal of Clinical Epidemiology 61 (2008) 17e33
2.3.6. Analysis of item information function
Item information functions (IIFs) can be calculated from
the IRT model parameters . IIF describes each item’s
contribution to overall test precision and their sum defines
the ideal precision of the test at a given theta allowing for
evaluation of the expected standard error (SE). For samples
with an observed variance s25 1, a SE !0.23 is compara-
ble to a reliability of O0.95 (reliability51 ? SE2=s2).
Similarly, for samples with a s5 10 (US-Norm sample)
a reliability of r O0.95 would be comparable to a SE !2.3.
2.3.7. Simulation studies
We conducted one type of simulation study to evaluate
the adequacy of the item bank in terms of measurement
precision in relation to the number and origin of items used
for scoring at a given theta. Based on the item parameters,
we simulated the answers of 1,000 simulees, having a nor-
mal distribution with a mean of 40 and a SD of 20, reflect-
ing the distribution of the samples we included.
2.3.8. Theta and SF-36 physical function scale
To illustrate the relation between IRT scores and tradi-
tional sum scores, the expected raw score on the SF-36
PF scale was calculated for various levels of theta. Such
calculations of expected scores can also be used to trans-
form IRT scores into the metric of traditional measures
. These calculations are based on the IRT model. For
each item and for each level of theta, an expected item
RA clinical study item overview
Abbreviated item content
Source Item code
1) Get in and out of a car
2) Run errands and shop
3) Walking several blocks
4) Stand up from a straight chair
5) Get on and off the toilet
6) Get in out of bed
7) Do chores such as vacuuming or yard work
8) Walking more than a mile
9) Climbing several flights of stairs
10) Climbing one flight of stairs
11) Climb up five steps
12) Wash and dry your body
13) Bend down to pick up clothing from the floor
14) Walk outdoors on flat ground
15) Walking one block
16) Shampoo your hair
17) Get down a 5 pound object above your head
18) Open a new milk carton
19) Bathing or dressing yourself
20) Dress yourself, including shoelaces and buttons
21) Need assistance to do hygiene
22) Take a tub bath
23) Moderate activities, such as moving a table
24) Open car doors
25) Lift a full cup or glass to your mouth
26) Cut your meat
27) Lifting or carrying groceries
28) Bending, kneeling, or stooping
29) Using a wheelchair
30) Need assistance to do errands and chores
31) Turn faucets on and off
32) Need assistance to reach something
33) Need assistance for dressing and grooming
34) Open previously opened jars
35) Need assistance for arising
36) Need assistance for eating
37) Need assistance for walking
38) Vigorous activities, such as lifting heavy objects
39e51) Other assistance or use of device items
Reasons to exclude items from the pool: CFA1: factor loading !0.4, res: residual correlation rO 0.25, local dep: local dependency, IRC: item response
curves did not discriminate, DIF: regression coefficient dR2O0.03 (DIF 1: Age, 2: Gender, 3: Ethnicity, 4: Race, 5: Samples, 6: DIF in ARAMIS sample)
Italics: Items that stayed in the item bank, AeD are used in the text to refer to the column. Tables for all other samples can be obtained by the authors.
23M. Rose et al. / Journal of Clinical Epidemiology 61 (2008) 17e33
Physical function item bank
Abbreviated Item textImaxat Q
SlopeStep 1 Step 2Step 3Step 4 Step 5 SourceItem labelRCItem recall period
1) Walk outdoors on flat ground
2) Difficulty using the toilet
3) Difficulty walking
4).Bend down to pick up clothing
5) Get on and off the toilet
6) Difficulty dressing
7) Bathing or dressing yourselfa
8) In bed or chair most or all of the day
9) Stand up from a chair
10) Get in and out of a car
11) Difficulty bathing
12) Get around in your home
13) Climb up five steps
14) Take a tub bath
15) Run errands and shop
16) Do chores such as vacuuming or yard work
17) Lying in beda
19) Walking one hundred yardsa
20) Light domestic dutiesa
21) Working around the house
22) Getting on/off toileta
23) Reach high on a shelf for something
24) Taking off socks/stockingsa
25) Rising from beda
27) Putting on socks/stockingsa
28) Going shoppinga
29) Lifting or carrying groceries
30) Walking on flat surfacea
31) Getting in/out of cara
32) Rising from sittinga
33) Getting in or out of chairs
34) Climbing a flight of stairsa
35) Bending to floora
36) Descending stairsa
37) Rearrange furniture in your home
38) Limitation of usual physical activities (such as walking or
39) Getting in/out of batha
40) Move light furniture, vacuum, and lift or push up to 25 pounds
41) Ascending stairsa
42) Dance for a half an hour
43) Today, I do not walk up or down hillsa
M. Rose et al. / Journal of Clinical Epidemiology 61 (2008) 17e33
44) Limitation of everyday physical activities (walking or climbing
45) Heavy domestic dutiesa
46) Do harder activities at home, such as mow lawns, mop floors
47) Bending, kneeling, or stooping
48) Today, I stand only for short periodsa
49) Walking several hundred yardsa
50) Walk 2 milesa
51) How much difficulty do you have using stairs
52) Today: I go up and down stairs more slowly, one step at a time,
53) Exercise hard for half an hour
54) Climbing several flights of stairs
55) How much difficulty do you have doing your daily physical
activities, because of your health
56) Limitation of moderately strenuous activities such as talking
walks, gardening, bowling, or playing golfa
57) Walking more than a mile
58) Moderate activities, such as moving a table, pushing a vacuum
cleaner, bowling, or playing golf
59) Can you take part in sports such as swimming, bowling, golf
60) Could you do hard activities at home, heavy work, like
scrubbing floors, or lifting or moving heavy furniture
61) Today, I walk shorter distances or stop and rest oftena
62) Do 8 hours physical labor
63) If you wanted, could you run a short distancea
64) Participate in active sports such as swimming, tennis,
65) Today: I walk more slowlya
66) Limitation of strenuous activities such as backpacking, skiing,
playing tennis, bicycling or jogging
67) Run a short distancea
68) How difficult would it be for you to jog or run slowly for 2
69) Vigorous activities, such as running, lifting heavy objects,
participating in strenuous sports
70) Limitation of vigorous activities, such as running, lifting heavy
objects, or participating in strenuous sportsa
Imaxat Q: maximum of the item information function a particular theta, Imax: maximum of the information function. Data source: 1: ARAMIS, 2: RAC, 3: MOS, 4: HOS, 5: GPF, 6: HIE, 7: SF-Norming
Study, RC: Response category: D: Difficulty, L: Limitation. A: Ability, T: time of limitation, P: Performance.
Item parameters are normed such that the general population has a mean of 0 and a SD of 1. We present this differently from Fig. 3, because it is more common to present item parameters with a mean of
0 instead of 50.
aItem will not be used in CAT pilot testing applications (see text).
bItem category collapsed to ensure item fit.
cThere are two different labels for the same item, as they were used differently in different enrollments. DEI4461: South Carolina 3-year enrollment; DEI4762: Seattle and Massachusetts enrollment, and
South Carolina 5-year enrollment.
M. Rose et al. / Journal of Clinical Epidemiology 61 (2008) 17e33
score is calculated by multiplying the item response proba-
bilities with the score for that response options. These
scores are then summed within items to form the expected
item score and summed again across items to form the
expected scale score for that level of theta .
3.1. Item selection
All data sets were screened systematically for items cov-
ering the physical function construct. After review, 136
items were used for the data analysis.
From the GPF sample, we excluded four SIP items (stay-
ing in bed most of the time, do not use stairs at all, walk
only with help, do not walk at all) that were too easy to
be applied to the general population (more than 95% re-
spondents chose the easiest response choice). For the same
reason, we excluded eleven items from the HIE sample
(help with eating, in bed most or all day, dress yourself,
walk to table for meals, walk around inside the house, walk
a block, light work around the house, have to stay indoors
most of the day, unable to walk unless assisted, travel
around community with assistance, eat without help), leav-
ing 121 items for further analyses.
3.3. Unidimensionality and local independence
Across the samples, the first unrotated factor explained
between 41.9% and 57.7% of the variance, the second
5.7% and 10.9%, and the third 4.0% and 7.5%. Confirma-
tory factor analysis including the proposed subdomains
and one superordinated factor showed the best model fit.
However, only samples where the HAQ was used had suf-
ficient number of items about upper extremity functions to
build a latent trait for this subdomain. For these samples,
the subdomains daily activities, lower, and central body
functions loaded on one superordinated factor with stan-
dardized loadings between 0.91 and 0.99, whereas the up-
per extremity functions (HAQ Eating, Grip) had loadings
between 0.69 (osteoarthritis [OA] patients) and 0.85 (rheu-
matoid arthritis [RA] patients) (see also ).
Three sets of items raised concerns of multidimensional-
ity that had to be addressed by removing items: (1) The fit
of the CFA model improved substantially if the assistance
and device items were allowed to form their own factor
and these items showed residual correlations above the
threshold of 0.25 (Table 2b: items 12, 29, 36, 40, 41, 42,
44e58; Table 2a: items 21, 29, 30, 32, 33, 35, 36, 37,
39e51, columns A). (2) In both the ARAMIS and RAC
samples, items on upper body function had rather low fac-
tor loadings and high residual correlations in a one factor
CFA solution (Table 2b: items 28, 33, 34, 37, 38, 43; Table
2a: items 18, 24, 25, 26, 31, 34 columns A). (3) Within the
HOS sample, all four shortness of breath items showed high
residual correlations with each other. From all samples, 46
items were removed due to their residual correlations, leav-
ing 75 items in the pool. After we excluded the items
described, we conducted a second CFA in each sample
(Tables 2a, 2b, column B; results on other than shown sam-
ples available on request) and reevaluated the residual cor-
relations. We tested a total of 1,324 residual correlations;
only 19 were above 0.20, only one residual correlation re-
mained above 0.25 (PF02 and PF03 in the RAC sample).
As the correlation between these items was below 0.20 in
all other samples, we considered this result as random
(Tables 2a, 2b).
3.4. Differential item functioning
In the general population, we observed DIF for gender
for the item ‘‘light house work’’ (HIE item: DEI4453,
DR25 0.033). Women report less difficulties doing house-
work, even when they experience more disability than men.
Because the ARAMIS sample was only OA patients and the
RAC sample only RA patients, we also observed a few in-
teractions with respect to diagnostic groups: Older patients
with osteoarthritis and greater disability (Table 2b: item 39,
column C) tend to admit more difficulties shampooing their
hair, compared to younger patients at the same level of
disability. This was not the case for patients with rheuma-
toid arthritis (Table 2a: item 16, column C). Women with
osteoarthritis are more likely to report problems reaching
up to get down a bag of sugar than men (Table 2b: item
35, column C). We do not see this gender influence for pa-
tients with rheumatoid arthritis (Table 2a: item 17, column
C). All items showing a DR2O0.03 in any of the DIF anal-
ysis were excluded from the item bank. Further analyses
were conducted on the 70 remaining items.
3.5. IRCs and item fit
Almost all response option curves showed the desired
characteristics, with one clear maximum.
Of particular interest was the performance of the
WOMAC items. Most of the items could be modeled by
transferring the VAS into an ordinal scale with five equal
distances. However, two WOMAC items (going shopping,
getting in/out bath, Table 2b: items 3, 25) did not show
the desired fit indices. For these and six other items, we
had to collapse some response categories to improve their
fit to the model (ARAMIS sample: DIERRAND: run er-
rands, DITUB, take a bath, GPF sample: LSU8, do 8 hours
of physical labor, LSU7: walk 2 miles, DEI5880: strenuous
leisure activities, DEI4461: take part in sport). For all tested
samples, no item finally showed a significant misfit (Tables
2a and 2b, column D).
26 M. Rose et al. / Journal of Clinical Epidemiology 61 (2008) 17e33
3.6. Item linking and item parameters
The last step of the item bank development was linking
the samples and norming the item parameters. The
ARAMIS and RAC had an overlap of nine HAQ items,
whereas the other samples were linked using seven SF-36
items. Because we had excluded many items from HIE
sample during the different steps of the analysis, we had on-
ly three items left which overlapped between the GPF and
the HIE samples. We considered this not to be sufficient to
estimate robust item parameters. Thus, the remaining five
items from the HIE sample were excluded from further
For the remaining 70 items, it was evaluated if we can
observe content or context effects, which might inflate item
parameter estimations. Excluding one of each pair of items
with similar content used in one sample did not essentially
change the threshold parameters of the remaining items
(mean ?0.02 to þ0.03). The maximum change in slope
ARAMIS item overview
Abbreviated item content
Source Item code
1) Taking off socks/stockings
2) Putting on socks/stockings
3) Going shopping
4) Getting in/out of car
5) Getting on/off toilet
6) Rising from bed
7) Walking on flat surface
8) Light domestic duties
9) Ascending stairs
10) Heavy domestic duties
11) Rising from sitting
12) Need assistance for eating
13) Get in and out of a car
15) Run errands and shop
16) Descending stair
17) Bending to floor
19) Do chores such as vacuuming or yard work
20) Climb up five steps
21) Walk outdoors on flat ground
22) Get in and out of bed
23) Wash and dry your body
24) Stand up from a straight chair
25) Getting in/out of bath
26) Get on and off the toilet
27) Bend down to pick up clothing from the floor
28) Open car doors
29) Need assistance to do errands and chores
30) Take a tub bath
31) Lying in bed
32) Dress yourself, incl. shoelaces and buttons
33) Lift a full cup or glass to your mouth
34) Cut your meat
35) Get down a 5 pound object above your head
36) Need help for daily hygiene
37) Turn faucets on and off
38) Open a new milk carton
39) Shampoo your hair
40) Use a wheelchair
41) Use a walker
42) Need assistance for walking
43) Open previously opened jars
44e58) Other assistance or use of device items
Reasons to exclude items from the pool: CFA1: factor loading !0.4, res: residual correlation rO0.25, IRC: item response curves did not discriminate,
dR2: DIF regression coefficient R2O0.03 (DIF 1: Age, 2: Gender, 3: Ethnicity, 4: Race, 5: Samples) Italics: Items that stayed in the item bank, AeD are used
in the text to refer to the column.
27M. Rose et al. / Journal of Clinical Epidemiology 61 (2008) 17e33
parameters was seen after excluding the HAQ item ‘‘get in
and out of the car.’’ The slope parameter of the correspond-
ing WOMAC item (getting in/out of car) dropped ?0.21
units from 2.93. Evaluating context effects for the WO-
MAC, HAQ, and SF-36 items, we saw rather stable thresh-
old parameters as well, with a mean difference of 0.11 (SD
0.06). However, some WOMAC and HAQ items showed
a notable decline in slope parameters (e.g., rising from
bed [?0.78], getting off toilet [?0.73], getting out of the
car [?0.73], run errands [?0.73], do chores [?0.60]), when
we decreased the number of items from the same instru-
ment in the estimation. Also noteworthy, the slope in-
creased significantly for the same HAQ items (run
errands [þ0.68], do chores [þ0.61]), when the number of
WOMAC items were reduced. All these items had slope pa-
rameters greater than 2. SF-36 items showed a smaller
slope decline (!0.60). Because experience with the evalu-
ation of parameter stability is still limited, we did not elim-
inate any items based on these results, but excluded the
WOMAC items from CAT simulations.
The resulting bank of items covered a wide range of the
latent trait. To estimate their difficulty levels, we ordered
the items in accordance with the maximum in their IIF (Ta-
ble 3). Most items covering the low range of physical func-
tion (or high disability) ask about self-care functions, like
using the toilet, getting around the home, dressing, groom-
ing, or walking on flat ground. In the middle of the range,
items ask about moderately demanding activities such as
doing light domestic duties, or climbing up five steps.
The higher end of the physical function scale is defined
by more strenuous activities, such as doing up to 8 hours
of physical labor or participating in vigorous activities. Al-
most all items seem to be influenced by musculoskeletal
and cardiopulmonary functionality, but the more difficult
the item gets, the more important the cardiopulmonary
system seems to be and vice versa. (The five most difficult
items [Table 3: items 66e70] ask about running or jogging,
which is typically used to test cardiopulmonary capacity.)
The maximum of all item information curves is below the
average physical function of the general population, indi-
cating that the item bank covers well the disease-related as-
pects of physical dysfunction but is limited in its coverage
of the range that is above average for the US population
3.7. Simulation studies
If we estimate a physical function score using all items
in the item bank, except the WOMAC items as we will not
use VAS scales in pilot CAT tests, the score can be esti-
mated with very high precision (SE ! 2.3) over a wide
range of the latent trait (z10e55) (Fig. 3). Best perfor-
mance is indicated by smaller SEMs and by the breadth
of the curve with low SEMs. If we only use the 10 most in-
formative items, simulating a CAT, we can cover a range of
approximately three standard deviations (20e50) with a
very high measurement precision (SE ! 2.3). If we only
use 9 HAQ or 10 SF-36 items, we do not reach that level
of precision. However, an excellent measurement precision
of SE !3.3 is achieved by the nine HAQ items in the pool
over a wide range at the lower part of the latent trait
(z10e40), and the SF-36 IRT score measures the range
between disability and healthy persons (z30e50) as well.
If we would use a CAT, these simulations suggest, we could
cover a range between 10 and 60 (SE ! 3.3) without in-
creasing the respondent burden and should be able to ap-
proach a level of precision compare using all items.
Figure 3 also shows the mean 6 standard deviation of two
of our samples to illustrate the different ranges of measure-
ment needed for different kinds of samples.
CAT 10 items
all 53 items
(without 17 WOMAC items)
normed theta values
01020 40 5060 703080
measurement precision (standard error)
SE = 3.3
SE = 5.0
SE = 2.3
Fig. 3. Measurement precision in relation to Theta. For samples with a SD of 10 (representative sample), a SE 55.0, 3.3, and 2.3 is comparable to a reliability
of r50.80, 0.90, and 0.95, respectively.
28M. Rose et al. / Journal of Clinical Epidemiology 61 (2008) 17e33
3.8. Theta and SF-36 physical function scale
Figure 4 shows the relationship between theta and scores
on the SF-36 PF scale, which will be used to cross-calibrate
both measurements. We have described this method earlier
in detail . The figure demonstrates that the relation be-
tween theta and the expected SF-36 PF score is approxi-
mately linear in the interval 25e50 but that the curve
flattens outside this range, illustrating the ceiling and floor
problems of the SF-36 PF scale.
We have demonstrated that it is possible to use a variety
of existing instruments to build a preliminary item bank
with promising properties. Apart from practicality, the use
of existing instruments will allow us to cross-calibrate
IRT scores with scores on traditional scales . The IRT
methods also showed that existing instruments like the
HAQ or SF-36 only provide very good measurement preci-
sion (SE ! 3.3) over a limited range. Our simulation studies
suggest that a 10-item CAT based on the preliminary item
bank can extend the range of measurement substantially
(plus 2e3 standard deviations), to an extent that ceiling
and floor effects are very unlikely to occur in clinical appli-
cations and that measurement precision is improved over
a wide range compared with fixed length questionnaires.
This will increase the ability to detect true change and to
fulfill power and sample size requirements. These results
are very encouraging and support the expectations created
by the PROMIS initiative. However, we also identified a
4.1. Conceptual issues
Combining different aspects of one construct in one
score has shown to be a successful strategy for instruments
in very different disease areas over the last decades (Beck
Depression Inventory, State Trait Anxiety Inventory, etc.),
and it could be shown that a comprehensive physical func-
tion score, like from the HAQ, can predict work disability,
medical costs, and mortality [81e83]. Within the frame-
work of classical test theory, this concept is widely ac-
cepted and over the past years the US Food and Drug
Administration has approved the use of the physical func-
tion scales from the HAQ or SF-36 in various clinical trials.
Being able to preserve the construct, as it is reflected in
these traditional instruments, seems important to ensure
the acceptability of the new instrument, and to allow
cross-calibration of traditional (classical test theory) and
new CAT scores. However, the explicit psychometric
requirements of an IRT framework may set limits to this
approach. In earlier studies [25,28,30], we applied one
parameter models, which provided a fairly reasonable fit
for a limited set of 10 physical function items from the
SF-36. However, as mentioned above and shown in Table
3, discrimination parameters vary substantially across the
larger set of current items. Restricting the item bank to
items with similar magnitude of slopes would have led to
various smaller banks, without obvious conceptional gain.
Thus, we decided to use a two parameter model, which al-
lowed us to build one larger item bank with similar content
like the SF-36 Physical Function scale.
In our empirical analysis, we had to exclude items, which
mainly cover aspects of hand and upper extremity function,
which are part of the HAQ summary scale. The factor anal-
tion domain, but further analysis showed that they shared
additional variance together, which cannot sufficiently be
explained by one underlying latent trait. Similar findings
have been reported by Wolfe et al. , who excluded fur-
ther aspects of physical function, leaving mainly lower body
proach so far, however, patients with predominantly hand
problems or with combined upper and lower extremity dys-
function are not adequately assessed. Haley et al. [54,84,85]
(IADL) items and separated them from mobility items. Van
der Heide et al.  suggested that assistance and device
items form a second dimension, corresponding to our find-
ings. Thus, the most obvious next steps will be to build sep-
arate item banks for upper extremity functionality and the
use of devices, but additional broader research is needed to
establish how best to deal with dimensionality issues and
how best to cover subdomains. One solution to the issue of
multidimensionality would be to apply a multidimensional
IRT model and multidimensional CAT [87,88]. However,
such methods must be evaluated in terms of their practical
advantage over simpler solutions, for example, using
extended CAT logics to measure particular items or subdo-
mains only in specific diseases or health states.
4.2. Empirical issues
A significant number ofitems had to be excluded because
their range of measurement did not fit to the sample towhich
SF-36 PF sum scale
normed theta values
Fig. 4. Relation between theta values and SF-36 PF sum scale.
29M. Rose et al. / Journal of Clinical Epidemiology 61 (2008) 17e33
eral population. The present item bank covers well the lower
range of physical function but the CAT simulations indicate
that it does not allow for measuring above average physical
abilities with sufficient precision. To cover this range, new
ically well-trained persons.
All items included in the item bank fulfill the empirical
criteria. However, some items or response options may be
less suited to a CAT than others. One such problem might
occur due to the VAS response format. We used a simple
approach to transform the VAS of the WOMAC items into
five response categories, but we do not know how well this
approach performs in a real CAT application, or whether
the use of a more complex modeling of the VAS scale
[89e91] is beneficial. One concern occurred when we in-
vestigated content effects between similar worded HAQ
and WOMAC items. We have seen greater parameter stabil-
ity than we had expected, which might express the impor-
tance of the response option, but might also reveal the
limitations of our approach to transform the VAS scale. Un-
til those issues are resolved, we will not use items with VAS
scales in a CAT. In general, we will prevent items with sim-
ilar content from being used in the same CAT assessment.
Nine items will not be used in pilot testing of the CAT
for that reason (Table 3: 7, 19, 34, 49, 50, 56, 63, 67, 70).
Most items assessed limitations or difficulties in per-
forming a particular task, but with the SIP we kept some
pure performance items also to investigate their empirical
properties. All of those items provided their maximum in-
formation at an unusual range of the latent trait (Table 3:
e.g., item 48: SIP072 [Today, I stand only for short periods]
and item 65: SIP035 [Today, I walk more slowly]). This
seems to support the theoretical assumption that they mea-
sure a slightly different concept than the limitation or diffi-
culty items; but it may also indicate that these items are
more easily misunderstood. SIP items in particular have
a rather complex introduction (‘‘The following statements
should describe you TODAY and are related to your state
of health’’), and the text of SIP056 also includes a complex
double negation (Table 3: item 43, ‘‘I do not walk up or
down hills: yes/no’’). This could explain why item 43 has
the lowest maximum item information in the item bank.
However, dichotomous items in general do not provide as
much information as items with more response options
(compare, e.g., SIP054, item 52, ‘‘Today, I go up and down
stairs more slowly, yes/no’’; with item 51, ‘‘How much dif-
ficulty do you have using stairs,’’ five response options),
which is why we favor the development of items with at
least three response choices for CAT use.
In an ideal study, all participants would have answered
all items. Here for most of the items, we only have
responses from a subsample. This prevented estimations
of item parameters in some instances, as the items range
of measurement did not fit the distribution of the sample.
However, a sample size below 1,000 and a small mismatch
of sample and measurement range, resulting in ceiling or
floor effects, could have compromised the estimations.
The data finally used for item parameter estimations
were sampled over a period of almost two decades. We
tested if this might have affected the estimates for the
SF-36 items, which were contained in five of the six final
data files. These items did not show DIF, but we could
not test if the difference in time might have influenced
the relationship of other items. Generally, only a few items
had to be excluded because they demonstrated DIF, as be-
tween RA and OA patients in items asking about self-care,
where a substantial amount of hand dysfunction is involved.
These items are more likely to have higher importance for
RA patients. Within samples we evaluated, we could only
investigate this problem for the HAQ items, because these
were answered by both RA and OA patients. Therefore, it
is likely that DIF for different disease groups may occur
for other items in the current item bank also. In our opinion
this needs particular attention in further item pool develop-
ment, as measurement invariance implies that IRT parame-
ters have to be essentially constant across different groups
of patients (‘‘sample-free’’) and test occasions (‘‘test-
free’’). Issues about whether DIF is possible or even might
be used for different score estimations across disease
groups will need to be explored and resolved in future
Most of the data we used were collected using standard-
ized paper and pencil questionnaires. This leads to a couple
of problems unique to the CAT process. In a CAT, each
item will be presented one item per screen on a PDA or
a PC and the preceding and subsequent items will vary.
This is a substantial change in the context, which may in-
fluence understanding of the item. To estimate the extent
of change in context, we analyzed item parameter stability
by excluding items and reestimating item parameters. For
approximately 20% of the items tested, we observed less
parameter stability than we would like to see. This could
be a result of local item dependence leading to inflated
slope parameters, due to context effects (e.g., shared
response scale), minor multidimensionality, or specific
sample effects. If the problem is due to context effects, this
can be solved with real CAT applications. We plan to
reestimate and compare item parameters and respondent
burden, after we have real CAT data available.
Over the next years, the PROMIS initiative aims to
transform the way patient-reported outcome tools are
selected and used in clinical research and to establish a
national data base for accurate and efficient measurement
30 M. Rose et al. / Journal of Clinical Epidemiology 61 (2008) 17e33
In conclusion, we find that our results support the general
assumptions made by the PROMIS initiative and demon-
strates the potential of item banks and CATs to greatly im-
prove assessment of PROs, but also point to the challenges
ahead: to better understand the domains and subdomains
we want to measure, to improve the items, and to estimate
valid and robust item parameters. Additional work on item
selection rules is also needed. There are so far only few re-
ports about real CAT applications in health outcomes re-
search[14,22,60] or in routine
[92,93], but the exploration of CATs in real applications
is the ultimate test of this methodology. The preliminary
item bank developed in this article was a prerequisite to
build and test a pilot CAT for physical function within
the PROMIS network, which will be the next step of this
part of the project.
This work was funded by the NIH through the NIH
Roadmap for Medical Research, Grant U01 AR052158-01
(Improved Outcome Assessment in Arthritis and Aging,
J. Fries (Principal Investigator) and J. Ware (Co-Principal
Investigator), Project Officer D. Ader) and supported by
Stanford University, QualityMetric Incorporated and Health
Assessment Lab from their own research funds. Informa-
tion on the PROMIS can be found at www.nihpromis.org.
 McDowell I, Newell C. Measuring health: a guide to rating scales and
questionnaires. 2nd edition. Oxford: Oxford University Press; 1996.
 Ware JE Jr, Sherbourne CD. The MOS 36-item short-form health sur-
vey (SF-36). I. Conceptual framework and item selection. Med Care
 Ware JE Jr, Kosinski M, Keller SD. A 12-item Short-Form Health
Survey. Med Care 1996;34(3):220e33.
 Fries JF, Spitz PW, Young DY. The dimensions of health outcomes:
the health assessment questionnaire, disability and pain scales.
J Rheumatol 1982;9:789e93.
 McHorney CA, Tarlov AR. Individual-patient monitoring in clinical
practice: are available health status surveys adequate? Qual Life
 Bellamy N, Buchanan WW, Goldsmith CH, Campbell J, Stitt LW.
Validation study of WOMAC: a health status instrument for measur-
ing clinically important patient relevant outcomes to antirheumatic
drug therapy in patients with osteoarthritis of the hip or knee. J Rheu-
 Bjorner JB, Kosinski M, Ware JE Jr. Computerized adaptive testing
and item banking. In: Fayers PM, Hays RD, editors. Assessing qual-
ity of life. Oxford: Oxford University Press; 2004.
 van der Linden WJ, Hambleton RK. Handbook of modern item
response theory. Berlin: Springer; 1997.
 Fischer GH, Molenaar IW. Rasch modelsdfoundations, recent devel-
opments, and applications. 1st edition. Berlin: Springer-Verlag; 1995.
 Embretson SE. The new rules of measurement. Psychol Assess
 Embretson SE, Reise SP. Item response theory for psychologists.
London: Lawrence Erlbaum Associates; 2000.
 Wainer H, Dorans NJ, Eignor D, Flaugher R, Green BF, Mislevy RJ,
et al. Computerized adaptive testing: a primer. 2nd edition. Mahwah,
NJ: Lawrence Erlbaum Associates; 2000.
 van der Linden WJ, Glas CAW. Computerized adaptive testing: the-
ory and practice. Dordrecht: Kluwer Academic Publishers; 2000.
 Ware JE Jr, Kosinski M, Bjorner JB, Bayliss MS, Batenhorst A,
Dahlof CG, et al. Applications of computerized adaptive testing
(CAT) to the assessment of headache impact. Qual Life Res 2003;
 Cella D, Chang CH. A discussion of item response theory and its ap-
plications in health status assessment. Med Care 2000;38(9 Suppl):
 Hambleton RK, Slater SC. Item response theory models and testing
practices: current international status and future directions. Eur J
Psychol Assess 1997;13(1):21e8.
 Hambleton RK, Jaeger RM, Plake BS, Mills C. Setting performance
standards on complex educational assessments. Appl Psychol Meas
 Hays RD, Morales LS, Reise SP. Item response theory and health out-
comes measurement in the 21st century. Med Care 2000;38(9 Suppl):
 McDonald RP. Future directions for item response theory. Int J Educ
 Ware JE Jr, Bjorner JB, Kosinski M. Practical implications of item
response theory and computerized adaptive testing: a brief summary
of ongoing studies of widely used headache impact scales. Med Care
 Bjorner JB, Kosinski M, Ware JE Jr. Using item response theory to
calibrate the Headache Impact Test (HIT) to the metric of traditional
headache scales. Qual Life Res 2003;12:981e1002.
 Walter O, Becker J, Fliege H, Bjorner JB, Kosinski M, Klapp BF,
et al. Developmental steps for a computer adaptive test for anxiety
(A-CAT). Diagnostica 2005;51:88e100.
 Fisher WP Jr, Eubanks RL, Marier RL. Equating the MOS SF36 and
the LSU HSI Physical Functioning Scales. J Outcome Meas
 Granger CV, Hamilton BB, Linacre JM, Heinemann AW, Wright BD.
Performance profiles of the functional independence measure. Am J
Phys Med Rehabil 1993;72(2):84e9.
 Haley SM, McHorney CA, Ware JE Jr. Evaluation of the MOS SF-36
physical functioning scale (PF-10): I. Unidimensionality and repro-
ducibility of the Rasch item scale. J Clin Epidemiol 1994;47:671e84.
 Heinemann AW, Linacre JM, Wright BD, Hamilton BB, Granger C.
Relationships between impairment and physical disability as
measured by the functional independence measure. Arch Phys Med
 Linacre JM, Heinemann AW, Wright BD, Granger CV, Hamilton BB.
The structure and stability of the Functional Independence Measure.
Arch Phys Med Rehabil 1994;75(2):127e32.
 McHorney CA, Haley SM, Ware JE Jr. Evaluation of the MOS SF-36
Physical Functioning Scale (PF-10): II. Comparison of relative preci-
sion using Likert and Rasch scoring methods. J Clin Epidemiol
 Bjorner JB, Kreiner S, Ware JE, Damsgaard MT, Bech P. Differential
item functioning in the Danish translation of the SF-36. J Clin Epide-
 Raczek AE, Ware JE, Bjorner JB, Gandek B, Haley SM,
Aaronson NK, et al. Comparison of Rasch and summated rating
scales constructed from SF-36 physical functioning items in seven
countries: results from the IQOLA Project. International Quality of
Life Assessment. J Clin Epidemiol 1998;51:1203e14.
 Tsuji T, Sonoda S, Domen K, Saitoh E, Liu M, Chino N. ADL struc-
ture for stroke patients in Japan based on the functional independence
measure. Am J Phys Med Rehabil 1995;74(6):432e8.
 Jenkinson C, Fitzpatrick R, Garratt A, Peto V, Stewart-Brown S. Can
item response theory reduce patient burden when measuring health
status in neurological disorders? Results from Rasch analysis of the
31M. Rose et al. / Journal of Clinical Epidemiology 61 (2008) 17e33
SF-36 physical functioning scale (PF-10). J Neurol Neurosurg
 Gray LB, Williams VSL, Hancock TD. An item response theory anal-
ysis of the Rosenberg Self-Esteem Scale. Pers Soc Psychol Bull
 King DW, King LA, Fairbank JA, Schlenger WE. Enhancing the pre-
cision of the Mississippi scale for combat-related posttraumatic stress
disorder: an application of item response theory. Psychol Assess
 Wolfe F, Michaud K, Pincus T. Development and validation of the
health assessment questionnaire II: a revised version of the health
assessment questionnaire. Arthritis Rheum 2004;50:3296e305.
Assessing mobility in children using a computer adaptive testing
version of the pediatric evaluation of disability inventory. Arch Phys
Med Rehabil 2005;86:932e9.
 Ware JE Jr, Gandek B, Sinclair SJ, Bjorner JB. Item response theory
and computerized adaptive testing: implications for outcomes mea-
surement in rehabilitation. Rehabil Psychol 2005;50(1):71e8.
 Fries JF, McShane DJ. ARAMIS (the American Rheumatism Associ-
ation Medical Information System). A prototypical national chronic-
disease data bank. West J Med 1986;145:798e804.
 Kosinski M, Zhao SZ, Dedhiya S, Osterhaus JT, Ware JE Jr. Deter-
mining minimally important changes in generic and disease-specific
health-related quality of life questionnaires in clinical trials of rheu-
matoid arthritis. Arthritis Rheum 2000;43:1478e87.
 Tarlov AR, Ware JE Jr, Greenfield S, Nelson EC, Perrin E,
Zubkoff M. The Medical Outcomes Study. An application of methods
for monitoring the results of medical care. J Am Med Assoc
 Stewart AL, Ware JE Jr. Measuring Functioning and Well-Being: The
Medical Outcomes Study Approach. London: Duke University Press;
 National Committee for Quality AssuranceIn: Specifications for the
Medicare Health Outcomes Survey. HEDIS?, Vol. 6. Washington,
DC: National Committee for Quality Assurance; 2004.
 Ware JE Jr, Kosinski M, Dewey JE, Gandek B. How to score and
interpret single-item health status measures: a manual for users of
the SF-8 health survey (with a supplement on the SF-6 health survey).
Lincoln, RI: QualityMetric Incorporated; 2001.
 Stewart AL, Ware JE, Brook RH. Physical health in terms of func-
tioning. Publication no. R-1987/2-HEW. In: Conceptualization and
measurement of health for adults in the Health Insurance Study,
Vol. 2. Santa Monica, CA: Rand Cooperation; 1981.
 Ware JE Jr, Kosinski M, Dewey J. How to score version two of the
SF-36 health survey. Lincoln, RI: QualityMetric Inc; 2000.
 Green CP, Porter CB, Bresnahan DR, Spertus JA. Development and
evaluation of the Kansas City Cardiomyopathy Questionnaire:
a new health status measure for heart failure. J Am Coll Cardiol
 Guyatt GH, Nogradi S, Halcrow S, Singer J, Sullivan MJ, Fallen EL.
Development and testing of a new measure of health status for clin-
ical trials in heart failure. J Gen Intern Med 1989;4(2):101e7.
 Rector T, Cohn J. Patients’self-assessment of their congestive heart
failure. Part 2: Content, reliability and validity of a new measure,
the Minnesota Living with Heart Failure questionnaire. Heart Fail
 Pepin V, Alexander JL, Phillips WT. Physical function assessment in
cardiac rehabilitation: self-report, proxy-report and performance-
based measures. J Cardiopulm Rehabil 2004;24(5):287e95.
 Bennett JA. Maintaining and improving physical function in elders.
Annu Rev Nurs Res 2002;20:3e33.
 Branch LG, Meyers AR. Assessing physical function in the elderly.
Clin Geriatr Med 1987;3(1):29e51.
 Coster WJ, Haley SM, Andres PL, Ludlow LH, Bond TL, Ni PS.
Refining theconceptualbasisforrehabilitation outcome
measurement: personal care and instrumental activities domain.
Med Care 2004;42(1 Suppl):I62e72.
 Haley SM, Andres PL, Coster WJ, Kosinski M, Ni P, Jette AM.
Short-form activity measure for post-acute care. Arch Phys Med
 Fries JF. The hierarchy of quality-of-life assessment, the Health As-
sessment Questionnaire (HAQ), and issues mandating development
of a toxicity index. Control Clin Trials 1991;12:106Se17S.
 Fries JF. New instruments for assessing disability: not quite ready for
prime time. Arthritis Rheum 2004;50:3064e7.
 Patrick DL, Darby SC, Green S, Horton G, Locker D, Wiggins RD.
Screening for disability in the inner city. J Epidemiol Community
 Gilson BS, Gilson JS, Bergner M, Bobbit RA, Kressel S, Pollard WE,
et al. The sickness impact profile. Development of an outcome mea-
sure of health care. Am J Public Health 1975;65:1304e10.
 Dalkey N, Rourke D, Lewis R, Snyder D. Studies in the quality of
life: Delphi and decision making. Lexington, MA: D.C. Health; 1972.
 BjornerJB, KosinskiM, Ware JE Jr. Calibrationofan item poolforas-
to the headache impact test (HIT?). Qual Life Res 2003;12:913e33.
 Masters GN, Wright BD. The essential process in a family of mea-
surement models. Psychometrika 1984;49:529e44.
 Muraki E. A generalized partial credit model. In: Linden WJ,
Hambleton RK, editors. Handbook of modern item response theory.
Berlin: Springer; 1997. p. 153e64.
 Muth? en LK, Muth? en BO. Mplus. The comprehensive modeling
program for applied researchers. User’s guide. Los Angeles: Muth? en
& Muth? en; 1998.
 Nunnally J. Psychometric theory. 2nd edition. New York: McGraw-
 Drasgow F, Parsons C. Applications of unidimensional item response
theory models to multidimensional data. Appl Psychol Meas 1983;7:
 Reckase M. Unifactor latent trait models applied to multifactor tests:
results and implications. J Educ Stat 1979;4:207e30.
 Fliege H, Becker J, Walter OB, Bjorner J, Klapp BF, Rose M. Devel-
opment of a computer-adaptive test for depression (D-CAT). Qual
Life Res 2005;14:2277e91.
 Holland PW, Wainer H. Differential item functioning. Hillsdale, NJ:
 Stout W. Psychometrics: from practice to theory and back: 15 years
of nonparametric multidimensional IRT, DIF/test equity, and skills
diagnostic assessment. Psychometrika 2002;67:485e518.
 Zumbo BD. A handbook on the theory and methods of differential
item functioning (DIF). Ottawa: National D? efense Headquarters;
 Nagelkerke NJD. Miscellanea. A note on a general definition of the
coefficient of determination. Biometrika 1991;78:691e2.
 Ramsay JO. TestGraf. A program for the graphical analysis of mul-
tiple choice test and questionnaire data. Montreal: McGill University;
 Muraki E. A generalized partial credit model: application of an EM
algorithm. Appl Psychol Meas 1992;16:159e76.
 Hambleton RK, Swaminathan H, Rogers HJ. Fundamentals of item
response theory. Newbury Park, CA: Sage Publications, Inc; 1991.
 Linden WJ, Hambleton RK. Handbook of modern item response the-
ory. Berlin: Springer; 1996.
 ParscaledIRT based test scoring and item analysis for graded open-
ended exercises and performance tasks. DOS/Windows. Chicago, IL:
Scientific Software Inc; 1996.
 Warm TA. Weighted likelihood estimation of ability in item response
theory. Psychometrika 1989;54:427e50.
 Mislevy RJ. Estimating latent distributions. Psychometrika 1984;49:
 Muraki E. Information functions of the generalized partial credit
model. Appl Psychol Meas 1993;17(4):351e63.
32M. Rose et al. / Journal of Clinical Epidemiology 61 (2008) 17e33
 Cella D, Yount S, Rothrock N, Gershon R, Cook K, Reeve B, et al. Download full-text
The Patient-Reported Outcomes Measurement Information System
(PROMIS): progress of an NIH Roadmap cooperative group during
its first two years. Med Care 2007;45:S3eS11.
 Wolfe F, Michaud K, Gefeller O, Choi HK. Predicting mortality
in patients with rheumatoid arthritis. Arthritis Rheum 2003;48:
 Wolfe F, Hawley DJ. The longterm outcomes of rheumatoid arthritis:
work disability: a prospective 18 year study of 823 patients. J Rheu-
 Michaud K, Messer J, Choi HK, Wolfe F. Direct medical costs and
their predictors in patients with rheumatoid arthritis: a three-year
study of 7,527 patients. Arthritis Rheum 2003;48:2750e62.
 Haley SM, Coster WJ, Andres PL, Kosinski M, Ni P. Score compa-
rability of short forms and computerized adaptive testing: simulation
study with the activity measure for post-acute care. Arch Phys Med
 Haley SM, Coster WJ, Andres PL, Ludlow LH, Ni P, Bond TL, et al.
Activity outcome measurement for postacute care. Med Care
 van der Heide A, Jacobs JW, Albada-Kuipers GA, Kraaimaat FW,
Geenen R, Bijlsma JW. Self report functional disability scores and
the use of devices: two distinct aspects of physical function in rheu-
matoid arthritis. Ann Rheum Dis 1993;52(7):497e502.
 Reckase MD. The past and future of multidimensional item response
theory. Appl Psychol Meas 1997;21:25e36.
 Gardner W, Kelleher KJ, Pajer KA. Multidimensional adaptive test-
ing for mental health problems in primary care. Med Care 2002;
 Ferrando P. Theoretical and empirical comparisons between two
models for continuous response. Multivar Behav Res 2002;37:
 Noel Y, Dauvier B. A beta logistic item response model for continu-
ous bounded responses. Appl Psychol Meas 2007;31:47e73.
 Samejima F. Homogeneous case of the continuous response model.
 Rose M, Walter OB, Fliege H, Becker J, Hess V. 7 years of experi-
ence using Personal Digital Assistants (PDA) for psychometric diag-
nostics in 6000 inpatients and polyclinic patients. In: Bludau HB,
Koop A, editors. Mobile computing in medicine. GI-Edition lecture
notes in informatics, P-15. Ko ¨llen Verlag; 2002. p. 35e44.
 RoseM, Fliege H, Walter OB, BeckerJ, BjornerJ, Ravens-Sieberer U,
et al. Using the item response theory to develop a computer adaptive
test for depression. Qual Life Res 2002;11(7):626.
33 M. Rose et al. / Journal of Clinical Epidemiology 61 (2008) 17e33