IRT health outcomes data analysis project: an overview
Karon F. Cook Æ Æ Cayla R. Teal Æ Æ Jakob B. Bjorner Æ Æ David Cella Æ Æ
Chih-Hung Chang Æ Æ Paul K. Crane Æ Æ Laura E. Gibbons Æ Æ Ron D. Hays Æ Æ
Colleen A. McHorney Æ Æ Katja Ocepek-Welikson Æ Æ Anastasia E. Raczek Æ Æ
Jeanne A. Teresi Æ Æ Bryce B. Reeve
Received: 25 August 2006/Accepted: 11 January 2007
? Springer Science+Business Media B.V. 2007
tute and the Drug Information Association co-spon-
sored the conference, ‘‘Improving the Measurement of
Health Outcomes through the Applications of Item
Response Theory (IRT) Modeling: Exploration of Item
Banks and Computer-Adaptive Assessment.’’ A com-
ponent of the conference was presentation of a psy-
chometric and content analysis of a secondary dataset.
ObjectivesA thorough psychometric and content
analysiswas conductedoftwoprimarydomains within a
In June 2004, the National Cancer Insti-
cancer health-related quality of life (HRQOL) dataset.
Research designHRQOL scales were evaluated using
factor analysis for categorical data, IRT modeling, and
differential item functioning analyses. In addition,
computerized adaptive administration of HRQOL item
banks was simulated, and various IRT models were
applied and compared.
NCI-funded Quality of Life Evaluation in Oncology
(Q-Score) Project. A total of 1,714 patients with cancer
or HIV/AIDS were recruited from 5 clinical sites.
R. D. Hays
Department of Medicine, and RAND Health Program,
University of California, Los Angeles, CA, USA
C. A. McHorney
Outcomes Research, Merck & Co., Inc., West Point, PA,
The New York Quality Improvement Organization, IPRO,
Lake Success, NY, USA
K. Ocepek-Welikson ? J. A. Teresi
New York State Psychiatric Institute and Research Division,
Hebrew Home, Riverdale, NY, USA
J. A. Teresi
Faculty of Medicine, Columbia University Stroud Center,
Riverdale, NY, USA
B. B. Reeve
Outcomes Research Branch, National Cancer Institute,
Bethesda, MD, USA
K. F. Cook (&)
Department of Rehabilitation Medicine, University of
Washington School of Medicine, SeattleWashington, USA
C. R. Teal
Department of Medicine, Houston Center for Quality of
Care & Utilization Studies, Veterans Affairs Health
Services Research & Development Center of Excellence
and Section of Health Services Research, Baylor College of
Medicine, Houston, TX, USA
J. B. Bjorner ? A. E. Raczek
QualityMetric Incorporated, Lincoln, RI and Health
Assessment Lab, Waltham, MA, USA
Center on Outcomes Research and Education, Evanston
Northwestern Healthcare, Northwestern University,
Feinberg School of Medicine, Chicago, IL, USA
Buehler Center on Aging, Northwestern University,
Feinberg School of Medicine, Chicago, IL, USA
P. K. Crane ? L. E. Gibbons
Division of General Internal Medicine, University of
Washington School of Medicine, WASeattle, USA
Qual Life Res
evaluated: Cancer Rehabilitation Evaluation System–
Short Form, European Organization for Research and
Treatment of Cancer Quality of Life Questionnaire,
Functional Assessment of Cancer Therapy and Medical
Outcomes Study Short-Form Health Survey.
Results and conclusions Four lessons learned from the
project are discussed: the importance of good devel-
opmental item banks, the ambiguity of model fit results,
the limits of our knowledge regarding the practical
implications of model misfit, and the importance in the
measurement of HRQOL of construct definition. With
respect to these lessons, areas for future research are
suggested. The feasibility of developing item banks for
broad definitions of health is discussed.
Items from 4 HRQOL instruments were
Measurement ? Outcomes
Quality of Life ? Health Status ?
In June 2004, the National Cancer Institute (NCI) and
the Drug Information Association co-sponsored the
conference, ‘‘Improving the Measurement of Health
Outcomes through the Applications of Item Response
Theory (IRT) Modeling: Exploration of Item Banks
the conference was presentation of a NCI-supported
study, developed exclusively for this conference, to
perform a psychometric and content analysis of two
primary domains (mental and physical health) within a
cancer health-related quality of life (HRQOL) dataset.
The mandate of the funding agency was to conduct the
analyses as a demonstration project. Four specific ana-
lytic goals wereaddressed: (1) explore the psychometric
properties of several HRQOL scales using classical and
factor analytic methods; (2) evaluate scales’ properties
through IRT modeling; (3) assess differential item
functioning; and (4) simulate computerized adaptive
administration of HRQOL item banks.
The purpose of this paper is to present an overview
of the project emphasizing pertinent, and often
bedeviling issues that present themselves in the
measurement of patient-reported outcomes. Because
the study originated as a demonstration project, we
employed more than one methodological approach
(sometimes several) for many of the analyses and
compared results. Doing so allowed us to evaluate and
comment on the differences in the methods and their
impact on findings. A detailed tutorial on each of the
analytic techniques employed would require a much
longer treatment and is outside the scope of this paper.
Original data were collected as part of the NCI-funded
Quality of Life Evaluation in Oncology (Q-Score)
Project (R01 CA60068, 1994–1999; PI: David Cella,
PhD). The objectives of the Q-Score project included
the development of a standard metric for commonly
used measures of HRQOL . Using the data from the
Q-Score project, investigators for the current study
evaluated responses to four HRQOL questionnaires:
(1) the 59-item Cancer Rehabilitation Evaluation
System–Short Form (CARES–SF) [2, 3]; (2) the
30-item European Organization for Research and
Treatment of Cancer Quality of Life Questionnaire
(EORTC) ; (3) the 33-item Functional Assessment
of Cancer Therapy (FACT) [5, 6]; and (4) the 36-item
Medical Outcomes Study Short-Form Health Survey
(SF-36) [7, 8]. The CARES-SF, EORTC, and FACT
were comprised solely of polytomous items (items with
more than two possible responses). Six of the SF-36
items were dichotomous (yes/no), while the rest were
polytomous. For these analyses, all items were scored
‘‘positively’’ so high scores indicated better outcomes.
Participants (n = 1,714; 56% male; 81% Caucasian)
had cancer or HIV/AIDS, were at least 2 months
post diagnosis, and were able to understand English.
They were recruited from five Eastern Cooperative
Hopkins Oncology Center, Medical College of Ohio,
Fox Chase Cancer, and Robert H. Lurie Comprehen-
sive Cancer Center of Northwestern University).
The project team targeted two primary HRQOL
domains—mental and physical health. Putative mea-
sures of either or both domains were included in an
initial review of item content. The research team
judged 116 out of 154 items to be applicable to a
mental health or physical health domain (see Appen-
dix). Among the eliminated items were items that
assessed constructs such as communication
healthcare providers and specific questions about
symptoms (e.g., nausea and appetite loss). For each of
Qual Life Res
the 116 items, the numbers of ‘‘missing’’ and ‘‘not
applicable’’ (‘‘n/a’’) responses were calculated (The
response option ‘‘n/a’’ was added for some scale items).
One FACT item and 11 CARES items had large
numbers of missing or ‘‘n/a’’ responses. These items
were dropped leaving a pool of 104 items. The item
pool was submitted to a range of analyses including
dimensionality assessment, IRT analyses, and DIF
HRQOL constructs often are defined broadly and may
fail to meet IRT’s unidimensionality assumption, i.e.,
that a single dimension drives item responses [9, 10].
The identification of unidimensional item banks for
measuring both mental and physical health proved to
be a challenging aspect of the project. The dimen-
sionality assessment combined exploratory factor
analysis (EFA), confirmatory factor analysis (CFA),
and expert review of item content. A polychoric cor-
relations matrix was analyzed using Mplus software
. Unweighted Least Squares (ULS) estimation was
used for EFA models; Weighted Least Squares with
Mean and Variance adjustment (WLSMV) was used
for CFA models. Following recommendations to
review multiple measures of model fit, fit was evaluated
based on the chi2-test of Model Fit , the Compar-
ative Fit Index (CFI) , the Root Mean Square
Error of Approximation (RMSEA) , the Stan-
dardized Root Mean Square Residual (SRMR) ,
evaluation of residuals greater than 0.10 [15, 16], and
percentage of item variance accounted for by a one-
factor model. The chi2statistic measures the discrep-
ancy between the observed covariance (or here:
correlation) matrix and the matrix that is predicted
from the model. If the model is correct, this statistic
will follow a chi2distribution. However, the statistic
usually indicates significant misfit, even for models that
fit well for practical purposes . The chi2statistic is
most useful in the comparison of nested models. CFI is
a measure of the amount of difference between the
examined model and the independence model (i.e. a
hypothetical model where none of the components in
the model are related). CFI values above 0.95 are
interpreted as indicating good model fit . RMSEA
is complimentary to CFI as it estimates the difference
between the examined model and the saturated model
(i.e. a hypothetical model where every aspect in the
model is related to every other aspect in the model),
with lower scores indicating smaller differences. An
RMSEA value below 0.08 is usually interpreted as
acceptable fit and a value below 0.05 as good fit .
An initial two-factor CFA was conducted on the 104
items using all available data (n = 1,714). The two-
factor model did not provide adequate fit to the data
based on common fit indices.
The data were then randomly divided into halves.
One was ‘‘saved’’ as a verification dataset. An EFA
was conducted on one half of the data (n = 857). The
number of factors retained was based on examination
of the scree plot, root mean square residuals (RMSR),
the residual correlation matrix, simple structure, par-
allel analysis, and review by content experts. Factors
were rotated both orthogonally (Varimax) and ob-
liquely (Promax). Based on the results, a 9-factor
solution was judged to best represent the data. Using
this solution, a 23-item bank was identified to mea-
sure physical function (a component of the larger
physical health construct). A 17-item bank was judged
to measure a component of mental health. At this
point in the process, the research team was unable to
define the mental health construct more precisely.
Additional factors included symptom expression, so-
cial support, role function, coping, sleep disturbance,
and communication in personal relationships regard-
To evaluate fit to a unidimensional model and
identify items that, if eliminated, might improve fit,
two single-factor CFAs were performed separately on
the physical function and mental health item banks
using the reserved random half of the data. Fit was
poor both for the physical function (CFI = 0.89,
RMSEA = 0.15, SRMR = 0.08) and mental health
items (CFI = 0.75, RMSEA = 0.20, SRMR = 0.08).
Both solutions had a number of residuals greater than
0.10. For the mental health bank, it was decided to
focus on the more narrow ‘‘general distress’’ construct
rather than a general mental health construct. Fifteen
items were identified for this item pool. The fit of a
one-factor solution for these items was improved
compared to the previously identified mental health
bank but still failed to meet conventional fit stan-
dards. Reducing the pool even further to a 10-item
bank also improved fit (CFI = 0.90, RMSEA = 0.16,
SRMR = 0.06), but again, not enough to meet con-
ventional fit standards.
The findings demonstrated the difficulty of devel-
oping comprehensive yet unidimensional item banks
for measuring broadly defined domains such as
physical and mental health. Our efforts to achieve
Qual Life Res
unidimensionality made it necessary to narrow the
constructs. Physical health was redefined as physical
function, and mental health was redefined as general
adjustment, fit to a unidimensional model was not
optimal. There are several possible implications of
1.It is preferable to start with more narrowly defined
concepts than physical and mental health. Al-
though redefinition of the construct was informed
by theory as well as by our exploratory results,
starting with a more conceptually rigorous model
of health outcomes could have improved the pro-
Trying to build item banks by combining several
short-form instruments is inferior to building a
large developmental item bank using careful con-
tent specification of the domain. The different
short forms used in this study may have been both
too similar and too narrow. On one hand, they
included items that were too alike in wording
causing local dependencies (as indicated by high
residual correlations). On the other hand, they
included too few items to assure comprehensive
coverage of each domain.
By combining data from different disease groups,
the standard factor-analytic assumptions of latent
multivariate normality may be violated. If this is
the case, a multigroup analysis could be more
appropriate. The polychoric correlation factor
analyses assume that the response categories for
each item have a strict rank order. If that
assumption is not fulfilled for some items, the
correlations may be biased.
The central concern in IRT is the relationship
between the trait being measured and the probabil-
ities of endorsing each of the item’s response cate-
gories. An item characteristic curve (ICC) relates the
probability of an item response to the trait being
measured (h). The ICCs for each item are defined
by two types of item parameters. Threshold (also
called location) parameters
level of the underlying attribute a certain response
category is most likely to be endorsed. Discrimina-
tion (also called slope) parameters reflect the item’s
Comparison of IRT models
The 15-item general distress pool (GD-15) and the 23-
item physical function pool (PF-23) were calibrated
using three IRT models: Master’s partial-credit model
(PCM) , Muraki’s generalized partial-credit model
(GPCM) , and Samejima’s graded-response model
(GRM) . The GPCM and the GRM include a dis-
crimination parameter; while the PCM does not. For
GPCM and GRM calibrations, the marginal maximum
likelihood estimation method implemented in Parscale
was used . Items were calibrated to the PCM using
two software programs that used different estimation
programs. Winsteps  uses joint maximum likeli-
hood, and OPLM  uses conditional maximum
likelihood. Results from these two estimation pro-
grams were compared and found to be similar for the
item pools in this study. In the Winsteps PCM cali-
bration, the step difficulty estimates for the PF-23
items had a range of 7.0 logits. The mean difference
between OPLM and Winsteps step difficulty estimates
was 0.083; 2 of 51 estimates were different by more
than 0.2 logits. The GD-15 step difficulty estimates had
a range of 3.6 logits. The mean difference between the
OPLM and Winsteps step difficulty estimates was 0.10,
with 3 of 61 estimates being different by more than 0.2
Comparison of item fit
Scores obtained using the different IRT models were
highly correlated for both physical function and gen-
eral distress; all pair-wise comparisons had a Pearson
product-moment correlation of 0.99. However, there
were differences across models with respect to item fit.
We compared the results of several approaches to
assessing item fit. A commonly used approach for
evaluating IRT fit categorizes respondents into 10
groups based on their estimated IRT score (h) and
compares predicted and observed item responses for
each item. The differences can be summarized in a v2
or a G2statistic. Parscale implements the G2statistic
. However, the G2statistic has problems that stem
from the assumption that the estimated IRT score is
indeed the true score. Simulation studies have found
that the G2statistic yields inflated type 1 error rates;
that is, it flags too many items as misfitting, particularly
when the scale is brief . Alternative procedures
have been suggested by Stone [25, 26], Glas  and
Orlando and Thissen . We implemented the G2fit
test suggested by Stone (G2*) [25, 26]. Finally, for the
PCM model, we compared fit using a third approach,
the Winsteps-generated information-weighted mean
Qual Life Res
squares (infit statistic) . Although the theoretical
distribution of this statistic is not well characterized
, Wright and Linacre have suggested unstandard-
ized mean square infit values over 1.3 as a criterion for
identifying misfitting items . Of the three fit sta-
tistic’s compared, G2, Stone’s (G2*), and infit, Stone’s
(G2*) would be consider the ‘‘best-practice’’ approach.
Table 1 presents the results from the fit analyses.
Bolded probabilities indicate items flagged as misfitting
based on G2or G2*. Probabilities with an asterisk
indicate items from the PCM calibration flagged as
misfitting based on infit values. There was substantial
concordance between G2and G2* results, though G2*
statistic proved somewhat more sensitive to misfit.
There was great disparity, however, between these two
methods and the infit criterion. This was particularly
noteworthy for the PF-23 item pool. The G2and G2*
statistics flagged 20 and 21 items, respectively, as mis-
fitting the PCM. The infit statistic criterion of 1.3
identified only 3 misfitting items. This result is consis-
tent with the work of Smith and Suh , who found
that, in a demonstration data set, the 1.3 criterion
Table 1 Item-model fit of the general distress and physical
function items calibrated with the generalized partial credit
model (GPCM), the graded response model (GRM) and the
partial credit model (PCM) based on Parscale-generated G2
Statistic and Stone’s Stone G2 *statistic
Item Content Parscale G2statistic Stone G2*statistic
GPCM GRM PCMGPCM GRM PCM
15-Item general distress pool
23-Item physical function pool
I feel sad.
I feel nervous.
I worry about dying.
able to enjoy life
content with my QOL
frequently feel anxious
been a very nervous person
felt down in the dumps
felt calm and peaceful
been a happy person
lack of energy
able to work
difficulty bending, lifting
difficulty household chores
difficulty bathing, grooming
trouble w/strenuous activities
trouble w/long walk
trouble w/short walk
stay in bed/chair most of day
help to eat/dress/wash/toilet
limited in work/daily activities
limited in hobbies/leisure
short of breath
lifting or carrying groceries
climb several flights of stairs
climb 1 flight of stairs
bending, kneeling, stooping
walk more than a mile
walk several blocks
walk 1 block
bathing or dressing
* Indicates item form Winsteps partial credit model calibration with infit > 1.3
Qual Life Res
greatly underestimated the number of items that vio-
lated the invariance property of Rasch models.
Both the items of the PF-23 and the GD-15 pools
had substantial misfit, but the misfit was most notable
for the GD-15 pool. This is likely because general
distress is a more complex construct than physical
functioning. There was little difference between the fit
of the GRM and GPCM, though, for the PF-23 item
pool, G2* results suggested that the GRM had better
fit. There was substantially greater misfit in the PCM
calibration compared to those conducted with the two-
Ideally, misfitting items would be dropped from the
item banks. However, our refined general distress and
physical function item banks had only 15 and 23 items
respectively, and elimination of even a few items would
amount to a proportionately large reduction. Some
level of misfit is expected in item banks, and there are
no generally accepted standards regarding how much
misfit is acceptable. Statistical guidelines are suspect
because of their sensitivity to sample size. These results
highlight the need for: (1) large developmental item
banks; (2) better fit statistics; and (3) research
exploring the practical impact of varying levels of item
DIF analysis examines the relationships among item
responses, levels of the trait being measured, and
subgroup membership. For a given level of trait, the
probability of endorsing a specified item response
should be independent of subgroup membership .
For example, men and women who have the same levels
of physical function should be equally likely to endorse
a specified physical function item category. DIF can be
evaluated using IRT models by examining stability of
item parameters over subgroups (that have been linked
to a common metric). Another approach to DIF testing
is logistic regression  or, for polytomous items,
ordinal logistic regression (OLR) . In the OLR
framework, DIF is identified as a significant effect of
subgroup membership on item score after controlling
for the level of the trait (by entering it as a covariate).
The level of the trait can be approximated by the
simple sum of all items or by an estimated IRT trait
A distinction is made between uniform and non-
uniform DIF . In IRT analyses, uniform DIF can be
defined as a subgroup difference in thresholds, while
non-uniform DIF refers to a subgroup difference in
slopes. In logistic regression analyses, uniform DIF is a
main effect of subgroup membership on item score,
while non-uniform DIF is an interaction effect of
subgroup membership and trait level on item score.
Reported here are DIF tests of the PF-23 item bank
using IRT and OLR methodologies. For these analy-
ses, gender and race were examined.
IRT-LR analyses of DIF
The IRT-LR analyses were conducted using Samej-
ima’s graded-response model  implemented in the
MULTILOG computer program . Likelihood-ratio
tests for DIF were conducted using the program, IRT-
LRDIF . An extended search strategy was used to
obtain ‘‘purified’’ anchor item sets that were free of
DIF and fit the IRT model. This anchor item set was
used in the final DIF analyses.
The magnitude of DIF was estimated based on ex-
pected item scores and area statistics. The impact of
DIF was examined by plotting the probability of
obtaining the range of possible scale scores against
estimates of physical function (test response functions).
Additionally, Raju’s non-compensatory DIF index was
calculated [38, 39].
After a Bonferroni adjustment for multiple com-
parisons (based on a = 0.05), six of the PF-23 items
evidenced uniform DIF with respect to race: ‘‘trouble
with a long walk;’’ ‘‘lack of energy;’’ ‘‘able to work;’’
‘‘vigorous activities;’’ and ‘‘walk more than a mile.’’
The analyses for gender groups identified four items
with uniform DIF: ‘‘difficulty with personal care;’’
‘‘short of breath;’’ ‘‘lack of energy;’’ and ‘‘problems
lifting or carrying groceries’’. Despite the identification
of these items with DIF, the impact of gender and race
DIF on the overall scale was modest. This was evident
in comparisons of test response functions based on
separate calibrations of items in the target subpopula-
tions. Figure 1 is an example of these results. Dis-
played are the test response functions for the
Caucasian and African-American subgroups generated
from the GRM calibration of the PF-23 items. As the
figure shows, the functions are similar.
OLR analyses of DIF
In the OLR analyses, both statistical significance and
magnitude of DIF (uniform only) were evaluated. The
magnitude of uniform DIF was investigated by
examining the effect of including a term for subgroup
Qual Life Res
membership on the coefficient associated with the
overall ability or trait level. If including a term for
subgroup membership changed this coefficient by
more than 10%, we determined that there was uni-
form DIF. Statistical significance for uniform DIF was
determined by examining the P value of the subgroup
membership term; P values < 0.05 were considered
statistically significant. Nonuniform DIF was identified
if the P value for the interaction term was < 0.05. Two
software programs were used for the analyses: STA-
TA  and DIFdetect .
The statistical significance criterion defined more
items with DIF than did the parameter change cri-
terion. Using the statistical criterion, eleven of the
PF-23 items had DIF with respect to race (8 uniform
and 3 non-uniform with no adjustment for multiple
testing). Test for gender DIF identified 17 DIF items
(15 uniform and 2 non-uniform). The criterion based
on relative change in difficulty estimates, however,
identified only one DIF item (non-uniform) with re-
spect to race and none with respect to gender. Inter-
estingly, the one item identified in the evaluation of
DIF by race, ‘‘walk one block’’, also was identified by
the IRT-LR method before, but not after the Bon-
Simulation of computerized adaptive administration
A computerized adaptive test (CAT) is a computer-
administered test (measure) in which presentation of
items after the initial one is based on previous item
responses. We simulated a CAT administration of the
GD-15 item pool. CAT requires the selection of a
‘‘stopping’’ rule. It is common to employ a stopping
rule based on either number of items administered
(fixed length) or level of measurement precision
desired (variable length). We simulated one fixed
length (number of items = 7 (CAT# items = 7)) and
(CATSE = 0.5)). Two IRT models were compared:
the GPCM and the PCM. Scales were transformed to a
common score range to ensure the standard error
stopping rule was equivalent across models.
A limitation of the simulation program used was its
inability to accommodate missing item responses or
responses for all items of the GD-15 item bank were
collapsed into 3 response categories. Only participants
who had no missing responses to any items were
included in the simulations (n = 1,432). The computer
algorithm used a maximum information criterion.
Once an estimate of the simulee’s theta level was ob-
tained, the item with the most information at estimated
h was administered. If persons scored in the highest or
lowest category for all items administered in the sim-
ulation, a h score was assigned (+4.00 or –4.00,
respectively). On this basis, there were 147 simulees
assigned scores in the PCM calibration, 146 in the
For both models and both conditions, correlations
with full bank scores were high. For the CATSE= 0.5
condition, correlations were 0.98 and 0.96, respectively,
for the PCM and the GPCM. For the CAT# items = 7
condition, correlations were 0.94 and 0.96, respectively
for the PCM and the GPCM. The advantage of the
PCM in the CATSE= 0.5 condition came at some cost,
however. On average, the GPCM-based CAT reached
(standarderror = 0.5
Level of Physical Function (theta)
-0.50.0 0.5 1.01.52.0 2.53.0
Expected Scale Score
c i r f
A anc i r e
Fig. 1 Test characteristic
functions for the 23-item
physical function pool
comparing Caucasian and
Qual Life Res
the stopping rule after 8 items (range: 5–15), while the
PCM-based CAT reached the stopping rule after an
average of 11 items (range: 8–15).
The results from the CATSE= 0.5 simulation
exposed weaknesses in the GD-15 item bank. Theo-
retically, the selection of a SE-based stopping rule
should result in equally precise measurement for per-
sons at all levels of general distress, but such a result
requires sufficient items that measure across the trait
continuum. Because the GD-15 item bank was small
and few items targeted high and low levels of distress,
all 15 items of the bank were administered to many of
the simulees in the CATSE= 0.5 condition, and the
stopping rule was never reached. Equal measurement
precision was obtained only in the middle range of
General Distress scores. The results highlight the
importance of large item banks that cover the range of
the trait being measured.
A number of lessons were learned, or, more precisely,
relearned. One was the importance of good develop-
mental item banks. After initial review, a total of 104
items were identified. However, because these were not
developed to complement each other and cover a
carefully specified domain, analyses of dimensionality
reduced our Physical Function and General Distress
banks to only 23 and 15 items, respectively. Even then,
these banks failed to exhibit optimal fit to a unidi-
mensional model. As already noted, the limits of the
item banks proved problematic in the CAT simulations.
A second lesson was that evaluations of model fit
may yield ambiguous results. Questions regarding
model fit often are phrased as if they were ‘‘yes/no’’
questions: ‘‘Do the data fit a unidimensional model?’’
‘‘Do the data fit an IRT model?’’ Of course, the
assumptions of all statistical and measurement models
are violated to some degree. Box and Draper observed
that ‘‘... all models are wrong; the practical question is
how wrong do they have to be to not be useful’’ . In
the application of IRT models to the assessment of
HRQOL, too little is known about how wrong models
can be before they lose their usefulness.
A third lesson concerned the imbalance between
technical knowledge and the practical implications of
the findings. This became clear during attempts to
interpret the fit statistics obtained from the factor
analyses and IRT calibrations. Needed is research
regarding how, when, and to what degree model misfit
impacts validity. Also noted is that the IRT-LR and the
OLR approaches to DIF identified different items and
different numbers of items as having DIF, and the
criterion employed greatly affected the results. These
findings highlight the need for research regarding cri-
teria and guidelines appropriate for DIF detection in
the context of health-related items.
A fourth lesson related to the importance of con-
struct definition. The most fundamental aspect of
effective measurement is a clear definition of what is to
be measured. When we attempted to build banks for
broad constructs, significant multidimensionality was
influenced which constructs we targeted. The identifi-
cation of this work as a ‘‘demonstration project’’ may
have justified this approach, but it may be common
that statistical and psychometric analyses become the
arbiters not just of how, but of what is measured. Al-
though some theoretical and empirical work has been
performed to better define health outcomes (e.g.,
Medical Outcomes Study ) much more is needed.
Additionally, creative strategies for identifying, defin-
ing, operationalizing, and measuring constructs that
are relevant to patients’ HRQOL are needed.
In light of the difficulties of fitting unidimensional
IRT models to the data available, multidimensional
IRT models (e.g., ) could be an appealing alter-
native. In these models, items are allowed to load on
more than one factor, and score estimation utilizes
correlations between factors. Multidimensional IRT
models have been used successfully to model health
data, and pilot versions of multidimensional CAT
programs have achieved promising results (e.g., 
and ). However, since correlations between factors
are part of the multidimensional measurement model,
it is critical to assess their stability across clinical and
demographic subgroups. Also, the interpretation of
scores should be evaluated carefully. In multidimen-
sional IRT, the score on any scale is affected by all item
responses, including responses to conceptually unre-
lated items. Thus, the relation between item score and
scale score lacks transparency, which may detract from
face validity and cause skepticism among applied
researchers. If the overall purpose of the analyses is the
calculation of a composite that weights together the
different subscores, the concerns noted above becomes
less of an issue. However, if the primary interest is in
subdomain scores, a carefully developed unidimen-
sional item bank may be preferable.
01) and NIH R01 (CA60068). Additional salary support pro-
vided by National Institute of Arthritis and Musculoskeletal and
Skin Diseases (1U01AR52171-01).
Study supported by NIH/NCI (Y1-PC-3028-
Qual Life Res
Appendix 2 Items included in factor analytic assessment of item bank(s) and unidimensionality
ScaleItem No. Item contentPhysical
3 During a typical day does your health limit you in vigorous activities,
such as running, lifting heavy objects, participating in strenuous
During a typical day does your health limit you in moderate
activities, such as moving a table, pushing a vacuum cleaner,
bowing or playing golf?
During a typical day does your health limit you in lifting or carrying
During a typical day does your health limit you in climbing several
flights of stairs?
During a typical day does your health limit you in climbing one flight
During a typical day does your health limit you in bending, kneeling,
During a typical day does your health limit you in walk more than a
During a typical day does your health limit you in walking several
During a typical day does your health limit you in walking one
During a typical day does your health limit you in bathing or
During the past 4 weeks, have you decreased work or other activities
as a result your physical health?
During the past 4 weeks, have you accomplished less than you would
like as a result your physical health?
During the past 4 weeks, have you limited in work &other activities
as a result your physical health?
During the past 4 weeks, have you difficulty with work or other
activities as a result your physical health?
During the past 4 weeks, have you cut down on time at work or
other activities as a result of emotional problems (such as feeling
depressed or anxious)?
During the past 4 weeks, have you accomplished less than would like
as a result of emotional problems (such as feeling depressed or
During the past 4 weeks, did you not do work or other activities as
carefully as a result of emotional problems (such as feeling
During the past 4 weeks, to what extent has your physical health or
emotional problems interfered with normal social activities with
family, friends, neighbors, or groups?
How much bodily pain have you had during the past 4 weeks?
During the past 4 weeks, how much did pain interfere with your
work (including both work outside the home and housework)?
Have you been a very nervous person?
Have you felt so down in the dumps nothing could cheer you up?
Have you felt calm and peaceful?
Have you felt downhearted and blue?
Have you been a happy person?
During the past 4 weeks, how much of the time has your physical
health or emotional problems interfered with your social activities
(like visiting with friends, relatives, etc.)?
Qual Life Res
Appendix 2 continued
Core Quality of
1 Do you have trouble with strenuous activities, like carrying a heavy
shopping bag or a suitcase?
Do you have any trouble taking a long walk?
Do you have any trouble taking a short outside of the house?
Do you have to stay in a bed or chair for most of the day?
Do you need help with eating, dressing, washing yourself or using the
Were you limited in doing either your work or other daily activities?
Were you limited in pursuing your hobbies or other leisure time
Were you short of breath?
Have you had pain?
Did you need to rest?
Have you had trouble sleeping?
Have you felt weak?
Have you lacked appetite?
Have you felt nauseated?
Have you vomited?
Have you been constipated?
Have you had diarrhea?
Were you tired?
Did pain interfere with your daily activities?
Have you had difficulty in concentrating on things, like reading a
newspaper or watching television?
Did you feel tense?
Did you worry?
Did you feel irritable?
Did you feel depressed?
Have you had difficulty remembering things?
Has your physical condition or medical treatment interfered with your
Has your physical condition or medical treatment interfered with your
I have difficulty bending or lifting.
I do not have the energy I used to have.
I have difficulty doing household chores
I have difficulty bathing, brushing teeth, or grooming myself.
I have difficulty planning activities because of the cancer or
I cannot gain weight
I find food unappealing
I find that the cancer or its treatments keep me from working
I frequently have pain
I find that my clothes do not fit.
I am uncomfortable with the changes in my body.
I frequently feel anxious.
I have difficulty sleeping.
I have difficulty concentrating.
I have difficulty asking friends or relatives to do things for me.
I have difficulty telling my friends or relatives about the cancer.
I find that my friends or relatives tell me I’m looking well
when I am not.
I find that my friends or relatives do not visit often enough.
I find that friends or relatives have difficulty talking with me about my
I become nervous when I am waiting to see the doctor.
I become nervous when I get my blood drawn.
System Short Form
Qual Life Res
1. Chang, C.-H., & Cella, D. (1997). Equating health-related
quality of life instruments in applied oncology settings.
Physical Medicine and Rehabilitation: States of the Art
Reviews, 11, 397–406.
2. Ganz, P. A., Schag, C. A., Lee, J. J., & Sim, M. S. (1992). The
CARES: A generic measure of health-related quality of life
for patients with cancer. Quality of Life Research, 1, 19–29.
3. Schag, C. A., Ganz, P. A., & Heinrich, R. L. (1991). CAncer
Rehabilitation Evaluation System-short form (CARES-SF).
A cancer specific rehabilitation and quality of life instru-
ment. Cancer, 68, 1406–1413.
4. Aaronson, N. K., Ahmedzai, S., Bergman, B., Bullinger, M.,
Cull, A., Duez, N. J., Filiberti, A., Flechtner, H., Fleishman,
S. B., & de Haes, J. C. (1993). The European organization for
research and treatment of cancer QLQ-C30: A quality-of-life
instrument for use in international clinical trials in oncology.
Journal of the National Cancer Institute, 85, 365–376.
5. Cella, D. F., & Bonomi, A. E. (1995). Measuring quality of
life: 1995 update. Oncology (Williston Park), 9, 47–60.
6. Cella, D. F., Tulsky, D. S., Gray, G., Sarafian, B., Linn, E.,
Bonomi, A., Silberman, M., Yellen, S. B., Winicour, P.,
Brannon, J., & et al. (1993). The Functional Assessment of
Cancer Therapy Scale: Development and validation of the
general measure. Journal of Clinical Oncology, 11, 570–579.
7. Hays, R. D., Sherbourne, C. D., & Mazel, R. M. (1993). The
RAND 36-Item Health Survey 1.0. Health Economics, 2,
8. Ware, J. E., Jr., & Sherbourne, C. D. (1992). The MOS 36-
item short-form health survey (SF-36). I. Conceptual
framework and item selection. Medical Care, 30, 473–483.
9. Nandakumar, R. (2004). Traditional dimensionality versus
essential dimensionality. Journal of Educational Measure-
ment, 28, 99–117.
Appendix 2 continued
Scale Item No. Item content Physical
26 I worry about when the
cancer is progressing.
I worry about not being
able to care for myself.
I do not feel sexually
I am not interested in
Functional Assessment of
Cancer Therapy (FACT)
I have a lack of energy
I have nausea.
Because of my physical condition, I have
trouble meeting the needs of my family.
I have pain.
I am bothered by side effects of treatment.
I feel sick.
I am forced to spend time in bed.
I feel distant from my friends.
I get emotional support from my family.
I get support from my friends and neighbors.
My family has accepted my illness.
Family communication about my illness is
I feel close to my partner (or the person who is
my main support).
I feel sad.
I am losing hope in the fight against my illness.
I feel nervous.
I worry about dying.
I worry that my condition will get worse.
I am able to work.
My work (including work in home) is fulfilling.
I am able to enjoy life.
I have accepted my illness.
I am sleeping well.
I am enjoying the things I usually do for fun.
I am content with the quality of my life right
Qual Life Res
10. Smith, E. V., Jr. (2002). Detecting and evaluating the impact Download full-text
of multidimensionality using item fit statistics and principal
component analysis of residuals. Journal of Applied Mea-
surement, 3, 205–231.
11. Muthen, B. O., & Muthen, L. K. (2001). Mplus User’s Guide.
Version 2. Los Angeles, CA: Muthen & Muthen.
12. Hu, L., & Bentler, P. M. (1995). Evaluating model fit. In: R.
H. Hoyle (Ed.), Structural equation modeling: concepts,
issues and applications (pp. 76–79). Thousand Oaks, CA:
13. Bentler, P. (1990). Comparative fit indices in structural
models. Psychological Bulletin, 107, 238–246.
14. Browne, M. W., & Cudeck, R. (1993). Alternative ways of
assessing model fit. In: K. A. Bollen, & J. S. Long (Eds.),
Testing structural equation models. Newbury Park, CA: Sage
15. Kline, R. B. (1998). Principles and practice of structural
equation modeling. New York, NY: The Guilford Press.
16. McDonald, R. P. (1999). Test theory: A unified treatment.
Mahway, NJ: Lawrence Earlbaum.
17. Hu, L. T., & Bentler, P. M. (1998). Fit indices in covariance
structure modeling: Sensitivity to underparameterized model
misspecification. Psychological Methods, 3, 424–453.
18. Masters, G. N. (1982). A Rasch model for partial credit
scoring. Psychometrika, 47, 149–173.
19. Muraki, E. (1992). A generalized partial credit model:
Application of an EM-algorithm. Applied Psychological
Measurement, 16, 159.
20. Samejima, F. (1969). Estimation of latent ability using a re-
sponse pattern of graded scores. Psychometrika Monograph
Supplement, No. 17.
21. Muraki, E., & Bock, R. D. (1997). PARSCALE 3: IRT based
test scoring and item analysis for graded items and rating
scales. Chicago, IL: Scientific Software International, Inc.
22. Linacre, J. M. (2002). WINSTEPS: Rasch-model computer
program. Version 3.36. Chicago: MESA Press.
23. Verhelst, N. D., & Glas, C. A. W. (1995). The one parameter-
logistic model. New York: Springer-Verlag.
24. Stone, C. A., & Zhang, B. (2003). Assessing goodness of fit of
item response theory models: A comparison of traditional
and alternative procedures. Journal of Educational Mea-
surement, 4, 331–352.
25. Stone, C. A. (2000). Monte Carlo based null distribution for
an alternative goodness-of-fit test statistic in IRT models.
Journal of Educational Measurement, 37(1), 58–75.
26. Stone, C. A. (2003). Empirical power and type I error rates
for an IRT fit statistic that considers the precision of ability
estimates. Educational and Psychological Measurement, 63,
27. Glas, C. A. W. (1999). Modification indices for the 2-PL and
the nominal response model. Psychometrika, 64, 273–294.
28. Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit
indices for dichotomous item response theory models.
Applied Psychological Measurement, 24, 50–64.
29. Wright, B. D., & Masters, G. N. (1982). Rating scale analysis.
Chicago: Mesa Press.
30. Wright, B. D. (1994). Reasonable mean-square fit. Rasch
Measurement Transactions, 8, 370.
31. Smith, R. M., & Suh, K. K. (2003). Rasch fit statistics as a test
of the invariance of item parameter estimates. Journal of
Applied Measurement, 4, 153–163.
32. Groenvold, M., Bjorner, J. B., Klee, M. C., & Kreiner, S.
(1995). Test for item bias in a quality of life questionnaire.
Journal of Clinical Epidemiology, 48, 805–816.
33. Swaminathan, H., & Rogers, H. J. (1990). Detecting differ-
ential item functioning using logistic regression procedures.
Journal of Educational Measurement, 27, 361–370.
34. Zumbo, B. D. (1999). A handbook on the theory and
methods of differential item functioning (DIF): Logistic
regression modeling as a unitary framework for binary and
Likert-type (ordinal) item scores. Ottawa, Canada: Direc-
torate of Human Resources Research and Evaluation,
Department of National Defense.
35. Camilli, G., & Shepard, L. A. (1994). Methods for identifying
biased test items. Thousand Oaks, CA: Sage Publishers.
36. Thissen, D. (1991). MULTILOG TM User’s Guide multiple,
categorical item analysis and test scoring using item response
theory. Chicago, IL: Scientific Software Inc.
37. Thissen, D. (2001). IRTLRDIF: Software for the computa-
tion of the statistics involved in item response theory likeli-
hood-ratio tests for differential item functioning. Version
38. Collins, W. C., Raju, N. S., & Edwards, J. E. (2000).
Assessing differential functioning in a satisfaction scale.
Journal of Applied Measurement, 85, 451–461.
39. Raju, N. S., van der Linden, W. J., & Fleer, P. F. (1995). IRT-
based internal measures of differential functioning of items
and tests. Applied Psychological Measurement, 19, 353–368.
40. STATA. (2004). College Station, TX: StataCorp LP
41. Crane, P. K., Jolley, L., & van Belle, G. (2003). DIFdetect.
Seattle, WA: University of Sashington.
42. Box, G., & Draper, N. (1987). Empirical model building and
response surfaces. New York: John Wiley and Sons.
43. Stewart, A. L., & Ware, J. E., Jr. (1992). Measuring func-
tioning and well-being: The Medical Outcomes Study Ap-
proach. London: Duke University Press.
44. Gardner, W., Kelleher, K. J., & Pajer, K. A. (2002). Multi-
dimensional adaptive testing for mental health problems in
primary care. Medical Care, 40, 812–823.
45. Petersen, M. A., Groenvold, M., Aaronson, N., Fayers, P.,
Sprangers, M., & Bjorner, J. B. (2006). Multidimensional
computerized adaptive testing of the EORTC QLQ-C30:
Basic developments and evaluations. Quality of Life
Research, 15, 315–329.
Qual Life Res