Practical Issues in the Application of Item Response Theory
A Demonstration Using Items From the Pediatric Quality of Life
Inventory (PedsQL) 4.0 Generic Core Scales
Cheryl D. Hill, PhD,* Michael C. Edwards, PhD,† David Thissen, PhD,‡ Michelle M. Langer, MA,‡
R. J. Wirth, MA,‡ Tasha M. Burwinkle, PhD,§ and James W. Varni, PhD¶?
Background: Item response theory (IRT) is increasingly being
applied to health-related quality of life instrument development and
refinement. This article discusses results obtained using categorical
confirmatory factor analysis (CCFA) to check IRT model assump-
tions and the application of IRT in item analysis and scale evalua-
Objectives: To demonstrate the value of CCFA and IRT in exam-
ining a health-related quality of life measure in children and ado-
Methods: This illustration uses data from 10,241 children and their
parents on items from the 4 subscales of the PedsQL 4.0 Generic
Core Scales. CCFA was applied to confirm domain dimensionality
and identify possible locally dependent items. IRT was used to
assess the strength of the relationship between the items and the
constructs of interest and the information available across the latent
Results: CCFA showed generally strong support for 1-factor models
for each domain; however, several items exhibited evidence of local
dependence. IRT revealed that the items generally exhibit favorable
characteristics and are related to the same construct within a given
domain. We discuss the lessons that can be learned by comparing
alternate forms of the same scale, and we assess the potential impact
of local dependence on the item parameter estimates.
Conclusions: This article describes CCFA methods for checking
IRT model assumptions and provides suggestions for using these
methods in practice. It offers insight into ways information gained
through IRT can be applied to evaluate items and aid in scale
Key Words: IRT, factor analysis, instrument development,
(Med Care 2007;45: S39–S47)
lated quality of life (HRQoL) item banks for developing both
adaptive (ie, computerized adaptive testing ?CAT?) and non-
adaptive (ie, linear) patient-reported outcomes instruments.1
This process of item banking and test assembly relies heavily
on item response theory (IRT) to assess the properties of the
candidate items that inform the assignment of items to do-
main banks and the selection of appropriate items for instru-
ments. As with any model, the use of IRT implies a number
of assumptions about the data.2This article discusses meth-
ods that can be used to check 2 primary assumptions of many
IRT models, unidimensionality and local independence. In
presenting these methods, we work through an example using
data on items from an existing HRQoL instrument; these
items were considered for inclusion in the PROMIS item
bank and were also used to inform the development of new
items for use with PROMIS.
IRT models describe the probability of observing a
particular pattern of responses given the respondent’s level on
the underlying construct (?). With the 2-parameter logistic
(2PL) model, which is appropriate for items measured in 2
response categories (eg, yes/no, true/false), this probability is
modeled using a slope parameter (ai) and a location param-
eter (bi) for each item i. The slope parameter measures the
strength of the relationship between the item and the under-
lying construct; higher slopes mean that the item can discrim-
inate more sharply between respondents above and below
some level on the latent continuum. For dichotomous items,
the location parameter is the point along the latent continuum
at which the item is most discriminating or informative; a
respondent whose level on the underlying construct is at this
location has a 50% chance of endorsing the item. In fields
such as educational measurement, the location parameter is
known as the difficulty parameter, where higher values are
associated with more difficult items (ie, the respondent must
be higher on the latent trait to provide a correct response).
he Patient-Reported Outcomes Measurement Information
System (PROMIS) project aims to assemble health-re-
From the *RTI Health Solutions, Research Triangle Park, North Carolina;
†Department of Psychology, The Ohio State University, Columbus;
‡Department of Psychology, University of North Carolina, Chapel Hill;
§Department of Pediatrics, Texas A&M University College of Medicine,
Temple; ¶Department of Pediatrics, College of Medicine; and ?Depart-
ment of Landscape Architecture and Urban Planning, College of Archi-
tecture, Texas A&M University, College Station.
Supported by National Institutes of Health Grant 1U01AR052181-01.
Presented at the annual meeting of the International Society for Quality of
Life Research on October 20, 2005 in San Francisco, CA.
Reprints: Cheryl D. Hill, PhD, RTI Health Solutions, 200 Park Offices Drive,
P.O. Box 12194, Research Triangle Park, NC 27709-2194. E-mail:
Copyright © 2007 by Lippincott Williams & Wilkins
Medical Care • Volume 45, Number 5 Suppl 1, May 2007
The probability of endorsing an item is described by a
function of these item parameters called a trace line, or item
characteristic curve, which takes the form for the 2PL model
T?ui? 1??? ?
1 ? exp? ? ai?? ? bi??, (1)
where ui? 1 refers to a positive response to item i.3
An alternative model often used in health outcomes
research is Samejima’s graded response model (GRM),4,5
which generalizes the 2PL model to include multiple bij
parameters per item (j from 1 to m ? 1) to correspond to m
response categories (eg, items with the response scale
“Strongly Disagree,” “Disagree,” “Neutral,” “Agree,” and
“Strongly Agree”). The formula for a GRM trace line is:
T?ui? j??? ?
1 ? exp?? ai?? ? bij??
1 ? exp? ? ai?? ? bij ? 1)?, (2)
which states that the probability of responding in category j is
the difference between a 2PL trace line for the probability of
responding in category j or higher and a 2PL trace line for the
probability of responding in category j ? 1 or higher. In the
case of the GRM, a respondent with an underlying construct
value of bijhas an equal probability of choosing category j or
lower and category j ? 1 or higher. These trace lines can be
plotted as the probability of endorsement along the contin-
uum of the latent trait to provide a visual representation of
location and discrimination. An expected score plot is an
alternative to a trace line plot that collapses the lines for each
category into 1 trajectory, showing the expected response
score across the latent trait. Trace lines can also be used to
calculate information curves that display the amount of in-
formation an item provides along the continuum of the
In health outcomes research, items are often scored so
that higher scores indicate that the respondent is higher on the
scale of the latent construct, or that the individual possesses
more of the trait that the items are designed to measure. For
example, a scale designed to assess quality of life would be
scored so that higher scores correspond with higher quality of
life. This would also mean that categories with larger bij
parameters would be more likely to be endorsed by respon-
dents with better quality of life than those with worse quality
Because of the way IRT models combine information
across items, 2 primary data requirements must be met. First,
the scale must be unidimensional, that is, the pattern of item
responses is best described by 1 dominant construct. When
items that are related to multiple underlying constructs are
forced to provide information for 1 construct alone, it is
difficult to determine what construct is being represented in
the ensuing scale score. Second, the items must be locally
independent, which means that the probabilities of each item
response are related only through the value of the latent
variable. That is, after accounting for the respondent’s latent
variable value, there should be no relationship between the
responses to different items. Items that do have a relationship
apart from the latent variable can create their own second
dimension that explains covariance between these items that
is not shared with the other items on the scale. This becomes
a specific factor that is common to the locally dependent
items and is separate from the general factor common to all
items on the scale. When this multidimensional scale is
forced into a unidimensional model, if the locally dependent
items are strongly defined (ie, high-factor loadings) and the
remaining items are weakly defined (ie, low-factor loadings),
the strength of the relationship between the locally dependent
items can change the construct measured by a scale by
causing the 1 factor to be a measure of the specific factor
rather than the general factor of interest.6
The goal of this article is to outline the use of categor-
ical confirmatory factor analysis (CCFA) and IRT in item
selection and scale development as applied in the PROMIS
project, using examples from data obtained from items on the
PedsQL 4.0 Generic Core Scales.7CCFA, a factor analytic
approach that accounts for the non-normality of categorical
data that renders traditional confirmatory factor analysis
methods inappropriate, will be used to assess domain dimen-
sionality and to identify possible locally dependent items.
Although there are other approaches for assessing local de-
pendence and dimensionality, the use of CCFA was sup-
ported by the PROMIS psychometric team.8IRT will be used
to assess how well the items measure the construct of interest
and the appropriateness of the set of items for various ranges
on the latent construct.
The use of CCFA and IRT will be demonstrated using
data on items from the 4 subscales of the PedsQL 4.0 Generic
Core Scales.7This instrument consists of 23 items designed
to measure HRQoL in children and adolescents. Four do-
mains are assessed: (1) Physical Functioning, (2) Emotional
Functioning, (3) Social Functioning, and (4) School Func-
tioning. A number of instrument versions exist for various
age ranges, different informants, and assorted languages;
however, this example will focus only on the child self-report
and parent proxy-report for children (ages 8–12 years old)
and adolescents (ages 13–18 years old) in English and Span-
ish. All analyses considered informant (self or parent), age
(child or adolescent), and language (English or Spanish)
separately, for a total of 8 replications of the analysis for each
domain. Items from the PedsQL 4.0 Generic Core Scales
were examined during the initial stages of the PROMIS
project to obtain information about the dimensionality of the
domains of interest to the project, to provide preliminary
information about some items being considered for inclusion
in the PROMIS item bank, and to familiarize the research
team with the analysis plan that will be applied to PROMIS
data when they become available. Thus, this analysis is not
intended to be an evaluation of the PedsQL 4.0 Generic Core
Hill et al
Medical Care • Volume 45, Number 5 Suppl 1, May 2007
© 2007 Lippincott Williams & Wilkins
1. Cella D, Yount S, Rothrock N, et al. The Patient-Reported Outcomes
Measurement Information System (PROMIS): progress of an NIH Road-
map Cooperative Group during its first two years. Med Care. 2007;
2. Hambleton RK. Emergence of item response modeling in instrument
development and data analysis. Med Care. 2000;38:II-60–II-65.
3. Birnbaum A. Some latent trait models and their use in inferring an
examinee’s ability. In: Lord FM, Novick MR, eds. Statistical Theories of
Mental Test Scores. Reading, MA: Addison-Wesley; 1968:395–479.
4. Samejima F. Estimation of Latent Ability Using a Response Pattern of
Graded Scores. Iowa City, IA: Psychometric Society; 1969. Psychomet-
ric Monograph No. 17.
5. Samejima F. Graded response model. In: van der Linden WJ, Hambleton
RK, eds. Handbook of Modern Item Response Theory. New York, NY:
Springer Verlag; 1997:85–100.
6. Chen W, Thissen D. Local dependence indexes for item pairs using item
response theory. J Educ Behav Stat. 1997;22:265–289.
7. Varni JW, Seid M, Kurtin PS. The PedsQL™ 4.0: reliability and validity
of the Pediatric Quality of Life Inventory™ Version 4. 0 Generic Core
Scales in healthy and patient populations.Med Care. 2001;39:800–812.
8. Reeve BB, Hays RD, Bjorner JB, et al. Psychometric evaluation and
calibration of health-related quality of life item banks: plans for
the Patient-Reported Outcomes Measurement Information System
(PROMIS). Med Care. 2007;45(Suppl 1):S22–S31.
9. Varni JW, Burwinkle TM, Seid M, et al. The PedsQL™ 4.0 as a
pediatric population health measure: feasibility, reliability, and validity.
Ambul Pediatr. 2003;3:329–341.
10. Oranje A. Comparison of estimation methods in factor analysis with
categorized variables: applications to NAEP data. Paper presented at
Annual Meeting of the American Educational Research Association,
Chicago, IL; April 2003.
11. Flora DB, Curran PJ. An empirical evaluation of alternative methods of
estimation for confirmatory factor analysis with ordinal data. Psychol
12. Jo ¨reskog KG, So ¨rbom D. PRELIS 2 User’s Reference Guide: A Program
for Multivariate Data Screening and Data Summarization; A Prepro-
cessor for LISREL. Chicago, IL: Scientific Software International; 1996.
13. Jo ¨reskog KG, So ¨rbom D. LISREL 8: User’s Reference Guide. Chicago,
IL: Scientific Software International; 1996.
14. Muthe ´n LK, Muthe ´n BO. Mplus User’s Guide. 3rd ed. Los Angeles, CA:
Muthe ´n & Muthe ´n; 1998–2004.
15. Jo ¨reskog KG. New developments in LISREL: analysis of ordinal vari-
ables using polychoric correlations and weighted least squares. Qual
16. Browne MW. Asymptotically distribution-free methods for the analysis
of covariance structures. Br J Math Stat Psychol. 1984;37:62–83.
17. Muthe ´n B. A general structural equation model with dichotomous,
ordered categorical, and continuous latent variable indicators. Psy-
18. Browne MW, Cudeck R. Alternative ways of assessing model fit. Sociol
Methods Res. 1992;21:230–258.
19. Tucker LR, Lewis C. A reliability coefficient for maximum likelihood
factor analysis. Psychometrika. 1973;38:1–10.
20. McDonald RP. Test Theory: A Unified Treatment. Mahwah, NJ: Law-
rence Erlbaum Associates; 1999.
21. Thissen D, Chen W-H, Bock RD. Multilog (Version 7) ?Computer
Software?. Lincolnwood, IL: Scientific Software International; 2003.
22. Schumaker RE, Lomax RG. A Beginner’s Guide to Structural Equation
Modeling. Mahwah, NJ: Erlbaum; 1996.
23. Langer MM, Hill CD, Thissen D, et al. Detection and evaluation of
differential item functioning using item response theory: an application
to the Pediatric Quality of Life Inventory™ (PedsQL™) 4.0 Generic
Core Scales. J Clin Epidemiol. In press.
Medical Care • Volume 45, Number 5 Suppl 1, May 2007IRT in Health Outcomes Research
© 2007 Lippincott Williams & Wilkins