Item response theory facilitated cocalibrating cognitive tests and reduced bias in estimated rates of decline

Department of Medicine, University of Washington, Seattle, WA, USA.
Journal of clinical epidemiology (Impact Factor: 3.42). 05/2008; 61(10):1018-27.e9. DOI: 10.1016/j.jclinepi.2007.11.011
Source: PubMed


To cocalibrate the Mini-Mental State Examination, the Modified Mini-Mental State, the Cognitive Abilities Screening Instrument, and the Community Screening Instrument for Dementia using item response theory (IRT) to compare screening cut points used to identify cases of dementia from different studies, to compare measurement properties of the tests, and to explore the implications of these measurement properties on longitudinal studies of cognitive functioning over time.
We used cross-sectional data from three large (n>1000) community-based studies of cognitive functioning in the elderly. We used IRT to cocalibrate the scales and performed simulations of longitudinal studies.
Screening cut points varied quite widely across studies. The four tests have curvilinear scaling and varied levels of measurement precision, with more measurement error at higher levels of cognitive functioning. In longitudinal simulations, IRT scores always performed better than standard scoring, whereas a strategy to account for varying measurement precision had mixed results.
Cocalibration allows direct comparison of cognitive functioning in studies using any of these four tests. Standard scoring appears to be a poor choice for analysis of longitudinal cognitive testing data. More research is needed into the implications of varying levels of measurement precision.

Download full-text


Available from: Sebastien Haneuse,
  • Source
    • "Similarly, use of serial 7s, spelling 'WORLD' backwards, or the more successfully completed of the two, is also likely to affect scores. Co-calibration across studies (Crane et al., 2008) was not attempted, however, as this was not the purpose of this article. Variability in administration was taken as representative of the likely variation across other reports in the literature. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Objectives: We describe and compare the expected performance trajectories of older adults on the Mini-Mental Status Examination (MMSE) across six independent studies from four countries in the context of a collaborative network of longitudinal studies of aging. A coordinated analysis approach is used to compare patterns of change conditional on sample composition differences related to age, sex, and education. Such coordination accelerates evaluation of particular hypotheses. In particular, we focus on the effect of educational attainment on cognitive decline. Method: Regular and Tobit mixed models were fit to MMSE scores from each study separately. The effects of age, sex, and education were examined based on more than one centering point. Results: Findings were relatively consistent across studies. On average, MMSE scores were lower for older individuals and declined over time. Education predicted MMSE score, but, with two exceptions, was not associated with decline in MMSE over time. Conclusion: A straightforward association between educational attainment and rate of cognitive decline was not supported. Thoughtful consideration is needed when synthesizing evidence across studies, as methodologies adopted and sample characteristics, such as educational attainment, invariably differ.
    The Journals of Gerontology Series B Psychological Sciences and Social Sciences 10/2012; 68(3). DOI:10.1093/geronb/gbs077 · 3.21 Impact Factor
  • Source
    • "ary care experience . This curve shows that the distribution of items is not uniform across the range measured by the scale , as the slope of the curve is higher to the left of about 0 than to the right . This finding suggests problems with using stan - dard scores in regression models ; item response theory ( IRT ) scores should be used instead ( Crane et al . 2008a ) . ( b ) The test information curve ( black curve ) and the standard error of measurement curve ( gray curve ) at each level of overall primary care experience are shown . These curves further document the uneven distribution of items across the scale . Test informa - tion is adequate to the left of 0 but drops to the right of 0 . This"
    [Show abstract] [Hide abstract]
    ABSTRACT: To evaluate psychometric properties of a widely used patient experience survey. English-language responses to the Clinician & Group Consumer Assessment of Healthcare Providers and Systems (CG-CAHPS®) survey (n = 12,244) from a 2008 quality improvement initiative involving eight southern California medical groups. We used an iterative hybrid ordinal logistic regression/item response theory differential item functioning (DIF) algorithm to identify items with DIF related to patient sociodemographic characteristics, duration of the physician-patient relationship, number of physician visits, and self-rated physical and mental health. We accounted for all sources of DIF and determined its cumulative impact. The upper end of the CG-CAHPS® performance range is measured with low precision. With sensitive settings, some items were found to have DIF. However, overall DIF impact was negligible, as 0.14 percent of participants had salient DIF impact. Latinos who spoke predominantly English at home had the highest prevalence of salient DIF impact at 0.26 percent. The CG-CAHPS® functions similarly across commercially insured respondents from diverse backgrounds. Consequently, previously documented racial and ethnic group differences likely reflect true differences rather than measurement bias. The impact of low precision at the upper end of the scale should be clarified.
    Health Services Research 12/2011; 46(6pt1):1778-802. DOI:10.1111/j.1475-6773.2011.01299.x · 2.78 Impact Factor
  • Source
    • "Even if the Rasch model holds, using the sum score in a regression framework may not be ideal because the relationship between the sum score and the Rasch trait score is not linear, as evident in a test characteristic curve. In such situations, an IRT trait score is a more reasonable choice for regression modeling (such as DIF detection in the ordinal logistic regression framework) than an observed sum score (Crane et al. 2008a). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Logistic regression provides a flexible framework for detecting various types of differential item functioning (DIF). Previous efforts extended the framework by using item response theory (IRT) based trait scores, and by employing an iterative process using group-specific item parameters to account for DIF in the trait scores, analogous to purification approaches used in other DIF detection frameworks. The current investigation advances the technique by developing a computational platform integrating both statistical and IRT procedures into a single program. Furthermore, a Monte Carlo simulation approach was incorporated to derive empirical criteria for various DIF statistics and effect size measures. For purposes of illustration, the procedure was applied to data from a questionnaire of anxiety symptoms for detecting DIF associated with age from the Patient-Reported Outcomes Measurement Information System.
    Journal of statistical software 03/2011; 39(8):1-30. DOI:10.18637/jss.v039.i08 · 3.80 Impact Factor
Show more