Evidence-based practice for equating health status items: sample size and IRT model

University of Washington.
Journal of applied measurement 02/2007; 8(2):175-89.
Source: PubMed


In the development of health outcome measures, the pool of candidate items may be divided into multiple forms, thus "spreading" response burden over two or more study samples. Item responses collected using this approach result in two or more forms whose scores are not equivalent. Therefore, the item responses must be equated (adjusted) to a common mathematical metric.
The purpose of this study was to examine the effect of sample size, test size, and selection of item response theory model in equating three forms of a health status measure. Each of the forms was comprised of a set of items unique to it and a set of anchor items common across forms.
The study was a secondary data analysis of patients' responses to the developmental item pool for the Health of Seniors Survey. A completely crossed design was used with 25 replications per study cell.
We found that the quality of equatings was affected greatly by sample size. Its effect was far more substantial than choice of IRT model. Little or no advantage was observed for equatings based on 60 or 72 items versus those based on 48 items.
We concluded that samples of less than 300 are clearly unacceptable for equating multiple forms. Additional sample size guidelines are offered based on our results.

30 Reads
  • [Show abstract] [Hide abstract]
    ABSTRACT: The aim of this study was to evaluate the Computerized Adaptive Test to measure anxiety (A-CAT), a patient-reported outcome questionnaire that uses computerized adaptive testing to measure anxiety. The A-CAT builds on an item bank of 50 items that has been built using conventional item analyses and item response theory analyses. The A-CAT was administered on Personal Digital Assistants to n=357 patients diagnosed and treated at the department of Psychosomatic Medicine and Psychotherapy, Charité Berlin, Germany. For validation purposes, two subgroups of patients (n=110 and 125) answered the A-CAT along with established anxiety and depression questionnaires. The A-CAT was fast to complete (on average in 2 min, 38 s) and a precise item response theory based CAT score (reliability>.9) could be estimated after 4-41 items. On average, the CAT displayed 6 items (SD=4.2). Convergent validity of the A-CAT was supported by correlations to existing tools (Hospital Anxiety and Depression Scale-A, Beck Anxiety Inventory, Berliner Stimmungs-Fragebogen A/D, and State Trait Anxiety Inventory: r=.56-.66); discriminant validity between diagnostic groups was higher for the A-CAT than for other anxiety measures. The German A-CAT is an efficient, reliable, and valid tool for assessing anxiety in patients suffering from anxiety disorders and other conditions with significant potential for initial assessment and long-term treatment monitoring. Future research directions are to explore content balancing of the item selection algorithm of the CAT, to norm the tool to a healthy sample, and to develop practical cutoff scores.
    No preview · Article · Dec 2008 · Depression and Anxiety
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This study examined two approaches to linking items from two pain surveys to form a single item bank with a common measurement scale. Secondary analysis of two independent surveys: Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials Survey with Main Survey (959 chronic pain patients; 42 pain items) and Pain Module (n=148; 36 pain items), and Center on Outcomes, Research and Education Survey (400 cancer patients; 43 pain items). There were common items included among the three data sets. In the first approach, all items were calibrated to an item response theory (IRT) model simultaneously, and in the second approach, items were calibrated separately and then the scales were transformed to a common metric. The two approaches produced similar linking results across the two sets of pain interference items because there was sufficient number of common items and large enough sample size. For pain intensity, simultaneous calibration yielded more stable results. Separated calibration yielded an unsatisfactory linking result for pain intensity because of a single common item with small sample size. The results suggested that a simultaneous IRT calibration method produces the more stable item parameters across independent samples, and hence, should be recommended for developing comprehensive item banks. Patient-reported health outcome surveys are often limited in sample sizes and the number of items owing to the difficulty of recruitment and the burden to the patients. As a result, the surveys either lack statistical power or are limited in scope. Using IRT methodology, survey data can be pooled to lend strength to each other to expand the scope and to increase the sample sizes.
    Full-text · Article · Aug 2009 · Journal of pain and symptom management
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: To use item response theory (IRT) methods to link physical functioning items in the Activity Measure for Post Acute Care (AM-PAC) and the Quality of Life Outcomes in Neurological Disorders (Neuro-QOL). Secondary data analysis of the physical functioning items of AM-PAC and Neuro-QOL. We used a nonequivalent group design with 36 core items common to both instruments and a test characteristic curve transformation method for linking AM-PAC and Neuro-QOL scores. Linking was conducted so that both raw and scaled AM-PAC and Neuro-QOL scores (mean ± SD converted-logit scores, 50 ± 10) could be compared. AM-PAC items were administered to rehabilitation patients in post-acute care (PAC) settings. Neuro-QOL items were administered to a community sample of adults through the Internet. PAC patients (N=1041) for the AM-PAC sample and community-dwelling adults (N=549) for the Neuro-QOL sample. Not applicable. Mobility (N=25) and activity of daily living (ADL) items (N=11) common to both instruments were included in analysis. Neuro-QOL items were linked to the AM-PAC scale by using the generalized partial credit model. Mobility and ADL subscale scores from the 2 instruments were calibrated to the AM-PAC metric. An IRT-based linking method placed AM-PAC and Neuro-QOL mobility and ADL scores on a common metric. This linking allowed estimation of AM-PAC mobility and ADL subscale scores based on Neuro-QOL mobility and ADL subscale scores and vice versa. The accuracy of these results should be validated in a future sample in which participants respond to both instruments.
    Full-text · Article · Oct 2011 · Archives of physical medicine and rehabilitation