Performance of reclassification statistics in comparing risk prediction models

Division of Preventive Medicine, Brigham and Women's Hospital, Boston, MA, USA.
Biometrical Journal (Impact Factor: 1.24). 03/2011; 53(2):237-58. DOI: 10.1002/bimj.201000078
Source: PubMed

ABSTRACT Concerns have been raised about the use of traditional measures of model fit in evaluating risk prediction models for clinical use, and reclassification tables have been suggested as an alternative means of assessing the clinical utility of a model. Several measures based on the table have been proposed, including the reclassification calibration (RC) statistic, the net reclassification improvement (NRI), and the integrated discrimination improvement (IDI), but the performance of these in practical settings has not been fully examined. We used simulations to estimate the type I error and power for these statistics in a number of scenarios, as well as the impact of the number and type of categories, when adding a new marker to an established or reference model. The type I error was found to be reasonable in most settings, and power was highest for the IDI, which was similar to the test of association. The relative power of the RC statistic, a test of calibration, and the NRI, a test of discrimination, varied depending on the model assumptions. These tools provide unique but complementary information.

  • [Show abstract] [Hide abstract]
    ABSTRACT: The Fracture Risk Assessment Tool (FRAX) is widely used to predict the 10-year probability of fracture; however, the clinical utility of FRAX in CKD is unknown. This study assessed the predictive ability of FRAX in individuals with reduced kidney function compared with individuals with normal kidney function. The discrimination and calibration (defined as the agreement between observed and predicted values) of FRAX were examined using data from the Canadian Multicentre Osteoporosis Study (CaMos). This study included individuals aged ≥40 years with an eGFR value at year 10 of CaMos (defined as baseline). The cohort was stratified by kidney function at baseline (eGFR<60 ml/min per 1.73 m(2) [72.2% stage 3a, 23.8% stage 3b, and 4.0% stage 4/5] versus ≥60 ml/min per 1.73 m(2)) and followed individuals for a mean of 4.8 years for an incident major osteoporotic fracture (clinical spine, hip, forearm/wrist, or humerus). There were 320 individuals with an eGFR<60 ml/min per 1.73 m(2) and 1787 with an eGFR≥60 ml/min per 1.73 m(2). The mean age was 67±10 years and 71% were women. The 5-year observed major osteoporotic fracture risk was 5.3% (95% confidence interval [95% CI], 3.3% to 8.6%) in individuals with an eGFR<60 ml/min per 1.73 m(2), which was comparable to the FRAX-predicted fracture risk (6.4% with bone mineral density; 8.2% without bone mineral density). A statistically significant difference was not observed in the area under the curve values for FRAX in individuals with an eGFR<60 ml/min per 1.73 m(2) versus ≥60 ml/min per 1.73 m(2) (0.69 [95% CI, 0.54 to 0.83] versus 0.76 [95% CI, 0.70 to 0.82]; P=0.38). This study showed that FRAX was able to predict major osteoporotic fractures in individuals with reduced kidney function; further study is needed before FRAX should be routinely used in individuals with reduced kidney function. Copyright © 2015 by the American Society of Nephrology.
    Clinical Journal of the American Society of Nephrology 02/2015; 10(4). DOI:10.2215/CJN.06040614 · 5.25 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: To access the calibration of a predictive model in a survival analysis setting, several authors have extended the Hosmer-Lemeshow goodness-of-fit test to survival data. Grønnesby and Borgan developed a test under the proportional hazards assumption, and Nam and D'Agostino developed a nonparametric test that is applicable in a more general survival setting for data with limited censoring. We analyze the performance of the two tests and show that the Grønnesby-Borgan test attains appropriate size in a variety of settings, whereas the Nam-D'Agostino method has a higher than nominal Type 1 error when there is more than trivial censoring. Both tests are sensitive to small cell sizes. We develop a modification of the Nam-D'Agostino test to allow for higher censoring rates. We show that this modified Nam-D'Agostino test has appropriate control of Type 1 error and comparable power to the Grønnesby-Borgan test and is applicable to settings other than proportional hazards. We also discuss the application to small cell sizes. Copyright © 2015 John Wiley & Sons, Ltd. Copyright © 2015 John Wiley & Sons, Ltd.
    Statistics in Medicine 02/2015; 34(10). DOI:10.1002/sim.6428 · 2.04 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The prognostic utility of serum C reactive protein (CRP) alone in sepsis is controversial. We used decision curve analysis (DCA) to evaluate the clinical usefulness of combining serum CRP levels with the CUBR-65 score in patients with suspected sepsis. Retrospective cohort study. Emergency department (ED) of an urban teaching hospital in Japan. Consecutive ED patients over 15 years of age who were admitted to the hospital after having a blood culture taken in the ED between 1 January 2010 and 31 December 2012. 30-day in-hospital mortality. Data from 1262 patients were analysed for score evaluation. The 30-day in-hospital mortality was 8.4%. Multivariable analysis showed that serum CRP ≥150 mg/L was an independent predictor of death (adjusted OR 2.0; 95% CI 1.3 to 3.1). We compared the predictive performance of CURB-65 with the performance of a modified CURB-65 with that included CRP (≥150 mg/L) to quantify the clinical usefulness of combining serum CRP with CURB-65. The areas under the receiver operating characteristics curves of CURB-65 and a modified CURB-65 were 0.76 (95% CI 0.72 to 0.80) and 0.77 (95% CI 0.72 to 0.81), respectively. Both models had good calibration for mortality and were useful among threshold probabilities from 0% to 30%. However, while incorporating CRP into CURB-65 yielded a significant category-free net reclassification improvement of 0.387 (95% CI 0.193 to 0.582) and integrated discrimination improvement of 0.015 (95% CI 0.004 to 0.027), DCA showed that CURB-65 and the modified CURB-65 score had comparable net benefits for prediction of mortality. Measurement of serum CRP added limited clinical usefulness to CURB-65 in predicting mortality in patients with clinically suspected sepsis, regardless of the source. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to
    BMJ Open 04/2015; 5(4):e007049. DOI:10.1136/bmjopen-2014-007049 · 2.06 Impact Factor


Available from