Understanding increments in model performance metrics.
ABSTRACT The area under the receiver operating characteristic curve (AUC) is the most commonly reported measure of discrimination for prediction models with binary outcomes. However, recently it has been criticized for its inability to increase when important risk factors are added to a baseline model with good discrimination. This has led to the claim that the reliance on the AUC as a measure of discrimination may miss important improvements in clinical performance of risk prediction rules derived from a baseline model. In this paper we investigate this claim by relating the AUC to measures of clinical performance based on sensitivity and specificity under the assumption of multivariate normality. The behavior of the AUC is contrasted with that of discrimination slope. We show that unless rules with very good specificity are desired, the change in the AUC does an adequate job as a predictor of the change in measures of clinical performance. However, stronger or more numerous predictors are needed to achieve the same increment in the AUC for baseline models with good versus poor discrimination. When excellent specificity is desired, our results suggest that the discrimination slope might be a better measure of model improvement than AUC. The theoretical results are illustrated using a Framingham Heart Study example of a model for predicting the 10-year incidence of atrial fibrillation.
Article: Editorial.Lifetime Data Analysis 04/2013; DOI:10.1007/s10985-013-9253-9
- [Show abstract] [Hide abstract]
ABSTRACT: An important question in the evaluation of an additional risk prediction marker is how to interpret a small increase in the area under the receiver operating characteristic curve (AUC). Many researchers believe that a change in AUC is a poor metric because it increases only slightly with the addition of a marker with a large odds ratio. Because it is not possible on purely statistical grounds to choose between the odds ratio and AUC, we invoke decision analysis, which incorporates costs and benefits. For example, a timely estimate of the risk of later non-elective operative delivery can help a woman in labor decide if she wants an early elective cesarean section to avoid greater complications from possible later non-elective operative delivery. A basic risk prediction model for later non-elective operative delivery involves only antepartum markers. Because adding intrapartum markers to this risk prediction model increases AUC by 0.02, we questioned whether this small improvement is worthwhile. A key decision-analytic quantity is the risk threshold, here the risk of later non-elective operative delivery at which a patient would be indifferent between an early elective cesarean section and usual care. For a range of risk thresholds, we found that an increase in the net benefit of risk prediction requires collecting intrapartum marker data on 68 to 124 women for every correct prediction of later non-elective operative delivery. Because data collection is non-invasive, this test tradeoff of 68 to 124 is clinically acceptable, indicating the value of adding intrapartum markers to the risk prediction model. Copyright © 2014 John Wiley & Sons, Ltd.Statistics in Medicine 09/2014; 33(22). DOI:10.1002/sim.6195