Understanding increments in model performance metrics

Department of Biostatistics, Harvard Clinical Research Institute, Boston University, CrossTown, 801 Massachusetts Ave., Boston, MA, 02118, USA, .
Lifetime Data Analysis (Impact Factor: 0.65). 12/2012; 9(2). DOI: 10.1007/s10985-012-9238-0
Source: PubMed


The area under the receiver operating characteristic curve (AUC) is the most commonly reported measure of discrimination for prediction models with binary outcomes. However, recently it has been criticized for its inability to increase when important risk factors are added to a baseline model with good discrimination. This has led to the claim that the reliance on the AUC as a measure of discrimination may miss important improvements in clinical performance of risk prediction rules derived from a baseline model. In this paper we investigate this claim by relating the AUC to measures of clinical performance based on sensitivity and specificity under the assumption of multivariate normality. The behavior of the AUC is contrasted with that of discrimination slope. We show that unless rules with very good specificity are desired, the change in the AUC does an adequate job as a predictor of the change in measures of clinical performance. However, stronger or more numerous predictors are needed to achieve the same increment in the AUC for baseline models with good versus poor discrimination. When excellent specificity is desired, our results suggest that the discrimination slope might be a better measure of model improvement than AUC. The theoretical results are illustrated using a Framingham Heart Study example of a model for predicting the 10-year incidence of atrial fibrillation.

11 Reads
  • Article: Editorial.

    Lifetime Data Analysis 04/2013; DOI:10.1007/s10985-013-9253-9 · 0.65 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The area under a receiver operating characteristic (ROC) curve (the AUC) is used as a measure of the performance of a screening or diagnostic test. We here assess the validity of the AUC. Assuming the test results follow Gaussian distributions in affected and unaffected individuals, standard mathematical formulae were used to describe the relationship between the detection rate (DR) (or sensitivity) and the false-positive rate (FPR) of a test with the AUC. These formulae were used to calculate the screening performance (DR for a given FPR, or FPR for a given DR) for different AUC values according to different standard deviations of the test result in affected and unaffected individuals. The DR for a given FPR is strongly dependent on relative differences in the standard deviation of the test variable in affected and unaffected individuals. Consequently, two tests with the same AUC can have a different DR for the same FPR. For example, an AUC of 0.75 has a DR of 24% for a 5% FPR if the standard deviations are the same in affected and unaffected individuals, but 39% for the same 5% FPR if the standard deviation in affected individuals is 1.5 times that in unaffected individuals. The AUC is an unreliable measure of screening performance because in practice the standard deviation of a screening or diagnostic test in affected and unaffected individuals can differ. The problem is avoided by not using AUC at all, and instead specifying DRs for given FPRs or FPRs for given DRs.
    Journal of Medical Screening 01/2014; 21(1). DOI:10.1177/0969141313517497 · 3.10 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: An important question in the evaluation of an additional risk prediction marker is how to interpret a small increase in the area under the receiver operating characteristic curve (AUC). Many researchers believe that a change in AUC is a poor metric because it increases only slightly with the addition of a marker with a large odds ratio. Because it is not possible on purely statistical grounds to choose between the odds ratio and AUC, we invoke decision analysis, which incorporates costs and benefits. For example, a timely estimate of the risk of later non-elective operative delivery can help a woman in labor decide if she wants an early elective cesarean section to avoid greater complications from possible later non-elective operative delivery. A basic risk prediction model for later non-elective operative delivery involves only antepartum markers. Because adding intrapartum markers to this risk prediction model increases AUC by 0.02, we questioned whether this small improvement is worthwhile. A key decision-analytic quantity is the risk threshold, here the risk of later non-elective operative delivery at which a patient would be indifferent between an early elective cesarean section and usual care. For a range of risk thresholds, we found that an increase in the net benefit of risk prediction requires collecting intrapartum marker data on 68 to 124 women for every correct prediction of later non-elective operative delivery. Because data collection is non-invasive, this test tradeoff of 68 to 124 is clinically acceptable, indicating the value of adding intrapartum markers to the risk prediction model. Copyright © 2014 John Wiley & Sons, Ltd.
    Statistics in Medicine 09/2014; 33(22). DOI:10.1002/sim.6195 · 1.83 Impact Factor