Page 1

Performance of Reclassification Statistics in Comparing Risk

Prediction Models

Nancy R. Cook and Nina P. Paynter

Division of Preventive Medicine, Brigham and Women’s Hospital, Boston, MA, USA

Abstract

Concerns have been raised about the use of traditional measures of model fit in evaluating risk

prediction models for clinical use, and reclassification tables have been suggested as an alternative

means of assessing the clinical utility of a model. Several measures based on the table have been

proposed, including the reclassification calibration (RC) statistic, the net reclassification

improvement (NRI), and the integrated discrimination improvement (IDI), but the performance of

these in practical settings has not been fully examined. We used simulations to estimate the type I

error and power for these statistics in a number of scenarios, as well as the impact of the number

and type of categories, when adding a new marker to an established or reference model. The type I

error was found to be reasonable in most settings, and power was highest for the IDI, which was

similar to the test of association. The relative power of the RC statistic, a test of calibration, and

the NRI, a test of discrimination, varied depending on the model assumptions. These tools provide

unique but complementary information.

Keywords

Calibration; Discrimination; Model accuracy; Prediction; Reclassification

1 Introduction

For many years model comparison in the medical literature has used the area under the

receiver operating characteristic (ROC) curve (Hanley and McNeil, 1982), whose analog in

the survival context is the c index or c-statistic (Harrell, 2001) This measure, while valuable,

can be insensitive to changes in absolute risk estimates (Cook, 2007). It is a measure of

discrimination that is directly applicable to the setting of classification, but is less useful in

evaluating risk prediction of future events since it is a function only of ranks, not the

predicted probabilities themselves (Moons and Harrell, 2003; Janes, et al., 2008).

Within the past few years, new methods for evaluating and comparing the fit of predictive

models have been proposed, and have generated considerable interest in both the statistical

and medical literature. The new measures are based on reclassification tables, which stratify

individuals into risk categories and examine changes in categories under a new model.

Measures based on both calibration (Cook, 2008) and discrimination (Pencina, et al., 2008)

have been proposed. While methods of model assessment should depend on the model’s

Correspondence to: Nancy R. Cook.

Conflict of Interest

The authors have declared no conflict of interest.

Supporting Information for this article is available from the author or on the WWW under http://dx.doi.org/10.1022/bimj.XXXXXXX.

NIH Public Access

Author Manuscript

Biom J. Author manuscript; available in PMC 2012 July 12.

Published in final edited form as:

Biom J. 2011 March ; 53(2): 237–258. doi:10.1002/bimj.201000078.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 2

intended use, when the model is intended to guide treatment decisions, evaluation of risk

strata may be more clinically useful than comparing overall ranks.

The performance of these measures has been demonstrated in applied examples, and they

have been described further elsewhere (Gu and Pepe, 2009; Steyerberg, et al., 2009; Cook,

2010; Whittemore, 2010), but a rigorous examination of their distributions and performance

characteristics has not yet been done. While not sufficient to determine clinical utility,

questions concerning type I error and power relative to commonly used measures are of

interest. In addition, there are unique questions in the setting of reclassification in categories.

These include the effects of the number of categories and of the category composition on

estimates and power. Some of these effects have been explored in simulations (Mihaescu, et

al., 2010) and applied examples (Cook and Ridker, 2009; Mealiffe, et al., 2010) and

differences due to category definition have been found in some measures.

The current paper explores these characteristics for binary outcomes in a series of

simulations using logistic models, and examines the performance of the measures in both the

null model and under varying effect sizes. The definitions of the new measures and their

variations are presented in Section 2, with a worked example. Simulation results are

presented in Section 3, and an example based on data from the Women’s Health Study is

presented in Section 4.

2 Definitions

The analyses here assume a binary outcome, such as the development of cardiovascular

disease (CVD) within a ten-year period of follow-up. Interest centers on the predicted risk,

and whether a new marker or model can improve the fit of the model. Often one model is

contained in the other, and one or more additional variables are included to the usual

predictors. More generally, however, any two models could be compared even if not nested.

2.1 Standard measures

Standard analyses include the evaluation of two components of model accuracy:

discrimination and calibration. Discrimination is typically assessed using the c-statistic,

which calculates the area under the ROC curve for a continuous predictor and binary

outcome. Other properties of the ROC curve may be of interest in particular applications,

such as the partial area under the curve given a threshold of specificity (McClish, 1989).

When comparing models, changes in the c-statistic have been used to evaluate “clinical

utility,” although this use is questionable in the setting of risk prediction (Janes et al., 2008).

In these analyses the c-statistic was estimated using the Wilcoxon rank sum statistic (Hanley

and McNeil, 1982), and the difference in c-statistics was tested using two variance

estimators, one based on Delong, DeLong and Clarke-Pearson (1988), and the other based

on Rosner and Glynn (2009).

Calibration is often assessed using the Hosmer-Lemeshow goodness-of-fit test (Hosmer and

Lemeshow, 1980). Decile categories are typically used, and the observed risk (proportion) is

compared to the average predicted risk within each category using a chi-square statistic.

Alternative categories may be used, particularly those based on the estimated risk itself,

similar to the H statistic examined by Hosmer and Lemeshow (1980). In the Women’s

Health Study, more than 90% of the women had risk less than 5%, and interest centered on

the fit among those at higher risk. Thus categories based on values of the predicted risk in

2% increments up to 20% were used (Cook, et al., 2006). Direct comparisons of model

calibration are not typically done, but the statistic is computed for each model separately.

Alternative measures based on smoothed residuals have been proposed (Le Cessie and van

Houwelingen, 1991), but are not evaluated here.

Cook and Paynter

Page 2

Biom J. Author manuscript; available in PMC 2012 July 12.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 3

2.2 Risk reclassification

Reclassification derives from placing individuals within risk strata, as is common in medical

applications. These may be pre-determined based on cost-effectiveness considerations, or on

intuitive definition of important risk levels. A reclassification table is a cross-tabulation of

risk categories for two models, and shows how individuals move between categories and fall

within the same or different risk strata (Cook, 2007). A hypothetical example is shown in

Table 1 which shows four estimated risk strata formed from two models fit in a sample of

10,000 individuals. These data were generated from a logistic model with overall probability

of disease equal to 10%. Outcomes were generated from the true model, denoted Model XY,

which included terms for normally-distributed independent variables X and Y with an odds

ratio for X (ORX) of 16.0 per 2 standard deviation units and an OR for Y (ORY) of 3.0 per

2 standard deviation units. Model X included only the term for X and would thus be

expected to provide poorer fit. The most obvious summary of the table is the percent of

individuals who change categories, here 27% overall. This is a function of the agreement

between the two models, and is an estimate of how applying a new model may alter

decisions. Clinicians may also be interested in the number changing risk category by initial

risk strata, which are 9%, 54%%, 52%, and 23% in the four strata in Table 1.

2.3 Reclassification calibration

Whether the changes in category are appropriate, however, is more important than the

percent reclassified. Cook (2008) suggested a reclassification calibration statistic using a

variation of the Hosmer-Lemeshow chi-square goodness-of-fit test. This compares the

observed risk to the average estimated risk within each cross-classified category, and can be

considered an assessment of calibration (D’Agostino, et al., 1997). Let p1ij represent the

predicted risk in the cell in the ith row and the jth column from the first, or standard, model,

and let p2ij represent predicted risk from the new model. The reclassification calibration

(RC) statistic for Model X has the form

where K is the number of cells. To improve the large-sample characteristics, only cells

containing at least 20 individuals are included in the calculation. In addition, we examined

the statistic using only those cells containing an average expected number of cases of at least

5, where the mean probability for the two models was averaged in each cell. This test could

be performed for each model separately, with an analogous definition for X2RC2, the test for

Model XY. The test for Model X is of primary interest, but that for Model XY should be

conducted for comparison. In Table 1, using the estimated probabilities from each model,

the test statistic for Model X is X2RC1 = 102.3 (p<0.0001) and that for Model XY is X2RC2

= 14.9 (p=0.24) both with 12 degrees of freedom. When cells were restricted to those with

an average expected value at least five, the degrees of freedom were reduced to 10 and the

statistics became 70.7 (p<0.0001) for Model X, and 9.3 (p=0.50) for Model XY.

D’Agostino et al (1997) proposed an adjustment to the classic Hosmer-Lemeshow statistic

that corrects for small predicted values by adding a small term to each cell. The analog for

reclassification would be

Cook and Paynter

Page 3

Biom J. Author manuscript; available in PMC 2012 July 12.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 4

which may be more likely to follow a chi-square distribution when there are small expected

numbers within cells. Using data from Table 1, the statistic is 86.8 for Model X (p<0.0001)

and 13.0 for Model XY (p=0.37) for cells of at least size 20, and 68.5 (p<0.0001) and 8.8

(p=0.55) for average expected numbers of at least 5 for models X and XY, respectively.

Pigeon and Heyse (1999) proposed an alternative variation of the classic Hosmer-Lemeshow

test, called J2 with K-1 degrees of freedom, which accounts for the under-dispersion within

each cell by adding a variance correction φij to each cell. Its analog for reclassification is

where

and the summation is over all individuals within the ijth cell. Hosmer and Lemeshow (2000)

originally examined this form of the statistic, and concluded that the correction was not

necessary for practical purposes. The setting of reclassification, however, would typically

uses only 3 or 4 risk strata, and thus the variation of predicted probabilities within strata may

be greater. Using data from Table 1, J2=102.8 (p<0.0001) for Model X and 15.0 (p=0.31) for

Model XY for cell sizes of at least 20. For cells with average expectation of at least five, the

statistics were 71.1 (p<0.0001) and 9.3 (p=0.59) for Models X and XY, respectively. The

performance of all three forms of this statistic, with both types of size restrictions, was

examined in simulations.

2.4 Discrimination improvement

Pencina et al (2008) suggested an alternative approach that divides the reclassification table

into cases and controls. Ideally, predicted risk for cases would move up and that for controls

would move down. They described the net reclassification improvement (NRI) which sums

the improvement among cases and controls. It is computed as

where D = 1 stands for cases and D = 0 stands for controls. This measures discrimination or

separation of predicted risk for cases and controls. The margins display the sensitivity and

specificity using the category cutoffs as risk thresholds (Janes et al., 2008). Table 1 shows

the reclassification among cases and controls necessary to compute the NRI. In these data

the NRI using these four risk categories is 8.7% (standard error = 1.8%, p<0.0001).

Cook and Paynter

Page 4

Biom J. Author manuscript; available in PMC 2012 July 12.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript

Page 5

The value of the NRI depends strongly on the categories used to form risk strata. If the two

middle categories in Table 1 are combined, the NRI is 7.1% (standard error = 1.5%,

p<0.0001). With only two risk strata defined by risk above and below 10%, the NRI is 2.5%

(standard error = 1.1%, p=0.030). In such a setting with only two risk strata, the NRI

represents the improvement in sensitivity plus the improvement in specificity. It would then

be equivalent to the difference in the Youden index, which is sensitivity plus specificity

minus 1. When there are more than 2 risk strata, the NRI adds terms for the incremental

sensitivity and specificity at the additional cut points, and would increase with more

categories. With infinite strata, this is equivalent to testing whether the predicted value is

higher in cases and lower in controls for Model XY vs. model X. This leads to a simple two-

by-two table of case status vs. an indicator of whether the probability for Model 2 is higher

or lower than that for Model 1. In the example data, the NRI computed in this way is 35.1%

(standard error = 3.4%, p<0.0001).

When risk strata are based on clinical cut points, there may be interest in sequential testing

or measurement of biomarkers. For example, physicians may be most interested in further

testing patients at intermediate risk of disease, whose clinical course is not yet determined.

In this situation, a ‘clinical’ NRI (cNRI) has been suggested by restricting the measure to the

intermediate risk strata based on Model X (Cook, 2008). However, due to the symmetric

nature of the table, this measure may be biased in its simple form. Under the null situation,

the table would be expected to be symmetric about the axis of identity, which could lead to

an expected value greater than zero. That is, under the null hypothesis of symmetry or

equivalence of the two models, the i,jth and j,ith cells would be expected to be equal, both

overall and among cases and controls, and the expectation for each cell would be (cij + cji)/2.

To correct the bias in the cNRI, one can subtract its expectation under this null hypothesis.

In Table 1 for cases, for example, the expected number for those reclassified from 5-<10%

to 0–<5% would be (25+22)/2 = 23.5 under the null hypothesis of symmetry. The expected

numbers for all cells are shown in Supplementary Table 1. The expected percent of cases in

the 5–<10% and 10–<20% strata for Model X who would move up under symmetry is 27.4

and who would move down is 19.2, for an expected reclassification improvement of 8.1%.

The expected reclassification improvement among controls is 11.4%, leading to an expected

cNRI of 19.6% under the null hypothesis of symmetry. Thus, the original cNRI is 32.2% and

the adjusted cNRI is 12.6%. Using the non-null variance estimate, the standard error is 3.8%

(p=0.0008).

Pencina et al. (2008) also describe the integral of sensitivity and specificity over all possible

cutoff values. This leads to the Integrated Discrimination Improvement (IDI) which can be

viewed as the sum of differences in these terms between models. The measure is equal to the

difference in Yates, or discrimination, slopes between models, where the Yates slope is the

difference in average estimated risk between cases and controls in a model. The IDI can then

be written as

where the average is over all predicted probabilities for cases or controls. Pepe, et al.

(2008a) showed that this is equivalent to change in an R2 measure representing percent of

variance explained in terms of probabilities. While this is true asymptotically (Hu, et al.,

2006), Tjur (2009) shows that the Yates slope (which he calls the coefficient of

discrimination) is exactly equal to a combination of R2 measures. In the example data, the

IDI, or change in R2, is 2.6% (standard error = 0.17%, p<0.0001).

Cook and Paynter

Page 5

Biom J. Author manuscript; available in PMC 2012 July 12.

NIH-PA Author Manuscript

NIH-PA Author Manuscript

NIH-PA Author Manuscript