Page 1

STATISTICS IN MEDICINE

Statist. Med. (in press)

Published online in Wiley InterScience

(www.interscience.wiley.com) DOI: 10.1002/sim.2929

Evaluating the added predictive ability of a new marker: From area

under the ROC curve to reclassification and beyond

Michael J. Pencina1,∗,†, Ralph B. D’Agostino Sr1, Ralph B. D’Agostino Jr2

and Ramachandran S. Vasan3

1Department of Mathematics and Statistics, Framingham Heart Study, Boston University, 111 Cummington St.,

Boston, MA 02215, U.S.A.

2Department of Biostatistical Sciences, Wake Forest University School of Medicine, Medical Center Boulevard,

Winston-Salem, NC 27157, U.S.A.

3Framingham Heart Study, Boston University School of Medicine, 73 Mount Wayte Avenue, Suite 2, Framingham,

MA 01702-5803, U.S.A.

SUMMARY

Identification of key factors associated with the risk of developing cardiovascular disease and quantification

of this risk using multivariable prediction algorithms are among the major advances made in preventive

cardiology and cardiovascular epidemiology in the 20th century. The ongoing discovery of new risk

markers by scientists presents opportunities and challenges for statisticians and clinicians to evaluate these

biomarkers and to develop new risk formulations that incorporate them. One of the key questions is how

best to assess and quantify the improvement in risk prediction offered by these new models. Demonstration

of a statistically significant association of a new biomarker with cardiovascular risk is not enough. Some

researchers have advanced that the improvement in the area under the receiver-operating-characteristic

curve (AUC) should be the main criterion, whereas others argue that better measures of performance of

prediction models are needed. In this paper, we address this question by introducing two new measures,

one based on integrated sensitivity and specificity and the other on reclassification tables. These new

measures offer incremental information over the AUC. We discuss the properties of these new measures

and contrast them with the AUC. We also develop simple asymptotic tests of significance. We illustrate

the use of these measures with an example from the Framingham Heart Study. We propose that scientists

consider these types of measures in addition to the AUC when assessing the performance of newer

biomarkers. Copyright q 2007 John Wiley & Sons, Ltd.

KEY WORDS: discrimination; model performance; AUC; risk prediction; biomarker

∗Correspondence to: Michael J. Pencina, Department of Mathematics and Statistics, Boston University, 111

Cummington Street, Boston, MA 02215, U.S.A.

†E-mail: mpencina@bu.edu

Contract/grant sponsor: National Heart, Lung, and Blood Institute’s Framingham Heart Study; contract/grant numbers:

N01-HC-25195, 2K24 HL 04334

Received 4 April 2007

Accepted 5 April 2007

Copyright q 2007 John Wiley & Sons, Ltd.

Page 2

M. J. PENCINA ET AL.

INTRODUCTION

Over 30 years after the construction of the first multivariable risk prediction model (also called

risk profile model) predicting the probability of developing cardiovascular disease (CVD) [1],

researchers continue to seek new risk factors that can predict CVD and that can be incorporated

into risk assessment algorithms. A general consensus exists that the information regarding an

individual’s age, baseline levels of systolic and diastolic blood pressure and serum cholesterol,

smokinganddiabetesstatusareallusefulpredictorsoftheCVDriskoverareasonabletimeperiodin

the future, typically over 1–10 years [2–5]. Quantification of vascular risk is accomplished through

risk equations or risk score sheets that have been developed on the basis of observations from

large cohort studies [2–6]. For example, the Framingham risk score has been routinely applied,

validated and calibrated for use in different countries, and for different ethnicities across countries

[7–9]. Various statistical models have been utilized over the decades to develop these equations.

Presently, Cox proportional hazards and Weibull parametric models seem to be among the most

frequently used ones [2–6]. However, CVD risk prediction is an ongoing work in progress. New

risk factors or markers are being identified and proposed constantly, and vie with each other for

consideration for incorporation into risk prediction algorithms. Plasminogen-activator inhibitor

type 1, gamma glutamyl transferase, C-reactive protein (CRP), B-type natriuretic peptide, urinary

albumin-to-creatinine ratio, left ventricular hypertrophy or fibrinogen are only a few examples

from a very long list [2,10–12].

The critical question arises as to how to evaluate the usefulness of a new marker. D’Agostino

[13] lists four initial decisions that guide the process: (1) defining the population of interest; (2)

defining the outcome of interest; (3) choosing how to incorporate the competing pre-existing set of

risk factors and (4) selecting the appropriate model and tests to evaluate the incremental yield of

a new biomarker. This paper focuses on the last issue, assuming that we have adequately defined

or answered the issues described in (1)–(3).

The most basic necessary condition required of any new marker is its statistical significance. It

is hard to imagine that one would argue for an inclusion of a new marker into a risk prediction

formulation if it is not related to the outcome of interest in a statistically significant manner.

Statistical significance, however, does not imply either clinical significance or improvement in

model performance. Indeed, many biomarkers with weak or moderate relations to the outcome of

interest can be associated in a statistically significant fashion if examined using a large enough

sample size.

Evaluation of risk prediction models and adjustments to them require model performance mea-

sures [7,14]. A key measure of the clinical utility of a survival model is its ability to discrim-

inate or separate those who will develop the event of interest from those who will not. Various

measures have been proposed to capture discrimination [14], but the area under the receiver-

operating-characteristic (ROC) curve (AUC) is the most popular metric [14–16]. Its probabilistic

interpretation addresses simply and directly its discriminatory ability. It is the probability that given

two subjects, one who will develop an event and the other who will not, the model will assign

a higher probability of an event to the former. Its traditional application in the context of binary

outcomes has been extended to the time-to-event models, which are the standard models for CVD

risk prediction [17,18].

Researchers, extending existing methodology, began evaluating new markers based on their

ability to increase the AUC. It quickly became apparent that, for models containing standard risk

factors and possessing reasonably good discrimination, very large ‘independent’ associations of

Copyright q 2007 John Wiley & Sons, Ltd.

Statist. Med. (in press)

DOI: 10.1002/sim

Page 3

EVALUATING THE ADDED PREDICTIVE ABILITY OF NEW MARKER

the new marker with the outcome are required to result in a meaningfully larger AUC [19–21].

None of the numerous new markers proposed comes close in magnitude to these necessary levels

of association. In response to this, some scientists have argued that we need to wait for new and

better markers; other researchers have sought model performance measures beyond the AUC to

evaluate the usefulness of markers. Reassignment of subjects into risk categories (reclassification

tables) and predictiveness curves form opposite ends of the spectrum of new ideas [22–24]. These

efforts address Greenland’s and O’Malley’s suggestion that ‘statisticians should seek new ways,

beyond the ROC curve, to evaluate clinically useful new markers for their ability to improve upon

current models such as the Framingham Risk Score’ [20].

In this paper, we propose two new ways of evaluating the usefulness of a new marker. They

fall somewhere in the middle of the spectrum mentioned above. One is based on event-specific

reclassification tables, and the other on the new model’s ability to improve integrated (average)

sensitivity (IS) without sacrificing integrated (average) specificity.

We start with a careful look at reclassification tables and suggest an objective way of quantifying

improvement in categories through what we term ‘the net reclassification improvement’ or NRI.

This method requires that there exist a priori meaningful risk categories (e.g. 0–6, 6–20, >20

per cent 10-year risk of coronary heart disease [CHD] based on the Third Adult Treatment Panel

[ATP III] risk classification [5]). Then, we extend this idea to the case of no ‘cut-offs’ to define our

second measure, the integrated discrimination improvement (IDI). We propose sample estimators

for both measures and show how the IDI can be estimated by the difference in discrimination

slopes proposed by Yates [25]. We also derive simple asymptotic tests to determine whether the

improvements in our measures are significantly different from zero. As an illustration, we show

a Framingham Heart Study example, in which the NRI and IDI indicate that HDL cholesterol

offers statistically significant improvement in the performance of a CHD model even though no

meaningful or significant improvement in the AUC is observed. Formal mathematical developments

for some identities are presented in the Appendix.

NET RECLASSIFICATION IMPROVEMENT AND INTEGRATED DISCRIMINATION

IMPROVEMENT

In this section, we propose two new ways of assessing improvement in model performance offered

by a new marker. The NRI focuses on reclassification tables constructed separately for participants

with and without events, and quantifies the correct movement in categories—upwards for events

and downwards for non-events. The IDI does not require categories, and focuses on differences

between ISs and ‘one minus specificities’ for models with and without the new marker. This section

introduces the concepts in general terms; a more formal discussion is presented in the next section

followed by a practical example.

From AUC to reclassification

The developments concerning the AUC come from applications to diagnostic testing in radiology

[16]. AUC can be defined as the area under the plot of sensitivity vs ‘one minus specificity’ for all

possible cut-off values. This definition has been shown to be equivalent to defining AUC as the

probability that a given diagnostic test (or predictive model in our case) assigns a higher probability

of an event to those who actually have (or develop) events [16]. The improvement in AUC for

Copyright q 2007 John Wiley & Sons, Ltd.

Statist. Med. (in press)

DOI: 10.1002/sim

Page 4

M. J. PENCINA ET AL.

a model containing a new marker is defined simply as the difference in AUCs calculated using

a model with and without the marker of interest. This increase, however, is often very small in

magnitude; for example, Wang et al. show that the addition of a biomarker score to a set of standard

risk factors predicting CVD increases the model AUC only from 0.76 to 0.77 [12]. Ware and Pepe

show simple examples in which enormous odds ratios are required to meaningfully increase the

AUC [19,21].

Because of the above, some researchers started looking at different methods of quantifying the

improvement. Reclassification tables have been gaining popularity in medical literature [10,22,23].

For example, Ridker et al. [10] compare a model developed for CVD risk prediction in women

using only standard risk factors (‘old’ model) with a model that also includes parental history of

myocardial infarction and CRP (‘new’ model), and observe a minimal increase in the AUC from

0.805 to 0.808. However, when they classify the predicted risks obtained using their two models

(old and new) into four categories (0–5, 5–10, 10–20, >20 per cent 10-year CVD risk) and then

cross-tabulate these two classifications, they show that about 30 per cent of individuals change their

category when comparing the new model with the old one. They further calculate the actual event

rates for those reclassified and call the reclassification successful if the actual rate corresponds to

the new model’s category.

Unfortunately, reclassification tables constructed and interpreted in this manner offer limited

means of evaluating improvement in performance. Relying solely on the number or percentage

of subjects who are reclassified can be misleading. Additionally, calculating event rates among

the reclassified individuals does not lead to an objective assessment of the true improvement in

classification. For instance, even if we reclassify 100 people from the 10–20 per cent 10-year CVD

risk category into the above 20 per cent group and the ‘actual’ event rate among these individuals

is 25 per cent, we improved the placement of 25 people, but not the remaining 75 who should

have stayed in the lower risk category.

We suggest a different way of constructing and interpreting the reclassification tables. The

reclassification of people who develop and who do not develop events should be considered

separately. Any ‘upward’ movement in categories for event subjects (i.e. those with the event) im-

plies improved classification, and any ‘downward movement’ indicates worse reclassification. The

interpretation is opposite for people who do not develop events. The improvement in reclas-

sification can be quantified as a sum of differences in proportions of individuals moving up

minus the proportion moving down for people who develop events, and the proportion of indi-

viduals moving down minus the proportion moving up for people who do not develop events.

We call this sum the NRI. Equivalently, the NRI can be calculated by computing the differ-

ence between the proportions of individuals moving up and the proportion of individuals moving

down for those who develop events and the corresponding difference in proportions for those

who not develop events, and taking a difference of these two differences. A simple asymp-

totic test that can be used to determine the significance of the improvement, separately for

event and non-event individuals and combining the two groups (NRI), is presented in the next

section.

From reclassification to discrimination slopes

One potential drawback of the reclassification-based measure defined above is its dependence on

the choice of categories. This limitation can be overcome by further extending the concept of the

NRI. If we assign 1 for each upward movement, −1 for each downward movement and 0 for no

Copyright q 2007 John Wiley & Sons, Ltd.

Statist. Med. (in press)

DOI: 10.1002/sim

Page 5

EVALUATING THE ADDED PREDICTIVE ABILITY OF NEW MARKER

movement in categories, the NRI can be expressed as

?

i in eventsv(i)

# events

−

?

j in noneventsv( j)

# nonevents

(1)

where v(i) is the above-defined movement indicator. Now consider the categorization so fine

that each person belongs to their own category. Then any increase in predicted probabilities for

individuals with events means upward movement (v(i)=1) and any decrease is a downward

movement (v(i)=−1). In this case, it makes sense to assign to each person the actual difference

in predicted probabilities instead of 1, −1 or 0. If we denote the new model-based predicted

probabilities of an event by ˆ pnewand old model-based probabilities by ˆ pold, we have

?

We show in the Appendix that the first term in (2) quantifies improvement in sensitivity and the

negative of the second term quantifies improvement in specificity. Also, by rearranging the terms

in (2), we observe that it is equivalent to the difference in discrimination slopes as introduced

by Yates [25,26] (discrimination slope can be defined as a difference between mean predicted

probabilities of an event for those with events and the corresponding mean for those without

events).

The difference in model-based discrimination slopes is an important measure of improvement in

model performance. As shown in the Appendix, it is a sample equivalent of the difference between

the integrated difference in sensitivities and the integrated difference in ‘one minus specificities’

between the new and old models. This integration is over all possible cut-offs. Thus, it quantifies

jointly the overall improvement in sensitivity and specificity. In simpler terms, the area under the

sensitivity curve is estimated by the mean of predicted probabilities of an event for those who

experience events, and the area under the ‘one minus specificity’ curve is estimated by the mean

of predicted probabilities of an event for those who do not experience events. We suggest the

integrated differences in sensitivities and ‘one minus specificities’ and their difference as another

measure of improvement in performance offered by the new marker. We call the last difference

the IDI and estimate it using the difference in discrimination slopes. A simple asymptotic test of

significance is provided in the next section.

i in events( ˆ pnew(i) − ˆ pold(i))

# events

−

?

j in nonevents( ˆ pnew( j) − ˆ pold( j))

# nonevents

(2)

STATISTICAL PROCEDURES AND CONSIDERATIONS

Net reclassification improvement

Consider a situation in which predicted probabilities of a given event of interest are estimated using

two models that share all risk factors, except for one new marker. Let us categorize the predicted

probabilities based on these two models into a set of clinically meaningful ordinal categories of

absolute risk and then cross-tabulate these two classifications. Define upward movement (up) as a

change into higher category based on the new model and downward movement (down) as a change

in the opposite direction. If D denotes the event indicator, we define the NRI as

NRI=[P(up|D =1) − P(down|D =1)] − [P(up|D =0) − P(down|D =0)]

(3)

Copyright q 2007 John Wiley & Sons, Ltd.

Statist. Med. (in press)

DOI: 10.1002/sim