Page 1

REVIEW ARTICLE

Anesthesiology 2010; 112:1023–40

Copyright © 2010, the American Society of Anesthesiologists, Inc. Lippincott Williams & Wilkins

David S. Warner, M.D., Editor

Statistical Evaluation of a Biomarker

Patrick Ray, M.D., Ph.D.,* Yannick Le Manach, M.D.,† Bruno Riou, M.D., Ph.D.,‡

Tim T. Houle, Ph.D.§

ABSTRACT

A biomarker may provide a diagnosis, assess disease severity or

risk, or guide other clinical interventions such as the use of drugs.

Although considerable progress has been made in standardizing the

methodology and reporting of randomized trials, less has been ac-

complished concerning the assessment of biomarkers. Biomarker

studiesareoftenpresentedwithpoorbiostatisticsandmethodologic

flaws that precludes them from providing a reliable and reproducible

scientific message. A host of issues are discussed that can improve

the statistical evaluation and reporting of biomarker studies. Investi-

gatorsshouldbeawareoftheseissueswhendesigningtheirstudies,

editors and reviewers when analyzing a manuscript, and readers

when interpreting results.

H

state. However, the term biomarker has evolved over time to

any biologic measurement, recently including genomic or

ISTORICALLY, the term biomarker referred to ana-

lytesinbiologicsamplesthatpredictapatient’sdisease

proteomic analyses, that could also predict a response to a

drug (efficacy, toxicity, or pharmacokinetics) or indicate an

underlying physiologic mechanism.1New biomarkers ex-

ploring the cardiovascular system, kidney, central nervous

system, inflammation, and sepsis are under the scrutiny of

bioengineering companies, and we are witnessing a biomar-

kersrevolutionsimilartotheimagingtechniquerevolution.2

Remarkably, this revolution has already occurred for cancer

drugs.1

Assessmentofthesebiomarkersiscomplexbutvaluablein

perioperative and critical care medicine as markers of diag-

nosis, disease severity, and risk. Although considerable

progress has been made in standardizing the methodology

and reporting of randomized trials, less has been accom-

plished concerning the assessment of diagnostic and prog-

nostic biomarkers. Analysis of the literature, even in presti-

gious journals, has revealed that the methodologic quality of

diagnostic studies is on average poor.3Recommendations

concerningthereportingofdiagnosticstudies,theStandards

for Reporting of Diagnostic Accuracy (STARD) initiative,

have been published recently,4several years after the first

recommendations concerning reporting of randomized tri-

als.5However, these recommendations do not encompass all

issues of this rapidly evolving domain.

The purpose of this article was to provide the anesthesi-

ologist with a comprehensive introduction of the problems,

potential solutions, and limitations raised by the assessment

of the diagnostic properties of modern biomarkers. It is im-

portant to appreciate the available statistical methodologic

tools to face the biomarker revolution, either as a clinical

investigator or as a consumer of scientific literature. This is

no easy task, for we must now look beyond the classic diag-

nostic indices (sensitivity, specificity, and predictive values)

and even beyond the more widely used receiver operating

characteristic (ROC) curves by integrating the principles of

Bayesian theory. To appreciate these issues, the different

roles of a biomarker must first be explored.

Role of a Biomarker

A biomarker may serve different roles (table 1) and, thus,

needtoaccomplishseveralreportinggoals.Abiomarkermay

* Staff Emergency Physician, Department of Emergency Medicine

and Surgery, Groupe Hospitalier Pitie ´-Salpe ˆtrie `re, Paris, France,

Universite ´ Pierre et Marie Curie-Paris 6, Paris, France, and Institut

National de la Santé et de la Recherche Médicale (INSERM) U956,

Paris, France. † Assistant Professor, Department of Anesthesiology

and Critical Care, Groupe Hospitalier Pitie ´-Salpe ˆtrie `re. ‡ Professor of

Anesthesiology and Critical Care, Chairman, Department of Emer-

gency Medicine and Surgery, Universite ´ Pierre et Marie Curie-Paris

6, and INSERM U956. § Research Assistant Professor, Department of

Anesthesiology, Wake Forest University School of Medicine, Win-

ston-Salem, North Carolina.

Received from Department of Emergency Medicine and Surgery,

Universite ´ Pierre et Marie Curie-Paris 6, Institut National de la Santé

et de la Recherche Médicale U956, Department of Anesthesiology

and Critical Care, Groupe Hospitalier Pitie ´-Salpe ˆtrie `re, Assistance

Publique-Ho ˆpitaux de Paris, Paris France, and Department of An-

esthesiology, Wake Forest University School of Medicine, Winston-

Salem, North Carolina. Submitted for publication December 2, 2009.

Accepted for publication December 14, 2009. Support was provided

solely from institutional and/or departmental sources. Dr. Patrick

Ray received honoraria from Biome ´rieux SA (Marcy l’Etoile,

France), BRAHMS (Clichy, France), and Roche Diagnostics France

(Meylan, France). Dr. Riou received honoraria from BRAHMS. James

C. Eisenach, M.D., served as Handling Editor for this article.

Address correspondence to Dr. Houle: Department of Anes-

thesiology, Wake Forest University School of Medicine, Medical

Center Boulevard, Winston-Salem, North Carolina 27157-1009.

thoule@wfubmc.edu. Information on purchasing reprints may be

found at www.anesthesiology.org or on the masthead page at the

beginning of this issue. ANESTHESIOLOGY’s articles are made freely

accessible to all readers, for personal use only, 6 months from the

cover date of the issue.

Anesthesiology, V 112 • No 4

1023

April 2010

Page 2

provide a diagnosis or assess severity (or assess a risk). For

example, cardiac troponin I is a very sensitive and specific

biomarker of myocardial infarction in the postoperative pe-

riod in noncardiac surgery.6In contrast, it is considered only

as a severity biomarker in pulmonary embolism,9whereas

procalcitonin is considered both as a diagnostic and severity

biomarker of infection.8Biomarkers are often used for risk

stratification. For example, blood lactate levels have been

proposedforriskstratificationofsepsis.14However,thepur-

pose of diagnostic and prognostic settings markedly differ.

For example, in the diagnostic setting, although unknown,

the outcome (the disease) has occurred, whereas in the prog-

nostic setting, the outcome remains to be determined and

can only be estimated as a probability or a risk, and the

uncertain nature of this outcome should be considered.

There are several important hierarchical steps in demon-

strating the clinical interest of a biomarker:

1. Demonstrate that the biomarker is significantly modified

in diseased patients as compared to control.

2. Assess the diagnostic properties of the biomarkers.

3. Compare the diagnostic properties of the biomarker to

existing tests.

4. Demonstrate that the diagnostic properties of the bi-

omarker increase the ability of the physician to make a

decision;thismightbedifficulttoanalyzebecausetiming

of diagnosis may be crucial and not easy to identify. For

example, although the accuracy of procalcitonin to diag-

nose postoperative infection after cardiac surgery was

lower than that of physicians, procalcitonin enabled to

make the diagnosis earlier.7

5. Assess the usefulness of the biomarker, which should be

clearlydistinguishedtothequalityofdiagnosticinforma-

tion provided.15Assessment of the usefulness mainly in-

volves both characteristics of the test itself such as cost,

invasiveness, technical difficulties, rapidity, and charac-

teristics of the clinical context (prevalence of the disease,

consequences of outcome, cost, and consequences of

therapeutic options).

6. Demonstrate that the measurement of the biomarkers

modifies outcome (intervention studies). For example,

severalstudiesnicelydemonstratedthatadiagnosticstrat-

egybasedonprocalcitoninlevelreducesantibioticusefor

acute respiratory tract infections, exacerbation of chronic

obstructive pulmonary disease, and ventilator-associated

pneumonia.16However, intervention studies are lacking

for many novel biomarkers or give conflicting results for

others.17

Forallstagesofthisprocess,itisimportanttounderstand

the pathophysiologic mechanisms involved in the biomark-

er’ssynthesis,production,itskineticproperties,anditsphys-

iologic effects. For example, brain natriuretic peptide (BNP)

is known to be released predominantly from the left cardiac

ventricles in response to increased ventricular wall stretch,

volume expansion, and overload. Its physiologic role in-

cludes systemic and pulmonary arterial vasodilation, promo-

tionofnatriuresisanddiuresis,inhibitionoftherenin-angio-

tensin-aldosterone system, and endothelins. In contrast, the

pathophysiologic background of procalcitonin remains

poorly understood.

A biomarker may also guide other clinical decisions, par-

ticularlyconcerningtheuseofdrugs.Thisareaisnowwidely

developed in oncology in which biomarkers are used to pre-

dict an efficacy and/or a toxicity response of a drug.1For

example, procalcitonin has been advocated to guide the cli-

nician to decide the duration of antibiotherapy,13and ge-

netic determinants of metabolic activation of clopidogrel

have been shown to modulate the clinical outcome of pa-

tientstreatedbyclopidogrelafteranacutemyocardialinfarc-

Table 1. The Main Roles of a Biomarker

RoleDescriptionExamples

Diagnosis of a diseaseTo make a diagnosis more reliably, more

rapidly, or more inexpensively than

available methods

Troponin Ic diagnoses myocardial

infarction6

Procalcitonin diagnoses bacterial

infection7

Procalcitonin identifies severe

outcome in septic patients8

Troponin Ic identifies severe

outcome in patients with

pulmonary embolism9

Brain natriuretic peptide and

postoperative outcome in

noncardiac surgery10

Troponin and long term outcome

in cardiac surgery11

Efficacy of clopidogrel15

Severity assessmentTo identify subgroup of patients with a severe

form of a disease associated with an

increased probability of death or severe

outcome

Risk assessment To identify subgroup of patients who may

experience better (or worse) outcome when

expose to an intervention

Prediction of drug

effects

To identify the pharmacological response of a

patient exposed to a drug (efficacy, toxicity,

and pharmacokinetics)

To assess the response to a therapeutic

intervention

Monitoring Procalcitonin may guide antibiotic

duration13

1024

Evaluation of a Biomarker

Anesthesiology, V 112 • No 4 • April 2010 Ray et al.

Page 3

tion.12Finally, a biomarker can be used as a surrogate end-

point in a clinical trial,1,18but this issue is beyond the scope

of this review.

The Bayesian Approach

One way to conceptualize the utility of a biomarker is its

value for enhancing our existing knowledge in predicting

the probability of some outcome (e.g., disease state, prog-

nosis). In this regard, Bayesian statistical methods provide

a powerful system from which to update existing informa-

tion about the likelihood of the occurrence of some dis-

ease or prognosis. A comprehensive introduction to these

methods is far beyond the scope of this review, but an

interested reader is referred to one of the several textbooks

that have been written on applying Bayesian statistical

methods to medical problems.19

Bayes’theoremusestwotypesofinformationtocompute

a predicted probability of the outcome. First, the prior prob-

ability (or pretest probability) of the outcome must be con-

sidered. For biomarker studies involving the diagnosis of a

disease, this can be akin to the general prevalence of the

disease in the population under study, or what we know

about the base rate of the disease for this individual without

any additional information. This information is combined

with the predictive power of the biomarker (i.e., the abil-

ity of the test to discriminate between disease states) to

adjust our prediction of the likelihood of the outcome.

Stated simply, the predicted probability of a patient hav-

ing the disease (posttest probability) can be calculated

as20: posttest probability ? (pretest probability) ? (pre-

dictive power of the evidence).

Numerical examples of this calculation are given in Like-

lihood Ratios, where likelihood ratios (LHRs) are discussed.

However, the use of Bayes’ theorem to “update” our expec-

tation of the presence of a disease is illustrated using Fagan’s

nomograms (fig. 1).21In these examples, it can be seen how

disease prevalence (pretest probability) is used in conjunc-

tion with the LHR (strength of evidence) to calculate an

updated (posttest) probability of the disease.

Although a powerful method for updating assumptions

giventheavailableinformation,therearemanyinstanceswhere

an appropriate estimate of the pretest probability is not known

(or agreed on). In such cases, different physicians might have

different estimates of the probability of the disease for a given

Fig. 1. Fagan nomogram using the Bayesian theory showing the pretest and postprobabilities and the likelihood ratio. (A) A straight line is

applied for a low pretest probability (0.20) for a good biomarker with a positive likelihood ratio of 10, providing a posttest probability of (0.80);

the important change in probability suggests a change for the physician (diagnostic or therapeutic). (B) In contrast, when the same good

biomarker is applied to a patient with a high pretest probability (0.80), the posttest probability is more than 0.95, but this may not represent an

important change for the physician. (C) The effects of several biomarkers with different likelihood ratios (2, 10, and 50) in a patient with a pretest

probability of 0.50. The nomogram is reprinted with permission from Fagan.21

1025

EDUCATION

Ray et al.

Anesthesiology, V 112 • No 4 • April 2010

Page 4

patient. Further, information may actually be available for

oneormoreriskfactors,buttheuniquecombinationofthese

factors may obscure the subjective probability for a given

patient. For these applications, the sensitivity of the expecta-

tion can be checked against a range of assumptions (see Re-

classification Table for more details).

Statistical Tools

Decision Matrix

Thediagnosticperformanceofabiomarkerisoftenevaluated

by its sensitivity and specificity. Sensitivity is the ability to

detect a disease in patients in whom the disease is truly

present (i.e., a true positive), and specificity is the ability to

rule out the disease in patients in whom the disease is truly

absent (i.e., a true negative). Calculation of these indices

requires knowledge of a patient’s “true” disease state and a

dichotomouspredictionbasedonthebiomarker(i.e.,disease

is predicted to be present or absent) to construct a 2 ? 2

contingency table. Table 2 displays how the frequency of

predictions from a sample of patients could be used in con-

junction with their known disease state to calculate sensitiv-

ity and specificity.

Although sensitivity and specificity are the most com-

monly provided variables in diagnostic studies, they do not

directly apply to many clinical situations because the physi-

cian would rather know the probability that the disease is

truly present or absent if the diagnostic test is positive or

negative rather than probability of a positive test given the

presenceofthedisease(sensitivity).Theseformer,moreclin-

ically interesting probabilities are provided by the positive

predictive value and negative predictive value. Table 2 pre-

sents the calculation of these predictive indices.

The diagnostic accuracy of a test is the proportion of

correctly classified patients (i.e., the sum of true positive and

true negative tests). Perhaps because it is the most intuitive

index of diagnostic performance, diagnostic accuracy is

sometimes reported as a global assessment of the test. How-

ever,theuseofthisindexforthispurposeisinherentlyflawed

and produces unsatisfactory estimates under a range of situ-

ations, such as when the prevalence of the disease substan-

tially deviates from 50%.22It is recommended that authors

reportmorethanjustasingleestimateofdiagnosticaccuracy.

The Youden index Y ? sensitivity ? (specificity ? 1) rep-

resentsthedifferencebetweenthediagnosticperformanceofthe

test and the best possible performance23(sometimes called the

“regret” defined as the utility loss because of uncertainty about

the true state).24Interestingly, accuracy is actually a weighted

average of sensitivity and specificity, using as weight the preva-

lence of the disease. It should be clear that these five indices

(sensitivity, specificity, negative and positive predictive values,

andaccuracy)arepartiallyredundant,becauseknowingthreeof

them enables the calculation of the rest.

Influence of Prevalence

Although sensitivity and specificity are not markedly influ-

enced by the prevalence of the disease, negative predictive

value, positive predictive value, and accuracy are affected by

prevalence.Figure2showstheinfluenceofprevalenceonthe

variousdiagnosticindices.Thisissueisofparamountimpor-

tance because disease prevalence can markedly differ from

one population to another. The prevalence of sepsis in an

intensive care unit is high compared with that in the emer-

gency department, and the positive predictive values and

accuracy of procalcitonin for sepsis might markedly differ be-

tween these two settings. For example, Falcoz et al.25reported

that a 1 ng/ml procalcitonin had a positive predictive value of

0.63 for predicting postoperative infection after thoracic sur-

gery. However, in that study, the prevalence of infection was

16%.Ifaposthocanalysiswasconductedrestrictingthescopeof

inclusiononlytopatientswithsystemicinflammatoryresponse

syndromecriteria,theprevalencewouldhavebeen63%andthe

positive predictive value 0.90.

Although the mathematical calculation of sensitivity and

specificity are not necessarily altered by prevalence, certain

clinical situations may foster higher estimates.26,27These in-

dices can be influenced by case mix, disease severity, or risk

factors for disease.27For example, a biomarker is likely to be

more sensitive among more severe than among milder cases

of the diseases. The sensitivity of procalcitonin to diagnose

bacterial infection is greater in patients with meningitis than

in patients with pyelonephritis.28

Likelihood Ratios

LHRs are another way of describing the prognostic or diag-

nosticvalueofabiomarker.Althoughwecallthem“diagnos-

tic” LHR, these ratios are LHRs in the true statistical sense

andcorrespondtotheratiosofthelikelihoodoftheobserved

test result in the diseased versus nondiseased populations.

Twodimensionsofaccuracyhavetobeconsidered,theLHR

for a positive test (positive LHR) and the LHR for a negative

test (negative LHR). One of the most interesting features of

LHRs is that they quantify the increase in knowledge about

Table 2. The Diagnostic Matrix and the Derivation

of Main Diagnostic Parameters

Disease

Total PresentAbsent

Biomarker

Positivea (true

positive)

c (false

negative)

a ? c

b (false

positive)

d (true

negative)

b ? d

a ? b

Negativec ? d

Totala ? b ? c ? d

Prevalence ? (a ? c)/(a ? b ? c ? d); sensitivity ? a/(a ? c);

specificity ? d/(b ? d); positive predictive value ? a/(a ? b);

negative predictive value ? d/(c ? d); accuracy ? (a ? d)/(a ?

b ? c ? d); Youden index ? sensitivity ? specificity ? 1;

positive likelihood ratio (LHR?) ? sensitivity/(1 ? specificity);

negative likelihood ratio (LHR?) ? (1 ? sensitivity)/specificity;

diagnostic odds ratio ? (ad)/(bc) ? (LHR?)/(LHR?).

LHR ? likelihood ratio.

1026

Evaluation of a Biomarker

Anesthesiology, V 112 • No 4 • April 2010Ray et al.

Page 5

the presence of disease that is gained through the diagnostic

test. Thus, LHR could also be referred as Bayes factors; we

coulddemonstratethatusingthefollowingformula:posttest

probability of disease ? (positive LHR) ? (pretest probabil-

ity of disease) or posttest probability of nondisease ? (nega-

tive LHR) ? (pretest probability of nondisease).

Furthermore,LHRsarenotdependantondiseaseprevalence

and,thus,areconsideredasarobustglobalmeasureofthediag-

nostic properties of a test, and they can be used with tests that

have more than two possible results (see interval LHR).

More pragmatically, positive LHR ranges from 1.0 to

infinity and negative LHR from 0 to 1.0. An uninformative

test having no relation with the disease has LHR of 1.0,

whereas a perfect test would have a positive LHR equal to

infinity and a negative LHR of 0 (table 3). For example, a

common sense translation of a positive LHR of 8.6 for a

plasmasolubletriggeringreceptorexpressedonmyeloidcells

1 value exceeding 60 ng/ml is that this value is obtained

approximately nine times more often from a patient with

sepsis than from a patient without sepsis.29Experts usually

consider that tests with a positive LHR greater than 10 (or a

negative LHR less than 0.1) have the potential to alter clin-

ical decisions. Although it could be tempting to follow de-

finitive rules-of-thumb for interpretation (such as those pro-

vided in table 3), we must primarily consider the clinical

setting to determine what level of increased likelihood is

clinically relevant to improve the management of patients.

Revisiting the nomograms in figure 1, several important

issues become clear concerning diagnostic LHRs. First, no

change in prediction (expectation) is possible without a

strong LHR. As might be expected, when an LHR provides

no added information (e.g., LHR ? 1.0), the pretest proba-

bility equals the posttest probability. Second, pretest proba-

bility greatly influences what can be learned from using even

Fig. 2. Effects of prevalence of a disease on main diagnostic variables in a simulated population (n ? 1,000) in an ideal world. A biomarker with

a sensitivity of 0.80 and a specificity of 0.60 was considered where each point on the horizontal axis corresponds to different prevalence (from

0.05 to 0.95, step of 0.05). The effects of prevalence on sensitivity (Se) and specificity (Sp) (A), positive predictive value (PPV) and negative

predictivevalue(NPV)(B),accuracy(C),andlikelihoodratios(D)canbeseenineachofthepanels.LHR??negativelikelihoodratio;LHR??positive

likelihood ratio.

Table 3. Rule of Thumb: Correspondence

between Accuracy, Positive (LHR?) and Negative

(LHR?) Likelihood Ratio, and Area under the

Receiver Operating Characteristics Curve

(AUCROC) and the Diagnostic Value of a Biomarker

Accuracy LHR? LHR?

AUCROC

Excellent diagnostic

value

Good diagnostic

value

Poor diagnostic

value

No diagnostic

value

?0.90

?10

?0.1

?0.90

0.75–0.90 5–10 0.1–0.2 0.75–0.90

0.50–0.75 1–5 0.2–10.50–0.75

0.5011 0.50

1027

EDUCATION

Ray et al.

Anesthesiology, V 112 • No 4 • April 2010

Page 6

a very predictive biomarker (e.g., LHR ? 10.0). Very high

(orlow)pretestprobabilitiesresultinsmalleradjustedexpec-

tations than those less extreme.

ROC Curve

Basic of ROC Curve. The ROC curve is a form of a series of

pairs (proportion of true positive results; proportion of false-

positive results) or (sensitivity; 1 ? specificity): the sequence

ofpointsobtainedfordifferentcutoffpointscanbejoinedto

draw the empirical ROC curve, or a smooth curve can be

obtained by appropriate fitting, usually using the binomial

distribution (fig. 3).30,31In other words, the positive LHRs

calculated at various values of the diagnostic test can be plot-

ted to produce a ROC curve, and thus, a ROC curve is a

graphical way of presenting information presented in the

table of LHRs. Graphically, the positive LHR is the slope of

thelinethroughtheorigin(sensitivity?0;1?specificity?

0)andagivenpointontheROCcurve,whereasthenegative

LHRistheslopeofthelinethroughthepointoppositetothe

origin (sensitivity ? 1; 1 ? specificity ? 1) and that given

point on the ROC curve.

Theareaunderthereceiver-operatingcharacteristiccurve

(AUCROC)(alsocalledthecstatisticsorthecindex)isequiv-

alent to the probability that the biomarker is higher for a

diseased patient than a control and, thus, is a measure of

discrimination. By convention, ROC curves should be pre-

sented above the identity curve (fig. 2) that represents a test

without any value and which performs like chance. It is im-

portant to note that the following points belong to the iden-

titycurve:ofcoursesensitivity?0.50andspecificity?0.50

but also sensitivity ? 0.90 and specificity ? 0.10, and sensitiv-

ity?0.10andspecificity?0.90.Thisenablesustounderstand

that sensitivity cannot be interpreted without specificity. The

AUCROCshouldbereportedwithconfidenceintervals(CIs)to

allow statistical evaluation versus the identity line or statistical

comparison versus other diagnostic tests (see Comparison of

ROC Curves). Usually, biomarkers are considered as having

gooddiscriminativepropertiestestswhenAUCarehigherthan

0.75andasexcellentmorethan0.90(table3).TheROCcurve

is a global assessment of the test accuracy but without any a

priori hypothesis concerning the cutoff chosen, is relatively in-

dependent on prevalence, and is a simple plot that is easily ap-

preciatedvisually.However,thecutoffpointandthenumberof

patientsarenottypicallypresented(althoughasmallsamplesize

is easily detected by a jagged and bumpy ROC curve). The

generation of a ROC curve is no longer cumbersome because

most statistical software provides the calculation and display of

the relative parameters.

There are three common summary measures for the accu-

racydescribedbyaROCcurve.Thefirstissimplytoreportthe

pair of values (sensitivity and specificity) associated with a cho-

sencutoffpoint.ThesecondistheAUCROC,andthethirdisthe

areaunderaportionofthecurve(partialarea)foraprespecified

range of values. Interpreting the AUCROCis somewhat prob-

lematic because of the substantial portion of variance in this

index that comes from values of the biomarker of no clinical

relevance. One ROC curve may have a higher proportion of

false positive than another in the region of clinical interest, but

the two ROC curves may cross, leading to different conclusion

whencurvesarecomparedonthebasisoftheentirearea.There-

fore,itisrecommendedthatexaminationoftheROCcurvebe

conducted in the context of partial area or average sensitivity

overarangeofclinicallyrelevantproportionoffalsepositivesin

addition to the AUCROC.32

Comparison of ROC Curves. The first step for any ROC

curve comparison should be a visual inspection of their

graphical representation. This inspection allows the evalua-

tion of large differences between the AUCROCand to detect

Fig. 3. Receiver operating characteristics (ROC) curve showing the

relationship between sensitivity (true positive) and 1 ? specificity

(true negative) in determining the predictive value of Brain Natriuretic

Peptide (BNP) for cardiogenic pulmonary edema in elderly patients

(?65 yr) admitted to the emergency department for acute dyspnea.

(A) The empirical ROC curve is shown by the line containing points

that corresponds to different cutoff; the area under the empirical

ROC curve was 0.87 (95% confidence interval 0.80–0.91); the ROC

curve is also shown by the continuous line, which was fitted to the

binomial distribution. The dotted line is the identity line. (B) The best

cutoff was chosen as that one which minimizes the mathematical

distance (d) to the ideal point (sensitivity ? specificity ? 1), corre-

sponded to a concentration of BNP of 250 pg/ml with a sensitivity of

0.78 and a specificity of 0.90. But the best cutoff should be prefer-

ably chosen as that one which maximizes the distance (j) between

the ROC curve and the identity line, for example, that which max-

imizes the Youden index (sensitivity ? specificity ? 1), which, in

the present case provided the same cutoff value. These two op-

tions do not take into account the prevalence and the cost-benefit

analysis (see text). Adapted from data from Ray et al.31

1028

Evaluation of a Biomarker

Anesthesiology, V 112 • No 4 • April 2010Ray et al.

Page 7

the situations where ROC curves cross. However, formal

statistical testing is required to assess differences between the

curves.Severaldifferentapproachesarepossible,andallmust

take into consideration the nature of the collected data.

When the predictive value of a new biomarker is compared

withanexistingstandard(s),twoormoreempiricalcurvesare

constructed based on tests performed on the same individu-

als. Statistical analysis on differences between these curves

must take into account the fact that one individual is con-

tributing two scores to the analysis. Most biomarker studies

collectdatathatarepaired(i.e.,measurementsarecorrelated)

in nature. Parametric approaches to these comparisons as-

sumethatthereisacontinuousspectrumofpossiblevaluesof

the biomarker for both diseased and nondiseased patients

(generallytruewithbiomarkers)andthattheunderlyingdis-

tribution is Gaussian (normal). However, this assumption is

often not tenable in biomarker studies. Despite this, paired

parametricmethodsofROCcomparisonareoftenusedtoeval-

uate biomarkers, using an approach described by Hanley and

Mc Neil.33An alternative nonparametric paired method, de-

scribed by DeLong et al.,34is based on the Mann–Whitney U

statistic. The two approaches yield similar estimates even in

nonbinormal models.35

Twomainlimitationsmustbeconsideredforglobalcom-

parisonsoftheROCcurves.First,thiswayofcomparingtwo

ROC curves is not precise, specifically when two ROC

curves cross each other. Second, many cutoffs on the ROC

curves are not considered in practice because their associated

specificity and sensibility are not clinically relevant. To re-

duce the impacts of these limitations, comparisons of partial

AUCROCwithin a specific range of specificity for two corre-

lated ROC curves have been developed and might be inter-

esting to consider for some biomarkers.36

Finally to maximize the generalization capacity of the

observed data, resampling methods have been proposed to

compareROCcurves.Thismodernapproachisactuallyeas-

ier to conduct with the increase of computing power and

seemstoprovidemoreaccurateresultsforsmallsamplesizes.

Determination of Cutoff. The ROC curve is used to deter-

mineaclinicalcutoffpointtomakeaclinicaldiscrimination.

The method used to choose this cutoff is crucial but unfor-

tunately not always reported in published studies.37In some

situations, we do not wish to (or could not) privilege either

sensitivity (identifying diseased patients) or specificity (ex-

cludingcontrolpatients),andthus,thecutoffpointischosen

as that one which could minimize misclassification. Two

techniquesareoftenusedtochoosean“optimal”cutoff.The

first one (I) minimizes the mathematical distance between

the ROC curve and the ideal point (sensitivity ? specific-

ity ? 1) and thus intuitively minimizes misclassification.

The second (J) maximizes the Youden index (sensitivity ?

[specificity?1])andthusintuitivelymaximizesappropriate

classification (fig. 3).38Interestingly, Perkins et al.39present

asophisticatedargumentthattheJpointshouldbepreferred,

because I does not solely rely on the rate of misclassification

butalsoonanunperceivedquadratictermthatisresponsible

for observed differences between I and J.39

However,theuseofIorevenJmaynotbesatisfactoryfor

two main reasons. First, this equipoise decision, which does

not privilege either sensitivity or specificity, is valid only in

the case of a prevalence of 0.50; in other situations, the prev-

alence should be taken into account. Second, in many clini-

cal situations, the researcher could privilege either sensitivity

or specificity because the consequence of false-positive or

false-negative results is not equivalent in terms of a cost–

benefit relationship. For example, it is clear that it is more

crucial to rule out bacterial meningitis than treat a patient

withantibioticswhohasviralmeningitis,oratleastthosedue

to enterovirus.40The researcher should assign a relative cost

(financial or health cost, from the patient, care provider, or

society points of view) of a false positive to a false-negative

result and consider the prevalence, and these different ele-

ments can be combined to calculate a slope m: m ? (false-

positive cost/false-negative cost) ? ([1 ? P]/P), the operat-

ing point on the ROC curve being that which maximizes the

function(sensitivity?m[1?specificity]).15,41Othermeth-

ods includes the net cost of treating controls to net benefit of

treating individuals and the prevalence.42,43

In any case, the following recommendations should be pro-

vided:(1)thechoiceoftheresearchermustbeclearlyexplained

andjustified;(2)thechoice(atleastitsmethodology)mustbea

priori decided; (3) the ROC curve should be provided to allow

thereadertomakeitsopinion;and(4)thecutoffthatmaximizes

theYoudenindexshouldalsobeindicated.Itremainsclearthat

data-driven choice of cutoff tends to exaggerate the diagnostic

performanceofthebiomarker.44Thisbiasshouldberecognized

and probably concerns many published studies.

Surprisingly, although the cutoff point has a crucial role

in the decision process, it is provided in most (if not all)

studies without any CI. This may constitute a major meth-

odologic flaw, particularly in small sample studies, because

this cutoff point might be markedly influenced by the value

of very few patients, although the CIs of sensitivity and spec-

ificity associated with that cutoff are reported.45,46The rea-

son of the absence of CI is probably related to the fact that

more sophisticated statistical methods should be used. The

principle of all these methods is to perform multiple resam-

pling of the studied population to provide a large sample of

different populations providing a large sample of cutoff

pointsandthusamean(oramedian)associatedwithits95%

CI. Several techniques of resampling can be used (bootstrap,

Jackknife,Leave-One-Out,n-foldsampling).47,48Inarecent

study, Fellahi et al.49used a bootstrap technique to provide

median and 95% CIs for cutoff points of troponin Ic in

patients undergoing various types of cardiac surgery, en-

abling the comparison of these different cutoff points. Here

again, CI rule enables the researcher and the reader to hon-

estly communicate or understand the values presented, tak-

ing into account the sample size.

TheGrayZone.Anotheroptionforclinicaldiscriminationisto

avoid providing a single cutoff that dichotomizes the popula-

1029

EDUCATION

Ray et al.

Anesthesiology, V 112 • No 4 • April 2010

Page 8

tion, but rather to propose two cutoffs separated by the “gray

zone”(fig.4).Thefirstcutoffischosentoexcludethediagnosis

withnear-certainty(i.e.,privilegespecificity). The second cut-

offischosentoincludethediagnosiswithnear-certainty(i.e.,

privilegesensitivity).Whenvaluesofthebiomarkerfallsinto

thegrayzonebetweenthetwocutoffs,uncertaintyexists,and

the physician should pursue a diagnosis using additional

tools. This approach is probably more useful from a clinical

point of view and is now more widely used in clinical re-

search. Moreover, the two cutoffs and gray zone comprise

three intervals of the biomarker that can be associated with a

respectiveLHR.Inthatcase,thepositiveLHRofthehighest

value of the biomarker in the gray zone is considered to

include the diagnosis and the negative LHR of the lowest

value to exclude the diagnosis. This interesting option is

often called the interval LHR50and results in less loss of

informationandlessdistortionthanchoosingasinglecutoff,

providing an advantage in interpretation over a binary out-

come. This allows the clinician to more thoroughly interpret

the results improving clinical decision-making.

Here again, the 95% CI of the cutoff points may be cal-

culated using a resampling method,47,48and the rules for

choosing the cutoffs be determined a priori and clearly ex-

plained and justified.

Reclassification Table

Biomarkers’ abilities to predict a disease are commonly eval-

uated using ROC curves. The improvement in AUCROCfor

a model containing the new biomarker is defined simply as

the difference in AUCROCcalculated using models with and

without the biomarker of interest. This increase, however, is

often very small in magnitude. Wareet al.51and Pepe et al.52

describe examples in which large odds ratios are required to

meaningfullyincreasetheAUCROC.Asaconsequence,many

risk factors that we know to be clinically important may not

affect the c-statistic very much. Thus, the ROC curves ap-

proachmightbeconsideredasinsensitivetoevaluatethegain

of biomarkers.52Furthermore, ROC curves are frequently

not helpful for evaluating biomarkers because they do not

provide information about the actual risks or the proportion

of participants who have high- or low-risk values. Moreover,

when comparing ROC curves for two biomarkers, the mod-

els are aligned according to their false-positive rates (that is,

different risk thresholds are applied to the two models to

achieve the same false-positive rate), and this might be con-

sidered as inappropriate.53In addition, the AUCROCor

c-statistic has poor clinical relevance. Clinicians are never

asked to compare risks for a pair of patients among whom

onewhowilleventuallyhavetheeventandonewhowillnot.

To complete the results obtained by ROC curves, some new

approaches to evaluate risk prediction have been proposed.

One of the most interesting is the risk stratification tables.

This methodology better focuses on the key purpose of a risk

prediction, which remains to classify individuals into clini-

cally relevant risk categories. Pencina et al.54have recently

purposed two ways of assessing improvement in model per-

formance using reclassification tables: Net Reclassification

Index (NRI) and Integrated Discrimination Improvement.

The NRI approach enables us to assess the role of a bi-

omarker to modify risk strata and alter clinical decisions. It

requires a predefined risk stratification, which is usually ex-

pressed in several strata (?2), and the use of 3 strata (high,

intermediate, and low risks) is probably the most easily han-

dled for routine clinical management.55NRI is the combi-

nation of four components: the proportion of individuals

with events who move up or down a category and the pro-

portionofindividualswithnoneventswhomoveupordown

a category. Because the NRI and its four components might

be affected by the choice of stratification of the risks, lack of

clear agreement on the categories that are clinically impor-

tant could be problematic when using the NRI to assess new

biomarkers. This concern is common with the Hosmer-

Lemeshow test. Again, prevalence, predictive values, cost,

andbenefitshouldprobablybeconsideredtomakeclinically

relevant decisions.56On the contrary, the Integrated Dis-

crimination Improvement table does not require predefined

strata, and it can be seen as continuous version of NRI with

the probability of disease differences used instead of pre-

defined strata. Alternatively, it could be defined as the differ-

enceofmeanpredictedprobabilitiesofeventsandnoevents.

NRI and Integrated Discrimination Improvement tables

provide an important increase in the power to detect an

improvement in risk stratification associated with the use of

a new biomarker. Indeed, numerous clinical situations exist

where a considered small increase of AUCROClead to sub-

stantial improvement in reclassification by the NRI and/or

Integrated Discrimination Improvement table. This might

suggest that very small increase of AUCROCmight still be

Fig. 4. Receiver operating characteristics (ROC) curve and the “gray

zone.” The ROC curve is the same as in figure 2 and shows the

predictive value of Brain Natriuretic Peptide (BNP) for cardiogenic

pulmonary edema in elderly patients (?65 yr) patients admitted to

the emergency department for acute dyspnea. Two cutoffs were

chosen as, one corresponding to a high value of BNP associated

with certainty for diagnostic inclusion (BNP ? 360 pg/ml; sensitiv-

ity ? 0.66, specificity ? 0.93; positive predictive value ? 0.90) and

the other with certainty for diagnosis exclusion (BNP ? 100 pg/ml;

sensitivity ? 0.91, specificity ? 0.51, negative predictive value ?

0.85). The square indicates the gray zone. Adapted from data from

Ray et al.31

1030

Evaluation of a Biomarker

Anesthesiology, V 112 • No 4 • April 2010Ray et al.

Page 9

suggestive of a meaningful improvement in the risk predic-

tionandthattheexclusiveuseofROCcurveisnotsufficient

to demonstrate that a biomarker is not useful. This is clearly

an evolving domain of biostatics,53which should be highly

considered for perioperative medicine and risk stratification.

Common Pitfalls of the Evaluation of a

Biomarker

Intrinsic Properties

A biomarker supposes a biologic assay that is associated with

measurementerrors.Thus,theprecisionofthemeasurement

of the biomarker and the limit of detection should be pro-

vided (reproducibility). Moreover, the measurement of the

biomarker should be sensitive, for example, detects very low

concentrations,andspecific,forexample,providesameasure

ofthebiomarkersitselfwithoutinterferenceswithothermol-

ecules, particularly those related to its metabolism. All these

intrinsic properties are important to report and disclose to

the reader.

Moreover,becausemostofbiomarkersaremoleculespro-

duced by our cells and measured in blood, urine, or other

organic fluid, the possibility that abnormal cells can produce

the biomarker in a completely abnormal pathophysiologic

mechanism when compared with that (or those) known

should always be considered. For example, there is some

evidence that KIM-1, a biomarker of kidney injury, can be

secreted by the kidney cancer cells in the absence of renal

injury.? This point might be difficult because the physiology

of a given biomarker is often incompletely known, and the

abnormal cells may produce anything. Normality may differ

betweenpopulations(exampleoftroponinincardiacsurgery

vs. other type of surgery).49

The analytical characteristics of any assay should be dis-

tinguished from its diagnostic characteristics.57The terms

“limit of detection,” “limit of quantitation,” or “minimal

detectable concentration” are synonyms used for analytical

sensitivity. Polymerase chain reaction is considered as a very

sensitive test because it could detect a very low number of

copies of gene or gene fragment. However, despite this ex-

quisiteanalyticalsensitivity,itsdiagnosticsensitivitymaynot

be so perfect when the target DNA is absent in the biologic

material analyzed: this could be the case of a patient with

endocarditis but whose withdrawn blood samples do not

contain any bacteria. In the same way, polymerase chain

reaction can be considered as an assay with exquisite analyt-

ical specificity, but its diagnostic specificity may not be so

perfect just because of contamination.57

Numerical Expression of Diagnostic Variables

Most of these variables should be considered as percentages

and, thus, can be expressed either using the unit “percent” or

using two digits: thus sensitivity might be presented either as

89% or 0.89. The important point is to ensure coherence

along a given manuscript for a given variable and among all

diagnostic variables. More importantly, because these vari-

ables are percentages, a CI (95% CI) should always be asso-

ciated.58The lower and upper limits of the 95% CIs inform

thereaderabouttheintervalinwhich95%ofallestimatesof

the measure (e.g., sensitivity, area under the curve) would

decreaseifthestudywasrepeatedoverandoveragain.When

LHRsarereported,CIsthatinclude1indicatethatthestudy

hasnotshownconvincingevidenceofanydiagnosticvalueof

the investigated biomarker. Therefore, the reader does not

knowwhetheratestwithapositiveLHRof20buta95%CI

of 0.7–43 is useful. A study reporting a positive LHR of 5.1

with a 95% CI of 4.0–6.0 provides more precise evidence

than another study arriving at a positive LHR of 9.7 with a

95% CI of 2.3–17. Usually the sample size in critical care

medicine studies is small, leading to wide CIs. Likewise, too

often, studies concerning diagnostic tests are underpowered

toallowstatisticallysoundinferencesaboutthedifferencesin

test accuracy.

ThereportingofCIsenablestheresearcherandthereader

to effectively communicate or understand the values pre-

sented, taking into account the uncertainty inherent with

any sample size. This is particularly important because most

of these variables are calculated using only a fraction of the

whole population studied: for example, an interesting sensi-

tivity of 0.90 in a large population of 500 patients (but only

10 presenting the disease) may not seem so interesting when

considering its 95% CI: 0.60–0.98. Likelihood and diag-

nostic ratios are ratios of probabilities but should also be

reported with their CIs. Moreover, CI enables the reader to

directly interpret statistical inference.58

Role of Time

In most clinical situations, the issue of the time of biomarker

measurement is of limited interest, mainly because the time

ofonsetofthepathologicprocessandordiseaseisunknown.

However,inothersituations,thetimeofonsetcanbereadily

determined. This is the case for acute chest pain and for the

appearance in the blood of a biomarker for myocardial in-

farction. In that example, although troponin is recognized as

an ideal biomarker (both very sensitive and very specific), it

needs more time to be detected than myoglobin, which is

considered as a poorer diagnostic biomarker but one that

appears earliest (fig. 5).59The importance of timing can be

crucial in perioperative medicine, particularly in the postop-

erative period, because timing of the insult (anesthesia/sur-

gery) is precisely known. For example, in cardiac surgery,

Fellahi et al.11suggested that troponin should be measured

24 h after cardiopulmonary bypass to gain the maximum

information.Incontrast,thetimeprofileofanotherbiomar-

? Morrissey J, London A, Lambert M, Luo J, Kharasch E: Specificity

of the urinary biomarkers KIM-1 or NGAL to detect perioperative

kidney injury (abstract A1623). Paper presented at the Annual Meet-

ing of the American Society of Anesthesiologists, New Orleans,

Louisiana, October 17–21, 2009.

1031

EDUCATION

Ray et al.

Anesthesiology, V 112 • No 4 • April 2010

Page 10

ker such as BNP may be completely different in that surgical

setting.60

The issue of time of measurement may also be crucial

when considering the pathophysiologic process assessed by

biomarkers, which supposes a clear understanding of these

processes, which is not always the case. In sepsis, the simul-

taneousoccurrenceofproinflammatoryprocesses(tumorne-

crosisfactor-?andinterleukin-6)andantiinflammatorypro-

cesses (interleukin-1), and the complex interaction of

therapeutics, may render difficult the analysis of biomarkers,

particularly when the onset of the various infection processes

(onset of infection vs. onset of severe infection vs. onset of

shock) remains vague.

Different Populations

Diagnostic tests may substantially vary when measured in

differentpatientpopulations,particularlywhenstudiedpop-

ulations are defined by characteristics such as demographic

features (age and sex) and spectrum of the disease (severity,

acute vs. chronic illness, pathologic location of form).61

Moreover, the diagnostic test may work well in a global pop-

ulation but not in a given subgroup. For example, procalci-

tonin may not be a good biomarker of infection in pyelone-

phritis28or intraabdominal abcess.62Procalcitonin is not a

good biomarker for infection in a population exposed to

heatstrokeeventhoughhalfofthemaretrulyinfectedsimply

because heatstroke itself increases procalcitonin.63In the

perioperative period, the type of surgery might be an impor-

tant cause of variation. The properties of cardiac troponin I

to diagnose postoperative myocardial infarction are funda-

mentally different in noncardiac versus cardiac surgery, just

because cardiac surgery alone is responsible for important

postoperativereleaseofcardiactroponin,whichhasmultiple

causes: surgical cardiac trauma, extracorporeal circulation,

and defibrillation.60Even when considering cardiac surgery,

different cutoff points of cardiac troponin to predict major

postoperative cardiac events are observed when comparing

cardiopulmonary bypass, valve, or combined surgery.49

All these important issues are usually summarized as

spectrum biases. Therefore, precise information concern-

ing the population studied and its case mix are important

to be provided by researchers and to be understood by

readers (table 4).

The issue of different populations could be more widely

analyzed as an issue of external influences (covariates). This

might be the case when factors other than the disease affect

thebiomarkerincludingfactorsthataffectthetestprocedure

(apparatus and centers), the value of the biomarker itself (see

later for the influence on kinetics), or the relation of the

biomarker to the outcome. Thus, adjustment for covariates

may be an important component of evaluation of biomark-

ers.64When the covariate does not modify the ROC perfor-

mance, the covariate-adjusted ROC curve is an appropriate

tool to assess the classification accuracy and is analogous to

the adjusted odds ratio in an association study. In contrast,

when a covariate affects the ROC performance, the ROC

curvesforspecificcovariategroupsshouldbeused.Covariate

adjustment may also be important when comparing biomar-

kers, even under a paired design, because unadjusted com-

parisons could be biased.64

Importance of the Biomarker Kinetics

A biomarker has its own kinetics implying metabolism and

elimination.Thisimportantissuehasbeenpoorlyrecognized

at least partly because the kinetics of biomarkers is often

poorly investigated. Just as renal or liver insufficiency may

influence the pharmacokinetics of drugs, they also could in-

fluence the kinetics of a biomarker and interfere with their

diagnostic properties. For example, procalcitonin has been

shown recently to be increased in patients with renal func-

tion who undergo vascular surgery. This increase was ob-

served both in infected and noninfected patients (fig. 6) and

interferes with the cutoff point chosen but not with the di-

agnostic performance.65This effect could be of paramount

importance in the postoperative period or in the intensive

care unit because these patients are more likely to present

organ failures. When comparing two biologic forms derived

fromBNP,theactiveformanditsprometaboliteN-terminal

prohormone brain natriuretic peptide, Ray et al.66observed

that the diagnostic properties of N-terminal prohormone

brain natriuretic peptide were decreased compared with

BNP, probably because of the differential impact of renal

function on these two biomarkers in an elderly population.

Fig.5.Effectoftime.(A)Schematicevolutionofbloodconcentrations

of main biomarkers of acute myocardial infarction (MI): myoglobin,

creatine kinase MB (CKMB), and cardiac troponin I (cTnI) after the

onset of chest pain. (B) Schematic evolution of their respective sen-

sitivity. Adapted from data from De Winter et al.59

1032

Evaluation of a Biomarker

Anesthesiology, V 112 • No 4 • April 2010 Ray et al.

Page 11

Table 4. The Standards for Reporting of Diagnostic Accuracy (STARD) Checklist for Reporting Diagnostic

Studies*

Section and TopicItem Description

Title, abstract, keywords1 Identify the article as a study of diagnostic accuracy (recommend MeSH heading

“sensitivity and specificity”

State the research questions or aims, such as estimating diagnostic accuracy or

comparing accuracy between tests or across participant groups

Describe the study population: the inclusion and exclusion criteria and the

settings and locations where the data were collected

Describe participant recruitment: was this based on presenting symptoms,

results from previous tests, or the fact that the participants had received the

index test or the reference standard

Describe participant sampling: was this a consecutive series of participants

defined by selection criteria in items 3 and 4? If not specify how participants

were further selected

Describe data collection: was data collection planned before the index tests and

reference standard were performed (prospective study) or after (retrospective

study)?

Describe the reference standard and its rationale

Describe technical specifications of materials and methods involved, including

how and when measurements were taken, or cite references for the index test

or reference standard, or both

Describe the definition of and rationale for the units, cut-off points, or categories

of the results of the index tests and the reference standard

Describe the number, training, and expertise of the persons executing and

reading the index tests and the reference standard

Where the readers of the index tests and the reference standard blind (masked)

to the results of the other test? Describe any other clinical information

available to the readers

Describe the methods for calculating or comparing measures of diagnostic

accuracy and the statistical methods used to quantify uncertainty (e.g.,

95% confidence interval)

Describe methods to quantify test reproducibility

Report when study was done, including beginning and ending dates of

recruitment

Report clinical and demographic characteristics (e.g., age, sex, spectrum of

presenting symptoms, comorbidity, current treatments, and recruitment

centers)

Report how many participants satisfying the criteria for inclusion did or did not

undergo the index tests or the reference standard, or both; describe why

participants failed to receive either test (a flow diagram is strongly

recommended)

Report time interval from index tests to reference standard, and any treatment

administered between

Report distribution of severity of disease (defined criteria) in those with the target

condition and other diagnoses in participants without the target conditions

Report a cross tabulation of the results of the index tests (including

indeterminate and missing results) by the results of the reference standard; for

continuous results, report the distribution of the test results by the results of

the reference standard

Report any adverse events from performing the index test or the reference

standard

Report estimates of diagnostic accuracy and measured of statistical accuracy

(e.g., 95% confidence interval)

Report how indeterminate results, missing results, and outliers of index tests

were handled

Report estimates of variability of diagnostic accuracy between readers, centers,

or subgroup of participants if done

Report estimates of test reproducibility, if done

Discuss the clinical applicability of the study findings

Introduction2

Methods, participants3

4

5

6

Test methods7

8

9

10

11

Statistical methods 12

Results, participants 13

14

15

16

Test results17

18

19

20

Estimates21

22

23

24

25Discussion

* Table available at http://www.stard-statement.org/. Accessed December 2, 2009.

1033

EDUCATION

Ray et al.

Anesthesiology, V 112 • No 4 • April 2010

Page 12

One of the important variables associated with a decrease

in organ function is age.10Because we are more frequently

caring for elderly patients, it is important that biomarkers be

tested not only in a middle-aged population but also in an

elderly population.

Other Bias

Therangeofvaluesreportedforthesensitivityandspecificity

in different studies of any biomarker are often very wide.

This variability is uncovered by most meta-analyses per-

formed on biomarker studies.67Apart from differences con-

cerning the cutoff point, which should be considered as a

definition issue, one of the most important reason for this

wide variation is that diagnostic studies are plagued by nu-

merous biases68:

Themainproblemresidesinthereferencetestused.Inmany

clinical situations, the definition of case and controls do

not rely on a perfect “gold standard” reference test (see

infra),andoftenpatientswithambiguousclassificationare

ignored in the analysis. This leads to a case-control design

that overestimates the diagnostic accuracy.3,69This bias

(also called spectrum bias) may be associated with the

largest bias effect.3

Selection bias occurs when nonconsecutive patients or not

randomly selected patients are included.

The lack of blinding for the biomarker tested can also intro-

duce bias, which usually overestimates diagnostic accu-

racy, although this effect seems to be relatively small.3

The verification bias is caused by the selection of a popula-

tionofpatientswhoactuallyreceivethereferencetest,thus

ignoring unverified patients, or when not all patients are

subjected to the reference test, or when different reference

tests are used (called partial verification bias).3This bias

might be particularly confounding when the decision to

perform the reference test is based on the result of the

studied test.

The test may produce an uninterpretable result. Although

this problem is frequently not formally reported, for these

observations are removed from the analysis, this practice

can introduce bias. For subjectively interpreted tests

(which might be particularly the case for biomarkers mea-

sured with rapid bedside technique), interobserver varia-

tion can have a silent but important impact that could be

neither estimated nor reported.

The accuracy of a biomarker may improve over time because

of either improvement in the skills of the biologist or

reader or improvement in technology. The measurement

of cardiac troponin is a good illustration of progressive

improvement in technology over the recent decade, lead-

ing to marked and progressive decrease of the cutoff de-

termining normality.70

Finally, as for clinical trials, a publication bias might oc-

cur because studies showing encouraging results have a

higher likelihood to be published. This bias is important to

consider for meta-analysis. In the absence of registration of

diagnosticstudies,itisdifficulttoestimatetheimpactofthis

source of bias.

Statistical Power Issue

Although frequently overlooked,71statistical power consid-

erations are as important for studies examining the diagnos-

tic performance of a biomarker as in other types of research

(e.g., clinical trials). Thus, all studies on biomarkers should

have included an a priori calculation of the number of pa-

tientsneededtobeincluded.Theexactstatisticalpowercon-

siderations that are relevant for interpreting a biomarker

study are dependent on the nature or purpose of the study

but generally focus on demonstrating that the sensitivity

and/or specificity of a biomarker is superior to some stated

value (e.g., sensitivity ?0.75). It is of note that this focus on

sensitivity and specificity is the case even if the predictive

values are of greater interest (as they often are), because pre-

Fig. 6. Influence on renal function on a biomarker in the postop-

erative period after major vascular surgery. Comparison of pro-

calcitonin in patients without (full circles and full line, n ? 201) and

with postoperative renal dysfunction (open circle and dotted line,

n ? 75) in the control group (A) and the infection group (B). Data

are median (95% confidence interval). * P ? 0.05 (between group

comparisons). Reproduced with permission from Amour et al.65

1034

Evaluation of a Biomarker

Anesthesiology, V 112 • No 4 • April 2010 Ray et al.

Page 13

dictive values are also dependent on prevalence of the under-

lying disease (fig. 2).

Calculatingtherequiredsamplesizetoprovidesomelevel

of desired statistical power (1 ? ?) is analogous to a tradi-

tionaldifferencefromatheoreticalproportion(i.e.,one-sam-

ple difference in proportions) using a one-sided CI. For the

calculation, the sensitivity or specificity values of the test are

treated as a proportion and compared with a minimally ac-

ceptable value. For this calculation, a null hypothesis is pos-

ited that the sensitivity or specificity of the test is equal to a

minimally acceptable value, with the alternative hypothesis

that the test value is greater than this minimal value. To test

this hypothesis, a type-I error rate must be specified (usually

as ? ? 0.05) to construct a one-sided CI (1 ? ?). Further,

because of the nature of the inference being conducted,

the desired statistical power is conservatively set to 95%

(1???0.95).Finally,theexpectedsensitivityorspecificity

ofthebiomarkermustbeanticipatedsuchthatthedifference

in the two proportions can be used in the calculation.

Althoughthisprocesscanseemdaunting,thereareseveral

resources available to assist researchers and consumers of bi-

omarker research. The calculation is now routinely available

on most statistical software applications. The assumptions

usedinthecalculationmuststillbeprovidedbytheuser,but

elegant algorithms can actually perform the calculation. Sec-

ond, Flahault et al.72have recently provided an extensive

overview of the process and have even provided tables of

values that are routinely encountered. Finally, a growing list

of internet sites host statistical power calculators for a variety

ofapplications.Althoughmanyofthesesitesarenotformally

vetted, several are hosted by Universities and are, thus, quite

useful.

Nevertheless,asweadvocategoingbeyondsensitivityand

specificity in this review, it should be emphasized that calcu-

lation of the required sample size should now be done con-

sidering either the sensitivity at a particular false-positive

rate,73theAUCROC,73–75includingpartialAUCROC,73,76or

the reclassification indices.54Moreover, the objective of a

study could also be to determine the value of a cutoff or to

compare two or more biomarkers. In fact, diagnosis assess-

ments of biomarkers include numerous forms of statistical

analyses, but we have to take into consideration sample size

calculations, which is ever feasible even with some difficul-

ties. Thus, clinicians have first to define the aim of the con-

sidered research and second to evaluate its ability to conduct

this power calculation. In fact, these techniques are not yet

availableinmostoftheusualstatisticalsoftwareapplications,

and more advanced statistical software# might be dissuasive

for a punctual use by clinical researchers. Thus, advice of a

biostatistician may be very helpful.

Imperfect Reference Test

In a diagnostic study, the reference test should be a gold

standard, but in many clinical situations this is not possible.

A universally recognized standard may not exist (e.g., cardiac

failure),maynothavebeenperformedinmanypatients(e.g.,

autopsy), or logistically could not be concurrently per-

formed. For example, when evaluating BNP, echocardiogra-

phy for heart failure is not always performed in the emer-

gency department but is usually performed later during

hospitalization.34Moreover, in many situations, biomarkers

are compared with derived scores from several clinical met-

rics that have unknown reliability in place of a confirmed

diagnosis. This practice was seen in the Framingham score

for heart failure and use of the biomarker BNP, criteria of

systemicinflammatoryresponsesyndromeandsepsisanduse

of procalcitonin,77and Risk, Injury, Failure, Loss, and End-

stage Kidney (RIFLE) score for acute renal failure.78

When an imperfect reference test must be used, it should

be recognized that measures of test performance can be dis-

torted.79Glueck et al.80showed that when inappropriate

reference standards are used, the observed AUCROCcan be

greater than the true area, with the typical direction of the bias

being a strong inflation in sensitivity with a slight deflation of

specificity.Takentogether,thisinformationwarrantstheuseof

reliable reference standards that are not prone to such bias.

There are several options available to improve a reference

standard when a gold standard does not exist or cannot be

used. First, expert consensus can be used to define the diag-

nosis. For this task, at least three experts are needed with a

majorityrule.81Theseexpertsshouldhavecompleteaccessto

allavailableinformation,exceptthatconcerningthebiomar-

ker test, to which they should be blinded. The statistical

agreement between experts should be quantified and re-

ported. A second option is to assign a probability value (i.e.,

0–1) that corresponds to a subjective or derived (logistic

regression using dedicated variables) probability that a pa-

tient has the disease. Third, one can use covariance informa-

tion to estimate a model of the multivariate normal distribu-

tions of disease-positive and disease-negative patients when

several accurate tests are being compared. Finally, one can

transform the diagnostic problem into a clinical outcome

problem.82

There are also some situations in which the reference test

outcome is not binary (yes or no) but ordinal or continuous.

Obuchowski et al.83proposed a ROC type nonparametric

measure of diagnostic accuracy. This is a discrimination test

in which a diagnostic test is compared with a continuous

referencetesttodeterminehowwellitdistinguishesoutcome

of the reference test.

New biomarkers do not only modify our diagnostic pro-

cess but also change the definition of a disease.84For exam-

ple, cardiac troponin has progressively modified the defini-

tion of the diagnosis of myocardial infarction.70Glasziou et

al.84have proposed three main principles that may assist the

replacement of a current reference test: the consequences of

the new test can be understood through disagreements be-

# For example, R software. Available at: http://cran.r-project.org/.

Accessed December 2, 2009.

1035

EDUCATION

Ray et al.

Anesthesiology, V 112 • No 4 • April 2010

Page 14

tweenthereferenceandthenewtest;resolvingdisagreements

between new and references test requires a fair, but not nec-

essarily perfect, “umpire” test; possible umpire tests include

causal exposures, concurrent testing, prognosis, or the re-

sponsetotreatment.Afairumpiretestmeansthatitdoesnot

favor either the reference or the new test and, thus, is consid-

ered as unbiased.

STARD Statement for Diagnosis Studies

TheSTARDinitiativewaspublishedrecentlytoimprovethe

qualityofreportingforstudiesofdiagnosticaccuracy.4Com-

plete and accurate reporting of biomarker studies allow the

reader to detect potential bias and judge the clinical applica-

bility and generalization of results. The STARD recommen-

dationsfollowthetemplateoftheConsolidatedStandardsof

Reporting Trials statement for the reporting of randomized

controlledtrials(RCTs).5TheSTARDguidelineattemptsto

improvethereportingofseveralfactorsthatmaythreatenthe

internalorexternalvalidityoftheresultsofastudy,including

designdeficiencies,selectionofthepatients,executionofthe

index test, selection of the reference standard, and analysis of

the data. That these reporting improvements are needed is

evidenced by a survey of diagnostic accuracy studies pub-

lishedinmajormedicaljournalsbetween1978and1993that

foundgenerallypoormethodologicqualityandunderreport-

ing of key methodologic elements.85Similar shortcomings

were observed in most specialized journals.86

TheSTARDguideline(table4)providesachecklistof25

items to verify that relevant information is provided. Similar

to the reporting of RCTs, a flow diagram is strongly recom-

mended, with an item advocating the extensive use of CIs.

Although the STARD initiative is a crucial step for improv-

ingreporting,becauseoftheheterogeneityofavailablemeth-

ods, only a few general recommendations concerning the

statistical methods were offered.

Associated Clinical Predictors and/or

Multiple Biomarkers

Pretestriskforallpatientsofapopulationisrarelyequal,and

clinical predictors, such as age, are most often present. The

clinicalquestionremainshowanewbiomarkerimprovesthe

risk stratification determined by the classic predictors (clini-

cal and biologic). The risk prediction obtained with a new

biomarker alone, even if excellent, may have no clinical ap-

plication if it does not improve the risk stratification ob-

tained with the usual predictors. The use of a risk prediction

model (a statistical model that combines information from

several predictors) is the most frequent approach. The pur-

pose of a risk prediction model is to accurately stratify indi-

viduals into clinically relevant risk categories. The common

typesofmodelsincludelogisticregression,Coxproportional

hazard, and classification trees. Two nested models are then

constructedandcompared,onewithusualpredictorsandthe

other with usual predictors and the new biomarkers.

In most clinical situations, clinicians want to apply more

than one biomarker. A multiple biomarker approach is more

widely used in several domains of modern medicine. For

example, stratification of the cardiovascular risk in the gen-

eral population is improved when considering several bi-

omarkers such as C-reactive protein, troponin, and BNP.87

After cardiac surgery, a multiple biomarker approach has

been shown to improve the prediction of poor long-term

outcome when compared with the classic clinical Euroscore

(fig.7).60Therearetwomainapproaches:(1)severalbiomar-

kers testing the same pathophysiologic process; (2) differ-

ent biomarkers testing different pathologic processes. For

example, C-reactive protein may assess the postoperative

inflammatory response, BNP the cardiac strain, and tro-

ponin myocardial any ischemic damage, all of them influ-

encing final outcome in cardiac surgery.11

However, the results from different tests are usually not in-

dependentofeachother,eveniftheyassessdifferentpathophys-

iologic mechanisms, indicating that sequential Bayesian calcu-

lations may not be appropriate. Other statistical methods,

which take into account colinearity and interdependence,

should be considered. In the case of a logistic regression, the

regressioncoefficientscanbeusedtocalculatetheprobabilityof

diseasepresence,andasimplifiedscorecanthenbederivedfrom

these coefficients. Biases associated with the multivariate analy-

sesarecommonandarelargelyabletoimpactthereplicationof

thesescores.Thebestwaytolimitthisistheuseofbothinternal

and external validations of the models.88

Meta-analysis of Diagnostic Studies

Systematic reviews are conducted to help gain insight on the

available evidence for a research topic. For a meta-analysis,

Fig. 7. Multiple biomarkers in cardiac surgery: cumulative postoper-

ative survival at mean of covariates (European System for Cardiac

Operative Risk Evaluation) without major cardiac events (MACE) ac-

cording to elevation of cardiac troponin I (?3.5 ng/ml), B-type natri-

uretic peptide (?880 pg/ml), and C-reactive protein (?180 mg/l).

Patients were categorized according to elevation of no biomarker

(BM)(n?58;survivalrateat1yr,95%),onlyoneBM(n?98;survival

rate at 1 yr, 82%), two BMs (n ? 56; survival rate at 1 yr, 63%), or

three BMs (n ? 12; survival rate at 1 yr, 58%). All survival curves

significantly differ from each other (P ? 0.05). Reproduced with

permission from Fellahi et al.60

1036

Evaluation of a Biomarker

Anesthesiology, V 112 • No 4 • April 2010Ray et al.

Page 15

this is conducted in a two-stage process where summary sta-

tistics are first computed for individually considered studies,

and then a weighted average is computed across studies.89In

thisregard,meta-analysesofbiomarkerdiagnosticstudiesare

similartoothertypesofmeta-analyses.67Forthatreason,the

reporting of meta-analyses of biomarker diagnostic studies

should generally follow existing guidelines for meta-analysis

suchasPreferredReportingItemsforSystematicreviewsand

Meta-Analyses (PRISMA).90

Despite general similarities, the conduct of meta-analysis

for biomarker studies differs from meta-analyses of RCT in

several important ways. First, the assessment of study quality

for diagnostic studies varies considerably from RCTs. Indi-

vidual studies on the same biomarker can vary considerably

on the choice of threshold used, the population under study,

and even the measurement of the biomarker or reference

standard. The choice of patient recruitment strategy can im-

pact the assessment as well, with one study finding that re-

cruiting patients and controls separately can lead to an over-

estimation of the test’s diagnostic accuracy.3To assist in the

evaluation of study quality, several specialized tools have

been created, such as STARD guidelines (table 4), and

the quality assessment of studies of diagnostic accuracy

(QUADAS) has been included in systematic reviews.91The

accurate characterization of a biomarker’s performance in a

particular setting for a specific population is dependent on

sorting through the available evidence to primarily focus on

only relevant studies of high quality.

The statistical techniques used to aggregate the results of

biomarker studies also differ from the meta-analyses of RCTs.

The meta-analysis of diagnostic studies requires the consider-

ation of two index measures (e.g., sensitivity and specificity), as

opposedtoasingleindexinthemeta-analysisofanRCT.92Itis

also expected that heterogeneity in the indices will be observed

from several different sources,3and this heterogeneity must be

considered in the statistical model used to pool the estimates.93

The choice of which type of model and estimation strategy to

use is not trivial, with several novel techniques such as the hier-

archical summary ROC94and multivariate random-effects

meta-analysis95offeringdistinctadvantagesovertraditionalap-

proaches. For the interested reader, Deeks et al.92offers an in-

formative illustration of the meta-analytical process.

Fig. 8. Reanalysis of the predictive value of brain natriuretic peptide for cardiogenic pulmonary edema in elderly patients (?65 yr), patients

admitted to the emergency department for acute dyspnea. (A) A bootstrap analysis (1,000 random samples) was performed to obtain a box

plot in the receiver-operating characteristics (ROC) curve. (B) A cost-benefit analysis was performed to choose the best cutoff point. (C) The

bootstrap analysis also allowed the determination of the best cutoff using the Youden method; this could also provide another definition of the

gray zone or a 95% confidence interval for the cutoff point. (D) The bootstrap analysis shows the distribution of the area under the

receiver-operating characteristic curve (AUCROC). Adapted from data from Ray et al.31

1037

EDUCATION

Ray et al.

Anesthesiology, V 112 • No 4 • April 2010