Page 1
arXiv:1105.1575v1 [stat.ME] 9 May 2011
Evaluating the diagnostic powers of variables and their
linear combinations when the gold standard is
continuous
Zhanfeng Wanga,b, Yuanchin Ivan Changb,1
aDepartment of Statistics and Finance, University of Science and Technology of China,
Hefei, 230026, China
bAcademia Sinica, Taipei, Taiwan, 11529
Abstract
The receiver operating characteristic (ROC) curve is a very useful tool for
analyzing the diagnostic/classification power of instruments/classification
schemes as long as a binaryscale gold standard is available. When the gold
standard is continuous and there is no confirmative threshold, ROC curve
becomes less useful. Hence, there are several extensions proposed for evalu
ating the diagnostic potential of variables of interest. However, due to the
computational difficulties of these nonparametric based extensions, they are
not easy to be used for finding the optimal combination of variables to im
prove the individual diagnostic power. Therefore, we propose a new measure,
which extends the AUC index for identifying variables with good potential to
be used in a diagnostic scheme. In addition, we propose a threshold gradient
descent based algorithm for finding the best linear combination of variables
that maximizes this new measure, which is applicable even when the number
of variables is huge. The estimate of the proposed index and its asymptotic
property are studied. The performance of the proposed method is illustrated
using both synthesized and real data sets.
Keywords:ROC curve, Area under curve, Gold standard, Classification
1Corresponding author: ycchang@stat.sinica.edu.tw
Preprint submitted to Computational Statistical and Data Analysis May 10, 2011
Page 2
1. Introduction
The ROC curve, founded on a binary gold standard, is one of the most
important tools to measure the diagnostic power of a variable or classifier,
and its importance has been intensively studied by many authors, which
can easily be found in the literature and textbooks such as Pepe (2003) and
Krzanowski and Hand (2009). Moreover, when the number of variables is
huge, many algorithms have been proposed for finding the best combination
of variables to increase the individual classification accuracy (Su and Liu
(1993), Pepe (2003), Ma and Huang (2005), and Wang et al. (2007a)). How
ever, in many classification or diagnostic problems, the professed binary gold
standard is essentially derived from a continuousvalued variable. If there is
no such confirmative threshold for the continuous gold standard, then the
evaluation of variables/classifiers according to the ROC curve based anal
ysis may vary as the choices of thresholds change and therefore becomes
less informative. For example, glycosylated hemoglobin is usually used as a
primary diabetic control index, and is originally measured as a continuous
valued variable. Health institutes, such as the World Health Organization
and National Institutes of Health (NIH), suggest a cutting point for it based
on current findings for diabetic diagnosis and control. Once its cutting point
is fixed, then the association between the variables of interests, such as new
drugs, and this binaryscale standard can be evaluated using some ROC
curve related analysis methods. However, as advances are made in science
and medicine about this disease, this criterion will be reevaluated and re
vised as necessary. Then, the performance evaluation of variables/classifiers
may vary as the binaryrecoding scheme is changed. It is clear that an unwar
ranted performance measure may result in misleading conclusions and may
require reevaluation of all the available diagnostic methods again every time
a new standard is proposed. Hence, a measure that directly connects to the
continuous gold standard is always preferred, which motivates our study of a
new measure when the gold standard is continuous. Our goal in this paper is
to find a robust measure, which is not affected by the choice of cutting point
of a gold standard or how the binary outcome is derived from a continuous
gold standard.
Although there are a lot of reports about the ROC curve, there is still a
lack of study when the gold standard is not binary (Krzanowski and Hand,
2009). In Henkelman et al. (1990), they proposed a maximum likelihood
method under ordinal scale gold standard.Recently, Zhou et al. (2005),
2
Page 3
Choi et al. (2006), and Wang et al. (2007b) considered the ROC curve es
timation problems based on some nonparametric and Bayesian approaches,
when there is no gold standard. In addition, some ROCtype analysis with
out a binary gold standard has been considered in Obuchowski (2005) and
Obuchowski (2006), where a nonparametric method is used to construct a
new measure, and many other applications with continuous gold standard
are discussed. However, these approaches, due to computational issue, are
not easy to apply to the case that the optimal combination of variables is
of interest; especially when the number of variables is large as in modern
biological/genetic related studies (Waikar et al. (2009)).
In this paper, an extension of the AUCtype measure is proposed, which
is independent of the choice of threshold of the continuous gold standard, and
algorithms for finding the best linear combination of variables that maximizes
the proposed measure are studied. Under the joint multivariate normality
assumption, the algorithm for the linear combination can be founded us
ing the LARS method. When this joint normality assumption is violated, we
propose a threshold gradient descent based method (TGDM) to find the opti
mal linear combination. Thus, our algorithms also inherit the nice properties
of LARS and TGDM when dealing with the high dimensional and variable
selection problems. Numerical studies are conducted to evaluate the perfor
mances of the proposed methods with different ranges of cutting points using
both synthesized and real data sets. The estimate of this novel measure and
its asymptotic properties are also presented.
In the next section, we first present a novel measure for evaluating the
diagnostic potential of individual variables and then an estimate of this mea
sure. The algorithms for finding the best linear combination are discussed
in Section 3. Numerical results based on the synthesized data and some real
examples follow. A summary and conclusions are given in Section 4. The
technical details are presented in Appendix.
2. An AUCtype Measure with a Continuous Gold Standard
Before introducing a novel AUCtype measure based on a continuous gold
standard, we first fix the notation and briefly review the definition of the ROC
curve and its related measures. Let Z and Y be two continuous realvalued
random variables, where Z denotes the gold standard and Y is a variable
of interest with diagnostic potential to be measured. Then, for example, Z
is a primary index for measuring a disease and Y is some other measure of
3
Page 4
subjects that is related to the disease of interest. In some medical diagnostics,
the primary index is difficult to measure, and we are usually looking for
variables that are strongly associated with Z and easy to measure, to be used
as surrogates. That is why we need to evaluate the “level of association”
of Y to Z. Likewise, in some bioinformatical studies, in order to develop
new treatments, we would like to identify any strong associations between
some genomic related factors Y to the continuous gold standard Z. Suppose
that there is an unambiguous threshold c of Z that can be used to classify
subjects into two subgroups, and assume further that subjects with Z > c
are classified as diseased, and otherwise as members of the control group.
Then the ROC curve, for such a given c, is defined as ROC(t) ≡ SD(S−1
where SD(t) = P(Y > tZ > c) and SC(t) = P(Y > tZ ≤ c), and the AUC
of variable Y is defined as
C(t)),
AUC(c) = P(Y+
c > Y−
c) (1)
where random variables Y+
jects of the disease and nondisease groups with density functions f(yZ > c)
and f(yZ < c). That is, Y+
populations defined by {Z > c} and {Z ≤ c}, respectively. It is clear that
the AUC(c) defined in (1) is a function of c, which will change as the thresh
old c of Z varies. Hence, when the threshold is dubious, using AUC(c) as a
measure may misjudge the diagnostic power of Y or the level of association
between Y and Z.
Let fc(t) be a probability density function defined on the range of possible
values of c, then AUCIis defined as
c and Y−
c respectively denote the Y value of sub
c
and Y−
c
are random variables for the sub
AUCI
≡?AUC(t)fc(t)dt.(2)
Hence, by its definition, the proposed AUCIis independent of the choice of
cutting point for the continuous gold standard, and any monotonic trans
formation of Y as well. This kind of threshold independent property is also
one of the important properties of the ROC curve and AUC when used as
measures of diagnostic performance. Since AUCI is defined as an integra
tion of AUC(c) over the range of possible cutting points with respect to a
weight function fc(t), the support of fc(t) should be chosen as a subset of
the support of the density of Z. Moreover, we can use fc(t) to put different
weights on all possible cutting points of Z if there is some information about
the possible cutting point. If Z is an ordinal discrete variable, then there are
4
Page 5
only countable cutting points, and fc(t) can be chosen as a probability mass
function of all possible cutting points, and the integration of (2) becomes
AUCI
=?
ti∈CAUC(ti)fc(ti),(3)
where C is a set of all possible cutting points. In particular, when Z is binary,
we can let fc(t) be a degenerated probability density, then AUCIis the same
as the original AUC.
2.1. Estimate of AUCI
Let random variables (Yi,Zi) denote a pair of measures from subject i,
for i ≥ 1. Suppose that {(yi,zi), i = 1,...,n} are n independent observed
values of random variables (Yi,Zi), i = 1,···,n. For a given cutting point c,
a subject i, i = 1,...,n, is assigned as a “case” if zi> c and otherwise labeled
as a “control”. That is, for a given c, we divide the observed subjects into
two groups; let S1(c) and S0(c) be the case and control groups with sample
sizes n1 and n0, respectively. It is obvious that these assignments depend
on the choice of c. Then for a fixed c, the empirical estimate of AUC(c) is
defined as
ˆA(c) =
1
n0n1
?
i∈S1(c);j∈S0(c)
ψ(yi− yj), (4)
where ψ(u) = 1, if u > 0; = 0.5, if u = 0 and = 0 if u < 0.
easy to see thatˆA(c) does not exist, either c > max{zi,i = 1,···,n} or
c < min{zi,i = 1,···,n}, since for these two cases, we have either n1= 0 or
n0= 0. Therefore, in this paper, we assumeˆA(c) = 0.5 when either one of
the cases occurs.)
If the whole support of Z is considered as a possible range of cutting
points, then a natural estimate of AUCIcan be defined as
(It is
ˆAI=
?
ˆA(t)dˆFc(t),(5)
whereˆFc(t) is the empirical estimate of the cumulative distribution function
of Z based on {z1,...,zn}. However, in practice, it is rare to choose cutting
points at ranges near the two ends of the distribution of Z. Thus, instead of
the whole range of Z, we might explicitly define a weight function fc(t) on a
particular critical range. Below, we demonstrate three possible choices: (1)
5
Page 6
a uniform distribution over the range of (−ˆ σ,+ˆ σ), where ˆ σ is an empirical
standard deviation of Z, say f1(t); (2) a normal density with sample mean
ˆ µ and standard deviation ˆ σ based on the observed values of Z, say f2(t); or
(3) using a kernel density estimate, say f3(t), to approximate the marginal
density of Z. For different weight functions fj(t), j = 1,2,3, the estimate of
AUCIis denoted as
ˆAIj=
?
ˆA(t)fj(t)dt.(6)
It is clear that our method can be extended to other reasonable choices of
weight functions. The theorem below states the strongly consistent property
ofˆAIjfor all j.
Theorem 2.1. Let (Y ∈ R1,Z ∈ R1) be a pair of random variables with uni
formly continuous marginal densities. Assume that {(y1, z1),...,(yn, zn)}
are n observations of the independent and identically distributed random sam
ple (Yi,Zi), i = 1,...,n. Assume further that Z is the continuous gold stan
dard. Then for a given fc(t) = fj(t), j = 1, 2, 3, with probability one,
ˆAIj− AUCIj → 0 as n → ∞, whereˆAIj and AUCIj are defined as in (6)
and(2), respectively, with corresponding fc(t) = fj(t).
Proof of Theorem 2.1 Since bounded functionˆA(c) converges almost surely
to AUC(c) for all given c and fc(t) is also bounded density function, the proof
of Theorem 2.1 follows from the dominated convergence theorem.
It is difficult to have an explicit form for the variance ofˆAIjdue to its
integral form. Thus, a bootstrap estimate of the variance ofˆAIjis used and
denoted as V (ˆAIj). A similar idea is employed in Obuchowski (2006).
Remark 2.2. Note that the method for calculating (6) may depend on the
choice of weight function. If the empirical density of the gold standard is used,
then the computation of it is straightforward; if a kernel density of the gold
standard is used, then a numerical integration method is required. However,
in all cases the computation of it are easy since it is an onedimensional
density.
3. Linear combination of variables that maximizes AUCI
For a classification or diagnostic problem, there are usually many vari
ables measured from each subject, and it is well known that a combination
6
Page 7
of variables can usually improve on the classification performance of a single
variable. This situation motivates us to study how to find the optimal linear
combination of variables that maximizes the proposed measure AUCI. For
classical AUC, Su and Liu (1993) studied the best linear combination under
a multivariate normal distribution assumption. Here we extend their idea to
AUCI. In addition, we also aim to address cases with huge number of vari
ables, which usually involve some computational issues and will be discussed
later in this section.
3.1. Optimal Linear Combination of Variables Under Joint Normality
For clarity and convenience, we start with a bivariate normal distribu
tion case, since the linear combination of variables, for a given vector of
coefficients, can be treated as a single variable.
Let U = (Y,Z)Tbe a random vector following a bivariate normal distri
bution with mean vector µ = (µ1,µ2)Tand covariance matrix
ΣU=
?
σ2
1
ρσ1σ2
σ2
ρσ1σ2
2
?
.
Suppose that Ui= (Yi,Zi)T, i = 1,2, are two independent random vectors
generated from the same distribution of U. Define
Qi= exp
?
−(Ui− µ)TΣ−1
U(Ui− µ)
2
?
, i = 1,2.
Then for a given c,
pr(Y1> Y2,Z1> c,Z2< c) =
?∞
−∞
?y1
−∞
?∞
c
?c
−∞
Q1Q2
4π2ΣUdz2dz1dy2dy1, (7)
where ΣU denotes the determinant of matrix ΣU. The conditional dis
tribution of Yj given Z = zj is a normal distribution with mean ˜ µj =
µ1+ σ1/σ2ρ(zj− µ2) with j = 1, 2 and variance ˜ σ2
η(z1,z2) = 1/(2πσ2
rewritten as
1= (1 − ρ2)σ2
2)). Then, (7) can be
1. Let
2)exp(−((z1−µ2)2+(z2−µ2)2)/(2σ2
pr(Y1> Y2,Z1> c,Z2< c)
?∞
c
−∞
=
?c
η(z1,z2)
?∞
−∞
?y1
−∞
1
2π˜ σ2
1
exp
?
−(y1− ˜ µ1)2+ (y2− ˜ µ2)2
2˜ σ2
1
?
dy2
7
Page 8
dy1dz2dz1
=
?∞
c
?∞
c
?c
−∞
?c
−∞
η(z1,z2)E(Φ(˜ σ1V + ˜ µ1− ˜ µ2
˜ σ1
ρ(z1− z2)
σ2(1 − ρ2)1/2))dz2dz1,
))dz2dz1
=
η(z1,z2)E(Φ(V +
(8)
where V is a standard normal random variable and Φ is the standard normal
cumulated distribution function. Note that under normality assumption, ρ =
0 implies that Y and Z are independent, and it follows from (8) AUCI= 0.5
in this case.
Now, suppose that˜ X = (X1,...,Xp)Tis a pdimensional random vec
tor of measures of a subject, and Z is the continuous gold standard as
before.Suppose l ∈ Rpand let Y = lT ˜ X be a linear combination of
˜ X. Assume further that˜ X follows a multivariate normal distribution with
mean vector µ∗and covariance matrix Σ. Then Y follows a normal dis
tribution with mean µ1 = lTµ∗and variance σ2
tion coefficient between Y and Z is ρ = lTcov(˜ X,Z)/((lTΣl)1/2σ2), where
cov(˜X,Z) = (cov(X1,Z),...,cov(Xp,Z))T. Then, AUCI for such a linear
combination of Xi’s, Y = lT ˜ X, is a function of l:
?
where (˜ XT
(˜ XT,Z)T. Our goal is to find the optimal linear combination of X1,...,Xp
such that AUCIis maximized and it is known that AUC is scale invariant.
In order to make the solution identifiable, we search for an lopt such that
AUCI(lopt) ≥ AUCI(l) for all possible l ∈ Rpwith ?l? = 1.
From (8),
1= lTΣl.The correla
AUCI(l) =pr(lT˜ X1> lT˜ X2Z1> t,Z2< t)fc(t)dt(9)
i,Zi)T, i = 1,2, are independent identically distributed samples of
∂
∂lE
?
Φ
?
V +
ρ(z1− z2)
σ2(1 − ρ2)1/2
??
=
1
√2exp
?
−ρ2(z1− z2)2
4σ2
2(1 − ρ2)
?
z1− z2
σ2(1 − ρ2)3/2
∂ρ
∂l.(10)
Therefore,
∂AUCI(l)
∂l
=
∂ρ
∂l
?
?
fc(t)
?∞
t
?t
−∞
1
23/2πσ2
?
σ2(1 − ρ2)3/2
2
exp
?
−(z1− µ2)2+ (z2− µ2)2
2σ2
1
pr(Z1> t,Z2< t)dz2dz1dt
2
?
exp−ρ2(z1− z2)2
4σ2
2(1 − ρ2)
z1− z2
=
∂ρ
∂l∆,(11)
8
Page 9
where ∆ dentes the integration part of the left hand side of (11). Since
∆ > 0, the equation ∂AUCI(l)/∂l = 0 if and only if ∂ρ/∂l = 0; that is,
lTcov(˜ X,Z)
((lTΣl)1/2σ2)= 0.
It implies that the optimal linear combination coefficient
lopt= Σ−1cov(˜ X,Z).
∂
∂l
(12)
Note that, as in Su and Liu (1993), this optimal linear combination coeffi
cient loptis independent of c, and depends only on the covariance matrix of
variables and the covariance between of variables and the gold standard.
3.2. Estimation of the Optimal Linear Combination
Assume that {(˜ xi,zi),i = 1,···,n} is a set of n independent and identi
cally distributed random samples, where zidenotes the observed gold stan
dard measures as before, and ˜ xi is its corresponding pdimensional vector
of observed variable values of subject i.
assume that all the components of ˜ x and z are centralized, since we can
always centralize the data by subtracting their sample means, and define
H = (˜ x1− ¯ x,···, ˜ xn− ¯ x)Tas an n× p matrix, and ˜ z = (z1− ¯ z,···,zn− ¯ z)T
as a vector of length p, where ¯ x =?n
estimate of loptbased on a sample of size n following from (12) is defined as
ˆl = (HTH)−1HT˜ z.
Similarly to the linear regression problem, it is clear thatˆl is a strongly
consistent estimate of lopt under some regularity conditions on˜ X and Z.
Define
1
n1n0
i∈S1(c);j∈S0(c)
Without loss of generality, we
i=1˜ xi/n and ¯ z =?n
i=1zi/n. Hence, the
(13)
ˆA(c,l) =
?
ψ(lT˜ xi− lT˜ xj). (14)
Then
ˆAI(l) =
?
ˆA(t,l)fc(t)dt(15)
is an estimate of AUCI(l). It is easy to see that for given t,ˆA(t,l) con
vergenes toˆA(t,l) uniformly with respect to l. Hence, using the dominated
convergence theorem, it is shown thatˆAI(ˆl) is a strongly consistent estimate
of AUCI(lopt) and the details are omitted here. This result is stated as a
theorem below:
9
Page 10
Theorem 3.1. Suppose that the joint distribution of˜ X ∈ Rp,Z ∈ R1follows
a multivariate normal distribution, where Z is the continuous gold standard,
and˜ X denotes the pdimensional vector of variables. Let {(˜ X1, Z1),···,(˜ Xn, Zn)}
be independent and identically distributed samples of size n. Then for a given
density fc(t), with probability one,
ˆAI(ˆl) − AUCI(lopt) −→ 0, as n → ∞,
where AUCI(lopt) andˆAI(ˆl) are defined as in (9) and (15) with l = loptand
ˆl, respectively.
Equation (13) provides a neat solution for the best linear combination
of variables under a joint multivariate normality assumption. However, it
can be seen from (13) that the calculation ofˆl relies on the computation of
an inverse matrix. Thus, when the number of variables is large, the direct
calculation ofˆl using (13) becomes numerically unstable. The situation is
worse, when the sample size is relatively small compared to the number of
variables. So, we need an alternative numerical approach that can handle
problems with large p to overcome this obstacle.
Again, from (13), we find that the estimateˆl can be viewed as a least
square estimate of l in the linear regression model below:
˜ z = Hl + e,(16)
where e is an ndimensional vector of random error. When p is small, then
the solution can be obtained easily as in regression problems. When p is
large, then we can apply the least angle regression shrinkage (LARS) method
(Efron et al., 2004) to (16) to obtain an estimate of l. Since this is the same
as applying LARS in a regression setup, the properties of LARS are therefore
inherited. With the assistance of LARS, the proposed measure can be applied
to evaluate linear combinations of lengthy variables. The variable selection
scheme will follow from LARS as it is used in regression models. However,
when the normality assumption is violated or the normal approximation to
the joint distribution is not adequate, the empirical results show that the lopt
defined in (12) is not a good solution. Thus, an alternative algorithm, which
does not rely on the normality assumption, is required and developed below.
Remark 3.2. Since the properties of applying LARS to find the linear com
bination of variables are the same as those in linear regression. We omit the
details of applying LARS under the normality assumption. Instead, we focus
on the case without a normality assumption.
10
Page 11
3.3. When the Joint Distribution is Unknown
As before, let’s start with a onedimensional case, and the case with a
linear combination of variables will follow easily as an extension.
Similarly to the methods used in Ma and Huang (2005), and Wang et al.
(2007a), we first use a sigmoid function S(t) = 1/(1 + exp(−t)) to approx
imate ψ(·) in equation (21). Thus, a smooth estimate of AUCI is defined
as
ˆAIs=
?
1
n1n0
?
i∈S1(t);j∈S0(t)
S
?yi− yj
h
?
fc(t)dt.(17)
It follows from the results in density estimation literature that for a suffi
ciently small window width h, S((y − x)/h) ≈ ψ((y − x), which implies the
following asymptotic properties ofˆAIs:
Theorem 3.3. Assume that {(y1, z1),···,(yn, zn)} are n independent and
identically distributed samples of (Y ∈ R1,Z ∈ R1), where Z denotes a
continuous gold standard. Denote the marginal densities of Y and Z by
fY and fZ, respectively. Let F(zy) be conditional cumulative function of Z
given Y = y. Suppose that fY and fZare larger than 0 and bounded. Assume
both fY(·) and F(z·) are uniformly continuous. Then for a given probability
density fc(t) with h = O(n−α), 1/5 < α < 1/2,
ˆAIs− AUCI→ 0 almost surely as n → ∞,
where AUCIandˆAIsare defined in (2) and (17), respectively.
(The proof of Theorem 3.3 relies on some classical results of density ap
proximation theory. The details are given in Appendix A.)
As before, we replace y in (17) with lT˜ x, then we have the smooth estimate
of AUCI(l) below:
ˆAIs(l) =
?
1
n1n0
?
i∈S1(t);j∈S0(t)
S
?lT˜ xi− lT˜ xj
h
?
fc(t)dt.(18)
The asymptotic property ofˆAIs(l) follows easily from Theorem 3.3, and is
summarized as the following theorem without proof.
11
Page 12
Theorem 3.4. Suppose that {(˜ x1, z1),···,(˜ xn, zn)} are n independent and
identically distributed samples of (˜ X ∈ Rp,Z ∈ R1), where Z denotes the
continuous gold standard, and˜ X is a vector of corresponding variables. Let
fc(t) be a probability density. Assume that for a given constant vector l ∈
Rp, the conditions of Theorem (3.3) holds for Y = lT ˜ X and Z. Then for
h = O(n−α) with 1/5 < α < 1/2,
ˆAIs(l) − AUCI(l) → 0 almost surely as n → ∞,
where AUCI(l) andˆAIs(l) are defined in (9) and (18), respectively.
Remark 3.5. We only need to estimate the density function of the linear
combination lT ˜ X ∈ R1, hence the choice of h does not depend on the length of
total variables p. Thus, the density estimation part of the proposed algorithm
will not suffer from the curse of dimensionality.
Following Theorem 3.3, we apply the threshold gradient descent method
(TGDM) of Friedman and Popescu (2004) to find the best linear combina
tion,ˆl which maximizesˆAIs(l). That is, to find a solution
ˆl = argmaxlˆAIs(l). (19)
From equation (18), we know that AUCIsis also scale invariant as is AUC.
That is,ˆAIs(l) with window width h will equals toˆAIs(kl) with h = kh for
a positive constant k. Hence, an anchor variable is needed such that the
solution of (19) is unique.
TGDM Based Algorithm Let {(˜ x1,z1),···,(˜ xn,zn)} be a set of random
samples of size n, which satisfies the assumption of Theorem 3.4. Define
s = (s1,···,sp)Tas a pdimensional vector with si= 1, if the corresponding
empirical AUCIof the ith variable is greater than 0.5; otherwise set si= −1.
Let βibe a pdimensional vector where only the ith component equals siand
0 otherwise. Define Ri=ˆAIs(βi), then choose the variable with the maxi
mum Rivalue as the anchor variable. In the following algorithm, we assume
that R1> Ri, for i = 2,...,p without loss of generality. Let notationˆlide
note the ith component ofˆl, thenˆl1is the coefficient of the anchor variable.
In order to make the coefficients identifiable, we set ?ˆl1? = 1. Following the
notations defined above, a TGDMbased algorithm for finding the best linear
combination of variables that maximizes AUCIsis stated below:
Algorithm:
12
Page 13
(0) Initial stage: Let r = 0 and choose a threshold parameter τ. Set
l(0)= (s1,0,···,0)T.
(1) Given l = l(r), calculate the derivative of the smoothed estimate
ˆAIs(l) with respect to linear coefficient l, d(l(r)) = (d1(l(r)),···,dp(l(r)))T=
∂ˆAIs(l)/∂ll=l(r).
(2) Use the threshold gradient descent method to calculate l = l(r+1)
that is, l(r+1)
0
= l(r)+ δ t(τ,l(r)) d(l(r)) for some δ > 0, where t(τ,l(r))
is an indicator vector
0
;
I?d(l(r)) > τ max{d1(l(r)),···,dp(l(r))}?.
(3) Find the optimal δ∗= argmaxδ>0ˆAIs(l(r+1)
δt(τ,l(r)) d(l(r)), and update l(r+1)= l(r)+ δ∗t(τ,l(r)) d(l(r)).
0
) with l(r+1)
0
= l(r)+
(4) Repeat steps (1)(4) untilˆAIs(l(r+1)) converges.
Remark 3.6. The initial value of l is chosen as (s1,0,···,0)T, since the first
component of l corresponds to the selected anchor variable. In Step (2), we
update l(r)along the direction t(τ,l(r)) d(l(r)), where the number of nonzero
components is decided by the threshold parameter τ, and by the definition of
t(τ,l(r)), the locations of nonzero components of t(τ,l(r)) are determined by
the elements of gradient d(l(r)). Step (3) is to find a suitable step size δ∗along
the direction of Step (2), then update the linear coefficients of variables. The
criterion of convergence of Step (4) has to be predetermined.
(The software used in this paper (GoldAUC) is available at
http://idv.sinica.edu.tw/ycchang/software.html).
4. Numerical studies
In numerical studies, we calculate the proposed measuresˆAIj, j = 1,2,3,
corresponding to 3 different fc(t) as defined before. Since the correlation
coefficient is a basic statistic to measure the association between two contin
uous variables, we therefore include it in our experimental studies. We also
compare the performances of our methods with that of Obuchowski’s (2006)
method (page 485, Equation (9)) described below:
ˆθ =
1
n(n − 1)
n
?
i=1
n
?
j=1
ψ
′(yi,zi,yj,zj),(20)
13
Page 14
where i ?= j,
ψ
′(yi,zi,yj,zj) = 1 if yi> yj and zi> zj, or yi< yj and zi< zj;
= 0.5 if yi= yj or zi= zj;
= 0otherwise.
The sample sizes used in our numerical studies are n = 50 and 100.
The window width for the kernel estimate inˆAI3 is equal to n1/5.
bootstrap sample size for estimating the variance of each case is 200, and
there are 100 replicates for each simulation setup. For the first experimental
study, the data are generated from bivariate normal distributions with means
µ1 = µ2 = 1.0, standard deviations (σ1,σ2) equal to (1.0,1.0), (1.0,2.0),
(2.0,1.0) and (2.0,2.0), and correlation coefficients equal to ρ = 0.0, 0.25,
0.5, 0.75, and 1.0. Let ˆ µ and ˆ σ2denote the sample mean and variance of z.
As in the classical ROC curve analysis, when a variable with no diagnostic
power, then its corresponding ROC curve will be the 45 degree diagonal
line of the unit square. If this case holds for all possible cutting points,
then it implies that AUCI = 0.5. So, we use 0.5 as the value of the null
hypothesis in our numerical study. Table 1 shows five statistics for different
simulation setups: correlation coefficient of two variables ˆ ρ,ˆAIj, j = 1,2,3
with corresponding fc(t)’s, andˆθ from Obuchowski (2006). Figure 1 is a plot
of statistics ˆ ρ2/V (ˆ ρ), (ˆAIj− 0.5)2/V (ˆAIj) for all j’s, and (ˆθ − 0.5)2/V (ˆθ)
versus ρ, where V (ˆ ρ) and V (ˆθ) are the bootstrap estimates of variances of ˆ ρ
andˆθ, respectively.
When the joint distribution of two variables follows a bivariate normal
distribution, the correlation coefficient is a natural statistic to describe the
association between the two variables. In our study, all five measures increase
as the true correlation coefficient ρ increases, which suggests that all measures
catch the linear association between variable Y and the gold standard Z as
expected. In fact,ˆAIjandˆθ are very close to their true values 0.5 and 1.0,
when ρ are equal to 0.0 and 1.0, respectively. In addition, Figure 1 shows
that the values of ˆ ρ2/V (ˆ ρ) and (ˆAIj− 0.5)2/V (ˆAIj), j = 1,2,3, are larger
than those of (ˆθ − 0.5)2/V (ˆθ) under current simulation set up.
Table 2 shows the results of five measures when there is no association
between variable Y and the gold standard Z. That is, the data set used in this
table are generated from the model y = z2+ǫ with standard normal error ǫ,
where the gold standard z is generated from three different distributions: (1)
normal distribution, (2) t2distribution with free degree 2, and (3) a Cauchy
The
14
Page 15
Table 1: Comparison of five measure indexes: ˆ ρ,ˆAIj, j = 1, 2, 3, andˆθ, where the marker
and gold standard, (y, z), follow multivariate normal distribution with means µ1= µ2=
1.0, with different standard deviations σ1, σ2and distinct correlation coefficients ρ.
n (σ1, σ2)Method0.0 0.25
50 (1.0, 1.0)ˆ ρ
0.105(0.076, 0.140)∗0.252(0.118, 0.130)0.511(0.088, 0.103)0.747(0.067, 0.064)1.000(0.000, 0.000)
ˆ AI1
0.505(0.064, 0.073) 0.621(0.063, 0.069)0.746(0.053, 0.058)0.866(0.040, 0.038)1.000(0.000, 0.000)
ˆ AI2
0.501(0.065, 0.067) 0.616(0.062, 0.063)0.743(0.046, 0.052)0.856(0.035, 0.033)0.979(0.010, 0.013)
ˆ AI3
0.498(0.065, 0.066) 0.611(0.062, 0.062)0.737(0.045, 0.051)0.846(0.037, 0.034)0.968(0.011, 0.015)
ˆθ
0.504(0.044, 0.049) 0.583(0.044, 0.048)0.673(0.038, 0.044)0.771(0.036, 0.035)1.000(0.000, 0.004)
0.50.75 1.0
(1.0, 2.0)ˆ ρ
0.106(0.073, 0.136) 0.263(0.118, 0.131)0.477(0.099, 0.109)0.750(0.058, 0.065)1.000(0.000, 0.000)
0.497(0.067, 0.073) 0.621(0.061, 0.070)0.730(0.053, 0.061)0.862(0.034, 0.040)1.000(0.000, 0.000)
0.495(0.065, 0.066) 0.622(0.061, 0.064)0.729(0.053, 0.054)0.859(0.029, 0.033)0.980(0.008, 0.010)
0.496(0.065, 0.066) 0.622(0.062, 0.064)0.729(0.051, 0.054)0.858(0.030, 0.034)0.983(0.004, 0.009)
0.498(0.044, 0.049) 0.583(0.043, 0.049)0.660(0.038, 0.044)0.769(0.032, 0.036)1.000(0.000, 0.004)
ˆ AI1
ˆ AI2
ˆ AI3
ˆθ
100(1.0, 1.0)ˆ ρ
0.085(0.056, 0.098) 0.253(0.083, 0.092)0.497(0.082, 0.075)0.747(0.046, 0.044)1.000(0.000, 0.000)
0.490(0.050, 0.051) 0.620(0.043, 0.048)0.739(0.046, 0.041)0.865(0.024, 0.027)1.000(0.000, 0.000)
0.485(0.053, 0.049) 0.622(0.041, 0.046)0.741(0.042, 0.038)0.864(0.023, 0.023)0.987(0.007, 0.008)
0.483(0.054, 0.049) 0.620(0.042, 0.045)0.739(0.042, 0.037)0.861(0.024, 0.023)0.982(0.006, 0.009)
0.493(0.033, 0.034) 0.581(0.029, 0.033)0.668(0.033, 0.030)0.771(0.023, 0.024)1.000(0.000, 0.001)
ˆ AI1
ˆ AI2
ˆ AI3
ˆθ
(1.0, 2.0)ˆ ρ
0.075(0.057, 0.097) 0.266(0.100, 0.091)0.499(0.081, 0.074)0.739(0.045, 0.046)1.000(0.000, 0.000)
0.496(0.049, 0.051) 0.625(0.053, 0.048)0.739(0.042, 0.041)0.859(0.025, 0.027)1.000(0.000, 0.000)
0.493(0.050, 0.049) 0.629(0.051, 0.045)0.744(0.041, 0.037)0.862(0.024, 0.023)0.987(0.006, 0.006)
0.494(0.049, 0.049) 0.630(0.052, 0.045)0.745(0.041, 0.037)0.862(0.024, 0.024) 0.99(0.003, 0.005)
0.498(0.032, 0.034) 0.586(0.036, 0.033)0.667(0.031, 0.030)0.765(0.022, 0.024)1.000(0.000, 0.001)
∗Empirical standard deviations and mean values of bootstrap standard deviations are in parentheses.
ˆ AI1
ˆ AI2
ˆ AI3
ˆθ
15
Page 16
0.00.2 0.4 0.60.8
5
10
15
20
(σ1,σ2)=(2.0,1.0)
ρ (n=50)
Stat.
ρ^
AUCI1
AUCI2
AUCI3
θ^
0.00.2 0.40.6 0.8
5
10
15
20
(σ1,σ2)=(2.0,2.0)
ρ (n=50)
Stat.
ρ^
AUCI1
AUCI2
AUCI3
θ^
0.0 0.20.4 0.6 0.8
5
10
15
20
(σ1,σ2)=(2.0,1.0)
ρ (n=100)
Stat.
ρ^
AUCI1
AUCI2
AUCI3
θ^
0.00.2 0.40.60.8
5
10
15
20
(σ1,σ2)=(2.0,2.0)
ρ (n=100)
Stat.
ρ^
AUCI1
AUCI2
AUCI3
θ^
Figure 1: Comparison of five measures: ˆ ρ2/V (ˆ ρ), (ˆAIj− 0.5)2/V (ˆAIj), j = 1, 2, 3, and
(ˆθ−0.5)2/V (ˆθ), where (Y,Z) follow bivariate normal distributions with means µ1= µ2=
1.0, with different standard deviations σ1, σ2and correlation coefficients ρ.
16
Page 17
distribution. Since z has symmetrical density functions for all three cases,
it is clear that there is no association between Y and Z. That is, the ideal
values of the correlation coefficient estimate ˆ ρ, ROCtype indexes estimates
ˆAIj− 0.5,j = 1,2,3 and ˆθ − 0.5 should be close to 0. We calculate the
25%,50% and 75% empirical quantiles based on 100 simulations. The p
values, with a nominal significance level equal to 0.05, for statistics ˆ ρ2/V (ˆ ρ),
(ˆAIj− 0.5)2/V (ˆAIj), j = 1,2,3 and (ˆθ − 0.5)2/V (ˆθ) are also reported. It is
seen from Table 2 that all three quantiles ofˆAI3andˆθ are very close to 0,
while the correlation coefficient seems to overestimate the association of Y
and Z in this experiment. When the tail of the distribution of Z becomes
heavier, the quantiles and pvalues of ˆ ρ become further from 0.0 and nominal
0.05, respectively. Especially, when Z is from a Cauchy distribution, the 25%
quantiles are larger than 0.5 and the corresponding pvalues are greater than
0.3.
The performances ofˆAI3andˆθ are better than those ofˆAI1andˆAI2when
Z is not from a normal distribution. This is becauseˆAI3is based on a kernel
estimate of fc(t) andˆθ is founded on a nonparametric method, they are not
affected by the distribution of Z, and therefore very stable even when Z is
not normally distributed.
As a summarization and conclusion to the results of Figure 1, and Tables 1
and 2, bothˆAI3andˆθ are recommended for detecting the association between
variables and the continuous gold standard. Althoughˆθ is considered as a
natural extension of the ordinary AUC index, it is worth noting that the
performance ofˆAIj(especiallyˆAI3), in these cases, are are very competitive.
4.1. Combination of Variables
Both correlation coefficient (CC) and the TGDM algorithm are used to
obtain the optimal linear combinations of variables. We then calculateˆAI3
andˆθ of the corresponding combination of variables based on the coefficient
vectors obtained from these two methods. The threshold parameter τ in the
TGDM algorithm is equal to 1.0 in our studies. The data set are generated
from Z = lT ˜ X + ǫ, where˜ X follows a p dimensional multivariate normal
distribution with mean vector (0,...,0)Tand an identity covariance matrix,
and the true l = (1.0,1.0,0.0,···,0.0)T. Error term ǫ is generated from either
the standard normal distribution or a Cauchy distribution. In this experi
mental study, we have tried three different dimensions of X (p = 4,10,20)
for all cases, and only variables x1and x2have nonzero coefficients. That
is, only these two variables are associated with the gold standard. Moreover,
17
Page 18
Table 2: Comparison of different methods when there is no association between variable Y and the gold
standard Z. The data set (y,z) is generated from model y = z2+ǫ with standard normal error ǫ. Three
different distributions of z are used, which are a normal distribution, a t2distribution with free degree 2
and a Cauchy distribution.
Normal
t2
Cauchy
n
50
Model
ˆ ρ
ˆ AI1
ˆ AI2
ˆ AI3
ˆθ
(25%, 50%, 75%)
(0.086, 0.175, 0.287)
(0.031, 0.060, 0.110)
(0.025, 0.054, 0.090)
(0.022, 0.050, 0.087)
(0.021, 0.043, 0.081)
pvalue∗
0.13
0.09
0.06
0.06
0.08
(25%, 50%, 75%)
(0.277, 0.544, 0.802)
(0.048, 0.079, 0.125)
(0.041, 0.076, 0.120)
(0.036, 0.060, 0.090)
(0.034, 0.060, 0.087)
pvalue
0.30
0.12
0.13
0.09
0.09
(25%, 50%, 75%)
(0.515, 0.834, 0.948)
(0.044, 0.074, 0.125)
(0.040, 0.078, 0.150)
(0.028, 0.055, 0.088)
(0.025, 0.049, 0.090)
pvalue
0.33
0.13
0.15
0.09
0.08
100ˆ ρ
(0.055, 0.114, 0.202)
(0.022, 0.044, 0.072)
(0.020, 0.033, 0.059)
(0.017, 0.034, 0.060)
(0.015, 0.032, 0.049)
0.07
0.07
0.06
0.06
0.07
(0.305, 0.527, 0.728)
(0.016, 0.035, 0.072)
(0.014, 0.033, 0.064)
(0.009, 0.032, 0.056)
(0.012, 0.031, 0.054)
0.27
0.06
0.05
0.04
0.04
(0.603, 0.825, 0.931)
(0.029, 0.052, 0.098)
(0.026, 0.048, 0.096)
(0.018, 0.041, 0.07)
(0.014, 0.036, 0.065)
0.37
0.15
0.14
0.08
0.06
ˆ AI1
ˆ AI2
ˆ AI3
ˆθ
∗Nominal significance level is 0.05.
a software based on the TGDM algorithm to calculate the optimal linear
combination of variables is available as an R package. It is also worth noting
that there is no algorithm or discussion in Obuchowski (2005) about finding
the linear combination of variables based onˆθ.
Table 3 lists the values ofˆAI3andˆθ for individual variables, x1and x2,
and the linear combinations based on the CC and TGDM methods. From
this table, we find thatˆAI3 andˆθ for linear combinations of variables are
always larger than for individual variables, which confirms that linear com
binations of variables can improve on the the diagnostic power of individual
variables. When ǫ follows the standard normal distribution,ˆAI3andˆθ for
linear combinations based on both TGDM and CC are very close. However,
when ǫ is a Cauchy distribution, the TGDM method has largerˆAI3andˆθ
than combinations based on CC. This is because the CC method relies on
the normality assumption, while TGDM does not. In addition, from Table
3, we can see thatˆAI3is larger thanˆθ. In most of the cases, the standard
deviations of TGDM are smaller than those ofˆθ, which suggests that the lin
ear combinations based on TGDM have greater diagnostic power, although
the difference may not be statistically significant in our simulation.
18
Page 19
Table 3: Results of linear combination using correlation coefficient (CC) and TGDM
method.
Nonzero coef.+
Distribution p∗∗
n
Method
x1
Normal
4 50
ˆ AI3
0.773(0.054)∗
ˆθ
0.694(0.043)
x2
CCTGDM
0.900(0.028)
0.815(0.031)
0.786(0.052) 0.900(0.024)
0.702(0.042) 0.815(0.028)
100
ˆ AI3
ˆθ
0.782(0.033)
0.693(0.027)
0.785(0.035) 0.904(0.018)
0.696(0.030) 0.807(0.021)
0.906(0.018)
0.809(0.021)
1050
ˆ AI3
ˆθ
ˆ AI3
ˆθ
0.785(0.048)
0.703(0.037)
0.791(0.036)
0.699(0.030)
0.773(0.046) 0.909(0.021)
0.692(0.040) 0.824(0.027)
0.789(0.032) 0.913(0.015)
0.700(0.025) 0.818(0.019)
0.900(0.031)
0.815(0.033)
0.913(0.016)
0.817(0.020)
100
2050
ˆ AI3
ˆθ
ˆ AI3
ˆθ
0.767(0.051)
0.689(0.042)
0.782(0.033)
0.693(0.028)
0.779(0.053) 0.928(0.018)
0.698(0.042) 0.852(0.026)
0.783(0.032) 0.922(0.015)
0.696(0.025) 0.828(0.019)
0.897(0.034)
0.813(0.039)
0.915(0.016)
0.820(0.019)
100
Cauchy
450
ˆ AI3
ˆθ
0.659(0.067)
0.629(0.046)
0.640(0.068) 0.669(0.107)
0.614(0.046) 0.619(0.088)
0.735(0.073)
0.685(0.059)
100
ˆ AI3
ˆθ
0.660(0.056)
0.629(0.036)
0.657(0.047) 0.659(0.094)
0.625(0.032) 0.615(0.078)
0.724(0.077)
0.679(0.063)
10 50
ˆ AI3
ˆθ
ˆ AI3
ˆθ
0.648(0.064)
0.620(0.045)
0.648(0.083)
0.625(0.033)
0.645(0.072) 0.690(0.099)
0.618(0.048) 0.628(0.079)
0.638(0.082) 0.664(0.104)
0.618(0.035) 0.614(0.063)
0.750(0.067)
0.689(0.056)
0.733(0.101)
0.683(0.061)
100
20 50
ˆ AI3
ˆθ
ˆ AI3
ˆθ
0.647(0.093)
0.623(0.044)
0.634(0.123)
0.624(0.032)
0.657(0.096) 0.740(0.123)
0.628(0.046) 0.665(0.083)
0.638(0.120) 0.649(0.142)
0.627(0.029) 0.604(0.068)
0.789(0.096)
0.719(0.052)
0.739(0.147)
0.689(0.069)
100
+Nonzero coef. represents variables with nonzero coefficients in true model.
∗Empirical standard deviations are in parentheses.
∗∗p denotes number of total variables in true model and the number of
nonzero variables is p1 = 2.
19
Page 20
4.2. Real examples
We apply the proposed measures to three real data sets: tumor, prostate
and diabetes data sets, which are used in Obuchowski (2005), Stamey et al.
(1989) and Willems et al. (1997), respectively. In the tumor data set, there
are 74 patients and only two surgery variables: the computed tomography
(CT) and a fictitious test (Fi). The continuous gold standard of this data
set is the size of the renal tumor mass. The prostate data has 97 patients
with prostate specific antigen as its gold standard together with 6 continu
ous variables, which are cancer volume, prostate weight, age (Age), benign
prostatic hyperplasia amount, capsular penetration, and percentage Gleason
scores 4 or 5 (Pgg45). Except variables Age and Pgg45, the others are re
coded in logscale and denoted by Lcavol, Lweight, Lbph, Lcp and Lpsa,
accordingly. The original diabetes data consists of 403 subjects, but we fol
lows Willems et al. (1997) to delete 22 subjects with missing variables. Of
the remaining 381 subjects from this data set used in our numerical study,
222 are females and 159 are males. The following 8 continuous variables are
used in this data set: total cholesterol (Chol), stabilized glucose (Stab.glu),
high density lipoprotein (Hdl), cholesterol/HDL ratio (Ratio), age (Age),
body mass index (BMI) and waist/hip ratio (WHR). The gold standard for
this data set is glycosylated hemoglobin (Glyhb), which is commonly used
as a measure of the progress of diabetes. In addition to analyzing the entire
diabetes data set, we also investigate female and male subgroups, separately.
We normalize the data before applying the proposed measures to each
data set to avoid scale variations. Table 4 presnetsˆAI3andˆθ for individual
variables with pvalue less than 10−7. From Table 4, we find thatˆAI3selects
more variables thanˆθ for some cases. Note thatˆAI3are much larger thanˆθ
with competitive standard deviations in these cases.
Table 5 lists the linear coefficients obtained using the TGDM and CC
methods, and their correspondingˆAI3andˆθ values for all data sets, including
the male and female subgroups of the diabetes data set. In the tumor data
set, Fi has a largerˆAI3value than CT; that is, Fi has a greater association
with the size of the renal tumor mass for tumor data. In the prostate data
set, Lcavol has the largestˆAI3value; that is, Lcavol is most highly associated
with prostate specific antigen among all variables considered in the prostate
data set. For the diabetes data set and its male and female subgroups, the
largestˆAI3 and the variable with the largest coefficient value is Stab.glu;
that is, Stab.glu has the highest potential to diagnose diabetes in terms of
glycosylated hemoglobin index. As expected, from Tables 4 and 5, the linear
20
Page 21
Table 4: Results of ROC measure indexes:
tumor, prostate, diabetes, diabetesfemale and diabetesmale data sets.
ˆAI3 andˆθ, of single markers for
Tumor
Data
Tumor
Method
ˆ AI3
ˆθ
CT Fi
0.943(0.014)∗
0.871(0.020)
0.982(0.011)
0.956(0.008)
Prostate
Data
Prostate
Method
ˆ AI3
ˆθ
Lcavol
0.865(0.022)
0.758(0.027)
Lweight
0.722(0.034)
0.647(0.027)
Lcp Pgg45
0.744(0.035)
0.676(0.028)
0.759(0.035)
0.675(0.031)
Diabetes
Data
Diabetes
Method
ˆ AI3
ˆθ
Chol


Stab.glu
0.779(0.021)
0.687(0.017)
Ratio Age
0.662(0.022)
0.600(0.015)
0.711(0.019)
0.644(0.014)
Diabetes
female
ˆ AI3
ˆθ
0.667(0.029)

0.786(0.022)
0.691(0.021)


0.732(0.025)
0.665(0.019)
Diabetes
male
∗Bootstrap standard deviation is in parentheses.
ˆ AI3
ˆθ


0.769(0.039)
0.682(0.030)
0.689(0.034)

0.681(0.030)

combinations based on TGDM and CC usually have largerˆAI3andˆθ values
than individual variables do, and similarly,ˆAI3andˆθ values for combinations
from TGDM are a little bit larger than those obtained using the CC method.
In real data sets the relation is seldom linear, which is the reason why the
combinations obtained using TGDM perform better than others.
5. Conclusion and Discussion
In this paper, we first propose a new measure for evaluating the poten
tial diagnostic power of individual variables, when there is only a continuous
21
Page 22
Table 5: Results of optimal linear coefficients and corresponding ROC measure indexes:ˆAI3and
ˆθ, for tumor, prostate, diabetes, diabetesfemale and diabetesmale data sets.
Tumor
DataMethod Coef. ROCtype indexes
CT
0.118
0.044
Fi
ˆ AI3
ˆθ
Tumor CC1.076
1.000
0.981(0.011)
0.983(0.011)
0.950(0.009)
0.957(0.008) TGDM
Prostate
Data MethodCoef. ROCtype indexes
ˆ AI3
0.892(0.018)
0.891(0.017)
Lcavol
0.642
1.000
Lweight
0.214
0.264
Age
0.118
0.108
Lbph
0.099
0.135
Lcp
0.017
0.013
Pgg45
0.147
0.189
ˆθ
ProstateCC 0.791(0.024)
0.789(0.023) TGDM
Diabetes
Data MethodCoef.
Ratio
0.101
0.099
ROCtype indexes
ˆ AI3
0.816(0.017)
0.826(0.018)
Chol
0.074
0.061
Stab.glu
0.668
1.000
Hdl
0.018
0.027
Age
0.101
0.373
BMI
0.017
0.140
WHR
0.019
0.011
ˆθ
DiabetesCC 0.717(0.015)
0.723(0.016)TGDM
DiabetesfemaleCC 0.109
0.253
0.659
1.000
0.073
0.164
0.027
0.007
0.106
0.389
0.029
0.133
0.069
0.199
0.834(0.021)
0.842(0.019)
0.737(0.019)
0.741(0.018) TGDM
DiabetesmaleCC 0.005
0.016
0.701
1.000
0.141
0.009
0.243
0.179
0.085
0.367
0.049
0.100
0.002
0.040
0.786(0.03)
0.811(0.031)
0.691(0.025)
0.706(0.027) TGDM
∗ROCtype indexes used here are AUCI3andˆθ.
22
Page 23
gold standard available and no confirmative threshold for it is known. The
proposed measure is an AUCtype index that shares the threshold indepen
dent property of the ROC curve and AUC, and can also be used to evaluate
the performance of classifiers when the gold standard variable is essentially
continuous, and the threshold is controvertible. Numerical results show that
the proposed novel index is very competitive to the existence method.
In addition, we propose algorithms, based on the newly defined index, for
finding the best linear combination of variables, which is useful from a prac
tical prospect when there are multiple variables considered at a time, and
how to evaluate or select a good combination of variables is an important
issue. Here we also study numerical methods for finding the linear combina
tion of variables that maximizes the proposed measure. When the normality
assumption of variables is valid, the best linear combination solution can be
realized as a solution to a linear system. Thus, under an assumption of nor
mality and when the number of variable p is large, the LARS algorithm can
be applied to obtain such a linear combination. This also implies that the
LARStype variable selection scheme can be conducted even when no binary
scale gold standard is available. When the joint distribution of variables is
unknown, the proposed measure is then approximated using a nonparametric
kernel density estimation method. In this case, we proposed a TGDMbased
algorithm to calculate the best linear combination of variables. Based on
numerical results, we found that our method is numerically stable with com
putational advantage when there are large number of variables considered
and combination of variables is of interest. Moreover, our method can be
easily extended to an ordinalscale gold standard with a suitable choice of a
weight function for cutting points, which will be reported elsewhere.
Appendix
Let random variables (Yi,Zi) denote a pair of measures from subject i,
for i ≥ 1. Suppose that {(yi,zi), i = 1,...,n} are n independent observed
values of random variables (Yi,Zi), i = 1,···,n. For a given cutting point
c, a subject i, i = 1,...,n, is assigned as a “case” if zi> c and otherwise
labeled as a “control”. That is, for a given c, we divide the observed subjects
into two groups; let S1(c) and S0(c) be the case and control groups with
sample sizes n1and n0, respectively.
Then we propose a natural estimate of AUC index, AUCI, with continu
23
Page 24
ous gold standard,
ˆAI=
?
ˆA(t)dˆFc(t), (21)
whereˆA(c) is defined as
ˆA(c) =
1
n0n1
?
i∈S1(c);j∈S0(c)
ψ(yi− yj),
ψ(u) = 1, if u > 0; = 0.5, if u = 0 and = 0 if u < 0 andˆFc(t) is the empirical
estimate of the cumulative distribution function of Z based on {z1,...,zn}.
However, in practice, it is rare to choose cutting points at ranges near the
two ends of the distribution of Z. Thus, instead of the whole range of Z, we
might explicitly define a weight function fc(t) on a particular critical range.
Since the step function ψ(·) in (21) is not continuously differentiable, a
smooth estimate of AUCIis defined as
ˆAIs=
?
1
n1n0
?
i∈S1(t);j∈S0(t)
S
?yi− yj
h
?
fc(t)dt,(22)
where S(t) is a sigmoid function 1/(1 + exp(−t)) and h is window width.
Appendix A: Proof of Strong Consistency ofˆ
AIs(l)
The proof of the strong consistency of smoothed AUCI(l) estimatorˆAIs(l)
follows from the following three lemmas.
Lemma 5.1. Suppose that X1,···,Xnis a sequence of independent and iden
tically distributed random variables with values in R1, and a uniformly contin
uous density f(·). Let k(x) be a bounded probability density and the Dirichlet
series?∞
?∞
−∞
n=1n exp(−γηn), ηn= nh2converges for any γ > 0. Then
fn(x) − f(x)dx → 0, almost surely as n → ∞,
where fn(x) =
1
nh
?n
i=1k((x − Xi)/h) is a kernel density estimator of f(x).
(The proof of Lemma 5.1 can be found in Nadaraya (1989), Theorem 3.1,
page 55. So, it is omitted here.)
24
Page 25
Lemma 5.2. Suppose that X1,···,Xnis a sequence of independent and iden
tically distributed random variables with values in R1, and a uniformly con
tinuous density. Then with probability one, as n → ∞
sup
x∈R1Fn(x) − F(x) → 0,
where Fn(·) and F(·) are the empirical distribution and distribution functions
of X, respectively.
Proof of Lemma 5.2:
From Nadaraya (1989) (Equation (1.4), page 43), we have
pr(sup
x∈R1Fn(x) − F(x) > ηn−1/2) ≤ c exp(−2η2), (23)
which completes the proof of Lemma 5.2.
Lemma 5.3. Assume that {(y1, z1),···,(yn, zn)} are n independent and
identically distributed samples of (Y ∈ R1,Z ∈ R1), where Z denotes a
continuous gold standard. For a given c, let˜f(yZ > c) be a conditional
density function of Y given Z > c. Suppose that conditions of Theorem 3
holds. Then˜f(·Z > c) is uniformly continuous.
Proof of Lemma 5.3:
By the Bayesian theorem, we have
˜f(yZ > c) =
?∞
pr(Z > c)
cf(y,z)dz
.(24)
For any yi∈ R1, i = 1,2,
?∞
cf(y1,z)dz −?∞
cf(y2,z)dz
c[f(zy1)fY(y2) − f(zy2)fY(y2)]dz
=?∞
c[f(zy1)fY(y1) − f(zy1)fY(y2)]dz +?∞
= [fY(y1) − fY(y2)][1 − F(cy1)] + [F(zy2) − F(zy1)]fY(y2),
where f(zy) is a conditional density function of Z given Y = y and fY(y)
is a density function of marker Y . From the conditions of Theorem 3, we
have b ≡ pr(Z > c) > 0, fY(·) < M and both fY(·) and F(z·) − F(z·) are
uniformly continuous. Hence, for any ǫ > 0, there exists a δ > 0, for any y1
and y2satisfying y1− y2 < δ, we have
fY(y1) − fY(y2) < bǫ/2
F(zy2) − F(zy1) < bǫ/(2M).
(25)
(26)
25
Page 26
Consequently, by (24), (25) and (26) we get that for a given c,
˜f(y1Z > c) −˜f(y2Z > c)
1
b{fY(y1) − fY(y2)(1 − F(cy1)) + F(zy2) − F(zy1)fY(y2)}
< ǫ/2 + ǫ/2 = ǫ.
<
(27)
It follows that˜f(·Z > c) is uniformly continuous.
Proof of Theorem 3:
By the triangle inequality, we have, for fixed l,
???ˆAIs− AUCI
??? ≤
???ˆAIs−ˆAI
= (I) + (II) (say).
??? +
???ˆAI− AUCI
???
(28)
From Theorem 1, (II) converges to 0 almost surely as n goes to ∞; that is
ˆAI− AUCI→ 0
From (21) and (17),
almost surely as n → ∞.(29)
(I) =
????
1
n1n0
?
i∈S1(t);j∈S0(t)S?yi−yj
?
h
?fc(t)dt−?
?−
1
n1n0
?
i∈S1(t);j∈S0(t)ψ(yi− yj)fc(t)dt
???
≤????
1
n1n0
i∈S1(t);j∈S0(t)S?yi−yj
h
1
n1n0
?
i∈S1(t);j∈S0(t)ψ(yi− yj)
???fc(t)dt.
Due to n1+ n0= n, then at least one of n1→ ∞ and n0→ ∞ holds as n
tends to ∞. Without loss of generality, assume that n1tends to ∞. Then
???1
+?
where˜F(·Z > t) is the conditional cumulative distribution function of Y
given {Z > t}. Let˜f(·Z > t) be its conditional density function. By
Lemma 5.3,˜f(·Z > t) is uniformly continuous.
Let h = n−α, 1/5 < α < 1/2. Set ηn= nh2= n1−2α, and the Dirichlet
series?∞
Lemma 5.1 are satisfied. Let k(t) denote the derivative of S(t), then k(t) is
(I) ≤?
1
n0
?
j∈S0(t)
j∈S0(t)
n1
?
i∈S1(t)ψ(yi− yj) −˜F(yjZ > t)
i∈S1(t)S?yi−yj
h
?−˜F(yjZ > t)
???fc(t)dt
???fc(t)dt,
1
n0
?
???1
n1
?
(30)
n=1nexp(−γηn) converges for any γ > 0. Thus, the conditions of
26
Page 27
a bounded probability density. Thus, by Lemma 5.1,
supy∈R1
???1
n1
?
?−˜f(tZ > t)
?
1
n1h
i∈S1(t)S?yi−y
?
h
?−˜F(yZ > t)
?−˜f(tZ > t)
????dt −→ 0,
???
= supy∈R1
????y
h
−∞
i∈S1(t)k?yi−t
h
?
dt
???
≤?∞
From Lemma 5.2, we have
−∞
???
?
1
n1h
?
i∈S1(t)k?yi−t
almost surely as n → ∞. (31)
sup
y∈R1
??????
1
n1
?
i∈S1(t)
ψ(yi− y) −˜F(yZ > t)
??????
−→ 0, almost surely as n → ∞.(32)
From (30), (31) and (32), we prove that
ˆAIs−ˆAI→ 0, almost surely as n → ∞.
Put (29) and (33) together to complete the proof of Theorem 3.
(33)
Acknowledgements
This work is partially supported via NSC972118M001004MY2 funded
by the National Science Council, Taipei, Taiwan, ROC.
References
Choi, Y., Johnson, W., Collins, M., Gardner, I. (2006). Bayesian infer
ences for receiver operating characteristic curves in the absence of a gold
standard. Journal of Agricuture, Biological and Enviromental Statistics
11, 210 – 229.
Efron, B., Johnstone, I., Hastie, T., Tibshirani, R. (2004). Least angle
regression. Ann. Statist. 32, 407–499.
Friedman, J. H., Popescu, B. E. (2004). Gradient directed regularization for
linear regression and classification. Tech. rep., Department of Statistics,
Stanford University.
Henkelman, R., Kay, I., Bronskill, M. (1990). Receiver operating charac
teristic analysis without truth. Medical Decision Making 10.
27
Page 28
Krzanowski, W., Hand, D. (2009). ROC curves for Continuous Data. CRC
Press, London.
Ma, S., Huang, J. (2005). Regularized roc method for disease classification
and biomarker selection with microarray data. Bioinformatics 21, 4356–
4362.
Waikar, S., Betensky, R., , Bonventre, J. (2009). Creatinine as the gold
standard for kidney injury biomarker studies?
24, 3263–3265.
Nephrol Dial Transplant
Nadaraya, E. A. (1989). Nonparametric Estimation of Probability Densities
and Regression Curves. Kluwer Academic.
Obuchowski, N. (2005). Estimating and comparing diagnostic tests’ accuracy
when the gold standard is not binary. Statistcs in Medicine 20, 3261–3278.
Obuchowski, N. (2006). An roctype measure of diagnostic accuracy when
the gold standard is continuousscale. Statistcs in Medicine 25, 481–493.
Pepe, M. (2003). The Statistical Rvaluation of Medical Tests for Classifica
tion and Prediction. University Press, Oxford.
Pepe, M, Thompson, M. (2000). Combining diagnostic test results to in
crease accuracy. Biostatistics 1, 123–140.
Pfeiffer, R., Castle, P. (2005). With or without a goldstandard. Epidemiology
16, .
Stamey, T., Kabalin, J., McNeal, J., Johnstone, I., Freiha, F., Redwine,
E., Yang, N. (1989). Prostate specific antigen in the diagnosis and treat
ment of adenocarcinoma of the prostate: Ii. radical prostatectomy treated
patients. Journal of Urology 141, 1076–1083.
Su, J., Liu, J. (1993). Linear combinations of multiple diagnostic markers.
J. Am. Statist. Ass. 88, 1350–1355.
Wang, Z., Chang, Y., Ying, Z., Zhu, L., Yang, Y. (2007a). A parsimonious
thresholdindependent protein feature selection method through the area
under receiver operating characteristic curve. Bioinformatics 23, 2788–
2794.
28
Page 29
Wang, C., Turnbull, B., Gr¨ ohn, Y., Nielsen, S. (2007b). Nonparametric
estimation of roc curves based on bayesian models when the true disease
state is unknown. Journal of Agriculture, Biological and Enviromental
Statistics 12.
Willems, J., Saunders, J., Hunt, D., Schorling, J. (1997). Prevalence of
coronary heart disease risk factors among rural blacks: A communitybased
study. Southern Medical Journal 90, 814–820.
Zhou, X.H., Castelluccio, P., Zhou, C. (2005). Nonparametric estimation
of roc curves in the absence of a gold standard. Biometrics 61, 600–609.
29
Download fulltext