Content uploaded by Mathias Fuchs

Author content

All content in this area was uploaded by Mathias Fuchs on Feb 10, 2017

Content may be subject to copyright.

arXiv:1310.8203v2 [math.ST] 18 Dec 2013

AU-STATISTIC ESTIMATOR FOR THE VARIANCE OF

RESAMPLING-BASED ERROR ESTIMATORS

M. FUCHS, R. HORNUNG, R. DE BIN, A.-L. BOULESTEIX

ABSTRACT. Werevisit resampling procedures for error estimation inbinary classiﬁcation

in terms of U-statistics. In particular, we exploit the fact that the error rate estimator in-

volving all learning-testing splits is a U-statistic. Therefore, several standard theorems on

properties of U-statistics apply. In particular, it has minimal variance among all unbiased

estimators and is asymptotically normally distributed. Moreover, there is an unbiased esti-

mator for this minimal variance if the total sample size is at least the double learning set size

plus two. In this case, we exhibit such an estimator which is another U-statistic. It enjoys,

again, various optimality properties and yields an asymptotically exact hypothesis test of

the equality of error rates when two learning algorithms are compared. Our statements ap-

ply to any deterministic learning algorithms under weak non-degeneracy assumptions. In

an application to tuning parameter choice in lasso regression on a gene expression data set,

the test does not reject thenull hypothesis of equal rates between two different parameters.

Unbiased Estimator; Penalized Regression Model; U-Statistic; Cross-Validation; Ma-

chine Learning;

1. INTRODUCTION

The goal of supervised statistical learning is to develop prediction rules taking the val-

ues of predictor variables as input and returning a predicted value of the response variable.

A prediction rule is typically learnt by applying a learning algorithm Mto a so-called learn-

ing data set. A typical example in biomedical research is the prediction of patient outcome

(e.g. recidive/no recidive within ﬁve years, tumor class, lymph node status, response to

chemotherapy, etc.) based on bio-markers such as, e.g., gene expression data. The practi-

tioners are usually interested in the accuracy of the prediction rule learnt from their data set

to predict future patients, while methodological researchers rather want to know whether

the learning algorithm is good at learning accurate prediction rules for different data sets

drawn from a distribution of interest. The ﬁrst perspective is called “conditional” (since

referring to a speciﬁc data set) while the latter, which we take in this paper, is denoted as

“unconditional”. Precisely, this paper focuses on the parameter deﬁned as the difference

between the unconditional errors of two learning algorithms of interest, Mand M′, for

binary classiﬁcation.

If the data set is very large, one can observe independent realizations of estimators of

the unconditional error rates and use them for a paired t-test (see Section 2.3). In practise,

however,huge data sets are rarely available. Prediction errors are thus usually estimated by

resampling procedures consisting of splitting the available data set into learning and test

sets a large number of times and averaging the estimated error over these iterations. The

well-known cross-validation procedure can be seen as a special case of resampling pro-

cedure for error estimation. A detailed overview of the vast literature on cross-validation

would go beyond the scope of this paper. The reader is referred to Arlot & Celisse (2010)

for a comprehensive survey.

1

2 M. FUCHS, R. HORNUNG, R. DE BIN, A.-L. BOULESTEIX

Having estimated the error rate, it is typically of interest to test the null hypothesis of

equal error rates between learning algorithms. This requires insight into the estimator’s

variance. Resampling-based error estimators typically have a very large variance, in par-

ticular in the case of small samples or high-dimensional predictor space (Dougherty et al.,

2011). The estimation of this variance has been the focus of a large body of literature, es-

pecially in the machine learning context. A good estimation of the variance of resampling-

based error estimators would allow to, e.g., derive reliable conﬁdence intervals for the true

error or to construct statistical tests to compare the performance of several learning al-

gorithms. The latter task is of crucial importance in practise, since applied computational

scientists including biostatisticians often have to make a choice within a multitude of differ-

ent learning algorithms whose performance in the considered settings is poorly explored.

In van der Wiel et al. (2009), for each splitting in repeated sub-sampling the predictions of

the two classiﬁers are compared by a Wilcoxon-test, and the resulting p-values are com-

bined. In Jiang et al. (2008a), the authors show the asymptotic normality of the error rate

estimator in the case of a support vector machine. In Jiang et al. (2008b), a bias-corrected

bootstrap-estimator for the error rate from leave-one-out cross-validation is introduced.

Various estimators of the variance of resampling-based error estimators have been sug-

gested in the literature (Dietterich, 1998; Nadeau & Bengio, 2003), most of them based on

critical simplifying assumptions. As far as cross-validation error estimates are concerned,

Bengio & Grandvalet (2003) show that there exists no unbiased estimator of their variance.

To date, the estimation of the variance of resampling-based error rates in general remains a

challenging issue with no adequate answer yet both from a theoretical and practical point

of view. In particular, there are no exact nor even asymptotically exact test procedures for

testing equality of error rates between learning algorithms available. The present paper

shows that there is an asymptotically exact test for the comparison of learing algorithms

by using and extending results from U-statistics theory.

Despite the large body of literature, there seems to be no explicit treatment of the as-

ymptotic properties of the estimators in general. Our main results are Theorem 3.9, stating

that there is an unbiased estimator of the difference estimator’s variance if n≥2g+2,

where gis the learning set size and nis the sample size, and Theorem 4.1, providing the

central limit theorem for the studentized statistic as it is needed for testing. The use of only

half the sample size for learning has already occurred in the literature, in a roughly simi-

lar context and on grounds of intuitive reasoning (B¨uhlmann & van de Geer, 2011, Section

10.2.1).

Corollary 5.1 gives an explicit bound on the number of iterations necessary to approxi-

mate the leave-p-out estimator, i.e. the minimum variance estimator, to an arbitrary given

precision, where p=n−g; we show that this minimal variance can be estimated by an

unbiased estimator, namely that from Deﬁnition 3.7. It has minimal variance itself, and the

ensuing studentized test in (17) is asymptotically exact. This shows that it is not necessary

to endeavour in determining the distribution of combinations of p-values to test equality of

error rates, as in van der Wiel et al. (2009, Section 2.3).

Section 2 recalls important deﬁnitions pertaining to U-statistics and cross-validation

viewed as incomplete U-statistics. We show that the procedure which involves all learning-

testing splits is then a complete U-statistic. In Section 3, we show that U-statistics theory

naturally suggests an unbiased estimator of the variance of this estimated difference of

errors as soon as the sample size criterion is satisﬁed. In Section 4, we exploit this vari-

ance estimator to derive an asymptotically exact hypothesis test of equality of the true

errors of two learning algorithms Mand M′. Section 5 addresses numerical computation

AU-STATISTIC ESTIMATOR FOR THE VARIANCE OF RESAMPLING-BASED ERROR ESTIMATORS 3

of approximations of the leave-p-out cross-validation estimator, while an illustration of the

variance estimation and the hypothesis test through application to the choice of the penalty

parameter in lasso regression is presented in Section 6.

2. DEFINITIONS,NOTATIONS AND PRELIMINARIES

2.1. Hoeffding’sdeﬁnition. Since the complete error estimator that we will consider from

the next section on is a U-statistic, we start by recalling the deﬁnition of U-statistics and

their basic properties. In the following, the reader who is already familiar with the machine

learning context may take Ψ0:=Φ0and m:=g+1 at ﬁrst, where gis the learning sample

size and Φ0is as deﬁned in (5); however, this is not necessary and we will need other cases

of the deﬁnition as well.

Deﬁnition 2.1 (U-statistic, Hoeffding (1948)).Let (Zi),i=1,...,nbe independent and

identically distributed r-dimensional random vectors with arbitrary distribution. Let m≤

n, and let Ψ0:Rr×m→Rbe an arbitrary measurable symmetric function of mvector

arguments. Write Ψ0(S)as above for the well-deﬁned value of Ψ0at those Ziwith indices

from S⊂ {1,...,n},|S|=m. Consider the average of Ψ0in the maximal design Sof

unordered size-msubsets

(1) U=U(Z1,...,Zn) = n

m−1∑

S∈S

Ψ0(S).

Any statistic of such a form is called a U-statistic.

The trailing factor is the inverse of the number of summands, the cardinality |S|. So, a

U-statistic is an unbiased estimator for the associated parameter

(2) Θ(P) = Z···ZΨ0(z1,...,zm)dP(z1)···dP(zm)

for a probability distribution Pon Rr, where z1,...,zmare r-vectors. A parameter of this

form is called a regular parameter. If mis the smallest number such that there exists such

a symmetric function Ψ0that represents a given parameter Θ(P)in the form (2), then m

is called the degree of Ψ0or of Θ(P), and the function Ψ0is called a kernel of U. Fur-

thermore, Uhas minimal variance among all unbiased estimators over any family of dis-

tributions Pcontaining all properly discontinuous distributions. Hoeffding (1948) shows

the asymptotic normality of U-statistics as n→∞and that a good number of well-known

statistics such as sample mean, empirical moments, Gini coefﬁcient, etc. are subsumed by

the deﬁnition.

It is also possible to associate a U-statistic to a non-symmetric kernel Ψ, i.e. to estimate

E(Ψ)by a U-statistic. The cost is to deal with n!/(n−m)! summands, many more than

just the binomial coefﬁcient n!/(m!(n−m)!). The symmetrization, indeed, consists in

grouping all m! summands involvingthe same unordered index set together. This writes a

U-statistic with non-symmetric kernel Ψin the form of Hoeffding (1948, Section 4.4) with

symmetric kernel Ψ0as in Hoeffding (1948, Section 3.3).

2.2. The difference of true error rates as the parameter of interest. The goal of this

section is to formalize the true error rate and to recall its nature as an expectation. Let X=

Rr−1,r∈N,be a ﬁxed predictor space, and Y⊂Rbe a space of responses. The number r

will be as in Hoeffding (1948) and one would usually denote p=r−1. Assume there is an

unknown but ﬁxed probability distribution Pon X×Y⊂Rrdeﬁned on the

σ

-algebra of

Lebesgue-measurable sets. We do not require Pto be absolutely continuous with respect

to the Lebesgue measure in order to allow for a discrete marginal distribution in Yas it

4 M. FUCHS, R. HORNUNG, R. DE BIN, A.-L. BOULESTEIX

occurs in binary classiﬁcation. The distribution Pcan be thought of as being supported on

Rrinstead of X×Y⊂Rronly, by identifying Pwith its push-forward image i∗(P)under

the inclusion i:X×Y→Rr. This allows to apply Hoeffding (1948) which only describes

U-statistics on a Euclidean space Rrto the deﬁnition and investigation of U-statistics on

(products of) X×Y. Let us ﬁx a loss function L:Y×Y→R. Typically, Lis the

misclassiﬁcation loss L(y1,y2) = 1y16=y2, but can be an arbitrary measurable function. Since

we suppose the marginal distribution of Pon Yto be discrete, the loss function associated

as done below to a learning algorithm is almost surely bounded. Therefore, all moments

exist, which will be helpful throughout the paper. It is automatic for the misclassiﬁcation

loss. However, some of the following also work for an unbounded distribution on Y. In

that case one would typically consider losses which are not almost surely bounded, such

as, for instance, the residual sum of squares loss L(y1,y2) = (y1−y2)2or the Brier score

in survival analysis. We will not work out the details of unbounded losses.

Say we are interested in the difference of error rates of classiﬁers learnt on a sample of

size g. Typical choices are g=4n/5 (assuming that ﬁve divides n) for a learning/testing

sample size ratio of 4 : 1 and g=n−1 for leave-one-out cross-validation. We also allow

for g=0 in case we are interested in the performance of classiﬁcation rules that were

already learnt on different and ﬁxed data. In that case, it is important that the learning

data were different since otherwise there would be a problematic contradiction between

simultaneously regarding the data as ﬁxed and as being drawn from P.

Let (zi)i=1,...,n= (x1,y1,...,xn,yn), where xi∈Xand yi∈Y, and denote by

fM:(X×Y)×g×X→Y

the function that maps (z1,...,zg;xg+1)∈(X×Y)×g×Xto the prediction, an element

of Y, by the learning algorithm M, learnt on z1,...,zgand applied to xg+1. We are only

concerned with deterministic learning algorithms M, i.e. ones which do not involve any

random component for classiﬁcation. We suppose that fMis symmetric in the ﬁrst gentries

z1,...,zg, i.e. Mtreats all learning observations equally, and that fMis measurable with

respect to the product

σ

-algebra. The inclusion X×Y→Rrdeﬁnes an inclusion (X×

Y)×g×X→Rrg+r−1and in order to be able to apply Hoeffding (1948) we view fMas a

map on Rrg+r−1by extending it by zero on Rrg+r−1\(X×Y)×g×Xwhich is a null-set

with respect to the push-forward measure i∗(P). The map fMis then also measurable on

Rrg+r−1.

Denote by Φ:(X×Y)g+1→Rthe function

(3) Φ(z1,...,zg;zg+1):=L(fM(z1,...,zg;xg+1),yg+1)−L(fM′(z1,...,zg;xg+1),yg+1)

for two learning algorithms Mand M′. The value Φ(z1,...,zg;zg+1)is the empirical differ-

ence of error rates between Mand M′, learnt on the ﬁrst gobservations (z1,...,zg)of the

sample (zi)and evaluated on the single last entry. The semicolon thus visually separates

learning and test sets. The deﬁnition of Φinvolves only a single test observation. Antici-

pating a little, the reason is that Φhad to be deﬁned with a minimal number of arguments

necessary for (6) below. In the following, we will convenientlyconsider larger test sample

sizes by applying the mechanism of associating a U-statistic to a kernel.

As noted above, we assume that Φis almost surely bounded. This happens, for instance,

for bounded loss functions such as the misclassiﬁcation loss. Also, we may view Φas being

deﬁned on Rr(g+1)instead of (X×Y)g+1without notational distinction, in the same way

as fMwas extended to Rrg+r−1.

AU-STATISTIC ESTIMATOR FOR THE VARIANCE OF RESAMPLING-BASED ERROR ESTIMATORS 5

The true difference of error rates between Mand M′is the expectation of Φ, taken with

respect to g+1 independent realizations of P:

∆:=eM−eM′=EP⊗(g+1)(Φ(Z1,...,Zg;Zg+1))

=Z···Z(X×Y)×(g+1)L(fM(z1,...,zg;xg+1),yg+1)

−L(fM′(z1,...,zg;xg+1),yg+1)dP(z1)···dP(zg+1),

(4)

where both learning and test data are random. The existence of the expectation follows

from measurability and from the boundedness assumption. The quantity ∆is the parameter

of main interest. Also, we consider the symmetric function of g+1 arguments

(5) Φ0(z1,...,zg+1):=1

g+1

g+1

∑

i=1Φ(z1,...,zi−1,zi+1,...,zg+1;zi),

satisfying

(6) ∆=E(Φ) = E(Φ0).

For the particular non-symmetric kernel Ψ=Φintroduced in (3), the symmetrization Ψ0=

Φ0of (5) can be written involving only m=g+1 summands instead of m!, in other words

only cyclic permutations instead of all permutations, due to the assumption that learning is

symmetric. In practise, it is not advantageous to compute Φ0directly because a learning

procedure should be used on more than just one test observation for numerical efﬁciency

(see Section 5); however, it is very convenient to consider Φ0for ease of presentation.

Remark 2.2.In case one is interested in estimating eMonly instead of a difference ∆=

eM−eM′, one can set the second summand of (3) identically to zero. We will not go into

the details.

2.3. Tests of the true error rate. In this section, let us recall the test problem of interest.

We want to test the null hypothesis H0:E(Φ) = 0 against the alternative H1:E(Φ)6=0.

The former is usually called the unconditional null hypothesis (Braga-Neto & Dougherty,

2004).

Remark 2.3.There is also a conditional null hypothesis where the classiﬁcation rule is

supposed to be given, for instance learnt on ﬁxed independent data, and the expectation is

taken only with respect to the test set. However, the learning data are usually also random

and may even be modelled to be from Pas well. In this case, the conditional null depends

on random data, leading to severe difﬁculties in the interpretation of type one error. For

this reason, in this paper we will only consider the unconditional error rate. However,

setting g=0 and plugging in a ready-made classiﬁcation rule for Φ, regardless of the data

it was learnt on, leads to a sort of conditional null hypothesis. In this case, the true error

becomes a random variable of the learning data, and the latter must not be from the sample

(z1,...,zn). We will not go into details of conditional testing or of the case g=0.

The form of H0suggests a t-test. However, the number of independent realizations of

Φ(z1,...,zg+1)is only ⌊n/(g+1)⌋, since it is to be computed with respect to P⊗(g+1).

Therefore, a correct t-test would be severely underpowered, and cross-validation proce-

dures are usually preferred.

6 M. FUCHS, R. HORNUNG, R. DE BIN, A.-L. BOULESTEIX

2.4. Cross-validation. Let us now show how cross-validation procedures ﬁt into the frame-

work described above. In a cross-validation procedure, dependent realizations of Φ(z1,...,zg+1)

are considered. More precisely, for every ordered subset T= (i1,...,ig;ig+1)of {1,...,n},

(7) e

∆(T):=Φ(zi1,...,zig;zig+1)

is an estimator of ∆, where we visually separate learning and test sets again. We view

e

∆(T)as an estimator of the difference of error rates of classiﬁers learnt on samples of size

ginstead of n, in contrast to differing usage in the literature. Thus, e

∆(T)is unbiased. Of

course, the word “ordered subset” refers to the order 1 <··· <g+1; it is not imposed

that i1<··· <ig+1. Similarly, if S={i1,...,ig+1}is an unordered subset of size g+1

of {1,...,n}, the value Φ0(zi1,...,zig+1), using the symmetric Φ0instead of Φ, does not

depend on the order of S. Therefore, we can unambiguously extend deﬁnition (7) to a

function e

∆also on the collection of unordered subsets Sby setting e

∆(S):=Φo(S). This

is an unbiased estimator of ∆. Also, let Tbe a collection of ordered subsets Tas above.

Then, let

(8) e

∆(T):=1

|T|∑

T∈Te

∆(T)

be the average of all values of e

∆(T)over T, and similarly e

∆(S)for a collection S

of unordered subsets Sas above the average of all values e

∆(S)involving the symmetric

function Φ0. Equation (8) may involve each learning set multiple times because each

observation of a test sample can then contribute a summand to (8). In other contexts the

mean error rate over the entire test sample is counted as only one occurrence of the learning

set.For any such collections Tor S, the estimators e

∆(T)and e

∆(S)are unbiased for ∆.

As soon as Tcontains together with an ordered subset T= (i1,...,ig;ig+1)all its cyclic

permutations (i2,...,ig+1;i1),(i3,...,ig+1,i1;i2)and so on, we have e

∆(T) = e

∆(S)where

the collection Sis obtained from the collection Tby forgetting the order (and the multiple

entries with the same order coming from the cyclic permutation).

Now, the ordinary K-fold cross-validation can be incorporated in this framework as

follows. Suppose that Kis such that K(n−g) = n, possibly after disregarding a few obser-

vations in order to assure divisibilityof nby n−g. Therefore, g≥n/2. The extreme cases

are g=n/2 for K=2 and g=n−1 for K=n. Let TCV be a collection of ordered subsets

of the form

(9) T=1,...,k(n−g),(k+1)(n−g) + 1,...,n;t

where k=0,...,K−1 enumerates the learning blocks, the notation is to be read in such

a way that if k=0 the ﬁrst entry is n−g+1 and if k=K−1 the last one is n, and t∈

{k(n−g)+1,...,(k+1)(n−g)}enumerates all test observations distinct from the learning

block. Thus, Tconsists of one or two learning strides whose indices are contiguous and

whose sizes add up to g, together with a single test observation index distinct from any

learning observation index. Then e

∆(TCV)recovers the ordinary cross-validation estimator

of ∆. In practise, one may also compute e

∆(TCV)from a permutation of the data, but this

does not inﬂuence the formal description because P⊗nis permutation-invariant.

Deﬁnition 2.4. We will in general refer to estimators of the form e

∆(T)or e

∆(S)given

by (8), as to cross-validation-like procedures, irrespectively of the structure of Tor S.

AU-STATISTIC ESTIMATOR FOR THE VARIANCE OF RESAMPLING-BASED ERROR ESTIMATORS 7

It was shown in Bengio & Grandvalet (2003) that there is no unbiased estimator of the

variance VP⊗n(e

∆(TCV)) for any cross-validation procedure TCV , i.e. any divisor Kof n.

It seems plausible from this tedious description of cross-validation that such a particular

design TCV consisting merely of sets of the special form (9) does not lead to a globally

small variance of e

∆(TCV)among all possible designs Twith ﬁxed learning set size g.

This variance is minimal for the cross-validation-like procedure Tmax consisting of all size

(g+1)-subsets. We will expose the cases where there is an unbiased variance estimator

of it, in contrast to the cross-validation case. Let us call this Tmax the maximal design.

Another immediate advantage of it over an incomplete one is the fact that the need for

a balanced data set, i.e. algorithms whose class labels are equally frequent, and/or for

balanced blocks falls away. The only case of gand Ksuch that TCV =Tmax is the leave-

one-out case g=n−1 respectively K=n.

Similarly, one can distinguish those cases of grespectively Ksuch that the associated

design Tcontains along with an ordered subset Tall its cyclic permutations. In such a

case, e

∆(T) = e

∆(S)for the design Scorresponding to T. Among the cross-validation

procedures, only the leave-one-out case g=n−1 respectively K=nproduces this situa-

tion. However, among the cross-validation-like procedures, this can happen for any g. For

instance, it holds for the maximal design for all 0 ≤g≤n−1. This is important to keep in

mind for numerical implementation.

2.5. The full cross-validation-like estimator of ∆is a U-statistic. In this section, we

show that the cross-validation-like procedure with maximal design, where all size-g-subsets

of the sample are used for learning, is a U-statistic and therefore has least variance among

all cross-validation-like procedures. It seems that this fact has not yet been described

in the literature. Among the immediate consequences of interpreting this procedure as a

U-statistic will be asymptotic normality, the ﬁrst case of Theorem 4.1. The parameter of

interest Θ=∆is a regular parameter because of (4); this equation also shows that its degree

is at most g+1.

Assumption 2.5. The degree of ∆is exactly g+1.

This states that the true error rate cannot be computed from learning samples of smaller

size than gfor all (reasonable) distributions P. While it is not automatic, it seems to be

violated only in irrelevant artiﬁcial counterexamples, such as for instance one of the form

Φ(z1,z2,z3) = Φ′(z2,z3)where the learning step only makes use of a part of the learning

set observations, and in similar cases. So, the assumption is natural.

Let Smax and Tmax be the maximal designs of unordered and ordered subsets, respec-

tively, as introduced above. The corresponding error rate estimator is then the U-statistic

associated to the particular kernel Ψ=Φand Ψ0=Φ0, respectively. We deﬁne

(10) b

∆:=U(Φ0) = Φ0(Smax) = Φ(Tmax)

as the associated U-statistic as in Deﬁnition 2.1, i.e. the one deﬁned by the symmetric ker-

nel Φ0. It follows immediately from Hoeffding (1948) that it has minimal variance among

all unbiased estimators of ∆. In particular, it has strictly smaller variance than all cross-

validation procedures for 2 ≤g≤n−2, and is equal to the cross-validation estimator in

the leave-one-out case g=n−1. Lee (1990, Section 4.3, Theorems 1 and 4) describes

quantitatively the variance decrease of b

∆with respect to e

∆. These theorems treat the case

of a ﬁxed Sand an Sconsisting of random subsets, respectively. The statistic b

∆co-

incides with what is called complete cross-validation in Kohavi (1995), as well as with

8 M. FUCHS, R. HORNUNG, R. DE BIN, A.-L. BOULESTEIX

complete repeated sub-sampling as considered in Boulesteix et al. (2008), or leave-p-out

cross-validation in Shao (1993) and Arlot & Celisse (2010), where p=n−g.

In practise, the deﬁnition of b

∆involves too many summands for computation, but can

be easily approximated to arbitrary precision using an Sof random subsets, see Section 5.

3. A U-STATISTIC ESTIMATOR OF V(b

∆)

3.1. Variances areregularparameters. The theory of U-statistics comes to full power as

soon as not only the original regular parameter ∆is estimated optimally by a U-statistic b

∆,

but also the variance V(b

∆)of this U-statistic itself is exhibited as another regular parameter,

this time depending not only on Φ0but also on n. Therefore, we are in a position to estimate

V(b

∆)by a U-statistic as well.

In the following Proposition, we outline formally that variances and covariances are

regular parameters in general, without determining optimally the degree. Thus, the full

power of U-statistics can be used to estimate them. We then pin down the degree in Propo-

sition 3.2.

Proposition3.1. Let f(z1,...,zk)be a function of k realizations of independent identically

distributed random variables Zi∼P with existing variance VP⊗k(f)<∞. Then the vari-

ance VP⊗k(f)is a regular parameter of degree at most 2k. More generally, the covariance

between two such functions f and g, as soon as it exists, is a regular parameter of degree

at most 2k.

Proof. Both V(f)and cov(f,g)are, by deﬁnition, polynomials of integrals with respect

to P. In order to show that they are regular parameters, we have to rewrite each one as a

single integral instead. This is accomplished by

VP⊗k(f) = E(f2)−E(f)2

=Z···Z1

2f(z1,...,zk)−f(zk+1,...,z2k)2dP(z1)···dP(z2k)

(11)

and an almost analogous formula for the covariance covP⊗k(f,g).

The integrand is not unique. It was chosen in such a way to resemble the symmetric

kernel (z1−z2)2/2 of the variance of Pitself, i.e. the case r=1,f(z1) = z1. Furthermore,

the degree of V(f), i.e. the minimal length of an integrand that accomplishes this, can be

much smaller than 2kand depends on f. Also, the integrand of (11) is not symmetric in

general and remains to be symmetrized.

Let us now investigate the case where fis a U-statistic associated to a symmetric kernel

Φ0. Caution has to be taken because the regular parameter now depends on n, in sharp

contrast to the U-statistic b

∆itself. For the case f=b

∆, we have k=n, so our knowledge

attained so far on the degree of the kernel of V(b

∆)is that it is at most 2n. However, it

is possible to obtain better insight into the degree of the variance. It will turn out that

the variance is a linear combination of regular parameters, each of whose degrees do not

depend on n, only the coefﬁcients of the linear combination depend on n. This is the

content of the following proposition, which presents in short form results of Hoeffding

(1948, Section 5) as well as immediate consequences.

In the following, we will consider a general underlying U-statistic Uwhich estimates

an unknown parameter Θ, and develop the theory of its variance as it is needed for its

estimation. From Section 3.2 on, we will pay particular attention to the case where Uis

associated to the kernel Φ0deﬁned by (5), thus Θ=∆and U=b

∆.

AU-STATISTIC ESTIMATOR FOR THE VARIANCE OF RESAMPLING-BASED ERROR ESTIMATORS 9

Proposition 3.2. Let U be the U-statistic associated to a bounded symmetric kernel Φ0of

degree m and to a total sample size n. Denote Θ=E(Φ0). Then the variance of U is a

regular parameter of degree at most 2m. Furthermore, it splits as a sum

(12) V(U) =

m

∑

c=1

α

c

κ

c−(1−

α

0)Θ2,

where

α

cis the mass function at c of the hyper-geometric distribution H(n,m,m), and all

κ

care regular parameters satisfying

(13)

κ

c=Z···ZΦ0(z1,...,zm)Φ0(zm−c+1,...,z2m−c)dP(z1)···dP(z2m−c).

Thus,

κ

cis a regular parameter of degree at most 2m−c. Furthermore, since

(14) Θ2=Z···ZΦ0(z1,...,zm)Φ0(zm+1,...,z2m)dP(z1)···dP(z2m),

Θ2is a regular parameter of degree at most 2m.

Proof. Direct computation shows that the right hand side of (13) coincides with what is

called E(Φ2

c(X1,...,Xc)) in Hoeffding (1948, Section 5) for all 1 ≤c≤m. This step in-

volves the symmetry of the kernel Φ0and careful renaming of the variables. Hoeffding al-

ready supposes a symmetric kernel which he calls Φ. The quantities

ζ

cof Hoeffding (1948)

– which are called

σ

cin Lee (1990) – are thus related to our

κ

cby means of the equation

ζ

c=

κ

c−Θ2, as follows from Hoeffding (1948, formula 5.10). From V(U) = ∑m

c=1

α

c

ζ

c

(Hoeffding, 1948, 5.13) we thus deduce V(U) = ∑m

c=1

α

c(

κ

c−Θ2) = ∑m

c=1

α

c

κ

c−(1−

α

0)Θ2because ∑m

c=1

α

c= (1−

α

0).

The fact that linear combinations of regular parameters are regular parameters (Hoeffding,

1948, Page 295) completes the proof.

Proposition 3.2 achieves the desired simpliﬁcation: The degree of V(U)for a U-statistic

Uof degree mis shown to be at most 2minstead of 2n, and the dependence of V(U)on n

is now expressed solely by means of the hyper-geometric mass function, whereas

κ

cand

E(U)2do not depend on n.

Remark 3.3.Direct computation yields E(U2) = ∑m

c=1

α

c

κ

c+

α

0Θ2, making use of the

fact that the kernel is symmetrized. This together with the usual decomposition V(U) =

E(U2)−E(U)2=E(U2)−Θ2also proves (12) and shows that the degree of E(U2)is at

most 2m. It is natural to assume that its degree is exactly 2m, in analogy to assumptions 2.5

above and 3.4 below. In contrast, the advantage of decomposition (12) is that the ﬁrst

summand only involves the

κ

cwhich all have smaller degree than 2m, namely 2m−c.

Therefore, we prefer decomposition (12) over the usual decomposition and work with and

estimate the quantities

κ

crather than Hoeffding’s

ζ

cwhich all have degree 2m(see also

Remark 3.8 below).

3.2. Deﬁnition of the U-statistic for the variance. In order to estimate the variance of

the U-statistic Uby another U-statistic, we need the following.

Assumption3.4. In the general situation of Proposition 3.2, the statisticUis non-degenerate.

In the particular case U=b

∆where Θ=∆, this states that

κ

c6=∆2for 1 ≤c≤g+1.

Furthermore, we assume in the situation of Proposition 3.2 that all upper bounds for the

degrees thus obtained are optimal. In the particular caseU=b

∆, this means that the degree

of

κ

cis 2m−c=2g+2−cand that of Θ2=∆2is 2m=2g+2.

10 M. FUCHS, R. HORNUNG, R. DE BIN, A.-L. BOULESTEIX

The non-degeneracy can be numerically checked for plausability, unlike Assumption 2.5

and the degree optimality which both state that the regular parameters cannot be written by

a smaller number of integrals. There seems to be no reason why a kernel of the form (5)

with a non-trivial classiﬁer should not satisfy them. The ﬁrst part of Assumption 3.4 is

needed for the central limit theorem 4.1, the second one for Theorem 3.9.

Proposition 3.2 motivates the following deﬁnition.

Deﬁnition 3.5. In the general situation of Proposition 3.2 and under Assumption 3.4, the

statistics b

κ

cfor 1 ≤c≤mof degree 2m−cand the statistic c

Θ2of degree 2mare deﬁned

as the U-statistics associated to the symmetrized versions of the kernels which are the

integrands in (13) and (14), respectively.

Estimating Θ2by U2instead would be biased and thus would not ﬁt in our framework.

Remark 3.6.It would not be suitable to simply estimate Θ2by zero in view of H0:Θ=0.

The ﬁrst reason is that failure to subtract (1−

α

0)·c

Θ2from the variance estimator (15)

below would overestimate the variance V(U)under H1, leading to a severe loss of power.

The second is that it would conﬂict with Hoeffding’ssetup. In fact, under the null hypoth-

esis Θ=0, the degree of Θand that of Θ2would be trivially zero, if we were willing to

restrict Hoeffding’s class Pof distributions to only ones obeying the null; however, the

least-variance optimality property of a U-statistic relies on Pencompassing all properly

discontinuous distribution functions, not only null ones. Likewise, the degree has to be

deﬁned for a global class of null and alternative together. This is akin of a classical one-

way analysis of variance statistic where estimating variance within and between groups

separately greatly increases the power.

We can now deﬁne the variance estimator of a U-statistic as a U-statistic itself.

Deﬁnition 3.7. In the general situation and notation of Proposition 3.2 and under Assump-

tion 3.4, we deﬁne an estimator, abbreviated bw, for the variance of the U-statistic Uas the

U-statistic associated to the linear combination as in (12) of the kernels of

κ

cand of Θ2

given by (13) and (14).

After a short and straightforward computation, the deﬁnition can be re-stated alterna-

tively in more explicit form: The single U-statistic bwsplits as a sum

(15) bw:=

m

∑

c=1

α

cb

κ

c−(1−

α

0)c

Θ2

of U-statistics of varying degrees. In the particular case U=b

∆where Θ=∆, this deﬁnes

an estimator for V(b

∆), which will be abbreviated by ˆv.

The estimator bwenjoys the unbiasedness and optimality properties analogous to b

∆. In

particular, this applies to ˆv.

Remark 3.8.In the latter case U=b

∆, we have m=g+1 so the degree of ˆvis 2g+2,

and that of b

κ

cwas given in the degree optimality statement of Assumption 3.4. The reason

for splitting bwinto several U-statistics of varying degree is numerical efﬁciency: Hoeffding

(1948) suggests to estimate

ζ

c=

κ

c−Θ2. However, all of these parameters have degree 2m.

Instead, it is of course advisable to estimate the

κ

cwhich have smaller degee 2m−c. Then,

Θ2needs to be estimated only once. This remark is the empirical analogue to Remark 3.3.

AU-STATISTIC ESTIMATOR FOR THE VARIANCE OF RESAMPLING-BASED ERROR ESTIMATORS 11

3.3. Existence criterion and order of consistency. We are now in a position to inves-

tigate the estimator of V(b

∆). In principle, this section applies to the general situation of

Proposition 3.2, but in order to keep the presentation clear we will focus on the interesting

case Θ=∆,U=b

∆for the rest of the paper. Therefore, we will write c

∆2for the statistic c

Θ2

whereas we will not introduce a special notation for the statistics

κ

cfor that case.

In the consistency statement of Theorem 3.9 below, the true parameter V(b

∆)depends

on n, unlike in a typical consistency statement. In principle, the sample size used for this

estimation does not need to be the same nagain, but can in fact be any number n′≥2g+2.

However, in practise the same sample is used to estimate ∆as well as V(b

∆), so we restrict

our attention to the diagonal case n=n′for simplicity. This is analogous to the ordinary

one-sample t-test statistic, where both the numerator, the sample mean, and its standard

deviation, the denominator, are simultaneously estimated on the same sample, so with the

same n. However, in our case, no factor n−1/2cancels out between both.

Theorem3.9. If n ≥2g+2, the estimator ˆv of V(b

∆)has least variance among all unbiased

estimators of V(b

∆)over any family of distributions Pcontaining all purely discontinuous

distribution functions. Furthermore, ˆv is strongly consistent in the sense that nd/2(ˆv−

V(b

∆)) →0almost surely for any 0≤d≤2.

We do not claim to have exhibited the optimal order of consistency.

(of Theorem 3.9). The unbiasedness of ˆvas well as its least-variance optimality follow

from the general properties of U-statistics. Only the consistency statement remains to be

shown. For 0 ≤d<1, Hoeffding (1963, Equation 5.7) applied to the U-statistic ˆvwhose

kernel is bounded between 0 and 1, yields the quantitative version

(16) Pˆv−V(b

∆)≥

ε

n−d/2≤2exp−2⌊n/(2g+2)⌋

ε

2n−d

for any

ε

>0 which has to be applied with care because the degree of the U-statistic varies

with n. This only applies to 0 ≤d≤1 and only shows weak consistency. In the following,

we make use of the fact that U-statistics are strongly consistent meaning that they satisfy

the strong law of large numbers if the kernel is absolutely integrable, for instance bounded.

This was ﬁrst shown in an unpublished paper of 1961 by Hoeffding, a complete proof is

given in Lee (1990, Section 3.4.2, Theorem 3). For all cases 0 ≤d≤2, let us ﬁrst show that

nˆvalmost surely tends to (g+1)2(

κ

1−∆2). For c≥2, the summands n

α

cb

κ

cof nˆvalmost

surely tend to zero, because n

α

cdoes, and b

κ

cis strongly consistent, so the sequence b

κ

cfor

n→∞is almost surely bounded, for every c. For c=1, the summand n

α

1b

κ

1almost surely

tends to (g+1)2

κ

1, because n

α

1→(g+1)2, and b

κ

1is strongly consistent. Similarly, the

summand n(1−

α

0)c

∆2almost surely tends to (g+1)2∆2, because n(1−

α

0)→(g+1)2,

and c

∆2is strongly consistent.

The statement now follows from the fact that limn→∞nV(b

∆) = (g+1)2(

κ

1−∆2)(Hoeffding,

1948, 5.23).

Under H0, there are unbiased estimators already for smaller msince then ∆2does not

need to be estimated. However, as noted in Remark 3.6, the optimality property cannot be

shown in this case.

4. TESTING

4.1. Central limit theorem. The convergence of b

∆towards ∆as n→∞is described by

the Strong Law of Large Number, the Law of the Iterated Logarithm and the Berry-Esseen

12 M. FUCHS, R. HORNUNG, R. DE BIN, A.-L. BOULESTEIX

theorem (Lee, 1990, Section 3.4.2, Theorem 3, Section 3.5, Theorem 1, Section 3.3.2,

Theorem 1, respectively). In order to show the existence of an asymptotically exact test,

we need the following theorem as it subsumes the unstudentized and the studentized case.

It is reminiscent of and contains as special case the statement that the t-distributions tend

to N(0,1)as the degrees of freedom tend to inﬁnity.

Theorem 4.1. Let u(n)be one of the following expressions: the asymptotic variance

u(n):= (g+1)2

κ

1−∆2/n, the expression u(n):= (g+1)2b

κ

1−c

∆2/n, where b

κ

1and

c

∆2are deﬁned by the case Θ=∆,U=b

∆in Deﬁnition 3.5, or u(n):=ˆv as of Deﬁnition 3.7.

Then (b

∆−∆)u(n)−1/2converges in distribution to N(0,1)as g remains ﬁxed, n →∞.

The occurrence of the factor (g+1)2is explained by the fact that this is the decay

rate of the coefﬁcient

α

1in the sense that limn→∞n

α

1= (g+1)2. This also explains the

asymptotic behaviour of the variance.

The ﬁrst case of Theorem 4.1 shows approximate normality of the unstudentized statis-

tic ∆itself. It seems that there exists no statement in the literature giving the precise reason

why a cross-validation type estimator is asymptotically normally distributed. This case

appears in the asymptotic variance statement Hoeffding (1948, 5.23). The second case is

included for systematic reasons; this expression is the empirical analogue of the ﬁrst case,

but is a biased variance estimator. Finally, the third case includes the unbiased variance es-

timator elaborated in the present manuscript. The fact that V(b

∆)/u(n)tends to one, shown

in the following proof, is not immediate, due to the diagonality property n=n′mentioned

above. Likewise, it is not obvious whether this ratio almost surely tends to one.

(of Theorem 4.1). In the ﬁrst case, this is Hoeffding (1948, Theorem 7.1) and rests on

the validity of the ﬁrst part of Assumption 3.4. In the other cases, the proof proceeds

simultaneously. First, the proof of Theorem 3.9 shows that convergence of nV(b

∆)implies

the almost sure convergence not only of ˆvbut in fact of nu(n)for any of the choices of

u(n). Thus, (nu(n))−1is almost surely bounded. This statement is licit because the ﬁrst

part of Assumption 3.4 implies

κ

16=∆2, so nu(n)converges to a non-zero value and,

therefore, there are at most only ﬁnitely many nsuch that u(n) = 0 has positive probability.

Consequently, we may multiply n(u(n)−V(b

∆)) which converges almost surely to zero,

hence also in probability, with (nu(n))−1to show that in each case, the ratio V(b

∆)/u(n)

tends to one in probability by Slutsky’s theorem. By the continuous mapping theorem,

(V(b

∆)/u(n))1/2tends to one in probability as well. Therefore, (b

∆−∆)u(n)−1/2= (b

∆−

∆)(V(b

∆))−1/2(V(b

∆)/u(n))1/2tends to N(0,1)in distribution by the ﬁrst case and another

application of Slutsky’s theorem.

4.2. Asymptotic rejection regions and conﬁdence intervals. So, the two-sided test of

H0with the rejection region

(17) nb

∆≥u(n)1/2

φ

−1(1−

α

/2)o

has asymptotic level

α

, where

φ

is the standard normal cumulative distribution func-

tion. While the second case of Theorem 4.1 uses a positively biased variance estimator

and hence provides a conservative test which, however, is asymptotically exact, the third

case provides the best approximation to exactness already in the ﬁnite case. Likewise, an

asymptotically exact conﬁdence interval for ∆eat level 1−

α

is

(18) hb

∆−u(n)1/2

φ

−1(1−

α

/2),b

∆+u(n)1/2

φ

−1(1−

α

/2)i.

AU-STATISTIC ESTIMATOR FOR THE VARIANCE OF RESAMPLING-BASED ERROR ESTIMATORS 13

A related, but different approach to a similar testing problem is provided by the so-called

empirical Bernstein inequalities in Peel et al. (2010, Equations 12, 13). These are sharp

empirical inequalities for general U-statistics associated to bounded kernels. However, n

has to be an integer multiple of the degree, and the authors do not consider cross-validation,

but only partitions of the test set.

5. THE CONVERGENCE OF INCOMPLETE TO COMPLETE U-STATISTICS IN PRACTISE

In practical applications, the number of summands of (1) is too large for computation.

In the particular case where one of the learning methods Mis a k-nearest-neighbour algo-

rithm, it is possible to compute the corresponding summand of the complete U-statistic,

the leave-p-out cross-validation estimator of the error rate, by an efﬁcient closed-form

expression (Celisse & Mary-Huard, 2012). In general, however, one can only consider a

design Tsmaller than the full Tmax, leading to incomplete U-statistics as treated in Lee

(1990), for instance. We now show that the incomplete U-statistic with random design

approximates the complete one satisfactorily after a feasible number of iterations.

Let Φbe a not necessarily symmetric kernel with −1≤Φ≤1, let T∗be a collection of

Nrandomly drawn ordered size m-subsets of {1,...,n}from the equidistribution Qon the

collection of such subsets, and let Φ(T∗)be the associated incomplete U-statistic. Then

the probability of approximation error at least

δ

>0 is bounded by

(19) prQ(|Φ(T∗)−Φ(T)| ≥

δ

)≤2exp−

δ

2N/2.

This follows from Hoeffding (1963, Theorem 2) because the entries of T∗were drawn

independently from each other. One should be aware that here we do not refer to the part

of Hoeffding (1963) concerned with U-statistics, in contrast to the situation of the similar

inequality (16), where we did so. Here, we formulated the version for ordered subsets

because this is of immediate interest for computation.

The fast exponential decay of (19) implies that sufﬁciency of the approximation is as-

sured as soon as Nis a small multiple of

δ

−2, where

δ

is the pre-speciﬁed tolerance.

Precisely, the following corollary of Hoeffding’s theorem can be used in practise.

Corollary 5.1. After at most N =2d+1iterations, d digits after the comma are ﬁxed with

a probability of at least 1−2exp(−5)≈0.99.

Such a number of repetitions is, in general, hard but feasible because this Nis the mere

number of times a model has to be ﬁtted. For instance, in the illustration in Section 6,

no tuning of the hyper-parameter

λ

is part of each iteration. Remarkably, this bound on

Nholds true irrespectively of the sample size nor of any properties of the particular U-

statistic under consideration, apart from −1≤Φ≤1. In practise, however, one proceeds

again slightly differently. For the case of approximation of the U-statistic b

∆for instance,

one applies the following procedure which yields even faster convergenceagainst the true

b

∆. In the formal setting required for inequality (19), one would use only one test obser-

vation for each learning iteration, which would lead to unnecessarily high computational

cost. Instead, one simply uses all remaining n−gobservations for testing. This speeds up

convergence even further. Corollary 5.1 also applies to the computation of ˆvby the linear

combination of U-statistics explained above because the kernels appearing in (13) and (14)

are bounded between −1 and 1 as well.

14 M. FUCHS, R. HORNUNG, R. DE BIN, A.-L. BOULESTEIX

6. THE CALCULATIONS IN A REAL DATA EXAMPLE

The estimation procedure elaborated in the preceding sections was applied to the well-

investigated colon cancer data set by Alon et al. (1999), where the binary response y∈Y

stands for the type of tissue (either normal tissue or tumor tissue) and the 2000 continu-

ous predictors are gene expressions. We used lasso-penalized logistic regression with the

coordinate descent method for classiﬁcation (Friedman et al., 2010) and the penalization

parameters

λ

=0·08, 0·5. Pre-selecting these values led to the software-internal estima-

tor for the difference of error rates to be greater than 0·1. This involved the whole data

set, however, this is no problem here. Sample size was n=62. Therefore, the condition

n≥2g+2constrainedg≤30. Since the variance of the U-statistic ˆvdecreases to the extent

to which the sample size exceeds the degree 2g+2, the learning set size gwas arbitrarily

chosen to be only 26 to compromise with the effort to avoid a too small learning set size.

There were numerical evidence for the validity of the non-degeneracy statement of As-

sumption 3.4. The resulting point estimate of ∆was −0·14, with 95%-conﬁdence interval

[-0·35, 0·07] and estimated variance ˆv=0·01. The number of iterations was N=105for

each of the U-statistics b

∆,b

κ

cfor 1 ≤c≤g+1 and c

∆2. By Corollary 5.1, two digits of each

of these were therefore assured. The two-sided p-value was p=0·19, given by the corre-

sponding upper and lower normal tail probabilities. An R-script that allows to reproduce

these results is available on the ﬁrst author’s institution web page.

ACKNOWLEDGEMENT

MF was supported by the German Science Foundation (DFG-Einzelf ¨orderung BO3139/2-

2 to ALB). RH was supported by the German Science Foundation (DFG-Einzelf¨orderung

BO3139/3-1 to ALB). RDB was supported by the German Science Foundation (DFG-

Einzelf¨orderung BO3139/4-1 to ALB).

REFERENCES

ALON, U., BARKAI, N., NOTTERMAN, D. A., GISH, K., YBARRA, S., MACK D. &

LEVINE, A. J. (1999). Broad patterns ofgene expression revealed byclustering analysis

of tumor and normal colon tissue probed by oligonucleotide arrays. Proc. Natl. Acad.

Sci. USA 96, 6745–6750.

ARLOT, S. & CELISSE, A. (2010). A survey of cross-validation procedures for model

selection. Stat. Surveys 4, 40–79.

BENGIO, Y. & GRANDVALET, Y. (2003). No unbiased estimator of the variance of K-fold

cross-validation. J. Mach. Learn. Res. 5, 1089–1105.

BOULESTEIX, A.-L., PORZELIUS, C. & DAUMER, M. (2008). Microarray-based classi-

ﬁcation and clinical predictors: on combined classiﬁers and additional predictive value.

Bioinformatics 24, 1698–1706.

BRAGA-NETO, U. M. & DOUGHERTY, E. R. (2004). Is cross-validation valid for small-

sample microarray classiﬁcation? Bioinformatics 20, 374–380.

B¨

UHLMANN, P. & VAN DE GEER, S. (2011). Statistics for High-Dimensional Data.

Springer Series in Statistics.

CELISSE, A. & MARY-HUARD, T. (2012). Exact Cross-Validation for kNN: application

to passive and active learning in classiﬁcation. J. Soc. Fr. Stat. 152, 83–97.

DIETTERICH, T. G. (1998). Approximate statistical tests for comparing supervised clas-

siﬁcation learning algorithms. Neural comput. 10, 1895–1923.

AU-STATISTIC ESTIMATOR FOR THE VARIANCE OF RESAMPLING-BASED ERROR ESTIMATORS 15

DOUGHERTY, E. R., ZOLLANVARI, A. & BRAGA-NETO, U. M. (2011). The illusion of

distribution-free small-sample classiﬁcation in genomics. Curr. genomics 12, 333.

FRIEDMAN, J., HASTIE, T. & TIBSHIRANI, R. (2010). Regularization paths for general-

ized linear models via coordinate descent. J. Stat. Softw. 33, 1–22.

HOEFFDING, W. (1948). A class of statistics with asymptotically normal distribution. Ann.

Math. Stat. 19, 293–325.

HOEFFDING, W. (1963). Probability inequalities for sums of bounded random variables.

J. Am. Statist. Assoc. 58, 13–30.

JIANG, B., ZHANG, X. & CAI, T. (2008). Estimating the conﬁdence interval for prediction

errors of support vector machine classiﬁers. J. Mach. Learn. Res. 9, 521–540.

JIANG, W., VARMA, S. & SIMON, R. (2008). Calculating conﬁdence intervals for predic-

tion error in microarray classiﬁcation using resampling. Stat. Appl. Genet. Molec. Biol.,

7.

KOHAVI, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and

model selection. International Joint Conferences on Artiﬁcial Intelligence 14, 1137–

1145.

LEE, J. (1990). U-statistics: Theory and Practice. CRC Press.

NADEAU, C. & BENGIO, Y. (2003). Inference for the generalization error. Machine

Learning 52, 239–281.

PEEL, T., ANTHOINE, S. & RALAIVOLA, L. (2010). Empirical Bernstein inequalities for

u-statistics. Adv. Neural Inf. Process. Syst. 23, 1903–1911.

SHAO, J. (1993). Linear model selection by cross-validation. J. Am. Statist. Assoc. 88,

486–494.

VAN DE WIEL, M., BERKHOF, J. & VAN WIERINGEN, W. (2009). Testing the prediction

error difference between two predictors. Biostatistics 10, 550–560.

INS TIT UT FR MEDIZINISCHE IN FOR MATI ONS VER ARB EITU NG BIO MET RIE U ND EPIDEMIOLOGIE,LUDWIG-

MAXIMILIANS-UNIVE RSITT MNCHEN,, MARCHIONINISTR. 15, 81377 MNCHEN, GERMANY

E-mail address:{fuchs,hornung,debin,boulesteix}@ibe.med.uni-muenchen.de