ArticlePDF Available

A U-statistic estimator for the variance of resampling-based error estimators



We revisit resampling procedures for error estimation in binary classification in terms of U-statistics. In particular, we exploit the fact that the error rate estimator involving all learning-testing splits is a U-statistic. Thus, it has minimal variance among all unbiased estimators and is asymptotically normally distributed. Moreover, there is an unbiased estimator for this minimal variance if the total sample size is at least the double learning set size plus two. In this case, we exhibit such an estimator which is another U-statistic. It enjoys, again, various optimality properties and yields an asymptotically exact hypothesis test of the equality of error rates when two learning algorithms are compared. Our statements apply to any deterministic learning algorithms under weak non-degeneracy assumptions.
arXiv:1310.8203v2 [math.ST] 18 Dec 2013
ABSTRACT. Werevisit resampling procedures for error estimation inbinary classification
in terms of U-statistics. In particular, we exploit the fact that the error rate estimator in-
volving all learning-testing splits is a U-statistic. Therefore, several standard theorems on
properties of U-statistics apply. In particular, it has minimal variance among all unbiased
estimators and is asymptotically normally distributed. Moreover, there is an unbiased esti-
mator for this minimal variance if the total sample size is at least the double learning set size
plus two. In this case, we exhibit such an estimator which is another U-statistic. It enjoys,
again, various optimality properties and yields an asymptotically exact hypothesis test of
the equality of error rates when two learning algorithms are compared. Our statements ap-
ply to any deterministic learning algorithms under weak non-degeneracy assumptions. In
an application to tuning parameter choice in lasso regression on a gene expression data set,
the test does not reject thenull hypothesis of equal rates between two different parameters.
Unbiased Estimator; Penalized Regression Model; U-Statistic; Cross-Validation; Ma-
chine Learning;
The goal of supervised statistical learning is to develop prediction rules taking the val-
ues of predictor variables as input and returning a predicted value of the response variable.
A prediction rule is typically learnt by applying a learning algorithm Mto a so-called learn-
ing data set. A typical example in biomedical research is the prediction of patient outcome
(e.g. recidive/no recidive within five years, tumor class, lymph node status, response to
chemotherapy, etc.) based on bio-markers such as, e.g., gene expression data. The practi-
tioners are usually interested in the accuracy of the prediction rule learnt from their data set
to predict future patients, while methodological researchers rather want to know whether
the learning algorithm is good at learning accurate prediction rules for different data sets
drawn from a distribution of interest. The first perspective is called “conditional” (since
referring to a specific data set) while the latter, which we take in this paper, is denoted as
“unconditional”. Precisely, this paper focuses on the parameter defined as the difference
between the unconditional errors of two learning algorithms of interest, Mand M, for
binary classification.
If the data set is very large, one can observe independent realizations of estimators of
the unconditional error rates and use them for a paired t-test (see Section 2.3). In practise,
however,huge data sets are rarely available. Prediction errors are thus usually estimated by
resampling procedures consisting of splitting the available data set into learning and test
sets a large number of times and averaging the estimated error over these iterations. The
well-known cross-validation procedure can be seen as a special case of resampling pro-
cedure for error estimation. A detailed overview of the vast literature on cross-validation
would go beyond the scope of this paper. The reader is referred to Arlot & Celisse (2010)
for a comprehensive survey.
Having estimated the error rate, it is typically of interest to test the null hypothesis of
equal error rates between learning algorithms. This requires insight into the estimator’s
variance. Resampling-based error estimators typically have a very large variance, in par-
ticular in the case of small samples or high-dimensional predictor space (Dougherty et al.,
2011). The estimation of this variance has been the focus of a large body of literature, es-
pecially in the machine learning context. A good estimation of the variance of resampling-
based error estimators would allow to, e.g., derive reliable confidence intervals for the true
error or to construct statistical tests to compare the performance of several learning al-
gorithms. The latter task is of crucial importance in practise, since applied computational
scientists including biostatisticians often have to make a choice within a multitude of differ-
ent learning algorithms whose performance in the considered settings is poorly explored.
In van der Wiel et al. (2009), for each splitting in repeated sub-sampling the predictions of
the two classifiers are compared by a Wilcoxon-test, and the resulting p-values are com-
bined. In Jiang et al. (2008a), the authors show the asymptotic normality of the error rate
estimator in the case of a support vector machine. In Jiang et al. (2008b), a bias-corrected
bootstrap-estimator for the error rate from leave-one-out cross-validation is introduced.
Various estimators of the variance of resampling-based error estimators have been sug-
gested in the literature (Dietterich, 1998; Nadeau & Bengio, 2003), most of them based on
critical simplifying assumptions. As far as cross-validation error estimates are concerned,
Bengio & Grandvalet (2003) show that there exists no unbiased estimator of their variance.
To date, the estimation of the variance of resampling-based error rates in general remains a
challenging issue with no adequate answer yet both from a theoretical and practical point
of view. In particular, there are no exact nor even asymptotically exact test procedures for
testing equality of error rates between learning algorithms available. The present paper
shows that there is an asymptotically exact test for the comparison of learing algorithms
by using and extending results from U-statistics theory.
Despite the large body of literature, there seems to be no explicit treatment of the as-
ymptotic properties of the estimators in general. Our main results are Theorem 3.9, stating
that there is an unbiased estimator of the difference estimator’s variance if n2g+2,
where gis the learning set size and nis the sample size, and Theorem 4.1, providing the
central limit theorem for the studentized statistic as it is needed for testing. The use of only
half the sample size for learning has already occurred in the literature, in a roughly simi-
lar context and on grounds of intuitive reasoning (B¨uhlmann & van de Geer, 2011, Section
Corollary 5.1 gives an explicit bound on the number of iterations necessary to approxi-
mate the leave-p-out estimator, i.e. the minimum variance estimator, to an arbitrary given
precision, where p=ng; we show that this minimal variance can be estimated by an
unbiased estimator, namely that from Definition 3.7. It has minimal variance itself, and the
ensuing studentized test in (17) is asymptotically exact. This shows that it is not necessary
to endeavour in determining the distribution of combinations of p-values to test equality of
error rates, as in van der Wiel et al. (2009, Section 2.3).
Section 2 recalls important definitions pertaining to U-statistics and cross-validation
viewed as incomplete U-statistics. We show that the procedure which involves all learning-
testing splits is then a complete U-statistic. In Section 3, we show that U-statistics theory
naturally suggests an unbiased estimator of the variance of this estimated difference of
errors as soon as the sample size criterion is satisfied. In Section 4, we exploit this vari-
ance estimator to derive an asymptotically exact hypothesis test of equality of the true
errors of two learning algorithms Mand M. Section 5 addresses numerical computation
of approximations of the leave-p-out cross-validation estimator, while an illustration of the
variance estimation and the hypothesis test through application to the choice of the penalty
parameter in lasso regression is presented in Section 6.
2.1. Hoeffding’sdefinition. Since the complete error estimator that we will consider from
the next section on is a U-statistic, we start by recalling the definition of U-statistics and
their basic properties. In the following, the reader who is already familiar with the machine
learning context may take Ψ0:=Φ0and m:=g+1 at first, where gis the learning sample
size and Φ0is as defined in (5); however, this is not necessary and we will need other cases
of the definition as well.
Definition 2.1 (U-statistic, Hoeffding (1948)).Let (Zi),i=1,...,nbe independent and
identically distributed r-dimensional random vectors with arbitrary distribution. Let m
n, and let Ψ0:Rr×mRbe an arbitrary measurable symmetric function of mvector
arguments. Write Ψ0(S)as above for the well-defined value of Ψ0at those Ziwith indices
from S⊂ {1,...,n},|S|=m. Consider the average of Ψ0in the maximal design Sof
unordered size-msubsets
(1) U=U(Z1,...,Zn) = n
Any statistic of such a form is called a U-statistic.
The trailing factor is the inverse of the number of summands, the cardinality |S|. So, a
U-statistic is an unbiased estimator for the associated parameter
(2) Θ(P) = Z···ZΨ0(z1,...,zm)dP(z1)···dP(zm)
for a probability distribution Pon Rr, where z1,...,zmare r-vectors. A parameter of this
form is called a regular parameter. If mis the smallest number such that there exists such
a symmetric function Ψ0that represents a given parameter Θ(P)in the form (2), then m
is called the degree of Ψ0or of Θ(P), and the function Ψ0is called a kernel of U. Fur-
thermore, Uhas minimal variance among all unbiased estimators over any family of dis-
tributions Pcontaining all properly discontinuous distributions. Hoeffding (1948) shows
the asymptotic normality of U-statistics as nand that a good number of well-known
statistics such as sample mean, empirical moments, Gini coefficient, etc. are subsumed by
the definition.
It is also possible to associate a U-statistic to a non-symmetric kernel Ψ, i.e. to estimate
E(Ψ)by a U-statistic. The cost is to deal with n!/(nm)! summands, many more than
just the binomial coefficient n!/(m!(nm)!). The symmetrization, indeed, consists in
grouping all m! summands involvingthe same unordered index set together. This writes a
U-statistic with non-symmetric kernel Ψin the form of Hoeffding (1948, Section 4.4) with
symmetric kernel Ψ0as in Hoeffding (1948, Section 3.3).
2.2. The difference of true error rates as the parameter of interest. The goal of this
section is to formalize the true error rate and to recall its nature as an expectation. Let X=
Rr1,rN,be a fixed predictor space, and YRbe a space of responses. The number r
will be as in Hoeffding (1948) and one would usually denote p=r1. Assume there is an
unknown but fixed probability distribution Pon X×YRrdefined on the
-algebra of
Lebesgue-measurable sets. We do not require Pto be absolutely continuous with respect
to the Lebesgue measure in order to allow for a discrete marginal distribution in Yas it
occurs in binary classification. The distribution Pcan be thought of as being supported on
Rrinstead of X×YRronly, by identifying Pwith its push-forward image i(P)under
the inclusion i:X×YRr. This allows to apply Hoeffding (1948) which only describes
U-statistics on a Euclidean space Rrto the definition and investigation of U-statistics on
(products of) X×Y. Let us fix a loss function L:Y×YR. Typically, Lis the
misclassification loss L(y1,y2) = 1y16=y2, but can be an arbitrary measurable function. Since
we suppose the marginal distribution of Pon Yto be discrete, the loss function associated
as done below to a learning algorithm is almost surely bounded. Therefore, all moments
exist, which will be helpful throughout the paper. It is automatic for the misclassification
loss. However, some of the following also work for an unbounded distribution on Y. In
that case one would typically consider losses which are not almost surely bounded, such
as, for instance, the residual sum of squares loss L(y1,y2) = (y1y2)2or the Brier score
in survival analysis. We will not work out the details of unbounded losses.
Say we are interested in the difference of error rates of classifiers learnt on a sample of
size g. Typical choices are g=4n/5 (assuming that five divides n) for a learning/testing
sample size ratio of 4 : 1 and g=n1 for leave-one-out cross-validation. We also allow
for g=0 in case we are interested in the performance of classification rules that were
already learnt on different and fixed data. In that case, it is important that the learning
data were different since otherwise there would be a problematic contradiction between
simultaneously regarding the data as fixed and as being drawn from P.
Let (zi)i=1,...,n= (x1,y1,...,xn,yn), where xiXand yiY, and denote by
the function that maps (z1,...,zg;xg+1)(X×Y)×g×Xto the prediction, an element
of Y, by the learning algorithm M, learnt on z1,...,zgand applied to xg+1. We are only
concerned with deterministic learning algorithms M, i.e. ones which do not involve any
random component for classification. We suppose that fMis symmetric in the first gentries
z1,...,zg, i.e. Mtreats all learning observations equally, and that fMis measurable with
respect to the product
-algebra. The inclusion X×YRrdefines an inclusion (X×
Y)×g×XRrg+r1and in order to be able to apply Hoeffding (1948) we view fMas a
map on Rrg+r1by extending it by zero on Rrg+r1\(X×Y)×g×Xwhich is a null-set
with respect to the push-forward measure i(P). The map fMis then also measurable on
Denote by Φ:(X×Y)g+1Rthe function
(3) Φ(z1,...,zg;zg+1):=L(fM(z1,...,zg;xg+1),yg+1)L(fM(z1,...,zg;xg+1),yg+1)
for two learning algorithms Mand M. The value Φ(z1,...,zg;zg+1)is the empirical differ-
ence of error rates between Mand M, learnt on the first gobservations (z1,...,zg)of the
sample (zi)and evaluated on the single last entry. The semicolon thus visually separates
learning and test sets. The definition of Φinvolves only a single test observation. Antici-
pating a little, the reason is that Φhad to be defined with a minimal number of arguments
necessary for (6) below. In the following, we will convenientlyconsider larger test sample
sizes by applying the mechanism of associating a U-statistic to a kernel.
As noted above, we assume that Φis almost surely bounded. This happens, for instance,
for bounded loss functions such as the misclassification loss. Also, we may view Φas being
defined on Rr(g+1)instead of (X×Y)g+1without notational distinction, in the same way
as fMwas extended to Rrg+r1.
The true difference of error rates between Mand Mis the expectation of Φ, taken with
respect to g+1 independent realizations of P:
where both learning and test data are random. The existence of the expectation follows
from measurability and from the boundedness assumption. The quantity is the parameter
of main interest. Also, we consider the symmetric function of g+1 arguments
(5) Φ0(z1,...,zg+1):=1
(6) =E(Φ) = E(Φ0).
For the particular non-symmetric kernel Ψ=Φintroduced in (3), the symmetrization Ψ0=
Φ0of (5) can be written involving only m=g+1 summands instead of m!, in other words
only cyclic permutations instead of all permutations, due to the assumption that learning is
symmetric. In practise, it is not advantageous to compute Φ0directly because a learning
procedure should be used on more than just one test observation for numerical efficiency
(see Section 5); however, it is very convenient to consider Φ0for ease of presentation.
Remark 2.2.In case one is interested in estimating eMonly instead of a difference =
eMeM, one can set the second summand of (3) identically to zero. We will not go into
the details.
2.3. Tests of the true error rate. In this section, let us recall the test problem of interest.
We want to test the null hypothesis H0:E(Φ) = 0 against the alternative H1:E(Φ)6=0.
The former is usually called the unconditional null hypothesis (Braga-Neto & Dougherty,
Remark 2.3.There is also a conditional null hypothesis where the classification rule is
supposed to be given, for instance learnt on fixed independent data, and the expectation is
taken only with respect to the test set. However, the learning data are usually also random
and may even be modelled to be from Pas well. In this case, the conditional null depends
on random data, leading to severe difficulties in the interpretation of type one error. For
this reason, in this paper we will only consider the unconditional error rate. However,
setting g=0 and plugging in a ready-made classification rule for Φ, regardless of the data
it was learnt on, leads to a sort of conditional null hypothesis. In this case, the true error
becomes a random variable of the learning data, and the latter must not be from the sample
(z1,...,zn). We will not go into details of conditional testing or of the case g=0.
The form of H0suggests a t-test. However, the number of independent realizations of
Φ(z1,...,zg+1)is only n/(g+1), since it is to be computed with respect to P(g+1).
Therefore, a correct t-test would be severely underpowered, and cross-validation proce-
dures are usually preferred.
2.4. Cross-validation. Let us now show how cross-validation procedures fit into the frame-
work described above. In a cross-validation procedure, dependent realizations of Φ(z1,...,zg+1)
are considered. More precisely, for every ordered subset T= (i1,...,ig;ig+1)of {1,...,n},
(7) e
is an estimator of , where we visually separate learning and test sets again. We view
(T)as an estimator of the difference of error rates of classifiers learnt on samples of size
ginstead of n, in contrast to differing usage in the literature. Thus, e
(T)is unbiased. Of
course, the word “ordered subset” refers to the order 1 <··· <g+1; it is not imposed
that i1<··· <ig+1. Similarly, if S={i1,...,ig+1}is an unordered subset of size g+1
of {1,...,n}, the value Φ0(zi1,...,zig+1), using the symmetric Φ0instead of Φ, does not
depend on the order of S. Therefore, we can unambiguously extend definition (7) to a
function e
also on the collection of unordered subsets Sby setting e
(S):=Φo(S). This
is an unbiased estimator of . Also, let Tbe a collection of ordered subsets Tas above.
Then, let
(8) e
be the average of all values of e
(T)over T, and similarly e
(S)for a collection S
of unordered subsets Sas above the average of all values e
(S)involving the symmetric
function Φ0. Equation (8) may involve each learning set multiple times because each
observation of a test sample can then contribute a summand to (8). In other contexts the
mean error rate over the entire test sample is counted as only one occurrence of the learning
set.For any such collections Tor S, the estimators e
(T)and e
(S)are unbiased for .
As soon as Tcontains together with an ordered subset T= (i1,...,ig;ig+1)all its cyclic
permutations (i2,...,ig+1;i1),(i3,...,ig+1,i1;i2)and so on, we have e
(T) = e
the collection Sis obtained from the collection Tby forgetting the order (and the multiple
entries with the same order coming from the cyclic permutation).
Now, the ordinary K-fold cross-validation can be incorporated in this framework as
follows. Suppose that Kis such that K(ng) = n, possibly after disregarding a few obser-
vations in order to assure divisibilityof nby ng. Therefore, gn/2. The extreme cases
are g=n/2 for K=2 and g=n1 for K=n. Let TCV be a collection of ordered subsets
of the form
(9) T=1,...,k(ng),(k+1)(ng) + 1,...,n;t
where k=0,...,K1 enumerates the learning blocks, the notation is to be read in such
a way that if k=0 the first entry is ng+1 and if k=K1 the last one is n, and t
{k(ng)+1,...,(k+1)(ng)}enumerates all test observations distinct from the learning
block. Thus, Tconsists of one or two learning strides whose indices are contiguous and
whose sizes add up to g, together with a single test observation index distinct from any
learning observation index. Then e
(TCV)recovers the ordinary cross-validation estimator
of . In practise, one may also compute e
(TCV)from a permutation of the data, but this
does not influence the formal description because Pnis permutation-invariant.
Definition 2.4. We will in general refer to estimators of the form e
(T)or e
by (8), as to cross-validation-like procedures, irrespectively of the structure of Tor S.
It was shown in Bengio & Grandvalet (2003) that there is no unbiased estimator of the
variance VPn(e
(TCV)) for any cross-validation procedure TCV , i.e. any divisor Kof n.
It seems plausible from this tedious description of cross-validation that such a particular
design TCV consisting merely of sets of the special form (9) does not lead to a globally
small variance of e
(TCV)among all possible designs Twith fixed learning set size g.
This variance is minimal for the cross-validation-like procedure Tmax consisting of all size
(g+1)-subsets. We will expose the cases where there is an unbiased variance estimator
of it, in contrast to the cross-validation case. Let us call this Tmax the maximal design.
Another immediate advantage of it over an incomplete one is the fact that the need for
a balanced data set, i.e. algorithms whose class labels are equally frequent, and/or for
balanced blocks falls away. The only case of gand Ksuch that TCV =Tmax is the leave-
one-out case g=n1 respectively K=n.
Similarly, one can distinguish those cases of grespectively Ksuch that the associated
design Tcontains along with an ordered subset Tall its cyclic permutations. In such a
case, e
(T) = e
(S)for the design Scorresponding to T. Among the cross-validation
procedures, only the leave-one-out case g=n1 respectively K=nproduces this situa-
tion. However, among the cross-validation-like procedures, this can happen for any g. For
instance, it holds for the maximal design for all 0 gn1. This is important to keep in
mind for numerical implementation.
2.5. The full cross-validation-like estimator of is a U-statistic. In this section, we
show that the cross-validation-like procedure with maximal design, where all size-g-subsets
of the sample are used for learning, is a U-statistic and therefore has least variance among
all cross-validation-like procedures. It seems that this fact has not yet been described
in the literature. Among the immediate consequences of interpreting this procedure as a
U-statistic will be asymptotic normality, the first case of Theorem 4.1. The parameter of
interest Θ=is a regular parameter because of (4); this equation also shows that its degree
is at most g+1.
Assumption 2.5. The degree of is exactly g+1.
This states that the true error rate cannot be computed from learning samples of smaller
size than gfor all (reasonable) distributions P. While it is not automatic, it seems to be
violated only in irrelevant artificial counterexamples, such as for instance one of the form
Φ(z1,z2,z3) = Φ(z2,z3)where the learning step only makes use of a part of the learning
set observations, and in similar cases. So, the assumption is natural.
Let Smax and Tmax be the maximal designs of unordered and ordered subsets, respec-
tively, as introduced above. The corresponding error rate estimator is then the U-statistic
associated to the particular kernel Ψ=Φand Ψ0=Φ0, respectively. We define
(10) b
:=U(Φ0) = Φ0(Smax) = Φ(Tmax)
as the associated U-statistic as in Definition 2.1, i.e. the one defined by the symmetric ker-
nel Φ0. It follows immediately from Hoeffding (1948) that it has minimal variance among
all unbiased estimators of . In particular, it has strictly smaller variance than all cross-
validation procedures for 2 gn2, and is equal to the cross-validation estimator in
the leave-one-out case g=n1. Lee (1990, Section 4.3, Theorems 1 and 4) describes
quantitatively the variance decrease of b
with respect to e
. These theorems treat the case
of a fixed Sand an Sconsisting of random subsets, respectively. The statistic b
incides with what is called complete cross-validation in Kohavi (1995), as well as with
complete repeated sub-sampling as considered in Boulesteix et al. (2008), or leave-p-out
cross-validation in Shao (1993) and Arlot & Celisse (2010), where p=ng.
In practise, the definition of b
involves too many summands for computation, but can
be easily approximated to arbitrary precision using an Sof random subsets, see Section 5.
3.1. Variances areregularparameters. The theory of U-statistics comes to full power as
soon as not only the original regular parameter is estimated optimally by a U-statistic b
but also the variance V(b
)of this U-statistic itself is exhibited as another regular parameter,
this time depending not only on Φ0but also on n. Therefore, we are in a position to estimate
)by a U-statistic as well.
In the following Proposition, we outline formally that variances and covariances are
regular parameters in general, without determining optimally the degree. Thus, the full
power of U-statistics can be used to estimate them. We then pin down the degree in Propo-
sition 3.2.
Proposition3.1. Let f(z1,...,zk)be a function of k realizations of independent identically
distributed random variables ZiP with existing variance VPk(f)<. Then the vari-
ance VPk(f)is a regular parameter of degree at most 2k. More generally, the covariance
between two such functions f and g, as soon as it exists, is a regular parameter of degree
at most 2k.
Proof. Both V(f)and cov(f,g)are, by definition, polynomials of integrals with respect
to P. In order to show that they are regular parameters, we have to rewrite each one as a
single integral instead. This is accomplished by
VPk(f) = E(f2)E(f)2
and an almost analogous formula for the covariance covPk(f,g).
The integrand is not unique. It was chosen in such a way to resemble the symmetric
kernel (z1z2)2/2 of the variance of Pitself, i.e. the case r=1,f(z1) = z1. Furthermore,
the degree of V(f), i.e. the minimal length of an integrand that accomplishes this, can be
much smaller than 2kand depends on f. Also, the integrand of (11) is not symmetric in
general and remains to be symmetrized.
Let us now investigate the case where fis a U-statistic associated to a symmetric kernel
Φ0. Caution has to be taken because the regular parameter now depends on n, in sharp
contrast to the U-statistic b
itself. For the case f=b
, we have k=n, so our knowledge
attained so far on the degree of the kernel of V(b
)is that it is at most 2n. However, it
is possible to obtain better insight into the degree of the variance. It will turn out that
the variance is a linear combination of regular parameters, each of whose degrees do not
depend on n, only the coefficients of the linear combination depend on n. This is the
content of the following proposition, which presents in short form results of Hoeffding
(1948, Section 5) as well as immediate consequences.
In the following, we will consider a general underlying U-statistic Uwhich estimates
an unknown parameter Θ, and develop the theory of its variance as it is needed for its
estimation. From Section 3.2 on, we will pay particular attention to the case where Uis
associated to the kernel Φ0defined by (5), thus Θ=and U=b
Proposition 3.2. Let U be the U-statistic associated to a bounded symmetric kernel Φ0of
degree m and to a total sample size n. Denote Θ=E(Φ0). Then the variance of U is a
regular parameter of degree at most 2m. Furthermore, it splits as a sum
(12) V(U) =
cis the mass function at c of the hyper-geometric distribution H(n,m,m), and all
care regular parameters satisfying
cis a regular parameter of degree at most 2mc. Furthermore, since
(14) Θ2=Z···ZΦ0(z1,...,zm)Φ0(zm+1,...,z2m)dP(z1)···dP(z2m),
Θ2is a regular parameter of degree at most 2m.
Proof. Direct computation shows that the right hand side of (13) coincides with what is
called E(Φ2
c(X1,...,Xc)) in Hoeffding (1948, Section 5) for all 1 cm. This step in-
volves the symmetry of the kernel Φ0and careful renaming of the variables. Hoeffding al-
ready supposes a symmetric kernel which he calls Φ. The quantities
cof Hoeffding (1948)
– which are called
cin Lee (1990) – are thus related to our
cby means of the equation
cΘ2, as follows from Hoeffding (1948, formula 5.10). From V(U) = m
(Hoeffding, 1948, 5.13) we thus deduce V(U) = m
cΘ2) = m
0)Θ2because m
c= (1
The fact that linear combinations of regular parameters are regular parameters (Hoeffding,
1948, Page 295) completes the proof.
Proposition 3.2 achieves the desired simplification: The degree of V(U)for a U-statistic
Uof degree mis shown to be at most 2minstead of 2n, and the dependence of V(U)on n
is now expressed solely by means of the hyper-geometric mass function, whereas
E(U)2do not depend on n.
Remark 3.3.Direct computation yields E(U2) = m
0Θ2, making use of the
fact that the kernel is symmetrized. This together with the usual decomposition V(U) =
E(U2)E(U)2=E(U2)Θ2also proves (12) and shows that the degree of E(U2)is at
most 2m. It is natural to assume that its degree is exactly 2m, in analogy to assumptions 2.5
above and 3.4 below. In contrast, the advantage of decomposition (12) is that the first
summand only involves the
cwhich all have smaller degree than 2m, namely 2mc.
Therefore, we prefer decomposition (12) over the usual decomposition and work with and
estimate the quantities
crather than Hoeffding’s
cwhich all have degree 2m(see also
Remark 3.8 below).
3.2. Definition of the U-statistic for the variance. In order to estimate the variance of
the U-statistic Uby another U-statistic, we need the following.
Assumption3.4. In the general situation of Proposition 3.2, the statisticUis non-degenerate.
In the particular case U=b
where Θ=, this states that
c6=2for 1 cg+1.
Furthermore, we assume in the situation of Proposition 3.2 that all upper bounds for the
degrees thus obtained are optimal. In the particular caseU=b
, this means that the degree
cis 2mc=2g+2cand that of Θ2=2is 2m=2g+2.
The non-degeneracy can be numerically checked for plausability, unlike Assumption 2.5
and the degree optimality which both state that the regular parameters cannot be written by
a smaller number of integrals. There seems to be no reason why a kernel of the form (5)
with a non-trivial classifier should not satisfy them. The first part of Assumption 3.4 is
needed for the central limit theorem 4.1, the second one for Theorem 3.9.
Proposition 3.2 motivates the following definition.
Definition 3.5. In the general situation of Proposition 3.2 and under Assumption 3.4, the
statistics b
cfor 1 cmof degree 2mcand the statistic c
Θ2of degree 2mare defined
as the U-statistics associated to the symmetrized versions of the kernels which are the
integrands in (13) and (14), respectively.
Estimating Θ2by U2instead would be biased and thus would not fit in our framework.
Remark 3.6.It would not be suitable to simply estimate Θ2by zero in view of H0:Θ=0.
The first reason is that failure to subtract (1
Θ2from the variance estimator (15)
below would overestimate the variance V(U)under H1, leading to a severe loss of power.
The second is that it would conflict with Hoeffding’ssetup. In fact, under the null hypoth-
esis Θ=0, the degree of Θand that of Θ2would be trivially zero, if we were willing to
restrict Hoeffding’s class Pof distributions to only ones obeying the null; however, the
least-variance optimality property of a U-statistic relies on Pencompassing all properly
discontinuous distribution functions, not only null ones. Likewise, the degree has to be
defined for a global class of null and alternative together. This is akin of a classical one-
way analysis of variance statistic where estimating variance within and between groups
separately greatly increases the power.
We can now define the variance estimator of a U-statistic as a U-statistic itself.
Definition 3.7. In the general situation and notation of Proposition 3.2 and under Assump-
tion 3.4, we define an estimator, abbreviated bw, for the variance of the U-statistic Uas the
U-statistic associated to the linear combination as in (12) of the kernels of
cand of Θ2
given by (13) and (14).
After a short and straightforward computation, the definition can be re-stated alterna-
tively in more explicit form: The single U-statistic bwsplits as a sum
(15) bw:=
of U-statistics of varying degrees. In the particular case U=b
where Θ=, this defines
an estimator for V(b
), which will be abbreviated by ˆv.
The estimator bwenjoys the unbiasedness and optimality properties analogous to b
. In
particular, this applies to ˆv.
Remark 3.8.In the latter case U=b
, we have m=g+1 so the degree of ˆvis 2g+2,
and that of b
cwas given in the degree optimality statement of Assumption 3.4. The reason
for splitting bwinto several U-statistics of varying degree is numerical efficiency: Hoeffding
(1948) suggests to estimate
cΘ2. However, all of these parameters have degree 2m.
Instead, it is of course advisable to estimate the
cwhich have smaller degee 2mc. Then,
Θ2needs to be estimated only once. This remark is the empirical analogue to Remark 3.3.
3.3. Existence criterion and order of consistency. We are now in a position to inves-
tigate the estimator of V(b
). In principle, this section applies to the general situation of
Proposition 3.2, but in order to keep the presentation clear we will focus on the interesting
case Θ=,U=b
for the rest of the paper. Therefore, we will write c
2for the statistic c
whereas we will not introduce a special notation for the statistics
cfor that case.
In the consistency statement of Theorem 3.9 below, the true parameter V(b
on n, unlike in a typical consistency statement. In principle, the sample size used for this
estimation does not need to be the same nagain, but can in fact be any number n2g+2.
However, in practise the same sample is used to estimate as well as V(b
), so we restrict
our attention to the diagonal case n=nfor simplicity. This is analogous to the ordinary
one-sample t-test statistic, where both the numerator, the sample mean, and its standard
deviation, the denominator, are simultaneously estimated on the same sample, so with the
same n. However, in our case, no factor n1/2cancels out between both.
Theorem3.9. If n 2g+2, the estimator ˆv of V(b
)has least variance among all unbiased
estimators of V(b
)over any family of distributions Pcontaining all purely discontinuous
distribution functions. Furthermore, ˆv is strongly consistent in the sense that nd/2(ˆv
)) 0almost surely for any 0d2.
We do not claim to have exhibited the optimal order of consistency.
(of Theorem 3.9). The unbiasedness of ˆvas well as its least-variance optimality follow
from the general properties of U-statistics. Only the consistency statement remains to be
shown. For 0 d<1, Hoeffding (1963, Equation 5.7) applied to the U-statistic ˆvwhose
kernel is bounded between 0 and 1, yields the quantitative version
(16) PˆvV(b
for any
>0 which has to be applied with care because the degree of the U-statistic varies
with n. This only applies to 0 d1 and only shows weak consistency. In the following,
we make use of the fact that U-statistics are strongly consistent meaning that they satisfy
the strong law of large numbers if the kernel is absolutely integrable, for instance bounded.
This was first shown in an unpublished paper of 1961 by Hoeffding, a complete proof is
given in Lee (1990, Section 3.4.2, Theorem 3). For all cases 0 d2, let us first show that
nˆvalmost surely tends to (g+1)2(
12). For c2, the summands n
cof nˆvalmost
surely tend to zero, because n
cdoes, and b
cis strongly consistent, so the sequence b
nis almost surely bounded, for every c. For c=1, the summand n
1almost surely
tends to (g+1)2
1, because n
1(g+1)2, and b
1is strongly consistent. Similarly, the
summand n(1
2almost surely tends to (g+1)22, because n(1
and c
2is strongly consistent.
The statement now follows from the fact that limnnV(b
) = (g+1)2(
1948, 5.23).
Under H0, there are unbiased estimators already for smaller msince then 2does not
need to be estimated. However, as noted in Remark 3.6, the optimality property cannot be
shown in this case.
4.1. Central limit theorem. The convergence of b
towards as nis described by
the Strong Law of Large Number, the Law of the Iterated Logarithm and the Berry-Esseen
theorem (Lee, 1990, Section 3.4.2, Theorem 3, Section 3.5, Theorem 1, Section 3.3.2,
Theorem 1, respectively). In order to show the existence of an asymptotically exact test,
we need the following theorem as it subsumes the unstudentized and the studentized case.
It is reminiscent of and contains as special case the statement that the t-distributions tend
to N(0,1)as the degrees of freedom tend to infinity.
Theorem 4.1. Let u(n)be one of the following expressions: the asymptotic variance
u(n):= (g+1)2
12/n, the expression u(n):= (g+1)2b
2/n, where b
2are defined by the case Θ=,U=b
in Definition 3.5, or u(n):=ˆv as of Definition 3.7.
Then (b
)u(n)1/2converges in distribution to N(0,1)as g remains fixed, n .
The occurrence of the factor (g+1)2is explained by the fact that this is the decay
rate of the coefficient
1in the sense that limnn
1= (g+1)2. This also explains the
asymptotic behaviour of the variance.
The first case of Theorem 4.1 shows approximate normality of the unstudentized statis-
tic itself. It seems that there exists no statement in the literature giving the precise reason
why a cross-validation type estimator is asymptotically normally distributed. This case
appears in the asymptotic variance statement Hoeffding (1948, 5.23). The second case is
included for systematic reasons; this expression is the empirical analogue of the first case,
but is a biased variance estimator. Finally, the third case includes the unbiased variance es-
timator elaborated in the present manuscript. The fact that V(b
)/u(n)tends to one, shown
in the following proof, is not immediate, due to the diagonality property n=nmentioned
above. Likewise, it is not obvious whether this ratio almost surely tends to one.
(of Theorem 4.1). In the first case, this is Hoeffding (1948, Theorem 7.1) and rests on
the validity of the first part of Assumption 3.4. In the other cases, the proof proceeds
simultaneously. First, the proof of Theorem 3.9 shows that convergence of nV(b
the almost sure convergence not only of ˆvbut in fact of nu(n)for any of the choices of
u(n). Thus, (nu(n))1is almost surely bounded. This statement is licit because the first
part of Assumption 3.4 implies
16=2, so nu(n)converges to a non-zero value and,
therefore, there are at most only finitely many nsuch that u(n) = 0 has positive probability.
Consequently, we may multiply n(u(n)V(b
)) which converges almost surely to zero,
hence also in probability, with (nu(n))1to show that in each case, the ratio V(b
tends to one in probability by Slutsky’s theorem. By the continuous mapping theorem,
)/u(n))1/2tends to one in probability as well. Therefore, (b
)u(n)1/2= (b
)/u(n))1/2tends to N(0,1)in distribution by the first case and another
application of Slutsky’s theorem.
4.2. Asymptotic rejection regions and confidence intervals. So, the two-sided test of
H0with the rejection region
(17) nb
has asymptotic level
, where
is the standard normal cumulative distribution func-
tion. While the second case of Theorem 4.1 uses a positively biased variance estimator
and hence provides a conservative test which, however, is asymptotically exact, the third
case provides the best approximation to exactness already in the finite case. Likewise, an
asymptotically exact confidence interval for eat level 1
(18) hb
A related, but different approach to a similar testing problem is provided by the so-called
empirical Bernstein inequalities in Peel et al. (2010, Equations 12, 13). These are sharp
empirical inequalities for general U-statistics associated to bounded kernels. However, n
has to be an integer multiple of the degree, and the authors do not consider cross-validation,
but only partitions of the test set.
In practical applications, the number of summands of (1) is too large for computation.
In the particular case where one of the learning methods Mis a k-nearest-neighbour algo-
rithm, it is possible to compute the corresponding summand of the complete U-statistic,
the leave-p-out cross-validation estimator of the error rate, by an efficient closed-form
expression (Celisse & Mary-Huard, 2012). In general, however, one can only consider a
design Tsmaller than the full Tmax, leading to incomplete U-statistics as treated in Lee
(1990), for instance. We now show that the incomplete U-statistic with random design
approximates the complete one satisfactorily after a feasible number of iterations.
Let Φbe a not necessarily symmetric kernel with 1Φ1, let Tbe a collection of
Nrandomly drawn ordered size m-subsets of {1,...,n}from the equidistribution Qon the
collection of such subsets, and let Φ(T)be the associated incomplete U-statistic. Then
the probability of approximation error at least
>0 is bounded by
(19) prQ(|Φ(T)Φ(T)| ≥
This follows from Hoeffding (1963, Theorem 2) because the entries of Twere drawn
independently from each other. One should be aware that here we do not refer to the part
of Hoeffding (1963) concerned with U-statistics, in contrast to the situation of the similar
inequality (16), where we did so. Here, we formulated the version for ordered subsets
because this is of immediate interest for computation.
The fast exponential decay of (19) implies that sufficiency of the approximation is as-
sured as soon as Nis a small multiple of
2, where
is the pre-specified tolerance.
Precisely, the following corollary of Hoeffding’s theorem can be used in practise.
Corollary 5.1. After at most N =2d+1iterations, d digits after the comma are fixed with
a probability of at least 12exp(5)0.99.
Such a number of repetitions is, in general, hard but feasible because this Nis the mere
number of times a model has to be fitted. For instance, in the illustration in Section 6,
no tuning of the hyper-parameter
is part of each iteration. Remarkably, this bound on
Nholds true irrespectively of the sample size nor of any properties of the particular U-
statistic under consideration, apart from 1Φ1. In practise, however, one proceeds
again slightly differently. For the case of approximation of the U-statistic b
for instance,
one applies the following procedure which yields even faster convergenceagainst the true
. In the formal setting required for inequality (19), one would use only one test obser-
vation for each learning iteration, which would lead to unnecessarily high computational
cost. Instead, one simply uses all remaining ngobservations for testing. This speeds up
convergence even further. Corollary 5.1 also applies to the computation of ˆvby the linear
combination of U-statistics explained above because the kernels appearing in (13) and (14)
are bounded between 1 and 1 as well.
The estimation procedure elaborated in the preceding sections was applied to the well-
investigated colon cancer data set by Alon et al. (1999), where the binary response yY
stands for the type of tissue (either normal tissue or tumor tissue) and the 2000 continu-
ous predictors are gene expressions. We used lasso-penalized logistic regression with the
coordinate descent method for classification (Friedman et al., 2010) and the penalization
=0·08, 0·5. Pre-selecting these values led to the software-internal estima-
tor for the difference of error rates to be greater than 0·1. This involved the whole data
set, however, this is no problem here. Sample size was n=62. Therefore, the condition
n2g+2constrainedg30. Since the variance of the U-statistic ˆvdecreases to the extent
to which the sample size exceeds the degree 2g+2, the learning set size gwas arbitrarily
chosen to be only 26 to compromise with the effort to avoid a too small learning set size.
There were numerical evidence for the validity of the non-degeneracy statement of As-
sumption 3.4. The resulting point estimate of was 0·14, with 95%-confidence interval
[-0·35, 0·07] and estimated variance ˆv=0·01. The number of iterations was N=105for
each of the U-statistics b
cfor 1 cg+1 and c
2. By Corollary 5.1, two digits of each
of these were therefore assured. The two-sided p-value was p=0·19, given by the corre-
sponding upper and lower normal tail probabilities. An R-script that allows to reproduce
these results is available on the first author’s institution web page.
MF was supported by the German Science Foundation (DFG-Einzelf ¨orderung BO3139/2-
2 to ALB). RH was supported by the German Science Foundation (DFG-Einzelf¨orderung
BO3139/3-1 to ALB). RDB was supported by the German Science Foundation (DFG-
Einzelf¨orderung BO3139/4-1 to ALB).
LEVINE, A. J. (1999). Broad patterns ofgene expression revealed byclustering analysis
of tumor and normal colon tissue probed by oligonucleotide arrays. Proc. Natl. Acad.
Sci. USA 96, 6745–6750.
ARLOT, S. & CELISSE, A. (2010). A survey of cross-validation procedures for model
selection. Stat. Surveys 4, 40–79.
BENGIO, Y. & GRANDVALET, Y. (2003). No unbiased estimator of the variance of K-fold
cross-validation. J. Mach. Learn. Res. 5, 1089–1105.
BOULESTEIX, A.-L., PORZELIUS, C. & DAUMER, M. (2008). Microarray-based classi-
fication and clinical predictors: on combined classifiers and additional predictive value.
Bioinformatics 24, 1698–1706.
BRAGA-NETO, U. M. & DOUGHERTY, E. R. (2004). Is cross-validation valid for small-
sample microarray classification? Bioinformatics 20, 374–380.
UHLMANN, P. & VAN DE GEER, S. (2011). Statistics for High-Dimensional Data.
Springer Series in Statistics.
CELISSE, A. & MARY-HUARD, T. (2012). Exact Cross-Validation for kNN: application
to passive and active learning in classification. J. Soc. Fr. Stat. 152, 83–97.
DIETTERICH, T. G. (1998). Approximate statistical tests for comparing supervised clas-
sification learning algorithms. Neural comput. 10, 1895–1923.
DOUGHERTY, E. R., ZOLLANVARI, A. & BRAGA-NETO, U. M. (2011). The illusion of
distribution-free small-sample classification in genomics. Curr. genomics 12, 333.
FRIEDMAN, J., HASTIE, T. & TIBSHIRANI, R. (2010). Regularization paths for general-
ized linear models via coordinate descent. J. Stat. Softw. 33, 1–22.
HOEFFDING, W. (1948). A class of statistics with asymptotically normal distribution. Ann.
Math. Stat. 19, 293–325.
HOEFFDING, W. (1963). Probability inequalities for sums of bounded random variables.
J. Am. Statist. Assoc. 58, 13–30.
JIANG, B., ZHANG, X. & CAI, T. (2008). Estimating the confidence interval for prediction
errors of support vector machine classifiers. J. Mach. Learn. Res. 9, 521–540.
JIANG, W., VARMA, S. & SIMON, R. (2008). Calculating confidence intervals for predic-
tion error in microarray classification using resampling. Stat. Appl. Genet. Molec. Biol.,
KOHAVI, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and
model selection. International Joint Conferences on Artificial Intelligence 14, 1137–
LEE, J. (1990). U-statistics: Theory and Practice. CRC Press.
NADEAU, C. & BENGIO, Y. (2003). Inference for the generalization error. Machine
Learning 52, 239–281.
PEEL, T., ANTHOINE, S. & RALAIVOLA, L. (2010). Empirical Bernstein inequalities for
u-statistics. Adv. Neural Inf. Process. Syst. 23, 1903–1911.
SHAO, J. (1993). Linear model selection by cross-validation. J. Am. Statist. Assoc. 88,
VAN DE WIEL, M., BERKHOF, J. & VAN WIERINGEN, W. (2009). Testing the prediction
error difference between two predictors. Biostatistics 10, 550–560.
E-mail address:{fuchs,hornung,debin,boulesteix}
... where K times we train our classifier on a subset of n train data points, n train < 2n and evaluate the error on the remaining m n = 2n − n train datapoints. Though still relatively uncommon, there are quite a few papers studying cross-validation within the framework of U-Statistics, in particular, Fuchs et al. (2013) and Wang and Lindsay (2014). Contrary to typical cross-validation, one tries to approximate the full cross-validation setting (see e.g., Fuchs et al. (2013, Section 2) and the references therein), in which all possible subsets of size n train are taken as training sets from the available data. ...
... Though still relatively uncommon, there are quite a few papers studying cross-validation within the framework of U-Statistics, in particular, Fuchs et al. (2013) and Wang and Lindsay (2014). Contrary to typical cross-validation, one tries to approximate the full cross-validation setting (see e.g., Fuchs et al. (2013, Section 2) and the references therein), in which all possible subsets of size n train are taken as training sets from the available data. Thus a first idea might be to repeat the random splitting into training and test set as advocated in Fuchs et al. (2013). ...
... Contrary to typical cross-validation, one tries to approximate the full cross-validation setting (see e.g., Fuchs et al. (2013, Section 2) and the references therein), in which all possible subsets of size n train are taken as training sets from the available data. Thus a first idea might be to repeat the random splitting into training and test set as advocated in Fuchs et al. (2013). An arguably more convenient approach, inspired by the work of Mentch and Hooker (2016), is to use the OOB error as a U-statistics instead. ...
Full-text available
We follow the line of using classifiers for two-sample testing and propose several tests based on the Random Forest classifier. The developed tests are easy to use, require no tuning and are applicable for any distribution on $\mathbb{R}^p$, even in high-dimensions. We provide a comprehensive treatment for the use of classification for two-sample testing, derive the distribution of our tests under the Null and provide a power analysis, both in theory and with simulations. To simplify the use of the method, we also provide the R-package "hypoRF".
... where the sum is taken over K randomly chosen subsets of size N train -see e.g., [19], [20], [21], [22]. We assume that K goes to infinity as N goes to infinity. ...
... In the context of Random Forest, Theorem 1 essentially proves that the OOB error of a prediction function that is bounded, is asymptotically normal if the number of trees is "high" and if K forests are trained on subsamples such that (15) and (16) are true. Since the OOB error with infinite learners is essentially the leave-one-out error in the context of cross-validation, this also means that a test of the cross-validation error could be derived under much weaker assumption as for instance in [20]. The key reason for the generality of the result, as was also realized by [22], is that K should be chosen small relative to N . ...
Full-text available
We follow the line of using classifiers for two-sample testing and propose several tests based on the Random Forest classifier. The developed tests are easy to use, require no tuning and are applicable for any distribution on Rp, even in high-dimensions. We provide a comprehensive treatment for the use of classification for two-sample testing, derive the distribution of our tests under the Null and provide a power analysis, both in theory and with simulations. To simplify the use of the method, we also provide the R-package "hypoRF".
... Furthermore, Markatou et al. (2005) proposed a moment approximation-based estimator of the same cross validation estimator of the generalization error, and compared this estimator with those provided by Nadeau and Bengio. Other relevant work on variance estimation includes Wang and Lindsay (2014); Fuchs et al. (2013); Markatou et al. (2011), and a theoretical analysis of the performance of cross validation in the density estimation framework by Celisse (2014). Selecting the size of the training set and understanding the effect of this selection on the generalization error, its bias and its variance, is of interest to many areas of scientific investigation . ...
... Furthermore, Markatou et al. (2005) proposed a moment approximation-based estimator of the same cross validation estimator of the generalization error, and compared this estimator with those provided by Nadeau and Bengio. Other relevant work on variance estimation includes Wang and Lindsay (2014); Fuchs et al. (2013); Markatou et al. (2011), and a theoretical analysis of the performance of cross validation in the density estimation framework by Celisse (2014). ...
Full-text available
An important question in constructing Cross Validation (CV) estimators of the generalization error is whether rules can be established that allow " optimal " selection of the size of the training set, for fixed sample size n. We define the resampling effectiveness of random CV estimators of the generalization error as the ratio of the limiting value of the variance of the CV estimator over the estimated from the data variance. The variance and the covariance of different average test set errors are independent of their indices, thus, the resampling effectiveness depends on the correlation and the number of repetitions used in the random CV estimator. We discuss statistical rules to define optimality and obtain the " optimal " training sample size as the solution of an appropriately formulated optimization problem. We show that in a broad class of smooth loss functions, and in particular for the q-class of loss functions, when the decision rule is the sample mean the problem of obtaining the optimal training sample size has a general solution independent of the data distribution. The analysis offered when the decision rule is regression illustrates the complexity of the problem.
... In the context of testing whether two binary classifiers have different error rates, this fact has already been pointed out by Fuchs et al. (2013). ...
Full-text available
The present work aims at deriving theoretical guaranties on the behavior of some cross-validation procedures applied to the k-nearest neighbors (kNN) rule in the context of binary classification. Here we focus on the leave-p-out cross-validation (LpO) used to assess the performance of the kNN classifier. Remarkably this LpO estimator can be efficiently computed in this context using closed-form formulas derived by Celisse and Mary-Huard (2011). We describe a general strategy to derive moment and exponential concentration inequalities for the LpO estimator applied to the kNN classifier. Such results are obtained first by exploiting the connection between the LpO estimator and U-statistics, and second by making an intensive use of the generalized Efron-Stein inequality applied to the L1O estimator. One other important contribution is made by deriving new quantifications of the discrepancy between the LpO estimator and the classification error/risk of the kNN classifier. The optimality of these bounds is discussed by means of several lower bounds as well as simulation experiments.
... In the context of testing whether two binary classifiers have different error rates, this fact has already been pointed out by [23]. We now derive a general upper bound on the q-th moment (q > 1) of the LpO estimator that holds true for any classification rule. ...
Full-text available
The present work addresses binary classification by use of the k-nearest neighbors (kNN) classifier. Among several assets, it belongs to intuitive majority vote classification rules and also adapts to spatial inhomogeneity, which is particularly relevant in high dimensional settings where no a priori partitioning of the space seems realistic. However the performance of the kNN classifier crucially depends on the number k of neighbors that will be considered. To calibrate the parameter k, cross-validation procedures such as V-fold or leave-one-out are usually used. But on the one hand these procedures can become highly time-consuming. On the other hand, not that much theoretical guaranties do exist on the performance of such procedures. Recently [11] have derived closed-form formulas for the leave-pout estimator of the kNN classifier performance. Such formulas now allow to efficiently perform cross-validation. The main purpose of the present article is twofold: First, we provide a new strategy to derive bounds on moments of the leave-pout estimator used to assess the performance of the kNN classifier. This new strategy exploits the link between leave-pout and U-statistics as well as the generalized Efron-Stein inequality. Second, these moment upper bounds are used to settle a new exponential concentration inequality for
Full-text available
In biomedical research, boosting-based regression approaches have gained much attention in the last decade. Their intrinsic variable selection procedure and ability to shrink the estimates of the regression coefficients toward 0 make these techniques appropriate to fit prediction models in the case of high-dimensional data, e.g. gene expressions. Their prediction performance, however, highly depends on specific tuning parameters, in particular on the number of boosting iterations to perform. This crucial parameter is usually selected via cross-validation. The cross-validation procedure may highly depend on a completely random component, namely the considered fold partition. We empirically study how much this randomness affects the results of the boosting techniques, in terms of selected predictors and prediction ability of the related models. We use four publicly available data sets related to four different diseases. In these studies, the goal is to predict survival end-points when a large number of continuous candidate predictors are available. We focus on two well known boosting approaches implemented in the R-packages CoxBoost and mboost, assuming the validity of the proportional hazards assumption and the linearity of the effects of the predictors. We show that the variability in selected predictors and prediction ability of the model is reduced by averaging over several repetitions of cross-validation in the selection of the tuning parameters.
Conference Paper
Machine learning research in image-based computer aided diagnosis is a field characterised by rich models and relatively small datasets. In this regime, conventional statistical tests for cross validation results may no longer be optimal due to variability in training set quality. We present a principle by which existing statistical tests can be conservatively extended to make use of arbitrary numbers of repeated experiments. We apply this to the problems of interval estimation and pair wise comparison for the accuracy of classification algorithms, and test the resulting procedures on real and synthetic classification tasks. The interval coverages in the synthetic task are notably improved, and the comparison has both increased power and reduced type I error. Experiments in the ADNI dataset show that the low replicability of split-half based tests can be dramatically improved.
This paper considers the problem of variance estimation of a U-statistic. Following the proposal of a linearly extrapolated variance estimator in Wang and Chen (201514. Q. Wang and S. Chen. A general class of linearly extrapolated variance estimators. Statistics&Probability Letters, 98:29–38, 2015. View all references), we consider a second-order extrapolation technique and devise a variance estimator that is nearly second-order unbiased. Simulation studies confirm that the second-order extrapolated variance estimator has smaller bias than the linearly extrapolated variance estimator and the jackknife variance estimator across a wide selection of distributions. In addition, the proposal also yields a smaller mean squared error than its counterparts. In the end we discuss the advantages of the proposed variance estimator in regression analysis and model selection.
We consider the mean prediction error of a classification or regression procedure as well as its cross-validation estimates, and investigate the variance of this estimate as a function of an arbitrary cross-validation design. We decompose this variance into a scalar product of coefficients and certain covariance expressions, such that the coefficients depend solely on the resampling design, and the covariances depend solely on the data’s probability distribution. We rewrite this scalar product in such a form that the initially large number of summands can gradually be decreased down to three under the validity of a quadratic approximation to the core covariances. We show an analytical example in which this quadratic approximation holds true exactly. Moreover, in this example, we show that the leave-p–out estimator of the error depends on p only by means of a constant and can, therefore, be written in a much simpler form. Furthermore, there is an unbiased estimator of the variance of K–fold cross-validation, in contrast to a claim in the literature. As a consequence, we can show that Balanced Incomplete Block Designs have smaller variance than K–fold cross-validation. In a real data example from the UCI machine learning repository, this property can be confirmed. We finally show how to find Balanced Incomplete Block Designs in practice.AMS Subject Classification: primary 62G05,62G09,62G10,62G20, secondary 62J05, 62K05, 05B05.
Conference Paper
Full-text available
We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on arti cial data and theoretical results in restricted settings have shown that for selecting a good classi er from a set of classiers (model selection), ten-fold cross-validation may be better than the more expensive leaveone-out cross-validation. We report on a largescale experiment|over half a million runs of C4.5 and a Naive-Bayes algorithm|to estimate the e ects of di erent parameters on these algorithms on real-world datasets. For crossvalidation, we vary the number of folds and whether the folds are strati ed or not � for bootstrap, we vary the number of bootstrap samples. Our results indicate that for real-word datasets similar to ours, the best method to use for model selection is ten-fold strati ed cross validation, even if computation power allows using more folds. 1
Full-text available
In order to compare learning algorithms, experimental results reported in the machine learning literature often use statistical tests of significance to support the claim that a new learning algorithm generalizes better. Such tests should take into account the variability due to the choice of training set and not only that due to the test examples, as is often the case. This could lead to gross underestimation of the variance of the cross-validation estimator, and to the wrong conclusion that the new algorithm is significantly better when it is not. We perform a theoretical investigation of the variance of a variant of the cross-validation estimator of the generalization error that takes into account the variability due to the randomness of the training set as well as test examples. Our analysis shows that all the variance estimators that are based only on the results of the cross-validation experiment must be biased. This analysis allows us to propose new estimators of this variance. We show, via simulations, that tests of hypothesis about the generalization error using those new variance estimators have better properties than tests involving variance estimators currently in use and listed in Dietterich (1998). In particular, the new tests have correct size and good power. That is, the new tests do not reject the null hypothesis too often when the hypothesis is true, but they tend to frequently reject the null hypothesis when the latter is false.
Full-text available
Classification has emerged as a major area of investigation in bioinformatics owing to the desire to discriminate phenotypes, in particular, disease conditions, using high-throughput genomic data. While many classification rules have been posed, there is a paucity of error estimation rules and an even greater paucity of theory concerning error estimation accuracy. This is problematic because the worth of a classifier depends mainly on its error rate. It is common place in bio-informatics papers to have a classification rule applied to a small labeled data set and the error of the resulting classifier be estimated on the same data set, most often via cross-validation, without any assumptions being made on the underlying feature-label distribution. Concomitant with a lack of distributional assumptions is the absence of any statement regarding the accuracy of the error estimate. Without such a measure of accuracy, the most common one being the root-mean-square (RMS), the error estimate is essentially meaningless and the worth of the entire paper is questionable. The concomitance of an absence of distributional assumptions and of a measure of error estimation accuracy is assured in small-sample settings because even when distribution-free bounds exist (and that is rare), the sample sizes required under the bounds are so large as to make them useless for small samples. Thus, distributional bounds are necessary and the distributional assumptions need to be stated. Owing to the epistemological dependence of classifiers on the accuracy of their estimated errors, scientifically meaningful distribution-free classification in high-throughput, small-sample biology is an illusion.
In the binary classification framework, a closed form expression of the cross-validation Leave-p-Out (LpO) risk estimator for the k Nearest Neighbor algorithm (kNN) is derived. It is first used to study the LpO risk minimization strategy for choosing k in the passive learning setting. The impact of p on the choice of k and the LpO estimation of the risk are inferred. In the active learning setting, a procedure is proposed that selects new examples using a LpO committee of kNN classifiers. The influence of p on the choice of new examples and the tuning of k at each step is investigated. The behavior of k chosen by LpO is shown to be different from what is observed in passive learning.
We consider the problem of selecting a model having the best predictive ability among a class of linear models. The popular leave-one-out cross-validation method, which is asymptotically equivalent to many other model selection methods such as the Akaike information criterion (AIC), the Cp, and the bootstrap, is asymptotically inconsistent in the sense that the probability of selecting the model with the best predictive ability does not converge to 1 as the total number of observations n → ∞. We show that the inconsistency of the leave-one-out cross-validation can be rectified by using a leave-nv-out cross-validation with nv, the number of observations reserved for validation, satisfying nv/n → 1 as n → ∞. This is a somewhat shocking discovery, because nv/n → 1 is totally opposite to the popular leave-one-out recipe in cross-validation. Motivations, justifications, and discussions of some practical aspects of the use of the leave-nv-out cross-validation method are provided, and results from a simulation study are presented.
Upper bounds are derived for the probability that the sum S of n independent random variables exceeds its mean ES by a positive number nt. It is assumed that the range of each summand of S is bounded or bounded above. The bounds for Pr {S – ES ≥ nt} depend only on the endpoints of the ranges of the summands and the mean, or the mean and the variance of S. These results are then used to obtain analogous inequalities for certain sums of dependent random variables such as U statistics and the sum of a random sample without replacement from a finite population.
Let X 1,…X nbe n independent random vectors, and Φ(x 1, …, x m) a function of m(≤n) vectors. A statistic of the form, where the sum ∑″ is extended over all permutations (α1,…, αm) of m different integers, 1 ≤ αi ≥ n, is called a U-statistic. If X 1,…X n have the same (cumulative) distribution function (d.f.) F(x), U is an unbiased estimate of the population characteristic is called a regular functional of the d.f. F(x). Certain optimal properties of U-statistics as unbiased estimates of regular functionals have been established by Halmos [9] (cf. Section 4).