Page 1
Mach Learn (2009) 75: 275–295
DOI 10.1007/s10994-009-5104-z
The generalization performance of ERM algorithm
with strongly mixing observations
Bin Zou ·Luoqing Li ·Zongben Xu
Received: 18 May 2008 / Revised: 23 December 2008 / Accepted: 8 January 2009 /
Published online: 7 February 2009
Springer Science+Business Media, LLC 2009
Abstract The generalization performance is the main concern of machine learning theoreti-
calresearch.ThepreviousmainboundsdescribingthegeneralizationabilityoftheEmpirical
Risk Minimization (ERM) algorithm are based on independent and identically distributed
(i.i.d.) samples. In order to study the generalization performance of the ERM algorithm with
dependent observations, we first establish the exponential bound on the rate of relative uni-
form convergence of the ERM algorithm with exponentially strongly mixing observations,
and then we obtain the generalization bounds and prove that the ERM algorithm with expo-
nentially strongly mixing observations is consistent. The main results obtained in this paper
not only extend the previously known results for i.i.d. observations to the case of exponen-
tially stronglymixingobservations,butalsoimprovethepreviousresultsforstronglymixing
samples. Because the ERM algorithm is usually very time-consuming and overfitting may
happen when the complexity of the hypothesis space is high, as an application of our main
results we also explore a new strategy to implement the ERM algorithm in high complexity
hypothesis space.
Keywords Generalization performance · ERM principle · Relative uniform convergence ·
Exponentially strongly mixing
Editor: Nicolo Cesa-Bianchi.
Supported by National 973 project (2007CB311002), NSFC key project (70501030), NSFC project
(10771053) and Foundation of Hubei Educational Committee (Q200710001).
B. Zou · Z. Xu (?)
Institute for Information and System Science, Faculty of Science, Xi’an Jiaotong University, Xi’an,
710049, People’s Republic of China
e-mail: zbxu@mail.xjtu.edu.cn
B. Zou
e-mail: zoubin0502@hubu.edu.cn
B. Zou · L. Li
Faculty of Mathematics and Computer Science, Hubei University, Wuhan, 430062, People’s Republic
of China
L. Li
e-mail: lilq@hubu.edu.cn
Page 2
276 Mach Learn (2009) 75: 275–295
1 Introduction
Recently there has been a great increase in the interest for theoretical issues in the ma-
chine learning community, which is mainly due to the fact that statistical learning the-
ory has demonstrated its usefulness by providing the ground for developing successful
and well-founded learning algorithms such as Support Vector Machines (SVMs) (Vap-
nik 1998). This renewed interest for theory naturally boosted the development of perfor-
mance bounds for learning machines (see e.g. Bartlett and Long 1998; Bousquet 2003;
Cesa-Bianchi et al. 2004; Cucker and Zhou 2007; Lugosi and Pawlak 1994; Smale and Zhou
2003, 2004; Wu and Zhou 2005; Zhou 2003 and references therein). In order to measure the
generalization ability of the empirical risk minimization algorithm with i.i.d. observations,
Vapnik (1998) first established the bound on the rate of uniform convergence and that on the
rate of relative uniform convergence for i.i.d. observations respectively. Bousquet (2003) ob-
tained a generalization of Vapnik and Chervonenkis’ bounds by using a new measure of the
size of function classes, local Rademacher average (Bartlett and Mendelson 2002). Cucker
and Smale (2002a) considered the least squares error and decomposed the error (or gener-
alization error) into two parts: the sample error and the approximation error, and then they
bounded the sample error and the approximation error based on i.i.d. observations respec-
tively for a compact hypothesis space. Chen et al. (2004) obtained the bound on the excess
expected risk for pattern recognition with i.i.d. observations by introducing a projection
operator.
However, independence is a very restrictive concept in several ways (Vidyasagar 2002).
First, it is often an assumption, rather than a deduction on the basis of observations. Second,
it is an all or nothing property, in the sense that two random variables are either indepen-
dent or they are not—the definition does not permit an intermediate notion of being nearly
independent. As a result, many of the proofs based on the assumption that the underlying
stochastic sequence is i.i.d. are rather “fragile”. The notion of mixing allows one to put the
notion of “near independence” on a firm mathematical foundation, and moreover, permits
one to derive a robust rather than a “fragile” theory. In addition, the i.i.d. assumption can not
bestrictlyjustifiedinreal-worldproblems,forexample, manymachine learningapplications
such as market prediction, system diagnosis, and speech recognition are inherently temporal
innature,andconsequentlynoti.i.d.processes(Steinwartetal.2006).Therefore,relaxations
of such i.i.d. assumption have been considered for quite a while in both machine learning
and statistics literatures. For example, Yu (1994) established the rates on the uniform conver-
gence of the empirical means to their means for stationary mixing sequences. White (1989)
considered cross-validated regression estimators for strongly mixing processes and estab-
lished convergence, without rates, of their estimators. Modha and Masry (1996) established
the minimum complexity regression estimation with m-dependent observations and strongly
mixing observations respectively. Vidyasagar (2002) considered several notions of mixing
(e.g. α-mixing, β-mixing and ϕ-mixing) and proved that most of the desirable properties
(e.g. PAC property or UCEMUP property) of i.i.d. sequences are preserved when the un-
derlying sequence is mixing sequence. Nobel and Dembo (1993) proved that, if a family
of functions has the property that the empirical means, based on i.i.d. sequences, converge
uniformly to their expected values as the number of samples approaches infinity, then the
family of functions continues to have the same property if the i.i.d. sequence is replaced
by β-mixing sequence. Karandikar and Vidyasagar (2002) extended this result to the case
where the underlying probability is itself not fixed, but varies over a family of measures.
Vidyasagar (2002) obtained the bound on the rate of uniform convergence of the empirical
means to their means for mixing sequences. Steinwart et al. (2006) proved that the SVMs
Page 3
Mach Learn (2009) 75: 275–295277
for both classification and regression are consistent if the data-generating process (e.g. mix-
ing process, Markov process) satisfies a certain type of law of large numbers (e.g. WLLNE,
SLLNE). Zou and Li (2007) established the bound on the rate of uniform convergence of
learning machines with exponentially strongly mixing observations.
To extend the previous bounds in Bousquet (2003), Cucker and Smale (2002a), Vapnik
(1998) on the rate of relative uniform convergence to the case where the i.i.d. observations
are replaced by exponentially strongly mixing observations, and to improve the results in
Vidyasagar (2002), Zou and Li (2007) based on strongly mixing sequences, in this paper
we first establish the bound on the rate of relative uniform convergence of the ERM algo-
rithm with exponentially strongly mixing samples, and then we obtain the generalization
bounds of the ERM algorithm with exponentially strongly mixing samples. Because when
the complexity of the given function set is high, the problem of solving ERM algorithm is
usually very time-consuming and overfitting may happen, as an application of our main re-
sultswealsoexploreanewmethodtosolvetheproblemofERMlearningwithexponentially
strongly mixing samples.
The rest of this paper is organized as follows: In Sect. 2, we introduce some notions and
notations. In Sect. 3, we present the main results of this paper. In Sect. 4 we establish the
bound on the rate of relative uniform convergence of the ERM algorithm with exponentially
strongly mixing observations. We prove the generalization bounds of the ERM algorithm
with exponentially strongly mixing sequences in Sect. 5. Finally, we conclude the paper
with some useful remarks in Sect. 6.
2 Preliminaries
In this section we introduce the definitions and notations used throughout the paper.
Let Z = {zi= (xi,yi)}∞
(Ω,B,P). For −∞ < i < ∞, let σ∞
the random variables zj,j ≥ i and zj,j ≤ i respectively. With these notations, there are
several definitions of mixing, but we shall be concerned with only one, namely, α-mixing in
this literature (see Ibragimov and Linnik 1971; Modha and Masry 1996; Rosenblatt 1956;
Vidyasagar 2002; Yu 1994).
i=−∞be a stationary real-valued sequence on a probability space
i
and σi
−∞denote the σ-algebra events generated by
Definition 1 (Vidyasagar 2002) The sequence Z is called α-mixing, or strongly mixing (or
strongly regular), if
sup
−∞,B∈σ∞
A∈σ0
k
{|P(A∩B)−P(A)P(B)|} = α(k) → 0 as k → ∞.
Here α(k) is called the α-mixing coefficient.
Assumption 1 (Exponentially strongly mixing) (Modha and Masry 1996) Assume that the
α-mixing coefficient of the sequence Z satisfies
α(k) ≤ αexp(−ckβ),
for some α > 0, β > 0, and c > 0, where the constants β and c are assumed to be known.
k ≥ 1,
Remark 1 (Modha and Masry 1996) Assumption 1 is satisfied by a large class of processes,
for example, certain linear processes (which includes certain ARMA processes) satisfy
Page 4
278Mach Learn (2009) 75: 275–295
the assumption with β = 1 (Withers 1981), and certain aperiodic, Harris-recurrent Markov
processes (which includes certain bilinear processes, nonlinear ARX processes, and ARH
processes) satisfy the assumption (Davydov 1973). As a trivial example, i.i.d. random vari-
ables satisfy the assumption with β = ∞.
Denote by z the sample set of size n observations
z = {z1,z2,...,zn}
drawn from the exponentially strongly mixing sequence Z. Set
n(α)= ?n?{8n/c}1/(β+1)?
−1?,
where n denotes the number of observations drawn from Z and ?u?(?u?) denotes the great-
est (least) integer less (greater) than or equal to u.
The goal of machine learning from random sampling is to find a function f that as-
signs values to objects such that if new objects are given, the function f will forecast them
correctly. Let
E(f) = E[?(f,z)] =
?
?(f,z)dP
be the expected risk (or expected error) of function f, where the function ?(f,z), which is
integrable for any f and depends on f and z, called loss function. In this paper, we would
like to establish a general framework which includes pattern recognition and regression
estimation, so we consider the loss function of general form ?(f,z). The important feature
of the regression estimation problem is that the loss function ?(f,z) can take arbitrary non-
negative values whereas in pattern recognition problem it can take only two values.
A learning task is to find the minimizer of the expected risk E(f) over a given hypothesis
space H. Since one knows only the set z of random samples instead of the distribution P,
the minimizer of the expected risk E(f) can not be computed directly. According to the
principle of Empirical Risk Minimizing (ERM) (Vapnik 1998), we minimize, instead of the
expected risk E(f), the so called empirical risk (or empirical error)
?
Let fHbe a function minimizing the expected risk E(f) over f ∈ H, i.e.,
En(f) =1
n
n
i=1
?(f,zi).
fH= argmin
f∈HE(f) = argmin
f∈H
?
?(f,z)dP.
We define the empirical target function fzto be a function minimizing the empirical risk
En(f) over f ∈ H, i.e.,
fz= argmin
f∈HEn(f) = argmin
f∈H
1
n
n
?
i=1
?(f,zi).
(1)
According to the principle of ERM, we shall consider the function fzas an approxima-
tion function of the target function fH. Thus a central question of ERM learning is how well
Page 5
Mach Learn (2009) 75: 275–295 279
fzreally approximate fH. If this approximation is good, then the ERM algorithm is said
to generalize well. A ERM algorithm with generalization capability implies that although
it is found via minimizing the empirical risk En(f), it can eventually predict as well as the
optimal predictor fH. To characterize the generalization capability of a learning algorithm
requires in essence to decipher how close fzis from fH. This is a very difficult issue in
general (Vapnik 1998). In the framework of statistical learning, however, this is then relaxed
to considering how close the expected risk E(fz) is from E(fH), or equivalently, how small
can we expect the difference E(fz)− E(fH) to be. We call the refined upper bound estima-
tions on E(fz) or on the deviation between E(fz) and E(fH) the generalization bounds of
the ERM algorithm.
Since fzis dependent on the sample set z, in other words, the minimization (1) is taken
over the discrete quantity En(f), intuitively, we have to estimate the capacity of the func-
tion set H. It has been shown that VC-dimension is not suitable for real-valued function
classes (Evgeniou and Pontil 1999). As for the Vγ-dimension or Pγ-dimension, though their
finiteness is sufficient and necessary for a function class to be a uniform Glivenko-Cantelli
(Alon et al. 1997), no satisfactory relationship has been found between them and the cov-
ering numbers in order to derive sharp estimates. So the capacity of the function set H is
measured by the covering number in this paper.
Definition 2 (Cucker and Smale 2002a) For a subset F of a metric space and ε > 0, the
covering number N(F,ε) of the function set F is the minimal integer b ∈ N such that there
exist b disks with radius ε covering F.
To estimate the generalization ability of the ERM algorithm (1) with exponentially
strongly mixing samples, we give some basic assumptions on the hypothesis space H and
the loss function ?(f,z):
(i) Assumption on the hypothesis space: We suppose that H is contained in a ball of a
Hölder space Cpon a compact subset of a Euclidean space Rdfor some p > 0, that is,
H = {f ∈ BR(Cp) : r < E(f) ≤ s},
where R is the radius of ball BR(Cp).
(ii) Assumption on the loss function: We define
M = sup
f∈Hmax
z∈Z|?(f,z)|
and
L =
sup
g1,g2∈H,g1?=g2
max
z∈Z
|?(g1,z)−?(g2,z)|
|g1−g2|
.
We assume that M and L are finite in this paper.
Because the function set H is assumed to be compact, the covering number N(H,ε) is
finite for a fixed ε > 0. Then there exists constant C0> 0 such that (Zhou 2003)
N(H,ε) ≤ exp{C0ε
−2d
p}.
(2)
3 Main results
To measure the generalization performance of a learning machine, Bousquet (2003), Cucker
and Smale (2002a), Vapnik (1998) obtained the bound on the rate of the empirical risks
Page 6
280Mach Learn (2009) 75: 275–295
uniformconvergence totheirexpected risksinagivenset H (or Q)basedoni.i.d.sequences,
that is, for any ε > 0, they bounded the term
?
Vidyasagar (2002) also established the bounds on the term (3) based on β-mixing sequences
and α-mixing sequences respectively. Yu (1994) obtained the convergence rates of the term
(3) for mixing sequences. Zou and Li (2007) established the bound on the term (3) based
on exponentially strongly mixing observations. The interested reader can consult (Zou and
Li 2007; Vidyasagar 2002; Yu 1994) for the details. For more inequalities on probabilities
of uniform deviations, see, for example, Alexander (1984), Bartlett and Lugosi (1999), De-
vroye (1982), Pollard (1984), Talagrand (1994).
However, the term (3) fails to capture the phenomenon that for those functions f ∈ H
for which the expected risk E(f) is small, the deviation E(f) − En(f) is also small with
large probability (see Bartlett and Lugosi 1999; Bousquet 2003; Vapnik 1998). In order
to extend these results in Bousquet (2003), Cucker and Smale (2002a), Vapnik (1998) to
the case where the i.i.d. sequence is replaced by α-mixing sequence, and to improve these
estimations in Zou and Li (2007), Vidyasagar (2002), our purpose in this paper is to bound
the term (for any ε > 0)
?
for the ERM algorithm (1) with exponentially strongly mixing samples. Our main results
are stated as follows.
Prob sup
f∈H|E(f)− En(f)| > ε
?
.
(3)
Probsup
f∈H
E(f)− En(f)
√E(f)
> ε
?
(4)
Theorem 1 Let Z be a stationary α-mixing sequence with the mixing coefficient satisfying
Assumption 1, that is
α(k) ≤ αexp(−ckβ),k ≥ 1, α > 0, β > 0, c > 0.
Set n(α)= ?n?{8n/c}1/(β+1)?−1?, and assume that the variance D[?(f,z)] ≤ σ2for all z ∈ Z
and for all functions in H. Then for any ε, 0 < ε ≤ 2r, the inequality
?
r√r
√s(s+4r)ε.
Probsup
f∈H
E(f)− En(f)
√E(f)
≥ ε
?
≤ CN
?
H,
εr√r
(s +4r)L
?
exp
?
−rτ2n(α)
2(σ2+τ√sM/3)
?
(5)
holds, where C = 1+4e−2α, and τ =
In particular, if Z is an i.i.d. sequence, according to Remark 1, we take β = ∞ in Theo-
rem 1 and ignore the multiplicative constant 1+4e−2α. The following bound follows from
Theorem 1 immediately.
Corollary 1 Let Z be an i.i.d. sequence, and assume that the variance D[?(f,z)] ≤ σ2for
all z ∈ Z and for all functions in H. Then for any ε, 0 < ε ≤ 2r, the inequality
?
holds.
Probsup
f∈H
E(f)− En(f)
√E(f)
≥ ε
?
≤ N
?
H,
εr√r
(s +4r)L
?
exp
?
−rτ2n
2(σ2+τ√sM/3)
?
Page 7
Mach Learn (2009) 75: 275–295281
Remark 2 (i) n(α)arises from the Bernstein inequality (Theorem 4.3) for strongly mixing
processes in Modha and Masry (1996) and is called the “effective number of observations”
for strongly mixing processes. From Theorem 1 and Corollary 1, we can find that n(α)plays
the same role in our analysis as that played by the number n of observations in the i.i.d. case.
(ii) Since n(α)→ ∞ as n → ∞, by Theorem 1, we then have that for any 0 < ε ≤ 2r
?
This shows that as long as the covering number of the hypothesis space H is finite, the
empirical risks En(f) can uniformly converge to their expected risks E(f), and the conver-
gence speed may be exponential. This assertion is well known for the ERM algorithm with
i.i.d. samples (see e.g. Bousquet 2003; Cucker and Smale 2002a; Vapnik 1998). We have
generalized these classical results in Bousquet (2003), Cucker and Smale (2002a), Vapnik
(1998) to the exponentially strongly mixing sequences.
(iii) Theorem 1 is on the rate of relative uniform convergence of the ERM algorithm (1)
with exponentially strongly mixing sequences. As far as we know, this is the first result on
this topic. The bound in Theorem 1 usually has smaller confidence interval than that bound
on the rate of uniform convergence (this is the reason why Bousquet 2003; Cucker and
Smale 2002a; Vapnik 1998 bounded the term (4)).
Prob sup
f∈H
E(f)− En(f)
√E(f)
≥ ε
?
→ 0,
as n → ∞.
Theorem 1 will be proven in the next section. Before going into the technical proofs, we
first deduce the generalization bounds of the ERM algorithm (1) with exponentially strongly
mixing samples.
Proposition 1 Let Z be a stationary α-mixing sequence with the mixing coefficient sat-
isfying Assumption 1. Assume that the variance D[?(f,z)] ≤ σ2for all z ∈ Z and for all
functions in H. Then for any η ∈ (0,1], the following inequalities hold true provided that
?ln(C/η)
C12
n(α)≥ max
2C1r2,C0[(s +4r)L]
2d
p
2d
pr
5d+2p
p
,ln(C/η(σ2+sM/3))
2r2
?
.
(i) With probability at least 1−η,
E(fz) ≤ En(fz)+ε2(n,η)
2
?
1+
?
1+4En(fz)
ε2(n,η)
?
.
(6)
(ii) With probability at least 1−2η,
E(fz)− E(fH) ≤ ε?(n,η)+ε2(n,η)
2
?
1+
?
1+4En(fz)
ε2(n,η)
?
,
(7)
where
ε(n,η) ≤ max
??2ln(C/η)
C1n(α)
?
?1
?
2
,
?2C0r
−3d
p[(s +4r)L]
C1n(α)
?
2d
p
?
p
2p+2d?
,
ε?(n,η) =M ln(C/η)
3n(α)
1+
1+
18n(α)σ2
M2ln(C/η)
,
Page 8
282Mach Learn (2009) 75: 275–295
C1=
3r4
2s(s +4r)[3(s +4r)σ2+2r
5
2M]
.
Remark 3 (i) Since when n → ∞, n(α)→ ∞, we have
ε(n,η) → 0,
By inequality (7), we then have
ε?(n,η) → 0,
as n → ∞.
E(fz)− E(fH) → 0,
as n → ∞.
This shows that the ERM algorithm (1) with exponentially strongly mixing observations is
consistent whenever the covering number of the target function set H is finite.
(ii) Bounds (6) and (7) describe the generalization performance of the ERM algorithm (1)
with exponentially strongly mixing observations in the given function set H: Bound (6)
evaluatestheriskforthechosenfunctioninthetargetfunctionset H,andbound(7)evaluates
how close this risk is to the smallest possible risk for the target functions set H.
(iii) Strongly mixing samples usually contain less information than i.i.d. samples, and
they therefore might lead to worse learning rates. This property of dependent samples is just
what we can expect as reflected in our results.
In addition, from Proposition 1, we can find that if p ? d, the learning rates of the ERM
algorithm with exponentially strongly mixing samples are close to or as same as those for
learning rate with i.i.d. samples.
The ERM algorithm is known to be a classical learning algorithm in statistical learning
theory (Vapnik 1998). However, when the complexity of the given function set H is high,
the ERM algorithm (1) is usually very time-consuming and overfitting may happen (see
Wu and Zhou 2005). Thus, regularization techniques are frequently adopted. Two kinds of
regularization methods are the most interesting: the Tikhonov regularization and the Ivanov
regularization. The interested reader can consult Wu and Zhou (2005) for the details. As an
application of Proposition 1, in this paper we also explore a new method to solve the time-
consuming problem of the ERM algorithm (1) by following the enlightening idea of Giné
and Koltchinski (2006). We simply state our ideas as follows: first, we decompose the given
target function set H into different disjoint compact subsets such that the complexities of
all subsets are small. To be more precise, for the given r,s,r < s, we take q > 1 and a ∈ N
such that s = rqa, and
a = logq
?s
r
?
.
Let ρi= rqi,i = 0,1,...,a (with ρ0= r,ρa= s). We set
H(ρi−1) = {f ∈ H : E(f) ≤ ρi−1},
and
H(ρi−1,ρi] = H(ρi)\H(ρi−1).
Then we have
H =
a?
i=1
H(ρi−1,ρi],
Page 9
Mach Learn (2009) 75: 275–295283
where a is finite because H is assumed to be compact. Second, for a given function subset
H(ρi−1,ρi], i ∈ {1,2,...,a}, by the ERM algorithm (1), we can obtain the corresponding
empirical target function fi
by the same argument conducted in Proposition 1. Thus we choose the minimizer of these
upper bounds of the risks E(fi
hypothesis space H. We can obtain the following proposition.
z, and then we can obtain the upper bound of their risk E(fi
z)
z), i ∈ {1,2,...,a} as the risk of the chosen function in the
Proposition 2 With all notations as in Proposition 1, let fi
tion minimizing the empirical risk En(f) over f ∈ H(ρi−1,ρi]. Then for any η ∈ (0,1], the
following inequalities hold true provided that
z, i ∈ {1,2,...,a} be the func-
n(α)≥ max
?ln(C/η)
2C1r2,C0[(s +4r)L]
C12
2d
p
2d
pr
5d+2p
p
,ln(C/η(σ2+sM/3))
2r2
?
.
(i) With probability at least 1−η,
E(fz) ≤ min
1≤i≤a
?
En(fi
z)+ε2
i(n,η)
2
?
1+
?
1+4En(fi
ε2
i(n,η)
z)
??
.
(ii) With probability at least 1−2η,
E(fz)− E(fH) ≤ ε?(n,η)+ min
1≤i≤a
?ε2
i(n,η)
2
?
1+
?
1+4En(fi
ε2
i(n,η)
z)
??
,
where C1and ε?(n,η) are defined as in Proposition 1, Ciis defined as at the end of Sect. 5,
and
εi(n,η) ≤ max
??2ln(C/η)
Cin(α)
?1
2
,
?2C0(ρi−1)
−3d
p[(ρi+4ρi−1)L]
Cin(α)
2d
p
?
p
2p+2d?
.
Propositions 1 and 2 will be proven in the next section. Before going into the technical
proofs,inordertohaveabetterunderstandingofthesignificanceandvalueoftheestablished
results in this paper, we compare our results with the previously known results in Vidyasagar
(2002). Therefore, we first give the equivalent form of inequality (7) in Proposition 1 as
follows.
By Theorem 1, we have that for any 0 < ε < 2r, and for the function fzthat minimizes
the empirical risk En(f) over H, the inequality
?E(fz)− En(fz)
r√r
√s(s+4r)ε.
It follows that for any 0 < ε < 2r,
Prob
√E(fz)
≥ ε
?
≤ CN
?
H,
εr√r
(s +4r)L
?
exp
?
−rτ2n(α)
2(σ2+τ√sM/3)
?
is valid, where C = 1+4e−2α, and τ =
Prob
?
E(fz)− En(fz) ≥ ε√s
?
≤ CN
?
H,
εr√r
(s +4r)L
?
exp
?
−rτ2n(α)
2(σ2+τ√sM/3)
?
.
(8)
Page 10
284Mach Learn (2009) 75: 275–295
By Theorem 2 in the next section, we also have that for any ε > 0, and for the function
fHthat minimizes the expected risk E(f) over H, the inequality
?
holds true. Note that
Prob
En(fH)− E(fH) ≥ ε√s
?
≤ Cexp
?
−rε2n(α)
2(σ2+ε√sM/3)
?
(9)
E(fz)− E(fH) ≤ E(fz)− En(fz)+ En(fH)− E(fH).
Taking inequality (8) into account from (9) and (10), and replacing ε by
that for any 0 < ε < 2r, the inequality
(10)
ε
2√s, we conclude
Prob?E(fz)− E(fH) ≥ ε?≤ 2CN
holds, where C = 1+4e−2α, and τ?=
Thus we have the following remarks.
?
H,
εr√r
2√s(s +4r)L
r√r
2s(s+4r)ε.
?
exp
?
−rτ?2n(α)
2(σ2+τ?√sM/3)
?
(11)
Remark 4 Comparing bound (11) with the bound in Theorem 6.12 obtained by Vidyasagar
(2002), we can find that although we adopt the same measure of the complexity of function
set, the covering number, and our proof techniques have many steps similar to that of The-
orem 3.5 in Vidyasagar (2002). The differences between bound (11) and the bound in The-
orem 6.12 are obvious: First, the key proof technique and method are different. Vidyasagar
(2002) first established the bound (Theorem 3.5) on the empirical means uniform conver-
gence to their true values, then he proved that the minimal empirical risk algorithm based on
a function family of finite metric entropy is PAC (see Theorem 6.12 in Vidyasagar 2002). In
this paper, we first adopted the sign n(α)introduced by Modha and Masry (1996) to establish
a new bound on the relative uniform convergence of the ERM algorithm with exponentially
strongly mixing samples, which consists of only one exponential term, and then we obtained
the generalization bounds of the ERM algorithm and proved that the ERM algorithm with
exponentially strongly mixing samples is consistent.
Second, in Theorem 6.12, Vidyasagar (2002) merely established the bound in the case
of n = kl, that is, Theorem 6.12 in Vidyasagar (2002) is on the case of the number n of
samples can be exactly divisible by integer k. In the case of n = kl, comparing the bound in
Theorem 6.12 with bound (11), we can find that if k and l satisfy
k ≤
?ln(4αl)
c
+lε2
8c+4lε
c
?1
β
,
bound (11) has better convergence rate than the bound in Theorem 6.12, otherwise bound
(11) has the same convergence rate as that bound in Theorem 6.12.
In addition, since n(α)= O(n
sameas thoseforconvergence rate withi.i.d. samples inBousquet(2003),CuckerandSmale
(2002a), Vapnik (1998).
β
β+1), the convergence rate of bound (11) is close to or as
4 Proof of relative uniform convergence bound
To prove the main results presented in the last section, we first establish a new bound on the
relative difference between the empirical risks and their expected risks by using an argument
Page 11
Mach Learn (2009) 75: 275–295 285
similar to that used by Modha and Masry (1996) and by Vidyasagar (2002). Our approach
is however based on the Bernstein moment condition (see Craig 1933; Modha and Masry
1996) and the covariance inequality for α-mixing sequences in Vidyasagar (2002).
Lemma 1 (Craig 1933) Let W be a random variable such that E(W) = 0, and W satisfies
the Berstein moment condition, that is , for some K1> 0,
E|W|k≤Var(W)
2
k!Kk−2
1
for all k ≥ 2. Then, for all 0 < ξ < 1/K1,
E[exp(ξW)] ≤ exp
?
ξ2E|W|2
2(1−ξK1)
?
.
Lemma 2 (Vidyasagar 2002) Suppose Z is an α-mixing stochastic process. Suppose
g0,g1,...,glare essentially bounded functions, where gidepends only on zik. Then
?????E
To exploit the α-mixing property, we decompose the index set I = {1,2,...,n} into
different parts as follows: Given an integer n, choose any integer kn≤ n and define ln=
?n/kn? to be the integer part of n/kn. For the time being, knand lnare denoted respectively
by k and l so as to reduce notational clutter. The dependence of k and l on n is restored near
the end of the paper.
Let p = n−kl and define the index sets Ii, i = 1,2,...,k as follows
?{i,i +k,...,i +lk}
Note that?
Theorem 2 Let Z be a stationary α-mixing sequence with the mixing coefficient satisfying
Assumption 1. Assume that the variance D[?(f,z)] ≤ σ2for all z ∈ Z and for all functions
in H. Then for all ε > 0, the inequality
?E(f)− En(f)
holds.
?
l?
i=0
gi
?
−
l?
i=0
E(gi)
?????≤ 4lα(k)
l?
i=0
?gi?∞.
Ii=
1 ≤ i ≤ p,
p +1 ≤ i ≤ k.
{i,i +k,...,i +(l −1)k}
iIiequals the index set I = {1,2,...,n} and that within each set Iithe elements
are pairwise separated by at least k. Then we have the following theorem.
Prob
√E(f)
> ε
?
≤ (1+4e−2α)exp
?
−n(α)rε2
2(σ2+ε√sM/3)
?
(12)
Proof For any i, 1 ≤ i ≤ n, let
Xi= E[?(f,z1)]−?(f,zi),Sn=
n
?
i=1
Xi,
then we have
E(f)− En(f) =1
nSn.
Page 12
286Mach Learn (2009) 75: 275–295
Let pi=|Ii|
follows that
nfor i = 1,2,...,k, where |Ii| is the number of terms in the i-th part, it then
k
?
i=1
pi=1
n
n
?
i=1
|Ii| = 1.
Then
Sn=
k
?
i=1
??
m∈Ii
Xm
?
=
k
?
i=1
T(i),
(13)
where T(i) =?
m∈IiXm.
Now we can apply1
nSnto the exponential exp(γSn
n) for all γ > 0
E
?
exp
?
γSn
n
??
≤
k
?
i=1
piE
?
exp
?
γT(i)
|Ii|
??
.
(14)
We now bound the second term on the right-hand side of inequality (14) which is denoted
henceforth by φ. For any i ∈ {1,2,...,k}, we have
?
m∈Ii
|Ii|
?
For simplicity, we denote the first term in inequality (15) by S1, and denote the second term
in inequality (15) by S2. Now we proceed through the following two steps.
Step 1 Estimate S1. By the stationary property of the α-mixing sequence Z, we have
?
φ = E exp
??
γXm
|Ii|
??
??
= E
?|Ii|
m=1
?????E
?
?
exp
?γXm
?γXm
|Ii|
??
??
≤
m=1
E
?
exp
?γXm
|Ii|
+
?|Ii|
m=1
exp
|Ii|
−
|Ii|
?
m=1
E
?
exp
?γXm
|Ii|
???????.
(15)
S1=
|Ii|
?
m=1
E exp
?γXm
|Ii|
??
=
?
E
?
exp
?γX1
|Ii|
???|Ii|
.
SinceX1
in Lemma 1 and
|Ii|satisfies the Bernstein moment condition with K1=
M
3|Ii|(Modha and Masry 1996)
E[X1] = E?E[?(f,z1)]−?(f,z1)?= 0.
M, we have
Hence for all 0 < γ ≤3|Ii|
|Ii|
?
m=1
E
?
exp
?γXm
|Ii|
??
≤ exp
?
γ2|Ii|E??X1
|Ii|
??2
2(1−γM/(3|Ii|))
?
≤ exp
?
γ2E??X1
??2
2|Ii|(1−γM/(3|Ii|))
?
.
For all i = 1,2,...,k, |Ii| ≥ l, thus we have
1−γM
3|Ii|≥ 1−γM
3l
,
Page 13
Mach Learn (2009) 75: 275–295 287
and furthermore
exp
?
γ2E??X1
??2
2|Ii|(1−γM/(3|Ii|))
?
≤ exp
?
γ2E??X1
??2
2l(1−γM/(3l))
?
.
Thus we obtain
S1≤ exp
?
γ2E??X1
??2
2l(1−γM/(3l))
?
.
Step 2 Estimate S2. With the same method in Modha and Masry (1996), by Lemma 2 and
Assumption 1, we can get
?????E
≤ 4α(k)(|Ii|−1)
m=1
≤ 4(|Ii|−1)α(k)eγM
≤ e|Ii|e−24α ·e−ckβ·eγM
≤ 4e−2αexp{|Ii|+γM −ckβ}.
The final inequality follows from the fact that |Ii|−1 ≤ e|Ii|−2(this is deduced from |Ii| ≥ 2
and Assumption 1).
Returning to inequality (15) and since γM ≤ 3|Ii|, we obtain
?
We require exp(4|Ii| − ckβ) ≤ 1, which holds if 4|Ii| ≤ ckβ. But |Ii| ≤ (n
bound holds if 4((n/k) + 1) ≤ ckβ, or if 4(n + k) < ckβ+1. Since n + k ≤ 2n the bound
holds if 8n ≤ ckβ+1, or if {8n/c}
S2=
?|Ii|
m=1
?
exp
?γXm
|Ii|
|Ii|
?
??
????exp
−
|Ii|
?
?γXm
m=1
E
?
exp
?γXm
|Ii|
???????
|Ii|
?????
∞
E exp
?
γT(i)
|Ii|
??
≤ exp
?
γ2E|X1|2
2l(1−γM/(3l))
?
+4e−2αexp(4|Ii|−ckβ).
k+ 1), thus the
1
β+1≤ k. Let
k = ?{8n/c}
1
β+1?.
Since l = ln= ?n/k?, we have
E
?
exp
?
γT(i)
|Ii|
??
≤ exp
?
γ2E|X1|2
2l(1−γM/(3l))
?
+4e−2α.
(16)
Since inequality (16) is true for all γ, 0 < γ <3|Ii|
all i, we then require that γ satisfies
M, to make the constraint uniform over
0 < γ <3l
M<3|Ii|
M
.
Since
γ2E|X1|2
2l(1−γM
3l)> 0,
Page 14
288 Mach Learn (2009) 75: 275–295
we have
E
?
exp
?
γT(i)
|Ii|
??
≤ (1+4e−2α)exp
?
γ2E|X1|2
2l(1−γM/3l)
?
.
Returning to inequality (14), we have
E
?
exp
?
γSn
n
??
≤ (1+4e−2α)exp
?
γ2E|X1|2
2l(1−γM/3l)
?
.
By Markov’s inequality, we have that for any δ > 0,
Prob?E(f)− En(f) > δE(f)?= Prob
?
eγ(E(f)−En(f))> eγδE(f)?
≤E[eγ(E(f)−En(f))]
eγδE(f)
?
≤ Cexp
−γδE(f)+
γ2E|X1|2
2l(1−γM/3l)
?
,
where C = 1+4e−2α. Now by substituting
γ =
δlμ
E|X1|2+Mδμ/3,
where μ = E(f), and noting that γ satisfies γ <3l
Prob?E(f)− En(f) > δE(f)?≤ (1+4e−2α)exp
Replacing δ by
M, we obtain
?
−δ2lμ2
2(E|X1|2+δμM/3)
?
.
ε
√μ, we then have
?E(f)− En(f)
Prob
√E(f)
> ε
?
≤ (1+4e−2α)exp
?
−ε2lμ
2(E|X1|2+ε√μM/3)
?
.
Since r < E(f) ≤ s, replacing l by n(α)then implies that for any ε > 0, the inequality
?E(f)− En(f)
holds. Theorem 2 thus follows from inequality (17) by replacing E|X1|2by σ2. This finishes
the proof of Theorem 2.
Prob
√E(f)
> ε
?
≤ (1+4e−2α)exp
?
−n(α)rε2
2(E|X1|2+ε√sM/3)
?
(17)
?
Remark 5 Vidyasagar (2002) established the bound (Theorem 3.5) on the difference be-
tween the empirical means and their true values based on strongly mixing sequences, and
his bound consists of two terms. However, in this paper we are to bound the relative differ-
ence between the empirical risks and their expected risks based on exponentially strongly
mixing sequences, and our result consists of only one exponential term. Comparing The-
orem 2 with Theorem 3.5 in Vidyasagar (2002), we can find that the bound in Theorem 2
has smaller confidence interval than that in Theorem 3.5. Concerning the comparison of the
convergence rate between the bound in Theorem 2 and that in Theorem 3.5 (Vidyasagar
2002), we also have the same results as those in Remark 4.
Page 15
Mach Learn (2009) 75: 275–295289
By Theorem 2, we now can prove our main theorem on the rate of the empirical risks rel-
atively uniform converging to their expected risks for the ERM algorithm with exponentially
strongly mixing sequence Z.
Proof of Theorem 1 We decompose the proof into three steps.
Step 1 Let H = H1∪ H2∪···∪ Hb, b ∈ N, then for any ε > 0, whenever
sup
f∈H
E(f)− En(f)
√E(f)
≥ ε,
there exists j, 1 ≤ j ≤ b, such that
sup
f∈Hj
E(f)− En(f)
√E(f)
≥ ε.
This implies the equivalence
sup
f∈H
E(f)− En(f)
√E(f)
≥ ε
⇐⇒∃j, 1 ≤ j ≤ b, s.t. sup
f∈Hj
E(f)− En(f)
√E(f)
≥ ε.
(18)
By the equivalence (18), and by the fact that the probability of a union of events is
bounded by the sum of the probabilities of these events, we have
Prob
?
sup
f∈H
E(f)− En(f)
√E(f)
≥ ε
?
≤
b
?
j=1
Prob
?
sup
f∈Hj
E(f)− En(f)
√E(f)
≥ ε
?
.
(19)
Step 2 To estimate the term on the right-hand side of inequality (19), we define
ψ(f) = (1−δ)E(f)− En(f).
Let b = N(H,ε
and radius ε/L. For any z ∈ Znand all f ∈ Dj, we conclude
L) and let the disks Dj, j ∈ {1,2,...,b} be a cover of H with center at fj
ψ(f)−ψ(fj) = (1−δ)E(f)− En(f)−[(1−δ)E(fj)− En(fj)]
= [En(fj)− En(f)]+(1−δ)[E(f)− E(fj)]
≤ L·?f −fj?∞+L(1−δ)·?f −fj?∞
≤ ε(2−δ).
Since this holds for all z ∈ Znand all f ∈ Dj, we obtain
sup
f∈Dj
ψ(f) ≥ 2ε(2−δ)
?⇒
ψ(fj) ≥ ε(2−δ).
This implies that for j = 1,2,...,b,
Prob
?
sup
f∈Dj
ψ(f) ≥ 2ε(2−δ)
?
≤ Prob
?
ψ(fj) ≥ ε(2−δ)
?
.
(20)
Page 16
290 Mach Learn (2009) 75: 275–295
Step 3 For the sake of simplicity, we denote the term on the right-hand side of inequal-
ity (20) by I1 and denote the term on the left-hand side of inequality (20) by I2. Take
δ =
?
= Prob
?
?E(fj)− En(fj)
?E(fj)− En(fj)
?E(fj)− En(fj)
?
?
?√r sup
?
?
?
By inequality (20), we then get
?
ε
E(fj), and suppose 0 < ε < 2r, then we have
I1= Prob
ψ(fj) ≥ ε(2−δ)
?
= Prob
?
E(fj)− En(fj) ≥ δE(fj)+ε(2−δ)
?
??
E(fj)− En(fj) ≥ ε +ε
?
2−
ε
E(fj)
= Prob
?E(fj)
?E(fj)
?E(fj)
sup
f∈Dj
?
≥
ε
?E(fj)+
ε
√s
?
?
ε
?E(fj)
?
2−
ε
E(fj)
??
≤ Prob
≥
?
3−ε
r
??
≤ Prob
≥
ε
√s
,
I2= Prob
ψ(f) ≥ 2ε(2−δ)
= Prob sup
f∈Dj
E(f)− En(f)−εE(f)
?E(f)− En(f)
E(f)− En(f)
√E(f)
E(f)− En(f)
√E(f)
E(f)− En(f)
√E(f)
E(fj)
?
≥ 2ε
?
2−
ε
E(fj)
??
?
ε
E(fj)
??
≥ Prob
f∈Dj
√E(f)
−
εs
√E(f)E(fj)
εs
E(fj)√r+2ε
εs
r√r+2ε
?
≥ 2ε
2−
ε
E(fj)
??
??
≥ Probsup
f∈Dj
≥
√r
?
?
2−
≥ Prob sup
f∈Dj
≥
√r
2−ε
s
≥ Probsup
f∈Dj
≥(s +4r)ε
r√r
?
.
Prob sup
f∈Dj
E(f)− En(f)
√E(f)
≥(s +4r)ε
r√r
?
≤ Prob
?E(fj)− En(fj)
?E(fj)
≥
ε
√s
?
.
(21)
Combining inequalities (19), (21) and (12), and replacing ε by
orem 1.
r√r
(s+4r)ε, we then get The-
?
5 Proof of generalization bound
In this section, we begin to prove the generalization bounds (Propositions 1 and 2) of the
ERM algorithm with exponentially strongly mixing samples by the results obtained in the
lastsection. Ourmain toolis thefollowinglemma establishedbyCuckerandSmale (2002b).
Page 17
Mach Learn (2009) 75: 275–295291
Lemma 3 (Cucker and Smale 2002b) Let c1, c2> 0, and s > q > 0. Then the equation
xs−c1xq−c2= 0
has a unique positive zero x∗. In addition
x∗≤ max{(2c1)1/(s−q),(2c2)(1/s)}.
Proof of Proposition 1 By the assumption that 0 < ε ≤ 2r, the exponential of inequality (5)
in Theorem 1 becomes
−rτ2n(α)
2(σ2+τ√sM/3)≤ −C1n(α)ε2,
where
C1=
3r4
2s(s +4r)[3(s +4r)σ2+2r
5
2M]
.
Since H is assumed to be compact, then by assumption (2) we have
?
where C0is a positive constant. In other words, for any ε, 2r ≥ ε > 0, by Theorem 1 we
have
NH,
εr√r
(s +4r)L
?
≤ exp
?
C0
?
εr√r
(s +4r)L
?−2d
p?
,
Prob
?
sup
f∈H
E(f)− En(f)
√E(f)
≥ ε
?
≤ Cexp
?
C0
?
εr√r
(s +4r)L
?−2d
p
−C1n(α)ε2
?
,
(22)
where C = 1+4e−2α.
Let us rewrite inequality (22) in an equivalent form. We equate the right-hand side of
inequality (22) to a positive value η(0 < η ≤ 1)
?
It follows that
Cexp
C0
?
εr√r
(s +4r)L
?−2d
p
−C1n(α)ε2
?
= η.
ε2+2d
p−ln(C/η)
C1n(α)ε
2d
p−C0r
−3d
p[(s +4r)L]
C1n(α)
2d
p
= 0.
By Lemma 3, we can solve this equation with respect to ε. This equation has a unique
positive zero ε∗, and
??2ln(C/η)
It is used further to solve inequality
ε∗.= ε(n,η) ≤ max
C1n(α)
?1
2
,
?2C0r
−3d
p[(s +4r)L]
C1n(α)
2d
p
?
p
2p+2d?
.
sup
f∈H
E(f)− En(f)
√E(f)
≤ ε(n,η).
Page 18
292 Mach Learn (2009) 75: 275–295
Then we deduce that with probability at least 1 − η simultaneously for all functions in
the function set H, the inequality
E(f) ≤ En(f)+ε2(n,η)
2
?
1+
?
1+4En(f)
ε2(n,η)
?
is valid. Since with probability at least 1 − η, this inequality holds for all functions of the
function set H, it holds in particular for the function fzthat minimizes the empirical risk
En(f) over H. For this function with probability at least 1−η, the inequality
E(fz) ≤ En(fz)+ε2(n,η)
2
?
1+
?
1+4En(fz)
ε2(n,η)
?
(23)
then holds.
By Theorem 4.3 in Modha and Masry (1996), we have that for any ε > 0, the inequality
Prob
?
|E(f)− En(f)| > ε
?
≤ 2(1+4e−2α)exp
?
−n(α)ε2
2(σ2+εM/3)
?
(24)
is valid. Thus by inequality (24), we conclude that for the same η as above, and for the
function fHthat minimizes the expected risk E(f) over H, the inequality
E(fH) > En(fH)−ε?(n,η)
holds with probability 1−η, where
?
(25)
ε?(n,η) =M ln(C/η)
3n(α)
1+
?
1+
18n(α)σ2
M2ln(C/η)
?
.
Note that
En(fH) ≥ En(fz).
(26)
From inequalities (23), (25) and (26), we deduce that with probability at least 1 − 2η, the
inequality
E(fz)− E(fH) ≤ ε?(n,η)+ε2(n,η)
2
?
1+
?
1+4En(fz)
ε2(n,η)
?
is valid. In addition, if
n(α)≥ max
?ln(C/η)
2C1r2,C0[(s +4r)L]
C12
2d
p
2d
pr
5d+2p
p
,ln(C/η(σ2+sM/3))
2r2
?
,
then we have ε ≤ 2r. This leads to Proposition 1.
?
Proof of Proposition 2 When the complexity of the function set H is high, in order to solve
the time-consuming problem of the ERM algorithm (1), we can decompose the hypoth-
esis space H into many compact subsets by following the enlightening idea of Giné and
Page 19
Mach Learn (2009) 75: 275–295293
Koltchinski (2006), and denote it as follows:
H =
a?
i=1
H(ρi−1,ρi].
For every i, 1 ≤ i ≤ a, let fi
H(ρi−1,ρi]. By the similar argument with inequality (23), we have that for any η ∈ (0,1],
with probability at least 1−η, the inequality
?
zbe the function minimizing the empirical risk En(f) over f ∈
E(fi
z) ≤ En(fi
z)+ε2
i(n,η)
2
1+
?
1+4En(fi
ε2
i(n,η)
z)
?
,
1 ≤ i ≤ a
(27)
holds, where
εi(n,η) ≤ max
??2ln(C/η)
Cin(α)
?1
2
,
?2C0(ρi−1)
3ρ4
i−1
−3d
p[(ρi+4ρi−1)L]
Cin(α)
2d
p
?
p
2p+2d?
,
Ci=
2ρi(ρi+4ρi−1)[3(ρi+4ρi−1)σ2+2ρ
Thus we have that with probability at least 1−η, the inequality
?
5
2
i−1M]
.
E(fz) ≤ min
1≤i≤a
En(fi
z)+ε2
i(n,η)
2
?
1+
?
1+4En(fi
ε2
i(n,η)
z)
??
(28)
is valid.
In addition, for the same η as above, we have that with probability 1−η, the inequality
E(fH) ≥ En(fH)−ε?(n,η)
holds. Then by inequalities (28), (29) and the fact that
(29)
En(fH) ≥ En(fi
z),i ∈ {1,2,...,a}
we have that with probability 1−2η, the inequality
E(fz)− E(fH) ≤ ε?(n,η)+ min
1≤i≤a
?ε2
i(n,η)
2
?
1+
?
1+4En(fi
ε2
i(n,η)
z)
??
is valid. We then complete the proof of Proposition 2.
?
6 Conclusions
In this paper we have studied the learning performance of the ERM algorithm with expo-
nentially strongly mixing samples. We first established a new bound on the rate of relative
uniform convergence for the ERM algorithm with exponentially strongly mixing samples.
Then we have derived the generalization bounds of the ERM algorithm and proved that
the ERM algorithm with exponentially strongly mixing observations is consistent. To our
Page 20
294Mach Learn (2009) 75: 275–295
knowledge, the results here are the first explicit bounds on the rate of convergence on this
topic. In order to have a better understanding of the significance and value of the established
results in this paper, we have compared our results with the previous works, and concluded
thattheestablishedresultsnotonlysharpenandimprovethepreviouslyknownresultsinZou
and Li (2007), Vidyasagar (2002), but also extend the results in Bousquet (2003), Cucker
and Smale (2002a), Vapnik (1998) for i.i.d. samples to the case of α-mixing sequence. We
have also shown that the learning rates of the ERM algorithm with exponentially strongly
mixing samples are close to or as same as those for learning rate with i.i.d. samples.
In addition, since the ERM algorithm is usually very time-consuming and overfitting
may happen when the complexity of the given function set H is high, as an application of
our main results, we also explored a new strategy to implement the ERM algorithm in high
complexity hypothesis space.
Along the line of the present work, several open problems deserve further research. For
example, how to control the generalization ability of the ERM algorithm with exponentially
strongly mixing samples? What is the essential difference between the generalization ability
of the ERM algorithm with i.i.d. samples and dependent samples? All these problems are
under our current investigation.
Acknowledgements
The authors would like to thank Professor Nicolo Cesa-Bianchi for his careful reading and helpful comments
on the paper.
Theauthorsaregratefultothereviewersfor theirvaluablecomments andsuggestions.
References
Alexander, K. (1984). Probability inequalities for empirical processes and a law of the iterated logarithm.
Annals of Probability, 4, 1041–1067.
Alon, N., Ben-Darid, S., Cesa-Bianchi, N., & Haussler, D. (1997). Scale-sensitive dimensions, uniform con-
vergence and learnability. Journal of the Association for Computing Machinery, 44, 615–631.
Bartlett, P. L., & Long, P. M. (1998). Prediction, learning, uniform convergence, and scale-sensitive dimen-
sions. Journal of Computer and System Sciences, 56(2), 174–190.
Bartlett, P. L., & Lugosi, G. (1999). An inequality for uniform deviations of sample averages from their
means. Statistics & Probability Letters, 4, 55–62.
Bartlett, P. L., & Mendelson, S. (2002). Rademacher and Gaussian complexities: risk bounds and structural
results. Journal of Machine Learning Research, 3, 463–482.
Bousquet, O. (2003). New approaches to statistical learning theory. Annals of the Institute of Statistical Math-
ematics, 55, 371–389.
Cesa-Bianchi, N., Alex Conconi, A., & Gentile, C. (2004). On the generalization ability of on-line learning
algorithms. IEEE Transactions on Information Theory, 50(9), 2050–2057.
Chen, D. R., Wu, Q., Ying, Y. M., & Zhou, D. X. (2004). Support vector machine soft margin classifiers:
error analysis. Journal of Machine Learning Research, 5, 1143–1175.
Craig, C. C. (1933). On the Tchebycheff inequality of Bernstein. Annals of Mathematical Statistics, 4, 94–
102.
Cucker, F., & Smale, S. (2002a). On the mathematical foundations of learning. Bulletin of the American
Mathematical Society, 39, 1–49.
Cucker, F., & Smale, S. (2002b). Best choices for regularization parameters in learning theory: on the bias-
variance problem. Foundations of Computational Mathematics, 2, 413–428.
Cucker, F., & Zhou, D. X. (2007). Learning theory: an approximation theory viewpoint. Cambridge: Cam-
bridge University Press.
Davydov, Y. A. (1973). Mixing conditions for Markov chains. Theory of Probability and its Applications,
XVIII, 312–328.
Devroye, L. (1982). Bounds for the uniform deviation of empirical measures. Journal of Multivariate Analy-
sis, 12, 72–79.
Evgeniou, T., & Pontil, M. (1999). Lecture notes in comput. sci.: Vol. 1720. On the V-gamma dimension for
regression in reproducing Kernel Hilbert spaces (pp. 106–117). Berlin: Springer.
Page 21
Mach Learn (2009) 75: 275–295295
Giné, E., & Koltchinski, V. (2006). Concentration inequality and asymptotic results for ratio type empirical
processes. The Annals of Probability, 34(3), 1143–1216.
Ibragimov, I. A., & Linnik, Y. V. (1971). Independent and stationary sequences of random variables. Gronin-
gen: Wolters-Noordnoff.
Karandikar, R. L., & Vidyasagar, M. (2002). Rates of uniform convergence of empirical means with mixing
processes. Statistics & Probability Letters, 58, 297–307.
Lugosi, G., & Pawlak, M. (1994). On the posterior-probability estimate of the error of nonparameter classifi-
cation rules. IEEE Transactions on Information Theory, 40(5), 475–481.
Modha, S., & Masry, E. (1996). Minimum complexity regression estimation with weakly dependent obser-
vations. IEEE Transactions on Information Theory, 42, 2133–2145.
Nobel, A., & Dembo, A. (1993). A note on uniform laws of averages for dependent processes. Statistics &
Probability Letters, 17, 169–172.
Pollard, D. (1984). Convergence of stochastic processes. New York: Springer.
Rosenblatt,M.(1956). Acentraltheoremandstrong mixing conditions. ProceedingsoftheNational Academy
of Sciences, 4, 43–47.
Smale, S., & Zhou, D. X. (2003). Estimating the approximation error in learning theory. Analysis and Its
Applications, 1, 17–41.
Smale, S., & Zhou, D. X. (2004). Shannon sampling and function reconstruction from point values. Bulletin
of the American Mathematical Society, 41, 279–305.
Steinwart, I., Hush, D., & Scovel, C. (2006). Learning from dependent observations (Technical Report LA-
UR-06-3507). Los Alamos National Laboratory. Submitted for publication.
Talagrand, M. (1994). Sharper bounds for Gaussian and empirical processes. Annals of Probability, 22, 28–
76.
Vapnik, V. (1998). Statistical learning theory. New York: Wiley.
Vidyasagar, M. (2002). Learning and generalization with applications to neural networks (2nd ed.). Berlin:
Springer.
Withers, C. S. (1981). Conditions for linear processes of stationary mixing sequences. Annals of Probability,
22, 94–116.
White, H. (1989). Connectionist nonparametric regression: Multilayer feedforward networks can learn arbi-
trary mappings. Neural Networks, 3, 535–549.
Wu, Q., & Zhou, D. X. (2005). SVM soft margin classifiers: linear programming versus quadratic program-
ming. Neural Computation, 17, 1160–1187.
Yu, B. (1994). Rates of convergence for empirical processes of stationary mixing sequences. Annals of Prob-
ability, 22, 94–114.
Zhou, D. X. (2003). Capacity of reproducing kernel spaces in learning theory. IEEE Transactions on Infor-
mation Theory, 49, 1743–1752.
Zou, B., & Li, L. Q. (2007). The performance bounds of learning machines based on exponentially strongly
mixing sequence. Computer and Mathematics with Applications, 53(7), 1050–1058.
Download full-text