PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Aggregated predictors are obtained by making a set of basic predictors vote according to some weights, that is, to some probability distribution. Randomized predictors are obtained by sampling in a set of basic predictors, according to some prescribed probability distribution. Thus, aggregated and randomized predictors have in common that they are not defined by a minimization problem, but by a probability distribution on the set of predictors. In statistical learning theory, there is a set of tools designed to understand the generalization ability of such procedures: PAC-Bayesian or PAC-Bayes bounds. Since the original PAC-Bayes bounds of McAllester, these tools have been considerably improved in many directions (we will for example describe a simplified version of the localization technique of Catoni that was missed by the community, and later rediscovered as "mutual information bounds"). Very recently, PAC-Bayes bounds received a considerable attention: for example there was workshop on PAC-Bayes at NIPS 2017, "(Almost) 50 Shades of Bayesian Learning: PAC-Bayesian trends and insights", organized by B. Guedj, F. Bach and P. Germain. One of the reason of this recent success is the successful application of these bounds to neural networks by Dziugaite and Roy. An elementary introduction to PAC-Bayes theory is still missing. This is an attempt to provide such an introduction.
arXiv:2110.11216v4 [stat.ML] 9 Nov 2021
User-friendly introduction to PAC-Bayes bounds
Pierre Alquier
RIKEN AIP, Tokyo, Japan
Abstract
Aggregated predictors are obtained by making a set of basic predictors vote accord-
ing to some weights, that is, to some probability distribution.
Randomized predictors are obtained by sampling in a set of basic predictors, ac-
cording to some prescribed probability distribution.
Thus, aggregated and randomized predictors have in common that they are not
defined by a minimization problem, but by a probability distribution on the set of
predictors. In statistical learning theory, there is a set of tools designed to understand
the generalization ability of such procedures: PAC-Bayesian or PAC-Bayes bounds.
Since the original PAC-Bayes bounds [166, 127], these tools have been considerably
improved in many directions (we will for example describe a simplified version of the
localization technique of [41, 43] that was missed by the community, and later rediscov-
ered as “mutual information bounds”). Very recently, PAC-Bayes bounds received a
considerable attention: for example there was workshop on PAC-Bayes at NIPS 2017,
(Almost) 50 Shades of Bayesian Learning: PAC-Bayesian trends and insights, orga-
nized by B. Guedj, F. Bach and P. Germain. One of the reasons of this recent success
is the successful application of these bounds to neural networks [67].
An elementary introduction to PAC-Bayes theory is still missing. This is an attempt
to provide such an introduction.
Contents
1 Introduction 3
1.1 Machine learning and PAC bounds . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Machine learning: notations . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 PACbounds................................ 5
1.2 What are PAC-Bayes bounds? . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Whythistutorial? ................................ 10
1.4 Two types of PAC bounds, organization of these notes . . . . . . . . . . . . 11
1
2 First step in the PAC-Bayes world 12
2.1 A simple PAC-Bayes bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Catoni’s bound [41] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 Exact minimization of the bound . . . . . . . . . . . . . . . . . . . . 14
2.1.3 Some examples, and non-exact minimization of the bound . . . . . . 15
2.1.4 The choice of λ.............................. 20
2.2 PAC-Bayes bound on aggregation of predictors . . . . . . . . . . . . . . . . . 22
2.3 PAC-Bayes bound on a single draw from the posterior . . . . . . . . . . . . . 23
2.4 Boundinexpectation............................... 23
2.5 Applications of empirical PAC-Bayes bounds . . . . . . . . . . . . . . . . . . 24
3 Tight and non-vacuous PAC-Bayes bounds 25
3.1 Why is there a race to the tighter PAC-Bayes bound? . . . . . . . . . . . . . 25
3.2 A few PAC-Bayes bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 McAllester’s bound [127] and Maurer’s improved bound [126] . . . . . 27
3.2.2 Catoni’s bound (another one) [43] . . . . . . . . . . . . . . . . . . . . 27
3.2.3 Seeger’s bound [161] and Maurer’s bound [126] . . . . . . . . . . . . . 28
3.2.4 Tolstikhin and Seldin’s bound [175] . . . . . . . . . . . . . . . . . . . 29
3.2.5 Thieman, Igel, Wintenberger and Seldin’s bound [174] . . . . . . . . . 30
3.2.6 A bound by Germain, Lacasse, Laviolette and Marchand [77] . . . . . 30
3.3 Tight generalization error bounds for deep learning . . . . . . . . . . . . . . 30
3.3.1 A milestone: non vacuous generalization error bounds for deep net-
works by Dziugaite and Roy [67] . . . . . . . . . . . . . . . . . . . . . 30
3.3.2 Bounds with data-dependent priors . . . . . . . . . . . . . . . . . . . 32
3.3.3 Comparison of the bounds and tight certificates for neural networks [147] 33
4 PAC-Bayes oracle inequalities and fast rates 34
4.1 From empirical inequalities to oracle inequalities . . . . . . . . . . . . . . . . 34
4.1.1 Bound in expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.2 Bound in probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Bernstein assumption and fast rates . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Applications of Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Dimension and rate of convergence . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Getting rid of the log terms: Catoni’s localization trick . . . . . . . . . . . . 46
5 Beyond “bounded loss” and “i.i.d observations” 49
5.1 “Almost” bounded losses (Sub-Gaussian and sub-gamma) . . . . . . . . . . . 49
5.1.1 The sub-Gaussian case . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.1.2 The sub-gamma case . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1.3 Remarks on exponential moments . . . . . . . . . . . . . . . . . . . . 51
5.2 Heavy-tailedlosses ................................ 51
5.2.1 The truncation approach . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.2 Bounds based on moment inequalities . . . . . . . . . . . . . . . . . . 52
2
5.2.3 Bounds based on robust losses . . . . . . . . . . . . . . . . . . . . . . 53
5.3 Dependent observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3.1 Inequalities for dependent variables . . . . . . . . . . . . . . . . . . . 54
5.3.2 Asimpleexample............................. 54
5.4 Other non i.i.d settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4.1 Non identically distributed observations . . . . . . . . . . . . . . . . 55
5.4.2 Shift in the distribution . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4.3 Meta-learning............................... 56
6 Related approaches in statistics and machine learning theory 56
6.1 Bayesian inference in statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.1.1 Gibbs posteriors, generalized posteriors . . . . . . . . . . . . . . . . . 57
6.1.2 Contraction of the posterior in Bayes nonparametrics . . . . . . . . . 57
6.1.3 Variational approximations . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2 Empirical risk minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3 Onlinelearning .................................. 60
6.3.1 Sequential prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3.2 Bandits and reinforcement learning (RL) . . . . . . . . . . . . . . . . 62
6.4 Aggregation of estimators in statistics . . . . . . . . . . . . . . . . . . . . . . 62
6.5 Information theoretic approaches . . . . . . . . . . . . . . . . . . . . . . . . 62
6.5.1 Minimum description length . . . . . . . . . . . . . . . . . . . . . . . 63
6.5.2 Mutual information bounds (MI) . . . . . . . . . . . . . . . . . . . . 63
7 Conclusion 65
1 Introduction
In a supervised learning problem, such as classification or regression, we are given a data
set, and we 1) fix a set of predictors and 2) find a good predictor in this set.
For example, when doing linear regression, you 1) chose to consider only linear predictors
and 2) use the least-square method to chose your linear predictor.
PAC-Bayes bounds will allow us to define and study “randomized” or “aggregated” pre-
dictors. By this, we mean that we will replace 2) by 2’) define weights on the predictors
and make them vote according to these weights or by 2”) draw a predictor according to some
prescribed probability distribution.
1.1 Machine learning and PAC bounds
1.1.1 Machine learning: notations
We will assume that the reader is already familiar with the setting of supervised learning
and the corresponding definitions. We briefly remind the notations involved here:
3
an object set X: photos, texts, Rd(equipped with a σ-algebra Sx).
a label set Y, usually a finite set for classification problem or the set of real numbers
for regression problems (equipped with a σ-algebra Sy).
a probability distribution Pon (X ×Y,Sx⊗ Sy), which is not known.
the data, or observations: (X1, Y1),...,(Xn, Yn). From now, and until the end of
Section 4, we assume that (X1, Y1),...,(Xn, Yn)are i.i.d from P.
a predictor is a measurable function f:X → Y.
we fix a set of predictors indexed by a parameter set Θ (equipped with a σ-algebra T):
{fθ, θ Θ}.
In regression, the basic example is fθ(x) = θTxfor X=Θ=Rd. The analogous for
classification is:
fθ(x) = 1 if θTx0
0 otherwise.
More sophisticated predictors: the set of all neural networks with a fixed architecture,
θbeing the weights of the network.
a loss function, that is, a measurable function :Y2[0,+) with (y, y) = 0. In a
classification problem, a very common loss function is:
(y, y) = 1 if y6=y
0 if y=y
We will refer to it as the 0-1 loss function, and will use the following shorter notation:
(y, y) = 1(y6=y). However, it is often more convenient to consider convex loss
functions, such as (y, y) = max(1 yy,0) (the hinge loss). In regression problems,
the most popular examples are (y, y) = (yy)2the quadratic loss, or (y, y) = |yy|
the absolute loss. Note that the original PAC-Bayes bounds in [127] were stated in the
special case of the 0-1 loss, and this is also the case of most bounds published since
then. However, PAC-Bayes bounds for regression with the quadratic loss were proven
in [42], and in many works since then (they will be mentioned later). From now, and
until the end of Section 4, we assume that 0C.This might be either
because we are using the 0-1 loss, or the quadratic loss but in a setting where fθ(x)
and yare bounded.
the generalization error of a predictor, or generalization risk, or simply risk:
R(f) = E(X,Y )P[(f(X), Y )].
For short, as we will only consider predictors in {fθ, θ Θ}we will write
R(θ) := R(fθ).
This function is not accessible because it depends on the unknown P.
4
for short, we put i(θ) := (fθ(Xi), Yi)0.
the empirical risk:
r(θ) = 1
n
n
X
i=1
i(θ)
satisfies
E(X1,Y1),...,(Xn,Yn)[r(θ)] = R(θ).
Note that the notation for the last expectation is cumbersome. From now, we will write
S= [(X1, Y1),...,(Xn, Yn)] and ES(for “expectation with respect to the sample”)
instead of E(X1,Y1),...,(Xn,Yn). In the same way, we will write PS.
an estimator is a function
ˆ
θ:
[
n=1
(X × Y)nΘ.
That is, to each possible dataset, with any possible size, it associates a parameter. (It
must be such that the restriction of ˆ
θto each (X × Y)nis measurable). For short, we
write ˆ
θinstead of ˆ
θ((X1, Y1),...,(Xn, Yn)). The most famous example is the Empirical
Risk Minimizer, or ERM: ˆ
θERM = argmin
θΘ
r(θ)
(with a convention in case of a tie).
1.1.2 PAC bounds
Of course, our objective is to minimize R, not r. So the ERM strategy is motivated by
the hope that these two functions are not so different, so that the minimizer of ralmost
minimizes R. In what remains of this section, we will check to what extent this is true. By
doing so, we will introduce some tools that will be useful when we will come to PAC-Bayes
bounds.
The first of these tools is a classical result that will be useful in all this tutorial.
Lemma 1.1 (Hoeffding’s inequality) Let U1,...,Unbe independent random variables
taking values in an interval [a, b]. Then, for any t > 0,
EhetPn
i=1[UiE(Ui)]ient2(ba)2
8.
The proof can be found for example in Chapter 2 of [37], which is a highly recommended
reading, but it is so classical that you can as well find it on Wikipedia.
Fix θΘ and apply Hoeffding’s inequality with Ui=E[i(θ)] i(θ) to get:
ESetn[R(θ)r(θ)]ent2C2
8.(1.1)
5
Now, for any s > 0,
PS(R(θ)r(θ)> s) = PSent[R(θ)r(θ)] >ents
ESent[R(θ)r(θ)]
ents by Markov’s inequality,
ent2C2
8nts by (1.1).
In other words,
PS(R(θ)> r(θ) + s)ent2C2
8nts.
We can make this bound as tight as possible, by optimizing our choice for t. Indeed, note
that nt2C2/8nts is minimized for t= 4s/C2, which leads to
PS(R(θ)> r(θ) + s)e2ns2
C2.(1.2)
This means that, for a given θ, the risk R(θ) cannot be much larger than the corresponding
empirical risk r(θ). The order of this “much larger” can be better understood by introducing
ε= e2ns2
C2
and substituting εto sin (1.2), which gives:
PS
R(θ)> r(θ) + Cslog 1
ε
2n
ε. (1.3)
We see that R(θ) will usually not exceed r(θ) by more than a term in 1/n. This is not
enough, though, to justify the use of the ERM. Indeed, (1.3) is only true for the θthat was
fixed above, and we cannot apply it to ˆ
θERM that is a function of the data. In order to study
R(ˆ
θERM), we can use
R(ˆ
θERM)r(ˆ
θERM)sup
θΘ
[R(θ)r(θ)] (1.4)
so we need a version of (1.3) that would hold uniformly on Θ.
Let us now assume, until the end of Subsection 1.1, that the set Θ is finite, that is,
card(Θ) = M < +. Then:
PS(sup
θΘ
[R(θ)r(θ)] > s) = PS [
θΘn[R(θ)r(θ)] > so!
X
θΘ
PS(R(θ)> r(θ) + s)
Me2ns2
C2(1.5)
6
thanks to (1.2). This time, put
ε=Me2ns2
C2,
plug into (1.5), this gives:
PS
sup
θΘ
[R(θ)r(θ)] > Cslog M
ε
2n
ε.
Let us state this conclusion as a theorem (focusing on the complementary event).
Theorem 1.2 Assume that card(Θ) = M < +. For any ε(0,1),
PS
θΘ,R(θ)r(θ) + Cslog M
ε
2n
1ε.
This result indeed motivates the introduction of ˆ
θERM. Indeed, using (1.4), with probability
at least 1 εwe have
R(ˆ
θERM)r(ˆ
θERM) + Cslog M
ε
2n
= inf
θΘ
r(θ) + Cslog M
ε
2n
so the ERM satisfies:
PS
R(ˆ
θERM)inf
θΘ
r(θ) + Cslog M
ε
2n
1ε.
Such a bound is usually called a PAC bound, that is, Probably Approximately Correct bound.
The reason for this terminology, introduced by Valiant in [178], is as follows: Valiant was
considering the case where there is a θ0Θ such that Yi=fθ0(Xi) holds almost surely. This
means that R(θ0) = 0 and r(θ0) = 0, and so
PS
R(ˆ
θERM)Cslog M
ε
2n
1ε,
which means that with large Probability,R(ˆ
θERM) is Approximately equal to the Correct
value, that is, 0. Note that, however, this is only correct if log(M)/n is small, that is, if M
is not larger than exp(n). This log(M) in the bound is the price to pay to learn which of M
predictors is the best.
7
Remark 1.1 The proof of Theorem 1.2 used, in addition to Hoeffding’s inequality, two tricks
that we will reuse many times in this tutorial:
given a random variable Uand sR, for any t > 0,
P(U > s) = PetU >ets
EetU
ets
thanks to Markov inequality. The combo “exponential + Markov inequality” is known
as Chernoff bound. Chernoff bound is of course very useful together with exponential
inequalities like Hoeffding’s inequality.
given a finite number of random variables U1,...,UM,
Psup
1iM
Ui> s=P [
1iMnUi> so!
M
X
i=1
P(Ui> s).
This argument is called the union-bound argument.
The next step in the study of the ERM would be to go beyond finite sets Θ. The union
bound argument has to be modified in this case, and things become a little more complicated.
We will therefore stop here the study of the ERM: it is not our objective anyway.
If the reader is interested by the study of the ERM in general: Vapnik and Chervonenkis
developed the theoretical tools for this study in 1969/1970, this is for example developed
by Vapnik in [180]. The book [64] is a beautiful and very pedagogical introduction to
machine learning theory, and Chapters 11 and 12 in particular are dedicated to Vapnik and
Chervonenkis theory.
1.2 What are PAC-Bayes bounds?
I am now in better position to explain what are PAC-Bayes bounds. A simple way to phrase
things: PAC-Bayes bounds are generalization of the union bound argument, that will allow
to deal with any parameter set Θ: finite or infinite, continuous... However, a byproduct of
this technique is that we will have to change the notion of estimator.
Definition 1.1 Let P(Θ) be the set of all probability distributions on ,T). A data-
dependent probability measure is a function:
ˆρ:
[
n=1
(X × Y)n→ P(Θ)
8
with a suitable measurability condition1We will write ˆρinstead of ˆρ((X1, Y1),...,(Xn, Yn))
for short.
In practice, when you have a data-dependent probability measure, and you want to build
a predictor, you can:
draw a random parameter ˜
θˆρ, we will call this procedure “randomized estimator”.
use it to average predictors, that is, define a new predictor:
fˆρ(·) = Eθˆρ[fθ(·)]
called the aggregated predictor with weights ˆρ.
So, with PAC-Bayes bounds, we will extend the union bound argument 2to infinite,
uncountable sets Θ, but we will obtain bounds on various risks related to data-dependent
probability measures, that is:
the risk of a randomized estimator, R(˜
θ),
or the average risk of randomized estimators, Eθˆρ[R(θ)],
or the risk of the aggregated estimator, R(fˆρ).
You will of course ask the question: if Θ is infinite, what will become the log(M) term
in Theorem 1.2, that came from the union bound? In general, this term will be replaced by
the Kullback-Leibler divergence between ˆρand a fixed πon Θ.
Definition 1.2 Given two probability measures µand νin P(Θ), the Kullback-Leibler (or
simply KL) divergence between µand νis
KL(µkν) = Zlog dµ
dν(θ)µ(dθ)[0,+]
if µhas a density dµ
dνwith respect to ν, and KL(µkν) = +otherwise.
Example 1.1 For example, if Θis finite,
KL(µkν) = X
θΘ
log µ(θ)
ν(θ)µ(θ).
1I don’t want to scare the reader with measurability conditions, as I will not check them in this tutorial
anyway. Here, the exact condition to ensure that what follows is well defined is that for any A T , the
function
((x1, y1),...,(xn, yn)) 7→ ρ((x1, y1),...,(xn, yn))] (A)
is measurable. That is, ˆρis a regular conditional probability.
2See the title of van Erven’s tutorial [179]: “PAC-Bayes mini-tutorial: a continuous union bound”. Note,
however, that it is argued by Catoni in [43] that PAC-Bayes bounds are actually more than that, we will
come back to this in Section 4.
9
The following result is well known. You can prove it using Jensen’s inequality, or use
Wikipedia again.
Proposition 1.3 For any probability measures µand ν,KL(µkν)0with equality if and
only if µ=ν.
1.3 Why this tutorial?
Since the “PAC analysis of a Bayesian estimator” by Shawe-Taylor and Williamson [166]
and the first PAC-Bayes bounds proven by McAllester [127, 128], many new PAC-Bayes
bounds appeared (we will see that many of them can be derived from Seeger’s bound [161]).
These bounds were used in various contexts, to solve a wide range of problems. This led
to hundreds of (beautiful!) papers. The consequence of this is that it’s quite difficult to be
aware of all the existing work on PAC-Bayes bound.
As a reviewer for ICML or NeurIPS, I had very often to reject papers because these
papers were re-proving already known results. Or, because these papers proposed bounds
that were weaker than existing ones3. In particular, it seems that many powerful techniques
in Catoni’s book [43] are still ignored by the community (some are already introduced in
earlier works [41, 42]).
On the other hand, it’s not always easy to begin with PAC-Bayes bounds. I realize that
most papers already assume some basic knowledge on these bounds, and that a monograph
like [43] is quite technical to begin with. When a MSc or PhD student asks me for an
easy-to-follow introduction on PAC-Bayes, I am never sure what to answer, and usually
end up improvising such an introduction for one or two hours, with a piece of chalk and a
blackboard. So it came to me recently 4that it might be useful to write a beginner-friendly
tutorial, that I could send instead!
Note that there are already short tutorials on PAC-Bayes bounds, by McAllester and
van Erven: [130, 179]. They are very good, and I recommend the reader to check them.
However, they are focused on empirical bounds only. There are also surveys on PAC-Bayes
bounds, such as Section 5 in [54] or [84]. These papers are very useful to navigate in the
ocean of publications on PAC-Bayes bounds, and they helped me a lot when I was writing
this document. Finally, in order to highlight the main ideas, I will not necessarily try to
present the bounds with the tightest possible constants. In particular, many oracle bounds
and localized bounds in Section 4 were introduced in [41, 43] with better constants. Thus I
strongly recommend to read [43] after this tutorial, and the more recent papers mentioned
below.
3I might have done such mistakes myself, and I apologize if it is the case.
4I must confess that I started a first version of this document after two introductory talks at A. Tsybakov’s
statistics seminar at ENSAE in September-October 2008. Then I got other things to do and I forgot about
it. I taught online learning and PAC-Bayes bounds at ENSAE between 2014 and 2019, which made me
think again about it. When I joined Emti Khan’s group in 2019, I started to think again about such a
document, to share it with the members of the group who were willing to learn about PAC-Bayes. Of course
the contents of the document had to be different, because of the enormous amount of very exciting papers
that were published in the meantime. I finally started again from scratch in early 2021.
10
1.4 Two types of PAC bounds, organization of these notes
It is important to make a distinction between two types of PAC bounds.
Theorem 1.2 is usually refered to as an empirical bound. It means that, for any θ,R(θ) is
upper bounded by an empirical quantity, that is, by something that we can compute when
we observe the data. This allows to define the ERM as the minimizer of this bound. It also
provides a numerical certificate of the generalization error of the ERM. You will really end
up with something like
PSR(ˆ
θERM)0.120.99.
However, a numerical certificate on the generalization error does not tell you one thing.
Can this 0.12 be improved using a larger sample size? Or is it the best that can be done
with our set of predictors? The right tools to answer these questions are oracle PAC bounds.
In these bounds, you have a control of the form
PSR(ˆ
θERM)inf
θΘR(θ) + rn(ε)1ε,
where the remainder rn(ε)0 as fast as possible when n→ ∞. Of course, the upper bound
on R(ˆ
θERM) cannot be computed because we don’t know the function R, so it doesn’t lead
to a numerical certificate. Still, these bounds are very interesting, because they tell you how
close you can expect R(ˆ
θERM) to be of the smallest possible value of R.
In the same way, there are empirical PAC-Bayes bounds, and oracle PAC-Bayes bounds.
The very first PAC-Bayes bounds by McAllester [127, 128] were empirical bounds. Later,
Catoni [41, 42, 43] proved the first oracle PAC-Bayes bounds.
In some sense, empirical PAC-Bayes bounds are more useful in practice, and oracle PAC-
Bayes bounds are theoretical objects. But this might be an oversimplification. I will show
that empirical bounds are tools used to prove some oracle bounds, so they are also useful in
theory. On the other hand, when we design a data-dependent probability measure, we don’t
know if it will lead to large or small empirical bounds. A preliminary study of its theoretical
properties through an oracle bound is the best way to ensure that it is efficient, and so that
it has a chance to lead to small empirical bounds.
In Section 2, we will study an example of empirical PAC-Bayes bound, essentially taken
from a preprint by Catoni [41]. We will prove it together, play with it, modify it in many
ways. In Section 3, I provide various empirical PAC-Bayes bounds, and explain the race to
tighter bounds. This led to bounds that are tight enough to provide good generalization
bounds for deep learning, we will discuss this based on Dziugaite and Roy’s paper [67] and
a more recent work by P´erez-Ortiz, Rivasplata, Shawe-Taylor, and Szepesv`ari [147].
In Section 4, we will turn to oracle PAC-Bayes bounds. I will explain how to derive these
bounds from empirical bounds, and apply them to some classical set of predictors. We will
examine the assumptions leading to fast rates in these inequalities.
Section 5 will be devoted to the various attempts to extend PAC-Bayes bounds beyond the
setting introduced in this introduction, that is: bounded loss, and i.i.d observations. Finally,
in Section 6 I will discuss briefly the connection between PAC-Bayes bounds and many
11
other approaches in machine learning and statistics, including the recent Mutual Information
bounds (MI).
2 First step in the PAC-Bayes world
As mentioned above, there are many PAC-Bayes bounds. I will start in this section by a
bound which is essentially due to Catoni in the preprint [41] (the same technique was used
in the monograph [43] but with some modifications). Why this choice?
Well, any choice is partly arbitrary: I did my PhD thesis [1] with Olivier Catoni and
thus I know his works well. So it’s convenient for me. But, also, in a first time, I don’t
want here to provide the best bound. I want to show how PAC-Bayes bounds work, how to
use them, and explain the different variants (bounds on randomized estimators, bounds on
aggregated estimators, etc.). It appears that Catoni’s technique is extremely convenient to
prove almost all the various type of bounds with a unique proof. Later, in Section 3, I will
present alternative empirical PAC-Bayes bounds, this will allow you to compare them, and
find the pros and the cons of each.
2.1 A simple PAC-Bayes bound
2.1.1 Catoni’s bound [41]
From now, and until the end of these notes, let us fix a probability measure π∈ P(Θ).
The measure πwill be called the prior, because of a connection with Bayesian statistics
that will be discussed in Section 6.
Theorem 2.1 For any λ > 0, for any ε(0,1),
PSρ∈ P(Θ),Eθρ[R(θ)] Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log 1
ε
λ1ε.
Let us prove Theorem 2.1. The proof requires a lemma that will be extremely useful in
all these notes. This lemma has been known since Kullback [104] in the case of a finite Θ,
but the general case is due to Donsker and Varadhan [65].
Lemma 2.2 (Donsker and Varadhan’s variational formula) For any measurable, bounded
function h: Θ Rwe have:
log Eθπeh(θ)= sup
ρ∈P(Θ)hEθρ[h(θ)] K L(ρkπ)i.
Moreover, the supremum with respect to ρin the right-hand side is reached for the Gibbs
measure πhdefined by its density with respect to π
dπh
dπ(θ) = eh(θ)
Eϑπ[eh(ϑ)].(2.1)
12
Proof of Lemma 2.2. Using the definition, just check that for any ρ∈ P(Θ),
KL(ρkπh) = Eθρ[h(θ)] + KL(ρkπ) + log Eθπeh(θ).
Thanks to Proposition 1.3, the left hand side is nonnegative, and equal to 0 only when
ρ=πh.
Proof of Theorem 2.1. The beginning of the proof follows closely the study of the ERM
and the proof of Theorem 1.2. Fix θΘ and apply Hoeffding’s inequality with Ui=
E[i(θ)] i(θ): for any t > 0,
ESetn[R(θ)r(θ)]ent2C2
8.
We put t=λ/n, which gives:
ESeλ[R(θ)r(θ)]eλ2C2
8n.
This is where the proof diverge from the proof of Theorem 1.2. We will now integrate this
bound with respect to π:
EθπESeλ[R(θ)r(θ)]eλ2C2
8n.
Thanks to Fubini, we can exchange the integration with respect to θand the one with respect
to the sample:
ESEθπeλ[R(θ)r(θ)]eλ2C2
8n(2.2)
and we apply Donsker and Varadhan’s variational formula (Lemma 2.2) to get:
ESesupρ∈P(Θ) λEθρ[R(θ)r(θ)]K L(ρkπ)eλ2C2
8n.
Rearranging terms:
EShesupρ∈P(Θ) λEθρ[R(θ)r(θ)]K L(ρkπ)λ2C2
8ni1.(2.3)
The end of the proof uses Chernoff bound. Fix s > 0,
PS"sup
ρ∈P(Θ)
λEθρ[R(θ)r(θ)] KL(ρkπ)λ2C2
8n> s#
EShesupρ∈P(Θ) λEθρ[R(θ)r(θ)]K L(ρkπ)λ2C2
8nies
es.
Solve es=ε, that is, put s= log(1) to get
PS"sup
ρ∈P(Θ)
λEθρ[R(θ)r(θ)] KL(ρkπ)λ2C2
8n>log 1
ε#ε.
Rearranging terms give:
PSρ∈ P(Θ), Eθρ[R(θ)] >Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log 1
ε
λε.
Take the complement to end the proof.
13
2.1.2 Exact minimization of the bound
We remind that the bound in Theorem 1.2,
PS
θΘ, R(θ)r(θ) + Cslog M
ε
2n
1ε,
motivated the introduction of ˆ
θERM, the minimizer of r.
Exactly in the same way, the bound in Theorem 2.1,
PSρ∈ P(Θ), Eθρ[R(θ)] Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log 1
ε
λ1ε,
motivates the study of a data-dependent probability measure ˆρλthat would be defined as:
ˆρλ= argmin
ρ∈P(Θ) Eθρ[r(θ)] + K L(ρkπ)
λ.
But does such a minimizer exist? It turns out that the answer is yes, thanks to Donsker and
Varadhan’s variational formula again! Indeed, to minimize:
Eθρ[r(θ)] + KL(ρkπ)
λ
is equivalent to maximize
λEθρ[r(θ)] KL(ρkπ)
which is exactly what the variational inequality does, with h(θ) = λr(θ). We know that
the minimum is reached for ρ=πλr as defined in (2.1). Let us summarize this in following
definition and corollary.
Definition 2.1 In the whole tutorial we will let ˆρλdenote “the Gibbs posterior” given by
ˆρλ=πλr, that is:
ˆρλ(dθ) = eλr(θ)π(dθ)
Eϑπ[eλt(ϑ)].(2.4)
Corollary 2.3 The Gibbs posterior is the minimizer of the right-hand side of Theorem 2.1:
ˆρλ= argmin
ρ∈P(Θ) Eθρ[r(θ)] + K L(ρkπ)
λ.
As a consequence, for any λ > 0, for any ε(0,1),
PSEθˆρλ[R(θ)] inf
ρ∈P(Θ) Eθρ[r(θ)] + λC 2
8n+KL(ρkπ) + log 1
ε
λ1ε.
14
2.1.3 Some examples, and non-exact minimization of the bound
When you see something like:
Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log 1
ε
λ,
I’m not sure you immediately see what is the order of magnitude of the bound. I don’t.
In general, when you apply such a general bound to a set of predictors, I think it is quite
important to make the bound more explicit. Let us process a few examples (I advise you to
do the calculations on your own in these examples, and in other examples).
Example 2.1 (Finite case) Let us start with the special case where Θis a finite set, that
is, card(Θ) = M < +. We begin with the application of Corollary 2.3. In this case, the
Gibbs posterior ˆρλof (2.4) is a probability on the finite set Θgiven by
ˆρλ(θ) = eλr(θ)π(θ)
PϑΘeλr(ϑ)π(ϑ).
and we have, with probability at least 1ε:
Eθˆρλ[R(θ)] inf
ρ∈P(Θ) Eθρ[r(θ)] + λC 2
8n+KL(ρkπ) + log 1
ε
λ.(2.5)
As the bound holds for all ρ∈ P(Θ), it holds in particular for all ρin the set of Dirac masses
{δθ, θ Θ}. Obviously:
Eθδθ[r(θ)] = r(θ)
and
KL(δθkπ) = X
ϑΘ
log δθ(ϑ)
π(ϑ)δθ(ϑ) = log 1
π(θ),
so the bound becomes:
PS Eθˆρλ[R(θ)] inf
θΘ"r(θ) + λC2
8n+log 1
π(θ)+ log 1
ε
λ#!1ε, (2.6)
with log(1/0) = +. This gives us an intuition on the role of the measure π: the bound
will be tighter for θ’s such that π(θ)is large. However, πcannot be large everywhere: it is a
probability distribution, so it must satisfy
X
θΘ
π(θ) = 1.
The larger the set Θ, the more this total sum of 1will be spread, which will lead to large
values of log(1(θ)).
15
If πis the uniform probability, then log(1(θ)) = log(M), and the bound becomes:
PS Eθˆρλ[R(θ)] inf
θΘr(θ) + λC2
8n+log M
ε
λ!1ε.
The choice λ=p8n/(C2log(M/ε))) actually minimizes the right-hand side, this gives:
PS
Eθˆρλ[R(θ)] inf
θΘr(θ) + Cslog M
ε
2n
1ε. (2.7)
That is, the Gibbs posterior ˆρλsatisfies the same bound as the ERM in Theorem 1.2. Note
that the optimization with respect to λis a little more problematic when πis not uniform,
because the optimal λwould depend on ϑ. We will come back to the choice of λin the general
case soon.
Let us also consider the statement of Theorem 2.1 in this case. With probability at least
1ε, we have:
ρ∈ P(Θ),Eθρ[R(θ)] Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log 1
ε
λ.
Let us apply this bound to any ρin the set of Dirac masses ρ∈ {δθ, θ Θ}. This gives:
θΘ,R(θ)r(θ) + λC2
8n+log 1
π(θ)+ log 1
ε
λ
and, when πis uniform:
θΘ,R(θ)r(θ) + λC2
8n+log M
ε
λ.
As this bound holds for any θ, it holds in particular for the ERM, which gives:
R(ˆ
θERM)inf
θΘr(θ) + λC2
8n+log M
ε
λ
and, once again with the choice λ=p8n/(C2log(M/ε))), we recover exactly the result of
Theorem 1.2:
R(ˆ
θERM)inf
θΘr(θ) + Cslog M
ε
2n.
The previous example leads to important remarks:
PAC-Bayes bounds can be used to prove generalization bounds for Gibbs posteriors, but
sometimes they can also be used to study more classical estimators, like the ERM. Many
of the recent papers by Catoni with co-authors study robust non-Bayesian estimators
thanks to sophisticated PAC-Bayes bounds [45].
16
the choice of λhas a different status when you study the Gibbs posterior ˆρλand the
ERM. Indeed, in the bound on the ERM, λis chosen to minimize the bound, but
the estimation procedure is not affected by λ. The bound for the Gibbs posterior is
also minimized with respect to λ, but ˆρλdepends on λ. So, if you make a mistake
when chosing λ, this will have bad consequences not only on the bound, but also on
the practical performances of the method. This means also that if the bound is not
tight, it is likely that the λobtained by minimizing the bound will not lead to good
performances in practice. (As you will see soon, we present in Section 3 bounds that
do not depend on a parameter like λ).
Example 2.2 (Lipschitz loss and Gaussian priors) Let us switch to the continuous case,
so that we can derive from PAC-Bayes bounds some results that we wouldn’t be able to de-
rive from a union bounds argument. We consider the case where Θ = Rd, the function
θ7→ (fθ(x), y)is L-Lipschitz for any xand y, and the prior πis a centered Gaussian:
π=N(0, σ2Id)where Idis the d×didentity matrix.
Let us, as in the previous example, study first the Gibbs posterior, by an application of
Corollary 2.3. With probability at least 1ε,
Eθˆρλ[R(θ)] inf
ρ∈P(Θ) Eθρ[r(θ)] + λC 2
8n+KL(ρkπ) + log 1
ε
λ.
Once again, the right-hand side is an infimum over all possible probability distributions ρ,
but it is easier to restrict to Gaussian distributions here. So:
Eθˆρλ[R(θ)] inf
ρ=N(m, s2Id)
mRd, s > 0Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log 1
ε
λ.(2.8)
Indeed, it is well known that, for ρ=N(m, s2Id)and π=N(0, σ2Id),
KL(ρkπ) = kmk2
2σ2+d
2s2
σ2+ log σ2
s21.
Moreover, the risk rinherits the Lipschitz property of the loss, that is, for any (θ, ϑ)Θ2,
r(θ)r(ϑ) + Lkϑθk. So, for ρ=N(m, s2Id),
Eθρ[r(θ)] r(m) + LEθρ[kθmk]
r(m) + LqEθρ[kθmk2]by Jensen’s inequality
=r(m) + Lsd.
Plugging this into 2.8 gives:
Eθˆρλ[R(θ)] inf
mRd,s>0
r(m) + Lsd+λC2
8n+
kmk2
2σ2+d
2hs2
σ2+ log σ2
s21i+ log 1
ε
λ
.
17
It is possible to minimize the bound completely in s, but for now, we will just consider the
choice s=σ/n, which gives:
Eθˆρλ[R(θ)]
inf
mRd"r(m) + rd
n+λC2
8n+kmk2
2σ2+d
21
n1 + log(n)+ log 1
ε
λ#
inf
mRd"r(m) + rd
n+λC2
8n+kmk2
2σ2+d
2log(n) + log 1
ε
λ#.
It is not possible to optimize the bound with respect to λas the optimal value would depend
on m... however, a way to understand the bound (by making it worse!) is to restrict the
infimum on mto kmk ≤ Bfor some B > 0. Then we have:
Eθˆρλ[R(θ)] inf
m:kmk≤Br(m) + rd
n+λC2
8n+kBk2
2σ2+d
2log(n) + log 1
ε
λ.
In this case, we see that the optimal λis
λ=1
Cs8nkBk2
2σ2+d
2log(n) + log 1
ε
which gives:
Eθˆρλ[R(θ)] inf
m:kmk≤Br(m) + rd
n+CskBk2
2σ2+d
2log(n) + log 1
ε
2n.
Note that our choice of λmight look a bit weird, as it depends on the confidence level ε. This
can be avoided by taking:
λ=1
Cs8nkBk2
2σ2+d
2log(n)
instead (check what bound you obtain by doing so!).
Finally, as in the previous example, we can also start from the statement of Theorem 2.1:
with probability at least 1ε,
ρ∈ P(Θ),Eθρ[R(θ)] Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log 1
ε
λ,
and restrict here ρto the set of Gaussian distributions N(m, s2Id). This leads to the defini-
tion of a new data-dependent probability measure, ˜ρλ=N( ˜m, ˜s2Id)where
( ˜m, ˜s) = argmin
mRd,s>0
Eθ∼N(m,s2Id)[r(θ)] + λC2
8n+
kmk2
2σ2+d
2hs2
σ2+ log σ2
s21i+ log 1
ε
λ.
18
While the Gibbs posterior ˆρλcan be quite a complicated object, one simply has to solve this
minimization problem to get ˜ρλ. The probability ˜ρλis actually a special case of what is called
a variational approximation of ˆρλ. Variational approximations are very popular in statistics
and machine learning, and were indeed analyzed through PAC-Bayes bounds [9, 8, 190]. We
will come back to this in Section 6. For now, following the same computations, and using
the same choice of λas for ˆρλ, we obtain the same bound:
Eθ˜ρλ[R(θ)] inf
m:kmk≤Br(m) + rd
n+CskBk2
2σ2+d
2log(n) + log 1
ε
2n.
Example 2.3 (Model aggregation, model selection) In the case where we have many
set of predictors, say Θ1, . . . , ΘM, equipped with priors π1,...,πMrespectively, it is possible
to define a prior on Θ = SM
j=1 Θj. For the sake of simplicity, assume that the Θj’s are
disjoint, and let p= (p(1),...,p(M)) be a probability distribution on {1,...,M}. We simply
put:
π=
M
X
j=1
p(j)πj.
The minimization of the bound in Theorem 2.1 leads to the Gibbs posterior ˆρλthat will
put mass on all the Θjin general, so this is a model aggregation procedure in the spirit
of [128]. On the other hand, we can also restrict the minimization in the PAC-Bayes bound
to distributions that would charge only one of the models, that is, to ρ∈ P1)· ··PM).
Theorem 2.1 becomes:
PS j∈ {1,...,M},ρ∈ Pj),Eθρ[R(θ)]
Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log 1
ε
λ!1ε,
that is
PS j∈ {1,...,M},ρ∈ Pj),Eθρ[R(θ)]
Eθρ[r(θ)] + λC2
8n+KL(ρkπj) + log 1
p(j)+ log 1
ε
λ!1ε.
Thus, we can propose the following procedure:
first, we build the Gibbs posterior for each model j,
ˆρ(j)
λ(dθ) = eλr(θ)πj(θ)
RΘjeλr(ϑ)πj(dϑ),
19
then, model selection:
ˆ
j= argmin
1jM(Eθˆρ(j)
λ
[r(θ)] + KL(ˆρ(j)
λkπj) + log 1
p(j)
λ).
The obtained ˆ
jsatisfies:
PS j∈ {1,...,M},ρ∈ Pj),Eθˆρ(ˆ
j)
λ
[R(θ)]
min
1jMinf
ρ∈Pj)(Eθρ[r(θ)] + K L(ρkπj) + log 1
p(j)+ log 1
ε
λ)!1ε.
2.1.4 The choice of λ
As discussed earlier, it is in general not possible to optimize the right-hand side of the PAC-
Bayes equality with respect to λ. For example, in 2.5, the optimal value of λcould depend
on ρ, which is not allowed by Theorem 2.1. In the previous examples, we have seen that in
some situations, if one is lucky enough, the optimal λactually does not depend on ρ, but we
still need a procedure for the general case.
A natural idea is to propose a finite grid Λ (0,+) and to minimize over this grid,
which can be justified by a union bound argument.
Theorem 2.4 Let Λ(0,+)be a finite set. For any ε(0,1),
PS ρ∈ P(Θ),λΛ,Eθρ[R(θ)]
Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log card(Λ)
ε
λ!1ε.
Proof. Fix λΛ, and then follow the proof of Theorem 2.1 until (2.3):
EShesupρ∈P(Θ) λEθρ[R(θ)r(θ)]K L(ρkπ)λ2C2
8ni1.
Sum over λΛ to get:
X
λΛ
EShesupρ∈P(Θ) λEθρ[R(θ)r(θ)]KL(ρkπ)λ2C2
8nicard(Λ)
and so
EShesupρ∈P(Θ)ΛλEθρ[R(θ)r(θ)]K L(ρkπ)λ2C2
8nicard(Λ).
20
The end of the proof is as for Theorem 2.1, we start with Chernoff bound. Fix s > 0,
PS"sup
ρ∈P(Θ)Λ
λEθρ[R(θ)r(θ)] KL(ρkπ)λ2C2
8n> s#
EShesupρ∈P(Θ)ΛλEθρ[R(θ)r(θ)]K L(ρkπ)λ2C2
8nies
card(Λ)es.
Solve card(Λ)es=ε, that is, put s= log(card(Λ)) to get
PS"sup
ρ∈P(Θ)
λEθρ[R(θ)r(θ)] KL(ρkπ)λ2C2
8n>log card(Λ)
ε#ε.
Rearranging terms gives:
PS"ρ∈ P(Θ),λΛ, Eθρ[R(θ)] >Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log card(Λ)
ε
λ#ε.
Take the complement to get the statement of the theorem.
This leads to the following procedure. First, we remind that, for a fixed λ, the minimizer
of the bound is ˆρλ=πλr. Then, we put:
ˆρ= ˆρˆ
λwhere
ˆ
λ= argmin
λΛ(Eθπλr [r(θ)] + λC2
8n+KL(πλrkπ) + log card(Λ)
ε
λ).(2.9)
We have immediately the following result.
Corollary 2.5 Define ˆρas in (2.9), for any ε(0,1) we have
PS Eθˆρ[R(θ)] inf
ρ∈ P(Θ)
λΛ"Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log card(Λ)
ε
λ#!1ε.
We could for example propose an arithmetic grid Λ = {1,2,...,n}. The bound in
Corollary 2.5 becomes:
Eθˆρ[R(θ)] inf
ρ∈ P(Θ)
λ= 1, . . . , n Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log n
ε
λ
It is also possible to transform the optimization on a discrete grid by an optimization on a
continuous grid. Indeed, for any λ[1, n], we simply apply the bound to the integer part of
λ,λ, and remark that we can upper bound λ⌋ ≤ λand 1/λ⌋ ≤ 1/(λ1). So the bound
becomes:
Eθˆρ[R(θ)] inf
ρ∈ P(Θ)
λ[1, n]Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log n
ε
λ1.
21
The arithmetic grid is not be the best choice, though: the log(n) term can be improved. In
order to optimize hyperparameters in PAC-Bayes bounds, Langford and Caruana [106] used
a geometric grid Λ = {ek, k N} [1, n], the same choice was used later by Catoni [41, 43].
Using such a bound in Corollary 2.5 we get
Eθˆρ[R(θ)] inf
ρ∈ P(Θ)
λ[1, n]"Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log 1+log n
ε
λ/e#.
We conclude this discussion on the choice of λby mentioning that there are other PAC-
Bayes bounds, for example McAllester’s bound [128], where there is no parameter λto
optimize. We will study these bounds in Section 3.
2.2 PAC-Bayes bound on aggregation of predictors
In the introduction, right after Definition 1.1, I promised that PAC-Bayes bound would allow
to control
the risk of randomized predictors,
the expected risk of randomized predictors,
the risk of averaged predictors.
But so far, we only focused on the expected risk of randomized predictors (the second bullet
point). In this subsection, we provide some bounds on averaged predictors, and in the next
one, we will focus on the risk of randomized predictors.
We start by a very simple remark. When the loss function u7→ (u, y) is convex for any
y, then the risk R(θ) = R(fθ) is a convex function of fθ. Thus, Jensen’s inequality ensures:
Eθρ[R(fθ)] R(Eθρ[fθ]).
Plugging this into Corollary 2.3 gives immediately the following result.
Corollary 2.6 Assume that y∈ Y,u7→ (u, y)is convex. Define
ˆ
fˆρλ(·) = Eθˆρλ[fθ(·)]
For any λ > 0, for any ε(0,1),
PSR(ˆ
fˆρλ)inf
ρ∈P(Θ) Eθρ[r(θ)] + λC 2
8n+KL(ρkπ) + log 1
ε
λ1ε.
That is, in the case of a convex loss function, like the quadratic loss or the hinge loss,
PAC-Bayes bounds also provide bounds on the risk of aggregated predictors.
It is also feasible under other assumptions. For example, we can use the Lipschitz property
as in Example 2.2, it can also be done using the margin of the classifier (see some references
in Subsection 6.2 below).
22
2.3 PAC-Bayes bound on a single draw from the posterior
Theorem 2.7 For any λ > 0, for any ε(0,1), for any data-dependent proba. measure ˜ρ,
PSP˜
θ˜ρ R(˜
θ)r(˜
θ) + λC2
8n+log dρ
dπ(˜
θ) + log 1
ε
λ!1ε.
This bound simply says that if you draw ˜
θfrom, for example, the Gibbs posterior ˆρλ(defined
in (2.4)), you have the bound on R(˜
θ) that holds with large probability simultaneously on
the drawing of the sample and of ˜
θ.
Proof. Once again, we follow the proof of Theorem 2.1, until (2.2):
ESEθπeλ[R(θ)r(θ)]eλ2C2
8n.
Now, for any function h,
Eθπ[h(θ)] = Zh(θ)π(dθ)
Z{ρ
dπ(θ)>0}h(θ)π(dθ)
=Z{ρ
dπ(θ)>0}h(θ)dπ
d˜ρ(θ) ˜ρ(dθ)
=Eθ˜ρhh(θ)elog d ˜ρ
dπ(θ)i
and in particular:
ESEθ˜ρheλ[R(θ)r(θ)]log ρ
dπ(θ)ieλ2C2
8n.
I could go through the proof until the end, but I think that you now guess that it’s essentially
Chernoff bound + rearrangement of the terms.
2.4 Bound in expectation
We end this section by one more variant of the initial PAC-Bayes bound in Theorem 2.1: a
bound in expectation with respect to the sample.
Theorem 2.8 For any λ > 0, for any data-dependent probability measure ˜ρ,
ESEθ˜ρ[R(θ)] ESEθ˜ρr(θ) + λC2
8n+KL(˜ρkπ)
λ.
In particular, for ˜ρ= ˆρλthe Gibbs posterior,
ESEθˆρλ[R(θ)] ESinf
ρ∈P(θ)
Eθρ[r(θ)] + λC2
8n+KL(ρkπ)
λ.
23
These bounds in expectation are very convenient tools from a pedagogical point of view.
Indeed, in Section 4, we will study oracle PAC-Bayes inequalities. While it is possible to
derive oracle PAC-Bayes bounds both in expectation and with large probability, the one in
expectation are much simpler to derive, and much shorter. Thus, I will mostly provide PAC-
Bayes oracle bounds in expectation in 4, and refer the reader to [41, 43] for the corresponding
bounds in probability.
Note that as the bound does not hold with large probability, as the previous bounds, it is
no longer a PAC bound in the proper sense: Probably Approximately Correct. Once, I was at-
tending a talk by Tsybakov where he presented some result from his paper with Dalalyan [61]
that can also be interpreted as a “PAC-Bayes bound in expectation”, and he suggested the
more appropriate EAC-Bayes acronym: Expectedly Approximately Correct (their paper is
briefly discussed in Subsection 6.4 below). I don’t think this term was often reused since
then. I also found recently in [80] the acronym MAC-Bayes: Mean Approximately Correct.
In order to avoid any confusion I will stick to “PAC-Bayes bound in expectation”, but I like
EAC and MAC!
Proof. Once again, the beginning of the proof is the same as for Theorem 2.1, until (2.3):
EShesupρ∈P(Θ) λEθρ[R(θ)r(θ)]K L(ρkπ)λ2C2
8ni1.
This time, use Jensen’s inequality to send the expectation with respect to the sample inside
the exponential function:
eEShsupρ∈P(Θ) λEθρ[R(θ)r(θ)]K L(ρkπ)λ2C2
8ni1,
that is,
ES"sup
ρ∈P(Θ)
λEθρ[R(θ)r(θ)] KL(ρkπ)λ2C2
8n#0.
In particular,
ESλEθ˜ρ[R(θ)r(θ)] KL(˜ρkπ)λ2C2
8n0.
Rearrange terms.
2.5 Applications of empirical PAC-Bayes bounds
The original PAC-Bayes bounds were stated for classification [127] and it became soon clear
that many results could be extended to any bounded loss, thus covering for example bounded
regression (we discuss in Section 5 how to get rid of the boundedness assumption). Thus,
some papers are written in no specific setting, with a generic loss, that can cover classification,
regression, or density estimation (this is the case, among others, of Chapter 1 of my PhD
thesis [1] and the corresponding paper [2] where I studied a generalization of Catoni’s results
of [43] to unbounded losses).
24
However, some empirical PAC-Bayes bounds were also developped or applied to specific
models, sometimes taking advantage of some specificities of the model. We mention for
example:
ranking/scoring [150],
density estimation [91],
multiple testing [35] is tackled with related techniques,
deep learning. Even though deep networks are trained for classification or regression,
the application of PAC-Bayes bounds to deep learning is not straightforward. We
discuss this in Section 3 based on [67] and more recent references.
unsupervised learning, including clustering [164, 13], representation learning [140, 139].
Note that this list is non-exhaustive, and that many more applications are presented in
Section 4 (more precisely, in Subsection 4.3).
3 Tight and non-vacuous PAC-Bayes bounds
3.1 Why is there a race to the tighter PAC-Bayes bound?
Let us start with a numerical application of the PAC-Bayes bounds we met in Section 2.
First, assume we are in the classification setting with the 0-1 loss, so that C= 1. We are
given a small set of classifiers, say M= 100, and that on the test set with size n= 1000, the
best of these classifiers has an empirical risk rn= 0.26. Let us use the bound in (2.7), that
I remind here:
PS
Eθˆρλ[R(θ)] inf
θΘr(θ) + Cslog M
ε
2n
1ε.
With ǫ= 0.05 this bound is:
PS Eθˆρλ[R(θ)] 0.26 + 1.slog 100
0.05
2×1000
|{z }
0.062
!0.95.
So the classification risk using the Gibbs posterior is smaller than 0.322 with probability at
least 95%.
Let us now switch to a more problematic example. We consider a very simple binary
neural network, given by the following formula, for xRd, and where ϕis a nonlinear
activation function (e.g ϕ(x) = max(x, 0)):
fw(x) = 1"M
X
i=1
w(2)
iϕ d
X
j=1
w(1)
j,i xj!0#
25
and the weights w(1)
j,i and w(2)
iare all in {−1,+1}for 1 jdand 1 iM. Define
θ= (w(1)
1,1, w(1)
1,2,...,w(1)
d,M , w(2)
1,...,w(2)
M). Note that the set of all possible such networks has
cardinality 2M(d+1). Consider inputs that are 100 ×100 greyscale images, that is, x[0,1]d
with d= 10,000, and a sample size n= 10,000. With neural networks, it is often the case
that a perfect classification of the training sample is possible, that is, there is a θsuch that
r(θ) = 0.
Even for a moderate number of units such as M= 100, this leads to the PAC-Bayes
bound (with ε= 0.05):
PS Eθˆρλ[R(θ)] 1.slog 21,000,100
0.05
2×10,000
|{z }
13.58
!0.95.
So the classification risk using the Gibbs posterior is smaller than 13.58 with probability at
least 95%. Which is not informative at all, because we already know that the classification
risk is smaller than 1. Such a bound is usually refered to as a vacuous bound, because it
does not bring any information at all. You can try to improve the bound by increasing the
dataset. But you can check that even n= 1,000,000 still leads to a vacuous bound with
this network.
Various opinions on these vacuous bounds are possible:
“theory is useless. I don’t know why I would care about generalization guarantees,
neural networks work in practice.” This opinion is lazy: it’s just a good excuse not to
have to think about generalization guarantees. I will assume that since you are reading
this tutorial, this is not your opinion.
“vacuous bounds are certainly better than no bounds at all!” This opinion is cynical,
it can be rephrased as “better have a theory that doesn’t work than no theory at all:
at least we can claim we have a theory, and some people might even believe us”. But
the theory just says nothing.
“let’s get back to work, and improve the bounds”. Since the publication of the first
PAC-Bayes bounds already mentioned [166, 127, 128], many variants were proven.
One can try to test which one is the best in a given setting, try to improve the priors,
try to refine the bound in many ways... In 2017, Dziugaite and Roy [67] obtained
non-vacuous (even though not really tight yet) PAC-Bayes bounds for practical neural
networks (since then, tighter bounds were obtained by these authors and by others).
This is a remarkable achievement, and this also made PAC-Bayes theory immediately
more popular than it was ever before.
Let’s begin this section with a review of some popular PAC-Bayes bounds: Subsection 3.2.
We will then explain which bound, and which improvements led to tight generalization
bounds for deep learning 3.3. In particular, we will focus on a very important approach to
improve the bounds: data-dependent priors.
26
3.2 A few PAC-Bayes bounds
Note that the original works on PAC-Bayes focused only on classification with the 0-1 loss.
So, for the whole Subsection 3.2, we assume that is the 0-1 loss function. Remember it
means that Rand rtake value in [0,1] (so C= 1 in this subsection).
3.2.1 McAllester’s bound [127] and Maurer’s improved bound [126]
As the original paper by McAllester [127] focused on finite or denumerable sets Θ, let us
start with the first bound for a general Θ, in [128].
Theorem 3.1 (Theorem 1 in [128]) For any ε > 0,
PS"ρ∈ P(Θ),Eθρ[R(θ)] Eθρ[r(θ)] + sKL(ρkπ) + log 1
ε+5
2log(n) + 8
2n1#ε.
Compared to Theorem 2.1, note that there is no parameter λhere to optimize. On the other
hand, one can no longer use 2.2 to minimize the right-hand side. A way to solve this problem
is to make the parameter λappear artificially using the inequality ab aλ/2 + b/(2λ) for
any λ > 0:
PS"ρ∈ P(Θ), Eθρ[R(θ)]
Eθρ[r(θ)] + inf
λ>0λ
4n+KL(ρkπ) + log 2
ε+1
2log(n)
2λ#ε. (3.1)
On the other hand, the prize to pay for an optimization with respect to λin Theorem 2.4
was a log(n) term, that is already in Maurer’s bound, for an arithmetic grid, and a log log(n)
term when using a geometric grid. So, asymptotically in n, Theorem 2.4 with a geometric
grid will always lead to better results than Theorem 3.1. On the other hand, the constants
in Theorem 3.1 are smaller, so the bound can be better for small sample sizes (a point that
should not be neglected for tight certificates in practice!).
It is possible to minimize the right-hand side in (3.1) with respect to ρ, and this will lead
to a Gibbs posterior: ˆρ=π2λr. It is also possible to minimize it with respect to λ, but
the minimization in λwhen ρitself depends on λis a bit more tricky. We want to mention
on this problem the more recent paper [174]. The authors proved a bound that is easier to
minimize simultaneously in λand ρ.
3.2.2 Catoni’s bound (another one) [43]
Theorem 2.1 was based on Catoni’s preprint [41]. Catoni’s monograph [43] provide many
other bounds.
27
Theorem 3.2 (Theorem 1.2.6 of [43]) Define, for a > 0, the function of p(0,1)
Φα(p) = log {1p[1 ea]}
a.
Then, for any λ > 0, for any ǫ > 0,
PSρ∈ P(Θ),Eθρ[R(θ)] Φ1
λ
nEθρ[r(θ)] + KL(ρkπ) + log 1
ε
λ1ε. (3.2)
Actually, we have
Φ1
a(q) = 1eaq
1ea,
and inequalities on the exponential function lead to the following consequence of Theo-
rem 3.2:
PS
ρ∈ P(Θ), Eθρ[R(θ)] λ
nh1eλ
niEθρ[r(θ)] + KL(ρkπ) + log 1
ε
λ
1ε. (3.3)
3.2.3 Seeger’s bound [161] and Maurer’s bound [126]
Let us now propose a completely different bound. This bound is very central in the PAC-
Bayesian theory: we will see that many other bounds can be derived from this one. A first
version was proven by Seeger and Langford [107, 161] and is often refered to as Seeger’s
bound. The bound was slightly improved by Maurer [126], so we will here provide Maurer’s
version of Seeger’s bound.
Let Be(p) denote the probability distribution of a Bernoulli random variable Vwith
parameter p, that is, P(V= 1) = p= 1 P(V= 0). Then we have:
KL(Be(p)kBe(q)) = plog p
q+ (1 p) log 1p
1q=: kl(p|q),
which can be used as a metric on [0,1].
Theorem 3.3 (Theorem 5 in [126]) For any ε > 0,
PS"ρ∈ P(Θ),kl (Eθρ[r(θ)] |Eθρ[R(θ)] ) KL(ρkπ) + log 2n
ε
n#ε.
Under this form, the bound is not very explicit, so we will derive a few of its consequences.
Following Seeger [161], we define:
kl1(q|b) = sup{p[0,1] : kl(p|q)b}.
28
Then the bound becomes:
PS"ρ∈ P(Θ), Eθρ[R(θ)] kl1 Eθρ[r(θ)]
KL(ρkπ) + log 2n
ε
n!#ε.
So we can deduce more explicit bounds from Seeger’s bound simply by providing explicit
bounds on the function kl1. In Section 3 in [147], it is mentioned that Theorem 3.1 can
be recovered in such a way, with improved constants, using Pinsker’s inequality kl(p|q)
(pq)2. We will now see other consequences of Seeger’s bound.
3.2.4 Tolstikhin and Seldin’s bound [175]
A better upper bound of kl is used by Tolstikhin and Seldin [175]:
kl1(q|b)q+p2qb + 2b.
Plugging this in Seeger’s inequality leads to Tolstikhin and Seldin’s bound [175].
Theorem 3.4 ((3) in [175]) For any ε > 0,
PS"ρ∈ P(Θ),Eθρ[R(θ)] Eθρ[r(θ)]
+s2Eθρ[r(θ)]KL(ρkπ) + log 2n
ε
2n+ 2KL(ρkπ) + log 2n
ε
2n#ε.
Note the amazing thing with this bound: while its dependence with respect to nis in
general in 1/n, as all the PAC-Bayes bounds seen so far, the dependence drops to 1/n if
Eθρ[r(θ)] = 0. This was actually not a surprise, because a similar phenomenon is known
for the ERM [180].
More generally, we will see in Section 4 a general assumption that characterizes the best
possible learning rate in classification problems. And as a special case, the noiseless case
indeed leads to rates in 1/n. One more word about Theorem 3.4: the authors acknowledge
that the majoration of kl1they use was actually suggested by McAllester [129]. They
also prove a completely new bound, the so-called PAC-Bayes-Empirical-Bernstein inequality,
that even improve on Theorem 3.4, but we will not provide it here. Let us summarize the
important take home message from Theorem 3.4:
in general, empirical PAC-Bayes bounds are in 1/n,
in the noiseless case Eθρ[r(θ)] = 0, it is possible to have a bound in 1/n, on the
condition that one uses the right PAC-Bayes inequality, for example Theorem 3.4,
Theorem 3.3 or the PAC-Bayes-Empirical-Bernstein inequality.
This is very important for the application of these bounds to neural networks, as deep
networks usually allow to classify the training data perfectly.
29
3.2.5 Thieman, Igel, Wintenberger and Seldin’s bound [174]
According to [147], it can also be recovered as a consequence of Seeger’s bound, using
ab λa
2+b
2λ.
It appears to be extremely tight and convenient in practice (see Subsection 3.3 below).
Theorem 3.5 For any ε > 0, for any λ(0,2),
PS"ρ∈ P(Θ),Eθρ[R(θ)] Eθρ[r(θ)]
1λ
2
+KL(ρkπ) + log 2n
ε
1λ
2#ε.
Here again, we observe the 1/n regime when Eθρ[r(θ)] = 0 (for example for λ= 1).
3.2.6 A bound by Germain, Lacasse, Laviolette and Marchand [77]
Let us conclude this discussion by a nice generalization of Theorem 3.3 by Germain, Lacasse,
Laviolette and Marchand [77].
Theorem 3.6 (Theorem 2.1 in [77]) Let D: [0,1]2Rbe any convex function. For
any ε > 0,
PS"ρ∈ P(Θ),D(Eθρ[r(θ)],Eθρ[R(θ)]) KL(ρkπ) + log ESEθπenD(r(θ),R(θ))
ε
n#ε.
As discussed by the authors, D(p, q) = kl(p|q) leads to Theorem 3.3, and D(p, q) = log[1
p(1eC)]Cq leads to Catoni’s bound given in Theorem 3.2 above. This leads to a natural
question: is there a function Dthat will lead to a strict improvement of Theorem 3.3? The
question is investigated in [70]. Overall, it seems that no function Dwill lead to a bound
that will be smaller, in expectation, than Theorem 3.3, up to the log(2n) term.
Theorem 3.6 has another important advantage that will be discussed in Section 5.
More bounds are known, but it’s not possible to mention all, so I apologize if I didn’t cite
a bound you like, or your bound. Some other variants will be discussed later, in Section 6, in
particular: bounds for unbounded losses , bounds for non i.i.d data, and also some bounds
where the KL divergence KL(ρkπ) is replaced by another divergence.
3.3 Tight generalization error bounds for deep learning
3.3.1 A milestone: non vacuous generalization error bounds for deep networks
by Dziugaite and Roy [67]
PAC-Bayes bounds were applied to (shallow) neural networks as early as in 2002 by Langford
and Caruana [106]. We also applied them with O. Wintenberger to prove that shallow
networks can consistently predict time series [10]. McAllester proposed an application of
30
PAC-Bayes bounds to dropout, a tool used for training neural networks, in his tutorial [130].
But none of these techniques seemed to lead to tight bounds for deep networks... until 2017,
when Dziugaite and Roy [67] obtained the first non-vacuous generalization error bounds
for deep networks on the MNIST dataset based on Theorem 3.3 (Seeger’s bound). Since
then, there was a regain of interest in PAC-Bayes bounds to obtain the tightest possible
certificates.
At first sight, [67] is an application of Seeger’s bound to a deep neural network, but many
important ideas and refinements were used to lead to a non vacuous bound (according to
the authors, some of them being original, some of them based on ideas from earlier works
like [106]). Let us describe here briefly these ingredients, the reader should of course read
the paper for more details and insightful explanations:
the posterior is constrained to be Gaussian (similar to the above “non-exact minimiza-
tion of the bound” in Subsection 2.1.3): ρw,s2=N(w, s2Id). Thus, the PAC-Bayes
bound only has to be minimized with respect to the parameter (w, s2), which allows
to use an optimization algorithm to minimize the bound (the authors mention that
fitting Gaussian distributions to neural networks was already proposed in [92] based
on the MDL principle, which will be discussed in Section 6).
the choice of an adequate upper bound on kl1in Seeger’s bound in order to make the
bound easier to minimize.
Seeger’s bound holds for the 0 1 loss, but Dziugaite and Roy upper bounded the
empirical risk by a convex, Lipschitz upper bound in order to make the bound easier
to minimize (note that this is a standard approach in classification):
Eθρ[r(θ)] = Eθρ"1
n
n
X
i=1
1(fθ(Xi)6=Yi)#Eθρ"1
n
n
X
i=1
log 1 + eYifθ(Xi)
log 2 #.
the use of the Stochastic Gradient Algorithm (SGD) to minimize the bound (up to
our knowledge, the first use of SGD to minimize a PAC-Bayes bound was [77] for
linear classification). Of course, this is standard in deep learning, but there is a crucial
observation that SGD tends to converge to flat minima. This is very important, because
around a flat minima w, we have r(w)r(w) and thus Eθρw,s2[r(θ)] r(θ) even for
quite large values of s2. On the other hand, for a sharp minimum w,Eθρw,s2[r(θ)]
r(θ) only for very small values of s2which tends to make the PAC-Bayes bound larger
(see Example 2.2 above).
finally, the authors used a data dependent prior: N(w0, σ2I), where σ2is chosen to
minimize the bound (this is justified in theory thanks via a union bound argument as
in Subsection 2.1.4 above). The mean w0is not optimized, but the authors point out
that the choice w0= 0 is not good and they actually draw w0randomly, as is usually
done in non-Bayesian approaches to initialize SGD.
31
On the MNIST data set, the authors obtain empirical bounds between 0.16 and 0.22,
thus, non vacuous. The classification performance of their posterior is actually around
0.03, so they conclude that there is still room for improvement. Indeed, since then, a wide
range of papers, from very theoretical to very computational, studied PAC-Bayes bounds
for deep networks [118, 138, 193, 111, 181, 105, 176, 31, 71, 147, 171]. We discuss recent
results from [147] below, but first, we want to discuss in detail one of the most important
ingredients above: the data-dependent prior.
3.3.2 Bounds with data-dependent priors
To use data in order to improve the prior is actually an old idea: we found such approaches
in [161, 41, 42, 192, 43, 145, 113, 67, 68, 66]. Note that the original PAC-Bayes bounds do
not allow to take a data-dependent prior. Thus, some additional work is required to make
this possible (e.g the union bound on σ2in [67] discussed above). Note that the very first
occurence of this idea is due to Seeger in [161]. Seeger proposed to split the sample in two
parts. The first part is used to define Θ and the prior π, and the PAC-Bayes bound is
applied on the second part of the sample (that is, conditionally on the first part). Seeger
used this technique to study the generalization of Gaussian processes. Later Catoni used
it [41] to prove generalization error bounds on compression schemes. We will here describe
in detail two other approaches: first, Catoni’s “localization” technique, because it will also
be important in Sections 4 and 6, and then a recent bound from [68].
First, let us discuss the intuition leading to Catoni’s method. We discussed in Section 2,
just after (2.6), that the bound is tighter for parameters θfor which π(θ) is large, and less
tight forparameters θfor which π(θ) is small: in the finite case, we remind that the Kullback-
Leibler divergence led to a term in log(1(θ)) in the bound. Based on this idea, we might
want to construct a prior πthat gives a large weight to the relevant parameters, that is, to
parameters such that R(θ) is small. This exactly corresponds to πβR for some β > 0, where
πβR is given by
dπβR
dπ(θ) = eβR(θ)
Eϑπ[eβR(ϑ)].
This choice is not data-dependent, and thus allowed by theory. But in practice, it cannot
be used, because R(θ) = E(X,Y )P[(fθ(X), Y )] is of course unknown (still, we use the prior
πβR in Section 4 below, in theoretical bounds that are not meant to be evaluated on the
data). For empirical bounds, Catoni proved that K(ρ, πβR) can be upper bounded, with
large probability, by K(ρ, πξr) for ξ=β/(λ+g(λ/n)λ2/n), plus some additional terms
(gbeing Bernstein’s function, that will be defined in Section 4). Plugging this result into
Theorem 2.1, he obtains the following “localized bound” (Lemma 6.2 in [42]):
PS ρ,Eθρ[R(θ)] (1 ξ)Eθρ[r(θ)] + KL (ρkπξr ) + (1 + ξ) log 2
ε
(1 ξ)λ+ (1 + ξ)gλ
nλ2
n!1ε
which means that we are allowed to use πξr , that is data-dependent, as a prior! This bound
is a little scary at first, because it depends on many parameters. We will provide simpler
32
localized bounds in Section 4 in order to explain their benefits (in particular, it allows to
remove some log(n) terms in the rates of convergences). For now, simply accept that the
bound is usually tighter than Theorem 2.1, but in practice, we have to calibrate both λand
β, which makes it a little more difficult to use. Thus, I am not aware of any application of
this technique to neural networks, but we will show in Section 4 that, used on PAC-Bayes
oracle inequalities, it leads to an improvement of the order of the bound. I would advise the
reader to read [41] or [43] to learn many consequences of this localization technique, see also
the paper by Tong Zhang [192].
Dziugaite and Roy proved in [68] that any data-dependent prior can actually be used in
Seeger’s bound, under a differential privacy condition, at the cost of a small modification of
the bound.
Theorem 3.7 (Theorem 4.2 in [68]) Assume we have a function Πthat maps any sam-
ple s= ((x1, y1),...,(xn, yn)) into a prior π= Π(S). Remind that the data is S=
((X1, Y1),...,(Xn, Yn)) and define, for any i∈ {1,...,n},S
ia copy of Swhere (Xi, Yi)
is replaced by (X
i, Y
i)Pindependent from S. Assume that Πis such that, for any
i∈ {1,...,n}, for any B,
PS(Π(S)B)eηPS
i(Π(S
i)B)
(we say that Πis η-differentially private). Then, for any ε > 0,
PS
ρ,kl (Eθρ[R(θ)] |Eθρ[r(θ)]) KL(ρkΠ(S)) + log 4n
ε
n+η2
2+ηslog 4
ε
2n
1ε.
For more on PAC-Bayes and differential privacy, see [143].
3.3.3 Comparison of the bounds and tight certificates for neural networks [147]
Recently, erez-Ortiz, Rivasplata, Shawe-Taylor and Szepesv´ari [147] trained neural net-
works in the spirit of [67] on the MNIST and CIFAR-10 datasets. They use the PAC-Bayes
with backprop algorithm from [155]. They obtain state of the art test errors (0.02 on MNIST),
and improve the generalization bounds of [67](0.0279 on MNIST, a very tight bound!). The
paper is very interesting even beyond neural networks, as the authors compare numerically
many of the PAC-Bayes bounds listed above. Note that consistently on the experiments, the
bound from [174] is the tightest bound (Theorem 3.5 above). We do not list here all the nice
tricks used by the authors to obtain tighter bounds, but we strongly recommend the reader
who wants to work on this topic to read this paper in detail. An important point to note
that is, in order to avoid to check that a given data-dependent prior is η-differentially pri-
vate, they tune the priors through a simple sample-splitting, very much in the spirit of [161]:
the prior is built on a first part of the sample, and the PAC-Bayes bound is evaluated (and
minimized) on the second part of the sample.
For another comparison of the bounds, in the small-data regime, see [70].
33
4 PAC-Bayes oracle inequalities and fast rates
As explained in Subsection 1.4 above, empirical PAC-Bayes bounds are very useful as they
provide a numerical certificate for randomized estimators or aggregated predictors. But we
also mention another type of PAC-Bayes bounds: oracle PAC-Bayes bounds. In this section,
we provide examples of PAC-Bayes oracle bounds. Interestingly, the first PAC-Bayes oracle
inequality we state below is actually derived from empirical PAC-Bayes inequality.
4.1 From empirical inequalities to oracle inequalities
As for empirical bounds, we can prove oracle bounds in expectation, and in probability. We
will first present a simple version of each. Later, we will focus on bounds in expectation for
the sake of simplicity: these bounds are much shorter to prove. But all the results we will
prove in expectation have counterparts in probability, that the reader can find in [41, 43] for
example.
4.1.1 Bound in expectation
We start by a reminder of (the second claim of) Theorem 2.8: for any λ > 0,
ESEθˆρλ[R(θ)] ESinf
ρ∈P(θ)Eθρ[r(θ)] + λC 2
8n+KL(ρkπ)
λ,
where we remind that ˆρλis the Gibbs posterior (defined in (2.4)). From there, we have the
following:
ESEθˆρλ[R(θ)] ESinf
ρ∈P(θ)Eθρ[r(θ)] + λC 2
8n+KL(ρkπ)
λ
inf
ρ∈P(θ)ESEθρ[r(θ)] + λC2
8n+KL(ρkπ)
λ
= inf
ρ∈P(θ)ES{Eθρ[r(θ)]}+λC2
8n+KL(ρkπ)
λ
= inf
ρ∈P(θ)Eθρ{ES[r(θ)]}+λC2
8n+KL(ρkπ)
λ
where we used Fubini theorem in the last equality. But, by definition, ES[r(θ)] = R(θ).
Thus, we obtain the following theorem.
Theorem 4.1 For any λ > 0,
ESEθˆρλ[R(θ)] inf
ρ∈P(θ)Eθρ[R(θ)] + λC 2
8n+KL(ρkπ)
λ.
34
Example 4.1 (Finite case, continued) In the context of Example 2.1, that is, card(Θ) =
M < +, with λ=p8n/(C2log(M)) and πuniform on Θwe obtain the bound:
ESEθˆρλ[R(θ)] inf
θΘR(θ) + Crlog(M)
2n.
Note that this time, we don’t have a numerical certificate on ESEθˆρλ[R(θ)]. But on the
other hand, we know that our predictions are the best theoretically possible, up to at most
Cplog(M)/2n(such an information is not provided by an empirical PAC-Bayes inequality).
A natural question after Example 4.1 is: is it possible to improve the rate 1/n? Is
it possible to ensure that our predictions are the best possible up to a smaller term? The
answer is “no” in the worst case, but “yes” quite often. These faster rates will be the object
of the following subsections. But first, as promised, we provide an oracle PAC-Bayes bound
in probability.
4.1.2 Bound in probability
As we derived the oracle inequality in expectation of Theorem 4.1 from the empirical inequal-
ity in expectation of Theorem 2.8, we will now use the empirical inequality in probability
from Theorem 2.1 to prove the following oracle inequality in probability. Note, however, that
the proof is slightly more complicated, and that this leads to different (and worse) constants
within the bound.
Theorem 4.2 For any λ > 0, for any ε(0,1),
PSEθˆρλ[R(θ)] inf
ρ∈P(Θ) Eθρ[R(θ)] + λC2
4n+ 2KL(ρkπ) + log 2
ε
λ1ε.
Proof: first, apply Theorem 2.1 to ρ= ˆρλ, as was done to obtain Corollary 2.3. This gives:
PSEθˆρλ[R(θ)] inf
ρ∈P(Θ) Eθρ[r(θ)] + λC 2
8n+KL(ρkπ) + log 1
ε
λ1ε. (4.1)
We will now prove the reverse inequality, that is:
PSρ∈ P(Θ), Eθρ[r(θ)] Eθρ[R(θ)] + λC2
8n+KL(ρkπ) + log 1
ε
λ1ε. (4.2)
Note that the proof of (4.2) is exactly similar to the proof of Theorem 2.1, except that we
replace Uiby Ui. So, the reader who is comfortable enough with this kind of proof can
skip this part, or prove (4.2) as an exercise. Still, we provide a complete proof for the sake
of completeness. Fix θΘ and apply Hoeffding’s inequality with Ui=i(θ)E[i(θ)], and
t=λ/n:
ESeλ[r(θ)R(θ)]eλ2C2
8n.
35
Integrate this bound with respect to π:
EθπESeλ[r(θ)R(θ)]eλ2C2
8n.
Apply Fubini:
ESEθπeλ[r(θ)R(θ)]eλ2C2
8n
and then Donsker and Varadhan’s variational formula (Lemma 2.2):
ESesupρ∈P(Θ) λEθρ[r(θ)R(θ)]K L(ρkπ)eλ2C2
8n.
Rearranging terms:
EShesupρ∈P(Θ) λEθρ[r(θ)R(θ)]K L(ρkπ)λ2C2
8ni1.
Chernoff bound gives:
PS"sup
ρ∈P(Θ)
λEθρ[r(θ)R(θ)] KL(ρkπ)λ2C2
8n>log 1
ε#ε.
Rearranging terms:
PSρ∈ P(Θ), Eθρ[r(θ)] >Eθρ[R(θ)] + λC2
8n+KL(ρkπ) + log 1
ε
λε.
Take the complement to get (4.2).
Consider now 4.1 and 4.1. A union bound gives:
PS
Eθˆρλ[R(θ)] infρ∈P(Θ) hEθρ[r(θ)] + λC2
8n+KL(ρkπ)+log 1
ε
λi
and simultaneously
ρ∈ P(Θ), Eθρ[r(θ)] Eθρ[R(θ)] + λC2
8n+KL(ρkπ)+log 1
ε
λ
12ε(4.3)
Plug the upper bound on Eθρ[r(θ)] from the second line into the first line to get:
PSEθˆρλ[R(θ)] inf
ρ∈P(Θ) Eθρ[r(θ)] + 2 λC2
8n+ 2KL(ρkπ) + log 1
ε
λ12ε(4.4)
Just replace εby ε/2 to get the statement of the theorem.
4.2 Bernstein assumption and fast rates
As mentioned above, the rate 1/nthat we have obtained in almost PAC-Bayes bounds seen
so far is not always the tightest possible. Actually, this can be seen in Tolstikhin and Seldin’s
bound (Theorem 3.4): in this bound, it is clear that if there is a ρsuch that Eθρ[r(θ)] = 0,
then the bound becomes in 1/n.
It appears that rates in 1/n are possible in a more general setting, under an assumption
often refered to as Bernstein assumption. This is well known for (“non Bayesian”) PAC
bounds [25] but it seems to me that this fact seems to be ignored by some papers on PAC-
Bayes.
36
Definition 4.1 From now, we will let θdenote a minimizer of Rwhen it exists:
R(θ) = min
θΘR(θ).
When such a θexists, and when there is a constant Ksuch that, for any θΘ,
ES[i(θ)i(θ)]2K[R(θ)R(θ)]
we say that Bernstein assumption is satisfied with constant K.
PAC-Bayes oracle bounds based using excplicitely Bernstein’s assumption can be found
in [41, 192, 43]. Before we state such a bound, let us explore situations where this assumption
is satisfied.
Example 4.2 (Classification without noise) Consider classification with the 01loss:
i(θ) = 1(Yi6=fθ(Xi)). If the optimal classifier does not make any mistake, that is, if
R(θ) = 0, we have necessarily i(θ) = 0 almost surely. We refer to this situation as
“classification without noise”. In this case, we have obviously:
ES[i(θ)i(θ)]2=ES[1(Yi6=fθ(Xi)) 0]2
=ES{1(Yi6=fθ(Xi))}
=R(θ)
= 1.[R(θ)R(θ)]
so Bernstein assumption is satisfied with constant K= 1. Actually, this can be extended
beyond the 01loss: for any loss with values in [0, C ], if R(θ) = 0, then Bernstein
assumption is satisfied with constant K=C.
Example 4.3 (Mammen and Tsybakov margin assumption) More generally, still in
classification with the 01loss, consider the function
η(x) = ES(Yi|Xi=x).
Mammen and Tsybakov [124] proved that, if |η(Xi)1/2| ≥ τalmost surely for some τ > 0,
then Bernstein assumption holds for some Kthat depends on τ. The case τ= 1/2leads
back to the previous example (noiseless classification), but 0< τ < 1/2is a more general
assumption.
Example 4.4 (Lipschitz and strongly convex loss function) Assume that Θis con-
vex. Let ρibe a function Θ2R+and assume that iis satisfies:
θΘ,i(θ) + i(θ)
2iθ+θ
21
8αρ2
i(θ, θ) (4.5)
and
θΘ,|i(θ)i(θ)| ≤ i(θ, θ).(4.6)
37
In the special case where ρ(θ, θ)is a metric on Θ,(4.5) will be satisfied if the loss is α-
strongly convex in θ, and (4.6) will be satisfied if the loss is L-Lipschitz in θwith respect to
ρi. Note that ρimay depend implicitly of (Xi, Yi)(as does i(θ)).
Bartlett, Jordan and McAuliffe [25] proved that, under these assumptions, Bernstein
assumption is satisfied with constant K= 4L2α. The proof is so luminous than I cannot
resist giving it:
ES[i(θ)i(θ)]2L2ESρi(θ, θ)2by (4.6)
8L2αESi(θ) + i(θ)
2iθ+θ
2 by (4.5)
= 8L2αR(θ) + R(θ)
2Rθ+θ
2
8L2αR(θ) + R(θ)
2R(θ)
where in the last equation, we used R(θ)Rθ+θ
2. Thus:
ES[i(θ)i(θ)]24L2α[R(θ)R(θ)]
and thus Bernstein assumption is satisfied with constant K= 4L2α.
Theorem 4.3 Assume Bernstein assumption is satisfied with some constant K > 0. Take
λ=n/ max(2K, C), we have:
ESEθˆρλ[R(θ)] R(θ)2 inf
ρ∈P(Θ) Eθρ[R(θ)] R(θ) + max(2K, C)KL(ρkπ)
n.
We postpone the applications to Subsection 4.3, but we just want to explain now how the
bound is used in general: we only have to find a ρsuch that Eθρ[R(θ)] R(θ) to obtain:
ESEθˆρλ[R(θ)] .R(θ) + 2 max(2K, C)KL(ρkπ)
n,
hence the rate in 1/n. We will provide more accurate statements in Subsection 4.3.
Remark 4.1 There is a more general version of Bernstein condition: where there are con-
stants K > 0and κ[1,+)such that, for any θΘ,
ES[i(θ)i(θ)]21
κK[R(θ)R(θ)]
we say that Bernstein assumption is satisfied with constants (K, κ). We will not study
the general case here, but we mention that, in the case of classification, this can also be
interpreted in terms of margin [124]. Under such an assumption, some oracle PAC-Bayes
inequalities for classification are proven in [43] that leads to rates in 1/nκ/(2κ1). These
results were extended to general losses in [2]. These rates are known to be optimal in the
minimax sense [109]. Finally, for recent results and a comparison of all the type of conditions
leading to fast rates in learning theory (including situations with unbounded losses), see [83].
38
Remark 4.2 All the PAC-Bayes inequalities seen before Section 4 were empirical. They
lead to rates in 1/n, except in the noiseless case R(θ) = 0 where we obtained the rate 1/n.
Then, in this section, we built:
an oracle inequality with rates 1/n. Note that an empirical inequality was part of the
proof.
an oracle inequality with rate 1/n. It is important to note that the proof we will propose
does not involve any empirical inequality. Similarly, the proofs in [43, 2] for the rates in
1/nκ/(2κ1) do not involve empirical inequalities. (The reader might remark that (4.8)
in the proof below is almost an empirical inequality, but the term r(θ)is not empirical
as it depends on the unknown θ).
It is thus natural to ask: are there empirical inequalities leading to rates in 1/n or 1/nκ/(2κ1)
beyond the noiseless case? The answer is “yes” for “non-Bayesian” PAC bounds [26], based
on Rademacher complexity: there are empirical bounds on R(ˆ
θERM)R(θ). In the PAC-
Bayesian case, it is a little more complicated (unless one uses the bound to control the risk
of a non-Bayesian estimator such as the ERM). This is discussed in [80].
Before stating the proof of Theorem 4.3, we remind a very classical result.
Lemma 4.4 (Bernstein’s inequality) Let gdenote the Bernstein function defined by g(0) =
1and, for x6= 0,
g(x) = ex1x
x2.
Let U1,...,Unbe i.i.d random variables taking values in an interval [0, C]. Then
EetPn
i=1[UiE(Ui)]eg(Ct)nt2Var(Ui).
Proof of Theorem 4.3: We follow the general proof scheme for PAC-Bayes bounds, with
some important differences. Fist, Hoeffding’s inequality will be replaced by Bernstein’s
inequality. But another very important point is to use the inequality on the “relative losses
i(θ)i(θ) instead of the loss i(θ) (for this reason, these bounds are sometimes called
“relative bounds”). This is to ensure that we can use Bernstein condition. So, let us fix
θΘ and apply Lemma 4.4 to Ui=i(θ)i(θ). Note that EUi=R(θ)R(θ), thus we
obtain:
ESetn[R(θ)R(θ)r(θ)+r(θ)]eg(Ct)nt2VarS(Ui).
Put λ=tn and note that
VarS(Ui)ES(U2
i)
=ES[i(θ)i(θ)]2
K[R(θ)R(θ)]
39
thanks to Bernstein condition. Thus:
ESeλ[R(θ)R(θ)r(θ)+r(θ)]eg(λC
n)λ2
nK[R(θ)R(θ)].
Rearranging terms:
ESeλ{[1Kg(λC
n)λ
n][R(θ)R(θ)]r(θ)+r(θ)}1.(4.7)
The next steps are now routine: we integrate θwith respect to πand apply Fubini and
Donsker and Varadhan’s variational formula to get:
ESeλsupρ∈P(Θ) (Eθρ{[1Kg (λC
n)λ
n][R(θ)R(θ)]r(θ)+r(θ)}KL(ρkπ))1.
In particular for ρ= ˆρλthe Gibbs posterior of (2.4), we have, using Jensen and rearranging
terms:
1Kg λC
nλ
n{ESEθˆρλ[R(θ)] R(θ)} ≤ ESEθˆρλ[r(θ)] r(θ) + KL(ˆρλkπ)
λ.
From now we assume that λis such that 1Kg λC
nλ
n>0, thus
ESEθˆρλ[R(θ)] R(θ)
ESnEθˆρλ[r(θ)] r(θ) + KLρλkπ)
λo
1Kg λC
nλ
n
.
In particular, take λ=n/ max(2K, C). We can check that: λn/(2K)Kλ/n 1/2
and λn/C g(λC/n)g(1) 1, so
Kg λC
nλ
n1
2
and thus
ESEθˆρλ[R(θ)] R(θ)2ESEθˆρλ[r(θ)] r(θ) + KL(ˆρλkπ)
λ.(4.8)
Finally, note that ˆρλminimizes the quantity in the expectation in the right-hand side, this
can be rewritten:
ESEθˆρλ[R(θ)] R(θ)2ESinf
ρ∈P(Θ) Eθρ[r(θ)] r(θ) + max(2K, C)KL(ρkπ)
n
2 inf
ρ∈P(Θ)
ESEθρ[r(θ)] r(θ) + max(2K, C )KL(ρkπ)
n
= 2 inf
ρ∈P(Θ) Eθρ[R(θ)] R(θ) + max(2K, C)KL(ρkπ)
n.
40
4.3 Applications of Theorem 4.3
Example 4.5 (Finite set of predictors) We come back to the setting of Example 2.1:
card(Θ) = Mand πis the uniform distribution over Θ. Assuming Bernstein condition holds
with constant K, we apply Theorem 4.3 and, as was done in Example 2.1, we restrict the
supremum to ρ∈ {δθ, θ Θ}. This gives, for λ=n/ max(2K, C ),
ESEθˆρλ[R(θ)] R(θ)2 inf
θΘR(θ)R(θ) + max(2K, C) log(M)
n.
In particular, for θ=θ, this becomes:
ESEθˆρλ[R(θ)] R(θ) + 2 max(2K, C) log(M)
n.
Note that the rate plog(M)/n from Example 2.1 becomes log(M)/n under Bernstein as-
sumption.
Example 4.6 (Lipschitz loss and Gaussian priors) We now tackle the setting of Ex-
ample 2.2 under Bernstein assumption with constant K. Let us remind that Θ = Rd,
θ7→ (fθ(x), y)is L-Lipschitz for any (x, y), and π=N(0, σ2Id). We apply Theorem 4.3,
for λ=n/ max(2K, C):
ESEθˆρλ[R(θ)] R(θ) + 2 inf
ρ=N(m, s2Id)
mRd, s > 0R(θ)R(θ) + max(2K, C )KL(ρkπ)
n.
Following the same derivations as in Example 2.2, with m=θ,
ESEθˆρλ[R(θ)] R(θ) + 2 inf
s>0
Lsd+
max(2K, C)nkθk2
2σ2+d
2hs2
σ2+ log σ2
s21io
n
.
Here again, we would seek for the exact optimizer in s, but for example s=d/n we obtain:
ESEθˆρλ[R(θ)] R(θ) + 2
Ld
n+
max(2K, C)nkθk2
2σ2+d
2hd
n2σ2+ log σ2n2
dio
n
that is
ESEθˆρλ[R(θ)] R(θ) + 2d
nmax(2K, C)
2log σ2n2
d+L+d
22+max(2K, C)kθk2
2
=R(θ) + Odlog(n)
n.
41
Example 4.7 (Lipschitz loss and Uniform priors) We propose a variant of the previ-
ous example, with a different prior. We still assume Bernstein assumption with constant K,
θ7→ (fθ(x), y)is L-Lipschitz for any (x, y), and this time Θ = {θRd:kθk ≤ B}and π
is uniform on Θ. We apply Theorem 4.3 with λ=n/ max(2K, C )and restrict the infimum
to ρ=U(θ0, s)the uniform distribution on {θ:kθθ0k ≤ s}=B(θ0, s). We obtain:
ESEθˆρλ[R(θ)] R(θ) + 2 inf
ρ=U(θ0, s)
θ0Rd, s > 0R(θ)R(θ) + max(2K, C )KL(ρkπ)
n.
For any s > 0, there exists θ0Θsuch that θ∈ B(θ0, s)Θand we have:
EθU(θ0,s)[R(θ)R(θ)] Ls,
so
ESEθˆρλ[R(θ)] R(θ) + 2 inf
s>0"Ls +max(2K, C)dlog B
s
n#.
The minimum of the right-hand side is exactly reached for s=max(2K,C)d
Ln and we obtain, for
nlarge enough (in order to ensure that sB):
ESEθˆρλ[R(θ)] R(θ) +
2 max(2K, C)dlog eBLn
max(2K,C)
n.
We will end this subsection by a (non-exhaustive!) list of more sophisticated models
where a rate of convergence was derived thanks to an oracle PAC-Bayes inequality:
model selection: for classification, see Chapter 5 in [42], and [43]. The minimization
procedure in Example 2.3 above is not always optimal. A selection based on Lepski’s
procedure [110] and PAC-Bayes bounds is used in [43] for classification, also in [2] for
general losses.
density estimation: Chapter 4 in [42].
scoring/ranking [151].
least-square regression: Chapter 5 in [42], and [18, 45] for robust versions.
sparse linear regression: see [61] where the authors prove a rate of convergence similar
to the one of the LASSO for the Gibbs posterior, under more general assumptions. See
also [62, 7, 59] for many variants and improvements, also [119] for group-sparsity.
single-index regression, in small dimension [73], and in high-dimension (with spar-
sity) [4].
additive non-parametric regression [85].
42
matrix regression [169, 3, 59, 58].
matrix completion: continuous case [122, 120, 121] and binary case [55]; more generally
tensor completion in [170] is tackled with related techniques.
quantum tomography [123].
deep learning [51].
unsupervised learning: estimation of the Gram matrix for PCA [45] and kernel-PCA [79,
86],
...
4.4 Dimension and rate of convergence
Let us recap the examples of Sections 2 and 4 seen so far. In each case, we were able to
prove a result of the form:
ESEθˆρλ[R(θ)] R(θ) + raten(π) where raten(π)
n→∞ 0
for an adequate choice of λ > 0. The way raten(π) depends on Θ characterizes the difficulty
of learning predictors in Θ when using the prior π: it is similar to other approaches in
learning theory, where the learning rate depends on the “complexity of Θ”. More precisely
(we remind that all the results seen so far are for a bounded loss function):
when Θ is finite and πis uniform, raten(π) is in plog(M)/n in general, and in log(M)/n
under Bernstein condition.
when Θ = Rdand π=N(0, σ2Id), raten(π) is in p[kθk2+dlog(n)]/n in general, and
in [kθk2+dlog(n)]/n under Bernstein condition.
left as an exercise (idea from [61]): when Θ = Rdand πis a multivariate Student,
raten(π) is in p[log kθk+dlog(n)]/n in general, and in [log kθk+dlog(n)]/n under
Bernstein condition.
when Θ is a compact subset of Rdand πis uniform, raten(π) is in pdlog(n)/n in
general, and in dlog(n)/n under Bernstein condition.
Based on the latter examples, Catoni [43] proposed the following definition. We remind
that πβR , for β > 0, is given by:
dπβR
dπ(θ) = eβR(θ)
Eϑπ[eβR(ϑ)]=eβ[R(θ)R(θ)]
Eϑπ[eβ[R(ϑ)R(θ)]].
43
Definition 4.2 Let
sup
β0
βEθπβR [R(θ)R(θ)] = dπ.
We call dπthe π-dimension of Θ.
We can add the following bullet point to the list:
when dπ<, raten(π) is in pdπlog(n)/n in general, and in dπlog(n)/n under Bern-
stein condition.
We will now prove it under Bernstein condition. The general case use the same arguments,
and is thus left as an exercise to the reader (the only difference in the proof is that one must
start from Theorem 4.1 instead of Theorem 4.3).
Theorem 4.5 Assume Bernstein assumption is satisfied with some constant K > 0. As-
sume dπ<+. Take λ=n/ max(2K, C), we have:
ESEθˆρλ[R(θ)] R(θ) +
2dπmax(2K, C) log ne2C
dπmax(2K,C)
n
The proof requires the following cute lemma.
Lemma 4.6 For any β0,
log Eθπeβ[R(θ)R(θ)]dπlog eCβ
dπ.
Proof of Lemma 4.6: Define
f(ξ) = log Eθπeξ[R(θ)R(θ)]
for any ξ0. First, note that
f(0) = log Eθπe0=log(1) = 0.
Moreover, we can check that fis differentiable and that
f(ξ) = Eθπ[R(θ)R(θ)]eξ[R(θ)R(θ)]
Eθπ{eξ[R(θ)R(θ)]}
=EθπξR [R(θ)R(θ)]
dπ
ξ
where we used Definition 4.2 for the last inequality. But we also have the (simpler) inequality:
f(ξ) = Eθπ[R(θ)R(θ)]eξ[R(θ)R(θ)]
Eθπ{eξ[R(θ)R(θ)]}
44
EθπCeξ[R(θ)R(θ)]
Eθπ{eξ[R(θ)R(θ)]}
=C.
Combining both bounds, f(ξ)min(C, dπ). Integrating for 0 ξβgives:
f(β) = f(β)f(0)
=Zβ
0
f(ξ)dξ
Zdπ
C
0
Cdξ+Zβ
dπ
C
dπ
ξdξ
=Cdπ
C+dπlog(β)dπlog dπ
C
=dπlog eCβ
dπ.
Proof of Theorem 4.5: We apply Theorem 4.3, for λ=n/ max(2K, C):
ESEθˆρλ[R(θ)] R(θ)2 inf
ρ∈P(Θ) Eθρ[R(θ)] R(θ) + max(2K, C)KL(ρkπ)
n.
Take ρ=πβR for some β0, this becomes:
ESEθˆρλ[R(θ)] R(θ) = 2 inf
β0EθπβR [R(θ)] R(θ) + max(2K, C)KL(πβR kπ)
n
2 inf
β0dπ
β+max(2K, C)KL(πβRkπ)
n(4.9)
where we used Definition 4.2. Note that
KL(πβR, π) = Eθπβ R log eβ[R(θ)R(θ)]
Eθπ{eβ[R(θ)R(θ)]}
=βEθπβR [R(θ)R(θ)] log Eθπeβ[R(θ)R(θ)]
≤ −log Eθπeβ[R(θ)R(θ)]because R(θ)R(θ)
dπlog eCβ
dπby Lemma 4.6.
Plugging this into (4.9) gives:
ESEθˆρλ[R(θ)] R(θ)2 inf
β0
dπ
β+
max(2K, C)dπlog eC β
dπ
n
.
The exact minimization in βgives β=n/ max(2K, C) and
ESEθˆρλ[R(θ)] R(θ)
2dπmax(2K, C) log ne2C
dπmax(2K,C)
n.
45
4.5 Getting rid of the log terms: Catoni’s localization trick
We have seen in Subsection 3.3 Catoni’s idea to replace the prior by πβR for some β > 0,
where πβR is given by Definition 4.2. This technique is called “localization” of the bound
by Catoni. Used in empirical bounds, this trick can lead to tighter bound. We will study its
effect on oracle bounds. Let us start by providing a counterpart of Theorem 4.3 using this
trick (with β=λ/4).
Theorem 4.7 Assume that Bernstein condition holds for some K > 0, and take λ=
n/ max(2K, C). Then
ES{Eθˆρλ[R(θ)] R(θ)} ≤ inf
ρ∈P(Θ)
3Eθρ[R(θ)R(θ)] +
4 max(2K, C)KL ρkπλ
4R
n
.
Before we give the proof, we will show a striking consequence: the log(n) terms in the
last bullet point in the list of rates of convergence can be removed:
when dπ<, raten(π) is in pdπ/n in general, and in dπ/n under Bernstein condition,
if we use a localized bound.
Indeed, take ρ=πλ
4R=π−{n/[4 max(2K,C)]}Rin the right-hand side of Theorem 4.7:
ES{Eθˆρλ[R(θ)] R(θ)} ≤ 3Eθπλ
4R[R(θ)R(θ)] + 0
n.
Using Definition 4.2 we obtain the following corollary.
Corollary 4.8 Assume that Bernstein condition holds for some K > 0, and take λ=
n/ max(2K, C), then:
ESEθˆρλ[R(θ)] R(θ) + 12dπmax(2K, C )
n.
We can also briefly detail the consequence of the bound in the finite case.
Example 4.8 (The finite case) When card(Θ) = Mis finite and πis uniform on Θ,
Theorem 4.7 applied to ρ=δθgives:
ES{Eθˆρλ[R(θ)] R(θ)} ≤
4 max(2K, C)KL δθkπλ
4R
n
=4 max(2K, C) log PθΘeλ
4[R(θ)R(θ)]
n
=4 max(2K, C) log PθΘen
4 max(2K,C)[R(θ)R(θ)]
n.(4.10)
46
Of course, we have: X
θΘ
en
4 max(2K,C)[R(θ)R(θ)] M
and thus we recover the rate in log(M)/n:
ES{Eθˆρλ[R(θ)] R(θ)} ≤ 4 max(2K, C) log(M)
n.
On the other hand, in some situations, we can do better from (4.10). Fix a threshold τ > 0
and define mτ= card({θΘ : R(θ)R(θ)τ})∈ {1,...,M}. Then we obtain the
bound:
ES{Eθˆρλ[R(θ)] R(θ)} ≤
4 max(2K, C) log mτ+ e
4 max(2K,C)(Mmτ)
n
which will be much smaller for large n.
Proof of Theorem 4.7: we follow the proof of Theorem 4.3 until (4.7) that we remind here:
ESeλ{[1Kg(λC
n)λ
n][R(θ)R(θ)]r(θ)+r(θ)}1.
Now, we integrate this with respect to πβ R for some β > 0 and use Fubini:
ESEθπβR eλ{[1Kg(λC
n)λ
n][R(θ)R(θ)]r(θ)+r(θ)}1
and Donsker and Varadhan’s formula:
ESeλsupρ∈P(Θ) (Eθρ{[1Kg(λC
n)λ
n][R(θ)R(θ)]r(θ)+r(θ)}KL(ρkπβR ))1.
At this point, we write explicitely
KL(ρkπβR ) = Eθρlog dρ
dπβR
(θ)
=Eθρlog dρ
dπ(θ)Eϑπ[eβ[R(ϑ)R(θ)]]
eβ[R(θ)R(θ)] 
=KL(ρkπ) + βEθρ[R(θ)R(θ)] + log Eϑπ[eβ[R(ϑ)R(θ)]]
which, plugged in the last formula, gives:
ESeλsupρ∈P(Θ) (Eθρ{[1Kg(λC
n)λ
nβ
λ][R(θ)R(θ)]r(θ)+r(θ)}KL(ρkπ)log Eϑπ[eβ[R(ϑ)R(θ)]])1.
We apply Jensen and rearrange terms to obtain, for any randomized estimator ˆρ,
1Kg λC
nλ
nβ
λES{Eθˆρ[R(θ)] R(θ)}
47
ESEθˆρ[r(θ)] r(θ) + KL(ˆρkπ)
λ+log Eϑπeβ[R(ϑ)R(θ)]
λ
Here again, the r.h.s is minimized for ˆρ= ˆρλthe Gibbs posterior, and we obtain:
"1Kg λC
nλ
nβ
λ#ES{Eθˆρλ[R(θ)] R(θ)}
ESinf
ρ∈P(Θ) Eθρ[r(θ)] r(θ) + K L(ρkπ)
λ+log Eϑπeβ[R(ϑ)R(θ)]
λ
inf
ρ∈P(Θ) ESEθρ[r(θ)] r(θ) + K L(ρkπ)
λ+log Eϑπeβ[R(ϑ)R(θ)]
λ
= inf
ρ∈P(Θ) Eθρ[R(θ)] R(θ) + K L(ρkπ)
λ+log Eϑπeβ[R(ϑ)R(θ)]
λ
= inf
ρ∈P(Θ) Eθρ[R(θ)R(θ)] 1β
λ+KL(ρkπβR )
λ
where we used again the formula on KL(ρkπβR) for the last step. So, for βand λsuch that
Kg λC
nλ
nβ
λ<1
we have:
ES{Eθˆρλ[R(θ)] R(θ)} ≤ inf
ρ∈P(Θ) 1β
λEθρ[R(θ)R(θ)] + KL(ρkπβR)
λ
1Kg λC
nλ
nβ
λ
For example, for λ=n/ max(2K, C ) we have already seen that
Kg λC
nλ
n1
2
and taking β=λ/4 leads to
ES{Eθˆρλ[R(θ)] R(θ)} ≤ inf
ρ∈P(Θ)
3
4Eθρ[R(θ)R(θ)] +
max(2K,C)KLρ
πλ
4R
n
13
4
that is
ES{Eθˆρλ[R(θ)] R(θ)} ≤ inf
ρ∈P(Θ)
3Eθρ[R(θ)R(θ)] +
4 max(2K, C)KL ρ
πλ
4R
n
which ends the proof.
48
We end up this section with the following comment page 15 in Catoni’s book [43]: “some
of the detractors of the PAC-Bayesian approach (which, as a newcomer, has sometimes
received a suspicious greeting among statisticians) have argued that it cannot bring anything
that elementary union bound arguments could not essentially provide. We do not share of
course this derogatory opinion, and while we think that allowing for non atomic priors and
posteriors is worthwhile, we also would like to stress that the upcoming local and relative
bounds could hardly be obtained with the only help of union bounds”.
5 Beyond “bounded loss” and “i.i.d observations”
If you follow the proof of the PAC-Bayesian inequalities seen so far, you will see that the
“bounded loss” and “i.i.d observations” assumptions are used only to apply Lemma 1.1
(Hoeffding’s inequality) or Lemma 4.4 (Bernstein’s inequality). In other words, in order to
prove PAC-Bayes inequalities for unbounded losses or dependent observations, all we need
is a result similar to Hoeffding or Bernstein’s inequalities (also called exponential moment
inequalities) in this context.
In the past 15 years, many variants of PAC-Bayes bounds were developped for various
applications based on this remark. In this section, we provide some pointers. In the end, some
authors now prefer to assume directly that the data is such that it satisfies a given exponential
inequality. One of the merits of Theorem 3.6 above (that is, Germain, Lacasse, Laviolette
and Marchand’s bound [77]) is to make it very explicit: the exponential moment appears in
the bound. Since [77], we used in our paper with James Ridgway and Nicolas Chopin [9] a
similar approach: we defined a “Hoeffding assumption” and a “Bernstein assumtpion” that
corresponds to data satisfying a Hoeffding type inequality, or a Bernstein type inequality +
the usual Bernstein condition (Definition 4.1). A similar point of view is used in [154].
Remark 5.1 Note that it is possible to prove a PAC-Bayes inequality like Theorem 2.1 start-
ing directly from (2.2), that is, assuming that an exponential moment inequality is satisfied
in average under the prior π, which does not necessarily it has to hold for each θ. We will
not develop this approach here, examples are detailed in [9, 87, 154].
5.1 “Almost” bounded losses (Sub-Gaussian and sub-gamma)
5.1.1 The sub-Gaussian case
Hoeffding’s inequality for n= 1 variable U1taking values in [a, b] simply states that
Eet[U1E(U1)]et2(ba)2
8.
Then that the general case is obtained by:
EetPn
i=1[UiE(Ui)]=E n
Y
i=1
et[UiE(Ui)]!
49
=
n
Y
i=1
Eet[UiE(Ui)](by independence)
n
Y
i=1
et2(ba)2
8
= ent2(ba)2
8.
Alternatively, if we simply assume that, for some C > 0,
Eet[U1E(U1)]eCt2(5.1)
for some constant C, similar derivations lead to:
EetPn
i=1[UiE(Ui)]enCt2,
on which we can build PAC-Bayes bounds. We can actually rephrase Hoeffding’s inequality
by: “if U1takes values in [a, b], then (5.1) is satisfied for C= (ba)2/8.
It appears that (5.1) is satisfied by some unbounded variables. For example, it is well
known that, if Ui∼ N(m, σ2) then
Eet[U1E(U1)]= eσ2t2
2,
that is (5.1) with C=σ2/2. Actually, it can be proven that a variables U1will satisfy (5.1)
if and only if its tails P(|U1| ≥ t) converge to zero (when t→ ∞) as fast as the ones of a
Gaussian variable, that is P(|U1| ≥ t)exp(t2/C) for some C>0 (see e.g. Chapter 1
in [48]). This is the reason beyond the following terminology.
Definition 5.1 A random variable Usuch that
Eet[UE(U)]eCt2
for some finite Cis called a sub-Gaussian random variable (with constant C).
Based on this definition, we can state for example the following variant of Theorem 2.1
that will be valid for (some!) unbounded losses.
Theorem 5.1 Assume that for any θthe i(θ)are independent and sub-Gaussian random
variables with constant C. Then for any ε0, for any λ > 0,
PSρ∈ P(Θ),Eθρ[R(θ)] Eθρ[r(θ)] + λC2
n+KL(ρkπ) + log 1
ε
λ1ε.
(We don’t provide the proof as all the ingredients were explained to the reader).
In the literature, PAC-Bayes bounds explicitely stated for sub-Gaussian losses can be
found in [9].
50
5.1.2 The sub-gamma case
We will not provide details, but variables satisfying inequalities similar to Bernstein’s inequal-
ity are called sub-gamma random variables, sometimes sub-exponential random variables.
A possible characterization is: P(|U1| ≥ t)exp(t/C) for some C>0 Such variables
include: gamma (and exponential) random variables, Gaussian variables and bounded vari-
ables.
Chapter 2 of [37] provides a very detailed and pedagogical overview of exponential mo-
ment inequalities for independent random variables, and in particular we refer the reader
to their Section 2.4 for more details on sub-gamma variables (but I have to warn you, this
book is so cool you will find difficult to stop at the end of Chapter 2 and will end up reading
everything).
In the literature, PAC-Bayes bounds for sub-gamma random variables can be found as
early as 2001: [42] (Chapter 5). These are these bounds that are used to prove minimax
rates in various parametric and non-parametric problems in the aforementioned [4, 85, 122].
5.1.3 Remarks on exponential moments
Finally, exponential moments inequalities for random variables such that P(|U1| ≥ t)
exp(tα/C) where α1 are studied in Chapter 1 in [48] (the set of such variables is called
an Orlicz space).
Still, note that all these random variables are defined such that they satisfy more or less
the same exponential inequalities than bounded variables. And indeed, for these variables,
P(|U1| ≥ t) is very small when tis large – hence the title of this section: almost bounded
variables. We will now discuss briefly how to go beyond this case.
5.2 Heavy-tailed losses
By heavy-tailed variables, we mean typically random variables U1such that P(|U1| ≥ t) is
for example in tαfor some α > 0.
5.2.1 The truncation approach
In my PhD thesis [1], I studied a truncation technique for general losses i(θ). That is, write:
i(θ) = i(θ)1(i(θ)s) + i(θ)1(i(θ)> s)
for some s > 0. The first term is bounded by s, so we can use exponential moments
inequalities on it, while I used inequalities on the tails P(|i(θ)| ≥ s) to control the second
term. For the sake of completeness I state one of the bounds obtained by this technique
(with s=n/λ).
Theorem 5.2 (Corollary 2.5 in [1]) Define
n,λ(θ) = E(X,Y )Pnmax h(fθ(X), Y )n
λ,0io
51
and
˜rλ,n(θ) = 1
n
n
X
i=1
Ψλ
nhmin (fθ(Xi), Yi),n
λi,
where
Ψα(u) := log(1 )
αand thus Ψ1
α(v) = 1eαv
α.
Then, for any ε > 0, for any λ > 0,
PS ρ∈ P(Θ),Eθρ[R(θ)]
Ψ1
λ
nEθρ[r(θ)] + KL(ρkπ) + log 1
ε
λ+Eθρ[∆n,λ(θ)]!1ε,
Note that ˜rλ,n(θ) is an approximation of r(θ) when λ/n is small enough (usually λn
in this bound). The function Ψαplays a role similar to the function Φαin Catoni’s bound
(Theorem 3.2), and more explicit inequalities can be derived by upper-bounding Ψ1
α. Finally,
n,λ(θ) corresponds to the tails of the loss function. Actually, for a bounded loss, we will
have ∆n,λ(θ) = 0 for n/λ large enough. In the sub-exponential setting, ∆n,λ(θ)>0 but
will usually not be the dominant term in the right-hand side. However, in [1], I provide
upper bounds on ∆n,λ(θ) in O((λ/n)s1) where sis such that E(s
i)<+, but this terms
is dominant in this case (and thus, it slows down the rate of convergence). This truncation
argument is reused in [2] but only the oracle bounds are provided there.
5.2.2 Bounds based on moment inequalities
Based on techniques developped in [95, 29], [5] proved inequalities similar to PAC-Bayes
bounds, that hold for heavy-tailed losses (they can also hold for non i.i.d losses, we will
discuss this point later). Curiously, these bounds depend no longer on the Kullback-Leibler
divergence, but on other divergences. An example of such an inequality is provided here and
relies only on the assumption that the losses have a variance.
Theorem 5.3 (Corollary 1 in [5]) Assume that the i(θ)are independent and such that
Var(i(θ)) κ < . Then, for any ε > 0,
PS ρ∈ P(Θ),Eθρ[R(θ)] Eθρ[r(θ)] + rκ(1 + χ2(ρkπ))
!1ε,
where χ2(ρkπ)is the chi-square divergence:
χ2(ρkπ) = Zdρ
dπ(θ)2
1π(dθ)if ρπand χ2(ρkπ) = +otherwise.
52
(Interestingly, the minimization of the bound with respect to ρleads to an explicit solution,
see [5]). Note that the dependence of the rate in εis much worse than in Theorem 2.1. This
was later dramatically improved by [142]. Still, as for the truncation approach described
earlier, this approach leads to slow rates of convergence for heavy-tailed variables.
5.2.3 Bounds based on robust losses
In [44] Olivier Catoni proposed a robust loss function ψused to estimate the mean of heavy-
tailed random variables (note that this is also based on ideas from an earlier paper [18]).
As a result, [44] obtains, for the mean of heavy-tailed variables, confidence intervals very
similar to the ones of estimators of the mean of a Gaussian. This loss function was used in
conjunction with PAC-Bayes bounds in [45, 79] to study non-Bayesian estimators.
More recently, Holland [94] derives a full PAC-Bayesian theory for possibly heavy-tailed
losses based on Catoni’s technique. The idea is as follows. Put
ψ(u) =
22
3if u < 2,
uu3
6if 2u2,
22
3otherwise
and, for any s > 0,
rψ,s(θ) = s
n
n
X
i=1
ψi(θ)
s.
The idea is that, even when i(θ) is unbounded, the new version of the risk, rψ,s(θ), so the
study of its deviations could be done through classical means. There is some additional work
to connect ES[rψ,s(θ)] to R(θ) for a well chosen s, and [94] obtains the following result.
Theorem 5.4 (Theorem 9 in [94]) Let ε > 0. Assume that the i(θ)are independent and
E(i(θ)2)M2<+and E(i(θ)3)M3<+,
for any θΘ,R(θ)pnM2/(4 log(1)),
εe1/90.89,
and put
π
n(Θ) = Eθπen[R(θ)rψ,s(θ)]
EθπeR(θ)rψ,s(θ)
then, for s:= nM2/[2 log(1)],
PS ρ∈ P(Θ),Eθρ[R(θ)]
Eθρ[rψ,s(θ)] + KL(ρkπ) + M2+log(8πM2
ε2)
2+π
n(Θ) 1
n!1ε.
53
Note that, on the contrary to Theorem 5.3 above, the bound is very similar to the one in
the bounded case (heavy-tailed variables do not lead to slower rates). In particular, we have
a good dependence of the bound in ε, and the presence of KL(ρkπ) that is much smaller
than χ2(ρkπ). The only notable difference is the restriction in the range of ε(which is of
no consequence in practice), and the term π
n(Θ). Unfortunately, as discussed in Remark
10 of [94], this term will deterioriate the rate of convergence when the i(θ) are not sub-
Gaussian (to my knowledge, it is not known if it will lead to better or worse rates than the
ones obtained through truncation).
5.3 Dependent observations
5.3.1 Inequalities for dependent variables
There are versions of Hoeffding and Bernstein’s inequalities for dependent random variables,
under various assumptions on this dependence. This can be used in the case where the
observations are actually a time series, or a random field.
For example, in our paper with O. Wintenberger [10] we learn auto-regressive predictors
of the form ˆ
Xt=fθ(Xt1,...,Xtk) for weakly dependent time series with a PAC-Bayes
bound. The proof relies on Rio’s version of Hoeffding’s inequality [152]. In our paper, only
slow rates in 1/nare provided. Later, fast rates in 1/n were proven in another paper
with O. Wintenberger and X. Li [6] for (less general) mixing time series thanks to Samson’s
version of Bernstein’s inequality [160].
More exponential moment inequalities (and moment inequalities) for dependent variables
can be found in the paper [185] and in the book dedicated to weak dependence [63]. Other
time series models where PAC-Bayes bounds were used include martingales [164], Markov
chains [21], continuous dynamical systems [88], LTI systems [69]...
5.3.2 A simple example
The weak-dependence conditions are quite general but they are also quite difficult to un-
derstand and the definitions are sometimes cumbersome. We only provide a much simpler
example based on a more restrictive assumption, α-mixing. This result comes from [5] and
extends Theorem 5.3 to time series.
Definition 5.2 Given two σ-algebras Fand G, we define
α(F,G) = sup{Cov(U, V ); 0 U1,Uis F-measurable, 0V1,Vis G-measurable}.
Note that if Fand Gare independent, α(F,G) = 0.
Definition 5.3 Given a time series U= (Ut)tZwe define its α-mixing coefficients by
hZ,αh(U) = sup
tN
α(σ(Ut), σ(Ut+h)).
54
Theorem 5.5 (Corollary 2 in [5]) Let X= (Xt)tZbe a real-valued stationary time se-
ries. Define, for θ= (θ1, θ2)R2,t(θ) = (Xtθ1θ2Xt1)2and R(θ) = EX[t(θ)] (it does
not depend on tas the series is stationary). Define
r(θ) =
n
X
t=1
t(θ)
the empirical risk based on the observation of (X0,...,Xn). Assume the prior πis chosen
such that Zkθk6π(dθ)M6<+
(for example a Gaussian prior). Assume that Xthe α-mixing coefficients of Xsatisfy:
X
tZ
[αt(X)]1
3≤ A <+.
Assume that E(X6
t)C < +. Define ν= 32C2
3A(1 + 4M6). Then, for any ε > 0,
PS ρ∈ P(Θ),Eθρ[R(θ)] Eθρ[r(θ)] + rν(1 + χ2(ρkπ))
!1ε.
5.4 Other non i.i.d settings
5.4.1 Non identically distributed observations
When the data is independent, but non-identically distributed, that is, (Xi, Yi)Pi, we can
still introduce
r(θ) = 1
n
n
X
i=1
i(θ) = 1
n
n
X
i=1
(fθ(Xi), Yi)
and
R(θ) = E[r(θ)] = 1
nX
i=1
E(Xi,Yi)Pi[(fθ(Xi), Yi)].
The proofs of most exponential inequalities still hold in this setting (for example, Hoeffding
inequality when the losses are bounded). Based on this remark, the full book [43] is written
for independent, but not necessarily identically distributed observations. Of course, if we
actually have Pi=Pfor any i, we recover the usual case R(θ) = E(X,Y )P[(fθ(X), Y )].
5.4.2 Shift in the distribution
A common problem in machine learning practice is the shift in distribution: one learns a
classifier based on i.i.d observations (Xi, Yi)P. But in practice, the data to be pre-
dicted are drawn from another distribution Q, that is: R(θ) = E(X,Y )Q[(fθ(X), Y )] 6=
E(X,Y )P[(fθ(X), Y )]. There is still a lot of work to do to address this practical problem,
but an interesting approach is proposed in [76]: the authors use a technique called domain
adaptation to allow the use of PAC-Bayes bounds in this context.
55
5.4.3 Meta-learning
Meta-learning is a scenario when one solves many machine learning tasks simulataneously,
and the objective is to improve this learning process for yet-to-come tasks. A popular
formalization (but not the only one possible) is:
each task t∈ {1, . . . , T}corresponds to a probability distribution Pt. The Pt’s are i.i.d
from some P.
for each task t, an i.i.d sample (Xt
1, Y t
1),...,(Xt
n, Y t
n) is drawn from Pt. Thus, we
observe the empirical risk of task t:
rt(θ) = 1
n
n
X
i=1
(fθ(Xt
i), Y t
i)
and use PAC-Bayes bounds to learn a good θtfor this task.
based on the data of tasks {1,...,T}, we want to improve the learning process for a
yet non-observed task PT+1 ∼ P.
This improvement differs from one application to the other, for example: learn a better
parameter space ΘT+1 Θ, learn a better prior, learn better hyperparameters like λ...
PAC-Bayes bounds for meta-learning were studied in [146, 12, 99, 157, 115, 132, 116, 70]. I
believe PAC-Bayes bound are particularly convenient for meta-learning problems, and thus
that this direction of research is very promising.
6 Related approaches in statistics and machine learn-
ing theory
In this section, we list some connections between PAC-Bayes theory and other approaches in
statistics and machine learning. We will mostly provide references, and will use mathematics
more heuristically than in the previous sections. Note that these connections are well-known
and were discussed in the literature, see for example [20].
6.1 Bayesian inference in statistics
In Bayesian statistics, we are given a sample X1,...,Xnassumed to be i.i.d from some Pθ
in a model {Pθ, θ Θ}. A prior πis given on the parameter set Θ. When each Pθhas a
density pθwith respect to a given measure, the likelihood function is defined by
L(θ;X1,...,Xn) :=
n
Y
i=1
pθ(Xi).
56
According to the Bayesian principles, all the information on the parameter that can be
inferred from the sample is in the posterior distribution
π(dθ|X1,...,Xn) = L(θ;X1,...,Xn)π(dθ)
RL(ϑ;X1,...,Xn)π(dϑ).
A direct remark is that π(dθ|X1,...,Xn) can be seen as a Gibbs posterior. Indeed, as
"n
Y
i=1
pθ(Xi)#π(dθ) = ePn
i=1 log pθ(Xi)π(dθ)
we can define the loss i(θ) = log pθ(Xi) and the corresponding empirical risk is the negative
log-likelihood:
r(θ) = 1
n
n
X
i=1
[log pθ(Xi)],
and we have
π(·|X1,...,Xn) = πnr
= argmin
ρ∈P(Θ) (Eθρ"1
n
n
X
i=1
[log pθ(Xi)]#+KL(ρkπ)
n).(6.1)
This connection is for example discussed in [15, 75]. Note, however, that log-likelihood
are rarely bounded, which prevents to use the simplest PAC-Bayes bounds to study the
consistency of π(dθ|X1, . . . , Xn).
6.1.1 Gibbs posteriors, generalized posteriors
Independently from the PAC-Bayes community, the Bayesian statistics community proposed
to generalize the notion of posterior π(dθ|X1,...,Xn) by using a general risk function r(θ)
instead of the negative log-likelihood (very often, they are also called Gibbs posteriors). This
is done for example in [98, 172] in order to estimate some parameters of the distributions
of the data without having to model the whole distribution. Another instance of such
generalized posteriors are fractional, or tempered posteriors, παnr , where ris the negative
log-likelihood and α < 1. Gr¨unwald proved that in some contexts where the posterior
π(dθ|X1,...,Xn) is not consistent, a tempered posterior for αsmall enough will be [81].
Gibbs posteriors are discussed in general in [34], but the paper does not bring new theoretical
arguments. An asymptotic study of Gibbs posteriors, using different arguments than PAC-
Bayes bounds (but related), can be found in [173] and some the references therein.
6.1.2 Contraction of the posterior in Bayes nonparametrics
A very active field of research is the study of the contraction of the posterior in Bayesian
statistics: the objective is to prove that π(dθ|X1,...,Xn) concentrates around θwhen
n→ ∞. We refer the reader to [158, 78] on this topic, see also [23] on high-dimensional
models specifically. Usually, such results require two assumptions:
57
a technical condition, the existence of tests to discriminate between members of the
model {Pθ, θ Θ},
and the prior mass condition, which states that enough mass is given by the prior to a
neighborhood of θ. In other words, π({θ:d(θ, θ)δ}) does not converge too fast to
0 when δ0, for some distance or risk measure d. For example, we can assume that
there is a sequence rn0 when n→ ∞ such that
π({θ:d(θ, θ)rn})enrn.(6.2)
Note that the prior mass condition can also be used in conjunction with PAC-Bayes
bounds, to show that the bound is small. For example consider the PAC-Bayes inequality
of Theorem 4.3: under Bernstein condition with constant Kand for a well chosen λ,
ESEθˆρλ[R(θ)] R(θ)2 inf
ρ∈P(Θ) Eθρ[R(θ)] R(θ) + max(2K, C)KL(ρ, π)
n.
Assume that the prior mass condition holds with d(θ, θ) = R(θ)R(θ) and take
ρ(dθ) = π(dθ)1({d(θ, θ)rn})
π({θ:d(θ, θ)rn}).
We obviously have:
Eθρ[R(θ)] R(θ)rn
and direct a calculation gives
KL(ρ, π) = log π({θ:d(θ, θ)rn})nrn
so the bound becomes:
ESEθˆρλ[R(θ)] R(θ)2[1 + max(2K, C)]rn
n→∞ 0.
Thanks to this connection, [30] proved the contraction of tempered posteriors essentially
only a prior mass condition, using a new PAC-Bayes bound.
6.1.3 Variational approximations
In many applications where the dimension of Θ is large, sampling from π(dθ|X1,...,Xn)
becomes a very difficult task. In order to overcome this difficulty, a recent trend is to
approximate this probability distribution by a tractable approximation. Formally, we would
chose a set Fof probability distributions (for example, Gaussian distributions with diagonal
covariance matrix) and define the following approximation of the posterior:
ˆρ= argmin
ρ∈F
KL(ρkπ(·|X1,...,Xn)).
58
This is called a variational approximation in Bayesian statistics, we refer the reader to [36]
for recent survey on the topic. Note that, by definition of π(dθ|X1,...,Xn) we also have
ˆρ= argmin
ρ∈F (Eθρ"1
n
n
X
i=1
[log pθ(Xi)]#+KL(ρkπ)
n)
that is, a restricted version of (6.1).
This leads to two remarks:
non-exact minimization of PAC-Bayes bounds, as in Subsubsection 2.1.3, can be in-
terpreted as variational approximations of Gibbs posteriors. Note that this is also the
case of the Gaussian approximation that was used for neural networks in [67]. This led
to our paper [9], dedicated to the consistency of variational approximations of Gibbs
posteriors proven via PAC-Bayes bounds, see also the results in [167].
on the other hand, little was known at that time on the theoretical properties of
variational approximations of the posterior in statistics. Using the fact that varia-
tional approximations of tempered posteriors are constrained minimizers of the PAC-
Bayes bound of [30], we studied in [8] the consistency of such approximations. As a
byproduct we have a generalization of the prior mass condition for variational infer-
ence (see (2.1) and (2.2) in [8]). These results were extended to the standard posterior
π(dθ|X1,...,Xn) in [190, 191].
More theoretical studies on variational inference (using PAC-Bayes, or not) appeared at the
same time or since: [52, 96, 125, 50, 148, 53, 21, 19, 141, 72].
The most general version of (6.1) we are aware of is:
ˆρ= argmin
ρ∈F Eθρ[r(θ)] + D(ρkπ)
n(6.3)
where Dis any distance or divergence between probability distributions. Here, the Bayesian
point of view is generalized in three directions:
1. the negative log-likelihood is replaced by a more general notion of risk r(θ), as in
PAC-Bayes bounds and in Gibbs posteriors,
2. the minimization over P(Θ) is replaced by the minimization over F, in order to keep
things tractable, as in variational inference,
3. finally, the KL divergence is replaced by a general D. Note that this already happened
in Theorem 5.3 above.
This triple generalization, and the optimization point of view on Bayesian statistics is
strongly advocated in [103] (in particular reasons to replace K L by Dare given in this
paper that seem to me more relevant in practice than Theorem 5.3).
In this spirit, [156, 49] provided PAC-Bayes type bounds where Dis the Wasserstein
distance.
59
Remark 6.1 Note that, when D6=KL, 6.3 is no longer equivalent to
ˆρ= argmin
ρ∈F
D(ρkπ(·|X1,...,Xn)) .(6.4)
The paper [103] discusses why (6.3) is a more natural generalization, and [74] shows that (6.4)
leads to difficult minimization problems. Note however that there are also some theoretical
results on (6.4) in [97].
6.2 Empirical risk minimization
We already pointed out in the introduction the link between empirical risk minimization
(based on PAC bounds) and PAC-Bayes.
When the parameter space Θ is not finite as in Theorem 1.2 above, the log(M) term
is replaced by a measure of the complexity of Θ called the Vapnik-Chervonenkis dimension
(VC-dim). We simply mention that in Section 2 of [41], Catoni builds a well-chosen data
dependent prior such that the VC-dim of Θ appears explicitely in the PAC-Bayes bound.
There is a similar construction in Chapter 3 in [43]. However, in the paper [117], Livni
and Moran provide an example of situation where the VC dimension is finite, and still the
PAC-Bayes approach will fail. (Note however that this problem was solved recently by [80]
thanks to “conditional PAC-Bayes bounds”, this is discussed below together with Mutual
Information bounds).
Similarly, generalization bounds for Support Vectors Machines are based on a quantity
called the margin. This quantity can also appear in PAC-Bayes bounds [108, 90, 43, 32].
Audibert and Bousquet studied a PAC-Bayes version of the chaining argument [17]. See
also a version based on Mutual Information bounds [14].
Finally, [187] proved PAC-Bayes bounds using Rademacher complexity.
6.3 Online learning
6.3.1 Sequential prediction
Sequential classification focuses on the following problem. At each time step t,
a new object xtis revealed,
the forecaster must propose a prediction ˆytof the label of yt,
the true label yt∈ {0,1}is revealed and the forecaster incurs a loss (ˆyt, yt), and
updates his/her knowledge.
Similarly, online regression and other online prediction problems are studied.
Prediction strategies are often evaluated through upper bounds on the regret Reg(T)
given by:
Reg(T) :=
T
X
t=1
(ˆyt, yt)inf
θ(fθ(xt), yt)
60
where {fθ, θ Θ}is a family of predictors as in Section 1 above. However, a striking point is
that most regret bounds hold without any stochastic assumption on the data (xt, yt)t=1,...,T :
they are not assumed to be independent not to have any link whatsoever with any statistical
model. On the other hand, assumptions on the function θ7→ (fθ(xt), yt) are unavoidable
(depending on the strategies: boundedness, Lipschitz condition, convexity, strong convexity,
etc.).
A popular strategy, strongly related to the PAC-Bayesian approach, is the exponentially
weighted average (EWA) forecaster, also known as weighted majority algorithm or multipli-
cate update rule [114, 182, 46, 102] (we also refer to [27] on the halving algorithm that can
be seen as an ancestor of this method). This strategy is defines as follows. First, let ρ1=π
be a prior distribution on Θ. Then, at each time step t:
the prediction is given by
ˆyt=Eθρt[fθ(x)],
when ytis revealed, we update
ρt+1(dθ) = eηℓ(fθ(x),yt)ρt(dθ)
RΘeηℓ(fϑ(x),yt)ρt(dϑ).
We provide here a simple regret bound that can be found in [47] (stated for a finite Θ
but the extension is direct). Note the formal analogy with PAC-Bayes bounds.
Theorem 6.1 Assume that, for any t,0(fθ(xt), yt)C(bounded loss assumption) and
is a convex function of θ. Then, for any T > 0,
T
X
t=1
(ˆyt, yt)inf
ρ∈P(Θ) Eθρ[(fθ(xt), yt)] + ηC2T
8+KL(ρkπ)
η.
In particular, when Θis finite with card(Θ) = Mand πis uniform, the restriction of the
infimum to Dirac masses leads to
T
X
t=1
(ˆyt, yt)inf
θΘ(fθ(xt), yt) + ηC2T
8+log(M)
η
and thus with η=2
Cq2 log(M)
T,
Reg(T)CrTlog(M)
2.
Choices of ηthat do not depend on the time horizon Tare possible [47]. Smaller regret
bounds, up to log(T) or even constants, are known under stronger assumptions [47, 16]. We
refer the reader to [165, 144] for more up to date introductions to online learning.
61
While there is no stochastic assumption on the data in Theorem 6.1, it is possible to
deduce inequalities in probability or in expectation from it under a additional assumptions
(for example, the assumption that the data is i.i.d). This is described for example in Chapter
5 in [165], but does not always lead to optimal rates. A more up-to-date discussion on this
topic with more general results can be found in [33].
Finally, note that many other strategies than EWA are studied: online gradient al-
gorithm, follow-the-regularized-leader (FTRL), online mirror descent (OMD)... EWA is
actually derived as a special case of FTRL and OMD in many references (e.g [165]) but
conversely, [93, 101] derive OMD and online gradient a EWA applied with various approx-
imations (compare to Section 2 and the remark that on can use PAC-Bayes inequalities to
provide generalization error bounds on non-Bayesian methods like the ERM).
6.3.2 Bandits and reinforcement learning (RL)
Other online problems received a considerable attention in the past few years. In bandits,
the forecaster only receives feedback on the loss of his/her prediction, but not on the losses
what he/she would have incurred under other predictions. We refer the reader to [39] for an
introduction to bandits. Note that some strategies used for bandits are derived from EWA.
Some authors derived strategies or regret bounds directly from PAC-Bayes bounds [162, 163].
Bandits themself are a subclass of a larger family of learning problems: reinforcement learning
(RL). Some generalization bounds similar to PAC-Bayes for RL were derived in [183].
6.4 Aggregation of estimators in statistics
Given a set Eof statistical estimators, the aggregation problem consists in finding a new
estimator, called the aggregate, that would perform as well as the best estimator in E,
see [136] for a formal definition and variants of the problem. The optimal rates are derived
in [177]. Many aggregates share a formal similarity with the EWA of online learning and
with the Gibbs posterior of the PAC-Bayes approach, we refer the reader to [136, 100, 188,
131, 189, 42, 192, 112, 109, 61, 40, 169, 62, 57, 60, 56, 59, 119, 58]. In some of these papers,
the connection to PAC-Bayes bounds is explicit: Theorem 1 in [61] is refered to as a PAC-
Bayes bound in the paper. It is actually an oracle PAC-Bayes bound in expectation. It leads
to fast rates in the spirit of Theorem 4.3, but with different assumptions (in particular, the
Xi’s are not random there).
6.5 Information theoretic approaches
A note on the terminology: a huge number of statistical and machine learning results men-
tioned above rely on tools from information theory (Tong Zhang’s beautiful paper [192]
actually proves PAC-Bayesian bounds under the name “information theoretic bounds”, I
hope it’s not the reason why it is often not cited by the PAC-Bayes community). My goal
here is not to classify what is an information theoretic approach and what is not, I’m cer-
62
tainly not qualified for that. I simply want to point out connection to two families of methods
inspired directly from information theory.
6.5.1 Minimum description length
In Rissanen’s Minimum Description Length (MDL) principle, the idea is to penalize the
empirical error of a classifier by its shortest description [153]. We refer the reader to [24, 82]
for more recent presentations of this very fruitful approach. Note that given a prefix code
on a finite alphabet Θ, it is possible to build a probability distribution π(θ)2L(θ)where
L(θ) is the length of the code of θ, so MDL provides a way to define priors in PAC-Bayes
bounds, see Chapter 1 in Catoni’s lecture notes [42].
6.5.2 Mutual information bounds (MI)
Recently, some generalization error bounds appeared where the complexity is measured in
terms of the mutual information between the sample and the estimator.
Definition 6.1 Let Uand Vbe two random variables with joint probability distribution
PU,V . Let PUand PVdenote the marginal distribution of Uand Vrespectively. The mutual
information (MI) between Uand Vis defined as:
I(U, V ) := KL(PUPVkPU,V ).
In [159], Russo and Zou introduced generalization error bounds (in expectation) that
depend on the mutual information between the predictors and the labels. In particular,
when ˆ
fis obtained by empirical risk minimization, they recover bounds depending of the
VC-dimension of Θ.
Russo and Zou’s result were improved and extended Raginsky, Rakhlin, Tsao, Wu and
Xu [149, 186]. In particular, Subsection 4.3 [186] prove powerful MI inequalities, and then
recovers from them a bound in expectation that is almost exactly Theorem 2.8 above. How-
ever, in the same way PAC-Bayes community not aware of Zhang’s “information theoretic
bounds” [192], it seems that the authors of [186] are not aware of PAC-Bayes bounds. Re-
cently, the connection between MI bounds and PAC-Bayes was pointed out in [22]. There, the
authors provide a unified approach, and derive various bounds under different assumptions
on the loss.
In their Theorem 2.3 in [135], Negrea, Haghifam, Dziugaite, Khisti and Roy write ex-
plicitely the connection between MI bounds and PAC-Bayes bounds. So we state here their
result, or rather, a simplified version (by setting their parameter mto 0).
Theorem 6.2 (Theorem 2.3 in with m= 0)Using the notations and assumptions of Sec-
tion 1, assume that the losses i(θ)are sub-Gaussian with parameter C, then, for any data-
dependent ˜ρ,
ESnEθ˜ρ[R(θ)] Eθ˜ρ[r(θ)]or2CI(θ, S)
nr2CES[KL(˜ρkπ)]
n.
63
A few comments on this result:
the first inequality is actually Theorem 1 of [186]. Note however that Theorem 2.3
in [135] contains more information, as setting their parameter m6= 0 allows to get a
data-dependent prior. The paper contains more new results, and a beautiful application
to derive empirical bounds on the performance of stochastic gradient descent (SGD).
On this topic, see also the recent [38, 184, 137].
we can see here that the MI bound
ESnEθ˜ρ[R(θ)] Eθ˜ρ[r(θ)]or2CI(θ, S)
n(6.5)
is tighter than the the PAC-Bayes bound in expectation
ESnEθ˜ρ[R(θ)] Eθ˜ρ[r(θ)]or2CES[KL(˜ρkπ)]
n.(6.6)
However, MI bound cannot be used as is in practice. Indeed, I(θ, S) depends on the
distribution of the sample Sthat is unknown in practice. In order to use (6.5), one
must upper bound I(θ, S) by a quantity that does not depend on the sample.
in two discussions page 14 and page 51 in [43], Catoni discusses the optimization of
PAC-Bayes bounds with respect to the prior. In order to explain the discussion done
there, apply ab a/(2λ) + bλ/2 to (6.6) to get a “Catoni style” bound:
ESnEθ˜ρ[R(θ)] Eθ˜ρ[r(θ)]oCλ
2n+ES[KL(˜ρkπ)]
λ
and thus
ESEθ˜ρ[R(θ)] ESEθ˜ρ[r(θ)] + Cλ
2n+KL(˜ρkπ)
λ
(compare to Theorem 2.8 above). Let ˜ρbe any data-dependent measure that is abso-
lutely continuous with respect to πalmost-surely, thus
d˜ρ
dπ(θ)
is well-defined. Catoni defines ES(˜ρ) the probability measure defined by
dES(˜ρ)
dπ(θ) = ESd˜ρ
dπ(θ).
Direct calculations show that ES[KL( ˜ρkπ)] = ES[KL(˜ρkES( ˜ρ))] + KL(ES(˜ρ)kπ) =
I(θ, S) + KL(ES(˜ρ)kπ) and thus:
ESEθ˜ρ[R(θ)] ES{Eθ˜ρ[r(θ)]}+Cλ
2n+I(θ, S) + KL(ES(˜ρ)kπ)
λ.
64
So the choice to replace πby ESρ) gives the MI bound:
ESEθ˜ρ[R(θ)] ES{Eθ˜ρ[r(θ)]}+Cλ
2n+I(θ, S)
λ.
The choice λ=p2nI(θ, S)/C leads to
ESEθ˜ρ[R(θ)] ES{Eθ˜ρ[r(θ)]}+r2CI(θ, S)
n.
In other words, MI bounds can be seen as KL bounds optimized with respect to the
prior. Of course, as we said above, MI bound cannot be computed in practice. Catoni
proposes an interpretation of his localization technique as taking the prior πβR to
approximate ES(πλr), and then to upper bound K L(ρkπβ R) via empirical bounds.
As we have seen in Section 3, this leads to empirical bound with data-dependent priors,
and in Section 4, this leads to improved PAC-Bayes oracle bounds. All this is pointed
out by Gr¨unwald, Steinke and Zakynthinou [80]: “Catoni already mentions that the
prior that minimizes a MAC-Bayesian bound is the prior that turns the KL term into
the mutual information”.
Thanks to MI bounds, it is also possible to provide an exact formula (not an upper bound)
for the generalization error of the Gibbs posterior in terms of the symmetrized version of the
KL bound [11].
From [28, 134], it is known that MI bounds can fail in some situations where the VC
dimension is finite: thus, they suffer the same limitation than PAC-Bayes bounds proven
in [117]. Recently, using ideas from the PAC-Bayes literature [15, 43, 133] and in the MI
literature [168, 89], Gr¨unwald, Steinke and Zakynthinou [80] unified MI bounds and PAC-
Bayes bounds, an they developped “conditional” MI and PAC-Bayes bounds. These bounds
are proven to be small for any set of classifiers with finite VC dimension. Thus, they don’t
suffer the limitations of PAC-Bayes and MI bounds of [28, 134, 117]. To cite the results of [80]
would go beyond the framework of this “easy introduction”, but one of the main point of this
tutorial is to prepare the reader not familiar with PAC-Bayes bounds, Bernstein assumption
etc. to read this paper.
7 Conclusion
We hope that the reader
has a better view of what a PAC-Bayes bound is, and what can be done with such a
bound,
is at least a little convinced that these bounds are quite flexible, that they can be used
in a wide range of contexts, and for different objectives in ML,
65
wants to read many of the references listed above, that provide tighter bounds and
clever applications.
I believe that PAC-Bayes bounds (and all the related approaches, including mutual infor-
mation bounds, etc.) will play an important role in the study of deep learning (in the wake
of [67]), in RL [183] and in meta-learning [146, 12, 157].
Acknowledgements
I learnt so much about PAC-Bayes bounds and related topics from my PhD advisor, all
my co-authors, friends, students and twitter pals... that I will not even try to make a list.
Thanks to all of you, you know who you are!
Still, I would like to thank specifically, for motivating, providing valuable feedback and
helping to improve this document (since the very first draft of Section 2): Mathieu Alain,
Pradeep Banerjee, Wessel Bruinsma, David Burt, Badr-Eddine Ch´erief-Abdellatif, Andrew
Foong, Emtiyaz Khan, Aryeh Kontorovich, The Tien Mai, Thomas M¨ollenhoff, Peter Nickl,
Donlapark Ponnoprat, Charles Riou and all the members of the Approximate Bayesian
Inference team at RIKEN AIP.
References
[1] P. Alquier. Transductive and inductive adaptative inference for regression and density
estimation. PhD thesis, University Paris 6, 2006.
[2] P. Alquier. PAC-Bayesian bounds for randomized empirical risk minimizers. Mathe-
matical Methods of Statistics, 17(4):279–304, 2008.
[3] P. Alquier. Bayesian methods for low-rank matrix estimation: short survey and the-
oretical study. In International Conference on Algorithmic Learning Theory, pages
309–323. Springer, 2013.
[4] P. Alquier and G. Biau. Sparse single-index model. Journal of Machine Learning
Research, 14(1), 2013.
[5] P. Alquier and B. Guedj. Simpler PAC-Bayesian bounds for hostile data. Machine
Learning, 107(5):887–902, 2018.
[6] P. Alquier, X. Li, and O. Wintenberger. Prediction of time series by statistical learning:
general losses and fast rates. Dependence Modeling, 1(2013):65–93, 2013.
[7] P. Alquier and K. Lounici. PAC-Bayesian bounds for sparse regression estimation with
exponential weights. Electronic Journal of Statistics, 5:127–145, 2011.
[8] P. Alquier and J. Ridgway. Concentration of tempered posteriors and of their varia-
tional approximations. Annals of Statistics, 48(3):1475–1497, 2020.
66
[9] P. Alquier, J. Ridgway, and N. Chopin. On the properties of variational approximations
of Gibbs posteriors. Journal of Machine Learning Research, 17(239):1–41, 2016.
[10] P. Alquier and O. Wintenberger. Model selection for weakly dependent time series
forecasting. Bernoulli, 18(3):883–913, 2012.
[11] G. Aminian, Y. Bu, L. Toni, M. R. D. Rodrigues, and G. Wornell. Characterizing
the generalization error of Gibbs algorithm with symmetrized KL information. arXiv
preprint arXiv:2107.13656, 2021.
[12] R. Amit and R. Meir. Meta-learning by adjusting priors based on extended PAC-Bayes
theory. In International Conference on Machine Learning, pages 205–214. PMLR, 2018.
[13] G. Appert and O. Catoni. New bounds for k-means and information k-means. arXiv
preprint arXiv:2101.05728, 2021.
[14] A. Asadi, E. Abbe, and S. Verdu. Chaining mutual information and tightening gen-
eralization bounds. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-
Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems,
volume 31. Curran Associates, Inc., 2018.
[15] J.-Y. Audibert. PAC-Bayesian statistical learning theory. PhD thesis, Universit´e Paris
VI, 2004.
[16] J.-Y. Audibert. Fast learning rates in statistical inference through aggregation. The
Annals of Statistics, 37(4):1591–1646, 2009.
[17] J.-Y. Audibert and O. Bousquet. Combining pac-bayesian and generic chaining bounds.
Journal of Machine Learning Research, 8(4), 2007.
[18] J.-Y. Audibert and O. Catoni. Robust linear least squares regression. The Annals of
Statistics, 39(5):2766–2794, 2011.
[19] M. Avena Medina, J. L. Montiel Olea, C. Rush, and A. Velez. On the robustness to
misspecification of α-posteriors and their variational approximations. arXiv preprint
arXiv:2104.08324, 2021.
[20] A. Banerjee. On Bayesian bounds. In Proceedings of ICML, pages 81–88. ACM, 2006.
[21] I. Banerjee, V. A. Rao, and H. Honnappa. PAC-Bayes bounds on variational tempered
posteriors for Markov models. Entropy, 23(3):313, 2021.
[22] P. K. Banerjee and G. Mont´ufar. Information complexity and generalization bounds.
arXiv preprint arXiv:2105.01747, 2021.
[23] S. Banerjee, I. Castillo, and S. Ghosal. Bayesian inference in high-dimensional models.
arXiv preprint arXiv:2101.04491, 2021.
67
[24] A. Barron, J. Rissanen, and B. Yu. The minimum description length principle in
coding and modeling. IEEE Transactions on Information Theory, 44(6):2743–2760,
1998.
[25] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk
bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
[26] P. L. Bartlett and S. Mendelson. Empirical minimization. Probability theory and related
fields, 135(3):311–334, 2006.
[27] J. Barzdinˇs and R. Freivalds. Prediction and limiting synthesis of recursively enumer-
able classes of functions. Latvijas Valsts Univ. Zimatm. Raksti, 210:101–111, 1974.
[28] R. Bassily, S. Moran, I. Nachum, J. Shafer, and A. Yehudayoff. Learners that use
little information. In F. Janoos, M. Mohri, and K. Sridharan, editors, Proceedings of
Algorithmic Learning Theory, volume 83 of Proceedings of Machine Learning Research,
pages 25–55. PMLR, 07–09 Apr 2018.
[29] L. B´egin, P. Germain, F. Laviolette, and J.-F. Roy. PAC-Bayesian bounds based on
the R´enyi divergence. In Artificial Intelligence and Statistics, pages 435–444. PMLR,
2016.
[30] A. Bhattacharya, D. Pati, and Y. Yang. Bayesian fractional posteriors. The Annals of
Statistics, 47(1):39–66, 2019.
[31] F. Biggs and B. Guedj. Differentiable PAC–Bayes objectives with partially aggregated
neural networks. Entropy, 23(10), 2021.
[32] F. Biggs and B. Guedj. On Margins and Derandomisation in PAC-Bayes. Preprint
arXiv:2107.03955, 2021.
[33] B. Bilodeau, J. Negrea, and D. M. Roy. Relaxing the IID Assumption: Adap-
tive Minimax Optimal Sequential Prediction with Expert Advice. arXiv preprint
arXiv:2007.06552, 2020.
[34] P. G. Bissiri, C. C. Holmes, and S. G. Walker. A general framework for updating belief
distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodol-
ogy), 78(5):1103–1130, 2016.
[35] G. Blanchard and F. Fleuret. Occam’s hammer. In International Conference on Com-
putational Learning Theory, pages 112–126. Springer, 2007.
[36] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for
statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
[37] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities. Oxford Univer-
sity Press, 2013.
68
[38] Y. Bu, S. Zou, and V. V. Veeravalli. Tightening mutual information-based bounds on
generalization error. IEEE Journal on Selected Areas in Information Theory, 1(1):121–
130, 2020.
[39] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-
armed bandit problems. Foundations and Trends®in Machine Learning, 5(1):1–122,
2012.
[40] F. Bunea and A. Nobel. Sequential procedures for aggregating arbitrary estimators
of a conditional mean. IEEE Transactions on Information Theory, 54(4):1725–1735,
2008.
[41] O. Catoni. A PAC-Bayesian approach to adaptive classification. preprint LPMA 840,
2003.
[42] O. Catoni. Statistical Learning Theory and Stochastic Optimization. Saint-Flour Sum-
mer School on Probability Theory 2001 (Jean Picard ed.), Lecture Notes in Mathe-
matics. Springer, 2004.
[43] O. Catoni. PAC-Bayesian supervised classification: the thermodynamics of statistical
learning. Institute of Mathematical Statistics Lecture Notes – Monograph Series, 56.
Institute of Mathematical Statistics, Beachwood, OH, 2007.
[44] O. Catoni. Challenging the empirical mean and empirical variance: a deviation study.
In Annales de l’IHP Probabilit´es et statistiques, volume 48, pages 1148–1185, 2012.
[45] O. Catoni and I. Giulini. Dimension free PAC-Bayesian bounds for the estimation of
the mean of a random vector. In NIPS-2017 Workshop (Almost) 50 Shades of Bayesian
Learning: PAC-Bayesian trends and insights, 2017.
[46] N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K.
Warmuth. How to use expert advice. Journal of the ACM, 44(3):427–485, 1997.
[47] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge university
press, 2006.
[48] D. Chafa¨ı, O. Gu´edon, G. Lecu´e, and A. Pajor. Interactions between compressed
sensing random matrices and high dimensional geometry. Soci´et´e Math´ematique de
France (SMF), 2012.
[49] A. Chee and S. Loustau. Learning with BOT-Bregman and optimal transport diver-
gences. Preprint hal-03262687, 2021.
[50] B.-E. Cherief-Abdellatif. Consistency of ELBO maximization for model selection. In
F. Ruiz, C. Zhang, D. Liang, and T. Bui, editors, Proceedings of The 1st Symposium
on Advances in Approximate Bayesian Inference, volume 96 of Proceedings of Machine
Learning Research, pages 11–31. PMLR, 02 Dec 2019.
69
[51] B.-E. Ch´erief-Abdellatif. Convergence rates of variational inference in sparse deep
learning. In International Conference on Machine Learning, pages 1831–1842. PMLR,
2020.
[52] B.-E. Ch´erief-Abdellatif and P. Alquier. Consistency of variational Bayes inference
for estimation and model selection in mixtures. Electronic Journal of Statistics,
12(2):2995–3035, 2018.
[53] B. E. Ch´erief-Abdellatif, P. Alquier, and M. E. Khan. A generalization bound for
online variational inference. In W. S. Lee and T. Suzuki, editors, Proceedings of The
Eleventh Asian Conference on Machine Learning, volume 101 of Proceedings of Ma-
chine Learning Research, pages 662–677, Nagoya, Japan, 2019.
[54] N. Chopin, S. Gadat, B. Guedj, A. Guyader, and E. Vernet. On some recent advances
on high dimensional Bayesian statistics. ESAIM: Proceedings and Surveys, 51:293–319,
2015.
[55] V. Cottet and P. Alquier. 1-Bit matrix completion: PAC-Bayesian analysis of a vari-
ational approximation. Machine Learning, 107(3):579–603, 2018.
[56] D. Dai, P. Rigollet, L. Xia, and T. Zhang. Aggregation of affine estimators. Electronic
Journal of Statistics, 8(1):302–327, 2014.
[57] D. Dai, P. Rigollet, and T. Zhang. Deviation optimal learning using greedy q-
aggregation. The Annals of Statistics, 40(3):1878–1905, 2012.
[58] A. S. Dalalyan. Exponential weights in multivariate regression and a low-rankness
favoring prior. Ann. Inst. H. Poincar´e Probab. Statist., 56(2):1465–1483, 2020.
[59] A. S. Dalalyan, E. Grappin, and Q. Paris. On the exponentially weighted aggregate
with the Laplace prior. Annals of Statistics, 46(5):2452–2478, 2018.
[60] A. S. Dalalyan and J. Salmon. Sharp oracle inequalities for aggregation of affine
estimators. The Annals of Statistics, 40(4):2327–2355, 2012.
[61] A. S. Dalalyan and A. B. Tsybakov. Aggregation by exponential weighting, sharp
PAC-Bayesian bounds and sparsity. Machine Learning, 72(1-2):39–61, 2008.
[62] A. S. Dalalyan and A. B. Tsybakov. Sparse regression learning by aggregation and
Langevin Monte-Carlo. Journal of Computer and System Sciences, 78(5):1423–1443,
2012.
[63] J. Dedecker, P. Doukhan, G. Lang, L. R. J. Rafael, S. Louhichi, and C. Prieur. Weak de-
pendence. In Weak dependence: With examples and applications, pages 9–20. Springer,
2007.
70
[64] L. Devroye, L. Gy¨orfi, and G. Lugosi. A probabilistic theory of pattern recognition.
Springer Science & Business Media, 1996.
[65] M. D. Donsker and S. S. Varadhan. Asymptotic evaluation of certain Markov process
expectations for large time. iii. Communications on Pure and Applied Mathematics,
28:389–461, 1976.
[66] G. K. Dziugaite, K. Hsu, W. Gharbieh, G. Arpino, and D. Roy. On the role of
data in PAC-Bayes. In A. Banerjee and K. Fukumizu, editors, Proceedings of The
24th International Conference on Artificial Intelligence and Statistics, volume 130 of
Proceedings of Machine Learning Research, pages 604–612. PMLR, 13–15 Apr 2021.
[67] G. K. Dziugaite and D. M. Roy. Computing nonvacuous generalization bounds for
deep (stochastic) neural networks with many more parameters than training data. In
Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2017.
[68] G. K. Dziugaite and D. M. Roy. Data-dependent PAC-Bayes priors via differential
privacy. In Advances in Neural Information Processing Systems, pages 8430–8441,
2018.
[69] D. Eringis, J. Leth, Z.-H. Tan, R. Wisniewski, and M. Petreczky. PAC-Bayesian theory
for stochastic LTI systems. arXiv preprint arXiv:2103.12866, 2021.
[70] A. Y. K. Foong, W. P. Bruinsma, D. R. Burt, and R. E. Turner. How tight can
pac-bayes be in the small data regime? arXiv preprint arXiv:2106.03542, 2021.
[71] P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur. Sharpness-aware minimization for
efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020.
[72] D. T. Frazier, R. Loaiza-Maya, G. M. Martin, and B. Koo. Loss-based Variational
Bayes prediction. arXiv preprint arXiv:2104.14054, 2021.
[73] S. Ga¨ıffas and G. Lecu´e. Optimal rates and adaptation in the single-index model using
aggregation. Electronic journal of statistics, 1:538–573, 2007.
[74] T. Geffner and J. Domke. On the difficulty of unbiased alpha divergence minimization.
arXiv preprint arXiv:2010.09541, 2020.
[75] P. Germain, F. Bach, A. Lacoste, and S. Lacoste-Julien. PAC-Bayesian theory meets
Bayesian inference. In Proceedings of the 30th International Conference on Neural
Information Processing Systems, pages 1884–1892, 2016.
[76] P. Germain, A. Habrard, F. Laviolette, and E. Morvant. A new PAC-Bayesian per-
spective on domain adaptation. In International conference on machine learning, pages
859–868. PMLR, 2016.
71
[77] P. Germain, A. Lacasse, F. Laviolette, and M. Marchand. PAC-Bayesian learning
of linear classifiers. In Proceedings of the 26th Annual International Conference on
Machine Learning, pages 353–360, 2009.
[78] S. Ghosal and A. Van der Vaart. Fundamentals of nonparametric Bayesian inference,
volume 44. Cambridge University Press, 2017.
[79] I. Giulini. Robust dimension-free Gram operator estimates. Bernoulli, 24(4B):3864–
3923, 2018.
[80] P. Gr¨unwald, T. Steinke, and L. Zakynthinou. PAC-Bayes, MAC-Bayes and Condi-
tional Mutual Information: Fast rate bounds that handle general VC classes. arXiv
preprint arXiv:2106.09683, 2021.
[81] P. Gr¨unwald and T. Van Ommen. Inconsistency of Bayesian inference for misspecified
linear models, and a proposal for repairing it. Bayesian Analysis, 12(4):1069–1103,
2017.
[82] P. D. Gr¨unwald. The minimum description length principle. MIT press, 2007.
[83] P. D. Gr¨unwald and N. A. Mehta. Fast rates for general unbounded loss functions:
From ERM to Generalized Bayes. Journal of Machine Learning Research, 21(56):1–80,
2020.
[84] B. Guedj. A primer on PAC-Bayesian learning. In Proceedings of the second congress
of the French Mathematical Society, 2019.
[85] B. Guedj and P. Alquier. PAC-Bayesian estimation and prediction in sparse additive
models. Electronic Journal of Statistics, 7:264–291, 2013.
[86] M. Haddouche, B. Guedj, O. Rivasplata, and J. Shawe-Taylor. Upper and lower bounds
on the performance of kernel PCA. arXiv preprint arXiv:2012.10369, 2020.
[87] M. Haddouche, B. Guedj, O. Rivasplata, and J. Shawe-Taylor. PAC-Bayes unleashed:
Generalisation bounds with unbounded losses. Entropy, 23(10), 2021.
[88] M. Haußmann, S. Gerwinn, A. Look, B. Rakitsch, and M. Kandemir. Learning partially
known stochastic dynamics with empirical PAC-Bayes. In International Conference on
Artificial Intelligence and Statistics, pages 478–486. PMLR, 2021.
[89] F. Hellstr¨om and G. Durisi. Generalization bounds via information density and con-
ditional information density. IEEE Journal on Selected Areas in Information Theory,
1(3):824–839, 2020.
[90] R. Herbrich and T. Graepel. A PAC-Bayesian margin bound for linear classifiers. IEEE
Transactions on Information Theory, 48(12):3140–3150, 2002.
72
[91] M. Higgs and J. Shawe-Taylor. A PAC-Bayes bound for tailored density estimation. In
International Conference on Algorithmic Learning Theory, pages 148–162. Springer.
[92] G. E. Hinton and D. Van Camp. Keeping the neural networks simple by minimizing
the description length of the weights. In Proceedings of the sixth annual conference on
Computational learning theory, pages 5–13, 1993.
[93] D. Hoeven, T. Erven, and W. Kot lowski. The many faces of exponential weights in
online learning. In Conference On Learning Theory, pages 2067–2092. PMLR, 2018.
[94] M. Holland. PAC-Bayes under potentially heavy tails. Advances in Neural Information
Processing Systems, 32:2715–2724, 2019.
[95] J. Honorio and T. Jaakkola. Tight bounds for the expected risk of linear classifiers
and PAC-Bayes finite-sample guarantees. In Proceedings of the 17th International
Conference on Artificial Intelligence and Statistics, pages 384–392, 2014.
[96] J. H. Huggins, T. Campbell, M. Kasprzak, and T. Broderick. Practical bounds on
the error of Bayesian posterior approximations: A nonasymptotic approach. arXiv
preprint arXiv:1809.09505, 2018.
[97] P. Jaiswal, V. Rao, and H. Honnappa. Asymptotic consistency of α-R´enyi-approximate
posteriors. Journal of Machine Learning Research, 21(156):1–42, 2020.
[98] W. Jiang and M. A. Tanner. Gibbs posterior for variable selection in high-dimensional
classification and data mining. The Annals of Statistics, pages 2207–2231, 2008.
[99] S. T. Jose and O. Simeone. Transfer meta-learning: Information-theoretic bounds and
information meta-risk minimization, 2020. arXiv preprint arXiv:2011.02872.
[100] A. Juditsky and A. Nemirovski. Functional aggregation for nonparametric regression.
Annals of Statistics, pages 681–712, 2000.
[101] M. E. Khan and H. Rue. The Bayesian learning rule. arXiv preprint arXiv:2107.04562,
2021.
[102] J. Kivinen and M. K. Warmuth. Averaging expert predictions. In European Conference
on Computational Learning Theory, pages 153–167. Springer, Berlin, 1999.
[103] J. Knoblauch, J. Jewson, and T. Damoulas. Generalized variational inference. arXiv
preprint arXiv:1904.02063, 2019.
[104] S. Kullback. Information theory and statistics. John Wiley & Sons, 1959.
[105] X. Lan, X. Guo, and K. E. Barner. PAC-Bayesian Generalization Bounds for Multi-
Layer Perceptrons. Preprint arXiv:2006.08888, 2020.
73
[106] J. Langford and R. Caruana. (not) bounding the true error. Advances in Neural
Information Processing Systems, 2:809–816, 2002.
[107] J. Langford and M. Seeger. Bounds for averaging classifiers. Technical Report CMU-
CS-01-102, Carnegie Mellon University, 2001.
[108] J. Langford and J. Shawe-Taylor. PAC-Bayes & margins. In Proceedings of the 15th
International Conference on Neural Information Processing Systems, pages 439–446.
MIT Press, 2002.
[109] G. Lecu´e. Aggregation procedures: optimality and fast rates. PhD thesis, Universit´e
Pierre et Marie Curie-Paris VI, 2007.
[110] O. Lepski. Asymptotically minimax adaptive estimation I: Upper bounds. Theory of
Probability and its Applications, 36(4):682–697.
[111] G. Letarte, P. Germain, B. Guedj, and F. Laviolette. Dichotomize and generalize: PAC-
Bayesian binary activated deep neural networks. In Advances in Neural Information
Processing Systems, pages 6872–6882, 2019.
[112] G. Leung and A. R. Barron. Information theory and mixing least-squares regressions.
IEEE Trans. Inform. Theory, 52(8):3396–3410, 2006.
[113] G. Lever, F. Laviolette, and J. Shawe-Taylor. Tighter PAC-Bayes bounds through
distribution-dependent priors. Theoretical Computer Science, 473:4–28, 2013.
[114] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. In Proceedings of
the 30th Annual Symposium on the Foundations of Computer Science, pages 256–261.
IEEE, 1989.
[115] T. Liu, J. Lu, Z. Yan, and G. Zhang. PAC-Bayes bounds for meta-learning with
data-dependent prior. arXiv preprint arXiv:2102.03748, 2021.
[116] T. Liu, J. Lu, Z. Yan, and G. Zhang. Statistical generalization performance guarantee
for meta-learning with data dependent prior. Neurocomputing, 465:391–405, 2021.
[117] R. Livni and S. Moran. A limitation of the PAC-Bayes framework. In H. Larochelle,
M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Infor-
mation Processing Systems, volume 33, pages 20543–20553. Curran Associates, Inc.,
2020.
[118] B. London. A PAC-Bayesian analysis of randomized learning with application to
stochastic gradient descent. In Advances in Neural Information Processing Systems,
pages 2931–2940, 2017.
[119] T. D. Luu, J. Fadili, and C. Chesneau. PAC-Bayesian risk bounds for group-analysis
sparse regression by exponential weighting. Journal of Multivariate Analysis, 171:209–
233, 2019.
74
[120] T. T. Mai. PAC-Bayesian estimation of low-rank matrices. PhD thesis, Universit´e
Paris Saclay, 2017.
[121] T. T. Mai. Bayesian matrix completion with a spectral scaled student prior: theoretical
guarantee and efficient sampling. arXiv preprint arXiv:2104.08191, 2021.
[122] T. T. Mai and P. Alquier. A Bayesian approach for noisy matrix completion: Optimal
rate under general sampling distribution. Electronic Journal of Statistics, 9(1):823–841,
2015.
[123] T. T. Mai and P. Alquier. Pseudo-Bayesian quantum tomography with rank-
adaptation. Journal of Statistical Planning and Inference, 184:62–76, 2017.
[124] E. Mammen and A. B. Tsybakov. Smooth discrimination analysis. The Annals of
Statistics, 27(6):1808–1829, 1999.
[125] A. R. Masegosa. Learning under model misspecification: Applications to variational
and ensemble methods. arXiv preprint arXiv:1912.08335, 2019.
[126] A. Maurer. A note on the PAC Bayesian theorem. arXiv preprint cs/0411099, 2004.
[127] D. A. McAllester. Some PAC-Bayesian theorems. In Proceedings of the Eleventh Annual
Conference on Computational Learning Theory, pages 230–234, New York, 1998. ACM.
[128] D. A. McAllester. PAC-Bayesian model averaging. In Proceedings of the twelfth annual
conference on Computational learning theory, pages 164–170, 1999.
[129] D. A. McAllester. PAC-Bayesian stochastic model selection. Machine Learning,
51(1):5–21, 2003.
[130] D. A. McAllester. A PAC-Bayesian tutorial with a dropout bound. arXiv preprint
arXiv:1307.2118, 2013.
[131] R. Meir and T. Zhang. Generalization error bounds for Bayesian mixture algorithms.
Journal of Machine Learning Research, 4(Oct):839–860, 2003.
[132] D. Meunier and P. Alquier. Meta-strategy for learning tuning parameters with guar-
antees. Entropy, 23(10), 2021.
[133] Z. Mhammedi, P. D. Gr¨unwald, and B. Guedj. PAC-Bayes un-expected Bernstein
inequality. arXiv preprint arXiv:1905.13367, 2019.
[134] I. Nachum, J. Shafer, and A. Yehudayoff. A direct sum result for the information
complexity of learning. In S. Bubeck, V. Perchet, and P. Rigollet, editors, Proceedings
of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine
Learning Research, pages 1547–1568. PMLR, 06–09 Jul 2018.
75
[135] J. Negrea, M. Haghifam, G. K. Dziugaite, A. Khisti, and D. M. Roy. Information-
theoretic generalization bounds for sgld via data-dependent estimates. Advances in
Neural Information Processing Systems, 32:11015–11025, 2019.
[136] A. Nemirovski. Topics in non-parametric statistics. Ecole d’Et´e de Probabilit´es de
Saint-Flour, 28:85, 2000.
[137] G. Neu, G. K. Dziugaite, M. Haghifam, and D. M. Roy. Information-theoretic gen-
eralization bounds for stochastic gradient descent. arXiv preprint arXiv:2102.00931,
2021.
[138] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. A PAC-Bayesian ap-
proach to spectrally-normalized margin bounds for neural networks. NIPS 2017 Work-
shop: (Almost) 50 Shades of Bayesian Learning: PAC-Bayesian trends and insights,
2017.
[139] K. Nozawa, P. Germain, and B. Guedj. PAC-Bayesian Contrastive Unsupervised Rep-
resentation Learning. In J. Peters and D. Sontag, editors, Proceedings of the 36th
Conference on Uncertainty in Artificial Intelligence (UAI), volume 124 of Proceedings
of Machine Learning Research, pages 21–30. PMLR, 03–06 Aug 2020.
[140] K. Nozawa and I. Sato. PAC-Bayes Analysis of Sentence Representation. arXiv preprint
arXiv:1902.04247, 2019.
[141] I. Ohn and L. Lin. Adaptive variational Bayes: Optimality, computation and applica-
tions. arXiv preprint arXiv:2109.03204, 2021.
[142] Y. Ohnishi and J. Honorio. Novel Change of Measure Inequalities with Applications
to PAC-Bayesian Bounds and Monte Carlo Estimation. In International Conference
on Artificial Intelligence and Statistics, pages 1711–1719. PMLR, 2021.
[143] L. Oneto, M. Donini, M. Pontil, and J. Shawe-Taylor. Randomized learning and gen-
eralization of fair and private classifiers: From PAC-Bayes to stability and differential
privacy. Neurocomputing, 416:231–243, 2020.
[144] F. Orabona. A modern introduction to online learning. arXiv preprint
arXiv:1912.13213, 2019.
[145] E. Parrado-Hern´andez, A. Ambroladze, J. Shawe-Taylor, and S. Sun. PAC-Bayes
bounds with data dependent priors. The Journal of Machine Learning Research,
13(1):3507–3531, 2012.
[146] A. Pentina and C. Lampert. A PAC-Bayesian bound for lifelong learning. In Interna-
tional Conference on Machine Learning, pages 991–999. PMLR, 2014.
[147] M. P´erez-Ortiz, O. Rivasplata, J. Shawe-Taylor, and C. Szepesv´ari. Tighter risk cer-
tificates for neural networks. arXiv preprint arXiv:2007.12911, 2020.
76
[148] S. Plummer, D. Pati, and A. Bhattacharya. Dynamics of coordinate ascent variational
inference: A case study in 2d ising models. Entropy, 22(11), 2020.
[149] M. Raginsky, A. Rakhlin, M. Tsao, Y. Wu, and A. Xu. Information-theoretic analysis of
stability and bias of learning algorithms. In 2016 IEEE Information Theory Workshop
(ITW), pages 26–30. IEEE, 2016.
[150] L. Ralaivola, M. Szafranski, and G. Stempfel. Chromatic PAC-Bayes bounds for non-iid
data: Applications to ranking and stationary β-mixing processes. Journal of Machine
Learning Research, 11(Jul):1927–1956, 2010.
[151] J. Ridgway, P. Alquier, N. Chopin, and F. Liang. PAC-Bayesian AUC classification
and scoring. Advances in Neural Information Processing Systems, 1(January):658–666,
2014.
[152] E. Rio. In´egalit´es de Hoeffding pour les fonctions lipschitziennes de suites d´ependantes.
Comptes Rendus de l’Acad´emie des Sciences-Series I-Mathematics, 330(10):905–908,
2000.
[153] J. Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978.
[154] O. Rivasplata, I. Kuzborskij, C. Szepesv´ari, and J. Shawe-Taylor. PAC-Bayes analysis
beyond the usual bounds. In Advances in Neural Information Processing Systems,
2020.
[155] O. Rivasplata, V. M. Tankasali, and C. Szepesvari. PAC-Bayes with backprop. arXiv
preprint arXiv:1908.07380, 2019.
[156] B. Rodr´ıguez-G´alvez, G. Bassi, R. Thobaben, and M. Skoglund. Tighter expected
generalization error bounds via wasserstein distance. arXiv preprint arXiv:2101.09315,
2021.
[157] J. Rothfuss, V. Fortuin, M. Josifoski, and A. Krause. PACOH: Bayes-optimal meta-
learning with PAC-guarantees. arXiv preprint arXiv:2002.05551, 2020.
[158] J. Rousseau. On the frequentist properties of bayesian nonparametric methods. Annual
Review of Statistics and Its Application, 3:211–231, 2016.
[159] D. Russo and J. Zou. How much does your data exploration overfit? controlling bias via
information usage. IEEE Transactions on Information Theory, 66(1):302–323, 2019.
[160] P.-M. Samson. Concentration of measure inequalities for Markov chains and φ-mixing
processes. The Annals of Probability, 28(1):416–461, 2000.
[161] M. Seeger. PAC-Bayesian generalisation error bounds for Gaussian process classifica-
tion. Journal of machine learning research, 3(Oct):233–269, 2002.
77
[162] Y. Seldin, P. Auer, J. Shawe-Taylor, R. Ortner, and F. Laviolette. PAC-Bayesian
analysis of contextual bandits. In Advances in Neural Information Processing Systems,
pages 1683–1691, 2011.
[163] Y. Seldin, N. Cesa-Bianchi, P. Auer, F. Laviolette, and J. Shawe-Taylor. PAC-Bayes-
Bernstein inequality for martingales and its application to multiarmed bandits. In
D. Glowacka, L. Dorard, and J. Shawe-Taylor, editors, Proceedings of the Workshop
on On-line Trading of Exploration and Exploitation 2, volume 26 of Proceedings of
Machine Learning Research, pages 98–111, Bellevue, Washington, USA, 02 Jul 2012.
PMLR.
[164] Y. Seldin, F. Laviolette, N. Cesa-Bianchi, J. Shawe-Taylor, and P. Auer. PAC-Bayesian
inequalities for martingales. IEEE Transactions on Information Theory, 58(12):7086–
7093, 2012.
[165] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and
Trends®in Machine Learning, 4(2):107–194, 2011.
[166] J. Shawe-Taylor and R. Williamson. A PAC analysis of a Bayes estimator. In Proceed-
ings of the Tenth Annual Conference on Computational Learning Theory, pages 2–9,
New York, 1997. ACM.
[167] R. Sheth and R. Khardon. Excess risk bounds for the Bayes risk using variational
inference in latent Gaussian models. In Advances in Neural Information Processing
Systems, pages 5151–5161, 2017.
[168] T. Steinke and L. Zakynthinou. Reasoning about generalization via conditional mutual
information. In Conference on Learning Theory, pages 3437–3452. PMLR, 2020.
[169] T. Suzuki. PAC-Bayesian bound for gaussian process regression and multiple kernel
additive model. In Conference on Learning Theory, pages 8–1. JMLR Workshop and
Conference Proceedings, 2012.
[170] T. Suzuki. Convergence rate of Bayesian tensor estimator and its minimax optimality.
In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference
on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages
1273–1282, Lille, France, 07–09 Jul 2015. PMLR.
[171] T. Suzuki. Generalization bound of globally optimal non-convex neural network train-
ing: Transportation map estimation by infinite dimensional Langevin dynamics. arXiv
preprint arXiv:2007.05824, 2020.
[172] N. Syring and R. Martin. Calibrating general posterior credible regions. Biometrika,
106(2):479–486, 2019.
[173] N. Syring and R. Martin. Gibbs posterior concentration rates under sub-exponential
type losses. arXiv preprint arXiv:2012.04505, 2020.
78
[174] N. Thiemann, C. Igel, W. O., and Y. Seldin. A strongly quasiconvex PAC-Bayesian
bound. In International Conference on Algorithmic Learning Theory, pages 466–492,
2017.
[175] I. Tolstikhin and Y. Seldin. PAC-Bayes-empirical-Bernstein inequality. Advances in
Neural Information Processing Systems 26 (NIPS 2013), pages 1–9, 2013.
[176] Y. Tsuzuku, I. Sato, and M. Sugiyama. Normalized flat minima: Exploring scale
invariant definition of flat minima for neural networks using PAC-Bayesian analysis.
In International Conference on Machine Learning, pages 9636–9647. PMLR, 2020.
[177] A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals
of Statistics, 32(1):135–166, 2004.
[178] L. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142,
1984.
[179] T. van Erven. PAC-Bayes mini-tutorial: a continuous union bound. arXiv preprint
arXiv:1405.1580, 2014.
[180] V. Vapnik. Statistical learning theory. Wiley–Blackwell, 1998.
[181] P. Viallard, R. Emonet, P. Germain, A. Habrard, and E. Morvant. Interpreting
Neural Networks as Majority Votes through the PAC-Bayesian Theory. NeurIPS 2019
Workshop on Machine Learning with Guarantees, 2019.
[182] V. G. Vovk. Aggregating strategies. Proceedings of Computational Learning Theory,
1990, 1990.
[183] H. Wang, S. Zheng, C. Xiong, and R. Socher. On the generalization gap in repa-
rameterizable reinforcement learning. In K. Chaudhuri and R. Salakhutdinov, editors,
Proceedings of the 36th International Conference on Machine Learning, volume 97 of
Proceedings of Machine Learning Research, pages 6648–6658. PMLR, 09–15 Jun 2019.
[184] Z. Wang, S.-L. Huang, E. E. Kuruoglu, J. Sun, X. Chen, and Y. Zheng. PAC-Bayes
information bottleneck. arXiv preprint arXiv:2109.14509, 2021.
[185] O. Wintenberger. Deviation inequalities for sums of weakly dependent time series.
Electronic Communications in Probability, (15):489–503, 2010.
[186] A. Xu and M. Raginsky. Information-theoretic analysis of generalization capability
of learning algorithms. In Proceedings of the 31st International Conference on Neural
Information Processing Systems, NIPS’17, page 2521–2530, Red Hook, NY, USA, 2017.
Curran Associates Inc.
[187] J. Yang, S. Sun, and D. M. Roy. Fast-rate PAC-Bayes generalization bounds via
shifted Rademacher processes. Advances in Neural Information Processing Systems,
32:10803–10813, 2019.
79
[188] Y. Yang. Adaptive regression by mixing. Journal of the American Statistical Associ-
ation, 96(454):574–588, 2001.
[189] Y. Yang. Aggregating regression procedures to improve performance. Bernoulli,
10(1):25–47, 2004.
[190] Y. Yang, D. Pati, and A. Bhattacharya. α-variational inference with statistical guar-
antees. Annals of Statistics, 48(2):886–905, 2020.
[191] F. Zhang and C. Gao. Convergence rates of variational posterior distributions. Annals
of Statistics, 48(4):2180–2207, 2020.
[192] T. Zhang. Information-theoretic upper and lower bounds for statistical estimation.
IEEE Transactions on Information Theory, 52(4):1307–1321, 2006.
[193] W. Zhou, V. Veitch, M. Austern, R. P. Adams, and P. Orbanz. Non-vacuous general-
ization bounds at the imagenet scale: a PAC-Bayesian compression approach. arXiv
preprint arXiv:1804.05862, 2018.
80
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We present new PAC-Bayesian generalisation bounds for learning problems with unbounded loss functions. This extends the relevance and applicability of the PAC-Bayes learning framework, where most of the existing literature focuses on supervised learning problems with a bounded loss function (typically assumed to take values in the interval [0;1]). In order to relax this classical assumption, we propose to allow the range of the loss to depend on each predictor. This relaxation is captured by our new notion of HYPothesis-dependent rangE (HYPE). Based on this, we derive a novel PAC-Bayesian generalisation bound for unbounded loss functions, and we instantiate it on a linear regression problem. To make our theory usable by the largest audience possible, we include discussions on actual computation, practicality and limitations of our assumptions.
Article
Full-text available
Meta-learning automatically infers an inductive bias by observing data from a number of related tasks. The inductive bias is encoded by hyperparameters that determine aspects of the model class or training algorithm, such as initialization or learning rate. Meta-learning assumes that the learning tasks belong to a task environment, and that tasks are drawn from the same task environment both during meta-training and meta-testing. This, however, may not hold true in practice. In this paper, we introduce the problem of transfer meta-learning, in which tasks are drawn from a target task environment during meta-testing that may differ from the source task environment observed during meta-training. Novel information-theoretic upper bounds are obtained on the transfer meta-generalization gap, which measures the difference between the meta-training loss, available at the meta-learner, and the average loss on meta-test data from a new, randomly selected, task in the target task environment. The first bound, on the average transfer meta-generalization gap, captures the meta-environment shift between source and target task environments via the KL divergence between source and target data distributions. The second, PAC-Bayesian bound, and the third, single-draw bound, account for this shift via the log-likelihood ratio between source and target task distributions. Furthermore, two transfer meta-learning solutions are introduced. For the first, termed Empirical Meta-Risk Minimization (EMRM), we derive bounds on the average optimality gap. The second, referred to as Information Meta-Risk Minimization (IMRM), is obtained by minimizing the PAC-Bayesian bound. IMRM is shown via experiments to potentially outperform EMRM.
Article
Full-text available
We make two related contributions motivated by the challenge of training stochastic neural networks, particularly in a PAC–Bayesian setting: (1) we show how averaging over an ensemble of stochastic neural networks enables a new class of partially-aggregated estimators, proving that these lead to unbiased lower-variance output and gradient estimators; (2) we reformulate a PAC–Bayesian bound for signed-output networks to derive in combination with the above a directly optimisable, differentiable objective and a generalisation guarantee, without using a surrogate loss or loosening the bound. We show empirically that this leads to competitive generalisation guarantees and compares favourably to other methods for training such networks. Finally, we note that the above leads to a simpler PAC–Bayesian training scheme for sign-activation networks than previous work.
Article
Full-text available
Online learning methods, similar to the online gradient algorithm (OGA) and exponentially weighted aggregation (EWA), often depend on tuning parameters that are difficult to set in practice. We consider an online meta-learning scenario, and we propose a meta-strategy to learn these parameters from past tasks. Our strategy is based on the minimization of a regret bound. It allows us to learn the initialization and the step size in OGA with guarantees. It also allows us to learn the prior or the learning rate in EWA. We provide a regret analysis of the strategy. It allows to identify settings where meta-learning indeed improves on learning each task in isolation.
Conference Paper
Full-text available
We focus on a stochastic learning model where the learner observes a finite set of training examples and the output of the learning process is a data-dependent distribution over a space of hypotheses. The learned data-dependent distribution is then used to make randomized predictions, and the high-level theme addressed here is guaranteeing the quality of predictions on examples that were not seen during training, i.e. generalization. In this setting the unknown quantity of interest is the expected risk of the data-dependent randomized predictor, for which upper bounds can be derived via a PAC-Bayes analysis, leading to PAC-Bayes bounds. Specifically, we present a basic PAC-Bayes inequality for stochastic kernels, from which one may derive extensions of various known PAC-Bayes bounds as well as novel bounds. We clarify the role of the requirements of fixed ‘data-free’ priors, bounded losses, and i.i.d. data. We highlight that those requirements were used to upper-bound an exponential moment term, while the basic PAC-Bayes theorem remains valid without those restrictions. We present three bounds that illustrate the use of data-dependent priors, including one for the unbounded square loss.
Article
Full-text available
Datasets displaying temporal dependencies abound in science and engineering applications, with Markov models representing a simplified and popular view of the temporal dependence structure. In this paper, we consider Bayesian settings that place prior distributions over the parameters of the transition kernel of a Markov model, and seek to characterize the resulting, typically intractable, posterior distributions. We present a Probably Approximately Correct (PAC)-Bayesian analysis of variational Bayes (VB) approximations to tempered Bayesian posterior distributions, bounding the model risk of the VB approximations. Tempered posteriors are known to be robust to model misspecification, and their variational approximations do not suffer the usual problems of over confident approximations. Our results tie the risk bounds to the mixing and ergodic properties of the Markov data generating model. We illustrate the PAC-Bayes bounds through a number of example Markov models, and also consider the situation where the Markov model is misspecified.
Article
Meta-learning aims to leverage experience from previous tasks to achieve an effective and fast adaptation ability when encountering new tasks. However, it is unclear how the generalization property applies to new tasks. Probably approximately correct (PAC) Bayes bound theory provides a theoretical framework to analyze the generalization performance for meta-learning with an explicit numerical generalization error upper bound. A tighter upper bound may achieve better generalization performance. However, for the PAC-Bayes meta-learning bound, the prior distribution is selected randomly which results in poor generalization performance. In this paper, we derive three novel generalization error upper bounds for meta-learning based on the PAC-Bayes relative entropy bound. Furthermore, in order to avoid randomly prior distribution, based on the empirical risk minimization (ERM) method, a data-dependent prior for the PAC-Bayes meta-learning bound algorithm is developed and the sample complexity and computational complexity are analyzed. The experiments illustrate that the proposed three PAC-Bayes bounds for meta-learning achieve a competitive generalization guarantee, and the extended PAC-Bayes bound with a data-dependent prior can achieve rapid convergence ability.