PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Aggregated predictors are obtained by making a set of basic predictors vote according to some weights, that is, to some probability distribution. Randomized predictors are obtained by sampling in a set of basic predictors, according to some prescribed probability distribution. Thus, aggregated and randomized predictors have in common that they are not defined by a minimization problem, but by a probability distribution on the set of predictors. In statistical learning theory, there is a set of tools designed to understand the generalization ability of such procedures: PAC-Bayesian or PAC-Bayes bounds. Since the original PAC-Bayes bounds of McAllester, these tools have been considerably improved in many directions (we will for example describe a simplified version of the localization technique of Catoni that was missed by the community, and later rediscovered as "mutual information bounds"). Very recently, PAC-Bayes bounds received a considerable attention: for example there was workshop on PAC-Bayes at NIPS 2017, "(Almost) 50 Shades of Bayesian Learning: PAC-Bayesian trends and insights", organized by B. Guedj, F. Bach and P. Germain. One of the reason of this recent success is the successful application of these bounds to neural networks by Dziugaite and Roy. An elementary introduction to PAC-Bayes theory is still missing. This is an attempt to provide such an introduction.
arXiv:2110.11216v4 [stat.ML] 9 Nov 2021
User-friendly introduction to PAC-Bayes bounds
Pierre Alquier
RIKEN AIP, Tokyo, Japan
Abstract
Aggregated predictors are obtained by making a set of basic predictors vote accord-
ing to some weights, that is, to some probability distribution.
Randomized predictors are obtained by sampling in a set of basic predictors, ac-
cording to some prescribed probability distribution.
Thus, aggregated and randomized predictors have in common that they are not
defined by a minimization problem, but by a probability distribution on the set of
predictors. In statistical learning theory, there is a set of tools designed to understand
the generalization ability of such procedures: PAC-Bayesian or PAC-Bayes bounds.
Since the original PAC-Bayes bounds [166, 127], these tools have been considerably
improved in many directions (we will for example describe a simplified version of the
localization technique of [41, 43] that was missed by the community, and later rediscov-
ered as “mutual information bounds”). Very recently, PAC-Bayes bounds received a
considerable attention: for example there was workshop on PAC-Bayes at NIPS 2017,
(Almost) 50 Shades of Bayesian Learning: PAC-Bayesian trends and insights, orga-
nized by B. Guedj, F. Bach and P. Germain. One of the reasons of this recent success
is the successful application of these bounds to neural networks [67].
An elementary introduction to PAC-Bayes theory is still missing. This is an attempt
to provide such an introduction.
Contents
1 Introduction 3
1.1 Machine learning and PAC bounds . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Machine learning: notations . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 PACbounds................................ 5
1.2 What are PAC-Bayes bounds? . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Whythistutorial? ................................ 10
1.4 Two types of PAC bounds, organization of these notes . . . . . . . . . . . . 11
1
2 First step in the PAC-Bayes world 12
2.1 A simple PAC-Bayes bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Catoni’s bound [41] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 Exact minimization of the bound . . . . . . . . . . . . . . . . . . . . 14
2.1.3 Some examples, and non-exact minimization of the bound . . . . . . 15
2.1.4 The choice of λ.............................. 20
2.2 PAC-Bayes bound on aggregation of predictors . . . . . . . . . . . . . . . . . 22
2.3 PAC-Bayes bound on a single draw from the posterior . . . . . . . . . . . . . 23
2.4 Boundinexpectation............................... 23
2.5 Applications of empirical PAC-Bayes bounds . . . . . . . . . . . . . . . . . . 24
3 Tight and non-vacuous PAC-Bayes bounds 25
3.1 Why is there a race to the tighter PAC-Bayes bound? . . . . . . . . . . . . . 25
3.2 A few PAC-Bayes bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 McAllester’s bound [127] and Maurer’s improved bound [126] . . . . . 27
3.2.2 Catoni’s bound (another one) [43] . . . . . . . . . . . . . . . . . . . . 27
3.2.3 Seeger’s bound [161] and Maurer’s bound [126] . . . . . . . . . . . . . 28
3.2.4 Tolstikhin and Seldin’s bound [175] . . . . . . . . . . . . . . . . . . . 29
3.2.5 Thieman, Igel, Wintenberger and Seldin’s bound [174] . . . . . . . . . 30
3.2.6 A bound by Germain, Lacasse, Laviolette and Marchand [77] . . . . . 30
3.3 Tight generalization error bounds for deep learning . . . . . . . . . . . . . . 30
3.3.1 A milestone: non vacuous generalization error bounds for deep net-
works by Dziugaite and Roy [67] . . . . . . . . . . . . . . . . . . . . . 30
3.3.2 Bounds with data-dependent priors . . . . . . . . . . . . . . . . . . . 32
3.3.3 Comparison of the bounds and tight certificates for neural networks [147] 33
4 PAC-Bayes oracle inequalities and fast rates 34
4.1 From empirical inequalities to oracle inequalities . . . . . . . . . . . . . . . . 34
4.1.1 Bound in expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.2 Bound in probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Bernstein assumption and fast rates . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Applications of Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Dimension and rate of convergence . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Getting rid of the log terms: Catoni’s localization trick . . . . . . . . . . . . 46
5 Beyond “bounded loss” and “i.i.d observations” 49
5.1 “Almost” bounded losses (Sub-Gaussian and sub-gamma) . . . . . . . . . . . 49
5.1.1 The sub-Gaussian case . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.1.2 The sub-gamma case . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1.3 Remarks on exponential moments . . . . . . . . . . . . . . . . . . . . 51
5.2 Heavy-tailedlosses ................................ 51
5.2.1 The truncation approach . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.2 Bounds based on moment inequalities . . . . . . . . . . . . . . . . . . 52
2
5.2.3 Bounds based on robust losses . . . . . . . . . . . . . . . . . . . . . . 53
5.3 Dependent observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3.1 Inequalities for dependent variables . . . . . . . . . . . . . . . . . . . 54
5.3.2 Asimpleexample............................. 54
5.4 Other non i.i.d settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4.1 Non identically distributed observations . . . . . . . . . . . . . . . . 55
5.4.2 Shift in the distribution . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4.3 Meta-learning............................... 56
6 Related approaches in statistics and machine learning theory 56
6.1 Bayesian inference in statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.1.1 Gibbs posteriors, generalized posteriors . . . . . . . . . . . . . . . . . 57
6.1.2 Contraction of the posterior in Bayes nonparametrics . . . . . . . . . 57
6.1.3 Variational approximations . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2 Empirical risk minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3 Onlinelearning .................................. 60
6.3.1 Sequential prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3.2 Bandits and reinforcement learning (RL) . . . . . . . . . . . . . . . . 62
6.4 Aggregation of estimators in statistics . . . . . . . . . . . . . . . . . . . . . . 62
6.5 Information theoretic approaches . . . . . . . . . . . . . . . . . . . . . . . . 62
6.5.1 Minimum description length . . . . . . . . . . . . . . . . . . . . . . . 63
6.5.2 Mutual information bounds (MI) . . . . . . . . . . . . . . . . . . . . 63
7 Conclusion 65
1 Introduction
In a supervised learning problem, such as classification or regression, we are given a data
set, and we 1) fix a set of predictors and 2) find a good predictor in this set.
For example, when doing linear regression, you 1) chose to consider only linear predictors
and 2) use the least-square method to chose your linear predictor.
PAC-Bayes bounds will allow us to define and study “randomized” or “aggregated” pre-
dictors. By this, we mean that we will replace 2) by 2’) define weights on the predictors
and make them vote according to these weights or by 2”) draw a predictor according to some
prescribed probability distribution.
1.1 Machine learning and PAC bounds
1.1.1 Machine learning: notations
We will assume that the reader is already familiar with the setting of supervised learning
and the corresponding definitions. We briefly remind the notations involved here:
3
an object set X: photos, texts, Rd(equipped with a σ-algebra Sx).
a label set Y, usually a finite set for classification problem or the set of real numbers
for regression problems (equipped with a σ-algebra Sy).
a probability distribution Pon (X ×Y,Sx⊗ Sy), which is not known.
the data, or observations: (X1, Y1),...,(Xn, Yn). From now, and until the end of
Section 4, we assume that (X1, Y1),...,(Xn, Yn)are i.i.d from P.
a predictor is a measurable function f:X → Y.
we fix a set of predictors indexed by a parameter set Θ (equipped with a σ-algebra T):
{fθ, θ Θ}.
In regression, the basic example is fθ(x) = θTxfor X=Θ=Rd. The analogous for
classification is:
fθ(x) = 1 if θTx0
0 otherwise.
More sophisticated predictors: the set of all neural networks with a fixed architecture,
θbeing the weights of the network.
a loss function, that is, a measurable function :Y2[0,+) with (y, y) = 0. In a
classification problem, a very common loss function is:
(y, y) = 1 if y6=y
0 if y=y
We will refer to it as the 0-1 loss function, and will use the following shorter notation:
(y, y) = 1(y6=y). However, it is often more convenient to consider convex loss
functions, such as (y, y) = max(1 yy,0) (the hinge loss). In regression problems,
the most popular examples are (y, y) = (yy)2the quadratic loss, or (y, y) = |yy|
the absolute loss. Note that the original PAC-Bayes bounds in [127] were stated in the
special case of the 0-1 loss, and this is also the case of most bounds published since
then. However, PAC-Bayes bounds for regression with the quadratic loss were proven
in [42], and in many works since then (they will be mentioned later). From now, and
until the end of Section 4, we assume that 0C.This might be either
because we are using the 0-1 loss, or the quadratic loss but in a setting where fθ(x)
and yare bounded.
the generalization error of a predictor, or generalization risk, or simply risk:
R(f) = E(X,Y )P[(f(X), Y )].
For short, as we will only consider predictors in {fθ, θ Θ}we will write
R(θ) := R(fθ).
This function is not accessible because it depends on the unknown P.
4
for short, we put i(θ) := (fθ(Xi), Yi)0.
the empirical risk:
r(θ) = 1
n
n
X
i=1
i(θ)
satisfies
E(X1,Y1),...,(Xn,Yn)[r(θ)] = R(θ).
Note that the notation for the last expectation is cumbersome. From now, we will write
S= [(X1, Y1),...,(Xn, Yn)] and ES(for “expectation with respect to the sample”)
instead of E(X1,Y1),...,(Xn,Yn). In the same way, we will write PS.
an estimator is a function
ˆ
θ:
[
n=1
(X × Y)nΘ.
That is, to each possible dataset, with any possible size, it associates a parameter. (It
must be such that the restriction of ˆ
θto each (X × Y)nis measurable). For short, we
write ˆ
θinstead of ˆ
θ((X1, Y1),...,(Xn, Yn)). The most famous example is the Empirical
Risk Minimizer, or ERM: ˆ
θERM = argmin
θΘ
r(θ)
(with a convention in case of a tie).
1.1.2 PAC bounds
Of course, our objective is to minimize R, not r. So the ERM strategy is motivated by
the hope that these two functions are not so different, so that the minimizer of ralmost
minimizes R. In what remains of this section, we will check to what extent this is true. By
doing so, we will introduce some tools that will be useful when we will come to PAC-Bayes
bounds.
The first of these tools is a classical result that will be useful in all this tutorial.
Lemma 1.1 (Hoeffding’s inequality) Let U1,...,Unbe independent random variables
taking values in an interval [a, b]. Then, for any t > 0,
EhetPn
i=1[UiE(Ui)]ient2(ba)2
8.
The proof can be found for example in Chapter 2 of [37], which is a highly recommended
reading, but it is so classical that you can as well find it on Wikipedia.
Fix θΘ and apply Hoeffding’s inequality with Ui=E[i(θ)] i(θ) to get:
ESetn[R(θ)r(θ)]ent2C2
8.(1.1)
5
Now, for any s > 0,
PS(R(θ)r(θ)> s) = PSent[R(θ)r(θ)] >ents
ESent[R(θ)r(θ)]
ents by Markov’s inequality,
ent2C2
8nts by (1.1).
In other words,
PS(R(θ)> r(θ) + s)ent2C2
8nts.
We can make this bound as tight as possible, by optimizing our choice for t. Indeed, note
that nt2C2/8nts is minimized for t= 4s/C2, which leads to
PS(R(θ)> r(θ) + s)e2ns2
C2.(1.2)
This means that, for a given θ, the risk R(θ) cannot be much larger than the corresponding
empirical risk r(θ). The order of this “much larger” can be better understood by introducing
ε= e2ns2
C2
and substituting εto sin (1.2), which gives:
PS
R(θ)> r(θ) + Cslog 1
ε
2n
ε. (1.3)
We see that R(θ) will usually not exceed r(θ) by more than a term in 1/n. This is not
enough, though, to justify the use of the ERM. Indeed, (1.3) is only true for the θthat was
fixed above, and we cannot apply it to ˆ
θERM that is a function of the data. In order to study
R(ˆ
θERM), we can use
R(ˆ
θERM)r(ˆ
θERM)sup
θΘ
[R(θ)r(θ)] (1.4)
so we need a version of (1.3) that would hold uniformly on Θ.
Let us now assume, until the end of Subsection 1.1, that the set Θ is finite, that is,
card(Θ) = M < +. Then:
PS(sup
θΘ
[R(θ)r(θ)] > s) = PS [
θΘn[R(θ)r(θ)] > so!
X
θΘ
PS(R(θ)> r(θ) + s)
Me2ns2
C2(1.5)
6
thanks to (1.2). This time, put
ε=Me2ns2
C2,
plug into (1.5), this gives:
PS
sup
θΘ
[R(θ)r(θ)] > Cslog M
ε
2n
ε.
Let us state this conclusion as a theorem (focusing on the complementary event).
Theorem 1.2 Assume that card(Θ) = M < +. For any ε(0,1),
PS
θΘ,R(θ)r(θ) + Cslog M
ε
2n
1ε.
This result indeed motivates the introduction of ˆ
θERM. Indeed, using (1.4), with probability
at least 1 εwe have
R(ˆ
θERM)r(ˆ
θERM) + Cslog M
ε
2n
= inf
θΘ
r(θ) + Cslog M
ε
2n
so the ERM satisfies:
PS
R(ˆ
θERM)inf
θΘ
r(θ) + Cslog M
ε
2n
1ε.
Such a bound is usually called a PAC bound, that is, Probably Approximately Correct bound.
The reason for this terminology, introduced by Valiant in [178], is as follows: Valiant was
considering the case where there is a θ0Θ such that Yi=fθ0(Xi) holds almost surely. This
means that R(θ0) = 0 and r(θ0) = 0, and so
PS
R(ˆ
θERM)Cslog M
ε
2n
1ε,
which means that with large Probability,R(ˆ
θERM) is Approximately equal to the Correct
value, that is, 0. Note that, however, this is only correct if log(M)/n is small, that is, if M
is not larger than exp(n). This log(M) in the bound is the price to pay to learn which of M
predictors is the best.
7
Remark 1.1 The proof of Theorem 1.2 used, in addition to Hoeffding’s inequality, two tricks
that we will reuse many times in this tutorial:
given a random variable Uand sR, for any t > 0,
P(U > s) = PetU >ets
EetU
ets
thanks to Markov inequality. The combo “exponential + Markov inequality” is known
as Chernoff bound. Chernoff bound is of course very useful together with exponential
inequalities like Hoeffding’s inequality.
given a finite number of random variables U1,...,UM,
Psup
1iM
Ui> s=P [
1iMnUi> so!
M
X
i=1
P(Ui> s).
This argument is called the union-bound argument.
The next step in the study of the ERM would be to go beyond finite sets Θ. The union
bound argument has to be modified in this case, and things become a little more complicated.
We will therefore stop here the study of the ERM: it is not our objective anyway.
If the reader is interested by the study of the ERM in general: Vapnik and Chervonenkis
developed the theoretical tools for this study in 1969/1970, this is for example developed
by Vapnik in [180]. The book [64] is a beautiful and very pedagogical introduction to
machine learning theory, and Chapters 11 and 12 in particular are dedicated to Vapnik and
Chervonenkis theory.
1.2 What are PAC-Bayes bounds?
I am now in better position to explain what are PAC-Bayes bounds. A simple way to phrase
things: PAC-Bayes bounds are generalization of the union bound argument, that will allow
to deal with any parameter set Θ: finite or infinite, continuous... However, a byproduct of
this technique is that we will have to change the notion of estimator.
Definition 1.1 Let P(Θ) be the set of all probability distributions on ,T). A data-
dependent probability measure is a function:
ˆρ:
[
n=1
(X × Y)n→ P(Θ)
8
with a suitable measurability condition1We will write ˆρinstead of ˆρ((X1, Y1),...,(Xn, Yn))
for short.
In practice, when you have a data-dependent probability measure, and you want to build
a predictor, you can:
draw a random parameter ˜
θˆρ, we will call this procedure “randomized estimator”.
use it to average predictors, that is, define a new predictor:
fˆρ(·) = Eθˆρ[fθ(·)]
called the aggregated predictor with weights ˆρ.
So, with PAC-Bayes bounds, we will extend the union bound argument 2to infinite,
uncountable sets Θ, but we will obtain bounds on various risks related to data-dependent
probability measures, that is:
the risk of a randomized estimator, R(˜
θ),
or the average risk of randomized estimators, Eθˆρ[R(θ)],
or the risk of the aggregated estimator, R(fˆρ).
You will of course ask the question: if Θ is infinite, what will become the log(M) term
in Theorem 1.2, that came from the union bound? In general, this term will be replaced by
the Kullback-Leibler divergence between ˆρand a fixed πon Θ.
Definition 1.2 Given two probability measures µand νin P(Θ), the Kullback-Leibler (or
simply KL) divergence between µand νis
KL(µkν) = Zlog dµ
dν(θ)µ(dθ)[0,+]
if µhas a density dµ
dνwith respect to ν, and KL(µkν) = +otherwise.
Example 1.1 For example, if Θis finite,
KL(µkν) = X
θΘ
log µ(θ)
ν(θ)µ(θ).
1I don’t want to scare the reader with measurability conditions, as I will not check them in this tutorial
anyway. Here, the exact condition to ensure that what follows is well defined is that for any A T , the
function
((x1, y1),...,(xn, yn)) 7→ ρ((x1, y1),...,(xn, yn))] (A)
is measurable. That is, ˆρis a regular conditional probability.
2See the title of van Erven’s tutorial [179]: “PAC-Bayes mini-tutorial: a continuous union bound”. Note,
however, that it is argued by Catoni in [43] that PAC-Bayes bounds are actually more than that, we will
come back to this in Section 4.
9
The following result is well known. You can prove it using Jensen’s inequality, or use
Wikipedia again.
Proposition 1.3 For any probability measures µand ν,KL(µkν)0with equality if and
only if µ=ν.
1.3 Why this tutorial?
Since the “PAC analysis of a Bayesian estimator” by Shawe-Taylor and Williamson [166]
and the first PAC-Bayes bounds proven by McAllester [127, 128], many new PAC-Bayes
bounds appeared (we will see that many of them can be derived from Seeger’s bound [161]).
These bounds were used in various contexts, to solve a wide range of problems. This led
to hundreds of (beautiful!) papers. The consequence of this is that it’s quite difficult to be
aware of all the existing work on PAC-Bayes bound.
As a reviewer for ICML or NeurIPS, I had very often to reject papers because these
papers were re-proving already known results. Or, because these papers proposed bounds
that were weaker than existing ones3. In particular, it seems that many powerful techniques
in Catoni’s book [43] are still ignored by the community (some are already introduced in
earlier works [41, 42]).
On the other hand, it’s not always easy to begin with PAC-Bayes bounds. I realize that
most papers already assume some basic knowledge on these bounds, and that a monograph
like [43] is quite technical to begin with. When a MSc or PhD student asks me for an
easy-to-follow introduction on PAC-Bayes, I am never sure what to answer, and usually
end up improvising such an introduction for one or two hours, with a piece of chalk and a
blackboard. So it came to me recently 4that it might be useful to write a beginner-friendly
tutorial, that I could send instead!
Note that there are already short tutorials on PAC-Bayes bounds, by McAllester and
van Erven: [130, 179]. They are very good, and I recommend the reader to check them.
However, they are focused on empirical bounds only. There are also surveys on PAC-Bayes
bounds, such as Section 5 in [54] or [84]. These papers are very useful to navigate in the
ocean of publications on PAC-Bayes bounds, and they helped me a lot when I was writing
this document. Finally, in order to highlight the main ideas, I will not necessarily try to
present the bounds with the tightest possible constants. In particular, many oracle bounds
and localized bounds in Section 4 were introduced in [41, 43] with better constants. Thus I
strongly recommend to read [43] after this tutorial, and the more recent papers mentioned
below.
3I might have done such mistakes myself, and I apologize if it is the case.
4I must confess that I started a first version of this document after two introductory talks at A. Tsybakov’s
statistics seminar at ENSAE in September-October 2008. Then I got other things to do and I forgot about
it. I taught online learning and PAC-Bayes bounds at ENSAE between 2014 and 2019, which made me
think again about it. When I joined Emti Khan’s group in 2019, I started to think again about such a
document, to share it with the members of the group who were willing to learn about PAC-Bayes. Of course
the contents of the document had to be different, because of the enormous amount of very exciting papers
that were published in the meantime. I finally started again from scratch in early 2021.
10
1.4 Two types of PAC bounds, organization of these notes
It is important to make a distinction between two types of PAC bounds.
Theorem 1.2 is usually refered to as an empirical bound. It means that, for any θ,R(θ) is
upper bounded by an empirical quantity, that is, by something that we can compute when
we observe the data. This allows to define the ERM as the minimizer of this bound. It also
provides a numerical certificate of the generalization error of the ERM. You will really end
up with something like
PSR(ˆ
θERM)0.120.99.
However, a numerical certificate on the generalization error does not tell you one thing.
Can this 0.12 be improved using a larger sample size? Or is it the best that can be done
with our set of predictors? The right tools to answer these questions are oracle PAC bounds.
In these bounds, you have a control of the form
PSR(ˆ
θERM)inf
θΘR(θ) + rn(ε)1ε,
where the remainder rn(ε)0 as fast as possible when n→ ∞. Of course, the upper bound
on R(ˆ
θERM) cannot be computed because we don’t know the function R, so it doesn’t lead
to a numerical certificate. Still, these bounds are very interesting, because they tell you how
close you can expect R(ˆ
θERM) to be of the smallest possible value of R.
In the same way, there are empirical PAC-Bayes bounds, and oracle PAC-Bayes bounds.
The very first PAC-Bayes bounds by McAllester [127, 128] were empirical bounds. Later,
Catoni [41, 42, 43] proved the first oracle PAC-Bayes bounds.
In some sense, empirical PAC-Bayes bounds are more useful in practice, and oracle PAC-
Bayes bounds are theoretical objects. But this might be an oversimplification. I will show
that empirical bounds are tools used to prove some oracle bounds, so they are also useful in
theory. On the other hand, when we design a data-dependent probability measure, we don’t
know if it will lead to large or small empirical bounds. A preliminary study of its theoretical
properties through an oracle bound is the best way to ensure that it is efficient, and so that
it has a chance to lead to small empirical bounds.
In Section 2, we will study an example of empirical PAC-Bayes bound, essentially taken
from a preprint by Catoni [41]. We will prove it together, play with it, modify it in many
ways. In Section 3, I provide various empirical PAC-Bayes bounds, and explain the race to
tighter bounds. This led to bounds that are tight enough to provide good generalization
bounds for deep learning, we will discuss this based on Dziugaite and Roy’s paper [67] and
a more recent work by P´erez-Ortiz, Rivasplata, Shawe-Taylor, and Szepesv`ari [147].
In Section 4, we will turn to oracle PAC-Bayes bounds. I will explain how to derive these
bounds from empirical bounds, and apply them to some classical set of predictors. We will
examine the assumptions leading to fast rates in these inequalities.
Section 5 will be devoted to the various attempts to extend PAC-Bayes bounds beyond the
setting introduced in this introduction, that is: bounded loss, and i.i.d observations. Finally,
in Section 6 I will discuss briefly the connection between PAC-Bayes bounds and many
11
other approaches in machine learning and statistics, including the recent Mutual Information
bounds (MI).
2 First step in the PAC-Bayes world
As mentioned above, there are many PAC-Bayes bounds. I will start in this section by a
bound which is essentially due to Catoni in the preprint [41] (the same technique was used
in the monograph [43] but with some modifications). Why this choice?
Well, any choice is partly arbitrary: I did my PhD thesis [1] with Olivier Catoni and
thus I know his works well. So it’s convenient for me. But, also, in a first time, I don’t
want here to provide the best bound. I want to show how PAC-Bayes bounds work, how to
use them, and explain the different variants (bounds on randomized estimators, bounds on
aggregated estimators, etc.). It appears that Catoni’s technique is extremely convenient to
prove almost all the various type of bounds with a unique proof. Later, in Section 3, I will
present alternative empirical PAC-Bayes bounds, this will allow you to compare them, and
find the pros and the cons of each.
2.1 A simple PAC-Bayes bound
2.1.1 Catoni’s bound [41]
From now, and until the end of these notes, let us fix a probability measure π∈ P(Θ).
The measure πwill be called the prior, because of a connection with Bayesian statistics
that will be discussed in Section 6.
Theorem 2.1 For any λ > 0, for any ε(0,1),
PSρ∈ P(Θ),Eθρ[R(θ)] Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log 1
ε
λ1ε.
Let us prove Theorem 2.1. The proof requires a lemma that will be extremely useful in
all these notes. This lemma has been known since Kullback [104] in the case of a finite Θ,
but the general case is due to Donsker and Varadhan [65].
Lemma 2.2 (Donsker and Varadhan’s variational formula) For any measurable, bounded
function h: Θ Rwe have:
log Eθπeh(θ)= sup
ρ∈P(Θ)hEθρ[h(θ)] K L(ρkπ)i.
Moreover, the supremum with respect to ρin the right-hand side is reached for the Gibbs
measure πhdefined by its density with respect to π
dπh
dπ(θ) = eh(θ)
Eϑπ[eh(ϑ)].(2.1)
12
Proof of Lemma 2.2. Using the definition, just check that for any ρ∈ P(Θ),
KL(ρkπh) = Eθρ[h(θ)] + KL(ρkπ) + log Eθπeh(θ).
Thanks to Proposition 1.3, the left hand side is nonnegative, and equal to 0 only when
ρ=πh.
Proof of Theorem 2.1. The beginning of the proof follows closely the study of the ERM
and the proof of Theorem 1.2. Fix θΘ and apply Hoeffding’s inequality with Ui=
E[i(θ)] i(θ): for any t > 0,
ESetn[R(θ)r(θ)]ent2C2
8.
We put t=λ/n, which gives:
ESeλ[R(θ)r(θ)]eλ2C2
8n.
This is where the proof diverge from the proof of Theorem 1.2. We will now integrate this
bound with respect to π:
EθπESeλ[R(θ)r(θ)]eλ2C2
8n.
Thanks to Fubini, we can exchange the integration with respect to θand the one with respect
to the sample:
ESEθπeλ[R(θ)r(θ)]eλ2C2
8n(2.2)
and we apply Donsker and Varadhan’s variational formula (Lemma 2.2) to get:
ESesupρ∈P(Θ) λEθρ[R(θ)r(θ)]K L(ρkπ)eλ2C2
8n.
Rearranging terms:
EShesupρ∈P(Θ) λEθρ[R(θ)r(θ)]K L(ρkπ)λ2C2
8ni1.(2.3)
The end of the proof uses Chernoff bound. Fix s > 0,
PS"sup
ρ∈P(Θ)
λEθρ[R(θ)r(θ)] KL(ρkπ)λ2C2
8n> s#
EShesupρ∈P(Θ) λEθρ[R(θ)r(θ)]K L(ρkπ)λ2C2
8nies
es.
Solve es=ε, that is, put s= log(1) to get
PS"sup
ρ∈P(Θ)
λEθρ[R(θ)r(θ)] KL(ρkπ)λ2C2
8n>log 1
ε#ε.
Rearranging terms give:
PSρ∈ P(Θ), Eθρ[R(θ)] >Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log 1
ε
λε.
Take the complement to end the proof.
13
2.1.2 Exact minimization of the bound
We remind that the bound in Theorem 1.2,
PS
θΘ, R(θ)r(θ) + Cslog M
ε
2n
1ε,
motivated the introduction of ˆ
θERM, the minimizer of r.
Exactly in the same way, the bound in Theorem 2.1,
PSρ∈ P(Θ), Eθρ[R(θ)] Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log 1
ε
λ1ε,
motivates the study of a data-dependent probability measure ˆρλthat would be defined as:
ˆρλ= argmin
ρ∈P(Θ) Eθρ[r(θ)] + K L(ρkπ)
λ.
But does such a minimizer exist? It turns out that the answer is yes, thanks to Donsker and
Varadhan’s variational formula again! Indeed, to minimize:
Eθρ[r(θ)] + KL(ρkπ)
λ
is equivalent to maximize
λEθρ[r(θ)] KL(ρkπ)
which is exactly what the variational inequality does, with h(θ) = λr(θ). We know that
the minimum is reached for ρ=πλr as defined in (2.1). Let us summarize this in following
definition and corollary.
Definition 2.1 In the whole tutorial we will let ˆρλdenote “the Gibbs posterior” given by
ˆρλ=πλr, that is:
ˆρλ(dθ) = eλr(θ)π(dθ)
Eϑπ[eλt(ϑ)].(2.4)
Corollary 2.3 The Gibbs posterior is the minimizer of the right-hand side of Theorem 2.1:
ˆρλ= argmin
ρ∈P(Θ) Eθρ[r(θ)] + K L(ρkπ)
λ.
As a consequence, for any λ > 0, for any ε(0,1),
PSEθˆρλ[R(θ)] inf
ρ∈P(Θ) Eθρ[r(θ)] + λC 2
8n+KL(ρkπ) + log 1
ε
λ1ε.
14
2.1.3 Some examples, and non-exact minimization of the bound
When you see something like:
Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log 1
ε
λ,
I’m not sure you immediately see what is the order of magnitude of the bound. I don’t.
In general, when you apply such a general bound to a set of predictors, I think it is quite
important to make the bound more explicit. Let us process a few examples (I advise you to
do the calculations on your own in these examples, and in other examples).
Example 2.1 (Finite case) Let us start with the special case where Θis a finite set, that
is, card(Θ) = M < +. We begin with the application of Corollary 2.3. In this case, the
Gibbs posterior ˆρλof (2.4) is a probability on the finite set Θgiven by
ˆρλ(θ) = eλr(θ)π(θ)
PϑΘeλr(ϑ)π(ϑ).
and we have, with probability at least 1ε:
Eθˆρλ[R(θ)] inf
ρ∈P(Θ) Eθρ[r(θ)] + λC 2
8n+KL(ρkπ) + log 1
ε
λ.(2.5)
As the bound holds for all ρ∈ P(Θ), it holds in particular for all ρin the set of Dirac masses
{δθ, θ Θ}. Obviously:
Eθδθ[r(θ)] = r(θ)
and
KL(δθkπ) = X
ϑΘ
log δθ(ϑ)
π(ϑ)δθ(ϑ) = log 1
π(θ),
so the bound becomes:
PS Eθˆρλ[R(θ)] inf
θΘ"r(θ) + λC2
8n+log 1
π(θ)+ log 1
ε
λ#!1ε, (2.6)
with log(1/0) = +. This gives us an intuition on the role of the measure π: the bound
will be tighter for θ’s such that π(θ)is large. However, πcannot be large everywhere: it is a
probability distribution, so it must satisfy
X
θΘ
π(θ) = 1.
The larger the set Θ, the more this total sum of 1will be spread, which will lead to large
values of log(1(θ)).
15
If πis the uniform probability, then log(1(θ)) = log(M), and the bound becomes:
PS Eθˆρλ[R(θ)] inf
θΘr(θ) + λC2
8n+log M
ε
λ!1ε.
The choice λ=p8n/(C2log(M/ε))) actually minimizes the right-hand side, this gives:
PS
Eθˆρλ[R(θ)] inf
θΘr(θ) + Cslog M
ε
2n
1ε. (2.7)
That is, the Gibbs posterior ˆρλsatisfies the same bound as the ERM in Theorem 1.2. Note
that the optimization with respect to λis a little more problematic when πis not uniform,
because the optimal λwould depend on ϑ. We will come back to the choice of λin the general
case soon.
Let us also consider the statement of Theorem 2.1 in this case. With probability at least
1ε, we have:
ρ∈ P(Θ),Eθρ[R(θ)] Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log 1
ε
λ.
Let us apply this bound to any ρin the set of Dirac masses ρ∈ {δθ, θ Θ}. This gives:
θΘ,R(θ)r(θ) + λC2
8n+log 1
π(θ)+ log 1
ε
λ
and, when πis uniform:
θΘ,R(θ)r(θ) + λC2
8n+log M
ε
λ.
As this bound holds for any θ, it holds in particular for the ERM, which gives:
R(ˆ
θERM)inf
θΘr(θ) + λC2
8n+log M
ε
λ
and, once again with the choice λ=p8n/(C2log(M/ε))), we recover exactly the result of
Theorem 1.2:
R(ˆ
θERM)inf
θΘr(θ) + Cslog M
ε
2n.
The previous example leads to important remarks:
PAC-Bayes bounds can be used to prove generalization bounds for Gibbs posteriors, but
sometimes they can also be used to study more classical estimators, like the ERM. Many
of the recent papers by Catoni with co-authors study robust non-Bayesian estimators
thanks to sophisticated PAC-Bayes bounds [45].
16
the choice of λhas a different status when you study the Gibbs posterior ˆρλand the
ERM. Indeed, in the bound on the ERM, λis chosen to minimize the bound, but
the estimation procedure is not affected by λ. The bound for the Gibbs posterior is
also minimized with respect to λ, but ˆρλdepends on λ. So, if you make a mistake
when chosing λ, this will have bad consequences not only on the bound, but also on
the practical performances of the method. This means also that if the bound is not
tight, it is likely that the λobtained by minimizing the bound will not lead to good
performances in practice. (As you will see soon, we present in Section 3 bounds that
do not depend on a parameter like λ).
Example 2.2 (Lipschitz loss and Gaussian priors) Let us switch to the continuous case,
so that we can derive from PAC-Bayes bounds some results that we wouldn’t be able to de-
rive from a union bounds argument. We consider the case where Θ = Rd, the function
θ7→ (fθ(x), y)is L-Lipschitz for any xand y, and the prior πis a centered Gaussian:
π=N(0, σ2Id)where Idis the d×didentity matrix.
Let us, as in the previous example, study first the Gibbs posterior, by an application of
Corollary 2.3. With probability at least 1ε,
Eθˆρλ[R(θ)] inf
ρ∈P(Θ) Eθρ[r(θ)] + λC 2
8n+KL(ρkπ) + log 1
ε
λ.
Once again, the right-hand side is an infimum over all possible probability distributions ρ,
but it is easier to restrict to Gaussian distributions here. So:
Eθˆρλ[R(θ)] inf
ρ=N(m, s2Id)
mRd, s > 0Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log 1
ε
λ.(2.8)
Indeed, it is well known that, for ρ=N(m, s2Id)and π=N(0, σ2Id),
KL(ρkπ) = kmk2
2σ2+d
2s2
σ2+ log σ2
s21.
Moreover, the risk rinherits the Lipschitz property of the loss, that is, for any (θ, ϑ)Θ2,
r(θ)r(ϑ) + Lkϑθk. So, for ρ=N(m, s2Id),
Eθρ[r(θ)] r(m) + LEθρ[kθmk]
r(m) + LqEθρ[kθmk2]by Jensen’s inequality
=r(m) + Lsd.
Plugging this into 2.8 gives:
Eθˆρλ[R(θ)] inf
mRd,s>0
r(m) + Lsd+λC2
8n+
kmk2
2σ2+d
2hs2
σ2+ log σ2
s21i+ log 1
ε
λ
.
17
It is possible to minimize the bound completely in s, but for now, we will just consider the
choice s=σ/n, which gives:
Eθˆρλ[R(θ)]
inf
mRd"r(m) + rd
n+λC2
8n+kmk2
2σ2+d
21
n1 + log(n)+ log 1
ε
λ#
inf
mRd"r(m) + rd
n+λC2
8n+kmk2
2σ2+d
2log(n) + log 1
ε
λ#.
It is not possible to optimize the bound with respect to λas the optimal value would depend
on m... however, a way to understand the bound (by making it worse!) is to restrict the
infimum on mto kmk ≤ Bfor some B > 0. Then we have:
Eθˆρλ[R(θ)] inf
m:kmk≤Br(m) + rd
n+λC2
8n+kBk2
2σ2+d
2log(n) + log 1
ε
λ.
In this case, we see that the optimal λis
λ=1
Cs8nkBk2
2σ2+d
2log(n) + log 1
ε
which gives:
Eθˆρλ[R(θ)] inf
m:kmk≤Br(m) + rd
n+CskBk2
2σ2+d
2log(n) + log 1
ε
2n.
Note that our choice of λmight look a bit weird, as it depends on the confidence level ε. This
can be avoided by taking:
λ=1
Cs8nkBk2
2σ2+d
2log(n)
instead (check what bound you obtain by doing so!).
Finally, as in the previous example, we can also start from the statement of Theorem 2.1:
with probability at least 1ε,
ρ∈ P(Θ),Eθρ[R(θ)] Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log 1
ε
λ,
and restrict here ρto the set of Gaussian distributions N(m, s2Id). This leads to the defini-
tion of a new data-dependent probability measure, ˜ρλ=N( ˜m, ˜s2Id)where
( ˜m, ˜s) = argmin
mRd,s>0
Eθ∼N(m,s2Id)[r(θ)] + λC2
8n+
kmk2
2σ2+d
2hs2
σ2+ log σ2
s21i+ log 1
ε
λ.
18
While the Gibbs posterior ˆρλcan be quite a complicated object, one simply has to solve this
minimization problem to get ˜ρλ. The probability ˜ρλis actually a special case of what is called
a variational approximation of ˆρλ. Variational approximations are very popular in statistics
and machine learning, and were indeed analyzed through PAC-Bayes bounds [9, 8, 190]. We
will come back to this in Section 6. For now, following the same computations, and using
the same choice of λas for ˆρλ, we obtain the same bound:
Eθ˜ρλ[R(θ)] inf
m:kmk≤Br(m) + rd
n+CskBk2
2σ2+d
2log(n) + log 1
ε
2n.
Example 2.3 (Model aggregation, model selection) In the case where we have many
set of predictors, say Θ1, . . . , ΘM, equipped with priors π1,...,πMrespectively, it is possible
to define a prior on Θ = SM
j=1 Θj. For the sake of simplicity, assume that the Θj’s are
disjoint, and let p= (p(1),...,p(M)) be a probability distribution on {1,...,M}. We simply
put:
π=
M
X
j=1
p(j)πj.
The minimization of the bound in Theorem 2.1 leads to the Gibbs posterior ˆρλthat will
put mass on all the Θjin general, so this is a model aggregation procedure in the spirit
of [128]. On the other hand, we can also restrict the minimization in the PAC-Bayes bound
to distributions that would charge only one of the models, that is, to ρ∈ P1)· ··PM).
Theorem 2.1 becomes:
PS j∈ {1,...,M},ρ∈ Pj),Eθρ[R(θ)]
Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log 1
ε
λ!1ε,
that is
PS j∈ {1,...,M},ρ∈ Pj),Eθρ[R(θ)]
Eθρ[r(θ)] + λC2
8n+KL(ρkπj) + log 1
p(j)+ log 1
ε
λ!1ε.
Thus, we can propose the following procedure:
first, we build the Gibbs posterior for each model j,
ˆρ(j)
λ(dθ) = eλr(θ)πj(θ)
RΘjeλr(ϑ)πj(dϑ),
19
then, model selection:
ˆ
j= argmin
1jM(Eθˆρ(j)
λ
[r(θ)] + KL(ˆρ(j)
λkπj) + log 1
p(j)
λ).
The obtained ˆ
jsatisfies:
PS j∈ {1,...,M},ρ∈ Pj),Eθˆρ(ˆ
j)
λ
[R(θ)]
min
1jMinf
ρ∈Pj)(Eθρ[r(θ)] + K L(ρkπj) + log 1
p(j)+ log 1
ε
λ)!1ε.
2.1.4 The choice of λ
As discussed earlier, it is in general not possible to optimize the right-hand side of the PAC-
Bayes equality with respect to λ. For example, in 2.5, the optimal value of λcould depend
on ρ, which is not allowed by Theorem 2.1. In the previous examples, we have seen that in
some situations, if one is lucky enough, the optimal λactually does not depend on ρ, but we
still need a procedure for the general case.
A natural idea is to propose a finite grid Λ (0,+) and to minimize over this grid,
which can be justified by a union bound argument.
Theorem 2.4 Let Λ(0,+)be a finite set. For any ε(0,1),
PS ρ∈ P(Θ),λΛ,Eθρ[R(θ)]
Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log card(Λ)
ε
λ!1ε.
Proof. Fix λΛ, and then follow the proof of Theorem 2.1 until (2.3):
EShesupρ∈P(Θ) λEθρ[R(θ)r(θ)]K L(ρkπ)λ2C2
8ni1.
Sum over λΛ to get:
X
λΛ
EShesupρ∈P(Θ) λEθρ[R(θ)r(θ)]KL(ρkπ)λ2C2
8nicard(Λ)
and so
EShesupρ∈P(Θ)ΛλEθρ[R(θ)r(θ)]K L(ρkπ)λ2C2
8nicard(Λ).
20
The end of the proof is as for Theorem 2.1, we start with Chernoff bound. Fix s > 0,
PS"sup
ρ∈P(Θ)Λ
λEθρ[R(θ)r(θ)] KL(ρkπ)λ2C2
8n> s#
EShesupρ∈P(Θ)ΛλEθρ[R(θ)r(θ)]K L(ρkπ)λ2C2
8nies
card(Λ)es.
Solve card(Λ)es=ε, that is, put s= log(card(Λ)) to get
PS"sup
ρ∈P(Θ)
λEθρ[R(θ)r(θ)] KL(ρkπ)λ2C2
8n>log card(Λ)
ε#ε.
Rearranging terms gives:
PS"ρ∈ P(Θ),λΛ, Eθρ[R(θ)] >Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log card(Λ)
ε
λ#ε.
Take the complement to get the statement of the theorem.
This leads to the following procedure. First, we remind that, for a fixed λ, the minimizer
of the bound is ˆρλ=πλr. Then, we put:
ˆρ= ˆρˆ
λwhere
ˆ
λ= argmin
λΛ(Eθπλr [r(θ)] + λC2
8n+KL(πλrkπ) + log card(Λ)
ε
λ).(2.9)
We have immediately the following result.
Corollary 2.5 Define ˆρas in (2.9), for any ε(0,1) we have
PS Eθˆρ[R(θ)] inf
ρ∈ P(Θ)
λΛ"Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log card(Λ)
ε
λ#!1ε.
We could for example propose an arithmetic grid Λ = {1,2,...,n}. The bound in
Corollary 2.5 becomes:
Eθˆρ[R(θ)] inf
ρ∈ P(Θ)
λ= 1, . . . , n Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log n
ε
λ
It is also possible to transform the optimization on a discrete grid by an optimization on a
continuous grid. Indeed, for any λ[1, n], we simply apply the bound to the integer part of
λ,λ, and remark that we can upper bound λ⌋ ≤ λand 1/λ⌋ ≤ 1/(λ1). So the bound
becomes:
Eθˆρ[R(θ)] inf
ρ∈ P(Θ)
λ[1, n]Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log n
ε
λ1.
21
The arithmetic grid is not be the best choice, though: the log(n) term can be improved. In
order to optimize hyperparameters in PAC-Bayes bounds, Langford and Caruana [106] used
a geometric grid Λ = {ek, k N} [1, n], the same choice was used later by Catoni [41, 43].
Using such a bound in Corollary 2.5 we get
Eθˆρ[R(θ)] inf
ρ∈ P(Θ)
λ[1, n]"Eθρ[r(θ)] + λC2
8n+KL(ρkπ) + log 1+log n
ε
λ/e#.
We conclude this discussion on the choice of λby mentioning that there are other PAC-
Bayes bounds, for example McAllester’s bound [128], where there is no parameter λto
optimize. We will study these bounds in Section 3.
2.2 PAC-Bayes bound on aggregation of predictors
In the introduction, right after Definition 1.1, I promised that PAC-Bayes bound would allow
to control
the risk of randomized predictors,
the expected risk of randomized predictors,
the risk of averaged predictors.
But so far, we only focused on the expected risk of randomized predictors (the second bullet
point). In this subsection, we provide some bounds on averaged predictors, and in the next
one, we will focus on the risk of randomized predictors.
We start by a very simple remark. When the loss function u7→ (u, y) is convex for any
y, then the risk R(θ) = R(fθ) is a convex function of fθ. Thus, Jensen’s inequality ensures:
Eθρ[R(fθ)] R(Eθρ[fθ]).
Plugging this into Corollary 2.3 gives immediately the following result.
Corollary 2.6 Assume that y∈ Y,u7→ (u, y)is convex. Define
ˆ
fˆρλ(·) = Eθˆρλ[fθ(·)]
For any λ > 0, for any ε(0,1),
PSR(ˆ
fˆρλ)inf
ρ∈P(Θ) Eθρ[r(θ)] + λC 2
8n+KL(ρkπ) + log 1
ε
λ1ε.
That is, in the case of a convex loss function, like the quadratic loss or the hinge loss,
PAC-Bayes bounds also provide bounds on the risk of aggregated predictors.
It is also feasible under other assumptions. For example, we can use the Lipschitz property
as in Example 2.2, it can also be done using the margin of the classifier (see some references
in Subsection 6.2 below).
22
2.3 PAC-Bayes bound on a single draw from the posterior
Theorem 2.7 For any λ > 0, for any ε(0,1), for any data-dependent proba. measure ˜ρ,
PSP˜
θ˜ρ R(˜
θ)r(˜
θ) + λC2
8n+log dρ
dπ(˜
θ) + log 1
ε
λ!1ε.
This bound simply says that if you draw ˜
θfrom, for example, the Gibbs posterior ˆρλ(defined
in (2.4)), you have the bound on R(˜
θ) that holds with large probability simultaneously on
the drawing of the sample and of ˜
θ.
Proof. Once again, we follow the proof of Theorem 2.1, until (2.2):
ESEθπeλ[R(θ)r(θ)]eλ2C2
8n.
Now, for any function h,
Eθπ[h(θ)] = Zh(θ)π(dθ)
Z{ρ
dπ(θ)>0}h(θ)π(dθ)
=Z{ρ
dπ(θ)>0}h(θ)dπ
d˜ρ(θ) ˜ρ(dθ)
=Eθ˜ρhh(θ)elog d ˜ρ
dπ(θ)i
and in particular:
ESEθ˜ρheλ[R(θ)r(θ)]log ρ
dπ(θ)ieλ2C2
8n.
I could go through the proof until the end, but I think that you now guess that it’s essentially
Chernoff bound + rearrangement of the terms.
2.4 Bound in expectation
We end this section by one more variant of the initial PAC-Bayes bound in Theorem 2.1: a
bound in expectation with respect to the sample.
Theorem 2.8 For any λ > 0, for any data-dependent probability measure ˜ρ,
ESEθ˜ρ[R(θ)] ESEθ˜ρr(θ) + λC2
8n+KL(˜ρkπ)
λ.
In particular, for ˜ρ= ˆρλthe Gibbs posterior,
ESEθˆρλ[R(θ)] ESinf
ρ∈P(θ)
Eθρ[r(θ)] + λC2
8n+KL(ρkπ)
λ.
23
These bounds in expectation are very convenient tools from a pedagogical point of view.
Indeed, in Section 4, we will study oracle PAC-Bayes inequalities. While it is possible to
derive oracle PAC-Bayes bounds both in expectation and with large probability, the one in
expectation are much simpler to derive, and much shorter. Thus, I will mostly provide PAC-
Bayes oracle bounds in expectation in 4, and refer the reader to [41, 43] for the corresponding
bounds in probability.
Note that as the bound does not hold with large probability, as the previous bounds, it is
no longer a PAC bound in the proper sense: Probably Approximately Correct. Once, I was at-
tending a talk by Tsybakov where he presented some result from his paper with Dalalyan [61]
that can also be interpreted as a “PAC-Bayes bound in expectation”, and he suggested the
more appropriate EAC-Bayes acronym: Expectedly Approximately Correct (their paper is
briefly discussed in Subsection 6.4 below). I don’t think this term was often reused since
then. I also found recently in [80] the acronym MAC-Bayes: Mean Approximately Correct.
In order to avoid any confusion I will stick to “PAC-Bayes bound in expectation”, but I like
EAC and MAC!
Proof. Once again, the beginning of the proof is the same as for Theorem 2.1, until (2.3):
EShesupρ∈P(Θ) λEθρ[R(θ)r(θ)]K L(ρkπ)λ2C2
8ni1.
This time, use Jensen’s inequality to send the expectation with respect to the sample inside
the exponential function:
eEShsupρ∈P(Θ) λEθρ[R(θ)r(θ)]K L(ρkπ)λ2C2
8ni1,
that is,
ES"sup
ρ∈P(Θ)
λEθρ[R(θ)r(θ)] KL(ρkπ)λ2C2
8n#0.
In particular,
ESλEθ˜ρ[R(θ)r(θ)] KL(˜ρkπ)λ2C2
8n0.
Rearrange terms.
2.5 Applications of empirical PAC-Bayes bounds
The original PAC-Bayes bounds were stated for classification [127] and it became soon clear
that many results could be extended to any bounded loss, thus covering for example bounded
regression (we discuss in Section 5 how to get rid of the boundedness assumption). Thus,
some papers are written in no specific setting, with a generic loss, that can cover classification,
regression, or density estimation (this is the case, among others, of Chapter 1 of my PhD
thesis [1] and the corresponding paper [2] where I studied a generalization of Catoni’s results
of [43] to unbounded losses).
24
However, some empirical PAC-Bayes bounds were also developped or applied to specific
models, sometimes taking advantage of some specificities of the model. We mention for
example:
ranking/scoring [150],
density estimation [91],
multiple testing [35] is tackled with related techniques,
deep learning. Even though deep networks are trained for classification or regression,
the application of PAC-Bayes bounds to deep learning is not straightforward. We
discuss this in Section 3 based on [67] and more recent references.
unsupervised learning, including clustering [164, 13], representation learning [140, 139].
Note that this list is non-exhaustive, and that many more applications are presented in
Section 4 (more precisely, in Subsection 4.3).
3 Tight and non-vacuous PAC-Bayes bounds
3.1 Why is there a race to the tighter PAC-Bayes bound?
Let us start with a numerical application of the PAC-Bayes bounds we met in Section 2.
First, assume we are in the classification setting with the 0-1 loss, so that C= 1. We are
given a small set of classifiers, say M= 100, and that on the test set with size n= 1000, the
best of these classifiers has an empirical risk rn= 0.26. Let us use the bound in (2.7), that
I remind here:
PS
Eθˆρλ[R(θ)] inf
θΘr(θ) + Cslog M
ε
2n
1ε.
With ǫ= 0.05 this bound is:
PS Eθˆρλ[R(θ)] 0.26 + 1.slog 100
0.05
2×1000
|{z }
0.062
!0.95.
So the classification risk using the Gibbs posterior is smaller than 0.322 with probability at
least 95%.
Let us now switch to a more problematic example. We consider a very simple binary
neural network, given by the following formula, for xRd, and where ϕis a nonlinear
activation function (e.g ϕ(x) = max(x, 0)):
fw(x) = 1"M
X
i=1
w(2)
iϕ d
X
j=1
w(1)
j,i xj!0#
25
and the weights w(1)
j,i and w(2)
iare all in {−1,+1}for 1 jdand 1 iM. Define
θ= (w(1)
1,1, w(1)
1,2,...,w(1)
d,M , w(2)
1,...,w(2)
M). Note that the set of all possible such networks has
cardinality 2M(d+1). Consider inputs that are 100 ×100 greyscale images, that is, x[0,1]d
with d= 10,000, and a sample size n= 10,000. With neural networks, it is often the case
that a perfect classification of the training sample is possible, that is, there is a θsuch that
r(θ) = 0.
Even for a moderate number of units such as M= 100, this leads to the PAC-Bayes
bound (with ε= 0.05):
PS Eθˆρλ[R(θ)] 1.slog 21,000,100
0.05
2×10,000
|{z }
13.58
!0.95.
So the classification risk using the Gibbs posterior is smaller than 13.58 with probability at
least 95%. Which is not informative at all, because we already know that the classification
risk is smaller than 1. Such a bound is usually refered to as a vacuous bound, because it
does not bring any information at all. You can try to improve the bound by increasing the
dataset. But you can check that even n= 1,000,000 still leads to a vacuous bound with
this network.
Various opinions on these vacuous bounds are possible:
“theory is useless. I don’t know why I would care about generalization guarantees,
neural networks work in practice.” This opinion is lazy: it’s just a good excuse not to
have to think about generalization guarantees. I will assume that since you are reading
this tutorial, this is not your opinion.
“vacuous bounds are certainly better than no bounds at all!” This opinion is cynical,
it can be rephrased as “better have a theory that doesn’t work than no theory at all:
at least we can claim we have a theory, and some people might even believe us”. But
the theory just says nothing.
“let’s get back to work, and improve the bounds”. Since the publication of the first
PAC-Bayes bounds already mentioned [166, 127, 128], many variants were proven.
One can try to test which one is the best in a given setting, try to improve the priors,
try to refine the bound in many ways... In 2017, Dziugaite and Roy [67] obtained
non-vacuous (even though not really tight yet) PAC-Bayes bounds for practical neural
networks (since then, tighter bounds were obtained by these authors and by others).
This is a remarkable achievement, and this also made PAC-Bayes theory immediately
more popular than it was ever before.
Let’s begin this section with a review of some popular PAC-Bayes bounds: Subsection 3.2.
We will then explain which bound, and which improvements led to tight generalization
bounds for deep learning 3.3. In particular, we will focus on a very important approach to
improve the bounds: data-dependent priors.
26
3.2 A few PAC-Bayes bounds
Note that the original works on PAC-Bayes focused only on classification with the 0-1 loss.
So, for the whole Subsection 3.2, we assume that is the 0-1 loss function. Remember it
means that Rand rtake value in [0,1] (so C= 1 in this subsection).
3.2.1 McAllester’s bound [127] and Maurer’s improved bound [126]
As the original paper by McAllester [127] focused on finite or denumerable sets Θ, let us
start with the first bound for a general Θ, in [128].
Theorem 3.1 (Theorem 1 in [128]) For any ε > 0,
PS"ρ∈ P(Θ),Eθρ[R(θ)] Eθρ[r(θ)] + sKL(ρkπ) + log 1
ε+5
2log(n) + 8
2n1#ε.
Compared to Theorem 2.1, note that there is no parameter λhere to optimize. On the other
hand, one can no longer use 2.2 to minimize the right-hand side. A way to solve this problem
is to make the parameter λappear artificially using the inequality ab aλ/2 + b/(2λ) for
any λ > 0:
PS"ρ∈ P(Θ), Eθρ[R(θ)]
Eθρ[r(θ)] + inf
λ>0λ
4n+KL(ρkπ) + log 2
ε+1
2log(n)
2λ#ε. (3.1)
On the other hand, the prize to pay for an optimization with respect to λin Theorem 2.4
was a log(n) term, that is already in Maurer’s bound, for an arithmetic grid, and a log log(n)
term when using a geometric grid. So, asymptotically in n, Theorem 2.4 with a geometric
grid will always lead to better results than Theorem 3.1. On the other hand, the constants
in Theorem 3.1 are smaller, so the bound can be better for small sample sizes (a point that
should not be neglected for tight certificates in practice!).
It is possible to minimize the right-hand side in (3.1) with respect to ρ, and this will lead
to a Gibbs posterior: ˆρ=π2λr. It is also possible to minimize it with respect to λ, but
the minimization in λwhen ρitself depends on λis a bit more tricky. We want to mention
on this problem the more recent paper [174]. The authors proved a bound that is easier to
minimize simultaneously in λand ρ.
3.2.2 Catoni’s bound (another one) [43]
Theorem 2.1 was based on Catoni’s preprint [41]. Catoni’s monograph [43] provide many
other bounds.
27