Content uploaded by Pierre Alquier

Author content

All content in this area was uploaded by Pierre Alquier on May 27, 2022

Content may be subject to copyright.

arXiv:2110.11216v4 [stat.ML] 9 Nov 2021

User-friendly introduction to PAC-Bayes bounds

Pierre Alquier

RIKEN AIP, Tokyo, Japan

Abstract

Aggregated predictors are obtained by making a set of basic predictors vote accord-

ing to some weights, that is, to some probability distribution.

Randomized predictors are obtained by sampling in a set of basic predictors, ac-

cording to some prescribed probability distribution.

Thus, aggregated and randomized predictors have in common that they are not

deﬁned by a minimization problem, but by a probability distribution on the set of

predictors. In statistical learning theory, there is a set of tools designed to understand

the generalization ability of such procedures: PAC-Bayesian or PAC-Bayes bounds.

Since the original PAC-Bayes bounds [166, 127], these tools have been considerably

improved in many directions (we will for example describe a simpliﬁed version of the

localization technique of [41, 43] that was missed by the community, and later rediscov-

ered as “mutual information bounds”). Very recently, PAC-Bayes bounds received a

considerable attention: for example there was workshop on PAC-Bayes at NIPS 2017,

(Almost) 50 Shades of Bayesian Learning: PAC-Bayesian trends and insights, orga-

nized by B. Guedj, F. Bach and P. Germain. One of the reasons of this recent success

is the successful application of these bounds to neural networks [67].

An elementary introduction to PAC-Bayes theory is still missing. This is an attempt

to provide such an introduction.

Contents

1 Introduction 3

1.1 Machine learning and PAC bounds . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 Machine learning: notations . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.2 PACbounds................................ 5

1.2 What are PAC-Bayes bounds? . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Whythistutorial? ................................ 10

1.4 Two types of PAC bounds, organization of these notes . . . . . . . . . . . . 11

1

2 First step in the PAC-Bayes world 12

2.1 A simple PAC-Bayes bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.1 Catoni’s bound [41] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.2 Exact minimization of the bound . . . . . . . . . . . . . . . . . . . . 14

2.1.3 Some examples, and non-exact minimization of the bound . . . . . . 15

2.1.4 The choice of λ.............................. 20

2.2 PAC-Bayes bound on aggregation of predictors . . . . . . . . . . . . . . . . . 22

2.3 PAC-Bayes bound on a single draw from the posterior . . . . . . . . . . . . . 23

2.4 Boundinexpectation............................... 23

2.5 Applications of empirical PAC-Bayes bounds . . . . . . . . . . . . . . . . . . 24

3 Tight and non-vacuous PAC-Bayes bounds 25

3.1 Why is there a race to the tighter PAC-Bayes bound? . . . . . . . . . . . . . 25

3.2 A few PAC-Bayes bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.1 McAllester’s bound [127] and Maurer’s improved bound [126] . . . . . 27

3.2.2 Catoni’s bound (another one) [43] . . . . . . . . . . . . . . . . . . . . 27

3.2.3 Seeger’s bound [161] and Maurer’s bound [126] . . . . . . . . . . . . . 28

3.2.4 Tolstikhin and Seldin’s bound [175] . . . . . . . . . . . . . . . . . . . 29

3.2.5 Thieman, Igel, Wintenberger and Seldin’s bound [174] . . . . . . . . . 30

3.2.6 A bound by Germain, Lacasse, Laviolette and Marchand [77] . . . . . 30

3.3 Tight generalization error bounds for deep learning . . . . . . . . . . . . . . 30

3.3.1 A milestone: non vacuous generalization error bounds for deep net-

works by Dziugaite and Roy [67] . . . . . . . . . . . . . . . . . . . . . 30

3.3.2 Bounds with data-dependent priors . . . . . . . . . . . . . . . . . . . 32

3.3.3 Comparison of the bounds and tight certiﬁcates for neural networks [147] 33

4 PAC-Bayes oracle inequalities and fast rates 34

4.1 From empirical inequalities to oracle inequalities . . . . . . . . . . . . . . . . 34

4.1.1 Bound in expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1.2 Bound in probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 Bernstein assumption and fast rates . . . . . . . . . . . . . . . . . . . . . . . 36

4.3 Applications of Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4 Dimension and rate of convergence . . . . . . . . . . . . . . . . . . . . . . . 43

4.5 Getting rid of the log terms: Catoni’s localization trick . . . . . . . . . . . . 46

5 Beyond “bounded loss” and “i.i.d observations” 49

5.1 “Almost” bounded losses (Sub-Gaussian and sub-gamma) . . . . . . . . . . . 49

5.1.1 The sub-Gaussian case . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.1.2 The sub-gamma case . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.1.3 Remarks on exponential moments . . . . . . . . . . . . . . . . . . . . 51

5.2 Heavy-tailedlosses ................................ 51

5.2.1 The truncation approach . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2.2 Bounds based on moment inequalities . . . . . . . . . . . . . . . . . . 52

2

5.2.3 Bounds based on robust losses . . . . . . . . . . . . . . . . . . . . . . 53

5.3 Dependent observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.3.1 Inequalities for dependent variables . . . . . . . . . . . . . . . . . . . 54

5.3.2 Asimpleexample............................. 54

5.4 Other non i.i.d settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.4.1 Non identically distributed observations . . . . . . . . . . . . . . . . 55

5.4.2 Shift in the distribution . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.4.3 Meta-learning............................... 56

6 Related approaches in statistics and machine learning theory 56

6.1 Bayesian inference in statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.1.1 Gibbs posteriors, generalized posteriors . . . . . . . . . . . . . . . . . 57

6.1.2 Contraction of the posterior in Bayes nonparametrics . . . . . . . . . 57

6.1.3 Variational approximations . . . . . . . . . . . . . . . . . . . . . . . . 58

6.2 Empirical risk minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.3 Onlinelearning .................................. 60

6.3.1 Sequential prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.3.2 Bandits and reinforcement learning (RL) . . . . . . . . . . . . . . . . 62

6.4 Aggregation of estimators in statistics . . . . . . . . . . . . . . . . . . . . . . 62

6.5 Information theoretic approaches . . . . . . . . . . . . . . . . . . . . . . . . 62

6.5.1 Minimum description length . . . . . . . . . . . . . . . . . . . . . . . 63

6.5.2 Mutual information bounds (MI) . . . . . . . . . . . . . . . . . . . . 63

7 Conclusion 65

1 Introduction

In a supervised learning problem, such as classiﬁcation or regression, we are given a data

set, and we 1) ﬁx a set of predictors and 2) ﬁnd a good predictor in this set.

For example, when doing linear regression, you 1) chose to consider only linear predictors

and 2) use the least-square method to chose your linear predictor.

PAC-Bayes bounds will allow us to deﬁne and study “randomized” or “aggregated” pre-

dictors. By this, we mean that we will replace 2) by 2’) deﬁne weights on the predictors

and make them vote according to these weights or by 2”) draw a predictor according to some

prescribed probability distribution.

1.1 Machine learning and PAC bounds

1.1.1 Machine learning: notations

We will assume that the reader is already familiar with the setting of supervised learning

and the corresponding deﬁnitions. We brieﬂy remind the notations involved here:

3

•an object set X: photos, texts, Rd(equipped with a σ-algebra Sx).

•a label set Y, usually a ﬁnite set for classiﬁcation problem or the set of real numbers

for regression problems (equipped with a σ-algebra Sy).

•a probability distribution Pon (X ×Y,Sx⊗ Sy), which is not known.

•the data, or observations: (X1, Y1),...,(Xn, Yn). From now, and until the end of

Section 4, we assume that (X1, Y1),...,(Xn, Yn)are i.i.d from P.

•a predictor is a measurable function f:X → Y.

•we ﬁx a set of predictors indexed by a parameter set Θ (equipped with a σ-algebra T):

{fθ, θ ∈Θ}.

In regression, the basic example is fθ(x) = θTxfor X=Θ=Rd. The analogous for

classiﬁcation is:

fθ(x) = 1 if θTx≥0

0 otherwise.

More sophisticated predictors: the set of all neural networks with a ﬁxed architecture,

θbeing the weights of the network.

•a loss function, that is, a measurable function ℓ:Y2→[0,+∞) with ℓ(y, y) = 0. In a

classiﬁcation problem, a very common loss function is:

ℓ(y′, y) = 1 if y′6=y

0 if y′=y

We will refer to it as the 0-1 loss function, and will use the following shorter notation:

ℓ(y′, y) = 1(y6=y′). However, it is often more convenient to consider convex loss

functions, such as ℓ(y′, y) = max(1 −yy′,0) (the hinge loss). In regression problems,

the most popular examples are ℓ(y′, y) = (y′−y)2the quadratic loss, or ℓ(y′, y) = |y′−y|

the absolute loss. Note that the original PAC-Bayes bounds in [127] were stated in the

special case of the 0-1 loss, and this is also the case of most bounds published since

then. However, PAC-Bayes bounds for regression with the quadratic loss were proven

in [42], and in many works since then (they will be mentioned later). From now, and

until the end of Section 4, we assume that 0≤ℓ≤C.This might be either

because we are using the 0-1 loss, or the quadratic loss but in a setting where fθ(x)

and yare bounded.

•the generalization error of a predictor, or generalization risk, or simply risk:

R(f) = E(X,Y )∼P[ℓ(f(X), Y )].

For short, as we will only consider predictors in {fθ, θ ∈Θ}we will write

R(θ) := R(fθ).

This function is not accessible because it depends on the unknown P.

4

•for short, we put ℓi(θ) := ℓ(fθ(Xi), Yi)≥0.

•the empirical risk:

r(θ) = 1

n

n

X

i=1

ℓi(θ)

satisﬁes

E(X1,Y1),...,(Xn,Yn)[r(θ)] = R(θ).

Note that the notation for the last expectation is cumbersome. From now, we will write

S= [(X1, Y1),...,(Xn, Yn)] and ES(for “expectation with respect to the sample”)

instead of E(X1,Y1),...,(Xn,Yn). In the same way, we will write PS.

•an estimator is a function

ˆ

θ:∞

[

n=1

(X × Y)n→Θ.

That is, to each possible dataset, with any possible size, it associates a parameter. (It

must be such that the restriction of ˆ

θto each (X × Y)nis measurable). For short, we

write ˆ

θinstead of ˆ

θ((X1, Y1),...,(Xn, Yn)). The most famous example is the Empirical

Risk Minimizer, or ERM: ˆ

θERM = argmin

θ∈Θ

r(θ)

(with a convention in case of a tie).

1.1.2 PAC bounds

Of course, our objective is to minimize R, not r. So the ERM strategy is motivated by

the hope that these two functions are not so diﬀerent, so that the minimizer of ralmost

minimizes R. In what remains of this section, we will check to what extent this is true. By

doing so, we will introduce some tools that will be useful when we will come to PAC-Bayes

bounds.

The ﬁrst of these tools is a classical result that will be useful in all this tutorial.

Lemma 1.1 (Hoeﬀding’s inequality) Let U1,...,Unbe independent random variables

taking values in an interval [a, b]. Then, for any t > 0,

EhetPn

i=1[Ui−E(Ui)]i≤ent2(b−a)2

8.

The proof can be found for example in Chapter 2 of [37], which is a highly recommended

reading, but it is so classical that you can as well ﬁnd it on Wikipedia.

Fix θ∈Θ and apply Hoeﬀding’s inequality with Ui=E[ℓi(θ)] −ℓi(θ) to get:

ESetn[R(θ)−r(θ)]≤ent2C2

8.(1.1)

5

Now, for any s > 0,

PS(R(θ)−r(θ)> s) = PSent[R(θ)−r(θ)] >ents

≤ESent[R(θ)−r(θ)]

ents by Markov’s inequality,

≤ent2C2

8−nts by (1.1).

In other words,

PS(R(θ)> r(θ) + s)≤ent2C2

8−nts.

We can make this bound as tight as possible, by optimizing our choice for t. Indeed, note

that nt2C2/8−nts is minimized for t= 4s/C2, which leads to

PS(R(θ)> r(θ) + s)≤e−2ns2

C2.(1.2)

This means that, for a given θ, the risk R(θ) cannot be much larger than the corresponding

empirical risk r(θ). The order of this “much larger” can be better understood by introducing

ε= e−2ns2

C2

and substituting εto sin (1.2), which gives:

PS

R(θ)> r(θ) + Cslog 1

ε

2n

≤ε. (1.3)

We see that R(θ) will usually not exceed r(θ) by more than a term in 1/√n. This is not

enough, though, to justify the use of the ERM. Indeed, (1.3) is only true for the θthat was

ﬁxed above, and we cannot apply it to ˆ

θERM that is a function of the data. In order to study

R(ˆ

θERM), we can use

R(ˆ

θERM)−r(ˆ

θERM)≤sup

θ∈Θ

[R(θ)−r(θ)] (1.4)

so we need a version of (1.3) that would hold uniformly on Θ.

Let us now assume, until the end of Subsection 1.1, that the set Θ is ﬁnite, that is,

card(Θ) = M < +∞. Then:

PS(sup

θ∈Θ

[R(θ)−r(θ)] > s) = PS [

θ∈Θn[R(θ)−r(θ)] > so!

≤X

θ∈Θ

PS(R(θ)> r(θ) + s)

≤Me−2ns2

C2(1.5)

6

thanks to (1.2). This time, put

ε=Me−2ns2

C2,

plug into (1.5), this gives:

PS

sup

θ∈Θ

[R(θ)−r(θ)] > Cslog M

ε

2n

≤ε.

Let us state this conclusion as a theorem (focusing on the complementary event).

Theorem 1.2 Assume that card(Θ) = M < +∞. For any ε∈(0,1),

PS

∀θ∈Θ,R(θ)≤r(θ) + Cslog M

ε

2n

≥1−ε.

This result indeed motivates the introduction of ˆ

θERM. Indeed, using (1.4), with probability

at least 1 −εwe have

R(ˆ

θERM)≤r(ˆ

θERM) + Cslog M

ε

2n

= inf

θ∈Θ

r(θ) + Cslog M

ε

2n

so the ERM satisﬁes:

PS

R(ˆ

θERM)≤inf

θ∈Θ

r(θ) + Cslog M

ε

2n

≥1−ε.

Such a bound is usually called a PAC bound, that is, Probably Approximately Correct bound.

The reason for this terminology, introduced by Valiant in [178], is as follows: Valiant was

considering the case where there is a θ0∈Θ such that Yi=fθ0(Xi) holds almost surely. This

means that R(θ0) = 0 and r(θ0) = 0, and so

PS

R(ˆ

θERM)≤Cslog M

ε

2n

≥1−ε,

which means that with large Probability,R(ˆ

θERM) is Approximately equal to the Correct

value, that is, 0. Note that, however, this is only correct if log(M)/n is small, that is, if M

is not larger than exp(n). This log(M) in the bound is the price to pay to learn which of M

predictors is the best.

7

Remark 1.1 The proof of Theorem 1.2 used, in addition to Hoeﬀding’s inequality, two tricks

that we will reuse many times in this tutorial:

•given a random variable Uand s∈R, for any t > 0,

P(U > s) = PetU >ets

≤EetU

ets

thanks to Markov inequality. The combo “exponential + Markov inequality” is known

as Chernoﬀ bound. Chernoﬀ bound is of course very useful together with exponential

inequalities like Hoeﬀding’s inequality.

•given a ﬁnite number of random variables U1,...,UM,

Psup

1≤i≤M

Ui> s=P [

1≤i≤MnUi> so!

≤

M

X

i=1

P(Ui> s).

This argument is called the union-bound argument.

The next step in the study of the ERM would be to go beyond ﬁnite sets Θ. The union

bound argument has to be modiﬁed in this case, and things become a little more complicated.

We will therefore stop here the study of the ERM: it is not our objective anyway.

If the reader is interested by the study of the ERM in general: Vapnik and Chervonenkis

developed the theoretical tools for this study in 1969/1970, this is for example developed

by Vapnik in [180]. The book [64] is a beautiful and very pedagogical introduction to

machine learning theory, and Chapters 11 and 12 in particular are dedicated to Vapnik and

Chervonenkis theory.

1.2 What are PAC-Bayes bounds?

I am now in better position to explain what are PAC-Bayes bounds. A simple way to phrase

things: PAC-Bayes bounds are generalization of the union bound argument, that will allow

to deal with any parameter set Θ: ﬁnite or inﬁnite, continuous... However, a byproduct of

this technique is that we will have to change the notion of estimator.

Deﬁnition 1.1 Let P(Θ) be the set of all probability distributions on (Θ,T). A data-

dependent probability measure is a function:

ˆρ:∞

[

n=1

(X × Y)n→ P(Θ)

8

with a suitable measurability condition1We will write ˆρinstead of ˆρ((X1, Y1),...,(Xn, Yn))

for short.

In practice, when you have a data-dependent probability measure, and you want to build

a predictor, you can:

•draw a random parameter ˜

θ∼ˆρ, we will call this procedure “randomized estimator”.

•use it to average predictors, that is, deﬁne a new predictor:

fˆρ(·) = Eθ∼ˆρ[fθ(·)]

called the aggregated predictor with weights ˆρ.

So, with PAC-Bayes bounds, we will extend the union bound argument 2to inﬁnite,

uncountable sets Θ, but we will obtain bounds on various risks related to data-dependent

probability measures, that is:

•the risk of a randomized estimator, R(˜

θ),

•or the average risk of randomized estimators, Eθ∼ˆρ[R(θ)],

•or the risk of the aggregated estimator, R(fˆρ).

You will of course ask the question: if Θ is inﬁnite, what will become the log(M) term

in Theorem 1.2, that came from the union bound? In general, this term will be replaced by

the Kullback-Leibler divergence between ˆρand a ﬁxed πon Θ.

Deﬁnition 1.2 Given two probability measures µand νin P(Θ), the Kullback-Leibler (or

simply KL) divergence between µand νis

KL(µkν) = Zlog dµ

dν(θ)µ(dθ)∈[0,+∞]

if µhas a density dµ

dνwith respect to ν, and KL(µkν) = +∞otherwise.

Example 1.1 For example, if Θis ﬁnite,

KL(µkν) = X

θ∈Θ

log µ(θ)

ν(θ)µ(θ).

1I don’t want to scare the reader with measurability conditions, as I will not check them in this tutorial

anyway. Here, the exact condition to ensure that what follows is well deﬁned is that for any A∈ T , the

function

((x1, y1),...,(xn, yn)) 7→ [ˆρ((x1, y1),...,(xn, yn))] (A)

is measurable. That is, ˆρis a regular conditional probability.

2See the title of van Erven’s tutorial [179]: “PAC-Bayes mini-tutorial: a continuous union bound”. Note,

however, that it is argued by Catoni in [43] that PAC-Bayes bounds are actually more than that, we will

come back to this in Section 4.

9

The following result is well known. You can prove it using Jensen’s inequality, or use

Wikipedia again.

Proposition 1.3 For any probability measures µand ν,KL(µkν)≥0with equality if and

only if µ=ν.

1.3 Why this tutorial?

Since the “PAC analysis of a Bayesian estimator” by Shawe-Taylor and Williamson [166]

and the ﬁrst PAC-Bayes bounds proven by McAllester [127, 128], many new PAC-Bayes

bounds appeared (we will see that many of them can be derived from Seeger’s bound [161]).

These bounds were used in various contexts, to solve a wide range of problems. This led

to hundreds of (beautiful!) papers. The consequence of this is that it’s quite diﬃcult to be

aware of all the existing work on PAC-Bayes bound.

As a reviewer for ICML or NeurIPS, I had very often to reject papers because these

papers were re-proving already known results. Or, because these papers proposed bounds

that were weaker than existing ones3. In particular, it seems that many powerful techniques

in Catoni’s book [43] are still ignored by the community (some are already introduced in

earlier works [41, 42]).

On the other hand, it’s not always easy to begin with PAC-Bayes bounds. I realize that

most papers already assume some basic knowledge on these bounds, and that a monograph

like [43] is quite technical to begin with. When a MSc or PhD student asks me for an

easy-to-follow introduction on PAC-Bayes, I am never sure what to answer, and usually

end up improvising such an introduction for one or two hours, with a piece of chalk and a

blackboard. So it came to me recently 4that it might be useful to write a beginner-friendly

tutorial, that I could send instead!

Note that there are already short tutorials on PAC-Bayes bounds, by McAllester and

van Erven: [130, 179]. They are very good, and I recommend the reader to check them.

However, they are focused on empirical bounds only. There are also surveys on PAC-Bayes

bounds, such as Section 5 in [54] or [84]. These papers are very useful to navigate in the

ocean of publications on PAC-Bayes bounds, and they helped me a lot when I was writing

this document. Finally, in order to highlight the main ideas, I will not necessarily try to

present the bounds with the tightest possible constants. In particular, many oracle bounds

and localized bounds in Section 4 were introduced in [41, 43] with better constants. Thus I

strongly recommend to read [43] after this tutorial, and the more recent papers mentioned

below.

3I might have done such mistakes myself, and I apologize if it is the case.

4I must confess that I started a ﬁrst version of this document after two introductory talks at A. Tsybakov’s

statistics seminar at ENSAE in September-October 2008. Then I got other things to do and I forgot about

it. I taught online learning and PAC-Bayes bounds at ENSAE between 2014 and 2019, which made me

think again about it. When I joined Emti Khan’s group in 2019, I started to think again about such a

document, to share it with the members of the group who were willing to learn about PAC-Bayes. Of course

the contents of the document had to be diﬀerent, because of the enormous amount of very exciting papers

that were published in the meantime. I ﬁnally started again from scratch in early 2021.

10

1.4 Two types of PAC bounds, organization of these notes

It is important to make a distinction between two types of PAC bounds.

Theorem 1.2 is usually refered to as an empirical bound. It means that, for any θ,R(θ) is

upper bounded by an empirical quantity, that is, by something that we can compute when

we observe the data. This allows to deﬁne the ERM as the minimizer of this bound. It also

provides a numerical certiﬁcate of the generalization error of the ERM. You will really end

up with something like

PSR(ˆ

θERM)≤0.12≥0.99.

However, a numerical certiﬁcate on the generalization error does not tell you one thing.

Can this 0.12 be improved using a larger sample size? Or is it the best that can be done

with our set of predictors? The right tools to answer these questions are oracle PAC bounds.

In these bounds, you have a control of the form

PSR(ˆ

θERM)≤inf

θ∈ΘR(θ) + rn(ε)≥1−ε,

where the remainder rn(ε)→0 as fast as possible when n→ ∞. Of course, the upper bound

on R(ˆ

θERM) cannot be computed because we don’t know the function R, so it doesn’t lead

to a numerical certiﬁcate. Still, these bounds are very interesting, because they tell you how

close you can expect R(ˆ

θERM) to be of the smallest possible value of R.

In the same way, there are empirical PAC-Bayes bounds, and oracle PAC-Bayes bounds.

The very ﬁrst PAC-Bayes bounds by McAllester [127, 128] were empirical bounds. Later,

Catoni [41, 42, 43] proved the ﬁrst oracle PAC-Bayes bounds.

In some sense, empirical PAC-Bayes bounds are more useful in practice, and oracle PAC-

Bayes bounds are theoretical objects. But this might be an oversimpliﬁcation. I will show

that empirical bounds are tools used to prove some oracle bounds, so they are also useful in

theory. On the other hand, when we design a data-dependent probability measure, we don’t

know if it will lead to large or small empirical bounds. A preliminary study of its theoretical

properties through an oracle bound is the best way to ensure that it is eﬃcient, and so that

it has a chance to lead to small empirical bounds.

In Section 2, we will study an example of empirical PAC-Bayes bound, essentially taken

from a preprint by Catoni [41]. We will prove it together, play with it, modify it in many

ways. In Section 3, I provide various empirical PAC-Bayes bounds, and explain the race to

tighter bounds. This led to bounds that are tight enough to provide good generalization

bounds for deep learning, we will discuss this based on Dziugaite and Roy’s paper [67] and

a more recent work by P´erez-Ortiz, Rivasplata, Shawe-Taylor, and Szepesv`ari [147].

In Section 4, we will turn to oracle PAC-Bayes bounds. I will explain how to derive these

bounds from empirical bounds, and apply them to some classical set of predictors. We will

examine the assumptions leading to fast rates in these inequalities.

Section 5 will be devoted to the various attempts to extend PAC-Bayes bounds beyond the

setting introduced in this introduction, that is: bounded loss, and i.i.d observations. Finally,

in Section 6 I will discuss brieﬂy the connection between PAC-Bayes bounds and many

11

other approaches in machine learning and statistics, including the recent Mutual Information

bounds (MI).

2 First step in the PAC-Bayes world

As mentioned above, there are many PAC-Bayes bounds. I will start in this section by a

bound which is essentially due to Catoni in the preprint [41] (the same technique was used

in the monograph [43] but with some modiﬁcations). Why this choice?

Well, any choice is partly arbitrary: I did my PhD thesis [1] with Olivier Catoni and

thus I know his works well. So it’s convenient for me. But, also, in a ﬁrst time, I don’t

want here to provide the best bound. I want to show how PAC-Bayes bounds work, how to

use them, and explain the diﬀerent variants (bounds on randomized estimators, bounds on

aggregated estimators, etc.). It appears that Catoni’s technique is extremely convenient to

prove almost all the various type of bounds with a unique proof. Later, in Section 3, I will

present alternative empirical PAC-Bayes bounds, this will allow you to compare them, and

ﬁnd the pros and the cons of each.

2.1 A simple PAC-Bayes bound

2.1.1 Catoni’s bound [41]

From now, and until the end of these notes, let us ﬁx a probability measure π∈ P(Θ).

The measure πwill be called the prior, because of a connection with Bayesian statistics

that will be discussed in Section 6.

Theorem 2.1 For any λ > 0, for any ε∈(0,1),

PS∀ρ∈ P(Θ),Eθ∼ρ[R(θ)] ≤Eθ∼ρ[r(θ)] + λC2

8n+KL(ρkπ) + log 1

ε

λ≥1−ε.

Let us prove Theorem 2.1. The proof requires a lemma that will be extremely useful in

all these notes. This lemma has been known since Kullback [104] in the case of a ﬁnite Θ,

but the general case is due to Donsker and Varadhan [65].

Lemma 2.2 (Donsker and Varadhan’s variational formula) For any measurable, bounded

function h: Θ →Rwe have:

log Eθ∼πeh(θ)= sup

ρ∈P(Θ)hEθ∼ρ[h(θ)] −K L(ρkπ)i.

Moreover, the supremum with respect to ρin the right-hand side is reached for the Gibbs

measure πhdeﬁned by its density with respect to π

dπh

dπ(θ) = eh(θ)

Eϑ∼π[eh(ϑ)].(2.1)

12

Proof of Lemma 2.2. Using the deﬁnition, just check that for any ρ∈ P(Θ),

KL(ρkπh) = −Eθ∼ρ[h(θ)] + KL(ρkπ) + log Eθ∼πeh(θ).

Thanks to Proposition 1.3, the left hand side is nonnegative, and equal to 0 only when

ρ=πh.

Proof of Theorem 2.1. The beginning of the proof follows closely the study of the ERM

and the proof of Theorem 1.2. Fix θ∈Θ and apply Hoeﬀding’s inequality with Ui=

E[ℓi(θ)] −ℓi(θ): for any t > 0,

ESetn[R(θ)−r(θ)]≤ent2C2

8.

We put t=λ/n, which gives:

ESeλ[R(θ)−r(θ)]≤eλ2C2

8n.

This is where the proof diverge from the proof of Theorem 1.2. We will now integrate this

bound with respect to π:

Eθ∼πESeλ[R(θ)−r(θ)]≤eλ2C2

8n.

Thanks to Fubini, we can exchange the integration with respect to θand the one with respect

to the sample:

ESEθ∼πeλ[R(θ)−r(θ)]≤eλ2C2

8n(2.2)

and we apply Donsker and Varadhan’s variational formula (Lemma 2.2) to get:

ESesupρ∈P(Θ) λEθ∼ρ[R(θ)−r(θ)]−K L(ρkπ)≤eλ2C2

8n.

Rearranging terms:

EShesupρ∈P(Θ) λEθ∼ρ[R(θ)−r(θ)]−K L(ρkπ)−λ2C2

8ni≤1.(2.3)

The end of the proof uses Chernoﬀ bound. Fix s > 0,

PS"sup

ρ∈P(Θ)

λEθ∼ρ[R(θ)−r(θ)] −KL(ρkπ)−λ2C2

8n> s#

≤EShesupρ∈P(Θ) λEθ∼ρ[R(θ)−r(θ)]−K L(ρkπ)−λ2C2

8nie−s

≤e−s.

Solve e−s=ε, that is, put s= log(1/ε) to get

PS"sup

ρ∈P(Θ)

λEθ∼ρ[R(θ)−r(θ)] −KL(ρkπ)−λ2C2

8n>log 1

ε#≤ε.

Rearranging terms give:

PS∃ρ∈ P(Θ), Eθ∼ρ[R(θ)] >Eθ∼ρ[r(θ)] + λC2

8n+KL(ρkπ) + log 1

ε

λ≤ε.

Take the complement to end the proof.

13

2.1.2 Exact minimization of the bound

We remind that the bound in Theorem 1.2,

PS

∀θ∈Θ, R(θ)≤r(θ) + Cslog M

ε

2n

≥1−ε,

motivated the introduction of ˆ

θERM, the minimizer of r.

Exactly in the same way, the bound in Theorem 2.1,

PS∀ρ∈ P(Θ), Eθ∼ρ[R(θ)] ≤Eθ∼ρ[r(θ)] + λC2

8n+KL(ρkπ) + log 1

ε

λ≥1−ε,

motivates the study of a data-dependent probability measure ˆρλthat would be deﬁned as:

ˆρλ= argmin

ρ∈P(Θ) Eθ∼ρ[r(θ)] + K L(ρkπ)

λ.

But does such a minimizer exist? It turns out that the answer is yes, thanks to Donsker and

Varadhan’s variational formula again! Indeed, to minimize:

Eθ∼ρ[r(θ)] + KL(ρkπ)

λ

is equivalent to maximize

−λEθ∼ρ[r(θ)] −KL(ρkπ)

which is exactly what the variational inequality does, with h(θ) = −λr(θ). We know that

the minimum is reached for ρ=π−λr as deﬁned in (2.1). Let us summarize this in following

deﬁnition and corollary.

Deﬁnition 2.1 In the whole tutorial we will let ˆρλdenote “the Gibbs posterior” given by

ˆρλ=π−λr, that is:

ˆρλ(dθ) = e−λr(θ)π(dθ)

Eϑ∼π[e−λt(ϑ)].(2.4)

Corollary 2.3 The Gibbs posterior is the minimizer of the right-hand side of Theorem 2.1:

ˆρλ= argmin

ρ∈P(Θ) Eθ∼ρ[r(θ)] + K L(ρkπ)

λ.

As a consequence, for any λ > 0, for any ε∈(0,1),

PSEθ∼ˆρλ[R(θ)] ≤inf

ρ∈P(Θ) Eθ∼ρ[r(θ)] + λC 2

8n+KL(ρkπ) + log 1

ε

λ≥1−ε.

14

2.1.3 Some examples, and non-exact minimization of the bound

When you see something like:

Eθ∼ρ[r(θ)] + λC2

8n+KL(ρkπ) + log 1

ε

λ,

I’m not sure you immediately see what is the order of magnitude of the bound. I don’t.

In general, when you apply such a general bound to a set of predictors, I think it is quite

important to make the bound more explicit. Let us process a few examples (I advise you to

do the calculations on your own in these examples, and in other examples).

Example 2.1 (Finite case) Let us start with the special case where Θis a ﬁnite set, that

is, card(Θ) = M < +∞. We begin with the application of Corollary 2.3. In this case, the

Gibbs posterior ˆρλof (2.4) is a probability on the ﬁnite set Θgiven by

ˆρλ(θ) = e−λr(θ)π(θ)

Pϑ∈Θe−λr(ϑ)π(ϑ).

and we have, with probability at least 1−ε:

Eθ∼ˆρλ[R(θ)] ≤inf

ρ∈P(Θ) Eθ∼ρ[r(θ)] + λC 2

8n+KL(ρkπ) + log 1

ε

λ.(2.5)

As the bound holds for all ρ∈ P(Θ), it holds in particular for all ρin the set of Dirac masses

{δθ, θ ∈Θ}. Obviously:

Eθ∼δθ[r(θ)] = r(θ)

and

KL(δθkπ) = X

ϑ∈Θ

log δθ(ϑ)

π(ϑ)δθ(ϑ) = log 1

π(θ),

so the bound becomes:

PS Eθ∼ˆρλ[R(θ)] ≤inf

θ∈Θ"r(θ) + λC2

8n+log 1

π(θ)+ log 1

ε

λ#!≥1−ε, (2.6)

with log(1/0) = +∞. This gives us an intuition on the role of the measure π: the bound

will be tighter for θ’s such that π(θ)is large. However, πcannot be large everywhere: it is a

probability distribution, so it must satisfy

X

θ∈Θ

π(θ) = 1.

The larger the set Θ, the more this total sum of 1will be spread, which will lead to large

values of log(1/π(θ)).

15

If πis the uniform probability, then log(1/π(θ)) = log(M), and the bound becomes:

PS Eθ∼ˆρλ[R(θ)] ≤inf

θ∈Θr(θ) + λC2

8n+log M

ε

λ!≥1−ε.

The choice λ=p8n/(C2log(M/ε))) actually minimizes the right-hand side, this gives:

PS

Eθ∼ˆρλ[R(θ)] ≤inf

θ∈Θr(θ) + Cslog M

ε

2n

≥1−ε. (2.7)

That is, the Gibbs posterior ˆρλsatisﬁes the same bound as the ERM in Theorem 1.2. Note

that the optimization with respect to λis a little more problematic when πis not uniform,

because the optimal λwould depend on ϑ. We will come back to the choice of λin the general

case soon.

Let us also consider the statement of Theorem 2.1 in this case. With probability at least

1−ε, we have:

∀ρ∈ P(Θ),Eθ∼ρ[R(θ)] ≤Eθ∼ρ[r(θ)] + λC2

8n+KL(ρkπ) + log 1

ε

λ.

Let us apply this bound to any ρin the set of Dirac masses ρ∈ {δθ, θ ∈Θ}. This gives:

∀θ∈Θ,R(θ)≤r(θ) + λC2

8n+log 1

π(θ)+ log 1

ε

λ

and, when πis uniform:

∀θ∈Θ,R(θ)≤r(θ) + λC2

8n+log M

ε

λ.

As this bound holds for any θ, it holds in particular for the ERM, which gives:

R(ˆ

θERM)≤inf

θ∈Θr(θ) + λC2

8n+log M

ε

λ

and, once again with the choice λ=p8n/(C2log(M/ε))), we recover exactly the result of

Theorem 1.2:

R(ˆ

θERM)≤inf

θ∈Θr(θ) + Cslog M

ε

2n.

The previous example leads to important remarks:

•PAC-Bayes bounds can be used to prove generalization bounds for Gibbs posteriors, but

sometimes they can also be used to study more classical estimators, like the ERM. Many

of the recent papers by Catoni with co-authors study robust non-Bayesian estimators

thanks to sophisticated PAC-Bayes bounds [45].

16

•the choice of λhas a diﬀerent status when you study the Gibbs posterior ˆρλand the

ERM. Indeed, in the bound on the ERM, λis chosen to minimize the bound, but

the estimation procedure is not aﬀected by λ. The bound for the Gibbs posterior is

also minimized with respect to λ, but ˆρλdepends on λ. So, if you make a mistake

when chosing λ, this will have bad consequences not only on the bound, but also on

the practical performances of the method. This means also that if the bound is not

tight, it is likely that the λobtained by minimizing the bound will not lead to good

performances in practice. (As you will see soon, we present in Section 3 bounds that

do not depend on a parameter like λ).

Example 2.2 (Lipschitz loss and Gaussian priors) Let us switch to the continuous case,

so that we can derive from PAC-Bayes bounds some results that we wouldn’t be able to de-

rive from a union bounds argument. We consider the case where Θ = Rd, the function

θ7→ ℓ(fθ(x), y)is L-Lipschitz for any xand y, and the prior πis a centered Gaussian:

π=N(0, σ2Id)where Idis the d×didentity matrix.

Let us, as in the previous example, study ﬁrst the Gibbs posterior, by an application of

Corollary 2.3. With probability at least 1−ε,

Eθ∼ˆρλ[R(θ)] ≤inf

ρ∈P(Θ) Eθ∼ρ[r(θ)] + λC 2

8n+KL(ρkπ) + log 1

ε

λ.

Once again, the right-hand side is an inﬁmum over all possible probability distributions ρ,

but it is easier to restrict to Gaussian distributions here. So:

Eθ∼ˆρλ[R(θ)] ≤inf

ρ=N(m, s2Id)

m∈Rd, s > 0Eθ∼ρ[r(θ)] + λC2

8n+KL(ρkπ) + log 1

ε

λ.(2.8)

Indeed, it is well known that, for ρ=N(m, s2Id)and π=N(0, σ2Id),

KL(ρkπ) = kmk2

2σ2+d

2s2

σ2+ log σ2

s2−1.

Moreover, the risk rinherits the Lipschitz property of the loss, that is, for any (θ, ϑ)∈Θ2,

r(θ)≤r(ϑ) + Lkϑ−θk. So, for ρ=N(m, s2Id),

Eθ∼ρ[r(θ)] ≤r(m) + LEθ∼ρ[kθ−mk]

≤r(m) + LqEθ∼ρ[kθ−mk2]by Jensen’s inequality

=r(m) + Ls√d.

Plugging this into 2.8 gives:

Eθ∼ˆρλ[R(θ)] ≤inf

m∈Rd,s>0

r(m) + Ls√d+λC2

8n+

kmk2

2σ2+d

2hs2

σ2+ log σ2

s2−1i+ log 1

ε

λ

.

17

It is possible to minimize the bound completely in s, but for now, we will just consider the

choice s=σ/√n, which gives:

Eθ∼ˆρλ[R(θ)]

≤inf

m∈Rd"r(m) + Lσrd

n+λC2

8n+kmk2

2σ2+d

21

n−1 + log(n)+ log 1

ε

λ#

≤inf

m∈Rd"r(m) + Lσrd

n+λC2

8n+kmk2

2σ2+d

2log(n) + log 1

ε

λ#.

It is not possible to optimize the bound with respect to λas the optimal value would depend

on m... however, a way to understand the bound (by making it worse!) is to restrict the

inﬁmum on mto kmk ≤ Bfor some B > 0. Then we have:

Eθ∼ˆρλ[R(θ)] ≤inf

m:kmk≤Br(m) + Lσrd

n+λC2

8n+kBk2

2σ2+d

2log(n) + log 1

ε

λ.

In this case, we see that the optimal λis

λ=1

Cs8nkBk2

2σ2+d

2log(n) + log 1

ε

which gives:

Eθ∼ˆρλ[R(θ)] ≤inf

m:kmk≤Br(m) + Lσrd

n+CskBk2

2σ2+d

2log(n) + log 1

ε

2n.

Note that our choice of λmight look a bit weird, as it depends on the conﬁdence level ε. This

can be avoided by taking:

λ=1

Cs8nkBk2

2σ2+d

2log(n)

instead (check what bound you obtain by doing so!).

Finally, as in the previous example, we can also start from the statement of Theorem 2.1:

with probability at least 1−ε,

∀ρ∈ P(Θ),Eθ∼ρ[R(θ)] ≤Eθ∼ρ[r(θ)] + λC2

8n+KL(ρkπ) + log 1

ε

λ,

and restrict here ρto the set of Gaussian distributions N(m, s2Id). This leads to the deﬁni-

tion of a new data-dependent probability measure, ˜ρλ=N( ˜m, ˜s2Id)where

( ˜m, ˜s) = argmin

m∈Rd,s>0

Eθ∼N(m,s2Id)[r(θ)] + λC2

8n+

kmk2

2σ2+d

2hs2

σ2+ log σ2

s2−1i+ log 1

ε

λ.

18

While the Gibbs posterior ˆρλcan be quite a complicated object, one simply has to solve this

minimization problem to get ˜ρλ. The probability ˜ρλis actually a special case of what is called

a variational approximation of ˆρλ. Variational approximations are very popular in statistics

and machine learning, and were indeed analyzed through PAC-Bayes bounds [9, 8, 190]. We

will come back to this in Section 6. For now, following the same computations, and using

the same choice of λas for ˆρλ, we obtain the same bound:

Eθ∼˜ρλ[R(θ)] ≤inf

m:kmk≤Br(m) + Lσrd

n+CskBk2

2σ2+d

2log(n) + log 1

ε

2n.

Example 2.3 (Model aggregation, model selection) In the case where we have many

set of predictors, say Θ1, . . . , ΘM, equipped with priors π1,...,πMrespectively, it is possible

to deﬁne a prior on Θ = SM

j=1 Θj. For the sake of simplicity, assume that the Θj’s are

disjoint, and let p= (p(1),...,p(M)) be a probability distribution on {1,...,M}. We simply

put:

π=

M

X

j=1

p(j)πj.

The minimization of the bound in Theorem 2.1 leads to the Gibbs posterior ˆρλthat will

put mass on all the Θjin general, so this is a model aggregation procedure in the spirit

of [128]. On the other hand, we can also restrict the minimization in the PAC-Bayes bound

to distributions that would charge only one of the models, that is, to ρ∈ P(Θ1)∪· ··∪P(ΘM).

Theorem 2.1 becomes:

PS ∀j∈ {1,...,M},∀ρ∈ P(Θj),Eθ∼ρ[R(θ)]

≤Eθ∼ρ[r(θ)] + λC2

8n+KL(ρkπ) + log 1

ε

λ!≥1−ε,

that is

PS ∀j∈ {1,...,M},∀ρ∈ P(Θj),Eθ∼ρ[R(θ)]

≤Eθ∼ρ[r(θ)] + λC2

8n+KL(ρkπj) + log 1

p(j)+ log 1

ε

λ!≥1−ε.

Thus, we can propose the following procedure:

•ﬁrst, we build the Gibbs posterior for each model j,

ˆρ(j)

λ(dθ) = e−λr(θ)πj(θ)

RΘje−λr(ϑ)πj(dϑ),

19

•then, model selection:

ˆ

j= argmin

1≤j≤M(Eθ∼ˆρ(j)

λ

[r(θ)] + KL(ˆρ(j)

λkπj) + log 1

p(j)

λ).

The obtained ˆ

jsatisﬁes:

PS ∀j∈ {1,...,M},∀ρ∈ P(Θj),Eθ∼ˆρ(ˆ

j)

λ

[R(θ)]

≤min

1≤j≤Minf

ρ∈P(Θj)(Eθ∼ρ[r(θ)] + K L(ρkπj) + log 1

p(j)+ log 1

ε

λ)!≥1−ε.

2.1.4 The choice of λ

As discussed earlier, it is in general not possible to optimize the right-hand side of the PAC-

Bayes equality with respect to λ. For example, in 2.5, the optimal value of λcould depend

on ρ, which is not allowed by Theorem 2.1. In the previous examples, we have seen that in

some situations, if one is lucky enough, the optimal λactually does not depend on ρ, but we

still need a procedure for the general case.

A natural idea is to propose a ﬁnite grid Λ ⊂(0,+∞) and to minimize over this grid,

which can be justiﬁed by a union bound argument.

Theorem 2.4 Let Λ⊂(0,+∞)be a ﬁnite set. For any ε∈(0,1),

PS ∀ρ∈ P(Θ),∀λ∈Λ,Eθ∼ρ[R(θ)]

≤Eθ∼ρ[r(θ)] + λC2

8n+KL(ρkπ) + log card(Λ)

ε

λ!≥1−ε.

Proof. Fix λ∈Λ, and then follow the proof of Theorem 2.1 until (2.3):

EShesupρ∈P(Θ) λEθ∼ρ[R(θ)−r(θ)]−K L(ρkπ)−λ2C2

8ni≤1.

Sum over λ∈Λ to get:

X

λ∈Λ

EShesupρ∈P(Θ) λEθ∼ρ[R(θ)−r(θ)]−KL(ρkπ)−λ2C2

8ni≤card(Λ)

and so

EShesupρ∈P(Θ),λ∈ΛλEθ∼ρ[R(θ)−r(θ)]−K L(ρkπ)−λ2C2

8ni≤card(Λ).

20

The end of the proof is as for Theorem 2.1, we start with Chernoﬀ bound. Fix s > 0,

PS"sup

ρ∈P(Θ),λ∈Λ

λEθ∼ρ[R(θ)−r(θ)] −KL(ρkπ)−λ2C2

8n> s#

≤EShesupρ∈P(Θ),λ∈ΛλEθ∼ρ[R(θ)−r(θ)]−K L(ρkπ)−λ2C2

8nie−s

≤card(Λ)e−s.

Solve card(Λ)e−s=ε, that is, put s= log(card(Λ)/ε) to get

PS"sup

ρ∈P(Θ)

λEθ∼ρ[R(θ)−r(θ)] −KL(ρkπ)−λ2C2

8n>log card(Λ)

ε#≤ε.

Rearranging terms gives:

PS"∃ρ∈ P(Θ),∃λ∈Λ, Eθ∼ρ[R(θ)] >Eθ∼ρ[r(θ)] + λC2

8n+KL(ρkπ) + log card(Λ)

ε

λ#≤ε.

Take the complement to get the statement of the theorem.

This leads to the following procedure. First, we remind that, for a ﬁxed λ, the minimizer

of the bound is ˆρλ=π−λr. Then, we put:

ˆρ= ˆρˆ

λwhere

ˆ

λ= argmin

λ∈Λ(Eθ∼π−λr [r(θ)] + λC2

8n+KL(π−λrkπ) + log card(Λ)

ε

λ).(2.9)

We have immediately the following result.

Corollary 2.5 Deﬁne ˆρas in (2.9), for any ε∈(0,1) we have

PS Eθ∼ˆρ[R(θ)] ≤inf

ρ∈ P(Θ)

λ∈Λ"Eθ∼ρ[r(θ)] + λC2

8n+KL(ρkπ) + log card(Λ)

ε

λ#!≥1−ε.

We could for example propose an arithmetic grid Λ = {1,2,...,n}. The bound in

Corollary 2.5 becomes:

Eθ∼ˆρ[R(θ)] ≤inf

ρ∈ P(Θ)

λ= 1, . . . , n Eθ∼ρ[r(θ)] + λC2

8n+KL(ρkπ) + log n

ε

λ

It is also possible to transform the optimization on a discrete grid by an optimization on a

continuous grid. Indeed, for any λ∈[1, n], we simply apply the bound to the integer part of

λ,⌊λ⌋, and remark that we can upper bound ⌊λ⌋ ≤ λand 1/⌊λ⌋ ≤ 1/(λ−1). So the bound

becomes:

Eθ∼ˆρ[R(θ)] ≤inf

ρ∈ P(Θ)

λ∈[1, n]Eθ∼ρ[r(θ)] + λC2

8n+KL(ρkπ) + log n

ε

λ−1.

21

The arithmetic grid is not be the best choice, though: the log(n) term can be improved. In

order to optimize hyperparameters in PAC-Bayes bounds, Langford and Caruana [106] used

a geometric grid Λ = {ek, k ∈N} ∩ [1, n], the same choice was used later by Catoni [41, 43].

Using such a bound in Corollary 2.5 we get

Eθ∼ˆρ[R(θ)] ≤inf

ρ∈ P(Θ)

λ∈[1, n]"Eθ∼ρ[r(θ)] + λC2

8n+KL(ρkπ) + log 1+log n

ε

λ/e#.

We conclude this discussion on the choice of λby mentioning that there are other PAC-

Bayes bounds, for example McAllester’s bound [128], where there is no parameter λto

optimize. We will study these bounds in Section 3.

2.2 PAC-Bayes bound on aggregation of predictors

In the introduction, right after Deﬁnition 1.1, I promised that PAC-Bayes bound would allow

to control

•the risk of randomized predictors,

•the expected risk of randomized predictors,

•the risk of averaged predictors.

But so far, we only focused on the expected risk of randomized predictors (the second bullet

point). In this subsection, we provide some bounds on averaged predictors, and in the next

one, we will focus on the risk of randomized predictors.

We start by a very simple remark. When the loss function u7→ ℓ(u, y) is convex for any

y, then the risk R(θ) = R(fθ) is a convex function of fθ. Thus, Jensen’s inequality ensures:

Eθ∼ρ[R(fθ)] ≥R(Eθ∼ρ[fθ]).

Plugging this into Corollary 2.3 gives immediately the following result.

Corollary 2.6 Assume that ∀y∈ Y,u7→ ℓ(u, y)is convex. Deﬁne

ˆ

fˆρλ(·) = Eθ∼ˆρλ[fθ(·)]

For any λ > 0, for any ε∈(0,1),

PSR(ˆ

fˆρλ)≤inf

ρ∈P(Θ) Eθ∼ρ[r(θ)] + λC 2

8n+KL(ρkπ) + log 1

ε

λ≥1−ε.

That is, in the case of a convex loss function, like the quadratic loss or the hinge loss,

PAC-Bayes bounds also provide bounds on the risk of aggregated predictors.

It is also feasible under other assumptions. For example, we can use the Lipschitz property

as in Example 2.2, it can also be done using the margin of the classiﬁer (see some references

in Subsection 6.2 below).

22

2.3 PAC-Bayes bound on a single draw from the posterior

Theorem 2.7 For any λ > 0, for any ε∈(0,1), for any data-dependent proba. measure ˜ρ,

PSP˜

θ∼˜ρ R(˜

θ)≤r(˜

θ) + λC2

8n+log dρ

dπ(˜

θ) + log 1

ε

λ!≥1−ε.

This bound simply says that if you draw ˜

θfrom, for example, the Gibbs posterior ˆρλ(deﬁned

in (2.4)), you have the bound on R(˜

θ) that holds with large probability simultaneously on

the drawing of the sample and of ˜

θ.

Proof. Once again, we follow the proof of Theorem 2.1, until (2.2):

ESEθ∼πeλ[R(θ)−r(θ)]≤eλ2C2

8n.

Now, for any function h,

Eθ∼π[h(θ)] = Zh(θ)π(dθ)

≥Z{d˜ρ

dπ(θ)>0}h(θ)π(dθ)

=Z{d˜ρ

dπ(θ)>0}h(θ)dπ

d˜ρ(θ) ˜ρ(dθ)

=Eθ∼˜ρhh(θ)e−log d ˜ρ

dπ(θ)i

and in particular:

ESEθ∼˜ρheλ[R(θ)−r(θ)]−log d˜ρ

dπ(θ)i≤eλ2C2

8n.

I could go through the proof until the end, but I think that you now guess that it’s essentially

Chernoﬀ bound + rearrangement of the terms.

2.4 Bound in expectation

We end this section by one more variant of the initial PAC-Bayes bound in Theorem 2.1: a

bound in expectation with respect to the sample.

Theorem 2.8 For any λ > 0, for any data-dependent probability measure ˜ρ,

ESEθ∼˜ρ[R(θ)] ≤ESEθ∼˜ρr(θ) + λC2

8n+KL(˜ρkπ)

λ.

In particular, for ˜ρ= ˆρλthe Gibbs posterior,

ESEθ∼ˆρλ[R(θ)] ≤ESinf

ρ∈P(θ)

Eθ∼ρ[r(θ)] + λC2

8n+KL(ρkπ)

λ.

23

These bounds in expectation are very convenient tools from a pedagogical point of view.

Indeed, in Section 4, we will study oracle PAC-Bayes inequalities. While it is possible to

derive oracle PAC-Bayes bounds both in expectation and with large probability, the one in

expectation are much simpler to derive, and much shorter. Thus, I will mostly provide PAC-

Bayes oracle bounds in expectation in 4, and refer the reader to [41, 43] for the corresponding

bounds in probability.

Note that as the bound does not hold with large probability, as the previous bounds, it is

no longer a PAC bound in the proper sense: Probably Approximately Correct. Once, I was at-

tending a talk by Tsybakov where he presented some result from his paper with Dalalyan [61]

that can also be interpreted as a “PAC-Bayes bound in expectation”, and he suggested the

more appropriate EAC-Bayes acronym: Expectedly Approximately Correct (their paper is

brieﬂy discussed in Subsection 6.4 below). I don’t think this term was often reused since

then. I also found recently in [80] the acronym MAC-Bayes: Mean Approximately Correct.

In order to avoid any confusion I will stick to “PAC-Bayes bound in expectation”, but I like

EAC and MAC!

Proof. Once again, the beginning of the proof is the same as for Theorem 2.1, until (2.3):

EShesupρ∈P(Θ) λEθ∼ρ[R(θ)−r(θ)]−K L(ρkπ)−λ2C2

8ni≤1.

This time, use Jensen’s inequality to send the expectation with respect to the sample inside

the exponential function:

eEShsupρ∈P(Θ) λEθ∼ρ[R(θ)−r(θ)]−K L(ρkπ)−λ2C2

8ni≤1,

that is,

ES"sup

ρ∈P(Θ)

λEθ∼ρ[R(θ)−r(θ)] −KL(ρkπ)−λ2C2

8n#≤0.

In particular,

ESλEθ∼˜ρ[R(θ)−r(θ)] −KL(˜ρkπ)−λ2C2

8n≤0.

Rearrange terms.

2.5 Applications of empirical PAC-Bayes bounds

The original PAC-Bayes bounds were stated for classiﬁcation [127] and it became soon clear

that many results could be extended to any bounded loss, thus covering for example bounded

regression (we discuss in Section 5 how to get rid of the boundedness assumption). Thus,

some papers are written in no speciﬁc setting, with a generic loss, that can cover classiﬁcation,

regression, or density estimation (this is the case, among others, of Chapter 1 of my PhD

thesis [1] and the corresponding paper [2] where I studied a generalization of Catoni’s results

of [43] to unbounded losses).

24

However, some empirical PAC-Bayes bounds were also developped or applied to speciﬁc

models, sometimes taking advantage of some speciﬁcities of the model. We mention for

example:

•ranking/scoring [150],

•density estimation [91],

•multiple testing [35] is tackled with related techniques,

•deep learning. Even though deep networks are trained for classiﬁcation or regression,

the application of PAC-Bayes bounds to deep learning is not straightforward. We

discuss this in Section 3 based on [67] and more recent references.

•unsupervised learning, including clustering [164, 13], representation learning [140, 139].

Note that this list is non-exhaustive, and that many more applications are presented in

Section 4 (more precisely, in Subsection 4.3).

3 Tight and non-vacuous PAC-Bayes bounds

3.1 Why is there a race to the tighter PAC-Bayes bound?

Let us start with a numerical application of the PAC-Bayes bounds we met in Section 2.

First, assume we are in the classiﬁcation setting with the 0-1 loss, so that C= 1. We are

given a small set of classiﬁers, say M= 100, and that on the test set with size n= 1000, the

best of these classiﬁers has an empirical risk rn= 0.26. Let us use the bound in (2.7), that

I remind here:

PS

Eθ∼ˆρλ[R(θ)] ≤inf

θ∈Θr(θ) + Cslog M

ε

2n

≥1−ε.

With ǫ= 0.05 this bound is:

PS Eθ∼ˆρλ[R(θ)] ≤0.26 + 1.slog 100

0.05

2×1000

|{z }

≤0.062

!≥0.95.

So the classiﬁcation risk using the Gibbs posterior is smaller than 0.322 with probability at

least 95%.

Let us now switch to a more problematic example. We consider a very simple binary

neural network, given by the following formula, for x∈Rd, and where ϕis a nonlinear

activation function (e.g ϕ(x) = max(x, 0)):

fw(x) = 1"M

X

i=1

w(2)

iϕ d

X

j=1

w(1)

j,i xj!≥0#

25

and the weights w(1)

j,i and w(2)

iare all in {−1,+1}for 1 ≤j≤dand 1 ≤i≤M. Deﬁne

θ= (w(1)

1,1, w(1)

1,2,...,w(1)

d,M , w(2)

1,...,w(2)

M). Note that the set of all possible such networks has

cardinality 2M(d+1). Consider inputs that are 100 ×100 greyscale images, that is, x∈[0,1]d

with d= 10,000, and a sample size n= 10,000. With neural networks, it is often the case

that a perfect classiﬁcation of the training sample is possible, that is, there is a θsuch that

r(θ) = 0.

Even for a moderate number of units such as M= 100, this leads to the PAC-Bayes

bound (with ε= 0.05):

PS Eθ∼ˆρλ[R(θ)] ≤1.slog 21,000,100

0.05

2×10,000

|{z }

≃13.58

!≥0.95.

So the classiﬁcation risk using the Gibbs posterior is smaller than 13.58 with probability at

least 95%. Which is not informative at all, because we already know that the classiﬁcation

risk is smaller than 1. Such a bound is usually refered to as a vacuous bound, because it

does not bring any information at all. You can try to improve the bound by increasing the

dataset. But you can check that even n= 1,000,000 still leads to a vacuous bound with

this network.

Various opinions on these vacuous bounds are possible:

•“theory is useless. I don’t know why I would care about generalization guarantees,

neural networks work in practice.” This opinion is lazy: it’s just a good excuse not to

have to think about generalization guarantees. I will assume that since you are reading

this tutorial, this is not your opinion.

•“vacuous bounds are certainly better than no bounds at all!” This opinion is cynical,

it can be rephrased as “better have a theory that doesn’t work than no theory at all:

at least we can claim we have a theory, and some people might even believe us”. But

the theory just says nothing.

•“let’s get back to work, and improve the bounds”. Since the publication of the ﬁrst

PAC-Bayes bounds already mentioned [166, 127, 128], many variants were proven.

One can try to test which one is the best in a given setting, try to improve the priors,

try to reﬁne the bound in many ways... In 2017, Dziugaite and Roy [67] obtained

non-vacuous (even though not really tight yet) PAC-Bayes bounds for practical neural

networks (since then, tighter bounds were obtained by these authors and by others).

This is a remarkable achievement, and this also made PAC-Bayes theory immediately

more popular than it was ever before.

Let’s begin this section with a review of some popular PAC-Bayes bounds: Subsection 3.2.

We will then explain which bound, and which improvements led to tight generalization

bounds for deep learning 3.3. In particular, we will focus on a very important approach to

improve the bounds: data-dependent priors.

26

3.2 A few PAC-Bayes bounds

Note that the original works on PAC-Bayes focused only on classiﬁcation with the 0-1 loss.

So, for the whole Subsection 3.2, we assume that ℓis the 0-1 loss function. Remember it

means that Rand rtake value in [0,1] (so C= 1 in this subsection).

3.2.1 McAllester’s bound [127] and Maurer’s improved bound [126]

As the original paper by McAllester [127] focused on ﬁnite or denumerable sets Θ, let us

start with the ﬁrst bound for a general Θ, in [128].

Theorem 3.1 (Theorem 1 in [128]) For any ε > 0,

PS"∀ρ∈ P(Θ),Eθ∼ρ[R(θ)] ≤Eθ∼ρ[r(θ)] + sKL(ρkπ) + log 1

ε+5

2log(n) + 8

2n−1#≤ε.

Compared to Theorem 2.1, note that there is no parameter λhere to optimize. On the other

hand, one can no longer use 2.2 to minimize the right-hand side. A way to solve this problem

is to make the parameter λappear artiﬁcially using the inequality √ab ≤aλ/2 + b/(2λ) for

any λ > 0:

PS"∀ρ∈ P(Θ), Eθ∼ρ[R(θ)]

≤Eθ∼ρ[r(θ)] + inf

λ>0λ

4n+KL(ρkπ) + log 2

ε+1

2log(n)

2λ#≤ε. (3.1)

On the other hand, the prize to pay for an optimization with respect to λin Theorem 2.4

was a log(n) term, that is already in Maurer’s bound, for an arithmetic grid, and a log log(n)

term when using a geometric grid. So, asymptotically in n, Theorem 2.4 with a geometric

grid will always lead to better results than Theorem 3.1. On the other hand, the constants

in Theorem 3.1 are smaller, so the bound can be better for small sample sizes (a point that

should not be neglected for tight certiﬁcates in practice!).

It is possible to minimize the right-hand side in (3.1) with respect to ρ, and this will lead

to a Gibbs posterior: ˆρ=π−2λr. It is also possible to minimize it with respect to λ, but

the minimization in λwhen ρitself depends on λis a bit more tricky. We want to mention

on this problem the more recent paper [174]. The authors proved a bound that is easier to

minimize simultaneously in λand ρ.

3.2.2 Catoni’s bound (another one) [43]

Theorem 2.1 was based on Catoni’s preprint [41]. Catoni’s monograph [43] provide many

other bounds.

27