PreprintPDF Available

Asymptotically Optimal Generalization Error Bounds for Noisy, Iterative Algorithms

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

We adopt an information-theoretic framework to analyze the generalization behavior of the class of iterative, noisy learning algorithms. This class is particularly suitable for study under information-theoretic metrics as the algorithms are inherently randomized, and it includes commonly used algorithms such as Stochastic Gradient Langevin Dynamics (SGLD). Herein, we use the maximal leakage (equivalently, the Sibson mutual information of order infinity) metric, as it is simple to analyze, and it implies both bounds on the probability of having a large generalization error and on its expected value. We show that, if the update function (e.g., gradient) is bounded in L2L_2-norm, then adding isotropic Gaussian noise leads to optimal generalization bounds: indeed, the input and output of the learning algorithm in this case are asymptotically statistically independent. Furthermore, we demonstrate how the assumptions on the update function affect the optimal (in the sense of minimizing the induced maximal leakage) choice of the noise. Finally, we compute explicit tight upper bounds on the induced maximal leakage for several scenarios of interest.
arXiv:2302.14518v1 [cs.LG] 28 Feb 2023
Proceedings of Machine Learning Research vol 195:123, 2023
Asymptotically Optimal Generalization Error Bounds for Noisy,
Iterative Algorithms
Ibrahim Issa IB RA HI M.I SS A@AU B.E DU.LB
American University of Beirut, Lebanon, and ´
Ecole Polytechnique F´
ed´
erale de Lausanne, Switzerland
Amedeo Roberto Esposito AM ED EO ROBERTO.ES PO SI TO @IS T.AC.AT
Institute of Science and Technology Austria
Michael Gastpar MI CH AE L.G AS TPAR@E PFL.CH
´
Ecole Polytechnique F´
ed´
erale de Lausanne, Switzerland
Abstract
We adopt an information-theoretic framework to analyze the generalization behavior of the class of
iterative, noisy learning algorithms. This class is particularly suitable for study under information-
theoretic metrics as the algorithms are inherently randomized, and it includes commonly used al-
gorithms such as Stochastic Gradient Langevin Dynamics (SGLD). Herein, we use the maximal
leakage (equivalently, the Sibson mutual information of order infinity) metric, as it is simple to
analyze, and it implies both bounds on the probability of having a large generalization error and
on its expected value. We show that, if the update function (e.g., gradient) is bounded in L2-norm,
then adding isotropic Gaussian noise leads to optimal generalization bounds: indeed, the input and
output of the learning algorithm in this case are asymptotically statistically independent. Further-
more, we demonstrate how the assumptions on the update function affect the optimal (in the sense
of minimizing the induced maximal leakage) choice of the noise. Finally, we compute explicit tight
upper bounds on the induced maximal leakage for several scenarios of interest.
Keywords: Noisy iterative algorithms, generalization error, maximal leakage, Gaussian noise
1. Introduction
One of the key challenges in machine learning research concerns the “generalization” behavior
of learning algorithms. That is: if a learning algorithm performs well on the training set, what
guarantees can one provide on its performance on new samples?
While the question of generalization is understood in many settings ((Bousquet et al.,2003;
Shalev-Shwartz and Ben-David.,2014)), existing bounds and techniques provide vacuous expres-
sions when employed to show the generalization capabilities of deep neural networks (DNNs)
((Bartlett et al.,2017,2019;Jiang et al.,2020;Zhang et al.,2021)). In general, classical measures
of model expressivity (such as Vapnik-Chervonenkis (VC) dimension ((Vapnik and Chervonenkis,
1991)), Rademacher complexity ((Bartlett and Mendelson,2003)), etc.), fail to explain the general-
ization abilities of DNNs due to the fact that they are typically over-parameterized models with less
training data than model parameters. A novel approach was introduced by (Russo and Zou,2016),
and (Xu and Raginsky,2017) (further developed by (Steinke and Zakynthinou,2020;Bu et al.,2020;
Esposito et al.,2021;Esposito and Gastpar,2022) and many others), where information-theoretic
techniques are used to link the generalization capabilities of a learning algorithm to information
measures. These quantities are algorithm-dependent and can be used to analyze the generaliza-
tion capabilities of general classes of updates and models, e.g., noisy iterative algorithms like
Stochastic Gradient Langevin Dynamics (SGLD) ((Pensia et al.,2018;Wang et al.,2021)), which
© 2023 I. Issa, A.R. Esposito & M. Gastpar.
ISSA ESP OS IT O GAS TPA R
can thus be applied to deep learning. Moreover, it has been shown that information-theoretic bounds
can be non-vacuous and reflect the real generalization behavior even in deep learning settings
((Dziugaite and Roy,2017;Zhou et al.,2018;Negrea et al.,2019;Haghifam et al.,2020)).
In this work we adopt and expand the framework introduced by (Pensia et al.,2018), but in-
stead of focusing on the mutual information between the input and output of an iterative algorithm,
we compute the maximal leakage ((Issa et al.,2020)). Maximal leakage, together with other infor-
mation measures of the Sibson/R´enyi family (maximal leakage can be shown to be Sibson Mutual
information of order infinity ((Issa et al.,2020))), have been linked to high-probability bounds on
the generalization error ((Esposito et al.,2021)). In particular, given a learning algorithm Atrained
on data-set S(made of nsamples) one can provide the following guarantee in case of the 01loss:
P(|gen-err(A, S)| η)2 exp(22+L(S→A(S))).(1)
This deviates from much of the literature in which the focus is on bounding the expected gener-
alization error instead ((Xu and Raginsky,2017;Steinke and Zakynthinou,2020)). Consequently,
if one can guarantee that for a class of algorithms, the maximal leakage between the input and the
output is bounded, then one can provide an exponentially decaying (in the number of samples n)
bound on the probability of having a large generalization error. This is in general not true for mutual
information, which can typically only guarantee a linearly decaying bound on the probability of the
same event ((Bassily et al.,2018)). Moreover, a bound on maximal leakage implies a bound on mu-
tual information (cf. Equation (6)) and, consequently, a bound on the expected generalization error
of A. The main advantage of maximal leakage lies in the fact that it depends on the distribution of
the samples only through its support. It is thus naturally independent from the distribution over the
samples and particularly amenable to analysis, especially in additive noise settings.
The contributions of this work can be summarized as follows:
we derive novel bounds on L(S→A(S)) whenever Ais a noisy, iterative algorithm (SGLD-
like), which then implies generalization with high-probability;
we show that the bounds provided on maximal leakage strictly improve the bounds provided
by (Pensia et al.,2018), and we thus provide a tighter bound on the expected generalization
error of said algorithms as well;
we show that, under certain assumptions, adding Gaussian noise is asymptotically optimal in
the number of dimensions d. In particular, we prove that the maximal leakage (and, conse-
quently, the mutual information) between the input and output of this family of algorithms
goes to 0with the number of dimensions. This implies that the input and output are asymp-
totically independent which is consistent with practical observations - larger neural networks
generalize better;
we leverage the analysis to extrapolate the optimal type of noise to be added (in the sense
that it minimizes the induced maximal leakage), based on the assumptions imposed on the
algorithm. In particular,
if one assumes the Lpnorm of the gradient to be bounded, with p2, our analysis
shows that adding Gaussian noise is asymptotically optimal;
if one assumes the Lnorm of the gradient to be bounded, then adding uniform noise
is optimal;
2
OPT IM AL GE NE RALIZ ATIO N ERROR BOUNDS FOR SGLD
Hence, the analysis and computation of maximal leakage can be used to inform the design of
novel noisy, iterative algorithms.
1.1. Related Work
The line of work exploiting information measures to bound the expected generalization started
in (Russo and Zou,2016;Xu and Raginsky,2017) and was then refined with a variety of approaches
considering Conditional Mutual Information (Steinke and Zakynthinou,2020;Haghifam et al.,2020),
the Mutual Information between individual samples and the hypothesis (Bu et al.,2019) or im-
proved versions of the original bounds (Issa et al.,2019;Hafez-Kolahi et al.,2020). Other ap-
proaches employed the Kullback-Leibler Divergence with a PAC-Bayesian approach (McAllester,
2013;Zhou et al.,2018). Moreover, said bounds were then characterized for specific SGLD-like
algorithms, denoted as “noisy, iterative algorithms” and used to provide novel, non-vacuous bounds
for Neural Networks (Pensia et al.,2018;Negrea et al.,2019;Haghifam et al.,2020;Wang et al.,
2023) as well as for SGD algorithms (Neu et al.,2021). Recent efforts tried to provide the opti-
mal type of noise to add in said algorithms and reduce the (empirical) gap in performance between
SGLD and SGD (Wang et al.,2021). All of these approaches considered the KL-Divergence or
(variants of) Shannon’s Mutual Information. General bounds on the expected generalization er-
ror leveraging arbitrary divergences were given in (Esposito and Gastpar,2022;Lugosi and Neu,
2022). Another line of work considered instead bounds on the probability of having a large gener-
alization error (Bassily et al.,2018;Esposito et al.,2021;Hellstr¨om and Durisi,2020) and focused
on large families of divergences and generalizations of the Mutual Information (in particular of the
Sibson/R´enyi-family, including conditional versions).
2. Preliminaries, Setup, and a General Bound
2.1. Preliminaries
2.1.1. INFOR MATION MEASURE S
The main building block of the information measures considered in this work is the enyi’s α-
divergence between two measures Pand Q,Dα(PkQ)(which can be seen as a parametrized gener-
alization of the Kullback Leibler-divergence) (van Erven and Harremo¨es,2014, Definition 2). Start-
ing from enyi’s Divergence and the geometric averaging that it involves, Sibson built the notion
of Information Radius (Sibson,1969) which can be seen as a special case of the following quan-
tity (Verd´u,2015): Iα(X, Y ) = minQYDα(PXY kPXQY).Sibson’s Iα(X, Y )represents a gener-
alization of Shannon’s mutual information, indeed one has that: limα1Iα(X, Y ) = I(X;Y) =
EPXY hlog dPX Y
dPXPYi.Differently, when α , one gets:
I(X, Y ) = log EPY"ess-sup
PX
dPXY
dPXPY#=L(XY),(2)
where L(XY)denotes the maximal leakage from Xto Y, a recently defined information measure
with an operational meaning in the context of privacy and security (Issa et al.,2020). Maximal
leakage represents the main quantity of interest for the scope of this paper, as it is amenable to
analysis and has been used to bound the generalization error (Esposito et al.,2021). As such, we
will bound the maximal leakage between the input and output of generic noisy iterative algorithms.
3
ISSA ESP OS IT O GAS TPA R
To that end, we mention a few useful properties of L(XY). If Xand Yare jointly continuous
random variables, then (Issa et al.,2020, Corollary 4)
L(XY) = log Zess-sup
PX
fY|X(y|x)dy, (3)
where fY|Xis the conditional pdf of Ygiven X. Moreover, maximal leakage satisfies the following
chain rule (the proof of which is given in Appendix A):
Lemma 1 Given a triple of random variables (X, Y1, Y2), then
L(XY1, Y2) L(XY1) + L(XY2|Y1),(4)
where the conditional maximal leakage L(XY2|Y1) = ess-supPY1L(XY2|Y1=y1), where
the latter term is interpreted as the maximal leakage from Xto Y2with respect to the distribution
PXY2|Y1=y1. Consequently, for random variables (X, (Yi)n
i=1),
L(XYn)
n
X
i=1 LXYi|Yi1.(5)
Moreover, one can relate L(XY)to I(X;Y)through Iα. Indeed, an important property of Iαis
that it is non-decreasing in α, hence for every > α > 1:
I(X;Y) = I1(X, Y )Iα(X, Y )I(X, Y ) = L(XY).(6)
For more details on Sibson’s α-MI we refer the reader to (Verd´u,2015), as for maximal leakage the
reader is referred to (Issa et al.,2020).
2.1.2. LEA RN ING SETTING
Let Zbe the sample space, Wbe the hypothesis space, and :W × Z R+be a loss function.
Say W Rd. Let S= (Z1, Z2,...,Zn)consist of ni.i.d samples, where ZiP, with P
unknown. A learning algorithm Ais a mapping A:Zn W that given a sample Sprovides a
hypothesis W=A(S).Acan be either a deterministic or a randomized mapping and undertaking a
probabilistic (and information-theoretic) approach one can then equivalently consider Aas a family
of conditional probability distributions PW|S=sfor s Zni.e., an information channel. Given a
hypothesis w W the true risk of wis denoted as follows:
LPZ(w) = EP[(w, Z )] (7)
while the empirical risk of won Sis denoted as follows:
LS(w) = 1
n
n
X
i=1
(w, Zi).(8)
Given a learning algorithm A, one can then define its generalization error as follows:
gen-errP(A, S) = LP(A(S)) LS(A(S)).(9)
4
OPT IM AL GE NE RALIZ ATIO N ERROR BOUNDS FOR SGLD
Since both Sand Acan be random, gen-errP(A, S)is a random variable and one can then study
its expected value or its behavior in probability. Bounds on the expected value of the generaliza-
tion error in terms of information measures are given in (Xu and Raginsky,2017;Issa et al.,2019;
Bu et al.,2019;Steinke and Zakynthinou,2020) stating different variants of the following bound
((Xu and Raginsky,2017, Theorem 1)): if (w, Z)is σ2-sub-Gaussian1then
|E[gen-errP(A, S)]| r2σ2I(S;A(S))
n.(10)
Thus, if one can prove that the mutual information between the input and output of a learning
algorithm Atrained on Sis bounded (ideally, growing less than linearly in n) then the expected
generalization error of Awill vanish with the number of samples. Alternatively, Esposito et al.
(2021) demonstrate high-probability bounds, involving different families of information measures.
One such bound, which is relevant to the scope of this paper is the following (Esposito et al.,2021,
Corollary 2): assume (w, Z )is σ2-sub-Gaussian and let α > 1
P(|gen-errP(A, S)| t)2 exp α1
αnt2
2σ2Iα(S, A(S)),(11)
taking the limit of α in (11) leads to the following (Esposito et al.,2021, Corollary 4):
P(|gen-errP(A, S)| t)2 exp nt2
2σ2 L (SA(S)).(12)
Thus, in this case, if one can prove that the maximal leakage between the input and output of a
learning algorithm Atrained on Sis bounded, then the probability of the generalization error of A
being larger than any constant twill decay exponentially fast in the number of samples n.
2.2. Problem Setup
We consider iterative algorithms, where each update is of the following form:
Wt=g(Wt1)ηtF(Wt1, Zt) + ξt,t1,(13)
where ZtS(sampled according to some distribution), g:RdRdis a deterministic function,
F(Wt1, Zt)computes a direction (e.g., gradient), ηtis a constant step-size, and ξt= (ξt1,...,ξtd)
is noise. We will assume for the remainder of this paper that ηthas an absolutely continuous
distribution. Let Tdenote the total number of iterations, Wt= (W1, W2,...Wt), and Zt=
(Z1, Z2,...,Zt). The algorithms under consideration further satisfy the following two assumptions
Assumption 1 (Sampling): The sampling strategy is agnostic to parameter vectors:
P(Zt+1|Zt, W t, S ) = P(Zt+1|Zt, S).(14)
Assumption 2 (Lp-Boundedness): For some p > 0and L > 0,supw,z kF(w, z)kpL.
1. A 0-mean random variable Xis said to be σ2-sub-Gaussian if log E[exp(λX)] σ2λ2/2for every λR.
5
ISSA ESP OS IT O GAS TPA R
As a consequence of the first assumption and the structure of the iterates, we get:
P(Wt+1|Wt, Z T, S) = P(Wt+1|Wt, Zt+1 ).(15)
The above setup was proposed by Pensia et al. (Pensia et al.,2018), who specifically studied the
case p= 2. Denoting by Wthe final output of the algorithm (some function of WT), they show
that
Theorem 2 ((Pensia et al.,2018, Theorem 1)) If the boundedness assumption holds for p= 2
and ξt N(0, σ2
tId), then
I(S;W)d
2
T
X
t=1
log 1 + η2
tL2
2
t.(16)
By virtue of inequality (10), this yields a bound on the expected generalization error.
In this work, we derive bounds on the maximal leakage between L(SW)for iterative noisy
algorithms, which leads to high-probability bounds on the generalization error (cf. equation (12)).
We consider different scenarios in which Fis bounded in L1,L2, or Lnorm, and the added noise
is Laplace, Gaussian, or Uniform.
2.3. Notation
Given dN,wRd, and r > 0, let Bd
p(w, r) = {xRd:kxwkpr}denote the Lp-ball of
radius rand center w, and let Vp(d, r)denote its corresponding volume. When the dimension dis
clear from the context, we may drop the superscript and write Bp(w, r). Given a set S, we denote
its complement by S. The i-th component of wtwill be denoted by wti .
We denote the pdf of the noise ξtby ft:RdR. The following functional will be useful for
our study: given dN,p > 0, a pdf f:RdR, and an r0, define
h(d, p, f, r):=Z
Bd
p(0,r)
sup
x∈Bd
p(0,r)
f(wx)dw. (17)
We denote the “positive octant” by Ad, i.e.,
Ad:={wRd:wi0,for all i {1,2,...,d}}.(18)
Since we will mainly consider pdfs that are symmetric (Gaussian, Laplace, uniform), the hfunc-
tional “restricted” to Adwill be useful:
h+(d, p, f, r):=Z
Bd
p(0,r)Ad
sup
x∈Bd
p(0,r)
f(wx)dw. (19)
2.4. General Bound
Proposition 3 Suppose ft:RdRis maximized for x= 0. If Assumptions 1 and 2 hold for
some p > 0, then
L(SW)
T
X
t=1
log (ft(0)Vp(d, ηtL) + h(d, p, ft, ηtL)) ,(20)
where his defined in equation (17).
6
OPT IM AL GE NE RALIZ ATIO N ERROR BOUNDS FOR SGLD
The above bound is appealing as it implicitly poses an optimization problem: given a constraint
on the noise pdf ft(say, a bounded variance), one may choose ftas to minimize the upper bound in
equation (20). Moreover, despite its generality, we show that it is tight in several interesting cases,
including when p= 2 and ftis the Gaussian pdf. The series of steps leading to the upper bound
include only one inequality (the source of the looseness” of the bound), which could be viewed as
due to replacing Assumption 2 by the following statement: for all w,zF(w, z) = Bd
p(0, L), i.e.,
in addition to assuming F(w, z) Bd
p(0, L), we assume that, for every w, every point in the ball
Bd
p(0, L)is attained for some z.
In the next section, we consider several scenarios for different values of pand different noise
distributions. As a testament to the tractability of maximal leakage, we derive exact semi-closed
form expressions for the bound of Proposition 3. Finally, it is worth noting that the form of the
bound allows to choose different noise distributions at different time steps, but these examples are
outside the scope of this paper.
Proof We proceed as in the the work of (Pensia et al.,2018):
L(SW) LZTWT
T
X
t=1 LZTWt|Wt1=
T
X
t=1 L(ZtWt|Wt1),(21)
where the first inequality follows from Lemma 2 of (Pensia et al.,2018) and the data processing in-
equality for maximal leakage (Issa et al.,2020, Lemma 1), the second inequality follows Lemma 1,
and the equality follows from (15). Now,
exp {L(ZtWt|Wt1=wt1)}=ZRd
ess-sup
Pzt
p(wt|Zt)dwt(22)
=ZRd
ess-sup
Pzt
ft(wtg(wt1) + ηtF(wt1, Zt)) dwt,(23)
=ZRd
ess-sup
Pzt
ft(wt+ηtF(wt1, Zt)) dwt,(24)
where the last equality follows from a change of a variable ˜wt=wtg(wt1). Finally, since
ηtF(wt1, zt) Bp(0, ηtL)by assumption, we can further upper-bound the above by:
exp {L(ZtWt|Wt1=wt1)}(25)
ZRd
sup
xt∈Bp(0tL)
ft(wt+xt) dwt(26)
=Z
Bp(0tL)
sup
xt∈Bp(0tL)
ft(wti +xti) dwt+Z
Bc
p(0tL)
sup
xt∈Bp(0tL)
ft(wt+xt) dwt(27)
=ft(0)Vp(d, ηtL) + Z
Bp(0tL)
sup
xt∈Bp(0tL)
ft(wtxt) dwt,(28)
where the last equality follows from the assumptions on ft.
7
ISSA ESP OS IT O GAS TPA R
3. Boundedness in L2-Norm
Considering the case where Fcomputes a gradient, then boundedness in L2-norm is a common
assumption. It is commonly enforced, for instance, using gradient clipping (Abadi et al.,2016a,b;
Chen et al.,2020).
The case in which p= 2 and the noise is Gaussian leads to the strongest result in this paper:
Theorem 4 If the boundedness assumption holds for p2and ξt N(0, σ2
tId), then
L(SW)
T
X
t=1
log V2(d, ηtL)
(2πσ2
t)d/2+1
Γd
2
d1
X
i=0
Γi+ 1
2 ηtL
σt2d1i!,(29)
where V2(d, r) = πd/2
Γd
2+ 1rd. Consequently, for fixed T,
lim
d→∞ L(SW) = 0.(30)
Remarkably, equation (30) states that, as dgrows, Sand Ware asymptotically independent. The
bound is asymptotically optimal for L(SW)(indeed it yields an equality). More importantly, the
induced high probability bound by equation (12) is also optimal. Indeed, at the limit when Sand
Ware independent, the bound (12) recovers (the order optimal) McDiarmid’s inequality i.e., under
the assumptions of Theorem 4and considering the 01loss:
P(|gen-err(A, S)| t)2 exp(2nt2).(31)
This can be seen as an explanation of the (arguably un-intuitive) phenomenon that deeper networks
often generalize better (also analyzed by (Wang et al.,2023)).
By contrast, this is not captured in the bound by (Pensia et al.,2018) given in equation (16).
Indeed, it is growing as a function of d, and tends to a non-zero value:
lim
d→∞
d
2
T
X
t=1
log 1 + η2
tL2
2
t=
T
X
t=1
η2
tL2
2σ2
t
,(32)
where the equality follows from the fact that lim
n→∞(1 + c/n)c=ecfor all cR. Notably, however,
I(S;W) L(SW)(cf. equation (6)), so that for large d, we have
I(S;W) L(SW)right-hand side of (29)right-hand side of (16).(33)
As such, the bound in Theorem 4is also a tighter upper bound on I(S;W).
Moreover, note that even if the parameter Lis large (e.g., Lipschitz constant of a neural network
(Negrea et al.,2019)), it appears in (29) normalized by Γ(d/2) so its effect is significantly dampened
(as dis also typically very large).
Finally, note that the bound in Proposition 3is increasing in p: this can be seen from line (26),
where the supremum over Bpcan be further upper-bounded by a supremum over Bpfor p> p.
Therefore for qp, the bound induced by Proposition 3is smaller. The bound in Theorem 4cor-
responds to p= 2 and goes to 0 (as dgrows), hence the bound induced by Proposition 3goes to 0
8
OPT IM AL GE NE RALIZ ATIO N ERROR BOUNDS FOR SGLD
for all qp. In particular, adding Gaussian noise is asymptotically optimal (in the sense discussed
above) when Assumption 2 holds for any p2.
Proof To show that the right hand side of Equation (29) goes to zero as as d is equal to zero,
we use Stirling’s approximation of the Gamma function: for all x > 0,
2πxx1
2exΓ(x)2πxx1
2exe1
12x.(34)
The details of the computation can be found in Appendix B. We now turn to the proof of inequal-
ity (29). The conditions of Proposition 3are satisfied, thus it is sufficient to prove the bound for
p= 2:
L(SW)
T
X
t=1
log
ft(0)V2(d, ηtL) + Z
B2(0tL)
sup
xt∈B2(0tL)
ft(wtxt)dwt
(35)
=
T
X
t=1
log
V2(d, ηtL)
(2πσ2
t)d
2
+Z
B2(0tL)
sup
xt∈B2(0tL)
1
(2πσ2
t)d
2
exp kwtxtk2
2
2σ2
tdwt
.(36)
Hence, it remains to show that the second term inside the log matches that of equation (29). To that
end, note that the point in B2(0, ηtL)that minimizes the distance to wtis given ηtL
kwtkwt. So we get
kwtxtk kwtηtL
kwtkwtk=kwtk ηtL. (37)
Then,
h(d, 2, ft, ηtL) = Z
B2(0tL)
sup
xt∈B2(0tL)
1
(2πσ2
t)d
2
exp kwtxtk2
2
2σ2
tdwt(38)
=Z
B2(0tL)
1
(2πσ2
t)d
2
exp (kwtk2ηtL)2
2σ2
tdwt.(39)
To evaluate this integral, we use spherical coordinates (details in Appendix C). Then,
h(d, 2, ft, ηtL) = ηtL
σt2d11
Γd
2
d1
X
i=0 σt2
ηtL!i
Γi+ 1
2.(40)
Combining equations (36) and (40) yields (29).
Remark 5 One could also derive a semi-closed form bound for the case in which the added noise
is uniform. However, in that case L(SW)goes to infinity as dgoes to infinity. The same behavior
holds if the added noise is Laplace. Since the Gaussian noise leads to an asymptotically optimal
bound, we skip the analysis of uniform and Laplace noise.
9
ISSA ESP OS IT O GAS TPA R
4. Boundedness in L-Norm
The bound in Proposition 3makes minimal assumptions about the pdf ft. In many practical sce-
narios we have more structure we could leverage. In particular, we make the following standard
assumptions in this section:
ξtis composed of i.i.d components. Letting ft0be the pdf of a component, ft(xt)=
d
Y
i=1
ft0(xti).
ft0is symmetric around 0 and non-increasing over [0,).
In this setting, Proposition 3reduces to a very simple form for p=:
Theorem 6 Suppose ftsatisfies the above assumptions. If Assumptions 1 and 2 hold for p=,
then
L(SW)
T
X
t=1
dlog (1 + 2ηtLft0(0)) .(41)
Unlike the bound of Theorem 4, the limit of dto infinity here is infinite. However, the bounded-
Lassumption is weaker than assuming a bounded L2-norm. Moreover, the assumption of having
a bounded L-norm is satisfied in (Pichapati et al.,2019) where the authors clipped the gradient
in terms of the L-norm, thus “enforcing” the assumption. On the other hand, the theorem has
an intriguing form as, under standard assumptions, the bound depends on ft0only through ft0(0).
This naturally leads to an optimization problem: given a certain constraint on the noise, which
distribution fminimizes f(0)? The following theorem shows that, if the noise is required to have
a bounded variance, then fcorresponds to the uniform distribution:
Theorem 7 Let Fbe the family of probability densities (over R) satisfying for each f F:
1. fis symmetric around 0.
2. fis non-increasing over [0,).
3. Ef[X2]σ2.
Then, the distribution minimizing f(0) over Fis the uniform distribution U(σ3, σ3).
That is, uniform noise is optimal in the sense that it minimizes the upper bound in Theorem 6
under bounded variance constraints. The proof of Theorem 7is deferred to Appendix E.
4.1. Proof of Theorem 6
Since the assumptions of Proposition 3hold, then
L(SW)
T
X
t=1
log
ft(0)V(d, ηtL) + Z
B(0tL)
sup
xt∈B(0tL)
ft(wtxt)dwt
(42)
=
T
X
t=1
log
(2ηtLft0(0))d+Z
B(0tL)
d
Y
i=1
sup
xti:|xti|ηtL
ft0(wti xti)dwt
.(43)
10
OPT IM AL GE NE RALIZ ATIO N ERROR BOUNDS FOR SGLD
It remains to show that h(d, , ft, ηtL)(i.e., the second term inside the log in Equation (17)) is
equal to (1 + 2ηtLft0(0))d(2ηtLft0(0))d. We will derive a recurrence relation for hin terms of
d. To simplify the notation, we drop the subscript tand ignore the dependence of hon p=,ft,
and ηtL, so that we simply write h(d)(and correspondingly, h+(d), cf. Equation (19)).
By symmetry, h(d) = 2dh+(d). Letting wd1:= (w1,...,wd1), we will decompose the
integral over Bd
(0, ηtL)into two disjoint subsets: 1) wd1/ Bd1
(0, ηtL), in which case wdcan
take any value in R, and 2) wd1 Bd1
(0, ηtL), in which case wdmust satisfy |wd|> ηtL.
h+(d) = Z
Bd1
(0tL)Ad1
d1
Y
i=1
sup
xi:|xi|≤ηtL
f(wixi)Z
0
sup
xd:|xd|≤ηtL
f(wdxd)dwddwd1(44)
+Z
Bd1
(0tL)Ad1
d1
Y
i=1
sup
xi:|xi|≤ηtL
f(wixi)Z
ηtL
sup
xd:|xd|≤ηtL
f(wdxd)dwddwd1(45)
The innermost integral of line (45) is independent of wd1so that the outer integral is equal to
h+(d1). Similarly, the innermost integral of line (44) is independent of wd1, and the supremum
in the outer integral yields f(0) for every i. Hence, we get
h(d) = (1 + 2ηtLf (0)) h(d1) + (2ηtLf(0))d1,(46)
the detailed proof of which is deferred to Appendix D. Finally, it is straightforward to check that
h(1) = 1, hence h(d) = (1 + 2ηtLf (0))d(2ηtLf(0))d.
5. Boundedness in L1-Norm
In this section, we consider the setting where Assumption 2 holds for p= 1. By Proposition 3, any
bound derived for p= 2 holds for p= 1 as well. In particular, Theorem 4applies so that L(SW)
goes to zero when the noise is Gaussian. Nevertheless, it is possible to compute a semi-closed form
directly for p= 1 (cf. Theorem 9below), which would be inherently tighter.
Considering the optimality of Gaussian noise for the p= 2 case, and the optimality of uniform
noise (in the sense discussed above) for p=case, one might wonder if Laplace noise is optimal
for the p= 1 case. We answer this question in the negative, as the limit of the leakage in this case
is a non-zero constant (cf. Theorem 8), as opposed to the zero limit when the noise is Gaussian.
5.1. Bound for Laplace noise
We say Xhas a Laplace distribution, denoted by XLap(µ, 1), if its pdf is given by f(x) =
λ
2eλ|xµ|for xR, for some µRand λ > 0. The corresponding variance is given by 22.
Theorem 8 If the boundedness assumptions holds for p= 1 and ξtis composed of i.i.d compo-
nents, each of which is Lap(0,σt
2), then
L(SW)
T
X
t=1
log V1(d, ηtL)
(σt2)d+
d1
X
i=0
(σtηtL/2)i
i!!,(47)
11
ISSA ESP OS IT O GAS TPA R
where V1(d, r) = (2r)d
d!. Consequently, for fixed T,
lim
d→∞ L(SW)
T
X
t=1
σtηtL
2.(48)
Proof We give a high-level description of the proof (as similar techniques have been used in proofs
of earlier theorems) and defer the details to Appendix F. Since the multivariate Laplace distribution
(for i.i.d variables) depends on the L1-norm of the corresponding vector of variables, we need to
solve the following problem: given R > 0and w / B1(0, R), compute
inf
x∈B1(0,R)kwxk1.(49)
The closest element in B1(0, R)will lie on the hyperplane defining B1that is in the same octant
as w, so the problem reduces to projecting a point on a hyperplane in L1-distance (the proof in
the appendix does not follow this argument but arrives at the same conclusion). Then, we need to
compute h(d, 1, ft, ηtL). We use a similar approach as in the proof of Theorem 6, that is, we split
the integral and derive a recurrence relation.
5.2. Bound for Gaussian noise
Finally, we derive a bound on the induced leakage when the added noise is Gaussian:
Theorem 9 If the boundedness assumptions holds for p= 1 and ξt N(0, σ2
tId), then
L(SW)
T
X
t=1
log
V1(d, Rt)
(2πσ2)d
2
+(2ηtL)d1(σt2d)
(2πσ2
t)d
2((d1)!)
d1
X
i=0 σt2d
ηtL!i
Γi+ 1
2
.(50)
Theorem 9is tighter than Theorem 4for any given dmoreover one has, again, that:
lim
d→∞ L(SW) = 0.(51)
Equation (51) follows from Theorem 4and the fact that the bound in Proposition 3is increasing in
p(cf. discussion below Theorem 4).
In order to prove Theorem 9one has to solve a problem similar to the one introduced in Theo-
rem 8(cf. equation (49)). However, in this case a different norm is involved: i.e., given R > 0and
w / B1(0, R), one has to compute
inf
x∈B1(0,R)kwxk2.(52)
Again, one can argue that the point achieving the infimum lies on the hyperplane defining B1that is
in the same octant as w. In other words, the minimizer xis such that the sign of each component is
the same sign as the corresponding component of w(and lies on the boundary of B1). Thus, we are
simply projecting a point on a hyperplane. The induced integral is solved by an opportune choice
of change of variables. The details of the proof are given in Appendix G.
12
OPT IM AL GE NE RALIZ ATIO N ERROR BOUNDS FOR SGLD
Appendix A. Proof of Lemma 1
Recall the definition of maximal leakage and conditional maximal leakage:
Definition 10 (Maximal Leakage (Issa et al.,2020, Definition 1)) Given two random variables (X, Y )
with joint distribution PXY ,
L(XY) = log sup
U:UXY
Pr(ˆ
U(Y) = U)
maxuPU(u),(53)
where Utakes values in a finite, but arbitrary, alphabet, and ˆ
U(Y)is the optimal estimator (i.e.,
MAP) of Ugiven Y.
Similarly,
Definition 11 (Conditional Maximal Leakage (Issa et al.,2020, Definition 6)) Given three ran-
dom variables (X, Y, Z )with joint distribution PX Y Z ,
L(XY|Z) = log sup
U:UXY|Z
Pr(ˆ
U(Y, Z ) = U)
Pr(ˆ
U(Z) = U),(54)
where Utakes values in a finite, but arbitrary, alphabet, and ˆ
U(Y, Z )and ˆ
U(Z)are the optimal
estimators (i.e., MAP) of Ugiven (Y, Z )and Ugiven Z, respectively.
It then follows that
L(XY1, Y2) = log sup
U:UX(Y1,Y2)
Pr(ˆ
U(Y1, Y2) = U)
maxuPU(u)(55)
= log sup
U:UX(Y1,Y2)
Pr(ˆ
U(Y1, Y2) = U)
Pr(ˆ
U(Y1) = U)
Pr(ˆ
U(Y1) = U)
maxuPU(u)(56)
log sup
U:UX(Y1,Y2)
Pr(ˆ
U(Y1, Y2) = U)
Pr(ˆ
U(Y1) = U)·sup
U:UX(Y1,Y2)
Pr(ˆ
U(Y1) = U)
maxuPU(u)(57)
log sup
U:UXY2|Y1
Pr(ˆ
U(Y1, Y2) = U)
Pr(ˆ
U(Y1) = U)·sup
U:UXY1
Pr(ˆ
U(Y1) = U)
maxuPU(u)(58)
=L(XY2|Y1) + L(XY1),(59)
where the last inequality follows from the fact that UX(Y1, Y2)implies UXY2|Y1.
The fact that
L(XY2|Y1) = ess-sup
PY1L(XY2|Y1=y1),(60)
has been shown for discrete alphabets in Theorem 6 of (Issa et al.,2020). The extension to continu-
ous alphabets is similar (with integrals replacing sums, and pdfs replacing pmfs, where appropriate).
13
ISSA ESP OS IT O GAS TPA R
Finally, it remains to show equation (5). We proceed by induction. The case n= 2 has already
been shown above. Assume the inequality is true up to n1variables, then
L(XYn) L(XY1) + ess-sup
PY1L(XYn
2|Y1=y1)(61)
L(XY1) + ess-sup
PY1
n
X
i=2 LXYi|Yi1, Y1=y1(62)
=
n
X
i=1 LXYi|Yi1,(63)
where the second inequality follows from the induction hypothesis.
Appendix B. Proof of equation (30)
For notational convenience, let c1=σt2
ηtLand c2=2e
c2
1
. Then,
d1
X
i=0
Γi+1
2
Γd
2ci(d1)
1= 1 +
d2
X
i=0
Γi+1
2
Γd
21 +
d2
X
i=0
c(i(d1))
1i+1
2i
2ei+1
2e1
12i
d
2d1
2ed
2
(64)
= 1 + e1
12 2e
c2
1dd1
2d2
X
i=0 (i+ 1)c2
1
2ei
2
(65)
1 + e1
12 c2
dd1
2
d2
X
i=0 d
c2i
2
(66)
= 1 + e1
12 c2
dd1
2d
c2d1
21
qd
c21
(67)
= 1 + e1
12 1c2
dd1
2
qd
c21
d→∞
1.(68)
Moreover,
V2(d, ηtL)
(2πσ2
t)d/2=πd/2
Γd
2+ 1 ηtL
p2πσ2
t!d
=V2 d, ηtL
p2πσ2
t!d→∞
0.(69)
Combining equations (68) and (69) yields the desired limit.
14
OPT IM AL GE NE RALIZ ATIO N ERROR BOUNDS FOR SGLD
Appendix C. Proof of equation (40)
To evaluate the integral in line (39), we write it in spherical coordinates:
h(d, 2, ft, ηtL)
=Z
B2(0tL)
1
(2πσ2
t)d
2
exp (kwtk2ηtL)2
2σ2
tdwt.
=1
(2πσ2
t)d
2Z2π
0Zπ
0
...Zπ
0Z
ηtL
e
(ρηtL)2
2σ2
tρd1sind2(φ1) sind3(φ2)...sin(φd2)dρdφd1
1
=2π
(2πσ2
t)d
2Zπ
0
sind2(φ1)dφ1...Zπ
0
sin(φd2)dφd2 Z
ηtL
e
(ρηtL)2
2σ2
tρd1dρ!.(70)
Now, note that for any nN,Zπ
0
sinn(x)dx= 2 Zπ/2
0
sinn(x)dx, and
Zπ/2
0
sinn(x)dx(a)
=Z1
0
un
1u2du(b)
=1
2Z1
0
tn1
2(1 t)1
2dy(c)
=1
2Beta n+ 1
2,1
2
=πΓn+1
2
n
2+ 1,(71)
where (a) follows from the change of variable u= sin x, (b) follows from the change of variable
t=u2, (c) follows from the definition of the Beta function: Beta(s1, s2) = Z1
0
ts11(1 t)s21,
and the last equality is a known property of the Beta function (Γ(1/2) = π). Consequently,
2πZπ
0
sind2(φ1)dφ1...Zπ
0
sin(φd2)dφd2
= (2π)
d2
Y
i=1
πΓi+1
2
Γi
2+ 1= (2π)πd2
2Γ(1)
Γ(d/2) = 2πd/21
Γ(d/2) .(72)
To evaluate the innermost integral, the following identity will be useful:
Z
0
xnex2dx =1
2Z
0
tn+1
2etdt =Γn+1
2
2,(73)
15
ISSA ESP OS IT O GAS TPA R
where the first equality follows from the change of variable t=x2. Then,
Z
ηtL
e
(ρηtL)2
2σ2
tρd1 =Z
0
e
ρ2
2σ2
t(ρ+ηtL)d1 (74)
=Z
0
d1
X
i=0 d1
i(ηtL)d1iρie
ρ2
2σ2
t (75)
(a)
=
d1
X
i=0 d1
i(ηtL)d1iZ
0σt2i+1 tiet2 (76)
(b)
= (ηtL)d1(σt2)
d1
X
i=0 σt2
ηtL!iΓ((i+ 1)/2)
2.(77)
where (a) follows from the change of variable t=ρ/(σ2), and (b) follows from (73).
Finally, combining equations (70), (72), and (77), we get
g(d, σt, ηtL) = 2πd/2
(2πσ2
t)d
2Γ(d/2)(ηtL)d1(σt2)
d1
X
i=0 σt2
ηtL!iΓ((i+ 1)/2)
2(78)
=ηtL
σt2d11
Γ(d/2)
d1
X
i=0 σt2
ηtL!i
Γ((i+ 1)/2).(79)
Appendix D. Proof of equation (46)
The innermost integral of line (45) evaluates to
Z
ηtL
sup
xd:|xd|≤ηtL
f(wdxd)dwd=Z
ηtL
f(wdηtL)dwd=Z
0
f(wd)dwd=1
2,(80)
where the first equality follows from the monotonicity assumptions, the second from a change of
variable, and the third from the symmetry assumption. Similarly, the innermost integral of line (44)
evaluates to
Z
0
sup
xd:|xd|≤ηtL
f(wdxd)dwd(81)
=ZηtL
0
sup
xd:|xd|≤ηtL
f(wdxd)dwddwd1+Z
ηtL
sup
xd:|xd|≤ηtL
f(wdxd)dwd(82)
=ηtLf(0) + 1
2.(83)
16
OPT IM AL GE NE RALIZ ATIO N ERROR BOUNDS FOR SGLD
Combining equations (45), (80), and (83), we get
h+(d) = ηtLf(0) + 1
2Z
Bd1
(0tL)Ad1
d1
Y
i=1
sup
xi:|xi|≤ηtL
f(wixi)dwd1(84)
+1
2Z
Bd1
(0tL)Ad1
d1
Y
i=1
sup
xi:|xi|≤ηtL
f(wixi)dwd1(85)
=ηtLf(0) + 1
2h+(d1) + 1
2(ηtLf(0))d1,(86)
where the second equality follows from the fact that fis maximized at 0, and Bd1
(0, ηtL)Ad1
is a (d1)-dimensional hypercube of side ηtL(with volume (ηtL)d1). Now,
h(d) = 2dh+(d) = (1 + 2ηtLf (0)) h(d1) + (2ηtLf (0))d1.(87)
Appendix E. Proof of Theorem 7
Consider any f F, and let
f+(x) = (f(x), x 0,
0, x < 0,and f(x) = (0, x 0,
f(x), x < 0.(88)
Then
varf(X2) = Z+
−∞
(f(x) + f+(x))x2dx =Z
0
2f+(x)x2dx, (89)
where the second equality follows from the symmetry assumption. Note that 2f+is a valid proba-
bility density over [0,), and let X+f+. Then, by previous equation,
varf(X2) = E(2f+)X2
+=Z
0
2x(1 Pr(X+x)) dx (90)
Z1/(2f(0))
0
2x(1 2xf(0)) dx =1
12f2(0).(91)
Hence,
f(0) 1
23pvarf(X2)1
23σ,(92)
which is achieved by the uniform distribution U(σ3, σ3).
Appendix F. Proof of Theorem 8
First, we show that the limit of the right-hand side of equation (47) is given by the right-hand side
of equation (48). Note that
V1(d, ηtL)
(σt2)d=V1d, ηtL
σt2d→∞
0.(93)
17
ISSA ESP OS IT O GAS TPA R
On the other hand,
lim
d→∞
d1
X
i=0
(σtηtL/2)i
i!=
X
i=0
(σtηtL/2)i
i!=eσtηtL/2.(94)
Since Tis finite, the limit and the sum are interchangeable, so that the above two equations yield
the desired limit.
We now turn to the proof of inequality (47). For notational convenience, set λt=σt
2(so that
ft0(x) = λt
2eλ|x|for all xR) and Rt=ηtL. Since the noise satisfies the assumptions of
Proposition 3, we get
L(SW)
T
X
t=1
log
ft(0)V1(d, Rt) + Z
B1(0,Rt)
sup
xt∈B1(0,Rt)
ft(wtxt)dwt
(95)
=
T
X
t=1
log
V1(d, Rt)
(λt/2)d+Z
B1(0,Rt)
sup
xt∈B1(0,Rt)λt
2d
exp {−λkwtxtk1}dwt
.
(96)
Recall h(d, p, ft, Rt)(cf. equation (17)) is defined to be the second term inside the log. Similarly to
the strategy adopted in the proof of Theorem 6, we will derive a recurrence relation for hin terms
of d, as such we will again suppress the dependence on p,ft, and Rtin the notation, and write h(d)
only (and correspondingly h+(d)).
Lemma 12 Given wBd
1(0, R)Ad(Addefined in equation (18)),
inf
x∈Bd
1(0,R)kwxk1=
d
X
i=1
wiR. (97)
Proof Since we are minimizing a continuous function over a compact set, then the infimum can be
replaced with a minimum.
Claim: There exists a minimizer xsuch that for all i,x
iwi.
Proof of Claim: Consider any x B1(0, R)such that there exists jsatisfying xj> wj. Note
that wj0by assumption. Now define x= (x1,...,xj1, wj, xj+1,...,xd). Then kxk1<kxk1
so that x B1(0, R). Moreover, kwxk1 kwxk1as desired.
Now,
inf
x∈Bd
1(0,R)kwxk1= inf
x∈Bd
1(0,R):
xiwi,i
kwxk1= inf
x∈Bd
1(0,R):
xiwi,i
d
X
i=1
(wixi) =
d
X
i=1
wiR. (98)
Given the above lemma, we will derive the recurrence relation by decomposing the integral over
Bd
1(0, Rt)into two disjoint subsets: 1) wd1/ Bd1
1(0, Rt), in which case wdcan take any value
18
OPT IM AL GE NE RALIZ ATIO N ERROR BOUNDS FOR SGLD
in R, and 2) wd1 Bd1
1(0, Rt), in which case wdmust satisfy |wd|> Rt kwd1k1.
h+(d) = Z
Bd
1(0,Rt)Ad
sup
xt∈B1(0,Rt)λt
2d
eλt(Pd
i=1 wtRt)dwt(99)
=Z
Bd1
1(0,Rt)Ad
λt
2d1
eλt(Pd1
i=1 wtRt)Z
0
λt
2eλtwddwddwd1(100)
+Z
Bd1
1(0,Rt)Adλt
2d1
eλt(Pd1
i=1 wtRt) Z
RtPd1
i=1 wi
λt
2eλtwddwd!dwd1
(101)
=1
2h+(d1) + Z
Bd1
1(0,Rt)Adλt
2d1
eλt(Pd1
i=1 wtRt)1
2eλt(RtPd
i=1 wi)dwd1
(102)
=1
2h+(d1) + 1
2λt
2d1V1(d1, Rt)
2d1(103)
=1
2h+(d1) + 1
2λtRt
2d11
(d1)!.(104)
Hence,
h(d) = 2dh+(d) = h(d1) + (λtRt)d1
(d1)! .(105)
It is easy check that h(1) = 1, and hence
h(d) =
d1
X
i=0
(λtRt)i
i!(106)
satisfies the base case and the recurrence relation. Re-substituting ηtLand σt/2for Rtand λt,
respectively, yields the desired result in equation (47).
Appendix G. Proof of Theorem 9
Let Rt=ηtL. Since the noise satisfies the assumptions of Proposition 3, we get
L(SW)
T
X
t=1
log
ft(0)V1(d, Rt) + Z
B1(0,Rt)
sup
xt∈B1(0,Rt)
ft(wtxt)dwt
(107)
=
T
X
t=1
log
V1(d, Rt)
(2πσ2)d
2
+Z
B1(0,Rt)
sup
xt∈B1(0,Rt)
1
(2πσ2
t)d
2
exp kwtxtk2
2
2σ2
tdwt
.
(108)
19
ISSA ESP OS IT O GAS TPA R
Consider
h+(d) = Z
B1(0,Rt)Ad
sup
xt∈B1(0,Rt)
1
(2πσ2
t)d
2
exp kwtxtk2
2
2σ2
tdwt.(109)
First we solve inf
xt∈B1(0,Rt)kwtxtk2. If wtAd, then the infimum is achieved for x
tAdas well
(one can simply flip the sign of any negative component, which cannot increase the distance). In
the subspace Ad, the boundary of the L1ball is defined by the hyperplane Pd
i=1 xti =Rt. As such,
finding the minimum distance corresponds to projecting the point wto the given hyperplane:
inf
xt∈B1(0,Rt)kwtxtk2= min
xt∈B1(0,Rt)Ad:
Pd
i=1 xi=Rt
kwtxtk2=Pd
i=1 wti Rt
d.(110)
Now,
g+(d) = Z
B1(0,Rt)Ad
1
(2πσ2
t)d
2
exp ((Pd
i=1 wti Rt)2
22
t)dwt.(111)
For notational convenience, we drop the tsubscript in the following. We perform a change of
variable as follows: ˜wd=Pd
i=1 wi. Hence, for w / B1(0, R),˜wdR. Since wd0, then
Pd1
i=1 wi˜wd. For xR, define S(x) := {wd1Rd1:Pd1
i=1 wix}. Then,
h+(d) = Z
RZ
S( ˜wd)
1
(2πσ2)d
2
e( ˜wdR)2
22dwd1dwd(112)
=1
(2πσ2
t)d
2Z
R
e( ˜wdR)2
22
Z
S( ˜wd)
dwd1
dwd(113)
(a)
=1
(2πσ2)d
2((d1)!) Z
R
˜wd1
de( ˜wdR)2
22dwd(114)
(b)
=1
(2πσ2)d
2((d1)!)Rd1(σ2d)
d1
X
i=0 σ2d
R!iΓ((i+ 1)/2)
2,(115)
where (a) follows from the fact that the innermost integral corresponds to the volume of a scaled
probability simplex (scaled by ˜wd), and (b) follows from the same computations as in Equations (74)
to (77) (with ˜σ=σd). Nothing that g(d) = 2dg+(d)yields the desired the term in equation (50).
References
Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S
Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine
learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016a.
20
OPT IM AL GE NE RALIZ ATIO N ERROR BOUNDS FOR SGLD
Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and
Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC
conference on computer and communications security, pages 308–318, 2016b.
Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and
structural results. J. Mach. Learn. Res., 3(null):463–482, mar 2003. ISSN 1532-4435.
Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized mar-
gin bounds for neural networks. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wal-
lach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural In-
formation Processing Systems, volume 30. Curran Associates, Inc., 2017. URL
https://proceedings.neurips.cc/paper/2017/file/b22b257ad0519d4500539da3c8bcf4dd-Paper.pdf.
Peter L. Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vc-dimension
and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning
Research, 20(63):1–17, 2019. URL http://jmlr.org/papers/v20/17-612.html.
R. Bassily, S. Moran, I. Nachum, J. Shafer, and A. Yehudayoff. Learners that use little information.
volume 83 of Proceedings of Machine Learning Research, pages 25–55. PMLR, 07–09 Apr 2018.
Olivier Bousquet, St´ephane Boucheron, and abor Lugosi. Introduction to statistical learning the-
ory. In Olivier Bousquet, Ulrike von Luxburg, and Gunnar R¨atsch, editors, Advanced Lectures on
Machine Learning, volume 3176 of Lecture Notes in Computer Science, pages 169–207. Springer,
2003. ISBN 3-540-23122-6.
Yuheng Bu, Shaofeng Zou, and Venugopal V. Veeravalli. Tightening mutual information based
bounds on generalization error. In 2019 IEEE International Symposium on Information Theory
(ISIT), pages 587–591, 2019. doi: 10.1109/ISIT.2019.8849590.
Yuheng Bu, Shaofeng Zou, and Venugopal V. Veeravalli. Tightening mutual information-based
bounds on generalization error. IEEE Journal on Selected Areas in Information Theory, 1(1):
121–130, 2020. doi: 10.1109/JSAIT.2020.2991139.
Xiangyi Chen, Steven Z Wu, and Mingyi Hong. Understanding gradient clipping in private sgd: A
geometric perspective. Advances in Neural Information Processing Systems, 33:13773–13782,
2020.
Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds
for deep (stochastic) neural networks with many more parameters than training data. CoRR,
abs/1703.11008, 2017. URL https://arxiv.org/abs/1703.11008.
Amedeo Roberto Esposito and Michael Gastpar. From generalisation error to transportation-cost
inequalities and back. In 2022 IEEE International Symposium on Information Theory (ISIT),
pages 294–299, 2022. doi: 10.1109/ISIT50566.2022.9834354.
Amedeo Roberto Esposito, Michael Gastpar, and Ibrahim Issa. Generalization error bounds via
enyi-, f-divergences and maximal leakage. IEEE Transactions on Information Theory, 67(8):
4986–5004, 2021. doi: 10.1109/TIT.2021.3085190.
21
ISSA ESP OS IT O GAS TPA R
Hassan Hafez-Kolahi, Zeinab Golgooni, Shohreh Kasaei, and Mahdieh Soley-
mani. Conditioning and processing: Techniques to improve information-theoretic
generalization bounds. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Bal-
can, and H. Lin, editors, Advances in Neural Information Processing Sys-
tems, volume 33, pages 16457–16467. Curran Associates, Inc., 2020. URL
https://proceedings.neurips.cc/paper/2020/file/befe5b0172188ad14d48c3ebe9cf76bf-Paper.pdf.
Mahdi Haghifam, Jeffrey Negrea, Ashish Khisti, Daniel M. Roy, and Gintare Karolina Dziugaite.
Sharpened generalization bounds based on conditional mutual information and an application to
noisy, iterative algorithms, 2020. URL https://arxiv.org/abs/2004.12983.
Fredrik Hellstr¨om and Giuseppe Durisi. Generalization bounds via information density and condi-
tional information density. IEEE Journal on Selected Areas in Information Theory, 1(3):824–839,
2020. doi: 10.1109/JSAIT.2020.3040992.
I. Issa, A. B. Wagner, and S. Kamath. An operational approach to information leakage. IEEE
Transactions on Information Theory, 66(3):1625–1657, 2020. doi: 10.1109/TIT.2019.2962804.
Ibrahim Issa, Amedeo Roberto Esposito, and Michael Gastpar. Strengthened information-theoretic
bounds on the generalization error. In 2019 IEEE International Symposium on Information The-
ory, ISIT Paris, France, July 7-12, 2019.
Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic
generalization measures and where to find them. In 8th International Conference on Learning
Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
URL https://openreview.net/forum?id=SJgIPJBFvH.
Gabor Lugosi and Gergely Neu. Generalization bounds via convex analysis. In Po-Ling Loh and
Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume
178 of Proceedings of Machine Learning Research, pages 3524–3546. PMLR, 02–05 Jul 2022.
URL https://proceedings.mlr.press/v178/lugosi22a.html.
David A. McAllester. A pac-bayesian tutorial with A dropout bound. CoRR, abs/1307.2118, 2013.
URL http://arxiv.org/abs/1307.2118.
Jeffrey Negrea, Mahdi Haghifam, Gintare Karolina Dziugaite, Ashish Khisti, and Daniel M Roy.
Information-theoretic generalization bounds for sgld via data-dependent estimates. In H. Wal-
lach, H. Larochelle, A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, and R. Garnett, editors, Advances
in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL
https://proceedings.neurips.cc/paper/2019/file/05ae14d7ae387b93370d142d82220f1b-Paper.pdf.
Gergely Neu, Gintare Karolina Dziugaite, Mahdi Haghifam, and Daniel M. Roy.
Information-theoretic generalization bounds for stochastic gradient descent. In Pro-
ceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceed-
ings of Machine Learning Research, pages 3526–3545. PMLR, 15–19 Aug 2021. URL
https://proceedings.mlr.press/v134/neu21a.html.
Ankit Pensia, Varun Jog, and Po-Ling Loh. Generalization error bounds for noisy, iterative al-
gorithms. 2018 IEEE International Symposium on Information Theory (ISIT), pages 546–550,
2018.
22
OPT IM AL GE NE RALIZ ATIO N ERROR BOUNDS FOR SGLD
Venkatadheeraj Pichapati, Ananda Theertha Suresh, Felix X Yu, Sashank J Reddi, and Sanjiv Ku-
mar. Adaclip: Adaptive clipping for private sgd. arXiv preprint arXiv:1908.07643, 2019.
Daniel Russo and James Zou. Controlling bias in adaptive data analysis using information theory.
In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, vol-
ume 51 of Proceedings of Machine Learning Research, pages 1232–1240. PMLR, 09–11 May
2016.
S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms.
Cambridge University Press, 2014.
R. Sibson. Information radius. Z. Wahrscheinlichkeitstheorie verw Gebiete 14, pages 149–160,
1969.
Thomas Steinke and Lydia Zakynthinou. Reasoning About Generalization via Condi-
tional Mutual Information. In Jacob Abernethy and Shivani Agarwal, editors, Pro-
ceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceed-
ings of Machine Learning Research, pages 3437–3452. PMLR, 09–12 Jul 2020. URL
https://proceedings.mlr.press/v125/steinke20a.html.
T. van Erven and P. Harremo¨es. enyi divergence and kullback-keibler divergence. IEEE Trans.
Inf. Theory, 60(7):3797–3820, July 2014.
V. N. Vapnik and A. Y. Chervonenkis. The necessary and sufficient conditions for consistency in
the empirical risk minimization method. Pattern Recognition and Image Analysis 1, (3), 1991.
Sergio Verd´u. α-mutual information. In 2015 Information Theory and Applications Workshop, ITA
2015, San Diego, CA, USA, February 1-6, 2015, pages 1–6, 2015.
Bohan Wang, Huishuai Zhang, Jieyu Zhang, Qi Meng, Wei Chen, and Tie-Yan Liu. Optimizing
information-theoretical generalization bound via anisotropic noise of SGLD. In Advances in
Neural Information Processing Systems 34: Annual Conference on Neural Information Process-
ing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 26080–26090, 2021. URL
https://proceedings.neurips.cc/paper/2021/hash/db2b4182156b2f1f817860ac9f409ad7-Abstract.html.
Hao Wang, Rui Gao, and Flavio P Calmon. Generalization bounds for noisy iterative algorithms
using properties of additive noise channels. Journal of Machine Learning Research, 24(26):1–43,
2023.
A. Xu and M. Raginsky. Information-theoretic analysis of generalization capability of learning
algorithms. In Advances in Neural Information Processing Systems, page 2521–2530, 2017.
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding
deep learning (still) requires rethinking generalization. Commun. ACM, 64(3):107–115, feb 2021.
ISSN 0001-0782. doi: 10.1145/3446776. URL https://doi.org/10.1145/3446776.
Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P. Adams, and Peter Orbanz. Non-vacuous
generalization bounds at the imagenet scale: a pac-bayesian compression approach. In Interna-
tional Conference on Learning Representations, 2018.
23