Conference PaperPDF Available

Practical and Matching Gradient Variance Bounds for Black-Box Variational Bayesian Inference

Authors:

Abstract and Figures

Understanding the gradient variance of black-box variational inference (BBVI) is a crucial step for establishing its convergence and developing algorithmic improvements. However, existing studies have yet to show that the gradient variance of BBVI satisfies the conditions used to study the convergence of stochastic gradient descent (SGD), the workhorse of BBVI. In this work, we show that BBVI satisfies a matching bound corresponding to the 𝐴𝐵𝐶 condition used in the SGD literature when applied to smooth and quadratically-growing log-likelihoods. Our results generalize to nonlinear covariance parameterizations widely used in the practice of BBVI. Furthermore, we show that the variance of the mean-field parameterization has provably superior dimensional dependence
Content may be subject to copyright.
Practical and Matching Gradient Variance Bounds for
Black-Box Variational Bayesian Inference
Kyurae Kim 1Kaiwen Wu 1Jisu Oh 2Jacob R. Gardner 1
Abstract
Understanding the gradient variance of black-
box variational inference (BBVI) is a crucial step
for establishing its convergence and developing
algorithmic improvements. However, existing
studies have yet to show that the gradient vari-
ance of BBVI satisfies the conditions used to
study the convergence of stochastic gradient de-
scent (SGD), the workhorse of BBVI. In this
work, we show that BBVI satisfies a matching
bound corresponding to the condition used
in the SGD literature when applied to smooth and
quadratically-growing log-likelihoods. Our re-
sults generalize to nonlinear covariance param-
eterizations widely used in the practice of BBVI.
Furthermore, we show that the variance of the
mean-field parameterization has provably supe-
rior dimensional dependence.
1. Introduction
Variational inference (VI; Jordan et al. 1999;Blei et al.
2017;Zhang et al. 2019) algorithms are fast and scalable
Bayesian inference methods widely applied in fields of
statistics and machine learning. In particular, black-box VI
(BBVI; Ranganath et al. 2014;Titsias & L´
azaro-Gredilla
2014) leverages stochastic gradient descent (SGD; Rob-
bins & Monro 1951;Bottou 1999) for inference of non-
conjugate probabilistic models. With the development of
bijectors (Kucukelbir et al.,2017;Dillon et al.,2017;Fjelde
et al.,2020), most of the methodological advances in BBVI
have now been abstracted out through various probabilistic
programming frameworks (Carpenter et al.,2017;Ge et al.,
1Department of Computer and Information Sciences, Uni-
versity of Pennsylvania, Philadelphia, Pennsylvania, United
States 2Department of Statistics, North Carolina State University,
Raleigh, North Carolina, United States. Correspondence to: Kyu-
rae Kim <kyrkim@seas.upenn.edu>, Jacob R. Gardner <jrgard-
ner@seas.upenn.edu>.
Proceedings of the 40 International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
2018;Dillon et al.,2017;Bingham et al.,2019;Salvatier
et al.,2016).
Despite the advances of BBVI, little is known about its the-
oretical properties. Even when restricted to the location-
scale family (Definition 2), it is unknown whether BBVI is
guaranteed to converge without having to modify the algo-
rithms used in practice, for example, by enforcing bounded
domains, bounded support, bounded gradients, and such.
This theoretical insight is necessary since BBVI meth-
ods are known to be less robust (Yao et al.,2018;Dhaka
et al.,2020;Welandawe et al.,2022;Dhaka et al.,2021;
Domke,2020) compared to other inference methods such
as Markov chain Monte Carlo. Although progress has been
made to formalize the theory of BBVI with some gener-
ality, the gap between our understanding of BBVI and the
convergence guarantees of SGD remains open. For exam-
ple, Domke (2019;2020) provided smoothness and gradi-
ent variance guarantees. Still, these results do not yet yield
a full convergence guarantee and do not extend to nonlin-
ear covariance parameterizations used in practice.
In this work, we investigate whether recent progress
in relaxing the gradient variance assumptions used in
SGD (Tseng,1998;Vaswani et al.,2019;Schmidt & Roux,
2013;Bottou et al.,2018;Gower et al.,2019;2021b;
Nguyen et al.,2018) apply to BBVI. These extensions
have led to new insights that the structure of the gradi-
ent bounds can have non-trivial interactions with gradient-
adaptive SGD algorithms (Zhang et al.,2022). For ex-
ample, when the “interpolation assumption” (the gradient
noise converges to 0; Schmidt & Roux 2013;Ma et al.
2018;Vaswani et al. 2019) does not hold, ADAM (Kingma
& Ba,2015) provably diverges with certain stepsize com-
binations (Zhang et al.,2022). Until BBVI can be shown
to conform to the assumptions used by these recent works,
it is unclear how these results relate to BBVI.
While the variance of BBVI gradient estimators has been
studied before (Xu et al.,2019;Domke,2019;Mohamed
et al.,2020a;Fujisawa & Sato,2021), the connection with
the conditions used in SGD has yet to be established. As
such, we answer the following question:
Does the gradient variance of BBVI conform to
the conditions assumed in convergence guaran-
1
Gradient Variance Bounds for Black-Box Variational Inference
tees of SGD without modifying the implementa-
tions used in practice?
The answer is yes! Assuming the target log joint distri-
bution is smooth and quadratically growing, we show that
the gradient variance of BBVI satisfies the ABC condition
(Assumption 2) used by Polyak & Tsypkin (1973); Khaled
& Richt´
arik (2023); Gower et al. (2021b). Our analysis
extends the previous result of Domke (2019) to covari-
ance parameterizations involving nonlinear functions for
conditioning the diagonal (see Section 2.5), as commonly
done in practice. Furthermore, we prove that the gradi-
ent variance of the mean-field parameterization (Peterson
& Anderson,1987;Peterson & Hartman,1989;Hinton &
van Camp,1993) results in better dimensional dependence
compared to full-rank ones.
Overall, our results should act as a key ingredient to ob-
taining a full convergence guarantees of BBVI, as recently
done by Kim et al. (2023).
Our contributions are summarized as follows:
We provide upper bounds on the gradient variance of
BBVI that matches the ABC condition (Assumption 2)
used for analyzing SGD.
Theorems 1 and 2do not require any modification
of the algorithms used in practice.
Theorem 3 achieves better constants under the
stronger bounded entropy assumption.
Our analysis applies to BBVI parameterizations (Sec-
tion 2.5) widely used in practice (Table 1).
Lemma 1 enables the bounds to cover nonlinear
covariance parameterizations.
Lemma 3 and Remark 4 shows that the gradient
variance of the mean-field parameterization has
superior dimensional scaling.
We provide a matching lower bound (Theorem 4) on
the gradient variance, showing that, under the stated
assumptions, the ABC condition is the weakest as-
sumption applicable to BBVI.
2. Preliminaries
Notation Random variables are denoted in serif, while
their realization is in regular font. (i.e,is a real-
ization of 𝘹,𝒙is a realization of the vector-valued 𝙭.)
𝒙2=𝒙,𝒙=𝒙𝒙denotes the Euclidean norm, while
𝑨F=tr(𝑨𝑨)is the Frobenius norm, where tr(𝑨)=
=1 is the matrix trace.
2.1. Variational Inference
Variational inference (Peterson & Anderson,1987;Hinton
& van Camp,1993) is a family of inference algorithms de-
vised to solve the problem
minimize
𝝀DKL(,𝝀,),(1)
where ,𝝀is called the “variational approximation”, while
is a distribution of interest, and KL is the (exclusive)
Kullback-Leibler (KL) divergence.
For Bayesian inference, is the posterior distribution
(𝒛)𝓁(𝒙𝒛)(𝒛)=𝓁(𝒙,𝒛),
where 𝓁(𝒙𝒛)is the likelihood, and (𝒛)is the prior. In
practice, one only has access to the likelihood and the prior.
Thus, Equation (1) cannot be directly solved. Instead, we
can minimize the negative evidence lower bound (ELBO;
Jordan et al. 1999) function (𝝀).
Evidence Lower Bound More formally, we solve
minimize
𝝀(𝝀),
where is defined as
(𝝀)𝔼𝙯,𝝀[log𝓁(𝒙,𝙯)]H,𝝀,(2)
=𝔼𝙯,𝝀[log𝓁(𝒙𝙯)]+DKL(,𝝀,),(3)
𝙯is the latent (random) variable,
,𝝀is the variational distribution,
is a bijector (support transformation), and
His the differential entropy.
The bijector (Dillon et al.,2017;Fjelde et al.,2020;
Leger,2023) is a differentiable bijective map that is used to
de-constrain the support of constrained random variables.
For example, when is expected to follow a gamma dis-
tribution, using =()with ()=loglets us work
with , which can be any real number, unlike . The use
of −1corresponds to the automatic differentiation VI for-
mulation (ADVI; Kucukelbir et al. 2017), which is now
widespread.
2.2. Variational Family
In this work, we specifically consider the location-scale
variational family with a standardized base distribution.
Definition 1 (Reparameterization Function).An affine
mapping 𝒕𝝀defined as
𝒕𝝀(𝒖)𝑪𝒖+𝒎
with 𝝀containing the parameters for forming the location
𝒎and scale 𝑪=𝑪(𝝀)×is called the
(location-scale) reparameterization function.
Definition 2 (Location-Scale Family).Let be some -
dimensional distribution. Then, 𝝀such that
𝞯𝝀𝞯
=𝒕𝝀(𝙪);𝙪
is said to be a member of the location-scale family indexed
by the base distribution and parameter 𝝀.
2
Gradient Variance Bounds for Black-Box Variational Inference
This family includes commonly used variational fami-
lies, such as the mean-field Gaussian, full-rank Gaussian,
Student-T, and other elliptical distributions.
Remark 1 (Entropy of Location-Scale Distributions).
The differential entropy of a location-scale family distri-
bution (Definition 2) is
H(𝝀)=H()+log𝑪.
Definition 3 (ADVI Family;Kucukelbir et al. 2017).Let
𝝀be some -dimensional distribution. Then, ,𝝀such
that
𝙯,𝝀𝙯
=−1(𝞯);𝞯𝝀
is said to be a member of the ADVI family with the base
distribution 𝝀parameterized with 𝝀.
We impose assumptions on the base distribution .
Assumption 1 (Base Distribution).is a -dimensional
distribution such that 𝙪and 𝙪=(𝘶1,,𝘶)with
indepedently and identically distributed components. Fur-
thermore, is (i) symmetric and standardized such that
𝔼𝘶= 0,𝔼𝘶2
= 1,𝔼𝘶3
= 0, and (ii) has finite kurtosis
𝔼𝘶4
=<.
These assumptions are already satisfied in practice by,
for example, generating 𝘶from a univariate normal or
Student-T with >4degrees of freedom.
2.3. Reparameterization Trick
When restricted to location scale families (Definitions 2
and 3), we can invoke Change-of-Variable, or more com-
monly known as the “reparameterization trick, such that
𝔼𝙯,𝝀log𝓁(𝒙,𝙯)=𝔼𝞯𝝀log𝓁𝒙,−1(𝞯)
=𝔼𝙪log𝓁𝒙,−1(𝒕𝝀(𝙪))
through the Law of the Unconcious Statistician. Differenti-
ating this results in the reparameterization or path gradient,
which often achieves lower variance than alternatives (Xu
et al.,2019;Mohamed et al.,2020b).
Objective Function For generality, we represent our ob-
jective as a composite infinite sum problem:
Definition 4 (Composite Infinite Sum).
(𝝀)=𝔼𝙪(𝒕𝝀(𝙪))+(𝝀),
where (𝝀,𝙪)𝒕𝝀×is some bivariate
stochastic function of 𝝀and the “noise source” 𝙪, while
is a deterministic regularization term.
By appropriately defining and , we retrieve the two most
common formulations of the ELBO in Equation (2) and
Equation (3) respectively:
Definition 5 (ELBO Entropy-Regularized Form).
H(𝜻)=log𝓁𝒙,−1(𝜻)

Joint Likelihood log𝑱−1 (𝜻)
H(𝝀)=−H(𝝀).
Definition 6 (ELBO KL-Regularized Form).
KL (𝜻)=log𝓁𝒙−1(𝜻)

Likelihood log𝑱−1 (𝜻)
KL (𝝀)=DKL(𝝀,).
Here, 𝑱−1 is the Jacobian of the bijector. Since DKL(𝝀,)
is seldomly available in tractable form, the entropy-
regularized form is the most widely used, while the KL
regularized is common for Gaussian processes and varia-
tional autoencoders.
Gradient Estimator We denote the -sample estimator
of the gradient of as
𝙜(𝝀)1
=1𝙜(𝝀),where (4)
𝙜(𝝀)𝝀(𝒕𝝀(𝙪))+(𝝀); 𝙪. (5)
We will occasionally drop 𝝀for clarity.
2.4. Gradient Variance Assumptions in
Stochastic Gradient Descent
Gradient Variance Assumptions in SGD For a while,
most convergence proofs in SGD have relied on the
“bounded variance” assumption. That is, for a gradient es-
timator 𝙜,𝔼𝙜2
2for some finite constant . This
assumption is problematic because these types of global
constants result in loose bounds, and it directly con-
tradicts the strong-convexity assumption (Nguyen et al.,
2018). Thus, retrieving previously known SGD conver-
gence rates under weaker assumptions has been an impor-
tant research direction (Tseng,1998;Vaswani et al.,2019;
Schmidt & Roux,2013;Bottou et al.,2018;Gower et al.,
2019;2021b;Nguyen et al.,2018).
ABC Condition In this work, we focus on the re-
cently rediscovered expected smoothness, or ABC, condi-
tion (Polyak & Tsypkin,1973;Gower et al.,2021b).
Assumption 2 (Expected Smoothness; ).𝙜is said to
satisfy the expected smoothness condition if
𝔼𝙜(𝝀)2
22((𝝀))+(𝝀)2
2+.
for some finite ,,0, where =inf𝝀∈R(𝝀).
As shown by Khaled & Richt´
arik (2023), this condition is
not only strictly weaker than many of the previously used
assumptions but also generalizes them by retrieving known
convergence rates when tweaking the constants.
3
Gradient Variance Bounds for Black-Box Variational Inference
Table 1: Survey of Parameterizations Used in Black-Box Variational Inference
Framework Version Parameterizations Conditioner Code
TURING (Ge et al.,2018)v0.23.2 Nonlinear Mean-field softplus link
STAN (Carpenter et al.,2017)v2.31.0 Nonlinear Mean-field exp link
Linear Cholesky link
PYRO (Bingham et al.,2019)v0.10.1 Nonlinear Mean-field softplus link
Linear Cholesky1link
PYMC3 (Salvatier et al.,2016)v5.0.1 Nonlinear Mean-field softplus link
Nonlinear Cholesky softplus link
GPYTORCH (Gardner et al.,2018)v1.9.0 Linear Cholesky link
Linear Mean-field link
1Numpyro also provides a low-rank Cholesky parameterization, which is non-linearly con-
ditioned. But the full-rank Cholesky is linear.
*Tensorflow probability (Dillon et al.,2017) wasn’t included as it
doesn’t provide a fully pre-configured variational family (although
tfp.experimental.vi.build *posterior exists, the parameterization is
user-supplied).
With the  condition, for non-
convex -smooth functions, under a
“appropriately chosen” stepsize (oth-
erwise the bound may blow-up as
explained by Khaled & Richt´
arik)
of 1, SGD converges
to a 𝒪()neighborhood in a
𝒪(1+2)()rate. Minor vari-
ants of the ABC condition have
also been used to prove convergence
of SGD for quasar convex func-
tions Gower et al. (2021a), stochastic
heavy-ball/momentum methods Liu &
Yuan (2022), and stochastic proxi-
mal methods (Li & Milzarek,2022).
Given the influx of results based on
the ABC condition, connecting with it
would significantly broaden our theo-
retical understanding of BBVI.
2.5. Covariance Parameterizations
When using the location-scale family (Definition 2), the
scale matrix 𝑪can be parameterized in different ways. Any
parameterization that results in a positive definite covari-
ance 𝑪𝑪𝕊
++ is valid. We consider multiple parame-
terizations as the choice can result in different theoretical
properties. A brief survey on the use of different parame-
terizations is shown in Table 1.
Linear Parameterization The previous results
by Domke (2019) considered the matrix square root
parameterization, which is linear with respect to the
variational parameters.
Definition 7 (Matrix Square Root).
𝑪(𝝀)=𝑪,
where 𝑪×is a matrix, 𝝀𝑪=vec(𝑪)2such
that 𝝀=(𝒎,𝝀𝑪).
Note that 𝑪is not constrained to be symmetric so this is not
a matrix square root in a narrow sense. Also, this param-
eterization does not guarantee 𝑪𝑪to be positive definite
(only positive semidefinite), which occasionally results in
the entropy term Hblowing up (Domke,2020). Domke
proposed to fix this by using proximal operators.
Nonlinear Parameterizations In practice, optimization
is preferably done in unconstrained , which then posi-
tive definiteness can be ensured by explicitly mapping the
diagonal elements to positive numbers. We denote this by
the diagonal conditioner . (See Table 1 for a brief sur-
vey on their use). The following two parameterizations are
commonly used, where 𝑫=diag((𝒔))×denotes
a diagonal matrix such that  =()>0.
Definition 8 (Mean-Field).
𝑪(𝝀,)=diag((𝒔)),
where 𝒔and 𝝀=(𝒎,𝒔).
Definition 9 (Cholesky).
𝑪(𝝀,)=diag((𝒔))+𝑳,
where 𝒔,𝑳×is a strictly lower triangu-
lar matrix, 𝝀𝑳= vec(𝑳)(+1)∕2 such that 𝝀=
(𝒎,𝒔,𝝀𝑳). The special case of ()=is called the “lin-
ear Cholesky” parameterization.
Diagonal conditioner For the diagonal conditioner, the
softplus function ()=softplus()log(1+)(Dugas
et al.,2000) or the exponential function ()=is com-
monly used. While using these nonlinear functions sig-
nificantly complicates the analysis, assuming to be 1-
Lipschitz retrieves practical guarantees.
Assumption 3 (Lipschitz Diagonal Conditioner).The di-
agonal conditioner is 1-Lipschitz continuous.
Remark 2. The softplus function is 1-Lipschitz.
3. Main Results
3.1. Key Lemmas
The main challenge in studying BBVI is that the gradient of
the composed function 𝝀(𝒕𝝀(𝒖))is different from .
For the matrix square root parameterization, Domke (2019)
established the connection through Lemma 1 (restated as
Lemma 6 in Appendix C.1). We generalize this result to
nonlinear parameterizations:
4
Gradient Variance Bounds for Black-Box Variational Inference
Lemma 1. Let 𝒕𝝀be a location-scale
reparameterization function (Definition 1) with some dif-
ferentiable function . Then, for 𝒈
(𝒕𝝀(𝒖)),
(i) Mean-Field
𝝀(𝒕𝝀(𝒖))2
2=𝒈2
2+𝒈
𝑼𝜱𝒈,
(ii) Cholesky
𝝀(𝒕𝝀(𝒖))2
2=𝒈2
2+𝒈
𝜮𝒈+𝒈
𝑼(𝜱𝐈)𝒈,
where 𝑼,𝜱,𝜮are diagonal matrices, which the diago-
nals are defined as
 =2
,Φ =()2,Σ ==12
,
and is a diagonal conditioner for the scale matrix.
Proof. See the full proof in Appendix C.2.1.
Note that the relationships in this lemma are all equalities,
which can be bounded with known quantities, as done in
the next lemma. We note here that if any of our analyses
were to be improved, this shall by done by obtaining tighter
bounds on the equalities in Lemma 1.
Lemma 2. Let 𝒕𝝀be a location-scale repa-
rameterization function (Definition 1), be
a differentiable function, and let satisfy Assumption 3.
(i) Mean-Field
𝝀(𝒕𝝀(𝒖))2
21+𝑼F(𝒕𝝀(𝒖))2
2,
where 𝑼is a diagonal matrix such that  =2
.
(ii) Cholesky
𝝀(𝒕𝝀(𝒖))2
21+𝒖2
2(𝒕𝝀(𝒖))2
2,
where the equality holds for the matrix square root
parameterization.
Proof. See the full proof in Appendix C.2.2.
Lemma 1 act as the interface between the properties of the
parameterization and the likelihood .
Remark 3 (Variance Reduction Through ).Anonlinear
Cholesky parameterization with a 1-Lipschitz achieves
lower or equal variance compared to the matrix square root
and linear Cholesky, where the equality is achieved with
the matrix square root parameterization.
Dimension Dependence of Mean-Field The superior di-
mensional dependence of the mean-field parameterization
is given by the following lemma:
Lemma 3. Let the assumptions of Lemma 2 hold and
𝙪satisfy Assumption 1. Then, for the mean-field
parameterization,
𝔼𝒕𝝀(𝙪)𝒛2
21+𝙐F
++1𝒎𝒛2
2+2+1𝑪2
F.
Proof. See the full proof in Appendix C.2.3.
Remark 4 (Superior Variance of Mean-Field).The
mean-field parameterization has 𝒪dimensional de-
pendence compared to the 𝒪()dimensional dependence
of the full-rank parameterizations in Lemma 7.
Lastly, the following lemma is the basic building block for
all of our upper bounds:
Lemma 4. Let 𝒈be the -sample gradient estimator
of (Definition 4) for some function ,and let 𝙪be
some random variable. Then,
𝔼𝙜2
21
𝔼𝝀(𝒕𝝀(𝙪))2
2+(𝝀)2
2.
Proof. See the full proof in Appendix C.2.4.
3.2. Upper Bounds
We restrict our analysis to the class of log-likelihoods that
satisfy the following conditions:
Definition 10 (-smoothness).A function is
-smooth if it satisfies the following for all 𝜻,𝜻:
(𝜻)(𝜻)2𝜻𝜻2.
Definition 11 (Quadratic Functional Growth).A func-
tion is -quadratically growing if
2𝜻
𝜻2
2(𝜻)
for all 𝜻, where
𝜻=Π(𝜻)is a projection of 𝜻onto
the set of minimizers of and =inf𝜻(𝜻).
The quadratic growth condition has first been used by
(Anitescu,2000) and is strictly weaker than the Polyak-
Łojasiewicz inequality (see Karimi et al. 2016, Appendix
A for the proof). Furthermore, for -strongly (quasar) con-
vex functions (Hinder et al.,2020;Jin,2020) automatically
satisfy quadratic growth, but our analysis does not require
(quasar) convexity.
Both assumptions are commonly used in SGD. For study-
ing the gradient variance of BBVI, assuming both smooth-
ness and quadratic growth is weaker than the assumptions
of Xu et al. (2019) but stronger than those of Domke
(2019), who assumed only smoothness. The additional as-
sumption on growth is necessary to extend his results to
establish the ABC condition.
For the variational family, we assume the followings:
Assumption 4. ,𝝀is a member of the ADVI family (Def-
inition 3), where the underlying 𝝀is a member of the
location-scale family (Definition 2) with its base distribu-
tion satisfying Assumption 1.
Entropy-Regularized Form First, we provide the upper
bound for the ELBO in entropy-regularized form. This re-
sult does not require any modifications to vanilla SGD.
5
Gradient Variance Bounds for Black-Box Variational Inference
Theorem 1. Let 𝙜be an -sample estimate of the gra-
dient of the ELBO in entropy regularized form (Defini-
tion 5). Also, assume that Assumption 3 and 4hold,
His H-smooth, and
KLis KL-quadratically growing.
Then,
𝔼𝙜2
242
H
KL(,)((𝝀))+(𝝀)2
2
+22
H
(,)
𝜻KL
𝜻H2
2
+42
H
KL(,)
KL,
where
(,)=2+1for mean-field,
(,)=+for the Cholesky and matrix square root,
KL,
Hare the stationary points of KL,H, respec-
tively, =inf𝝀(𝝀), and
KL=inf𝜻().
Proof Sketch. From Lemma 4, we can see that the key
quantity of upper bounding the gradient variance is to ana-
lyze 𝔼𝝀H(𝒕𝝀(𝙪)). The bird’s eye view of the proof is
as follows:
The relationship between 𝝀H(𝒕𝝀(𝙪))2
2and
H(𝒕𝝀(𝙪))2
2is established through Lemma 2.
Then, the H-smoothness of Hrelates
H(𝒕𝝀(𝙪))2
2with 𝒕𝝀(𝙪)
𝜻H2
2, the aver-
age squared distance from H’s stationary point.
The average squared distance enables the simplifica-
tion of stochastic terms through Lemmas 3 and 7. This
step also introduces dimension dependence.
From here, we are now left with the 𝔼𝒕𝝀(𝙪)
𝜻H2
2term.
One might be tempted to assume the quadratic growth as-
sumption on Hand proceed as
𝔼𝒕𝝀(𝙪)
𝜻H2
22
H(𝒕𝝀(𝙪))
H.
However, for the entropy-regularized form, this soon runs
into a dead end since in
𝔼H(𝒕𝝀(𝙪))
H=(𝝀)(𝝀)
=((𝝀))+()H(𝝀),
the negative entropy term His not bounded unless we
rely on assumptions that need modifications to the BBVI
algorithms. (e.g., bounded support, bounded domain). For-
tunately, the following inequality cleverly side-steps this
problem:
𝔼𝒕𝝀(𝙪)
𝜻H2
22𝔼𝒕𝝀(𝙪)
𝜻KL2
2+2
𝜻KL
𝜻H2
2,(6)
albeit at the cost of some looseness. By converting the
entropy-regularized form into the KL-regularized form,
the regularizer term becomes KL = DKL(𝝀,)0,
which is bounded below by definition, unlike the entropic-
regularizer H. The proof completes by
applying the quadratic growth assumption to relate
the parameter distance with the function suboptimal-
ity gap, and
upper bounding the KL regularizer term.
Proof. See the full proof in Appendix C.3.1.
Remark 5. If the bijector is an identity function, 𝜻KL
and 𝜻Hare the maximum likelihood (ML) and maximum
a-posteriori (MAP) estimates, respectively. Thus, with
enough datapoints, the term
𝜻KL
𝜻H2
2will be negligible
since the ML and MAP estimates will be close.
Remark 6. It is also possible the tighten the constants by a
factor of two. Instead of applying Equation (6), we can use
the inequality
(+)21+22+1+−22,
for some >0. By setting 2==
𝜻KL
𝜻H2,
𝔼𝒕𝝀(𝙪)
𝜻H2
2(1+2)𝔼𝒕𝝀(𝙪)
𝜻KL2
2+4+2.
Since 0as explained in Remark 5, the constant in front
of the first term is tightened almost by a factor of 2. How-
ever, the stated form is more convenient for theory since
the first term does not depend on
𝜻KL
𝜻H2.
Remark 7. Let cond.=HKL be the condition number of
the problem. For the full-rank parameterizations, the vari-
ance is bounded as 𝒪(Hcond.(+)). The variance
depends linearly on
the scaling of the problem H,
the conditioning of the problem cond.,
the dimensionality of the problem , and
the tail properties of the variational family ,
where the number of Monte Carlo samples linearly re-
duces the variance.
KL-Regularized Form We now prove an equivalent re-
sult for the KL-regularized form. Here, we do not have to
rely on Equation (6) since we already start from KL, which
results in better constants.
Theorem 2. Let 𝙜be an -sample estimator of the
gradient of the ELBO in KL-regularized form (Defini-
tion 6). Also, assume that
KLis KL-smooth,
KLis KL-quadratically growing,
and Assumption 3 and 4hold. Then, the gradient vari-
6
Gradient Variance Bounds for Black-Box Variational Inference
ance is bounded above as
𝔼𝙜2
222
KL
KL(,)((𝝀))+(𝝀)2
2
+22
KL
KL(,)
KL,
where
(,)=2+1for mean-field,
(,)=+for the Cholesky and matrix square root,
=inf𝝀(𝝀), and
KL=inf𝜻().
Proof. See the full proof in Appendix C.3.2.
3.3. Upper Bound Under Bounded Entropy
The bound in Theorem 1 is slightly loose due to the
use of Equation (6) and Equation (29). An alternative
bound with slightly tighter constants, although the gains
are marignal compared to Remark 6, can be obtained by
assuming the following:
Assumption 5 (Bounded Entropy).The regularization
term is bounded below as H(𝝀)
H.
For the entropy-regularized form, this corresponds to the
entropy being bounded above by some constant since
(𝝀)= −H(𝝀). When using the nonlinear parameteri-
zations (Definitions 8 and 9), this assumption can be prac-
tically enforced by bounding the output of by some large
.
Proposition 1. Let the diagonal conditioner be
bounded as (). Then, for any -dimensional dis-
tribution 𝝀in the location-scale family with the mean-
field (Definition 8) or Cholesky (Definition 9) parame-
terizations,
H(𝝀)=−H(𝝀)−H()
2log.
Proof. From Remark 1,H(𝝀)=H()+log𝑪. Since 𝑪
under Definitions 8 and 9is a diagonal or triangular matrix,
the log absolute determinant is the log sum of the diago-
nals. The conclusion follows from the fact that the diago-
nals  =()are bounded by .
This is essentially a weaker version of the bounded do-
main assumption, though only the diagonal elements of 𝑪,
1,,, are bounded. While this assumption results in
an admittedly less realistic algorithm, it enables a tighter
bound for the entropy-regularized form ELBO.
Theorem 3. Let 𝙜be an -sample estimator of the
gradient of the ELBO in entropy-regularized form (Defi-
nition 5). Also, assume that
His H-smooth,
His H-quadratically growing,
His bounded as H(𝝀)>
H(Assumption 5),
and Assumption 3 and 4hold. Then, the gradient vari-
ance of 𝒈is bounded above as
𝔼𝙜2
222
H
H(,)((𝝀))+(𝝀)2
2
+22
H
H(,)
H
H,
where
(,)=2+1for mean-field,
(,)=+for the Cholesky parameterization,
=inf𝝀(𝝀), and
H=inf𝜻().
Proof Sketch. Instead of using Equation (6), we apply
the quadratic assumption directly to H. The remaining
entropic-regularizer term can now be bounded through the
bounded entropy assumption.
Proof. See the full proof in Appendix C.3.3.
3.4. Matching Lower Bound
Finally, we present a matching lower bound on the gra-
dient variance of BBVI. Our lower bound holds broadly
for smooth and strongly convex problem instances that are
well-conditioned and high-dimensional.
Theorem 4. Let 𝙜be an -sample estimator of the
gradient of the ELBO in either the entropy- or KL-
regularized form. Also, let Assumption 4 hold where
the matrix square root parameterization is used. Then,
for all -smooth and -strongly convex functions such
that <+1, the variance of 𝙜is bounded below
by some strictly positive constant as
𝔼𝙜2
222(+1)22
 ((𝝀))+(𝝀)2
2
+22(+1)22
 (𝔼(𝒕𝝀(𝒖))),
as long as 𝝀is in a local neighborhood around the
unique global optimum 𝝀=argmin𝝀(𝝀), where
=(𝝀)and =argmin𝜻(𝜻).
Proof Sketch. We use the fact that, with the matrix square
root parameterization, if is -smooth, 𝔼(𝒕𝝀(𝙪))is also
-smooth (Domke,2020). From this, the parameter subop-
timality can be related to the function suboptimality as
𝝀
𝝀2
2(2∕)(𝔼(𝒕𝝀(𝙪))),
where
𝝀=
𝜻,𝐎. For the entropy term, we circumvent the
need to directly bound its value by restricting our interest
to the neighborhood of the minimizer 𝝀, where the con-
tribution of (𝝀)(𝝀)will be marginal enough for the
lower bound to hold.
Proof. See the full proof in Appendix C.3.4.
7
Gradient Variance Bounds for Black-Box Variational Inference
DKL(𝝀,)
1 200 400 600 8001,000
104
105
106
107
108
Iteration
(𝝀)
1 200 400 600 8001,000
104
105
106
107
108
Iteration
DKL(𝝀,)
1 200 400 600 8001,000
103
104
105
106
107
108
Iteration
(𝝀)
1 200 400 600 8001,000
103
104
105
106
107
108
Iteration
Cholesky ()=softplus()Mean-Field ()=softplus()
Theorem 1 Theorem 3 Theorem 1 Theorem 3
Gradient Variance 𝔼𝙜2
2Upper Bound 2((𝝀))+2
2+
Figure 1: Evaluation of the bounds for a perfectly conditioned quadratic target function. The blue regions are the
loosenesses resulting from either using (Theorem 1) or not using (Theorem 3) the bounded entropy assumption (Assump-
tion 5), while the red regions are the remaining “technical loosesnesses.” The gradient variance was estimated from 103
samples.
Remark 8 (Matching Dimensional Dependence).For
well-conditioned problems such that <+1, a lower
bound of the same dimensional dependence with our upper
bounds holds near the optimum.
Remark 9 (Unimprovability of the ABC Condition).
The lower bound suggests that the  gradient vari-
ance condition is unimprovable within the class of smooth,
quadratically growing functions.
4. Simulations
We now evaluate our bounds and the insights gathered dur-
ing the analysis through simulations. We implemented a
bare-bones implementation of BBVI in Julia (Bezanson
et al.,2017) with plain SGD. The stepsize were manually
tuned so that all problems converge at similar speeds. For
all problems, we use a unit Gaussian base distribution such
that ()=𝒩(;0,1)resulting in a kurtosis of =3and
use =10Monte Carlo samples.
4.1. Synthetic Problem
To test the ideal tightness of the bounds, we consider
quadratics achieving the tightest bound for the constants
H,KL,H,KLgiven as
log𝓁(𝒙𝒛)=
2𝒛𝒛2
2; log(𝒛)=1
𝒛2
2,
where simulates the effect of the number of datapoints.
We set the constants as = 0.3,=8.0, and = 100,
the mode 𝒛is randomly sampled from a Gaussian, and
the dimension of the problem is =20. For the bounded
entropy case, we set =2.0(the true standard deviation is
in the order of 1e-3).
DKL(𝝀,)
1 2,000 4,000
104
106
108
1010
Iteration
Gradient Variance 𝔼𝙜2
2
Upper Bound
1 500 1,000
104
105
106
107
Iteration
𝔼𝙜2
2
Iteration
𝔼𝙜2
2
Matrix square root
Cholesky ()=
Cholesky ()=softplus()
Figure 2: Linear regression on the AI RFO IL dataset.
(left) Evaluation of the upper bound (Theorem 1).
(right) Comparison of the variance of different param-
eterizations resulting in the same 𝒎,𝑪.
Quality of Upper Bound The results for the Cholesky
and mean-field parameterizations with a softplus bijector
are shown in Figure 1. For the Cholesky parameterization,
the bulk of the looseness comes from the treatment of the
regularization term (blue region). The remaining “techni-
cal looseness” (red region) is relatively tight and can be
shown to be tighter when using linear parameterizations
(()=) and the square root parameterization, which
is the tightest. However, for the mean-field parameteriza-
tion, despite the superior constants (Remark 4), there is still
room for improvement. Additional results for other param-
eterizations can be found in Appendix B.1.
4.2. Real Dataset
Model We now evaluate the theoretical results with real
datasets. Given a regression dataset (𝑿,𝒚), we use the lin-
ear Gaussian model
𝒩𝑿𝒘,2;𝒘𝒩(𝟎,𝐈),
8
Gradient Variance Bounds for Black-Box Variational Inference
where and are hyperparameters. The smoothness and
quadratic growth constants for this model are given as the
max- and minimum eigenvalues of −2𝑿𝑿+−1𝐈(for
H) and −2𝑿𝑿(for KL ).
KL and
Hare given as the
mode of the likelihood and the posterior, while is the
negative marginal log-likelihood.
Quality of Upper Bound Section 4.1 shows the result
on the AIRFOIL dataset (Dua & Graff,2017). The con-
stants are H=3.520×104,KL =2.909×103. Due to
poor conditioning, the bound is much looser compared to
the quadratic case. We note that generalizing our bounds to
utilize matrix smoothness and matrix-quadratic growth as
done by (Domke,2019) would tighten the bounds. But the
theoretical gains would be marginal. Detailed information
about the datasets and additional results for other parame-
terizations can be found in Appendix B.2.
Comparison of Parameterizations Section 4.1 com-
pares the gradient variance resulting from the different pa-
rameterizations. For a fair comparison, the gradient is
estimated on the 𝝀that results in the same 𝒎,𝑪for all
three parameterizations. This shows the gradual increase
in variance by (i) not using a nonlinear conditioner (linear
Cholesky) (ii) and increasing the number of variational pa-
rameters (matrix square root).
5. Related Works
Controlling Gradient Variance The main algorithmic
challenge in BBVI is to control the gradient noise (Ran-
ganath et al.,2014). This has led to various methods
for reducing the variance of VI gradient estimators us-
ing control variates (Ranganath et al.,2014;Miller et al.,
2017;Geffner & Domke,2018), ensembling of estima-
tors (Geffner & Domke,2020), modifying the differ-
entiation procedure (Roeder et al.,2017), quasi-Monte
Carlo (Buchholz et al.,2018;Liu & Owen,2021), and mul-
tilevel Monte Carlo (Fujisawa & Sato,2021). Cultivating a
deeper understanding of the properties of gradient variance
could further extend this list.
Convergence Guarantees Obtaining full convergence
guarantees has been an important task for understand-
ing BBVI algorithms. However, most guarantees so far
have relied on strong assumptions such as that the log-
likelihood is Lipschitz (Ch´
erief-Abdellatif et al.,2019;
Alquier,2021), that the gradient variance is bounded
by constant (Liu & Owen,2021;Buchholz et al.,2018;
Domke,2020;Hoffman & Ma,2020), and that the sup-
port of 𝝀is bounded (Fujisawa & Sato,2021). Our result
shows that similar results can be obtained under relaxed as-
sumptions. Meanwhile, Bhatia et al. (2022) have recently
proven a full complexity guarantee for a variant of BBVI.
But similarly to Hoffman & Ma (2020), they only optimize
the scale matrix 𝑪, and the specifics of the algorithm di-
verge from the usual BBVI implementations as it uses the
stochastic power iterations instead of SGD.
Gradient Variance Guarantees Studying the actual gra-
dient variance properties of BBVI has only started to make
progress recently. Fan et al. (2015) first provided bounds
by assuming the log-likelihood to be Lipschitz. Under
more general conditions, Domke (2019) provided tight
bounds for smooth log-likelihoods, which our work builds
upon. Domke’s result can also be seen as a direct gen-
eralization of the results of Xu et al. (2019), which are
restricted to quadratic log-likelihoods and the mean-field
family. Lastly, Mohamed et al. (2020a) provides a concep-
tual evaluation of gradient estimators used in BBVI.
6. Discussions
Conclusions In this work, we have proven upper bounds
on the gradient variance of BBVI with the location-scale
family for smooth, quadratically-growing log-likelihoods.
Specifically, we have provided bounds for both the ELBO
in entropy-regularized and KL-regularized forms. Our
guarantees work without a single modification to the al-
gorithms used in practice, although stronger assumptions
establish a tighter bound for the entropy-regularized form
ELBO. Also, our bounds corresponds to the ABC condi-
tion (Section 2.4) and the expected residual (ER) condition,
where the latter is a special case of the former with =1.
The ER condition has been used by Gower et al. (2021a) for
proving convergence of SGD on quasar convex functions,
which generalize convex functions. The results of this pa-
per are used by Kim et al. (2023) to establish convergence
of BBVI through the results of Khaled & Richt´
arik (2023).
Limitations Our results have the following limitations:
Our results only apply to smooth and quadratically–
growing log likelihoods and the location-scale ADVI
family. Also, our bounds cannot distinguish the variance
of the Cholesky and matrix square root parameterizations,
and empirically, the bounds for the mean-field parame-
terization appear loose. Furthermore, our results only
work with 1-Lipschitz diagonal conditioners such as the
softplus function. Unfortunately, assuming both smooth-
ness and quadratic growth is quite restrictive, as it leaves a
very small number of known distributions. Also, in prac-
tice, non-Lipschitz conditioners such as the exponential
functions are widely used. While obtaining similar bounds
with such conditioners would be challenging, constructing
a theoretical framework that extends to such would be an
important future research direction.
Acknowledgements
This work was supported by NSF award IIS-2145644.
9
Gradient Variance Bounds for Black-Box Variational Inference
References
Alquier, P. Non-Exponentially Weighted Aggregation: Re-
gret Bounds for Unbounded Loss Functions. In Proceed-
ings of the International Conference on Machine Learn-
ing, volume 193 of PMLR, pp. 207–218. ML Research
Press, July 2021. (page 9)
Anitescu, M. Degenerate nonlinear programming with a
quadratic growth condition. SIAM Journal on Optimiza-
tion, 10(4):1116–1135, January 2000. (page 5)
Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B.
Julia: A fresh approach to numerical computing. SIAM
review, 59(1):65–98, 2017. (page 8)
Bhatia, K., Kuang, N. L., Ma, Y.-A., and Wang, Y. Sta-
tistical and computational trade-offs in variational in-
ference: A case study in inferential model selection.
(arXiv:2207.11208), July 2022. (page 9)
Bingham, E., Chen, J. P., Jankowiak, M., Obermeyer, F.,
Pradhan, N., Karaletsos, T., Singh, R., Szerlip, P., Hors-
fall, P., and Goodman, N. D. Pyro: Deep universal prob-
abilistic programming. Journal of Machine Learning Re-
search, 20(28):1–6, 2019. (pages 1,4)
Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Vari-
ational inference: A review for statisticians. Journal
of the American Statistical Association, 112(518):859–
877, April 2017. (page 1)
Bottou, L. On-line learning and stochastic approxima-
tions. In On-Line Learning in Neural Networks, pp.
9–42. Cambridge University Press, first edition, January
1999. (page 1)
Bottou, L., Curtis, F. E., and Nocedal, J. Optimization
methods for large-scale machine learning. SIAM Review,
60(2):223–311, January 2018. (pages 1,3)
Buchholz, A., Wenzel, F., and Mandt, S. Quasi-Monte
Carlo variational inference. In Proceedings of the Inter-
national Conference on Machine Learning, volume 80
of PMLR, pp. 668–677. ML Research Press, July 2018.
(page 9)
Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D.,
Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li,
P., and Riddell, A. Stan: A probabilistic programming
language. Journal of Statistical Software, 76(1), 2017.
(pages 1,4)
Ch´
erief-Abdellatif, B.-E., Alquier, P., and Khan, M. E. A
generalization bound for online variational inference. In
Proceedings of the Asian Conference on Machine Learn-
ing, volume 101 of PMLR, pp. 662–677. ML Research
Press, October 2019. (page 9)
Dhaka, A. K., Catalina, A., Andersen, M. R., ns Mag-
nusson, M., Huggins, J., and Vehtari, A. Robust, ac-
curate stochastic optimization for variational inference.
In Advances in Neural Information Processing Systems,
volume 33, pp. 10961–10973. Curran Associates, Inc.,
2020. (page 1)
Dhaka, A. K., Catalina, A., Welandawe, M., Andersen,
M. R., Huggins, J., and Vehtari, A. Challenges and
opportunities in high dimensional variational inference.
In Advances in Neural Information Processing Systems,
volume 34, pp. 7787–7798. Curran Associates, Inc.,
2021. (page 1)
Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasude-
van, S., Moore, D., Patton, B., Alemi, A., Hoffman,
M., and Saurous, R. A. TensorFlow distributions.
(arXiv:1711.10604), November 2017. (pages 1,2,4)
Domke, J. Provable gradient variance guarantees for black-
box variational inference. In Advances in Neural In-
formation Processing Systems, volume 32. Curran As-
sociates, Inc., 2019. (pages 1,2,4,5,9,13,16,17,18,
21)
Domke, J. Provable smoothness guarantees for black-box
variational inference. In Proceedings of the Interna-
tional Conference on Machine Learning, volume 119 of
PMLR, pp. 2587–2596. ML Research Press, July 2020.
(pages 1,4,7,9,17,23)
Dua, D. and Graff, C. UCI machine learning repository.
2017. (page 9)
Dugas, C., Bengio, Y., B´
elisle, F., Nadeau, C., and Garcia,
R. Incorporating second-order functional knowledge for
better option pricing. In Advances in Neural Informa-
tion Processing Systems, volume 13. MIT Press, 2000.
(page 4)
Fan, K., Wang, Z., Beck, J., Kwok, J., and Heller, K. A.
Fast second order stochastic backpropagation for vari-
ational inference. In Advances in Neural Information
Processing Systems, volume 28. Curran Associates, Inc.,
2015. (page 9)
Fjelde, T. E., Xu, K., Tarek, M., Yalburgi, S., and Ge, H.
Bijectors.jl: Flexible transformations for probability dis-
tributions. In Proceedings of The Symposium on Ad-
vances in Approximate Bayesian Inference, volume 118
of PMLR, pp. 1–17. ML Research Press, February 2020.
(pages 1,2)
Fujisawa, M. and Sato, I. Multilevel Monte Carlo varia-
tional inference. Journal of Machine Learning Research,
22(278):1–44, 2021. (pages 1,9)
10
Gradient Variance Bounds for Black-Box Variational Inference
Gardner, J., Pleiss, G., Weinberger, K. Q., Bindel, D., and
Wilson, A. G. GPyTorch: Blackbox matrix-matrix Gaus-
sian process inference with GPU acceleration. In Ad-
vances in Neural Information Processing Systems, vol-
ume 31. Curran Associates, Inc., 2018. (page 4)
Ge, H., Xu, K., and Ghahramani, Z. Turing: A language
for flexible probabilistic inference. In Proceedings of
the International Conference on Machine Learning, vol-
ume 84 of PMLR, pp. 1682–1690. ML Research Press,
2018. (pages 1,4)
Geffner, T. and Domke, J. Using large ensembles of control
variates for variational inference. In Advances in Neural
Information Processing Systems, volume 31. Curran As-
sociates, Inc., 2018. (page 9)
Geffner, T. and Domke, J. A rule for gradient estima-
tor selection, with an application to variational infer-
ence. In Proceedings of the International Conference
on Artificial Intelligence and Statistics, volume 108 of
PMLR, pp. 1803–1812. ML Research Press, August
2020. (page 9)
Gower, R., Sebbouh, O., and Loizou, N. SGD for struc-
tured nonconvex functions: Learning rates, minibatching
and interpolation. In Proceedings of The International
Conference on Artificial Intelligence and Statistics, vol-
ume 130 of PMLR, pp. 1315–1323. ML Research Press,
March 2021a. (pages 4,9)
Gower, R. M., Loizou, N., Qian, X., Sailanbayev, A.,
Shulgin, E., and Richt´
arik, P. SGD: General analysis
and improved rates. In Proceedings of the International
Conference on Machine Learning, volume 97 of PMLR,
pp. 5200–5209. ML Research Press, June 2019. (pages
1,3)
Gower, R. M., Richt´
arik, P., and Bach, F. Stochastic
quasi-gradient methods: Variance reduction via Jaco-
bian sketching. Mathematical Programming, 188(1):
135–192, July 2021b. (pages 1,2,3)
Hinder, O., Sidford, A., and Sohoni, N. Near-optimal meth-
ods for minimizing star-convex functions and beyond. In
Proceedings of Conference on Learning Theory, volume
125 of PMLR, pp. 1894–1938. ML Research Press, July
2020. (page 5)
Hinton, G. E. and van Camp, D. Keeping the neural net-
works simple by minimizing the description length of the
weights. In Proceedings of the Annual Conference on
Computational Learning Theory, pp. 5–13, Santa Cruz,
California, United States, 1993. ACM Press. (page 2)
Hoffman, M. and Ma, Y. Black-box variational inference
as a parametric approximation to Langevin dynamics. In
Proceedings of the International Conference on Machine
Learning, PMLR, pp. 4324–4341. ML Research Press,
November 2020. (page 9)
Jin, J. On the convergence of first order methods for
quasar-convex optimization. (arXiv:2010.04937), Octo-
ber 2020. (page 5)
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul,
L. K. An introduction to variational methods for graph-
ical models. Machine Learning, 37(2):183–233, 1999.
(pages 1,2)
Karimi, H., Nutini, J., and Schmidt, M. Linear conver-
gence of gradient and proximal-gradient methods under
the Polyak-Łojasiewicz condition. In Machine Learn-
ing and Knowledge Discovery in Databases, Lecture
Notes in Computer Science, pp. 795–811, Cham, 2016.
Springer International Publishing. (page 5)
Khaled, A. and Richt´
arik, P. Better theory for SGD in the
nonconvex world. Transactions of Machine Learning
Research, 2023. (pages 2,3,4,9)
Kim, K., Wu, K., Oh, J., Ma, Y., and Gardner,
J. R. Black-box variational inference converges.
(arXiv:2305.15349), May 2023. (pages 2,9)
Kingma, D. P. and Ba, J. Adam: A Method for Stochastic
Optimization. In Proceedings of the International Con-
ference on Learning Representations, San Diego, Cali-
fornia, USA, 2015. (page 1)
Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and
Blei, D. M. Automatic differentiation variational infer-
ence. Journal of Machine Learning Research, 18(14):
1–45, 2017. (pages 1,2,3)
Leger, J.-B. Parametrization cookbook: A set of bijective
parametrizations for using machine learning methods in
statistical inference. (arXiv:2301.08297), January 2023.
(page 2)
Li, X. and Milzarek, A. A unified convergence theorem for
stochastic optimization methods. In Advances in Neural
Information Processing Systems, October 2022. (page 4)
Liu, J. and Yuan, Y. On almost sure convergence rates of
stochastic gradient methods. In Proceedings of the Con-
ference on Learning Theory, volume 178 of PMLR, pp.
2963–2983. ML Research Press, June 2022. (page 4)
Liu, S. and Owen, A. B. Quasi-Monte Carlo quasi-Newton
in Variational Bayes. Journal of Machine Learning Re-
search, 22(243):1–23, 2021. (page 9)
Ma, S., Bassily, R., and Belkin, M. The power of interpola-
tion: Understanding the effectiveness of SGD in modern
11
Gradient Variance Bounds for Black-Box Variational Inference
over-parametrized learning. In Proceedings of the Inter-
national Conference on Machine Learning, volume 80 of
PMLR, pp. 3325–3334. ML Research Press, July 2018.
(page 1)
Miller, A., Foti, N., D’ Amour, A., and Adams, R. P.
Reducing reparameterization gradient variance. In Ad-
vances in Neural Information Processing Systems, vol-
ume 30. Curran Associates, Inc., 2017. (page 9)
Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A.
Monte Carlo gradient estimation in machine learning.
Journal of Machine Learning Research, 21(132):1–62,
2020a. (pages 1,9)
Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A.
Monte Carlo gradient estimation in machine learning.
Journal of Machine Learning Research, 21(132):1–62,
2020b. (page 3)
Nguyen, L., Nguyen, P. H., van Dijk, M., Richtarik, P.,
Scheinberg, K., and Takac, M. SGD and Hogwild! Con-
vergence without the bounded gradients assumption. In
Proceedings of the International Conference on Machine
Learning, volume 80 of PMLR, pp. 3750–3758. ML Re-
search Press, July 2018. (pages 1,3)
Peterson, C. and Anderson, J. R. A mean field theory learn-
ing algorithm for Neural Networks. Complex Systems, 1
(5):995–1019, 1987. (page 2)
Peterson, C. and Hartman, E. Explorations of the mean
field theory learning algorithm. Neural Networks, 2(6):
475–494, January 1989. (page 2)
Polyak, B. T. and Tsypkin, Y. Z. Pseudogradient adaptation
and training algorithms. Automatic Remote Control, 34
(3):45–68, 1973. (pages 2,3)
Ranganath, R., Gerrish, S., and Blei, D. Black box vari-
ational inference. In Proceedings of the International
Conference on Artificial Intelligence and Statistics, vol-
ume 33 of PMLR, pp. 814–822. ML Research Press,
April 2014. (pages 1,9)
Robbins, H. and Monro, S. A stochastic approximation
method. The Annals of Mathematical Statistics, 22(3):
400–407, September 1951. (page 1)
Roeder, G., Wu, Y., and Duvenaud, D. K. Sticking the
landing: Simple, lower-variance gradient estimators for
variational inference. In Advances in Neural Information
Processing Systems, volume 30. Curran Associates, Inc.,
2017. (page 9)
Salvatier, J., Wiecki, T. V., and Fonnesbeck, C. Probabilis-
tic programming in Python using PyMC3. PeerJ Com-
puter Science, 2:e55, April 2016. (pages 1,4)
Schmidt, M. and Roux, N. L. Fast convergence of stochas-
tic gradient descent under a strong growth condition.
(arXiv:1308.6370), August 2013. (pages 1,3)
Titsias, M. and L´
azaro-Gredilla, M. Doubly stochastic vari-
ational Bayes for non-conjugate inference. In Proceed-
ings of the International Conference on Machine Learn-
ing, volume 32 of PMLR, pp. 1971–1979. ML Research
Press, June 2014. (page 1)
Tseng, P. An incremental gradient(-projection) method
with momentum term and adaptive stepsize rule. SIAM
Journal on Optimization, 8(2):506–531, May 1998.
(pages 1,3)
Vaswani, S., Bach, F., and Schmidt, M. Fast and faster
convergence of SGD for over-parameterized models and
an accelerated perceptron. In Proceedings of the Inter-
national Conference on Artificial Intelligence and Statis-
tics, volume 89 of PMLR, pp. 1195–1204. ML Research
Press, April 2019. (pages 1,3)
Welandawe, M., Andersen, M. R., Vehtari, A., and Hug-
gins, J. H. Robust, automated, and accurate black-box
variational inference. (arXiv:2203.15945), March 2022.
(page 1)
Xu, M., Quiroz, M., Kohn, R., and Sisson, S. A. Variance
reduction properties of the reparameterization trick. In
Proceedings of the International Conference on Artifi-
cial Intelligence and Statistics, volume 89 of PMLR, pp.
2711–2720. ML Research Press, April 2019. (pages 1,3,
5,9)
Yao, Y., Vehtari, A., Simpson, D., and Gelman, A. Yes,
but did it work?: Evaluating variational inference. In
Proceedings of the International Conference on Machine
Learning, PMLR, pp. 5581–5590. ML Research Press,
July 2018. (page 1)
Zhang, C., Butepage, J., Kjellstrom, H., and Mandt, S. Ad-
vances in variational inference. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 41(8):2008–
2026, August 2019. (page 1)
Zhang, Y., Chen, C., Shi, N., Sun, R., and Luo, Z.-Q. Adam
can converge without any modification on update rules.
In Advances in Neural Information Processing Systems,
2022. (page 1)
12
Gradient Variance Bounds for Black-Box Variational Inference
TABL E OF CONTENTS
1 Introduction 1
2 Preliminaries 2
2.1 Variational Inference . . . . . . . . . 2
2.2 Variational Family . . . . . . . . . . 2
2.3 Reparameterization Trick . . . . . . . 3
2.4 Gradient Variance Assumptions in
Stochastic Gradient Descent . . . . . 3
2.5 Covariance Parameterizations . . . . . 4
3 Main Results 4
3.1 Key Lemmas . . . . . . . . . . . . . 4
3.2 Upper Bounds . . . . . . . . . . . . . 5
3.3 Upper Bound Under Bounded Entropy 7
3.4 Matching Lower Bound . . . . . . . . 7
4 Simulations 8
4.1 Synthetic Problem . . . . .