Content uploaded by Kyurae Kim

Author content

All content in this area was uploaded by Kyurae Kim on Jun 04, 2023

Content may be subject to copyright.

Practical and Matching Gradient Variance Bounds for

Black-Box Variational Bayesian Inference

Kyurae Kim 1Kaiwen Wu 1Jisu Oh 2Jacob R. Gardner 1

Abstract

Understanding the gradient variance of black-

box variational inference (BBVI) is a crucial step

for establishing its convergence and developing

algorithmic improvements. However, existing

studies have yet to show that the gradient vari-

ance of BBVI satisﬁes the conditions used to

study the convergence of stochastic gradient de-

scent (SGD), the workhorse of BBVI. In this

work, we show that BBVI satisﬁes a matching

bound corresponding to the condition used

in the SGD literature when applied to smooth and

quadratically-growing log-likelihoods. Our re-

sults generalize to nonlinear covariance param-

eterizations widely used in the practice of BBVI.

Furthermore, we show that the variance of the

mean-ﬁeld parameterization has provably supe-

rior dimensional dependence.

1. Introduction

Variational inference (VI; Jordan et al. 1999;Blei et al.

2017;Zhang et al. 2019) algorithms are fast and scalable

Bayesian inference methods widely applied in ﬁelds of

statistics and machine learning. In particular, black-box VI

(BBVI; Ranganath et al. 2014;Titsias & L´

azaro-Gredilla

2014) leverages stochastic gradient descent (SGD; Rob-

bins & Monro 1951;Bottou 1999) for inference of non-

conjugate probabilistic models. With the development of

bijectors (Kucukelbir et al.,2017;Dillon et al.,2017;Fjelde

et al.,2020), most of the methodological advances in BBVI

have now been abstracted out through various probabilistic

programming frameworks (Carpenter et al.,2017;Ge et al.,

1Department of Computer and Information Sciences, Uni-

versity of Pennsylvania, Philadelphia, Pennsylvania, United

States 2Department of Statistics, North Carolina State University,

Raleigh, North Carolina, United States. Correspondence to: Kyu-

rae Kim <kyrkim@seas.upenn.edu>, Jacob R. Gardner <jrgard-

ner@seas.upenn.edu>.

Proceedings of the 40 International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

2018;Dillon et al.,2017;Bingham et al.,2019;Salvatier

et al.,2016).

Despite the advances of BBVI, little is known about its the-

oretical properties. Even when restricted to the location-

scale family (Deﬁnition 2), it is unknown whether BBVI is

guaranteed to converge without having to modify the algo-

rithms used in practice, for example, by enforcing bounded

domains, bounded support, bounded gradients, and such.

This theoretical insight is necessary since BBVI meth-

ods are known to be less robust (Yao et al.,2018;Dhaka

et al.,2020;Welandawe et al.,2022;Dhaka et al.,2021;

Domke,2020) compared to other inference methods such

as Markov chain Monte Carlo. Although progress has been

made to formalize the theory of BBVI with some gener-

ality, the gap between our understanding of BBVI and the

convergence guarantees of SGD remains open. For exam-

ple, Domke (2019;2020) provided smoothness and gradi-

ent variance guarantees. Still, these results do not yet yield

a full convergence guarantee and do not extend to nonlin-

ear covariance parameterizations used in practice.

In this work, we investigate whether recent progress

in relaxing the gradient variance assumptions used in

SGD (Tseng,1998;Vaswani et al.,2019;Schmidt & Roux,

2013;Bottou et al.,2018;Gower et al.,2019;2021b;

Nguyen et al.,2018) apply to BBVI. These extensions

have led to new insights that the structure of the gradi-

ent bounds can have non-trivial interactions with gradient-

adaptive SGD algorithms (Zhang et al.,2022). For ex-

ample, when the “interpolation assumption” (the gradient

noise converges to 0; Schmidt & Roux 2013;Ma et al.

2018;Vaswani et al. 2019) does not hold, ADAM (Kingma

& Ba,2015) provably diverges with certain stepsize com-

binations (Zhang et al.,2022). Until BBVI can be shown

to conform to the assumptions used by these recent works,

it is unclear how these results relate to BBVI.

While the variance of BBVI gradient estimators has been

studied before (Xu et al.,2019;Domke,2019;Mohamed

et al.,2020a;Fujisawa & Sato,2021), the connection with

the conditions used in SGD has yet to be established. As

such, we answer the following question:

Does the gradient variance of BBVI conform to

the conditions assumed in convergence guaran-

1

Gradient Variance Bounds for Black-Box Variational Inference

tees of SGD without modifying the implementa-

tions used in practice?

The answer is yes! Assuming the target log joint distri-

bution is smooth and quadratically growing, we show that

the gradient variance of BBVI satisﬁes the ABC condition

(Assumption 2) used by Polyak & Tsypkin (1973); Khaled

& Richt´

arik (2023); Gower et al. (2021b). Our analysis

extends the previous result of Domke (2019) to covari-

ance parameterizations involving nonlinear functions for

conditioning the diagonal (see Section 2.5), as commonly

done in practice. Furthermore, we prove that the gradi-

ent variance of the mean-ﬁeld parameterization (Peterson

& Anderson,1987;Peterson & Hartman,1989;Hinton &

van Camp,1993) results in better dimensional dependence

compared to full-rank ones.

Overall, our results should act as a key ingredient to ob-

taining a full convergence guarantees of BBVI, as recently

done by Kim et al. (2023).

Our contributions are summarized as follows:

❶We provide upper bounds on the gradient variance of

BBVI that matches the ABC condition (Assumption 2)

used for analyzing SGD.

➤Theorems 1 and 2do not require any modiﬁcation

of the algorithms used in practice.

➤Theorem 3 achieves better constants under the

stronger bounded entropy assumption.

❷Our analysis applies to BBVI parameterizations (Sec-

tion 2.5) widely used in practice (Table 1).

➤Lemma 1 enables the bounds to cover nonlinear

covariance parameterizations.

➤Lemma 3 and Remark 4 shows that the gradient

variance of the mean-ﬁeld parameterization has

superior dimensional scaling.

❸We provide a matching lower bound (Theorem 4) on

the gradient variance, showing that, under the stated

assumptions, the ABC condition is the weakest as-

sumption applicable to BBVI.

2. Preliminaries

Notation Random variables are denoted in serif, while

their realization is in regular font. (i.e,is a real-

ization of 𝘹,𝒙is a realization of the vector-valued 𝙭.)

𝒙2=𝒙,𝒙=𝒙𝒙denotes the Euclidean norm, while

𝑨F=tr(𝑨𝑨)is the Frobenius norm, where tr(𝑨)=

=1 is the matrix trace.

2.1. Variational Inference

Variational inference (Peterson & Anderson,1987;Hinton

& van Camp,1993) is a family of inference algorithms de-

vised to solve the problem

minimize

𝝀∈ℝDKL(,𝝀,),(1)

where ,𝝀is called the “variational approximation”, while

is a distribution of interest, and KL is the (exclusive)

Kullback-Leibler (KL) divergence.

For Bayesian inference, is the posterior distribution

(𝒛)∝𝓁(𝒙∣𝒛)(𝒛)=𝓁(𝒙,𝒛),

where 𝓁(𝒙∣𝒛)is the likelihood, and (𝒛)is the prior. In

practice, one only has access to the likelihood and the prior.

Thus, Equation (1) cannot be directly solved. Instead, we

can minimize the negative evidence lower bound (ELBO;

Jordan et al. 1999) function (𝝀).

Evidence Lower Bound More formally, we solve

minimize

𝝀∈ℝ(𝝀),

where is deﬁned as

(𝝀)≜−𝔼𝙯∼,𝝀[log𝓁(𝒙,𝙯)]−H,𝝀,(2)

=−𝔼𝙯∼,𝝀[log𝓁(𝒙𝙯)]+DKL(,𝝀,),(3)

𝙯is the latent (random) variable,

,𝝀is the variational distribution,

is a bijector (support transformation), and

His the differential entropy.

The bijector (Dillon et al.,2017;Fjelde et al.,2020;

Leger,2023) is a differentiable bijective map that is used to

de-constrain the support of constrained random variables.

For example, when is expected to follow a gamma dis-

tribution, using =()with ()=loglets us work

with , which can be any real number, unlike . The use

of −1corresponds to the automatic differentiation VI for-

mulation (ADVI; Kucukelbir et al. 2017), which is now

widespread.

2.2. Variational Family

In this work, we speciﬁcally consider the location-scale

variational family with a standardized base distribution.

Deﬁnition 1 (Reparameterization Function).An afﬁne

mapping 𝒕𝝀∶ℝ→ℝdeﬁned as

𝒕𝝀(𝒖)≜𝑪𝒖+𝒎

with 𝝀containing the parameters for forming the location

𝒎∈ℝand scale 𝑪=𝑪(𝝀)∈ℝ×is called the

(location-scale) reparameterization function.

Deﬁnition 2 (Location-Scale Family).Let be some -

dimensional distribution. Then, 𝝀such that

𝞯∼𝝀⇔𝞯

=𝒕𝝀(𝙪);𝙪∼

is said to be a member of the location-scale family indexed

by the base distribution and parameter 𝝀.

2

Gradient Variance Bounds for Black-Box Variational Inference

This family includes commonly used variational fami-

lies, such as the mean-ﬁeld Gaussian, full-rank Gaussian,

Student-T, and other elliptical distributions.

Remark 1 (Entropy of Location-Scale Distributions).

The differential entropy of a location-scale family distri-

bution (Deﬁnition 2) is

H(𝝀)=H()+log𝑪.

Deﬁnition 3 (ADVI Family;Kucukelbir et al. 2017).Let

𝝀be some -dimensional distribution. Then, ,𝝀such

that

𝙯∼,𝝀⇔𝙯

=−1(𝞯);𝞯∼𝝀

is said to be a member of the ADVI family with the base

distribution 𝝀parameterized with 𝝀.

We impose assumptions on the base distribution .

Assumption 1 (Base Distribution).is a -dimensional

distribution such that 𝙪∼and 𝙪=(𝘶1,…,𝘶)with

indepedently and identically distributed components. Fur-

thermore, is (i) symmetric and standardized such that

𝔼𝘶= 0,𝔼𝘶2

= 1,𝔼𝘶3

= 0, and (ii) has ﬁnite kurtosis

𝔼𝘶4

=<∞.

These assumptions are already satisﬁed in practice by,

for example, generating 𝘶from a univariate normal or

Student-T with >4degrees of freedom.

2.3. Reparameterization Trick

When restricted to location scale families (Deﬁnitions 2

and 3), we can invoke Change-of-Variable, or more com-

monly known as the “reparameterization trick,” such that

𝔼𝙯∼,𝝀log𝓁(𝒙,𝙯)=𝔼𝞯∼𝝀log𝓁𝒙,−1(𝞯)

=𝔼𝙪∼log𝓁𝒙,−1(𝒕𝝀(𝙪))

through the Law of the Unconcious Statistician. Differenti-

ating this results in the reparameterization or path gradient,

which often achieves lower variance than alternatives (Xu

et al.,2019;Mohamed et al.,2020b).

Objective Function For generality, we represent our ob-

jective as a composite inﬁnite sum problem:

Deﬁnition 4 (Composite Inﬁnite Sum).

(𝝀)=𝔼𝙪∼(𝒕𝝀(𝙪))+(𝝀),

where (𝝀,𝙪)↦◦𝒕𝝀∶ℝ×ℝ→ℝis some bivariate

stochastic function of 𝝀and the “noise source” 𝙪, while

is a deterministic regularization term.

By appropriately deﬁning and , we retrieve the two most

common formulations of the ELBO in Equation (2) and

Equation (3) respectively:

Deﬁnition 5 (ELBO Entropy-Regularized Form).

H(𝜻)=−log𝓁𝒙,−1(𝜻)

Joint Likelihood −log𝑱−1 (𝜻)

H(𝝀)=−H(𝝀).

Deﬁnition 6 (ELBO KL-Regularized Form).

KL (𝜻)=−log𝓁𝒙∣−1(𝜻)

Likelihood −log𝑱−1 (𝜻)

KL (𝝀)=DKL(𝝀,).

Here, 𝑱−1 is the Jacobian of the bijector. Since DKL(𝝀,)

is seldomly available in tractable form, the entropy-

regularized form is the most widely used, while the KL

regularized is common for Gaussian processes and varia-

tional autoencoders.

Gradient Estimator We denote the -sample estimator

of the gradient of as

𝙜(𝝀)≜1

=1𝙜(𝝀),where (4)

𝙜(𝝀)≜∇𝝀(𝒕𝝀(𝙪))+∇(𝝀); 𝙪∼. (5)

We will occasionally drop 𝝀for clarity.

2.4. Gradient Variance Assumptions in

Stochastic Gradient Descent

Gradient Variance Assumptions in SGD For a while,

most convergence proofs in SGD have relied on the

“bounded variance” assumption. That is, for a gradient es-

timator 𝙜,𝔼𝙜2

2≤for some ﬁnite constant . This

assumption is problematic because ❶these types of global

constants result in loose bounds, ❷and it directly con-

tradicts the strong-convexity assumption (Nguyen et al.,

2018). Thus, retrieving previously known SGD conver-

gence rates under weaker assumptions has been an impor-

tant research direction (Tseng,1998;Vaswani et al.,2019;

Schmidt & Roux,2013;Bottou et al.,2018;Gower et al.,

2019;2021b;Nguyen et al.,2018).

ABC Condition In this work, we focus on the re-

cently rediscovered expected smoothness, or ABC, condi-

tion (Polyak & Tsypkin,1973;Gower et al.,2021b).

Assumption 2 (Expected Smoothness; ).𝙜is said to

satisfy the expected smoothness condition if

𝔼𝙜(𝝀)2

2≤2((𝝀)−∗)+∇(𝝀)2

2+.

for some ﬁnite ,,≥0, where ∗=inf𝝀∈R(𝝀).

As shown by Khaled & Richt´

arik (2023), this condition is

not only strictly weaker than many of the previously used

assumptions but also generalizes them by retrieving known

convergence rates when tweaking the constants.

3

Gradient Variance Bounds for Black-Box Variational Inference

Table 1: Survey of Parameterizations Used in Black-Box Variational Inference

Framework Version Parameterizations Conditioner Code

TURING (Ge et al.,2018)v0.23.2 Nonlinear Mean-ﬁeld softplus link

STAN (Carpenter et al.,2017)v2.31.0 Nonlinear Mean-ﬁeld exp link

Linear Cholesky link

PYRO (Bingham et al.,2019)v0.10.1 Nonlinear Mean-ﬁeld softplus link

Linear Cholesky1link

PYMC3 (Salvatier et al.,2016)v5.0.1 Nonlinear Mean-ﬁeld softplus link

Nonlinear Cholesky softplus link

GPYTORCH (Gardner et al.,2018)v1.9.0 Linear Cholesky link

Linear Mean-ﬁeld link

1Numpyro also provides a low-rank Cholesky parameterization, which is non-linearly con-

ditioned. But the full-rank Cholesky is linear.

*Tensorﬂow probability (Dillon et al.,2017) wasn’t included as it

doesn’t provide a fully pre-conﬁgured variational family (although

tfp.experimental.vi.build *posterior exists, the parameterization is

user-supplied).

With the condition, for non-

convex -smooth functions, under a

“appropriately chosen” stepsize (oth-

erwise the bound may blow-up as

explained by Khaled & Richt´

arik)

of ≤1∕, SGD converges

to a 𝒪()neighborhood in a

𝒪(1+2)∕()rate. Minor vari-

ants of the ABC condition have

also been used to prove convergence

of SGD for quasar convex func-

tions Gower et al. (2021a), stochastic

heavy-ball/momentum methods Liu &

Yuan (2022), and stochastic proxi-

mal methods (Li & Milzarek,2022).

Given the inﬂux of results based on

the ABC condition, connecting with it

would signiﬁcantly broaden our theo-

retical understanding of BBVI.

2.5. Covariance Parameterizations

When using the location-scale family (Deﬁnition 2), the

scale matrix 𝑪can be parameterized in different ways. Any

parameterization that results in a positive deﬁnite covari-

ance 𝑪𝑪∈𝕊

++ is valid. We consider multiple parame-

terizations as the choice can result in different theoretical

properties. A brief survey on the use of different parame-

terizations is shown in Table 1.

Linear Parameterization The previous results

by Domke (2019) considered the matrix square root

parameterization, which is linear with respect to the

variational parameters.

Deﬁnition 7 (Matrix Square Root).

𝑪(𝝀)=𝑪,

where 𝑪∈ℝ×is a matrix, 𝝀𝑪=vec(𝑪)∈ℝ2such

that 𝝀=(𝒎,𝝀𝑪).

Note that 𝑪is not constrained to be symmetric so this is not

a matrix square root in a narrow sense. Also, this param-

eterization does not guarantee 𝑪𝑪to be positive deﬁnite

(only positive semideﬁnite), which occasionally results in

the entropy term Hblowing up (Domke,2020). Domke

proposed to ﬁx this by using proximal operators.

Nonlinear Parameterizations In practice, optimization

is preferably done in unconstrained ℝ, which then posi-

tive deﬁniteness can be ensured by explicitly mapping the

diagonal elements to positive numbers. We denote this by

the diagonal conditioner . (See Table 1 for a brief sur-

vey on their use). The following two parameterizations are

commonly used, where 𝑫=diag((𝒔))∈ℝ×denotes

a diagonal matrix such that =()>0.

Deﬁnition 8 (Mean-Field).

𝑪(𝝀,)=diag((𝒔)),

where 𝒔∈ℝand 𝝀=(𝒎,𝒔).

Deﬁnition 9 (Cholesky).

𝑪(𝝀,)=diag((𝒔))+𝑳,

where 𝒔∈ℝ,𝑳∈ℝ×is a strictly lower triangu-

lar matrix, 𝝀𝑳= vec(𝑳)∈ℝ(+1)∕2 such that 𝝀=

(𝒎,𝒔,𝝀𝑳). The special case of ()=is called the “lin-

ear Cholesky” parameterization.

Diagonal conditioner For the diagonal conditioner, the

softplus function ()=softplus()≜log(1+)(Dugas

et al.,2000) or the exponential function ()=is com-

monly used. While using these nonlinear functions sig-

niﬁcantly complicates the analysis, assuming to be 1-

Lipschitz retrieves practical guarantees.

Assumption 3 (Lipschitz Diagonal Conditioner).The di-

agonal conditioner is 1-Lipschitz continuous.

Remark 2. The softplus function is 1-Lipschitz.

3. Main Results

3.1. Key Lemmas

The main challenge in studying BBVI is that the gradient of

the composed function ∇𝝀(𝒕𝝀(𝒖))is different from ∇.

For the matrix square root parameterization, Domke (2019)

established the connection through Lemma 1 (restated as

Lemma 6 in Appendix C.1). We generalize this result to

nonlinear parameterizations:

4

Gradient Variance Bounds for Black-Box Variational Inference

Lemma 1. Let 𝒕𝝀∶ℝ→ℝbe a location-scale

reparameterization function (Deﬁnition 1) with some dif-

ferentiable function ∶ℝ→ℝ. Then, for 𝒈≜

∇(𝒕𝝀(𝒖)),

(i) Mean-Field

∇𝝀(𝒕𝝀(𝒖))2

2=𝒈2

2+𝒈

𝑼𝜱𝒈,

(ii) Cholesky

∇𝝀(𝒕𝝀(𝒖))2

2=𝒈2

2+𝒈

𝜮𝒈+𝒈

𝑼(𝜱−𝐈)𝒈,

where 𝑼,𝜱,𝜮are diagonal matrices, which the diago-

nals are deﬁned as

=2

,Φ =′()2,Σ ==12

,

and is a diagonal conditioner for the scale matrix.

Proof. See the full proof in Appendix C.2.1.

Note that the relationships in this lemma are all equalities,

which can be bounded with known quantities, as done in

the next lemma. We note here that if any of our analyses

were to be improved, this shall by done by obtaining tighter

bounds on the equalities in Lemma 1.

Lemma 2. Let 𝒕𝝀∶ℝ→ℝbe a location-scale repa-

rameterization function (Deﬁnition 1), ∶ℝ→ℝbe

a differentiable function, and let satisfy Assumption 3.

(i) Mean-Field

∇𝝀(𝒕𝝀(𝒖))2

2≤1+𝑼F∇(𝒕𝝀(𝒖))2

2,

where 𝑼is a diagonal matrix such that =2

.

(ii) Cholesky

∇𝝀(𝒕𝝀(𝒖))2

2≤1+𝒖2

2∇(𝒕𝝀(𝒖))2

2,

where the equality holds for the matrix square root

parameterization.

Proof. See the full proof in Appendix C.2.2.

Lemma 1 act as the interface between the properties of the

parameterization and the likelihood .

Remark 3 (Variance Reduction Through ).Anonlinear

Cholesky parameterization with a 1-Lipschitz achieves

lower or equal variance compared to the matrix square root

and linear Cholesky, where the equality is achieved with

the matrix square root parameterization.

Dimension Dependence of Mean-Field The superior di-

mensional dependence of the mean-ﬁeld parameterization

is given by the following lemma:

Lemma 3. Let the assumptions of Lemma 2 hold and

𝙪∼satisfy Assumption 1. Then, for the mean-ﬁeld

parameterization,

𝔼𝒕𝝀(𝙪)−𝒛2

21+𝙐F

≤++1𝒎−𝒛2

2+2+1𝑪2

F.

Proof. See the full proof in Appendix C.2.3.

Remark 4 (Superior Variance of Mean-Field).The

mean-ﬁeld parameterization has 𝒪dimensional de-

pendence compared to the 𝒪()dimensional dependence

of the full-rank parameterizations in Lemma 7.

Lastly, the following lemma is the basic building block for

all of our upper bounds:

Lemma 4. Let 𝒈be the -sample gradient estimator

of (Deﬁnition 4) for some function ,and let 𝙪be

some random variable. Then,

𝔼𝙜2

2≤1

𝔼∇𝝀(𝒕𝝀(𝙪))2

2+∇(𝝀)2

2.

Proof. See the full proof in Appendix C.2.4.

3.2. Upper Bounds

We restrict our analysis to the class of log-likelihoods that

satisfy the following conditions:

Deﬁnition 10 (-smoothness).A function ∶ℝ→ℝis

-smooth if it satisﬁes the following for all 𝜻,𝜻′∈ℝ:

∇(𝜻)−∇(𝜻′)2≤𝜻−𝜻′2.

Deﬁnition 11 (Quadratic Functional Growth).A func-

tion ∶ℝ→ℝis -quadratically growing if

2𝜻−

𝜻2

2≤(𝜻)−∗

for all 𝜻∈ℝ, where

𝜻=Π(𝜻)is a projection of 𝜻onto

the set of minimizers of and ∗=inf𝜻∈ℝ(𝜻).

The quadratic growth condition has ﬁrst been used by

(Anitescu,2000) and is strictly weaker than the Polyak-

Łojasiewicz inequality (see Karimi et al. 2016, Appendix

A for the proof). Furthermore, for -strongly (quasar) con-

vex functions (Hinder et al.,2020;Jin,2020) automatically

satisfy quadratic growth, but our analysis does not require

(quasar) convexity.

Both assumptions are commonly used in SGD. For study-

ing the gradient variance of BBVI, assuming both smooth-

ness and quadratic growth is weaker than the assumptions

of Xu et al. (2019) but stronger than those of Domke

(2019), who assumed only smoothness. The additional as-

sumption on growth is necessary to extend his results to

establish the ABC condition.

For the variational family, we assume the followings:

Assumption 4. ,𝝀is a member of the ADVI family (Def-

inition 3), where the underlying 𝝀is a member of the

location-scale family (Deﬁnition 2) with its base distribu-

tion satisfying Assumption 1.

Entropy-Regularized Form First, we provide the upper

bound for the ELBO in entropy-regularized form. This re-

sult does not require any modiﬁcations to vanilla SGD.

5

Gradient Variance Bounds for Black-Box Variational Inference

Theorem 1. Let 𝙜be an -sample estimate of the gra-

dient of the ELBO in entropy regularized form (Deﬁni-

tion 5). Also, assume that Assumption 3 and 4hold,

•His H-smooth, and

•KLis KL-quadratically growing.

Then,

𝔼𝙜2

2≤42

H

KL(,)((𝝀)−∗)+∇(𝝀)2

2

+22

H

(,)

𝜻KL−

𝜻H2

2

+42

H

KL(,)∗−∗

KL,

where

(,)=2+1for mean-ﬁeld,

(,)=+for the Cholesky and matrix square root,

KL,

Hare the stationary points of KL,H, respec-

tively, ∗=inf𝝀∈ℝ(𝝀), and ∗

KL=inf𝜻∈ℝ().

Proof Sketch. From Lemma 4, we can see that the key

quantity of upper bounding the gradient variance is to ana-

lyze 𝔼∇𝝀H(𝒕𝝀(𝙪)). The bird’s eye view of the proof is

as follows:

❶The relationship between ∇𝝀H(𝒕𝝀(𝙪))2

2and

∇H(𝒕𝝀(𝙪))2

2is established through Lemma 2.

❷Then, the H-smoothness of Hrelates

∇H(𝒕𝝀(𝙪))2

2with 𝒕𝝀(𝙪)−

𝜻H2

2, the aver-

age squared distance from H’s stationary point.

❸The average squared distance enables the simpliﬁca-

tion of stochastic terms through Lemmas 3 and 7. This

step also introduces dimension dependence.

From here, we are now left with the 𝔼𝒕𝝀(𝙪)−

𝜻H2

2term.

One might be tempted to assume the quadratic growth as-

sumption on Hand proceed as

𝔼𝒕𝝀(𝙪)−

𝜻H2

2≤2

H(𝒕𝝀(𝙪))−∗

H.

However, for the entropy-regularized form, this soon runs

into a dead end since in

𝔼H(𝒕𝝀(𝙪))−∗

H=(𝝀)−(𝝀)−∗

=((𝝀)−∗)+(∗−∗)−H(𝝀),

the negative entropy term His not bounded unless we

rely on assumptions that need modiﬁcations to the BBVI

algorithms. (e.g., bounded support, bounded domain). For-

tunately, the following inequality cleverly side-steps this

problem:

𝔼𝒕𝝀(𝙪)−

𝜻H2

2≤2𝔼𝒕𝝀(𝙪)−

𝜻KL2

2+2

𝜻KL −

𝜻H2

2,(6)

albeit at the cost of some looseness. By converting the

entropy-regularized form into the KL-regularized form,

the regularizer term becomes KL = DKL(𝝀,)≥0,

which is bounded below by deﬁnition, unlike the entropic-

regularizer H. The proof completes by

❹applying the quadratic growth assumption to relate

the parameter distance with the function suboptimal-

ity gap, and

❺upper bounding the KL regularizer term.

Proof. See the full proof in Appendix C.3.1.

Remark 5. If the bijector is an identity function, 𝜻KL

and 𝜻Hare the maximum likelihood (ML) and maximum

a-posteriori (MAP) estimates, respectively. Thus, with

enough datapoints, the term

𝜻KL−

𝜻H2

2will be negligible

since the ML and MAP estimates will be close.

Remark 6. It is also possible the tighten the constants by a

factor of two. Instead of applying Equation (6), we can use

the inequality

(+)2≤1+22+1+−22,

for some >0. By setting 2==

𝜻KL−

𝜻H2,

𝔼𝒕𝝀(𝙪)−

𝜻H2

2≤(1+2)𝔼𝒕𝝀(𝙪)−

𝜻KL2

2+4+2.

Since ≈0as explained in Remark 5, the constant in front

of the ﬁrst term is tightened almost by a factor of 2. How-

ever, the stated form is more convenient for theory since

the ﬁrst term does not depend on

𝜻KL−

𝜻H2.

Remark 7. Let cond.=H∕KL be the condition number of

the problem. For the full-rank parameterizations, the vari-

ance is bounded as 𝒪(Hcond.(+)∕). The variance

depends linearly on

❶the scaling of the problem H,

❷the conditioning of the problem cond.,

❸the dimensionality of the problem , and

❹the tail properties of the variational family ,

where the number of Monte Carlo samples linearly re-

duces the variance.

KL-Regularized Form We now prove an equivalent re-

sult for the KL-regularized form. Here, we do not have to

rely on Equation (6) since we already start from KL, which

results in better constants.

Theorem 2. Let 𝙜be an -sample estimator of the

gradient of the ELBO in KL-regularized form (Deﬁni-

tion 6). Also, assume that

•KLis KL-smooth,

•KLis KL-quadratically growing,

and Assumption 3 and 4hold. Then, the gradient vari-

6

Gradient Variance Bounds for Black-Box Variational Inference

ance is bounded above as

𝔼𝙜2

2≤22

KL

KL(,)((𝝀)−∗)+∇(𝝀)2

2

+22

KL

KL(,)∗−∗

KL,

where

(,)=2+1for mean-ﬁeld,

(,)=+for the Cholesky and matrix square root,

∗=inf𝝀∈ℝ(𝝀), and ∗

KL=inf𝜻∈ℝ().

Proof. See the full proof in Appendix C.3.2.

3.3. Upper Bound Under Bounded Entropy

The bound in Theorem 1 is slightly loose due to the

use of Equation (6) and Equation (29). An alternative

bound with slightly tighter constants, although the gains

are marignal compared to Remark 6, can be obtained by

assuming the following:

Assumption 5 (Bounded Entropy).The regularization

term is bounded below as H(𝝀)≥∗

H.

For the entropy-regularized form, this corresponds to the

entropy being bounded above by some constant since

(𝝀)= −H(𝝀). When using the nonlinear parameteri-

zations (Deﬁnitions 8 and 9), this assumption can be prac-

tically enforced by bounding the output of by some large

.

Proposition 1. Let the diagonal conditioner be

bounded as ()≤. Then, for any -dimensional dis-

tribution 𝝀in the location-scale family with the mean-

ﬁeld (Deﬁnition 8) or Cholesky (Deﬁnition 9) parame-

terizations,

H(𝝀)=−H(𝝀)≥−H()−

2log.

Proof. From Remark 1,H(𝝀)=H()+log𝑪. Since 𝑪

under Deﬁnitions 8 and 9is a diagonal or triangular matrix,

the log absolute determinant is the log sum of the diago-

nals. The conclusion follows from the fact that the diago-

nals =()are bounded by .

This is essentially a weaker version of the bounded do-

main assumption, though only the diagonal elements of 𝑪,

1,…,, are bounded. While this assumption results in

an admittedly less realistic algorithm, it enables a tighter

bound for the entropy-regularized form ELBO.

Theorem 3. Let 𝙜be an -sample estimator of the

gradient of the ELBO in entropy-regularized form (Deﬁ-

nition 5). Also, assume that

•His H-smooth,

•His H-quadratically growing,

•His bounded as H(𝝀)>∗

H(Assumption 5),

and Assumption 3 and 4hold. Then, the gradient vari-

ance of 𝒈is bounded above as

𝔼𝙜2

2≤22

H

H(,)((𝝀)−∗)+∇(𝝀)2

2

+22

H

H(,)∗−∗

H−∗

H,

where

(,)=2+1for mean-ﬁeld,

(,)=+for the Cholesky parameterization,

∗=inf𝝀∈ℝ(𝝀), and ∗

H=inf𝜻∈ℝ().

Proof Sketch. Instead of using Equation (6), we apply

the quadratic assumption directly to H. The remaining

entropic-regularizer term can now be bounded through the

bounded entropy assumption.

Proof. See the full proof in Appendix C.3.3.

3.4. Matching Lower Bound

Finally, we present a matching lower bound on the gra-

dient variance of BBVI. Our lower bound holds broadly

for smooth and strongly convex problem instances that are

well-conditioned and high-dimensional.

Theorem 4. Let 𝙜be an -sample estimator of the

gradient of the ELBO in either the entropy- or KL-

regularized form. Also, let Assumption 4 hold where

the matrix square root parameterization is used. Then,

for all -smooth and -strongly convex functions such

that ∕<+1, the variance of 𝙜is bounded below

by some strictly positive constant as

𝔼𝙜2

2≥22(+1)−22

((𝝀)−∗)+∇(𝝀)2

2

+22(+1)−22

(𝔼(𝒕𝝀∗(𝒖))−∗),

as long as 𝝀is in a local neighborhood around the

unique global optimum 𝝀∗=argmin𝝀∈ℝ(𝝀), where

∗=(𝝀∗)and ∗=argmin𝜻∈ℝ(𝜻).

Proof Sketch. We use the fact that, with the matrix square

root parameterization, if is -smooth, 𝔼(𝒕𝝀(𝙪))is also

-smooth (Domke,2020). From this, the parameter subop-

timality can be related to the function suboptimality as

𝝀−

𝝀2

2≥(2∕)(𝔼(𝒕𝝀(𝙪))−∗),

where

𝝀=

𝜻,𝐎. For the entropy term, we circumvent the

need to directly bound its value by restricting our interest

to the neighborhood of the minimizer 𝝀∗, where the con-

tribution of (𝝀∗)−(𝝀)will be marginal enough for the

lower bound to hold.

Proof. See the full proof in Appendix C.3.4.

7

Gradient Variance Bounds for Black-Box Variational Inference

DKL(𝝀,)

1 200 400 600 8001,000

104

105

106

107

108

Iteration

(𝝀)−∗

1 200 400 600 8001,000

104

105

106

107

108

Iteration

DKL(𝝀,)

1 200 400 600 8001,000

103

104

105

106

107

108

Iteration

(𝝀)−∗

1 200 400 600 8001,000

103

104

105

106

107

108

Iteration

Cholesky ()=softplus()Mean-Field ()=softplus()

Theorem 1 Theorem 3 Theorem 1 Theorem 3

Gradient Variance 𝔼𝙜2

2Upper Bound 2((𝝀)−∗)+∇2

2+

Figure 1: Evaluation of the bounds for a perfectly conditioned quadratic target function. The blue regions are the

loosenesses resulting from either using (Theorem 1) or not using (Theorem 3) the bounded entropy assumption (Assump-

tion 5), while the red regions are the remaining “technical loosesnesses.” The gradient variance was estimated from 103

samples.

Remark 8 (Matching Dimensional Dependence).For

well-conditioned problems such that ∕<+1, a lower

bound of the same dimensional dependence with our upper

bounds holds near the optimum.

Remark 9 (Unimprovability of the ABC Condition).

The lower bound suggests that the gradient vari-

ance condition is unimprovable within the class of smooth,

quadratically growing functions.

4. Simulations

We now evaluate our bounds and the insights gathered dur-

ing the analysis through simulations. We implemented a

bare-bones implementation of BBVI in Julia (Bezanson

et al.,2017) with plain SGD. The stepsize were manually

tuned so that all problems converge at similar speeds. For

all problems, we use a unit Gaussian base distribution such

that ()=𝒩(;0,1)resulting in a kurtosis of =3and

use =10Monte Carlo samples.

4.1. Synthetic Problem

To test the ideal tightness of the bounds, we consider

quadratics achieving the tightest bound for the constants

H,KL,H,KLgiven as

log𝓁(𝒙∣𝒛)=−

2𝒛−𝒛∗2

2; log(𝒛)=−1

𝒛2

2,

where simulates the effect of the number of datapoints.

We set the constants as = 0.3,=8.0, and = 100,

the mode 𝒛∗is randomly sampled from a Gaussian, and

the dimension of the problem is =20. For the bounded

entropy case, we set =2.0(the true standard deviation is

in the order of 1e-3).

DKL(𝝀,)

1 2,000 4,000

104

106

108

1010

Iteration

Gradient Variance 𝔼𝙜2

2

Upper Bound

1 500 1,000

104

105

106

107

Iteration

𝔼𝙜2

2

Iteration

𝔼𝙜2

2

Matrix square root

Cholesky ()=

Cholesky ()=softplus()

Figure 2: Linear regression on the AI RFO IL dataset.

(left) Evaluation of the upper bound (Theorem 1).

(right) Comparison of the variance of different param-

eterizations resulting in the same 𝒎,𝑪.

Quality of Upper Bound The results for the Cholesky

and mean-ﬁeld parameterizations with a softplus bijector

are shown in Figure 1. For the Cholesky parameterization,

the bulk of the looseness comes from the treatment of the

regularization term (blue region). The remaining “techni-

cal looseness” (red region) is relatively tight and can be

shown to be tighter when using linear parameterizations

(()=) and the square root parameterization, which

is the tightest. However, for the mean-ﬁeld parameteriza-

tion, despite the superior constants (Remark 4), there is still

room for improvement. Additional results for other param-

eterizations can be found in Appendix B.1.

4.2. Real Dataset

Model We now evaluate the theoretical results with real

datasets. Given a regression dataset (𝑿,𝒚), we use the lin-

ear Gaussian model

∼𝒩𝑿𝒘,2;𝒘∼𝒩(𝟎,𝐈),

8

Gradient Variance Bounds for Black-Box Variational Inference

where and are hyperparameters. The smoothness and

quadratic growth constants for this model are given as the

max- and minimum eigenvalues of −2𝑿𝑿+−1𝐈(for

H) and −2𝑿𝑿(for KL ). ∗

KL and ∗

Hare given as the

mode of the likelihood and the posterior, while ∗is the

negative marginal log-likelihood.

Quality of Upper Bound Section 4.1 shows the result

on the AIRFOIL dataset (Dua & Graff,2017). The con-

stants are H=3.520×104,KL =2.909×103. Due to

poor conditioning, the bound is much looser compared to

the quadratic case. We note that generalizing our bounds to

utilize matrix smoothness and matrix-quadratic growth as

done by (Domke,2019) would tighten the bounds. But the

theoretical gains would be marginal. Detailed information

about the datasets and additional results for other parame-

terizations can be found in Appendix B.2.

Comparison of Parameterizations Section 4.1 com-

pares the gradient variance resulting from the different pa-

rameterizations. For a fair comparison, the gradient is

estimated on the 𝝀that results in the same 𝒎,𝑪for all

three parameterizations. This shows the gradual increase

in variance by (i) not using a nonlinear conditioner (linear

Cholesky) (ii) and increasing the number of variational pa-

rameters (matrix square root).

5. Related Works

Controlling Gradient Variance The main algorithmic

challenge in BBVI is to control the gradient noise (Ran-

ganath et al.,2014). This has led to various methods

for reducing the variance of VI gradient estimators us-

ing control variates (Ranganath et al.,2014;Miller et al.,

2017;Geffner & Domke,2018), ensembling of estima-

tors (Geffner & Domke,2020), modifying the differ-

entiation procedure (Roeder et al.,2017), quasi-Monte

Carlo (Buchholz et al.,2018;Liu & Owen,2021), and mul-

tilevel Monte Carlo (Fujisawa & Sato,2021). Cultivating a

deeper understanding of the properties of gradient variance

could further extend this list.

Convergence Guarantees Obtaining full convergence

guarantees has been an important task for understand-

ing BBVI algorithms. However, most guarantees so far

have relied on strong assumptions such as that the log-

likelihood is Lipschitz (Ch´

erief-Abdellatif et al.,2019;

Alquier,2021), that the gradient variance is bounded

by constant (Liu & Owen,2021;Buchholz et al.,2018;

Domke,2020;Hoffman & Ma,2020), and that the sup-

port of 𝝀is bounded (Fujisawa & Sato,2021). Our result

shows that similar results can be obtained under relaxed as-

sumptions. Meanwhile, Bhatia et al. (2022) have recently

proven a full complexity guarantee for a variant of BBVI.

But similarly to Hoffman & Ma (2020), they only optimize

the scale matrix 𝑪, and the speciﬁcs of the algorithm di-

verge from the usual BBVI implementations as it uses the

stochastic power iterations instead of SGD.

Gradient Variance Guarantees Studying the actual gra-

dient variance properties of BBVI has only started to make

progress recently. Fan et al. (2015) ﬁrst provided bounds

by assuming the log-likelihood to be Lipschitz. Under

more general conditions, Domke (2019) provided tight

bounds for smooth log-likelihoods, which our work builds

upon. Domke’s result can also be seen as a direct gen-

eralization of the results of Xu et al. (2019), which are

restricted to quadratic log-likelihoods and the mean-ﬁeld

family. Lastly, Mohamed et al. (2020a) provides a concep-

tual evaluation of gradient estimators used in BBVI.

6. Discussions

Conclusions In this work, we have proven upper bounds

on the gradient variance of BBVI with the location-scale

family for smooth, quadratically-growing log-likelihoods.

Speciﬁcally, we have provided bounds for both the ELBO

in entropy-regularized and KL-regularized forms. Our

guarantees work without a single modiﬁcation to the al-

gorithms used in practice, although stronger assumptions

establish a tighter bound for the entropy-regularized form

ELBO. Also, our bounds corresponds to the ABC condi-

tion (Section 2.4) and the expected residual (ER) condition,

where the latter is a special case of the former with =1.

The ER condition has been used by Gower et al. (2021a) for

proving convergence of SGD on quasar convex functions,

which generalize convex functions. The results of this pa-

per are used by Kim et al. (2023) to establish convergence

of BBVI through the results of Khaled & Richt´

arik (2023).

Limitations Our results have the following limitations:

❶Our results only apply to smooth and quadratically–

growing log likelihoods and ❷the location-scale ADVI

family. Also, ❸our bounds cannot distinguish the variance

of the Cholesky and matrix square root parameterizations,

❹and empirically, the bounds for the mean-ﬁeld parame-

terization appear loose. Furthermore, ❺our results only

work with 1-Lipschitz diagonal conditioners such as the

softplus function. Unfortunately, assuming both smooth-

ness and quadratic growth is quite restrictive, as it leaves a

very small number of known distributions. Also, in prac-

tice, non-Lipschitz conditioners such as the exponential

functions are widely used. While obtaining similar bounds

with such conditioners would be challenging, constructing

a theoretical framework that extends to such would be an

important future research direction.

Acknowledgements

This work was supported by NSF award IIS-2145644.

9

Gradient Variance Bounds for Black-Box Variational Inference

References

Alquier, P. Non-Exponentially Weighted Aggregation: Re-

gret Bounds for Unbounded Loss Functions. In Proceed-

ings of the International Conference on Machine Learn-

ing, volume 193 of PMLR, pp. 207–218. ML Research

Press, July 2021. (page 9)

Anitescu, M. Degenerate nonlinear programming with a

quadratic growth condition. SIAM Journal on Optimiza-

tion, 10(4):1116–1135, January 2000. (page 5)

Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B.

Julia: A fresh approach to numerical computing. SIAM

review, 59(1):65–98, 2017. (page 8)

Bhatia, K., Kuang, N. L., Ma, Y.-A., and Wang, Y. Sta-

tistical and computational trade-offs in variational in-

ference: A case study in inferential model selection.

(arXiv:2207.11208), July 2022. (page 9)

Bingham, E., Chen, J. P., Jankowiak, M., Obermeyer, F.,

Pradhan, N., Karaletsos, T., Singh, R., Szerlip, P., Hors-

fall, P., and Goodman, N. D. Pyro: Deep universal prob-

abilistic programming. Journal of Machine Learning Re-

search, 20(28):1–6, 2019. (pages 1,4)

Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Vari-

ational inference: A review for statisticians. Journal

of the American Statistical Association, 112(518):859–

877, April 2017. (page 1)

Bottou, L. On-line learning and stochastic approxima-

tions. In On-Line Learning in Neural Networks, pp.

9–42. Cambridge University Press, ﬁrst edition, January

1999. (page 1)

Bottou, L., Curtis, F. E., and Nocedal, J. Optimization

methods for large-scale machine learning. SIAM Review,

60(2):223–311, January 2018. (pages 1,3)

Buchholz, A., Wenzel, F., and Mandt, S. Quasi-Monte

Carlo variational inference. In Proceedings of the Inter-

national Conference on Machine Learning, volume 80

of PMLR, pp. 668–677. ML Research Press, July 2018.

(page 9)

Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D.,

Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li,

P., and Riddell, A. Stan: A probabilistic programming

language. Journal of Statistical Software, 76(1), 2017.

(pages 1,4)

Ch´

erief-Abdellatif, B.-E., Alquier, P., and Khan, M. E. A

generalization bound for online variational inference. In

Proceedings of the Asian Conference on Machine Learn-

ing, volume 101 of PMLR, pp. 662–677. ML Research

Press, October 2019. (page 9)

Dhaka, A. K., Catalina, A., Andersen, M. R., ns Mag-

nusson, M., Huggins, J., and Vehtari, A. Robust, ac-

curate stochastic optimization for variational inference.

In Advances in Neural Information Processing Systems,

volume 33, pp. 10961–10973. Curran Associates, Inc.,

2020. (page 1)

Dhaka, A. K., Catalina, A., Welandawe, M., Andersen,

M. R., Huggins, J., and Vehtari, A. Challenges and

opportunities in high dimensional variational inference.

In Advances in Neural Information Processing Systems,

volume 34, pp. 7787–7798. Curran Associates, Inc.,

2021. (page 1)

Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasude-

van, S., Moore, D., Patton, B., Alemi, A., Hoffman,

M., and Saurous, R. A. TensorFlow distributions.

(arXiv:1711.10604), November 2017. (pages 1,2,4)

Domke, J. Provable gradient variance guarantees for black-

box variational inference. In Advances in Neural In-

formation Processing Systems, volume 32. Curran As-

sociates, Inc., 2019. (pages 1,2,4,5,9,13,16,17,18,

21)

Domke, J. Provable smoothness guarantees for black-box

variational inference. In Proceedings of the Interna-

tional Conference on Machine Learning, volume 119 of

PMLR, pp. 2587–2596. ML Research Press, July 2020.

(pages 1,4,7,9,17,23)

Dua, D. and Graff, C. UCI machine learning repository.

2017. (page 9)

Dugas, C., Bengio, Y., B´

elisle, F., Nadeau, C., and Garcia,

R. Incorporating second-order functional knowledge for

better option pricing. In Advances in Neural Informa-

tion Processing Systems, volume 13. MIT Press, 2000.

(page 4)

Fan, K., Wang, Z., Beck, J., Kwok, J., and Heller, K. A.

Fast second order stochastic backpropagation for vari-

ational inference. In Advances in Neural Information

Processing Systems, volume 28. Curran Associates, Inc.,

2015. (page 9)

Fjelde, T. E., Xu, K., Tarek, M., Yalburgi, S., and Ge, H.

Bijectors.jl: Flexible transformations for probability dis-

tributions. In Proceedings of The Symposium on Ad-

vances in Approximate Bayesian Inference, volume 118

of PMLR, pp. 1–17. ML Research Press, February 2020.

(pages 1,2)

Fujisawa, M. and Sato, I. Multilevel Monte Carlo varia-

tional inference. Journal of Machine Learning Research,

22(278):1–44, 2021. (pages 1,9)

10

Gradient Variance Bounds for Black-Box Variational Inference

Gardner, J., Pleiss, G., Weinberger, K. Q., Bindel, D., and

Wilson, A. G. GPyTorch: Blackbox matrix-matrix Gaus-

sian process inference with GPU acceleration. In Ad-

vances in Neural Information Processing Systems, vol-

ume 31. Curran Associates, Inc., 2018. (page 4)

Ge, H., Xu, K., and Ghahramani, Z. Turing: A language

for ﬂexible probabilistic inference. In Proceedings of

the International Conference on Machine Learning, vol-

ume 84 of PMLR, pp. 1682–1690. ML Research Press,

2018. (pages 1,4)

Geffner, T. and Domke, J. Using large ensembles of control

variates for variational inference. In Advances in Neural

Information Processing Systems, volume 31. Curran As-

sociates, Inc., 2018. (page 9)

Geffner, T. and Domke, J. A rule for gradient estima-

tor selection, with an application to variational infer-

ence. In Proceedings of the International Conference

on Artiﬁcial Intelligence and Statistics, volume 108 of

PMLR, pp. 1803–1812. ML Research Press, August

2020. (page 9)

Gower, R., Sebbouh, O., and Loizou, N. SGD for struc-

tured nonconvex functions: Learning rates, minibatching

and interpolation. In Proceedings of The International

Conference on Artiﬁcial Intelligence and Statistics, vol-

ume 130 of PMLR, pp. 1315–1323. ML Research Press,

March 2021a. (pages 4,9)

Gower, R. M., Loizou, N., Qian, X., Sailanbayev, A.,

Shulgin, E., and Richt´

arik, P. SGD: General analysis

and improved rates. In Proceedings of the International

Conference on Machine Learning, volume 97 of PMLR,

pp. 5200–5209. ML Research Press, June 2019. (pages

1,3)

Gower, R. M., Richt´

arik, P., and Bach, F. Stochastic

quasi-gradient methods: Variance reduction via Jaco-

bian sketching. Mathematical Programming, 188(1):

135–192, July 2021b. (pages 1,2,3)

Hinder, O., Sidford, A., and Sohoni, N. Near-optimal meth-

ods for minimizing star-convex functions and beyond. In

Proceedings of Conference on Learning Theory, volume

125 of PMLR, pp. 1894–1938. ML Research Press, July

2020. (page 5)

Hinton, G. E. and van Camp, D. Keeping the neural net-

works simple by minimizing the description length of the

weights. In Proceedings of the Annual Conference on

Computational Learning Theory, pp. 5–13, Santa Cruz,

California, United States, 1993. ACM Press. (page 2)

Hoffman, M. and Ma, Y. Black-box variational inference

as a parametric approximation to Langevin dynamics. In

Proceedings of the International Conference on Machine

Learning, PMLR, pp. 4324–4341. ML Research Press,

November 2020. (page 9)

Jin, J. On the convergence of ﬁrst order methods for

quasar-convex optimization. (arXiv:2010.04937), Octo-

ber 2020. (page 5)

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul,

L. K. An introduction to variational methods for graph-

ical models. Machine Learning, 37(2):183–233, 1999.

(pages 1,2)

Karimi, H., Nutini, J., and Schmidt, M. Linear conver-

gence of gradient and proximal-gradient methods under

the Polyak-Łojasiewicz condition. In Machine Learn-

ing and Knowledge Discovery in Databases, Lecture

Notes in Computer Science, pp. 795–811, Cham, 2016.

Springer International Publishing. (page 5)

Khaled, A. and Richt´

arik, P. Better theory for SGD in the

nonconvex world. Transactions of Machine Learning

Research, 2023. (pages 2,3,4,9)

Kim, K., Wu, K., Oh, J., Ma, Y., and Gardner,

J. R. Black-box variational inference converges.

(arXiv:2305.15349), May 2023. (pages 2,9)

Kingma, D. P. and Ba, J. Adam: A Method for Stochastic

Optimization. In Proceedings of the International Con-

ference on Learning Representations, San Diego, Cali-

fornia, USA, 2015. (page 1)

Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and

Blei, D. M. Automatic differentiation variational infer-

ence. Journal of Machine Learning Research, 18(14):

1–45, 2017. (pages 1,2,3)

Leger, J.-B. Parametrization cookbook: A set of bijective

parametrizations for using machine learning methods in

statistical inference. (arXiv:2301.08297), January 2023.

(page 2)

Li, X. and Milzarek, A. A uniﬁed convergence theorem for

stochastic optimization methods. In Advances in Neural

Information Processing Systems, October 2022. (page 4)

Liu, J. and Yuan, Y. On almost sure convergence rates of

stochastic gradient methods. In Proceedings of the Con-

ference on Learning Theory, volume 178 of PMLR, pp.

2963–2983. ML Research Press, June 2022. (page 4)

Liu, S. and Owen, A. B. Quasi-Monte Carlo quasi-Newton

in Variational Bayes. Journal of Machine Learning Re-

search, 22(243):1–23, 2021. (page 9)

Ma, S., Bassily, R., and Belkin, M. The power of interpola-

tion: Understanding the effectiveness of SGD in modern

11

Gradient Variance Bounds for Black-Box Variational Inference

over-parametrized learning. In Proceedings of the Inter-

national Conference on Machine Learning, volume 80 of

PMLR, pp. 3325–3334. ML Research Press, July 2018.

(page 1)

Miller, A., Foti, N., D’ Amour, A., and Adams, R. P.

Reducing reparameterization gradient variance. In Ad-

vances in Neural Information Processing Systems, vol-

ume 30. Curran Associates, Inc., 2017. (page 9)

Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A.

Monte Carlo gradient estimation in machine learning.

Journal of Machine Learning Research, 21(132):1–62,

2020a. (pages 1,9)

Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A.

Monte Carlo gradient estimation in machine learning.

Journal of Machine Learning Research, 21(132):1–62,

2020b. (page 3)

Nguyen, L., Nguyen, P. H., van Dijk, M., Richtarik, P.,

Scheinberg, K., and Takac, M. SGD and Hogwild! Con-

vergence without the bounded gradients assumption. In

Proceedings of the International Conference on Machine

Learning, volume 80 of PMLR, pp. 3750–3758. ML Re-

search Press, July 2018. (pages 1,3)

Peterson, C. and Anderson, J. R. A mean ﬁeld theory learn-

ing algorithm for Neural Networks. Complex Systems, 1

(5):995–1019, 1987. (page 2)

Peterson, C. and Hartman, E. Explorations of the mean

ﬁeld theory learning algorithm. Neural Networks, 2(6):

475–494, January 1989. (page 2)

Polyak, B. T. and Tsypkin, Y. Z. Pseudogradient adaptation

and training algorithms. Automatic Remote Control, 34

(3):45–68, 1973. (pages 2,3)

Ranganath, R., Gerrish, S., and Blei, D. Black box vari-

ational inference. In Proceedings of the International

Conference on Artiﬁcial Intelligence and Statistics, vol-

ume 33 of PMLR, pp. 814–822. ML Research Press,

April 2014. (pages 1,9)

Robbins, H. and Monro, S. A stochastic approximation

method. The Annals of Mathematical Statistics, 22(3):

400–407, September 1951. (page 1)

Roeder, G., Wu, Y., and Duvenaud, D. K. Sticking the

landing: Simple, lower-variance gradient estimators for

variational inference. In Advances in Neural Information

Processing Systems, volume 30. Curran Associates, Inc.,

2017. (page 9)

Salvatier, J., Wiecki, T. V., and Fonnesbeck, C. Probabilis-

tic programming in Python using PyMC3. PeerJ Com-

puter Science, 2:e55, April 2016. (pages 1,4)

Schmidt, M. and Roux, N. L. Fast convergence of stochas-

tic gradient descent under a strong growth condition.

(arXiv:1308.6370), August 2013. (pages 1,3)

Titsias, M. and L´

azaro-Gredilla, M. Doubly stochastic vari-

ational Bayes for non-conjugate inference. In Proceed-

ings of the International Conference on Machine Learn-

ing, volume 32 of PMLR, pp. 1971–1979. ML Research

Press, June 2014. (page 1)

Tseng, P. An incremental gradient(-projection) method

with momentum term and adaptive stepsize rule. SIAM

Journal on Optimization, 8(2):506–531, May 1998.

(pages 1,3)

Vaswani, S., Bach, F., and Schmidt, M. Fast and faster

convergence of SGD for over-parameterized models and

an accelerated perceptron. In Proceedings of the Inter-

national Conference on Artiﬁcial Intelligence and Statis-

tics, volume 89 of PMLR, pp. 1195–1204. ML Research

Press, April 2019. (pages 1,3)

Welandawe, M., Andersen, M. R., Vehtari, A., and Hug-

gins, J. H. Robust, automated, and accurate black-box

variational inference. (arXiv:2203.15945), March 2022.

(page 1)

Xu, M., Quiroz, M., Kohn, R., and Sisson, S. A. Variance

reduction properties of the reparameterization trick. In

Proceedings of the International Conference on Artiﬁ-

cial Intelligence and Statistics, volume 89 of PMLR, pp.

2711–2720. ML Research Press, April 2019. (pages 1,3,

5,9)

Yao, Y., Vehtari, A., Simpson, D., and Gelman, A. Yes,

but did it work?: Evaluating variational inference. In

Proceedings of the International Conference on Machine

Learning, PMLR, pp. 5581–5590. ML Research Press,

July 2018. (page 1)

Zhang, C., Butepage, J., Kjellstrom, H., and Mandt, S. Ad-

vances in variational inference. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 41(8):2008–

2026, August 2019. (page 1)

Zhang, Y., Chen, C., Shi, N., Sun, R., and Luo, Z.-Q. Adam

can converge without any modiﬁcation on update rules.

In Advances in Neural Information Processing Systems,

2022. (page 1)

12

Gradient Variance Bounds for Black-Box Variational Inference

TABL E OF CONTENTS

1 Introduction 1

2 Preliminaries 2

2.1 Variational Inference . . . . . . . . . 2

2.2 Variational Family . . . . . . . . . . 2

2.3 Reparameterization Trick . . . . . . . 3

2.4 Gradient Variance Assumptions in

Stochastic Gradient Descent . . . . . 3

2.5 Covariance Parameterizations . . . . . 4

3 Main Results 4

3.1 Key Lemmas . . . . . . . . . . . . . 4

3.2 Upper Bounds . . . . . . . . . . . . . 5

3.3 Upper Bound Under Bounded Entropy 7

3.4 Matching Lower Bound . . . . . . . . 7

4 Simulations 8

4.1 Synthetic Problem . . . . .