Content uploaded by Valentin De Bortoli
Author content
All content in this area was uploaded by Valentin De Bortoli on Mar 10, 2021
Content may be subject to copyright.
arXiv:2008.05793v1 [math.ST] 13 Aug 2020
Maximum likelihood estimation of regularisation
parameters in high-dimensional inverse problems: an
empirical Bayesian approach
Part II: Theoretical Analysis
Valentin De Bortoli ∗1, Alain Durmus †1, Marcelo Pereyra ‡2, and Ana F. Vidal §
Part of this work has been presented at the 25th IEEE International Conference
on Image Processing (ICIP) [50]2
1CMLA - École normale supérieure Paris-Saclay, CNRS, Université Paris-Saclay, 94235 Cachan, France.
2Maxwell Institute for Mathematical Sciences & School of Mathematical and Computer Sciences,
Heriot-Watt University, Edinburgh, EH14 4AS, United Kingdom.
August 14, 2020
Abstract
This paper presents a detailed theoretical analysis of the three stochastic approximation
proximal gradient algorithms proposed in our companion paper [49] to set regularization pa-
rameters by marginal maximum likelihood estimation. We prove the convergence of a more
general stochastic approximation scheme that includes the three algorithms of [49] as special
cases. This includes asymptotic and non-asymptotic convergence results with natural and
easily verifiable conditions, as well as explicit bounds on the convergence rates. Importantly,
the theory is also general in that it can be applied to other intractable optimisation prob-
lems. A main novelty of the work is that the stochastic gradient estimates of our scheme are
constructed from inexact proximal Markov chain Monte Carlo samplers. This allows the use
of samplers that scale efficiently to large problems and for which we have precise theoretical
guarantees.
1 Introduction
Numerous imaging problems require performing inferences on an unknown image of interest x∈Rd
from some observed data y. Canonical examples include image denoising [12,28], compressive
sensing [18,40], super-resolution [35,51], tomographic reconstruction [13], image inpainting [24,44],
source separation [9,8], fusion [46,31], and phase retrieval [10,26]. Such imaging problems can
be formulated in a Bayesian statistical framework, where inferences are derived from the so-called
posterior distribution of xgiven y, which for the purpose of this paper we specify as follows
p(x|y, θ) = p(y|x)p(x|θ)/p(y|θ)
where p(y|x) = exp{−fy(x)}with fy∈C1(Rd,R)is the likelihood function, and the prior distri-
bution is p(x|θ) = exp{−θ⊤g(x)}with g:Rd→RdΘand θ∈Θ⊂RdΘ. The function fyacts as a
data-fidelity term, gas a regulariser that promotes desired structural or regularity properties (e.g.,
smoothness, piecewise-regularity, or sparsity [11]), and θis a regularisation parameter that con-
trols the amount of regularity enforced. Most Bayesian methods in the imaging literature consider
models for which fyand gare convex functions and report as solution the maximum-a-posteriori
(MAP) Bayesian estimator
argmin fy,θ ,where fy,θ(x) = fy(x) + θ⊤g(x)for any x∈Rd.(1)
∗Email: debortoli@cmla.ens-cachan.fr
†Email: durmus@cmla.ens-cachan.fr
‡Email: m.pereyra@hw.ac.uk
§Email: af69@hw.ac.uk
1
For example, many imaging works consider a linear observation model of the form y=Ax +w,
where A∈Rd×Rdis some problem-specific linear operator and the noise whas distribution
N(0, σ2Id)with variance σ2>0. Then, for any x∈Rdfy(x) = (2σ2)−1kAx −yk2. With regards
to the prior, a common choice in imaging is to set Θ = R+and g(x) = kBxk1for some suitable
basis or dictionary B∈Rd′×Rd, or g(x) = TV(x), where TV(x)is the isotropic total variation
pseudo-norm given by TV(x) = Pip(∆h
ix)2+ (∆v
ix)2where ∆v
iand ∆h
idenote horizontal and
vertical first-order local (pixel-wise) difference operators.
Importantly, when fyand gare convex, problem (1) is also convex and can usually be efficiently
solved by using modern proximal convex optimisation techniques [11], with remarkable guarantees
on the solutions delivered.
Setting the value of θcan be notoriously difficult, especially in problems that are ill-posed or
ill-conditioned where the regularisation has a dramatic impact on the recovered estimates. We
refer to [27] and [49, Section 1] for illustrations and a detailed review of the existing methods for
setting set θ.
In our companion paper [49], we present a new method to set regularisation parameters. More
precisely, in [49], we adopt an empirical Bayesian approach and set θby maximum marginal
likelihood estimation, i.e.
θ⋆∈arg max
θ∈Θ
log p(y|θ),where p(y|θ) = ZRd
p(y, x|θ)dx , p(y, x|θ)∝exp[−fy,θ (x)] .(2)
To solve (2), we aim at using gradient based optimization methods. The gradient of θ7→ log p(y|θ),
can be computed using Fisher’s identity, see [49, Proposition A.1], which implies under mild inte-
grability conditions on fyand g, for any θ∈Θ,
∇θlog p(y|θ) = −ZRd
g(˜x)p(˜x|y, θ)d˜x+ZRd
g(˜x)p(˜x|θ)d˜x .
It follows that θ7→ ∇θlog p(y|θ)can be written as a sum of two parametric integrals which are
untractable in most cases. Therefore, we propose to use a stochastic approximation (SA) scheme
and, in particular, we define three different algorithms to solve (2) [49, Algorithm 3.1, Algorithm
3.2, Algorithm 3.3]. These algorithms are extensively demonstrated in [49] through a range of
applications and comparisons with alternative approaches from the state-of-the-art.
In the present paper we theoretically analyse these three SA schemes and establish natural
and easily verifiable conditions for convergence. For generality, rather than presenting algorithm-
specific analyses, we establish detailed convergence results for a more general SA scheme that covers
the three algorithms of [49] as specific cases. Indeed, all these methods boil down to defining a
sequence (θn)n∈Nsatisfying a recursion of the form: for any n∈N,
θn+1 = ΠΘ"θn−δn+1
mn
mn
X
k=1 g(Xn
k)−g(¯
Xn
k)#,(3)
where ΠΘis the projection onto a convex closed set Θ,(Xn
k)k∈{1,...,mn}and (¯
Xn
k)k∈{1,...,mn}are
two independent stochastic processes targeting x7→ p(x|y , θ)and x7→ p(x|θ)respectively, (mn)n∈N
is a sequence of batch-sizes and (δn)n∈N∗is a sequence of stepsizes. In this paper, we are interested
in establishing the convergence of the averaging of (θn)n∈Nto a solution of (2) in this setting. SA
has been extensively studied during the past decades [41,29,38,47,33,34,7,6,48]. Recently,
quantitative results have been obtained in [45,2,39,1,43]. In contrast to [1], here we consider
the case where (Xn
k)k∈{1,...,mn}and (¯
Xn
k)k∈{1,...,mn}are inexact Markov chains which target x7→
p(x|y, θ)and x7→ p(x|θ)respectively and are based on some generalizations of the Unadjusted
Langevin Algorithm (ULA) [42]. In the recent years, ULA has attracted a lot of attention since
this algorithm exhibits favorable high-dimensional convergence properties in the case where the
target distribution admits a differentiable density, see [20,22,14,15]. However, in most imaging
models, the penalty function gis not differentiable and therefore x7→ p(x|y , θ)and x7→ p(x|θ)are
not differentiable as well. Therefore, we consider proximal Langevin samplers which are specifically
design to overcome this issue: the Moreau-Yoshida Unadjusted Langevin Algorithm (MYULA),
see [23], and the Proximal Unadjusted Langevin Operator (PULA), see [21].
A similar approximation scheme to (3) is studied in [1]. More precisely [1, Theorem 3, Theorem
4] are similar to Theorem 6and Theorem 7. Contrarily to that work, here we do not require the
Markov kernels we use to exactly target x7→ p(x|θ)and x7→ p(x|y, θ)but allow some bias
in the estimation which is accounted for in our convergence rates. This relaxation to biased
2
estimates plays a central role in the capacity of the method to scale efficiently to large problems.
Moreover, the present paper is also a complement of [17] which establishes general conditions for
the convergence of inexact Markovian SA but only apply these results to ULA. In this study, we
do not consider a general Markov kernel but rather specialize the results of [17] to MYULA and
PULA Markov kernels. However, to apply results of [17], new quantitative geometric convergence
properties on MYULA and PULA have to be established.
The remainder of the paper is organized as follows. In Section 2, we recall our notations and
conventions. In Section 3, we define the class of optimisation problems considered and the SA
scheme (3). This setting includes the optimization problem presented in (2) and the three specific
algorithms introduced in [49]. Then, in Section 4, we present a detailed analysis of the theoretical
properties of the proposed methodology. First, we show new ergodicity results for the MYULA
and PULA samplers. In a second part, we provide easily verifiable conditions for convergence and
quantitative convergence rates for the averaging sequences designed from (3). The proofs of these
results are gathered in Section 5.
2 Notations and conventions
We denote by B(0, R)and B(0, R)the open ball, respectively the closed ball, with radius Rin Rd.
Denote by B(Rd)the Borel σ-field of Rd,F(Rd)the set of all Borel measurable functions on Rdand
for f∈F(Rd),kfk∞= supx∈Rd|f(x)|. For µa probability measure on (Rd,B(Rd)) and f∈F(Rd)
aµ-integrable function, denote by µ(f)the integral of fw.r.t. µ. For f∈F(Rd), the V-norm of
fis given by kfkV= supx∈Rd|f(x)|/V (x). Let ξbe a finite signed measure on (Rd,B(Rd)). The
V-total variation norm of ξis defined as
kξkV= sup
f∈F(Rd),kfkV61ZRd
f(x)dξ(x)
.
If V≡1, then k · kVis the total variation norm on measures denoted by k · kTV .
Let Ube an open set of Rd. We denote by Ck(U,RdΘ)the set of RdΘ-valued k-differentiable
functions, respectively the set of compactly supported RdΘ-valued k-differentiable functions. Ck(U)
stands Ck(U,R). Let f:U→R, we denote by ∇f, the gradient of fif it exists. fis said to be
m-convex with m>0if for all x, y ∈Rdand t∈[0,1],
f(tx + (1 −t)y)6tf(x) + (1 −t)f(y)−(m/2)t(1 −t)kx−yk2.
Let (Ω,F,P)be a probability space. Denote by µ≪νif µis absolutely continuous w.r.t. νand
dµ/dνan associated density. Let µ, ν be two probability measures on (Rd,B(Rd)). Define the
Kullback-Leibler divergence of µfrom νby
KL (µ|ν) = (RRd
dµ
dν(x) log dµ
dν(x)dν(x),if µ≪ν ,
+∞otherwise .
3 Proposed stochastic approximation proximal gradient op-
timisation methodology
3.1 Problem statement
Let Θ⊂RdΘand f: Θ →R. We consider the optimisation problem
θ⋆∈arg min
θ∈Θ
f(θ),(4)
in scenarios where it is not possible to evaluate fnor ∇fbecause they are computationally in-
tractable. Problem (4) includes the marginal likelihood estimation problem (2) of our companion
paper [49] as the special case f=−log p(y|·). We make the following general assumptions on f
and Θ, which are in particular verified by the imaging models considered in [49].
A1. Θis a convex compact set and Θ⊂B(0, RΘ)with RΘ>0.
A 2. There exist an open set U⊂Rpand Lf>0such that Θ⊂U,f∈C1(U,R)and for any
θ1, θ2∈Θ
k∇θf(θ1)− ∇θf(θ2)k6Lfkθ1−θ2k.
3
A3. For any θ∈Θ, there exist Hθ,¯
Hθ:Rd→RdΘand two probability distributions πθ,¯πθon
(Rd,B(Rd)) satisfying for any θ∈Θ
∇θf(θ) = ZRd
Hθ(x)dπθ(x) + ZRd
¯
Hθ(x)d¯πθ(x).
In addition, (θ, x)7→ Hθ(x)and (θ, x)7→ ¯
Hθ(x)are measurable.
Remark 1. Note that if f∈C2(Θ) then A2is automatically satisfied under A1, since Θis
compact. In every model considered in our companion paper [49], θ7→ −log p(y|θ)is continuously
twice differentiable on each compact using the dominated convergence theorem and therefore A2
holds under A1.
Remark 2. Assumption A3is verified in the three cases considered in our companion paper [49,
Algorithm 3.1, Algorithm 3.2, Algorithm 3.3]:
(a) if the regulariser gis αpositively homogeneous with α > 0and dΘ= 1, corresponding to [49,
Algorithm 3.1], then for any θ∈Θ,Hθ=g,¯
Hθ=−d/(αθ),πθis the probability measure with
density w.r.t. the Lebesgue measure x7→ p(x|y, θ)and ¯πθis any probability measure;
(b) if the regulariser gis separably positively homogeneous as in [49, Algorithm 3.2], then for any
θ∈Θ,Hθ=g,¯
Hθ= (−|Ai|/(αiθi))i∈{1,...,dΘ},πθis the probability measure with density w.r.t. the
Lebesgue measure x7→ p(x|y, θ)and ¯πθis any probability measure;
(c) if the regulariser gis inhomogeneous, corresponding to [49, Algorithm 3.3], then for any θ∈Θ,
¯
Hθ=−g,Hθ=g,πθand ¯πθare the probability measures associated with the posterior and the
prior, with density w.r.t. the Lebesgue measure x7→ p(x|y, θ)and x7→ p(x|θ)respectively.
We now present in Algorithm 1, the stochastic algorithm we consider in order to solve (4).
This method encompasses the schemes introduced in the companion paper [49, Algorithm 3.1,
Algorithm 3.2, Algorithm 3.3]. Starting from (X0
0,¯
X0
0)∈Rd×Rdand θ0∈Θ, we define on a
probability space (Ω,F,P), the sequence ({(Xn
k,¯
Xn
k) : k∈ {0,...,mn}}, θn)n∈Nby the following
recursion for n∈Nand k∈ {0,...,mn−1}
(Xn
k)k∈{0,...,mn}is a MC with kernel Kγn,θnand Xn
0=Xn−1
mn−1given Fn−1,
(¯
Xn
k)k∈{0,...,mn}is a MC with kernel ¯
Kγ′
n,θnand ¯
Xn
0=¯
Xn−1
mn−1given Fn−1,
θn+1 = ΠΘ"θn−δn+1
mn
mn
X
k=1 Hθn(Xn
k) + ¯
Hθn(¯
Xn
k)#,
(5)
where (X−1
m−1,¯
X−1
m−1) = (X0
0,¯
X0
0),{(Kγ,θ ,¯
Kγ,θ ) : γ > 0, θ ∈Θ}is a family of Markov kernels on
Rd× B(Rd),(mn)n∈N∈(N∗)N,δn, γn, γ′
n>0for any n∈N,ΠΘis the projection onto Θand Fn
is defined as follows for all n∈N∪ {−1}
Fn=σθ0,{(Xℓ
k,¯
Xℓ
k)k∈{0,...,mℓ}:ℓ∈ {0,...,n}},F−1=σ(θ0, X 0
0,¯
X0
0).
Define for any N∈N,
¯
θN=
N−1
X
n=0
δnθn,N−1
X
n=0
δn.
In the sequel, we are interested in the convergence of (f(¯
θN))N∈Nto a minimum of fin the case
where the Markov kernels {(Kγ,θ,¯
Kγ,θ ) : γ > 0, θ ∈Θ}, used in Algorithm 1are either the ones
associated with MYULA or PULA. We now present these two MCMC methods for which some
analysis is required in our study of (f(¯
θN))N∈N.
3.2 Choice of MCMC kernels
Given the high dimensionality involved, it is fundamental to carefully choose the families of Markov
kernels {Kγ,θ ,¯
Kγ,θ :γ > 0, θ ∈Θ}driving Algorithm 1. In the experimental part of this work,
see [49, Section 4], we use the MYULA Markov kernel recently proposed in [23], which is a state-
of-the-art proximal Markov chain Monte Carlo (MCMC) method specifically designed for high-
dimensional models that are are log-concave but not smooth. The method is derived from the
4
Algorithm 1 General algorithm
1: Input: initial {θ0, X0
0,¯
X0
0},(δn, γn, γ′
n, mn)n∈N, number of iterations N.
2: for n= 0 to N−1do
3: if n > 0then
4: Set Xn
0=Xn−1
mn−1,
5: Set ¯
Xn
0=¯
Xn−1
mn−1,
6: end if
7: for k= 0 to mn−1do
8: Sample Xn
k+1 ∼Kγn,θn(Xn
k,·),
9: Sample ¯
Xn
k+1 ∼¯
Kγ′
n,θn(¯
Xn
k,·),
10: end for
11: Set θn+1 = ΠΘhθn−δn+1
mnPmn
k=1 Hθn(Xn
k) + ¯
Hθn(¯
Xn
k)i.
12: end for
13: Output: ¯
θN={PN−1
n=0 δn}−1PN−1
n=0 δnθn.
discretisation of an over-damped Langevin diffusion, (¯
Xt)t>0, satisfying the following stochastic
differential equation
dXt=−∇xF(Xt)dt+√2dBt,(6)
where F:Rd7→ Ris a continuously differentiable potential and (Bt)t>0is a standard d-dimensional
Brownian motion. Under mild assumptions, this equation has a unique strong solution [25, Chapter
4, Theorem 2.3]. Accordingly, the law of (Xt)t>0converges as t→ ∞ to the diffusion’s unique
invariant distribution, with probability density given by π(x)∝e−F(x)for all x∈Rd[42, Theorem
2.2]. Hence, to use (6) as a Monte Carlo method to sample from the posterior p(x|y, θ), we set
F(x) = log p(x|y, θ)and thus specify the desired target density. Similarly, to sample from the prior
we set F(x) = −∇xlog p(x|θ).
However, sampling directly from (6) is usually not computationally feasible. Instead, we usually
resort to a discrete-time Euler-Maruyama approximation of (6) that leads to the following Markov
chain (Xk)k∈Nwith X0∈Rd, given for any k∈Nby
ULA :Xk+1 =Xk−γ∇xF(Xk) + p2γZk+1 ,
where γ > 0is a discretisation step-size and (Zk)k∈N∗is a sequence of i.i.d d-dimensional zero-mean
Gaussian random variables with an identity covariance matrix. This Markov chain is commonly
known as the Unadjusted Langevin Algorithm (ULA) [42]. Under some additional assumptions
on F, namely Lipschitz continuity of ∇xF, the ULA chain inherits the convergence properties of
(6) and converges to a stationary distribution that is close to the target π, with γcontrolling a
trade-off between accuracy and convergence speed [23].
Remark 3. In this form, the ULA algorithm is limited to distributions where Fis a Lipschitz
continuously differentiable function. However, in the imaging problems of interest this is usually
not the case [49]. For example, to implement any of the algorithms presented in [49] it is necessary
to sample from the posterior distribution p(x|y, θ)(corresponding to πθin Section 3.1), which
would require setting for any x∈Rd,F(x) = fy(x) + θ⊤g(x). Similarly, one of the algorithms
also requires sampling from the prior distribution x7→ p(x|θ)(corresponding to ¯πθin Section 3.1),
which requires setting for any x∈Rd,F(x) = θ⊤g(x). In both cases, if gis not smooth then ULA
cannot be directly applied. The MYULA kernel was designed precisely to overcome this limitation.
3.2.1 Moreau-Yoshida Unadjusted Langevin Algorithm
Suppose that the target potential admits a decomposition F=V+Uwhere Vis Lipschitz
differentiable and Uis not smooth but convex over Rd. In MYULA, the differentiable part is
handled via the gradient ∇xVin a manner akin to ULA, whereas the non-differentiable convex
part is replaced by a smooth approximation Uλ(x)given by the Moreau-Yosida envelope of U, see
[5, Definition 12.20], defined for any x∈Rdand λ > 0by
Uλ(x) = min
˜x∈RdnU(˜x) + (1/2λ)kx−˜xk2
2o.(7)
Similarly, we define the proximal operator for any x∈Rdand λ > 0by
proxλ
U(x) = arg min
˜x∈RdnU(˜x) + (1/2λ)kx−˜xk2
2o.(8)
5
For any λ > 0, the Moreau-Yosida envelope Uλis continuously differentiable with gradient given
for any x∈Rdby
∇Uλ(x) = (x−proxλ
U(x))/λ , (9)
(see, e.g., [5, Proposition 16.44]). Using this approximation we obtain the MYULA kernel associ-
ated with (Xk)k∈Ngiven by X0∈Rdand the following recursion for any k∈N
MYULA :Xk+1 =Xk−γ∇xV(Xk)−γ∇xUλ(Xk) + p2γZk+1 .(10)
Returning to the imaging problems of interest, we define the MYULA families of Markov kernels
{Rγ,θ ,¯
Rγ,θ :γ > 0, θ ∈Θ}that we use in Algorithm 1to target πθand ¯πθfor θ∈Θas follows.
By Remark 3, we set V=fyand U=θ⊤g,¯
V= 0 and ¯
U=θ⊤g. Then, for any θ∈Θand γ > 0,
Rγ,θ associated with (Xk)k∈Nis given by X0∈Rdand the following recursion for any k∈N
Xk+1 =Xk−γ∇xfy(Xk)−γnXk−proxλ
θ⊤g(Xk)o/λ +p2γZk+1 .(11)
Similarly, for any θ∈Θand γ′>0,¯
Rγ,θ associated with (Xk)k∈Nis given by X0∈Rdand the
following recursion for any k∈N
¯
Xk+1 =¯
Xk−γ′n¯
Xk−proxλ′
θ⊤g(¯
Xk)o/λ′+p2γZk+1 ,(12)
where we recall that λ, λ′>0are the smoothing parameters associated with θ⊤gλ,γ, γ′>0are the
discretisation steps and (Zk)k∈N∗is a sequence of i.i.d d-dimensional zero-mean Gaussian random
variables with an identity covariance matrix.
Notice that other ways of splitting the target potential Fcan be straightforwardly implemented.
For example, instead of a single non-smooth convex term U, one might choose a splitting involving
several non-smooth terms to simplify the computation of the proximal operators (each term would
be replaced by its Moreau-Yosida envelope in (6)). Similarly, although we usually to associate
V, ¯
Vand U, ¯
Uto the log-likelihood and the log-prior, some cases might benefit from a different
splitting. Moreover, as illustrated in Section 3.2.2 below, other discrete approximations of the
Langevin diffusion could be considered too.
3.2.2 Proximal Unadjusted Langevin Algorithm
As an alternative to MYULA, one could also consider using the Proximal Unadjusted Langevin
Algorithm (PULA) introduced in [21], which replaces the (forward) gradient step of MYULA by
a composition of a backward and forward step. More precisely, PULA defines the Markov chain
(Xk)k∈Nstarting from X0∈Rdby the following recursion: for any k∈N
PULA :Xk+1 = proxλ
U(Xk)−γ∇xU(proxλ
U(Xk)) + p2γZk+1 .(13)
To highlight the connection with MYULA we note that for any x∈Rdand λ>0,∇Uλ(x) =
(x−proxλ
U(x))/λ by [5, Proposition 12.30]. Therefore, if we set λ=γwe obtain that (13) can be
rewritten for any k∈Na
Xk+1 =Xk−γ∇xV(Xk)−γ∇xU(proxλ
U(Xk)) + p2γZk+1 ,
which corresponds to (10) with λ=γ, except that the term ∇xU(Xk)in (10) is replaced by
∇xU(proxλ
U(Xk)) in (10).
Going back to the imaging problems of interest, to define the PULA families of Markov kernels
{Sγ,θ ,¯
Sγ,θ :γ > 0, θ ∈Θ}that we use in Algorithm 1to target πθand ¯πθfor θ∈Θwe proceed
as follows. We set V=fyand U=θ⊤g,¯
V= 0 and ¯
U=θ⊤g. Then, by Remark 3, for any θ∈Θ
and γ > 0,Sγ ,θ associated with (Xk)k∈Nis given by X0∈Rdand the following recursion for any
k∈N
Xk+1 = proxλ
θ⊤g(Xk)−γ∇xfy(proxλ
θ⊤g(Xk)) + p2γZk+1 ,(14)
Similarly, for any θ∈Θand γ′>0,¯
Sγ,θ associated with (Xk)k∈Nis given by X0∈Rdand the
following recursion for any k∈N
¯
Xk+1 = proxλ′
θ⊤g(¯
Xk) + p2γZk+1 .(15)
Recall that λ, λ′>0are the smoothing parameters associated with θ⊤gλ,γ , γ′>0are the
discretisation steps and (Zk)k∈N∗is a sequence of i.i.d d-dimensional zero-mean Gaussian random
6
variables with an identity covariance matrix. Again, one could use PULA with a different splitting
of F.
Finally, we note at this point that the MYULA and PULA kernels (11), (12), (14) and (15),
do not target the posterior or prior distributions exactly but rather an approximation of these
distributions. This is mainly due to two facts: 1) we are not able to use the exact Langevin diffusion
(6), so we resort to a discrete approximation instead; and 2) we replace the non-differentiable terms
with their Moreau-Yosida envelopes. As a result of these approximation errors, Algorithm 1will
exhibit some asymptotic estimation bias. This error is controlled by λ, λ′, γ, γ ′, and δ, and can be
made arbitrarily small at the expense of additional computing time, see Theorem 7in Section 4.
4 Analysis of the convergence properties
4.1 Ergodicity properties of MYULA and PULA
Before establishing our main convergence results about Algorithm 1, see Section 4.1, we derive
ergodicity properties on the Markov chains given by (10) and (13). We consider the following
assumptions on πθand ¯πθ. These assumptions are satisfied for a large class of models in Bayesian
imaging sciences, and in particular by the models considered in our companion paper [49].
H 1. For any θ∈Θ, there exist Vθ,¯
Vθ, Uθ,¯
Uθ:Rd→[0,+∞)convex functions satisfying the
following conditions.
(a) For any θ∈Θand x∈Rd,
πθ(x)∝exp [−Vθ(x)−Uθ(x)] ,¯πθ(x)∝exp −¯
Vθ(x)−¯
Uθ(x),
and
min inf
θ∈ΘZRd
exp[−Vθ(˜x)−Uθ(˜x)]d˜x, inf
θ∈ΘZRd
exp[−¯
Vθ(˜x)−¯
Uθ(˜x)]d˜x>0.(16)
(b) For any θ∈Θ,Vθand ¯
Vθare continuously differentiable and there exists L>0such that
for any θ∈Θand x, y ∈Rd
max k∇xVθ(x)− ∇xVθ(y)k,k∇x¯
Vθ(x)− ∇x¯
Vθ(y)k6Lkx−yk.
In addition, there exist RV,1, RV ,2>0such that for any θ∈Θ, there exist x⋆
θ,¯x⋆
θ∈Rdwith
x⋆
θ∈arg minRdVθ,¯x⋆
θ∈arg minRd¯
Vθ,x⋆
θ,¯x⋆
θ∈B(0, RV,1)and Vθ(x⋆
θ),¯
Vθ(¯x⋆
θ)∈B(0, RV,2).
(c) There exists M>0such that for any θ∈Θand x, y ∈Rd
max kUθ(x)−Uθ(y)k,k¯
Uθ(x)−¯
Uθ(y)k6Mkx−yk.
In addition, there exist RU,1, RU,2>0such that for any θ∈Θ, there exist x♯
θ,¯x♯
θ∈Rdwith
x♯
θ,¯x♯
θ∈B(0, RU,1)and Uθ(x♯
θ),¯
Uθ(¯x♯
θ)∈B(0, RU,2).
Note that (16) in H1-(a) is satisfied if Θis compact and the functions θ7→ RRdexp[−Vθ(˜x)−
Uθ(˜x)]d˜xand θ7→ RRdexp[−¯
Vθ(˜x)−¯
Uθ(˜x)]d˜xare continuous. This latter condition can be
then easily verified using the Lebesgue dominated convergence theorem and some assumptions
on {Vθ,¯
Vθ, Uθ,¯
Uθ:θ∈Θ}. Note that if there exists V:Rd→[0,+∞)such that for any θ∈Θ,
Vθ=Vand there exists x⋆∈Rdwith x⋆∈arg minRdVthen one can choose x⋆
θ=x⋆for any
θ∈Θin H1-(b). In this case, RV,2= 0. Similarly if for any θ∈Θ,Uθ(0) = 0 then one can choose
x♯
θ= 0 in H1-(c) and in this case RU,1=RU,2= 0. These conditions are satisfied by all the models
studied in [49].
As emphasized in Section 3.1, we use a stochastic approximation proximal gradient approach
to minimize fand therefore we need to consider Monte Carlo estimators for ∇θf(θ)and θ∈Θ.
These estimators are derived from Markov chains targeting πθand ¯πθrespectively. We consider two
MCMC methodologies to construct the Markov chains. A first option, as proposed in Section 3.2.1,
is to use MYULA to sample from πθand ¯πθ. Let κ > 0and {Rγ,θ :γ > 0, θ ∈Θ}be the family
of kernels defined for any x∈Rd,γ > 0,θ∈Θand A∈ B(Rd)by
Rγ,θ (x, A) = (4πγ)−d/2ZA
exp
y−x+γ∇xVθ(x) + κ−1x−proxγκ
Uθ(x)
2.(4γ)dy . (17)
7
Note that (17) is the Markov kernel associated with the recursion (10) with U←Uθ,V←Vθand
λ←κγ. For any γ, κ > 0and θ∈Θcorresponds to Rγ,κγ,θ in [49]. Consider also the family of
Markov kernels {¯
Rγ,θ :γ > 0, θ ∈Θ}such that for any γ > 0and θ∈Θ,¯
Rγ,θ is the Markov
kernel defined by (17) but with ¯
Uθand ¯
Vθin place of Uθand Vθrespectively. The coefficient κis
related to λin (11) by κ=λ/γ.
Moreover, although our companion paper [49] only considers the MYULA kernel, the theoretical
results we present in this paper also hold if the algorithms are implemented using PULA [21]. Define
the family {Sγ,θ :γ > 0, θ ∈Θ}, for any x∈Rd,γ > 0,θ∈Θand A∈ B(Rd)by
Sγ,θ (x, A) = (4πγ)−d/2ZA
exp
y−proxγκ
Uθ(x) + γ∇xVθ(proxγκ
Uθ(x))
2.(4γ)dy . (18)
Note that (17) is the Markov kernel associated with the recursion (13) with U←Uθ,V←Vθ
and λ←κγ. Consider also the family of Markov kernels {¯
Sγ,θ :γ > 0, θ ∈Θ}such that for
any γ > 0and θ∈Θ,¯
Sγ,θ is the Markov kernel defined by the recursion (18) but with ¯
Uθand
¯
Vθin place of Uθand Vθrespectively. We use the results derived in [17] to analyse the sequence
given by (5) with {(Kγ,θ ,¯
Kγ,θ ) : γ∈(0,¯γ], θ ∈Θ}={(Rγ,θ,¯
Rγ,θ ) : γ∈(0,¯γ], θ ∈Θ}or
{(Sγ,θ ,¯
Sγ,θ ) : γ∈(0,¯γ], θ ∈Θ}. To this end, we impose that for any γ∈(0,¯γ]and θ∈Θ,
the kernels Kγ,θ and ¯
Kγ,θ admit an invariant probability distribution, denoted by πγ ,θ and ¯πγ ,θ
respectively which are approximations of πθand ¯πθdefined in A3, and geometrically converge
towards them. More precisely, we show in Theorem 4and Theorem 5below, that MYULA and
PULA satisfy these conditions if at least one of the following assumptions is verified:
H2. There exists m>0such that for any θ∈Θ,Vθand ¯
Vθare m-convex.
H 3. There exist η > 0and c>0such that for any θ∈Θand x∈Rd,min(Uθ(x),¯
Uθ(x)) >
ηkxk − c.
Note that if for any θ∈Θ,Uθis convex on Rdand supθ∈Θ(RRdexp[−Uθ(˜x)]d˜x)<+∞, then H3
is automatically satisfied, as an immediate extension of [4, Lemma 2.2 (b)]. In [49], H3is satisfied
as soon as the prior distribution x7→ p(x|θ)is log-concave and proper for any θ∈Θ. In [49], if the
prior x7→ p(x|θ)is improper for some θ∈Θthen we require H2to be satisfied, i.e. for any y∈Cdy,
there exists m>0such that for any θ∈Θ,x7→ p(x|y, θ)is m-log-concave. Finally, we believe that
H3could be relaxed to the following condition: there exist η > 0and c>0such that for any θ∈Θ
and x∈Rd,min(Uθ(x)+Vθ(x),¯
Uθ(x)+ ¯
Vθ(x)) >ηkxk−c. In particular, this latter condition holds
in the case where x7→ p(x|θ) = exp[−θ⊤TV(x)] and supθ∈Θ(RRdexp[−Uθ(˜x) + Vθ(˜x)]d˜x)<+∞.
Consider for any m∈N∗and α > 0, the two functions Wmand Wαgiven for any x∈Rdby
Wm(x) = 1 + kxk2m, Wα= exp αq1 + kxk2.(19)
Theorem 4. Assume H1and H2or H3. Let ¯κ > 1>κ > 1/2,¯γ < min{(2 −1/κ)/L,2/(m+L)}
if H2holds and ¯γ < min{(2 −1/κ)/L, η/(2ML)}if H3holds. Then for any a∈(0,1], there exist
¯
A2,a >0and ρa∈(0,1) such that for any θ∈Θ,κ∈[κ, ¯κ],γ∈(0,¯γ],Rγ ,θ and ¯
Rγ,θ admit
invariant probability measures πγ,θ, respectively ¯πγ,θ. In addition, for any x, y ∈Rdand n∈Nwe
have
max kδxRn
γ,θ −πγ ,θ kWa,kδx¯
Rn
γ,θ −¯πγ,θ kWa6¯
A2,a ¯ργn
aWa(x),
max kδxRn
γ,θ −δyRn
γ,θ kWa,kδx¯
Rn
γ,θ −δy¯
Rn
γ,θ kWa6¯
A2,a ¯ργn
a{Wa(x) + Wa(y)},
with W=Wmand m∈N∗if H2holds and W=Wαwith α < min(κη/4, η/8) if H3holds.
Proof. The proof is postponed to Section 5.2.
Theorem 5. Assume H1and H2or H3. Let Let ¯κ > 1>κ > 1/2,¯γ < 2/(m+L)if H2holds
and ¯γ < 2/Lif H3holds. Then for any a∈(0,1], there exist A2,a >0and ρa∈(0,1) such that
for any θ∈Θ,κ∈[κ, ¯κ],γ∈(0,¯γ],Sγ,θ and ¯
Sγ,θ admit an invariant probability measure πγ,θ and
¯πγ,θ respectively. In addition, for any x, y ∈Rdand n∈Nwe have
max kδxSn
γ,θ −πγ ,θ kWa,kδx¯
Sn
γ,θ −¯πγ,θ kWa6A2,aργn
aWa(x),
max kδxSn
γ,θ −δySn
γ,θ kWa,kδx¯
Sn
γ,θ −δy¯
Sn
γ,θ kWa6A2,aργ n
a{Wa(x) + Wa(y)},
with W=Wmand m∈N∗if H2holds and W=Wαwith α < κη/4if H3holds.
Proof. The proof is postponed to Section 5.3.
8
4.2 Main results
We now state our main results regarding the convergence of the sequence defined by (5) under the
following additional regularity assumption.
H4. There exist MΘ>0and fΘ∈C(R+,R+)such that for any θ1, θ2∈Θ,x∈Rd,
max k∇xVθ1(x)− ∇xVθ2(x)k,k∇x¯
Vθ1(x)− ∇x¯
Vθ2(x)k6MΘkθ1−θ2k(1 + kxk),
max k∇xUκ
θ1(x)− ∇xUκ
θ2(x)k,k∇x¯
Uκ
θ1(x)− ∇x¯
Uκ
θ2(x)k6fΘ(κ)kθ1−θ2k(1 + kxk).
In Theorem 6, we give sufficient conditions on the parameters of the algorithm under which the
sequence (θn)n∈Nconverges a.s., and we give explicit convergence rates in Theorem 7.
Theorem 6. Assume A1,A2,A3and that fis convex. Let κ∈[κ, ¯κ]with ¯κ>1>κ > 1/2.
Assume H1and one of the following conditions:
(a) H2holds, ¯γ < min(2/(m+L),(2 −1/κ)/L,L−1)and there exists m∈N∗and Cm>0such
that for any θ∈Θand x∈Rd,kHθ(x)k6CmW1/4
m(x)and k¯
Hθ(x)k6CmW1/4
m(x).
(b) H3holds, ¯γ < min((2 −1/κ)/L, η/(2ML),L−1)and there exists 0< α < η/4,Cα>0such
that for any θ∈Θand x∈Rd,kHθ(x)k6CαW1/4
α(x)and k¯
Hθ(x)k6CαW1/4
α(x).
Let (γn)n∈N,(δn)n∈Nbe sequences of non-increasing positive real numbers and (mn)n∈Nbe a se-
quence of non-decreasing positive integers satisfying δ0<1/Lfand γ0<¯γ. Let ({(Xn
k,¯
Xn
k) : k∈
{0,...,mn}}, θn)n∈Nbe given by (5). In addition, assume that P+∞
n=0 δn+1 = +∞,P+∞
n=0 δn+1γ1/2
n<
+∞and that one of the following conditions holds:
(1) P+∞
n=0 δn+1/(mnγn)<+∞;
(2) mn=m0∈N∗for all n∈N,supn∈N|δn+1 −δn|δ−2
n<+∞,H4holds and we have
P+∞
n=0 δ2
n+1γ−2
n<+∞,P+∞
n=0 δn+1γ−3
n+1(γn−γn+1 )<+∞.
Then (θn)n∈Nconverges a.s. to some θ⋆∈arg minΘf. Furthermore, a.s. there exists C>0such
that for any n∈N∗
(n
X
k=1
δkf(θk),n
X
k=1
δk)−min
Θf6C, n
X
k=1
δk!.
Proof. The proof is postponed to Section 5.6.
These results are similar to the ones identified in [17, Theorem 1, Theorem 5, Theorem 6] for
the Stochastic Optimization with Unadjusted Langevin (SOUL) algorithm. Note that in SOUL the
potential is assumed to be differentiable and the sampler is given by ULA, whereas in Theorem 6,
the results are stated for PULA and MYULA samplers.
Although rigorously establishing convexity of fis usually not possible for imaging models, we
expect that in many cases, for any of its minimizer θ⋆,fis convex in some neighborhood of θ⋆.
For example, this is the case if its Hessian is definite positive around this point.
Assume that δn∼n−a,γn∼n−band mn∼n−cwith a, b, c >0. We now distinguish two cases
depending on if for all n∈N,mn=m0∈N∗(fixed batch size) or not (increasing size).
1) In the increasing batch size case, Theorem 6ensures that (θn)n∈Nconverges if the following
inequalities are satisfied
a+b/2>1, a −b+c > 1, a 61.(20)
Note in particular that c > 0,i.e. the number of Markov chain iterates required to compute the
estimator of the gradient increases at each step. However, for any a∈[0,1] there exist b, c > 0
such that (20) is satisfied. In the special setting where a= 0 then for any ε2> ε1>0such that
b= 2 + ε1and c= 3 + ε2satisfy the results of (20) hold.
2) In the fixed batch size case, which implies that c= 0, Theorem 6ensures that (θn)n∈Nconverges
if the following inequalities are satisfied
a+b/2>1,2(a−b)>1, a +b+ 1 −2b > 1a61,
which can be rewritten as
b∈(2(1 −a),min(a−1/2, a/2)) , a ∈[0,1] .
The interval (2(a−1),min(a−1/2, a/2)) is then not empty if and only if a∈(5/6,1].
9
Theorem 7. Assume A1,A2,A3and that fis convex. Let κ∈[κ, ¯κ]with ¯κ>1>κ > 1/2.
Assume H1and that the condition (a) or (b) in Theorem 6is satisfied. Let (γn)n∈N,(δn)n∈Nbe
sequences of non-increasing positive real numbers and (mn)n∈Nbe a sequence of non-decreasing
positive integers satisfying δ0<1/Lfand γ0<¯γ. Let ({(Xn
k,¯
Xn
k) : k∈ {0,...,mn}}, θn)n∈Nbe
given by (5)
E"( n
X
k=1
δkf(θk),n
X
k=1
δk)−min
Θf#6En, n
X
k=1
δk!,
where
(a)
En=C1(1 +
n−1
X
k=0
δk+1γ1/2
k+
n−1
X
k=0
δk+1/(mkγk) +
n−1
X
k=0
δ2
k+1/(mkγk)2).(21)
(b) or if mn=m0for all n∈N,supn∈N|δn+1 −δn|δ−2
n<+∞and H4holds
En=C2(1 +
n−1
X
k=0
δk+1γ1/2
k+
n−1
X
k=0
δ2
k+1/γk+
n−1
X
k=0
δk+1γ−3
k+1(γk−γk+1 )).(22)
Proof. The proof is postponed to Section 5.7.
First, note that if the stepsize is fixed and recalling that κ=λ/γ then the condition γ < (2 −
1/κ)/Lcan be rewritten as γ < 2/(L+λ−1). Assume that (δn)n∈Nis non-increasing, limn→+∞δn=
0,limn→+∞mn= +∞and γn=γ0>0for all n∈N. In addition, assume that Pn∈N∗δn= +∞
then, by [37, Problem 80, Part I], it holds that
(limn→+∞[ (Pn
k=1 δk/mk)/(Pn
k=1 δk)] = limn→+∞1/mn= 0 ;
limn→+∞Pn
k=1 δ2
k(Pn
k=1 δk)= limn→+∞δn= 0 .(23)
Therefore, using (21) we obtain that
lim sup
n→+∞
E"( n
X
k=1
δkf(θk),n
X
k=1
δk)−min f#6C1√γ0.
Similarly, if the stepsize is fixed and the number of Markov chain iterates is fixed, i.e. for all n∈N,
γn=γ0and mn=m0with γ0>0and m0∈N∗, combining (22) and (23) we obtain that
lim sup
n→+∞
E"( n
X
k=1
δkf(θk),n
X
k=1
δk)−min f#6C2√γ0.
5 Proof of the main results
In this section, we gather the proofs of Section 4. First, in Section 5.1 we derive some useful
technical lemmas. In Section 5.2, we prove Theorem 4, using minorisation and Foster-Lyapunov
drift conditions. Similarly, we prove Theorem 5in Section 5.3. Next, we show Theorem 6by
applying [17, Theorem 1, Theorem 3] and Theorem 7by applying [17, Theorem 2, Theorem
4], which boils down to verifying that [17, H1, H2] are satisfied. In Section 5.4, we show that
[17, H1, H2] hold if the sequence is given by (5) where {(Kγ,θ,¯
Kγ,θ ) : γ∈(0,¯γ], θ ∈Θ}=
{(Rγ,θ ,¯
Rγ,θ ) : γ∈(0,¯γ], θ ∈Θ}defined in (18), i.e. we consider PULA as a sampling scheme
in the optimization algorithm. In Section 5.5 we check that [17, H1, H2] are satisfied when
{(Kγ,θ ,¯
Kγ,θ ) : γ∈(0,¯γ], θ ∈Θ}={(Sγ,θ,¯
Sγ,θ ) : γ∈(0,¯γ], θ ∈Θ}defined in (17), i.e. when
considering MYULA as a sampling scheme. Finally, we prove Theorem 6in Section 5.6 and
Theorem 7in Section 5.7.
10
5.1 Technical lemmas
We say that a Markov kernel Ron Rd×B(Rd)satisfies a discrete Foster-Lyapunov drift condition
Dd(W, λ, b)if there exist λ∈(0,1),b>0and a measurable function W:Rd→[1,+∞)such that
for all x∈Rd
RW(x)6λW (x) + b .
We will use the following result.
Lemma 8. Let Rbe a Markov kernel on Rd×B(Rd)which satisfies Dd(W, λγ, bγ)with λ∈(0,1),
b>0,γ > 0and a measurable function W:Rd→[1,+∞). Then, we have for any x∈Rd
R⌈1/γ⌉W(x)6(1 + blog−1(1/λ)λ−¯γ)W(x).
Proof. Using [17, Lemma 9] we have for any x∈Rd
R⌈1/γ⌉W(x)6
λγ⌈1/γ⌉+bγ
⌈1/γ⌉−1
X
k=0
λγk
W(x)6(1 + blog−1(1/λ)λ−¯γ)W(x).
We continue this section by giving some results on proximal operators. Some of them are
well-known but their proof is given for completeness.
Lemma 9. Let κ>0and U:Rd→Rconvex. Assume that Uis M-Lipschitz with M>0, then
Uκis M-Lipschitz and for any x∈Rd,kx−proxκ
U(x)k6κM.
Proof. Let κ>0. We have for any x, y ∈Rdby (7) and (8)
Uκ(x)−Uκ(y)
=kx−proxκ
U(x)k2/(2κ) + U(proxκ
U(x)) − ky−proxκ
U(y)k2/(2κ)−U(proxκ
U(y))
6ky−proxκ
U(y)k2/(2κ) + U(x−y+ proxκ
U(y)) − ky−proxκ
U(y)k2/(2κ)−U(proxκ
U(y))
6Mkx−yk.
Hence, Uκis M-Lipschitz. Since by [5, Proposition 12.30], Uκis continuously differentiable we
have for any x∈Rd,k∇Uκ(x)k6M. Combining this result with the fact that for any x∈Rd,
∇Uκ(x) = (x−proxκ
U(x))/κby [5, Proposition 12.30] concludes the proof.
Lemma 10. Let U:Rd→[0,+∞)be a convex and M-Lipschitz function with M>0. Then for
any κ>0and z, z ′∈Rd,
hproxκ
U(z)−z, z i6−κU(z) + κ2M2+κ{U(z′) + Mkz′k} .
Proof. κ>0and z, z ′∈Rd. Since (z−proxκ
U(z))/κ∈∂U (proxκ
U(z)) [5, Proposition 16.44], we
have
κ{U(z′)−U(proxκ
U(z))}>hz−proxκ
U(z), z′−proxκ
U(z)i
>hz−proxκ
U(z), z′−zi+kz−proxκ
U(z)k2
>hz−proxκ
U(z), z′−zi.
Combining this result, the fact that Uis M-Lipschitz and Lemma 9we get that
hproxκ
U(z)−z, z i6κU(z′)−κU(z) + κMkz−proxκ
U(z)k+kz′kkz−proxκ
U(z)k
6−κU(z) + κ2M2+κ{U(z′) + Mkz′k} ,
which concludes the proof
Lemma 11. Let κ1,κ2>0and U:Rd→Rconvex and lower semi-continuous. For any x∈Rd
we have
kproxκ1
U(x)−proxκ2
U(x)k262(κ1−κ2)(U(proxκ2
U(x)) −U(proxκ1
U(x))) .
If in addition, Uis M-Lipschitz with M>0then
kproxκ1
U(x)−proxκ2
U(x)k62M|κ1−κ2|.
11
Proof. Let x∈Rd. By definition of proxκ1
U(x)we have
2κ1U(proxκ1
U(x)) + kx−proxκ1
U(x)k262κ1U(proxκ2
U(x)) + kx−proxκ2
U(x)k2.
Combining this result and the fact that (x−proxκ2
U(x))/κ2∈∂U (proxκ2
U(x)) we have
kproxκ1
U(x)−proxκ1
U(x)k2
62κ1{U(proxκ2
U(x)) −U(proxκ1
U(x))}+ 2hx−proxκ2
U(x),proxκ1
U(x)−proxκ2
U(x)i
62κ1{U(proxκ2
U(x)) −U(proxκ1
U(x))}+ 2κ2{U(proxκ1
U(x)) −U(proxκ2
U(x))}
62(κ1−κ2)(U(proxκ2
U(x)) −U(proxκ1
U(x))) ,
which concludes the proof.
Lemma 12. Let V:Rd→Rm-convex and continuously differentiable with m>0. Assume that
there exists M > 0such that for any x, y ∈Rd
k∇V(x)− ∇V(y)k6Mkx−yk.
Assume that there exists x⋆∈arg minRdV, then for any γ∈(0,¯γ]with ¯γ < 2/(M+m)and x∈Rd
kx−γ∇V(x)k26(1 −γ)kxk2+γ{(2/(m+M)−¯γ)−1+ 4}kx⋆k2,
with =mM/(m+M).
Proof. Let x∈Rd,γ∈(0,¯γ]and ¯γ < 2/(m+M). Using [36, Theorem 2.1.11] and the fact that for
any a, b, ε > 0,εa2+b2/ε >2ab we have
kx−γ∇V(x)k2
6kxk2−2γh∇V(x)− ∇V(x⋆), x −x⋆i+γ¯γk∇V(x)− ∇V(x⋆)k2
+ 2γkx⋆kk∇V(x)− ∇V(x⋆)k
6kxk2−2γ kx−x⋆k2−γ(2/(m+M)−¯γ)k∇V(x)− ∇V(x⋆)k2
+ 2γkx⋆kk∇V(x)− ∇V(x⋆)k
6kxk2−2γ kx−x⋆k2−γ(2/(m+M)−¯γ)k∇V(x)− ∇V(x⋆)k2
+γ(2/(m+M)−¯γ)k∇V(x)− ∇V(x⋆)k2+γ/(2/(m+M)−¯γ)kx⋆k2
6(1 −2γ)kxk2+ 4γ kx⋆k kxk+γ/(2/(m+M)−¯γ)kx⋆k2
6(1 −γ)kxk2+γ(2/(m+M)−¯γ)−1+ 4kx⋆k2.
Lemma 13. Assume H1and H2. Then for any κ > 0,θ∈Θ,γ∈(0,¯γ]with ¯γ < 2/(m+L)and
x∈Rd, we have
proxγκ
Uθ(x)−γ∇xVθ(proxγκ
Uθ(x))
2
6(1 −γ/2) kxk2+γ¯γκ2M2+(2/(m+L)−¯γ)−1+ 4R2
V,1+2κ2M2−1,
with =mL/(m+L).
Proof. Let κ > 0,θ∈Θ,γ∈(0,¯γ]and x∈Rd. Using H1,H2, Lemma 9, Lemma 12, the
Cauchy-Schwarz inequality and that for any α, β >0,maxt∈R(−αt2+ 2βt) = β2/α, we have
proxγκ
Uθ(x)−γ∇xVθ(proxγκ
Uθ(x))
2
6(1 −γ)
proxγκ
Uθ(x)
2+γ(2/(m+L)−¯γ)−1+ 4kx⋆
θk2
6(1 −γ)
x−proxγκ
Uθ(x)−x
2+γ(2/(m+L)−¯γ)−1+ 4R2
V,1
6(1 −γ)kxk2+γ2κ2M2+ 2γκMkxk+γ(2/(m+L)−¯γ)−1+ 4R2
V,1
6(1 −γ/2) kxk2+γ2κ2M2+γ(2/(m+L)−¯γ)−1+ 4R2
V,1+ 2γκMkxk − γ kxk2/2
6(1 −γ/2) kxk2+γ¯γκ2M2+γ(2/(m+L)−¯γ)−1+ 4R2
V,1+ 2γκ2M2−1.
12
Lemma 14. Assume H1and H3. Then for any κ > 0,θ∈Θ,γ∈(0,¯γ]with ¯γ < 2/Land
x∈Rd, we have
proxγκ
Uθ(x)−γ∇xVθ(proxγκ
Uθ(x))
26kxk2+γ3¯γκ2M2+ 2κc+ 2κ(RU,2+MRU,1)
+(2/L −¯γ)−1R2
V,1−2κη kxk.
Proof. Let κ > 0,θ∈Θ,γ∈(0,¯γ]and x∈Rd. Using H1,H3, Lemma 9and Lemma 10 and
Lemma 12 we have
proxγκ
Uθ(x)−γ∇xVθ(proxγκ
Uθ(x))
26kproxγκ
Uθ(x)k2+γ/(2/L−¯γ)R2
V,1
6kxk2+γ2κ2M2+ 2hproxγκ
Uθ(x)−x, xi+γ/(2/L−¯γ)R2
V,1
6kxk2+ 3γ2κ2M2−2γκU (x) + 2γκ(U(x♯
θ) + Mkx♯
θk) + γ/(2/L−¯γ)R2
V,1
6kxk2+ 3γ2κ2M2−2γκη kxk+ 2γ κc
+ 2γκ(U(x♯
θ) + Mkx♯
θk) + γ/(2/L−¯γ)R2
V,1
6kxk2+γ3¯γκ2M2+ 2κc+ 2κ(RU,2+MRU,1) + (2/L −¯γ)−1R2
V,1−2κη kxk.
Lemma 15. Assume H1and H2. Then for any κ > 0,θ∈Θ,γ∈(0,¯γ]with ¯γ < 2/(m+L)and
x∈Rd, we have
kx−γ∇xVθ(x)−γ∇xUγκ
θ(x)k26(1 −γ/2) kxk2
+γ(2/(m+L)−¯γ)−1+ 4R2
V,1+ 2γ2MLRV ,1+γ2M2+ 2γM2(1 + ¯γL)2−1,
with =mL/(2m+ 2L).
Proof. Let κ > 0,θ∈Θ,γ∈(0,¯γ]and x∈Rd. Using H1,H2, Lemma 9, Lemma 12 and that for
any α, β >0,max(−αt2+ 2βt) = β2/α we have
kx−γ∇xVθ(x)−γ∇xUγκ
θ(x)k2
6kx−γ∇xVθ(x)k2+ 2γMkx−γ{∇xVθ(x)− ∇xVθ(x⋆
θ)}k +γ2M2
6(1 −γ)kxk2+γ(2/(m+L)−¯γ)−1+ 4kx⋆
θk2
+ 2γMkxk+ 2γ2Mk∇xVθ(x)− ∇xVθ(x⋆
θ)k+γ2M2
6(1 −γ)kxk2+γ(2/(m+L)−¯γ)−1+ 4kx⋆
θk2
+ 2γMkxk+ 2γ2ML kxk+ 2γ2ML kx⋆
θk+γ2M2
6(1 −γ/2) kxk2+γ(2/(m+L)−¯γ)−1+ 4R2
V,1
+ 2γ2MLRV,1+γ2M2+ 2γM(1 + ¯γL)kxk − γ kxk2/2
6(1 −γ/2) kxk2+γ(2/(m+L)−¯γ)−1+ 4R2
V,1
+ 2γ2MLRV,1+γ2M2+ 2γM2(1 + ¯γL)2−1.
Lemma 16. Assume H1and H3. Then for any κ > 0,θ∈Θ,x∈Rdand γ∈(0,¯γ]with
¯γ < min(2/L, η/(2ML)), we have
kx−γ∇xVθ(x)−γ∇xUγκ
θ(x)k2
6kxk2+γ(2/L −¯γ)−1R2
V,1+ 3¯γM2+ 2c+ 2(MRU,1+RU,2) + 2¯γMLRV,2−ηkxk.
Proof. Let κ > 0,θ∈Θ,γ∈(0,¯γ]and x∈Rd. Using H1,H3, (7), Lemma 9and Lemma 10 we
13
have
kx−γ∇xVθ(x)−γ∇xUγκ
θ(x)k2
6kx−γ∇xVθ(x)k2−2γhx−γ∇xVθ(x),∇xUγκ
θ(x)i+γ2M2
6kx−γ∇xVθ(x)k2−2κ−1hx−γ∇xVθ(x), x −proxγκ
Uθ(x)i+γ2M2
6kx−γ∇xVθ(x)k2−2κ−1hx, x −proxγκ
Uθ(x)i+ 2κ−1γk∇xVθ(x)kkx−proxγκ
Uθ(x)k+γ2M2
6kx−γ∇xVθ(x)k2+ 3γ2M2−2γη kxk+ 2γc+ 2γ(Mkx♯
θk+U(x♯
θ)) + 2γ¯γMk∇xVθ(x)k
6kx−γ∇xVθ(x)k2+ 3γ¯γM2−2γη kxk
+ 2γc+ 2γ(MRU,1+RU,2) + 2γ¯γML kxk+ 2γ¯γML kx⋆
θk
6kx−γ∇xVθ(x)k2+ 3γ¯γM2−γη kxk+ 2γc+ 2γ(MRU,1+RU,2) + 2γ¯γML kx⋆
θk,
where we have used for the last inequality that ¯γ < η/(2ML). Then, we can conclude using H1and
Lemma 12 that
kx−γ∇xVθ(x)−γ∇xUγκ
θ(x)k2
6kxk2+γ/(2/L −¯γ)R2
V,1+ 3γ¯γM2−γη kxk+ 2γc+ 2γ(MRU,1+RU,2) + 2γ¯γMLRV ,1
6kxk2+γ(2/L −¯γ)−1R2
V,1+ 3¯γM2+ 2c+ 2(MRU,1+RU,2) + 2¯γMLRV,2−ηkxk.
For υ∈Rdand σ>0, denote Υυ,σthe d-dimensional Gaussian distribution with mean υand
covariance matrix σ2Id.
Lemma 17. For any σ1,σ2>0and υ1, υ2∈Rd, we have
KL (Υυ1,σ1Id|Υυ2,σ2Id ) = kυ1−υ2k2/(2σ2
2) + (d/2) −log(σ2
1/σ2
2)−1 + σ2
1/σ2
2.
In addition, if σ1>σ2
KL (Υυ1,σ1Id|Υυ2,σ2Id)6kυ1−υ2k2/(2σ2
2) + (d/2)(1 −σ2
1/σ2
2)2.
Proof. Let Xbe a d-dimensional Gaussian random variable with mean υ1and covariance matrix
σ2
1Id. We have that
KL (Υυ1,σ1Id|Υυ2,σ2Id ) = Ehlog n(σ2
2/σ2
1)d/2exp h−kX−υ1k2/(2σ2
1) + kX−υ2k2/(2σ2
2)ioi
=−(d/2) log(σ2
1/σ2
2) + Eh−kX−υ1k2/(2σ2
1) + kX−υ2k2/(2σ2
2)i
=−(d/2) log(σ2
1/σ2
2) + (1/2)(σ−2
2−σ−2
1)Eh−kX−υ1k2i+
υ2
1−υ2
2
/(2σ2
2)
=−(d/2) log(σ2
1/σ2
2) + (d/2)(σ2
1/σ2
2−1) +
υ2
1−υ2
2
/(2σ2
2)
=kυ1−υ2k2/(2σ2
2) + (d/2) −log(σ2
1/σ2
2)−1 + σ2
1/σ2
2.
In the case where σ1>σ2, let s=σ2
1/σ2
2−1. Since s>0we have log(1 + s)>s−s2. Therefore,
we get that
−log(σ2
1/σ2
2)−1 + σ2
1/σ2
2=−log(1 + s) + s6s2,
which concludes the proof.
5.2 Proof of Theorem 4
We show that under H2or H3, Foster-Lyapunov drifts hold for MYULA in Lemma 18 and
Lemma 19. Combining these Foster-Lyapunov drifts with an appropriate minorisation condition
Lemma 20, we obtain the geometric ergodicity of the underlying Markov chain in Theorem 21.
Lemma 18. Assume H1and H2. Then for any θ∈Θ,κ∈[κ, ¯κ]and γ∈(0,¯γ]with ¯κ>1>
κ > 1/2,¯γ < 2/(m+L),Rγ ,θ and ¯
Rγ,θ satisfy Dd(W1, λγ
2, b2γ)with
λ2= exp [−/2] ,
b2=(2/(m+L)−¯γ)−1+ 4R2
V,1+ 2¯γMLRV ,1+ ¯γM2+ 2d+ 2M2(1 + ¯γL)2−1+/2,
=mL/(m+L),
14
where for any x∈Rd,W2(x) = 1 + kxk2. In addition, for any m∈N∗, there exist λm∈(0,1),
bm>0such that for any θ∈Θ,κ∈[κ, ¯κ],γ∈(0,¯γ]with ¯κ>1>κ > 1/2,¯γ < 2/(m+L),Rγ,θ
and ¯
Rγ,θ satisfy Dd(Wm, λγ
m, bmγ), where Wmis given in (19).
Proof. We show the property for Rγ ,θ only as the proof for ¯
Rγ,θ is identical. Let θ∈Θ,κ∈[κ,¯κ],
γ∈(0,¯γ]and x∈Rd. Let Zbe a d-dimensional Gaussian random variable with zero mean and
identity covariance matrix. Using Lemma 15 we have
ZRdkyk2Rγ,θ (x, dy) = E
x−γ∇xVθ(x)−γ∇xUγκ
θ(x) + p2γZ
2
=kx−γ∇xVθ(x)−γ∇xUγκ
θ(x)k+ 2γd
6(1 −γ/2) kxk2+γ(2/(m+L)−¯γ)−1+ 4R2
V,1
+2¯γMLRV,1+ ¯γM2+ 2d+ 2M2(1 + ¯γL)2−1.
Therefore, we get
ZRd
(1 + kyk2)Rγ,θ (x, dy)6(1 −γ/2)(1 + kxk2) + γ(2/(m+L)−¯γ)−1+ 4R2
V,1
+2¯γMLRV,1+ ¯γM2+ 2d+ 2M2(1 + ¯γL)2−1+/2,
which concludes the first part of the proof. Let Tγ ,θ(x) = x−γ∇xVθ(x)−γ∇xUγκ
θ(x). In the
sequel, for any k∈ {1,...,m},b, ˜
bk>0and λ, ˜
λk∈[0,1) are constants independent of γwhich
may take different values at each appearance. Note that using Lemma 15, for any k∈ {1,...,2m}
there exist ˜
λk∈(0,1) and ˜
bk>0such that
kTγ,θ (x)kk6{˜
λγ
kkxk+γ˜
bk}k(24)
6˜
λγk
kkxkk+γ2kmax(˜
bk,1)kmax(¯γ, 1)2k−1n1 + kxkk−1o
6˜
λγ
kkxkk+˜
bkγn1 + kxkk−1o6(1 + kxkk)(1 + ˜
bkγ).
Therefore, combining (24) and the Cauchy-Schwarz inequality we obtain
ZRd
(1 + kyk2)Rγ,θ (x, dy) = 1 + Eh(kTγ,θ (x)k2+ 2p2γhTγ,θ(x), Zi+ 2γkZk2)mi
= 1 +
m
X
k=0
k
X
ℓ=0 m
kk
ℓkTγ,θ (x)k2(m−k)2(3k−ℓ)/2γ(k+ℓ)/2EhhTγ,θ (x), Z ik−ℓkZk2ℓi
61 + kTγ,θ (x)k2m
+ 23m/2
m
X
k=1
k
X
ℓ=0 m
kk
ℓkTγ,θ (x)k2(m−k)γ(k+ℓ)/2EhhTγ,θ(x), Zik−ℓkZk2ℓi
1
{(1,0)}c(k, ℓ)
61 + kTγ,θ (x)k2m
+γ23m/2
m
X
k=1
k
X
ℓ=0 m
kk
ℓkTγ,θ (x)k2m−k−ℓ¯γ(k+ℓ)/2−1EhkZkk+ℓi
1
{(1,0)}c(k, ℓ)
61 + λγ
2mkxk2m+b2mγn1 + kxk2m−1o
+γ23m/222mmax(¯γ, 1)2msup
k∈{1,...,m}n(1 + ˜
bk¯γ)EhkZkkio(1 + kxk2m−1)
61 + λγkxk2m+γb(1 + kxk2m−1)
6λγ/2(1 + kxk2m) + γb(1 + kxk2m−1) + λγ(1 + kxk2m)−λγ/2(1 + kxk2m).
Using that λγ−λγ/26−log(1/λ)γλγ/2/2, concludes the proof.
Lemma 19. Assume H1and H3. Then for any θ∈Θ,κ∈[κ, ¯κ]and γ∈(0,¯γ]with ¯κ>1>
15
κ > 1/2,¯γ < min(2/L, η/(2ML)),Rγ ,θ and ¯
Rγ,θ satisfy Dd(W, λγ, bγ)with
λ= e−α2,
be= (4/L −2¯γ)−1R2
V,1+ (3/2)¯γM2+c+MRU,1+RU,2+ ¯γMLRV,2+d+ 2α ,
b=αbeeα¯γ beW(R),
W=Wα, α < η/8,
Rη= max (2be/(η−8α),1) ,
(25)
where Wαis given in (19).
Proof. We show the property for Rγ,θ only as the proof for ¯
Rγ,θ is identical. Let θ∈Θ,κ∈[κ,¯κ]
γ∈(0,¯γ],x∈Rdand Zbe a d-dimensional Gaussian random variable with zero mean and identity
covariance matrix. Using Lemma 16 we have
ZRdkyk2Rγ,θ (x, dy) = kx−γ∇xVθ(x)−γ∇xUγκ
θk2+ 2γd
6kxk2+γ(2/L −¯γ)−1R2
V,1+ 3¯γM2+ 2c+ 2(MRU,1+RU,2) + 2¯γMLRV,2+ 2d−ηkxk.
Using the log-Sobolev inequality [3, Proposition 5.4.1] and Jensen’s inequality we get that
Rγ,θ W(x)6exp αRγ,θφ(x) + α2γ(26)
6exp "α1 + ZRdkyk2Rγ,θ (x, dy)1/2
+α2γ#.
We now distinguish two cases:
(a) If kxk>Rη, recalling that Rηis given in (25), then
(2/L −¯γ)−1R2
V,1+ 3¯γM2+ 2c+ 2(MRU,1+RU,2) + 2¯γMLRV,2+ 2d−ηkxk6−8αkxk.
In this case using that φ−1(x)kxk>1/2and that for any t>0,√1 + t61 + t/2we have
1 + ZRdkyk2Rγ,θ (x, dy)1/2
−φ(x)6
6γφ−1(x)(2/L −¯γ)−1R2
V,1+ 3¯γM2+ 2c+ 2(MRU,1+RU,2) + 2¯γMLRV,2+ 2d−ηkxk2
6−4αγφ−1(x)kxk6−2αγ .
Hence,
Rγ,θ W(x)6"α1 + ZRdkyk2Rγ,θ (x, dy)1/2
+α2γ#6e−α2γW(x).
(b) If kxk6Rηthen using that for any t>0,√1 + t61 + t/2we have
1 + ZRdkyk2Rγ,θ (x, dy)1/2
−φ(x)
6γ((4/L −2¯γ)−1R2
V,1+ (3/2)¯γM2+c+MRU,1+RU,2+ ¯γMLRV,2+d).
Therefore, using (26), we get
Rγ,θ W(x)
6exp αγ (4/L −2¯γ)−1R2
V,1+ (3/2)¯γM2+c+MRU,1+RU,2+ ¯γMLRV,2+d+αW(x).
Since for all a>b,ea−eb6(a−b)eawe obtain that
Rγ,θ W(x)6λγW(x) + γαbeeα¯γ beW(Rη),
which concludes the proof.
16
Lemma 20. Assume H1. For any κ∈[κ, ¯κ],θ∈Θ,γ∈(0,¯γ]with ¯κ>1>κ > 1/2,
¯γ < (2 −1/κ)/Land x, y ∈Rd
max kδxR⌈1/γ⌉
γ,θ −δyR⌈1/γ ⌉
γ,θ kTV ,kδx¯
R⌈1/γ⌉
γ,θ −δy¯
R⌈1/γ⌉
γ,θ kTV 61−2Φn−kx−yk/(2√2)o,
where Φis the cumulative distribution function of the standard normal distribution on R.
Proof. We only show that for any θ∈Θ,κ∈[κ,¯κ],γ∈(0,¯γ]with ¯κ>1>κ > 1/2,¯γ < (2−1/κ)/L
and x, y ∈Rd, we have kδxR⌈1/γ⌉
γ,θ −δyR⌈1/γ ⌉
γ,θ kTV 61−2Φ− kx−yk/(2√2)as the proof of for
¯
Rγ,θ is similar. Let κ∈[κ,¯κ],θ∈Θ,γ∈(0,¯γ]. We have that x7→ Vθ(x) + Uγκ
θ(x)is convex,
continuously differentiable and satisfies for any x, y ∈Rd
k∇xVθ(x) + ∇xUγκ
θ(x)− ∇xVθ(y)− ∇xUγκ
θ(y)k6{L+ 1/(γκ)} kx−yk,
Combining this result with [36, Theorem 2.1.5, Equation (2.1.8)] and the fact that γ62/{L+
1/(γκ)}since ¯γ6(2 −1/κ)/L, we have for any x, y ∈Rd
kx−γ∇xVθ(x)−γ∇xUγκ
θ(x)−y+γ∇xVθ(y) + γ∇xUγκ
θ(y)k6kx−yk.
The proof is then an application of [16, Proposition 3b] with ℓ←1, for any x∈Rd,Tγ,θ(x)←
x−γ∇xVθ(x)−γ∇x∇Uγκ
θ(x)and Π←Id.
Theorem 21. Assume H1and H2or H3. Let ¯κ>1>κ > 1/2,¯γ < min{(2 −1/κ)/L,2/(m+L)}
if H2holds and ¯γ < min{(2 −1/κ)/L, η/(2ML)}if H3holds. Then for any a∈(0,1], there exist
A2,a >0and ρa∈(0,1) such that for any θ∈Θ,κ∈[κ,¯κ],γ∈(0,¯γ],Rγ ,θ and ¯
Rγ,θ admit
invariant probability measures πγ,θ, respectively ¯πγ,θ , and for any x, y ∈Rdand n∈Nwe have
max kδxRn
γ,θ −πγ ,θ kWa,kδx¯
Rn
γ,θ −¯πγ,θ kWa6A2,a ργn
aWa(x),
max kδxRn
γ,θ −δyRn
γ,θ kWa,kδx¯
Rn
γ,θ −δy¯
Rn
γ,θ kWa6A2,aργn
a{Wa(x) + Wa(y)},
with W=Wmand m∈N∗if H2holds and W=Wαwith α < min(κη/4, η/8) if H3holds, see
(19).
Proof. We only show that for any a∈(0,1], there exist A2,a >0and ρa∈(0,1) such that for
any θ∈Θ,κ∈[κ,¯κ]and γ∈(0,¯γ]we have kδxRn
γ,θ −πγ ,θ kWa6A2,aργ n
aWa(x)and kδxRn
γ,θ −
δyRn
γ,θ kWa6A2,aργn
a{Wa(x) + Wa(y)}, since the proof for ¯
Rγ,θ is similar . Let a∈[0,1]. First,
using Jensen’s inequality and Lemma 18 if H2holds or Lemma 19 if H3holds, we get that there
exist λaand basuch that for any θ∈Θ,κ∈[κ, ¯κ],γ∈(0,¯γ],Rγ,θ and ¯
Rγ,θ satisfy Dd(Wa, λγ
a, baγ).
Combining [16, Theorem 6], Lemma 20 and Dd(Wa, λγ
a, baγ), we get that there exist ¯
A2,a >0and
ρa∈(0,1) such that for any θ∈Θ,κ∈[κ, ¯κ],γ∈(0,¯γ],x, y ∈Rdand n∈N,Rγ,θ and ¯
Rγ,θ admit
invariant probability measures πγ,θ and ¯πγ ,θ respectively and
max kδxRn
γ,θ −δyRn
γ,θ kWa,kδx¯
Rn
γ,θ −δy¯
Rn
γ,θ kWa6¯
A2,aργ n
a{Wa(x) + Wa(y)}.(27)
Using that for any θ∈Θ,κ∈[κ,¯κ]and γ∈(0,¯γ],Rγ,θ and ¯
Rγ,θ satisfy Dd(Wa, λγ
a, baγ)and [17,
Lemma S2] we have
πγ,θ (Wa)6baγ/(1 −λγ
a)6baλ−¯γ
a/log(1/λa).(28)
Hence, combining (27) and (28), we have for any θ∈Θ,κ∈[κ, ¯κ],γ∈(0,¯γ]and n∈N
max kδxRn
γ,θ −πγ ,θ kW,kδx¯
Rn
γ,θ −¯πγ,θ kW6¯
A2,aργ n
a(1 + baλ−¯γ
a/log(1/λa))Wa(x).
We conclude upon letting A2,a =¯
A2,a(1 + baλ−¯γ
a/log(1/λa)).
5.3 Proof of Theorem 5
We show that under H2or H3, Foster-Lyapunov drifts hold for PULA in Lemma 22 and Lemma 23.
Combining these Foster-Lyapunov drifts with an appropriate minorisation condition Lemma 24,
we obtain the geometric ergodicity of the underlying Markov chain in Theorem 25.
17
Lemma 22. Assume H1and H2. Then for any θ∈Θ,κ∈[κ, ¯κ]and γ∈(0,¯γ]with ¯κ>1>
κ > 1/2and ¯γ < 2/(m+L),Sγ ,θ and ¯
Sγ,θ satisfy Dd(W1, λγ
2, b2γ)with
λ2= exp [−/2] ,
b2= ¯γ¯κ2M2+(2/(m+L)−¯γ)−1+ 4R2
V,2+ 2d+ 2¯κ2M2−1+/2,
=mL/(m+L),
where for any x∈Rd,W1(x) = 1 + kxk2. In addition, for any m∈N∗, there exist λm∈(0,1),
bm>0such that for any θ∈Θ,κ∈[κ,¯κ]and γ∈(0,¯γ]with ¯κ>1>κ > 1/2and ¯γ < 2/(m+L),
Sγ,θ and ¯
Sγ,θ satisfy Dd(Wm, λγ
m, bmγ), where Wmis given in (19).
Proof. We show the property for Sγ,θ only as the proof for ¯
Sγ,θ is identical. Let θ∈Θ,κ∈[κ,¯κ],
γ∈(0,¯γ]and x∈Rd. Let Zbe a d-dimensional Gaussian random variable with zero mean and
identity covariance matrix. Using Lemma 13 we have
ZRdkyk2Sγ,θ (x, dy) = E
proxγκ
Uθ(x)−γ∇xVθ(proxγκ
Uθ(x)) + p2γZ
2
6(1 −γ/2) kxk2+γ¯γκ2M2+(2/(m+L)−¯γ)−1+ 4R2
V,1+2κ2M2−1+ 2γ d .
Therefore, we get
ZRd
(1 + kyk2)Sγ,θ (x, dy)6(1 −γ/2)(1 + kxk2) + γ¯γκ2M2
+(2/(m+L)−¯γ)−1+ 4R2
V,1+ 2d+ 2κ2M2−1+/2,
which concludes the first part of the proof using that for any t>0,1−t6e−t. The proof of the
result for W=Wmwith m∈N∗is a straightforward adaptation of the one of Lemma 18 and is
left to the reader.
Lemma 23. Assume H1and H3. Then for any θ∈Θ,κ∈[κ, ¯κ]and γ∈(0,¯γ]with ¯κ>1>
κ > 1/2and ¯γ < 2/L,Sγ ,θ and ¯
Sγ,θ satisfy Dd(W, λγ, bγ)with
λ= e−α2,
be= (3/2)¯γ¯κ2M2+ ¯κc+ ¯κ(RU,2+MRU,1) + (4/L −2¯γ)−1R2
V,1+d+ 2α
b=αbeeα¯γ beW(R),
W=Wα,0< α < κη/4,
Rη= max (be/(κη−4α),1) ,
and where Wαis given in (19).
Proof. We show the property for Sγ,θ only as the proof for ¯
Sγ,θ is identical. Let θ∈Θ,κ∈[κ,¯κ],
γ∈(0,¯γ],x∈Rd, and Zbe a d-dimensional Gaussian random variable with zero mean and
identity covariance matrix. Using Lemma 14 we have
ZRdkyk2Sγ,θ (x, dy)6
proxγκ
Uθ(x)−γ∇xVθ(proxγκ
Uθ(x))
2+ 2γd
6kxk2+γ3¯γκ2M2+ 2κc+ 2κ(RU,2+MRU,1) + (2/L −¯γ)−1R2
V,1+ 2d−2κη kxk.
Using the log-Sobolev inequality [3, Proposition 5.4.1] and Jensen’s inequality we get that
Sγ,θ W(x)6exp αSγ,θφ(x) + α2γ(29)
6exp "α1 + ZRdkyk2Sγ,θ (x, dy)1/2
+α2γ#.
We now distinguish two cases.
(a) If kxk>Rηthen φ−1(x)kxk>1/2and 3¯γκ2M2+ 2κc+ 2κ(RU,2+MRU,1) + (2/L −¯γ)−1R2
V,1+
2d−2κη kxk6−8αkxk. In this case using that for any t>0,√1 + t−16t/2we get
1 + ZRdkyk2Sγ,θ (x, dy)1/2
−φ(x)
6γφ−1(x)3¯γκ2M2+ 2κc+ 2κ(RU,2+MRU,1) + (2/L −¯γ)−1R2
V,1+ 2d−2κη kxk/2
6−4αγφ−1(x)kxk6−2αγ .
18
Hence,
Sγ,θ W(x)6exp "α1 + ZRdkyk2Sγ,θ (x, dy)1/2
+α2γ#6e−α2γW(x).
(b) If kxk6Rηthen using that for any t>0,√1 + t−16t/2
1 + ZRdkyk2Sγ,θ (x, dy)1/2
−φ(x)
6γ(3/2)¯γκ2M2+κc+κ(RU,2+MRU,1) + (4/L −2¯γ)−1R2
V,1+d.
Therefore we get using (29)
Sγ,θ W(x)/W (x)
6exp αγ (3/2)¯γκ2M2+κc+κ(RU,2+MRU,1) + (4/L −2¯γ)−1R2
V,1+d+α6eαbeγ.
Since for all a>b,ea−eb6(a−b)eawe obtain that
Sγ,θ W(x)6λγW(x) + γαbeeα¯γbeW(Rη),
which concludes the proof.
Lemma 24. Assume H1. For any θ∈Θ,κ∈[κ, ¯κ]and γ∈(0,¯γ]with ¯κ>1>κ > 1/2,¯γ < 2/L
and x, y ∈Rd
max kδxS⌈1/γ⌉
γ,θ −δyS⌈1/γ ⌉
γ,θ kTV ,kδx¯
S⌈1/γ⌉
γ,θ −δy¯
S⌈1/γ⌉
γ,θ kTV 61−2Φn−kx−yk/(2√2)o,
where Φis the cumulative distribution function of the standard normal distribution on R.
Proof. We only show that for any θ∈Θ,κ∈[κ,¯κ],γ∈(0,¯γ]with ¯γ < 2/L, and x, y ∈Rd,
kδxS⌈1/γ⌉
γ,θ −δyS⌈1/γ ⌉
γ,θ kTV 61−2Φ− kx−yk/(2√2)since the proof for ¯
Sγ,θ is similar. Let
θ∈Θ,κ∈[κ, ¯κ],γ∈(0,¯γ]. Using [36, Theorem 2.1.5, Equation (2.1.8)] and that the proximal
operator is non-expansive [5, Proposition 12.28], we have for any x, y ∈Rd
proxγκ
Uθ(x)−proxγκ
Uθ(y)−γ(∇xVθ(proxγκ
Uθ(x)) − ∇xVθ(proxγκ
Uθ(y)))
6
proxγκ
Uθ(x)−proxγκ
Uθ(y)
6kx−yk.
The proof is then an application of [16, Proposition 3b] with ℓ←1, for any x∈Rd,Tγ,θ(x)←
proxγκ
Uθ(x)−γ∇xVθ(proxγκ
Uθ(x)) and Π←Id.
Theorem 25. Assume H1and H2or H3. Let ¯κ>1>κ > 1/2. Let ¯γ < 2/(m+L)if H2holds
and ¯γ < 2/Lif H3holds. Then for any a∈(0,1], there exist A2,a >0and ρa∈(0,1) such that
for any θ∈Θ,κ∈[κ, ¯κ],γ∈(0,¯γ],Sγ,θ and ¯
Sγ,θ admit an invariant probability measure πγ,θ and
¯πγ,θ respectively, and for any x, y ∈Rdand n∈Nwe have
max kδxSn
γ,θ −πγ ,θ kWa,kδx¯
Sn
γ,θ −¯πγ,θ kWa6A2,aργn
aWa(x),
max kδxSn
γ,θ −δySn
γ,θ kWa,kδx¯
Sn
γ,θ −δy¯
Sn
γ,θ kWa6A2,aργ n
a{Wa(x) + Wa(y)},
with W=Wmand m∈N∗if H2holds and W=Wαwith α < κη/4if H3holds, see (19).
Proof. The proof is similar to the one of Theorem 21.
5.4 Checking [17, H1, H2] for PULA
Lemma 26 implies that [17, H1a] holds. The geometric ergodicity proved in Theorem 25 implies
[17, H1b]. Then, we show that the distance between the invariant probability distribution of the
Markov chain and the target distribution is controlled in Corollary 31 and therefore [17, H1c] is
satisfied. Finally, we show that [17, H2] is satisfied in Proposition 32.
19
Lemma 26. Assume H1,H2or H3, and let (Xn
k,¯
Xn
k)n∈N,k∈{0,...,mn}be given by (5)with
{(Kγ,θ ,¯
Kγ,θ ) : γ∈(0,¯γ], θ ∈Θ}={(Sγ,θ ,¯
Sγ,θ ) : γ∈(0,¯γ], θ ∈Θ}and κ∈[κ, ¯κ]with
¯κ>1>κ > 1/2. Then there exists A1>1such that for any n, p ∈Nand k∈ {0,...,mn}
EhSp
γn,θnW(Xn
k)X0
0i6A1W(X0
0),
Eh¯
Sp
γn,θnW(¯
Xn
k)¯
X0
0i6A1W(¯
X0
0),
EW(X0
0)<+∞,EW(¯
X0
0)<+∞,
with W=Wmwith m∈N∗and ¯γ < 2/(m+L)if H2holds and W=Wαwith α < κη/4and
¯γ < 2/Lif H3holds, see (19).
Proof. Combining [17, Lemma S15] and Lemma 22 if H2holds or Lemma 23 if H3holds conclude
the proof.
Lemma 27. Assume H1and H2or H3. We have supθ∈Θ{πθ(W)+ ¯πθ(W)}<+∞, with W=Wm
with m∈N∗if H2holds and W=Wαwith α < η if H3holds, see (19).
Proof. We only show that supθπθ(W)<+∞since the proof for ¯πθis similar. Let m∈N∗,α < η
and θ∈ΘThe proof is divided into two parts.
(a) If H2holds then using H1-(b) we have
ZRd
(1 + kxk2m) exp [−Uθ(x)−Vθ(x)] dx6ZRd
(1 + kxk2m) exp [−Vθ(x)] dx
6ZRd
(1 + kxk2m) exp h−Vθ(x⋆
θ)−mkx−x⋆
θk2/2idx
6exp RV,3+mR2
V,1/2ZRd
(1 + kxk2m) exp hmRV,1kxk − mkxk2/2idx .
Hence using H1-(a) we have
sup
θ∈Θ
πθ(W)6exp RV,3+mR2
V,1/2ZRd
(1 + kxk2m) exp hmRV,1kxk − mkxk2/2idx
inf
θ∈ΘZRd
exp [−Uθ(x)−Vθ(x)] dx<+∞.
(b) if H3holds then we have
ZRd
exp [αφ(x)] exp [−Uθ(x)−Vθ(x)] dx6ZRd
exp [αφ(x)] exp [−Uθ(x)] dx
6ecZRd
exp [α(1 + kxk)] exp [−ηkxk] dx .
Since α < η we have using H1-(a)
sup
θ∈Θ
πθ(W)6ecZRd
exp [α(1 + kxk)] exp [−ηkxk] dx
inf
θ∈ΘZRd
exp [−Uθ(x)−Vθ(x)] dx<+∞,
which concludes the proof.
Theorem 28. Assume H1and H2or H3. Let ¯κ>1>κ > 1/2. Let ¯γ < 2/(m+L)if H2holds
and ¯γ < 2/Lif H3holds. Then for any θ∈Θ,κ∈[κ, ¯κ]and γ∈(0,¯γ]we have
max kπ♯
γ,θ −πθkW1/2,k¯π♯
γ,θ −¯πθkW1/26˜
Ψ(γ),
where for any θ∈Θand γ∈(0,¯γ],π♯
γ,θ , respectively ¯π♯
γ,θ , is the invariant probability measure of
Sγ,θ , respectively ¯
Sγ,θ , given by (18)and associated with κ= 1. In addition, for any γ∈(0,¯γ]
˜
Ψ(γ) = √2{bλ−¯γ/log(1/λ) + sup
θ∈Θ
πθ(W) + sup
θ∈Θ
¯πθ(W)}1/2(Ld+M2)1/2√γ ,
and where W=Wmwith m∈N∗and ¯γ, λ, b are given in Lemma 22 if H2holds and W=Wα
with α < min(κη/4, η)and ¯γ , λ, b are given in Lemma 23 if H3holds, see (19).
20
Proof. We only show that for any θ∈Θ,κ∈[κ, ¯κ]and γ∈(0,¯γ],kπ♯
γ,θ −πθkW1/26˜
Ψ(γ), since
the proof of k˜π♯
γ,θ −˜πθkW1/26˜
Ψ(γ)is similar. Let θ∈Θ,κ∈[κ,¯κ],γ∈(0,¯γ]and x∈RdUsing
Theorem 25 we obtain that (δxSn
γ,θ )n∈N, with κ= 1, is weakly convergent towards π♯
γ,θ . Using that
µ7→ KL (µ|πθ)is lower semi-continuous for any θ∈Θ, see [19, Lemma 1.4.3b], and [21, Corollary
18] we get that
KL π♯
γ,θ |πθ6lim inf
n→+∞KL n−1
n
X
k=1
δxSk
γ,θ
πθ!6γ(Ld+M2).
Using a generalized Pinsker inequality, see [22, Lemma 24], Lemma 27 and Lemma 22 if H2holds
or Lemma 23 if H3holds, we get that
kπ♯
γ,θ −πθkW1/26√2(π♯
γ,θ (W) + πθ(W))1/2KL π♯
γ,θ |πθ1/2
6√2{bλ−¯γ/log(1/λ) + sup
θ∈Θ
πθ(W)}1/2(Ld+M2)1/2γ1/2,
which concludes the proof.
Lemma 29. Assume H1and H2or H3. Let ¯κ>1>κ > 1/2. Let ¯γ < 2/(m+L)if H2holds
and ¯γ < 2/Lif H3holds. Then there exists ¯
B3>0such that for any θ∈Θ,γ∈(0,¯γ],x∈Rd
and κi∈[κ, ¯κ]with i∈ {1,2}we have
max kδxS⌈1/γ⌉
1,γ,θ −δxS⌈1/γ ⌉
2,γ,θ kW1/2,kδx¯
S⌈1/γ⌉
1,γ,θ −δx¯
S⌈1/γ⌉
2,γ,θ kW1/26¯
B3γ|κ1−κ2|W1/2(x).
where for any i∈ {1,2},θ∈Θand γ∈(0,¯γ],Si,γ,θ is given by (18)and associated with κ←κi,
and W=Wmwith m∈N∗if H2holds. In addition, W=Wαwith α < min(κη/4, η)if H3
holds, see (19).
Proof. We only show that for any θ∈Θ,γ∈(0,¯γ],x∈Rdand κi∈[κ,¯κ]with i∈ {1,2}we have
kδxS⌈1/γ⌉
1,γ,θ −δxS⌈1/γ ⌉
2,γ,θ kW1/26¯
B3γ|κ1−κ2|W1/2(x)since the proof for ¯
S1,γ,θ and ¯
S2,γ,θ is similar.
Let θ∈Θ,γ∈(0,¯γ],x∈Rdand κi∈[κ,¯κ]with i∈ {1,2}. Using a generalized Pinsker inequality,
see [22, Lemma 24], we have
kδxS⌈1/γ⌉
1,γ,θ −δxS⌈1/γ ⌉
2,γ,θ kW1/2
6√2(S⌈1/γ⌉
1,γ,θ W(x) + S⌈1/γ ⌉
2,γ,θ W(x))1/2KL δxS⌈1/γ ⌉
1,γ,θ |δxS⌈1/γ ⌉
2,γ,θ 1/2.(30)
Using [30, Lemma 4.1] we get that KL δxS⌈1/γ ⌉
1,γ,θ |δxS⌈1/γ ⌉
2,γ,θ 6KL ( ˜µ1|˜µ2)where setting T=
γ⌈1/γ⌉,˜µi,i∈ {1,2}, is the probability measure over B(C([0, T ],Rd)) which is defined for any
A∈ B(C([0, T ],Rd)) by ˜µi(A) = P((Xi
t)t∈[0,T ]∈A),i∈ {1,2}and for any t∈[0, T ]
dXi
t=bi(t, (Xi
s)s∈[0,T ])dt+√2dBt, Xi
0=x ,
with for any (ωs)s∈[0,T ]∈C([0, T ],Rd)and t∈[0, T ]
bi(t, (ωs)s∈[0,T ]) = X
p∈N
1
[pγ,(p+1)γ)(t)T(proxγκi
Uθ(ωpγ )) ,
where for any y∈Rd,Tγ,θ (y) = y−γ∇xVθ(y). Since (Xi
t)t∈[0,T ]∈C([0, T ],Rd),biand bare
continuous for any i∈ {1,2}, [32, Theorem 7.19] applies and we obtain that ˜µ1≪˜µ2and
d˜µ1
d˜µ2
((X1
t)t∈[0,T ]) = exp ((1/4) ZT
0
b1(t, (X1
s)s∈[0,T ])−b2(t, (X1
s)s∈[0,T ])
2dt
+(1/2) ZT
0hb1(t, (X1
s)s∈[0,T ])−b2(t, (X1
s)s∈[0,T ]),dX1
ti),
where the equality holds almost surely. As a consequence we obtain that
KL (˜µ1|˜µ2) = (1/4)E"ZT
0
b1(t, (X1
s)s∈[0,T ])−b2(t, (X1
s)s∈[0,T ])
2ds#.(31)
21
In addition, using Lemma 11, we have for any (ωs)s∈[0,T]∈C([0, T ],Rd)and t∈[0, T ]
b1(t, (ωs)s∈[0,T ])−b2(t, (ωs)s∈[0,T ])
2=
Tγ,θ (proxγκ1
Uθ(ωγ⌊t/γ⌋)) − Tγ,θ(proxγκ2
Uθ(ωγ⌊t/γ⌋))
2
6
proxγκ1
Uθ(ωγ⌊t/γ⌋)−proxγκ2
Uθ(ωγ⌊t/γ⌋)
264γ2(κ1−κ2)2M2.(32)
Combining this result and (31) we get that
KL δxS⌈1/γ⌉
1,γ,θ |δxS⌈1/γ ⌉
2,γ,θ 6(1 + ¯γ)M2γ2|κ1−κ2|2.(33)
Combining (33) and (30) we get that
kδxS⌈1/γ⌉
1,γ,θ −δxS⌈1/γ ⌉
2,γ,θ kW1/2
621/2(1 + ¯γ)1/2M(S⌈1/γ⌉
1,γ,θ W(x) + S⌈1/γ ⌉
2,γ,θ W(x))1/2γ|κ1−κ2|.
We conclude the proof upon using Lemma 8, and Lemma 22 if H2holds, or Lemma 23 if H3
holds.
Proposition 30. Assume H1and H2or H3. Let ¯κ>1>κ > 1/2. Let ¯γ < 2/(m+L)if H2
holds and ¯γ < 2/Lif H3holds. Then there exists B3>0such that for any θ∈Θ,γ∈(0,¯γ]and
κi∈[κ,¯κ]with i∈ {1,2}we have
max kπ1
γ,θ −π2
γ,θ kW1/2,k¯π1
γ,θ −¯π2
γ,θ kW1/26B3γ|κ1−κ2|,
where for any i∈ {1,2},θ∈Θand γ∈(0,¯γ],πi
γ,θ , respectively ¯πi
γ,θ , is the invariant probability
measure of Si,γ,θ , respectively ¯
Si,γ,θ , given by (18)and associated with κ←κi. In addition,
W=Wmwith m∈N∗if H2holds and W=Wαwith α < min(κη/4, η)if H3holds, see (19).
Proof. We only show that for any θ∈Θ,γ∈(0,¯γ]and κi∈[κ, ¯κ]with i∈ {1,2},kπ1
γ,θ −
π2
γ,θ kW1/26B3γ|κ2−κ1|since the proof for ¯π1
γ,θ and ¯π2
γ,θ are similar. Let θ∈Θ,γ∈(0,¯γ],
x∈Rdand κi>1/2. Using Theorem 25 we have
lim
n→+∞kδxSn
1,γ,θ −δxSn
2,γ,θ kW1/2=kπ1,γ,θ −π2,γ,θ kW1/2.
Let n=q⌈1/γ⌉. Using Theorem 25 with a= 1/2, that W1/2(x)6W(x)for any x∈Rd,
Lemma 29, Lemma 8and Lemma 22 if H2holds or Lemma 23 if H3holds, we have
kδxSn
1,γ,θ −δxSn
2,γ,θ kW1/26
q−1
X
k=0 kδxS(k+1)⌈1/γ⌉
1,γ,θ S(q−k−1)⌈1/γ ⌉
2,γ,θ −δxSk⌈1/γ ⌉
1,γ,θ S(q−k)⌈1/γ ⌉
2,γ,θ kW1/2
6
q−1
X
k=0
A2,1/2ρq−k−1
1/2
δxSk⌈1/γ⌉
1,γ,θ nS⌈1/γ ⌉
1,γ,θ −S⌈1/γ ⌉
2,γ,θ o
W1/2
6A2,1/2
q−1
X
k=0
ρq−k−1
1/2¯
B3γ|κ1−κ2|δxSk⌈1/γ⌉
1,γ,θ W(x)
6A2,1/2
q−1
X
k=0
ρq−k−1
1/2¯
B3γ|κ1−κ2|(1 + bλ−¯γ/log(1/λ))W(x)
6A2,1/2¯
B3(1 + bλ−¯γ/log(1/λ))/(1 −ρ1/2)|κ1−κ2|γW (x),
which concludes the proof with B3= 2A2,1/2¯
B3(1 + bλ−¯γ/log(1/λ))/(1 −ρ1/2)κupon setting
x= 0.
Corollary 31. Assume H1and H2or H3. Let ¯κ>1>κ > 1/2. Let ¯γ < 2/(m+L)if H2holds
and ¯γ < 2/Lif H3holds. Then for any κ∈[κ, ¯κ],θ∈Θand γ∈(0,¯γ], we have
max (kπγ,θ −πθkW1/2,k¯πγ,θ −¯πθkW1/2)6Ψ(γ),
where for any γ∈(0,¯γ],πγ,θ is the invariant probability measure of Sγ,θ given by (18). In addition,
Ψ(γ) = ˜
Ψ(γ)+B3γ|κ−1|, where ˜
Ψis given in Theorem 28 and B3in Proposition 30, and W=Wm
with m∈N∗if H2holds and W=Wαwith α < min(κη/4, η)if H3holds, see (19).
22
Proof. We only show that for any θ∈Θand γ∈(0,¯γ]we have kπγ,θ −πθkW1/26Ψ(γ)since the
proof for ¯πγ,θ and ¯πθare similar. Let κ∈[κ,¯κ],θ∈Θ,γ∈(0,¯γ]. The proof is a direct application
of Theorem 28 and Proposition 30 upon noticing that
kπγ,θ −πθkW1/26kπγ ,θ −π♯
γ,θ kW1/2+kπ♯
γ,θ −πθkW1/2,
where π♯
γ,θ is the invariant probability measure of Sγ ,θ given by (18) and associated with κ= 1.
Proposition 32. Assume H1and H2or H3. Let ¯κ>1>κ > 1/2. Let ¯γ < 2/(m+L)if H2
holds and ¯γ < 2/Lif H3holds. Then there exists A4>0such that for any κ∈[κ, ¯κ],θ1, θ2∈Θ,
γ1, γ2∈(0,¯γ]with γ2< γ1,a∈[1/4,1/2] and x∈Rd
max kδxSγ1,θ1−δxSγ2,θ2kWa,kδx¯
Sγ1,θ1−δx¯
Sγ2,θ2kWa
6(Λ(γ1, γ2) + Λ(γ1, γ2)kθ1−θ2k)W2a(x),
with
Λ1(γ1, γ2) = A4(γ1/γ2−1) ,Λ2(γ1, γ2) = A4γ1/2
2,
and where W=Wmwith m∈Nand m>2if H2is satisfied and W=Wαwith α < min(κη/4, η)
if H3is satisfied, see (19).
Proof. We only show that for any κ∈[κ,¯κ],θ1, θ2∈Θ,γ1, γ2∈(0,¯γ]with γ2< γ1,a∈[1/4,1/2]
and x∈Rdwe have kδxSγ1,θ1−δxSγ2,θ2kWa6(Λ(γ1, γ2) + Λ(γ1, γ2)kθ1−θ2k)W2a(x)since the
proof for ¯
Sγ1,θ1and ¯
Sγ2,θ2is similar. Let a∈[1/4,1/2],κ∈[κ,¯κ],θ1, θ2∈Θ,γ1, γ2∈(0,¯γ]with
γ2< γ1. Using a generalized Pinsker inequality, see [22, Lemma 24], we have
kδxSγ1,θ1−δxSγ2,θ2kWa
6√2(δxSγ1,θ1W2a(x) + δxSγ2,θ2W2a(x))1/2KL (δxSγ1,θ1|δxSγ2,θ2)1/2.
Combining this result, Jensen’s inequality and Lemma 22 if H2holds and Lemma 23 if H3holds,
we obtain that
kSγ1,θ1−Sγ2,θ2kWa62(1 + b¯γ)1/2{KL (δxSγ1,θ1|δxSγ2,θ2)}1/2Wa(x).
Denote for υ∈Rdand σ>0,Υυ,σthe d-dimensional Gaussian distribution with mean υand
covariance matrix σ2Id. Using Lemma 17 and the fact that γ1>γ2we have
KL (δxSγ1,θ1|δxSγ2,θ2)(34)
6d(γ1/γ2−1)2/2 +
Tγ1,θ1(proxγ1κ
Uθ1(x)) − Tγ2,θ2(proxγ2κ
Uθ1(x))
2(4γ2),
with Tγ,θ (z) = z−γ∇xVθ(z)for any θ∈Θ,γ∈(0,¯γ]and x∈Rd. We have
(1/4)
Tγ1,θ1(proxγ1κ
Uθ1(x)) − Tγ2,θ2(proxγ2κ
Uθ2(x))
2(35)
6
Tγ1,θ1(proxγ1κ
Uθ1(x)) − Tγ1,θ1(proxγ2κ
Uθ1(x))
2+
Tγ1,θ1(proxγ2κ
Uθ1(x)) − Tγ1,θ1(proxγ2κ
Uθ2(x))
2
+
Tγ1,θ1(proxγ2κ
Uθ2(x)) − Tγ2,θ1(proxγ2κ
Uθ2(x))
2+
Tγ2,θ1(proxγ2κ
Uθ2(x)) − Tγ2,θ2(proxγ2κ
Uθ2(x))
2.
First using H1, [36, Theorem 2.1.5, Equation (2.1.8)] and Lemma 11 we have
Tγ1,θ1(proxγ1κ
Uθ1(x)) − Tγ1,θ1(proxγ2κ
Uθ1(x))
(36)
6
proxγ1κ
Uθ1(x)−proxγ2κ
Uθ1(x)
62M|γ1κ−γ2κ|.
Second, we have using (9), H1, [36, Theorem 2.1.5, Equation (2.1.8)] and H4
Tγ1,θ1(proxγ2κ
Uθ1(x)) − Tγ1,θ1(proxγ2κ
Uθ2(x))
(37)
6γ2κ
∇xUγ2κ
θ1(x)− ∇xUγ2κ
θ2(x)
6sup
t∈[0,¯γ κ]{fθ(t)}γ2κkθ1−θ2k(1 + kxk).
23
Third using H1and Lemma 9we have that
Tγ1,θ1(proxγ2κ
Uθ2(x)) − Tγ2,θ1(proxγ2κ
Uθ2(x))
6(γ1−γ2)
∇xVθ1(proxγ2κ
Uθ2(x))
(38)
6(γ1−γ2)L
proxγ2κ
Uθ2(x)−x⋆
θ1
6(γ1−γ2)L(RV,1+ ¯γκM+kxk).
Finally using H1,H4and Lemma 9we have that
Tγ2,θ1(proxγ2κ
Uθ2(x)) − Tγ2,θ2(proxγ2κ
Uθ2(x))
(39)
6γ2
∇xVθ1(proxγ2κ
Uθ2(x)) − ∇xVθ2(proxγ2κ
Uθ2(x))
6γ2MΘkθ1−θ2k(1 + kproxγ2κ
Uθ2(x)k)6γ2MΘkθ1−θ2k(1 + ¯γκM+kxk).
Therefore, combining (36), (37), (38) and (39) in (35), there exists A4,1>0such that for any
γ1, γ2>0with γ2< γ1and θ1, θ2∈Θ
Tγ1,θ1(proxγ1κ
Uθ1(x)) − Tγ2,θ2(proxγ2κ
Uθ2(x))
2
6A4,1h(γ1−γ2)2+γ2
2kθ1−θ2k2iW2a(x).
Using this result in (34), there exists A4,2>0such that
KL (δxSγ1,θ1|δxSγ2,θ2)6A4,2h(γ1/γ2−1)2+γ2kθ1−θ2k2iW2a(x),
which implies the announced result upon setting A4= 2pA4,2(1 + b¯γ)1/2and using that for any
u, v >0,√u+v6√u+√v.
5.5 Checking [17, H1, H2] for MYULA
In this section, similarly to Section 5.5 for PULA, we show that [17, H1, H2] hold for MYULA.
Lemma 33. Assume H1,H2or H3, and let (Xn
k,¯
Xn
k)n∈N,k∈{0,...,mn}be given by (5)with
{(Kγ,θ ,¯
Kγ,θ ) : γ∈(0,¯γ], θ ∈Θ}={(Rγ,θ,¯
Rγ,θ ) : γ∈(0,¯γ], θ ∈Θ}and κ∈[κ, ¯κ]with
¯κ>1>κ>1/2. Then there exists ¯
A1>1such that for any n, p ∈Nand k∈ {0,...,mn}
EhRp
γn,θnW(Xn
k)X0
0i6¯
A1W(X0
0),
Eh¯
Rp
γn,θnW(¯
Xn
k)¯
X0
0i6¯
A1W(¯
X0
0),
EW(X0
0)<+∞,EW(¯
X0
0)<+∞.
with W=Wmwith m∈N∗and ¯γ < 2/(m+L)if H2holds and W=Wαwith α < min(κη/4, η/8)
and ¯γ < min{2/L, η /(2ML)}if H3holds, see (19).
Proposition 34. Assume H1and H2or H3. Let ¯κ>1>κ > 1/2. Let ¯γ < min{(2 −
1/κ)/L,2/(m+L)}if H2holds and ¯γ < min{(2 −1/κ)/L, η/(2ML)}if H3holds. Then there exists
¯
B3,1>0such that for any θ∈Θ,κi∈[κ,¯κ],γ∈(0,¯γ]
max kπ1
γ,θ −π2
γ,θ kW1/2,k¯π1
γ,θ −¯π2
γ,θ kW1/26¯
B3,1γ ,
where for any i∈ {1,2},θ∈Θand γ∈(0,¯γ],πi
γ,θ , respectively ¯πi
γ,θ , is the invariant probability
measure of Ri,γ,θ , respectively ¯
Ri,γ,θ , given by (17)and associated with κ←κi. In addition,
W=Wmwith m∈N∗if H2holds and W=Wαwith α < min(κη/4, η/8) if H3holds, see (19).
Proof. The proof is similar to the one of Proposition 30 upon setting for any i∈ {1,2}and
(ωs)s∈[0,T ]∈C([0, T ],Rd)with T=γ⌈1/γ⌉
bi(t, (ωs)s∈[0,T ]) = ω⌊t/γ⌋γ−γ∇xVθ(ω⌊t/γ⌋γ)−γ∇xUγκi(γ)
θ(ω⌊t/γ⌋γ),
and replacing (32) in Lemma 29 by
b1(t, (ωs)s∈[0,T ])−b2(t, (ωs)s∈[0,T ])
2
=
−γ∇xUγκ1
θ(ω⌊t/γ⌋γ) + γ∇xUγ κ2
θ(ω⌊t/γ⌋γ)
264γ2M2.
24
Proposition 35. Assume H1and H2or H3. Let ¯κ>1>κ > 1/2. Let ¯γ < min{(2 −
1/κ)/L,2/(m+L),L−1}if H2holds and ¯γ < min{(2 −1/κ)/L, η/(2ML),L−1}if H3holds. Then
there exists ¯
B3,2>0such that for any θ∈Θ,γ∈(0,¯γ]and κi∈[κ, ¯κ]with i∈ {1,2}we have
max kπ♭
γ,θ −π♯
γ,θ kW1/2,k¯π♭
γ,θ −¯π♯
γ,θ kW1/26¯
B3,2γ2,
where for any θ∈Θand γ∈(0,¯γ],π♭
γ,θ , respectively ¯π♭
γ,θ , is the invariant probability measure of
Rγ,θ , respectively ¯
Rγ,θ , given by (17)and associated with κ= 1 and π♯
γ,θ , respectively ¯π♯
γ,θ , is the
invariant probability measure of Sγ,θ , respectively ¯
Sγ,θ , given by (18)and associated with κ= 1. In
addition, W=Wmwith m∈N∗if H2holds and W=Wαwith α < min(κη/4, η/8) if H3holds,
see (19).
Proof. The proof is similar to the one of Proposition 30 upon setting for any (ωs)s∈[0,T ]∈C([0, T ],Rd)
with T=γ⌈1/γ⌉
b1(t, (ωs)s∈[0,T ]) = proxγ
Uθ(ω⌊t/γ⌋γ)−γ∇xVθ(proxγ
Uθ(ω⌊t/γ⌋γ)) ,
b2(t, (ωs)s∈[0,T ]) = ω⌊t/γ⌋γ−γ∇xVθ(ω⌊t/γ⌋γ)−γ∇xUγ
θ(ω⌊t/γ⌋γ),
and replacing (32) in Lemma 29 and using (9) and Lemma 9we get
b1(t, (ωs)s∈[0,T ])−b2(t, (ωs)s∈[0,T ])
2
=kproxγ
Uθ(ω⌊t/γ⌋γ)) −γ∇xVθ(proxγ
Uθ(ω⌊t/γ⌋γ)) −ω⌊t/γ⌋γ
+γ∇xVθ(ω⌊t/γ⌋γ)) + γ(ω⌊t/γ⌋γ−proxγ
Uθ(ω⌊t/γ⌋γ))/γk2
=γ2
∇xVθ(proxγ
Uθ(ω⌊t/γ⌋γ))) − ∇xVθ(ω⌊t/γ⌋γ))
26L2M2γ4.
Proposition 36. Assume H1and H2or H3. Let ¯κ>1>κ > 1/2. Let ¯γ < min{(2 −
1/κ)/L,2/(m+L),L−1}if H2holds and ¯γ < min{(2 −1/κ)/L, η/(2ML),L−1}if H3holds. Then
for any θ∈Θ,κ∈[κ,¯κ]and γ∈(0,¯γ], we have
max (kπγ,θ −πθkW1/2,k¯πγ,θ −¯πθkW1/2)6¯
Ψ(γ),
where for any i∈ {1,2},θ∈Θand γ∈(0,¯γ],πi
γ,θ , respectively ¯πi
γ,θ , is the invariant probability
measure of Ri,γ,θ , respectively ¯
Ri,γ,θ , given by (17)and associated with κ←κi. In addition,
¯
Ψ(γ) = ˜
Ψ(γ) + ¯
B3,1γ+¯
B3,2γ2, where ˜
Ψis given in Theorem 28 and B3in Proposition 30, and
W=Wmwith m∈N∗if H2holds and W=Wαwith α < min(κη/4, η/8) if H3holds, see (19).
Proof. We only show that for any θ∈Θand γ∈(0,¯γ],kπγ,θ −πθkW1/26¯
Ψ(γ)as the proof for
¯πγ,θ and ¯πθis similar. First note that for any θ∈Θ,κ∈[κ,¯κ]and γ∈(0,¯γ]we have
kπγ,θ −πθkW1/26kπγ ,θ −π♭
γ,θ kW1/2+kπ♭
γ,θ −π♯
γ,θ kW1/2+kπ♯
γ,θ −πθkW1/2,
where for any θ∈Θand γ∈(0,¯γ],π♭
γ,θ is the invariant probability measure of Rγ,θ given by (17)
and associated with κ= 1 and π♯
γ,θ is the invariant probability measure of Sγ ,θ and associated with
κ= 1. We conclude the proof upon combining Proposition 34, Proposition 35 and Theorem 28.
Proposition 37. Assume H1and H2or H3. Let ¯κ>1>κ > 1/2. Let ¯γ < min{(2 −
1/κ)/L,2/(m+L)}if H2holds and ¯γ < min{(2 −1/κ)/L, η/(2ML)}if H3holds. Then there exists
¯
A4>0such that for any θ1, θ2∈Θ,κ∈[κ, ¯κ],γ1, γ2∈(0,¯γ]with γ2< γ1,a∈[1/4,1/2] and
x∈Rd
max kδxRγ1,θ1−δxRγ2,θ2kWa,kδx¯
Rγ1,θ1−δx¯
Rγ2,θ2kWa
6(¯
Λ1(γ1, γ2) + ¯
Λ2(γ1, γ2)kθ1−θ2k)W2a(x),
with
¯
Λ1(γ1, γ2) = ¯
A4(γ1/γ2−1) ,¯
Λ2(γ1, γ2) = ¯
A4γ1/2
2,
and where W=Wmwith m∈Nand m>2if H2is satisfied and W=Wαwith α <
min(κη/4, η/8) if H3is satisfied, see (19).
25
Proof. First, note that we only show that for any θ1, θ2∈Θ,κ∈[¯κ, κ],γ1, γ2∈(0,¯γ]with γ2< γ1,
a∈[1/4,1/2] and x∈Rd, we have kδxRγ1,θ1−δxRγ2,θ2kWa6(¯
Λ(γ1, γ2)+ ¯
Λ(γ1, γ2)kθ1−θ2k)W2a(x)
since the proof for ¯
Rγ1,θ1and ¯
Rγ2,θ2is similar. Let a∈[1/4,1/2],θ1, θ2∈Θ,κ∈[κ,¯κ],
γ1, γ2∈(0,¯γ]with γ2< γ1. Using a generalized Pinsker inequality [22, Lemma 24] we have
kδxRγ1,θ1−δxRγ2,θ2kWa
6√2(δxRγ1,θ1W2a(x) + δxRγ2,θ2W2a(x))1/2KL (δxRγ1,θ1|δxRγ2,θ2)1/2.
Combining this result, Jensen’s inequality and Lemma 22 if H2holds and Lemma 23 if H3holds,
we obtain that
kδxRγ1,θ1−δxRγ2,θ2kWa62(1 + b¯γ)1/2KL (δxRγ1,θ1|δxRγ2,θ2)1/2Wa(x).
Using Lemma 17 and the fact that γ1>γ2we have
KL (δxRγ1,θ1|δxRγ2,θ2)
6d(γ1/γ2−1)2/2 + kγ2∇xVθ2(x)−γ1∇xVθ1(x) + γ2∇xUγ2κ
θ2(x)−γ1∇xUγ1κ
θ1(x)k2/(4γ2),(40)
We have
kγ2∇xVθ2(x)−γ1∇xVθ1(x) + γ2∇xUγ2κ
θ2(x)−γ1∇xUγ1κ
θ1(x)k2(41)
64kγ2∇xVθ2(x)−γ2∇xVθ1(x)k2+ 4 kγ2∇xVθ1(x)−γ1∇xVθ1(x)k2
+ 4
γ1∇xUγ1κ
θ1(x)−γ2∇xUγ2κ
θ1(x)
2+ 4
γ2∇xUγ2κ
θ1(x)−γ2∇xUγ2κ
θ2(x)
2.
First using H4we have
kγ2∇xVθ2(x)−γ2∇xVθ1(x)k6γ2MΘkθ1−θ2k(1 + kxk).(42)
Second using H1we have
kγ2∇xVθ1(x)−γ1∇xVθ1(x)k6(γ1−γ2)k∇xVθ1(x)k(43)
6(γ1−γ2)L
x−x⋆
θ1
6(γ1−γ2)L(RV,1+kxk).
Third using H1,H4, Lemma 9and Lemma 11 we have
γ1∇xUγ1κ
θ1(x)−γ2∇xUγ2κ
θ1(x)
6
(x−proxγ1κ
Uθ1(x))/κ −(x−proxγ2κ
Uθ1(x))/κ
(44)
6
proxγ2κ
Uθ1(x)−proxγ1κ
Uθ1(x)
.κ
62M(γ1−γ2)
Finally using H4we have
γ2∇xUγ2κ
θ1(x)−γ2∇xUγ2κ
θ2(x)
6γ2(sup
[0,¯γ κ]
fθ(t))kθ1−θ2k.(45)
Combining (42), (43), (44) and (45) in (41) we get that there exists ¯
A4,1>0such that
kγ2∇xVθ2(x)−γ1∇xVθ1(x) + γ2∇xUκ
θ2(x)−γ1∇xUκ
θ1(x)k2
6¯
A4,1(γ1−γ2)2+γ2
2kθ1−θ2kW2a(x).
Using this result in (40) we obtain that there exists ¯
A4,2>0such that
KL (δxRγ1,θ1|δxRγ2,θ2)6¯
A4,2h(γ1/γ2−1)2+γ2kθ1−θ2k2iW2a(x),
which implies the announced result upon setting ¯
A4= 2p¯
A4,2(1 + b¯γ)1/2and using that for any
u, v >0,√u+v6√u+√v.
26
5.6 Proof of Theorem 6
We divide the proof in two parts.
(a) First assume that (Xn
k)n∈N,k∈{0,...,mn}and (¯
Xn
k)n∈N,k∈{0,...,mn}are given by (5) and we have
{(Kγ,θ ,¯
Kγ,θ ) : γ∈(0,¯γ], θ ∈Θ}={(Sγ,θ,¯
Sγ,θ ) : γ∈(0,¯γ], θ ∈Θ}. Then Lemma 26 implies
that [17, H1a] is satisfied with A1←A1, Theorem 25 implies that [17, H1b] holds with A2←A2
and ρ←ρ. Finally, using Corollary 31 we get that [17, H1c] holds with Ψ←Ψ. Therefore, we
can apply [17, Theorem 1] and we obtain that the sequence (θn)n∈Nconverges a.s. if
+∞
X
n=0
δn= +∞,
+∞
X
n=0
δn+1Ψ(γn)<+∞,
+∞
X
n=0
δn+1/(mnγn)<+∞.
Since Ψ(γn) = O(γ1/2
n)by Corollary 31, these summability conditions are satisfied under the
summability assumptions of Theorem 6-(1). Proposition 32 implies that [17, H2] holds with Λ1←
Λ1and Λ2←Λ2. Therefore if mn=m0for all n∈N, we can apply [17, Theorem 3] and we
obtain that the sequence (θn)n∈Nconverges a.s. if
+∞
X
n=0
δn= +∞,
+∞
X
n=0
δn+1Ψ(γn)<+∞,
+∞
X
n=0
δn+1γ−2
n<+∞
+∞
X
n=0
δn+1/γ2
n(Λ1(γn, γn+1) + δn+1 Λ2(γn, γn+1)) <+∞.
These summability conditions are satisfied under the summability assumptions of Theorem 6-(2).
(b) Second assume that (Xn
k)n∈N,k∈{0,...,mn}and (¯
Xn
k)n∈N,k∈{0,...,mn}are given by (5) with {(Kγ,θ ,¯
Kγ,θ ) :
γ∈(0,¯γ], θ ∈Θ}={(Rγ,θ ,¯
Rγ,θ ) : γ∈(0,¯γ], θ ∈Θ}. Then Lemma 33 implies that [17, H1a]
is satisfied with A1←¯
A1, Theorem 21 implies that [17, H1b] holds with A2←¯
A2and ρ←¯ρ.
Finally, using Proposition 36 we get that [17, H1c] holds with Ψ←¯
Ψ. Therefore, we can apply
[17, Theorem 1] and we obtain that the sequence (θn)n∈Nconverges a.s. if
+∞
X
n=0
δn= +∞,
+∞
X
n=0
δn+1 ¯
Ψ(γn)<+∞,
+∞
X
n=0
δn+1/(mnγn)<+∞.
Since Ψ(γn) = O(γ1/2
n)by Proposition 36, these summability conditions are satisfied under the
summability assumptions of Theorem 6-(1). Proposition 37 implies that [17, H2] holds with Λ1←
¯
Λ1and Λ2←¯
Λ2. Therefore if mn=m0for all n∈N, we can apply [17, Theorem 3] and we
obtain that the sequence (θn)n∈Nconverges a.s. if
+∞
X
n=0
δn= +∞,
+∞
X
n=0
δn+1 ¯
Ψ(γn)<+∞,
+∞
X
n=0
δ2
n+1γ−2
n,
+∞
X
n=0
δn+1/γ2
n(¯
Λ1(γn, γn+1) + δn+1 ¯
Λ2(γn, γn+1)) <+∞.
These summability conditions are satisfied under the summability assumptions of Theorem 6-(2).
5.7 Proof of Theorem 7
The proof is similar to the one of Theorem 6using [16, Theorem 2, Theorem 4] instead of [16,
Theorem 1, Theorem 3].
6 Acknowledgements
AD acknowledges financial support from Polish National Science Center grant: NCN UMO-
2018/31/B/ST1/00253. MP acknowledges financial support from EPSRC under grant EP/T007346/1.
27
References
[1] Yves F Atchadé, Gersende Fort, and Eric Moulines. On perturbed proximal gradient algo-
rithms. J. Mach. Learn. Res, 18(1):310–342, 2017.
[2] Francis R. Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation
algorithms for machine learning. In Advances in Neural Information Processing Systems 24:
25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a
meeting held 12-14 December 2011, Granada, Spain, pages 451–459, 2011.
[3] D. Bakry, I. Gentil, and M. Ledoux. Analysis and geometry of Markov diffusion operators,
volume 348 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of
Mathematical Sciences]. Springer, Cham, 2014.
[4] Dominique Bakry, Franck Barthe, Patrick Cattiaux, and Arnaud Guillin. A simple proof of
the Poincaré inequality for a large class of probability measures including the log-concave case.
Electron. Commun. Probab., 13:60–66, 2008.
[5] Heinz H. Bauschke and Patrick L. Combettes. Convex analysis and monotone operator the-
ory in Hilbert spaces. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC.
Springer, Cham, second edition, 2017. With a foreword by Hédy Attouch.
[6] M. Benaim. A dynamical system approach to stochastic approximations. SIAM J. Control
Optim., 34(2):437–472, 1996.
[7] A. Benveniste, M. Métivier, and P. Priouret. Adaptive algorithms and stochastic approxima-
tions, volume 22 of Applications of Mathematics (New York). Springer-Verlag, Berlin, 1990.
Translated from the French by Stephen S. Wilson.
[8] Sebastian Berisha, James G Nagy, and Robert J Plemmons. Deblurring and sparse unmixing
of hyperspectral images using multiple point spread functions. SIAM Journal on Scientific
Computing, 37(5):S389–S406, 2015.
[9] José M Bioucas-Dias, Antonio Plaza, Nicolas Dobigeon, Mario Parente, Qian Du, Paul Gader,
and Jocelyn Chanussot. Hyperspectral unmixing overview: Geometrical, statistical, and sparse
regression-based approaches. IEEE journal of selected topics in applied earth observations and
remote sensing, 5(2):354–379, 2012.
[10] Emmanuel J Candes, Yonina C Eldar, Thomas Strohmer, and Vladislav Voroninski. Phase
retrieval via matrix completion. SIAM review, 57(2):225–251, 2015.
[11] Antonin Chambolle and Thomas Pock. An introduction to continuous optimization for imag-
ing. Acta Numerica, 25:161–319, 2016.
[12] Emilie Chouzenoux, Anna Jezierska, Jean-Christophe Pesquet, and Hugues Talbot. A Convex
Approach for Image Restoration with Exact Poisson–Gaussian Likelihood. SIAM Journal on
Imaging Sciences, 8(4):2662–2682, 2015.
[13] Julianne Chung and Linh Nguyen. Motion estimation and correction in photoacoustic tomo-
graphic reconstruction. SIAM Journal on Imaging Sciences, 10(1):216–242, 2017.
[14] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-
concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
79(3):651–676, 2017.
[15] Arnak S Dalalyan and Avetik Karagulyan. User-friendly guarantees for the langevin monte
carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12):5278–
5311, 2019.
[16] V. De Bortoli and A. Durmus. Convergence of diffusions and their discretizations: from
continuous to discrete processes and back, 2019.
[17] V. De Bortoli, A. Durmus, M. Pereyra, and A. F. Vidal. Efficient stochastic optimisation by
unadjusted langevin monte carlo. application to maximum marginal likelihood and empirical
bayesian estimation. 2019.
28
[18] David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–
1306, 2006.
[19] Paul Dupuis and Richard S. Ellis. A weak convergence approach to the theory of large devi-
ations. Wiley Series in Probability and Statistics: Probability and Statistics. John Wiley &
Sons, Inc., New York, 1997. A Wiley-Interscience Publication.
[20] A. Durmus and E. Moulines. High-dimensional Bayesian inference via the Unadjusted
Langevin Algorithm. ArXiv e-prints, May 2016.
[21] Alain Durmus, Szymon Majewski, and Blazej Miasojedow. Analysis of langevin monte carlo
via convex optimization. Journal of Machine Learning Research, 20(73):1–46, 2019.
[22] Alain Durmus, Eric Moulines, et al. Nonasymptotic convergence analysis for the unadjusted
Langevin algorithm. The Annals of Applied Probability, 27(3):1551–1587, 2017.
[23] Alain Durmus, Eric Moulines, and Marcelo Pereyra. Efficient Bayesian computation by prox-
imal Markov chain Monte Carlo: when Langevin meets Moreau. SIAM Journal on Imaging
Sciences, 11(1):473–506, 2018.
[24] Bruno Galerne and Arthur Leclaire. Texture inpainting using efficient Gaussian conditional
simulation. SIAM Journal on Imaging Sciences, 10(3):1446–1474, 2017.
[25] Nobuyuki Ikeda and Shinzo Watanabe. Stochastic differential equations and diffusion pro-
cesses, volume 24 of North-Holland Mathematical Library. North-Holland Publishing Co.,
Amsterdam; Kodansha, Ltd., Tokyo, second edition, 1989.
[26] Mark A Iwen, Aditya Viswanathan, and Yang Wang. Fast phase retrieval from local correlation
measurements. SIAM Journal on Imaging Sciences, 9(4):1655–1688, 2016.
[27] Jari Kaipio and Erkki Somersalo. Statistical and computational inverse problems, volume 160.
Springer Science & Business Media, 2006.
[28] Michael Kech and Felix Krahmer. Optimal injectivity conditions for bilinear inverse problems
with applications to identifiability of deconvolution problems. SIAM Journal on Applied
Algebra and Geometry, 1(1):20–37, 2017.
[29] Jack Kiefer, Jacob Wolfowitz, et al. Stochastic estimation of the maximum of a regression
function. The Annals of Mathematical Statistics, 23(3):462–466, 1952.
[30] Solomon Kullback. Information theory and statistics. John Wiley and Sons, Inc., New York;
Chapman and Hall, Ltd., London, 1959.
[31] Shutao Li, Xudong Kang, Leyuan Fang, Jianwen Hu, and Haitao Yin. Pixel-level image fusion:
A survey of the state of the art. Information Fusion, 33:100–112, 2017.
[32] Robert S. Liptser and Albert N. Shiryaev. Statistics of random processes. II, volume 6 of
Applications of Mathematics (New York). Springer-Verlag, Berlin, expanded edition, 2001.
Applications, Translated from the 1974 Russian original by A. B. Aries, Stochastic Modelling
and Applied Probability.
[33] M. Métivier and P. Priouret. Applications of a Kushner and Clark lemma to general classes
of stochastic algorithms. IEEE Trans. Inform. Theory, 30(2, part 1):140–151, 1984.
[34] M. Métivier and P. Priouret. Théorèmes de convergence presque sure pour une classe
d’algorithmes stochastiques à pas décroissant. Probab. Theory Related Fields, 74(3):403–428,
1987.
[35] Veniamin I Morgenshtern and Emmanuel J Candes. Super-resolution of positive sources: The
discrete setup. SIAM Journal on Imaging Sciences, 9(1):412–444, 2016.
[36] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87.
Springer Science & Business Media, 2013.
[37] George Pólya and Gabor Szegő. Problems and theorems in analysis. I. Classics in Mathematics.
Springer-Verlag, Berlin, 1998. Series, integral calculus, theory of functions, Translated from
the German by Dorothee Aeppli, Reprint of the 1978 English translation.
29
[38] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.
SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
[39] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal
for strongly convex stochastic optimization. arXiv preprint arXiv:1109.5647, 2011.
[40] Saiprasad Ravishankar and Yoram Bresler. Efficient blind compressed sensing using sparsifying
transforms with convergence guarantees and application to magnetic resonance imaging. SIAM
Journal on Imaging Sciences, 8(4):2519–2557, 2015.
[41] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of
mathematical statistics, pages 400–407, 1951.
[42] G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions and
their discrete approximations. Bernoulli, 2(4):341–363, 1996.
[43] Lorenzo Rosasco, Silvia Villa, and Bang Công V˜u. Convergence of stochastic proximal gradient
algorithm. Applied Mathematics & Optimization, pages 1–27, 2019.
[44] Carola-Bibiane Schönlieb. Partial Differential Equation Methods for Image Inpainting, vol-
ume 29. Cambridge University Press, 2015.
[45] Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization:
Convergence results and optimal averaging schemes. In International Conference on Machine
Learning, pages 71–79, 2013.
[46] Miguel Simões, José Bioucas-Dias, Luis B Almeida, and Jocelyn Chanussot. A convex for-
mulation for hyperspectral image superresolution via subspace-based regularization. IEEE
Transactions on Geoscience and Remote Sensing, 53(6):3373–3388, 2015.
[47] Weijie Su, Stephen P. Boyd, and Emmanuel J. Candès. A differential equation for modeling
nesterov’s accelerated gradient method: Theory and insights. J. Mach. Learn. Res., 17:153:1–
153:43, 2016.
[48] V. B. Tadić and A. Doucet. Asymptotic bias of stochastic gradient search. Ann. Appl. Probab.,
27(6):3255–3304, 2017.
[49] Ana F. Vidal, Valentin De Bortoli, Marcelo Pereyra, and Alain Durmus. Maximum likelihood
estimation of regularisation parameters in high-dimensional inverse problems: an empirical
bayesian approach. Part I: Methodology and experiments, 2019.
[50] Ana Fernandez Vidal and Marcelo Pereyra. Maximum likelihood estimation of regularisation
parameters. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages
1742–1746. IEEE, 2018.
[51] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network
for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 2472–2481, 2018.
30