PreprintPDF Available

Maximum likelihood estimation of regularisation parameters in high-dimensional inverse problems: an empirical Bayesian approach. Part II: Theoretical Analysis

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

This paper presents a detailed theoretical analysis of the three stochastic approximation proximal gradient algorithms proposed in our companion paper [49] to set regularization parameters by marginal maximum likelihood estimation. We prove the convergence of a more general stochastic approximation scheme that includes the three algorithms of [49] as special cases. This includes asymptotic and non-asymptotic convergence results with natural and easily verifiable conditions, as well as explicit bounds on the convergence rates. Importantly, the theory is also general in that it can be applied to other intractable optimisation problems. A main novelty of the work is that the stochastic gradient estimates of our scheme are constructed from inexact proximal Markov chain Monte Carlo samplers. This allows the use of samplers that scale efficiently to large problems and for which we have precise theoretical guarantees.
arXiv:2008.05793v1 [math.ST] 13 Aug 2020
Maximum likelihood estimation of regularisation
parameters in high-dimensional inverse problems: an
empirical Bayesian approach
Part II: Theoretical Analysis
Valentin De Bortoli 1, Alain Durmus 1, Marcelo Pereyra 2, and Ana F. Vidal §
Part of this work has been presented at the 25th IEEE International Conference
on Image Processing (ICIP) [50]2
1CMLA - École normale supérieure Paris-Saclay, CNRS, Université Paris-Saclay, 94235 Cachan, France.
2Maxwell Institute for Mathematical Sciences & School of Mathematical and Computer Sciences,
Heriot-Watt University, Edinburgh, EH14 4AS, United Kingdom.
August 14, 2020
Abstract
This paper presents a detailed theoretical analysis of the three stochastic approximation
proximal gradient algorithms proposed in our companion paper [49] to set regularization pa-
rameters by marginal maximum likelihood estimation. We prove the convergence of a more
general stochastic approximation scheme that includes the three algorithms of [49] as special
cases. This includes asymptotic and non-asymptotic convergence results with natural and
easily verifiable conditions, as well as explicit bounds on the convergence rates. Importantly,
the theory is also general in that it can be applied to other intractable optimisation prob-
lems. A main novelty of the work is that the stochastic gradient estimates of our scheme are
constructed from inexact proximal Markov chain Monte Carlo samplers. This allows the use
of samplers that scale efficiently to large problems and for which we have precise theoretical
guarantees.
1 Introduction
Numerous imaging problems require performing inferences on an unknown image of interest xRd
from some observed data y. Canonical examples include image denoising [12,28], compressive
sensing [18,40], super-resolution [35,51], tomographic reconstruction [13], image inpainting [24,44],
source separation [9,8], fusion [46,31], and phase retrieval [10,26]. Such imaging problems can
be formulated in a Bayesian statistical framework, where inferences are derived from the so-called
posterior distribution of xgiven y, which for the purpose of this paper we specify as follows
p(x|y, θ) = p(y|x)p(x|θ)/p(y|θ)
where p(y|x) = exp{−fy(x)}with fyC1(Rd,R)is the likelihood function, and the prior distri-
bution is p(x|θ) = exp{−θg(x)}with g:RdRdΘand θΘRdΘ. The function fyacts as a
data-fidelity term, gas a regulariser that promotes desired structural or regularity properties (e.g.,
smoothness, piecewise-regularity, or sparsity [11]), and θis a regularisation parameter that con-
trols the amount of regularity enforced. Most Bayesian methods in the imaging literature consider
models for which fyand gare convex functions and report as solution the maximum-a-posteriori
(MAP) Bayesian estimator
argmin fy,θ ,where fy(x) = fy(x) + θg(x)for any xRd.(1)
Email: debortoli@cmla.ens-cachan.fr
Email: durmus@cmla.ens-cachan.fr
Email: m.pereyra@hw.ac.uk
§Email: af69@hw.ac.uk
1
For example, many imaging works consider a linear observation model of the form y=Ax +w,
where ARd×Rdis some problem-specific linear operator and the noise whas distribution
N(0, σ2Id)with variance σ2>0. Then, for any xRdfy(x) = (2σ2)1kAx yk2. With regards
to the prior, a common choice in imaging is to set Θ = R+and g(x) = kBxk1for some suitable
basis or dictionary BRd×Rd, or g(x) = TV(x), where TV(x)is the isotropic total variation
pseudo-norm given by TV(x) = Pip(∆h
ix)2+ (∆v
ix)2where v
iand h
idenote horizontal and
vertical first-order local (pixel-wise) difference operators.
Importantly, when fyand gare convex, problem (1) is also convex and can usually be efficiently
solved by using modern proximal convex optimisation techniques [11], with remarkable guarantees
on the solutions delivered.
Setting the value of θcan be notoriously difficult, especially in problems that are ill-posed or
ill-conditioned where the regularisation has a dramatic impact on the recovered estimates. We
refer to [27] and [49, Section 1] for illustrations and a detailed review of the existing methods for
setting set θ.
In our companion paper [49], we present a new method to set regularisation parameters. More
precisely, in [49], we adopt an empirical Bayesian approach and set θby maximum marginal
likelihood estimation, i.e.
θarg max
θΘ
log p(y|θ),where p(y|θ) = ZRd
p(y, x|θ)dx , p(y, x|θ)exp[fy,θ (x)] .(2)
To solve (2), we aim at using gradient based optimization methods. The gradient of θ7→ log p(y|θ),
can be computed using Fisher’s identity, see [49, Proposition A.1], which implies under mild inte-
grability conditions on fyand g, for any θΘ,
θlog p(y|θ) = ZRd
gx)px|y, θ)d˜x+ZRd
gx)px|θ)d˜x .
It follows that θ7→ θlog p(y|θ)can be written as a sum of two parametric integrals which are
untractable in most cases. Therefore, we propose to use a stochastic approximation (SA) scheme
and, in particular, we define three different algorithms to solve (2) [49, Algorithm 3.1, Algorithm
3.2, Algorithm 3.3]. These algorithms are extensively demonstrated in [49] through a range of
applications and comparisons with alternative approaches from the state-of-the-art.
In the present paper we theoretically analyse these three SA schemes and establish natural
and easily verifiable conditions for convergence. For generality, rather than presenting algorithm-
specific analyses, we establish detailed convergence results for a more general SA scheme that covers
the three algorithms of [49] as specific cases. Indeed, all these methods boil down to defining a
sequence (θn)nNsatisfying a recursion of the form: for any nN,
θn+1 = ΠΘ"θnδn+1
mn
mn
X
k=1 g(Xn
k)g(¯
Xn
k)#,(3)
where ΠΘis the projection onto a convex closed set Θ,(Xn
k)k∈{1,...,mn}and (¯
Xn
k)k∈{1,...,mn}are
two independent stochastic processes targeting x7→ p(x|y , θ)and x7→ p(x|θ)respectively, (mn)nN
is a sequence of batch-sizes and (δn)nNis a sequence of stepsizes. In this paper, we are interested
in establishing the convergence of the averaging of (θn)nNto a solution of (2) in this setting. SA
has been extensively studied during the past decades [41,29,38,47,33,34,7,6,48]. Recently,
quantitative results have been obtained in [45,2,39,1,43]. In contrast to [1], here we consider
the case where (Xn
k)k∈{1,...,mn}and (¯
Xn
k)k∈{1,...,mn}are inexact Markov chains which target x7→
p(x|y, θ)and x7→ p(x|θ)respectively and are based on some generalizations of the Unadjusted
Langevin Algorithm (ULA) [42]. In the recent years, ULA has attracted a lot of attention since
this algorithm exhibits favorable high-dimensional convergence properties in the case where the
target distribution admits a differentiable density, see [20,22,14,15]. However, in most imaging
models, the penalty function gis not differentiable and therefore x7→ p(x|y , θ)and x7→ p(x|θ)are
not differentiable as well. Therefore, we consider proximal Langevin samplers which are specifically
design to overcome this issue: the Moreau-Yoshida Unadjusted Langevin Algorithm (MYULA),
see [23], and the Proximal Unadjusted Langevin Operator (PULA), see [21].
A similar approximation scheme to (3) is studied in [1]. More precisely [1, Theorem 3, Theorem
4] are similar to Theorem 6and Theorem 7. Contrarily to that work, here we do not require the
Markov kernels we use to exactly target x7→ p(x|θ)and x7→ p(x|y, θ)but allow some bias
in the estimation which is accounted for in our convergence rates. This relaxation to biased
2
estimates plays a central role in the capacity of the method to scale efficiently to large problems.
Moreover, the present paper is also a complement of [17] which establishes general conditions for
the convergence of inexact Markovian SA but only apply these results to ULA. In this study, we
do not consider a general Markov kernel but rather specialize the results of [17] to MYULA and
PULA Markov kernels. However, to apply results of [17], new quantitative geometric convergence
properties on MYULA and PULA have to be established.
The remainder of the paper is organized as follows. In Section 2, we recall our notations and
conventions. In Section 3, we define the class of optimisation problems considered and the SA
scheme (3). This setting includes the optimization problem presented in (2) and the three specific
algorithms introduced in [49]. Then, in Section 4, we present a detailed analysis of the theoretical
properties of the proposed methodology. First, we show new ergodicity results for the MYULA
and PULA samplers. In a second part, we provide easily verifiable conditions for convergence and
quantitative convergence rates for the averaging sequences designed from (3). The proofs of these
results are gathered in Section 5.
2 Notations and conventions
We denote by B(0, R)and B(0, R)the open ball, respectively the closed ball, with radius Rin Rd.
Denote by B(Rd)the Borel σ-field of Rd,F(Rd)the set of all Borel measurable functions on Rdand
for fF(Rd),kfk= supxRd|f(x)|. For µa probability measure on (Rd,B(Rd)) and fF(Rd)
aµ-integrable function, denote by µ(f)the integral of fw.r.t. µ. For fF(Rd), the V-norm of
fis given by kfkV= supxRd|f(x)|/V (x). Let ξbe a finite signed measure on (Rd,B(Rd)). The
V-total variation norm of ξis defined as
kξkV= sup
fF(Rd),kfkV61ZRd
f(x)dξ(x)
.
If V1, then k · kVis the total variation norm on measures denoted by k · kTV .
Let Ube an open set of Rd. We denote by Ck(U,RdΘ)the set of RdΘ-valued k-differentiable
functions, respectively the set of compactly supported RdΘ-valued k-differentiable functions. Ck(U)
stands Ck(U,R). Let f:UR, we denote by f, the gradient of fif it exists. fis said to be
m-convex with m>0if for all x, y Rdand t[0,1],
f(tx + (1 t)y)6tf(x) + (1 t)f(y)(m/2)t(1 t)kxyk2.
Let (Ω,F,P)be a probability space. Denote by µνif µis absolutely continuous w.r.t. νand
dµ/dνan associated density. Let µ, ν be two probability measures on (Rd,B(Rd)). Define the
Kullback-Leibler divergence of µfrom νby
KL (µ|ν) = (RRd
dµ
dν(x) log dµ
dν(x)dν(x),if µν ,
+otherwise .
3 Proposed stochastic approximation proximal gradient op-
timisation methodology
3.1 Problem statement
Let ΘRdΘand f: Θ R. We consider the optimisation problem
θarg min
θΘ
f(θ),(4)
in scenarios where it is not possible to evaluate fnor fbecause they are computationally in-
tractable. Problem (4) includes the marginal likelihood estimation problem (2) of our companion
paper [49] as the special case f=log p(y). We make the following general assumptions on f
and Θ, which are in particular verified by the imaging models considered in [49].
A1. Θis a convex compact set and ΘB(0, RΘ)with RΘ>0.
A 2. There exist an open set URpand Lf>0such that ΘU,fC1(U,R)and for any
θ1, θ2Θ
k∇θf(θ1) θf(θ2)k6Lfkθ1θ2k.
3
A3. For any θΘ, there exist Hθ,¯
Hθ:RdRdΘand two probability distributions πθ,¯πθon
(Rd,B(Rd)) satisfying for any θΘ
θf(θ) = ZRd
Hθ(x)dπθ(x) + ZRd
¯
Hθ(x)d¯πθ(x).
In addition, (θ, x)7→ Hθ(x)and (θ, x)7→ ¯
Hθ(x)are measurable.
Remark 1. Note that if fC2(Θ) then A2is automatically satisfied under A1, since Θis
compact. In every model considered in our companion paper [49], θ7→ log p(y|θ)is continuously
twice differentiable on each compact using the dominated convergence theorem and therefore A2
holds under A1.
Remark 2. Assumption A3is verified in the three cases considered in our companion paper [49,
Algorithm 3.1, Algorithm 3.2, Algorithm 3.3]:
(a) if the regulariser gis αpositively homogeneous with α > 0and dΘ= 1, corresponding to [49,
Algorithm 3.1], then for any θΘ,Hθ=g,¯
Hθ=d/(αθ),πθis the probability measure with
density w.r.t. the Lebesgue measure x7→ p(x|y, θ)and ¯πθis any probability measure;
(b) if the regulariser gis separably positively homogeneous as in [49, Algorithm 3.2], then for any
θΘ,Hθ=g,¯
Hθ= (|Ai|/(αiθi))i∈{1,...,dΘ},πθis the probability measure with density w.r.t. the
Lebesgue measure x7→ p(x|y, θ)and ¯πθis any probability measure;
(c) if the regulariser gis inhomogeneous, corresponding to [49, Algorithm 3.3], then for any θΘ,
¯
Hθ=g,Hθ=g,πθand ¯πθare the probability measures associated with the posterior and the
prior, with density w.r.t. the Lebesgue measure x7→ p(x|y, θ)and x7→ p(x|θ)respectively.
We now present in Algorithm 1, the stochastic algorithm we consider in order to solve (4).
This method encompasses the schemes introduced in the companion paper [49, Algorithm 3.1,
Algorithm 3.2, Algorithm 3.3]. Starting from (X0
0,¯
X0
0)Rd×Rdand θ0Θ, we define on a
probability space (Ω,F,P), the sequence ({(Xn
k,¯
Xn
k) : k {0,...,mn}}, θn)nNby the following
recursion for nNand k {0,...,mn1}
(Xn
k)k∈{0,...,mn}is a MC with kernel Kγnnand Xn
0=Xn1
mn1given Fn1,
(¯
Xn
k)k∈{0,...,mn}is a MC with kernel ¯
Kγ
nnand ¯
Xn
0=¯
Xn1
mn1given Fn1,
θn+1 = ΠΘ"θnδn+1
mn
mn
X
k=1 Hθn(Xn
k) + ¯
Hθn(¯
Xn
k)#,
(5)
where (X1
m1,¯
X1
m1) = (X0
0,¯
X0
0),{(Kγ,θ ,¯
Kγ,θ ) : γ > 0, θ Θ}is a family of Markov kernels on
Rd× B(Rd),(mn)nN(N)N,δn, γn, γ
n>0for any nN,ΠΘis the projection onto Θand Fn
is defined as follows for all nN {1}
Fn=σθ0,{(X
k,¯
X
k)k∈{0,...,m}: {0,...,n}},F1=σ(θ0, X 0
0,¯
X0
0).
Define for any NN,
¯
θN=
N1
X
n=0
δnθn,N1
X
n=0
δn.
In the sequel, we are interested in the convergence of (f(¯
θN))NNto a minimum of fin the case
where the Markov kernels {(Kγ,θ,¯
Kγ,θ ) : γ > 0, θ Θ}, used in Algorithm 1are either the ones
associated with MYULA or PULA. We now present these two MCMC methods for which some
analysis is required in our study of (f(¯
θN))NN.
3.2 Choice of MCMC kernels
Given the high dimensionality involved, it is fundamental to carefully choose the families of Markov
kernels {Kγ,θ ,¯
Kγ,θ :γ > 0, θ Θ}driving Algorithm 1. In the experimental part of this work,
see [49, Section 4], we use the MYULA Markov kernel recently proposed in [23], which is a state-
of-the-art proximal Markov chain Monte Carlo (MCMC) method specifically designed for high-
dimensional models that are are log-concave but not smooth. The method is derived from the
4
Algorithm 1 General algorithm
1: Input: initial {θ0, X0
0,¯
X0
0},(δn, γn, γ
n, mn)nN, number of iterations N.
2: for n= 0 to N1do
3: if n > 0then
4: Set Xn
0=Xn1
mn1,
5: Set ¯
Xn
0=¯
Xn1
mn1,
6: end if
7: for k= 0 to mn1do
8: Sample Xn
k+1 Kγnn(Xn
k,·),
9: Sample ¯
Xn
k+1 ¯
Kγ
nn(¯
Xn
k,·),
10: end for
11: Set θn+1 = ΠΘhθnδn+1
mnPmn
k=1 Hθn(Xn
k) + ¯
Hθn(¯
Xn
k)i.
12: end for
13: Output: ¯
θN={PN1
n=0 δn}1PN1
n=0 δnθn.
discretisation of an over-damped Langevin diffusion, (¯
Xt)t>0, satisfying the following stochastic
differential equation
dXt=−∇xF(Xt)dt+2dBt,(6)
where F:Rd7→ Ris a continuously differentiable potential and (Bt)t>0is a standard d-dimensional
Brownian motion. Under mild assumptions, this equation has a unique strong solution [25, Chapter
4, Theorem 2.3]. Accordingly, the law of (Xt)t>0converges as t to the diffusion’s unique
invariant distribution, with probability density given by π(x)eF(x)for all xRd[42, Theorem
2.2]. Hence, to use (6) as a Monte Carlo method to sample from the posterior p(x|y, θ), we set
F(x) = log p(x|y, θ)and thus specify the desired target density. Similarly, to sample from the prior
we set F(x) = −∇xlog p(x|θ).
However, sampling directly from (6) is usually not computationally feasible. Instead, we usually
resort to a discrete-time Euler-Maruyama approximation of (6) that leads to the following Markov
chain (Xk)kNwith X0Rd, given for any kNby
ULA :Xk+1 =XkγxF(Xk) + p2γZk+1 ,
where γ > 0is a discretisation step-size and (Zk)kNis a sequence of i.i.d d-dimensional zero-mean
Gaussian random variables with an identity covariance matrix. This Markov chain is commonly
known as the Unadjusted Langevin Algorithm (ULA) [42]. Under some additional assumptions
on F, namely Lipschitz continuity of xF, the ULA chain inherits the convergence properties of
(6) and converges to a stationary distribution that is close to the target π, with γcontrolling a
trade-off between accuracy and convergence speed [23].
Remark 3. In this form, the ULA algorithm is limited to distributions where Fis a Lipschitz
continuously differentiable function. However, in the imaging problems of interest this is usually
not the case [49]. For example, to implement any of the algorithms presented in [49] it is necessary
to sample from the posterior distribution p(x|y, θ)(corresponding to πθin Section 3.1), which
would require setting for any xRd,F(x) = fy(x) + θg(x). Similarly, one of the algorithms
also requires sampling from the prior distribution x7→ p(x|θ)(corresponding to ¯πθin Section 3.1),
which requires setting for any xRd,F(x) = θg(x). In both cases, if gis not smooth then ULA
cannot be directly applied. The MYULA kernel was designed precisely to overcome this limitation.
3.2.1 Moreau-Yoshida Unadjusted Langevin Algorithm
Suppose that the target potential admits a decomposition F=V+Uwhere Vis Lipschitz
differentiable and Uis not smooth but convex over Rd. In MYULA, the differentiable part is
handled via the gradient xVin a manner akin to ULA, whereas the non-differentiable convex
part is replaced by a smooth approximation Uλ(x)given by the Moreau-Yosida envelope of U, see
[5, Definition 12.20], defined for any xRdand λ > 0by
Uλ(x) = min
˜xRdnU(˜x) + (1/2λ)kx˜xk2
2o.(7)
Similarly, we define the proximal operator for any xRdand λ > 0by
proxλ
U(x) = arg min
˜xRdnU(˜x) + (1/2λ)kx˜xk2
2o.(8)
5
For any λ > 0, the Moreau-Yosida envelope Uλis continuously differentiable with gradient given
for any xRdby
Uλ(x) = (xproxλ
U(x)) , (9)
(see, e.g., [5, Proposition 16.44]). Using this approximation we obtain the MYULA kernel associ-
ated with (Xk)kNgiven by X0Rdand the following recursion for any kN
MYULA :Xk+1 =XkγxV(Xk)γxUλ(Xk) + p2γZk+1 .(10)
Returning to the imaging problems of interest, we define the MYULA families of Markov kernels
{Rγ,θ ,¯
Rγ,θ :γ > 0, θ Θ}that we use in Algorithm 1to target πθand ¯πθfor θΘas follows.
By Remark 3, we set V=fyand U=θg,¯
V= 0 and ¯
U=θg. Then, for any θΘand γ > 0,
Rγ,θ associated with (Xk)kNis given by X0Rdand the following recursion for any kN
Xk+1 =Xkγxfy(Xk)γnXkproxλ
θg(Xk)o +p2γZk+1 .(11)
Similarly, for any θΘand γ>0,¯
Rγ,θ associated with (Xk)kNis given by X0Rdand the
following recursion for any kN
¯
Xk+1 =¯
Xkγn¯
Xkproxλ
θg(¯
Xk)o+p2γZk+1 ,(12)
where we recall that λ, λ>0are the smoothing parameters associated with θgλ,γ, γ>0are the
discretisation steps and (Zk)kNis a sequence of i.i.d d-dimensional zero-mean Gaussian random
variables with an identity covariance matrix.
Notice that other ways of splitting the target potential Fcan be straightforwardly implemented.
For example, instead of a single non-smooth convex term U, one might choose a splitting involving
several non-smooth terms to simplify the computation of the proximal operators (each term would
be replaced by its Moreau-Yosida envelope in (6)). Similarly, although we usually to associate
V, ¯
Vand U, ¯
Uto the log-likelihood and the log-prior, some cases might benefit from a different
splitting. Moreover, as illustrated in Section 3.2.2 below, other discrete approximations of the
Langevin diffusion could be considered too.
3.2.2 Proximal Unadjusted Langevin Algorithm
As an alternative to MYULA, one could also consider using the Proximal Unadjusted Langevin
Algorithm (PULA) introduced in [21], which replaces the (forward) gradient step of MYULA by
a composition of a backward and forward step. More precisely, PULA defines the Markov chain
(Xk)kNstarting from X0Rdby the following recursion: for any kN
PULA :Xk+1 = proxλ
U(Xk)γxU(proxλ
U(Xk)) + p2γZk+1 .(13)
To highlight the connection with MYULA we note that for any xRdand λ>0,Uλ(x) =
(xproxλ
U(x)) by [5, Proposition 12.30]. Therefore, if we set λ=γwe obtain that (13) can be
rewritten for any kNa
Xk+1 =XkγxV(Xk)γxU(proxλ
U(Xk)) + p2γZk+1 ,
which corresponds to (10) with λ=γ, except that the term xU(Xk)in (10) is replaced by
xU(proxλ
U(Xk)) in (10).
Going back to the imaging problems of interest, to define the PULA families of Markov kernels
{Sγ,θ ,¯
Sγ,θ :γ > 0, θ Θ}that we use in Algorithm 1to target πθand ¯πθfor θΘwe proceed
as follows. We set V=fyand U=θg,¯
V= 0 and ¯
U=θg. Then, by Remark 3, for any θΘ
and γ > 0,Sγ associated with (Xk)kNis given by X0Rdand the following recursion for any
kN
Xk+1 = proxλ
θg(Xk)γxfy(proxλ
θg(Xk)) + p2γZk+1 ,(14)
Similarly, for any θΘand γ>0,¯
Sγ,θ associated with (Xk)kNis given by X0Rdand the
following recursion for any kN
¯
Xk+1 = proxλ
θg(¯
Xk) + p2γZk+1 .(15)
Recall that λ, λ>0are the smoothing parameters associated with θgλ,γ , γ>0are the
discretisation steps and (Zk)kNis a sequence of i.i.d d-dimensional zero-mean Gaussian random
6
variables with an identity covariance matrix. Again, one could use PULA with a different splitting
of F.
Finally, we note at this point that the MYULA and PULA kernels (11), (12), (14) and (15),
do not target the posterior or prior distributions exactly but rather an approximation of these
distributions. This is mainly due to two facts: 1) we are not able to use the exact Langevin diffusion
(6), so we resort to a discrete approximation instead; and 2) we replace the non-differentiable terms
with their Moreau-Yosida envelopes. As a result of these approximation errors, Algorithm 1will
exhibit some asymptotic estimation bias. This error is controlled by λ, λ, γ, γ , and δ, and can be
made arbitrarily small at the expense of additional computing time, see Theorem 7in Section 4.
4 Analysis of the convergence properties
4.1 Ergodicity properties of MYULA and PULA
Before establishing our main convergence results about Algorithm 1, see Section 4.1, we derive
ergodicity properties on the Markov chains given by (10) and (13). We consider the following
assumptions on πθand ¯πθ. These assumptions are satisfied for a large class of models in Bayesian
imaging sciences, and in particular by the models considered in our companion paper [49].
H 1. For any θΘ, there exist Vθ,¯
Vθ, Uθ,¯
Uθ:Rd[0,+)convex functions satisfying the
following conditions.
(a) For any θΘand xRd,
πθ(x)exp [Vθ(x)Uθ(x)] ,¯πθ(x)exp ¯
Vθ(x)¯
Uθ(x),
and
min inf
θΘZRd
exp[Vθx)Uθx)]d˜x, inf
θΘZRd
exp[¯
Vθx)¯
Uθx)]d˜x>0.(16)
(b) For any θΘ,Vθand ¯
Vθare continuously differentiable and there exists L>0such that
for any θΘand x, y Rd
max k∇xVθ(x) xVθ(y)k,k∇x¯
Vθ(x) x¯
Vθ(y)k6Lkxyk.
In addition, there exist RV,1, RV ,2>0such that for any θΘ, there exist x
θ,¯x
θRdwith
x
θarg minRdVθ,¯x
θarg minRd¯
Vθ,x
θ,¯x
θB(0, RV,1)and Vθ(x
θ),¯
Vθx
θ)B(0, RV,2).
(c) There exists M>0such that for any θΘand x, y Rd
max kUθ(x)Uθ(y)k,k¯
Uθ(x)¯
Uθ(y)k6Mkxyk.
In addition, there exist RU,1, RU,2>0such that for any θΘ, there exist x
θ,¯x
θRdwith
x
θ,¯x
θB(0, RU,1)and Uθ(x
θ),¯
Uθx
θ)B(0, RU,2).
Note that (16) in H1-(a) is satisfied if Θis compact and the functions θ7→ RRdexp[Vθ(˜x)
Uθx)]d˜xand θ7→ RRdexp[¯
Vθx)¯
Uθx)]d˜xare continuous. This latter condition can be
then easily verified using the Lebesgue dominated convergence theorem and some assumptions
on {Vθ,¯
Vθ, Uθ,¯
Uθ:θΘ}. Note that if there exists V:Rd[0,+)such that for any θΘ,
Vθ=Vand there exists xRdwith xarg minRdVthen one can choose x
θ=xfor any
θΘin H1-(b). In this case, RV,2= 0. Similarly if for any θΘ,Uθ(0) = 0 then one can choose
x
θ= 0 in H1-(c) and in this case RU,1=RU,2= 0. These conditions are satisfied by all the models
studied in [49].
As emphasized in Section 3.1, we use a stochastic approximation proximal gradient approach
to minimize fand therefore we need to consider Monte Carlo estimators for θf(θ)and θΘ.
These estimators are derived from Markov chains targeting πθand ¯πθrespectively. We consider two
MCMC methodologies to construct the Markov chains. A first option, as proposed in Section 3.2.1,
is to use MYULA to sample from πθand ¯πθ. Let κ > 0and {Rγ,θ :γ > 0, θ Θ}be the family
of kernels defined for any xRd,γ > 0,θΘand A B(Rd)by
Rγ,θ (x, A) = (4πγ)d/2ZA
exp
yx+γxVθ(x) + κ1xproxγκ
Uθ(x)
2.(4γ)dy . (17)
7
Note that (17) is the Markov kernel associated with the recursion (10) with UUθ,VVθand
λκγ. For any γ, κ > 0and θΘcorresponds to Rγ,κγ,θ in [49]. Consider also the family of
Markov kernels {¯
Rγ,θ :γ > 0, θ Θ}such that for any γ > 0and θΘ,¯
Rγ,θ is the Markov
kernel defined by (17) but with ¯
Uθand ¯
Vθin place of Uθand Vθrespectively. The coefficient κis
related to λin (11) by κ=λ/γ.
Moreover, although our companion paper [49] only considers the MYULA kernel, the theoretical
results we present in this paper also hold if the algorithms are implemented using PULA [21]. Define
the family {Sγ,θ :γ > 0, θ Θ}, for any xRd,γ > 0,θΘand A B(Rd)by
Sγ,θ (x, A) = (4πγ)d/2ZA
exp
yproxγκ
Uθ(x) + γxVθ(proxγκ
Uθ(x))
2.(4γ)dy . (18)
Note that (17) is the Markov kernel associated with the recursion (13) with UUθ,VVθ
and λκγ. Consider also the family of Markov kernels {¯
Sγ,θ :γ > 0, θ Θ}such that for
any γ > 0and θΘ,¯
Sγ,θ is the Markov kernel defined by the recursion (18) but with ¯
Uθand
¯
Vθin place of Uθand Vθrespectively. We use the results derived in [17] to analyse the sequence
given by (5) with {(Kγ,θ ,¯
Kγ,θ ) : γ(0,¯γ], θ Θ}={(Rγ,θ,¯
Rγ,θ ) : γ(0,¯γ], θ Θ}or
{(Sγ,θ ,¯
Sγ,θ ) : γ(0,¯γ], θ Θ}. To this end, we impose that for any γ(0,¯γ]and θΘ,
the kernels Kγ,θ and ¯
Kγ,θ admit an invariant probability distribution, denoted by πγ and ¯πγ
respectively which are approximations of πθand ¯πθdefined in A3, and geometrically converge
towards them. More precisely, we show in Theorem 4and Theorem 5below, that MYULA and
PULA satisfy these conditions if at least one of the following assumptions is verified:
H2. There exists m>0such that for any θΘ,Vθand ¯
Vθare m-convex.
H 3. There exist η > 0and c>0such that for any θΘand xRd,min(Uθ(x),¯
Uθ(x)) >
ηkxk c.
Note that if for any θΘ,Uθis convex on Rdand supθΘ(RRdexp[Uθx)]d˜x)<+, then H3
is automatically satisfied, as an immediate extension of [4, Lemma 2.2 (b)]. In [49], H3is satisfied
as soon as the prior distribution x7→ p(x|θ)is log-concave and proper for any θΘ. In [49], if the
prior x7→ p(x|θ)is improper for some θΘthen we require H2to be satisfied, i.e. for any yCdy,
there exists m>0such that for any θΘ,x7→ p(x|y, θ)is m-log-concave. Finally, we believe that
H3could be relaxed to the following condition: there exist η > 0and c>0such that for any θΘ
and xRd,min(Uθ(x)+Vθ(x),¯
Uθ(x)+ ¯
Vθ(x)) >ηkxk−c. In particular, this latter condition holds
in the case where x7→ p(x|θ) = exp[θTV(x)] and supθΘ(RRdexp[Uθ(˜x) + Vθ(˜x)]d˜x)<+.
Consider for any mNand α > 0, the two functions Wmand Wαgiven for any xRdby
Wm(x) = 1 + kxk2m, Wα= exp αq1 + kxk2.(19)
Theorem 4. Assume H1and H2or H3. Let ¯κ > 1>κ > 1/2,¯γ < min{(2 1)/L,2/(m+L)}
if H2holds and ¯γ < min{(2 1)/L, η/(2ML)}if H3holds. Then for any a(0,1], there exist
¯
A2,a >0and ρa(0,1) such that for any θΘ,κ[κ, ¯κ],γ(0,¯γ],Rγ and ¯
Rγ,θ admit
invariant probability measures πγ, respectively ¯πγ,θ. In addition, for any x, y Rdand nNwe
have
max kδxRn
γ,θ πγ kWa,kδx¯
Rn
γ,θ ¯πγ,θ kWa6¯
A2,a ¯ργn
aWa(x),
max kδxRn
γ,θ δyRn
γ,θ kWa,kδx¯
Rn
γ,θ δy¯
Rn
γ,θ kWa6¯
A2,a ¯ργn
a{Wa(x) + Wa(y)},
with W=Wmand mNif H2holds and W=Wαwith α < min(κη/4, η/8) if H3holds.
Proof. The proof is postponed to Section 5.2.
Theorem 5. Assume H1and H2or H3. Let Let ¯κ > 1>κ > 1/2,¯γ < 2/(m+L)if H2holds
and ¯γ < 2/Lif H3holds. Then for any a(0,1], there exist A2,a >0and ρa(0,1) such that
for any θΘ,κ[κ, ¯κ],γ(0,¯γ],Sγ,θ and ¯
Sγ,θ admit an invariant probability measure πγ,θ and
¯πγ respectively. In addition, for any x, y Rdand nNwe have
max kδxSn
γ,θ πγ kWa,kδx¯
Sn
γ,θ ¯πγ,θ kWa6A2,aργn
aWa(x),
max kδxSn
γ,θ δySn
γ,θ kWa,kδx¯
Sn
γ,θ δy¯
Sn
γ,θ kWa6A2,aργ n
a{Wa(x) + Wa(y)},
with W=Wmand mNif H2holds and W=Wαwith α < κη/4if H3holds.
Proof. The proof is postponed to Section 5.3.
8
4.2 Main results
We now state our main results regarding the convergence of the sequence defined by (5) under the
following additional regularity assumption.
H4. There exist MΘ>0and fΘC(R+,R+)such that for any θ1, θ2Θ,xRd,
max k∇xVθ1(x) xVθ2(x)k,k∇x¯
Vθ1(x) x¯
Vθ2(x)k6MΘkθ1θ2k(1 + kxk),
max k∇xUκ
θ1(x) xUκ
θ2(x)k,k∇x¯
Uκ
θ1(x) x¯
Uκ
θ2(x)k6fΘ(κ)kθ1θ2k(1 + kxk).
In Theorem 6, we give sufficient conditions on the parameters of the algorithm under which the
sequence (θn)nNconverges a.s., and we give explicit convergence rates in Theorem 7.
Theorem 6. Assume A1,A2,A3and that fis convex. Let κ[κ, ¯κ]with ¯κ>1>κ > 1/2.
Assume H1and one of the following conditions:
(a) H2holds, ¯γ < min(2/(m+L),(2 1)/L,L1)and there exists mNand Cm>0such
that for any θΘand xRd,kHθ(x)k6CmW1/4
m(x)and k¯
Hθ(x)k6CmW1/4
m(x).
(b) H3holds, ¯γ < min((2 1)/L, η/(2ML),L1)and there exists 0< α < η/4,Cα>0such
that for any θΘand xRd,kHθ(x)k6CαW1/4
α(x)and k¯
Hθ(x)k6CαW1/4
α(x).
Let (γn)nN,(δn)nNbe sequences of non-increasing positive real numbers and (mn)nNbe a se-
quence of non-decreasing positive integers satisfying δ0<1/Lfand γ0<¯γ. Let ({(Xn
k,¯
Xn
k) : k
{0,...,mn}}, θn)nNbe given by (5). In addition, assume that P+
n=0 δn+1 = +,P+
n=0 δn+1γ1/2
n<
+and that one of the following conditions holds:
(1) P+
n=0 δn+1/(mnγn)<+;
(2) mn=m0Nfor all nN,supnN|δn+1 δn|δ2
n<+,H4holds and we have
P+
n=0 δ2
n+1γ2
n<+,P+
n=0 δn+1γ3
n+1(γnγn+1 )<+.
Then (θn)nNconverges a.s. to some θarg minΘf. Furthermore, a.s. there exists C>0such
that for any nN
(n
X
k=1
δkf(θk),n
X
k=1
δk)min
Θf6C, n
X
k=1
δk!.
Proof. The proof is postponed to Section 5.6.
These results are similar to the ones identified in [17, Theorem 1, Theorem 5, Theorem 6] for
the Stochastic Optimization with Unadjusted Langevin (SOUL) algorithm. Note that in SOUL the
potential is assumed to be differentiable and the sampler is given by ULA, whereas in Theorem 6,
the results are stated for PULA and MYULA samplers.
Although rigorously establishing convexity of fis usually not possible for imaging models, we
expect that in many cases, for any of its minimizer θ,fis convex in some neighborhood of θ.
For example, this is the case if its Hessian is definite positive around this point.
Assume that δnna,γnnband mnncwith a, b, c >0. We now distinguish two cases
depending on if for all nN,mn=m0N(fixed batch size) or not (increasing size).
1) In the increasing batch size case, Theorem 6ensures that (θn)nNconverges if the following
inequalities are satisfied
a+b/2>1, a b+c > 1, a 61.(20)
Note in particular that c > 0,i.e. the number of Markov chain iterates required to compute the
estimator of the gradient increases at each step. However, for any a[0,1] there exist b, c > 0
such that (20) is satisfied. In the special setting where a= 0 then for any ε2> ε1>0such that
b= 2 + ε1and c= 3 + ε2satisfy the results of (20) hold.
2) In the fixed batch size case, which implies that c= 0, Theorem 6ensures that (θn)nNconverges
if the following inequalities are satisfied
a+b/2>1,2(ab)>1, a +b+ 1 2b > 1a61,
which can be rewritten as
b(2(1 a),min(a1/2, a/2)) , a [0,1] .
The interval (2(a1),min(a1/2, a/2)) is then not empty if and only if a(5/6,1].
9
Theorem 7. Assume A1,A2,A3and that fis convex. Let κ[κ, ¯κ]with ¯κ>1>κ > 1/2.
Assume H1and that the condition (a) or (b) in Theorem 6is satisfied. Let (γn)nN,(δn)nNbe
sequences of non-increasing positive real numbers and (mn)nNbe a sequence of non-decreasing
positive integers satisfying δ0<1/Lfand γ0<¯γ. Let ({(Xn
k,¯
Xn
k) : k {0,...,mn}}, θn)nNbe
given by (5)
E"( n
X
k=1
δkf(θk),n
X
k=1
δk)min
Θf#6En, n
X
k=1
δk!,
where
(a)
En=C1(1 +
n1
X
k=0
δk+1γ1/2
k+
n1
X
k=0
δk+1/(mkγk) +
n1
X
k=0
δ2
k+1/(mkγk)2).(21)
(b) or if mn=m0for all nN,supnN|δn+1 δn|δ2
n<+and H4holds
En=C2(1 +
n1
X
k=0
δk+1γ1/2
k+
n1
X
k=0
δ2
k+1k+
n1
X
k=0
δk+1γ3
k+1(γkγk+1 )).(22)
Proof. The proof is postponed to Section 5.7.
First, note that if the stepsize is fixed and recalling that κ=λ/γ then the condition γ < (2
1)/Lcan be rewritten as γ < 2/(L+λ1). Assume that (δn)nNis non-increasing, limn+δn=
0,limn+mn= +and γn=γ0>0for all nN. In addition, assume that PnNδn= +
then, by [37, Problem 80, Part I], it holds that
(limn+[ (Pn
k=1 δk/mk)/(Pn
k=1 δk)] = limn+1/mn= 0 ;
limn+Pn
k=1 δ2
k(Pn
k=1 δk)= limn+δn= 0 .(23)
Therefore, using (21) we obtain that
lim sup
n+
E"( n
X
k=1
δkf(θk),n
X
k=1
δk)min f#6C1γ0.
Similarly, if the stepsize is fixed and the number of Markov chain iterates is fixed, i.e. for all nN,
γn=γ0and mn=m0with γ0>0and m0N, combining (22) and (23) we obtain that
lim sup
n+
E"( n
X
k=1
δkf(θk),n
X
k=1
δk)min f#6C2γ0.
5 Proof of the main results
In this section, we gather the proofs of Section 4. First, in Section 5.1 we derive some useful
technical lemmas. In Section 5.2, we prove Theorem 4, using minorisation and Foster-Lyapunov
drift conditions. Similarly, we prove Theorem 5in Section 5.3. Next, we show Theorem 6by
applying [17, Theorem 1, Theorem 3] and Theorem 7by applying [17, Theorem 2, Theorem
4], which boils down to verifying that [17, H1, H2] are satisfied. In Section 5.4, we show that
[17, H1, H2] hold if the sequence is given by (5) where {(Kγ,θ,¯
Kγ,θ ) : γ(0,¯γ], θ Θ}=
{(Rγ,θ ,¯
Rγ,θ ) : γ(0,¯γ], θ Θ}defined in (18), i.e. we consider PULA as a sampling scheme
in the optimization algorithm. In Section 5.5 we check that [17, H1, H2] are satisfied when
{(Kγ,θ ,¯
Kγ,θ ) : γ(0,¯γ], θ Θ}={(Sγ,θ,¯
Sγ,θ ) : γ(0,¯γ], θ Θ}defined in (17), i.e. when
considering MYULA as a sampling scheme. Finally, we prove Theorem 6in Section 5.6 and
Theorem 7in Section 5.7.
10
5.1 Technical lemmas
We say that a Markov kernel Ron Rd×B(Rd)satisfies a discrete Foster-Lyapunov drift condition
Dd(W, λ, b)if there exist λ(0,1),b>0and a measurable function W:Rd[1,+)such that
for all xRd
RW(x)6λW (x) + b .
We will use the following result.
Lemma 8. Let Rbe a Markov kernel on Rd×B(Rd)which satisfies Dd(W, λγ, )with λ(0,1),
b>0,γ > 0and a measurable function W:Rd[1,+). Then, we have for any xRd
R1W(x)6(1 + blog1(1)λ¯γ)W(x).
Proof. Using [17, Lemma 9] we have for any xRd
R1W(x)6
λγ1+
1⌉−1
X
k=0
λγk
W(x)6(1 + blog1(1)λ¯γ)W(x).
We continue this section by giving some results on proximal operators. Some of them are
well-known but their proof is given for completeness.
Lemma 9. Let κ>0and U:RdRconvex. Assume that Uis M-Lipschitz with M>0, then
Uκis M-Lipschitz and for any xRd,kxproxκ
U(x)k6κM.
Proof. Let κ>0. We have for any x, y Rdby (7) and (8)
Uκ(x)Uκ(y)
=kxproxκ
U(x)k2/(2κ) + U(proxκ
U(x)) kyproxκ
U(y)k2/(2κ)U(proxκ
U(y))
6kyproxκ
U(y)k2/(2κ) + U(xy+ proxκ
U(y)) kyproxκ
U(y)k2/(2κ)U(proxκ
U(y))
6Mkxyk.
Hence, Uκis M-Lipschitz. Since by [5, Proposition 12.30], Uκis continuously differentiable we
have for any xRd,k∇Uκ(x)k6M. Combining this result with the fact that for any xRd,
Uκ(x) = (xproxκ
U(x))/κby [5, Proposition 12.30] concludes the proof.
Lemma 10. Let U:Rd[0,+)be a convex and M-Lipschitz function with M>0. Then for
any κ>0and z, z Rd,
hproxκ
U(z)z, z i6κU(z) + κ2M2+κ{U(z) + Mkzk} .
Proof. κ>0and z, z Rd. Since (zproxκ
U(z))/κ∂U (proxκ
U(z)) [5, Proposition 16.44], we
have
κ{U(z)U(proxκ
U(z))}>hzproxκ
U(z), zproxκ
U(z)i
>hzproxκ
U(z), zzi+kzproxκ
U(z)k2
>hzproxκ
U(z), zzi.
Combining this result, the fact that Uis M-Lipschitz and Lemma 9we get that
hproxκ
U(z)z, z i6κU(z)κU(z) + κMkzproxκ
U(z)k+kzkkzproxκ
U(z)k
6κU(z) + κ2M2+κ{U(z) + Mkzk} ,
which concludes the proof
Lemma 11. Let κ1,κ2>0and U:RdRconvex and lower semi-continuous. For any xRd
we have
kproxκ1
U(x)proxκ2
U(x)k262(κ1κ2)(U(proxκ2
U(x)) U(proxκ1
U(x))) .
If in addition, Uis M-Lipschitz with M>0then
kproxκ1
U(x)proxκ2
U(x)k62M|κ1κ2|.
11
Proof. Let xRd. By definition of proxκ1
U(x)we have
2κ1U(proxκ1
U(x)) + kxproxκ1
U(x)k262κ1U(proxκ2
U(x)) + kxproxκ2
U(x)k2.
Combining this result and the fact that (xproxκ2
U(x))/κ2∂U (proxκ2
U(x)) we have
kproxκ1
U(x)proxκ1
U(x)k2
62κ1{U(proxκ2
U(x)) U(proxκ1
U(x))}+ 2hxproxκ2
U(x),proxκ1
U(x)proxκ2
U(x)i
62κ1{U(proxκ2
U(x)) U(proxκ1
U(x))}+ 2κ2{U(proxκ1
U(x)) U(proxκ2
U(x))}
62(κ1κ2)(U(proxκ2
U(x)) U(proxκ1
U(x))) ,
which concludes the proof.
Lemma 12. Let V:RdRm-convex and continuously differentiable with m>0. Assume that
there exists M > 0such that for any x, y Rd
k∇V(x) V(y)k6Mkxyk.
Assume that there exists xarg minRdV, then for any γ(0,¯γ]with ¯γ < 2/(M+m)and xRd
kxγV(x)k26(1 γ)kxk2+γ{(2/(m+M)¯γ)1+ 4}kxk2,
with =mM/(m+M).
Proof. Let xRd,γ(0,¯γ]and ¯γ < 2/(m+M). Using [36, Theorem 2.1.11] and the fact that for
any a, b, ε > 0,εa2+b2 >2ab we have
kxγV(x)k2
6kxk22γh∇V(x) V(x), x xi+γ¯γkV(x) V(x)k2
+ 2γkxkk∇V(x) V(x)k
6kxk22γ kxxk2γ(2/(m+M)¯γ)k∇V(x) V(x)k2
+ 2γkxkk∇V(x) V(x)k
6kxk22γ kxxk2γ(2/(m+M)¯γ)k∇V(x) V(x)k2
+γ(2/(m+M)¯γ)k∇V(x) V(x)k2+γ/(2/(m+M)¯γ)kxk2
6(1 2γ)kxk2+ 4γ kxk kxk+γ/(2/(m+M)¯γ)kxk2
6(1 γ)kxk2+γ(2/(m+M)¯γ)1+ 4kxk2.
Lemma 13. Assume H1and H2. Then for any κ > 0,θΘ,γ(0,¯γ]with ¯γ < 2/(m+L)and
xRd, we have
proxγκ
Uθ(x)γxVθ(proxγκ
Uθ(x))
2
6(1 γ/2) kxk2+γ¯γκ2M2+(2/(m+L)¯γ)1+ 4R2
V,1+2κ2M21,
with =mL/(m+L).
Proof. Let κ > 0,θΘ,γ(0,¯γ]and xRd. Using H1,H2, Lemma 9, Lemma 12, the
Cauchy-Schwarz inequality and that for any α, β >0,maxtR(αt2+ 2βt) = β2, we have
proxγκ
Uθ(x)γxVθ(proxγκ
Uθ(x))
2
6(1 γ)
proxγκ
Uθ(x)
2+γ(2/(m+L)¯γ)1+ 4kx
θk2
6(1 γ)
xproxγκ
Uθ(x)x
2+γ(2/(m+L)¯γ)1+ 4R2
V,1
6(1 γ)kxk2+γ2κ2M2+ 2γκMkxk+γ(2/(m+L)¯γ)1+ 4R2
V,1
6(1 γ/2) kxk2+γ2κ2M2+γ(2/(m+L)¯γ)1+ 4R2
V,1+ 2γκMkxk γ kxk2/2
6(1 γ/2) kxk2+γ¯γκ2M2+γ(2/(m+L)¯γ)1+ 4R2
V,1+ 2γκ2M21.
12
Lemma 14. Assume H1and H3. Then for any κ > 0,θΘ,γ(0,¯γ]with ¯γ < 2/Land
xRd, we have
proxγκ
Uθ(x)γxVθ(proxγκ
Uθ(x))
26kxk2+γγκ2M2+ 2κc+ 2κ(RU,2+MRU,1)
+(2/L ¯γ)1R2
V,12κη kxk.
Proof. Let κ > 0,θΘ,γ(0,¯γ]and xRd. Using H1,H3, Lemma 9and Lemma 10 and
Lemma 12 we have
proxγκ
Uθ(x)γxVθ(proxγκ
Uθ(x))
26kproxγκ
Uθ(x)k2+γ/(2/L¯γ)R2
V,1
6kxk2+γ2κ2M2+ 2hproxγκ
Uθ(x)x, xi+γ/(2/L¯γ)R2
V,1
6kxk2+ 3γ2κ2M22γκU (x) + 2γκ(U(x
θ) + Mkx
θk) + γ/(2/L¯γ)R2
V,1
6kxk2+ 3γ2κ2M22γκη kxk+ 2γ κc
+ 2γκ(U(x
θ) + Mkx
θk) + γ/(2/L¯γ)R2
V,1
6kxk2+γγκ2M2+ 2κc+ 2κ(RU,2+MRU,1) + (2/L ¯γ)1R2
V,12κη kxk.
Lemma 15. Assume H1and H2. Then for any κ > 0,θΘ,γ(0,¯γ]with ¯γ < 2/(m+L)and
xRd, we have
kxγxVθ(x)γxUγκ
θ(x)k26(1 γ/2) kxk2
+γ(2/(m+L)¯γ)1+ 4R2
V,1+ 2γ2MLRV ,1+γ2M2+ 2γM2(1 + ¯γL)21,
with =mL/(2m+ 2L).
Proof. Let κ > 0,θΘ,γ(0,¯γ]and xRd. Using H1,H2, Lemma 9, Lemma 12 and that for
any α, β >0,max(αt2+ 2βt) = β2 we have
kxγxVθ(x)γxUγκ
θ(x)k2
6kxγxVθ(x)k2+ 2γMkxγ{∇xVθ(x) xVθ(x
θ)}k +γ2M2
6(1 γ)kxk2+γ(2/(m+L)¯γ)1+ 4kx
θk2
+ 2γMkxk+ 2γ2Mk∇xVθ(x) xVθ(x
θ)k+γ2M2
6(1 γ)kxk2+γ(2/(m+L)¯γ)1+ 4kx
θk2
+ 2γMkxk+ 2γ2ML kxk+ 2γ2ML kx
θk+γ2M2
6(1 γ/2) kxk2+γ(2/(m+L)¯γ)1+ 4R2
V,1
+ 2γ2MLRV,1+γ2M2+ 2γM(1 + ¯γL)kxk γ kxk2/2
6(1 γ/2) kxk2+γ(2/(m+L)¯γ)1+ 4R2
V,1
+ 2γ2MLRV,1+γ2M2+ 2γM2(1 + ¯γL)21.
Lemma 16. Assume H1and H3. Then for any κ > 0,θΘ,xRdand γ(0,¯γ]with
¯γ < min(2/L, η/(2ML)), we have
kxγxVθ(x)γxUγκ
θ(x)k2
6kxk2+γ(2/L ¯γ)1R2
V,1+ 3¯γM2+ 2c+ 2(MRU,1+RU,2) + 2¯γMLRV,2ηkxk.
Proof. Let κ > 0,θΘ,γ(0,¯γ]and xRd. Using H1,H3, (7), Lemma 9and Lemma 10 we
13
have
kxγxVθ(x)γxUγκ
θ(x)k2
6kxγxVθ(x)k22γhxγxVθ(x),xUγκ
θ(x)i+γ2M2
6kxγxVθ(x)k22κ1hxγxVθ(x), x proxγκ
Uθ(x)i+γ2M2
6kxγxVθ(x)k22κ1hx, x proxγκ
Uθ(x)i+ 2κ1γk∇xVθ(x)kkxproxγκ
Uθ(x)k+γ2M2
6kxγxVθ(x)k2+ 3γ2M22γη kxk+ 2γc+ 2γ(Mkx
θk+U(x
θ)) + 2γ¯γMk∇xVθ(x)k
6kxγxVθ(x)k2+ 3γ¯γM22γη kxk
+ 2γc+ 2γ(MRU,1+RU,2) + 2γ¯γML kxk+ 2γ¯γML kx
θk
6kxγxVθ(x)k2+ 3γ¯γM2γη kxk+ 2γc+ 2γ(MRU,1+RU,2) + 2γ¯γML kx
θk,
where we have used for the last inequality that ¯γ < η/(2ML). Then, we can conclude using H1and
Lemma 12 that
kxγxVθ(x)γxUγκ
θ(x)k2
6kxk2+γ/(2/L ¯γ)R2
V,1+ 3γ¯γM2γη kxk+ 2γc+ 2γ(MRU,1+RU,2) + 2γ¯γMLRV ,1
6kxk2+γ(2/L ¯γ)1R2
V,1+ 3¯γM2+ 2c+ 2(MRU,1+RU,2) + 2¯γMLRV,2ηkxk.
For υRdand σ>0, denote Υυ,σthe d-dimensional Gaussian distribution with mean υand
covariance matrix σ2Id.
Lemma 17. For any σ1,σ2>0and υ1, υ2Rd, we have
KL υ1,σ1Id|Υυ2,σ2Id ) = kυ1υ2k2/(2σ2
2) + (d/2) log(σ2
1/σ2
2)1 + σ2
1/σ2
2.
In addition, if σ1>σ2
KL υ1,σ1Id|Υυ2,σ2Id)6kυ1υ2k2/(2σ2
2) + (d/2)(1 σ2
1/σ2
2)2.
Proof. Let Xbe a d-dimensional Gaussian random variable with mean υ1and covariance matrix
σ2
1Id. We have that
KL υ1,σ1Id|Υυ2,σ2Id ) = Ehlog n(σ2
2/σ2
1)d/2exp hkXυ1k2/(2σ2
1) + kXυ2k2/(2σ2
2)ioi
=(d/2) log(σ2
1/σ2
2) + EhkXυ1k2/(2σ2
1) + kXυ2k2/(2σ2
2)i
=(d/2) log(σ2
1/σ2
2) + (1/2)(σ2
2σ2
1)EhkXυ1k2i+
υ2
1υ2
2
/(2σ2
2)
=(d/2) log(σ2
1/σ2
2) + (d/2)(σ2
1/σ2
21) +
υ2
1υ2
2
/(2σ2
2)
=kυ1υ2k2/(2σ2
2) + (d/2) log(σ2
1/σ2
2)1 + σ2
1/σ2
2.
In the case where σ1>σ2, let s=σ2
1/σ2
21. Since s>0we have log(1 + s)>ss2. Therefore,
we get that
log(σ2
1/σ2
2)1 + σ2
1/σ2
2=log(1 + s) + s6s2,
which concludes the proof.
5.2 Proof of Theorem 4
We show that under H2or H3, Foster-Lyapunov drifts hold for MYULA in Lemma 18 and
Lemma 19. Combining these Foster-Lyapunov drifts with an appropriate minorisation condition
Lemma 20, we obtain the geometric ergodicity of the underlying Markov chain in Theorem 21.
Lemma 18. Assume H1and H2. Then for any θΘ,κ[κ, ¯κ]and γ(0,¯γ]with ¯κ>1>
κ > 1/2,¯γ < 2/(m+L),Rγ and ¯
Rγ,θ satisfy Dd(W1, λγ
2, b2γ)with
λ2= exp [/2] ,
b2=(2/(m+L)¯γ)1+ 4R2
V,1+ 2¯γMLRV ,1+ ¯γM2+ 2d+ 2M2(1 + ¯γL)21+/2,
=mL/(m+L),
14
where for any xRd,W2(x) = 1 + kxk2. In addition, for any mN, there exist λm(0,1),
bm>0such that for any θΘ,κ[κ, ¯κ],γ(0,¯γ]with ¯κ>1>κ > 1/2,¯γ < 2/(m+L),Rγ,θ
and ¯
Rγ,θ satisfy Dd(Wm, λγ
m, bmγ), where Wmis given in (19).
Proof. We show the property for Rγ only as the proof for ¯
Rγ,θ is identical. Let θΘ,κ[κ,¯κ],
γ(0,¯γ]and xRd. Let Zbe a d-dimensional Gaussian random variable with zero mean and
identity covariance matrix. Using Lemma 15 we have
ZRdkyk2Rγ,θ (x, dy) = E
xγxVθ(x)γxUγκ
θ(x) + p2γZ
2
=kxγxVθ(x)γxUγκ
θ(x)k+ 2γd
6(1 γ/2) kxk2+γ(2/(m+L)¯γ)1+ 4R2
V,1
+2¯γMLRV,1+ ¯γM2+ 2d+ 2M2(1 + ¯γL)21.
Therefore, we get
ZRd
(1 + kyk2)Rγ,θ (x, dy)6(1 γ/2)(1 + kxk2) + γ(2/(m+L)¯γ)1+ 4R2
V,1
+2¯γMLRV,1+ ¯γM2+ 2d+ 2M2(1 + ¯γL)21+/2,
which concludes the first part of the proof. Let Tγ (x) = xγxVθ(x)γxUγκ
θ(x). In the
sequel, for any k {1,...,m},b, ˜
bk>0and λ, ˜
λk[0,1) are constants independent of γwhich
may take different values at each appearance. Note that using Lemma 15, for any k {1,...,2m}
there exist ˜
λk(0,1) and ˜
bk>0such that
kTγ,θ (x)kk6{˜
λγ
kkxk+γ˜
bk}k(24)
6˜
λγk
kkxkk+γ2kmax(˜
bk,1)kmax(¯γ, 1)2k1n1 + kxkk1o
6˜
λγ
kkxkk+˜
bkγn1 + kxkk1o6(1 + kxkk)(1 + ˜
bkγ).
Therefore, combining (24) and the Cauchy-Schwarz inequality we obtain
ZRd
(1 + kyk2)Rγ,θ (x, dy) = 1 + Eh(kTγ,θ (x)k2+ 2p2γhTγ(x), Zi+ 2γkZk2)mi
= 1 +
m
X
k=0
k
X
=0 m
kk
kTγ,θ (x)k2(mk)2(3k)/2γ(k+)/2EhhTγ,θ (x), Z ikkZk2i
61 + kTγ,θ (x)k2m
+ 23m/2
m
X
k=1
k
X
=0 m
kk
kTγ,θ (x)k2(mk)γ(k+)/2EhhTγ(x), ZikkZk2i
1
{(1,0)}c(k, )
61 + kTγ,θ (x)k2m
+γ23m/2
m
X
k=1
k
X
=0 m
kk
kTγ,θ (x)k2mk¯γ(k+)/21EhkZkk+i
1
{(1,0)}c(k, )
61 + λγ
2mkxk2m+b2mγn1 + kxk2m1o
+γ23m/222mmax(¯γ, 1)2msup
k∈{1,...,m}n(1 + ˜
bk¯γ)EhkZkkio(1 + kxk2m1)
61 + λγkxk2m+γb(1 + kxk2m1)
6λγ/2(1 + kxk2m) + γb(1 + kxk2m1) + λγ(1 + kxk2m)λγ/2(1 + kxk2m).
Using that λγλγ/26log(1)γλγ/2/2, concludes the proof.
Lemma 19. Assume H1and H3. Then for any θΘ,κ[κ, ¯κ]and γ(0,¯γ]with ¯κ>1>
15
κ > 1/2,¯γ < min(2/L, η/(2ML)),Rγ and ¯
Rγ,θ satisfy Dd(W, λγ, )with
λ= eα2,
be= (4/L γ)1R2
V,1+ (3/2)¯γM2+c+MRU,1+RU,2+ ¯γMLRV,2+d+ 2α ,
b=αbeeα¯γ beW(R),
W=Wα, α < η/8,
Rη= max (2be/(η8α),1) ,
(25)
where Wαis given in (19).
Proof. We show the property for Rγ,θ only as the proof for ¯
Rγ,θ is identical. Let θΘ,κ[κ,¯κ]
γ(0,¯γ],xRdand Zbe a d-dimensional Gaussian random variable with zero mean and identity
covariance matrix. Using Lemma 16 we have
ZRdkyk2Rγ,θ (x, dy) = kxγxVθ(x)γxUγκ
θk2+ 2γd
6kxk2+γ(2/L ¯γ)1R2
V,1+ 3¯γM2+ 2c+ 2(MRU,1+RU,2) + 2¯γMLRV,2+ 2dηkxk.
Using the log-Sobolev inequality [3, Proposition 5.4.1] and Jensen’s inequality we get that
Rγ,θ W(x)6exp αRγφ(x) + α2γ(26)
6exp "α1 + ZRdkyk2Rγ,θ (x, dy)1/2
+α2γ#.
We now distinguish two cases:
(a) If kxk>Rη, recalling that Rηis given in (25), then
(2/L ¯γ)1R2
V,1+ 3¯γM2+ 2c+ 2(MRU,1+RU,2) + 2¯γMLRV,2+ 2dηkxk68αkxk.
In this case using that φ1(x)kxk>1/2and that for any t>0,1 + t61 + t/2we have
1 + ZRdkyk2Rγ,θ (x, dy)1/2
φ(x)6
6γφ1(x)(2/L ¯γ)1R2
V,1+ 3¯γM2+ 2c+ 2(MRU,1+RU,2) + 2¯γMLRV,2+ 2dηkxk2
64αγφ1(x)kxk62αγ .
Hence,
Rγ,θ W(x)6"α1 + ZRdkyk2Rγ,θ (x, dy)1/2
+α2γ#6eα2γW(x).
(b) If kxk6Rηthen using that for any t>0,1 + t61 + t/2we have
1 + ZRdkyk2Rγ,θ (x, dy)1/2
φ(x)
6γ((4/L γ)1R2
V,1+ (3/2)¯γM2+c+MRU,1+RU,2+ ¯γMLRV,2+d).
Therefore, using (26), we get
Rγ,θ W(x)
6exp αγ (4/L γ)1R2
V,1+ (3/2)¯γM2+c+MRU,1+RU,2+ ¯γMLRV,2+d+αW(x).
Since for all a>b,eaeb6(ab)eawe obtain that
Rγ,θ W(x)6λγW(x) + γαbeeα¯γ beW(Rη),
which concludes the proof.
16
Lemma 20. Assume H1. For any κ[κ, ¯κ],θΘ,γ(0,¯γ]with ¯κ>1>κ > 1/2,
¯γ < (2 1)/Land x, y Rd
max kδxR1
γ,θ δyR1
γ,θ kTV ,kδx¯
R1
γ,θ δy¯
R1
γ,θ kTV 612Φnkxyk/(22)o,
where Φis the cumulative distribution function of the standard normal distribution on R.
Proof. We only show that for any θΘ,κ[κ,¯κ],γ(0,¯γ]with ¯κ>1>κ > 1/2,¯γ < (21)/L
and x, y Rd, we have kδxR1
γ,θ δyR1
γ,θ kTV 612Φ kxyk/(22)as the proof of for
¯
Rγ,θ is similar. Let κ[κ,¯κ],θΘ,γ(0,¯γ]. We have that x7→ Vθ(x) + Uγκ
θ(x)is convex,
continuously differentiable and satisfies for any x, y Rd
k∇xVθ(x) + xUγκ
θ(x) xVθ(y) xUγκ
θ(y)k6{L+ 1/(γκ)} kxyk,
Combining this result with [36, Theorem 2.1.5, Equation (2.1.8)] and the fact that γ62/{L+
1/(γκ)}since ¯γ6(2 1)/L, we have for any x, y Rd
kxγxVθ(x)γxUγκ
θ(x)y+γxVθ(y) + γxUγκ
θ(y)k6kxyk.
The proof is then an application of [16, Proposition 3b] with 1, for any xRd,Tγ(x)
xγxVθ(x)γxUγκ
θ(x)and ΠId.
Theorem 21. Assume H1and H2or H3. Let ¯κ>1>κ > 1/2,¯γ < min{(2 1)/L,2/(m+L)}
if H2holds and ¯γ < min{(2 1)/L, η/(2ML)}if H3holds. Then for any a(0,1], there exist
A2,a >0and ρa(0,1) such that for any θΘ,κ[κ,¯κ],γ(0,¯γ],Rγ and ¯
Rγ,θ admit
invariant probability measures πγ, respectively ¯πγ,θ , and for any x, y Rdand nNwe have
max kδxRn
γ,θ πγ kWa,kδx¯
Rn
γ,θ ¯πγ,θ kWa6A2,a ργn
aWa(x),
max kδxRn
γ,θ δyRn
γ,θ kWa,kδx¯
Rn
γ,θ δy¯
Rn
γ,θ kWa6A2,aργn
a{Wa(x) + Wa(y)},
with W=Wmand mNif H2holds and W=Wαwith α < min(κη/4, η/8) if H3holds, see
(19).
Proof. We only show that for any a(0,1], there exist A2,a >0and ρa(0,1) such that for
any θΘ,κ[κ,¯κ]and γ(0,¯γ]we have kδxRn
γ,θ πγ kWa6A2,aργ n
aWa(x)and kδxRn
γ,θ
δyRn
γ,θ kWa6A2,aργn
a{Wa(x) + Wa(y)}, since the proof for ¯
Rγ,θ is similar . Let a[0,1]. First,
using Jensen’s inequality and Lemma 18 if H2holds or Lemma 19 if H3holds, we get that there
exist λaand basuch that for any θΘ,κ[κ, ¯κ],γ(0,¯γ],Rγ,θ and ¯
Rγ,θ satisfy Dd(Wa, λγ
a, baγ).
Combining [16, Theorem 6], Lemma 20 and Dd(Wa, λγ
a, baγ), we get that there exist ¯
A2,a >0and
ρa(0,1) such that for any θΘ,κ[κ, ¯κ],γ(0,¯γ],x, y Rdand nN,Rγ,θ and ¯
Rγ,θ admit
invariant probability measures πγ and ¯πγ respectively and
max kδxRn
γ,θ δyRn
γ,θ kWa,kδx¯
Rn
γ,θ δy¯
Rn
γ,θ kWa6¯
A2,aργ n
a{Wa(x) + Wa(y)}.(27)
Using that for any θΘ,κ[κ,¯κ]and γ(0,¯γ],Rγ,θ and ¯
Rγ,θ satisfy Dd(Wa, λγ
a, baγ)and [17,
Lemma S2] we have
πγ,θ (Wa)6baγ/(1 λγ
a)6baλ¯γ
a/log(1a).(28)
Hence, combining (27) and (28), we have for any θΘ,κ[κ, ¯κ],γ(0,¯γ]and nN
max kδxRn
γ,θ πγ kW,kδx¯
Rn
γ,θ ¯πγ,θ kW6¯
A2,aργ n
a(1 + baλ¯γ
a/log(1a))Wa(x).
We conclude upon letting A2,a =¯
A2,a(1 + baλ¯γ
a/log(1a)).
5.3 Proof of Theorem 5
We show that under H2or H3, Foster-Lyapunov drifts hold for PULA in Lemma 22 and Lemma 23.
Combining these Foster-Lyapunov drifts with an appropriate minorisation condition Lemma 24,
we obtain the geometric ergodicity of the underlying Markov chain in Theorem 25.
17
Lemma 22. Assume H1and H2. Then for any θΘ,κ[κ, ¯κ]and γ(0,¯γ]with ¯κ>1>
κ > 1/2and ¯γ < 2/(m+L),Sγ and ¯
Sγ,θ satisfy Dd(W1, λγ
2, b2γ)with
λ2= exp [/2] ,
b2= ¯γ¯κ2M2+(2/(m+L)¯γ)1+ 4R2
V,2+ 2d+ 2¯κ2M21+/2,
=mL/(m+L),
where for any xRd,W1(x) = 1 + kxk2. In addition, for any mN, there exist λm(0,1),
bm>0such that for any θΘ,κ[κ,¯κ]and γ(0,¯γ]with ¯κ>1>κ > 1/2and ¯γ < 2/(m+L),
Sγ,θ and ¯
Sγ,θ satisfy Dd(Wm, λγ
m, bmγ), where Wmis given in (19).
Proof. We show the property for Sγ only as the proof for ¯
Sγ,θ is identical. Let θΘ,κ[κ,¯κ],
γ(0,¯γ]and xRd. Let Zbe a d-dimensional Gaussian random variable with zero mean and
identity covariance matrix. Using Lemma 13 we have
ZRdkyk2Sγ,θ (x, dy) = E
proxγκ
Uθ(x)γxVθ(proxγκ
Uθ(x)) + p2γZ
2
6(1 γ/2) kxk2+γ¯γκ2M2+(2/(m+L)¯γ)1+ 4R2
V,1+2κ2M21+ 2γ d .
Therefore, we get
ZRd
(1 + kyk2)Sγ,θ (x, dy)6(1 γ/2)(1 + kxk2) + γ¯γκ2M2
+(2/(m+L)¯γ)1+ 4R2
V,1+ 2d+ 2κ2M21+/2,
which concludes the first part of the proof using that for any t>0,1t6et. The proof of the
result for W=Wmwith mNis a straightforward adaptation of the one of Lemma 18 and is
left to the reader.
Lemma 23. Assume H1and H3. Then for any θΘ,κ[κ, ¯κ]and γ(0,¯γ]with ¯κ>1>
κ > 1/2and ¯γ < 2/L,Sγ and ¯
Sγ,θ satisfy Dd(W, λγ, )with
λ= eα2,
be= (3/2)¯γ¯κ2M2+ ¯κc+ ¯κ(RU,2+MRU,1) + (4/L γ)1R2
V,1+d+ 2α
b=αbeeα¯γ beW(R),
W=Wα,0< α < κη/4,
Rη= max (be/(κη4α),1) ,
and where Wαis given in (19).
Proof. We show the property for Sγ only as the proof for ¯
Sγ,θ is identical. Let θΘ,κ[κ,¯κ],
γ(0,¯γ],xRd, and Zbe a d-dimensional Gaussian random variable with zero mean and
identity covariance matrix. Using Lemma 14 we have
ZRdkyk2Sγ,θ (x, dy)6
proxγκ
Uθ(x)γxVθ(proxγκ
Uθ(x))
2+ 2γd
6kxk2+γγκ2M2+ 2κc+ 2κ(RU,2+MRU,1) + (2/L ¯γ)1R2
V,1+ 2d2κη kxk.
Using the log-Sobolev inequality [3, Proposition 5.4.1] and Jensen’s inequality we get that
Sγ,θ W(x)6exp αSγφ(x) + α2γ(29)
6exp "α1 + ZRdkyk2Sγ,θ (x, dy)1/2
+α2γ#.
We now distinguish two cases.
(a) If kxk>Rηthen φ1(x)kxk>1/2and γκ2M2+ 2κc+ 2κ(RU,2+MRU,1) + (2/L ¯γ)1R2
V,1+
2d2κη kxk68αkxk. In this case using that for any t>0,1 + t16t/2we get
1 + ZRdkyk2Sγ,θ (x, dy)1/2
φ(x)
6γφ1(x)3¯γκ2M2+ 2κc+ 2κ(RU,2+MRU,1) + (2/L ¯γ)1R2
V,1+ 2d2κη kxk/2
64αγφ1(x)kxk62αγ .
18
Hence,
Sγ,θ W(x)6exp "α1 + ZRdkyk2Sγ,θ (x, dy)1/2
+α2γ#6eα2γW(x).
(b) If kxk6Rηthen using that for any t>0,1 + t16t/2
1 + ZRdkyk2Sγ,θ (x, dy)1/2
φ(x)
6γ(3/2)¯γκ2M2+κc+κ(RU,2+MRU,1) + (4/L 2¯γ)1R2
V,1+d.
Therefore we get using (29)
Sγ,θ W(x)/W (x)
6exp αγ (3/2)¯γκ2M2+κc+κ(RU,2+MRU,1) + (4/L 2¯γ)1R2
V,1+d+α6eαbeγ.
Since for all a>b,eaeb6(ab)eawe obtain that
Sγ,θ W(x)6λγW(x) + γαbeeα¯γbeW(Rη),
which concludes the proof.
Lemma 24. Assume H1. For any θΘ,κ[κ, ¯κ]and γ(0,¯γ]with ¯κ>1>κ > 1/2,¯γ < 2/L
and x, y Rd
max kδxS1
γ,θ δyS1
γ,θ kTV ,kδx¯
S1
γ,θ δy¯
S1
γ,θ kTV 612Φnkxyk/(22)o,
where Φis the cumulative distribution function of the standard normal distribution on R.
Proof. We only show that for any θΘ,κ[κ,¯κ],γ(0,¯γ]with ¯γ < 2/L, and x, y Rd,
kδxS1
γ,θ δyS1
γ,θ kTV 612Φ kxyk/(22)since the proof for ¯
Sγ,θ is similar. Let
θΘ,κ[κ, ¯κ],γ(0,¯γ]. Using [36, Theorem 2.1.5, Equation (2.1.8)] and that the proximal
operator is non-expansive [5, Proposition 12.28], we have for any x, y Rd
proxγκ
Uθ(x)proxγκ
Uθ(y)γ(xVθ(proxγκ
Uθ(x)) xVθ(proxγκ
Uθ(y)))
6
proxγκ
Uθ(x)proxγκ
Uθ(y)
6kxyk.
The proof is then an application of [16, Proposition 3b] with 1, for any xRd,Tγ(x)
proxγκ
Uθ(x)γxVθ(proxγκ
Uθ(x)) and ΠId.
Theorem 25. Assume H1and H2or H3. Let ¯κ>1>κ > 1/2. Let ¯γ < 2/(m+L)if H2holds
and ¯γ < 2/Lif H3holds. Then for any a(0,1], there exist A2,a >0and ρa(0,1) such that
for any θΘ,κ[κ, ¯κ],γ(0,¯γ],Sγ,θ and ¯
Sγ,θ admit an invariant probability measure πγ,θ and
¯πγ respectively, and for any x, y Rdand nNwe have
max kδxSn
γ,θ πγ kWa,kδx¯
Sn
γ,θ ¯πγ,θ kWa6A2,aργn
aWa(x),
max kδxSn
γ,θ δySn
γ,θ kWa,kδx¯
Sn
γ,θ δy¯
Sn
γ,θ kWa6A2,aργ n
a{Wa(x) + Wa(y)},
with W=Wmand mNif H2holds and W=Wαwith α < κη/4if H3holds, see (19).
Proof. The proof is similar to the one of Theorem 21.
5.4 Checking [17, H1, H2] for PULA
Lemma 26 implies that [17, H1a] holds. The geometric ergodicity proved in Theorem 25 implies
[17, H1b]. Then, we show that the distance between the invariant probability distribution of the
Markov chain and the target distribution is controlled in Corollary 31 and therefore [17, H1c] is
satisfied. Finally, we show that [17, H2] is satisfied in Proposition 32.
19
Lemma 26. Assume H1,H2or H3, and let (Xn
k,¯
Xn
k)nN,k∈{0,...,mn}be given by (5)with
{(Kγ,θ ,¯
Kγ,θ ) : γ(0,¯γ], θ Θ}={(Sγ,θ ,¯
Sγ,θ ) : γ(0,¯γ], θ Θ}and κ[κ, ¯κ]with
¯κ>1>κ > 1/2. Then there exists A1>1such that for any n, p Nand k {0,...,mn}
EhSp
γnnW(Xn
k)X0
0i6A1W(X0
0),
Eh¯
Sp
γnnW(¯
Xn
k)¯
X0
0i6A1W(¯
X0
0),
EW(X0
0)<+,EW(¯
X0
0)<+,
with W=Wmwith mNand ¯γ < 2/(m+L)if H2holds and W=Wαwith α < κη/4and
¯γ < 2/Lif H3holds, see (19).
Proof. Combining [17, Lemma S15] and Lemma 22 if H2holds or Lemma 23 if H3holds conclude
the proof.
Lemma 27. Assume H1and H2or H3. We have supθΘ{πθ(W)+ ¯πθ(W)}<+, with W=Wm
with mNif H2holds and W=Wαwith α < η if H3holds, see (19).
Proof. We only show that supθπθ(W)<+since the proof for ¯πθis similar. Let mN,α < η
and θΘThe proof is divided into two parts.
(a) If H2holds then using H1-(b) we have
ZRd
(1 + kxk2m) exp [Uθ(x)Vθ(x)] dx6ZRd
(1 + kxk2m) exp [Vθ(x)] dx
6ZRd
(1 + kxk2m) exp hVθ(x
θ)mkxx
θk2/2idx
6exp RV,3+mR2
V,1/2ZRd
(1 + kxk2m) exp hmRV,1kxk mkxk2/2idx .
Hence using H1-(a) we have
sup
θΘ
πθ(W)6exp RV,3+mR2
V,1/2ZRd
(1 + kxk2m) exp hmRV,1kxk mkxk2/2idx
inf
θΘZRd
exp [Uθ(x)Vθ(x)] dx<+.
(b) if H3holds then we have
ZRd
exp [αφ(x)] exp [Uθ(x)Vθ(x)] dx6ZRd
exp [αφ(x)] exp [Uθ(x)] dx
6ecZRd
exp [α(1 + kxk)] exp [ηkxk] dx .
Since α < η we have using H1-(a)
sup
θΘ
πθ(W)6ecZRd
exp [α(1 + kxk)] exp [ηkxk] dx
inf
θΘZRd
exp [Uθ(x)Vθ(x)] dx<+,
which concludes the proof.
Theorem 28. Assume H1and H2or H3. Let ¯κ>1>κ > 1/2. Let ¯γ < 2/(m+L)if H2holds
and ¯γ < 2/Lif H3holds. Then for any θΘ,κ[κ, ¯κ]and γ(0,¯γ]we have
max kπ
γ,θ πθkW1/2,k¯π
γ,θ ¯πθkW1/26˜
Ψ(γ),
where for any θΘand γ(0,¯γ],π
γ,θ , respectively ¯π
γ,θ , is the invariant probability measure of
Sγ,θ , respectively ¯
Sγ,θ , given by (18)and associated with κ= 1. In addition, for any γ(0,¯γ]
˜
Ψ(γ) = 2{¯γ/log(1/λ) + sup
θΘ
πθ(W) + sup
θΘ
¯πθ(W)}1/2(Ld+M2)1/2γ ,
and where W=Wmwith mNand ¯γ, λ, b are given in Lemma 22 if H2holds and W=Wα
with α < min(κη/4, η)and ¯γ , λ, b are given in Lemma 23 if H3holds, see (19).
20
Proof. We only show that for any θΘ,κ[κ, ¯κ]and γ(0,¯γ],kπ
γ,θ πθkW1/26˜
Ψ(γ), since
the proof of k˜π
γ,θ ˜πθkW1/26˜
Ψ(γ)is similar. Let θΘ,κ[κ,¯κ],γ(0,¯γ]and xRdUsing
Theorem 25 we obtain that (δxSn
γ,θ )nN, with κ= 1, is weakly convergent towards π
γ,θ . Using that
µ7→ KL (µ|πθ)is lower semi-continuous for any θΘ, see [19, Lemma 1.4.3b], and [21, Corollary
18] we get that
KL π
γ,θ |πθ6lim inf
n+KL n1
n
X
k=1
δxSk
γ,θ
πθ!6γ(Ld+M2).
Using a generalized Pinsker inequality, see [22, Lemma 24], Lemma 27 and Lemma 22 if H2holds
or Lemma 23 if H3holds, we get that
kπ
γ,θ πθkW1/262(π
γ,θ (W) + πθ(W))1/2KL π
γ,θ |πθ1/2
62{¯γ/log(1/λ) + sup
θΘ
πθ(W)}1/2(Ld+M2)1/2γ1/2,
which concludes the proof.
Lemma 29. Assume H1and H2or H3. Let ¯κ>1>κ > 1/2. Let ¯γ < 2/(m+L)if H2holds
and ¯γ < 2/Lif H3holds. Then there exists ¯
B3>0such that for any θΘ,γ(0,¯γ],xRd
and κi[κ, ¯κ]with i {1,2}we have
max kδxS1
1,γ,θ δxS1
2,γ,θ kW1/2,kδx¯
S1
1,γ,θ δx¯
S1
2,γ,θ kW1/26¯
B3γ|κ1κ2|W1/2(x).
where for any i {1,2},θΘand γ(0,¯γ],Si,γ,θ is given by (18)and associated with κκi,
and W=Wmwith mNif H2holds. In addition, W=Wαwith α < min(κη/4, η)if H3
holds, see (19).
Proof. We only show that for any θΘ,γ(0,¯γ],xRdand κi[κ,¯κ]with i {1,2}we have
kδxS1
1,γ,θ δxS1
2,γ,θ kW1/26¯
B3γ|κ1κ2|W1/2(x)since the proof for ¯
S1,γ,θ and ¯
S2,γ,θ is similar.
Let θΘ,γ(0,¯γ],xRdand κi[κ,¯κ]with i {1,2}. Using a generalized Pinsker inequality,
see [22, Lemma 24], we have
kδxS1
1,γ,θ δxS1
2,γ,θ kW1/2
62(S1
1,γ,θ W(x) + S1
2,γ,θ W(x))1/2KL δxS1
1,γ,θ |δxS1
2,γ,θ 1/2.(30)
Using [30, Lemma 4.1] we get that KL δxS1
1,γ,θ |δxS1
2,γ,θ 6KL ( ˜µ1|˜µ2)where setting T=
γ1,˜µi,i {1,2}, is the probability measure over B(C([0, T ],Rd)) which is defined for any
A B(C([0, T ],Rd)) by ˜µi(A) = P((Xi
t)t[0,T ]A),i {1,2}and for any t[0, T ]
dXi
t=bi(t, (Xi
s)s[0,T ])dt+2dBt, Xi
0=x ,
with for any (ωs)s[0,T ]C([0, T ],Rd)and t[0, T ]
bi(t, (ωs)s[0,T ]) = X
pN
1
[pγ,(p+1)γ)(t)T(proxγκi
Uθ(ω )) ,
where for any yRd,Tγ,θ (y) = yγxVθ(y). Since (Xi
t)t[0,T ]C([0, T ],Rd),biand bare
continuous for any i {1,2}, [32, Theorem 7.19] applies and we obtain that ˜µ1˜µ2and
d˜µ1
d˜µ2
((X1
t)t[0,T ]) = exp ((1/4) ZT
0
b1(t, (X1
s)s[0,T ])b2(t, (X1
s)s[0,T ])
2dt
+(1/2) ZT
0hb1(t, (X1
s)s[0,T ])b2(t, (X1
s)s[0,T ]),dX1
ti),
where the equality holds almost surely. As a consequence we obtain that
KL (˜µ1|˜µ2) = (1/4)E"ZT
0
b1(t, (X1
s)s[0,T ])b2(t, (X1
s)s[0,T ])
2ds#.(31)
21
In addition, using Lemma 11, we have for any (ωs)s[0,T]C([0, T ],Rd)and t[0, T ]
b1(t, (ωs)s[0,T ])b2(t, (ωs)s[0,T ])
2=
Tγ,θ (proxγκ1
Uθ(ωγt/γ)) Tγ(proxγκ2
Uθ(ωγt/γ))
2
6
proxγκ1
Uθ(ωγt/γ)proxγκ2
Uθ(ωγt/γ)
264γ2(κ1κ2)2M2.(32)
Combining this result and (31) we get that
KL δxS1
1,γ,θ |δxS1
2,γ,θ 6(1 + ¯γ)M2γ2|κ1κ2|2.(33)
Combining (33) and (30) we get that
kδxS1
1,γ,θ δxS1
2,γ,θ kW1/2
621/2(1 + ¯γ)1/2M(S1
1,γ,θ W(x) + S1
2,γ,θ W(x))1/2γ|κ1κ2|.
We conclude the proof upon using Lemma 8, and Lemma 22 if H2holds, or Lemma 23 if H3
holds.
Proposition 30. Assume H1and H2or H3. Let ¯κ>1>κ > 1/2. Let ¯γ < 2/(m+L)if H2
holds and ¯γ < 2/Lif H3holds. Then there exists B3>0such that for any θΘ,γ(0,¯γ]and
κi[κ,¯κ]with i {1,2}we have
max kπ1
γ,θ π2
γ,θ kW1/2,k¯π1
γ,θ ¯π2
γ,θ kW1/26B3γ|κ1κ2|,
where for any i {1,2},θΘand γ(0,¯γ],πi
γ,θ , respectively ¯πi
γ,θ , is the invariant probability
measure of Si,γ,θ , respectively ¯
Si,γ,θ , given by (18)and associated with κκi. In addition,
W=Wmwith mNif H2holds and W=Wαwith α < min(κη/4, η)if H3holds, see (19).
Proof. We only show that for any θΘ,γ(0,¯γ]and κi[κ, ¯κ]with i {1,2},kπ1
γ,θ
π2
γ,θ kW1/26B3γ|κ2κ1|since the proof for ¯π1
γ,θ and ¯π2
γ,θ are similar. Let θΘ,γ(0,¯γ],
xRdand κi>1/2. Using Theorem 25 we have
lim
n+kδxSn
1,γ,θ δxSn
2,γ,θ kW1/2=kπ1 π2,γ,θ kW1/2.
Let n=q1. Using Theorem 25 with a= 1/2, that W1/2(x)6W(x)for any xRd,
Lemma 29, Lemma 8and Lemma 22 if H2holds or Lemma 23 if H3holds, we have
kδxSn
1,γ,θ δxSn
2,γ,θ kW1/26
q1
X
k=0 kδxS(k+1)1
1,γ,θ S(qk1)1
2,γ,θ δxSk1
1,γ,θ S(qk)1
2,γ,θ kW1/2
6
q1
X
k=0
A2,1/2ρqk1
1/2
δxSk1
1,γ,θ nS1
1,γ,θ S1
2,γ,θ o
W1/2
6A2,1/2
q1
X
k=0
ρqk1
1/2¯
B3γ|κ1κ2|δxSk1
1,γ,θ W(x)
6A2,1/2
q1
X
k=0
ρqk1
1/2¯
B3γ|κ1κ2|(1 + ¯γ/log(1))W(x)
6A2,1/2¯
B3(1 + ¯γ/log(1))/(1 ρ1/2)|κ1κ2|γW (x),
which concludes the proof with B3= 2A2,1/2¯
B3(1 + ¯γ/log(1))/(1 ρ1/2)κupon setting
x= 0.
Corollary 31. Assume H1and H2or H3. Let ¯κ>1>κ > 1/2. Let ¯γ < 2/(m+L)if H2holds
and ¯γ < 2/Lif H3holds. Then for any κ[κ, ¯κ],θΘand γ(0,¯γ], we have
max (kπγ,θ πθkW1/2,k¯πγ,θ ¯πθkW1/2)6Ψ(γ),
where for any γ(0,¯γ],πγ,θ is the invariant probability measure of Sγ,θ given by (18). In addition,
Ψ(γ) = ˜
Ψ(γ)+B3γ|κ1|, where ˜
Ψis given in Theorem 28 and B3in Proposition 30, and W=Wm
with mNif H2holds and W=Wαwith α < min(κη/4, η)if H3holds, see (19).
22
Proof. We only show that for any θΘand γ(0,¯γ]we have kπγ,θ πθkW1/26Ψ(γ)since the
proof for ¯πγ,θ and ¯πθare similar. Let κ[κ,¯κ],θΘ,γ(0,¯γ]. The proof is a direct application
of Theorem 28 and Proposition 30 upon noticing that
kπγ,θ πθkW1/26kπγ π
γ,θ kW1/2+kπ
γ,θ πθkW1/2,
where π
γ,θ is the invariant probability measure of Sγ given by (18) and associated with κ= 1.
Proposition 32. Assume H1and H2or H3. Let ¯κ>1>κ > 1/2. Let ¯γ < 2/(m+L)if H2
holds and ¯γ < 2/Lif H3holds. Then there exists A4>0such that for any κ[κ, ¯κ],θ1, θ2Θ,
γ1, γ2(0,¯γ]with γ2< γ1,a[1/4,1/2] and xRd
max kδxSγ11δxSγ22kWa,kδx¯
Sγ11δx¯
Sγ22kWa
6(Λ(γ1, γ2) + Λ(γ1, γ2)kθ1θ2k)W2a(x),
with
Λ1(γ1, γ2) = A4(γ121) ,Λ2(γ1, γ2) = A4γ1/2
2,
and where W=Wmwith mNand m>2if H2is satisfied and W=Wαwith α < min(κη/4, η)
if H3is satisfied, see (19).
Proof. We only show that for any κ[κ,¯κ],θ1, θ2Θ,γ1, γ2(0,¯γ]with γ2< γ1,a[1/4,1/2]
and xRdwe have kδxSγ11δxSγ22kWa6(Λ(γ1, γ2) + Λ(γ1, γ2)kθ1θ2k)W2a(x)since the
proof for ¯
Sγ11and ¯
Sγ22is similar. Let a[1/4,1/2],κ[κ,¯κ],θ1, θ2Θ,γ1, γ2(0,¯γ]with
γ2< γ1. Using a generalized Pinsker inequality, see [22, Lemma 24], we have
kδxSγ11δxSγ22kWa
62(δxSγ11W2a(x) + δxSγ22W2a(x))1/2KL (δxSγ11|δxSγ22)1/2.
Combining this result, Jensen’s inequality and Lemma 22 if H2holds and Lemma 23 if H3holds,
we obtain that
kSγ11Sγ22kWa62(1 + b¯γ)1/2{KL (δxSγ1,θ1|δxSγ22)}1/2Wa(x).
Denote for υRdand σ>0,Υυ,σthe d-dimensional Gaussian distribution with mean υand
covariance matrix σ2Id. Using Lemma 17 and the fact that γ1>γ2we have
KL (δxSγ11|δxSγ22)(34)
6d(γ121)2/2 +
Tγ11(proxγ1κ
Uθ1(x)) Tγ22(proxγ2κ
Uθ1(x))
2(4γ2),
with Tγ,θ (z) = zγxVθ(z)for any θΘ,γ(0,¯γ]and xRd. We have
(1/4)
Tγ11(proxγ1κ
Uθ1(x)) Tγ22(proxγ2κ
Uθ2(x))
2(35)
6
Tγ11(proxγ1κ
Uθ1(x)) Tγ11(proxγ2κ
Uθ1(x))
2+
Tγ11(proxγ2κ
Uθ1(x)) Tγ11(proxγ2κ
Uθ2(x))
2
+
Tγ11(proxγ2κ
Uθ2(x)) Tγ21(proxγ2κ
Uθ2(x))
2+
Tγ21(proxγ2κ
Uθ2(x)) Tγ22(proxγ2κ
Uθ2(x))
2.
First using H1, [36, Theorem 2.1.5, Equation (2.1.8)] and Lemma 11 we have
Tγ11(proxγ1κ
Uθ1(x)) Tγ11(proxγ2κ
Uθ1(x))
(36)
6
proxγ1κ
Uθ1(x)proxγ2κ
Uθ1(x)
62M|γ1κγ2κ|.
Second, we have using (9), H1, [36, Theorem 2.1.5, Equation (2.1.8)] and H4
Tγ11(proxγ2κ
Uθ1(x)) Tγ11(proxγ2κ
Uθ2(x))
(37)
6γ2κ
xUγ2κ
θ1(x) xUγ2κ
θ2(x)
6sup
t[0,¯γ κ]{fθ(t)}γ2κkθ1θ2k(1 + kxk).
23
Third using H1and Lemma 9we have that
Tγ11(proxγ2κ
Uθ2(x)) Tγ21(proxγ2κ
Uθ2(x))
6(γ1γ2)
xVθ1(proxγ2κ
Uθ2(x))
(38)
6(γ1γ2)L
proxγ2κ
Uθ2(x)x
θ1
6(γ1γ2)L(RV,1+ ¯γκM+kxk).
Finally using H1,H4and Lemma 9we have that
Tγ21(proxγ2κ
Uθ2(x)) Tγ22(proxγ2κ
Uθ2(x))
(39)
6γ2
xVθ1(proxγ2κ
Uθ2(x)) xVθ2(proxγ2κ
Uθ2(x))
6γ2MΘkθ1θ2k(1 + kproxγ2κ
Uθ2(x)k)6γ2MΘkθ1θ2k(1 + ¯γκM+kxk).
Therefore, combining (36), (37), (38) and (39) in (35), there exists A4,1>0such that for any
γ1, γ2>0with γ2< γ1and θ1, θ2Θ
Tγ11(proxγ1κ
Uθ1(x)) Tγ22(proxγ2κ
Uθ2(x))
2
6A4,1h(γ1γ2)2+γ2
2kθ1θ2k2iW2a(x).
Using this result in (34), there exists A4,2>0such that
KL (δxSγ11|δxSγ22)6A4,2h(γ121)2+γ2kθ1θ2k2iW2a(x),
which implies the announced result upon setting A4= 2pA4,2(1 + b¯γ)1/2and using that for any
u, v >0,u+v6u+v.
5.5 Checking [17, H1, H2] for MYULA
In this section, similarly to Section 5.5 for PULA, we show that [17, H1, H2] hold for MYULA.
Lemma 33. Assume H1,H2or H3, and let (Xn
k,¯
Xn
k)nN,k∈{0,...,mn}be given by (5)with
{(Kγ,θ ,¯
Kγ,θ ) : γ(0,¯γ], θ Θ}={(Rγ,θ,¯
Rγ,θ ) : γ(0,¯γ], θ Θ}and κ[κ, ¯κ]with
¯κ>1>κ>1/2. Then there exists ¯
A1>1such that for any n, p Nand k {0,...,mn}
EhRp
γnnW(Xn
k)X0
0i6¯
A1W(X0
0),
Eh¯
Rp
γnnW(¯
Xn
k)¯
X0
0i6¯
A1W(¯
X0
0),
EW(X0
0)<+,EW(¯
X0
0)<+.
with W=Wmwith mNand ¯γ < 2/(m+L)if H2holds and W=Wαwith α < min(κη/4, η/8)
and ¯γ < min{2/L, η /(2ML)}if H3holds, see (19).
Proposition 34. Assume H1and H2or H3. Let ¯κ>1>κ > 1/2. Let ¯γ < min{(2
1)/L,2/(m+L)}if H2holds and ¯γ < min{(2 1)/L, η/(2ML)}if H3holds. Then there exists
¯
B3,1>0such that for any θΘ,κi[κ,¯κ],γ(0,¯γ]
max kπ1
γ,θ π2
γ,θ kW1/2,k¯π1
γ,θ ¯π2
γ,θ kW1/26¯
B3,1γ ,
where for any i {1,2},θΘand γ(0,¯γ],πi
γ,θ , respectively ¯πi
γ,θ , is the invariant probability
measure of Ri,γ,θ , respectively ¯
Ri,γ,θ , given by (17)and associated with κκi. In addition,
W=Wmwith mNif H2holds and W=Wαwith α < min(κη/4, η/8) if H3holds, see (19).
Proof. The proof is similar to the one of Proposition 30 upon setting for any i {1,2}and
(ωs)s[0,T ]C([0, T ],Rd)with T=γ1
bi(t, (ωs)s[0,T ]) = ωt/γγγxVθ(ωt/γγ)γxUγκi(γ)
θ(ωt/γγ),
and replacing (32) in Lemma 29 by
b1(t, (ωs)s[0,T ])b2(t, (ωs)s[0,T ])
2
=
γxUγκ1
θ(ωt/γγ) + γxUγ κ2
θ(ωt/γγ)
264γ2M2.
24
Proposition 35. Assume H1and H2or H3. Let ¯κ>1>κ > 1/2. Let ¯γ < min{(2
1)/L,2/(m+L),L1}if H2holds and ¯γ < min{(2 1)/L, η/(2ML),L1}if H3holds. Then
there exists ¯
B3,2>0such that for any θΘ,γ(0,¯γ]and κi[κ, ¯κ]with i {1,2}we have
max kπ
γ,θ π
γ,θ kW1/2,k¯π
γ,θ ¯π
γ,θ kW1/26¯
B3,2γ2,
where for any θΘand γ(0,¯γ],π
γ,θ , respectively ¯π
γ,θ , is the invariant probability measure of
Rγ,θ , respectively ¯
Rγ,θ , given by (17)and associated with κ= 1 and π
γ,θ , respectively ¯π
γ,θ , is the
invariant probability measure of Sγ,θ , respectively ¯
Sγ,θ , given by (18)and associated with κ= 1. In
addition, W=Wmwith mNif H2holds and W=Wαwith α < min(κη/4, η/8) if H3holds,
see (19).
Proof. The proof is similar to the one of Proposition 30 upon setting for any (ωs)s[0,T ]C([0, T ],Rd)
with T=γ1
b1(t, (ωs)s[0,T ]) = proxγ
Uθ(ωt/γγ)γxVθ(proxγ
Uθ(ωt/γγ)) ,
b2(t, (ωs)s[0,T ]) = ωt/γγγxVθ(ωt/γγ)γxUγ
θ(ωt/γγ),
and replacing (32) in Lemma 29 and using (9) and Lemma 9we get
b1(t, (ωs)s[0,T ])b2(t, (ωs)s[0,T ])
2
=kproxγ
Uθ(ωt/γγ)) γxVθ(proxγ
Uθ(ωt/γγ)) ωt/γγ
+γxVθ(ωt/γγ)) + γ(ωt/γγproxγ
Uθ(ωt/γγ))k2
=γ2
xVθ(proxγ
Uθ(ωt/γγ))) xVθ(ωt/γγ))
26L2M2γ4.
Proposition 36. Assume H1and H2or H3. Let ¯κ>1>κ > 1/2. Let ¯γ < min{(2
1)/L,2/(m+L),L1}if H2holds and ¯γ < min{(2 1)/L, η/(2ML),L1}if H3holds. Then
for any θΘ,κ[κ,¯κ]and γ(0,¯γ], we have
max (kπγ,θ πθkW1/2,k¯πγ,θ ¯πθkW1/2)6¯
Ψ(γ),
where for any i {1,2},θΘand γ(0,¯γ],πi
γ,θ , respectively ¯πi
γ,θ , is the invariant probability
measure of Ri,γ,θ , respectively ¯
Ri,γ,θ , given by (17)and associated with κκi. In addition,
¯
Ψ(γ) = ˜
Ψ(γ) + ¯
B3,1γ+¯
B3,2γ2, where ˜
Ψis given in Theorem 28 and B3in Proposition 30, and
W=Wmwith mNif H2holds and W=Wαwith α < min(κη/4, η/8) if H3holds, see (19).
Proof. We only show that for any θΘand γ(0,¯γ],kπγ,θ πθkW1/26¯
Ψ(γ)as the proof for
¯πγ and ¯πθis similar. First note that for any θΘ,κ[κ,¯κ]and γ(0,¯γ]we have
kπγ,θ πθkW1/26kπγ π
γ,θ kW1/2+kπ
γ,θ π
γ,θ kW1/2+kπ
γ,θ πθkW1/2,
where for any θΘand γ(0,¯γ],π
γ,θ is the invariant probability measure of Rγ,θ given by (17)
and associated with κ= 1 and π
γ,θ is the invariant probability measure of Sγ and associated with
κ= 1. We conclude the proof upon combining Proposition 34, Proposition 35 and Theorem 28.
Proposition 37. Assume H1and H2or H3. Let ¯κ>1>κ > 1/2. Let ¯γ < min{(2
1)/L,2/(m+L)}if H2holds and ¯γ < min{(2 1)/L, η/(2ML)}if H3holds. Then there exists
¯
A4>0such that for any θ1, θ2Θ,κ[κ, ¯κ],γ1, γ2(0,¯γ]with γ2< γ1,a[1/4,1/2] and
xRd
max kδxRγ11δxRγ22kWa,kδx¯
Rγ11δx¯
Rγ22kWa
6(¯
Λ1(γ1, γ2) + ¯
Λ2(γ1, γ2)kθ1θ2k)W2a(x),
with
¯
Λ1(γ1, γ2) = ¯
A4(γ121) ,¯
Λ2(γ1, γ2) = ¯
A4γ1/2
2,
and where W=Wmwith mNand m>2if H2is satisfied and W=Wαwith α <
min(κη/4, η/8) if H3is satisfied, see (19).
25
Proof. First, note that we only show that for any θ1, θ2Θ,κκ, κ],γ1, γ2(0,¯γ]with γ2< γ1,
a[1/4,1/2] and xRd, we have kδxRγ11δxRγ22kWa6(¯
Λ(γ1, γ2)+ ¯
Λ(γ1, γ2)kθ1θ2k)W2a(x)
since the proof for ¯
Rγ11and ¯
Rγ22is similar. Let a[1/4,1/2],θ1, θ2Θ,κ[κ,¯κ],
γ1, γ2(0,¯γ]with γ2< γ1. Using a generalized Pinsker inequality [22, Lemma 24] we have
kδxRγ11δxRγ22kWa
62(δxRγ11W2a(x) + δxRγ22W2a(x))1/2KL (δxRγ11|δxRγ22)1/2.
Combining this result, Jensen’s inequality and Lemma 22 if H2holds and Lemma 23 if H3holds,
we obtain that
kδxRγ11δxRγ22kWa62(1 + b¯γ)1/2KL (δxRγ11|δxRγ22)1/2Wa(x).
Using Lemma 17 and the fact that γ1>γ2we have
KL (δxRγ11|δxRγ22)
6d(γ121)2/2 + kγ2xVθ2(x)γ1xVθ1(x) + γ2xUγ2κ
θ2(x)γ1xUγ1κ
θ1(x)k2/(4γ2),(40)
We have
kγ2xVθ2(x)γ1xVθ1(x) + γ2xUγ2κ
θ2(x)γ1xUγ1κ
θ1(x)k2(41)
64kγ2xVθ2(x)γ2xVθ1(x)k2+ 4 kγ2xVθ1(x)γ1xVθ1(x)k2
+ 4
γ1xUγ1κ
θ1(x)γ2xUγ2κ
θ1(x)
2+ 4
γ2xUγ2κ
θ1(x)γ2xUγ2κ
θ2(x)
2.
First using H4we have
kγ2xVθ2(x)γ2xVθ1(x)k6γ2MΘkθ1θ2k(1 + kxk).(42)
Second using H1we have
kγ2xVθ1(x)γ1xVθ1(x)k6(γ1γ2)k∇xVθ1(x)k(43)
6(γ1γ2)L
xx
θ1
6(γ1γ2)L(RV,1+kxk).
Third using H1,H4, Lemma 9and Lemma 11 we have
γ1xUγ1κ
θ1(x)γ2xUγ2κ
θ1(x)
6
(xproxγ1κ
Uθ1(x)) (xproxγ2κ
Uθ1(x))
(44)
6
proxγ2κ
Uθ1(x)proxγ1κ
Uθ1(x)
.κ
62M(γ1γ2)
Finally using H4we have
γ2xUγ2κ
θ1(x)γ2xUγ2κ
θ2(x)
6γ2(sup
[0,¯γ κ]
fθ(t))kθ1θ2k.(45)
Combining (42), (43), (44) and (45) in (41) we get that there exists ¯
A4,1>0such that
kγ2xVθ2(x)γ1xVθ1(x) + γ2xUκ
θ2(x)γ1xUκ
θ1(x)k2
6¯
A4,1(γ1γ2)2+γ2
2kθ1θ2kW2a(x).
Using this result in (40) we obtain that there exists ¯
A4,2>0such that
KL (δxRγ11|δxRγ22)6¯
A4,2h(γ121)2+γ2kθ1θ2k2iW2a(x),
which implies the announced result upon setting ¯
A4= 2p¯
A4,2(1 + b¯γ)1/2and using that for any
u, v >0,u+v6u+v.
26
5.6 Proof of Theorem 6
We divide the proof in two parts.
(a) First assume that (Xn
k)nN,k∈{0,...,mn}and (¯
Xn
k)nN,k∈{0,...,mn}are given by (5) and we have
{(Kγ,θ ,¯
Kγ,θ ) : γ(0,¯γ], θ Θ}={(Sγ,θ,¯
Sγ,θ ) : γ(0,¯γ], θ Θ}. Then Lemma 26 implies
that [17, H1a] is satisfied with A1A1, Theorem 25 implies that [17, H1b] holds with A2A2
and ρρ. Finally, using Corollary 31 we get that [17, H1c] holds with ΨΨ. Therefore, we
can apply [17, Theorem 1] and we obtain that the sequence (θn)nNconverges a.s. if
+
X
n=0
δn= +,
+
X
n=0
δn+1Ψ(γn)<+,
+
X
n=0
δn+1/(mnγn)<+.
Since Ψ(γn) = O(γ1/2
n)by Corollary 31, these summability conditions are satisfied under the
summability assumptions of Theorem 6-(1). Proposition 32 implies that [17, H2] holds with Λ1
Λ1and Λ2Λ2. Therefore if mn=m0for all nN, we can apply [17, Theorem 3] and we
obtain that the sequence (θn)nNconverges a.s. if
+
X
n=0
δn= +,
+
X
n=0
δn+1Ψ(γn)<+,
+
X
n=0
δn+1γ2
n<+
+
X
n=0
δn+12
n(Λ1(γn, γn+1) + δn+1 Λ2(γn, γn+1)) <+.
These summability conditions are satisfied under the summability assumptions of Theorem 6-(2).
(b) Second assume that (Xn
k)nN,k∈{0,...,mn}and (¯
Xn
k)nN,k∈{0,...,mn}are given by (5) with {(Kγ,θ ,¯
Kγ,θ ) :
γ(0,¯γ], θ Θ}={(Rγ,θ ,¯
Rγ,θ ) : γ(0,¯γ], θ Θ}. Then Lemma 33 implies that [17, H1a]
is satisfied with A1¯
A1, Theorem 21 implies that [17, H1b] holds with A2¯
A2and ρ¯ρ.
Finally, using Proposition 36 we get that [17, H1c] holds with Ψ¯
Ψ. Therefore, we can apply
[17, Theorem 1] and we obtain that the sequence (θn)nNconverges a.s. if
+
X
n=0
δn= +,
+
X
n=0
δn+1 ¯
Ψ(γn)<+,
+
X
n=0
δn+1/(mnγn)<+.
Since Ψ(γn) = O(γ1/2
n)by Proposition 36, these summability conditions are satisfied under the
summability assumptions of Theorem 6-(1). Proposition 37 implies that [17, H2] holds with Λ1
¯
Λ1and Λ2¯
Λ2. Therefore if mn=m0for all nN, we can apply [17, Theorem 3] and we
obtain that the sequence (θn)nNconverges a.s. if
+
X
n=0
δn= +,
+
X
n=0
δn+1 ¯
Ψ(γn)<+,
+
X
n=0
δ2
n+1γ2
n,
+
X
n=0
δn+12
n(¯
Λ1(γn, γn+1) + δn+1 ¯
Λ2(γn, γn+1)) <+.
These summability conditions are satisfied under the summability assumptions of Theorem 6-(2).
5.7 Proof of Theorem 7
The proof is similar to the one of Theorem 6using [16, Theorem 2, Theorem 4] instead of [16,
Theorem 1, Theorem 3].
6 Acknowledgements
AD acknowledges financial support from Polish National Science Center grant: NCN UMO-
2018/31/B/ST1/00253. MP acknowledges financial support from EPSRC under grant EP/T007346/1.
27
References
[1] Yves F Atchadé, Gersende Fort, and Eric Moulines. On perturbed proximal gradient algo-
rithms. J. Mach. Learn. Res, 18(1):310–342, 2017.
[2] Francis R. Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation
algorithms for machine learning. In Advances in Neural Information Processing Systems 24:
25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a
meeting held 12-14 December 2011, Granada, Spain, pages 451–459, 2011.
[3] D. Bakry, I. Gentil, and M. Ledoux. Analysis and geometry of Markov diffusion operators,
volume 348 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of
Mathematical Sciences]. Springer, Cham, 2014.
[4] Dominique Bakry, Franck Barthe, Patrick Cattiaux, and Arnaud Guillin. A simple proof of
the Poincaré inequality for a large class of probability measures including the log-concave case.
Electron. Commun. Probab., 13:60–66, 2008.
[5] Heinz H. Bauschke and Patrick L. Combettes. Convex analysis and monotone operator the-
ory in Hilbert spaces. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC.
Springer, Cham, second edition, 2017. With a foreword by Hédy Attouch.
[6] M. Benaim. A dynamical system approach to stochastic approximations. SIAM J. Control
Optim., 34(2):437–472, 1996.
[7] A. Benveniste, M. Métivier, and P. Priouret. Adaptive algorithms and stochastic approxima-
tions, volume 22 of Applications of Mathematics (New York). Springer-Verlag, Berlin, 1990.
Translated from the French by Stephen S. Wilson.
[8] Sebastian Berisha, James G Nagy, and Robert J Plemmons. Deblurring and sparse unmixing
of hyperspectral images using multiple point spread functions. SIAM Journal on Scientific
Computing, 37(5):S389–S406, 2015.
[9] José M Bioucas-Dias, Antonio Plaza, Nicolas Dobigeon, Mario Parente, Qian Du, Paul Gader,
and Jocelyn Chanussot. Hyperspectral unmixing overview: Geometrical, statistical, and sparse
regression-based approaches. IEEE journal of selected topics in applied earth observations and
remote sensing, 5(2):354–379, 2012.
[10] Emmanuel J Candes, Yonina C Eldar, Thomas Strohmer, and Vladislav Voroninski. Phase
retrieval via matrix completion. SIAM review, 57(2):225–251, 2015.
[11] Antonin Chambolle and Thomas Pock. An introduction to continuous optimization for imag-
ing. Acta Numerica, 25:161–319, 2016.
[12] Emilie Chouzenoux, Anna Jezierska, Jean-Christophe Pesquet, and Hugues Talbot. A Convex
Approach for Image Restoration with Exact Poisson–Gaussian Likelihood. SIAM Journal on
Imaging Sciences, 8(4):2662–2682, 2015.
[13] Julianne Chung and Linh Nguyen. Motion estimation and correction in photoacoustic tomo-
graphic reconstruction. SIAM Journal on Imaging Sciences, 10(1):216–242, 2017.
[14] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-
concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
79(3):651–676, 2017.
[15] Arnak S Dalalyan and Avetik Karagulyan. User-friendly guarantees for the langevin monte
carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12):5278–
5311, 2019.
[16] V. De Bortoli and A. Durmus. Convergence of diffusions and their discretizations: from
continuous to discrete processes and back, 2019.
[17] V. De Bortoli, A. Durmus, M. Pereyra, and A. F. Vidal. Efficient stochastic optimisation by
unadjusted langevin monte carlo. application to maximum marginal likelihood and empirical
bayesian estimation. 2019.
28
[18] David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–
1306, 2006.
[19] Paul Dupuis and Richard S. Ellis. A weak convergence approach to the theory of large devi-
ations. Wiley Series in Probability and Statistics: Probability and Statistics. John Wiley &
Sons, Inc., New York, 1997. A Wiley-Interscience Publication.
[20] A. Durmus and E. Moulines. High-dimensional Bayesian inference via the Unadjusted
Langevin Algorithm. ArXiv e-prints, May 2016.
[21] Alain Durmus, Szymon Majewski, and Blazej Miasojedow. Analysis of langevin monte carlo
via convex optimization. Journal of Machine Learning Research, 20(73):1–46, 2019.
[22] Alain Durmus, Eric Moulines, et al. Nonasymptotic convergence analysis for the unadjusted
Langevin algorithm. The Annals of Applied Probability, 27(3):1551–1587, 2017.
[23] Alain Durmus, Eric Moulines, and Marcelo Pereyra. Efficient Bayesian computation by prox-
imal Markov chain Monte Carlo: when Langevin meets Moreau. SIAM Journal on Imaging
Sciences, 11(1):473–506, 2018.
[24] Bruno Galerne and Arthur Leclaire. Texture inpainting using efficient Gaussian conditional
simulation. SIAM Journal on Imaging Sciences, 10(3):1446–1474, 2017.
[25] Nobuyuki Ikeda and Shinzo Watanabe. Stochastic differential equations and diffusion pro-
cesses, volume 24 of North-Holland Mathematical Library. North-Holland Publishing Co.,
Amsterdam; Kodansha, Ltd., Tokyo, second edition, 1989.
[26] Mark A Iwen, Aditya Viswanathan, and Yang Wang. Fast phase retrieval from local correlation
measurements. SIAM Journal on Imaging Sciences, 9(4):1655–1688, 2016.
[27] Jari Kaipio and Erkki Somersalo. Statistical and computational inverse problems, volume 160.
Springer Science & Business Media, 2006.
[28] Michael Kech and Felix Krahmer. Optimal injectivity conditions for bilinear inverse problems
with applications to identifiability of deconvolution problems. SIAM Journal on Applied
Algebra and Geometry, 1(1):20–37, 2017.
[29] Jack Kiefer, Jacob Wolfowitz, et al. Stochastic estimation of the maximum of a regression
function. The Annals of Mathematical Statistics, 23(3):462–466, 1952.
[30] Solomon Kullback. Information theory and statistics. John Wiley and Sons, Inc., New York;
Chapman and Hall, Ltd., London, 1959.
[31] Shutao Li, Xudong Kang, Leyuan Fang, Jianwen Hu, and Haitao Yin. Pixel-level image fusion:
A survey of the state of the art. Information Fusion, 33:100–112, 2017.
[32] Robert S. Liptser and Albert N. Shiryaev. Statistics of random processes. II, volume 6 of
Applications of Mathematics (New York). Springer-Verlag, Berlin, expanded edition, 2001.
Applications, Translated from the 1974 Russian original by A. B. Aries, Stochastic Modelling
and Applied Probability.
[33] M. Métivier and P. Priouret. Applications of a Kushner and Clark lemma to general classes
of stochastic algorithms. IEEE Trans. Inform. Theory, 30(2, part 1):140–151, 1984.
[34] M. Métivier and P. Priouret. Théorèmes de convergence presque sure pour une classe
d’algorithmes stochastiques à pas décroissant. Probab. Theory Related Fields, 74(3):403–428,
1987.
[35] Veniamin I Morgenshtern and Emmanuel J Candes. Super-resolution of positive sources: The
discrete setup. SIAM Journal on Imaging Sciences, 9(1):412–444, 2016.
[36] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87.
Springer Science & Business Media, 2013.
[37] George Pólya and Gabor Szegő. Problems and theorems in analysis. I. Classics in Mathematics.
Springer-Verlag, Berlin, 1998. Series, integral calculus, theory of functions, Translated from
the German by Dorothee Aeppli, Reprint of the 1978 English translation.
29
[38] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.
SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
[39] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal
for strongly convex stochastic optimization. arXiv preprint arXiv:1109.5647, 2011.
[40] Saiprasad Ravishankar and Yoram Bresler. Efficient blind compressed sensing using sparsifying
transforms with convergence guarantees and application to magnetic resonance imaging. SIAM
Journal on Imaging Sciences, 8(4):2519–2557, 2015.
[41] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of
mathematical statistics, pages 400–407, 1951.
[42] G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions and
their discrete approximations. Bernoulli, 2(4):341–363, 1996.
[43] Lorenzo Rosasco, Silvia Villa, and Bang Công u. Convergence of stochastic proximal gradient
algorithm. Applied Mathematics & Optimization, pages 1–27, 2019.
[44] Carola-Bibiane Schönlieb. Partial Differential Equation Methods for Image Inpainting, vol-
ume 29. Cambridge University Press, 2015.
[45] Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization:
Convergence results and optimal averaging schemes. In International Conference on Machine
Learning, pages 71–79, 2013.
[46] Miguel Simões, José Bioucas-Dias, Luis B Almeida, and Jocelyn Chanussot. A convex for-
mulation for hyperspectral image superresolution via subspace-based regularization. IEEE
Transactions on Geoscience and Remote Sensing, 53(6):3373–3388, 2015.
[47] Weijie Su, Stephen P. Boyd, and Emmanuel J. Candès. A differential equation for modeling
nesterov’s accelerated gradient method: Theory and insights. J. Mach. Learn. Res., 17:153:1–
153:43, 2016.
[48] V. B. Tadić and A. Doucet. Asymptotic bias of stochastic gradient search. Ann. Appl. Probab.,
27(6):3255–3304, 2017.
[49] Ana F. Vidal, Valentin De Bortoli, Marcelo Pereyra, and Alain Durmus. Maximum likelihood
estimation of regularisation parameters in high-dimensional inverse problems: an empirical
bayesian approach. Part I: Methodology and experiments, 2019.
[50] Ana Fernandez Vidal and Marcelo Pereyra. Maximum likelihood estimation of regularisation
parameters. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages
1742–1746. IEEE, 2018.
[51] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network
for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 2472–2481, 2018.
30
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In this paper, we provide new insights on the Unadjusted Langevin Algorithm. We show that this method can be formulated as a first order optimization algorithm of an objective functional defined on the Wasserstein space of order 2. Using this interpretation and techniques borrowed from convex optimization, we give a non-asymptotic analysis of this method to sample from logconcave smooth target distribution on Rd\mathbb{R}^d. Our proofs are then easily extended to the Stochastic Gradient Langevin Dynamics, which is a popular extension of the Unadjusted Langevin Algorithm. Finally, this interpretation leads to a new methodology to sample from a non-smooth target distribution, for which a similar study is done.
Article
Full-text available
In this paper, we revisit the recently established theoretical guarantees for the convergence of the Langevin Monte Carlo algorithm of sampling from a smooth and (strongly) log-concave density. We improve, in terms of constants, the existing results when the accuracy of sampling is measured in the Wasserstein distance and provide further insights on relations between, on the one hand, the Langevin Monte Carlo for sampling and, on the other hand, the gradient descent for optimization. More importantly, we establish non-asymptotic guarantees for the accuracy of a version of the Langevin Monte Carlo algorithm that is based on inaccurate evaluations of the gradient. Finally, we propose a variable-step version of the Langevin Monte Carlo algorithm that has two advantages. First, its step-sizes are independent of the target accuracy and, second, its rate provides a logarithmic improvement over the constant-step Langevin Monte Carlo algorithm.
Article
We consider in this paper the problem of sampling a high-dimensional probability distribution π having a density w.r.t. the Lebesgue measure on Rd, known up to a normalization constant x ∣→ π(x) = e−U(x)/( Rd e−U(y)dy. Such problem naturally occurs for example in Bayesian inference and machine learning. Under the assumption that U is continuously differentiable, ∇U is globally Lipschitz and U is strongly convex, we obtain non-asymptotic bounds for the convergence to stationarity in Wasserstein distance of order 2 and total variation distance of the sampling method based on the Euler discretization of the Langevin stochastic differential equation, for both constant and decreasing step sizes. The dependence on the dimension of the state space of these bounds is explicit. The convergence of an appropriately weighted empirical measure is also investigated and bounds for the mean square error and exponential deviation inequality are reported for functions which are measurable and bounded. An illustration to Bayesian inference for binary regression is presented to support our claims.
Article
The asymptotic behavior of the stochastic gradient algorithm using biased gradient estimates is analyzed. Relying on arguments based on dynamic system theory (chain-recurrence) and differential geometry (Yomdin theorem and Lojasiewicz inequalities), upper bounds on the asymptotic bias of this algorithm are derived. The results hold under mild conditions and cover a broad class of algorithms used in machine learning, signal processing and statistics.
Article
Inpainting consists in computing a plausible completion of missing parts of an image given the available content. In the restricted framework of texture images, the image can be seen as a realization of a random field model, which gives a stochastic formulation of image inpainting: on the masked exemplar one estimates a random texture model which can then be conditionally sampled in order to fill the hole. In this paper is proposed an instance of such stochastic inpainting methods, dealing in particular with the case of Gaussian textures. First, a simple procedure is proposed for estimating a Gaussian texture model based on a masked exemplar, which, although quite naive, gives sufficient results for our inpainting purpose. Next, the conditional sampling step is solved with the traditional algorithm for Gaussian conditional simulation. The main difficulty of this step is to solve a very large linear system, which, in the case of stationary Gaussian textures, can be done efficiently with a conjugate gradient descent (using a Fourier representation of the covariance operator). Several experiments show that the corresponding inpainting algorithm is able to inpaint large holes (of any shape) in a texture, with a reasonable computational time. Moreover, several comparisons illustrate that the proposed approach performs better on texture images than state-of-the-art inpainting methods.
Article
We study a version of the proximal gradient algorithm for which the gradient is intractable and is approximated by Monte Carlo methods (and in particular Markov Chain Monte Carlo). We derive conditions on the step size and the Monte Carlo batch size under which convergence is guaranteed: both increasing batch size and constant batch size are considered. We also derive non-asymptotic bounds for an averaged version. Our results cover both the cases of biased and unbiased Monte Carlo approximation. To support our findings, we discuss the inference of a sparse generalized linear model with random effect and the problem of learning the edge structure and parameters of sparse undirected graphical models.
Book
This reference text, now in its second edition, offers a modern unifying presentation of three basic areas of nonlinear analysis: convex analysis, monotone operator theory, and the fixed point theory of nonexpansive operators. Taking a unique comprehensive approach, the theory is developed from the ground up, with the rich connections and interactions between the areas as the central focus, and it is illustrated by a large number of examples. The Hilbert space setting of the material offers a wide range of applications while avoiding the technical difficulties of general Banach spaces. The authors have also drawn upon recent advances and modern tools to simplify the proofs of key results making the book more accessible to a broader range of scholars and users. Combining a strong emphasis on applications with exceptionally lucid writing and an abundance of exercises, this text is of great value to a large audience including pure and applied mathematicians as well as researchers in engineering, data science, machine learning, physics, decision sciences, economics, and inverse problems. The second edition of Convex Analysis and Monotone Operator Theory in Hilbert Spaces greatly expands on the first edition, containing over 140 pages of new material, over 270 new results, and more than 100 new exercises. It features a new chapter on proximity operators including two sections on proximity operators of matrix functions, in addition to several new sections distributed throughout the original chapters. Many existing results have been improved, and the list of references has been updated. Heinz H. Bauschke is a Full Professor of Mathematics at the Kelowna campus of the University of British Columbia, Canada. Patrick L. Combettes, IEEE Fellow, was on the faculty of the City University of New York and of Université Pierre et Marie Curie – Paris 6 before joining North Carolina State University as a Distinguished Professor of Mathematics in 2016.
Article
Modern imaging methods rely strongly on Bayesian inference techniques to solve challenging imaging problems. Currently, the predominant Bayesian computation approach is convex optimisation, which scales very efficiently to high dimensional image models and delivers accurate point estimation results. However, in order to perform more complex analyses, for example image uncertainty quantification or model selection, it is necessary to use more computationally intensive Bayesian computation techniques such as Markov chain Monte Carlo methods. This paper presents a new and highly efficient Markov chain Monte Carlo methodology to perform Bayesian computation for high dimensional models that are log-concave and non-smooth, a class of models that is central in imaging sciences. The methodology is based on a regularised unadjusted Langevin algorithm that exploits tools from convex analysis, namely Moreau-Yoshida envelopes and proximal operators, to construct Markov chains with favourable convergence properties. In addition to scaling efficiently to high dimensions, the method is straightforward to apply to models that are currently solved by using proximal optimisation algorithms. We provide a detailed theoretical analysis of the proposed methodology, including asymptotic and non-asymptotic convergence results with easily verifiable conditions, and explicit bounds on the convergence rates. The proposed methodology is demonstrated with four experiments related to image deconvolution and tomographic reconstruction with total-variation and 1\ell_1 priors, where we conduct a range of challenging Bayesian analyses related to uncertainty quantification, hypothesis testing, and model selection in the absence of ground truth.