Content uploaded by Wuchen Li
Author content
All content in this area was uploaded by Wuchen Li on Feb 25, 2025
Content may be subject to copyright.
arXiv:2502.16773v1 [stat.CO] 24 Feb 2025
SPLITTING REGULARIZED WASSERSTEIN PROXIMAL ALGORITHMS
FOR NONSMOOTH SAMPLING PROBLEMS
FUQUN HAN, STANLEY OSHER, AND WUCHEN LI
Abstract. Sampling from nonsmooth target probability distributions is essential in various ap-
plications, including the Bayesian Lasso. We propose a splitting-based sampling algorithm for the
time-implicit discretization of the probability flow for the Fokker-Planck equation, where the score
function defined as the gradient logarithm of the current probability density function, is approxi-
mated by the regularized Wasserstein proximal. When the prior distribution is the Laplace prior,
our algorithm is explicitly formulated as a deterministic interacting particle system, incorporating
softmax operators and shrinkage operations to efficiently compute the gradient drift vector field and
the score function. The proposed formulation introduces a particular class of attention layers in
transformer structures, which can sample sparse target distributions. We verify the convergence
towards target distributions regarding R´enyi divergences under suitable conditions. Numerical ex-
periments in high-dimensional nonsmooth sampling problems, such as sampling from mixed Gauss-
ian and Laplace distributions, logistic regressions, image restoration with L1-TV regularization,
and Bayesian neural networks, demonstrate the efficiency and robust performance of the proposed
method.
1. Introduction
Solving the Bayesian Lasso problem [28] involves sampling from the target distribution
ρ∗(x) = 1
Zexp (−β(f(x) + g(x))) ,
where x∈Rd,f:Rd→Ris the negative log-likelihood, g(x) = λkxk1is the log-density of the
Laplace prior for λ > 0, β > 0 is a known parameter, and Zis an unknown normalization constant.
The Bayesian Lasso is widely used as it simultaneously conducts parameter estimation and variable
selection. It has broad applications in high-dimensional real-world data analysis, including cancer
prediction [10], depression symptom diagnosis [27], and Bayesian neural networks [34].
Most algorithms for sampling from ρ∗rely on discretizing the overdamped Langevin dynamics.
In each iteration, these algorithms evaluate the gradient of the logarithm of target distribution
once and plus a Brownian motion perturbation to generate diffusion. However, the time-discretized
overdamped Langevin dynamics presents several challenges. First, the gradient of gmay not be
well-defined, as in the case of gbeing a L1norm. Second, overdamped Langevin dynamics often
perform inefficiently in high-dimensional sampling problems due to the fact that the variance of
Brownian motion linearly depends on the dimension.
To address the first challenge, many proximal sampling algorithms, often with splitting tech-
niques, have been extensively studied. [29, 13, 31] use proximal operators to approximate the gra-
dient of nonsmooth log-density. Extended works include methods leveraging a restricted Gaussian
oracle (RGO) [22, 8, 24], incorporating both sub-gradient and proximal operators [16], and solving
an inexact proximal map at each iteration [2]. For a recent review, see [21]. In these works, the
proximal map is often interpreted as a semi-implicit discretization of the Langevin dynamics with
Key words and phrases. Regularized Wasserstein proximal; Splitting; Shrinkage operator; Restricted Gaussian
oracle; Transformers.
1
2 FUQUN HAN, STANLEY OSHER, AND WUCHEN LI
respect to the drift term. The present study also employs the proximal operator to approximate the
gradient of nonsmooth terms, however, the proposed algorithm is fully deterministic as described
below.
Furthermore, to handle the second challenge, instead of considering the time discretization of
the Langevin dynamic, we will analyze a deterministic interacting particle system obtained by the
time-discretized probability flow ODE. Here, the ODE involves the drift function and the gradient
logarithm of the current probability density function, named the score function, which induces the
diffusion. Since this approach avoids simulating Brownian motion, it is independent of the sample
space dimension. However, accurately approximating the score function presents a challenge of its
own.
To approximate the evolution of the score function, [32] derived a closed-form formula using
the regularized Wasserstein proximal operator (RWPO). The RWPO is defined as the Wasserstein
proximal operator with a Laplacian regularization term (see Section 2 for details). By applying
Hopf–Cole transformations, the operator admits a closed-form kernel formula. It has been shown
that the RWPO provides a first-order approximation to the evolution of the Fokker–Planck equa-
tion [17], leading to an effective score function approximation. The sampling algorithm based on
RWPO named backward regularized Wasserstein proximal (BRWP), has been implemented in sev-
eral studies [32, 18] with different computational strategies. Its backward nature comes from the
implicit time discretization of the probability flow ODE for the score function term. However, a key
challenge in implementing the BRWP kernel lies in approximating an integral over Rdto compute
the denominator term.
In this work, we derive a computationally efficient closed-form update for BRWP without evalu-
ating a high dimensional integral for special nonsmooth functions, such as the L1norm. Following
the restricted Gaussian oracle of BRWP with L1function, we derive an explicit formula of the sam-
pling algorithm, in which samples interact with each other following an interacting kernel function.
In particular, this kernel function is constructed by shrinkage operators and the softmax functions.
Moreover, we also apply the splitting method and proximal updates for sampling problems with
nonsmooth target density.
We sketch the algorithm below. For particles {xk
i}N
i=1 in the kiteration, when g(x) = λkxk1, the
proposed iterative sampling scheme is
xk+1
2
i=xk
i−h∇f(xk
i), xk+1
i=xk+1
2
i+1
2
Sλh(xk+1
2
i)−
N
X
j=1
softmax(U(i, j)j)xk+1
2
j
,
where h > 0 is the time step size. The interacting kernel is defined as
U(i, j) := −β
2
kxk+1
2
i−xk+1
2
jk2
2− kSλh(xk+1
2
i)−xk+1
2
jk2
2
2h−λkSλh(xk+1
2
j)k1
,
with
softmax(x) = exp(xj)
PN
ℓ=1 exp(xℓ)!1≤j≤N
, x ∈Rd.
The shrinkage operator Sλh takes the form
Sλh(x) := sign(x) ReLU(|x| − λh),
with the rectified linear unit (ReLU) function ReLU(z) = max{z, 0}for z∈R. We remake that the
shrinkage operator is the proximal map of the L1norm, i.e., Sλh(x) = proxh
λkxk1(x).
SPLITTING REGULARIZED WASSERSTEIN PROXIMAL ALGORITHMS 3
The iterative scheme exhibits an intriguing connection to recent AI methods, particularly trans-
former architectures, as explored in [6, 15]. The proposed sampling algorithms can be viewed as
analogs of multi-attention transformers, incorporating generalized attention layers and the ReLU
function. In this framework, each sample xk
iacts as a token, while the matrix operator Udefines the
attention mechanism. A more detailed discussion of the connection between the proposed scheme
and attention mechanisms in transformer architectures is provided in Section 2.4.
Compared to algorithms based on splitting the overdamped Langevin dynamics with Brownian
motion, as studied in [29, 13, 31, 22, 8, 24], the proposed deterministic approach generally provides
a better approximation to the target density empirically, particularly with a small number of par-
ticles. It also demonstrates faster convergence in high-dimensional sample spaces, benefiting from
adapting the deterministic score function, as established in [17]. Several other works have investi-
gated deterministic interacting particle systems for sampling, including Stein variational gradient
descent methods [25] and blob methods [11]. The proposed approach, however, leverages a kernel
formulation derived directly from the solution of the Fokker–Planck equation, naturally incorpo-
rating information about the underlying dynamics, as reflected in the definition of U(i, j) above.
Furthermore, the proposed kernel is closely related to the restricted Gaussian oracle [22] due to
the definition of the kernel formula for RWPO and our computational implementation provides an
approximation to the restricted Gaussian oracle.
The structure of this paper is as follows. Section 2 presents the derivation of the BRWP-splitting
sampling scheme with a detailed algorithm description. In particular, we introduce several kernels,
each corresponding to a different particle-based approximation of the initial density. Section 3
demonstrates the convergence of the BRWP-splitting algorithm towards the target density for the
R´enyi divergence under the Poincar´e inequality and suitable conditions. This analysis is based on
an interpolation argument and provides a term-by-term bound on the discretization error. Section
4 extends our approach to other regularization terms, specifically L1-TV regularization, which
integrates primal-dual hybrid gradient descent with the BRWP-splitting algorithm. Finally, Section
5 presents numerical experiments on mixture distributions, Bayesian logistic regression, several
imaging applications, and Bayesian neural network training. Proofs and detailed derivations are
included in the supplementary material.
2. Regularized Wasserstein Proximal and Splitting Methods for Sampling
We are aiming to draw samples from probability distributions of the form
ρ∗(x) = 1
Zexp(−β(f(x) + λkxk1)) ,(1)
where x∈Rd,f:Rd→Ris L-smooth, β= (kBT)−1with a temperature constant T > 0 and the
Boltzmann constant kB,λis a regularization parameter, and Z=RRdexp(−β(f(y) + λkyk1))dy <
+∞is an unknown normalization constant.
Sampling from such a distribution is widely used in parameter estimation, particularly under
the framework of the Bayesian Lasso problem [28], which simultaneously performs estimation and
variable selection. However, the nonsmoothness of the L1norm poses significant challenges in
developing theoretically sound and numerically efficient sampling algorithms. Beyond the Bayesian
Lasso setting, we are also interested in more general cases where g(x) is a nonsmooth function whose
proximal operator is easy to compute. In this case, we consider sampling from the distribution
ρ∗(x) = 1
Zexp(−β(f(x) + g(x))) .(2)
4 FUQUN HAN, STANLEY OSHER, AND WUCHEN LI
2.1. Langevin dynamic and regularized Wasserstein proximal operator. In this section,
we review the time discretization of the overdamped Langevin dynamic and regularized Wasserstein
proximal operator to motivate the proposed algorithm.
Denote V=f+gfor simplicity. To sample from ρ∗in (2), a classical approach involves the
overdamped Langevin dynamics at time t
dXt=−∇V(Xt)dt +p2β−1dBt,(3)
where Xt∈Rdis a stochastic process, and Btis the standard Brownian motion in Rd. Denote ρt
as the probability density function of Xt. It is well known that the Kolmogorov forward equation
of stochastic process Xtsatisfies the following Fokker–Planck equation:
∂ρt
∂t =∇ · (ρt∇V) + β−1∆ρt=β−1∇ · ρt∇log ρt
ρ∗(4)
where we use the fact that ρt∇log ρt=∇ρtand ∇log ρ∗=∇log e−βV =−β∇V.
From the stationary solution of the Fokker–Planck equation, we observe that the invariant dis-
tribution of the Langevin dynamics coincides with the target distribution ρ∗. However, directly
applying the overdamped Langevin dynamics (3) to sample from (1) presents several challenges.
Firstly, the function Vis nonsmooth which creates difficulties in the gradient computation. Sec-
ondly, the variance of the Brownian motion depends on the sample space dimension linearly which
slows down convergence, posing challenges for high-dimensional sampling tasks.
To address the first issue, for a small stepsize h > 0, one often utilizes the Moreau envelope
gh(x) = inf
y∈Rdg(y) + 1
2hkx−yk2
2,(5)
which provides a smooth approximation to the nonsmooth function g; and the proximal operator
proxh
g(x) = arg min
y∈Rdg(y) + 1
2hkx−yk2
2,(6)
which provides a smooth approximation to the gradient of gbased on the relation
∇gh(x) = x−proxh
g(x)
h,for a convex function g . (7)
These tools have been widely applied in nonsmooth sampling problems [24, 36, 13]. In this work,
we also employ the proximal operator to approximate the gradient of nonsmooth functions.
Furthermore, to tackle the second challenge which arises from the linear dependence of the
variance of Brownian motion and the dimension, we aim to avoid the simulation of Brownian
motions in the sampling algorithm. Instead, we consider the evolution of particles xt∈Rdgoverned
by the probability flow ODE:
dxt=−∇V(xt)dt −β−1∇log ρt(xt)dt . (8)
Here, the diffusion is induced by the score function ∇log ρt. While the individual particle trajectories
of equation (8) differ from those of stochastic dynamics (3), the Liouville equation of (8) is still the
Fokker–Planck equation (4).
The primary challenge in discretizing the probability flow ODE (8) in time is the accurate ap-
proximation of the score function. For each discretized time point, since we can only access N
particles obtained from the previous iteration, kernel density estimation-based particle methods
can be unstable and sensitive to the choice of bandwidth. To mitigate this, we consider a semi-
implicit discretization of (8), where the score function at the next time step is utilized. This results
in the following iterative sampling scheme.
SPLITTING REGULARIZED WASSERSTEIN PROXIMAL ALGORITHMS 5
Denote the time steps as tkfor k= 1,2,..., with a step size h=tk+1 −tk>0. Let xkrepresent a
particle at time tk, distributed according to the density ρk, i.e., xk∼ρk. Similarly, let xk+1 ∼ρk+1,
where ρk+1 is the density at the next time step tk+1 =tk+h. Then the semi-implicit discretization
of probability flow ODE in time is
xk+1 =xk−h∇V(xk)−hβ−1∇log ρk+1(xk).(9)
To compute ρk+1, one must approximate the evolution of density function following the Fokker–Planck
equation (4). A classical approach is the JKO scheme [19]:
ρk+1 = arg min
ρ∈P2(Rd)
β−1DKL(ρkρ∗) + 1
2hW2(ρ, ρk)2,(10)
where P2(Rd) is the set of probability measures in Rdwith a finite second-order moment and
DKL(ρkρ∗) denotes the Kullback–Leibler (KL) divergence defined as
DKL(ρkρ∗) := ZRd
ρlog ρ
ρ∗dx .
Moreover, W2(ρ, ρk)2represents the squared Wasserstein-2 distance, which can be defined using
Benamou-Brenier formula [1]:
W2(ρ0, ρh)2
2h:= inf
vZh
0ZRd
1
2kv(t, x)k2ρ(t, x)dxdt ,
where the minimization is taken over vector fields v: [0, h]×Rd→Rdsubject to the continuity
equation with fixed initial and terminal conditions:
∂ρ
∂t +∇ · (ρv) = 0 , ρ(0, x) = ρ0(x), ρ(h, x) = ρh(x).
However, solving the JKO-type implicit scheme often requires high-dimensional optimization, which
can be computationally expensive. We remark that many existing sampling algorithms exploit
certain splitting of the JKO scheme [24, 31, 3] and employ the implicit gradient descent for the drift
vector fields. This work considers the implicit update regarding both drift and the score functions
simultaneously.
To derive a closed-form update for the evolution of the Fokker–Planck equation, we start with the
Wasserstein proximal operator with linear energy, as introduced in [23]. By incorporating a Lapla-
cian regularization term into the Wasserstein proximal operator and applying the Benamou–Brenier
formula, we obtain the following regularized Wasserstein proximal operator (RWPO)
WProxh,β
V(ρk) := arg min
q∈P2(Rd)
inf
vZh
0ZRd
1
2kv(t, x)k2
2ρ(t, x)dx dt +ZRd
V(x)q(x)dx,(11)
where the minimization is taken over all vector fields vand terminal density q, subject to the
continuity equation with an additional Laplacian term and the initial condition:
∂ρ
∂t +∇ · (ρv) = β−1∆ρ , ρ(0, x) = ρk(x), ρ(h, x) = q(x).(12)
After introducing a Lagrange multiplier function Φ : [0, h]×Rd→R, the RWPO is equivalent to the
following system of coupled PDEs consisting of a forward Fokker–Planck equation and a backward
Hamilton–Jacobi equation
∂tρ+∇ · (ρ∇Φ) = β−1∆ρ ,
∂tΦ + 1
2||∇Φ||2
2=−β−1∆Φ ,
ρ(0, x) = ρk(x),Φ(h, x) = −V(x).
(13)
6 FUQUN HAN, STANLEY OSHER, AND WUCHEN LI
By applying the Hopf-Cole transformation and using the heat kernel, one can derive a closed-form
solution for the RWPO:
WProxh,β
V(ρk) = ZRd
exp −β
2V(x) + kx−yk2
2
2h
RRdexp −β
2V(z) + kz−yk2
2
2hdz
ρk(y)dy =Kh
Vρk(x),(14)
where the kernel Kh
Vapplied on the initial density ρkdepends on Vand step size h. The more
detailed derivation of (14) can be found in [23].
From (13), we observe that since Φ(T, x) = −V(x) and ρsatisfies a Fokker–Planck equation
with drift vector field ∇Φ, the solution of RWPO approximates the evolution of the Fokker–Planck
equation (4) when his small. Furthermore, [17] rigorously justifies that Kh
Vρkapproximates ρk+1
with an error of order O(h2) when Vis smooth. In summary, we use the kernel formula (14) to
approximate the the evolution of the Fokker–Planck equation (4) with V=f+gwhich further
approximates the score function in (9).
2.2. Splitting with regularized Wasserstein proximal algorithms. We now return to the
composite sampling problem and examine the JKO scheme (10) again to derive the splitting scheme.
For the case where ρ∗=1
Zexp(−β(f+g)), we observe that
DKL(ρkρ∗) = βZRd
fρ dx +ZRd
ρlog ρ
exp(−βg)dx + log Z .
Thus, the JKO scheme (10) can be written as
ρk+1 = arg min
ρ∈P2(Rd)ZRd
fρ dx +β−1ZRd
ρlog ρ
exp(−βg)dx +1
2hW2(ρ, ρk)2,
where we omit the normalization constant log Zin the minimization step.
The idea of splitting JKO scheme is to introduce an intermediate density ρk+1
2and consider a
two-step squared Wasserstein distance. Then ρk+1 is given by the following optimization problem
ρk+1 = arg min
ρ∈P2(Rd)min
ρk+1
2∈P2(Rd)ZRd
fρk+1
2dx +ZRd
gρdx +β−1ZRd
ρlog ρdx (15)
+1
2hW2(ρk+1
2, ρk)2+1
2hW2(ρ, ρk+1
2)2.
Next, we proceed by decomposing the optimization problem into two steps
(ρk+1
2= arg minρ∈P2(Rd)RRdfρ dx +1
2hW2(ρ, ρk)2,
ρk+1 = arg minρ∈P2(Rd)RRdgρ dx +β−1RRdρlog ρdx +1
2hW2(ρ, ρk+1
2)2.(16)
When ρk(x) = 1
NPN
j=1 δxk
j(x) and ρis also approximated by a sum of delta measures, the two-step
Wasserstein proximal operators yield the following particle update scheme:
(xk+1
2= arg minx∈Rd{f(x) + 1
2hkx−xkk2
2},
xk+1 = arg minx∈Rd{g(x) + β−1log ρ(x) + 1
2hkx−xk+1
2k2
2},(17)
where the subindex jis omitted for the simplicity of notation.
For the first step, when his small, we approximate the implicit proximal step for fby an explicit
gradient descent
xk+1
2=xk−h∇f(xk).
For the second step in (16), we note that it corresponds to a single-step JKO scheme for the
Fokker–Planck equation with drift term ∇g. Thus, we approximate ρk+1 using the regularized
SPLITTING REGULARIZED WASSERSTEIN PROXIMAL ALGORITHMS 7
Wasserstein proximal operator WProxh,β
gin (14). Moreover, when Kh
gρk+1
2is convex, the second
step in (17) is equivalent to the implicit update:
xk+1 =xk+1
2−h∇g(xk+1)−hβ−1∇log Kh
gρk+1
2(xk+1).(18)
Finally, we replace the first two terms in (18) with the proximal operator of gto circumvent the
need to compute the gradient of a nonsmooth function. We also approximate the implicit update of
the score function with an explicit step by using Kh
gρk+1
2(xk), which retains a semi-implicit nature
since Kh
gρk+1
2≈ρk+1. This results in the following iterative formula
xk+1 = proxh
g(xk+1
2)−hβ−1∇log Kh
gρk+1
2(xk+1
2).(19)
We remark that the convergence of the above splitting scheme under smooth assumption will be
demonstrated in Section 3.
2.3. Algorithm. To summarize the derivation in the previous section, the iterative formula for
particles {xk+1}at the k+ 1 iteration is expressed as
(xk+1
2=xk−h∇f(xk),
xk+1 = proxh
g(xk+1
2)−hβ−1∇log Kh
gρk+1
2(xk+1
2).(20)
Next, we shall derive an explicit and computationally efficient formula for the second step in
(20). We first replace xk+1
2
iby xk
ifor notational simplicity. Recalling that when g(x) = λkxk1, the
proximal operator is given by the shrinkage operator
Sλh(x) := proxh
λkxk1(x) = sign(x) max{|x| − λh, 0}.
Then, we simplify the expression for Kh
λk·k1ρkdefined in (14). We recall the Laplace method: for
any smooth function φ∈C∞(Rd;R) and a domain A⊂Rd,
lim
h→0ZA
exp −φ(x)
hdx =˜
Cexp −min
x∈A
φ(x∗)
h,(21)
where ˜
Cis a constant depending on h,d, and the Hessian of φ. The domain Acan be extended to Rd
if the integral is well-defined over the entire space. Applying this to the normalization term inside
the integral (14), recalling the definition of the proximal operator, and noting that the Hessian of
the exponent is 1 almost everywhere, we obtain the following approximation for sufficiently small
h:
ZRd
exp −β
2λkzk1+kz−yk2
2
2hdz ≈Cexp −β
2λkSλh(y)k1+kSλh(y)−yk2
2
2h,(22)
where Cis a constant depending on hand dalmost everywhere, except at points where the exponent
is nonsmooth.
For the numerator of Kh
λk·k1ρk, we approximate ρk(x) by kernel density estimation with the sum
of delta measures
ρk(x)≈1
N
N
X
j=1
δxk
j(x).
In this case, the approximated density function at time tk+1 in (14) becomes
Kh
gρk(x)≈
exp −β
2λkxk1
CN
N
X
j=1
exp "−β
2 kx−xk
jk2
2− kSλh(xk
j)−xk
jk2
2
2h−λkSλh(xk
j)k1!#.(23)
8 FUQUN HAN, STANLEY OSHER, AND WUCHEN LI
Using ∇log Kh
gρk=∇Kh
gρk/Kh
gρk, the normalization constant CN cancels out and we arrive
∇log Kh
gρk(x)≈ −β
2 x−Sλh(x)
h+PN
j=1(x−xk
j) exp(U(x, xk
j))
hPN
j=1 exp(U(x, xk
j)) !,(24)
where Uis given by
U(x, xk
j) := −β
2 kx−xk
jk2
2− kSλh(xk
j)−xk
jk2
2
2h−λkSλh(xk
j)k1!.
We then define the matrix operator Ai,j and the normalized version Mi,j as
Ai,j = exp(U(xk
i, xk
j)) , Mi,j =Ai,j
PN
j=1 Ai,j
.(25)
With this notation, the second step of the iterative scheme (20) can be rewritten as
xk+1
i=xk
i+1
2
Sλh(xk
i)−
N
X
j=1
Mi,j xk
j
.(26)
The above derivation leads to a deterministic sampling algorithm for the composite density
function
ρ∗(x) = 1
Zexp −β(f(x) + λkxk1),
which is described below.
Algorithm 1 Splitting Regularized Wasserstein Proximal Algorithm (BRWP-splitting)
Require: Initial particles {x0
i}N
i=1, step size h.
1: for iteration k= 1,2,... and each particle i= 1,...,N do
2: Step 1: Compute the gradient descent with respect to smooth function f:
xk+1
2
i=xk
i−h∇f(xk
i).
3: Step 2: Perform the proximal update on gwith the score function
xk+1
i=xk+1
2
i+1
2
Sλh(xk+1
2
i)−
N
X
j=1
Mi,j xk+1
2
j
.
Here, Mi,j is defined as in (25), replacing xkwith xk+1
2.
4: end for
For a more general target density function ρ∗as in (2) containing a nonsmooth function g, the
Step 2 in Algorithm 1 is replaced by
xk+1
i=xk+1
2
i+1
2
proxh
g(xk+1
2
i)−
N
X
j=1
Mi,j xk+1
2
j
,(27)
where
Ai,j = exp "−β
2 kxk
i−xk
jk2
2−kproxh
g(xk
j)−xk
jk2
2
2h−g(proxh
g(xk
j))!#,
and Mi,j is defined as in (25). Intuitively, we note that the proximal term in (27) corresponds
to a half-step of gradient descent depending on xk
i. The first exponent kxk
i−xk
jk2
2in Ai,j induces
diffusion as a heat kernel, while the last exponent involves gperforms the second half-step of gradient
SPLITTING REGULARIZED WASSERSTEIN PROXIMAL ALGORITHMS 9
descent via a weighted average of xk
jsimilar to the idea used in consensus-based optimization [5].
This mechanism ensures that the set of points will concentrate in a high-probability region of the
target density and will not collapse to the local minimum of the log-density f+g.
2.4. Connections with attention functions in transformers. We now recall the interacting
particle system formulation for transformers, as discussed in [6, 15]. In a transformer, each data
point, represented as a vector, namely a token, is processed iteratively through a series of layers
with attention functions. A key component of each layer is the self-attention mechanism, which
enables interactions among all tokens.
More specifically, in the simplified single-headed softmax self-attention mechanism, define V∈
Rd×d(value), Q∈Rm×d(query), and K∈Rm×d(key) as learnable matrices, and define the softmax
function for ω∈RNas
softmax(ω) = exp(ωj)
PN
ℓ=1 exp(ωℓ)!1≤j≤N
.
The tokens are updated as
xk+1
i=xk
i+h
N
X
j=1
softmax((Qxk
i·Kxk
j)j)V xk
j,
where the softmax function is evaluated at index j.
This formulation naturally represents the transformer as an interacting particle system, where
the interaction kernel is given by Qxk
i·Kxk
j. Various types of interaction kernels have been studied
and applied in different contexts; see [6] for a more detailed discussion. Leveraging this perspective,
we rewrite the proposed iterative sampling scheme in (27) as
xk+1
i=xk+1
2
i+1
2
proxh
g(xk+1
2
i)−
N
X
j=1
softmax(U(i, j))xk+1
2
j
,(28)
U(i, j) = −β
2
kxk+1
2
i−xk+1
2
jk2
2− kproxh
g(xk+1
2
j)−xk+1
2
jk2
2
2h−g(proxh
g(xk+1
2
j))
,
where xk+1
2
i=xk
i−h∇f(xk
i).
Here, the interaction kernel is modified by the new matrix operator U, while the value matrix
is replaced by gradient descent updates regarding f. Additionally, the proximal term integrates
target distribution information into the dynamics, allowing convergence to the target distribution.
Especially, when g=λkxk1, the shrinkage operator automatically promotes the sparsity of the
variables. Since particle interactions are computed via the softmax function, the system (28) can
be efficiently implemented on modern GPUs, making it well-suited for high-dimensional sampling
applications.
2.5. Different choices of kernels for particle interaction. In this section, we explore alterna-
tive formulations of the matrix operator Mi,j, previously defined in (25), based on different density
approximations of ρkfrom particles. These alternatives may lead to improved numerical perfor-
mance in high-dimension sampling problems. Similar to the notation in the previous section, we
continuously replace xk+1
2
iwith xk
ito simplify notation.
10 FUQUN HAN, STANLEY OSHER, AND WUCHEN LI
Proposition 1. Suppose the density function at the k-th iteration is approximated using Gaussian
kernels as
ρk(x) = 1
N(2πσ2)d/2
N
X
j=1
exp −kx−xk
jk2
2
2σ2!,
with bandwidth σ > 0. Then, for the particle update scheme given by (26), let c= 2h/(σ2β)and
xk
i,ℓ be the ℓ-th component of the particle xk
i, the matrix operator Ai,j and Mi,j will be
Ai,j = exp −kxk
jk2
2
2σ2!d
Y
ℓ=1 hS1(xk
i,ℓ, xk
j,ℓ) + S2(xk
i,ℓ, xk
j,ℓ) + S3(xk
i,ℓ, xk
j,ℓ)i,(29)
Mi,j =Ai,j
PN
j=1 exp −kxk
jk2
2
2σ2Qd
ℓ=1 hT1(xk
i,ℓ, xk
j,ℓ) + T2(xk
i,ℓ, xk
j,ℓ) + T3xk
i,ℓ,(xk
j,ℓ)i,(30)
where the terms T1, T2, T3and S1, S2, S3are given by
T1(x, z) = q4h
β(1+c)R∞
qβ(1+c)
4h[λh−x+cz+λh
1+c]exp(−y2)dy exp −
β
4hλ2h2−
(x+cz+λh)2
1+c,
T2(x, z) = q4h
β(1+c)Rqβ(1+c)
4hh−λh−x+cz−λh
(1+c)i
−∞ exp(−y2)dy exp −
β
4hλ2h2−
(x+cz−λh)2
1+c,
T3(x, z) = q4h
cβ Rqcβ
4hhλh−(x+cz)
ci
qcβ
4hh−λh−(x+cz)
ciexp(−y2)dy exp β
4h
(x+cz)2
c,
S1(x, z) = β
2
(x+cz+λh)
h(1+c)T1(x, z) + 1
1+cexp −
β(1+c)
4hλh −
x+cz+λh
1+c2exp −
β
4hλ2h2−
(x+cz+λh)2
1+c,
S2(x, z) = β
2
(x+cz−λh)
h(1+c)T2(x, z)−1
1+cexp −
β(1+c)
4h−λh −
x+cz−λh
1+c2exp −
β
4hλ2h2−
(x+cz−λh)2
1+c,
S3(x, z) = β
2
(x+cz)
hc T3(x, z)−1
cexp −
cβ
4hλh −
(x+cz)
c2−exp −
cβ
4h−λh −
(x+cz)
c2exp β(x+cz)2
4hc ,
after replacing all xk
iwith xk+1/2
i. Here, the integral of exp(−y2)can be obtained by the value of
the error function
erf(z) = 2
√πZz
0
exp(−y2)dy .
The derivation of Proposition 1 can be found in supplementary material A. The Gaussian kernel
used in [18] has been applied to eliminate asymptotic bias in the discretization of the probability
flow ODE when the target distribution is Gaussian. Moreover, Gaussian kernels with adaptively
computed bandwidths based on particle variance are also helpful for approximating density functions
in high dimensions. For further discussion in this direction, see [35].
Next, by comparing the results in Proposition 1 with the expression in (25), we observe that the
matrix operator Mi,j simplifies significantly when the kernel is approximated by delta measures.
However, in high-dimensional settings, kernel density estimation with delta measures suffers from
the curse of dimensionality, as the number of particles required to maintain a given level of accuracy
grows exponentially [14]. To address this issue, we propose an alternative and heuristic method for
efficiently approximating the score function while maintaining its representation as a sum of delta
measures. Specifically, we approximate the density function with an auxiliary set of points:
ρk(x)≈1
Nd
N
X
j1,··· ,jd=1
δ˜xk
j1,···,jd
(x),˜xk
j= ˜xk
j1,··· ,jd= [xk
j1(1),···, xk
jd(d)]T.(31)
Here, ρkcan be regarded as approximated by a separable density function. Due to the separability
of the L1norm and the shrinkage operator, the proposed matrix operator takes the following form,
with its derivation provided in the supplementary material A.
SPLITTING REGULARIZED WASSERSTEIN PROXIMAL ALGORITHMS 11
Proposition 2. If the density function at the k-th iteration is approximated by (31), then for the
particle update scheme given by (26), the operator Mi,j will be
(Mi,j )ℓ=
exp −β
2(xk
i,ℓ−xk
j,ℓ)2−(Sλh (xk
j,ℓ)−xk
j,ℓ)2
2h−λ|Sλh(xk
j,ℓ)|
Pjexp −β
2(xk
i,ℓ−xk
j,ℓ)2−(Sλh (xk
j,ℓ)−xk
j,ℓ)2
2h−λ|Sλh(xk
j,ℓ)|,(32)
for ℓ= 1,···, d where xk
i,ℓ denotes the ℓ-th component of the particle xk
i. The Step 2 in Algorithm
1 now becomes
xk+1
i=xk
i+1
2
Sλh(xk
i)−
N
X
j=1
Mi,j ·xk
j
,(33)
after replacing all xk
iwith xk+1/2
i.
Our numerical experiments in Section 5 show that the kernel in Proposition 2 usually has faster
convergence and more accurate estimation of the model variance than the kernel in (25) in high
dimensional sampling problems.
Moreover, we remark that for more general log-density functions gand other choices of kernels
used to estimate ρkthat are not separable, tensor train approaches can be employed. Once the
density at time tkand the target density ρ∗are approximated in tensor train format, an analog of
Algorithm (1) remains computationally efficient. For further details, see [18].
3. Convergence Analysis
In this section, we analyze the convergence of the proposed Algorithm 1 for sampling from the
target distribution. For notational sake, we write ρ∗
hto be the regularized density function defined
as
ρ∗
h(x) = 1
Zh
exp(−β(f(x) + gh(x))) ,(34)
where ghis the Moreau envelope of gand Zh=RRdexp(−β(f(y) + gh(y)))dy.
We assume that the following conditions hold:
•The function fis convex and Lfsmooth, meaning its gradient ∇fis LfLipschitz continuous.
•The function gis convex and LgLipschitz. Also, ghis Lghsmooth.
•ρ∗
hsatisfies the Poincare inequality with constant αd>0, i.e., for any bounded smooth
function ψ,
ZRd
ψ2ρ∗
hdx −ZRd
ψρ∗
hdx2
≤αdZRdk∇ψk2ρ∗
hdx .
•The score function at time t, i.e., ∇log ρtwhere ρtsatisfies the Fokker Plank equation at
time tis convex and βLρLipshitz continuous.
We remark that the second condition ensures the proximal operator of gis single-valued and the
smooth assumption ensures the Hessian of ghis bounded to derive the asymptotic expression of the
kernel formula (14). Regarding the Poincar´e inequality in the third assumption, we note that it
follows from both the log-Sobolev inequality and the Talagrand inequality. Furthermore, it remains
valid even in cases where the log-Sobolev inequality does not apply, such as when ghas a tail of
the form kxk1. Moreover, both the log-Sobolev and Poincar´e inequalities are special cases of the
Lata la–Oleszkiewicz inequality for α= 2 and α= 1, respectively. These inequalities characterize
concentration properties for densities of the form exp(−kxkα), as discussed in [20]. Finally, the last
12 FUQUN HAN, STANLEY OSHER, AND WUCHEN LI
condition is an assumption that appeared frequently in the analyses of the probability flow ODE
[9, 8].
Recalling the definition of the Moreau envelope of gin (5), we first state two key properties of
the Moreau envelope.
Lemma 3 ([13]).If gis convex and LgLipschitz continuous, then the following properties hold:
(1) For any x∈Rd,
0≤g(x)−gh(x)≤L2
gh .
(2) ghis convex, and the function 1
Zghexp(−βgh)defines a valid probability density function,
where
Zgh=ZRd
exp(−βgh(y)) dy .
Next, we show that the kernel formula for the regularized Wasserstein proximal operator used
in Section 2 approximates the evolution of the Fokker–Planck equation. We denote ρk+1
2as the
density function of xk+1
2=xk−h∇f(xk), which can be obtained via kernel density estimation,
provided a sufficiently large number of particles. Then the following can be proved.
Lemma 4. For the approximation to the score function based on the kernel formula (14), when
h < 1/(Lghd2), we have
∇log Kh
gρk+1
2(x) = −β
2x−proxh
g(x)
h
+RRd
x−y
hexp h−β
2kx−yk2
2−ky−proxh
g(y)k2
2
2h−g(proxh
g(y))iρk+1
2(y)dy
RRdexp h−β
2kx−yk2
2−ky−proxh
g(y)k2
2
2h−g(proxh
g(y))iρk+1
2(y)dy ,(35)
which provides an approximation to the score function as follows
∇log Kh
gρk+1
2(x) = ∇log ρ(x, tk+h) + O(h2)
almost everywhere, where ρ(x, t)satisfies the Fokker–Planck equation with the initial condition at
tk:
∂ρ
∂t =∇ · (ρ∇gh) + β−1∆ρ, ρ(x, tk) = ρk+1
2(x).
The proof is provided in the supplementary material B. Next, recalling the proposed particle
evolution scheme in (20), the first step consists of a gradient descent step with respect to f. By
applying the change of variable formula for the probability density function, we obtain
ρk+1
2=ρk+h∇ · (ρk∇f) + O(h2).
Consequently, after applying the kernel Kh
gand using the result in Lemma 4, we have the approxi-
mation formula
Kh
gρk+1
2=ρk+h∇ · (ρk∇(f+gh)) + hβ−1∆ρk+O(h2).(36)
Thus, the density function Kh
gρk+1
2obtained from the kernel formula (14) provides a first-order
approximation to the evolution of the Fokker–Planck equation with drift term ∇(f+gh).
Next, the iterative sampling scheme in (20) can be rewritten more compactly as
xk+1 =xk−h∇f(xk)−h∇gh(xk−h∇f(xk)) −hβ−1∇log Kh
gρk+1
2(xk−h∇f(xk)) .(37)
SPLITTING REGULARIZED WASSERSTEIN PROXIMAL ALGORITHMS 13
Our convergence analysis will examine the convergence of the density ρkto ρ∗in terms of the
R´enyi divergence Rqfor q∈[2,∞). The R´enyi divergence is defined as
Rq(µkν) := 1
q−1log (Fq(µkν)) ,where Fq(µkν) = ZRd
µq
νq−1dx .
Next, we define the R´enyi information Gq(µkν) as the time derivative of Fq(µkν)
Gq(µkν) = ZRdµ
νq
∇log µ
ν
2
2ν dx . (38)
A key consequence of the Poincar´e inequality is the following relationship regarding the time deriv-
ative of the R´enyi divergence along the Langevin dynamics.
Lemma 5 ([33]).Suppose ρ∗
hsatisfies the Poincar´e inequality with constant αd>0. Then, for any
q≥2, we have
Gq(ρkρ∗
h)
Fq(ρkρ∗
h)≥4αd
q2[1 −exp(−Rq(ρkρ∗
h))] .(39)
By employing the interpolation argument and establishing bounds for the discretization error,
we can prove the convergence of the proposed sampling scheme to the target density as follows.
Theorem 6. Let x0∼ρ0be initial particles and L=Lf+Lgh+Lρ. When h≤min{(√2−
1)/L, 1/(Lghd2)}, we have the following convergence of Algorithm 1 with respect to the R´enyi di-
vergence.
(1) For the convergence towards the regularized target density ρ∗
h:
Rq(ρkkρ∗
h) (40)
≤
Rq(ρ0kρ∗
h)−kh αd
q1−2L2h2
(1−hL)2−qL2(L+Lf)2h2d+O(h3), Rq(ρ0kρ∗
h)≥1 ;
Rq(ρ0kρ∗
h) exp h−kh αd
q1−2L2h2
(1−hL)2i+q2L2(L+Lf)2h2d
αd+O(h3), Rq(ρ0kρ∗
h)<1.
(2) For the convergence towards the target density ρ∗:
Rq(ρkkρ∗)≤
R2q−1(ρ0kρ∗
h)−tkhαd
(2q−1) 1−2L2h2
(1−hL)2−(2q−1)L4h2di
+c(q)L2
gh+O(h3), Rq(ρ0kρ∗
h)≥1 ;
R2q−1(ρ0kρ∗
h) exp h−tkαd
(2q−1) 1−2L2h2
(1−hL)2i
+(2q−1)2L4h2d
αd+c(q)L2
gh+O(h3), Rq(ρ0kρ∗
h)<1 ;
(41)
where c(q) = q(2q−1)
(2q−1)2.
The proof is provided in the supplementary material B. We remark that for the convergence to
ρ∗
h, when Rq(ρ0kρ∗
h)<1, the asymptotic bias induced by the discretization is of order O(h2), which
is smaller than that of sampling methods with Brownian motion, where the bias is of order O(h).
We note that the condition hLgh<1/d2in Lemma 4 and the second assumption in this section
are quite strong, restricting many nonsmooth cases. When gis merely Lipschitz continuous, one
can still establish that ∇log Kh
gρk+1
2=ρk+1
2+O(h), but the O(h2) term is lost due to the lack of
smoothness. If a rigorous approximation result for the kernel formula can be obtained, one could
follow the analysis in [3] to study the convergence of the gradient flow in the W2metric, which
remains valid for general nonsmooth functions and does not require the Poincar´e inequality. Another
approach to achieving exponential-type convergence is to use a strategy similar to that in [13], where
the proximal operator proxγ
gis applied with γ6=h. This ensures that the Lipschitz constant gγ
remains independent of h, allowing for a rigorous convergence result toward the regularized density.
14 FUQUN HAN, STANLEY OSHER, AND WUCHEN LI
However, our numerical experiments suggest that the proposed algorithm performs better than
using an alternative regularization parameter γ. Given the challenges in rigorously verifying the
kernel formula, we present our analysis in a smooth setting to illustrate the effectiveness of the
proposed approach while leaving a broader discussion of nonsmooth cases for future work.
4. Generalization to Sampling with TV Regularization
An important practical application of L1-norm regularization is its combination with total vari-
ation (TV) regularization for image denoising and restoration [30]. In this context, we consider
sampling from the distribution
ρ∗(u) = 1
Zexp(−V(u)) , V (u) = kφ−F uk2
2+λkDuk1,(42)
where u∈Rdrepresents the image or signal, φ∈Rmis the noisy observation, and F∈Rd×mis
a known forward operator with m≤d. The matrix D∈Rd×2ddenotes the discretized gradient
operator for two-dimensional images. This formulation extends naturally to the more general setting
where V(u) = f(u) + kKuk1for an arbitrary function fand a linear operator K. For clarity, we
focus on sampling from (42). Compared to direct optimization of V(u), sampling-based algorithms
provide a means to quantify uncertainty in the recovered image and facilitate Bayesian inference,
as demonstrated in Section 5.
A common approach to sampling from ρ∗in (42) is to compute the proximal operator of the TV
norm using Chambolle’s algorithm [7], as in [13]. However, this requires solving an optimization
problem at each iteration. Instead, we seek a more deterministic method by combining the BRWP-
splitting scheme with the primal-dual hybrid gradient (PDHG) method.
Since the proximal operator of the TV norm lacks a closed-form expression, we introduce an
auxiliary variable p=Du ∈R2dand reformulate the log-density as
V(u, p) = kφ−F uk2
2+λkpk1+γkp−Duk1,(43)
where γ > 0 is a large regularization parameter enforcing p≈Du. This transforms the sampling
problem in uinto a sampling task over uand psimultaneously. The last term in V(u, p) still involves
the L1norm of p−Du, whose proximal operator is not explicit. To address this, we use the dual
formulation of the L1norm:
V(u, p) = kφ−F uk2
2+λkpk1+ max
y∈R2dγy ·(p−Du)−δkyk∞≤1(y),(44)
where yis the dual variable and the last term is the convex conjugate of the L1norm.
Writing x= [u, p]T,G(x) = kφ−F uk2
2+λkpk1, and L= [I, −D]Tto simplify notation, we recall
the generalized PDHG scheme for sampling proposed in [16]:
(Xk+1 = proxh
G{Xk−hγLTYk}+p2β−1ζk,
Yk+1 = proxτ
δk·k∞≤1{Yk+τγLXk+1},(45)
where ζkis a 3d-dimensional Brownian motion added to the primal update, and τ, h > 0 are step
sizes for the primal and dual update. It is shown in [4] that this scheme has a unique invariant
distribution in continuous time. Moreover, coupling hand τsuch that τ/h → ∞ as h, τ →0 ensures
convergence to the target distribution 1
Zexp(−βV (u, p)).
SPLITTING REGULARIZED WASSERSTEIN PROXIMAL ALGORITHMS 15
Next, we consider the discretization of the probability flow ODE for the primal variable. Replacing
the Brownian motion by the score function to have
(xk+1 =xk−hγLTyk−h∇Gh(xk−γLTyk)−hβ−1∇log ρk+1(xk),
yk+1 = proxτ
δk·k∞≤1{yk+γLxk+1}.(46)
For the gradient of the Moreau envelope of G, we approximate it using an explicit gradient descent
for the smooth term of Ghand a proximal step for the L1norm term as
∇Gh(xk−γKTyk)≈∇kφ−F(uk−hγyk)k2
2,pk−γDTyk−Sλh(pk−γDTyk)
hT
,(47)
which holds as h→0.
Finally, as in Section 2, we apply the two-step splitting strategy to update the primal variables:
(uk+1
2=uk−hγyk,
uk+1 =uk+1
2−h∇kφ−F uk+1
2k2
2−hβ−1∇log Kh
kφ−F·k2
2ρu
k+1
2
(uk+1
2).(48)
(pk+1
2=pk−hγ(−Dyk),
pk+1 =Sλh(pk+1
2)−hβ−1∇log Kh
λk·k1ρp
k+1
2
(pk+1
2).(49)
Here, uk+1
2∼ρu
k+1
2
and pk+1
2∼ρp
k+1
2
.
Moreover, the score functions ∇log Kh
kφ−F·k2
2and ∇log Kh
λk·k1are defined analogously to (24).
The proximal operator proxh
kφ−F·k2
2can be computed explicitly as
proxh
kφ−F·k2
2(v) = (I+hF TF)−1(v+hF Tφ)≈(I−hF TF)(v+hF Tφ) + O(h2).(50)
This splitting scheme decomposes the primal update into two sequential steps: (i) a gradient descent
step involving the inner product with y, and (ii) a gradient descent step for the smooth part and a
proximal step for the nonsmooth part of Ghwith explicit score functions.
The full algorithm, incorporating the dual update and primal splitting, is summarized in Al-
gorithm 2. We remark that the last step in the Algorithm is a common step used in the PDHG
scheme that takes an over-relaxation on the primal variable. Numerical experiments are presented
in Section 5.
5. Numerical Experiments
In this section, we numerically verify the performance of the proposed sampling algorithm based
on the splitting of the regularized Wasserstein proximal operator (BRWP-splitting, or BRWP for
short). Specifically, we use the matrix operator constructed in Proposition 2 for the first four
examples, and the one defined in (25) for the last example to achieve better numerical performance.
Numerical experiments include examples of sampling from mixture distribution, Bayesian logistic
regression, image restoration with L1−2TV regularization, uncertainty quantification with Bayesian
inference, and Bayesian neural network training. In particular, the performance of the proposed
algorithm will be compared with the Moreau-Yosida Unadjusted Langevin Algorithm (MYULA)
[13] and the Metropolis-adjusted Proximal Algorithm (PRGO) [26] where the appeared restricted
Gaussian oracle is sampled by the accelerated gradient method employed in [24]. 1
1The code is in GitHub with the link https://github.com/fq-han/BRWP-splitting .
16 FUQUN HAN, STANLEY OSHER, AND WUCHEN LI
Algorithm 2 Sampling Algorithm for Posterior Distribution with TV Regularization
Require: Initial particles {u0
i, p0
i, y0
i}N
i=1, step size h,τ, parameters γ, λ
1: for iteration k= 1,2,... and each particle i= 1,...,N do
2: Gradient descent for the inner product term:
uk+1
2
i=uk
i+hγDTyk
i, pk+1
2
i=pk
i−hγyk
i.
3: Semi-implicit discretization of the probability flow ODE for the data fitting term:
uk+1
i=uk+1
2
i+1
2
uk+1
2
i−hF T(F uk+1
2
i−g)−
N
X
j=1
uk+1
2
jMu
i,j
,
where Mu
i,j is defined in (27) with g(v) = kφ−F v k2
2, proxh
ggiven in (50), and xk+1
2replaced by
uk+1
2.
4: Semi-implicit discretization of the probability flow ODE for L1norm:
pk+1
i=pk+1
2
i+1
2
Shλ(pk+1
2
i)−
N
X
j=1
pk+1
2
jMp
i,j
,
where Mu
i,j is defined in (27) is defined in (25) with xk+1
2replaced by pk+1
2.
5: Gradient ascent for the dual variable:
yk+1
i=Pk·k∞≤1yk
i+τγ[I , −D]2pk+1
i−pk
i
2uk+1
i−uk
i;
where Pk·k∞≤1is the projection to the L∞ball defined as
Pk·k∞≤1(xj) = xj
max{|xj|,1}.
6: end for
5.1. Example 1. We consider the sampling from a mixture of Gaussian distribution and Laplace
distribution, where
ρ∗(x) = 1
Zexp(−(f(x) + λkxk1)) ,exp(−f(x)) =
M
X
n=1
exp −(x−yn)2
2σ2,
with σ= 4 and centers ynrandomly distributed in [−10,10]d. To quantify the performance of
sampling algorithms, we consider the decay of KL divergence of the one-dimensional marginal
distribution, i.e., we plot DKL(ρjkρ∗
j) where
ρj(xj) = ZRd−1
ρ(x)dx1···dxj−1dxj+1 ···dxd.
The explicit marginal distribution is detailed in the supplementary material.
We conduct numerical experiments for sampling from the mixture distribution in d= 20 and 50,
λ= 0.1, and M= 4. Results of the BRWP-splitting are compared with MYULA and PRGO. In
Fig. 1 and Fig. 2, the decay of KL divergence of the marginal distribution when j= 1 and d, and
the kernel density estimation using Gaussian kernel from generated samples are plotted.
Both experiments in Fig. 1 and Fig.2 showed that the proposed BRWP-splitting scheme provides
a more accurate approximation to the target distribution in terms of KL divergence and the density
obtained from kernel density estimation.
SPLITTING REGULARIZED WASSERSTEIN PROXIMAL ALGORITHMS 17
0 50 100 150 200 250 300 350 400
iteration
0.3
0.4
0.5
0.6
0.7
0.8
0.9
DKL(1|| 1
*)
BRWP
PRGO
MYULA
0 50 100 150 200 250 300 350 400
iteration
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
DKL(d|| d
*)
BRWP
PRGO
MYULA
-20 -15 -10 -5 0 5 10 15 20
x1
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
(x1)
*(x)
Approx by BRWP
-20 -15 -10 -5 0 5 10 15 20
0
0.05
0.1
0.15
(x1)
*(x)
Approx by MYULA
Figure 1. Example 1: Results in d= 20, step size h= 0.02, and 50 particles. From
left to right: the decay of KL divergence in the first and the last dimension, density
approximated by particles generated by BRWP-splitting and MYULA in the first
spatial variable.
0 50 100 150 200 250 300 350 400
iteration
0.3
0.4
0.5
0.6
0.7
0.8
0.9
DKL(1|| 1
*)
BRWP
PRGO
MYULA
0 50 100 150 200 250 300 350 400
iteration
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DKL(d|| d
*)
BRWP
PRGO
MYULA
Figure 2. Example 1: Results in d= 50, step size h= 0.02, and 100 particles.
From left to right: the decay of KL divergence in the first and the last dimension,
density approximated by particles generated by BRWP-splitting and MYULA in the
first two spatial variables.
5.2. Example 2. The next experiment concerns the Bayesian logistic regression motivated by [12].
The task is to estimate unknown parameter θ∈Rd. Given binary variable (label) y={0,1}under
features (covariate) x∈Rn, the logistic model for ygiven xcan be modeled as
p(y= 1|θ, x) = exp(θTx)
1 + exp(θTx),(51)
for some parameter θthat we try to estimate.
Suppose now we obtain a set of data pairs {(xi, yi)}n
i=1 where each yiconditioned on xiis drawn
from a logistic distribution with parameters θ∗. Then using the Bayes rule, we can construct the
posterior distribution of parameter θin terms of data {yi}. Denoting Y= [y1,··· , yn], X=
[x1,···, xn] and writing π0(x) = exp(−λkxk1) to be the prior distribution, then the posterior
distribution for parameters θis computed as
p(θ|y) = p(y|θ, x)p0(θ) = 1
Zexp YTXθ −
N
X
i=1
log(1 + exp(θTxi)) −λkθk1!.
For our numerical experiments, xiis normalized where each component is sampled from Rademacher
distribution, i.e., which takes the values ±1 with probability 1
2. Given xi, we then draw yifrom the
logistic model (51) with θ=θ∗. The parameter θ∗∈Rdcontains only d/4 non-zero components
with value 1. We examine the performance of the algorithm by computing the L1distance between
sample mean θand the true parameter θ∗
1
dkθ−θ∗k1.
The regularization parameter is chosen as λ= 3d/(2π2), and the results are presented in Fig. 3.
18 FUQUN HAN, STANLEY OSHER, AND WUCHEN LI
0 5 10 15 20 25 30 35 40 45 50
iteration
-2.5
-2
-1.5
-1
-0.5
0
0.5
BRWP
PRGO
MYULA
0 5 10 15 20 25 30 35 40 45 50
iteration
-2.2
-2
-1.8
-1.6
-1.4
-1.2
-1
-0.8
-0.6
-0.4
-0.2
BRWP
PRGO
MYULA
Figure 3. Example 2: Logarithm of relative L1error log kθ−θ∗k1/din Bayesian
logistic regression for 100 particles and h= 0.05 with d= 20 (left) and d= 50
(right).
Figure 4. Example 3: Left to right: exact image, noisy image, mean of all samples
after 100 iterations by BRWP-splitting and MYULA.
From Fig. 3, it is clear that the proposed BRWP-splitting method provides a more accurate
estimate of the mean parameter in this Bayesian logistic regression.
5.3. Example 3. In this example, we apply the proposed sampling algorithm in image denoising
with L1−2regularization as proposed in [37].
The posterior distribution under consideration is
ρ∗(u) = 1
Zexp −1
2kAu −yk2
2+λ(kDuk1− kDuk2) ,(52)
the first term in the exponent is a data-fitting term and the second term is the difference between
L1and L2norm with the discrete gradient operator defined in section 4 which promotes the sparsity
of the image variation. Here, each ucorresponds to one single image. To tackle this, the log-density
is split as
f=kAu −yk2
2−λkDuk2, g =λkDuk1.(53)
To handle the second terms with L1-TV norm, we apply the algorithm proposed in Algorithm 2.
We consider the case that Ais a noisy measurement operator such that
A=I+ǫ ,
where ǫis a sparse Gaussian noise with mean 0, variance 0.1, that has 3dnon-zero entries. For the
exact image zex, the noisy image zis taken as Azex +ηwhere ηis a Gaussian noise with mean 0
and variance 0.2. The results obtained with 20 samples and h= 0.1 are plotted in Fig. 4 and Fig. 5.
SPLITTING REGULARIZED WASSERSTEIN PROXIMAL ALGORITHMS 19
Figure 5. Example 3: Left to right: exact image, noisy image, mean of all samples
after 100 iterations by BRWP-splitting and MYULA.
From both Fig. 4 and Fig.5, the proposed sampling method recovers the original image from
noisy data properly with L1−2TV regularization.
5.4. Example 4. In the next example, we examine the application of the proposed sampling al-
gorithm for a compressive sensing application with L1regularization. The target function for this
problem is defined as
ρ∗(x) = 1
Zexp(−kAx −zk2
2+λkxk1),(54)
where x∈Rd,Ais a m×dcirculant blurring matrix with m=d/4.
To quantify the uncertainty in the measurement data, we consider the concept of the highest
posterior density (HPD). For a given confidence level α∈[0,1], the HPD region Cαis defined as
ZCα
ρ(x)dx = 1 −α , Cα:= {x∈Rd:V(x)≤ηα},
where ηαis a threshold corresponding to the confidence level. The integral can be numerically
approximated using samples we get from the BRWP-splitting algorithm. For an arbitrary test
image ˜x, by comparing V( ˜x) with ηαfor various α, we can assess the confidence that ˜xbelongs
to the high-probability region of the posterior distribution. In particular, with the set of particles
generated from the BRWP-splitting scheme, the integral is computed numerically as
ZCα
ρ(x)dx ≈PjχV(xj)<ηα
N,
where Nis the total number of samples, and χV(xj)<ηαis the indicator function equals to 1 if
V(xj)< ηαand 0 otherwise.
We test the algorithm on a brain MRI image of size d= 1282. The measurement model is assumed
to be Ax +ǫ, where ǫrepresents Gaussian noise with mean 0 and variance 0.2. The reconstruction is
estimated using a step size h= 0.02, with 100 samples and 100 iterations. Additionally, we compute
the HPD region threshold and plot the graph of ηαversus α, which is estimated using 1000 samples.
From Fig.6, we observe that the proposed algorithm yields a better reconstruction compared with
MYULA. Furthermore, the sampling approach allows us to compute the HPD region threshold,
facilitating practical Bayesian inference analysis.
5.5. Example 5. In this example, we apply the proposed method to Bayesian neural network
training. Specifically, the likelihood function is modeled as an isotropic Gaussian, and the prior
distribution is Laplace prior. We consider a two-layer neural network, where each layer consists of
50 hidden units with a ReLU activation function. For each dataset, 90% of the data is used for
training, while the remaining 10% is reserved for testing. Each algorithm is simulated using 200
particles over 500 iterations.
20 FUQUN HAN, STANLEY OSHER, AND WUCHEN LI
1.49 1.5 1.51 1.52 1.53 1.54 1.55
104
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1-
Figure 6. Example 4: From left to right: exact MRI image, reconstructed MRI
image with BRWP-splitting, reconstructed MRI image with MYULA, HDP region
thresholds ηα.
Dataset BRWP-splitting BRWP MYULA SVGD
Boston 3.78±1.93×10−14.27±2.09×10−26.29±6.00×10−34.05±6.93×10−2
Wine 0.53±2.54×10−10.61±2.47×10−10.72±1.13×10−10.54±3.64×10−1
Concrete 3.25±1.37×10−14.11±1.02×10−14.71±3.14×10−13.32±1.47×10−1
Kin8nm 0.093±7.99×10−40.135±2.15×10−30.294±1.56×10−30.092±7.93×10−4
Power 4.13±3.21×10−25.25±8.42×10−28.49±2.87×10−14.15±1.63×10−2
Protein 4.23±2.17×10−24.74±4.32×10−25.12±7.32×10−24.61±1.93×10−2
Energy 1.54±2.37×10−23.06±6.06×10−24.52±2.42 2.00±4.13×10−2
Table 1. Example 5: Root-mean-square error for different datasets in Bayesian
neural network training with λ= 1/d.
We compare the BRWP-splitting against MYULA, the original BRWP (non-splitting, without
proximal computation), and SVGD (Stein variational gradient descent). The step size for each
method is selected via grid search to achieve the best performance, and it remains consistent across
all experiments.
From Table 1, we observe that, for most datasets tested, the proposed BRWP-splitting approach
achieves a lower root-mean-square error compared to the other methods.
6. Discussions
In this work, we propose a sampling algorithm based on splitting methods and regularized Wasser-
stein proximal operators for sampling from nonsmooth distributions. When the log-density of the
prior distribution is the L1norm, the scheme is formulated as an interacting particle system incorpo-
rating shrinkage operators and the softmax function. The resulting iterative sampling scheme is sim-
ple to implement and naturally promotes sparsity. Theoretical convergence of the proposed scheme
is established under suitable conditions and the algorithm’s efficiency is demonstrated through ex-
tensive numerical experiments.
For future directions, we aim to extend our theoretical analysis to investigate the algorithm’s
convergence in the finite-particle approximation and explore its applicability beyond log-concave
sampling. On the computational side, we seek to enhance efficiency through GPU-based parallel
implementations and examine the impact of different kernel choices on the performance. Addition-
ally, as discussed in Section 2.4, regularized Wasserstein proximal operators share a close structural
connection with transformer architectures, motivating our interest in analyzing the self-attention
mechanism through the lens of interacting particle systems. More importantly, building on the
proposed algorithm, we plan to develop tailored transformer models for learning sparse data distri-
butions, which are only known by samples.
SPLITTING REGULARIZED WASSERSTEIN PROXIMAL ALGORITHMS 21
Acknowledgement: F. Han’s work is partially supported by AFOSR YIP award No. FA9550 -23-
10087. F. Han and S. Osher’s work is partially supported by ONR N00014-20-1-2787, NSF-2208272,
STROBE NSF-1554564, and NSF 2345256. W. Li’s work is supported by AFOSR YIP award No.
FA9550-23-10087, NSF RTG: 2038080, and NSF DMS-2245097.
Appendix A. Derivation in Section 2
Proof of proposition 1. For the case ρkis approximated with Gaussian kernel, writing xℓas the ℓ-th
component of x∈Rd, we note (14) becomes
Kh
gρk(x) = 1
N(2πσ2)d/2exp −β
2λkxk1·
N
X
j=1
d
Y
ℓ=1 ZR
exp −β
2(xℓ−yℓ)2−(yℓ−Sλh(yℓ))2
2h−λ|Sλh(yℓ)|−(yℓ−xj,ℓ)2
2σ2dyℓ.
Hence, obtaining the closed-form formula reduces to evaluating a one-dimensional exponential
integral. This integral can be decomposed into three parts: [λh, ∞), (−λh, λh), and (−∞,−λh], fol-
lowing the definition of the shrinking operator Sλh(y). Defining c= 2h/(σ2β), denoting ψ(x, xj) =
exp −β
2x2+cx2
j
2hto simplify notation, and omitting the index ℓfor simplicity, the integral over
[λh, ∞) is given by
ψ(x, xj)Z∞
λh
exp −β
4h(1 + c)y2−2y(x+cxj+λh)dy exp −βλ2h
4
=ψ(x, xj)s4h
β(1 + c)
Z∞
qβ(1+c)
4hhλh−x+cxj+λh
1+ciexp(−y2)dy exp −β
4hλ2h2−(x+cxj+λh)2
1 + c.
Similarly, the integral on (−∞,−λh] will be
ψ(x, xj)Z−λh
−∞
exp −β
4h(1 + c)y2−2y(x+cxj−λh)dy exp −βλ2h
4
=ψ(x, xj)s4h
β(1 + c)
Zqβ(1+c)
4hh−λh−x+cxj−λh
(1+c)i
−∞
exp(−y2)dy exp −β
4hλ2h2−(x+cxj−λh)2
1 + c.
Finally the integral on [−λh, λh] can be computed as
ψ(x, xj)Zλh
−λh
exp −β
4hcy2−2y(x+cxj)dy
=ψ(x, xj)Zλh
−λh
exp −cβ
4hy−(x+cxj)
c2!dy exp β
4h
(x+cxj)2
c
=ψ(x, xj)s4h
cβ Zqcβ
4hλh−(x+cxj)
c
qcβ
4h−λh−(x+cxj)
cexp(−y2)dy exp β
4h
(x+cxj)2
c.
22 FUQUN HAN, STANLEY OSHER, AND WUCHEN LI
Next, to compute the score function, we need to evaluate ∇Kh
gρkon each sub-integral. For the
integral on [λh, ∞), omitting the ψterm, direct computation implies the following
∇(s4h
β(1 + c)Z∞
qβ(1+c)
4hhλh−x+cxj+λh
1+ciexp(−y2)dy exp −β
4hλ2h2−(x+cxj+λh)2
1 + c)
=(sβ
h(1 + c)(x+cxj+λh)Z∞
qβ(1+c)
4hhλh−x+cxj+λh
1+ciexp(−y2)dy
+ exp "−β(1 + c)
4hλh −x+cxj+λh
1 + c2#)·
exp −β
4hλ2h2−(x+cxj+λh)2
1+c
1 + c.
The gradient for the integral on (−∞,−λh] can be evaluated similarly to the above by replacing
x+cxj+λh with x+cj−λh and change of signs. Finally, the gradient for the integral on (−λh, λh)
can be evaluated as
∇
s4h
cβ Zqcβ
4hλh−(x+cxj)
c
qcβ
4h−λh−(x+cxj)
cexp(−y2)dy exp β
4h
(x+cxj)2
c
= exp β
4h
(x+cxj)2
c·(s4h
cβ
β
2h
(x+cxj)
cZqcβ
4hλh−(x+cxj)
c
qcβ
4h−λh−(x+cxj)
cexp(−y2)dy
−1
c"exp −cβ
4hλh −x+cxj
c2!−exp −cβ
4h−λh −x+cxj
c2!#).
Combining the above gives the desired result.
Proof of proposition 2. For the sum of Ai,j defined in (25), we have
Nd
X
j=1
Ai,j =
Nd
X
j=1
exp −β
2 kxk
i−˜xk
jk2
2
2h!!exp β
2 kSλh(˜xk
j)−˜xk
jk2
2
2h+λkSλh(˜xk
j)k1!!
=
Nd
X
j=1
d
Y
ℓ=1
exp −β
2 (xk
i,ℓ −˜xk
j,ℓ)2
2h!!exp β
2 (Sλh(˜xk
j,ℓ)−˜xk
j,ℓ)2
2h+λ|Sλh(˜xk
j,ℓ)|!!
=
N
X
j1=1 ···
N
X
jd−1=1
d−1
Y
ℓ=1
N
X
jd=1
exp −β
2 (xk
i,ℓ −xk
jℓ,ℓ
2h!!·
exp β
2 (Sλh(xk
jℓ,ℓ)−xk
jℓ,ℓ)2
2h+λ|Sλh(xk
jℓ,ℓ)|!! .
Then, we can rearrange the terms to get
Nd
X
j=1
Ai,j =
d
Y
ℓ=1
N
X
j=1
exp −β
2 (xk
i,ℓ −xk
j,ℓ)2
2h!!exp β
2 (Sλh(xk
j,ℓ)−xk
j,ℓ)2
2h+λ|Sλh(xk
j,ℓ)|!! ,
which is the desired formula in the proposition.
SPLITTING REGULARIZED WASSERSTEIN PROXIMAL ALGORITHMS 23
Appendix B. Postponed Proof in Section 3
Proof of Lemma 4. The proximal term in (35) can be rewritten using the property of the proximal
operator as
x−proxh
g(x)
h=∇gh(x).
Thus, the formula (35) is equivalent to
Kh
gρk+1
2(x) (55)
=∇exp −β
2gh(x)Rexp h−β
4hkx−yk2
2− ky−proxh
g(y)k2
2−hg(proxh
g(y))iρk+1
2(y)dy
exp −β
2gh(x)Rexp h−β
4hkx−yk2
2− ky−proxh
g(y)k2
2−hg(proxh
g(y))iρk+1
2(y)dy
.
Since ghis gradient-Lipschitz, its Hessian exists and is bounded almost everywhere. Additionally,
as gh(x) = g(x) + O(h) by Lemma 3, we can substitute ghfor gin (55), introducing an additional
error term of O(h) in the exponent. Moreover, since the dominating term in the exponent is of order
1/h, the error resulting from replacing gwith ghwill be of order O(h2) after taking the quotient.
Applying the Laplace method (see [17] for details), we obtain
ZRd
exp −β
2gh(z) + ky−zk2
2
2hdz (56)
=C
exp −β
2gh(proxh
gh(y)) + kproxh
gh(y)−yk2
2
2h
1 + h
2∆gh(proxh
gh(y)) +O(h2),
for constant C= (2πh)d/2almost everywhere. We note that the Laplacian term in the denominator
will be concealed after taking the quotient in (55).
Substituting (56) into (55) leads to
Kh
gρk+1
2(x) = ZRd
exp −β
2gh(x) + ||x−y||2
2
2h
RRdexp −β
2gh(z) + ||z−y||2
2
2hdz
ρk+1
2(y)dy +O(h2).(57)
Thus, it remains to verify that Kh
gρk+1
2approximates the evolution of the Fokker–Planck equation
with drift term ∇ghfrom tkto tk+h. This follows from the assumption on ρk+1
2,gh, and Theorem
4 in [17].
Our proof of the convergence of R´enyi divergence will rely on the interpolation argument by
considering the continuity equation of (37) in time t∈[kh, (k+ 1)h]. The particle at time tis
written as
xt−xkh (58)
=−(t−kh)h∇f(xkh) + ∇gh(xkh −h∇f(xkh )) + β−1∇log K(t−kh)
gρk+1
2(xkh −h∇f(xkh))i
=−(t−kh)∇f(xt) + ∇gh(xt) + β−1∇log ρt(xt) + Λ(xt, xkh),
where
Λ(xt, xkh) := −β−1∇log ρt
ρ∗
h
(xt) + β−1∇log ρt
ρ∗
h
(xkh −h∇f(xkh))
+∇f(xkh)− ∇f(xkh −h∇f(xkh))
−β−1∇log ρt(xkh −h∇f(xkh)) + β−1∇log Kt−kh
gρk+1
2(xkh −h∇f(xkh)) .
24 FUQUN HAN, STANLEY OSHER, AND WUCHEN LI
We note that when t= (k+ 1)h, we have xt=xk+1, i.e., the location of the particle in the next
time step.
Then the Fokker Planck equation corresponds to (58) for t∈[kh, (k+ 1)h] will be
∂ρt
∂t (xt) = β−1∇ · ρt(xt)∇log ρt
ρ∗
h
(xt)+∇ · (ρt(xt)Λ(xt, xkh)) .(59)
We now state the following lemma on the time derivative of R´enyi divergence along (59).
Lemma 7. For t∈[kh, (k+ 1)h], the time derivative of the R´enyi divergence between ρtalong (59)
and ρ∗
hsatisfies
∂
∂t Rq(ρtkρ∗
h)≤ −q
2
Gq(ρtkρ∗
h)
Fq(ρtkρ∗
h)+q
2Fq(ρtkρ∗
h)ZRdkΛ(xt, xkh)k2
2ρt
ρ∗
hq
ρ∗
hdxt.(60)
Proof. By the definition of the R´enyi divergence, we have
∂
∂t Rq(ρtkρ∗
h) = q
q−1RRdρt
ρ∗
hq−1∂tρtdxt
Fq(ρtkρ∗
h)
=q
(q−1)Fq(ρtkρ∗
h)ZRdρt
ρ∗
hq−1
∇ · ρt∇log ρt
ρ∗
h+ Λ(xt, xkh)ρtdxt
=−q
Fq(ρtkρ∗
h)ZRdρt
ρ∗
hq−2
∇ρt
ρ∗
h·ρt∇log ρt
ρ∗
h+ Λ(xt, xkh)ρtdxt
=−q
Fq(ρtkρ∗
h)"ZRd
∇ρt
ρ∗
h
2
2ρt
ρ∗
hq−2
ρ∗
hdxt+ZRdρt
ρ∗
hq−1
∇ρt
ρ∗
h·Λ(xt, xkh)ρ∗
hdxt#.
The first term is precisely the R´enyi information term defined in (38), and the second term represents
the discretization error, which we need to bound. The second term can be further simplified as
follows:
ZRdρt
ρ∗
hq−1
∇ρt
ρ∗
h·Λ(xt, xkh)ρ∗
hdxt=ZRd∇ρt
ρ∗
h·Λ(xt, xkh)ρt
ρ∗
hρt
ρ∗
hq−2
ρ∗
hdxt
≥ − 1
2ZRd
∇ρt
ρ∗
h
2
2ρt
ρ∗
hq−2
ρ∗
hdxt−1
2ZRdkΛ(xt, xkh)k2
2ρt
ρ∗
hq
ρ∗
hdxt.
The final result is obtained by combining the above relations.
Next, we will bound the discretization error using the Lipschitz continuity of the score function
and (f+gh).
Lemma 8. The discretization error term Λ(xt, xkh)satisfies
ZRdkΛ(xt, xkh)k2
2ρt
ρ∗
hq
ρ∗
hdxt≤2L2h2
(1 −hL)2Gq(ρtkρ∗
h) + 2L2
f(L+Lf)2h2dFq(ρtkρ∗
h) + O(h3),(61)
where L=Lf+Lgh+Lρ.
SPLITTING REGULARIZED WASSERSTEIN PROXIMAL ALGORITHMS 25
Proof. Firstly, using the gradient Lipschitz condition on f,gh, and log ρt, and also the approximation
result in Lemma 4, we can bound the discretization error as
kΛ(xt, xkh)k2≤Lkxt−xkh +h∇f(xkh)k2+Lfhk∇f(xkh )k2+O(h2)
≤Lkxt−xkhk2+ (L+Lf)hk∇f(xkh )k2+O(h2) (62)
≤Lkxt−xkhk2+h(L+Lf)Lf√d+O(h2).
For the first term of (62), by the formula for xtin (58), we obtain
kxt−xkhk2≤h
∇log ρt
ρ∗
h
(xkh)
2
+O(h2)
≤h
∇log ρt
ρ∗
h
(xt)
2
+h
∇log ρt
ρ∗
h
(xt)− ∇ log ρt
ρ∗
h
(xkh)
2
+O(h2)
≤h
∇log ρt
ρ∗
h
(xt)
2
+Lh kxt−xkhk2+O(h2).
The above leads to
kxt−xkhk2≤h
1−hL
∇log ρt
ρ∗
h
(xt)
+O(h2).
Substituting this back into Λ(xt, xkh ), we get
ZRdkΛ(xt, xkh)k2
2ρt
ρ∗
hq
ρ∗
hdxt
≤2L2h2
(1 −hL)2ZRd
∇log ρt
ρ∗
h
2
2ρt
ρ∗
hq
ρ∗
hdxt+ 2L2
f(L+Lf)2h2dZRdρt
ρ∗
hq
ρ∗
hdxt+O(h3)
=2L2h2
(1 −hL)2Gq(ρtkρ∗
h) + 2L2
f(L+Lf)2h2dFq(ρtkρ∗
h) + O(h3).
Next, we are ready to prove Theorem 6.
Proof of Theorem 6 part (1). By combining Lemma 7 and 8, we have
∂
∂t Rq(ρtkρ∗
h) = −q
2
Gq(ρtkρ∗
h)
Fq(ρtkρ∗
h)+q
2Fq(ρtkρ∗
h)ZRdkΛ(xt, xkh)k2
2ρt
ρ∗
hq
ρ∗
hdxt
≤q
2−1 + 2L2h2
(1 −hL)2Gq(ρtkρ∗
h)
Fq(ρtkρ∗
h)+qL2
f(L+Lf)2h2d+O(h3).
Using the result in Lemma 5, i.e., when ρ∗
hsatisfies the Poincar´e inequality with constant αd, we
have
Gq(ρkρ∗
h)
Fq(ρkρ∗
h)≥4αd
q2(1 −exp(−Rq(ρkρ∗
h))) .
Hence, we arrive at
∂
∂t Rq(ρtkρ∗
h)≤2αd
q(1 −exp(−Rq(ρtkρ∗
h))) −1 + 2L2h2
(1 −hL)2+qL2(L+Lf)2h2d+O(h3),
when h≤(√2−1)/L.
26 FUQUN HAN, STANLEY OSHER, AND WUCHEN LI
Writing ρk=ρkh. Then when Rq(ρ0kρ∗
h)≥1, it follows 1 −exp(−Rq(ρkkρ∗
h)) ≥1
2. In this case,
we can derive the linear convergence given by
Rq(ρkkρ∗
h)≤Rq(ρ0kρ∗
h)−kh αd
q1−2L2h2
(1 −hL)2−qL2(L+Lf)2h2d+O(h3).
For the case Rq(ρ0kρ∗
h)<1, we note that
1−exp(−Rq(ρ0kρ∗
h)) ≥Rq(ρ0kρ∗
h)−Rq(ρ0kρ∗
h)2
2≥1
2Rq(ρ0kρ∗
h).
In this scenario, by integration with respect to tfrom 0 to kh, we have
Rq(ρkkρ∗
h)≤Rq(ρ0kρ∗
h) exp −kh αd
q1−2L2h2
(1 −hL)2+q2L2(L+Lf)2h2d
αd
+O(h3).
Proof of Theorem 6 part (2). The bound of R´enyi divergence between ρ∗and ρ∗
hcan be derived
using approximation results (b) in Lemma 3 and Taylor expansion of log function which lead to
Rq(ρ∗kρ∗
h) = 1
q−1log ZRdρ∗
ρ∗
hq
ρ∗
hdx≤qL2
gh
q−1+O(h2).
Additionally, we recall the following decomposition theorem for R´enyi divergence
Rq(ρkkρ∗)≤ q−1
2
q−1!R2q(ρ∗kρ∗
h) + R2q−1(ρkkρ∗
h).
Plug in the above two relations into part (1) of Theorem 6, and the desired result can be proved.
Appendix C. Details about Numerical Experiments
Evaluation of marginal distribution in Example 1. We can integrate the mixture of
Gaussian and Laplace models exactly. If Σ−1
i= 1/(2σ2)Id, the integral is given by
ZRd
ρ∗(x)dx
=
N
X
n=1
d
Y
j=1 ZR
exp −(xj−yn,j)2
2σ2
n−λ|xj|dxj
=
N
X
n=1
d
Y
j=1 "Z∞
−(yn,j −λσ2
n)
exp −z2
j
2σ2
n!dzjexp −y2
n,j −(yn,j −λσ2
n)2
2σ2
n!
+Z−(yn,j +λσ2
n)
−∞
exp −z2
j
2σ2
n!dzjexp −y2
n,j −(yn,j +λσ2
n)2
2σ2
n!#
=1
(√2σn)d
N
X
n=1
d
Y
j=1 "Z∞
−(yn,j −λσ2
n)
√2σn
exp −z2
jdzjexp −y2
n,j −(yn,j −λσ2
n)2
2σ2
n!
+Z−(yn,j +λσ2
n)
√2σn
−∞
exp −z2
jdzjexp −y2
n,j −(yn,j +λσ2
n)2
2σ2
n!
.
The above computation provides the normalization constant Z. By replacing the integration over
Rdwith an integration over Rd−1, we obtain the formula for the marginal distribution ρ∗
1.
SPLITTING REGULARIZED WASSERSTEIN PROXIMAL ALGORITHMS 27
References
[1] L. Ambrosio, N. Gigli, and G. Savar´e. Gradient flows: in metric spaces and in the space of probability measures.
Springer Science & Business Media, 2008.
[2] M. Benko, I. Chlebicka, J. Endal, and B. Miasojedow. Langevin Monete Carlo beyond lipschitz gradient continuity.
arXiv preprint arXiv:2412.09698, 2024.
[3] E. Bernton. Langevin Monete Carlo and JKO splitting. In Conference on Learning Theory, pages 1777–1798.
PMLR, 2018.
[4] M. Burger, M. J. Ehrhardt, L. Kuger, and L. Weigand. Analysis of primal-dual Langevin algorithms, 2024.
[5] J. A. Carrillo, F. Hoffmann, A. M. Stuart, and U. Vaes. Consensus-based sampling. Studies in Applied Mathe-
matics, 148(3):1069–1140, 2022.
[6] V. Castin, P. Ablin, J. A. Carrillo, and G. Peyr´e. A unified perspective on the dynamics of deep transformers.
arXiv preprint arXiv:2501.18322, 2025.
[7] A. Chambolle. An algorithm for total variation minimization and applications. Journal of Mathematical Imaging
and Vision, 20:89–97, 2004.
[8] H. Chen, H. Lee, and J. Lu. Improved analysis of score-based generative modeling: User-friendly bounds under
minimal smoothness assumptions. In International Conference on Machine Learning, pages 4735–4763. PMLR,
2023.
[9] S. Chen, S. Chewi, H. Lee, Y. Li, J. Lu, and A. Salim. The probability flow ODE is provably fast. Advances in
Neural Information Processing Systems, 36, 2024.
[10] J. Chu, N. A. Sun, W. Hu, X. Chen, N. Yi, and Y. Shen. The application of Bayesian methods in cancer prognosis
and prediction. Cancer Genomics & Proteomics, 19(1):1–11, 2022.
[11] K. Craig, K. Elamvazhuthi, M. Haberland, and O. Turanova. A blob method for inhomogeneous diffusion with
applications to multi-agent control and sampling. Mathematics of Computation, 92(344):2575–2654, 2023.
[12] A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal
of the Royal Statistical Society Series B: Statistical Methodology, 79(3):651–676, 2017.
[13] A. Durmus, E. Moulines, and M. Pereyra. Efficient Bayesian computation by proximal Markov chain Monete
Carlo: when Langevin meets Moreau. SIAM Journal on Imaging Sciences, 11(1):473–506, 2018.
[14] N. Fournier and A. Guillin. On the rate of convergence in Wasserstein distance of the empirical measure. Proba-
bility Theory and Related Fields, 162(3):707–738, 2015.
[15] B. Geshkovski, C. Letrouit, Y. Polyanskiy, and P. Rigollet. A mathematical perspective on transformers. arXiv
preprint arXiv:2312.10794, 2023.
[16] A. Habring, M. Holler, and T. Pock. Subgradient Langevin methods for sampling from nonsmooth potentials.
SIAM Journal on Mathematics of Data Science, 6(4):897–925, 2024.
[17] F. Han, S. Osher, and W. Li. Convergence of noise-free sampling algorithms with regularized Wasserstein proxi-
mals. arXiv preprint arXiv:2409.01567, 2024.
[18] F. Han, S. Osher, and W. Li. Tensor train based sampling algorithms for approximating regularized Wasserstein
proximal operators. arXiv preprint arXiv:2401.13125, 2024.
[19] R. Jordan, D. Kinderlehrer, and F. Otto. The variational formulation of the Fokker-Planck equation. SIAM
Journal on Mathematical Analysis, 29(1):1–17, 1998.
[20] R. Lata la and K. Oleszkiewicz. Between Sobolev and Poincar´e. In Geometric Aspects of Functional Analysis:
Israel Seminar 1996–2000, pages 147–168. Springer, 2000.
[21] T. T.-K. Lau, H. Liu, and T. Pock. Non-log-concave and nonsmooth sampling via Langevin Monete Carlo
algorithms. In INdAM Workshop: Advanced Techniques in Optimization for Machine learning and Imaging,
pages 83–149. Springer, 2022.
[22] Y. T. Lee, R. Shen, and K. Tian. Structured logconcave sampling with a restricted Gaussian oracle. In Proceedings
of Thirty Fourth Conference on Learning Theory, pages 2993–3050. PMLR, 2021.
[23] W. Li, S. Liu, and S. Osher. A kernel formula for regularized Wasserstein proximal operators. Research in the
Mathematical Sciences, 10(4):43, 2023.
[24] J. Liang and Y. Chen. A proximal algorithm for sampling from non-smooth potentials. In 2022 Winter Simulation
Conference (WSC), pages 3229–3240. IEEE, December 2022.
[25] Q. Liu and D. Wang. Stein variational gradient descent: A general purpose Bayesian inference algorithm. Advances
in Neural Information Processing Systems, 29, 2016.
[26] W. Mou, N. Flammarion, M. J. Wainwright, and P. L. Bartlett. An efficient sampling algorithm for non-smooth
composite potentials. Journal of Machine Learning Research, 23(233):1–50, 2022.
[27] J. Pan, E. H. Ip, and L. Dub´e. An alternative to post hoc model modification in confirmatory factor analysis:
The Bayesian Lasso. Psychological Methods, 22(4):687, 2017.
[28] T. Park and G. Casella. The Bayesian Lasso. Journal of the American Statistical Association, 103(482):681–686,
2008.
28 FUQUN HAN, STANLEY OSHER, AND WUCHEN LI
[29] M. Pereyra. Proximal Markov chain Monete Carlo algorithms. Statistics and Computing, 26:745–760, 2016.
[30] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D:
Nonlinear Phenomena, 60(1-4):259–268, 1992.
[31] A. Salim, D. Kovalev, and P. Richt´arik. Stochastic proximal Langevin algorithm: Potential splitting and
nonasymptotic rates. Advances in Neural Information Processing Systems, 32, 2019.
[32] H. Y. Tan, S. Osher, and W. Li. Noise-free sampling algorithms via regularized Wasserstein proximals. Research
in the Mathematical Sciences, 11(4):65, 2024.
[33] S. Vempala and A. Wibisono. Rapid convergence of the unadjusted Langevin algorithm: Isoperimetry suffices.
Advances in Neural Information Processing Systems, 32, 2019.
[34] M. Vladimirova, J. Verbeek, P. Mesejo, and J. Arbel. Understanding priors in Bayesian neural networks at the
unit level. In International Conference on Machine Learning, pages 6458–6467. PMLR, 2019.
[35] Y. Wang and W. Li. Accelerated information gradient flow. Journal of Scientific Computing, 90:1–47, 2022.
[36] A. Wibisono. Proximal Langevin algorithm: Rapid convergence under isoperimetry. arXiv preprint
arXiv:1911.01469, 2019.
[37] P. Yin, Y. Lou, Q. He, and J. Xin. Minimization of ℓ1−2for compressed sensing. SIAM Journal on Scientific
Computing, 37(1):A536–A563, 2015.
Department of Mathematics, University of California, Los Angeles, Los Angeles, CA, USA
Email address:fqhan@math.ucla.edu
Department of Mathematics, University of California, Los Angeles, Los Angeles, CA, USA
Email address:sjo@math.ucla.edu
Department of Mathematics, University of South Carolina, Columbia, SC, USA
Email address:wuchen@mailbox.sc.edu