ArticlePDF Available

Unbiased Hamiltonian Monte Carlo with couplings


Abstract and Figures

We propose a coupling approach to parallelize Hamiltonian Monte Carlo estimators, following Jacob, O'Leary & Atchad\'e (2017). A simple coupling, obtained by using common initial velocities and common uniform variables for the acceptance steps, leads to pairs of Markov chains that contract, in the sense that the distance between them can become arbitrarily small. We show how this strategy can be combined with coupled random walk Metropolis-Hastings steps to enable exact meetings of the two chains, and in turn, unbiased estimators that can be computed in parallel and averaged. The resulting estimator is valid in the limit of the number of independent replicates, instead of the usual limit of the number of Markov chain iterations. We investigate the effect of tuning parameters, such as the number of leap-frog steps and the step size, on the estimator's efficiency. The proposed methodology is demonstrated on a 250-dimensional Normal distribution, on a bivariate Normal truncated by linear and quadratic inequalities, and on a logistic regression with 300 covariates.
Content may be subject to copyright.
Unbiased Hamiltonian Monte Carlo with couplings
Jeremy Hengand Pierre E. Jacob
September 4, 2017
We propose a coupling approach to parallelize Hamiltonian Monte Carlo estimators, following Jacob, O’Leary
& Atchadé (2017). A simple coupling, obtained by using common initial velocities and common uniform variables
for the acceptance steps, leads to pairs of Markov chains that contract, in the sense that the distance between
them can become arbitrarily small. We show how this strategy can be combined with coupled random walk
Metropolis–Hastings steps to enable exact meetings of the two chains, and in turn, unbiased estimators that can
be computed in parallel and averaged. The resulting estimator is valid in the limit of the number of independent
replicates, instead of the usual limit of the number of Markov chain iterations. We investigate the effect of tuning
parameters, such as the number of leap-frog steps and the step size, on the estimator’s efficiency. The proposed
methodology is demonstrated on a 250-dimensional Normal distribution, on a bivariate Normal truncated by
linear and quadratic inequalities, and on a logistic regression with 300 covariates.
1 Introduction
1.1 Goal: parallel computation with Hamiltonian Monte Carlo
Hamiltonian Monte Carlo, also called Hybrid Monte Carlo (HMC), is a Markov chain Monte Carlo (MCMC) method
to approximate integrals with respect to a target probability distribution πon Rd. Originally proposed by Duane
et al. [1987] in the physics literature, it was later introduced in statistics by Neal [1993] and is now part of the
standard toolbox [Brooks et al.,2011,Lelièvre et al.,2010], in part due to favorable scaling properties with respect
to the dimension d[Beskos et al.,2010,2013], compared to e.g. random walk Metropolis–Hastings. Hamiltonian
Monte Carlo is at the core of the No-U-Turn sampler (NUTS, Hoffman and Gelman [2014]) used in the software
Stan [Carpenter et al.,2016]. As with any other MCMC method, HMC estimators are justified in the limit of the
number of iterations. Algorithms which rely on such asymptotics face the risk of becoming obsolete if computational
power keeps increasing through the number of available processors and not through clock speed. To address this
issue, we propose to run pairs of HMC chains, for a random but finite number of iterations, and combine them in
such a way that the resulting estimators are unbiased. One can then produce independent copies in parallel and
average them to obtain estimators that are valid in the limit of the number of copies.
If the chains could be initialized from the target distribution, MCMC estimators would be unbiased, and one
could simply average independent chains [Rosenthal,2000]. Perfect samplers can be used for this purpose [Casella
et al.,2001,Huber,2016,Glynn,2016]; more widely applicable approaches to unbiased estimation from MCMC
samplers are proposed in e.g. Mykland et al. [1995], Neal [2002]. More recently, Jacob et al. [2017b] present an
approach based on coupled Markov chains. The method builds upon Glynn and Rhee [2014], Jacob et al. [2017a] and
other “debiasing” techniques [Jacob and Thiery,2015,Vihola,2015,Glynn,2016], and leverages maximal couplings
[Thorisson,2000] of proposal and conditional distributions to remove the “burn-in bias” of Metropolis–Hastings
and Gibbs chains respectively. The use of maximal couplings allows two chains initialized at different positions to
coincide exactly after a random number of steps, referred to as the meeting time. Importantly, these constructions
are applicable to continuous state spaces.
The present article proposes a combination of couplings to enable parallel computation for the Hamiltonian
Monte Carlo sampler. We start by briefly recalling the unbiased estimators of Jacob et al. [2017b] in Section
1.2 and introducing some preliminary notation in Section 1.3. The Rcode producing the figures of this article is
available on the GitHub account of the second author1.
Department of Statistics, Harvard University, USA. Emails: &
arXiv:1709.00404v1 [stat.CO] 1 Sep 2017
1.2 Context: unbiased estimation with couplings
Consider the task of approximating the integral π(h) = Rh(x)π(dx)<, for a test function hof interest. Let
X= (Xn)n0denote a π-invariant MCMC chain associated with an initial distribution π0and transition kernel P,
i.e. X0π0and XnP(Xn1,·)for all n1. Introduce another Markov chain Y= (Yn)n0which has the same
law as X= (Xn)n0, so that Xnand Ynhave the same marginal distribution for all n0. We will write Pto
denote the law of the coupled chain (Xn, Yn)n0and Eto denote expectation with respect to P. We now assume
the following.
[A1] As n→ ∞,E[h(Xn)] π(h), and there exists ι > 0and D < such that for all n0,E[h(Xn)2+ι]< D.
[A2] The meeting time τ= inf {n1 : Xn=Yn1}is finite almost surely, and satisfies a geometric tail condition
of the form P(τ > n)C γnfor all n0and some constants C < and γ(0,1).
[A3] The coupled chains are faithful [Rosenthal,1997]: Xn=Yn1for all nτ.
Under these assumptions, the random variable defined as
Hk(X, Y ) = h(Xk) +
max(k,τ 1)
is an unbiased estimator of π(h), for any choice of initial distribution π0and any k0. The first term above,
h(Xk), is in general biased since the chain (Xn)n0might not have reached stationarity by step k. As the second
term is precisely such that E[Hk(X, Y )] = π(h), it is referred to as a correction term. If kτ, the correction term
is zero. The estimator can be computed in max(τ , k)steps, which has a finite expectation under A2.
We introduce another unbiased estimator, denoted by ¯
Hk:m, defined for some integer m > k, resulting from
averaging H`(X, Y )over `∈ {k, . . . , m}. By rearranging terms, we define
Hk:m(X, Y ) = 1
mk+ 1
h(Xn) + 1
mk+ 1
min (nk+ 1, m k+ 1) {h(Xn+1)h(Yn)},(2)
which is unbiased and computable in max(τ, m)steps. The first average above can be recognized as the usual
MCMC estimator, obtained after miterations and discarding the first k1states. As before, the second term
can be seen as a correction to remove the bias of ¯
Hk:m(X, Y ). On the event {kτ}, the correction term is
equal to zero. We refer to Jacob et al. [2017b] for a more detailed discussion of (1)-(2), and guidelines for the
choice of kand m. Importantly, unbiased estimators can be produced independently in parallel and averaged, with
direct computational gains on parallel computing architectures. Explicit constructions of pairs of Markov chains
satisfying A1-A3 based on Metropolis–Hastings and Gibbs samplers are given in Jacob et al. [2017b]. Here we
propose coupling strategies for HMC chains, so as to enable the unbiased estimators of (1)-(2). The main challenge
lies in A2, for which two coupled chains have to meet exactly after a “Geometric” number of steps.
1.3 Notation and plan
The set of integers {a, . . . , b}for abis written as [a:b]. The set of non-negative real numbers is denoted by R+.
The vectors 0dand 1drefer to d-dimensional vectors of zeros and ones respectively. The matrix Idis the identity
matrix of size d×d. The norm of a vector xRdis written as |x|= (Pd
i=1 x2
i)1/2. The transpose of a vector
xRdand matrix ARd×pare denoted by xTand ATrespectively. The gradient of a function (x, y)7→ f(x, y)
with respect to x(resp. y) is denoted by xf(resp. yf). The Hessian of a real-valued function fis denoted
by 2f. The Borel σ-algebra of Rdis denoted by B(Rd)and the Lebesgue measure on Rdby Lebd. The Normal
distribution with mean µand covariance matrix Σis denoted by N(µ, Σ) and its density by x7→ N(x;µ, Σ). The
Uniform distribution on the interval [0,1] is U[0,1]. The total variation distance dTV between two distributions,
with densities pand q, is defined as dTV(p, q) = 1
The rest of the article is structured as follows. Section 2describes Hamiltonian dynamics for coupled trajectories.
Section 3introduces a simple coupling of Hamiltonian Monte Carlo chains, which satisfies a relaxed meeting time
assumption similar to A2. Section 4then combines HMC kernels with random walk Metropolis–Hastings kernels, to
ensure that chains meet exactly and satisfy A2. Section 5contains simulation results on a 250-dimensional Normal
target, a truncated Normal distribution and a logistic regression with 300 covariates, and Section 6concludes.
2 Hamiltonian dynamics for pairs of particles
2.1 Hamiltonian flows and extended target
We now suppose that the target distribution has the form
where the potential U:RdR+is twice continuously differentiable and its gradient Uis globally β-Lipschitz,
i.e. there exists β > 0such that
|∇U(q)− ∇U(q0)| ≤ β|qq0|,
for all q, q0Rd. We now introduce Hamiltonian flows on a phase space R2d, which consists of position variables
qRdand velocity variables pRd. We will be concerned with a Hamiltonian function E:Rd×RdR+of the
E(q, p) = U(q) + 1
We note the use of an identity mass matrix here and defer to preconditioning as a means to incorporate any
knowledge of the curvature properties of π. The time evolution of a particle (q(t), p(t))tR+under Hamiltonian
dynamics is described by the autonomous system of ordinary differential equations
dtq(t) = pE(q(t), p(t)) = p(t),(3)
dtp(t) = qE(q(t), p(t)) = −∇U(q(t)).
Under the above assumptions on U, (3) with an initial condition (q(0), p(0)) = (q0, p0)Rd×Rdadmits a unique
solution globally on R+[Lelièvre et al.,2010, p. 14]. We will write the flow map as Φt(q0, p0)=(q(t), p(t)) for
any tR+, and Φ
t(q0, p0) = q(t)and Φ
t(q0, p0) = p(t)as its projection onto the position and velocity coordinates
respectively. It is worth recalling that Hamiltonian flows have the following properties.
[P1] (Reversibility) Φ1
t=MΦtMwhere M(q, p) := (q, p)denotes velocity reversal;
[P2] (Energy conservation) EΦt=Efor any tR+;
[P3] (Volume preservation) Leb2dt(A)) = Leb2d(A)for any A∈ B(Rd×Rd).
It follows from P1 and P2 that the extended target distribution on phase space,
˜π(dq, dp)exp(E(q, p))dqdp,
is invariant under the Markov semi-group induced by the flow, i.e. the pushforward measure Φt]˜πdefined by
Φt]˜π(A) = ˜π1
t(A)) for A∈ B(R2d)is equal to ˜πfor any tR+.
2.2 Coupled Hamiltonian dynamics
Following Section 1.2, we now consider the coupling of two particles (qi(t), pi(t))tR+, i = 1,2evolving under (3)
with initial conditions (qi(0), pi(0)) = (qi
0, pi
0), i = 1,2. We first draw some insights from a Gaussian example.
Example 1. Let πbe a Gaussian distribution on Rwith mean µRand unit variance σ2R+, in which case
U(q)=(qµ)2/(2σ2)and U(q)=(qµ)2. Then the solution of (3) is given by
Φt(q0, p0) = µ+ (q0µ) cos t
σ+σp0sin t
p0cos t
σ(q0µ) sin t
see e.g. Neal [2011]. Hence the difference between the positions is given by
0) cos t
0) sin t
Observe that if we set p1
0, then
0|cos t
so the particles meet exactly whenever t= (2a+ 1)πσ/2, and contraction occurs for any t6=πaσ, for any non-
negative integer a.
This example motivates a coupling that simply assigns particles the same initial velocity. Moreover, it also
reveals that certain trajectory lengths will result in stronger contractions than others. We now examine the utility
of this approach more generally. Define ∆(t) = q1(t)q2(t)as the difference between the particle locations and
note that 1
dt|∆(t)|2= ∆(t)T(p1(t)p2(t)).
Therefore by imposing that p1(0) = p2(0), the function t7→ |∆(t)|admits a stationary point at time t= 0. This
is geometrically intuitive as the trajectories at time zero are parallel to one another for an infinitesimally small
amount of time. To characterize this stationary point, we compute
dt2|∆(t)|2=∆(t)TU(q1(t)) − ∇U(q2(t))+|p1(t)p2(t)|2.
If we assume that the potential Uis α-strongly convex in an open set S∈ B(Rd), i.e. there exists α > 0such that
(U(q)− ∇U(q0))T(qq0)α|qq0|2,
for all q, q0S, then
dt2|∆(t)|2≤ −α|q1(t)q2(t)|2+|p1(t)p2(t)|2.(4)
Therefore by the second derivative test, t= 0 is a strict local maximum point if q1
0, q2
0S. Using continuity of
t7→ |∆(t)|2, it follows that there exist ˜
t > 0and ρ < 1such that
0, p0)Φ
0, p0)|< ρ|q1
for t(0,˜
t). We note the dependence of ˜
tand ρon the initial positions (q1
0, q2
0)and velocity p0. We now strengthen
the above claim.
Lemma 1. Suppose that the potential Uis twice continuously differentiable, α-strongly convex on S∈ B(Rd)and
its gradient Uis globally β-Lipschitz. For any compact set CS×S×Rd, there exist ˜
t > 0and ρ < 1such that
0, p0)Φ
0, p0)| ≤ ρ|q1
for all (q1
0, q2
0, p0)Cand t(0,˜
Proof. Take (q1
0, q2
0, p0)C. Applying Taylor’s theorem on ∆(t)around t= 0 gives
∆(t) = ∆(0) 1
for some t(0, t), where G0:= U(q1
0)− ∇U(q2
0)and G:= 2U(q1(t))p1(t)− ∇2U(q2(t))p2(t). We will
control each term of the expansion
Using strong convexity, the Lipschitz assumption and Young’s inequality
Note that by Young’s inequality and the Lipschitz assumption
2|p1(t)|2+ 2k∇2U(q2(t)k2
0, p0)|2+|Φ
0, p0)|2)
0, p0)|2+|Φ
0, p0)|2),
where k · k2denotes the spectral norm. The above supremum is attained by continuity of the mapping (q, p)7→
t(q, p). The claim (5) follows by combining both inequalities and taking tsufficiently small.
3 Hamiltonian Monte Carlo
3.1 Leap frog integrator
As the flow defined by (3) is typically intractable, one has to resort to time discretization. The leap-frog symplectic
integrator is a standard choice as it preserves P1 and P3. Given a step size ε > 0and a number of leap-frog steps
LN, this scheme initializes at (q0, p0)Rd×Rdand iterates
q`+1 =q`+εp`+1/2
p`+1 =p`+1/2ε
for `[0 : L1]. We write the leap-frog iteration as ˆ
Φε(q`, p`)=(q`+1, p`+1 )and the corresponding approximation
of the flow as ˆ
Φε,`(q0, p0) = (q`, p`)for `[1 : L]. As before, we denote by ˆ
ε,`(q0, p0) = q`and ˆ
ε,`(q0, p0) = p`
the projections onto the position and velocity coordinates respectively. The leap-frog scheme is of order two [Hairer
et al.,2005, Theorem 3.4]: for sufficiently small ε, we have both
Φε,L(q0, p0)ΦεL (q0, p0)| ≤ C1ε2,(6)
Φε,L(q0, p0)) E(q0, p0)| ≤ C2ε2,(7)
for some constants C1, C2>0. Given the nature of Hamiltonian dynamics, the constant C1will typically grow
exponentially with the number of leap-frog iterations L[Leimkuhler and Matthews,2015, Section 2.2.3]. Under
appropriate assumptions, the constant C2on the other hand can be shown be stable over exponentially long time
intervals [Hairer et al.,2005, Theorem 8.1]. The Hamiltonian is not exactly conserved under time discretization,
but one can employ a Metropolis–Hastings correction as described in the following section.
3.2 Hamiltonian Monte Carlo kernel
Hamiltonian Monte Carlo [HMC, Neal,1993,Duane et al.,1987] is a Metropolis–Hastings (MH) algorithm on phase
space that targets ˜πwith the time discretized Hamiltonian dynamics ˆ
Φε,L(q0, p0)=(qL, pL)as a proposal. From a
state (Qn, Pn)Rd×Rd, at iteration n0,
1. sample a velocity P
n∼ N(0d, Id), independently of other variables, and set (q0, p0)=(Qn, P
2. perform leap-frog integration to obtain (qL, pL) = ˆ
Φε,L(q0, p0);
3. with probability α((q0, p0),(qL, pL)), set (Qn+1, Pn+1)=(qL,pL), otherwise set (Qn+1 , Pn+1)=(Qn, Pn).
Since the leap-frog integrator preserves P1 and P3, the MH acceptance probability is given by
α((q, p),(q0, q0)) = min(1,exp (E(q, p)E(q0, p0))),(8)
for (q, p),(q0, p0)Rd×Rd. As this constructs a ˜π-invariant Markov chain (Qn, Pn)n0on phase space, the marginal
chain (Qn)n0is a π-invariant Markov chain. We can write the Markov transition kernel of the marginal chain as
Kε,L (q, A) = ZRd
ε,L(q, p)α(q, p),ˆ
Φε,L(q, p)N(p; 0d, Id)dp (9)
+δq(A)ZRdn1α(q, p),ˆ
Φε,L(q, p)oN(p; 0d, Id)dp,
for qRd, A ∈ B(Rd). Irreducibility and geometric ergodicity of Kε,L have recently been established rigorously in
Durmus et al. [2017]; see also Cances et al. [2007], Livingstone et al. [2016] for previous works. These results can
be used to verify A1 in Section 1.2.
3.3 Coupled Hamiltonian Monte Carlo kernel
Similarly to Section 2.2, we now consider coupling two HMC chains (Qi
n, P i
n)n0, i = 1,2using the following
procedure. From two states (Qi
n, P i
n), i = 1,2, at iteration n0,
1. sample a velocity P
n∼ N(0d, Id), independently of other variables, and for i= 1,2, set (qi
0, pi
n, P
2. for i= 1,2, perform leap-frog integration to obtain (qi
L, pi
L) = ˆ
0, pi
3. sample U∼ U[0,1];
4. for i= 1,2,if Uα(qi
0, pi
L, pi
L), set (Qi
n+1, P i
L), otherwise set (Qi
n+1, P i
n, P i
The above procedure amounts to running two HMC chains with common random numbers. We denote the associated
coupled transition kernel on the position coordinates as ¯
Kε,L (q1, q2), A1×A2for q1, q2Rdand A1, A2
B(Rd). Marginally we have ¯
Kε,L (q1, q2), A1×Rd=Kε,L(q1, A1)and ¯
Kε,L (q1, q2),Rd×A2=Kε,L(q2, A2).
We suppose that (Q1
0, Q2
0)are initialized according to π0independently, and (P1
0, P 2
0)with an arbitrary distribution
on R2d. We will write Pε,L as the law of the coupled HMC chains (Qi
n, P i
n)n0,i= 1,2and Eε,L to denote
expectation with respect to Pε,L .
We now establish that the relaxed meeting time τδ= inf n0 : |Q1
n| ≤ δfor any δ > 0has geometric
tail. The following result can be used to establish A2 for the algorithm that will be introduced in the next section.
Theorem 1. Suppose that the potential Uis twice continuously differentiable, the gradient of Uis globally β-
Lipschitz and there exists a compact set S∈ B(Rd)with Lebd(S)>0such that the restriction of Uto Sdenoted by
U|S:SRis α-strongly convex. Then there exists ˜ε > 0,˜
LN,CR+and γ(0,1) such that
Pε,L (τδ> n) n, n N,(10)
for any ε < ˜εand L > ˜
Lsatisfying εL < ˜ε˜
Proof. We first establish that the coupled HMC kernel is Leb2d-irreducible by adapting the arguments in Durmus
et al. [2017, proof of Theorem 2] to our coupling. Under the Lipschitz assumption on U, the arguments in Durmus
et al. [2017, proof of Theorem 2] imply that for any LN, there exists ˜εL>0such that the mapping p7→ ˆ
ε,L(q, p)
is a continuously differentiable diffeomorphism from Rdto Rdfor qRdand ε < ˜εL. Hence the mapping
p7→ ¯
Φε,L(q, q0, p) := ˆ
ε,L(q, p),ˆ
ε,L(q0, p)
from Rdto R2dis also a continuously differentiable diffeomorphism for (q, q0)R2dand ε < ˜εL. Writing ¯
ε,L :
R2dRdas the inverse function, by a change of variables,
Kε,L (q1, q2), AZRdZ1
ε,L(q1, p),ˆ
ε,L(q1, p)2
Iuα(qi, p),ˆ
Φε,L(qi, p)N(p; 0d, Id)du dp
ε,L(¯q); 0d, Id
det J¯
ε,L q)
du d¯q
Leb2d(A) inf
ε,L( ¯q)),ˆ
ε,L( ¯q))N¯
ε,L(¯q); 0d, Id
det J¯
ε,L q)
for all A∈ B(R2d), where J¯
ε,L denotes the Jacobian matrix of ¯
ε,L (with the convention 0×+= 0). It follows
that ¯
Kε,L is aperiodic and irreducible with respect to the Lebesgue measure on R2d.
For any real-valued measurable function f:R, we write its level sets as Lf(`) = {x:f(x)`}for
`R. Define the kinetic energy function K(p) = |p|2/2, the levels U > infqSU(q)and ¯
U < supqSU(q)such that
U < ¯
U, and the sets C`=LU|S(`)×LK(¯
U)and ˜
U`)for `(U, ¯
Since Lebd(LU|S(`)) >0for `(U, ¯
U)under the assumptions on U,Leb2d-irreducibility of ¯
Kε,L implies for any
LNand ε < ˜εL, there exists NNsuch that
Pε,L Q1
NLU|S(`), Q2
When both chains enter the set LU|S(`), it follows from Lemma 1that there exist ˜
T > 0and ρ0<1such that
N, P
N, P
N)| ≤ ρ0|Q1
for all (Q1
N, Q2
N, P
C`and T < ˜
T. Hence we have
Pε,L |Φ
N, P
N, P
N)| ≤ ρ0|Q1
N| | Q1
NLU|S(`), Q2
By triangle inequality, consistency of the leap-frog integrator (6) and compactness of ˜
C`, there exists ε0˜εL,
L0Nand ρ1<1such that
Pε,L |ˆ
N, P
N, P
N)| ≤ ρ1|Q1
N| | Q1
NLU|S(`), Q2
for ε<ε0and L>L0satisfying εL =T. Again by consistency of the leap-frog integrator (7) and compactness of
C`, it follows from (8) that there exist ε1ε0,L1L0and η0<1/2such that
Pε,L Qi
N+1 =ˆ
N, P
N, P
for i= 1,2and ε<ε1,L>L1satisfying εL =T. By Fréchet’s inequality, the probability of accepting both
proposals satisfies
Pε,L Q1
N+1 =ˆ
N, P
N), Q2
N+1 =ˆ
N, P
N, Q2
N, P
Pε,L |Q1
N+1 Q2
N+1| ≤ ρ1|Q1
N| | Q1
NLU|S(`), Q2
To iterate this argument, note first that if (q, p)C`then continuity of Uand the mapping t7→ Φ
t(q, p)
implies Φ
t(q, p)LU|S(¯
U)for any tR+. Owing to time discretization, we only have ˆ
t(q, p)LU|S(¯
(q, p)C`and some η1>0, by another application of (7). It follows that there exists a number of iterations IN
that depends on ρ1, and an initial level `0(U,¯
U)depending on Iand η1such that
Pε,L |Q1
N+I| ≤ δ|Q1
NLU|S(`0), Q2
Therefore we can conclude (10) by applying Williams [1991, Exercise E.10.5].
Under similar conditions, Durmus et al. [2017] provide a convergence result for the marginal HMC chains, which
can be used to check A1; see also Cances et al. [2007], Livingstone et al. [2016], Mangoubi and Smith [2017] and
Tweedie [1983] for the finiteness of moments.
It is worth noting that the distance between chains might exceed δat some future iterations n > τδ, and that
the event {|Q1
n| ≤ δ}is not an exact meeting event; thus Theorem 1does not establish A2. In the next
section, we combine coupled HMC kernels with another kernel designed to prompt exact meetings, which would
occur with large probability when the two chains are close.
4 Unbiased Hamiltonian Monte Carlo estimators
The construction of Jacob et al. [2017b] requires two chains that meet exactly. One possibility here is the approach
of Glynn and Rhee [2014], which involves the introduction of a truncation variable. Instead we propose to use
coupled Metropolis–Hastings steps to trigger exact meetings. These coupled MH steps are described in Section
4.1, and a summary of the proposed methodology combining the two coupled kernels is in Section 4.2. Section 4.3
briefly describes a further variance reduction technique.
4.1 Coupled Metropolis–Hastings steps
As in Section 1, let us denote the two chains by (Xn)n0and (Yn)n0; these correspond to the position coordinates
in Section 3, propagated with a time shift, e.g. (Xn+1, Yn)¯
Kε,L((Xn, Yn1),·). According to Theorem 1, coupled
HMC chains are close to one another after some iterations. Denote the distance between the chains at step nby
In a coupled MH step with Normal random walk, a pair of proposals (X?, Y ?)is sampled from the maximal
coupling of N(Xn,Σ) and N(Yn1,Σ) [Jacob et al.,2017b]. Let us consider the case where Σ = σ2Idfor some σ > 0.
Algorithm 1 Unbiased HMC estimator ¯
Hk:m(X, Y )of π(h), with tuning parameters ω, σ, ε, L, k, m.
The kernel ¯
Pσrefers to a coupled random walk MH kernel with proposal standard deviation σ, and maximally
coupled proposals. The kernel ¯
Kε,L refers to a coupled HMC kernel with step size ε,Lleap-frog steps, and common
initial velocity at each step. The marginal kernels are denoted by Pσand Kε,L respectively.
1. Draw X0and Y0from an initial distribution π0, and
(a) with probability ω, sample X1Pσ(X0,·);
(b) otherwise sample X1Kε,L (X0,·);
(c) set n= 1.
2. While Xn6=Yn1and n<m,
(a) with probability ω, sample (Xn+1, Yn)¯
Pσ((Xn, Yn1),·);
(b) otherwise, sample (Xn+1 , Yn)¯
Kε,L((Xn, Yn1),·);
(c) if Xn+1 =Ynset τ=n+ 1;
(d) increment nn+ 1.
3. Compute H`(X, Y ) = h(X`) + Pmax(m,τ 1)
n=`{h(Xn+1)h(Yn)}for `[k:m],
and ¯
Hk:m(X, Y )=(mk+ 1)1Pm
`=kH`(X, Y ); or compute ¯
Hk:m(X, Y )as in (2).
Under the maximal coupling, we have P(X?=Y?)=1dTV(N(Xn, σ2Id),N(Yn1, σ2Id)). The total variation
can be approximated as in Pollard [2005]. First, we have dTV(N(Xn, σ2Id),N(Yn1, σ2Id)) = P(2σ|Z| ≤ δn),
where Zis a univariate standard Normal variable and δnis considered fixed. Approximations of the folded Normal
cumulative distribution function then lead to
P(X?=Y?)=1P(2σ|Z| ≤ δn)=11
σ2,as δn
To achieve P(X?=Y?) = sfor some desired probability s, we can choose σas approximately δn/(2π(1 s)).
The proposed values (X?, Y ?)are then accepted as the next states according to MH acceptance ratios, i.e. if
Umin(1, π(X?)(Xn)) and Umin(1, π(Y?)(Yn1)) respectively, where a single uniform variable U∼ U[0,1]
is used for both chains.
If σis small compared to the spread of the target density function, the probability of jointly accepting the
proposals is high. On the other hand, σneeds to be large compared to δn=|XnYn1|for the event {X?=Y?}
to frequently occur. This leads to a trade-off; in numerical experiments, for pairs of chains propagated using the
coupled HMC kernel ¯
Kε,L, we can monitor both the distance δnand the target density values to guide the choice of
σ. We will choose a fixed value of σfor all coupled MH steps, and leave adaptive strategies, where σwould be e.g.
chosen according to δn, for future research. Hereafter we denote by Pσand ¯
Pσthe marginal and coupled kernels
associated with the MH steps.
4.2 Combining kernels
We propose to use both coupled HMC and MH kernels through a mixture. The coupled HMC kernel is expected
to bring the two chains close to one another, while the coupled MH kernel enables exact meetings when the chains
are already close. In a mixture of kernels, at each step, the MH kernel is chosen with probability ω, otherwise the
HMC kernel is chosen. The procedure is described in Algorithm 1. Note that A3 is satisfied by design for coupled
chains generated by this algorithm. As the resulting coupled mixture kernel inherits properties of the coupled MH
kernel, A2 can in principle be verified by simply relying on the properties of coupled MH kernels established in
Jacob et al. [2017b]. However, we stress here that Theorem 1provides some insight on the role of coupled HMC
steps on the efficiency of the proposed estimator.
We now comment on the computational cost of Algorithm 1. Assume for simplicity that the cost of evaluating
the target density is approximately equal to that of evaluating its gradient. Each HMC step is then L+ 1 times
more expensive than a MH step. If we choose a small value for ω, such as 0.1or 0.05, the cost of the MH steps
becomes negligible. Secondly, the cost of running two chains is approximately twice the cost of running each chain
until meeting occurs. Thereafter, only one chain needs to be propagated up to step m. If we choose mto be much
larger than τwith high probability, the cost of Algorithm 1is therefore comparable to the cost of mHMC iterations.
The efficiency of the unbiased HMC estimator depends on the mixing properties of the underlying HMC kernel,
and on the contraction achieved by the coupling. Importantly, the tuning parameters εand Lthat would be
optimal for the marginal HMC kernel are not necessarily adequate for the coupled kernel, as illustrated in Section
5. The other tuning parameters include σfor the coupled MH step discussed above, and Jacob et al. [2017b] give
recommendations for kand m: namely kcan be chosen as a large quantile of the meeting times, and msuch that
(mk)/m 1, for instance m= 10k.
Finally, in Section 5.2 we will encounter a situation where the coupled HMC kernel contracts so quickly that the
distance |XnYn1|becomes smaller than machine precision after a small number of iterations. The two chains
can then be considered exactly identical, for all practical purposes, and the coupled MH steps become unnecessary.
4.3 Choice of weights and variance reduction
As suggested in Jacob et al. [2017a,b], the estimators H`(X, Y )for `[k:m]given in (1), can be averaged with
any weights (w`)m
`=ksuch that Pm
`=kw`= 1. The estimator ¯
Hk:m(X, Y )in (2) corresponds to weights equal to
(mk+1)1. For an arbitrary choice (w`)m
`=k, the estimator Pm
`=kw`H`(X, Y )is unbiased and its variance is given
by wTΣHw, where ΣHdenotes the (mk+ 1) ×(mk+ 1) covariance matrix of the estimators (Hk, . . . , Hm).
To minimize such a variance without violating the sum constraint, we solve the system
1. . . 1 0
where λis a Lagrange multiplier, for a computational cost of order (mk+1)3. The matrix ΣHcan be approximated
from i.i.d. realizations of H`for `[k:m]. The resulting weights can then be used to reduce the variance of
Hk:m(X, Y ), especially if the original MCMC chain exhibits strong autocorrelations.
5 Numerical illustrations
We investigate some key aspects of the proposed unbiased HMC estimator, such as its efficiency compared to
standard HMC estimators. As in the rest of the article, we choose a Normal distribution for the initial velocities
at each HMC step, and a unit mass matrix; other choices are possible [Girolami and Calderhead,2011,Livingstone
et al.,2017].
In all experiments, whenever the test function his not specified, it is chosen as h:x7→ x1, so that π(h)is simply
the mean of the first target marginal distribution. The asymptotic variance of an MCMC estimator refers to the
variance appearing in the central limit theorem satisfied by N1PN
n=0 h(Xn)as N→ ∞, where (Xn)n0is the
chain generated by the algorithm. Here, these asymptotic variances are approximated with the spectrum0 function
of the coda package [Plummer et al.,2006]. For unbiased estimators, we define the asymptotic efficiency as variance
multiplied by expected cost [Glynn and Whitt,1992]. This accounts for the fact that, in a given computing budget,
more estimators can be averaged over if each one can be produced faster. For the estimator ¯
Hk:m(X, Y )in (2), the
expected computing time E[max(τ , m)] and the variance V[¯
Hk:m(X, Y )] are approximated by empirical averages of
independent realizations.
5.1 Multivariate Normal distribution
Let the target πbe a multivariate Normal N(0d,Σπ)with d= 250 and with the (i, j)-entry of Σπequal to
exp(−|ij|). In this example we discuss the choice of trajectory length, defined as the product εL, and the use of
coupled MH kernels to trigger exact meetings.
We fix the number of leap-frog steps to L= 20 and vary the step size εso that the trajectory length εL spans
between 0and 3π/2, where πhere denotes the mathematical constant. The initial distribution π0is chosen as the
target. For each trajectory length, the asymptotic variance of HMC computed from 5,000 iterations is shown in
Figure 1a. The optimal trajectory length is close to the value π, which is consistent with the analytical solution
in Section 2.2. For such a trajectory length, the asymptotic variance is smaller than the variance obtained with
perfect samples from the target, thanks to negative auto-correlations.
π4 2π4 3π4 4π4 5π4 6π4
trajectory length
HCMC variance
(a) HMC asymptotic variance against trajectory length
π4 2π4 3π4 4π4 5π4 6π4
trajectory length
distance after 100 iterations
(b) Distance after 100 coupled HMC iterations against
trajectory length εL.
Figure 1: In the multivariate Normal example of Section 5.1, asymptotic variance for the estimation of Rx1π(dx)
using HMC, computed using chains of length 5,000 started at stationarity (left). Euclidean distance between the
100-th iterate of coupled HMC chains (right). The number of leap-frog steps is set to L= 20, which implicitly
determines the step size εfor each trajectory length εL. Each dot corresponds to one of 5independent runs.
(a) Log-distance between coupled HMC chains against
(b) Log-distance between coupled chains propagated
with a mixture of HMC and MH kernels, against it-
Figure 2: In the multivariate Normal example of Section 5.1, distance between coupled HMC chains against number
of iterations (left), and between chains propagated with the the mixture of HMC and MH kernels, with σ= 105
and ω= 0.1(right). Each line corresponds to one of 100 independent runs.
We then run 100 iterations of coupled HMC and compute the Euclidean distance between the two final states.
The resulting distances are shown in Figure 1b. Lengths around the value π/2lead to the smallest distances, con-
sistently with the analytical reasoning of Section 2.2. Moreover, there is a range of lengths that lead to contraction.
On the other hand, the optimal length for the HMC estimator, which was the value π, does not lead to visible
contraction after 100 iterations. Therefore, the proposed coupling contracts most with tuning parameters that are
not optimal for the underlying HMC algorithm, which results in a loss of efficiency.
Based on Figure 1b, we set εL =π/2,L= 20 and run coupled chains, 100 times independently, until their
distances is less than machine precision. In Figure 2a these distances are plotted on a logarithmic scale against
iterations; the lines drop when the distances fall below machine precision, which occurs between iterations 127 and
312. The distances are already very small after a few dozen iterations. We implement the proposed algorithm with
a mixture of kernels described in Section 4.2, with σ= 105and ω= 0.1, and plot the resulting distances in Figure
2b. All meeting times then occur between iterations 36 and 97. The MH steps thus successfully manage to trigger
exact meetings.
We set k= 50 and m= 500 to produce R= 100 unbiased estimators of Rx1π(dx)as in (2). The asymptotic
efficiency is approximately equal to 1.96. The asymptotic variance of HMC obtained with εL =πwas found to
be approximately 0.16, averaging the 5runs shown in Figure 1a. Therefore, the proposed estimator is about 12
times less efficient than the original HMC algorithm when optimally tuned. Depending on hardware, this can be
considered an acceptable loss in exchange for complete parallelism, among other advantages of unbiased estimators
argued e.g. in Rhee [2013], Jacob et al. [2017b]. Unbiased estimators could also be obtained from variants of HMC
where the number of leap-frog steps Lis random, and possibly adaptive, which might reduce the efficiency loss.
5.2 Truncated Normal distribution
We consider Hamiltonian Monte Carlo on truncated Normal distributions, with truncations defined by linear and
quadratic inequalities. In this setting Pakman and Paninski [2014] show that Hamiltonian dynamics can be solved,
resulting in trajectories that bounce off the constraints. An R package implementing the method of Pakman and
Paninski [2014] is available online [Pakman,2012]. Using this package, the implementation of the proposed method
only involved simple modifications.
We consider two of the examples in Pakman and Paninski [2014], where a bivariate Normal distribution is
truncated by two linear and two quadratic constraints respectively. A thousand HMC samples are shown in Figure
3(top row). The first distribution is a bivariate Normal, with unit covariance matrix and mean (4,4), restricted
to the set {x1x21.1x1} ⊂ R2(Figure 3a). The second distribution is a bivariate standard Normal restricted
to the set {(x14)2/32 + (x21)2/81} ∩ {4x2
1+ 8x2
22x1x2+ 5x21} ⊂ R2(Figure 3b). We use the value
π/2as a trajectory length, as advocated in Pakman and Paninski [2014]. As for the initial distribution π0, we use
a point mass at (2,2.1) for the first target, and at (2,0) for the second one.
In this example, the proposed coupling induces a contraction that leads to distances between trajectories be-
coming smaller than machine precision, after a few iterations. Therefore, we do not need to resort to coupled MH
steps: we can define the meeting times directly as the first times for which distances are less than machine precision.
Histograms of such meeting times are shown in Figure 3for both targets (bottom row). They indicate that small
values of kand mcould be chosen, effectively leading to the possibility of running very short HMC chains in parallel
in a principled way.
5.3 Logistic regression
We consider a Bayesian logistic regression as in Hoffman and Gelman [2014], on the classic German credit data
set. Including pairwise interactions, the covariates are in a matrix Xwith N= 1000 rows and p= 300 columns,
which we standardize by column. The parameters are the intercept αR, coefficients βRp, and a prior
variance σ2R+on intercept and coefficients. The likelihood specifies that the binary outcome Yisatisfies
P(Yi= 1|Xi, α, β) = (1 + exp(αXT
iβ))1for all i[1 : N]. The prior specifies α|σ2∼ N(0, σ2)and
βj|σ2∼ N(0, σ 2), for all j[1 : p], and an Exponential distribution with rate λ= 0.01 for σ2. We transform σ2
into log σ2, so that each parameter lies in R. The target πis the posterior distribution of (α, β, log σ2), of dimension
d=p+ 2 = 302. We use an independent standard Normal for each parameter to initialize the chains, which defines
We set L= 20 and vary εso that the trajectory length εL is in the range [0.1,0.5]. For each length, we run
10,000 HMC iterations, discard the first 5,000 as burn-in, and use the remaining 5,000 samples to approximate
the asymptotic variance of HMC for the estimation of Rx1π(dx), which here is the posterior expectation of the
intercept. The results of independent runs are shown in Figure 4a. Coupled HMC chains are then run for 1,000
iterations, and the distances between the final states are shown in Figure 4b. Again, the optimal choice of εL for
the asymptotic variance of HMC is not optimal in terms of contraction. However, contrarily to the example of
Section 5.1, here each of the considered trajectory lengths yields some contraction.
Using the length εL = 0.1, we then proceed with Algorithm 1of Section 4.2, using σ= 105and ω= 0.05.
Over 100 independent experiments, we compute the distance between the coupled chains, using two different
initializations. The first is the standard Normal distribution on each parameter as above, leading to the distances
plotted in Figure 5a. The observed meeting times occur between iterations 256 and 535. Using k= 100 and
m= 1,000, we produce 100 independent estimators ¯
Hk:m(X, Y )from these coupled chains, in order to approximate
the marginal means and variances of the target. With these values, we construct a Normal approximation of the
target, with a diagonal covariance matrix, and use this Normal as a new initial distribution π0. For this better
initialization, the distance traces are shown in Figure 5b. The observed meeting times occur between iterations 192
and 422, and the plot shows that the distances decrease faster than with the previous initialization. The vertical
upward jumps in Figure 5correspond to events where one chain accepts its HMC proposal while the other chain
does not.
With this better initialization, again using k= 100 and m= 1,000, we produce R= 1,000 independent
estimators of Rx1π(dx). The asymptotic efficiency is found to be approximately 0.40. The asymptotic variance
of HMC obtained with εL = 0.3was found to be approximately 0.09, and with εL = 0.1approximately 0.33;
(a) HMC samples approximating a bivariate Normal
truncated by two linear constraints.
(b) HMC samples approximating a bivariate Normal
truncated by two quadratic constraints.
0 10 20 30
meeting times
(c) Meeting times for the bivariate Normal with linear
0 5 10 15
meeting times
(d) Meeting times for the bivariate Normal with
quadratic constraints.
Figure 3: In the truncated Normal example of Section 5.2, scatter plot of 1,000 HMC samples for a bivariate
Normal truncated by two linear constraints (top left), and two quadratic constraints (top right). Histogram of
1,000 meeting times, defined as first times for which the distance is smaller than machine precision, for coupled
HMC chains targeting the bivariate Normal with linear constraints (bottom left), and with quadratic constraints
(bottom right).
0.1 0.2 0.3 0.4 0.5
trajectory length
HCMC variance
(a) HMC asymptotic variance against trajectory length
0.1 0.2 0.3 0.4 0.5
trajectory length
distance after 1000 iterations
(b) Distance after 1,000 coupled HMC steps against
trajectory length εL.
Figure 4: In the logistic regression example of Section 5.3, asymptotic variance for the estimation of Rx1π(dx)
using HMC, computed using chains of length 10,000 started from an independent standard Normal distribution for
each parameter, and discarding a burn-in of 5,000 steps (left). Euclidean distance between the 1,000-th iterate of
coupled HMC chains (right). The number of leap-frog steps is set to L= 20, which implicitly determines the step
size εfor each trajectory length εL. Each dot corresponds to one of 5independent runs.
(a) Log-distance between coupled chains initialized
from independent standard Normal distributions.
(b) Log-distance between coupled chains initialized
from a crude Normal approximation of the target.
Figure 5: In the logistic regression example of Section 5.3, distance between coupled chains initialized from in-
dependent standard Normal distributions for each parameter against number of iterations (left), and initialized
from a Normal approximation of the target (right). The Normal approximation is obtained by estimating the 302
marginal means and variances of the target distribution. In both cases the chains are propagated using a mixture
of HMC and MH kernels, with σ= 105and ω= 0.05, and the HMC kernel uses L= 20 and εL = 0.1. Each line
corresponds to one of 100 independent runs.
these were obtained from 105HMC iterations after discarding 5,000 iterations as burn-in. Therefore, the proposed
estimator is about 4times less efficient than the original HMC estimator when optimally tuned, or more precisely,
for the optimal value of εgiven a fixed value L= 20. We could also use εL = 0.3for the unbiased HMC estimator,
according to Figure 4b, but the meeting times would then be longer, and the potential for parallelization would
thus be reduced.
From the coupled chains, histograms can be produced by binning a dimension of the space and estimating
posterior masses of these bins, which are integrals of indicator functions [Jacob et al.,2017b]. Histograms of α
and β1under the posterior distribution are shown in Figure 6. The vertical bars indicate the point estimates of
posterior masses, and gray rectangles represent 95% confidence intervals based on the central limit theorem. The
overlaid red curves show kernel density estimates obtained from 105HMC samples, after discarding a burn-in of
5,000 steps, and using L= 20 and εL = 0.3. Taking these kernel density estimates as ground truth, the narrowness
of confidence intervals reflects the accuracy of the proposed estimators. We stress that these confidence intervals
are based on the central limit theorem for averages of independent variables, and are therefore justified in the limit
of number of independent estimators, all of which can be computed in parallel.
6 Discussion
Coupled Hamiltonian Monte Carlo chains can be combined to generate unbiased estimators of integrals with re-
spect to target distributions. With adequate couplings, such chains become exactly equal after a random number
of steps. The proposed approach involves a simple coupling of Hamiltonian Monte Carlo kernels, based on common
random numbers, that generates chains converging to one another. Combined with coupled random walk Metropo-
lis–Hastings steps, the approach leads to estimators that can be produced independently in parallel and averaged.
The method is demonstrated on three examples, and a contraction property of coupled HMC kernels is formally
established under strong log-concavity of the target on parts of the state space. Recently, Mangoubi and Smith
[2017] have proposed a much deeper study of the same coupling, and have adroitly exploited it to obtain novel
quantitative bounds on mixing properties of HMC. The same coupling was already discussed in Neal [2002], for
the purpose of removing the burn-in bias. The exploration of further links between our proposed estimators and
the circular coupling of Neal [2002] is an exciting avenue of research. The proposed couplings also enable other
unbiased estimators, such as those of Glynn and Rhee [2014] which do not require exact meetings.
As seen in numerical experiments, optimal trajectory lengths for standard HMC estimators are not optimal in
the coupled construction. This leads to a loss of efficiency of the proposed estimators compared to standard HMC
estimators. Whether this loss is acceptable or not will likely depend on the target distribution and the available
hardware. Other considerations include the construction of confidence intervals, which is arguably simpler with
i.i.d. variables than with Markov chains, and the unbiased property itself, which could be appealing in various
−1.75 −1.50 −1.25 −1.00 −0.75 −0.50
(a) Estimated posterior of the intercept α.
−0.9 −0.6 −0.3
(b) Estimated posterior of the coefficient β1.
Figure 6: In the logistic regression example of Section 5.3, histograms of the posterior distributions of the intercept
α(left) and of the first coefficient β1(right). Vertical bars indicate point estimates of posterior mass in each bin,
obtained with 1,000 unbiased HMC estimators, and 95% confidence intervals are represented by gray rectangles.
Red curves represent kernel density estimates computed from 105HMC iterations, considered as the ground truth.
To improve asymptotic efficiencies, random numbers of leap-frog steps, and adaptive selection of that number
based on the distance between the chains, would be interesting topics of research. A related question would be
the construction of unbiased estimators from the No-U-Turn sampler of Hoffman and Gelman [2014]. Finally, the
optimal weights described in Section 4.3 could potentially bring significant variance reduction in situations where
HMC chains exhibit significant autocorrelations.
Pierre E. Jacob gratefully acknowledges support by the National Science Foundation through grant DMS-1712872.
Beskos A., Pillai N., Roberts G., Sanz-Serna J.-M., and Stuart A. The acceptance probability of the Hybrid Monte
Carlo method in high-dimensional problems. In AIP Conference Proceedings, volume 1281, pages 23–26. AIP,
2010. 1
Beskos A., Pillai N., Roberts G., Sanz-Serna J.-M., and Stuart A., 2013. Optimal tuning of the Hybrid Monte Carlo
algorithm. Bernoulli, 19(5A):1501–1534. 1
Brooks S. P., Gelman A., Jones G., and Meng X.-L., 2011. Handbook of Markov chain Monte Carlo. CRC press. 1
Cances E., Legoll F., and Stoltz G., 2007. Theoretical and numerical comparison of some sampling methods for
molecular dynamics. ESAIM: Mathematical Model ling and Numerical Analysis, 41(2):351–389. 5,7
Carpenter B., Gelman A., Hoffman M. D., Lee D., Goodrich B., Betancourt M., Brubaker M. A., Guo J., Li P.,
and Riddell A., 2016. Stan: a probabilistic programming language. Journal of Statistical Software, 20:1–37. 1
Casella G., Lavine M., and Robert C. P., 2001. Explaining the perfect sampler. The American Statistician, 55(4):
299–305. 1
Duane S., Kennedy A. D., Pendleton B. J., and Roweth D., 1987. Hybrid Monte Carlo. Physics Letters B, 195(2):
216–222. 1,5
Durmus A., Moulines E., and Saksman E., 2017. On the convergence of Hamiltonian Monte Carlo. arXiv preprint
Girolami M. and Calderhead B., 2011. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal
of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214. 9
Glynn P. W. Exact simulation versus exact estimation. In Winter Simulation Conference (WSC), 2016, pages
193–205. IEEE, 2016. 1
Glynn P. W. and Rhee C.-H., 2014. Exact estimation for Markov chain equilibrium expectations. Journal of Applied
Probability, 51(A):377–389. 1,7,13
Glynn P. W. and Whitt W., 1992. The asymptotic efficiency of simulation estimators. Operations Research, 40(3):
505–520. 9
Hairer E., Wanner G., and Lubich C., 2005. Geometric numerical integration: structure-preserving algorithms for
ordinary differential equations. Springer-Verlag, New York. 5
Hoffman M. D. and Gelman A., 2014. The No-U-turn sampler: adaptively setting path lengths in Hamiltonian
Monte Carlo. Journal of Machine Learning Research, 15(1):1593–1623. 1,11,14
Huber M., 2016. Perfect simulation, volume 148. CRC Press. 1
Jacob P. E. and Thiery A. H., 2015. On non-negative unbiased estimators. The Annals of Statistics, 43(2):769–784.
Jacob P. E., Lindsten F., and Schön T. B., 2017a. Smoothing with couplings of conditional particle filters. arXiv
preprint arXiv:1701.02002.1,9
Jacob P. E., O’Leary J., and Atchadé Y. F., 2017b. Unbiased Markov chain Monte Carlo with couplings. arXiv
preprint arXiv:1708.03625.1,2,7,8,9,11,13
Leimkuhler B. and Matthews C., 2015. Molecular Dynamics. Springer-Verlag, New York. 5
Lelièvre T., Rousset M., and Stoltz G., 2010. Free Energy Computations: A Mathematical Perspective. Imperial
College Press. ISBN 978-1-84816-248-8. 1,3
Livingstone S., Betancourt M., Byrne S., and Girolami M., 2016. On the geometric ergodicity of Hamiltonian Monte
Carlo. arXiv preprint arXiv:1601.08057.5,7
Livingstone S., Faulkner M. F., and Roberts G. O., 2017. Kinetic energy choice in Hamiltonian/hybrid Monte
Carlo. arXiv preprint arXiv:1706.02649.9
Mangoubi O. and Smith A., 2017. Rapid mixing of Hamiltonian Monte Carlo on strongly log-concave distributions.
arXiv preprint arXiv:1708.07114.7,13
Mykland P., Tierney L., and Yu B., 1995. Regeneration in Markov chain samplers. Journal of the American
Statistical Association, 90(429):233–241. 1
Neal R. M., 1993. Bayesian learning via stochastic dynamics. Advances in neural information processing systems,
pages 475–475. 1,5
Neal R. M. Circularly-coupled Markov chain sampling. Technical report, 9910 (revised), Department of Statistics,
University of Toronto, 2002. 1,13
Neal R. M., 2011. MCMC using Hamiltonian dynamics. Handbook of Markov chain Monte Carlo, 2(11). 3
Pakman A., 2012. tmg: truncated multivariate Gaussian sampling. CRAN. URL
Pakman A. and Paninski L., 2014. Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. Journal
of Computational and Graphical Statistics, 23(2):518–542. 11
Plummer M., Best N., Cowles K., and Vines K., 2006. CODA: Convergence diagnosis and output analysis for
MCMC. R News, 6(1):7–11. URL
Pollard D., 2005. Chapter 3: Total variation distance between measures. Asymptopia. URL http://www.stat.
Rhee C.-H. Unbiased estimation with biased samplers. PhD thesis, Stanford University, 2013. URL http://purl.
Rosenthal J. S., 1997. Faithful couplings of Markov chains: now equals forever. Advances in Applied Mathematics,
18(3):372 – 381. ISSN 0196-8858. 2
Rosenthal J. S., 2000. Parallel computing and Monte Carlo algorithms. Far east journal of theoretical statistics, 4
(2):207–236. 1
Thorisson H., 2000. Coupling, stationarity, and regeneration, volume 14. Springer New York. 1
Tweedie R., 1983. The existence of moments for stationary Markov chains. Journal of Applied Probability, 20(1):
191–196. 7
Vihola M., 2015. Unbiased estimators and multilevel Monte Carlo. arXiv preprint arXiv:1512.01022.1
Williams D., 1991. Probability with martingales. Cambridge university press. 7
... One successful strategy for MLMC methods is to construct instead an approximate coupling Π α such that π α i (α) /Π α is bounded for all i = 1, . . . , 2 D , then simulate from this and construct self-normalized importance sampling estimators of the type (28) for each of the individual summands of ∆ 1 Zα f α (ϕ α ) appearing in (33). This strategy was introduced for MLMCMC in [38] and has subsequently been applied to MIMC in the contexts of MCMC [39] and SMC [44,19]. ...
... In their limiting forms in (33) and (34), the expressions are equivalent, however from an approximation perspective they are fundamentally different. In the context of SMC, there are advantages to the latter. ...
... For this particular example, the increment rates associated to MLSMC are s = 0.8 and β = 1.6, while the mixed rates associated to MISMC are s i = 0.8 and β i = 1.6 for i = 1, 2. The rates for s and β can be observed from the Figure 13 and mixed rates for s i and β i for i = 1, 2 can be observed from the Figures 10, 11 and 12. This forward simulation method has a cost rate of γ = 2 + ω, for any ω > 0, while the traditional full factorization method used in [33] (and references therein) has γ = 6. However, one has γ i = 1 + ω < β i < γ. ...
We consider the problem of estimating expectations with respect to a target distribution with an unknown normalizing constant, and where even the unnormalized target needs to be approximated at finite resolution. This setting is ubiquitous across science and engineering applications, for example in the context of Bayesian inference where a physics-based model governed by an intractable partial differential equation (PDE) appears in the likelihood. A multi-index Sequential Monte Carlo (MISMC) method is used to construct ratio estimators which provably enjoy the complexity improvements of multi-index Monte Carlo (MIMC) as well as the efficiency of Sequential Monte Carlo (SMC) for inference. In particular, the proposed method provably achieves the canonical complexity of MSE$^{-1}$, while single level methods require MSE$^{-\xi}$ for $\xi>1$. This is illustrated on examples of Bayesian inverse problems with an elliptic PDE forward model in $1$ and $2$ spatial dimensions, where $\xi=5/4$ and $\xi=3/2$, respectively. It is also illustrated on a more challenging log Gaussian process models, where single level complexity is approximately $\xi=9/4$ and multilevel Monte Carlo (or MIMC with an inappropriate index set) gives $\xi = 5/4 + \omega$, for any $\omega > 0$, whereas our method is again canonical.
... Coupling of Markov chains has been relied upon to prove the convergence of Markov Chain Monte Carlo algorithms (MCMC), as well as for providing lower bounds for the effective sample sizes generated by MCMC methods [Rosenthal, 1997, Johnson, 1998, 1996, Jacob et al., 2017, Bou-Rabee et al., 2020. Theoretical results in recent times show that two Hamiltonian Monte Carlo chains can be coupled by giving them the same set of random numbers [Heng and Jacob, 2019, Bou-Rabee et al., 2020, Piponi et al., 2020. In other words, even though the chains might be initialised with different states, their dynamics will at some point become indistinguishable [Bou-Rabee et al., 2020, Piponi et al., 2020. ...
... In other words, even though the chains might be initialised with different states, their dynamics will at some point become indistinguishable [Bou-Rabee et al., 2020, Piponi et al., 2020. Markov chain coupling theory has also been used to provide unbiased Hamiltonian Monte Carlo estimators [Glynn and Rhee, 2014, Jacob et al., 2017, Heng and Jacob, 2019]. An approach of constructing a pair of HMC chains, where the momentum variable is shared, such that they create unbiased chains is presented by Heng and Jacob [2019]. ...
... Markov chain coupling theory has also been used to provide unbiased Hamiltonian Monte Carlo estimators [Glynn and Rhee, 2014, Jacob et al., 2017, Heng and Jacob, 2019]. An approach of constructing a pair of HMC chains, where the momentum variable is shared, such that they create unbiased chains is presented by Heng and Jacob [2019]. On the other hand, Bou-Rabee et al. [2020] propose a new approach called contractive sampling where the momentum variable is not shared between the coupled chains. ...
Full-text available
Markov Chain Monte Carlo inference of target posterior distributions in machine learning is predominately conducted via Hamiltonian Monte Carlo and its variants. This is due to Hamiltonian Monte Carlo based samplers ability to suppress random-walk behaviour. As with other Markov Chain Monte Carlo methods, Hamiltonian Monte Carlo produces auto-correlated samples which results in high variance in the estimators, and low effective sample size rates in the generated samples. Adding antithetic sampling to Hamiltonian Monte Carlo has been previously shown to produce higher effective sample rates compared to vanilla Hamiltonian Monte Carlo. In this paper, we present new algorithms which are antithetic versions of Riemannian Manifold Hamiltonian Monte Carlo and Quantum-Inspired Hamiltonian Monte Carlo. The Riemannian Manifold Hamiltonian Monte Carlo algorithm improves on Hamiltonian Monte Carlo by taking into account the local geometry of the target, which is beneficial for target densities that may exhibit strong correlations in the parameters. Quantum-Inspired Hamiltonian Monte Carlo is based on quantum particles that can have random mass. Quantum-Inspired Hamiltonian Monte Carlo uses a random mass matrix which results in better sampling than Hamiltonian Monte Carlo on spiky and multi-modal distributions such as jump diffusion processes. The analysis is performed on jump diffusion process using real world financial market data, as well as on real world benchmark classification tasks using Bayesian logistic regression.
... Therefore MCMC with couplings has attracted research attention recently thanks to its ability to debias Monte Carlo estimators (Jacob et al., 2020). In particular, Heng and Jacob (2019) focused on the Metropolis-Hastings (MH) adjusted HMC variant, which proposes the end-point of a simulated Hamiltonian trajectory as the new state, followed by an MH correction step. We refer to this HMC variant as coupled Metropolis HMC. ...
... We refer to this HMC variant as coupled Metropolis HMC. Heng and Jacob (2019) noticed that coupled Metropolis HMC is sensitive to the choice of HMC parameters such as integrator step sizes and Hamiltonian trajectory lengths. More specifically, parameters (e.g. ...
... trajectory lengths) optimal for sampling efficiency (e.g. effective sample size) can require a large number of HMC iterations to achieve meeting; on the other hand, optimal parameters for coupling can lead to poor mixing (Heng and Jacob, 2019). ...
Hamiltonian Monte Carlo (HMC) is a popular sampling method in Bayesian inference. Recently, Heng & Jacob (2019) studied Metropolis HMC with couplings for unbiased Monte Carlo estimation, establishing a generic parallelizable scheme for HMC. However, in practice a different HMC method, multinomial HMC, is considered as the go-to method, e.g. as part of the no-U-turn sampler. In multinomial HMC, proposed states are not limited to end-points as in Metropolis HMC; instead points along the entire trajectory can be proposed. In this paper, we establish couplings for multinomial HMC, based on optimal transport for multinomial sampling in its transition. We prove an upper bound for the meeting time - the time it takes for the coupled chains to meet - based on the notion of local contractivity. We evaluate our methods using three targets: 1,000 dimensional Gaussians, logistic regression and log-Gaussian Cox point processes. Compared to Heng & Jacob (2019), coupled multinomial HMC generally attains a smaller meeting time, and is more robust to choices of step sizes and trajectory lengths, which allows re-use of existing adaptation methods for HMC. These improvements together paves the way for a wider and more practical use of coupled HMC methods.
... This results in the high variance of HMC based estimators. One approach of tackling the high variance of MCMC estimators is by using results from MCMC coupling theory [12], [13]. ...
... More recently, MCMC couplings have been studied for HMC with good results [18], [19]. Markov chain coupling has also been used to provide unbiased HMC estimators [12], [13], [20]. Heng and Jacob [13] propose an approach that constructs a pair of HMC chains that are coupled in such a way that they meet after some random number of iterations. ...
... Markov chain coupling has also been used to provide unbiased HMC estimators [12], [13], [20]. Heng and Jacob [13] propose an approach that constructs a pair of HMC chains that are coupled in such a way that they meet after some random number of iterations. These chains can then be combined to create unbiased chains. ...
Full-text available
Hamiltonian Monte Carlo is a Markov Chain Monte Carlo method that has been widely applied to numerous posterior inference problems within the machine learning literature. Markov Chain Monte Carlo estimators have higher variance than classical Monte Carlo estimators due to autocorrelations present between the generated samples. In this work we present three new methods for tackling the high variance problem in Hamiltonian Monte Carlo based estimators: 1) We combine antithetic and importance sampling techniques where the importance sampler is based on sampling from a modified or shadow Hamiltonian using Separable Shadow Hamiltonian Hybrid Monte Carlo, 2) We present the antithetic Magnetic Hamiltonian Monte Carlo algorithm that is based on performing antithetic sampling on the Magnetic Hamiltonian Monte Carlo algorithm and 3) We propose the antithetic Magnetic Momentum Hamiltonian Monte Carlo algorithm based on performing antithetic sampling on the Magnetic Momentum Hamiltonian Monte Carlo method.We find that the antithetic Separable Shadow Hamiltonian Hybrid Monte Carlo and antithetic Magnetic Momentum Hamiltonian Monte Carlo algorithms produce effective sample sizes that are higher than antithetic Hamiltonian Monte Carlo on all the benchmark datasets.We further find that antithetic Separable Shadow Hamiltonian Hybrid Monte Carlo and antithetic Magnetic Hamiltonian Monte Carlo produce higher effective sample sizes normalised by execution time in higher dimensions than antithetic Hamiltonian Monte Carlo. In addition, the antithetic versions of all the algorithms have higher effective sample sizes than their non-antithetic counterparts, indicating the usefulness of adding antithetic sampling to Markov Chain Monte Carlo algorithms. The methods are assessed on benchmark datasets using Bayesian logistic regression and Bayesian neural network models.
... In particular, the seminal paper of Jacob et al. (2020b) showed how to compute unbiased estimates of expectations using coupled Markov chains, allowing to then compute these to an arbitrary precision using distributed hardware. Following this work, several extensions have been developed, both in the classical Markov chain Monte Carlo (MCMC) methods (Heng and Jacob, 2019;Xu et al., 2021;Wang et al., 2021), as well as in pseudo-marginal and particle Markov chain Monte Carlo (PMCMC) methods (Jacob et al., 2020a;Middleton et al., 2019). ...
Full-text available
We propose a novel coupled rejection-sampling method for sampling from couplings of arbitrary distributions. The method relies on accepting or rejecting coupled samples coming from dominating marginals. Contrary to existing acceptance-rejection methods, the variance of the execution time of the proposed method is limited and stays finite as the two target marginals approach each other in the sense of the total variation norm. In the important special case of coupling multivariate Gaussians with different means and covariances, we derive positive lower bounds for the resulting coupling probability of our algorithm, and we then show how the coupling method can be optimised using convex optimisation. Finally, we show how we can modify the coupled-rejection method to propose from coupled ensemble of proposals, so as to asymptotically recover a maximal coupling. We then apply the method to derive a novel parallel coupled particle filter resampling algorithm, and show how it can be used to speed up unbiased MCMC methods based on couplings.
... Our focus was on learning a mass matrix so that samples from the Markov chain can be used for estimators that are consistent for increasing iterations. However, unbiased estimators might also be constructed using coupled HMC chains [30] and one might ask if the adapted mass matrix leads to shorter meeting times in such a setting. ...
Hamiltonian Monte Carlo (HMC) is a popular Markov Chain Monte Carlo (MCMC) algorithm to sample from an unnormalized probability distribution. A leapfrog integrator is commonly used to implement HMC in practice, but its performance can be sensitive to the choice of mass matrix used therein. We develop a gradient-based algorithm that allows for the adaptation of the mass matrix by encouraging the leapfrog integrator to have high acceptance rates while also exploring all dimensions jointly. In contrast to previous work that adapt the hyperparameters of HMC using some form of expected squared jumping distance, the adaptation strategy suggested here aims to increase sampling efficiency by maximizing an approximation of the proposal entropy. We illustrate that using multiple gradients in the HMC proposal can be beneficial compared to a single gradient-step in Metropolis-adjusted Langevin proposals. Empirical evidence suggests that the adaptation method can outperform different versions of HMC schemes by adjusting the mass matrix to the geometry of the target distribution and by providing some control on the integration time.
... To address the issue of bias, the design of unbiased Monte Carlo estimators has recently attracted much attention [20,12,22,4,24,5,3,14,16,21] in operations research, statistics, and machine learning communities. Many existing debiasing techniques are closely related to the Multilevel the randomized Multilevel Monte Carlo estimator described in [4,5]. ...
Full-text available
We propose a new unbiased estimator for estimating the utility of the optimal stopping problem. The MUSE, short for `Multilevel Unbiased Stopping Estimator', constructs the unbiased Multilevel Monte Carlo (MLMC) estimator at every stage of the optimal stopping problem in a backward recursive way. In contrast to traditional sequential methods, the MUSE can be implemented in parallel when multiple processors are available. We prove the MUSE has finite variance, finite computational complexity, and achieves $\varepsilon$-accuracy with $O(1/\varepsilon^2)$ computational cost under mild conditions. We demonstrate MUSE empirically in several numerical examples, including an option pricing problem with high-dimensional inputs, which illustrates the use of the MUSE on computer clusters.
... We conclude this introduction by remarking that mixing time bounds based on coupling methods might be relevant to recently developed unbiased estimators based on couplings [21]. Both in theory and in practice, the usefulness of these unbiased estimators requires a successful coupling that realizes these bounds. ...
We provide quantitative upper bounds on the total variation mixing time of the Markov chain corresponding to the unadjusted Hamiltonian Monte Carlo (uHMC) algorithm. For two general classes of models and fixed time discretization step size $h$, the mixing time is shown to depend only logarithmically on the dimension. Moreover, we provide quantitative upper bounds on the total variation distance between the invariant measure of the uHMC chain and the true target measure. As a consequence, we show that an $\varepsilon$-accurate approximation of the target distribution $\mu$ in total variation distance can be achieved by uHMC for a broad class of models with $O\left(d^{3/4}\varepsilon^{-1/2}\log (d/\varepsilon )\right)$ gradient evaluations, and for mean field models with weak interactions with $O\left(d^{1/2}\varepsilon^{-1/2}\log (d/\varepsilon )\right)$ gradient evaluations. The proofs are based on the construction of successful couplings for uHMC that realize the upper bounds.
Full-text available
Amongst Markov chain Monte Carlo algorithms, Hamiltonian Monte Carlo (HMC) is often the algorithm of choice for complex, high-dimensional target distributions; however, its efficiency is notoriously sensitive to the choice of the integration-time tuning parameter, $T$. When integrating both forward and backward in time using the same leapfrog integration step as HMC, the set of local maxima in the potential along a path, or apogees, is the same whatever point (position and momentum) along the path is chosen to initialise the integration. We present the Apogee to Apogee Path Sampler (AAPS), which utilises this invariance to create a simple yet generic methodology for constructing a path, proposing a point from it and accepting or rejecting that proposal so as to target the intended distribution. We demonstrate empirically that AAPS has a similar efficiency to HMC but is much more robust to the setting of its equivalent tuning parameter, a non-negative integer, $K$, the number of apogees that the path crosses.
Full-text available
Phylogenetic inference is an intractable statistical problem on a complex sample space. Markov chain Monte Carlo methods are the primary tool for Bayesian phylogenetic inference, but it is challenging to construct efficient schemes to explore the associated posterior distribution and to then assess their convergence. Building on recent work developing couplings of Monte Carlo algorithms, we describe a procedure to couple Markov Chains targeting a posterior distribution over a space of phylogenetic trees with ages, scalar parameters and latent variables. We demonstrate how to use these couplings to check convergence and mixing time of the chains.
Full-text available
Markov chain Monte Carlo (MCMC) methods provide consistent approximations of integrals as the number of iterations goes to infinity. However, MCMC estimators are generally biased after any fixed number of iterations, which complicates both parallel computation and the construction of confidence intervals. We propose to remove this bias by using couplings of Markov chains and a telescopic sum argument, inspired by Glynn & Rhee (2014). The resulting unbiased estimators can be computed independently in parallel, and confidence intervals can be directly constructed from the Central Limit Theorem for i.i.d. variables. We provide practical couplings for important algorithms such as the Metropolis-Hastings and Gibbs samplers. We establish the theoretical validity of the proposed estimators, and study their variances and computational costs. In numerical experiments, including inference in hierarchical models, bimodal or high-dimensional target distributions, logistic regressions with the P\'olya-Gamma Gibbs sampler and the Bayesian Lasso, we demonstrate the wide applicability of the proposed methodology as well as its limitations. Finally, we illustrate how the proposed estimators can approximate the "cut" distribution that arises in Bayesian inference for misspecified models.
Full-text available
We consider how different choices of kinetic energy in Hamiltonian Monte Carlo affect algorithm performance. To this end, we introduce two quantities which can be easily evaluated, the composite gradient and the implicit noise. Results are established on integrator stability and geometric convergence, and we show that choices of kinetic energy that result in heavy-tailed momentum distributions can exhibit an undesirable negligible moves property, which we define. A general efficiency-robustness trade off is outlined, and implementations which rely on approximate gradients are also discussed. Two numerical studies illustrate our theoretical findings, showing that the standard choice which results in a Gaussian momentum distribution is not always optimal in terms of either robustness or efficiency.
Full-text available
Stan is a probabilistic programming language for specifying statistical models. A Stan program imperatively defines a log probability function over parameters conditioned on specified data and constants. As of version 2.14.0, Stan provides full Bayesian inference for continuous-variable models through Markov chain Monte Carlo methods such as the No-U-Turn sampler, an adaptive form of Hamiltonian Monte Carlo sampling. Penalized maximum likelihood estimates are calculated using optimization methods such as the limited memory Broyden-Fletcher-Goldfarb-Shanno algorithm. Stan is also a platform for computing log densities and their gradients and Hessians, which can be used in alternative algorithms such as variational Bayes, expectation propagation, and marginal inference using approximate integration. To this end, Stan is set up so that the densities, gradients, and Hessians, along with intermediate quantities of the algorithm such as acceptance probabilities, are easily accessible. Stan can be called from the command line using the cmdstan package, through R using the rstan package, and through Python using the pystan package. All three interfaces support sampling and optimization-based inference with diagnostics and posterior analysis. rstan and pystan also provide access to log probabilities, gradients, Hessians, parameter transforms, and specialized plotting.
Full-text available
In state space models, smoothing refers to the task of estimating a latent stochastic process given noisy measurements related to the process. We propose the first unbiased estimator of smoothing expectations. The lack-of-bias property has methodological benefits, as it allows for a complete parallelization of the algorithm and for computing accurate confidence intervals. The method combines two recent breakthroughs: the first is a generic debiasing technique for Markov chains due to Rhee and Glynn, and the second is the introduction of a uniformly ergodic Markov chain for smoothing, the conditional particle filter of Andrieu, Doucet and Holenstein. We show how a combination of the two methods delivers practical estimators, upon the introduction of couplings between conditional particle filters. The algorithm is widely applicable, has minimal tuning parameters and is amenable to modern computing hardware. We establish the validity of the proposed estimator under mild assumptions. Numerical experiments illustrate its performance in a toy model and in a Lotka-Volterra model with an intractable transition density.
Full-text available
We investigate the properties of the hybrid Monte Carlo algorithm (HMC) in high dimensions. HMC develops a Markov chain reversible with respect to a given target distribution Π using separable Hamiltonian dynamics with potential −logΠ. The additional momentum variables are chosen at random from the Boltzmann distribution, and the continuous-time Hamiltonian dynamics are then discretised using the leapfrog scheme. The induced bias is removed via a Metropolis–Hastings accept/reject rule. In the simplified scenario of independent, identically distributed components, we prove that, to obtain an O(1) acceptance probability as the dimension d of the state space tends to ∞, the leapfrog step size h should be scaled as h=l×d−1/4. Therefore, in high dimensions, HMC requires O(d1/4) steps to traverse the state space. We also identify analytically the asymptotically optimal acceptance probability, which turns out to be 0.651 (to three decimal places). This value optimally balances the cost of generating a proposal, which decreases as l increases (because fewer steps are required to reach the desired final integration time), against the cost related to the average number of proposals required to obtain acceptance, which increases as l increases.
We obtain several quantitative bounds on the mixing properties of the Hamiltonian Monte Carlo (HMC) algorithm for a strongly log-concave target distribution $\pi$ on $\mathbb{R}^{d}$, showing that HMC mixes quickly in this setting. One of our main results is a dimension-free bound on the mixing of an "ideal" HMC chain, which is used to show that the usual leapfrog implementation of HMC can sample from $\pi$ using only $\mathcal{O}(d^{\frac{1}{4}})$ gradient evaluations. This dependence on dimension is sharp, and our results significantly extend and improve previous quantitative bounds on the mixing of HMC.
This paper discusses the stability properties of the Hamiltonian Monte Carlo (HMC) algorithm used to sample from a positive target density $\pi$ on $\mathbb{R}^d$, with either a fixed or a random number of integration steps. Under mild conditions on the potential $U$ associated with $\pi$, we show that the Markov kernel associated to the HMC algorithm is irreducible and recurrent. Under some additional conditions, the Markov kernel may be shown to be Harris recurrent. Besides, verifiable conditions on $U$ are derived which imply geometric convergence.
We give conditions under which the stationary distribution π of a Markov chain admits moments of the general form ∫ f(x)π(dx), where f is a general function; specific examples include f(x) = xr and f(x) = esx . In general the time-dependent moments of the chain then converge to the stationary moments. We show that in special cases this convergence of moments occurs at a geometric rate. The results are applied to random walk on [0, ∞).