Content uploaded by Pierre E Jacob
Author content
All content in this area was uploaded by Pierre E Jacob on Sep 06, 2017
Content may be subject to copyright.
Unbiased Hamiltonian Monte Carlo with couplings
Jeremy Heng∗and Pierre E. Jacob∗
September 4, 2017
Abstract
We propose a coupling approach to parallelize Hamiltonian Monte Carlo estimators, following Jacob, O’Leary
& Atchadé (2017). A simple coupling, obtained by using common initial velocities and common uniform variables
for the acceptance steps, leads to pairs of Markov chains that contract, in the sense that the distance between
them can become arbitrarily small. We show how this strategy can be combined with coupled random walk
Metropolis–Hastings steps to enable exact meetings of the two chains, and in turn, unbiased estimators that can
be computed in parallel and averaged. The resulting estimator is valid in the limit of the number of independent
replicates, instead of the usual limit of the number of Markov chain iterations. We investigate the effect of tuning
parameters, such as the number of leap-frog steps and the step size, on the estimator’s efficiency. The proposed
methodology is demonstrated on a 250-dimensional Normal distribution, on a bivariate Normal truncated by
linear and quadratic inequalities, and on a logistic regression with 300 covariates.
1 Introduction
1.1 Goal: parallel computation with Hamiltonian Monte Carlo
Hamiltonian Monte Carlo, also called Hybrid Monte Carlo (HMC), is a Markov chain Monte Carlo (MCMC) method
to approximate integrals with respect to a target probability distribution πon Rd. Originally proposed by Duane
et al. [1987] in the physics literature, it was later introduced in statistics by Neal [1993] and is now part of the
standard toolbox [Brooks et al.,2011,Lelièvre et al.,2010], in part due to favorable scaling properties with respect
to the dimension d[Beskos et al.,2010,2013], compared to e.g. random walk Metropolis–Hastings. Hamiltonian
Monte Carlo is at the core of the No-U-Turn sampler (NUTS, Hoffman and Gelman [2014]) used in the software
Stan [Carpenter et al.,2016]. As with any other MCMC method, HMC estimators are justified in the limit of the
number of iterations. Algorithms which rely on such asymptotics face the risk of becoming obsolete if computational
power keeps increasing through the number of available processors and not through clock speed. To address this
issue, we propose to run pairs of HMC chains, for a random but finite number of iterations, and combine them in
such a way that the resulting estimators are unbiased. One can then produce independent copies in parallel and
average them to obtain estimators that are valid in the limit of the number of copies.
If the chains could be initialized from the target distribution, MCMC estimators would be unbiased, and one
could simply average independent chains [Rosenthal,2000]. Perfect samplers can be used for this purpose [Casella
et al.,2001,Huber,2016,Glynn,2016]; more widely applicable approaches to unbiased estimation from MCMC
samplers are proposed in e.g. Mykland et al. [1995], Neal [2002]. More recently, Jacob et al. [2017b] present an
approach based on coupled Markov chains. The method builds upon Glynn and Rhee [2014], Jacob et al. [2017a] and
other “debiasing” techniques [Jacob and Thiery,2015,Vihola,2015,Glynn,2016], and leverages maximal couplings
[Thorisson,2000] of proposal and conditional distributions to remove the “burn-in bias” of Metropolis–Hastings
and Gibbs chains respectively. The use of maximal couplings allows two chains initialized at different positions to
coincide exactly after a random number of steps, referred to as the meeting time. Importantly, these constructions
are applicable to continuous state spaces.
The present article proposes a combination of couplings to enable parallel computation for the Hamiltonian
Monte Carlo sampler. We start by briefly recalling the unbiased estimators of Jacob et al. [2017b] in Section
1.2 and introducing some preliminary notation in Section 1.3. The Rcode producing the figures of this article is
available on the GitHub account of the second author1.
∗Department of Statistics, Harvard University, USA. Emails: jjmheng@fas.harvard.edu & pjacob@fas.harvard.edu.
1Link: github.com/pierrejacob/debiasedhmc.
1
arXiv:1709.00404v1 [stat.CO] 1 Sep 2017
1.2 Context: unbiased estimation with couplings
Consider the task of approximating the integral π(h) = Rh(x)π(dx)<∞, for a test function hof interest. Let
X= (Xn)n≥0denote a π-invariant MCMC chain associated with an initial distribution π0and transition kernel P,
i.e. X0∼π0and Xn∼P(Xn−1,·)for all n≥1. Introduce another Markov chain Y= (Yn)n≥0which has the same
law as X= (Xn)n≥0, so that Xnand Ynhave the same marginal distribution for all n≥0. We will write Pto
denote the law of the coupled chain (Xn, Yn)n≥0and Eto denote expectation with respect to P. We now assume
the following.
[A1] As n→ ∞,E[h(Xn)] →π(h), and there exists ι > 0and D < ∞such that for all n≥0,E[h(Xn)2+ι]< D.
[A2] The meeting time τ= inf {n≥1 : Xn=Yn−1}is finite almost surely, and satisfies a geometric tail condition
of the form P(τ > n)≤C γnfor all n≥0and some constants C < ∞and γ∈(0,1).
[A3] The coupled chains are faithful [Rosenthal,1997]: Xn=Yn−1for all n≥τ.
Under these assumptions, the random variable defined as
Hk(X, Y ) = h(Xk) +
max(k,τ −1)
X
n=k{h(Xn+1)−h(Yn)},(1)
is an unbiased estimator of π(h), for any choice of initial distribution π0and any k≥0. The first term above,
h(Xk), is in general biased since the chain (Xn)n≥0might not have reached stationarity by step k. As the second
term is precisely such that E[Hk(X, Y )] = π(h), it is referred to as a correction term. If k≥τ, the correction term
is zero. The estimator can be computed in max(τ , k)steps, which has a finite expectation under A2.
We introduce another unbiased estimator, denoted by ¯
Hk:m, defined for some integer m > k, resulting from
averaging H`(X, Y )over `∈ {k, . . . , m}. By rearranging terms, we define
¯
Hk:m(X, Y ) = 1
m−k+ 1
m
X
n=k
h(Xn) + 1
m−k+ 1
max(m,τ−1)
X
n=k
min (n−k+ 1, m −k+ 1) {h(Xn+1)−h(Yn)},(2)
which is unbiased and computable in max(τ, m)steps. The first average above can be recognized as the usual
MCMC estimator, obtained after miterations and discarding the first k−1states. As before, the second term
can be seen as a correction to remove the bias of ¯
Hk:m(X, Y ). On the event {k≥τ}, the correction term is
equal to zero. We refer to Jacob et al. [2017b] for a more detailed discussion of (1)-(2), and guidelines for the
choice of kand m. Importantly, unbiased estimators can be produced independently in parallel and averaged, with
direct computational gains on parallel computing architectures. Explicit constructions of pairs of Markov chains
satisfying A1-A3 based on Metropolis–Hastings and Gibbs samplers are given in Jacob et al. [2017b]. Here we
propose coupling strategies for HMC chains, so as to enable the unbiased estimators of (1)-(2). The main challenge
lies in A2, for which two coupled chains have to meet exactly after a “Geometric” number of steps.
1.3 Notation and plan
The set of integers {a, . . . , b}for a≤bis written as [a:b]. The set of non-negative real numbers is denoted by R+.
The vectors 0dand 1drefer to d-dimensional vectors of zeros and ones respectively. The matrix Idis the identity
matrix of size d×d. The norm of a vector x∈Rdis written as |x|= (Pd
i=1 x2
i)1/2. The transpose of a vector
x∈Rdand matrix A∈Rd×pare denoted by xTand ATrespectively. The gradient of a function (x, y)7→ f(x, y)
with respect to x(resp. y) is denoted by ∇xf(resp. ∇yf). The Hessian of a real-valued function fis denoted
by ∇2f. The Borel σ-algebra of Rdis denoted by B(Rd)and the Lebesgue measure on Rdby Lebd. The Normal
distribution with mean µand covariance matrix Σis denoted by N(µ, Σ) and its density by x7→ N(x;µ, Σ). The
Uniform distribution on the interval [0,1] is U[0,1]. The total variation distance dTV between two distributions,
with densities pand q, is defined as dTV(p, q) = 1
/2R|p(x)−q(x)|dx.
The rest of the article is structured as follows. Section 2describes Hamiltonian dynamics for coupled trajectories.
Section 3introduces a simple coupling of Hamiltonian Monte Carlo chains, which satisfies a relaxed meeting time
assumption similar to A2. Section 4then combines HMC kernels with random walk Metropolis–Hastings kernels, to
ensure that chains meet exactly and satisfy A2. Section 5contains simulation results on a 250-dimensional Normal
target, a truncated Normal distribution and a logistic regression with 300 covariates, and Section 6concludes.
2
2 Hamiltonian dynamics for pairs of particles
2.1 Hamiltonian flows and extended target
We now suppose that the target distribution has the form
π(dq)∝exp(−U(q))dq,
where the potential U:Rd→R+is twice continuously differentiable and its gradient ∇Uis globally β-Lipschitz,
i.e. there exists β > 0such that
|∇U(q)− ∇U(q0)| ≤ β|q−q0|,
for all q, q0∈Rd. We now introduce Hamiltonian flows on a phase space R2d, which consists of position variables
q∈Rdand velocity variables p∈Rd. We will be concerned with a Hamiltonian function E:Rd×Rd→R+of the
form
E(q, p) = U(q) + 1
2|p|2.
We note the use of an identity mass matrix here and defer to preconditioning as a means to incorporate any
knowledge of the curvature properties of π. The time evolution of a particle (q(t), p(t))t∈R+under Hamiltonian
dynamics is described by the autonomous system of ordinary differential equations
d
dtq(t) = ∇pE(q(t), p(t)) = p(t),(3)
d
dtp(t) = ∇qE(q(t), p(t)) = −∇U(q(t)).
Under the above assumptions on U, (3) with an initial condition (q(0), p(0)) = (q0, p0)∈Rd×Rdadmits a unique
solution globally on R+[Lelièvre et al.,2010, p. 14]. We will write the flow map as Φt(q0, p0)=(q(t), p(t)) for
any t∈R+, and Φ◦
t(q0, p0) = q(t)and Φ∗
t(q0, p0) = p(t)as its projection onto the position and velocity coordinates
respectively. It is worth recalling that Hamiltonian flows have the following properties.
[P1] (Reversibility) Φ−1
t=M◦Φt◦Mwhere M(q, p) := (q, −p)denotes velocity reversal;
[P2] (Energy conservation) E◦Φt=Efor any t∈R+;
[P3] (Volume preservation) Leb2d(Φt(A)) = Leb2d(A)for any A∈ B(Rd×Rd).
It follows from P1 and P2 that the extended target distribution on phase space,
˜π(dq, dp)∝exp(−E(q, p))dqdp,
is invariant under the Markov semi-group induced by the flow, i.e. the pushforward measure Φt]˜πdefined by
Φt]˜π(A) = ˜π(Φ−1
t(A)) for A∈ B(R2d)is equal to ˜πfor any t∈R+.
2.2 Coupled Hamiltonian dynamics
Following Section 1.2, we now consider the coupling of two particles (qi(t), pi(t))t∈R+, i = 1,2evolving under (3)
with initial conditions (qi(0), pi(0)) = (qi
0, pi
0), i = 1,2. We first draw some insights from a Gaussian example.
Example 1. Let πbe a Gaussian distribution on Rwith mean µ∈Rand unit variance σ2∈R+, in which case
U(q)=(q−µ)2/(2σ2)and ∇U(q)=(q−µ)/σ2. Then the solution of (3) is given by
Φt(q0, p0) = µ+ (q0−µ) cos t
σ+σp0sin t
σ
p0cos t
σ−1
σ(q0−µ) sin t
σ,
see e.g. Neal [2011]. Hence the difference between the positions is given by
q1(t)−q2(t)=(q1
0−q2
0) cos t
σ+σ(p1
0−p2
0) sin t
σ.
Observe that if we set p1
0=p2
0, then
|q1(t)−q2(t)|=|q1
0−q2
0|cos t
σ,
so the particles meet exactly whenever t= (2a+ 1)πσ/2, and contraction occurs for any t6=πaσ, for any non-
negative integer a.
3
This example motivates a coupling that simply assigns particles the same initial velocity. Moreover, it also
reveals that certain trajectory lengths will result in stronger contractions than others. We now examine the utility
of this approach more generally. Define ∆(t) = q1(t)−q2(t)as the difference between the particle locations and
note that 1
2
d
dt|∆(t)|2= ∆(t)T(p1(t)−p2(t)).
Therefore by imposing that p1(0) = p2(0), the function t7→ |∆(t)|admits a stationary point at time t= 0. This
is geometrically intuitive as the trajectories at time zero are parallel to one another for an infinitesimally small
amount of time. To characterize this stationary point, we compute
1
2
d2
dt2|∆(t)|2=−∆(t)T∇U(q1(t)) − ∇U(q2(t))+|p1(t)−p2(t)|2.
If we assume that the potential Uis α-strongly convex in an open set S∈ B(Rd), i.e. there exists α > 0such that
(∇U(q)− ∇U(q0))T(q−q0)≥α|q−q0|2,
for all q, q0∈S, then
1
2
d2
dt2|∆(t)|2≤ −α|q1(t)−q2(t)|2+|p1(t)−p2(t)|2.(4)
Therefore by the second derivative test, t= 0 is a strict local maximum point if q1
0, q2
0∈S. Using continuity of
t7→ |∆(t)|2, it follows that there exist ˜
t > 0and ρ < 1such that
|Φ◦
t(q1
0, p0)−Φ◦
t(q2
0, p0)|< ρ|q1
0−q2
0|,
for t∈(0,˜
t). We note the dependence of ˜
tand ρon the initial positions (q1
0, q2
0)and velocity p0. We now strengthen
the above claim.
Lemma 1. Suppose that the potential Uis twice continuously differentiable, α-strongly convex on S∈ B(Rd)and
its gradient ∇Uis globally β-Lipschitz. For any compact set C⊂S×S×Rd, there exist ˜
t > 0and ρ < 1such that
|Φ◦
t(q1
0, p0)−Φ◦
t(q2
0, p0)| ≤ ρ|q1
0−q2
0|,(5)
for all (q1
0, q2
0, p0)∈Cand t∈(0,˜
t).
Proof. Take (q1
0, q2
0, p0)∈C. Applying Taylor’s theorem on ∆(t)around t= 0 gives
∆(t) = ∆(0) −1
2t2G0−1
6t3G∗,
for some t∗∈(0, t), where G0:= ∇U(q1
0)− ∇U(q2
0)and G∗:= ∇2U(q1(t∗))p1(t∗)− ∇2U(q2(t∗))p2(t∗). We will
control each term of the expansion
|∆(t)|2=|∆(0)|2−t2∆(0)TG0−1
3t3∆(0)TG∗+1
4t4|G0|2+1
6t5GT
0G∗+1
36t6|G∗|2.
Using strong convexity, the Lipschitz assumption and Young’s inequality
|∆(t)|2≤1−αt2+1
6t3+1
4β2t4+1
12β2t5|∆(0)|2+1
6t3+1
12t5+1
36t6|G∗|2.
Note that by Young’s inequality and the Lipschitz assumption
|G∗|2≤2k∇2U(q1(t∗)k2
2|p1(t∗)|2+ 2k∇2U(q2(t∗)k2
2|p2(t∗)|2
≤2β2(|Φ∗
t∗(q1
0, p0)|2+|Φ∗
t∗(q2
0, p0)|2)
≤2β2sup
(q1
0,q2
0,p0)∈C
(|Φ∗
t∗(q1
0, p0)|2+|Φ∗
t∗(q2
0, p0)|2),
where k · k2denotes the spectral norm. The above supremum is attained by continuity of the mapping (q, p)7→
Φ∗
t∗(q, p). The claim (5) follows by combining both inequalities and taking tsufficiently small.
4
3 Hamiltonian Monte Carlo
3.1 Leap frog integrator
As the flow defined by (3) is typically intractable, one has to resort to time discretization. The leap-frog symplectic
integrator is a standard choice as it preserves P1 and P3. Given a step size ε > 0and a number of leap-frog steps
L∈N, this scheme initializes at (q0, p0)∈Rd×Rdand iterates
p`+1/2=p`−ε
2∇U(q`)
q`+1 =q`+εp`+1/2
p`+1 =p`+1/2−ε
2∇U(q`+1),
for `∈[0 : L−1]. We write the leap-frog iteration as ˆ
Φε(q`, p`)=(q`+1, p`+1 )and the corresponding approximation
of the flow as ˆ
Φε,`(q0, p0) = (q`, p`)for `∈[1 : L]. As before, we denote by ˆ
Φ◦
ε,`(q0, p0) = q`and ˆ
Φ∗
ε,`(q0, p0) = p`
the projections onto the position and velocity coordinates respectively. The leap-frog scheme is of order two [Hairer
et al.,2005, Theorem 3.4]: for sufficiently small ε, we have both
|ˆ
Φε,L(q0, p0)−ΦεL (q0, p0)| ≤ C1ε2,(6)
and
|E(ˆ
Φε,L(q0, p0)) −E(q0, p0)| ≤ C2ε2,(7)
for some constants C1, C2>0. Given the nature of Hamiltonian dynamics, the constant C1will typically grow
exponentially with the number of leap-frog iterations L[Leimkuhler and Matthews,2015, Section 2.2.3]. Under
appropriate assumptions, the constant C2on the other hand can be shown be stable over exponentially long time
intervals [Hairer et al.,2005, Theorem 8.1]. The Hamiltonian is not exactly conserved under time discretization,
but one can employ a Metropolis–Hastings correction as described in the following section.
3.2 Hamiltonian Monte Carlo kernel
Hamiltonian Monte Carlo [HMC, Neal,1993,Duane et al.,1987] is a Metropolis–Hastings (MH) algorithm on phase
space that targets ˜πwith the time discretized Hamiltonian dynamics ˆ
Φε,L(q0, p0)=(qL, pL)as a proposal. From a
state (Qn, Pn)∈Rd×Rd, at iteration n≥0,
1. sample a velocity P∗
n∼ N(0d, Id), independently of other variables, and set (q0, p0)=(Qn, P ∗
n);
2. perform leap-frog integration to obtain (qL, pL) = ˆ
Φε,L(q0, p0);
3. with probability α((q0, p0),(qL, pL)), set (Qn+1, Pn+1)=(qL,−pL), otherwise set (Qn+1 , Pn+1)=(Qn, Pn).
Since the leap-frog integrator preserves P1 and P3, the MH acceptance probability is given by
α((q, p),(q0, q0)) = min(1,exp (E(q, p)−E(q0, p0))),(8)
for (q, p),(q0, p0)∈Rd×Rd. As this constructs a ˜π-invariant Markov chain (Qn, Pn)n≥0on phase space, the marginal
chain (Qn)n≥0is a π-invariant Markov chain. We can write the Markov transition kernel of the marginal chain as
Kε,L (q, A) = ZRd
IAˆ
Φ◦
ε,L(q, p)α(q, p),ˆ
Φε,L(q, p)N(p; 0d, Id)dp (9)
+δq(A)ZRdn1−α(q, p),ˆ
Φε,L(q, p)oN(p; 0d, Id)dp,
for q∈Rd, A ∈ B(Rd). Irreducibility and geometric ergodicity of Kε,L have recently been established rigorously in
Durmus et al. [2017]; see also Cances et al. [2007], Livingstone et al. [2016] for previous works. These results can
be used to verify A1 in Section 1.2.
5
3.3 Coupled Hamiltonian Monte Carlo kernel
Similarly to Section 2.2, we now consider coupling two HMC chains (Qi
n, P i
n)n≥0, i = 1,2using the following
procedure. From two states (Qi
n, P i
n), i = 1,2, at iteration n≥0,
1. sample a velocity P∗
n∼ N(0d, Id), independently of other variables, and for i= 1,2, set (qi
0, pi
0)=(Qi
n, P ∗
n);
2. for i= 1,2, perform leap-frog integration to obtain (qi
L, pi
L) = ˆ
Φε,L(qi
0, pi
0);
3. sample U∼ U[0,1];
4. for i= 1,2,if U≤α(qi
0, pi
0),(qi
L, pi
L), set (Qi
n+1, P i
n+1)=(qi
L,−pi
L), otherwise set (Qi
n+1, P i
n+1)=(Qi
n, P i
n).
The above procedure amounts to running two HMC chains with common random numbers. We denote the associated
coupled transition kernel on the position coordinates as ¯
Kε,L (q1, q2), A1×A2for q1, q2∈Rdand A1, A2∈
B(Rd). Marginally we have ¯
Kε,L (q1, q2), A1×Rd=Kε,L(q1, A1)and ¯
Kε,L (q1, q2),Rd×A2=Kε,L(q2, A2).
We suppose that (Q1
0, Q2
0)are initialized according to π0independently, and (P1
0, P 2
0)with an arbitrary distribution
on R2d. We will write Pε,L as the law of the coupled HMC chains (Qi
n, P i
n)n≥0,i= 1,2and Eε,L to denote
expectation with respect to Pε,L .
We now establish that the relaxed meeting time τδ= inf n≥0 : |Q1
n−Q2
n| ≤ δfor any δ > 0has geometric
tail. The following result can be used to establish A2 for the algorithm that will be introduced in the next section.
Theorem 1. Suppose that the potential Uis twice continuously differentiable, the gradient of Uis globally β-
Lipschitz and there exists a compact set S∈ B(Rd)with Lebd(S)>0such that the restriction of Uto Sdenoted by
U|S:S→Ris α-strongly convex. Then there exists ˜ε > 0,˜
L∈N,C∈R+and γ∈(0,1) such that
Pε,L (τδ> n)≤Cγ n, n ∈N,(10)
for any ε < ˜εand L > ˜
Lsatisfying εL < ˜ε˜
L.
Proof. We first establish that the coupled HMC kernel is Leb2d-irreducible by adapting the arguments in Durmus
et al. [2017, proof of Theorem 2] to our coupling. Under the Lipschitz assumption on ∇U, the arguments in Durmus
et al. [2017, proof of Theorem 2] imply that for any L∈N, there exists ˜εL>0such that the mapping p7→ ˆ
Φ◦
ε,L(q, p)
is a continuously differentiable diffeomorphism from Rdto Rdfor q∈Rdand ε < ˜εL. Hence the mapping
p7→ ¯
Φε,L(q, q0, p) := ˆ
Φ◦
ε,L(q, p),ˆ
Φ◦
ε,L(q0, p)
from Rdto R2dis also a continuously differentiable diffeomorphism for (q, q0)∈R2dand ε < ˜εL. Writing ¯
Φ−1
ε,L :
R2d→Rdas the inverse function, by a change of variables,
¯
Kε,L (q1, q2), A≥ZRdZ1
0
IAˆ
Φ◦
ε,L(q1, p),ˆ
Φ◦
ε,L(q1, p)2
Y
i=1
Iu≤α(qi, p),ˆ
Φε,L(qi, p)N(p; 0d, Id)du dp
=ZRdZ1
0
IA(¯q)
2
Y
i=1
Iu≤α(qi,¯
Φ−1
ε,L(¯q)),ˆ
Φε,L(qi,¯
Φ−1
ε,L(¯q))N¯
Φ−1
ε,L(¯q); 0d, Id
det J¯
Φ−1
ε,L (¯q)
du d¯q
≥Leb2d(A) inf
¯q∈Amin
i=1,2α(qi,¯
Φ−1
ε,L( ¯q)),ˆ
Φε,L(qi,¯
Φ−1
ε,L( ¯q))N¯
Φ−1
ε,L(¯q); 0d, Id
det J¯
Φ−1
ε,L (¯q)
,
for all A∈ B(R2d), where J¯
Φ−1
ε,L denotes the Jacobian matrix of ¯
Φ−1
ε,L (with the convention 0×+∞= 0). It follows
that ¯
Kε,L is aperiodic and irreducible with respect to the Lebesgue measure on R2d.
For any real-valued measurable function f:Ω→R, we write its level sets as Lf(`) = {x∈Ω:f(x)≤`}for
`∈R. Define the kinetic energy function K(p) = |p|2/2, the levels U > infq∈SU(q)and ¯
U < supq∈SU(q)such that
U < ¯
U, and the sets C`=LU|S(`)×LK(¯
U−`)⊂LE(¯
U)and ˜
C`=LU|S(`)×LU|S(`)×LK(¯
U−`)for `∈(U, ¯
U).
Since Lebd(LU|S(`)) >0for `∈(U, ¯
U)under the assumptions on U,Leb2d-irreducibility of ¯
Kε,L implies for any
L∈Nand ε < ˜εL, there exists N∈Nsuch that
Pε,L Q1
N∈LU|S(`), Q2
N∈LU|S(`)>0.
6
When both chains enter the set LU|S(`), it follows from Lemma 1that there exist ˜
T > 0and ρ0<1such that
|Φ◦
T(Q1
N, P ∗
N)−Φ◦
T(Q2
N, P ∗
N)| ≤ ρ0|Q1
N−Q2
N|,
for all (Q1
N, Q2
N, P ∗
N)∈˜
C`and T < ˜
T. Hence we have
Pε,L |Φ◦
T(Q1
N, P ∗
N)−Φ◦
T(Q2
N, P ∗
N)| ≤ ρ0|Q1
N−Q2
N| | Q1
N∈LU|S(`), Q2
N∈LU|S(`)>0.
By triangle inequality, consistency of the leap-frog integrator (6) and compactness of ˜
C`, there exists ε0≤˜εL,
L0∈Nand ρ1<1such that
Pε,L |ˆ
Φ◦
ε,L(Q1
N, P ∗
N)−ˆ
Φ◦
ε,L(Q2
N, P ∗
N)| ≤ ρ1|Q1
N−Q2
N| | Q1
N∈LU|S(`), Q2
N∈LU|S(`)>0,
for ε<ε0and L>L0satisfying εL =T. Again by consistency of the leap-frog integrator (7) and compactness of
C`, it follows from (8) that there exist ε1≤ε0,L1≥L0and η0<1/2such that
Pε,L Qi
N+1 =ˆ
Φ◦
ε,L(Qi
N, P ∗
N)|(Qi
N, P ∗
N)∈C`≥1−η0,
for i= 1,2and ε<ε1,L>L1satisfying εL =T. By Fréchet’s inequality, the probability of accepting both
proposals satisfies
Pε,L Q1
N+1 =ˆ
Φ◦
ε,L(Q1
N, P ∗
N), Q2
N+1 =ˆ
Φ◦
ε,L(Q2
N, P ∗
N)|(Q1
N, Q2
N, P ∗
N)∈˜
C`>0,
therefore
Pε,L |Q1
N+1 −Q2
N+1| ≤ ρ1|Q1
N−Q2
N| | Q1
N∈LU|S(`), Q2
N∈LU|S(`)>0.
To iterate this argument, note first that if (q, p)∈C`then continuity of Uand the mapping t7→ Φ◦
t(q, p)
implies Φ◦
t(q, p)∈LU|S(¯
U)for any t∈R+. Owing to time discretization, we only have ˆ
Φ◦
t(q, p)∈LU|S(¯
U+η1)for
(q, p)∈C`and some η1>0, by another application of (7). It follows that there exists a number of iterations I∈N
that depends on ρ1, and an initial level `0∈(U,¯
U)depending on Iand η1such that
Pε,L |Q1
N+I−Q2
N+I| ≤ δ|Q1
N∈LU|S(`0), Q2
N∈LU|S(`0)>0.
Therefore we can conclude (10) by applying Williams [1991, Exercise E.10.5].
Under similar conditions, Durmus et al. [2017] provide a convergence result for the marginal HMC chains, which
can be used to check A1; see also Cances et al. [2007], Livingstone et al. [2016], Mangoubi and Smith [2017] and
Tweedie [1983] for the finiteness of moments.
It is worth noting that the distance between chains might exceed δat some future iterations n > τδ, and that
the event {|Q1
n−Q2
n| ≤ δ}is not an exact meeting event; thus Theorem 1does not establish A2. In the next
section, we combine coupled HMC kernels with another kernel designed to prompt exact meetings, which would
occur with large probability when the two chains are close.
4 Unbiased Hamiltonian Monte Carlo estimators
The construction of Jacob et al. [2017b] requires two chains that meet exactly. One possibility here is the approach
of Glynn and Rhee [2014], which involves the introduction of a truncation variable. Instead we propose to use
coupled Metropolis–Hastings steps to trigger exact meetings. These coupled MH steps are described in Section
4.1, and a summary of the proposed methodology combining the two coupled kernels is in Section 4.2. Section 4.3
briefly describes a further variance reduction technique.
4.1 Coupled Metropolis–Hastings steps
As in Section 1, let us denote the two chains by (Xn)n≥0and (Yn)n≥0; these correspond to the position coordinates
in Section 3, propagated with a time shift, e.g. (Xn+1, Yn)∼¯
Kε,L((Xn, Yn−1),·). According to Theorem 1, coupled
HMC chains are close to one another after some iterations. Denote the distance between the chains at step nby
δn=|Xn−Yn−1|.
In a coupled MH step with Normal random walk, a pair of proposals (X?, Y ?)is sampled from the maximal
coupling of N(Xn,Σ) and N(Yn−1,Σ) [Jacob et al.,2017b]. Let us consider the case where Σ = σ2Idfor some σ > 0.
7
Algorithm 1 Unbiased HMC estimator ¯
Hk:m(X, Y )of π(h), with tuning parameters ω, σ, ε, L, k, m.
The kernel ¯
Pσrefers to a coupled random walk MH kernel with proposal standard deviation σ, and maximally
coupled proposals. The kernel ¯
Kε,L refers to a coupled HMC kernel with step size ε,Lleap-frog steps, and common
initial velocity at each step. The marginal kernels are denoted by Pσand Kε,L respectively.
1. Draw X0and Y0from an initial distribution π0, and
(a) with probability ω, sample X1∼Pσ(X0,·);
(b) otherwise sample X1∼Kε,L (X0,·);
(c) set n= 1.
2. While Xn6=Yn−1and n<m,
(a) with probability ω, sample (Xn+1, Yn)∼¯
Pσ((Xn, Yn−1),·);
(b) otherwise, sample (Xn+1 , Yn)∼¯
Kε,L((Xn, Yn−1),·);
(c) if Xn+1 =Ynset τ=n+ 1;
(d) increment n←n+ 1.
3. Compute H`(X, Y ) = h(X`) + Pmax(m,τ −1)
n=`{h(Xn+1)−h(Yn)}for `∈[k:m],
and ¯
Hk:m(X, Y )=(m−k+ 1)−1Pm
`=kH`(X, Y ); or compute ¯
Hk:m(X, Y )as in (2).
Under the maximal coupling, we have P(X?=Y?)=1−dTV(N(Xn, σ2Id),N(Yn−1, σ2Id)). The total variation
can be approximated as in Pollard [2005]. First, we have dTV(N(Xn, σ2Id),N(Yn−1, σ2Id)) = P(2σ|Z| ≤ δn),
where Zis a univariate standard Normal variable and δnis considered fixed. Approximations of the folded Normal
cumulative distribution function then lead to
P(X?=Y?)=1−P(2σ|Z| ≤ δn)=1−1
√2π
δn
σ+Oδ2
n
σ2,as δn
σ→0.
To achieve P(X?=Y?) = sfor some desired probability s, we can choose σas approximately δn/(√2π(1 −s)).
The proposed values (X?, Y ?)are then accepted as the next states according to MH acceptance ratios, i.e. if
U≤min(1, π(X?)/π(Xn)) and U≤min(1, π(Y?)/π(Yn−1)) respectively, where a single uniform variable U∼ U[0,1]
is used for both chains.
If σis small compared to the spread of the target density function, the probability of jointly accepting the
proposals is high. On the other hand, σneeds to be large compared to δn=|Xn−Yn−1|for the event {X?=Y?}
to frequently occur. This leads to a trade-off; in numerical experiments, for pairs of chains propagated using the
coupled HMC kernel ¯
Kε,L, we can monitor both the distance δnand the target density values to guide the choice of
σ. We will choose a fixed value of σfor all coupled MH steps, and leave adaptive strategies, where σwould be e.g.
chosen according to δn, for future research. Hereafter we denote by Pσand ¯
Pσthe marginal and coupled kernels
associated with the MH steps.
4.2 Combining kernels
We propose to use both coupled HMC and MH kernels through a mixture. The coupled HMC kernel is expected
to bring the two chains close to one another, while the coupled MH kernel enables exact meetings when the chains
are already close. In a mixture of kernels, at each step, the MH kernel is chosen with probability ω, otherwise the
HMC kernel is chosen. The procedure is described in Algorithm 1. Note that A3 is satisfied by design for coupled
chains generated by this algorithm. As the resulting coupled mixture kernel inherits properties of the coupled MH
kernel, A2 can in principle be verified by simply relying on the properties of coupled MH kernels established in
Jacob et al. [2017b]. However, we stress here that Theorem 1provides some insight on the role of coupled HMC
steps on the efficiency of the proposed estimator.
We now comment on the computational cost of Algorithm 1. Assume for simplicity that the cost of evaluating
the target density is approximately equal to that of evaluating its gradient. Each HMC step is then L+ 1 times
more expensive than a MH step. If we choose a small value for ω, such as 0.1or 0.05, the cost of the MH steps
becomes negligible. Secondly, the cost of running two chains is approximately twice the cost of running each chain
8
until meeting occurs. Thereafter, only one chain needs to be propagated up to step m. If we choose mto be much
larger than τwith high probability, the cost of Algorithm 1is therefore comparable to the cost of mHMC iterations.
The efficiency of the unbiased HMC estimator depends on the mixing properties of the underlying HMC kernel,
and on the contraction achieved by the coupling. Importantly, the tuning parameters εand Lthat would be
optimal for the marginal HMC kernel are not necessarily adequate for the coupled kernel, as illustrated in Section
5. The other tuning parameters include σfor the coupled MH step discussed above, and Jacob et al. [2017b] give
recommendations for kand m: namely kcan be chosen as a large quantile of the meeting times, and msuch that
(m−k)/m ≈1, for instance m= 10k.
Finally, in Section 5.2 we will encounter a situation where the coupled HMC kernel contracts so quickly that the
distance |Xn−Yn−1|becomes smaller than machine precision after a small number of iterations. The two chains
can then be considered exactly identical, for all practical purposes, and the coupled MH steps become unnecessary.
4.3 Choice of weights and variance reduction
As suggested in Jacob et al. [2017a,b], the estimators H`(X, Y )for `∈[k:m]given in (1), can be averaged with
any weights (w`)m
`=ksuch that Pm
`=kw`= 1. The estimator ¯
Hk:m(X, Y )in (2) corresponds to weights equal to
(m−k+1)−1. For an arbitrary choice (w`)m
`=k, the estimator Pm
`=kw`H`(X, Y )is unbiased and its variance is given
by wTΣHw, where ΣHdenotes the (m−k+ 1) ×(m−k+ 1) covariance matrix of the estimators (Hk, . . . , Hm).
To minimize such a variance without violating the sum constraint, we solve the system
1
ΣH
.
.
.
1
1. . . 1 0
wk
.
.
.
wm
λ
=
0
.
.
.
0
1
,
where λis a Lagrange multiplier, for a computational cost of order (m−k+1)3. The matrix ΣHcan be approximated
from i.i.d. realizations of H`for `∈[k:m]. The resulting weights can then be used to reduce the variance of
¯
Hk:m(X, Y ), especially if the original MCMC chain exhibits strong autocorrelations.
5 Numerical illustrations
We investigate some key aspects of the proposed unbiased HMC estimator, such as its efficiency compared to
standard HMC estimators. As in the rest of the article, we choose a Normal distribution for the initial velocities
at each HMC step, and a unit mass matrix; other choices are possible [Girolami and Calderhead,2011,Livingstone
et al.,2017].
In all experiments, whenever the test function his not specified, it is chosen as h:x7→ x1, so that π(h)is simply
the mean of the first target marginal distribution. The asymptotic variance of an MCMC estimator refers to the
variance appearing in the central limit theorem satisfied by N−1PN
n=0 h(Xn)as N→ ∞, where (Xn)n≥0is the
chain generated by the algorithm. Here, these asymptotic variances are approximated with the spectrum0 function
of the coda package [Plummer et al.,2006]. For unbiased estimators, we define the asymptotic efficiency as variance
multiplied by expected cost [Glynn and Whitt,1992]. This accounts for the fact that, in a given computing budget,
more estimators can be averaged over if each one can be produced faster. For the estimator ¯
Hk:m(X, Y )in (2), the
expected computing time E[max(τ , m)] and the variance V[¯
Hk:m(X, Y )] are approximated by empirical averages of
independent realizations.
5.1 Multivariate Normal distribution
Let the target πbe a multivariate Normal N(0d,Σπ)with d= 250 and with the (i, j)-entry of Σπequal to
exp(−|i−j|). In this example we discuss the choice of trajectory length, defined as the product εL, and the use of
coupled MH kernels to trigger exact meetings.
We fix the number of leap-frog steps to L= 20 and vary the step size εso that the trajectory length εL spans
between 0and 3π/2, where πhere denotes the mathematical constant. The initial distribution π0is chosen as the
target. For each trajectory length, the asymptotic variance of HMC computed from 5,000 iterations is shown in
Figure 1a. The optimal trajectory length is close to the value π, which is consistent with the analytical solution
in Section 2.2. For such a trajectory length, the asymptotic variance is smaller than the variance obtained with
perfect samples from the target, thanks to negative auto-correlations.
9
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.1
1.0
10.0
π4 2π4 3π4 4π4 5π4 6π4
trajectory length
HCMC variance
(a) HMC asymptotic variance against trajectory length
εL.
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
1e−20
1e−15
1e−10
1e−05
1e+00
π4 2π4 3π4 4π4 5π4 6π4
trajectory length
distance after 100 iterations
(b) Distance after 100 coupled HMC iterations against
trajectory length εL.
Figure 1: In the multivariate Normal example of Section 5.1, asymptotic variance for the estimation of Rx1π(dx)
using HMC, computed using chains of length 5,000 started at stationarity (left). Euclidean distance between the
100-th iterate of coupled HMC chains (right). The number of leap-frog steps is set to L= 20, which implicitly
determines the step size εfor each trajectory length εL. Each dot corresponds to one of 5independent runs.
(a) Log-distance between coupled HMC chains against
iterations.
(b) Log-distance between coupled chains propagated
with a mixture of HMC and MH kernels, against it-
erations.
Figure 2: In the multivariate Normal example of Section 5.1, distance between coupled HMC chains against number
of iterations (left), and between chains propagated with the the mixture of HMC and MH kernels, with σ= 10−5
and ω= 0.1(right). Each line corresponds to one of 100 independent runs.
We then run 100 iterations of coupled HMC and compute the Euclidean distance between the two final states.
The resulting distances are shown in Figure 1b. Lengths around the value π/2lead to the smallest distances, con-
sistently with the analytical reasoning of Section 2.2. Moreover, there is a range of lengths that lead to contraction.
On the other hand, the optimal length for the HMC estimator, which was the value π, does not lead to visible
contraction after 100 iterations. Therefore, the proposed coupling contracts most with tuning parameters that are
not optimal for the underlying HMC algorithm, which results in a loss of efficiency.
Based on Figure 1b, we set εL =π/2,L= 20 and run coupled chains, 100 times independently, until their
distances is less than machine precision. In Figure 2a these distances are plotted on a logarithmic scale against
iterations; the lines drop when the distances fall below machine precision, which occurs between iterations 127 and
312. The distances are already very small after a few dozen iterations. We implement the proposed algorithm with
a mixture of kernels described in Section 4.2, with σ= 10−5and ω= 0.1, and plot the resulting distances in Figure
2b. All meeting times then occur between iterations 36 and 97. The MH steps thus successfully manage to trigger
exact meetings.
We set k= 50 and m= 500 to produce R= 100 unbiased estimators of Rx1π(dx)as in (2). The asymptotic
efficiency is approximately equal to 1.96. The asymptotic variance of HMC obtained with εL =πwas found to
be approximately 0.16, averaging the 5runs shown in Figure 1a. Therefore, the proposed estimator is about 12
times less efficient than the original HMC algorithm when optimally tuned. Depending on hardware, this can be
10
considered an acceptable loss in exchange for complete parallelism, among other advantages of unbiased estimators
argued e.g. in Rhee [2013], Jacob et al. [2017b]. Unbiased estimators could also be obtained from variants of HMC
where the number of leap-frog steps Lis random, and possibly adaptive, which might reduce the efficiency loss.
5.2 Truncated Normal distribution
We consider Hamiltonian Monte Carlo on truncated Normal distributions, with truncations defined by linear and
quadratic inequalities. In this setting Pakman and Paninski [2014] show that Hamiltonian dynamics can be solved,
resulting in trajectories that bounce off the constraints. An R package implementing the method of Pakman and
Paninski [2014] is available online [Pakman,2012]. Using this package, the implementation of the proposed method
only involved simple modifications.
We consider two of the examples in Pakman and Paninski [2014], where a bivariate Normal distribution is
truncated by two linear and two quadratic constraints respectively. A thousand HMC samples are shown in Figure
3(top row). The first distribution is a bivariate Normal, with unit covariance matrix and mean (4,4), restricted
to the set {x1≤x2≤1.1x1} ⊂ R2(Figure 3a). The second distribution is a bivariate standard Normal restricted
to the set {(x1−4)2/32 + (x2−1)2/8≤1} ∩ {4x2
1+ 8x2
2−2x1x2+ 5x2≥1} ⊂ R2(Figure 3b). We use the value
π/2as a trajectory length, as advocated in Pakman and Paninski [2014]. As for the initial distribution π0, we use
a point mass at (2,2.1) for the first target, and at (2,0) for the second one.
In this example, the proposed coupling induces a contraction that leads to distances between trajectories be-
coming smaller than machine precision, after a few iterations. Therefore, we do not need to resort to coupled MH
steps: we can define the meeting times directly as the first times for which distances are less than machine precision.
Histograms of such meeting times are shown in Figure 3for both targets (bottom row). They indicate that small
values of kand mcould be chosen, effectively leading to the possibility of running very short HMC chains in parallel
in a principled way.
5.3 Logistic regression
We consider a Bayesian logistic regression as in Hoffman and Gelman [2014], on the classic German credit data
set. Including pairwise interactions, the covariates are in a matrix Xwith N= 1000 rows and p= 300 columns,
which we standardize by column. The parameters are the intercept α∈R, coefficients β∈Rp, and a prior
variance σ2∈R+on intercept and coefficients. The likelihood specifies that the binary outcome Yisatisfies
P(Yi= 1|Xi, α, β) = (1 + exp(−α−XT
iβ))−1for all i∈[1 : N]. The prior specifies α|σ2∼ N(0, σ2)and
βj|σ2∼ N(0, σ 2), for all j∈[1 : p], and an Exponential distribution with rate λ= 0.01 for σ2. We transform σ2
into log σ2, so that each parameter lies in R. The target πis the posterior distribution of (α, β, log σ2), of dimension
d=p+ 2 = 302. We use an independent standard Normal for each parameter to initialize the chains, which defines
π0.
We set L= 20 and vary εso that the trajectory length εL is in the range [0.1,0.5]. For each length, we run
10,000 HMC iterations, discard the first 5,000 as burn-in, and use the remaining 5,000 samples to approximate
the asymptotic variance of HMC for the estimation of Rx1π(dx), which here is the posterior expectation of the
intercept. The results of independent runs are shown in Figure 4a. Coupled HMC chains are then run for 1,000
iterations, and the distances between the final states are shown in Figure 4b. Again, the optimal choice of εL for
the asymptotic variance of HMC is not optimal in terms of contraction. However, contrarily to the example of
Section 5.1, here each of the considered trajectory lengths yields some contraction.
Using the length εL = 0.1, we then proceed with Algorithm 1of Section 4.2, using σ= 10−5and ω= 0.05.
Over 100 independent experiments, we compute the distance between the coupled chains, using two different
initializations. The first is the standard Normal distribution on each parameter as above, leading to the distances
plotted in Figure 5a. The observed meeting times occur between iterations 256 and 535. Using k= 100 and
m= 1,000, we produce 100 independent estimators ¯
Hk:m(X, Y )from these coupled chains, in order to approximate
the marginal means and variances of the target. With these values, we construct a Normal approximation of the
target, with a diagonal covariance matrix, and use this Normal as a new initial distribution π0. For this better
initialization, the distance traces are shown in Figure 5b. The observed meeting times occur between iterations 192
and 422, and the plot shows that the distances decrease faster than with the previous initialization. The vertical
upward jumps in Figure 5correspond to events where one chain accepts its HMC proposal while the other chain
does not.
With this better initialization, again using k= 100 and m= 1,000, we produce R= 1,000 independent
estimators of Rx1π(dx). The asymptotic efficiency is found to be approximately 0.40. The asymptotic variance
of HMC obtained with εL = 0.3was found to be approximately 0.09, and with εL = 0.1approximately 0.33;
11
(a) HMC samples approximating a bivariate Normal
truncated by two linear constraints.
(b) HMC samples approximating a bivariate Normal
truncated by two quadratic constraints.
0.00
0.05
0.10
0 10 20 30
meeting times
density
(c) Meeting times for the bivariate Normal with linear
constraints.
0.0
0.1
0.2
0.3
0 5 10 15
meeting times
density
(d) Meeting times for the bivariate Normal with
quadratic constraints.
Figure 3: In the truncated Normal example of Section 5.2, scatter plot of 1,000 HMC samples for a bivariate
Normal truncated by two linear constraints (top left), and two quadratic constraints (top right). Histogram of
1,000 meeting times, defined as first times for which the distance is smaller than machine precision, for coupled
HMC chains targeting the bivariate Normal with linear constraints (bottom left), and with quadratic constraints
(bottom right).
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
0.0
0.1
0.2
0.3
0.4
0.5
0.1 0.2 0.3 0.4 0.5
trajectory length
HCMC variance
(a) HMC asymptotic variance against trajectory length
εL.
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1e−20
1e−15
1e−10
1e−05
1e+00
0.1 0.2 0.3 0.4 0.5
trajectory length
distance after 1000 iterations
(b) Distance after 1,000 coupled HMC steps against
trajectory length εL.
Figure 4: In the logistic regression example of Section 5.3, asymptotic variance for the estimation of Rx1π(dx)
using HMC, computed using chains of length 10,000 started from an independent standard Normal distribution for
each parameter, and discarding a burn-in of 5,000 steps (left). Euclidean distance between the 1,000-th iterate of
coupled HMC chains (right). The number of leap-frog steps is set to L= 20, which implicitly determines the step
size εfor each trajectory length εL. Each dot corresponds to one of 5independent runs.
12
(a) Log-distance between coupled chains initialized
from independent standard Normal distributions.
(b) Log-distance between coupled chains initialized
from a crude Normal approximation of the target.
Figure 5: In the logistic regression example of Section 5.3, distance between coupled chains initialized from in-
dependent standard Normal distributions for each parameter against number of iterations (left), and initialized
from a Normal approximation of the target (right). The Normal approximation is obtained by estimating the 302
marginal means and variances of the target distribution. In both cases the chains are propagated using a mixture
of HMC and MH kernels, with σ= 10−5and ω= 0.05, and the HMC kernel uses L= 20 and εL = 0.1. Each line
corresponds to one of 100 independent runs.
these were obtained from 105HMC iterations after discarding 5,000 iterations as burn-in. Therefore, the proposed
estimator is about 4times less efficient than the original HMC estimator when optimally tuned, or more precisely,
for the optimal value of εgiven a fixed value L= 20. We could also use εL = 0.3for the unbiased HMC estimator,
according to Figure 4b, but the meeting times would then be longer, and the potential for parallelization would
thus be reduced.
From the coupled chains, histograms can be produced by binning a dimension of the space and estimating
posterior masses of these bins, which are integrals of indicator functions [Jacob et al.,2017b]. Histograms of α
and β1under the posterior distribution are shown in Figure 6. The vertical bars indicate the point estimates of
posterior masses, and gray rectangles represent 95% confidence intervals based on the central limit theorem. The
overlaid red curves show kernel density estimates obtained from 105HMC samples, after discarding a burn-in of
5,000 steps, and using L= 20 and εL = 0.3. Taking these kernel density estimates as ground truth, the narrowness
of confidence intervals reflects the accuracy of the proposed estimators. We stress that these confidence intervals
are based on the central limit theorem for averages of independent variables, and are therefore justified in the limit
of number of independent estimators, all of which can be computed in parallel.
6 Discussion
Coupled Hamiltonian Monte Carlo chains can be combined to generate unbiased estimators of integrals with re-
spect to target distributions. With adequate couplings, such chains become exactly equal after a random number
of steps. The proposed approach involves a simple coupling of Hamiltonian Monte Carlo kernels, based on common
random numbers, that generates chains converging to one another. Combined with coupled random walk Metropo-
lis–Hastings steps, the approach leads to estimators that can be produced independently in parallel and averaged.
The method is demonstrated on three examples, and a contraction property of coupled HMC kernels is formally
established under strong log-concavity of the target on parts of the state space. Recently, Mangoubi and Smith
[2017] have proposed a much deeper study of the same coupling, and have adroitly exploited it to obtain novel
quantitative bounds on mixing properties of HMC. The same coupling was already discussed in Neal [2002], for
the purpose of removing the burn-in bias. The exploration of further links between our proposed estimators and
the circular coupling of Neal [2002] is an exciting avenue of research. The proposed couplings also enable other
unbiased estimators, such as those of Glynn and Rhee [2014] which do not require exact meetings.
As seen in numerical experiments, optimal trajectory lengths for standard HMC estimators are not optimal in
the coupled construction. This leads to a loss of efficiency of the proposed estimators compared to standard HMC
estimators. Whether this loss is acceptable or not will likely depend on the target distribution and the available
hardware. Other considerations include the construction of confidence intervals, which is arguably simpler with
i.i.d. variables than with Markov chains, and the unbiased property itself, which could be appealing in various
13
0
1
2
3
−1.75 −1.50 −1.25 −1.00 −0.75 −0.50
α
density
(a) Estimated posterior of the intercept α.
0
1
2
3
4
−0.9 −0.6 −0.3
β1
density
(b) Estimated posterior of the coefficient β1.
Figure 6: In the logistic regression example of Section 5.3, histograms of the posterior distributions of the intercept
α(left) and of the first coefficient β1(right). Vertical bars indicate point estimates of posterior mass in each bin,
obtained with 1,000 unbiased HMC estimators, and 95% confidence intervals are represented by gray rectangles.
Red curves represent kernel density estimates computed from 105HMC iterations, considered as the ground truth.
contexts.
To improve asymptotic efficiencies, random numbers of leap-frog steps, and adaptive selection of that number
based on the distance between the chains, would be interesting topics of research. A related question would be
the construction of unbiased estimators from the No-U-Turn sampler of Hoffman and Gelman [2014]. Finally, the
optimal weights described in Section 4.3 could potentially bring significant variance reduction in situations where
HMC chains exhibit significant autocorrelations.
Acknowledgement
Pierre E. Jacob gratefully acknowledges support by the National Science Foundation through grant DMS-1712872.
References
Beskos A., Pillai N., Roberts G., Sanz-Serna J.-M., and Stuart A. The acceptance probability of the Hybrid Monte
Carlo method in high-dimensional problems. In AIP Conference Proceedings, volume 1281, pages 23–26. AIP,
2010. 1
Beskos A., Pillai N., Roberts G., Sanz-Serna J.-M., and Stuart A., 2013. Optimal tuning of the Hybrid Monte Carlo
algorithm. Bernoulli, 19(5A):1501–1534. 1
Brooks S. P., Gelman A., Jones G., and Meng X.-L., 2011. Handbook of Markov chain Monte Carlo. CRC press. 1
Cances E., Legoll F., and Stoltz G., 2007. Theoretical and numerical comparison of some sampling methods for
molecular dynamics. ESAIM: Mathematical Model ling and Numerical Analysis, 41(2):351–389. 5,7
Carpenter B., Gelman A., Hoffman M. D., Lee D., Goodrich B., Betancourt M., Brubaker M. A., Guo J., Li P.,
and Riddell A., 2016. Stan: a probabilistic programming language. Journal of Statistical Software, 20:1–37. 1
Casella G., Lavine M., and Robert C. P., 2001. Explaining the perfect sampler. The American Statistician, 55(4):
299–305. 1
Duane S., Kennedy A. D., Pendleton B. J., and Roweth D., 1987. Hybrid Monte Carlo. Physics Letters B, 195(2):
216–222. 1,5
Durmus A., Moulines E., and Saksman E., 2017. On the convergence of Hamiltonian Monte Carlo. arXiv preprint
arXiv:1705.00166.5,6,7
Girolami M. and Calderhead B., 2011. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal
of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214. 9
Glynn P. W. Exact simulation versus exact estimation. In Winter Simulation Conference (WSC), 2016, pages
193–205. IEEE, 2016. 1
Glynn P. W. and Rhee C.-H., 2014. Exact estimation for Markov chain equilibrium expectations. Journal of Applied
Probability, 51(A):377–389. 1,7,13
Glynn P. W. and Whitt W., 1992. The asymptotic efficiency of simulation estimators. Operations Research, 40(3):
505–520. 9
14
Hairer E., Wanner G., and Lubich C., 2005. Geometric numerical integration: structure-preserving algorithms for
ordinary differential equations. Springer-Verlag, New York. 5
Hoffman M. D. and Gelman A., 2014. The No-U-turn sampler: adaptively setting path lengths in Hamiltonian
Monte Carlo. Journal of Machine Learning Research, 15(1):1593–1623. 1,11,14
Huber M., 2016. Perfect simulation, volume 148. CRC Press. 1
Jacob P. E. and Thiery A. H., 2015. On non-negative unbiased estimators. The Annals of Statistics, 43(2):769–784.
1
Jacob P. E., Lindsten F., and Schön T. B., 2017a. Smoothing with couplings of conditional particle filters. arXiv
preprint arXiv:1701.02002.1,9
Jacob P. E., O’Leary J., and Atchadé Y. F., 2017b. Unbiased Markov chain Monte Carlo with couplings. arXiv
preprint arXiv:1708.03625.1,2,7,8,9,11,13
Leimkuhler B. and Matthews C., 2015. Molecular Dynamics. Springer-Verlag, New York. 5
Lelièvre T., Rousset M., and Stoltz G., 2010. Free Energy Computations: A Mathematical Perspective. Imperial
College Press. ISBN 978-1-84816-248-8. 1,3
Livingstone S., Betancourt M., Byrne S., and Girolami M., 2016. On the geometric ergodicity of Hamiltonian Monte
Carlo. arXiv preprint arXiv:1601.08057.5,7
Livingstone S., Faulkner M. F., and Roberts G. O., 2017. Kinetic energy choice in Hamiltonian/hybrid Monte
Carlo. arXiv preprint arXiv:1706.02649.9
Mangoubi O. and Smith A., 2017. Rapid mixing of Hamiltonian Monte Carlo on strongly log-concave distributions.
arXiv preprint arXiv:1708.07114.7,13
Mykland P., Tierney L., and Yu B., 1995. Regeneration in Markov chain samplers. Journal of the American
Statistical Association, 90(429):233–241. 1
Neal R. M., 1993. Bayesian learning via stochastic dynamics. Advances in neural information processing systems,
pages 475–475. 1,5
Neal R. M. Circularly-coupled Markov chain sampling. Technical report, 9910 (revised), Department of Statistics,
University of Toronto, 2002. 1,13
Neal R. M., 2011. MCMC using Hamiltonian dynamics. Handbook of Markov chain Monte Carlo, 2(11). 3
Pakman A., 2012. tmg: truncated multivariate Gaussian sampling. CRAN. URL https://cran.r-project.org/
package=tmg.11
Pakman A. and Paninski L., 2014. Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. Journal
of Computational and Graphical Statistics, 23(2):518–542. 11
Plummer M., Best N., Cowles K., and Vines K., 2006. CODA: Convergence diagnosis and output analysis for
MCMC. R News, 6(1):7–11. URL https://journal.r-project.org/archive/.9
Pollard D., 2005. Chapter 3: Total variation distance between measures. Asymptopia. URL http://www.stat.
yale.edu/~pollard/Courses/607.spring05/handouts/Totalvar.pdf.8
Rhee C.-H. Unbiased estimation with biased samplers. PhD thesis, Stanford University, 2013. URL http://purl.
stanford.edu/nf154yt1415.11
Rosenthal J. S., 1997. Faithful couplings of Markov chains: now equals forever. Advances in Applied Mathematics,
18(3):372 – 381. ISSN 0196-8858. 2
Rosenthal J. S., 2000. Parallel computing and Monte Carlo algorithms. Far east journal of theoretical statistics, 4
(2):207–236. 1
Thorisson H., 2000. Coupling, stationarity, and regeneration, volume 14. Springer New York. 1
Tweedie R., 1983. The existence of moments for stationary Markov chains. Journal of Applied Probability, 20(1):
191–196. 7
Vihola M., 2015. Unbiased estimators and multilevel Monte Carlo. arXiv preprint arXiv:1512.01022.1
Williams D., 1991. Probability with martingales. Cambridge university press. 7
15