Content uploaded by Pierre E Jacob

Author content

All content in this area was uploaded by Pierre E Jacob on Sep 06, 2017

Content may be subject to copyright.

Unbiased Hamiltonian Monte Carlo with couplings

Jeremy Heng∗and Pierre E. Jacob∗

September 4, 2017

Abstract

We propose a coupling approach to parallelize Hamiltonian Monte Carlo estimators, following Jacob, O’Leary

& Atchadé (2017). A simple coupling, obtained by using common initial velocities and common uniform variables

for the acceptance steps, leads to pairs of Markov chains that contract, in the sense that the distance between

them can become arbitrarily small. We show how this strategy can be combined with coupled random walk

Metropolis–Hastings steps to enable exact meetings of the two chains, and in turn, unbiased estimators that can

be computed in parallel and averaged. The resulting estimator is valid in the limit of the number of independent

replicates, instead of the usual limit of the number of Markov chain iterations. We investigate the eﬀect of tuning

parameters, such as the number of leap-frog steps and the step size, on the estimator’s eﬃciency. The proposed

methodology is demonstrated on a 250-dimensional Normal distribution, on a bivariate Normal truncated by

linear and quadratic inequalities, and on a logistic regression with 300 covariates.

1 Introduction

1.1 Goal: parallel computation with Hamiltonian Monte Carlo

Hamiltonian Monte Carlo, also called Hybrid Monte Carlo (HMC), is a Markov chain Monte Carlo (MCMC) method

to approximate integrals with respect to a target probability distribution πon Rd. Originally proposed by Duane

et al. [1987] in the physics literature, it was later introduced in statistics by Neal [1993] and is now part of the

standard toolbox [Brooks et al.,2011,Lelièvre et al.,2010], in part due to favorable scaling properties with respect

to the dimension d[Beskos et al.,2010,2013], compared to e.g. random walk Metropolis–Hastings. Hamiltonian

Monte Carlo is at the core of the No-U-Turn sampler (NUTS, Hoﬀman and Gelman [2014]) used in the software

Stan [Carpenter et al.,2016]. As with any other MCMC method, HMC estimators are justiﬁed in the limit of the

number of iterations. Algorithms which rely on such asymptotics face the risk of becoming obsolete if computational

power keeps increasing through the number of available processors and not through clock speed. To address this

issue, we propose to run pairs of HMC chains, for a random but ﬁnite number of iterations, and combine them in

such a way that the resulting estimators are unbiased. One can then produce independent copies in parallel and

average them to obtain estimators that are valid in the limit of the number of copies.

If the chains could be initialized from the target distribution, MCMC estimators would be unbiased, and one

could simply average independent chains [Rosenthal,2000]. Perfect samplers can be used for this purpose [Casella

et al.,2001,Huber,2016,Glynn,2016]; more widely applicable approaches to unbiased estimation from MCMC

samplers are proposed in e.g. Mykland et al. [1995], Neal [2002]. More recently, Jacob et al. [2017b] present an

approach based on coupled Markov chains. The method builds upon Glynn and Rhee [2014], Jacob et al. [2017a] and

other “debiasing” techniques [Jacob and Thiery,2015,Vihola,2015,Glynn,2016], and leverages maximal couplings

[Thorisson,2000] of proposal and conditional distributions to remove the “burn-in bias” of Metropolis–Hastings

and Gibbs chains respectively. The use of maximal couplings allows two chains initialized at diﬀerent positions to

coincide exactly after a random number of steps, referred to as the meeting time. Importantly, these constructions

are applicable to continuous state spaces.

The present article proposes a combination of couplings to enable parallel computation for the Hamiltonian

Monte Carlo sampler. We start by brieﬂy recalling the unbiased estimators of Jacob et al. [2017b] in Section

1.2 and introducing some preliminary notation in Section 1.3. The Rcode producing the ﬁgures of this article is

available on the GitHub account of the second author1.

∗Department of Statistics, Harvard University, USA. Emails: jjmheng@fas.harvard.edu & pjacob@fas.harvard.edu.

1Link: github.com/pierrejacob/debiasedhmc.

1

arXiv:1709.00404v1 [stat.CO] 1 Sep 2017

1.2 Context: unbiased estimation with couplings

Consider the task of approximating the integral π(h) = Rh(x)π(dx)<∞, for a test function hof interest. Let

X= (Xn)n≥0denote a π-invariant MCMC chain associated with an initial distribution π0and transition kernel P,

i.e. X0∼π0and Xn∼P(Xn−1,·)for all n≥1. Introduce another Markov chain Y= (Yn)n≥0which has the same

law as X= (Xn)n≥0, so that Xnand Ynhave the same marginal distribution for all n≥0. We will write Pto

denote the law of the coupled chain (Xn, Yn)n≥0and Eto denote expectation with respect to P. We now assume

the following.

[A1] As n→ ∞,E[h(Xn)] →π(h), and there exists ι > 0and D < ∞such that for all n≥0,E[h(Xn)2+ι]< D.

[A2] The meeting time τ= inf {n≥1 : Xn=Yn−1}is ﬁnite almost surely, and satisﬁes a geometric tail condition

of the form P(τ > n)≤C γnfor all n≥0and some constants C < ∞and γ∈(0,1).

[A3] The coupled chains are faithful [Rosenthal,1997]: Xn=Yn−1for all n≥τ.

Under these assumptions, the random variable deﬁned as

Hk(X, Y ) = h(Xk) +

max(k,τ −1)

X

n=k{h(Xn+1)−h(Yn)},(1)

is an unbiased estimator of π(h), for any choice of initial distribution π0and any k≥0. The ﬁrst term above,

h(Xk), is in general biased since the chain (Xn)n≥0might not have reached stationarity by step k. As the second

term is precisely such that E[Hk(X, Y )] = π(h), it is referred to as a correction term. If k≥τ, the correction term

is zero. The estimator can be computed in max(τ , k)steps, which has a ﬁnite expectation under A2.

We introduce another unbiased estimator, denoted by ¯

Hk:m, deﬁned for some integer m > k, resulting from

averaging H`(X, Y )over `∈ {k, . . . , m}. By rearranging terms, we deﬁne

¯

Hk:m(X, Y ) = 1

m−k+ 1

m

X

n=k

h(Xn) + 1

m−k+ 1

max(m,τ−1)

X

n=k

min (n−k+ 1, m −k+ 1) {h(Xn+1)−h(Yn)},(2)

which is unbiased and computable in max(τ, m)steps. The ﬁrst average above can be recognized as the usual

MCMC estimator, obtained after miterations and discarding the ﬁrst k−1states. As before, the second term

can be seen as a correction to remove the bias of ¯

Hk:m(X, Y ). On the event {k≥τ}, the correction term is

equal to zero. We refer to Jacob et al. [2017b] for a more detailed discussion of (1)-(2), and guidelines for the

choice of kand m. Importantly, unbiased estimators can be produced independently in parallel and averaged, with

direct computational gains on parallel computing architectures. Explicit constructions of pairs of Markov chains

satisfying A1-A3 based on Metropolis–Hastings and Gibbs samplers are given in Jacob et al. [2017b]. Here we

propose coupling strategies for HMC chains, so as to enable the unbiased estimators of (1)-(2). The main challenge

lies in A2, for which two coupled chains have to meet exactly after a “Geometric” number of steps.

1.3 Notation and plan

The set of integers {a, . . . , b}for a≤bis written as [a:b]. The set of non-negative real numbers is denoted by R+.

The vectors 0dand 1drefer to d-dimensional vectors of zeros and ones respectively. The matrix Idis the identity

matrix of size d×d. The norm of a vector x∈Rdis written as |x|= (Pd

i=1 x2

i)1/2. The transpose of a vector

x∈Rdand matrix A∈Rd×pare denoted by xTand ATrespectively. The gradient of a function (x, y)7→ f(x, y)

with respect to x(resp. y) is denoted by ∇xf(resp. ∇yf). The Hessian of a real-valued function fis denoted

by ∇2f. The Borel σ-algebra of Rdis denoted by B(Rd)and the Lebesgue measure on Rdby Lebd. The Normal

distribution with mean µand covariance matrix Σis denoted by N(µ, Σ) and its density by x7→ N(x;µ, Σ). The

Uniform distribution on the interval [0,1] is U[0,1]. The total variation distance dTV between two distributions,

with densities pand q, is deﬁned as dTV(p, q) = 1

/2R|p(x)−q(x)|dx.

The rest of the article is structured as follows. Section 2describes Hamiltonian dynamics for coupled trajectories.

Section 3introduces a simple coupling of Hamiltonian Monte Carlo chains, which satisﬁes a relaxed meeting time

assumption similar to A2. Section 4then combines HMC kernels with random walk Metropolis–Hastings kernels, to

ensure that chains meet exactly and satisfy A2. Section 5contains simulation results on a 250-dimensional Normal

target, a truncated Normal distribution and a logistic regression with 300 covariates, and Section 6concludes.

2

2 Hamiltonian dynamics for pairs of particles

2.1 Hamiltonian ﬂows and extended target

We now suppose that the target distribution has the form

π(dq)∝exp(−U(q))dq,

where the potential U:Rd→R+is twice continuously diﬀerentiable and its gradient ∇Uis globally β-Lipschitz,

i.e. there exists β > 0such that

|∇U(q)− ∇U(q0)| ≤ β|q−q0|,

for all q, q0∈Rd. We now introduce Hamiltonian ﬂows on a phase space R2d, which consists of position variables

q∈Rdand velocity variables p∈Rd. We will be concerned with a Hamiltonian function E:Rd×Rd→R+of the

form

E(q, p) = U(q) + 1

2|p|2.

We note the use of an identity mass matrix here and defer to preconditioning as a means to incorporate any

knowledge of the curvature properties of π. The time evolution of a particle (q(t), p(t))t∈R+under Hamiltonian

dynamics is described by the autonomous system of ordinary diﬀerential equations

d

dtq(t) = ∇pE(q(t), p(t)) = p(t),(3)

d

dtp(t) = ∇qE(q(t), p(t)) = −∇U(q(t)).

Under the above assumptions on U, (3) with an initial condition (q(0), p(0)) = (q0, p0)∈Rd×Rdadmits a unique

solution globally on R+[Lelièvre et al.,2010, p. 14]. We will write the ﬂow map as Φt(q0, p0)=(q(t), p(t)) for

any t∈R+, and Φ◦

t(q0, p0) = q(t)and Φ∗

t(q0, p0) = p(t)as its projection onto the position and velocity coordinates

respectively. It is worth recalling that Hamiltonian ﬂows have the following properties.

[P1] (Reversibility) Φ−1

t=M◦Φt◦Mwhere M(q, p) := (q, −p)denotes velocity reversal;

[P2] (Energy conservation) E◦Φt=Efor any t∈R+;

[P3] (Volume preservation) Leb2d(Φt(A)) = Leb2d(A)for any A∈ B(Rd×Rd).

It follows from P1 and P2 that the extended target distribution on phase space,

˜π(dq, dp)∝exp(−E(q, p))dqdp,

is invariant under the Markov semi-group induced by the ﬂow, i.e. the pushforward measure Φt]˜πdeﬁned by

Φt]˜π(A) = ˜π(Φ−1

t(A)) for A∈ B(R2d)is equal to ˜πfor any t∈R+.

2.2 Coupled Hamiltonian dynamics

Following Section 1.2, we now consider the coupling of two particles (qi(t), pi(t))t∈R+, i = 1,2evolving under (3)

with initial conditions (qi(0), pi(0)) = (qi

0, pi

0), i = 1,2. We ﬁrst draw some insights from a Gaussian example.

Example 1. Let πbe a Gaussian distribution on Rwith mean µ∈Rand unit variance σ2∈R+, in which case

U(q)=(q−µ)2/(2σ2)and ∇U(q)=(q−µ)/σ2. Then the solution of (3) is given by

Φt(q0, p0) = µ+ (q0−µ) cos t

σ+σp0sin t

σ

p0cos t

σ−1

σ(q0−µ) sin t

σ,

see e.g. Neal [2011]. Hence the diﬀerence between the positions is given by

q1(t)−q2(t)=(q1

0−q2

0) cos t

σ+σ(p1

0−p2

0) sin t

σ.

Observe that if we set p1

0=p2

0, then

|q1(t)−q2(t)|=|q1

0−q2

0|cos t

σ,

so the particles meet exactly whenever t= (2a+ 1)πσ/2, and contraction occurs for any t6=πaσ, for any non-

negative integer a.

3

This example motivates a coupling that simply assigns particles the same initial velocity. Moreover, it also

reveals that certain trajectory lengths will result in stronger contractions than others. We now examine the utility

of this approach more generally. Deﬁne ∆(t) = q1(t)−q2(t)as the diﬀerence between the particle locations and

note that 1

2

d

dt|∆(t)|2= ∆(t)T(p1(t)−p2(t)).

Therefore by imposing that p1(0) = p2(0), the function t7→ |∆(t)|admits a stationary point at time t= 0. This

is geometrically intuitive as the trajectories at time zero are parallel to one another for an inﬁnitesimally small

amount of time. To characterize this stationary point, we compute

1

2

d2

dt2|∆(t)|2=−∆(t)T∇U(q1(t)) − ∇U(q2(t))+|p1(t)−p2(t)|2.

If we assume that the potential Uis α-strongly convex in an open set S∈ B(Rd), i.e. there exists α > 0such that

(∇U(q)− ∇U(q0))T(q−q0)≥α|q−q0|2,

for all q, q0∈S, then

1

2

d2

dt2|∆(t)|2≤ −α|q1(t)−q2(t)|2+|p1(t)−p2(t)|2.(4)

Therefore by the second derivative test, t= 0 is a strict local maximum point if q1

0, q2

0∈S. Using continuity of

t7→ |∆(t)|2, it follows that there exist ˜

t > 0and ρ < 1such that

|Φ◦

t(q1

0, p0)−Φ◦

t(q2

0, p0)|< ρ|q1

0−q2

0|,

for t∈(0,˜

t). We note the dependence of ˜

tand ρon the initial positions (q1

0, q2

0)and velocity p0. We now strengthen

the above claim.

Lemma 1. Suppose that the potential Uis twice continuously diﬀerentiable, α-strongly convex on S∈ B(Rd)and

its gradient ∇Uis globally β-Lipschitz. For any compact set C⊂S×S×Rd, there exist ˜

t > 0and ρ < 1such that

|Φ◦

t(q1

0, p0)−Φ◦

t(q2

0, p0)| ≤ ρ|q1

0−q2

0|,(5)

for all (q1

0, q2

0, p0)∈Cand t∈(0,˜

t).

Proof. Take (q1

0, q2

0, p0)∈C. Applying Taylor’s theorem on ∆(t)around t= 0 gives

∆(t) = ∆(0) −1

2t2G0−1

6t3G∗,

for some t∗∈(0, t), where G0:= ∇U(q1

0)− ∇U(q2

0)and G∗:= ∇2U(q1(t∗))p1(t∗)− ∇2U(q2(t∗))p2(t∗). We will

control each term of the expansion

|∆(t)|2=|∆(0)|2−t2∆(0)TG0−1

3t3∆(0)TG∗+1

4t4|G0|2+1

6t5GT

0G∗+1

36t6|G∗|2.

Using strong convexity, the Lipschitz assumption and Young’s inequality

|∆(t)|2≤1−αt2+1

6t3+1

4β2t4+1

12β2t5|∆(0)|2+1

6t3+1

12t5+1

36t6|G∗|2.

Note that by Young’s inequality and the Lipschitz assumption

|G∗|2≤2k∇2U(q1(t∗)k2

2|p1(t∗)|2+ 2k∇2U(q2(t∗)k2

2|p2(t∗)|2

≤2β2(|Φ∗

t∗(q1

0, p0)|2+|Φ∗

t∗(q2

0, p0)|2)

≤2β2sup

(q1

0,q2

0,p0)∈C

(|Φ∗

t∗(q1

0, p0)|2+|Φ∗

t∗(q2

0, p0)|2),

where k · k2denotes the spectral norm. The above supremum is attained by continuity of the mapping (q, p)7→

Φ∗

t∗(q, p). The claim (5) follows by combining both inequalities and taking tsuﬃciently small.

4

3 Hamiltonian Monte Carlo

3.1 Leap frog integrator

As the ﬂow deﬁned by (3) is typically intractable, one has to resort to time discretization. The leap-frog symplectic

integrator is a standard choice as it preserves P1 and P3. Given a step size ε > 0and a number of leap-frog steps

L∈N, this scheme initializes at (q0, p0)∈Rd×Rdand iterates

p`+1/2=p`−ε

2∇U(q`)

q`+1 =q`+εp`+1/2

p`+1 =p`+1/2−ε

2∇U(q`+1),

for `∈[0 : L−1]. We write the leap-frog iteration as ˆ

Φε(q`, p`)=(q`+1, p`+1 )and the corresponding approximation

of the ﬂow as ˆ

Φε,`(q0, p0) = (q`, p`)for `∈[1 : L]. As before, we denote by ˆ

Φ◦

ε,`(q0, p0) = q`and ˆ

Φ∗

ε,`(q0, p0) = p`

the projections onto the position and velocity coordinates respectively. The leap-frog scheme is of order two [Hairer

et al.,2005, Theorem 3.4]: for suﬃciently small ε, we have both

|ˆ

Φε,L(q0, p0)−ΦεL (q0, p0)| ≤ C1ε2,(6)

and

|E(ˆ

Φε,L(q0, p0)) −E(q0, p0)| ≤ C2ε2,(7)

for some constants C1, C2>0. Given the nature of Hamiltonian dynamics, the constant C1will typically grow

exponentially with the number of leap-frog iterations L[Leimkuhler and Matthews,2015, Section 2.2.3]. Under

appropriate assumptions, the constant C2on the other hand can be shown be stable over exponentially long time

intervals [Hairer et al.,2005, Theorem 8.1]. The Hamiltonian is not exactly conserved under time discretization,

but one can employ a Metropolis–Hastings correction as described in the following section.

3.2 Hamiltonian Monte Carlo kernel

Hamiltonian Monte Carlo [HMC, Neal,1993,Duane et al.,1987] is a Metropolis–Hastings (MH) algorithm on phase

space that targets ˜πwith the time discretized Hamiltonian dynamics ˆ

Φε,L(q0, p0)=(qL, pL)as a proposal. From a

state (Qn, Pn)∈Rd×Rd, at iteration n≥0,

1. sample a velocity P∗

n∼ N(0d, Id), independently of other variables, and set (q0, p0)=(Qn, P ∗

n);

2. perform leap-frog integration to obtain (qL, pL) = ˆ

Φε,L(q0, p0);

3. with probability α((q0, p0),(qL, pL)), set (Qn+1, Pn+1)=(qL,−pL), otherwise set (Qn+1 , Pn+1)=(Qn, Pn).

Since the leap-frog integrator preserves P1 and P3, the MH acceptance probability is given by

α((q, p),(q0, q0)) = min(1,exp (E(q, p)−E(q0, p0))),(8)

for (q, p),(q0, p0)∈Rd×Rd. As this constructs a ˜π-invariant Markov chain (Qn, Pn)n≥0on phase space, the marginal

chain (Qn)n≥0is a π-invariant Markov chain. We can write the Markov transition kernel of the marginal chain as

Kε,L (q, A) = ZRd

IAˆ

Φ◦

ε,L(q, p)α(q, p),ˆ

Φε,L(q, p)N(p; 0d, Id)dp (9)

+δq(A)ZRdn1−α(q, p),ˆ

Φε,L(q, p)oN(p; 0d, Id)dp,

for q∈Rd, A ∈ B(Rd). Irreducibility and geometric ergodicity of Kε,L have recently been established rigorously in

Durmus et al. [2017]; see also Cances et al. [2007], Livingstone et al. [2016] for previous works. These results can

be used to verify A1 in Section 1.2.

5

3.3 Coupled Hamiltonian Monte Carlo kernel

Similarly to Section 2.2, we now consider coupling two HMC chains (Qi

n, P i

n)n≥0, i = 1,2using the following

procedure. From two states (Qi

n, P i

n), i = 1,2, at iteration n≥0,

1. sample a velocity P∗

n∼ N(0d, Id), independently of other variables, and for i= 1,2, set (qi

0, pi

0)=(Qi

n, P ∗

n);

2. for i= 1,2, perform leap-frog integration to obtain (qi

L, pi

L) = ˆ

Φε,L(qi

0, pi

0);

3. sample U∼ U[0,1];

4. for i= 1,2,if U≤α(qi

0, pi

0),(qi

L, pi

L), set (Qi

n+1, P i

n+1)=(qi

L,−pi

L), otherwise set (Qi

n+1, P i

n+1)=(Qi

n, P i

n).

The above procedure amounts to running two HMC chains with common random numbers. We denote the associated

coupled transition kernel on the position coordinates as ¯

Kε,L (q1, q2), A1×A2for q1, q2∈Rdand A1, A2∈

B(Rd). Marginally we have ¯

Kε,L (q1, q2), A1×Rd=Kε,L(q1, A1)and ¯

Kε,L (q1, q2),Rd×A2=Kε,L(q2, A2).

We suppose that (Q1

0, Q2

0)are initialized according to π0independently, and (P1

0, P 2

0)with an arbitrary distribution

on R2d. We will write Pε,L as the law of the coupled HMC chains (Qi

n, P i

n)n≥0,i= 1,2and Eε,L to denote

expectation with respect to Pε,L .

We now establish that the relaxed meeting time τδ= inf n≥0 : |Q1

n−Q2

n| ≤ δfor any δ > 0has geometric

tail. The following result can be used to establish A2 for the algorithm that will be introduced in the next section.

Theorem 1. Suppose that the potential Uis twice continuously diﬀerentiable, the gradient of Uis globally β-

Lipschitz and there exists a compact set S∈ B(Rd)with Lebd(S)>0such that the restriction of Uto Sdenoted by

U|S:S→Ris α-strongly convex. Then there exists ˜ε > 0,˜

L∈N,C∈R+and γ∈(0,1) such that

Pε,L (τδ> n)≤Cγ n, n ∈N,(10)

for any ε < ˜εand L > ˜

Lsatisfying εL < ˜ε˜

L.

Proof. We ﬁrst establish that the coupled HMC kernel is Leb2d-irreducible by adapting the arguments in Durmus

et al. [2017, proof of Theorem 2] to our coupling. Under the Lipschitz assumption on ∇U, the arguments in Durmus

et al. [2017, proof of Theorem 2] imply that for any L∈N, there exists ˜εL>0such that the mapping p7→ ˆ

Φ◦

ε,L(q, p)

is a continuously diﬀerentiable diﬀeomorphism from Rdto Rdfor q∈Rdand ε < ˜εL. Hence the mapping

p7→ ¯

Φε,L(q, q0, p) := ˆ

Φ◦

ε,L(q, p),ˆ

Φ◦

ε,L(q0, p)

from Rdto R2dis also a continuously diﬀerentiable diﬀeomorphism for (q, q0)∈R2dand ε < ˜εL. Writing ¯

Φ−1

ε,L :

R2d→Rdas the inverse function, by a change of variables,

¯

Kε,L (q1, q2), A≥ZRdZ1

0

IAˆ

Φ◦

ε,L(q1, p),ˆ

Φ◦

ε,L(q1, p)2

Y

i=1

Iu≤α(qi, p),ˆ

Φε,L(qi, p)N(p; 0d, Id)du dp

=ZRdZ1

0

IA(¯q)

2

Y

i=1

Iu≤α(qi,¯

Φ−1

ε,L(¯q)),ˆ

Φε,L(qi,¯

Φ−1

ε,L(¯q))N¯

Φ−1

ε,L(¯q); 0d, Id

det J¯

Φ−1

ε,L (¯q)

du d¯q

≥Leb2d(A) inf

¯q∈Amin

i=1,2α(qi,¯

Φ−1

ε,L( ¯q)),ˆ

Φε,L(qi,¯

Φ−1

ε,L( ¯q))N¯

Φ−1

ε,L(¯q); 0d, Id

det J¯

Φ−1

ε,L (¯q)

,

for all A∈ B(R2d), where J¯

Φ−1

ε,L denotes the Jacobian matrix of ¯

Φ−1

ε,L (with the convention 0×+∞= 0). It follows

that ¯

Kε,L is aperiodic and irreducible with respect to the Lebesgue measure on R2d.

For any real-valued measurable function f:Ω→R, we write its level sets as Lf(`) = {x∈Ω:f(x)≤`}for

`∈R. Deﬁne the kinetic energy function K(p) = |p|2/2, the levels U > infq∈SU(q)and ¯

U < supq∈SU(q)such that

U < ¯

U, and the sets C`=LU|S(`)×LK(¯

U−`)⊂LE(¯

U)and ˜

C`=LU|S(`)×LU|S(`)×LK(¯

U−`)for `∈(U, ¯

U).

Since Lebd(LU|S(`)) >0for `∈(U, ¯

U)under the assumptions on U,Leb2d-irreducibility of ¯

Kε,L implies for any

L∈Nand ε < ˜εL, there exists N∈Nsuch that

Pε,L Q1

N∈LU|S(`), Q2

N∈LU|S(`)>0.

6

When both chains enter the set LU|S(`), it follows from Lemma 1that there exist ˜

T > 0and ρ0<1such that

|Φ◦

T(Q1

N, P ∗

N)−Φ◦

T(Q2

N, P ∗

N)| ≤ ρ0|Q1

N−Q2

N|,

for all (Q1

N, Q2

N, P ∗

N)∈˜

C`and T < ˜

T. Hence we have

Pε,L |Φ◦

T(Q1

N, P ∗

N)−Φ◦

T(Q2

N, P ∗

N)| ≤ ρ0|Q1

N−Q2

N| | Q1

N∈LU|S(`), Q2

N∈LU|S(`)>0.

By triangle inequality, consistency of the leap-frog integrator (6) and compactness of ˜

C`, there exists ε0≤˜εL,

L0∈Nand ρ1<1such that

Pε,L |ˆ

Φ◦

ε,L(Q1

N, P ∗

N)−ˆ

Φ◦

ε,L(Q2

N, P ∗

N)| ≤ ρ1|Q1

N−Q2

N| | Q1

N∈LU|S(`), Q2

N∈LU|S(`)>0,

for ε<ε0and L>L0satisfying εL =T. Again by consistency of the leap-frog integrator (7) and compactness of

C`, it follows from (8) that there exist ε1≤ε0,L1≥L0and η0<1/2such that

Pε,L Qi

N+1 =ˆ

Φ◦

ε,L(Qi

N, P ∗

N)|(Qi

N, P ∗

N)∈C`≥1−η0,

for i= 1,2and ε<ε1,L>L1satisfying εL =T. By Fréchet’s inequality, the probability of accepting both

proposals satisﬁes

Pε,L Q1

N+1 =ˆ

Φ◦

ε,L(Q1

N, P ∗

N), Q2

N+1 =ˆ

Φ◦

ε,L(Q2

N, P ∗

N)|(Q1

N, Q2

N, P ∗

N)∈˜

C`>0,

therefore

Pε,L |Q1

N+1 −Q2

N+1| ≤ ρ1|Q1

N−Q2

N| | Q1

N∈LU|S(`), Q2

N∈LU|S(`)>0.

To iterate this argument, note ﬁrst that if (q, p)∈C`then continuity of Uand the mapping t7→ Φ◦

t(q, p)

implies Φ◦

t(q, p)∈LU|S(¯

U)for any t∈R+. Owing to time discretization, we only have ˆ

Φ◦

t(q, p)∈LU|S(¯

U+η1)for

(q, p)∈C`and some η1>0, by another application of (7). It follows that there exists a number of iterations I∈N

that depends on ρ1, and an initial level `0∈(U,¯

U)depending on Iand η1such that

Pε,L |Q1

N+I−Q2

N+I| ≤ δ|Q1

N∈LU|S(`0), Q2

N∈LU|S(`0)>0.

Therefore we can conclude (10) by applying Williams [1991, Exercise E.10.5].

Under similar conditions, Durmus et al. [2017] provide a convergence result for the marginal HMC chains, which

can be used to check A1; see also Cances et al. [2007], Livingstone et al. [2016], Mangoubi and Smith [2017] and

Tweedie [1983] for the ﬁniteness of moments.

It is worth noting that the distance between chains might exceed δat some future iterations n > τδ, and that

the event {|Q1

n−Q2

n| ≤ δ}is not an exact meeting event; thus Theorem 1does not establish A2. In the next

section, we combine coupled HMC kernels with another kernel designed to prompt exact meetings, which would

occur with large probability when the two chains are close.

4 Unbiased Hamiltonian Monte Carlo estimators

The construction of Jacob et al. [2017b] requires two chains that meet exactly. One possibility here is the approach

of Glynn and Rhee [2014], which involves the introduction of a truncation variable. Instead we propose to use

coupled Metropolis–Hastings steps to trigger exact meetings. These coupled MH steps are described in Section

4.1, and a summary of the proposed methodology combining the two coupled kernels is in Section 4.2. Section 4.3

brieﬂy describes a further variance reduction technique.

4.1 Coupled Metropolis–Hastings steps

As in Section 1, let us denote the two chains by (Xn)n≥0and (Yn)n≥0; these correspond to the position coordinates

in Section 3, propagated with a time shift, e.g. (Xn+1, Yn)∼¯

Kε,L((Xn, Yn−1),·). According to Theorem 1, coupled

HMC chains are close to one another after some iterations. Denote the distance between the chains at step nby

δn=|Xn−Yn−1|.

In a coupled MH step with Normal random walk, a pair of proposals (X?, Y ?)is sampled from the maximal

coupling of N(Xn,Σ) and N(Yn−1,Σ) [Jacob et al.,2017b]. Let us consider the case where Σ = σ2Idfor some σ > 0.

7

Algorithm 1 Unbiased HMC estimator ¯

Hk:m(X, Y )of π(h), with tuning parameters ω, σ, ε, L, k, m.

The kernel ¯

Pσrefers to a coupled random walk MH kernel with proposal standard deviation σ, and maximally

coupled proposals. The kernel ¯

Kε,L refers to a coupled HMC kernel with step size ε,Lleap-frog steps, and common

initial velocity at each step. The marginal kernels are denoted by Pσand Kε,L respectively.

1. Draw X0and Y0from an initial distribution π0, and

(a) with probability ω, sample X1∼Pσ(X0,·);

(b) otherwise sample X1∼Kε,L (X0,·);

(c) set n= 1.

2. While Xn6=Yn−1and n<m,

(a) with probability ω, sample (Xn+1, Yn)∼¯

Pσ((Xn, Yn−1),·);

(b) otherwise, sample (Xn+1 , Yn)∼¯

Kε,L((Xn, Yn−1),·);

(c) if Xn+1 =Ynset τ=n+ 1;

(d) increment n←n+ 1.

3. Compute H`(X, Y ) = h(X`) + Pmax(m,τ −1)

n=`{h(Xn+1)−h(Yn)}for `∈[k:m],

and ¯

Hk:m(X, Y )=(m−k+ 1)−1Pm

`=kH`(X, Y ); or compute ¯

Hk:m(X, Y )as in (2).

Under the maximal coupling, we have P(X?=Y?)=1−dTV(N(Xn, σ2Id),N(Yn−1, σ2Id)). The total variation

can be approximated as in Pollard [2005]. First, we have dTV(N(Xn, σ2Id),N(Yn−1, σ2Id)) = P(2σ|Z| ≤ δn),

where Zis a univariate standard Normal variable and δnis considered ﬁxed. Approximations of the folded Normal

cumulative distribution function then lead to

P(X?=Y?)=1−P(2σ|Z| ≤ δn)=1−1

√2π

δn

σ+Oδ2

n

σ2,as δn

σ→0.

To achieve P(X?=Y?) = sfor some desired probability s, we can choose σas approximately δn/(√2π(1 −s)).

The proposed values (X?, Y ?)are then accepted as the next states according to MH acceptance ratios, i.e. if

U≤min(1, π(X?)/π(Xn)) and U≤min(1, π(Y?)/π(Yn−1)) respectively, where a single uniform variable U∼ U[0,1]

is used for both chains.

If σis small compared to the spread of the target density function, the probability of jointly accepting the

proposals is high. On the other hand, σneeds to be large compared to δn=|Xn−Yn−1|for the event {X?=Y?}

to frequently occur. This leads to a trade-oﬀ; in numerical experiments, for pairs of chains propagated using the

coupled HMC kernel ¯

Kε,L, we can monitor both the distance δnand the target density values to guide the choice of

σ. We will choose a ﬁxed value of σfor all coupled MH steps, and leave adaptive strategies, where σwould be e.g.

chosen according to δn, for future research. Hereafter we denote by Pσand ¯

Pσthe marginal and coupled kernels

associated with the MH steps.

4.2 Combining kernels

We propose to use both coupled HMC and MH kernels through a mixture. The coupled HMC kernel is expected

to bring the two chains close to one another, while the coupled MH kernel enables exact meetings when the chains

are already close. In a mixture of kernels, at each step, the MH kernel is chosen with probability ω, otherwise the

HMC kernel is chosen. The procedure is described in Algorithm 1. Note that A3 is satisﬁed by design for coupled

chains generated by this algorithm. As the resulting coupled mixture kernel inherits properties of the coupled MH

kernel, A2 can in principle be veriﬁed by simply relying on the properties of coupled MH kernels established in

Jacob et al. [2017b]. However, we stress here that Theorem 1provides some insight on the role of coupled HMC

steps on the eﬃciency of the proposed estimator.

We now comment on the computational cost of Algorithm 1. Assume for simplicity that the cost of evaluating

the target density is approximately equal to that of evaluating its gradient. Each HMC step is then L+ 1 times

more expensive than a MH step. If we choose a small value for ω, such as 0.1or 0.05, the cost of the MH steps

becomes negligible. Secondly, the cost of running two chains is approximately twice the cost of running each chain

8

until meeting occurs. Thereafter, only one chain needs to be propagated up to step m. If we choose mto be much

larger than τwith high probability, the cost of Algorithm 1is therefore comparable to the cost of mHMC iterations.

The eﬃciency of the unbiased HMC estimator depends on the mixing properties of the underlying HMC kernel,

and on the contraction achieved by the coupling. Importantly, the tuning parameters εand Lthat would be

optimal for the marginal HMC kernel are not necessarily adequate for the coupled kernel, as illustrated in Section

5. The other tuning parameters include σfor the coupled MH step discussed above, and Jacob et al. [2017b] give

recommendations for kand m: namely kcan be chosen as a large quantile of the meeting times, and msuch that

(m−k)/m ≈1, for instance m= 10k.

Finally, in Section 5.2 we will encounter a situation where the coupled HMC kernel contracts so quickly that the

distance |Xn−Yn−1|becomes smaller than machine precision after a small number of iterations. The two chains

can then be considered exactly identical, for all practical purposes, and the coupled MH steps become unnecessary.

4.3 Choice of weights and variance reduction

As suggested in Jacob et al. [2017a,b], the estimators H`(X, Y )for `∈[k:m]given in (1), can be averaged with

any weights (w`)m

`=ksuch that Pm

`=kw`= 1. The estimator ¯

Hk:m(X, Y )in (2) corresponds to weights equal to

(m−k+1)−1. For an arbitrary choice (w`)m

`=k, the estimator Pm

`=kw`H`(X, Y )is unbiased and its variance is given

by wTΣHw, where ΣHdenotes the (m−k+ 1) ×(m−k+ 1) covariance matrix of the estimators (Hk, . . . , Hm).

To minimize such a variance without violating the sum constraint, we solve the system

1

ΣH

.

.

.

1

1. . . 1 0

wk

.

.

.

wm

λ

=

0

.

.

.

0

1

,

where λis a Lagrange multiplier, for a computational cost of order (m−k+1)3. The matrix ΣHcan be approximated

from i.i.d. realizations of H`for `∈[k:m]. The resulting weights can then be used to reduce the variance of

¯

Hk:m(X, Y ), especially if the original MCMC chain exhibits strong autocorrelations.

5 Numerical illustrations

We investigate some key aspects of the proposed unbiased HMC estimator, such as its eﬃciency compared to

standard HMC estimators. As in the rest of the article, we choose a Normal distribution for the initial velocities

at each HMC step, and a unit mass matrix; other choices are possible [Girolami and Calderhead,2011,Livingstone

et al.,2017].

In all experiments, whenever the test function his not speciﬁed, it is chosen as h:x7→ x1, so that π(h)is simply

the mean of the ﬁrst target marginal distribution. The asymptotic variance of an MCMC estimator refers to the

variance appearing in the central limit theorem satisﬁed by N−1PN

n=0 h(Xn)as N→ ∞, where (Xn)n≥0is the

chain generated by the algorithm. Here, these asymptotic variances are approximated with the spectrum0 function

of the coda package [Plummer et al.,2006]. For unbiased estimators, we deﬁne the asymptotic eﬃciency as variance

multiplied by expected cost [Glynn and Whitt,1992]. This accounts for the fact that, in a given computing budget,

more estimators can be averaged over if each one can be produced faster. For the estimator ¯

Hk:m(X, Y )in (2), the

expected computing time E[max(τ , m)] and the variance V[¯

Hk:m(X, Y )] are approximated by empirical averages of

independent realizations.

5.1 Multivariate Normal distribution

Let the target πbe a multivariate Normal N(0d,Σπ)with d= 250 and with the (i, j)-entry of Σπequal to

exp(−|i−j|). In this example we discuss the choice of trajectory length, deﬁned as the product εL, and the use of

coupled MH kernels to trigger exact meetings.

We ﬁx the number of leap-frog steps to L= 20 and vary the step size εso that the trajectory length εL spans

between 0and 3π/2, where πhere denotes the mathematical constant. The initial distribution π0is chosen as the

target. For each trajectory length, the asymptotic variance of HMC computed from 5,000 iterations is shown in

Figure 1a. The optimal trajectory length is close to the value π, which is consistent with the analytical solution

in Section 2.2. For such a trajectory length, the asymptotic variance is smaller than the variance obtained with

perfect samples from the target, thanks to negative auto-correlations.

9

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0.1

1.0

10.0

π4 2π4 3π4 4π4 5π4 6π4

trajectory length

HCMC variance

(a) HMC asymptotic variance against trajectory length

εL.

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

1e−20

1e−15

1e−10

1e−05

1e+00

π4 2π4 3π4 4π4 5π4 6π4

trajectory length

distance after 100 iterations

(b) Distance after 100 coupled HMC iterations against

trajectory length εL.

Figure 1: In the multivariate Normal example of Section 5.1, asymptotic variance for the estimation of Rx1π(dx)

using HMC, computed using chains of length 5,000 started at stationarity (left). Euclidean distance between the

100-th iterate of coupled HMC chains (right). The number of leap-frog steps is set to L= 20, which implicitly

determines the step size εfor each trajectory length εL. Each dot corresponds to one of 5independent runs.

(a) Log-distance between coupled HMC chains against

iterations.

(b) Log-distance between coupled chains propagated

with a mixture of HMC and MH kernels, against it-

erations.

Figure 2: In the multivariate Normal example of Section 5.1, distance between coupled HMC chains against number

of iterations (left), and between chains propagated with the the mixture of HMC and MH kernels, with σ= 10−5

and ω= 0.1(right). Each line corresponds to one of 100 independent runs.

We then run 100 iterations of coupled HMC and compute the Euclidean distance between the two ﬁnal states.

The resulting distances are shown in Figure 1b. Lengths around the value π/2lead to the smallest distances, con-

sistently with the analytical reasoning of Section 2.2. Moreover, there is a range of lengths that lead to contraction.

On the other hand, the optimal length for the HMC estimator, which was the value π, does not lead to visible

contraction after 100 iterations. Therefore, the proposed coupling contracts most with tuning parameters that are

not optimal for the underlying HMC algorithm, which results in a loss of eﬃciency.

Based on Figure 1b, we set εL =π/2,L= 20 and run coupled chains, 100 times independently, until their

distances is less than machine precision. In Figure 2a these distances are plotted on a logarithmic scale against

iterations; the lines drop when the distances fall below machine precision, which occurs between iterations 127 and

312. The distances are already very small after a few dozen iterations. We implement the proposed algorithm with

a mixture of kernels described in Section 4.2, with σ= 10−5and ω= 0.1, and plot the resulting distances in Figure

2b. All meeting times then occur between iterations 36 and 97. The MH steps thus successfully manage to trigger

exact meetings.

We set k= 50 and m= 500 to produce R= 100 unbiased estimators of Rx1π(dx)as in (2). The asymptotic

eﬃciency is approximately equal to 1.96. The asymptotic variance of HMC obtained with εL =πwas found to

be approximately 0.16, averaging the 5runs shown in Figure 1a. Therefore, the proposed estimator is about 12

times less eﬃcient than the original HMC algorithm when optimally tuned. Depending on hardware, this can be

10

considered an acceptable loss in exchange for complete parallelism, among other advantages of unbiased estimators

argued e.g. in Rhee [2013], Jacob et al. [2017b]. Unbiased estimators could also be obtained from variants of HMC

where the number of leap-frog steps Lis random, and possibly adaptive, which might reduce the eﬃciency loss.

5.2 Truncated Normal distribution

We consider Hamiltonian Monte Carlo on truncated Normal distributions, with truncations deﬁned by linear and

quadratic inequalities. In this setting Pakman and Paninski [2014] show that Hamiltonian dynamics can be solved,

resulting in trajectories that bounce oﬀ the constraints. An R package implementing the method of Pakman and

Paninski [2014] is available online [Pakman,2012]. Using this package, the implementation of the proposed method

only involved simple modiﬁcations.

We consider two of the examples in Pakman and Paninski [2014], where a bivariate Normal distribution is

truncated by two linear and two quadratic constraints respectively. A thousand HMC samples are shown in Figure

3(top row). The ﬁrst distribution is a bivariate Normal, with unit covariance matrix and mean (4,4), restricted

to the set {x1≤x2≤1.1x1} ⊂ R2(Figure 3a). The second distribution is a bivariate standard Normal restricted

to the set {(x1−4)2/32 + (x2−1)2/8≤1} ∩ {4x2

1+ 8x2

2−2x1x2+ 5x2≥1} ⊂ R2(Figure 3b). We use the value

π/2as a trajectory length, as advocated in Pakman and Paninski [2014]. As for the initial distribution π0, we use

a point mass at (2,2.1) for the ﬁrst target, and at (2,0) for the second one.

In this example, the proposed coupling induces a contraction that leads to distances between trajectories be-

coming smaller than machine precision, after a few iterations. Therefore, we do not need to resort to coupled MH

steps: we can deﬁne the meeting times directly as the ﬁrst times for which distances are less than machine precision.

Histograms of such meeting times are shown in Figure 3for both targets (bottom row). They indicate that small

values of kand mcould be chosen, eﬀectively leading to the possibility of running very short HMC chains in parallel

in a principled way.

5.3 Logistic regression

We consider a Bayesian logistic regression as in Hoﬀman and Gelman [2014], on the classic German credit data

set. Including pairwise interactions, the covariates are in a matrix Xwith N= 1000 rows and p= 300 columns,

which we standardize by column. The parameters are the intercept α∈R, coeﬃcients β∈Rp, and a prior

variance σ2∈R+on intercept and coeﬃcients. The likelihood speciﬁes that the binary outcome Yisatisﬁes

P(Yi= 1|Xi, α, β) = (1 + exp(−α−XT

iβ))−1for all i∈[1 : N]. The prior speciﬁes α|σ2∼ N(0, σ2)and

βj|σ2∼ N(0, σ 2), for all j∈[1 : p], and an Exponential distribution with rate λ= 0.01 for σ2. We transform σ2

into log σ2, so that each parameter lies in R. The target πis the posterior distribution of (α, β, log σ2), of dimension

d=p+ 2 = 302. We use an independent standard Normal for each parameter to initialize the chains, which deﬁnes

π0.

We set L= 20 and vary εso that the trajectory length εL is in the range [0.1,0.5]. For each length, we run

10,000 HMC iterations, discard the ﬁrst 5,000 as burn-in, and use the remaining 5,000 samples to approximate

the asymptotic variance of HMC for the estimation of Rx1π(dx), which here is the posterior expectation of the

intercept. The results of independent runs are shown in Figure 4a. Coupled HMC chains are then run for 1,000

iterations, and the distances between the ﬁnal states are shown in Figure 4b. Again, the optimal choice of εL for

the asymptotic variance of HMC is not optimal in terms of contraction. However, contrarily to the example of

Section 5.1, here each of the considered trajectory lengths yields some contraction.

Using the length εL = 0.1, we then proceed with Algorithm 1of Section 4.2, using σ= 10−5and ω= 0.05.

Over 100 independent experiments, we compute the distance between the coupled chains, using two diﬀerent

initializations. The ﬁrst is the standard Normal distribution on each parameter as above, leading to the distances

plotted in Figure 5a. The observed meeting times occur between iterations 256 and 535. Using k= 100 and

m= 1,000, we produce 100 independent estimators ¯

Hk:m(X, Y )from these coupled chains, in order to approximate

the marginal means and variances of the target. With these values, we construct a Normal approximation of the

target, with a diagonal covariance matrix, and use this Normal as a new initial distribution π0. For this better

initialization, the distance traces are shown in Figure 5b. The observed meeting times occur between iterations 192

and 422, and the plot shows that the distances decrease faster than with the previous initialization. The vertical

upward jumps in Figure 5correspond to events where one chain accepts its HMC proposal while the other chain

does not.

With this better initialization, again using k= 100 and m= 1,000, we produce R= 1,000 independent

estimators of Rx1π(dx). The asymptotic eﬃciency is found to be approximately 0.40. The asymptotic variance

of HMC obtained with εL = 0.3was found to be approximately 0.09, and with εL = 0.1approximately 0.33;

11

(a) HMC samples approximating a bivariate Normal

truncated by two linear constraints.

(b) HMC samples approximating a bivariate Normal

truncated by two quadratic constraints.

0.00

0.05

0.10

0 10 20 30

meeting times

density

(c) Meeting times for the bivariate Normal with linear

constraints.

0.0

0.1

0.2

0.3

0 5 10 15

meeting times

density

(d) Meeting times for the bivariate Normal with

quadratic constraints.

Figure 3: In the truncated Normal example of Section 5.2, scatter plot of 1,000 HMC samples for a bivariate

Normal truncated by two linear constraints (top left), and two quadratic constraints (top right). Histogram of

1,000 meeting times, deﬁned as ﬁrst times for which the distance is smaller than machine precision, for coupled

HMC chains targeting the bivariate Normal with linear constraints (bottom left), and with quadratic constraints

(bottom right).

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

0.0

0.1

0.2

0.3

0.4

0.5

0.1 0.2 0.3 0.4 0.5

trajectory length

HCMC variance

(a) HMC asymptotic variance against trajectory length

εL.

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

1e−20

1e−15

1e−10

1e−05

1e+00

0.1 0.2 0.3 0.4 0.5

trajectory length

distance after 1000 iterations

(b) Distance after 1,000 coupled HMC steps against

trajectory length εL.

Figure 4: In the logistic regression example of Section 5.3, asymptotic variance for the estimation of Rx1π(dx)

using HMC, computed using chains of length 10,000 started from an independent standard Normal distribution for

each parameter, and discarding a burn-in of 5,000 steps (left). Euclidean distance between the 1,000-th iterate of

coupled HMC chains (right). The number of leap-frog steps is set to L= 20, which implicitly determines the step

size εfor each trajectory length εL. Each dot corresponds to one of 5independent runs.

12

(a) Log-distance between coupled chains initialized

from independent standard Normal distributions.

(b) Log-distance between coupled chains initialized

from a crude Normal approximation of the target.

Figure 5: In the logistic regression example of Section 5.3, distance between coupled chains initialized from in-

dependent standard Normal distributions for each parameter against number of iterations (left), and initialized

from a Normal approximation of the target (right). The Normal approximation is obtained by estimating the 302

marginal means and variances of the target distribution. In both cases the chains are propagated using a mixture

of HMC and MH kernels, with σ= 10−5and ω= 0.05, and the HMC kernel uses L= 20 and εL = 0.1. Each line

corresponds to one of 100 independent runs.

these were obtained from 105HMC iterations after discarding 5,000 iterations as burn-in. Therefore, the proposed

estimator is about 4times less eﬃcient than the original HMC estimator when optimally tuned, or more precisely,

for the optimal value of εgiven a ﬁxed value L= 20. We could also use εL = 0.3for the unbiased HMC estimator,

according to Figure 4b, but the meeting times would then be longer, and the potential for parallelization would

thus be reduced.

From the coupled chains, histograms can be produced by binning a dimension of the space and estimating

posterior masses of these bins, which are integrals of indicator functions [Jacob et al.,2017b]. Histograms of α

and β1under the posterior distribution are shown in Figure 6. The vertical bars indicate the point estimates of

posterior masses, and gray rectangles represent 95% conﬁdence intervals based on the central limit theorem. The

overlaid red curves show kernel density estimates obtained from 105HMC samples, after discarding a burn-in of

5,000 steps, and using L= 20 and εL = 0.3. Taking these kernel density estimates as ground truth, the narrowness

of conﬁdence intervals reﬂects the accuracy of the proposed estimators. We stress that these conﬁdence intervals

are based on the central limit theorem for averages of independent variables, and are therefore justiﬁed in the limit

of number of independent estimators, all of which can be computed in parallel.

6 Discussion

Coupled Hamiltonian Monte Carlo chains can be combined to generate unbiased estimators of integrals with re-

spect to target distributions. With adequate couplings, such chains become exactly equal after a random number

of steps. The proposed approach involves a simple coupling of Hamiltonian Monte Carlo kernels, based on common

random numbers, that generates chains converging to one another. Combined with coupled random walk Metropo-

lis–Hastings steps, the approach leads to estimators that can be produced independently in parallel and averaged.

The method is demonstrated on three examples, and a contraction property of coupled HMC kernels is formally

established under strong log-concavity of the target on parts of the state space. Recently, Mangoubi and Smith

[2017] have proposed a much deeper study of the same coupling, and have adroitly exploited it to obtain novel

quantitative bounds on mixing properties of HMC. The same coupling was already discussed in Neal [2002], for

the purpose of removing the burn-in bias. The exploration of further links between our proposed estimators and

the circular coupling of Neal [2002] is an exciting avenue of research. The proposed couplings also enable other

unbiased estimators, such as those of Glynn and Rhee [2014] which do not require exact meetings.

As seen in numerical experiments, optimal trajectory lengths for standard HMC estimators are not optimal in

the coupled construction. This leads to a loss of eﬃciency of the proposed estimators compared to standard HMC

estimators. Whether this loss is acceptable or not will likely depend on the target distribution and the available

hardware. Other considerations include the construction of conﬁdence intervals, which is arguably simpler with

i.i.d. variables than with Markov chains, and the unbiased property itself, which could be appealing in various

13

0

1

2

3

−1.75 −1.50 −1.25 −1.00 −0.75 −0.50

α

density

(a) Estimated posterior of the intercept α.

0

1

2

3

4

−0.9 −0.6 −0.3

β1

density

(b) Estimated posterior of the coeﬃcient β1.

Figure 6: In the logistic regression example of Section 5.3, histograms of the posterior distributions of the intercept

α(left) and of the ﬁrst coeﬃcient β1(right). Vertical bars indicate point estimates of posterior mass in each bin,

obtained with 1,000 unbiased HMC estimators, and 95% conﬁdence intervals are represented by gray rectangles.

Red curves represent kernel density estimates computed from 105HMC iterations, considered as the ground truth.

contexts.

To improve asymptotic eﬃciencies, random numbers of leap-frog steps, and adaptive selection of that number

based on the distance between the chains, would be interesting topics of research. A related question would be

the construction of unbiased estimators from the No-U-Turn sampler of Hoﬀman and Gelman [2014]. Finally, the

optimal weights described in Section 4.3 could potentially bring signiﬁcant variance reduction in situations where

HMC chains exhibit signiﬁcant autocorrelations.

Acknowledgement

Pierre E. Jacob gratefully acknowledges support by the National Science Foundation through grant DMS-1712872.

References

Beskos A., Pillai N., Roberts G., Sanz-Serna J.-M., and Stuart A. The acceptance probability of the Hybrid Monte

Carlo method in high-dimensional problems. In AIP Conference Proceedings, volume 1281, pages 23–26. AIP,

2010. 1

Beskos A., Pillai N., Roberts G., Sanz-Serna J.-M., and Stuart A., 2013. Optimal tuning of the Hybrid Monte Carlo

algorithm. Bernoulli, 19(5A):1501–1534. 1

Brooks S. P., Gelman A., Jones G., and Meng X.-L., 2011. Handbook of Markov chain Monte Carlo. CRC press. 1

Cances E., Legoll F., and Stoltz G., 2007. Theoretical and numerical comparison of some sampling methods for

molecular dynamics. ESAIM: Mathematical Model ling and Numerical Analysis, 41(2):351–389. 5,7

Carpenter B., Gelman A., Hoﬀman M. D., Lee D., Goodrich B., Betancourt M., Brubaker M. A., Guo J., Li P.,

and Riddell A., 2016. Stan: a probabilistic programming language. Journal of Statistical Software, 20:1–37. 1

Casella G., Lavine M., and Robert C. P., 2001. Explaining the perfect sampler. The American Statistician, 55(4):

299–305. 1

Duane S., Kennedy A. D., Pendleton B. J., and Roweth D., 1987. Hybrid Monte Carlo. Physics Letters B, 195(2):

216–222. 1,5

Durmus A., Moulines E., and Saksman E., 2017. On the convergence of Hamiltonian Monte Carlo. arXiv preprint

arXiv:1705.00166.5,6,7

Girolami M. and Calderhead B., 2011. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal

of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214. 9

Glynn P. W. Exact simulation versus exact estimation. In Winter Simulation Conference (WSC), 2016, pages

193–205. IEEE, 2016. 1

Glynn P. W. and Rhee C.-H., 2014. Exact estimation for Markov chain equilibrium expectations. Journal of Applied

Probability, 51(A):377–389. 1,7,13

Glynn P. W. and Whitt W., 1992. The asymptotic eﬃciency of simulation estimators. Operations Research, 40(3):

505–520. 9

14

Hairer E., Wanner G., and Lubich C., 2005. Geometric numerical integration: structure-preserving algorithms for

ordinary diﬀerential equations. Springer-Verlag, New York. 5

Hoﬀman M. D. and Gelman A., 2014. The No-U-turn sampler: adaptively setting path lengths in Hamiltonian

Monte Carlo. Journal of Machine Learning Research, 15(1):1593–1623. 1,11,14

Huber M., 2016. Perfect simulation, volume 148. CRC Press. 1

Jacob P. E. and Thiery A. H., 2015. On non-negative unbiased estimators. The Annals of Statistics, 43(2):769–784.

1

Jacob P. E., Lindsten F., and Schön T. B., 2017a. Smoothing with couplings of conditional particle ﬁlters. arXiv

preprint arXiv:1701.02002.1,9

Jacob P. E., O’Leary J., and Atchadé Y. F., 2017b. Unbiased Markov chain Monte Carlo with couplings. arXiv

preprint arXiv:1708.03625.1,2,7,8,9,11,13

Leimkuhler B. and Matthews C., 2015. Molecular Dynamics. Springer-Verlag, New York. 5

Lelièvre T., Rousset M., and Stoltz G., 2010. Free Energy Computations: A Mathematical Perspective. Imperial

College Press. ISBN 978-1-84816-248-8. 1,3

Livingstone S., Betancourt M., Byrne S., and Girolami M., 2016. On the geometric ergodicity of Hamiltonian Monte

Carlo. arXiv preprint arXiv:1601.08057.5,7

Livingstone S., Faulkner M. F., and Roberts G. O., 2017. Kinetic energy choice in Hamiltonian/hybrid Monte

Carlo. arXiv preprint arXiv:1706.02649.9

Mangoubi O. and Smith A., 2017. Rapid mixing of Hamiltonian Monte Carlo on strongly log-concave distributions.

arXiv preprint arXiv:1708.07114.7,13

Mykland P., Tierney L., and Yu B., 1995. Regeneration in Markov chain samplers. Journal of the American

Statistical Association, 90(429):233–241. 1

Neal R. M., 1993. Bayesian learning via stochastic dynamics. Advances in neural information processing systems,

pages 475–475. 1,5

Neal R. M. Circularly-coupled Markov chain sampling. Technical report, 9910 (revised), Department of Statistics,

University of Toronto, 2002. 1,13

Neal R. M., 2011. MCMC using Hamiltonian dynamics. Handbook of Markov chain Monte Carlo, 2(11). 3

Pakman A., 2012. tmg: truncated multivariate Gaussian sampling. CRAN. URL https://cran.r-project.org/

package=tmg.11

Pakman A. and Paninski L., 2014. Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. Journal

of Computational and Graphical Statistics, 23(2):518–542. 11

Plummer M., Best N., Cowles K., and Vines K., 2006. CODA: Convergence diagnosis and output analysis for

MCMC. R News, 6(1):7–11. URL https://journal.r-project.org/archive/.9

Pollard D., 2005. Chapter 3: Total variation distance between measures. Asymptopia. URL http://www.stat.

yale.edu/~pollard/Courses/607.spring05/handouts/Totalvar.pdf.8

Rhee C.-H. Unbiased estimation with biased samplers. PhD thesis, Stanford University, 2013. URL http://purl.

stanford.edu/nf154yt1415.11

Rosenthal J. S., 1997. Faithful couplings of Markov chains: now equals forever. Advances in Applied Mathematics,

18(3):372 – 381. ISSN 0196-8858. 2

Rosenthal J. S., 2000. Parallel computing and Monte Carlo algorithms. Far east journal of theoretical statistics, 4

(2):207–236. 1

Thorisson H., 2000. Coupling, stationarity, and regeneration, volume 14. Springer New York. 1

Tweedie R., 1983. The existence of moments for stationary Markov chains. Journal of Applied Probability, 20(1):

191–196. 7

Vihola M., 2015. Unbiased estimators and multilevel Monte Carlo. arXiv preprint arXiv:1512.01022.1

Williams D., 1991. Probability with martingales. Cambridge university press. 7

15