Page 1

ANALYSIS OF CONVERGENCE RATES OF SOME GIBBS

SAMPLERS ON CONTINUOUS STATE SPACES

AARON SMITH

1. Abstract

We use a non-Markovian coupling and small modifications of techniques from the

theory of finite Markov chains to analyze some Markov chains on continuous state

spaces. The first is a Gibbs sampler on narrow contingency tables, the second a gen-

eralization of a sampler introduced by Randall and Winkler.

2. Introduction

The problem of sampling from a given distribution on high-dimensional continuous

spaces arises in the computational sciences and Bayesian statistics, and a frequently-

used solution is Markov chain Monte Carlo (MCMC); see [13] for many examples.

Because MCMC methods produce good samples only after a lengthy mixing period,

a long-standing mathematical question is to analyze the mixing times of the MCMC

algorithms which are in common use. Although there are many mixing conditions,

the most commonly used is called the mixing time, and is based on the total variation

distance:

For measures ν, µ with common measurable σ-algebra A, the total variation dis-

tance between µ and ν is

||µ − ν||TV = sup

A∈Aµ(A) − ν(A)

For an ergodic Markov chain Xtwith unique stationary distribution π, the mixing

time is

τ(?) = inf{t|||L(Xt) − π||TV < ?}

Although most scientific and statistical uses of MCMC methods occur in continu-

ous state space, much of the mathematical mixing analysis has been in the discrete

setting. The methods that have been developed for discrete chains often break down

when used to analyze continuous chains, though there have been efforts, such as [21],

Date: August 30, 2011.

1

arXiv:1108.5415v1 [math.PR] 27 Aug 2011

Page 2

2AARON SMITH

to fix this. This paper extends the author’s previous work in [20] and work of Randall

and Winkler [17], and attempts to provide some more examples of relatively sharp

analyses of continuous chains similar to those used to develop the discrete theory.

The first case consists of narrow contingency tables. Beginning with the work of

Diaconis and Efron [6], there has been interest in finding efficient ways to sample

uniformly from the collection of integer-valued matrices with given row and column

sums. A great deal of this effort has been based on Markov chain Monte Carlo

methods. While some of the efforts have dealt directly with Markov chains on these

integer-valued matrices, much recent success, including [10] [16], has involved using

knowledge of Gibbs samplers on convex sets in Rnand clever ways to project from

the continuous chain to the desired matrices [15].

Unfortunately, while the general bounds are polynomial in the number of entries

in the desired matrix, they are often not of a small order; see [14]. In this note, we

find some better bounds for very specific cases. Like the note [20], this is part of an

attempt to make further use of non-Markovian coupling techniques [11] [3] [5] and also

to expand the small set of carefully analyzed Gibbs samplers [17] [18] [7] [8]. In this

case, the new techniques are two slight modifications of the path-coupling method

introduced in [4]. In many path-coupling arguments, a burn-in argument is used to

show that for most pairs of points in a metric space, there is a path along which the

Markov transition kernel is contractive acting on any pair of points along the path.

In this argument, we show that for all paths, the kernel is contractive acting on most

pairs of points along the path. This type of modification seems likely to be useful

only on continuous spaces.

We consider the following Gibbs sampler Xt[i,j] on 2 by n matrices satisfying the

row sums?n

1 ≤ i < j ≤ n and update the four entries Xt+1[i,1], Xt+1[i,2], Xt+1[j,1] and xt+1[j,2]

to be uniform conditional on all other entries of Xt. We find the following reasonable

bound on the mixing time of this sampler:

i=1X[i,j] = n for 1 ≤ j ≤ 2 and column sums?2

j=1X[i,j] = 2 for

all 1 ≤ i ≤ n. To make a step of the Gibbs sampler, choose two distinct integers

Theorem 1 (Convergence Rate for Narrow Matrices). For T > (31r + 81)nlog(n),

||L(XT) − U||TV ≤ 13n−r

while for T < n(log(n) − r), and n sufficiently large,

||L(XT) − U||TV ≥ 1 − 2e−r

The next process is a Gibbs sampler on the simplex, with a very restricted set of

allowed moves. Fix a group G with symmetric generating set S. We consider the

process Xt[g] on the simplex ∆G = {X|?

g∈GX[g] = 1;X[g] ≥ 0}. At each step,

Page 3

ANALYSIS OF CONVERGENCE RATES OF SOME GIBBS SAMPLERS ON CONTINUOUS STATE SPACES 3

choose g ∈ G, s ∈ S and λ ∈ [0,1] uniformly, and set Xt+1[g] = λ(Xt[g] + Xt[gs])

and Xt+1[gs] = (1 − λ)(Xt[g] + Xt[gs]); for all other h ∈ G set Xt+1[h] = Xt[h]. Also

consider a simple random walk Ston G, where in each stage we choose g ∈ G and

s ∈ S uniformly at random and set St+1= gs if St= g, and St+1= Stotherwise.

Then if ? γ is the spectral gap of the walk St,

||L(XT) − U||TV ≤ 9n−r

and conversely for T <r

Theorem 2 (Convergence Rate for Gibbs Sampler with Geometry). for T >4r+15

? γ

log(n),

? γ,

||L(XT) − U||TV ≥1

2e−r− 3n−1

3

This substantially generalizes [17] and [20], from samplers on the cycle or complete

graph respectively to general Cayley graphs. In addition to being of mathematical

interest, this process is an example of a gossip process with some geometry, studied by

electrical engineers and sociologists interested in how information propagates through

networks; see [19] for a survey.

The proof of the upper bound will use an auxilliary chain similar to that found

in [17], a coupling argument improved from [20], and some elementary comparison

theory of Markov chains. The lower bound is somewhat simpler than that in [17].

3. General Strategy and the Partition Process

Both of our bounds will be obtained using a similar strategy, ultimately built on the

classical coupling lemma. We recall that a coupling of Markov chains with transition

kernel K is a process (Xt,Yt) so that marginally both Xtand Ytare Markov chains

with transition kernel K. Then we have the following lemma (see [12]):

Lemma 3 (Fundamental Coupling Lemma). If (Xt,Yt) is a coupling of Markov

chains, Y0is distributed according to the stationary distribution of K, and τ is the

first time at which Xt= Yt, then

||L(Xt) − L(Yt)||TV ≤ P[t > τ]

In each chain, then, we begin with Xtstarted at a distribution of our choice, and

Ytstarted at stationarity. For any fixed (large) T, we will then couple Xtand Ytso

that they will have coupled by time T with high probability. Each coupling will have

two phases: an initial phase in which Xtand Ytget close with high probability, and

a non-Markovian coupling phase in which we actually force them to collide. Unlike

many coupling proofs, the time of interest T must be specified before constructing

the coupling.

While the initial contraction phases are quite different for both chains, the final

Page 4

4 AARON SMITH

coupling phase can be described in a unified way. The unifying device is the partition

process Pton set partitions of [n], introduced in [20] for a special case of the second

sampler treated here (see that note for details). This partition process contains some

information about the entire process (Yt)0≤t≤T, and is the only source of information

from the future that is used to construct the non-Markovian coupling. Critically, we

don’t use any information about the random variables on [0,1] used at each step.

Our process Ptwill consist of a set of nested partitions of [n], P0≤ P1≤ ... ≤ PT,

where I say partition A is less than partition B if every element of partition B is a

subset of an element of partition A. To construct Pt, begin by running Ytfrom time 0

to time T, and setting PT= {{1},...,{n}}, n singletons. While running the chain, we

choose two privileged coordinates i(t), j(t) at each time t. To construct Pt−1from Pt,

we merge distinct sets A, B in Ptto a single set A∪B in Pt−1if and only if one of i(t),

j(t) is in A and the other is in B. This defines the entire process. While constructing

Pt, we will also record a series of ‘marked times’ t1> t2> ... > tmand associated

special subsets S(tj,1) and S(tj,2) of [n]. We will set tj= sup{t|t < tj−1,Pt−1?= Pt}

and S(tj,1) to be the smaller of the two elements of Ptmerged at time tj, S(tj,2)

the larger (with ties settled arbitrarily).

We will be interested in the smallest time τ such that PT−τ = [n], a single block.

From classical arguments (see e.g. chapter 7 of [2]), it is easy to check that

Lemma 4 (Connectedness). For the Gibbs sampler on narrow matrices,

P[τ > (1

2+ ?)nlog(n)] ≤ 2n−?

The analogous lemma for the other example will be proved in section 8.

For both of our walks, we will use two types of coupling, the ‘proportional’ coupling

and the ‘subset’ coupling. In both cases, the choices of i, j will be the same for both

Xt and Yt at all time steps. In the proportional coupling, we choose the uniform

variable at time t to minimize ||Xt+1−Yt+1||2given Xt, Yt. In the simplex walk, this

involves choosing the same variable λ in both cases; for the other walk, the coupling

will be almost as simple.

To discuss the subset coupling, we must define the weight of Xton a subset S ⊂ [n],

which we call w(Xt,S). For the simplex walk, we define w(Xt,S) =?

Ytwith respect to set S is one that attempts to force w(Xt+1,S) = w(Yt+1,S) with

maximal probability. We say that a subset coupling succeeds if that equality holds;

otherwise, we say it fails.

In each case, the coupling of Xtand Ytduring the non-Markovian coupling phase will

be as follows. At marked times tj, we will perform a subset coupling of Xtj, Ytjwith

respect to S(tj,1). At all other times, we will perform a proportional coupling. This

leads to:

s∈SXt[s]. For

narrow matrices, we define w(Xt,S) =?

s∈SXt[s,1]. A subset coupling of Xtand

Page 5

ANALYSIS OF CONVERGENCE RATES OF SOME GIBBS SAMPLERS ON CONTINUOUS STATE SPACES 5

Lemma 5 (Final Coupling). Assume the non-Markovian coupling phase lasts from

time T1to T, that PT1= {[n]}, and that all subset couplings succeed. Then XT= YT.

Proof: At time T1, we have w(XT1,[n]) = w(YT1,[n]). I claim, inductively, that

w(Xt,S) = w(Yt,S) for all S ∈ Pt and all T1 ≤ t ≤ T.

at time T1. By definition of the partition process, it cannot become untrue ex-

cept at marked times, and at marked time tj it can only become untrue at one of

the marked sets S(tj,1) or S(tj,2). By assumption, all subset couplings have suc-

ceeded, so w(Xt+1,S(tj,1)) = w(Yt+1,S(tj,1)). By construction, w(Xt+1,S(tj,2)) =

(Xt+1,S(tj,1)∪S(tj,2))−w(Yt+1,S(tj,1)) and similarly for Yt+1, so w(Xt+1,S(tj,2)) =

w(Yt+1,S(tj,2)). Thus, the inductive claim has been proved.

Finally, we note that if w(Xt,{i}) = w(Yt,{i}) for any singleton {i}, then Xt[i] = Yt[i].

?

Note that it is true

So, in both cases, showing that all subset couplings succeed with high probability

is sufficient to show that coupling has succeeded.

4. Contraction and Narrow Matrices

We begin with some quick observations about the geometry of our space. It is the

part of an n−1-dimensional affine subspace of R2nthat lies in the upper orthant. Our

updates are in fact moves along 1-dimensional pieces of this subspace, even though we

are updating four entries. While the original motivation for this sampler comes from

statistics, it is being treated here primarily as an example of a chain that is some-

where between the standard Gibbs sampler on the simplex and the standard Gibbs

sampler on doubly-stochastic matrices or Kac’s famous walk on the orthogonal group.

In this section, we will prove the following contractivity estimate for the Gibbs

sampler on narrow matrices:

Lemma 6 (Weak Convergence on Narrow Matrices). If Xtand Ytare coupled under

the proportional coupling until time T = (10A + 10.5)nlog(n), then

P[||XT− YT||1≥ ?] ≤ 3?−1n−A

The above lemma is a contraction result, and it will be proved using a variant of the

path-coupling argument introduced in [4]. In path-coupling arguments, the goal is to

couple Xtand Ytby constructing an interpolating chain, Xt= Z(0)

Yt so that d(X0,Y0) ∼

show that, in general, E[d(Z(j)

most coupling arguments, we find such an α that holds for all pairs Z(j)

given a typical Xt and Yt; this immediately gives an estimate of E[d(Xt,Yt)] ≤

αt?m

t ,Z(1)

t ,...,Z(m)

t

=

?m

j=1d(Z(j−1)

t ,Z(j)

0

,Z(j+1)

0

) for some metric d. We would then

t )] ≤ αtd(Z(j)

0,Z(j)

0) for some 0 < α < 1. In

t , Z(j+1

t

j=1d(Z(j−1)

0

,Z(j+1)

0

) ∼ αtd(X0,Y0).In our argument, we show this only for