Content uploaded by Dawn B. Woodard

Author content

All content in this area was uploaded by Dawn B. Woodard on Jan 01, 2014

Content may be subject to copyright.

Bayesian Analysis (2013) 8, Number 1, pp. 23–26

Comment on Article by Schmidl et al.

Dawn B. Woodard ˚

The authors develop a novel Markov chain method (ACIMH) that is designed to

sample eﬃciently by adapting to the features of the target distribution, learning from

the samples obtained previously. Their approach is based on independence Metropolis-

Hastings (IMH), and takes the proposal distribution to be an estimate of the target

distribution. Its appeal is that the eﬃciency of IMH is controlled by how close the

proposal density is to the target density. If a very accurate estimate of the target

density can be obtained, then the samples obtained by the Markov chain are nearly

independent, leading to near-optimal accuracy of Monte Carlo approximations.

The authors’ estimate of the target density is based on D-vine copulas, an extremely

ﬂexible class of models that has not been used previously for this purpose. Other

authors have proposed similar “adaptive” IMH methods based on alternative estimates

of the target density, in particular mixture distributions such as normal or tmixtures

(Andrieu and Moulines 2006;Andrieu and Thoms 2008). Continued development of

eﬃcient general-purpose sampling algorithms like these is critical as we create robust

software packages for Bayesian statistics that will encourage its widespread use.

D-vine copulas use a factorization approach that may scale more eﬀectively with

dimension than the mixture model approach. ACIMH factorizes the target density as

fpxq “ r

d´1

ź

j“1

d´j

ź

i“1

ci,pi`jq|pi`1q:pi`j´1qs¨r

d

ź

k“1

fkpxkqs xPX(1)

then drops the dependence of ci,pi`jq|pi`1q:pi`j´1qon xpi`1q:pi`j´1q, so that only uni-

variate and bivariate densities need to be estimated. However, I show that as ACIMH

is currently deﬁned its eﬃciency still degrades exponentially in the dimension d. I do

this for several simple but representative target densities. So ACIMH is expected to be

ineﬀective for high-dimensional problems; this may explain why the examples used by

Schmidl et al. are low-dimensional (having d“2, 3, and 7). I will argue that this issue

is inherent to any IMH method that takes the proposal density to be an estimate of the

joint target density based on past samples. I then suggest that this problem may be

mitigated by applying ACIMH or a related method to blocks of parameters separately

rather than to the entire parameter vector simultaneously. ACIMH is a promising ad-

dition to the Markov chain toolbox, due to its ability to ﬂexibly estimate aspects of the

target density; however, it needs to be used in a blocked fashion in order to scale well

with dimension.

˚School of Operations Research & Information Engineering and Department of Statistical Science,

Cornell University, Ithaca, NY. http://people.orie.cornell.edu/woodard

2013 International Society for Bayesian Analysis DOI:10.1214/13-BA801A

24 Comment on Article by Schmidl et al.

1 Eﬃciency of ACIMH

Here I analyze the eﬃciency of ACIMH as a function of the dimension d. ACIMH is

not an adaptive MCMC method in the purest sense, namely that the chain continues

to adapt forever. ACIMH stops updating the copula after a ﬁxed number of iterations,

so one can view it instead as a standard Markov chain with a tuning period. Viewing

it in this way, I will give results on the convergence rate of the Markov chain after this

period. Take the number of iterations nin the tuning period to be ﬁxed (not dependent

on d), and let ˆ

fp¨q be the estimator of fp¨q obtained at the end of this period.

I argue that: (a) the accuracy of ˆ

fdegrades exponentially in d, in the sense that

infxPX

ˆ

fpxq

fpxqdecays exponentially; and (b) this implies that the convergence rate of IMH

with proposal density ˆ

fdecays exponentially in d. I will show that (a) holds for several

simple but non-pathological target densities. The step (b) is proven by Liu (1996), al-

though the continuous-state-space case is handled incompletely. Speciﬁcally, Liu (1996)

shows that for a discrete state space the spectral gap (convergence rate) of IMH is

equal to infxPX

qpxq

fpxqwhere qp¨q is the proposal distribution. In the continuous-state-

space case, which is more technically challenging, he gives evidence strongly suggesting

that this result still holds. Combined with (a), this suggests that the spectral gap of

ACIMH decays exponentially in d. Informally, this means that the number of iterations

of ACIMH required to attain a ﬁxed accuracy increases exponentially in d. More for-

mally, the number of iterations required to decrease the (χ2) distance to the stationary

distribution by a ﬁxed factor grows exponentially in d, in the worst case over start-

ing distributions (cf. Woodard, Schmidler, and Huber 2009). Similar implications hold

regarding the accuracy of Monte Carlo estimators.

To show (a) for several examples, I rely solely on the error introduced by estimation

of the term śd

k“1fkpxkqin (1). I assume that the bivariate densities ci,pi`jq|pi`1q:pi`j´1q

satisfy the modeling assumption (do not depend on xpi`1q:pi`j´1q) and are estimated

with perfect accuracy; taking into account this source of error would only increase the

overall error. Speciﬁcally, I will take target densities that have the product form fpxqﬁ

śd

k“1fkpxkqfor xPX, and assume that the bivariate densities ci,pi`jq|pi`1q:pi`j´1qare

correctly estimated to be equal to one on the support, so that the goal is to estimate

fpxq “ śd

k“1fkpxkqby ˆ

fpxq “ śd

k“1ˆ

fkpxkq. In this simpliﬁed context ACIMH can be

deﬁned on any state space, not just the Euclidean spaces which are needed for the full

copula representation. All that is needed is a parametric form for ˆ

fkpxkq, the parameters

of which are estimated by maximum likelihood as recommended by Schmidl et al. For

simplicity I will also assume that the samples xi“ pxi1, . . . , xidqfor iP t1, . . . , nufrom

the tuning (adaptation) period are i.i.d. according to fp¨q; taking into account their

autocorrelation would only inﬂate the error.

First, consider the discrete state space X“ t0,1udand the target density fpxqﬁ

śd

k“1fkpxkqwhere fkpxkqﬁpxk

kp1´pkq1´xkand pkP p0,1q. The maximum likelihood

estimator of pkis ˆpk“1

nřn

i“1xik, for kP t1, . . . , du. To avoid a degenerate estimate

replace with 1

nif řixik “0 and n´1

nif řixik “n.

D. B. Woodard 25

Say that the true values of pkare all pk“1

2, and deﬁne

Wdﬁln ˆ

fp1, . . . , 1q

fp1, . . . , 1q“

d

ÿ

k“1

rln ˆ

fkp1q ´ ln fkp1qs “

d

ÿ

k“1

rln ˆpk´ln 1

2s.

The quantity Epln ˆpkqdoes not depend on kor d, and by Jensen’s inequality Epln ˆpkq ă

ln Eˆpk“ln 1

2, so EWd“cd for some cP p´8,0q. Also, VarpWdq “ dVarpln ˆp1qwhere

Varpln ˆp1qdoes not depend on d. By Chebyshev’s inequality,

PrpWděcd1{4q ď Prp|Wd´EWd| ě ´cd3{4q ď dVarpln ˆp1q

c2d3{2

dÑ8

ÝÑ 0.

So Prpˆ

fp1, . . . , 1q{fp1, . . . , 1q ă exptcd1{4uq dÑ8

ÝÑ 1, meaning that inf xPXˆ

fpxq{fpxqde-

cays exponentially in d.

For a continuous-state-space example, take X“Rd,fpxq “ śd

k“1fkpxkqand

fkpxkq “ Npxk;µk,1q. The maximum likelihood estimator of µkis ˆµk“1

nřn

i“1xik.

Say that the true values are µk“0, and deﬁne Wdﬁln ˆ

fp0,...,0q

fp0,...,0q“řd

k“1r´1

2ˆµ2

ks.We

have Eˆµ2

ką pEˆµkq2“0, and the rest of the argument is analogous to the ﬁrst example.

Here I relied only on the error associated with estimating fkpxkq, and the fact that it

combines multiplicatively when estimating śd

k“1fkpxkq. This is not speciﬁc to ACIMH;

attempting to estimate the joint density fpxqdirectly without any factorization assump-

tions would presumably lead to even higher error. So the ineﬃciency we have identiﬁed

is inherent to any IMH method that takes the proposal density to be an estimate ˆ

fp¨q

of the joint density based on previous samples.

2 Conclusion

Although the authors focus on updating the entire parameter vector at once, when d

is large it may be more eﬃcient to apply ACIMH to blocks of parameters. The blocks

would be chosen to contain highly dependent sets of parameters, and could overlap.

Speciﬁcally, one would select subsets AjĂ t1, . . . , duof the parameter index set, for

jP t1, . . . , J uwhere Jis the desired number of blocks and YJ

j“1Aj“ t1, . . . , du. After

an initial sampling period, the samples would be used to estimate the marginal density

fjpxAjqof each subvector of the parameters, for instance with vine-copulas. Then one

would simulate a Metropolis-within-Gibbs chain, updating the subvectors xAjin turn

according to IMH moves with proposal density qjpxAjq “ ˆ

fjpxAjqand acceptance rate

min "1,fpxnewqqjpxold

Ajq

fpxoldqqjpxnew

Ajq*. This chain would be more eﬃcient if each proposal density qj

were equal to the conditional density fpxAj|xt1,...,duzAjqof xAjgiven the remainder of

the parameter vector, since then the acceptance rate would always be equal to one.

However, estimating the conditional density fpxAj|xt1,...,duzAjqwould suﬀer from the

same diﬃculties as estimating the joint density fpxq, so I am instead suggesting the use

of the estimated marginal density ˆ

fjpxAjq. Although suboptimal, this substitution may

still yield an eﬃcient algorithm.

26 Comment on Article by Schmidl et al.

References

Andrieu, C. and Moulines, E. (2006). “On the ergodicity properties of some adaptive

MCMC algorithms.” Annals of Applied Probability, 16: 1462–1505. 23

Andrieu, C. and Thoms, J. (2008). “A tutorial on adaptive MCMC.” Statistics and

Computing, 18: 343–373. 23

Liu, J. S. (1996). “Metropolized independent sampling with comparisons to rejection

sampling and importance sampling.” Statistics and Computing , 6: 113–119. 24

Woodard, D. B., Schmidler, S. C., and Huber, M. (2009). “Suﬃcient conditions for

torpid mixing of parallel and simulated tempering.” Electronic Journal of Probability,

14: 780–804.