ArticlePDF Available
Bayesian Analysis (2013) 8, Number 1, pp. 23–26
Comment on Article by Schmidl et al.
Dawn B. Woodard ˚
The authors develop a novel Markov chain method (ACIMH) that is designed to
sample efficiently by adapting to the features of the target distribution, learning from
the samples obtained previously. Their approach is based on independence Metropolis-
Hastings (IMH), and takes the proposal distribution to be an estimate of the target
distribution. Its appeal is that the efficiency of IMH is controlled by how close the
proposal density is to the target density. If a very accurate estimate of the target
density can be obtained, then the samples obtained by the Markov chain are nearly
independent, leading to near-optimal accuracy of Monte Carlo approximations.
The authors’ estimate of the target density is based on D-vine copulas, an extremely
flexible class of models that has not been used previously for this purpose. Other
authors have proposed similar “adaptive” IMH methods based on alternative estimates
of the target density, in particular mixture distributions such as normal or tmixtures
(Andrieu and Moulines 2006;Andrieu and Thoms 2008). Continued development of
efficient general-purpose sampling algorithms like these is critical as we create robust
software packages for Bayesian statistics that will encourage its widespread use.
D-vine copulas use a factorization approach that may scale more effectively with
dimension than the mixture model approach. ACIMH factorizes the target density as
fpxq “ r
d´1
ź
j1
d´j
ź
i1
ci,pi`jq|pi`1q:pi`j´1qs¨r
d
ź
k1
fkpxkqs xPX(1)
then drops the dependence of ci,pi`jq|pi`1q:pi`j´1qon xpi`1q:pi`j´1q, so that only uni-
variate and bivariate densities need to be estimated. However, I show that as ACIMH
is currently defined its efficiency still degrades exponentially in the dimension d. I do
this for several simple but representative target densities. So ACIMH is expected to be
ineffective for high-dimensional problems; this may explain why the examples used by
Schmidl et al. are low-dimensional (having d2, 3, and 7). I will argue that this issue
is inherent to any IMH method that takes the proposal density to be an estimate of the
joint target density based on past samples. I then suggest that this problem may be
mitigated by applying ACIMH or a related method to blocks of parameters separately
rather than to the entire parameter vector simultaneously. ACIMH is a promising ad-
dition to the Markov chain toolbox, due to its ability to flexibly estimate aspects of the
target density; however, it needs to be used in a blocked fashion in order to scale well
with dimension.
˚School of Operations Research & Information Engineering and Department of Statistical Science,
Cornell University, Ithaca, NY. http://people.orie.cornell.edu/woodard
2013 International Society for Bayesian Analysis DOI:10.1214/13-BA801A
24 Comment on Article by Schmidl et al.
1 Efficiency of ACIMH
Here I analyze the efficiency of ACIMH as a function of the dimension d. ACIMH is
not an adaptive MCMC method in the purest sense, namely that the chain continues
to adapt forever. ACIMH stops updating the copula after a fixed number of iterations,
so one can view it instead as a standard Markov chain with a tuning period. Viewing
it in this way, I will give results on the convergence rate of the Markov chain after this
period. Take the number of iterations nin the tuning period to be fixed (not dependent
on d), and let ˆ
fp¨q be the estimator of fp¨q obtained at the end of this period.
I argue that: (a) the accuracy of ˆ
fdegrades exponentially in d, in the sense that
infxPX
ˆ
fpxq
fpxqdecays exponentially; and (b) this implies that the convergence rate of IMH
with proposal density ˆ
fdecays exponentially in d. I will show that (a) holds for several
simple but non-pathological target densities. The step (b) is proven by Liu (1996), al-
though the continuous-state-space case is handled incompletely. Specifically, Liu (1996)
shows that for a discrete state space the spectral gap (convergence rate) of IMH is
equal to infxPX
qpxq
fpxqwhere qp¨q is the proposal distribution. In the continuous-state-
space case, which is more technically challenging, he gives evidence strongly suggesting
that this result still holds. Combined with (a), this suggests that the spectral gap of
ACIMH decays exponentially in d. Informally, this means that the number of iterations
of ACIMH required to attain a fixed accuracy increases exponentially in d. More for-
mally, the number of iterations required to decrease the (χ2) distance to the stationary
distribution by a fixed factor grows exponentially in d, in the worst case over start-
ing distributions (cf. Woodard, Schmidler, and Huber 2009). Similar implications hold
regarding the accuracy of Monte Carlo estimators.
To show (a) for several examples, I rely solely on the error introduced by estimation
of the term śd
k1fkpxkqin (1). I assume that the bivariate densities ci,pi`jq|pi`1q:pi`j´1q
satisfy the modeling assumption (do not depend on xpi`1q:pi`j´1q) and are estimated
with perfect accuracy; taking into account this source of error would only increase the
overall error. Specifically, I will take target densities that have the product form fpxq
śd
k1fkpxkqfor xPX, and assume that the bivariate densities ci,pi`jq|pi`1q:pi`j´1qare
correctly estimated to be equal to one on the support, so that the goal is to estimate
fpxq “ śd
k1fkpxkqby ˆ
fpxq “ śd
k1ˆ
fkpxkq. In this simplified context ACIMH can be
defined on any state space, not just the Euclidean spaces which are needed for the full
copula representation. All that is needed is a parametric form for ˆ
fkpxkq, the parameters
of which are estimated by maximum likelihood as recommended by Schmidl et al. For
simplicity I will also assume that the samples xi“ pxi1, . . . , xidqfor iP t1, . . . , nufrom
the tuning (adaptation) period are i.i.d. according to fp¨q; taking into account their
autocorrelation would only inflate the error.
First, consider the discrete state space X“ t0,1udand the target density fpxq
śd
k1fkpxkqwhere fkpxkqpxk
kp1´pkq1´xkand pkP p0,1q. The maximum likelihood
estimator of pkis ˆpk1
nřn
i1xik, for kP t1, . . . , du. To avoid a degenerate estimate
replace with 1
nif řixik 0 and n´1
nif řixik n.
D. B. Woodard 25
Say that the true values of pkare all pk1
2, and define
Wdln ˆ
fp1, . . . , 1q
fp1, . . . , 1q
d
ÿ
k1
rln ˆ
fkp1q ´ ln fkp1qs “
d
ÿ
k1
rln ˆpk´ln 1
2s.
The quantity Epln ˆpkqdoes not depend on kor d, and by Jensen’s inequality Epln ˆpkq ă
ln Eˆpkln 1
2, so EWdcd for some cP p´8,0q. Also, VarpWdq “ dVarpln ˆp1qwhere
Varpln ˆp1qdoes not depend on d. By Chebyshev’s inequality,
PrpWděcd1{4q ď Prp|Wd´EWd| ě ´cd3{4q ď dVarpln ˆp1q
c2d3{2
dÑ8
ÝÑ 0.
So Prpˆ
fp1, . . . , 1q{fp1, . . . , 1q ă exptcd1{4uq dÑ8
ÝÑ 1, meaning that inf xPXˆ
fpxq{fpxqde-
cays exponentially in d.
For a continuous-state-space example, take XRd,fpxq “ śd
k1fkpxkqand
fkpxkq “ Npxk;µk,1q. The maximum likelihood estimator of µkis ˆµk1
nřn
i1xik.
Say that the true values are µk0, and define Wdln ˆ
fp0,...,0q
fp0,...,0qřd
k11
2ˆµ2
ks.We
have Eˆµ2
ką pEˆµkq20, and the rest of the argument is analogous to the first example.
Here I relied only on the error associated with estimating fkpxkq, and the fact that it
combines multiplicatively when estimating śd
k1fkpxkq. This is not specific to ACIMH;
attempting to estimate the joint density fpxqdirectly without any factorization assump-
tions would presumably lead to even higher error. So the inefficiency we have identified
is inherent to any IMH method that takes the proposal density to be an estimate ˆ
fp¨q
of the joint density based on previous samples.
2 Conclusion
Although the authors focus on updating the entire parameter vector at once, when d
is large it may be more efficient to apply ACIMH to blocks of parameters. The blocks
would be chosen to contain highly dependent sets of parameters, and could overlap.
Specifically, one would select subsets AjĂ t1, . . . , duof the parameter index set, for
jP t1, . . . , J uwhere Jis the desired number of blocks and YJ
j1Aj“ t1, . . . , du. After
an initial sampling period, the samples would be used to estimate the marginal density
fjpxAjqof each subvector of the parameters, for instance with vine-copulas. Then one
would simulate a Metropolis-within-Gibbs chain, updating the subvectors xAjin turn
according to IMH moves with proposal density qjpxAjq “ ˆ
fjpxAjqand acceptance rate
min "1,fpxnewqqjpxold
Ajq
fpxoldqqjpxnew
Ajq*. This chain would be more efficient if each proposal density qj
were equal to the conditional density fpxAj|xt1,...,duzAjqof xAjgiven the remainder of
the parameter vector, since then the acceptance rate would always be equal to one.
However, estimating the conditional density fpxAj|xt1,...,duzAjqwould suffer from the
same difficulties as estimating the joint density fpxq, so I am instead suggesting the use
of the estimated marginal density ˆ
fjpxAjq. Although suboptimal, this substitution may
still yield an efficient algorithm.
26 Comment on Article by Schmidl et al.
References
Andrieu, C. and Moulines, E. (2006). “On the ergodicity properties of some adaptive
MCMC algorithms.” Annals of Applied Probability, 16: 1462–1505. 23
Andrieu, C. and Thoms, J. (2008). “A tutorial on adaptive MCMC.” Statistics and
Computing, 18: 343–373. 23
Liu, J. S. (1996). “Metropolized independent sampling with comparisons to rejection
sampling and importance sampling.” Statistics and Computing , 6: 113–119. 24
Woodard, D. B., Schmidler, S. C., and Huber, M. (2009). “Sufficient conditions for
torpid mixing of parallel and simulated tempering.” Electronic Journal of Probability,
14: 780–804.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We obtain upper bounds on the spectral gap of Markov chains constructed by parallel and simu- lated tempering, and provide a set of sufficient conditions for torpid mixing of both techniques. Combined with the results of(22), these results yield a two-sided bound on the spectral gap of these algorithms. We identify a persistence property of the target distribution, and show that it can lead unexpectedly to slow mixing that commonly used convergence diagnostics will fail to detect. For a multimodal distribution, the persistence is a measure of how "spiky", or tall and narrow, one peak is relative to the other peaks of the distribution. We show that this persistence phenomenon can be used to explain the torpid mixing of parallel and simulated tempering on the ferromagnetic mean-field Potts model shown previously. We also illustrate how it causes tor- pid mixing of tempering on a mixture of normal distributions with unequal covariances in RM, a previously unknown result with relevance to statistical inference problems. More generally, any- time a multimodal distribution includes both very narrow and very wide peaks of comparable probability mass, parallel and simulated tempering are shown to mix slowly.
Article
Full-text available
this paper, a special Metropolis-Hastings type algorithm, Metropolized independent sampling, proposed firstly in Hastings (1970), is studied in full detail. The eigenvalues and eigenvectors of the corresponding Markov chain, as well as a sharp bound for the total variation distance between the n-th updated distribution and the target distribution, are provided. Furthermore, the relationship between this scheme, rejection sampling, and importance sampling are studied with emphasizes on their relative efficiencies. It is shown that Metropolized independent sampling is superior to rejection sampling in two aspects: asymptotic efficiency and ease of computation. Key Words: Coupling, Delta method, Eigen analysis, Importance ratio. 1 1 Introduction
Article
Full-text available
In this paper we study the ergodicity properties of some adaptive Markov chain Monte Carlo algorithms (MCMC) that have been recently proposed in the literature. We prove that under a set of verifiable conditions, ergodic averages calculated from the output of a so-called adaptive MCMC sampler converge to the required value and can even, under more stringent assumptions, satisfy a central limit theorem. We prove that the conditions required are satisfied for the independent Metropolis--Hastings algorithm and the random walk Metropolis algorithm with symmetric increments. Finally, we propose an application of these results to the case where the proposal distribution of the Metropolis--Hastings update is a mixture of distributions from a curved exponential family.
Article
Although Markov chain Monte Carlo methods have been widely used in many disciplines, exact eigen analysis for such generated chains has been rare. In this paper, a special Metropolis-Hastings algorithm, Metropolized independent sampling, proposed first in Hastings (1970), is studied in full detail. The eigenvalues and eigenvectors of the corresponding Markov chain, as well as a sharp bound for the total variation distance between the nth updated distribution and the target distribution, are provided. Furthermore, the relationship between this scheme, rejection sampling, and importance sampling are studied with emphasis on their relative efficiencies. It is shown that Metropolized independent sampling is superior to rejection sampling in two respects: asymptotic efficiency and ease of computation.
Article
We review adaptive Markov chain Monte Carlo algorithms (MCMC) as a mean to optimise their performance. Using simple toy examples we review their theoretical underpinnings, and in particular show why adaptive MCMC algorithms might fail when some fundamental properties are not satisfied. This leads to guidelines concerning the design of correct algorithms. We then review criteria and the useful framework of stochastic approximation, which allows one to systematically optimise generally used criteria, but also analyse the properties of adaptive MCMC algorithms. We then propose a series of novel adaptive algorithms which prove to be robust and reliable in practice. These algorithms are applied to artificial and high dimensional scenarios, but also to the classic mine disaster dataset inference problem.