ArticlePDF Available

# Convergence Control Methods for Markov Chain Monte Carlo Algorithms

Authors:

## Abstract

Markov chain Monte Carlo methods have been increasingly popular since their introduction by Gelfand and Smith. However, while the breadth and variety of Markov chain Monte Carlo applications are properly astounding, progress in the control of convergence for these algorithms has been slow, despite its relevance in practical implementations. We present here different approaches toward this goal based on functional and mixing theories, while paying particular attention to the central limit theorem and to the approximation of the limiting variance. Renewal theory in the spirit of Mykland, Tierney and Yu is presented as the most promising technique in this regard, and we illustrate its potential in several examples. In addition, we stress that many strong convergence properties can be derived from the study of simple sub-chains which are produced by Markov chain Monte Carlo algorithms, due to a duality principle obtained in Diebolt and Robert for mixture estimation. We show here the generality of this principle which applies, for instance, to most missing data models. A more empirical stopping rule for Markov chain Monte Carlo algorithms is related to the simultaneous convergence of different estimators of the quantity of interest. Besides the regular ergodic average, we propose the Rao-Blackwellized version as well as estimates based on importance sampling and trapezoidal approximations of the integrals.
Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve, and extend access to
Statistical Science.
www.jstor.org
®
... Thus, a chain whose distribution changes drastically during iterations is still in the exploration phase and is therefore not stationary (see plot (b) in Figure 1). Mixing refers in practice to the exploration of the support of F : slow mixing chains correspond to chains that only explore a subset of the parameter space, which can lead to strong bias in the distribution [see Robert, 1995, for a more rigorous definition]. A common way to limitate mixing issues is to run several chains in parallel with different starting points, which also allows comparing the chains together. ...
... Several limitations of this two-state Markov chain approximation are raised by Brooks and Roberts [1999], Doss et al. [2014], for example. A more general way to estimate ESS(x) is to apply the same idea as in the definition of the localR(x): use any estimator of ESS [Robert andCasella, 2004, Gelman et al., 2013] on indicator variables I ...
Preprint
Full-text available
Diagnosing convergence of Markov chain Monte Carlo is crucial and remains an essentially unsolved problem. Among the most popular methods, the potential scale reduction factor, commonly named $\hat{R}$, is an indicator that monitors the convergence of output chains to a target distribution, based on a comparison of the between- and within-variances. Several improvements have been suggested since its introduction in the 90s. Here, we aim at better understanding the $\hat{R}$ behavior by proposing a localized version that focuses on quantiles of the target distribution. This new version relies on key theoretical properties of the associated population value. It naturally leads to proposing a new indicator $\hat{R}_\infty$, which is shown to allow both for localizing the Markov chain Monte Carlo convergence in different quantiles of the target distribution, and at the same time for handling some convergence issues not detected by other $\hat{R}$ versions.
... We consider the test function h : x → x. The first algorithm considered here is the Gibbs sampler described in Robert (1995), alternating between Exponential and Normal draws: ...
Preprint
Full-text available
This article draws connections between unbiased estimators constructed from coupled Markov chains that meet exactly after a random number of iterations, and solutions of the Poisson equation. We first show how such pairs of chains can be employed to obtain unbiased estimators of pointwise evaluations of solutions of the Poisson equation. We then propose new estimators of the asymptotic variance of Markov chain ergodic averages. We formally study the proposed estimators under realistic assumptions on the meeting times of the coupled chains and on the existence of moments of test functions under the target distribution. We illustrate their behaviour in toy examples and in a more challenging setting of high-dimensional Bayesian regression.
... convergence diagnostics (see, for example, Gelman and Rubin (1992), Geweke (1992), Raftery and Lewis (1992), Robert (1995), Gilks and Roberts (1996), Cowles and Carlin (1996), Brooks and Gelman (1998), Brooks and Roberts (1998), Brooks et al. (2011), Robert and Casella (2004), Roy (2019)), almost all of them are heuristic in nature, often involving subjective judgment, while the theoretical establishment of rates of MCMC convergence is too difficult in general situations. Needless to mention, the popular MCMC convergence diagnostics can often mislead the practitioner about the actual convergence scenario. ...
Thesis
Full-text available
This thesis aims to solve important problems in topics as varied as deterministic and random infinite series, stochastic processes and function optimization, by embedding the objects in appropriate Bayesian characterization frameworks and then providing the equivalent Bayesian solution. The key philosophy is to view even the deterministic objects as the series elements of deterministic infinite series as realizations of stochastic processes, which facilitates the Bayesian treatment. Our Bayesian embedding perspective led to Bayesian characterizations of convergence, divergence and oscillations of deterministic and random infinite series; stationarity, nonstationarity, oscillations of general stochastic processes, and also a novel function optimization theory driven by posterior Gaussian derivative process. Advantages of our Bayesian characterization approach includes equivalent Bayesian solutions to questions of convergence, divergence, oscillations of infinite series where all existing methods fail to provide conclusive answers, equivalent Bayesian assessment of strong and weak stationarity and nonstationarity in time series, spatial and spatio-temporal processes, along with equivalent Bayesian appraisals of complete spatial randomness, strong and weak stationarity and the Poisson assumption in point process analysis. Furthermore, such Bayesian characterization led to method for Bayesian frequency determination in oscillating time series and a reliable method for convergence diagnostics of Markov Chain Monte Carlo algorithms, apart from the novel and accurate function optimization method. Special mention must be reserved for Bayesian characterization of infinite series, as this attempted to provide solutions to two problems of great importance. One such problem is the celebrated Riemann Hypothesis, the most elusive problem of classical mathematics, whose solution is the most sought after. The other is related to the global climate change debate, the specific question being the validity of the portentous future global warming projections. The respective results of our Bayesian characterizations of deterministic and random infinite series support neither Riemann Hypothesis, nor future global warming.
... Simulation results show good performance of the proposed algorithm. In future work we plan to provide a theoretical analysis regarding the convergence of the proposed MCMC scheme which will require to extend previous studies on Gibbs sampling like that reported in [33], [34], [38]. ...
Preprint
We consider the identification of large-scale linear and stable dynamic systems whose outputs may be the result of many correlated inputs. Hence, severe ill-conditioning may affect the estimation problem. This is a scenario often arising when modeling complex physical systems given by the interconnection of many sub-units where feedback and algebraic loops can be encountered. We develop a strategy based on Bayesian regularization where any impulse response is modeled as the realization of a zero-mean Gaussian process. The stable spline covariance is used to include information on smooth exponential decay of the impulse responses. We then design a new Markov chain Monte Carlo scheme that deals with collinearity and is able to efficiently reconstruct the posterior of the impulse responses. It is based on a variation of Gibbs sampling which updates possibly overlapping blocks of the parameter space on the basis of the level of collinearity affecting the different inputs. Numerical experiments are included to test the goodness of the approach where hundreds of impulse responses form the system and inputs correlation may be very high.
... In Sections 4 and 5 we will establish geometric ergodicity of K C and study its asymptotic stability, respectively. Our approach, which is motivated by the following lemma, will be to analyze K A in place of K C ; the lemma says we can analyze either of K G , K A or K Σ in place of K C (see also [36]). The proof of the lemma uses only well known results about de-initializing Markov chains [37] and can be found in Appendix B. ...
... Moreover, it is well-known that the convergence properties of the marginal, P XDG and P YDG , chains are essentially those of the original DG chain (Robert, 1995;Roberts and Rosenthal, 2001). ...
Preprint
Component-wise MCMC algorithms, including Gibbs and conditional Metropolis-Hastings samplers, are commonly used for sampling from multivariate probability distributions. A long-standing question regarding Gibbs algorithms is whether a deterministic-scan (systematic-scan) sampler converges faster than its random-scan counterpart. We answer this question when the samplers involve two components by establishing an exact quantitative relationship between the $L^2$ convergence rates of the two samplers. The relationship shows that the deterministic-scan sampler converges faster. We also establish qualitative relations among the convergence rates of two-component Gibbs samplers and some conditional Metropolis-Hastings variants. For instance, it is shown that if a two-component conditional Metropolis-Hastings sampler is geometrically ergodic, then so are the associated Gibbs samplers.
... Moreover, in reality, the target posteriors can often be multimodal, and in such cases, the performances of such diagnostic tools can be even poorer. For more about MCMC convergence diagnostics, see, for example, Gelman and Rubin (1992), Geweke (1992), Raftery and Lewis (1992), Robert (1995), Gilks and Roberts (1996), Cowles and Carlin (1996), Brooks and Gelman (1998), Brooks and Roberts (1998), Brooks et al. (2011), Robert and Casella (2004), Roy (2019). ...
Preprint
Full-text available
In this article, we primarily propose a novel Bayesian characterization of stationary and nonstationary stochastic processes. In practice, this theory aims to distinguish between global stationarity and nonstationarity for both parametric and nonparametric stochastic processes. Interestingly, ourtheory builds on our previous work on Bayesian characterization of infinite series, which was applied to verification of the (in)famous Riemann Hypothesis. Thus, there seems to be interesting and important connections between pure mathematics and Bayesian statistics, with respect to our proposed ideas. We validate our proposed method with simulation and real data experiments associated with different setups. In particular, applications of our method include stationarity and nonstationarity determination in various time series models, spatial and spatio-temporal setups, and convergence diagnostics of Markov Chain Monte Carlo. Our results demonstrate very encouraging performance, even in very subtle situations. Using similar principles, we also provide a novel Bayesian characterization of mutual independence among any number of random variables, using which we characterize the properties of point processes, including characterizations of Poisson point processes, complete spatial randomness, stationarity and nonstationarity. Applications to simulation experiments with ample Poisson and non-Poisson point process models again indicate quite encouraging performance of our proposed ideas. We further propose a novel recursive Bayesian method for determination of frequencies of oscillatory stochastic processes, based on our general principle. Simulation studies and real data experiments with varieties of time series models consisting of single and multiple frequencies bring out the worth of our method.
... The complexity of the Gibbs algorithm is studied through a simulation study. Furthermore, the theoretical complexity of the Gibbs sampling is discussed by many researchers and interested 21 readers are referred to Belloni andChernozhukov (2009), Frigessi (1993), Kaican and Zhi (2005), Mengersen and Tweedie (1996), Roberts and Smith (1994), Robert (1995), Roberts and Tweedie (1996), and Rostenthal (1995). ...
Article
Transmuted distributions are flexible skewed families constructed by the induction of one or more additional parameters to a parent distribution. This paper investigates the potential usefulness of a two-component mixture of Transmuted Pareto Distribution (TPaD) under a Bayesian framework assuming type-I right censored sampling. For Bayesian analysis, non−informative as well as informative priors are assumed while three loss functions, namely the squared error loss function (SELF), precautionary loss function (PLF), and quadratic loss function (QLF) are considered to estimate the unknown parameters. Furthermore, Bayesian credible intervals (BCIs) are also discussed in this study. Since the posterior distributions do not have explicit forms, posterior summaries are computed using a Markov Chain Monte Carlo (MCMC) technique. The performance of the Bayes estimators is assessed by their posterior risks assuming different sample sizes and censoring rates. To highlight the practical significance of a two−component mixture of transmuted Pareto distribution (TPaD), a medical data set collected for rental calculi problem is analyzed in this study. Furthermore, annual flood rate data collected at the Floyd River are also discussed in this study.
Article
Although Bayesian variable selection methods have been intensively studied, their routine use in practice has not caught up with their non-Bayesian counterparts such as Lasso, likely due to difficulties in both computations and flexibilities of prior choices. To ease these challenges, we propose the neuronized priors to unify and extend some popular shrinkage priors, such as Laplace, Cauchy, horseshoe, and spike-and-slab priors. A neuronized prior can be written as the product of a Gaussian weight variable and a scale variable transformed from Gaussian via an activation function. Compared with classic spike-and-slab priors, the neuronized priors achieve the same explicit variable selection without employing any latent indicator variables, which results in both more efficient and flexible posterior sampling and more effective posterior modal estimation. Theoretically, we provide specific conditions on the neuronized formulation to achieve the optimal posterior contraction rate, and show that a broadly applicable MCMC algorithm achieves an exponentially fast convergence rate under the neuronized formulation. We also examine various simulated and real data examples and demonstrate that using the neuronization representation is computationally more or comparably efficient than its standard counterpart in all well-known cases. An R package NPrior is provided for using neuronized priors in Bayesian linear regression.
Chapter
Starting from approaches in Bioinformatics, we will investigate aspects of Bayesian robustness ideas and compare them to methods from classical robust statistics. Bayesian robustness branches into three aspects, robustifying the prior, the likelihood or the loss function. Our focus will be the the likelihood itself. For computational convenience, normal likelihoods are the standard for many basic analyses ranging from simple mean estimation to regression or discriminatory models. However, similar to classical analyses non-normal data cause problems in the estimation process and are often covered with complex models for the overestimated variance or shrink- age. Most prominently, Bayesian non-parametrics approach this challenge with infinite mixtures of distributions. However, infinite mixture models do not allow an identification of outlying values in “near-Gaussian” scenarios being almost too flexible for such a purpose. The goal of our works is to allow for a robust estimation of parameters of the “main part of the data”, while being able to identify the outlying part of the data and providing a posterior probability for not fitting the main likelihood model. For this purpose, we propose to mix a Gaussian likelihood with heavy-tailed or skewed distributions of a similar structure which can hierarchically be related to the normal distribution in order to allow a consistent estimation of parameters and efficient simulation. We present an application of this approach in Bioinformatics for the robust estimation of genetic array data by mixing Gaussian and student’s t distributions with various degrees of freedom. To this effect, we employ microarray data as a case study for this behaviour, as they are well-known for their complicated, over-dispersed noise behaviour. Our secondary goal is to present a methodology, which helps not only to identify noisy genes but also to recognise whether single arrays are responsible for this behaviour. Although Bioinformatics dropped array technology in favour of sequencing in research, the medical diagnostics has picked up the methodology and thus require appropriate error estimators.
Article
Full-text available
In this paper we study open generalized Jackson networks with general arrival streams and general service time distributions. Assuming that the arrival rate does not exceed the network capacity and that the service times possess conditionally bounded second moments, we deduce stability of the network by bounding the expected waiting time for a customer entering the network. For Markovian networks we obtain convergence of the total work in the system, as well as the mean queue size and mean customer delay, to a unique finite steady state value.
Article
Markov chain Monte Carlo (MCMC) methods have been used extensively in statistical physics over the last 40 years, in spatial statistics for the past 20 and in Bayesian image analysis over the last decade. In the last five years, MCMC has been introduced into significance testing, general Bayesian inference and maximum likelihood estimation. This paper presents basic methodology of MCMC, emphasizing the Bayesian paradigm, conditional probability and the intimate relationship with Markov random fields in spatial statistics. Hastings algorithms are discussed, including Gibbs, Metropolis and some other variations. Pairwise difference priors are described and are used subsequently in three Bayesian applications, in each of which there is a pronounced spatial or temporal aspect to the modeling. The examples involve logistic regression in the presence of unobserved covariates and ordinal factors; the analysis of agricultural field experiments, with adjustment for fertility gradients; and processing of low-resolution medical images obtained by a gamma camera. Additional methodological issues arise in each of these applications and in the Appendices. The paper lays particular emphasis on the calculation of posterior probabilities and concurs with others in its view that MCMC facilitates a fundamental breakthrough in applied Bayesian modeling. Comments: Arnoldo Frigessi (41–43), Alan E. Gelfand, Bradley P. Carlin (43–46), Charles J. Geyer (46–48), G. O. Roberts, S. K. Sahu, W. R. Gilks (49–51), Wing Hung Wong (52–53), Bin Yu (54–58), Julian Besag, Peter Green, David Higdon, Kerrie Mengersen (58–66).
Article
this paper, which even before its appearance has done a valuable service in clarifying both theory and practice in this important area. For example, the discussion of combining strategies in Section 2.4 helped researchers break away from pure Gibbs sampling in 1991; it was, for example, part of the reasoning that lead to the "Metropolis-coupled" scheme of Geyer (1991) mentioned at the end of Section 2.3.3.
Article
SUMMARY Capture-recapture models are widely used in the estimation of population sizes. Based on data augmentation considerations, we show how Gibbs sampling can be applied to calculate Bayes estimates in this setting. As a result, formulations which were previously avoided because of analytical and numerical intractability can now be easily considered for practical application. We illustrate this potential by using Gibbs sampling to calculate Bayes estimates for a hierarchical capture-recapture model in a real example.
Article
The Arnason–Schwarz model is usually used for estimating survival and movement probabilities of animal populations from capture-recapture data. The missing data structure of this capture-recapture model is exhibited and summarised via a directed graph representation. Taking advantage of this structure we implement a Gibbs sampling algorithm from which Bayesian estimates and credible intervals for survival and movement probabilities are derived. Convergence of the algorithm is proved using a duality principle. We illustrate our approach through a real example.
Article
This work considers Monte Carlo methods for approximating the integral of any twice-differentiable function f over a hypercube. Whereas earlier Monte Carlo schemes have yilded on $O({1 / n})$ convergence rate for the expected square error, we show that by allowing nonlinear operations on the random samples $\{ (U_i ,f(U_i ))\} _{i = 1}^n$, much more rapid convergence can be achieved. Specifically, we give a rule which attains rates of $O({1 / {n^4 }})$ and $O({1 \mathord{\left/ {\vphantom {1 {n^4 }}} \right. \kern-\nulldelimiterspace} {n^2 }})$ in one and two dimensions respectively. Analysis shows that our algorithms become worse than the usual Monte Carlo method as the dimension of the domain increases, and these findings point to the possibility that “crude” Monte Carlo has an asymptotic optimality property among all Monte Carlo rules, linear and otherwise.
Article
A general method, suitable for fast computing machines, for investigating such properties as equations of state for substances consisting of interacting individual molecules is described. The method consists of a modified Monte Carlo integration over configuration space. Results for the two-dimensional rigid-sphere system have been obtained on the Los Alamos MANIAC and are presented here. These results are compared to the free volume equation of state and to a four-term virial coefficient expansion. The Journal of Chemical Physics is copyrighted by The American Institute of Physics.
Article
Markov chain Monte Carlo using the Metropolis-Hastings algorithm is a general method for the simulation of stochastic processes having probability densities known up to a constant of proportionality. Despite recent advances in its theory, the practice has remained controversial. This article makes the case for basing all inference on one long run of the Markov chain and estimating the Monte Carlo error by standard nonparametric methods well-known in the time-series and operations research literature. In passing it touches on the Kipnis-Varadhan central limit theorem for reversible Markov chains, on some new variance estimators, on judging the relative efficiency of competing Monte Carlo schemes, on methods for constructing more rapidly mixing Markov chains and on diagnostics for Markov chain Monte Carlo.
Article
The Gibbs sampler, the algorithm of Metropolis and similar iterative simulation methods are potentially very helpful for summarizing multivariate distributions. Used naively, however, iterative simulation can give misleading answers. Our methods are simple and generally applicable to the output of any iterative simulation; they are designed for researchers primarily interested in the science underlying the data and models they are analyzing, rather than for researchers interested in the probability theory underlying the iterative simulations themselves. Our recommended strategy is to use several independent sequences, with starting points sampled from an overdispersed distribution. At each step of the iterative simulation, we obtain, for each univariate estimand of interest, a distributional estimate and an estimate of how much sharper the distributional estimate might become if the simulations were continued indefinitely. Because our focus is on applied inference for Bayesian posterior distributions in real problems, which often tend toward normality after transformations and marginalization, we derive our results as normal-theory approximations to exact Bayesian inference, conditional on the observed simulations. The methods are illustrated on a random-effects mixture model applied to experimental measurements of reaction times of normal and schizophrenic patients.