PreprintPDF Available

Evaluating the Strong Scalability of Parallel Markov-Chain Monte Carlo Algorithms

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Markov-chain Monte Carlo (MCMC) is a popular method for performing asymptotically exact Bayesian inference. However, the acceleration of MCMC by parallelizing its computation is a significant challenge. Despite the numerous algorithms developed for parallelizing MCMC, a fundamental question remains unanswered. If parallel MCMC is the answer, what is the question? To answer this fundamental question, we first characterize the scaling of parallel MCMC algorithms into weak and strong scaling. Then, we discuss the outcomes they deliver. Under these terminologies, most previous works comparing different parallel MCMC methods fall into the weak scaling territory. For this reason, we focus on assessing the strong scalability of previously proposed parallelization schemes of MCMC, both conceptually and empirically. Also, considering the popularity and importance of probabilistic programming languages, we focus on algorithms that are applicable to a wide variety of statistical models. First, we evaluate the strong scalability of methods based on parallelizing a single MCMC chain. Second, for methods based on multiple MCMC chains, we develop a simple expression for estimating the speedup and empirically evaluate their strong scalability. Our results show that previously proposed methods for parallelizing MCMC algorithms achieve limited strong scalability. We conclude by providing future directions for parallelizing Bayesian inference.
Content may be subject to copyright.
Evaluating the Strong Scalability of
Parallel Markov-Chain Monte Carlo Algorithms
Khu-rai Kim
Electronics Engineering Dept.
Sogang University
Seoul, Republic of Korea
msca8h@sogang.ac.kr
Simon Maskell
Electrical Engineering and Electronics Dept.
University of Liverpool
Liverpool, UK
S.Maskell@liverpool.ac.uk
Sungyong Park
Computer Science and Engineering Dept.
Sogang University
Seoul, Republic of Korea
parksy@sogang.ac.kr
AbstractMarkov-chain Monte Carlo (MCMC) is a popular
method for performing asymptotically exact Bayesian infer-
ence. However, the acceleration of MCMC by parallelizing its
computation is a significant challenge. Despite the numerous
algorithms developed for parallelizing MCMC, a fundamental
question remains unanswered. If parallel MCMC is the answer,
what is the question? To answer this fundamental question,
we first characterize the scaling of parallel MCMC algorithms
into weak and strong scaling. Then, we discuss the outcomes
they deliver. Under these terminologies, most previous works
comparing different parallel MCMC methods fall into the weak
scaling territory. For this reason, we focus on assessing the
strong scalability of previously proposed parallelization schemes
of MCMC, both conceptually and empirically. Also, considering
the popularity and importance of probabilistic programming
languages, we focus on algorithms that are applicable to a
wide variety of statistical models. First, we evaluate the strong
scalability of methods based on parallelizing a single MCMC
chain. Second, for methods based on multiple MCMC chains,
we develop a simple expression for estimating the speedup and
empirically evaluate their strong scalability. Our results show that
previously proposed methods for parallelizing MCMC algorithms
achieve limited strong scalability. We conclude by providing
future directions for parallelizing Bayesian inference.
Index Terms—Bayesian Inference, Parallel Computing,
Markov-chain Monte Carlo
I. INTRODUCTION
Bayesian statistics has recently seen surging popularity.
Until recently, the high computational cost of Bayesian in-
ference methods has prohibited wide adoption of Bayesian
statistics. Fortunately, with the advances of our computational
capabilities, many of the Bayesian statistical models of our
interest are now within arms’ reach. Also, abstractions such as
probabilistic programming languages (PPL, [1], [2], [3], [4],
[5]) have enabled statisticians to enjoy modern computational
resources with reduced complexity and enhanced portability.
Meanwhile, with the ever-increasing amount of data and com-
plexity of our statistical models, the need for high-performance
parallel Bayesian inference is gaining importance.
Currently, Markov-chain Monte Carlo (MCMC) is the de-
facto standard method for performing asymptotically exact
Bayesian inference. MCMC algorithms are generally appli-
cable as they assume very little about the target statistical
model. Notably, variants of the Hamiltonian Monte Carlo
sampler (HMC, [6]) have been widely adopted as the default
sampling strategy in PPLs, such as Stan [4], PyMC3 [1],
[3], and Turing.jl [5]. For PPLs where the user’s model can
be complex, HMC has shown, both theoretically and empiri-
cally, to be efficient and robust. However, MCMC algorithms
are computationally expensive, requiring many thousands of
evaluations of the likelihood (and, in the case of HMC, its
gradient).
Because of the high cost of executing MCMC algorithms,
it is natural to seek acceleration by utilizing modern high-
performance computing resources. Yet, on most modern com-
putational hardware, this can only be achieved by paralleliza-
tion. Since the clock speed of processors started stagnating,
the necessity for parallelizing MCMC has kept increasing
over time [7], [8]. Also, the emergence of massively par-
allel, programmable, specialized hardware such as graphical
processing units (GPU) and field-programmable gate arrays
(FPGA) has opened immense benefits for parallelization [9],
[10], [11], [12]. Unfortunately, the fundamentally sequential
nature of MCMC algorithms has shown that it is challenging
to scale MCMC algorithms [13], [14]. Moreover, being gener-
ally applicable enough to support PPLs adds multiple design
constraints. Methods such as in [15], [9], [10], [12], [16] that
exploit the parallelism inherent in the statistical model are not
generically applicable. For these reasons, developing parallel
MCMC algorithms robust and automatic enough to be applied
to PPLs is a vital but challenging objective.
If parallelizing MCMC is the answer, what is the question?
Until now, various strategies for accelerating MCMC by
parallelization have been proposed. However, a fundamental
discussion on what the goal of parallelization is (in the
perspective of parallel computing) has yet to be presented. For
example, are we trying to achieve weak scaling [17] or strong
scaling [18]? Or, more fundamentally, what strong scaling or
weak scaling even mean in the context of Bayesian inference?
What are the previously proposed methods actually achieving?
This study is our attempt to address these questions with a
focus on fulfilling the constraints of PPLs.
In this paper, we first characterize the acceleration of
Bayesian inference into strong and weak scaling. The weak
scaling [17] approach reduces estimation error as we increase
the amount of computational resources. In contrast, the strong
scaling [18] approach tries to reduce the time for reaching a
specific target estimation error. As weak scaling and strong
scaling deliver sharply different outcomes, we start with a
broader discussion by comparing the outcomes of the two
approaches. Since the estimation error of MCMC can only be
reduced in a rate of (1∕𝑁)(𝑁is proportional to the amount
of computation), the weak scaling approach is susceptible to
diminishing returns [19]. Also, because of the recent trend
of using of the effective sample size per unit time metric
(ESS/second), performance comparisons have been biased
towards weak scaling. As a result, proper comparisons of
the strong scalability of different parallelization methods have
been left behind. Thus, we dedicate the remainder of our paper
to evaluating the strong scalability of MCMC parallelization
schemes.
Previously proposed strategies for parallelizing MCMC (and
achieving strong scalability), can be classified into three types:
intrachain parallelism, interchain parallel, and data parallelism.
Intrachain [20], [15], [21], [22], [23], [24], [11], [16] paral-
lelism seeks speedup by parallelizing the computation of a
single MCMC chain. Interchain parallelism [25], [26], [27],
[28], [29], [8] computes multiple MCMC chains and utilizes
the embarrassingly parallel nature of the multiple chains.
On the other hand, data parallelism (mostly consisting of
consensus MCMC methods, [30], [31], [32]) can be seen as
combining both; splitting the data into multiple subsets and
operating MCMC chains in each subset. Since the applicability
of data parallelism is limited to certain types of statistical
models (models that are inherently data parallel), we will
restrict our discussion to intra and interchain parallelism.
We empirically investigate the potential strong scalability
of intrachain parallelism by computing the Amdahl speedup
limit [18] on multiple realistic Bayesian models. For interchain
parallelism, we develop a simple formula for estimating the
speedup given the autocorrelation and the amount of burn-
in samples. Then, we empirically evaluate the strong scaling
of previously proposed interchain parallelization methods.
Our results suggest that strategies for parallelizing MCMC,
whether interchain or intrachain, struggle to achieve strong
scaling close to linear. Lastly, we conclude our work by pro-
viding future directions for overcoming the limits of MCMC.
To summarize, the key insights of this paper are as follows:
We characterize the acceleration of MCMC algorithms
into strong and weak scaling (Section III-A).
We argue that the outcomes of strong scalability are more
desirable than weak scalability. However, the analysis of
previous works are largely focused on weak scalability
(Section III-A).
We experimentally assess the strong scalability of paral-
lelizing a single MCMC chain (intrachain parallelization)
(Section III-B).
We develop a simple expression for evaluating the strong
scalability of multiple MCMC chain parallelization (in-
terchain parallelization), and experimentally evaluate the
strong scalability of previously proposed interchain par-
allelization strategies (Section III-C).
II. PRELIMINARIES
A. Bayesian Inference
The goal of Bayesian inference is to obtain the distribution
of a parameter of interest. First, an observable quantity, or
dataset (denoted as ), is given. By assuming that the data
is generated from an observed variable 𝜃according to a data
generation process, the likelihood of observing can be com-
puted as 𝑝(𝜃). Bayesian inference focuses on inferring the
inverse-probability 𝑝(𝜃), called the posterior distribution,
by setting a prior distribution 𝑝(𝜃)on 𝜃and invoking Bayes’
rule such as
𝑝(𝜃)=𝑝(𝜃)𝑝(𝜃)
𝑝()=𝑝(𝜃)𝑝(𝜃)
𝑝(𝜃)𝑝(𝜃)𝑑𝜃.(1)
Using the posterior distribution, it is possible to marginalize
the parameters of an arbitrary function 𝑓such as
𝔼𝜃𝑝(𝜃)[𝑓(𝜃)]=𝑓(𝜃)𝑝(𝜃)𝑑𝜃. (2)
Typical use-cases of marginalization include obtaining sum-
mary statistics of 𝜃, or performing predictions based on data.
Unfortunately, 𝑝()in Equation (2), called the evidence,
is often intractable unless we restrict ourselves to off-the-
shelf conjugate probability distributions. To perform Bayesian
analysis on more interesting and complex models, we instead
obtain a finite set of samples describing 𝑝(𝜃). This is done
by sampling from the joint distribution 𝜋(𝜃)=𝑝(𝜃)𝑝(𝜃),
which is proportional to the posterior distribution up to a
normalizing constant (the evidence) such that 𝑝(𝜃)𝜋(𝜃).
Using these 𝑁samples, we can approximate the marginal 𝐹
using the Monte Carlo approximation
𝑓(𝜃)𝑝(𝜃)𝑑𝜃1
𝑁
𝜃𝑖𝑝(𝜃)𝑓(𝜃𝑖)= ̄
𝑓. (3)
B. Sampling from the Posterior
Sampling proportionally to the posterior is a significant
challenge by itself. Currently adopted methods can be classi-
fied into two types: asymptotically exact methods and approx-
imate methods. For asymptotically exact sampling, Markov-
chain Monte Carlo (MCMC) is the most widely used.
1) Markov-chain Monte Carlo:The basic idea of MCMC
is to construct a Markov-chain along the samples with a
Markov-chain transition operator, or kernel, 𝐾(𝜃,𝜃). By con-
ditioning the current sample 𝜃𝑖on the previous state of the
Markov-chain 𝜃𝑖−1, it is possible to cancel out the normal-
ization constant and sample proportionally to the posterior.
If 𝐾(𝜃,𝜃)admits the posterior distribution 𝑝(𝜃)as its
invariant measure such as
𝜋(𝜃)=Θ𝐾(𝜃,𝜃)𝜋(𝑑𝜃)(4)
and some additional assumptions, the states of the Markov-
chain 𝜃𝑖form an asymptotically unbiased estimator.
Theorem 1 (Thm 4.7.7, [33]): If the Markov-chain 𝜃1𝜃𝑖is
aperiodic, irreducible, and reversible with invariant distribution
Algorithm 1: Markov-Chain Monte Carlo
for 𝑡[1,𝑁burn +𝑁]do
𝜃𝑞(𝜃𝑡−1)(propose sample)
𝛼=min𝜋(𝜃)𝑞(𝜃𝑡−1𝜃)
𝜋(𝜃𝑡−1)𝑞(𝜃𝜃𝑡−1),0(acceptance prob.)
𝑢Uniform(0,1)
𝜃𝑡=𝜃, 𝑢<𝛼 (accept proposal)
𝜃𝑡−1,otherwise (reject proposal)
end
return 𝜃𝑁burn,, 𝜃𝑁+𝑁burn
𝜋, the Central Limit Theorem applies when 0<𝜎2<+∞
such that 𝑁(̄
𝑓𝔼𝜋[𝑓(𝜃)]) 𝑑
,(0,𝜎2
CLT)(5)
where 𝑁is the number of Markov-chain states,
̄
𝑓=1
𝑁𝑁
𝑖=1𝑓(𝜃𝑖),(6)
𝜎2
CLT =𝕍𝜋[𝑓(𝜃1)]+2
𝑖=2cov𝜋(𝑓(𝜃1),𝑓(𝜃𝑖)),(7)
𝔼[]denotes the expectation, 𝕍[]denotes the variance, and
cov(,)denotes the covariance.
2) MCMC Algorithms:The most basic form of a Markov-
chain operator satisfying the conditions of Theorem 1 is the
Metropolis-Hastings method [34] described in Algorithm 1.
First, a proposal is generated from an arbitrary distribution
𝑞(𝜃𝑡−1). Then, the proposal is either accepted or rejected
based on its acceptance probability 𝛼. Each accept-reject
decision forms a state of the Markov chain. This is repeated
for 𝑁+𝑁burn times, where only the last 𝑁samples are used for
estimation. For a detailed introduction to MCMC algorithms,
see [35] and [36].
Currently, variants of Hamiltonian Monte Carlo (HMC)
have been empirically shown to be the most efficient, robust,
and generically applicable [3], [37]. Specifically, the no-u-turn
(NUTS) sampler [38], an automatic tuning variant of HMC,
has been widely adopted. In HMC samplers, a proposal is
generated by simulating the Hamiltonian dynamics with the
potential energy given by 𝜋(𝜃). For a detailed introduction to
HMC samplers, see [37].
3) Burn-in and adaptation:As depicted in Algorithm 1,
only the last 𝑁samples are used for inference. The first
𝑁burn samples, often called burn-in or warmup are discarded.
Similarly, the interval 𝑖∈ [0,𝑁burn ]is called the burn-in
period. The reason for discarding burn-in samples is to ensure
that the Markov-chain reaches the stationary region of the
posterior (which previously some have called the typical set).
While the necessity of burn-in has been questioned before
such as in [35], it is a standard practice. Also, recent MCMC
algorithms such as NUTS utilize the burn-in samples for
adapting the hyperparameters of 𝐾(,)(For example, by using
the Nesterov dual-averaging procedure [38]).
III. EVALUATI NG T HE ST RONG SCALABILITY O F
PARALLEL MARKOV-CH AI N MON TE CA RL O
A. Defining Acceleration of Bayesian Inference
Accelerating Bayesian inference has been a central goal
for advancing the ideals of Bayesian methodologies. However,
what does accelerating Bayesian inference truly mean? First,
we focus on the goal of Bayesian inference, which is to
estimate the variables in question with the least amount of
error.
1) Estimation error:The estimation error is defined by the
asymptotic root mean squared error (RMSE), which can be
derived from (7) as
RMSE =𝔼[(̄
𝑓𝔼[𝑓])2]= 𝜎
𝑁eff
(8)
where 𝜎2=𝕍𝜋[𝑓(𝜃1)]and 𝑁eff is the effective sample size
(ESS). The ESS is defined as
𝑁eff =𝑁
𝜏=𝑁
1+2
𝑘=1𝜌𝑘(9)
where 𝜌𝑘is the autocorrelation of the Markov-chain with
lag 𝑘, and 𝑁is the number of samples [34], [39]. Here,
𝜏is the statistical performance of a Markov-chain transition
kernel 𝐾(𝜃,𝜃). The goal of MCMC sampling is to estimate
the quantity 𝔼[𝑓]with a low RMSE by either increasing 𝑁,
or decreasing 𝜏. Note that for some MCMC-based methods,
𝜏is not necessarily computed using the autocorrelation (for
example, as in [29]). For this reason we will call 𝜏the variance
inflation factor.
2) The objective of parallelization:In general, the goal of
parallelization is either to increase the amount of computation
done in a fixed interval or to decrease the amount of time
of performing a fixed amount of computation. The former
is known as weak scalability, while the latter is known as
strong scalability. In MCMC, the amount of time spent on
computation is roughly proportional to the number of sam-
ples 𝑁. While this not perfectly true because of algorithmic
variation in algorithms such as NUTS, we will stick with
this assumption throughout this paper for simplicity. With that
said, for MCMC, weak scaling means to reduce the RMSE by
increasing resources, while strong scaling means to decrease
the time until achieving a fixed target RMSE by increasing
resources. Now, acceleration of Bayesian inference can be
characterized into two categories: weak scaling and strong
scaling.
3) What we were achieving so far:Until now, weak
scaling has been the unspoken goal of parallelizing MCMC.
While some discussions regarding strong scalability exist [40],
[41], [42], the ESS per unit time (ESS/second) metric has
recently seen dominant use for comparing different algorithms.
For instance, [4, p. 10] states that “. . . effective sample size
per second (or its inverse) is the most relevant statistic for
comparing the efficiency of sampler implementations”. Since
a higher ESS/second value signifies that lower error can be
achieved during the same unit time, it is a typical weak
scalability metric. For this reason, we first question whether
the path of weak scalability is leading us to where truly want
to be. To answer this, we first look into the outcomes of weak
and strong scaling.
4) What weak and strong scaling deliver:With weak
scalability, we can achieve a lower error by increasing our
computational resources. Since the ultimate goal of inference
is to achieve lower RMSE of the target estimator, weak scaling
might sound attractive at first. However, it is questionable
whether we genuinely need arbitrarily low error. In practice,
achieving an “acceptable level” of RMSE faster is more
desirable than merely achieving arbitrarily low RMSE. Also,
according to Equation (8), the estimation error can only be
reduced in the order of (1∕𝑁). This slow rate of reduction
causes an issue of diminishing returns [19, p. 112-113].
Unfortunately, even if this lofty parallel speedup
goal is achieved, the asymptotic picture for MCMC
is dim: in the asymptotic regime, doubling the
number of samples collected can only reduce the
Monte Carlo standard error by a factor of 2. This
scaling means that there are diminishing returns to
purchasing additional computational resources, even
if those resources provide linear speedup in terms of
accelerating the execution of the MCMC algorithm.
Again, if achieving an “acceptable level” of RMSE is the
ultimate goal, it is not only the amount of error we can reduce
that diminishes but the utility of such reduction that also
vanishes. In sharp contrast, strong scaling actually reduces the
time spent for inference. Hence, Bayesian inference is more
often a strong scaling problem rather than a weak scaling
problem. What we want is to achieve an acceptable amount
of error faster, rather than arbitrarily low error.
5) A case for strong scalability:Weak scalability doesn’t
reduce the time for Bayesian inference. Instead, it enables
us to perform inference more accurately, which definitely
has a place. However, as discussed, weak scaling has a
problem of diminishing returns. Moreover, the popularity of
the ESS/seconds metric have resulted in a bias towards weak
scalability. For these reasons, the strong scalability of currently
known parallelization strategies is not well understood. Thus,
we dedicate the remainder of this paper to evaluating the strong
scalability of parallel MCMC methods.
B. Evaluating the Strong Scalability of Intrachain Parallelism
From now on, we will evaluate the strong scalability of
intrachain and interchain approaches to parallelizing MCMC.
Intrachain parallelism accelerates inference by accelerating the
execution of a single MCMC chain. Intrachain parallelization
approaches span parallelizing the evaluation of the likelihood,
using the parallel delayed rejection algorithm, or using the
prefetching method.
1) Parallel delayed rejection and prefetching:Parallel de-
layed rejection is a parallel realization of the delayed rejection
algorithm [21], [22]. In parallel delayed rejection, multiple
proposals are generated in parallel, while accept-reject de-
cisions are made sequentially for each proposal until one is
accepted. Once a proposal is accepted, all other proposals are
discarded. For this reason, parallel delayed rejection wastes
a lot of computation, bounding the speedup sublinearly [21],
[22]. Meanwhile, prefetching methods [20], [23], [24] extract
parallelism by simulating multiple iterations of the Markov-
chain asynchronously. Similarly to delayed rejection, these
approaches achieve only logarithmic speedup [20] as they
waste a lot of computation as soon as an accept-rejection
decision is sealed.
2) Likelihood parallelization:On the other hand, likeli-
hood parallelization [15], [11], [16] aims to accelerate MCMC
by parallelizing the computation of the likelihood. Since
the likelihood is the most computationally expensive part in
Bayesian inference, likelihood parallelization looks promis-
ing [51]. Indeed, Stan [4] explicitly supports this model of
parallelization through features such as map_rect (from ver-
sion 2.18.0) and reduce_sum (from version 2.23). However,
how much does likelihood parallelization deliver in practice?
From now on, we will empirically evaluate the gains of
parallelizing the likelihood by computing the Amdahl speedup
limit [18] on a diverse set of Bayesian models. Precisely, we
measure the time spent for computing the likelihood using
Bayesian models based on the Stan PPL [4].
3) Experimental settings:We first forked1and modified
the Stan runtime in order to measure the execution time of
the likelihood and its gradient. We sample 4000 samples using
Stan’s implementation of the NUTS MCMC algorithm [38]
after discarding 4000 samples for burn-in. We use the default
configuration of CmdStan v2.23.0. The server we use for the
experiment runs Linux 4.15 on Intel Xeon E7–8870 v2 pro-
cessors and has 770GB of RAM. The processor frequency was
fixed to 2.7GHz for all the experiments using the cpupower
frequency-set command. All of the experiments are
repeated 32 times each.
4) Analysis methodology:Following the Bayesian theme
of this paper, we employ a Bayesian analysis methodology for
the results. First, for each of the 32 repetitions, we estimate
the execution time proportion 𝑝=𝑇likelihood𝑇total of the
likelihood (𝑇likelihood) and the total execution time (𝑇total).
We assume the data is generated from the following beta-
regression model: 𝜇Uniform(0,1)
𝜙Inv-Gamma(1,1)
𝑝Beta(𝛼,𝛽)
where 𝛼=𝜇𝜙, 𝛽=(1𝜇)𝜙,
𝜇and 𝜙are the mean and precision of 𝑝. We set an uninforma-
tive uniform prior on 𝜇and an inverse-gamma prior on 𝜙. We
use the Turing.jl [5] PPL for describing the aforementioned
model, and draw 4000 samples using NUTS after discarding
4000 samples for burn-in.
1Forked the repository https://github.com/stan-dev/stan in July 13 2020.
TABLE I: Execution Time Proportion (𝑝) of the Likelihood
Name References Characterization2#Datapoints #Parameters Exec. Time Estimated 𝑝1
(seconds)
ad [43] Logistic Regression 354 7000 14167 0.994 (0.994, 0.995)
butterfly [43] Hierarchical Bayesian 3 28 212 0.974 (0.973, 0.974)
cancer [44], [45] Sparse Logistic Regression 1434 102 17902 0.947 (0.947, 0.947)
covid [46] Hierarchical Bayesian 87 75 5808 0.990 (0.990, 0.990)
disease [43] Logistic Regression 345 10 6115 0.986 (0.986, 0.986)
lda [47], [48] Mixed-membership Model 1224 7737 5900 0.990 (0.990, 0.990)
racial [43] Hierarchical Bayesian 5 300 2212 0.911 (0.909, 0.912)
soccer [49] Hierarchical Bayesian 331 3040 33148 0.958 (0.957, 0.958)
stock [47], [50] Stochastic Volatility 1006 5030 5033 0.853 (0.852, 0.854)
votes [43] Hierarchical Gaussian Processes 11 550 3279 0.852 (0.851, 0.854)
1We report the posterior median along with the 80% credible interval in the parentheses.
2The characterization is based on [43].
Fig. 1: The Amdhal limit estimated from the posterior samples
of 𝑝. The error bars are the 80% credible intervals.
5) Benchmark models:We chose models from the
BayesSuite benchmark [43] (ad, butterfly, disease, racial,
votes), the Stan User Guide [47] (lda, stock), and other
independent sources (soccer,covid,cancer). The considered
workloads are organized in Table I with their description and
average execution time. For the stock benchmark, we use the
S&P 500 daily closing price data from the year 2000 to 2020.
Also, for the cancer benchmark, we use the prostate cancer
(Prostate_GE) dataset from [52]. Lastly, for the covid,lda,
and soccer benchmarks, we use the datasets used in their
original works.
6) Results:The summaries of the estimated posterior of
𝑝are shown in the right of Table. I. Within the considered
models, the proportion of the likelihood computation time (𝑝)
is larger than 99% on only ad, covid, lda. Also, for stock
and votes, the proportion is only around 85%. The estimated
Amdahl limits, which are the maximum theoretical speedup
we can achieve, are shown in Fig. 1. Results show that the
likelihood of the considered statistical models are generally
not dominant enough to achieve significant speedup.
We performed various analyses on the results to see if 𝑝
is related to any of the problem’s characteristics, such as the
number of datapoints or the number of parameters. However,
we failed to find any significant correlation. This suggests
that the efficiency of likelihood parallelization results from
complex interactions between multiple factors spread across
the PPL ecosystem.
7) Discussion:It is important to note that the estimates in
Fig. 1 are optimistic estimates assuming infinite computational
resources. In practice, the speedup gains would be much less.
Also, in general, the amount of parallelism in the statistical
models varies greatly. While simple logistic regression models
with very tall datasets (datasets with lots of data points) have
significant parallelism, more complex hierarchical Bayesian
models do not have as much parallelism. Moreover, paral-
lelizing the likelihood is a burden that must be carried by
the user. For example, in Stan, utilizing constructs such as
map_rect and reduce_sum requires a large number of
user code changes. In the end, parallelizing the likelihood is
by no means an automatic approach, and its performance gain
is highly dependent on the model used. Since other methods
such as parallel delayed rejection and prefetching also have
fundamental limits, it is unclear whether we could achieve
pain-free linear speedup with intrachain parallelization.
C. Evaluating the Strong Scalability of Interchain Parallelism
The interchain approach achieves speedup by executing
multiple independent chains in parallel. Because of the seem-
ing independence of the chains, this approach is often de-
scribed as being embarrassingly parallel. However, in this
section, we show that this appearance is deceptive to the proper
assessment of strong scalability. To properly assess the strong
scalability of interchain parallelism, we will first theoretically
quantify the speedup. Then, we will empirically evaluate
the scalability of previously proposed interchain parallelism
schemes.
1) Theoretical analysis:Our discussion starts with the
asymptotic error rate of MCMC given by Equation (8) and (9).
By rearranging the equations, we obtain the number of samples
needed to achieve an error rate 𝜖,𝑁=𝜏𝜎2𝜖2,given
the variance of the estimated statistic (𝜎) and the variance
reduction factor (𝜏).
Assuming that generating a sample costs a constant execu-
tion time 𝑇, the amount of work 𝑊needed to achieve an error
𝜖is 𝑊seq =𝑇(𝑁+𝑁burn)=𝑇(1+𝑏seq)𝑁(10)
=𝑇(1+𝑏seq )𝜏seq 𝜎2
𝜖2(11)
where 𝑁burn is the relative ratio of burn-in samples and 𝑏
is the proportion of burn-in samples such that 𝑁burn =𝑁𝑏.
Normally, the burn-in ratio 𝑏is chosen to be quite high. For
example, [53] heuristically suggests using 𝑏=1.0, where half
of the samples are used for inference, and half are used for
burn-in. This ratio has also been supported by [54] according
to non-asymptotic error analysis results.
It is also important to note that, in practical conditions,
reducing the absolute amount of burn-in results in an inferior
𝐾(,)with a relatively higher autocorrelation. This is because
recent MCMC algorithms use the burn-in samples for adapting
the hyperparameters of 𝐾(,). Moreover, since earlier samples
of the Markov-chain often contain a lot of bias, shortening the
burn-in period results in the adaptation period to use only low
quality samples. For these reasons, we will roughly assume
that reducing 𝑁burn increases the variance inflation factor 𝜏.
By executing multiple MCMC chains in parallel and com-
bining the results, we can reduce the execution time by the
number of computing unit 𝑃. In this case, the amount of work
done by each computing unit in parallel is
𝑊par =𝑇(𝑁+𝑁burn)
𝑃=𝑇(1+𝑏par )𝑁1
𝑃(12)
=𝑇(1+𝑏par )𝜏par 𝜎2
𝜖21
𝑃.(13)
Equation (11) and (13) give us the execution time required for
achieving an error 𝜖by running MCMC sequentially and in
parallel.
The speedup 𝑆of running multiple chains in parallel is now
given as
𝑆=𝑊seq
𝑊par =𝑇(1+𝑏seq )𝜏seq 𝜎2
𝜖2
𝑇(1+𝑏par )𝜏par 𝜎2
𝜖21
𝑃
(14)
=𝑃𝜏seq (1+𝑏seq )
𝜏par(1+𝑏par).(15)
Remark 1 (Amdahl’s Law): If we hold 𝑁burn constant, we
retrieve Amdahl’s law. This can be done by setting 𝑏=𝑏par =
𝑃𝑏seq since the length of the individual chains are shortened
by a factor of 𝑃. Assuming the performance of the samplers
is equal so that 𝜏seq =𝜏par, the speedup is given as
𝑆=𝑃1+𝑏
1+𝑃𝑏 =1+𝑏
(1∕𝑃)+𝑏,(16)
which is structurally equivalent to Amdahl’s law. Thus, if
𝑏𝑠𝑒𝑞 >0, then 𝑆<𝑃.
2) Limitation due to burn-in (𝑁𝑏𝑢𝑟𝑛):Since in practice
it is almost always true that 𝑏𝑠𝑒𝑞 >0, Remark 1 shows that
fixing the number of samples used for burn-in is always inef-
ficient [40], [42], unless 𝑁𝑁burn (similar to the asymptotic
results of [55, Thm 6.4]). Because of the diminishing returns
discussed in Section III-A, it would be inefficient to maintain
𝑁𝑁burn. Thus, in realistic settings, the effect of 𝑁burn will
be not negligible in terms of scalability.
3) Can we avoid burn-in?:Since holding the amount of
burn-in fundamentally restricts strong scalability, methods for
completely avoiding the need for burn-in has been proposed.
For example, [13] suggests to use perfect simulation [56] in
order to obtain initial points for the Markov-chain that are
1 30 60 90 120 150
10−2
10−1
𝜃1𝒩(1.0,3.02)
𝜃1𝒩(0.0,0.12)
optimal
target
Samples per Chain
RMSE
128 64 50 40 30
Speedup
Fig. 2: RMSE of estimating 𝜇10 from samples of 64 parallel
Markov-chains where 𝑁burn = 64. The dark red line shows
the RMSE decreasing with the number of samples, the target
line is the target RMSE acquired by the single, longer MCMC
chain (𝑁=2048,𝑁burn =2048). The optimal line denotes the
iteration where the theoretically optimal speedup is achieved.
We estimated the RMSE by repeating the sampling process
212 times.
within the stationary region. Similarly, DynamicHMC.jl [57],
an implementation of the NUTS algorithm supported by
Turing.jl, performs maximum a-posteriori estimation to obtain
an initial point close to the mode of the posterior. However,
these approaches have a catch. In the name of reducing burn-
in, these methods introduce additional computation, which
decreases efficiency just as burn-in does. Also, recent MCMC
algorithms need the burn-in period to perform hyperparameter
adaptation.
Then, what about reducing 𝑁𝑏𝑢𝑟𝑛 according to the number
of chains?
Remark 2 (Reducing the amount of burn-in): By holding the
burn-in ratio constant and assuming that the autocorrelation
changes, the number of samples generated by each chain is
𝑁𝑃while the number of burn-in samples is 𝑁burn𝑃. Then,
the speedup is given as
𝑆=𝑃(𝜏seq 𝜏par).(17)
Consequently, the performance is directly dependent on the
performance, or inflation factor, of the Markov-chains.
The results of Remark 2 are actually quite promising. If we
can keep the variance inflation factor 𝜏seq 𝜏par, then we can
achieve linear speedup such as 𝑆𝑃. From now on, we will
empirically evaluate the possibility of this direction.
4) Experimental settings: Neal’s 10-D Gaussian:First, we
perform experiments on a simple synthetic problem to evaluate
the effect of shortening the chain and starting the chain
within the stationary region. We run the NUTS sampler of
AdvancedHMC.jl [5], [58] on a 10-dimensional “Neal’s Gaus-
sian” which is a multivariate Gaussian distribution (0,Σ)
where Σ=Diagonal(0.01,0.02,,1.0) [59]. Because of the
varying covariance scale, Neal’s Gaussian is appropriate for
evaluating the effect of adapting 𝐾(,). We first sample with
𝑁= 4096, 𝑁burn = 4096 samples with a single, long Markov-
chain. Then, we execute 64 parallel chains, where each chain
spends 𝑁burn𝑃=4096∕64=64for burn-in, and compare the
RMSE against the single long chain. The RMSE is estimated
by repeating the sampling process 212 times.
For the longer chain, we use the Stan adaptation rule [47],
which alternates between adapting the preconditioner and the
step-size of NUTS. For 𝑁burn <150, the burn-in phase is too
short for using the Stan adaptation rule. In this case, we use the
naive adaptation rule [5], which simply adapts the covariance
and the step-size at the same time. We start each of the 64
chains from a random initial point sampled from (1.0,3.02)
or (0.0,0.12). Since the former is much wider than the
stationary region, it simulates the effect of sampling the initial
point from the prior distribution, which is an effective heuristic
to use in practice. Since strong contraction from the prior
𝑝(𝜃)to the posterior 𝑝(𝜃)is often expected, the initial
distribution will often be much wider than the stationary
region.
5) Results: Neal’s 10-D Gaussian:The results for esti-
mating the mean of the 10th parameter (𝜇10) can be seen in
Fig. 2. The 𝑥-axis is the number of samples sampled from each
parallel MCMC chains, which is roughly proportional to the
execution time. The red lines show the RMSE decreasing as
we draw more samples. The target line is the RMSE achieved
by the single, longer MCMC chain (𝑁=4096,𝑁burn =4096).
The RMSE was estimated by repeating the sampling process
212times. The optimal line denotes the number of samples per
chain where perfect linear speedup can be achieved. Unfortu-
nately, the samples of the parallel chains required many more
samples to achieve the target RMSE. In all cases, the chains
appeared to have properly converged ( ˆ
𝑅<1.01). We can see
that the strong scaling speedup achieved by 𝜃1(1.0,3.02)
is about 𝑆=45, while for 𝜃1(0.0,0.12)it is about 𝑆=60.
Note that this is not a perfectly fair comparison, as the cost
of starting from the stationary region needs to be considered.
6) Does starting from the stationary region help?:Yes.
Starting within the stationary region (𝜃1(0.0,0.12))
achieved much better efficiency compared to starting away
(𝜃1(1.0,3.02)) from it. Since the chains have converged
regardless of the initial point, it’s the quality of adaptation
that made the difference. By starting away from the stationary
region, the samples used for adapting 𝐾(,)contain a lot
of bias, worsening the performance. These results suggest
that it is not only the convergence but also the adaptation
of the Markov-chains that is critical to performance. This
observation is especially crucial to HMC based methods, as
their performance is highly dependent on appropriate tuning
of the kernel [59].
From now on, we compare the performance of previously
proposed approaches for interchain parallelization.
7) Considered parallel MCMC algorithms:We consider
interchain adaptation (INCA) by [26], and the generalized
Metropolis-Hastings algorithm (GMH) [28], [29]. We imple-
mented these algorithms on top of AdvancedHMC.jl using
the Julia language. In INCA, the samples generated by all
the parallel chains are used during adaptation. Then, instead
of using only 𝑁burn𝑃samples, 𝑁burn samples can be used
for adaptation. While the original INCA was applied to the
adaptive metropolis algorithm (AM, [60]), we apply it to
NUTS with the windowed acceptance [61] scheme. With AM,
INCA can be carried out not only during burn-in, but from
beginning to end. However, for NUTS, we can only apply
INCA during the burn-in period to preserve ergodicity. We
use the Stan adaptation rule for 𝑁burn𝑃150, and the naive
adaptation rule for 𝑁burn𝑃<150. We also include the non-
INCA version of NUTS as a baseline.
The GMH algorithm, on the other hand, is very different
in that it can be regarded as both an intrachain and interchain
approach. In each iteration, from a single state, GMH proposes
𝑁prop proposals in parallel. Then, it accepts 𝑁accept samples
by resampling from the proposals. Finally, it selects a single
sample from the proposals and uses it as the next state.
Overall, GMH can be thought of as operating a single “guiding
chain” where 𝑁prop short parallel chains are initiated from in
every iteration. Despite [28] showed that the “guiding chain”
achieves superior ESS, setting 𝑁prop >𝑁accept increases the
total amount of work (therefore decreasing efficiency). We
will also show that the performance of the “guiding chain”
is not representative of the overall samples. Since maximum
parallel efficiency can be achieved by setting 𝑁prop =𝑃, we set
𝑁prop =𝑁accept =𝑃. Additionally, we use the waste-recycling
extension of [29], as it is provably more efficient than the
original GMH. Lastly, for the underlying MCMC algorithm of
GMH, we use HMC with 32 leapfrog steps, jitter the stepsize
by 10% (as recommended by [62]), and use an adaptation
scheme identical to NUTS-INCA.
8) Metropolis-coupled MCMC (parallel tempering):An-
other popular type of interchain parallelization that we do
not include in our experiment is Metropolis-coupled MCMC
(MC3, [25], also known as parallel tempering). MC3improves
mixing of the Markov-chain by operating multiple Markov-
chains with different targets such as 𝜋𝑖(𝜃)=𝑝(𝜃)𝑇𝑖𝑝(𝜃).
The parallel chains periodically exchange the current state. The
exponent 𝑇𝑖(0,1]is known as the temperature parameter
where 𝑇𝑖<1eases exploration of the posterior distribution,
as the prior is often simpler than the posterior. By exchanging
the state of the parallel chains, the improved exploration is
communicated across the chains. However, only the samples
from the chain with the temperature 𝑇= 1 is used for
inference. Thus, using MC3instead of the non-MC3MCMC
increases the total amount of work for acquiring the same
number of samples. As a result, achieving good efficiency with
MC3is very difficult, as the ESS of the acquired samples must
be 𝑃times larger to accommodate the increased work. For this
reason, we do not consider it in our experiment.
9) Experimental settings: Eight schools:For comparing
the parallel MCMC algorithms, we choose the eight schools
problem [63], which is a hierarchical Bayesian model,
Fig. 3: Variance inflation factor (𝜏) of considered algorithms (lower the better). The solid lines are the mean, while the shaded
regions demark the 50% and 90% boostrap confidence intervals estimated from 210 repetitions.
Fig. 4: Strong scaling speedup of the considered algorithms estimated from 𝜏(higher the better). The diagonal black line shows
the theoretically optimal speedup. The solid lines are the mean, while the shaded regions demark the 50% and 95% bootstrap
confidence intervals estimated from 210 repetitions.
𝜇(0,10)
𝜏Half-Cauchy(0,5)
𝜃𝑖(𝜇,𝜏2)
𝑦𝑖(𝜃𝑖,𝜎2
𝑖)
where 𝑖[1,8]. This version of the eight schools model is
called centered parameterization, and is known to be difficult
to sample from because of its funnel-shaped posterior. While
a more efficient parameterization exists, we keep the cen-
tered parameterization as it represents challenges commonly
encountered in hierarchical Bayesian models. There are 10
variables in this model in total. All the results we show are
for estimating 𝜇.
10) Analysis methodology:To estimate the strong scala-
bility of the considered algorithms, we measure the RMSE
resulting from shortening the as suggested by Remark 2.
Specifically, we increase the number of chains and shorten
them such that each chain draws 𝑁𝑃samples and spend
𝑁burn𝑃samples for burn-in. Ideally, the estimated speedup
according to Remark 2 should achieve a linear speedup close
to 𝑃. We set the base settings as 𝑁=4096and 𝑁burn =4096.
The true mean (𝔼𝜋[𝑓(𝜃)]) used for estimating the RMSE is
estimated from a single long reference MCMC chain where
𝑁=220,𝑁burn =212. Then, the variance inflation factor 𝜏is
estimated using Equation (8) and (9). The variance 𝕍𝜋[𝑓(𝜃1)]
is also acquired from the reference chain. From the esimated
𝜏, we compute the speedup according to Equation (17).
11) Results: Eight schools:The estimated variance infla-
tion factor (𝜏) is shown in Fig. 3. For all three methods, 𝜏
grows significantly as the number of chains increases while
the chains’ length decreases. However, for NUTS and NUTS-
INCA, the inflation factors decrease slightly until 𝑃= 16.
This is because with more chains, it is more likely that some
of the chains start near the stationary region. Such benefit of
using multiple independent chains has been discussed before
in [64]. Meanwhile, the INCA scheme did not improve the
performance. Since the quality of the initial samples is poor,
sharing the samples during adaptation does not seem to help
except for 𝑃=4where the chains are not yet too short.
The speedup estimated from the measured 𝜏is shown in
Fig. 4. We can see that, for all three methods, the strong
scaling efficiency quickly falls down with more than 32 chains
except for NUTS-INCA. In particular, GMH failed to achieve
significant scaling with more than 4 chains. This shows that the
performance increase of the “guiding chain” (which benefits
from the increasing number of parallel proposals) does not
reflect the overall performance. Note that we did confirm that
the performance of the guiding chain increases, as originally
reported in [28].
12) Discussion:In a series of experiments, we evaluated
the scalability of interchain parallelism. Despite the apparent
parallelism, our results suggest that interchain parallelism does
not achieve near-linear scaling when the statistical perfor-
mance is also considered. Mainly, shortening of the chains
introduce issues related to convergence and adaptation. Mean-
while, methods for improving the quality adaptation such
as INCA did not improve the results because of the poor
statistical quality of the initial samples.
IV. REL ATED WORKS
Starting from [13], numerous approaches have been pro-
posed to parallelize MCMC. For a general review on acceler-
ating MCMC, we point to [19] and [14].
1) Evaluation of MCMC workloads:Until now, only a
few workload analyses on Bayesian inference workloads have
been presented. In [43], Wang et al. performed an extensive
analysis of Stan inference workloads. They showed that most
Stan programs were compute-bound in terms of instructions
per cycle. However, they noted that interchain parallelization
is trivial, as different parallel chains are independent. In
Section III-C, we showed that in a strong scaling perspective,
interchain parallelism does not provide impressive scalability.
2) Quantifying scalability of MCMC:Formulas for quan-
tifying scalability of interchain parallelism similar to (16)
appeared in [40], [42], [41]. These formulas however, only
consider the amount of burn-in. In contrast, our formula in
Equation (15) also considers the effect of the variance inflation
factor. Consequently, it enables the comparison of the strong
scalability of MCMC algorithms in a much broader context.
Meanwhile, while some works such as [25] have presented
analysis of strong scalability, the speedup is computed rel-
atively based on the increased amount of work, which is
misleading. To properly assess strong scalability, the speedup
must be computed against the original amount of work.
3) Limitations of parallelizing MCMC:As we have dis-
cussed in Section III-B and III-C, paralleizing MCMC has
multiple fundamental challenges. Some of these challenges
have been realized early on. For example, the issues with
burn-in have been quickly pointed out by [13], [65], [20],
[42], and in several other papers. However, most previous
works except [42], [26], mainly focused on the convergence
aspect of burn-in. Hence, they suggested removing burn-in by
starting from the stationary region. We showed in Section III-C
that this solution is only partially effective because of the
adaptation of the chains.
V. DISCUSSION
In this paper, we have discussed the fundamental goals of
parallelizing MCMC computation. We evaluated the strong
scalability of previously proposed approaches for parallelizing
MCMC. Various issues, such as the tuning of kernel hyperpa-
rameters, convergence, burn-in, and applicability complicates
strong scaling.
Before concluding our work, we would like to point out that
all of the aforementioned issues are fundamental limitations of
MCMC based approaches. These limitations are artifacts of the
theory of Markov-chains. Instead, we propose investigating
the feasibility of alternative algorithms. For example, the
performance of sequential Monte Carlo (SMC) [66], [67], an
algorithm that combines the benefits of importance sampling
and MCMC, doesn’t rely on the theory of Markov-chains.
In sharp contrast, SMC is built on the theory of interacting
particles [68], which is radically different from the theory
of Markov-chains. The convergence of SMC has been es-
tablished for 𝑃where 𝑃is the number of particles
(or parallel MCMC chains in our context). Also, designing
efficient MCMC kernels internally used by SMC is much
easier, since necessary conditions such as ergodicity don’t have
to be fulfilled [69]. While SMC has been shown to be a valid
candidate for parallelization [10], [70], [71], it has yet to be
thoroughly compared against parallel MCMC approaches in
a setting general enough to apply to PPLs. To conclude, an
important future research direction would be to compare the
scalability of MCMC against alternative approaches.
ACKNOWLEDGMENT
The authors would like to express their deepest apprecia-
tions to Aki Vehtari for his valuable advice on estimating 𝜏.
The authors also thank Jisu Oh for his constructive comments
about the theory of Markov-chains.
REFERENCES
[1] A. Patil, D. Huard, and C. Fonnesbeck, “PyMC: Bayesian Stochastic
Modelling in Python,” J. Stat. Soft., vol. 35, no. 4, 2010.
[2] F. Wood, J. W. van de Meent, and V. Mansinghka, “A new approach
to probabilistic programming inference,” in Proc. 17th Int. Conf. Mach.
Learn., ser. ICML’14, 2014, pp. 1024–1032.
[3] J. Salvatier, T. V. Wiecki, and C. Fonnesbeck, “Probabilistic program-
ming in Python using PyMC3,” PeerJ Comput. Sci., vol. 2, p. e55, Apr.
2016.
[4] B. Carpenter et al., “Stan: A Probabilistic Programming Language,” J.
Stat. Soft., vol. 76, no. 1, 2017.
[5] H. Ge, K. Xu, and Z. Ghahramani, “Turing: A language for flexible prob-
abilistic inference,” in Int. Conf. Artif. Intell. Statist., ser. AISTATS’18,
2018, pp. 1682–1690.
[6] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth, “Hybrid
Monte Carlo,” Phys. Lett. B, vol. 195, no. 2, pp. 216–222, 1987.
[7] P. Jacob, C. P. Robert, and M. H. Smith, “Using Parallel Computation
to Improve Independent Metropolis–Hastings Based Estimation,J.
Comput. Graphical Statist., vol. 20, no. 3, pp. 616–635, Jan. 2011.
[8] P. E. Jacob, J. O’Leary, and Y. F. Atchad´
e, “Unbiased Markov chain
Monte Carlo methods with couplings,” J. Roy. Stat. Soc. B, vol. 82,
no. 3, pp. 543–600, Jul. 2020.
[9] M. A. Suchard, Q. Wang, C. Chan, J. Frelinger, A. Cron, and M. West,
“Understanding GPU Programming for Statistical Computation: Studies
in Massively Parallel Massive Mixtures,” J. Comput. Graphical Statist.,
vol. 19, no. 2, pp. 419–438, Jan. 2010.
[10] A. Lee, C. Yau, M. B. Giles, A. Doucet, and C. C. Holmes, “On the
Utility of Graphics Cards to Perform Massively Parallel Simulation of
Advanced Monte Carlo Methods,J. Comput. Graphical Statist., vol. 19,
no. 4, pp. 769–789, Jan. 2010.
[11] S. Zierke and J. D. Bakos, “FPGA acceleration of the phylogenetic
likelihood function for Bayesian MCMC inference methods,” BMC
Bioinformatics, vol. 11, no. 1, p. 184, Dec. 2010.
[12] G. Mingas and C.-S. Bouganis, “Population-Based MCMC on Multi-
Core CPUs, GPUs and FPGAs,” IEEE Trans. Comput., vol. 65, no. 4,
pp. 1283–1296, Apr. 2016.
[13] J. S. Rosenthal, “Parallel computing and monte carlo algorithms,” Far
East J. Theor. Stat., vol. 4, pp. 207–236, 1999.
[14] C. P. Robert, V. Elvira, N. Tawn, and C. Wu, “Accelerating MCMC
algorithms,” Wiley Interdisciplinary Rev: Comput. Statist., vol. 10, no. 5,
p. e1435, Sep. 2018.
[15] X. Feng, K. W. Cameron, C. P. Sosa, and B. Smith, “Building the Tree
of Life on Terascale Systems,” in Proc. Int. Parallel Distrib. Process.
Symp., ser. IPDPS’07. Long Beach, CA, USA: IEEE, 2007, pp. 1–10.
[16] B. Nemeth, T. Haber, J. Liesenborgs, and W. Lamotte, “Automatic
Parallelization of Probabilistic Models with Varying Load Imbalance,”
in 20th IEEE/ACM Int. Symp. Cluster, Cloud and Internet Comput., ser.
CCGrid’20. Melbourne, Australia: IEEE, May 2020, pp. 752–759.
[17] J. L. Gustafson, “Reevaluating Amdahl’s law,” Commun. ACM, vol. 31,
no. 5, pp. 532–533, May 1988.
[18] G. M. Amdahl, “Validity of the single processor approach to achieving
large scale computing capabilities,” in Proc. April 18-20, 1967, Spring
Joint Comput. Conf. - AFIPS ’67 (Spring). Atlantic City, New Jersey:
ACM Press, 1967, p. 483.
[19] E. Angelino, M. J. Johnson, and R. P. Adams, “Patterns of Scalable
Bayesian Inference,” Found. Trends. Mach. Learn., vol. 9, no. 2-3, pp.
119–247, 2016.
[20] A. E. Brockwell, “Parallel Markov chain Monte Carlo Simulation by
Pre-Fetching,” J. Comput. Graphical Statist., vol. 15, no. 1, pp. 246–
261, Mar. 2006.
[21] J. M. R. Byrd, S. A. Jarvis, and A. H. Bhalerao, “Reducing the run-time
of MCMC programs by multithreading on SMP architectures,” in Proc.
Int. Parallel Distrib. Process. Symp., ser. IPDPS’08. Miami, FL, USA:
IEEE, Apr. 2008, pp. 1–8.
[22] ——, “On the parallelisation of MCMC by speculative chain execution,
in Proc. Int. Parallel Distrib. Process. Symp. Workshop Phd Forum, ser.
IPDPSW’10. Atlanta, GA: IEEE, Apr. 2010, pp. 1–8.
[23] E. Angelino, E. Kohler, A. Waterland, M. Seltzer, and R. P. Adams,
“Accelerating MCMC via parallel predictive prefetching,” in Proc. 30th
Conf. Uncertainty Artif. Intell., ser. UAI’14. Arlington, Virginia, USA:
AUAI Press, 2014, pp. 22–31.
[24] I. Strid, “Efficient parallelisation of Metropolis–Hastings algorithms
using a prefetching approach,” Comput. Statist. Data Anal., vol. 54,
no. 11, pp. 2814–2835, Nov. 2010.
[25] G. Altekar, S. Dwarkadas, J. P. Huelsenbeck, and F. Ronquist, “Parallel
Metropolis coupled Markov chain Monte Carlo for Bayesian phyloge-
netic inference,” Bioinformatics, vol. 20, no. 3, pp. 407–415, Feb. 2004.
[26] R. V. Craiu, J. Rosenthal, and C. Yang, “Learn From Thy Neighbor:
Parallel-Chain and Regional Adaptive MCMC,J. Amer. Statistical
Assoc., vol. 104, no. 488, pp. 1454–1466, Dec. 2009.
[27] A. Solonen, P. Ollinaho, M. Laine, H. Haario, J. Tamminen, and
H. J¨
arvinen, “Efficient MCMC for Climate Model Parameter Estimation:
Parallel Adaptive Chains and Early Rejection,Bayesian Anal., vol. 7,
no. 3, pp. 715–736, Sep. 2012.
[28] B. Calderhead, “A general construction for parallelizing Metropolis-
Hastings algorithms,” Proc. Nat. Acad. Sci., vol. 111, no. 49, pp. 17 408–
17 413, Dec. 2014.
[29] S. Yang, Y. Chen, E. Bernton, and J. S. Liu, “On parallelizable Markov
chain Monte Carlo algorithms with waste-recycling,” Stat Comput,
vol. 28, no. 5, pp. 1073–1081, Sep. 2018.
[30] W. Neiswanger, C. Wang, and E. P. Xing, “Asymptotically exact,
embarrassingly parallel MCMC,” in Proc. 30th Conf. Uncertainty Artif.
Intell., ser. UAI’14. Arlington, Virginia, USA: AUAI Press, 2014, pp.
623–632.
[31] S. L. Scott, A. W. Blocker, F. V. Bonassi, H. A. Chipman, E. I. George,
and R. E. McCulloch, “Bayes and big data: The consensus monte carlo
algorithm,” Int. J. Manag. Sci. Eng. Manag., vol. 11, pp. 78–88, 2016.
[32] S. Srivastava, C. Li, and D. B. Dunson, “Scalable bayes via barycenter
in wasserstein space,” J. Mach. Learn. Res., vol. 19, no. 8, pp. 1–35,
2018.
[33] C. P. Robert and G. Casella, Monte Carlo Statistical Methods, ser.
Springer Texts in Statistics. New York, NY: Springer New York, 2004.
[34] W. K. Hastings, “Monte Carlo sampling methods using Markov chains
and their applications,” Biometrika, vol. 57, no. 1, pp. 97–109, Apr.
1970.
[35] C. J. Geyer, “Introduction to markov chain monte carlo,” in Handbook
of Markov Chain Monte Carlo. CRC Press, 2011, pp. 3–48.
[36] C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan, “An Introduction
to MCMC for Machine Learning,” Mach. Learn., vol. 50, no. 1/2, pp.
5–43, 2003.
[37] M. Betancourt, “A Conceptual Introduction to Hamiltonian Monte
Carlo,” arXiv:1701.02434 [stat], Jan. 2017.
[38] M. D. Hoffman and A. Gelman, “The no-u-turn sampler: Adaptively
setting path lengths in Hamiltonian Monte Carlo,” J. Mach. Learn. Res.,
vol. 15, no. 47, pp. 1593–1623, 2014.
[39] C. J. Geyer, “Practical Markov Chain Monte Carlo,Statist. Sci., vol. 7,
no. 4, pp. 473–483, Nov. 1992.
[40] D. Wilkinson, “Parallel Bayesian Computation,” in Handbook of Par-
allel Computing and Statistics (Statistics, Textbooks and Monographs).
Chapman & Hall/CRC, 2005.
[41] V. Gopal and G. Casella, “Running regenerative markov chains in
parallel,” unpublished, 2011.
[42] L. Murray, “Distributed markov chain monte carlo,” in Proc. Neural Inf.
Process. Syst. Workshop Learn. Cores, Clusters Clouds, vol. 11, 2010.
[43] Y. Emma Wang, Y. Zhu, G. G. Ko, B. Reagen, G.-Y. Wei, and D. Brooks,
“Demystifying Bayesian Inference Workloads,” in IEEE Int. Symp.
Perform. Anal. Syst. Softw., ser. ISPASS’19. Madison, WI, USA: IEEE,
Mar. 2019, pp. 177–189.
[44] J. Piironen and A. Vehtari, “Sparsity information and regularization in
the horseshoe and other shrinkage priors,” Electron. J. Statist., vol. 11,
no. 2, pp. 5018–5051, 2017.
[45] M. Betancourt, “Bayes Sparse Regression,” Mar. 2018.
[46] Imperial College COVID-19 Response Team et al., “Estimating the
effects of non-pharmaceutical interventions on COVID-19 in Europe,
Nature, Jun. 2020.
[47] Stan Development Team, “Stan modeling language users guide and
reference manual, version 2.23.0,” 2020.
[48] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J.
Mach. Learn. Res., vol. 3, no. null, pp. 993–1022, Mar. 2003.
[49] L. Egidi, F. Pauli, and N. Torelli, “Are Shots Predictive Of Soccer
Results?” in StanCon 2018. Zenodo, Aug. 2018.
[50] S. Kim, N. Shepherd, and S. Chib, “Stochastic Volatility: Likelihood
Inference and Comparison with ARCH Models,” Rev. Econ. Stud.,
vol. 65, no. 3, pp. 361–393, Jul. 1998.
[51] R. M. Neal, “Probabilistic inference using markov chain monte carlo
methods,” University of Toronto, Tech. Rep. CRG-TR-93-1, Sep. 1993.
[52] J. Li et al., “Feature selection: A data perspective,” ACM Comput.
Surveys, vol. 50, no. 6, p. 94, 2018.
[53] A. Gelman and K. Shirley, “Inference from simulations and monitoring
convergence,” in Handbook of Markov Chain Monte Carlo. CRC Press,
2011, pp. 163–174.
[54] D. Rudolf, “Error bounds for computing the expectation by Markov
chain Monte Carlo,” Monte Carlo Methods Appl., vol. 16, no. 3-4, Jan.
2010.
[55] G. S. Fishman, Discrete-Event Simulation. New York, NY: Springer
New York, 2001.
[56] J. G. Propp and D. B. Wilson, “Exact sampling with coupled markov
chains and applications to statistical mechanics,” Random Struct Algo-
rithms, vol. 9, no. 1–2, pp. 223–252, Aug. 1996.
[57] T. K. Papp, JackRab, D. Aluthge, J. TagBot, and M. Piibeleht, “Tpap-
p/DynamicHMC.jl: V2.1.6,” Zenodo, Aug. 2020.
[58] K. Xu, H. Ge, W. Tebbutt, M. Tarek, M. Trapp, and Z. Ghahramani,
“AdvancedHMC.jl: A robust, modular and efficient implementation of
advanced HMC algorithms,” in Proc. 2nd Symp. Adv. Approx. Bayesian
Inference, ser. AABI’19, vol. 118. PMLR, Dec. 2020, pp. 1–10.
[59] R. M. Neal et al., “MCMC using Hamiltonian dynamics,” Handb.
Markov Chain Monte Carlo, vol. 2, no. 11, p. 2, 2011.
[60] H. Haario, E. Saksman, and J. Tamminen, “An adaptive metropolis
algorithm,” Bernoulli, vol. 7, no. 2, pp. 223–242, Apr. 2001.
[61] R. M. Neal, “An Improved Acceptance Procedure for the Hybrid Monte
Carlo Algorithm,” J. Comput. Phys., vol. 111, no. 1, pp. 194–203, Mar.
1994.
[62] R. M. Neal et al., “MCMC using Hamiltonian dynamics,” Handb.
Markov Chain Monte Carlo, vol. 2, no. 11, p. 2, 2011.
[63] A. Gelman, J. Carlin, H. Stern, D. Dunson, A. Vehtari, and D. Rubin,
Bayesian Data Analysis, 3rd ed., ser. Chapman & Hall/CRC Texts in
Statistical Science. Boca Raton: CRC Press, 2014.
[64] A. Gelman and D. B. Rubin, “Inference from Iterative Simulation Using
Multiple Sequences,” Statist. Sci., vol. 7, no. 4, pp. 457–472, Nov. 1992.
[65] X. Feng, D. A. Buell, J. R. Rose, and P. J. Waddell, “Parallel algorithms
for Bayesian phylogenetic inference,” J. Parallel Distrib. Comput.,
vol. 63, no. 7-8, pp. 707–718, Jul. 2003.
[66] P. Del Moral, A. Doucet, and A. Jasra, “Sequential Monte Carlo
samplers,” J. Roy. Statist. Soc.: B, vol. 68, no. 3, pp. 411–436, Jun.
2006.
[67] N. Chopin, “A sequential particle filter method for static models,
Biometrika, vol. 89, no. 3, pp. 539–552, Aug. 2002.
[68] P. Del Moral, Feynman-Kac Formulae, ser. Probability and Its Applica-
tions. New York, NY: Springer New York, 2004.
[69] A. Beskos, A. Jasra, N. Kantas, and A. Thiery, “On the convergence of
adaptive sequential Monte Carlo methods,Ann. Appl. Probab., vol. 26,
no. 2, pp. 1111–1146, Apr. 2016.
[70] A. Varsi, L. Kekempanos, J. Thiyagalingam, and S. Maskell, “A Single
SMC Sampler on MPI that Outperforms a Single MCMC Sampler,”
arXiv:1905.10252 [cs, stat], May 2019.
[71] B. Nemeth, T. Haber, J. Liesenborgs, and W. Lamotte, “Relaxing
Scalability Limits with Speculative Parallelism in Sequential Monte
Carlo,” in IEEE Int. Conf. Cluster Comput., ser. CLUSTER’18. Belfast:
IEEE, Sep. 2018, pp. 494–503.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Markov chain Monte Carlo algorithms are used to simulate from complex statistical distributions by way of a local exploration of these distributions. This local feature avoids heavy requests on understanding the nature of the target, but it also potentially induces a lengthy exploration of this target, with a requirement on the number of simulations that grows with the dimension of the problem and with the complexity of the data behind it. Several techniques are available toward accelerating the convergence of these Monte Carlo algorithms, either at the exploration level (as in tempering, Hamiltonian Monte Carlo and partly deterministic methods) or at the exploitation level (with Rao–Blackwellization and scalable methods). This article is categorized under: • Statistical and Graphical Methods of Data Analysis > Markov Chain Monte Carlo (MCMC) • Algorithms and Computational Methods > Algorithms • Statistical and Graphical Methods of Data Analysis > Monte Carlo Methods
Article
Full-text available
Parallelizable Markov chain Monte Carlo (MCMC) generates multiple proposals and parallelizes the evaluations of the likelihood function on different cores at each MCMC iteration. Inspired by Calderhead (Proc Natl Acad Sci 111(49):17408–17413, 2014), we introduce a general ‘waste-recycling’ framework for parallelizable MCMC, under which we show that using weighted samples from waste-recycling is preferable to resampling in terms of both statistical and computational efficiencies. We also provide a simple-to-use criteria, the generalized effective sample size, for evaluating efficiencies of parallelizable MCMC algorithms, which applies to both the waste-recycling and the vanilla versions. A moment estimator of the generalized effective sample size is provided and shown to be reasonably accurate by simulations.
Article
Full-text available
Markov chain Monte Carlo (MCMC) methods provide consistent approximations of integrals as the number of iterations goes to infinity. However, MCMC estimators are generally biased after any fixed number of iterations, which complicates both parallel computation and the construction of confidence intervals. We propose to remove this bias by using couplings of Markov chains and a telescopic sum argument, inspired by Glynn & Rhee (2014). The resulting unbiased estimators can be computed independently in parallel, and confidence intervals can be directly constructed from the Central Limit Theorem for i.i.d. variables. We provide practical couplings for important algorithms such as the Metropolis-Hastings and Gibbs samplers. We establish the theoretical validity of the proposed estimators, and study their variances and computational costs. In numerical experiments, including inference in hierarchical models, bimodal or high-dimensional target distributions, logistic regressions with the P\'olya-Gamma Gibbs sampler and the Bayesian Lasso, we demonstrate the wide applicability of the proposed methodology as well as its limitations. Finally, we illustrate how the proposed estimators can approximate the "cut" distribution that arises in Bayesian inference for misspecified models.
Article
For many applications it is useful to sample from a finite set of objects in accordance with some particular distribution. One approach is to run an ergodic (i.e., irreducible aperiodic) Markov chain whose stationary distribution is the desired distribution on this set; after the Markov chain has run for M steps, with M sufficiently large, the distribution governing the state of the chain approximates the desired distribution. Unfortunately, it can be difficult to determine how large M needs to be. We describe a simple variant of this method that determines on its own when to stop and that outputs samples in exact accordance with the desired distribution. The method uses couplings which have also played a role in other sampling schemes; however, rather than running the coupled chains from the present into the future, one runs from a distant point in the past up until the present, where the distance into the past that one needs to go is determined during the running of the algorithm itself. If the state space has a partial order that is preserved under the moves of the Markov chain, then the coupling is often particularly efficient. Using our approach, one can sample from the Gibbs distributions associated with various statistical mechanics models (including Ising, random-cluster, ice, and dimer) or choose uniformly at random from the elements of a finite distributive lattice. © 1996 John Wiley & Sons, Inc.