Content uploaded by Kyurae Kim

Author content

All content in this area was uploaded by Kyurae Kim on Dec 19, 2020

Content may be subject to copyright.

Evaluating the Strong Scalability of

Parallel Markov-Chain Monte Carlo Algorithms

Khu-rai Kim

Electronics Engineering Dept.

Sogang University

Seoul, Republic of Korea

msca8h@sogang.ac.kr

Simon Maskell

Electrical Engineering and Electronics Dept.

University of Liverpool

Liverpool, UK

S.Maskell@liverpool.ac.uk

Sungyong Park

Computer Science and Engineering Dept.

Sogang University

Seoul, Republic of Korea

parksy@sogang.ac.kr

Abstract—Markov-chain Monte Carlo (MCMC) is a popular

method for performing asymptotically exact Bayesian infer-

ence. However, the acceleration of MCMC by parallelizing its

computation is a signiﬁcant challenge. Despite the numerous

algorithms developed for parallelizing MCMC, a fundamental

question remains unanswered. If parallel MCMC is the answer,

what is the question? To answer this fundamental question,

we ﬁrst characterize the scaling of parallel MCMC algorithms

into weak and strong scaling. Then, we discuss the outcomes

they deliver. Under these terminologies, most previous works

comparing different parallel MCMC methods fall into the weak

scaling territory. For this reason, we focus on assessing the

strong scalability of previously proposed parallelization schemes

of MCMC, both conceptually and empirically. Also, considering

the popularity and importance of probabilistic programming

languages, we focus on algorithms that are applicable to a

wide variety of statistical models. First, we evaluate the strong

scalability of methods based on parallelizing a single MCMC

chain. Second, for methods based on multiple MCMC chains,

we develop a simple expression for estimating the speedup and

empirically evaluate their strong scalability. Our results show that

previously proposed methods for parallelizing MCMC algorithms

achieve limited strong scalability. We conclude by providing

future directions for parallelizing Bayesian inference.

Index Terms—Bayesian Inference, Parallel Computing,

Markov-chain Monte Carlo

I. INTRODUCTION

Bayesian statistics has recently seen surging popularity.

Until recently, the high computational cost of Bayesian in-

ference methods has prohibited wide adoption of Bayesian

statistics. Fortunately, with the advances of our computational

capabilities, many of the Bayesian statistical models of our

interest are now within arms’ reach. Also, abstractions such as

probabilistic programming languages (PPL, [1], [2], [3], [4],

[5]) have enabled statisticians to enjoy modern computational

resources with reduced complexity and enhanced portability.

Meanwhile, with the ever-increasing amount of data and com-

plexity of our statistical models, the need for high-performance

parallel Bayesian inference is gaining importance.

Currently, Markov-chain Monte Carlo (MCMC) is the de-

facto standard method for performing asymptotically exact

Bayesian inference. MCMC algorithms are generally appli-

cable as they assume very little about the target statistical

model. Notably, variants of the Hamiltonian Monte Carlo

sampler (HMC, [6]) have been widely adopted as the default

sampling strategy in PPLs, such as Stan [4], PyMC3 [1],

[3], and Turing.jl [5]. For PPLs where the user’s model can

be complex, HMC has shown, both theoretically and empiri-

cally, to be efﬁcient and robust. However, MCMC algorithms

are computationally expensive, requiring many thousands of

evaluations of the likelihood (and, in the case of HMC, its

gradient).

Because of the high cost of executing MCMC algorithms,

it is natural to seek acceleration by utilizing modern high-

performance computing resources. Yet, on most modern com-

putational hardware, this can only be achieved by paralleliza-

tion. Since the clock speed of processors started stagnating,

the necessity for parallelizing MCMC has kept increasing

over time [7], [8]. Also, the emergence of massively par-

allel, programmable, specialized hardware such as graphical

processing units (GPU) and ﬁeld-programmable gate arrays

(FPGA) has opened immense beneﬁts for parallelization [9],

[10], [11], [12]. Unfortunately, the fundamentally sequential

nature of MCMC algorithms has shown that it is challenging

to scale MCMC algorithms [13], [14]. Moreover, being gener-

ally applicable enough to support PPLs adds multiple design

constraints. Methods such as in [15], [9], [10], [12], [16] that

exploit the parallelism inherent in the statistical model are not

generically applicable. For these reasons, developing parallel

MCMC algorithms robust and automatic enough to be applied

to PPLs is a vital but challenging objective.

If parallelizing MCMC is the answer, what is the question?

Until now, various strategies for accelerating MCMC by

parallelization have been proposed. However, a fundamental

discussion on what the goal of parallelization is (in the

perspective of parallel computing) has yet to be presented. For

example, are we trying to achieve weak scaling [17] or strong

scaling [18]? Or, more fundamentally, what strong scaling or

weak scaling even mean in the context of Bayesian inference?

What are the previously proposed methods actually achieving?

This study is our attempt to address these questions with a

focus on fulﬁlling the constraints of PPLs.

In this paper, we ﬁrst characterize the acceleration of

Bayesian inference into strong and weak scaling. The weak

scaling [17] approach reduces estimation error as we increase

the amount of computational resources. In contrast, the strong

scaling [18] approach tries to reduce the time for reaching a

speciﬁc target estimation error. As weak scaling and strong

scaling deliver sharply different outcomes, we start with a

broader discussion by comparing the outcomes of the two

approaches. Since the estimation error of MCMC can only be

reduced in a rate of (1∕𝑁)(𝑁is proportional to the amount

of computation), the weak scaling approach is susceptible to

diminishing returns [19]. Also, because of the recent trend

of using of the effective sample size per unit time metric

(ESS/second), performance comparisons have been biased

towards weak scaling. As a result, proper comparisons of

the strong scalability of different parallelization methods have

been left behind. Thus, we dedicate the remainder of our paper

to evaluating the strong scalability of MCMC parallelization

schemes.

Previously proposed strategies for parallelizing MCMC (and

achieving strong scalability), can be classiﬁed into three types:

intrachain parallelism, interchain parallel, and data parallelism.

Intrachain [20], [15], [21], [22], [23], [24], [11], [16] paral-

lelism seeks speedup by parallelizing the computation of a

single MCMC chain. Interchain parallelism [25], [26], [27],

[28], [29], [8] computes multiple MCMC chains and utilizes

the embarrassingly parallel nature of the multiple chains.

On the other hand, data parallelism (mostly consisting of

consensus MCMC methods, [30], [31], [32]) can be seen as

combining both; splitting the data into multiple subsets and

operating MCMC chains in each subset. Since the applicability

of data parallelism is limited to certain types of statistical

models (models that are inherently data parallel), we will

restrict our discussion to intra and interchain parallelism.

We empirically investigate the potential strong scalability

of intrachain parallelism by computing the Amdahl speedup

limit [18] on multiple realistic Bayesian models. For interchain

parallelism, we develop a simple formula for estimating the

speedup given the autocorrelation and the amount of burn-

in samples. Then, we empirically evaluate the strong scaling

of previously proposed interchain parallelization methods.

Our results suggest that strategies for parallelizing MCMC,

whether interchain or intrachain, struggle to achieve strong

scaling close to linear. Lastly, we conclude our work by pro-

viding future directions for overcoming the limits of MCMC.

To summarize, the key insights of this paper are as follows:

∙We characterize the acceleration of MCMC algorithms

into strong and weak scaling (Section III-A).

∙We argue that the outcomes of strong scalability are more

desirable than weak scalability. However, the analysis of

previous works are largely focused on weak scalability

(Section III-A).

∙We experimentally assess the strong scalability of paral-

lelizing a single MCMC chain (intrachain parallelization)

(Section III-B).

∙We develop a simple expression for evaluating the strong

scalability of multiple MCMC chain parallelization (in-

terchain parallelization), and experimentally evaluate the

strong scalability of previously proposed interchain par-

allelization strategies (Section III-C).

II. PRELIMINARIES

A. Bayesian Inference

The goal of Bayesian inference is to obtain the distribution

of a parameter of interest. First, an observable quantity, or

dataset (denoted as ), is given. By assuming that the data

is generated from an observed variable 𝜃according to a data

generation process, the likelihood of observing can be com-

puted as 𝑝(𝜃). Bayesian inference focuses on inferring the

inverse-probability 𝑝(𝜃), called the posterior distribution,

by setting a prior distribution 𝑝(𝜃)on 𝜃and invoking Bayes’

rule such as

𝑝(𝜃)=𝑝(𝜃)𝑝(𝜃)

𝑝()=𝑝(𝜃)𝑝(𝜃)

𝑝(𝜃)𝑝(𝜃)𝑑𝜃.(1)

Using the posterior distribution, it is possible to marginalize

the parameters of an arbitrary function 𝑓such as

𝔼𝜃∼𝑝(𝜃)[𝑓(𝜃)]=𝑓(𝜃)𝑝(𝜃)𝑑𝜃. (2)

Typical use-cases of marginalization include obtaining sum-

mary statistics of 𝜃, or performing predictions based on data.

Unfortunately, 𝑝()in Equation (2), called the evidence,

is often intractable unless we restrict ourselves to off-the-

shelf conjugate probability distributions. To perform Bayesian

analysis on more interesting and complex models, we instead

obtain a ﬁnite set of samples describing 𝑝(𝜃). This is done

by sampling from the joint distribution 𝜋(𝜃)=𝑝(𝜃)𝑝(𝜃),

which is proportional to the posterior distribution up to a

normalizing constant (the evidence) such that 𝑝(𝜃)∝𝜋(𝜃).

Using these 𝑁samples, we can approximate the marginal 𝐹

using the Monte Carlo approximation

𝑓(𝜃)𝑝(𝜃)𝑑𝜃≈1

𝑁

𝜃𝑖∼𝑝(𝜃)𝑓(𝜃𝑖)= ̄

𝑓. (3)

B. Sampling from the Posterior

Sampling proportionally to the posterior is a signiﬁcant

challenge by itself. Currently adopted methods can be classi-

ﬁed into two types: asymptotically exact methods and approx-

imate methods. For asymptotically exact sampling, Markov-

chain Monte Carlo (MCMC) is the most widely used.

1) Markov-chain Monte Carlo:The basic idea of MCMC

is to construct a Markov-chain along the samples with a

Markov-chain transition operator, or kernel, 𝐾(𝜃,𝜃′). By con-

ditioning the current sample 𝜃𝑖on the previous state of the

Markov-chain 𝜃𝑖−1, it is possible to cancel out the normal-

ization constant and sample proportionally to the posterior.

If 𝐾(𝜃,𝜃′)admits the posterior distribution 𝑝(𝜃)as its

invariant measure such as

𝜋(𝜃)=Θ𝐾(𝜃′,𝜃)𝜋(𝑑𝜃′)(4)

and some additional assumptions, the states of the Markov-

chain 𝜃𝑖form an asymptotically unbiased estimator.

Theorem 1 (Thm 4.7.7, [33]): If the Markov-chain 𝜃1…𝜃𝑖is

aperiodic, irreducible, and reversible with invariant distribution

Algorithm 1: Markov-Chain Monte Carlo

for 𝑡∈[1,𝑁burn +𝑁]do

𝜃∗∼𝑞(⋅𝜃𝑡−1)(propose sample)

𝛼=min𝜋(𝜃∗)𝑞(𝜃𝑡−1𝜃∗)

𝜋(𝜃𝑡−1)𝑞(𝜃∗𝜃𝑡−1),0(acceptance prob.)

𝑢∼Uniform(0,1)

𝜃𝑡=𝜃∗, 𝑢<𝛼 (accept proposal)

𝜃𝑡−1,otherwise (reject proposal)

end

return 𝜃𝑁burn,…, 𝜃𝑁+𝑁burn

𝜋, the Central Limit Theorem applies when 0<𝜎2<+∞

such that 𝑁(̄

𝑓−𝔼𝜋[𝑓(𝜃)]) 𝑑

,→(0,𝜎2

CLT)(5)

where 𝑁is the number of Markov-chain states,

̄

𝑓=1

𝑁𝑁

𝑖=1𝑓(𝜃𝑖),(6)

𝜎2

CLT =𝕍𝜋[𝑓(𝜃1)]+2∞

𝑖=2cov𝜋(𝑓(𝜃1),𝑓(𝜃𝑖)),(7)

𝔼[⋅]denotes the expectation, 𝕍[⋅]denotes the variance, and

cov(⋅,⋅)denotes the covariance.

2) MCMC Algorithms:The most basic form of a Markov-

chain operator satisfying the conditions of Theorem 1 is the

Metropolis-Hastings method [34] described in Algorithm 1.

First, a proposal is generated from an arbitrary distribution

𝑞(⋅𝜃𝑡−1). Then, the proposal is either accepted or rejected

based on its acceptance probability 𝛼. Each accept-reject

decision forms a state of the Markov chain. This is repeated

for 𝑁+𝑁burn times, where only the last 𝑁samples are used for

estimation. For a detailed introduction to MCMC algorithms,

see [35] and [36].

Currently, variants of Hamiltonian Monte Carlo (HMC)

have been empirically shown to be the most efﬁcient, robust,

and generically applicable [3], [37]. Speciﬁcally, the no-u-turn

(NUTS) sampler [38], an automatic tuning variant of HMC,

has been widely adopted. In HMC samplers, a proposal is

generated by simulating the Hamiltonian dynamics with the

potential energy given by 𝜋(𝜃). For a detailed introduction to

HMC samplers, see [37].

3) Burn-in and adaptation:As depicted in Algorithm 1,

only the last 𝑁samples are used for inference. The ﬁrst

𝑁burn samples, often called burn-in or warmup are discarded.

Similarly, the interval 𝑖∈ [0,𝑁burn ]is called the burn-in

period. The reason for discarding burn-in samples is to ensure

that the Markov-chain reaches the stationary region of the

posterior (which previously some have called the typical set).

While the necessity of burn-in has been questioned before

such as in [35], it is a standard practice. Also, recent MCMC

algorithms such as NUTS utilize the burn-in samples for

adapting the hyperparameters of 𝐾(⋅,⋅)(For example, by using

the Nesterov dual-averaging procedure [38]).

III. EVALUATI NG T HE ST RONG SCALABILITY O F

PARALLEL MARKOV-CH AI N MON TE CA RL O

A. Deﬁning Acceleration of Bayesian Inference

Accelerating Bayesian inference has been a central goal

for advancing the ideals of Bayesian methodologies. However,

what does accelerating Bayesian inference truly mean? First,

we focus on the goal of Bayesian inference, which is to

estimate the variables in question with the least amount of

error.

1) Estimation error:The estimation error is deﬁned by the

asymptotic root mean squared error (RMSE), which can be

derived from (7) as

RMSE =𝔼[(̄

𝑓−𝔼[𝑓])2]= 𝜎

𝑁eff

(8)

where 𝜎2=𝕍𝜋[𝑓(𝜃1)]and 𝑁eff is the effective sample size

(ESS). The ESS is deﬁned as

𝑁eff =𝑁

𝜏=𝑁

1+2∞

𝑘=1𝜌𝑘(9)

where 𝜌𝑘is the autocorrelation of the Markov-chain with

lag 𝑘, and 𝑁is the number of samples [34], [39]. Here,

𝜏is the statistical performance of a Markov-chain transition

kernel 𝐾(𝜃,𝜃′). The goal of MCMC sampling is to estimate

the quantity 𝔼[𝑓]with a low RMSE by either increasing 𝑁,

or decreasing 𝜏. Note that for some MCMC-based methods,

𝜏is not necessarily computed using the autocorrelation (for

example, as in [29]). For this reason we will call 𝜏the variance

inﬂation factor.

2) The objective of parallelization:In general, the goal of

parallelization is either to increase the amount of computation

done in a ﬁxed interval or to decrease the amount of time

of performing a ﬁxed amount of computation. The former

is known as weak scalability, while the latter is known as

strong scalability. In MCMC, the amount of time spent on

computation is roughly proportional to the number of sam-

ples 𝑁. While this not perfectly true because of algorithmic

variation in algorithms such as NUTS, we will stick with

this assumption throughout this paper for simplicity. With that

said, for MCMC, weak scaling means to reduce the RMSE by

increasing resources, while strong scaling means to decrease

the time until achieving a ﬁxed target RMSE by increasing

resources. Now, acceleration of Bayesian inference can be

characterized into two categories: weak scaling and strong

scaling.

3) What we were achieving so far:Until now, weak

scaling has been the unspoken goal of parallelizing MCMC.

While some discussions regarding strong scalability exist [40],

[41], [42], the ESS per unit time (ESS/second) metric has

recently seen dominant use for comparing different algorithms.

For instance, [4, p. 10] states that “. . . effective sample size

per second (or its inverse) is the most relevant statistic for

comparing the efﬁciency of sampler implementations”. Since

a higher ESS/second value signiﬁes that lower error can be

achieved during the same unit time, it is a typical weak

scalability metric. For this reason, we ﬁrst question whether

the path of weak scalability is leading us to where truly want

to be. To answer this, we ﬁrst look into the outcomes of weak

and strong scaling.

4) What weak and strong scaling deliver:With weak

scalability, we can achieve a lower error by increasing our

computational resources. Since the ultimate goal of inference

is to achieve lower RMSE of the target estimator, weak scaling

might sound attractive at ﬁrst. However, it is questionable

whether we genuinely need arbitrarily low error. In practice,

achieving an “acceptable level” of RMSE faster is more

desirable than merely achieving arbitrarily low RMSE. Also,

according to Equation (8), the estimation error can only be

reduced in the order of (1∕𝑁). This slow rate of reduction

causes an issue of diminishing returns [19, p. 112-113].

Unfortunately, even if this lofty parallel speedup

goal is achieved, the asymptotic picture for MCMC

is dim: in the asymptotic regime, doubling the

number of samples collected can only reduce the

Monte Carlo standard error by a factor of 2. This

scaling means that there are diminishing returns to

purchasing additional computational resources, even

if those resources provide linear speedup in terms of

accelerating the execution of the MCMC algorithm.

Again, if achieving an “acceptable level” of RMSE is the

ultimate goal, it is not only the amount of error we can reduce

that diminishes but the utility of such reduction that also

vanishes. In sharp contrast, strong scaling actually reduces the

time spent for inference. Hence, Bayesian inference is more

often a strong scaling problem rather than a weak scaling

problem. What we want is to achieve an acceptable amount

of error faster, rather than arbitrarily low error.

5) A case for strong scalability:Weak scalability doesn’t

reduce the time for Bayesian inference. Instead, it enables

us to perform inference more accurately, which deﬁnitely

has a place. However, as discussed, weak scaling has a

problem of diminishing returns. Moreover, the popularity of

the ESS/seconds metric have resulted in a bias towards weak

scalability. For these reasons, the strong scalability of currently

known parallelization strategies is not well understood. Thus,

we dedicate the remainder of this paper to evaluating the strong

scalability of parallel MCMC methods.

B. Evaluating the Strong Scalability of Intrachain Parallelism

From now on, we will evaluate the strong scalability of

intrachain and interchain approaches to parallelizing MCMC.

Intrachain parallelism accelerates inference by accelerating the

execution of a single MCMC chain. Intrachain parallelization

approaches span parallelizing the evaluation of the likelihood,

using the parallel delayed rejection algorithm, or using the

prefetching method.

1) Parallel delayed rejection and prefetching:Parallel de-

layed rejection is a parallel realization of the delayed rejection

algorithm [21], [22]. In parallel delayed rejection, multiple

proposals are generated in parallel, while accept-reject de-

cisions are made sequentially for each proposal until one is

accepted. Once a proposal is accepted, all other proposals are

discarded. For this reason, parallel delayed rejection wastes

a lot of computation, bounding the speedup sublinearly [21],

[22]. Meanwhile, prefetching methods [20], [23], [24] extract

parallelism by simulating multiple iterations of the Markov-

chain asynchronously. Similarly to delayed rejection, these

approaches achieve only logarithmic speedup [20] as they

waste a lot of computation as soon as an accept-rejection

decision is sealed.

2) Likelihood parallelization:On the other hand, likeli-

hood parallelization [15], [11], [16] aims to accelerate MCMC

by parallelizing the computation of the likelihood. Since

the likelihood is the most computationally expensive part in

Bayesian inference, likelihood parallelization looks promis-

ing [51]. Indeed, Stan [4] explicitly supports this model of

parallelization through features such as map_rect (from ver-

sion 2.18.0) and reduce_sum (from version 2.23). However,

how much does likelihood parallelization deliver in practice?

From now on, we will empirically evaluate the gains of

parallelizing the likelihood by computing the Amdahl speedup

limit [18] on a diverse set of Bayesian models. Precisely, we

measure the time spent for computing the likelihood using

Bayesian models based on the Stan PPL [4].

3) Experimental settings:We ﬁrst forked1and modiﬁed

the Stan runtime in order to measure the execution time of

the likelihood and its gradient. We sample 4000 samples using

Stan’s implementation of the NUTS MCMC algorithm [38]

after discarding 4000 samples for burn-in. We use the default

conﬁguration of CmdStan v2.23.0. The server we use for the

experiment runs Linux 4.15 on Intel Xeon E7–8870 v2 pro-

cessors and has 770GB of RAM. The processor frequency was

ﬁxed to 2.7GHz for all the experiments using the cpupower

frequency-set command. All of the experiments are

repeated 32 times each.

4) Analysis methodology:Following the Bayesian theme

of this paper, we employ a Bayesian analysis methodology for

the results. First, for each of the 32 repetitions, we estimate

the execution time proportion 𝑝=𝑇likelihood∕𝑇total of the

likelihood (𝑇likelihood) and the total execution time (𝑇total).

We assume the data is generated from the following beta-

regression model: 𝜇∼Uniform(0,1)

𝜙∼Inv-Gamma(1,1)

𝑝∼Beta(𝛼,𝛽)

where 𝛼=𝜇𝜙, 𝛽=(1−𝜇)𝜙,

𝜇and 𝜙are the mean and precision of 𝑝. We set an uninforma-

tive uniform prior on 𝜇and an inverse-gamma prior on 𝜙. We

use the Turing.jl [5] PPL for describing the aforementioned

model, and draw 4000 samples using NUTS after discarding

4000 samples for burn-in.

1Forked the repository https://github.com/stan-dev/stan in July 13 2020.

TABLE I: Execution Time Proportion (𝑝) of the Likelihood

Name References Characterization2#Datapoints #Parameters Exec. Time Estimated 𝑝1

(seconds)

ad [43] Logistic Regression 354 7000 14167 0.994 (0.994, 0.995)

butterﬂy [43] Hierarchical Bayesian 3 28 212 0.974 (0.973, 0.974)

cancer [44], [45] Sparse Logistic Regression 1434 102 17902 0.947 (0.947, 0.947)

covid [46] Hierarchical Bayesian 87 75 5808 0.990 (0.990, 0.990)

disease [43] Logistic Regression 345 10 6115 0.986 (0.986, 0.986)

lda [47], [48] Mixed-membership Model 1224 7737 5900 0.990 (0.990, 0.990)

racial [43] Hierarchical Bayesian 5 300 2212 0.911 (0.909, 0.912)

soccer [49] Hierarchical Bayesian 331 3040 33148 0.958 (0.957, 0.958)

stock [47], [50] Stochastic Volatility 1006 5030 5033 0.853 (0.852, 0.854)

votes [43] Hierarchical Gaussian Processes 11 550 3279 0.852 (0.851, 0.854)

1We report the posterior median along with the 80% credible interval in the parentheses.

2The characterization is based on [43].

Fig. 1: The Amdhal limit estimated from the posterior samples

of 𝑝. The error bars are the 80% credible intervals.

5) Benchmark models:We chose models from the

BayesSuite benchmark [43] (ad, butterﬂy, disease, racial,

votes), the Stan User Guide [47] (lda, stock), and other

independent sources (soccer,covid,cancer). The considered

workloads are organized in Table I with their description and

average execution time. For the stock benchmark, we use the

S&P 500 daily closing price data from the year 2000 to 2020.

Also, for the cancer benchmark, we use the prostate cancer

(Prostate_GE) dataset from [52]. Lastly, for the covid,lda,

and soccer benchmarks, we use the datasets used in their

original works.

6) Results:The summaries of the estimated posterior of

𝑝are shown in the right of Table. I. Within the considered

models, the proportion of the likelihood computation time (𝑝)

is larger than 99% on only ad, covid, lda. Also, for stock

and votes, the proportion is only around 85%. The estimated

Amdahl limits, which are the maximum theoretical speedup

we can achieve, are shown in Fig. 1. Results show that the

likelihood of the considered statistical models are generally

not dominant enough to achieve signiﬁcant speedup.

We performed various analyses on the results to see if 𝑝

is related to any of the problem’s characteristics, such as the

number of datapoints or the number of parameters. However,

we failed to ﬁnd any signiﬁcant correlation. This suggests

that the efﬁciency of likelihood parallelization results from

complex interactions between multiple factors spread across

the PPL ecosystem.

7) Discussion:It is important to note that the estimates in

Fig. 1 are optimistic estimates assuming inﬁnite computational

resources. In practice, the speedup gains would be much less.

Also, in general, the amount of parallelism in the statistical

models varies greatly. While simple logistic regression models

with very tall datasets (datasets with lots of data points) have

signiﬁcant parallelism, more complex hierarchical Bayesian

models do not have as much parallelism. Moreover, paral-

lelizing the likelihood is a burden that must be carried by

the user. For example, in Stan, utilizing constructs such as

map_rect and reduce_sum requires a large number of

user code changes. In the end, parallelizing the likelihood is

by no means an automatic approach, and its performance gain

is highly dependent on the model used. Since other methods

such as parallel delayed rejection and prefetching also have

fundamental limits, it is unclear whether we could achieve

pain-free linear speedup with intrachain parallelization.

C. Evaluating the Strong Scalability of Interchain Parallelism

The interchain approach achieves speedup by executing

multiple independent chains in parallel. Because of the seem-

ing independence of the chains, this approach is often de-

scribed as being embarrassingly parallel. However, in this

section, we show that this appearance is deceptive to the proper

assessment of strong scalability. To properly assess the strong

scalability of interchain parallelism, we will ﬁrst theoretically

quantify the speedup. Then, we will empirically evaluate

the scalability of previously proposed interchain parallelism

schemes.

1) Theoretical analysis:Our discussion starts with the

asymptotic error rate of MCMC given by Equation (8) and (9).

By rearranging the equations, we obtain the number of samples

needed to achieve an error rate 𝜖,𝑁=𝜏𝜎2∕𝜖2,given

the variance of the estimated statistic (𝜎) and the variance

reduction factor (𝜏).

Assuming that generating a sample costs a constant execu-

tion time 𝑇, the amount of work 𝑊needed to achieve an error

𝜖is 𝑊seq =𝑇(𝑁+𝑁burn)=𝑇(1+𝑏seq)𝑁(10)

=𝑇(1+𝑏seq )𝜏seq 𝜎2

𝜖2(11)

where 𝑁burn is the relative ratio of burn-in samples and 𝑏

is the proportion of burn-in samples such that 𝑁burn =𝑁𝑏.

Normally, the burn-in ratio 𝑏is chosen to be quite high. For

example, [53] heuristically suggests using 𝑏=1.0, where half

of the samples are used for inference, and half are used for

burn-in. This ratio has also been supported by [54] according

to non-asymptotic error analysis results.

It is also important to note that, in practical conditions,

reducing the absolute amount of burn-in results in an inferior

𝐾(⋅,⋅)with a relatively higher autocorrelation. This is because

recent MCMC algorithms use the burn-in samples for adapting

the hyperparameters of 𝐾(⋅,⋅). Moreover, since earlier samples

of the Markov-chain often contain a lot of bias, shortening the

burn-in period results in the adaptation period to use only low

quality samples. For these reasons, we will roughly assume

that reducing 𝑁burn increases the variance inﬂation factor 𝜏.

By executing multiple MCMC chains in parallel and com-

bining the results, we can reduce the execution time by the

number of computing unit 𝑃. In this case, the amount of work

done by each computing unit in parallel is

𝑊par =𝑇(𝑁+𝑁burn)

𝑃=𝑇(1+𝑏par )𝑁1

𝑃(12)

=𝑇(1+𝑏par )𝜏par 𝜎2

𝜖21

𝑃.(13)

Equation (11) and (13) give us the execution time required for

achieving an error 𝜖by running MCMC sequentially and in

parallel.

The speedup 𝑆of running multiple chains in parallel is now

given as

𝑆=𝑊seq

𝑊par =𝑇(1+𝑏seq )𝜏seq 𝜎2

𝜖2

𝑇(1+𝑏par )𝜏par 𝜎2

𝜖21

𝑃

(14)

=𝑃𝜏seq (1+𝑏seq )

𝜏par(1+𝑏par).(15)

Remark 1 (Amdahl’s Law): If we hold 𝑁burn constant, we

retrieve Amdahl’s law. This can be done by setting 𝑏=𝑏par =

𝑃𝑏seq since the length of the individual chains are shortened

by a factor of 𝑃. Assuming the performance of the samplers

is equal so that 𝜏seq =𝜏par, the speedup is given as

𝑆=𝑃1+𝑏

1+𝑃𝑏 =1+𝑏

(1∕𝑃)+𝑏,(16)

which is structurally equivalent to Amdahl’s law. Thus, if

𝑏𝑠𝑒𝑞 >0, then 𝑆<𝑃.

2) Limitation due to burn-in (𝑁𝑏𝑢𝑟𝑛):Since in practice

it is almost always true that 𝑏𝑠𝑒𝑞 >0, Remark 1 shows that

ﬁxing the number of samples used for burn-in is always inef-

ﬁcient [40], [42], unless 𝑁≫𝑁burn (similar to the asymptotic

results of [55, Thm 6.4]). Because of the diminishing returns

discussed in Section III-A, it would be inefﬁcient to maintain

𝑁≫𝑁burn. Thus, in realistic settings, the effect of 𝑁burn will

be not negligible in terms of scalability.

3) Can we avoid burn-in?:Since holding the amount of

burn-in fundamentally restricts strong scalability, methods for

completely avoiding the need for burn-in has been proposed.

For example, [13] suggests to use perfect simulation [56] in

order to obtain initial points for the Markov-chain that are

1 30 60 90 120 150

10−2

10−1

𝜃1∼𝒩(1.0,3.02)

𝜃1∼𝒩(0.0,0.12)

optimal

target

Samples per Chain

RMSE

128 64 50 40 30

Speedup

Fig. 2: RMSE of estimating 𝜇10 from samples of 64 parallel

Markov-chains where 𝑁burn = 64. The dark red line shows

the RMSE decreasing with the number of samples, the target

line is the target RMSE acquired by the single, longer MCMC

chain (𝑁=2048,𝑁burn =2048). The optimal line denotes the

iteration where the theoretically optimal speedup is achieved.

We estimated the RMSE by repeating the sampling process

212 times.

within the stationary region. Similarly, DynamicHMC.jl [57],

an implementation of the NUTS algorithm supported by

Turing.jl, performs maximum a-posteriori estimation to obtain

an initial point close to the mode of the posterior. However,

these approaches have a catch. In the name of reducing burn-

in, these methods introduce additional computation, which

decreases efﬁciency just as burn-in does. Also, recent MCMC

algorithms need the burn-in period to perform hyperparameter

adaptation.

Then, what about reducing 𝑁𝑏𝑢𝑟𝑛 according to the number

of chains?

Remark 2 (Reducing the amount of burn-in): By holding the

burn-in ratio constant and assuming that the autocorrelation

changes, the number of samples generated by each chain is

𝑁∕𝑃while the number of burn-in samples is 𝑁burn∕𝑃. Then,

the speedup is given as

𝑆=𝑃(𝜏seq ∕𝜏par).(17)

Consequently, the performance is directly dependent on the

performance, or inﬂation factor, of the Markov-chains.

The results of Remark 2 are actually quite promising. If we

can keep the variance inﬂation factor 𝜏seq ≈𝜏par, then we can

achieve linear speedup such as 𝑆≈𝑃. From now on, we will

empirically evaluate the possibility of this direction.

4) Experimental settings: Neal’s 10-D Gaussian:First, we

perform experiments on a simple synthetic problem to evaluate

the effect of shortening the chain and starting the chain

within the stationary region. We run the NUTS sampler of

AdvancedHMC.jl [5], [58] on a 10-dimensional “Neal’s Gaus-

sian” which is a multivariate Gaussian distribution (0,Σ)

where Σ=Diagonal(0.01,0.02,…,1.0) [59]. Because of the

varying covariance scale, Neal’s Gaussian is appropriate for

evaluating the effect of adapting 𝐾(⋅,⋅). We ﬁrst sample with

𝑁= 4096, 𝑁burn = 4096 samples with a single, long Markov-

chain. Then, we execute 64 parallel chains, where each chain

spends 𝑁burn∕𝑃=4096∕64=64for burn-in, and compare the

RMSE against the single long chain. The RMSE is estimated

by repeating the sampling process 212 times.

For the longer chain, we use the Stan adaptation rule [47],

which alternates between adapting the preconditioner and the

step-size of NUTS. For 𝑁burn <150, the burn-in phase is too

short for using the Stan adaptation rule. In this case, we use the

naive adaptation rule [5], which simply adapts the covariance

and the step-size at the same time. We start each of the 64

chains from a random initial point sampled from (1.0,3.02)

or (0.0,0.12). Since the former is much wider than the

stationary region, it simulates the effect of sampling the initial

point from the prior distribution, which is an effective heuristic

to use in practice. Since strong contraction from the prior

𝑝(𝜃)to the posterior 𝑝(𝜃)is often expected, the initial

distribution will often be much wider than the stationary

region.

5) Results: Neal’s 10-D Gaussian:The results for esti-

mating the mean of the 10th parameter (𝜇10) can be seen in

Fig. 2. The 𝑥-axis is the number of samples sampled from each

parallel MCMC chains, which is roughly proportional to the

execution time. The red lines show the RMSE decreasing as

we draw more samples. The target line is the RMSE achieved

by the single, longer MCMC chain (𝑁=4096,𝑁burn =4096).

The RMSE was estimated by repeating the sampling process

212times. The optimal line denotes the number of samples per

chain where perfect linear speedup can be achieved. Unfortu-

nately, the samples of the parallel chains required many more

samples to achieve the target RMSE. In all cases, the chains

appeared to have properly converged ( ˆ

𝑅<1.01). We can see

that the strong scaling speedup achieved by 𝜃1∼(1.0,3.02)

is about 𝑆=45, while for 𝜃1∼(0.0,0.12)it is about 𝑆=60.

Note that this is not a perfectly fair comparison, as the cost

of starting from the stationary region needs to be considered.

6) Does starting from the stationary region help?:Yes.

Starting within the stationary region (𝜃1∼(0.0,0.12))

achieved much better efﬁciency compared to starting away

(𝜃1∼(1.0,3.02)) from it. Since the chains have converged

regardless of the initial point, it’s the quality of adaptation

that made the difference. By starting away from the stationary

region, the samples used for adapting 𝐾(⋅,⋅)contain a lot

of bias, worsening the performance. These results suggest

that it is not only the convergence but also the adaptation

of the Markov-chains that is critical to performance. This

observation is especially crucial to HMC based methods, as

their performance is highly dependent on appropriate tuning

of the kernel [59].

From now on, we compare the performance of previously

proposed approaches for interchain parallelization.

7) Considered parallel MCMC algorithms:We consider

interchain adaptation (INCA) by [26], and the generalized

Metropolis-Hastings algorithm (GMH) [28], [29]. We imple-

mented these algorithms on top of AdvancedHMC.jl using

the Julia language. In INCA, the samples generated by all

the parallel chains are used during adaptation. Then, instead

of using only 𝑁burn∕𝑃samples, 𝑁burn samples can be used

for adaptation. While the original INCA was applied to the

adaptive metropolis algorithm (AM, [60]), we apply it to

NUTS with the windowed acceptance [61] scheme. With AM,

INCA can be carried out not only during burn-in, but from

beginning to end. However, for NUTS, we can only apply

INCA during the burn-in period to preserve ergodicity. We

use the Stan adaptation rule for 𝑁burn∕𝑃150, and the naive

adaptation rule for 𝑁burn∕𝑃<150. We also include the non-

INCA version of NUTS as a baseline.

The GMH algorithm, on the other hand, is very different

in that it can be regarded as both an intrachain and interchain

approach. In each iteration, from a single state, GMH proposes

𝑁prop proposals in parallel. Then, it accepts 𝑁accept samples

by resampling from the proposals. Finally, it selects a single

sample from the proposals and uses it as the next state.

Overall, GMH can be thought of as operating a single “guiding

chain” where 𝑁prop short parallel chains are initiated from in

every iteration. Despite [28] showed that the “guiding chain”

achieves superior ESS, setting 𝑁prop >𝑁accept increases the

total amount of work (therefore decreasing efﬁciency). We

will also show that the performance of the “guiding chain”

is not representative of the overall samples. Since maximum

parallel efﬁciency can be achieved by setting 𝑁prop =𝑃, we set

𝑁prop =𝑁accept =𝑃. Additionally, we use the waste-recycling

extension of [29], as it is provably more efﬁcient than the

original GMH. Lastly, for the underlying MCMC algorithm of

GMH, we use HMC with 32 leapfrog steps, jitter the stepsize

by 10% (as recommended by [62]), and use an adaptation

scheme identical to NUTS-INCA.

8) Metropolis-coupled MCMC (parallel tempering):An-

other popular type of interchain parallelization that we do

not include in our experiment is Metropolis-coupled MCMC

(MC3, [25], also known as parallel tempering). MC3improves

mixing of the Markov-chain by operating multiple Markov-

chains with different targets such as 𝜋𝑖(𝜃)=𝑝(𝜃)𝑇𝑖𝑝(𝜃).

The parallel chains periodically exchange the current state. The

exponent 𝑇𝑖∈(0,1]is known as the temperature parameter

where 𝑇𝑖<1eases exploration of the posterior distribution,

as the prior is often simpler than the posterior. By exchanging

the state of the parallel chains, the improved exploration is

communicated across the chains. However, only the samples

from the chain with the temperature 𝑇= 1 is used for

inference. Thus, using MC3instead of the non-MC3MCMC

increases the total amount of work for acquiring the same

number of samples. As a result, achieving good efﬁciency with

MC3is very difﬁcult, as the ESS of the acquired samples must

be 𝑃times larger to accommodate the increased work. For this

reason, we do not consider it in our experiment.

9) Experimental settings: Eight schools:For comparing

the parallel MCMC algorithms, we choose the eight schools

problem [63], which is a hierarchical Bayesian model,

Fig. 3: Variance inﬂation factor (𝜏) of considered algorithms (lower the better). The solid lines are the mean, while the shaded

regions demark the 50% and 90% boostrap conﬁdence intervals estimated from 210 repetitions.

Fig. 4: Strong scaling speedup of the considered algorithms estimated from 𝜏(higher the better). The diagonal black line shows

the theoretically optimal speedup. The solid lines are the mean, while the shaded regions demark the 50% and 95% bootstrap

conﬁdence intervals estimated from 210 repetitions.

𝜇∼(0,10)

𝜏∼Half-Cauchy(0,5)

𝜃𝑖∼(𝜇,𝜏2)

𝑦𝑖∼(𝜃𝑖,𝜎2

𝑖)

where 𝑖∈[1,8]. This version of the eight schools model is

called centered parameterization, and is known to be difﬁcult

to sample from because of its funnel-shaped posterior. While

a more efﬁcient parameterization exists, we keep the cen-

tered parameterization as it represents challenges commonly

encountered in hierarchical Bayesian models. There are 10

variables in this model in total. All the results we show are

for estimating 𝜇.

10) Analysis methodology:To estimate the strong scala-

bility of the considered algorithms, we measure the RMSE

resulting from shortening the as suggested by Remark 2.

Speciﬁcally, we increase the number of chains and shorten

them such that each chain draws 𝑁∕𝑃samples and spend

𝑁burn∕𝑃samples for burn-in. Ideally, the estimated speedup

according to Remark 2 should achieve a linear speedup close

to 𝑃. We set the base settings as 𝑁=4096and 𝑁burn =4096.

The true mean (𝔼𝜋[𝑓(𝜃)]) used for estimating the RMSE is

estimated from a single long reference MCMC chain where

𝑁=220,𝑁burn =212. Then, the variance inﬂation factor 𝜏is

estimated using Equation (8) and (9). The variance 𝕍𝜋[𝑓(𝜃1)]

is also acquired from the reference chain. From the esimated

𝜏, we compute the speedup according to Equation (17).

11) Results: Eight schools:The estimated variance inﬂa-

tion factor (𝜏) is shown in Fig. 3. For all three methods, 𝜏

grows signiﬁcantly as the number of chains increases while

the chains’ length decreases. However, for NUTS and NUTS-

INCA, the inﬂation factors decrease slightly until 𝑃= 16.

This is because with more chains, it is more likely that some

of the chains start near the stationary region. Such beneﬁt of

using multiple independent chains has been discussed before

in [64]. Meanwhile, the INCA scheme did not improve the

performance. Since the quality of the initial samples is poor,

sharing the samples during adaptation does not seem to help

except for 𝑃=4where the chains are not yet too short.

The speedup estimated from the measured 𝜏is shown in

Fig. 4. We can see that, for all three methods, the strong

scaling efﬁciency quickly falls down with more than 32 chains

except for NUTS-INCA. In particular, GMH failed to achieve

signiﬁcant scaling with more than 4 chains. This shows that the

performance increase of the “guiding chain” (which beneﬁts

from the increasing number of parallel proposals) does not

reﬂect the overall performance. Note that we did conﬁrm that

the performance of the guiding chain increases, as originally

reported in [28].

12) Discussion:In a series of experiments, we evaluated

the scalability of interchain parallelism. Despite the apparent

parallelism, our results suggest that interchain parallelism does

not achieve near-linear scaling when the statistical perfor-

mance is also considered. Mainly, shortening of the chains

introduce issues related to convergence and adaptation. Mean-

while, methods for improving the quality adaptation such

as INCA did not improve the results because of the poor

statistical quality of the initial samples.

IV. REL ATED WORKS

Starting from [13], numerous approaches have been pro-

posed to parallelize MCMC. For a general review on acceler-

ating MCMC, we point to [19] and [14].

1) Evaluation of MCMC workloads:Until now, only a

few workload analyses on Bayesian inference workloads have

been presented. In [43], Wang et al. performed an extensive

analysis of Stan inference workloads. They showed that most

Stan programs were compute-bound in terms of instructions

per cycle. However, they noted that interchain parallelization

is trivial, as different parallel chains are independent. In

Section III-C, we showed that in a strong scaling perspective,

interchain parallelism does not provide impressive scalability.

2) Quantifying scalability of MCMC:Formulas for quan-

tifying scalability of interchain parallelism similar to (16)

appeared in [40], [42], [41]. These formulas however, only

consider the amount of burn-in. In contrast, our formula in

Equation (15) also considers the effect of the variance inﬂation

factor. Consequently, it enables the comparison of the strong

scalability of MCMC algorithms in a much broader context.

Meanwhile, while some works such as [25] have presented

analysis of strong scalability, the speedup is computed rel-

atively based on the increased amount of work, which is

misleading. To properly assess strong scalability, the speedup

must be computed against the original amount of work.

3) Limitations of parallelizing MCMC:As we have dis-

cussed in Section III-B and III-C, paralleizing MCMC has

multiple fundamental challenges. Some of these challenges

have been realized early on. For example, the issues with

burn-in have been quickly pointed out by [13], [65], [20],

[42], and in several other papers. However, most previous

works except [42], [26], mainly focused on the convergence

aspect of burn-in. Hence, they suggested removing burn-in by

starting from the stationary region. We showed in Section III-C

that this solution is only partially effective because of the

adaptation of the chains.

V. DISCUSSION

In this paper, we have discussed the fundamental goals of

parallelizing MCMC computation. We evaluated the strong

scalability of previously proposed approaches for parallelizing

MCMC. Various issues, such as the tuning of kernel hyperpa-

rameters, convergence, burn-in, and applicability complicates

strong scaling.

Before concluding our work, we would like to point out that

all of the aforementioned issues are fundamental limitations of

MCMC based approaches. These limitations are artifacts of the

theory of Markov-chains. Instead, we propose investigating

the feasibility of alternative algorithms. For example, the

performance of sequential Monte Carlo (SMC) [66], [67], an

algorithm that combines the beneﬁts of importance sampling

and MCMC, doesn’t rely on the theory of Markov-chains.

In sharp contrast, SMC is built on the theory of interacting

particles [68], which is radically different from the theory

of Markov-chains. The convergence of SMC has been es-

tablished for 𝑃→∞where 𝑃is the number of particles

(or parallel MCMC chains in our context). Also, designing

efﬁcient MCMC kernels internally used by SMC is much

easier, since necessary conditions such as ergodicity don’t have

to be fulﬁlled [69]. While SMC has been shown to be a valid

candidate for parallelization [10], [70], [71], it has yet to be

thoroughly compared against parallel MCMC approaches in

a setting general enough to apply to PPLs. To conclude, an

important future research direction would be to compare the

scalability of MCMC against alternative approaches.

ACKNOWLEDGMENT

The authors would like to express their deepest apprecia-

tions to Aki Vehtari for his valuable advice on estimating 𝜏.

The authors also thank Jisu Oh for his constructive comments

about the theory of Markov-chains.

REFERENCES

[1] A. Patil, D. Huard, and C. Fonnesbeck, “PyMC: Bayesian Stochastic

Modelling in Python,” J. Stat. Soft., vol. 35, no. 4, 2010.

[2] F. Wood, J. W. van de Meent, and V. Mansinghka, “A new approach

to probabilistic programming inference,” in Proc. 17th Int. Conf. Mach.

Learn., ser. ICML’14, 2014, pp. 1024–1032.

[3] J. Salvatier, T. V. Wiecki, and C. Fonnesbeck, “Probabilistic program-

ming in Python using PyMC3,” PeerJ Comput. Sci., vol. 2, p. e55, Apr.

2016.

[4] B. Carpenter et al., “Stan: A Probabilistic Programming Language,” J.

Stat. Soft., vol. 76, no. 1, 2017.

[5] H. Ge, K. Xu, and Z. Ghahramani, “Turing: A language for ﬂexible prob-

abilistic inference,” in Int. Conf. Artif. Intell. Statist., ser. AISTATS’18,

2018, pp. 1682–1690.

[6] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth, “Hybrid

Monte Carlo,” Phys. Lett. B, vol. 195, no. 2, pp. 216–222, 1987.

[7] P. Jacob, C. P. Robert, and M. H. Smith, “Using Parallel Computation

to Improve Independent Metropolis–Hastings Based Estimation,” J.

Comput. Graphical Statist., vol. 20, no. 3, pp. 616–635, Jan. 2011.

[8] P. E. Jacob, J. O’Leary, and Y. F. Atchad´

e, “Unbiased Markov chain

Monte Carlo methods with couplings,” J. Roy. Stat. Soc. B, vol. 82,

no. 3, pp. 543–600, Jul. 2020.

[9] M. A. Suchard, Q. Wang, C. Chan, J. Frelinger, A. Cron, and M. West,

“Understanding GPU Programming for Statistical Computation: Studies

in Massively Parallel Massive Mixtures,” J. Comput. Graphical Statist.,

vol. 19, no. 2, pp. 419–438, Jan. 2010.

[10] A. Lee, C. Yau, M. B. Giles, A. Doucet, and C. C. Holmes, “On the

Utility of Graphics Cards to Perform Massively Parallel Simulation of

Advanced Monte Carlo Methods,” J. Comput. Graphical Statist., vol. 19,

no. 4, pp. 769–789, Jan. 2010.

[11] S. Zierke and J. D. Bakos, “FPGA acceleration of the phylogenetic

likelihood function for Bayesian MCMC inference methods,” BMC

Bioinformatics, vol. 11, no. 1, p. 184, Dec. 2010.

[12] G. Mingas and C.-S. Bouganis, “Population-Based MCMC on Multi-

Core CPUs, GPUs and FPGAs,” IEEE Trans. Comput., vol. 65, no. 4,

pp. 1283–1296, Apr. 2016.

[13] J. S. Rosenthal, “Parallel computing and monte carlo algorithms,” Far

East J. Theor. Stat., vol. 4, pp. 207–236, 1999.

[14] C. P. Robert, V. Elvira, N. Tawn, and C. Wu, “Accelerating MCMC

algorithms,” Wiley Interdisciplinary Rev: Comput. Statist., vol. 10, no. 5,

p. e1435, Sep. 2018.

[15] X. Feng, K. W. Cameron, C. P. Sosa, and B. Smith, “Building the Tree

of Life on Terascale Systems,” in Proc. Int. Parallel Distrib. Process.

Symp., ser. IPDPS’07. Long Beach, CA, USA: IEEE, 2007, pp. 1–10.

[16] B. Nemeth, T. Haber, J. Liesenborgs, and W. Lamotte, “Automatic

Parallelization of Probabilistic Models with Varying Load Imbalance,”

in 20th IEEE/ACM Int. Symp. Cluster, Cloud and Internet Comput., ser.

CCGrid’20. Melbourne, Australia: IEEE, May 2020, pp. 752–759.

[17] J. L. Gustafson, “Reevaluating Amdahl’s law,” Commun. ACM, vol. 31,

no. 5, pp. 532–533, May 1988.

[18] G. M. Amdahl, “Validity of the single processor approach to achieving

large scale computing capabilities,” in Proc. April 18-20, 1967, Spring

Joint Comput. Conf. - AFIPS ’67 (Spring). Atlantic City, New Jersey:

ACM Press, 1967, p. 483.

[19] E. Angelino, M. J. Johnson, and R. P. Adams, “Patterns of Scalable

Bayesian Inference,” Found. Trends. Mach. Learn., vol. 9, no. 2-3, pp.

119–247, 2016.

[20] A. E. Brockwell, “Parallel Markov chain Monte Carlo Simulation by

Pre-Fetching,” J. Comput. Graphical Statist., vol. 15, no. 1, pp. 246–

261, Mar. 2006.

[21] J. M. R. Byrd, S. A. Jarvis, and A. H. Bhalerao, “Reducing the run-time

of MCMC programs by multithreading on SMP architectures,” in Proc.

Int. Parallel Distrib. Process. Symp., ser. IPDPS’08. Miami, FL, USA:

IEEE, Apr. 2008, pp. 1–8.

[22] ——, “On the parallelisation of MCMC by speculative chain execution,”

in Proc. Int. Parallel Distrib. Process. Symp. Workshop Phd Forum, ser.

IPDPSW’10. Atlanta, GA: IEEE, Apr. 2010, pp. 1–8.

[23] E. Angelino, E. Kohler, A. Waterland, M. Seltzer, and R. P. Adams,

“Accelerating MCMC via parallel predictive prefetching,” in Proc. 30th

Conf. Uncertainty Artif. Intell., ser. UAI’14. Arlington, Virginia, USA:

AUAI Press, 2014, pp. 22–31.

[24] I. Strid, “Efﬁcient parallelisation of Metropolis–Hastings algorithms

using a prefetching approach,” Comput. Statist. Data Anal., vol. 54,

no. 11, pp. 2814–2835, Nov. 2010.

[25] G. Altekar, S. Dwarkadas, J. P. Huelsenbeck, and F. Ronquist, “Parallel

Metropolis coupled Markov chain Monte Carlo for Bayesian phyloge-

netic inference,” Bioinformatics, vol. 20, no. 3, pp. 407–415, Feb. 2004.

[26] R. V. Craiu, J. Rosenthal, and C. Yang, “Learn From Thy Neighbor:

Parallel-Chain and Regional Adaptive MCMC,” J. Amer. Statistical

Assoc., vol. 104, no. 488, pp. 1454–1466, Dec. 2009.

[27] A. Solonen, P. Ollinaho, M. Laine, H. Haario, J. Tamminen, and

H. J¨

arvinen, “Efﬁcient MCMC for Climate Model Parameter Estimation:

Parallel Adaptive Chains and Early Rejection,” Bayesian Anal., vol. 7,

no. 3, pp. 715–736, Sep. 2012.

[28] B. Calderhead, “A general construction for parallelizing Metropolis-

Hastings algorithms,” Proc. Nat. Acad. Sci., vol. 111, no. 49, pp. 17 408–

17 413, Dec. 2014.

[29] S. Yang, Y. Chen, E. Bernton, and J. S. Liu, “On parallelizable Markov

chain Monte Carlo algorithms with waste-recycling,” Stat Comput,

vol. 28, no. 5, pp. 1073–1081, Sep. 2018.

[30] W. Neiswanger, C. Wang, and E. P. Xing, “Asymptotically exact,

embarrassingly parallel MCMC,” in Proc. 30th Conf. Uncertainty Artif.

Intell., ser. UAI’14. Arlington, Virginia, USA: AUAI Press, 2014, pp.

623–632.

[31] S. L. Scott, A. W. Blocker, F. V. Bonassi, H. A. Chipman, E. I. George,

and R. E. McCulloch, “Bayes and big data: The consensus monte carlo

algorithm,” Int. J. Manag. Sci. Eng. Manag., vol. 11, pp. 78–88, 2016.

[32] S. Srivastava, C. Li, and D. B. Dunson, “Scalable bayes via barycenter

in wasserstein space,” J. Mach. Learn. Res., vol. 19, no. 8, pp. 1–35,

2018.

[33] C. P. Robert and G. Casella, Monte Carlo Statistical Methods, ser.

Springer Texts in Statistics. New York, NY: Springer New York, 2004.

[34] W. K. Hastings, “Monte Carlo sampling methods using Markov chains

and their applications,” Biometrika, vol. 57, no. 1, pp. 97–109, Apr.

1970.

[35] C. J. Geyer, “Introduction to markov chain monte carlo,” in Handbook

of Markov Chain Monte Carlo. CRC Press, 2011, pp. 3–48.

[36] C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan, “An Introduction

to MCMC for Machine Learning,” Mach. Learn., vol. 50, no. 1/2, pp.

5–43, 2003.

[37] M. Betancourt, “A Conceptual Introduction to Hamiltonian Monte

Carlo,” arXiv:1701.02434 [stat], Jan. 2017.

[38] M. D. Hoffman and A. Gelman, “The no-u-turn sampler: Adaptively

setting path lengths in Hamiltonian Monte Carlo,” J. Mach. Learn. Res.,

vol. 15, no. 47, pp. 1593–1623, 2014.

[39] C. J. Geyer, “Practical Markov Chain Monte Carlo,” Statist. Sci., vol. 7,

no. 4, pp. 473–483, Nov. 1992.

[40] D. Wilkinson, “Parallel Bayesian Computation,” in Handbook of Par-

allel Computing and Statistics (Statistics, Textbooks and Monographs).

Chapman & Hall/CRC, 2005.

[41] V. Gopal and G. Casella, “Running regenerative markov chains in

parallel,” unpublished, 2011.

[42] L. Murray, “Distributed markov chain monte carlo,” in Proc. Neural Inf.

Process. Syst. Workshop Learn. Cores, Clusters Clouds, vol. 11, 2010.

[43] Y. Emma Wang, Y. Zhu, G. G. Ko, B. Reagen, G.-Y. Wei, and D. Brooks,

“Demystifying Bayesian Inference Workloads,” in IEEE Int. Symp.

Perform. Anal. Syst. Softw., ser. ISPASS’19. Madison, WI, USA: IEEE,

Mar. 2019, pp. 177–189.

[44] J. Piironen and A. Vehtari, “Sparsity information and regularization in

the horseshoe and other shrinkage priors,” Electron. J. Statist., vol. 11,

no. 2, pp. 5018–5051, 2017.

[45] M. Betancourt, “Bayes Sparse Regression,” Mar. 2018.

[46] Imperial College COVID-19 Response Team et al., “Estimating the

effects of non-pharmaceutical interventions on COVID-19 in Europe,”

Nature, Jun. 2020.

[47] Stan Development Team, “Stan modeling language users guide and

reference manual, version 2.23.0,” 2020.

[48] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J.

Mach. Learn. Res., vol. 3, no. null, pp. 993–1022, Mar. 2003.

[49] L. Egidi, F. Pauli, and N. Torelli, “Are Shots Predictive Of Soccer

Results?” in StanCon 2018. Zenodo, Aug. 2018.

[50] S. Kim, N. Shepherd, and S. Chib, “Stochastic Volatility: Likelihood

Inference and Comparison with ARCH Models,” Rev. Econ. Stud.,

vol. 65, no. 3, pp. 361–393, Jul. 1998.

[51] R. M. Neal, “Probabilistic inference using markov chain monte carlo

methods,” University of Toronto, Tech. Rep. CRG-TR-93-1, Sep. 1993.

[52] J. Li et al., “Feature selection: A data perspective,” ACM Comput.

Surveys, vol. 50, no. 6, p. 94, 2018.

[53] A. Gelman and K. Shirley, “Inference from simulations and monitoring

convergence,” in Handbook of Markov Chain Monte Carlo. CRC Press,

2011, pp. 163–174.

[54] D. Rudolf, “Error bounds for computing the expectation by Markov

chain Monte Carlo,” Monte Carlo Methods Appl., vol. 16, no. 3-4, Jan.

2010.

[55] G. S. Fishman, Discrete-Event Simulation. New York, NY: Springer

New York, 2001.

[56] J. G. Propp and D. B. Wilson, “Exact sampling with coupled markov

chains and applications to statistical mechanics,” Random Struct Algo-

rithms, vol. 9, no. 1–2, pp. 223–252, Aug. 1996.

[57] T. K. Papp, JackRab, D. Aluthge, J. TagBot, and M. Piibeleht, “Tpap-

p/DynamicHMC.jl: V2.1.6,” Zenodo, Aug. 2020.

[58] K. Xu, H. Ge, W. Tebbutt, M. Tarek, M. Trapp, and Z. Ghahramani,

“AdvancedHMC.jl: A robust, modular and efﬁcient implementation of

advanced HMC algorithms,” in Proc. 2nd Symp. Adv. Approx. Bayesian

Inference, ser. AABI’19, vol. 118. PMLR, Dec. 2020, pp. 1–10.

[59] R. M. Neal et al., “MCMC using Hamiltonian dynamics,” Handb.

Markov Chain Monte Carlo, vol. 2, no. 11, p. 2, 2011.

[60] H. Haario, E. Saksman, and J. Tamminen, “An adaptive metropolis

algorithm,” Bernoulli, vol. 7, no. 2, pp. 223–242, Apr. 2001.

[61] R. M. Neal, “An Improved Acceptance Procedure for the Hybrid Monte

Carlo Algorithm,” J. Comput. Phys., vol. 111, no. 1, pp. 194–203, Mar.

1994.

[62] R. M. Neal et al., “MCMC using Hamiltonian dynamics,” Handb.

Markov Chain Monte Carlo, vol. 2, no. 11, p. 2, 2011.

[63] A. Gelman, J. Carlin, H. Stern, D. Dunson, A. Vehtari, and D. Rubin,

Bayesian Data Analysis, 3rd ed., ser. Chapman & Hall/CRC Texts in

Statistical Science. Boca Raton: CRC Press, 2014.

[64] A. Gelman and D. B. Rubin, “Inference from Iterative Simulation Using

Multiple Sequences,” Statist. Sci., vol. 7, no. 4, pp. 457–472, Nov. 1992.

[65] X. Feng, D. A. Buell, J. R. Rose, and P. J. Waddell, “Parallel algorithms

for Bayesian phylogenetic inference,” J. Parallel Distrib. Comput.,

vol. 63, no. 7-8, pp. 707–718, Jul. 2003.

[66] P. Del Moral, A. Doucet, and A. Jasra, “Sequential Monte Carlo

samplers,” J. Roy. Statist. Soc.: B, vol. 68, no. 3, pp. 411–436, Jun.

2006.

[67] N. Chopin, “A sequential particle ﬁlter method for static models,”

Biometrika, vol. 89, no. 3, pp. 539–552, Aug. 2002.

[68] P. Del Moral, Feynman-Kac Formulae, ser. Probability and Its Applica-

tions. New York, NY: Springer New York, 2004.

[69] A. Beskos, A. Jasra, N. Kantas, and A. Thiery, “On the convergence of

adaptive sequential Monte Carlo methods,” Ann. Appl. Probab., vol. 26,

no. 2, pp. 1111–1146, Apr. 2016.

[70] A. Varsi, L. Kekempanos, J. Thiyagalingam, and S. Maskell, “A Single

SMC Sampler on MPI that Outperforms a Single MCMC Sampler,”

arXiv:1905.10252 [cs, stat], May 2019.

[71] B. Nemeth, T. Haber, J. Liesenborgs, and W. Lamotte, “Relaxing

Scalability Limits with Speculative Parallelism in Sequential Monte

Carlo,” in IEEE Int. Conf. Cluster Comput., ser. CLUSTER’18. Belfast:

IEEE, Sep. 2018, pp. 494–503.