Evaluating the Strong Scalability of
Parallel Markov-Chain Monte Carlo Algorithms
Electronics Engineering Dept.
Seoul, Republic of Korea
Electrical Engineering and Electronics Dept.
University of Liverpool
Computer Science and Engineering Dept.
Seoul, Republic of Korea
Abstract—Markov-chain Monte Carlo (MCMC) is a popular
method for performing asymptotically exact Bayesian infer-
ence. However, the acceleration of MCMC by parallelizing its
computation is a signiﬁcant challenge. Despite the numerous
algorithms developed for parallelizing MCMC, a fundamental
question remains unanswered. If parallel MCMC is the answer,
what is the question? To answer this fundamental question,
we ﬁrst characterize the scaling of parallel MCMC algorithms
into weak and strong scaling. Then, we discuss the outcomes
they deliver. Under these terminologies, most previous works
comparing different parallel MCMC methods fall into the weak
scaling territory. For this reason, we focus on assessing the
strong scalability of previously proposed parallelization schemes
of MCMC, both conceptually and empirically. Also, considering
the popularity and importance of probabilistic programming
languages, we focus on algorithms that are applicable to a
wide variety of statistical models. First, we evaluate the strong
scalability of methods based on parallelizing a single MCMC
chain. Second, for methods based on multiple MCMC chains,
we develop a simple expression for estimating the speedup and
empirically evaluate their strong scalability. Our results show that
previously proposed methods for parallelizing MCMC algorithms
achieve limited strong scalability. We conclude by providing
future directions for parallelizing Bayesian inference.
Index Terms—Bayesian Inference, Parallel Computing,
Markov-chain Monte Carlo
Bayesian statistics has recently seen surging popularity.
Until recently, the high computational cost of Bayesian in-
ference methods has prohibited wide adoption of Bayesian
statistics. Fortunately, with the advances of our computational
capabilities, many of the Bayesian statistical models of our
interest are now within arms’ reach. Also, abstractions such as
probabilistic programming languages (PPL, , , , ,
) have enabled statisticians to enjoy modern computational
resources with reduced complexity and enhanced portability.
Meanwhile, with the ever-increasing amount of data and com-
plexity of our statistical models, the need for high-performance
parallel Bayesian inference is gaining importance.
Currently, Markov-chain Monte Carlo (MCMC) is the de-
facto standard method for performing asymptotically exact
Bayesian inference. MCMC algorithms are generally appli-
cable as they assume very little about the target statistical
model. Notably, variants of the Hamiltonian Monte Carlo
sampler (HMC, ) have been widely adopted as the default
sampling strategy in PPLs, such as Stan , PyMC3 ,
, and Turing.jl . For PPLs where the user’s model can
be complex, HMC has shown, both theoretically and empiri-
cally, to be efﬁcient and robust. However, MCMC algorithms
are computationally expensive, requiring many thousands of
evaluations of the likelihood (and, in the case of HMC, its
Because of the high cost of executing MCMC algorithms,
it is natural to seek acceleration by utilizing modern high-
performance computing resources. Yet, on most modern com-
putational hardware, this can only be achieved by paralleliza-
tion. Since the clock speed of processors started stagnating,
the necessity for parallelizing MCMC has kept increasing
over time , . Also, the emergence of massively par-
allel, programmable, specialized hardware such as graphical
processing units (GPU) and ﬁeld-programmable gate arrays
(FPGA) has opened immense beneﬁts for parallelization ,
, , . Unfortunately, the fundamentally sequential
nature of MCMC algorithms has shown that it is challenging
to scale MCMC algorithms , . Moreover, being gener-
ally applicable enough to support PPLs adds multiple design
constraints. Methods such as in , , , ,  that
exploit the parallelism inherent in the statistical model are not
generically applicable. For these reasons, developing parallel
MCMC algorithms robust and automatic enough to be applied
to PPLs is a vital but challenging objective.
If parallelizing MCMC is the answer, what is the question?
Until now, various strategies for accelerating MCMC by
parallelization have been proposed. However, a fundamental
discussion on what the goal of parallelization is (in the
perspective of parallel computing) has yet to be presented. For
example, are we trying to achieve weak scaling  or strong
scaling ? Or, more fundamentally, what strong scaling or
weak scaling even mean in the context of Bayesian inference?
What are the previously proposed methods actually achieving?
This study is our attempt to address these questions with a
focus on fulﬁlling the constraints of PPLs.
In this paper, we ﬁrst characterize the acceleration of
Bayesian inference into strong and weak scaling. The weak
scaling  approach reduces estimation error as we increase
the amount of computational resources. In contrast, the strong
scaling  approach tries to reduce the time for reaching a
speciﬁc target estimation error. As weak scaling and strong
scaling deliver sharply different outcomes, we start with a
broader discussion by comparing the outcomes of the two
approaches. Since the estimation error of MCMC can only be
reduced in a rate of (1∕𝑁)(𝑁is proportional to the amount
of computation), the weak scaling approach is susceptible to
diminishing returns . Also, because of the recent trend
of using of the effective sample size per unit time metric
(ESS/second), performance comparisons have been biased
towards weak scaling. As a result, proper comparisons of
the strong scalability of different parallelization methods have
been left behind. Thus, we dedicate the remainder of our paper
to evaluating the strong scalability of MCMC parallelization
Previously proposed strategies for parallelizing MCMC (and
achieving strong scalability), can be classiﬁed into three types:
intrachain parallelism, interchain parallel, and data parallelism.
Intrachain , , , , , , ,  paral-
lelism seeks speedup by parallelizing the computation of a
single MCMC chain. Interchain parallelism , , ,
, ,  computes multiple MCMC chains and utilizes
the embarrassingly parallel nature of the multiple chains.
On the other hand, data parallelism (mostly consisting of
consensus MCMC methods, , , ) can be seen as
combining both; splitting the data into multiple subsets and
operating MCMC chains in each subset. Since the applicability
of data parallelism is limited to certain types of statistical
models (models that are inherently data parallel), we will
restrict our discussion to intra and interchain parallelism.
We empirically investigate the potential strong scalability
of intrachain parallelism by computing the Amdahl speedup
limit  on multiple realistic Bayesian models. For interchain
parallelism, we develop a simple formula for estimating the
speedup given the autocorrelation and the amount of burn-
in samples. Then, we empirically evaluate the strong scaling
of previously proposed interchain parallelization methods.
Our results suggest that strategies for parallelizing MCMC,
whether interchain or intrachain, struggle to achieve strong
scaling close to linear. Lastly, we conclude our work by pro-
viding future directions for overcoming the limits of MCMC.
To summarize, the key insights of this paper are as follows:
∙We characterize the acceleration of MCMC algorithms
into strong and weak scaling (Section III-A).
∙We argue that the outcomes of strong scalability are more
desirable than weak scalability. However, the analysis of
previous works are largely focused on weak scalability
∙We experimentally assess the strong scalability of paral-
lelizing a single MCMC chain (intrachain parallelization)
∙We develop a simple expression for evaluating the strong
scalability of multiple MCMC chain parallelization (in-
terchain parallelization), and experimentally evaluate the
strong scalability of previously proposed interchain par-
allelization strategies (Section III-C).
A. Bayesian Inference
The goal of Bayesian inference is to obtain the distribution
of a parameter of interest. First, an observable quantity, or
dataset (denoted as ), is given. By assuming that the data
is generated from an observed variable 𝜃according to a data
generation process, the likelihood of observing can be com-
puted as 𝑝(𝜃). Bayesian inference focuses on inferring the
inverse-probability 𝑝(𝜃), called the posterior distribution,
by setting a prior distribution 𝑝(𝜃)on 𝜃and invoking Bayes’
rule such as
Using the posterior distribution, it is possible to marginalize
the parameters of an arbitrary function 𝑓such as
Typical use-cases of marginalization include obtaining sum-
mary statistics of 𝜃, or performing predictions based on data.
Unfortunately, 𝑝()in Equation (2), called the evidence,
is often intractable unless we restrict ourselves to off-the-
shelf conjugate probability distributions. To perform Bayesian
analysis on more interesting and complex models, we instead
obtain a ﬁnite set of samples describing 𝑝(𝜃). This is done
by sampling from the joint distribution 𝜋(𝜃)=𝑝(𝜃)𝑝(𝜃),
which is proportional to the posterior distribution up to a
normalizing constant (the evidence) such that 𝑝(𝜃)∝𝜋(𝜃).
Using these 𝑁samples, we can approximate the marginal 𝐹
using the Monte Carlo approximation
B. Sampling from the Posterior
Sampling proportionally to the posterior is a signiﬁcant
challenge by itself. Currently adopted methods can be classi-
ﬁed into two types: asymptotically exact methods and approx-
imate methods. For asymptotically exact sampling, Markov-
chain Monte Carlo (MCMC) is the most widely used.
1) Markov-chain Monte Carlo:The basic idea of MCMC
is to construct a Markov-chain along the samples with a
Markov-chain transition operator, or kernel, 𝐾(𝜃,𝜃′). By con-
ditioning the current sample 𝜃𝑖on the previous state of the
Markov-chain 𝜃𝑖−1, it is possible to cancel out the normal-
ization constant and sample proportionally to the posterior.
If 𝐾(𝜃,𝜃′)admits the posterior distribution 𝑝(𝜃)as its
invariant measure such as
and some additional assumptions, the states of the Markov-
chain 𝜃𝑖form an asymptotically unbiased estimator.
Theorem 1 (Thm 4.7.7, ): If the Markov-chain 𝜃1…𝜃𝑖is
aperiodic, irreducible, and reversible with invariant distribution
Algorithm 1: Markov-Chain Monte Carlo
for 𝑡∈[1,𝑁burn +𝑁]do
𝜃𝑡=𝜃∗, 𝑢<𝛼 (accept proposal)
𝜃𝑡−1,otherwise (reject proposal)
return 𝜃𝑁burn,…, 𝜃𝑁+𝑁burn
𝜋, the Central Limit Theorem applies when 0<𝜎2<+∞
such that 𝑁(̄
where 𝑁is the number of Markov-chain states,
𝔼[⋅]denotes the expectation, 𝕍[⋅]denotes the variance, and
cov(⋅,⋅)denotes the covariance.
2) MCMC Algorithms:The most basic form of a Markov-
chain operator satisfying the conditions of Theorem 1 is the
Metropolis-Hastings method  described in Algorithm 1.
First, a proposal is generated from an arbitrary distribution
𝑞(⋅𝜃𝑡−1). Then, the proposal is either accepted or rejected
based on its acceptance probability 𝛼. Each accept-reject
decision forms a state of the Markov chain. This is repeated
for 𝑁+𝑁burn times, where only the last 𝑁samples are used for
estimation. For a detailed introduction to MCMC algorithms,
see  and .
Currently, variants of Hamiltonian Monte Carlo (HMC)
have been empirically shown to be the most efﬁcient, robust,
and generically applicable , . Speciﬁcally, the no-u-turn
(NUTS) sampler , an automatic tuning variant of HMC,
has been widely adopted. In HMC samplers, a proposal is
generated by simulating the Hamiltonian dynamics with the
potential energy given by 𝜋(𝜃). For a detailed introduction to
HMC samplers, see .
3) Burn-in and adaptation:As depicted in Algorithm 1,
only the last 𝑁samples are used for inference. The ﬁrst
𝑁burn samples, often called burn-in or warmup are discarded.
Similarly, the interval 𝑖∈ [0,𝑁burn ]is called the burn-in
period. The reason for discarding burn-in samples is to ensure
that the Markov-chain reaches the stationary region of the
posterior (which previously some have called the typical set).
While the necessity of burn-in has been questioned before
such as in , it is a standard practice. Also, recent MCMC
algorithms such as NUTS utilize the burn-in samples for
adapting the hyperparameters of 𝐾(⋅,⋅)(For example, by using
the Nesterov dual-averaging procedure ).
III. EVALUATI NG T HE ST RONG SCALABILITY O F
PARALLEL MARKOV-CH AI N MON TE CA RL O
A. Deﬁning Acceleration of Bayesian Inference
Accelerating Bayesian inference has been a central goal
for advancing the ideals of Bayesian methodologies. However,
what does accelerating Bayesian inference truly mean? First,
we focus on the goal of Bayesian inference, which is to
estimate the variables in question with the least amount of
1) Estimation error:The estimation error is deﬁned by the
asymptotic root mean squared error (RMSE), which can be
derived from (7) as
where 𝜎2=𝕍𝜋[𝑓(𝜃1)]and 𝑁eff is the effective sample size
(ESS). The ESS is deﬁned as
where 𝜌𝑘is the autocorrelation of the Markov-chain with
lag 𝑘, and 𝑁is the number of samples , . Here,
𝜏is the statistical performance of a Markov-chain transition
kernel 𝐾(𝜃,𝜃′). The goal of MCMC sampling is to estimate
the quantity 𝔼[𝑓]with a low RMSE by either increasing 𝑁,
or decreasing 𝜏. Note that for some MCMC-based methods,
𝜏is not necessarily computed using the autocorrelation (for
example, as in ). For this reason we will call 𝜏the variance
2) The objective of parallelization:In general, the goal of
parallelization is either to increase the amount of computation
done in a ﬁxed interval or to decrease the amount of time
of performing a ﬁxed amount of computation. The former
is known as weak scalability, while the latter is known as
strong scalability. In MCMC, the amount of time spent on
computation is roughly proportional to the number of sam-
ples 𝑁. While this not perfectly true because of algorithmic
variation in algorithms such as NUTS, we will stick with
this assumption throughout this paper for simplicity. With that
said, for MCMC, weak scaling means to reduce the RMSE by
increasing resources, while strong scaling means to decrease
the time until achieving a ﬁxed target RMSE by increasing
resources. Now, acceleration of Bayesian inference can be
characterized into two categories: weak scaling and strong
3) What we were achieving so far:Until now, weak
scaling has been the unspoken goal of parallelizing MCMC.
While some discussions regarding strong scalability exist ,
, , the ESS per unit time (ESS/second) metric has
recently seen dominant use for comparing different algorithms.
For instance, [4, p. 10] states that “. . . effective sample size
per second (or its inverse) is the most relevant statistic for
comparing the efﬁciency of sampler implementations”. Since
a higher ESS/second value signiﬁes that lower error can be
achieved during the same unit time, it is a typical weak
scalability metric. For this reason, we ﬁrst question whether
the path of weak scalability is leading us to where truly want
to be. To answer this, we ﬁrst look into the outcomes of weak
and strong scaling.
4) What weak and strong scaling deliver:With weak
scalability, we can achieve a lower error by increasing our
computational resources. Since the ultimate goal of inference
is to achieve lower RMSE of the target estimator, weak scaling
might sound attractive at ﬁrst. However, it is questionable
whether we genuinely need arbitrarily low error. In practice,
achieving an “acceptable level” of RMSE faster is more
desirable than merely achieving arbitrarily low RMSE. Also,
according to Equation (8), the estimation error can only be
reduced in the order of (1∕𝑁). This slow rate of reduction
causes an issue of diminishing returns [19, p. 112-113].
Unfortunately, even if this lofty parallel speedup
goal is achieved, the asymptotic picture for MCMC
is dim: in the asymptotic regime, doubling the
number of samples collected can only reduce the
Monte Carlo standard error by a factor of 2. This
scaling means that there are diminishing returns to
purchasing additional computational resources, even
if those resources provide linear speedup in terms of
accelerating the execution of the MCMC algorithm.
Again, if achieving an “acceptable level” of RMSE is the
ultimate goal, it is not only the amount of error we can reduce
that diminishes but the utility of such reduction that also
vanishes. In sharp contrast, strong scaling actually reduces the
time spent for inference. Hence, Bayesian inference is more
often a strong scaling problem rather than a weak scaling
problem. What we want is to achieve an acceptable amount
of error faster, rather than arbitrarily low error.
5) A case for strong scalability:Weak scalability doesn’t
reduce the time for Bayesian inference. Instead, it enables
us to perform inference more accurately, which deﬁnitely
has a place. However, as discussed, weak scaling has a
problem of diminishing returns. Moreover, the popularity of
the ESS/seconds metric have resulted in a bias towards weak
scalability. For these reasons, the strong scalability of currently
known parallelization strategies is not well understood. Thus,
we dedicate the remainder of this paper to evaluating the strong
scalability of parallel MCMC methods.
B. Evaluating the Strong Scalability of Intrachain Parallelism
From now on, we will evaluate the strong scalability of
intrachain and interchain approaches to parallelizing MCMC.
Intrachain parallelism accelerates inference by accelerating the
execution of a single MCMC chain. Intrachain parallelization
approaches span parallelizing the evaluation of the likelihood,
using the parallel delayed rejection algorithm, or using the
1) Parallel delayed rejection and prefetching:Parallel de-
layed rejection is a parallel realization of the delayed rejection
algorithm , . In parallel delayed rejection, multiple
proposals are generated in parallel, while accept-reject de-
cisions are made sequentially for each proposal until one is
accepted. Once a proposal is accepted, all other proposals are
discarded. For this reason, parallel delayed rejection wastes
a lot of computation, bounding the speedup sublinearly ,
. Meanwhile, prefetching methods , ,  extract
parallelism by simulating multiple iterations of the Markov-
chain asynchronously. Similarly to delayed rejection, these
approaches achieve only logarithmic speedup  as they
waste a lot of computation as soon as an accept-rejection
decision is sealed.
2) Likelihood parallelization:On the other hand, likeli-
hood parallelization , ,  aims to accelerate MCMC
by parallelizing the computation of the likelihood. Since
the likelihood is the most computationally expensive part in
Bayesian inference, likelihood parallelization looks promis-
ing . Indeed, Stan  explicitly supports this model of
parallelization through features such as map_rect (from ver-
sion 2.18.0) and reduce_sum (from version 2.23). However,
how much does likelihood parallelization deliver in practice?
From now on, we will empirically evaluate the gains of
parallelizing the likelihood by computing the Amdahl speedup
limit  on a diverse set of Bayesian models. Precisely, we
measure the time spent for computing the likelihood using
Bayesian models based on the Stan PPL .
3) Experimental settings:We ﬁrst forked1and modiﬁed
the Stan runtime in order to measure the execution time of
the likelihood and its gradient. We sample 4000 samples using
Stan’s implementation of the NUTS MCMC algorithm 
after discarding 4000 samples for burn-in. We use the default
conﬁguration of CmdStan v2.23.0. The server we use for the
experiment runs Linux 4.15 on Intel Xeon E7–8870 v2 pro-
cessors and has 770GB of RAM. The processor frequency was
ﬁxed to 2.7GHz for all the experiments using the cpupower
frequency-set command. All of the experiments are
repeated 32 times each.
4) Analysis methodology:Following the Bayesian theme
of this paper, we employ a Bayesian analysis methodology for
the results. First, for each of the 32 repetitions, we estimate
the execution time proportion 𝑝=𝑇likelihood∕𝑇total of the
likelihood (𝑇likelihood) and the total execution time (𝑇total).
We assume the data is generated from the following beta-
regression model: 𝜇∼Uniform(0,1)
where 𝛼=𝜇𝜙, 𝛽=(1−𝜇)𝜙,
𝜇and 𝜙are the mean and precision of 𝑝. We set an uninforma-
tive uniform prior on 𝜇and an inverse-gamma prior on 𝜙. We
use the Turing.jl  PPL for describing the aforementioned
model, and draw 4000 samples using NUTS after discarding
4000 samples for burn-in.
1Forked the repository https://github.com/stan-dev/stan in July 13 2020.
TABLE I: Execution Time Proportion (𝑝) of the Likelihood
Name References Characterization2#Datapoints #Parameters Exec. Time Estimated 𝑝1
ad  Logistic Regression 354 7000 14167 0.994 (0.994, 0.995)
butterﬂy  Hierarchical Bayesian 3 28 212 0.974 (0.973, 0.974)
cancer ,  Sparse Logistic Regression 1434 102 17902 0.947 (0.947, 0.947)
covid  Hierarchical Bayesian 87 75 5808 0.990 (0.990, 0.990)
disease  Logistic Regression 345 10 6115 0.986 (0.986, 0.986)
lda ,  Mixed-membership Model 1224 7737 5900 0.990 (0.990, 0.990)
racial  Hierarchical Bayesian 5 300 2212 0.911 (0.909, 0.912)
soccer  Hierarchical Bayesian 331 3040 33148 0.958 (0.957, 0.958)
stock ,  Stochastic Volatility 1006 5030 5033 0.853 (0.852, 0.854)
votes  Hierarchical Gaussian Processes 11 550 3279 0.852 (0.851, 0.854)
1We report the posterior median along with the 80% credible interval in the parentheses.
2The characterization is based on .
Fig. 1: The Amdhal limit estimated from the posterior samples
of 𝑝. The error bars are the 80% credible intervals.
5) Benchmark models:We chose models from the
BayesSuite benchmark  (ad, butterﬂy, disease, racial,
votes), the Stan User Guide  (lda, stock), and other
independent sources (soccer,covid,cancer). The considered
workloads are organized in Table I with their description and
average execution time. For the stock benchmark, we use the
S&P 500 daily closing price data from the year 2000 to 2020.
Also, for the cancer benchmark, we use the prostate cancer
(Prostate_GE) dataset from . Lastly, for the covid,lda,
and soccer benchmarks, we use the datasets used in their
6) Results:The summaries of the estimated posterior of
𝑝are shown in the right of Table. I. Within the considered
models, the proportion of the likelihood computation time (𝑝)
is larger than 99% on only ad, covid, lda. Also, for stock
and votes, the proportion is only around 85%. The estimated
Amdahl limits, which are the maximum theoretical speedup
we can achieve, are shown in Fig. 1. Results show that the
likelihood of the considered statistical models are generally
not dominant enough to achieve signiﬁcant speedup.
We performed various analyses on the results to see if 𝑝
is related to any of the problem’s characteristics, such as the
number of datapoints or the number of parameters. However,
we failed to ﬁnd any signiﬁcant correlation. This suggests
that the efﬁciency of likelihood parallelization results from
complex interactions between multiple factors spread across
the PPL ecosystem.
7) Discussion:It is important to note that the estimates in
Fig. 1 are optimistic estimates assuming inﬁnite computational
resources. In practice, the speedup gains would be much less.
Also, in general, the amount of parallelism in the statistical
models varies greatly. While simple logistic regression models
with very tall datasets (datasets with lots of data points) have
signiﬁcant parallelism, more complex hierarchical Bayesian
models do not have as much parallelism. Moreover, paral-
lelizing the likelihood is a burden that must be carried by
the user. For example, in Stan, utilizing constructs such as
map_rect and reduce_sum requires a large number of
user code changes. In the end, parallelizing the likelihood is
by no means an automatic approach, and its performance gain
is highly dependent on the model used. Since other methods
such as parallel delayed rejection and prefetching also have
fundamental limits, it is unclear whether we could achieve
pain-free linear speedup with intrachain parallelization.
C. Evaluating the Strong Scalability of Interchain Parallelism
The interchain approach achieves speedup by executing
multiple independent chains in parallel. Because of the seem-
ing independence of the chains, this approach is often de-
scribed as being embarrassingly parallel. However, in this
section, we show that this appearance is deceptive to the proper
assessment of strong scalability. To properly assess the strong
scalability of interchain parallelism, we will ﬁrst theoretically
quantify the speedup. Then, we will empirically evaluate
the scalability of previously proposed interchain parallelism
1) Theoretical analysis:Our discussion starts with the
asymptotic error rate of MCMC given by Equation (8) and (9).
By rearranging the equations, we obtain the number of samples
needed to achieve an error rate 𝜖,𝑁=𝜏𝜎2∕𝜖2,given
the variance of the estimated statistic (𝜎) and the variance
reduction factor (𝜏).
Assuming that generating a sample costs a constant execu-
tion time 𝑇, the amount of work 𝑊needed to achieve an error
𝜖is 𝑊seq =𝑇(𝑁+𝑁burn)=𝑇(1+𝑏seq)𝑁(10)
=𝑇(1+𝑏seq )𝜏seq 𝜎2
where 𝑁burn is the relative ratio of burn-in samples and 𝑏
is the proportion of burn-in samples such that 𝑁burn =𝑁𝑏.
Normally, the burn-in ratio 𝑏is chosen to be quite high. For
example,  heuristically suggests using 𝑏=1.0, where half
of the samples are used for inference, and half are used for
burn-in. This ratio has also been supported by  according
to non-asymptotic error analysis results.
It is also important to note that, in practical conditions,
reducing the absolute amount of burn-in results in an inferior
𝐾(⋅,⋅)with a relatively higher autocorrelation. This is because
recent MCMC algorithms use the burn-in samples for adapting
the hyperparameters of 𝐾(⋅,⋅). Moreover, since earlier samples
of the Markov-chain often contain a lot of bias, shortening the
burn-in period results in the adaptation period to use only low
quality samples. For these reasons, we will roughly assume
that reducing 𝑁burn increases the variance inﬂation factor 𝜏.
By executing multiple MCMC chains in parallel and com-
bining the results, we can reduce the execution time by the
number of computing unit 𝑃. In this case, the amount of work
done by each computing unit in parallel is
=𝑇(1+𝑏par )𝜏par 𝜎2
Equation (11) and (13) give us the execution time required for
achieving an error 𝜖by running MCMC sequentially and in
The speedup 𝑆of running multiple chains in parallel is now
𝑊par =𝑇(1+𝑏seq )𝜏seq 𝜎2
𝑇(1+𝑏par )𝜏par 𝜎2
=𝑃𝜏seq (1+𝑏seq )
Remark 1 (Amdahl’s Law): If we hold 𝑁burn constant, we
retrieve Amdahl’s law. This can be done by setting 𝑏=𝑏par =
𝑃𝑏seq since the length of the individual chains are shortened
by a factor of 𝑃. Assuming the performance of the samplers
is equal so that 𝜏seq =𝜏par, the speedup is given as
which is structurally equivalent to Amdahl’s law. Thus, if
𝑏𝑠𝑒𝑞 >0, then 𝑆<𝑃.
2) Limitation due to burn-in (𝑁𝑏𝑢𝑟𝑛):Since in practice
it is almost always true that 𝑏𝑠𝑒𝑞 >0, Remark 1 shows that
ﬁxing the number of samples used for burn-in is always inef-
ﬁcient , , unless 𝑁≫𝑁burn (similar to the asymptotic
results of [55, Thm 6.4]). Because of the diminishing returns
discussed in Section III-A, it would be inefﬁcient to maintain
𝑁≫𝑁burn. Thus, in realistic settings, the effect of 𝑁burn will
be not negligible in terms of scalability.
3) Can we avoid burn-in?:Since holding the amount of
burn-in fundamentally restricts strong scalability, methods for
completely avoiding the need for burn-in has been proposed.
For example,  suggests to use perfect simulation  in
order to obtain initial points for the Markov-chain that are
1 30 60 90 120 150
Samples per Chain
128 64 50 40 30
Fig. 2: RMSE of estimating 𝜇10 from samples of 64 parallel
Markov-chains where 𝑁burn = 64. The dark red line shows
the RMSE decreasing with the number of samples, the target
line is the target RMSE acquired by the single, longer MCMC
chain (𝑁=2048,𝑁burn =2048). The optimal line denotes the
iteration where the theoretically optimal speedup is achieved.
We estimated the RMSE by repeating the sampling process
within the stationary region. Similarly, DynamicHMC.jl ,
an implementation of the NUTS algorithm supported by
Turing.jl, performs maximum a-posteriori estimation to obtain
an initial point close to the mode of the posterior. However,
these approaches have a catch. In the name of reducing burn-
in, these methods introduce additional computation, which
decreases efﬁciency just as burn-in does. Also, recent MCMC
algorithms need the burn-in period to perform hyperparameter
Then, what about reducing 𝑁𝑏𝑢𝑟𝑛 according to the number
Remark 2 (Reducing the amount of burn-in): By holding the
burn-in ratio constant and assuming that the autocorrelation
changes, the number of samples generated by each chain is
𝑁∕𝑃while the number of burn-in samples is 𝑁burn∕𝑃. Then,
the speedup is given as
Consequently, the performance is directly dependent on the
performance, or inﬂation factor, of the Markov-chains.
The results of Remark 2 are actually quite promising. If we
can keep the variance inﬂation factor 𝜏seq ≈𝜏par, then we can
achieve linear speedup such as 𝑆≈𝑃. From now on, we will
empirically evaluate the possibility of this direction.
4) Experimental settings: Neal’s 10-D Gaussian:First, we
perform experiments on a simple synthetic problem to evaluate
the effect of shortening the chain and starting the chain
within the stationary region. We run the NUTS sampler of
AdvancedHMC.jl ,  on a 10-dimensional “Neal’s Gaus-
sian” which is a multivariate Gaussian distribution (0,Σ)
where Σ=Diagonal(0.01,0.02,…,1.0) . Because of the
varying covariance scale, Neal’s Gaussian is appropriate for
evaluating the effect of adapting 𝐾(⋅,⋅). We ﬁrst sample with
𝑁= 4096, 𝑁burn = 4096 samples with a single, long Markov-
chain. Then, we execute 64 parallel chains, where each chain
spends 𝑁burn∕𝑃=4096∕64=64for burn-in, and compare the
RMSE against the single long chain. The RMSE is estimated
by repeating the sampling process 212 times.
For the longer chain, we use the Stan adaptation rule ,
which alternates between adapting the preconditioner and the
step-size of NUTS. For 𝑁burn <150, the burn-in phase is too
short for using the Stan adaptation rule. In this case, we use the
naive adaptation rule , which simply adapts the covariance
and the step-size at the same time. We start each of the 64
chains from a random initial point sampled from (1.0,3.02)
or (0.0,0.12). Since the former is much wider than the
stationary region, it simulates the effect of sampling the initial
point from the prior distribution, which is an effective heuristic
to use in practice. Since strong contraction from the prior
𝑝(𝜃)to the posterior 𝑝(𝜃)is often expected, the initial
distribution will often be much wider than the stationary
5) Results: Neal’s 10-D Gaussian:The results for esti-
mating the mean of the 10th parameter (𝜇10) can be seen in
Fig. 2. The 𝑥-axis is the number of samples sampled from each
parallel MCMC chains, which is roughly proportional to the
execution time. The red lines show the RMSE decreasing as
we draw more samples. The target line is the RMSE achieved
by the single, longer MCMC chain (𝑁=4096,𝑁burn =4096).
The RMSE was estimated by repeating the sampling process
212times. The optimal line denotes the number of samples per
chain where perfect linear speedup can be achieved. Unfortu-
nately, the samples of the parallel chains required many more
samples to achieve the target RMSE. In all cases, the chains
appeared to have properly converged ( ˆ
𝑅<1.01). We can see
that the strong scaling speedup achieved by 𝜃1∼(1.0,3.02)
is about 𝑆=45, while for 𝜃1∼(0.0,0.12)it is about 𝑆=60.
Note that this is not a perfectly fair comparison, as the cost
of starting from the stationary region needs to be considered.
6) Does starting from the stationary region help?:Yes.
Starting within the stationary region (𝜃1∼(0.0,0.12))
achieved much better efﬁciency compared to starting away
(𝜃1∼(1.0,3.02)) from it. Since the chains have converged
regardless of the initial point, it’s the quality of adaptation
that made the difference. By starting away from the stationary
region, the samples used for adapting 𝐾(⋅,⋅)contain a lot
of bias, worsening the performance. These results suggest
that it is not only the convergence but also the adaptation
of the Markov-chains that is critical to performance. This
observation is especially crucial to HMC based methods, as
their performance is highly dependent on appropriate tuning
of the kernel .
From now on, we compare the performance of previously
proposed approaches for interchain parallelization.
7) Considered parallel MCMC algorithms:We consider
interchain adaptation (INCA) by , and the generalized
Metropolis-Hastings algorithm (GMH) , . We imple-
mented these algorithms on top of AdvancedHMC.jl using
the Julia language. In INCA, the samples generated by all
the parallel chains are used during adaptation. Then, instead
of using only 𝑁burn∕𝑃samples, 𝑁burn samples can be used
for adaptation. While the original INCA was applied to the
adaptive metropolis algorithm (AM, ), we apply it to
NUTS with the windowed acceptance  scheme. With AM,
INCA can be carried out not only during burn-in, but from
beginning to end. However, for NUTS, we can only apply
INCA during the burn-in period to preserve ergodicity. We
use the Stan adaptation rule for 𝑁burn∕𝑃150, and the naive
adaptation rule for 𝑁burn∕𝑃<150. We also include the non-
INCA version of NUTS as a baseline.
The GMH algorithm, on the other hand, is very different
in that it can be regarded as both an intrachain and interchain
approach. In each iteration, from a single state, GMH proposes
𝑁prop proposals in parallel. Then, it accepts 𝑁accept samples
by resampling from the proposals. Finally, it selects a single
sample from the proposals and uses it as the next state.
Overall, GMH can be thought of as operating a single “guiding
chain” where 𝑁prop short parallel chains are initiated from in
every iteration. Despite  showed that the “guiding chain”
achieves superior ESS, setting 𝑁prop >𝑁accept increases the
total amount of work (therefore decreasing efﬁciency). We
will also show that the performance of the “guiding chain”
is not representative of the overall samples. Since maximum
parallel efﬁciency can be achieved by setting 𝑁prop =𝑃, we set
𝑁prop =𝑁accept =𝑃. Additionally, we use the waste-recycling
extension of , as it is provably more efﬁcient than the
original GMH. Lastly, for the underlying MCMC algorithm of
GMH, we use HMC with 32 leapfrog steps, jitter the stepsize
by 10% (as recommended by ), and use an adaptation
scheme identical to NUTS-INCA.
8) Metropolis-coupled MCMC (parallel tempering):An-
other popular type of interchain parallelization that we do
not include in our experiment is Metropolis-coupled MCMC
(MC3, , also known as parallel tempering). MC3improves
mixing of the Markov-chain by operating multiple Markov-
chains with different targets such as 𝜋𝑖(𝜃)=𝑝(𝜃)𝑇𝑖𝑝(𝜃).
The parallel chains periodically exchange the current state. The
exponent 𝑇𝑖∈(0,1]is known as the temperature parameter
where 𝑇𝑖<1eases exploration of the posterior distribution,
as the prior is often simpler than the posterior. By exchanging
the state of the parallel chains, the improved exploration is
communicated across the chains. However, only the samples
from the chain with the temperature 𝑇= 1 is used for
inference. Thus, using MC3instead of the non-MC3MCMC
increases the total amount of work for acquiring the same
number of samples. As a result, achieving good efﬁciency with
MC3is very difﬁcult, as the ESS of the acquired samples must
be 𝑃times larger to accommodate the increased work. For this
reason, we do not consider it in our experiment.
9) Experimental settings: Eight schools:For comparing
the parallel MCMC algorithms, we choose the eight schools
problem , which is a hierarchical Bayesian model,
Fig. 3: Variance inﬂation factor (𝜏) of considered algorithms (lower the better). The solid lines are the mean, while the shaded
regions demark the 50% and 90% boostrap conﬁdence intervals estimated from 210 repetitions.
Fig. 4: Strong scaling speedup of the considered algorithms estimated from 𝜏(higher the better). The diagonal black line shows
the theoretically optimal speedup. The solid lines are the mean, while the shaded regions demark the 50% and 95% bootstrap
conﬁdence intervals estimated from 210 repetitions.
where 𝑖∈[1,8]. This version of the eight schools model is
called centered parameterization, and is known to be difﬁcult
to sample from because of its funnel-shaped posterior. While
a more efﬁcient parameterization exists, we keep the cen-
tered parameterization as it represents challenges commonly
encountered in hierarchical Bayesian models. There are 10
variables in this model in total. All the results we show are
for estimating 𝜇.
10) Analysis methodology:To estimate the strong scala-
bility of the considered algorithms, we measure the RMSE
resulting from shortening the as suggested by Remark 2.
Speciﬁcally, we increase the number of chains and shorten
them such that each chain draws 𝑁∕𝑃samples and spend
𝑁burn∕𝑃samples for burn-in. Ideally, the estimated speedup
according to Remark 2 should achieve a linear speedup close
to 𝑃. We set the base settings as 𝑁=4096and 𝑁burn =4096.
The true mean (𝔼𝜋[𝑓(𝜃)]) used for estimating the RMSE is
estimated from a single long reference MCMC chain where
𝑁=220,𝑁burn =212. Then, the variance inﬂation factor 𝜏is
estimated using Equation (8) and (9). The variance 𝕍𝜋[𝑓(𝜃1)]
is also acquired from the reference chain. From the esimated
𝜏, we compute the speedup according to Equation (17).
11) Results: Eight schools:The estimated variance inﬂa-
tion factor (𝜏) is shown in Fig. 3. For all three methods, 𝜏
grows signiﬁcantly as the number of chains increases while
the chains’ length decreases. However, for NUTS and NUTS-
INCA, the inﬂation factors decrease slightly until 𝑃= 16.
This is because with more chains, it is more likely that some
of the chains start near the stationary region. Such beneﬁt of
using multiple independent chains has been discussed before
in . Meanwhile, the INCA scheme did not improve the
performance. Since the quality of the initial samples is poor,
sharing the samples during adaptation does not seem to help
except for 𝑃=4where the chains are not yet too short.
The speedup estimated from the measured 𝜏is shown in
Fig. 4. We can see that, for all three methods, the strong
scaling efﬁciency quickly falls down with more than 32 chains
except for NUTS-INCA. In particular, GMH failed to achieve
signiﬁcant scaling with more than 4 chains. This shows that the
performance increase of the “guiding chain” (which beneﬁts
from the increasing number of parallel proposals) does not
reﬂect the overall performance. Note that we did conﬁrm that
the performance of the guiding chain increases, as originally
reported in .
12) Discussion:In a series of experiments, we evaluated
the scalability of interchain parallelism. Despite the apparent
parallelism, our results suggest that interchain parallelism does
not achieve near-linear scaling when the statistical perfor-
mance is also considered. Mainly, shortening of the chains
introduce issues related to convergence and adaptation. Mean-
while, methods for improving the quality adaptation such
as INCA did not improve the results because of the poor
statistical quality of the initial samples.
IV. REL ATED WORKS
Starting from , numerous approaches have been pro-
posed to parallelize MCMC. For a general review on acceler-
ating MCMC, we point to  and .
1) Evaluation of MCMC workloads:Until now, only a
few workload analyses on Bayesian inference workloads have
been presented. In , Wang et al. performed an extensive
analysis of Stan inference workloads. They showed that most
Stan programs were compute-bound in terms of instructions
per cycle. However, they noted that interchain parallelization
is trivial, as different parallel chains are independent. In
Section III-C, we showed that in a strong scaling perspective,
interchain parallelism does not provide impressive scalability.
2) Quantifying scalability of MCMC:Formulas for quan-
tifying scalability of interchain parallelism similar to (16)
appeared in , , . These formulas however, only
consider the amount of burn-in. In contrast, our formula in
Equation (15) also considers the effect of the variance inﬂation
factor. Consequently, it enables the comparison of the strong
scalability of MCMC algorithms in a much broader context.
Meanwhile, while some works such as  have presented
analysis of strong scalability, the speedup is computed rel-
atively based on the increased amount of work, which is
misleading. To properly assess strong scalability, the speedup
must be computed against the original amount of work.
3) Limitations of parallelizing MCMC:As we have dis-
cussed in Section III-B and III-C, paralleizing MCMC has
multiple fundamental challenges. Some of these challenges
have been realized early on. For example, the issues with
burn-in have been quickly pointed out by , , ,
, and in several other papers. However, most previous
works except , , mainly focused on the convergence
aspect of burn-in. Hence, they suggested removing burn-in by
starting from the stationary region. We showed in Section III-C
that this solution is only partially effective because of the
adaptation of the chains.
In this paper, we have discussed the fundamental goals of
parallelizing MCMC computation. We evaluated the strong
scalability of previously proposed approaches for parallelizing
MCMC. Various issues, such as the tuning of kernel hyperpa-
rameters, convergence, burn-in, and applicability complicates
Before concluding our work, we would like to point out that
all of the aforementioned issues are fundamental limitations of
MCMC based approaches. These limitations are artifacts of the
theory of Markov-chains. Instead, we propose investigating
the feasibility of alternative algorithms. For example, the
performance of sequential Monte Carlo (SMC) , , an
algorithm that combines the beneﬁts of importance sampling
and MCMC, doesn’t rely on the theory of Markov-chains.
In sharp contrast, SMC is built on the theory of interacting
particles , which is radically different from the theory
of Markov-chains. The convergence of SMC has been es-
tablished for 𝑃→∞where 𝑃is the number of particles
(or parallel MCMC chains in our context). Also, designing
efﬁcient MCMC kernels internally used by SMC is much
easier, since necessary conditions such as ergodicity don’t have
to be fulﬁlled . While SMC has been shown to be a valid
candidate for parallelization , , , it has yet to be
thoroughly compared against parallel MCMC approaches in
a setting general enough to apply to PPLs. To conclude, an
important future research direction would be to compare the
scalability of MCMC against alternative approaches.
The authors would like to express their deepest apprecia-
tions to Aki Vehtari for his valuable advice on estimating 𝜏.
The authors also thank Jisu Oh for his constructive comments
about the theory of Markov-chains.
 A. Patil, D. Huard, and C. Fonnesbeck, “PyMC: Bayesian Stochastic
Modelling in Python,” J. Stat. Soft., vol. 35, no. 4, 2010.
 F. Wood, J. W. van de Meent, and V. Mansinghka, “A new approach
to probabilistic programming inference,” in Proc. 17th Int. Conf. Mach.
Learn., ser. ICML’14, 2014, pp. 1024–1032.
 J. Salvatier, T. V. Wiecki, and C. Fonnesbeck, “Probabilistic program-
ming in Python using PyMC3,” PeerJ Comput. Sci., vol. 2, p. e55, Apr.
 B. Carpenter et al., “Stan: A Probabilistic Programming Language,” J.
Stat. Soft., vol. 76, no. 1, 2017.
 H. Ge, K. Xu, and Z. Ghahramani, “Turing: A language for ﬂexible prob-
abilistic inference,” in Int. Conf. Artif. Intell. Statist., ser. AISTATS’18,
2018, pp. 1682–1690.
 S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth, “Hybrid
Monte Carlo,” Phys. Lett. B, vol. 195, no. 2, pp. 216–222, 1987.
 P. Jacob, C. P. Robert, and M. H. Smith, “Using Parallel Computation
to Improve Independent Metropolis–Hastings Based Estimation,” J.
Comput. Graphical Statist., vol. 20, no. 3, pp. 616–635, Jan. 2011.
 P. E. Jacob, J. O’Leary, and Y. F. Atchad´
e, “Unbiased Markov chain
Monte Carlo methods with couplings,” J. Roy. Stat. Soc. B, vol. 82,
no. 3, pp. 543–600, Jul. 2020.
 M. A. Suchard, Q. Wang, C. Chan, J. Frelinger, A. Cron, and M. West,
“Understanding GPU Programming for Statistical Computation: Studies
in Massively Parallel Massive Mixtures,” J. Comput. Graphical Statist.,
vol. 19, no. 2, pp. 419–438, Jan. 2010.
 A. Lee, C. Yau, M. B. Giles, A. Doucet, and C. C. Holmes, “On the
Utility of Graphics Cards to Perform Massively Parallel Simulation of
Advanced Monte Carlo Methods,” J. Comput. Graphical Statist., vol. 19,
no. 4, pp. 769–789, Jan. 2010.
 S. Zierke and J. D. Bakos, “FPGA acceleration of the phylogenetic
likelihood function for Bayesian MCMC inference methods,” BMC
Bioinformatics, vol. 11, no. 1, p. 184, Dec. 2010.
 G. Mingas and C.-S. Bouganis, “Population-Based MCMC on Multi-
Core CPUs, GPUs and FPGAs,” IEEE Trans. Comput., vol. 65, no. 4,
pp. 1283–1296, Apr. 2016.
 J. S. Rosenthal, “Parallel computing and monte carlo algorithms,” Far
East J. Theor. Stat., vol. 4, pp. 207–236, 1999.
 C. P. Robert, V. Elvira, N. Tawn, and C. Wu, “Accelerating MCMC
algorithms,” Wiley Interdisciplinary Rev: Comput. Statist., vol. 10, no. 5,
p. e1435, Sep. 2018.
 X. Feng, K. W. Cameron, C. P. Sosa, and B. Smith, “Building the Tree
of Life on Terascale Systems,” in Proc. Int. Parallel Distrib. Process.
Symp., ser. IPDPS’07. Long Beach, CA, USA: IEEE, 2007, pp. 1–10.
 B. Nemeth, T. Haber, J. Liesenborgs, and W. Lamotte, “Automatic
Parallelization of Probabilistic Models with Varying Load Imbalance,”
in 20th IEEE/ACM Int. Symp. Cluster, Cloud and Internet Comput., ser.
CCGrid’20. Melbourne, Australia: IEEE, May 2020, pp. 752–759.
 J. L. Gustafson, “Reevaluating Amdahl’s law,” Commun. ACM, vol. 31,
no. 5, pp. 532–533, May 1988.
 G. M. Amdahl, “Validity of the single processor approach to achieving
large scale computing capabilities,” in Proc. April 18-20, 1967, Spring
Joint Comput. Conf. - AFIPS ’67 (Spring). Atlantic City, New Jersey:
ACM Press, 1967, p. 483.
 E. Angelino, M. J. Johnson, and R. P. Adams, “Patterns of Scalable
Bayesian Inference,” Found. Trends. Mach. Learn., vol. 9, no. 2-3, pp.
 A. E. Brockwell, “Parallel Markov chain Monte Carlo Simulation by
Pre-Fetching,” J. Comput. Graphical Statist., vol. 15, no. 1, pp. 246–
261, Mar. 2006.
 J. M. R. Byrd, S. A. Jarvis, and A. H. Bhalerao, “Reducing the run-time
of MCMC programs by multithreading on SMP architectures,” in Proc.
Int. Parallel Distrib. Process. Symp., ser. IPDPS’08. Miami, FL, USA:
IEEE, Apr. 2008, pp. 1–8.
 ——, “On the parallelisation of MCMC by speculative chain execution,”
in Proc. Int. Parallel Distrib. Process. Symp. Workshop Phd Forum, ser.
IPDPSW’10. Atlanta, GA: IEEE, Apr. 2010, pp. 1–8.
 E. Angelino, E. Kohler, A. Waterland, M. Seltzer, and R. P. Adams,
“Accelerating MCMC via parallel predictive prefetching,” in Proc. 30th
Conf. Uncertainty Artif. Intell., ser. UAI’14. Arlington, Virginia, USA:
AUAI Press, 2014, pp. 22–31.
 I. Strid, “Efﬁcient parallelisation of Metropolis–Hastings algorithms
using a prefetching approach,” Comput. Statist. Data Anal., vol. 54,
no. 11, pp. 2814–2835, Nov. 2010.
 G. Altekar, S. Dwarkadas, J. P. Huelsenbeck, and F. Ronquist, “Parallel
Metropolis coupled Markov chain Monte Carlo for Bayesian phyloge-
netic inference,” Bioinformatics, vol. 20, no. 3, pp. 407–415, Feb. 2004.
 R. V. Craiu, J. Rosenthal, and C. Yang, “Learn From Thy Neighbor:
Parallel-Chain and Regional Adaptive MCMC,” J. Amer. Statistical
Assoc., vol. 104, no. 488, pp. 1454–1466, Dec. 2009.
 A. Solonen, P. Ollinaho, M. Laine, H. Haario, J. Tamminen, and
arvinen, “Efﬁcient MCMC for Climate Model Parameter Estimation:
Parallel Adaptive Chains and Early Rejection,” Bayesian Anal., vol. 7,
no. 3, pp. 715–736, Sep. 2012.
 B. Calderhead, “A general construction for parallelizing Metropolis-
Hastings algorithms,” Proc. Nat. Acad. Sci., vol. 111, no. 49, pp. 17 408–
17 413, Dec. 2014.
 S. Yang, Y. Chen, E. Bernton, and J. S. Liu, “On parallelizable Markov
chain Monte Carlo algorithms with waste-recycling,” Stat Comput,
vol. 28, no. 5, pp. 1073–1081, Sep. 2018.
 W. Neiswanger, C. Wang, and E. P. Xing, “Asymptotically exact,
embarrassingly parallel MCMC,” in Proc. 30th Conf. Uncertainty Artif.
Intell., ser. UAI’14. Arlington, Virginia, USA: AUAI Press, 2014, pp.
 S. L. Scott, A. W. Blocker, F. V. Bonassi, H. A. Chipman, E. I. George,
and R. E. McCulloch, “Bayes and big data: The consensus monte carlo
algorithm,” Int. J. Manag. Sci. Eng. Manag., vol. 11, pp. 78–88, 2016.
 S. Srivastava, C. Li, and D. B. Dunson, “Scalable bayes via barycenter
in wasserstein space,” J. Mach. Learn. Res., vol. 19, no. 8, pp. 1–35,
 C. P. Robert and G. Casella, Monte Carlo Statistical Methods, ser.
Springer Texts in Statistics. New York, NY: Springer New York, 2004.
 W. K. Hastings, “Monte Carlo sampling methods using Markov chains
and their applications,” Biometrika, vol. 57, no. 1, pp. 97–109, Apr.
 C. J. Geyer, “Introduction to markov chain monte carlo,” in Handbook
of Markov Chain Monte Carlo. CRC Press, 2011, pp. 3–48.
 C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan, “An Introduction
to MCMC for Machine Learning,” Mach. Learn., vol. 50, no. 1/2, pp.
 M. Betancourt, “A Conceptual Introduction to Hamiltonian Monte
Carlo,” arXiv:1701.02434 [stat], Jan. 2017.
 M. D. Hoffman and A. Gelman, “The no-u-turn sampler: Adaptively
setting path lengths in Hamiltonian Monte Carlo,” J. Mach. Learn. Res.,
vol. 15, no. 47, pp. 1593–1623, 2014.
 C. J. Geyer, “Practical Markov Chain Monte Carlo,” Statist. Sci., vol. 7,
no. 4, pp. 473–483, Nov. 1992.
 D. Wilkinson, “Parallel Bayesian Computation,” in Handbook of Par-
allel Computing and Statistics (Statistics, Textbooks and Monographs).
Chapman & Hall/CRC, 2005.
 V. Gopal and G. Casella, “Running regenerative markov chains in
parallel,” unpublished, 2011.
 L. Murray, “Distributed markov chain monte carlo,” in Proc. Neural Inf.
Process. Syst. Workshop Learn. Cores, Clusters Clouds, vol. 11, 2010.
 Y. Emma Wang, Y. Zhu, G. G. Ko, B. Reagen, G.-Y. Wei, and D. Brooks,
“Demystifying Bayesian Inference Workloads,” in IEEE Int. Symp.
Perform. Anal. Syst. Softw., ser. ISPASS’19. Madison, WI, USA: IEEE,
Mar. 2019, pp. 177–189.
 J. Piironen and A. Vehtari, “Sparsity information and regularization in
the horseshoe and other shrinkage priors,” Electron. J. Statist., vol. 11,
no. 2, pp. 5018–5051, 2017.
 M. Betancourt, “Bayes Sparse Regression,” Mar. 2018.
 Imperial College COVID-19 Response Team et al., “Estimating the
effects of non-pharmaceutical interventions on COVID-19 in Europe,”
Nature, Jun. 2020.
 Stan Development Team, “Stan modeling language users guide and
reference manual, version 2.23.0,” 2020.
 D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J.
Mach. Learn. Res., vol. 3, no. null, pp. 993–1022, Mar. 2003.
 L. Egidi, F. Pauli, and N. Torelli, “Are Shots Predictive Of Soccer
Results?” in StanCon 2018. Zenodo, Aug. 2018.
 S. Kim, N. Shepherd, and S. Chib, “Stochastic Volatility: Likelihood
Inference and Comparison with ARCH Models,” Rev. Econ. Stud.,
vol. 65, no. 3, pp. 361–393, Jul. 1998.
 R. M. Neal, “Probabilistic inference using markov chain monte carlo
methods,” University of Toronto, Tech. Rep. CRG-TR-93-1, Sep. 1993.
 J. Li et al., “Feature selection: A data perspective,” ACM Comput.
Surveys, vol. 50, no. 6, p. 94, 2018.
 A. Gelman and K. Shirley, “Inference from simulations and monitoring
convergence,” in Handbook of Markov Chain Monte Carlo. CRC Press,
2011, pp. 163–174.
 D. Rudolf, “Error bounds for computing the expectation by Markov
chain Monte Carlo,” Monte Carlo Methods Appl., vol. 16, no. 3-4, Jan.
 G. S. Fishman, Discrete-Event Simulation. New York, NY: Springer
New York, 2001.
 J. G. Propp and D. B. Wilson, “Exact sampling with coupled markov
chains and applications to statistical mechanics,” Random Struct Algo-
rithms, vol. 9, no. 1–2, pp. 223–252, Aug. 1996.
 T. K. Papp, JackRab, D. Aluthge, J. TagBot, and M. Piibeleht, “Tpap-
p/DynamicHMC.jl: V2.1.6,” Zenodo, Aug. 2020.
 K. Xu, H. Ge, W. Tebbutt, M. Tarek, M. Trapp, and Z. Ghahramani,
“AdvancedHMC.jl: A robust, modular and efﬁcient implementation of
advanced HMC algorithms,” in Proc. 2nd Symp. Adv. Approx. Bayesian
Inference, ser. AABI’19, vol. 118. PMLR, Dec. 2020, pp. 1–10.
 R. M. Neal et al., “MCMC using Hamiltonian dynamics,” Handb.
Markov Chain Monte Carlo, vol. 2, no. 11, p. 2, 2011.
 H. Haario, E. Saksman, and J. Tamminen, “An adaptive metropolis
algorithm,” Bernoulli, vol. 7, no. 2, pp. 223–242, Apr. 2001.
 R. M. Neal, “An Improved Acceptance Procedure for the Hybrid Monte
Carlo Algorithm,” J. Comput. Phys., vol. 111, no. 1, pp. 194–203, Mar.
 R. M. Neal et al., “MCMC using Hamiltonian dynamics,” Handb.
Markov Chain Monte Carlo, vol. 2, no. 11, p. 2, 2011.
 A. Gelman, J. Carlin, H. Stern, D. Dunson, A. Vehtari, and D. Rubin,
Bayesian Data Analysis, 3rd ed., ser. Chapman & Hall/CRC Texts in
Statistical Science. Boca Raton: CRC Press, 2014.
 A. Gelman and D. B. Rubin, “Inference from Iterative Simulation Using
Multiple Sequences,” Statist. Sci., vol. 7, no. 4, pp. 457–472, Nov. 1992.
 X. Feng, D. A. Buell, J. R. Rose, and P. J. Waddell, “Parallel algorithms
for Bayesian phylogenetic inference,” J. Parallel Distrib. Comput.,
vol. 63, no. 7-8, pp. 707–718, Jul. 2003.
 P. Del Moral, A. Doucet, and A. Jasra, “Sequential Monte Carlo
samplers,” J. Roy. Statist. Soc.: B, vol. 68, no. 3, pp. 411–436, Jun.
 N. Chopin, “A sequential particle ﬁlter method for static models,”
Biometrika, vol. 89, no. 3, pp. 539–552, Aug. 2002.
 P. Del Moral, Feynman-Kac Formulae, ser. Probability and Its Applica-
tions. New York, NY: Springer New York, 2004.
 A. Beskos, A. Jasra, N. Kantas, and A. Thiery, “On the convergence of
adaptive sequential Monte Carlo methods,” Ann. Appl. Probab., vol. 26,
no. 2, pp. 1111–1146, Apr. 2016.
 A. Varsi, L. Kekempanos, J. Thiyagalingam, and S. Maskell, “A Single
SMC Sampler on MPI that Outperforms a Single MCMC Sampler,”
arXiv:1905.10252 [cs, stat], May 2019.
 B. Nemeth, T. Haber, J. Liesenborgs, and W. Lamotte, “Relaxing
Scalability Limits with Speculative Parallelism in Sequential Monte
Carlo,” in IEEE Int. Conf. Cluster Comput., ser. CLUSTER’18. Belfast:
IEEE, Sep. 2018, pp. 494–503.