Page 1
arXiv:0903.0837v1 [astroph.CO] 4 Mar 2009
Estimation of cosmological parameters using adaptive importance sampling
Darren Wraith,1,2Martin Kilbinger,2Karim Benabed,2Olivier Capp´ e,3
JeanFran¸ cois Cardoso,3,2Gersende Fort,3Simon Prunet,2and Christian P. Robert1
1CEREMADE, Universit´ e Paris Dauphine, 75775 Paris cedex 16, France
2Institut d’Astrophysique de Paris, CNRS UMR 7095 & UPMC, 98 bis, boulevard Arago, 75014 Paris, France
3LTCI, TELECOM ParisTech and CNRS, 46, rue Barrault, 75013 Paris, France
(Dated: March 4, 2009)
We present a Bayesian sampling algorithm called adaptive importance sampling or Population
Monte Carlo (PMC), whose computational workload is easily parallelizable and thus has the po
tential to considerably reduce the wallclock time required for sampling, along with providing other
benefits. To assess the performance of the approach for cosmological problems, we use simulated and
actual data consisting of CMB anisotropies, supernovae of type Ia, and weak cosmological lensing,
and provide a comparison of results to those obtained using stateoftheart Markov Chain Monte
Carlo (MCMC). For both types of data sets, we find comparable parameter estimates for PMC and
MCMC, with the advantage of a significantly lower computational time for PMC. In the case of
WMAP5 data, for example, the wallclock time reduces from several days for MCMC to a few hours
using PMC on a cluster of processors. Other benefits of the PMC approach, along with potential
difficulties in using the approach, are analysed and discussed.
PACS numbers: 98.80.es, 02.50.r, 02.50.Sk
I.INTRODUCTION
In recent years we have seen spectacular advances in
observational cosmology, with the availability of more
and more high quality data allowing for the testing of
models with higher complexity. Some of these tests have
been made possible thanks to the use of Bayesian sam
pling techniques, and in particular Markov Chain Monte
Carlo (MCMC) – an (iterative) algorithm that produces
a Markov chain whose distribution converges to the tar
get posterior π. After a “burnin” period, samples from
such a chain can be regarded as samples approximately
from π. Proposed values for the chain or the updating
scheme of MCMC can be designed to ensure that moves
towards regions of higher mass under π are favored, and
regions with null probability (under π) are never visited.
This way, most of the computational effort can be spent
in the region of importance to the posterior distribution,
and an MCMC approach is usually much more efficient
than traditional grid sampling of the model parameter
space.
The MCMC technique is now well known in cosmology,
and in particular in its most simple form, the Metropolis
Hastings algorithm, thanks to the userfriendly and
freely available package COSMOMC [1]. Other forms of
the MCMC algorithm, like Gibbssampling and Hybrid
Monte Carlo (better known in cosmology as Hamiltonian
sampling), have also been proposed and have found some
interesting usage in the estimation of the posterior distri
bution for the Cosmic Microwave Background anisotropy
power spectrum at low resolution (see [2] and references
therein, also [3] and [4]).
For all its advantages over grid sampling, the MCMC
approach also suffers from problems. One difficulty is to
assess the correct convergence of the chain. Another lies
in the presence of correlations within the chain which
can greatly reduce the efficiency of the sample [5]. A
third issue which is particularly relevant for the usage of
MCMC in cosmology is the computational time involved.
Indeed, whatever the sampling technique, we often need
to compute at least one estimate of the posterior for each
sampled point. This computation can be slow in cosmol
ogy. With the current processing speed of computers, a
point of the posterior of, for example, the WMAP5 data
set, using CAMB [6]1and the public WMAP5 likelihood
code [4]2, both with their default precision settings, is
computed at the order of several seconds, and can be
much slower when exploring nonflat models. Of course,
as stated above, for most problems MCMC will require
orders of magnitude less samples than a grid for a given
target precision, thus providing an important efficiency
improvement. However, apart from improving the likeli
hood codes or waiting for the availability of faster com
puters there is not much speed improvement to expect
from an MCMC approach, while probably needed if one
wants to explore yet bigger and more complex models.
On the algorithmic side of the problem, some effort has
been devoted recently to the improvement of the like
lihood codes, mainly by using clever interpolation tricks
(segmentation [7], neural networks [8]) and by looking for
improvements in the MCMC algorithm [9, 10, 11]. The
former [7, 8] indeed provide some gain in efficiency, but
at the cost of a long precomputation step for each model.
The latter improves on the natural inefficiency of the
Metropolis Algorithm but imposes some other require
ments, like the availability of cheap computation of the
derivatives of the likelihood [9], or the knowledge of con
1http://camb.info
2http://lambda.gsfc.nasa.gov
Page 2
2
ditional probabilities of some of the parameters [10, 11].
Other (non Markovian) Monte Carlo methods, such as
nested sampling, have also been proposed and applied re
cently to cosmological problems with some success along
with presenting their own problems [12, 13, 14].
On the hardware side, however, there is a route to
speed improvement that does not lie in quicker CPUs,
but on the availability of cheap multiCPU computers
and the standardization of clusters of computers. This
opportunity, however, is only partly opened to MCMC.
Indeed, there are two ways of parallelizing the parame
ter exploration. First, by distributing the computation
of the likelihood, which is not always possible and does
not always lead to speed improvement. Second, by run
ning multiple chains in parallel. This last option is the
simplest, but is ‘forbidden’ by the iterative nature of
the MCMC algorithm. More precisely, running paral
lel chains and mixing them in the end to build a bigger
chain sample is of course possible (and can be advanta
geous in fully exploring the support of π), but at the con
dition that each of the individual chains has converged.
In the absence of such a condition, significant biases in
the sample can be introduced. Determining convergence
for each chain is inherently difficult in practice and has
largely prevented more widespread use of the approach
[15]. Thus, for MCMC any speed improvements through
parallelization are difficult to achieve.
In this paper, we propose another sampling algorithm
suitable for cosmological applications, that is not based
upon MCMC, and can be parallelised. This novel algo
rithm, called Population Monte Carlo (PMC) is an adap
tive importance sampling technique, that has been stud
ied recently in the statistics literature [16]. While this
algorithm solves some of the issues of MCMC in cosmol
ogy, the approach of course has a different set of potential
problems that we will analyse and discuss, along with its
advantages.
The paper is outlined as follows. In the next section,
we provide a brief introduction to the Bayesian approach,
which we hope will give the casual reader some important
keys for further readings, and we also discuss the chal
lenges and issues involved with using either an MCMC
or an importance sampling algorithm for estimation. We
then describe details of the PMC approach. In Sect. III,
we assess the performance of the PMC approach using a
simulated target density with features similar to cosmo
logical parameter posteriors, and provide a comparison to
results obtained using an MCMC approach. In Sect. IV,
we illustrate the results from the PMC approach using
actual data, consisting of CMB anisotropies, supernovae
of type Ia and weak cosmological lensing. We conclude
in Sect. V with a discussion and an outlook for further
work.
II.METHODS
A. Bayesian inference via simulation
A key feature of Bayesian inference is to provide a
probabilistic expression for the uncertainty regarding a
parameter of interest x by combining prior information
along with information brought by the data. Prior infor
mation, for example, could take the form of information
obtained from previous experiments which cannot read
ily be incorporated into the current experiment or simply
consist of a feasible range. The absence of prior informa
tion, however, is not restriction for the use of Bayesian in
ference and estimation can still be regarded as valid [17].
Information brought by the data and prior information
are entirely subsummed in the posterior probability den
sity function obtained, up to a normalization constant,
by
π(x) ∝ likelihood(datax) × prior(x). (1)
It is however generally difficult to handle the posterior
distribution, due to (a) the dimension of the parame
ter vector x, and (b) the use of nonanalytical likelihood
functions. For both of these reasons, the normalizing
constant missing from the righthand side of (1) is usu
ally not explicitly available. A practical solution to this
difficulty is to replace the analytical study of the poste
rior distribution with a simulation from this distribution,
since producing a sample from π allows for a straightfor
ward approximation of all integrals related with π, due
to the Monte Carlo principle [5]. In short, if x1,...,xNis
a sample drawn from the distribution π and f denotes a
function (with finite expectation under π), the empirical
average
1
N
N
?
n=1
f(xn)(2)
is a convergent estimator of the integral
π(f) =
?
f(x)π(x)dx,(3)
in the sense that the empirical mean (2) converges to
π(f) as N grows to infinity. Quantities of interest in a
Bayesian analysis typically include the posterior mean,
for which f(x) = x; the posterior covariance matrix cor
responding to f(x) = xxT; and probability intervals,
with f(x) = 1S(x), where S is a domain of interest, and
1S(x) denotes the indicator function which is equal to
one if x ∈ S and zero otherwise.
B.Markov chain Monte Carlo methods
For most problems in practice, direct simulation from
π is not an option and more sophisticated approximation
Page 3
3
techniques are necessary. One of the standard approaches
[5] to the simulation of complex distributions is the class
of Markov chain Monte Carlo (MCMC) methods that
rely on the production of a Markov chain {xn} having the
target posterior distribution π as limiting distribution.
MCMC can be implemented with many Markovian
proposal distributions but the standard approach is the
random walk MetropolisHastings algorithm: given the
current value xn of the chain, a new value x⋆ is drawn
from ψ(x − xn), where the socalled proposal ψ denotes
a symmetric probability density function. The point x⋆
is then accepted as xn+1 with probability (also called
acceptance rate in this context)
min
?
1,π(x⋆)
π(xn)
?
, (4)
and otherwise, xn+1 = xn.
mented as follows:
The algorithm is imple
Random walk MetropolisHastings algorithm
Do: Choose an arbitrary value of x1.
For n ≥ 1:
Generate x⋆∼ ψ(x − xn) and u ∼ Uniform(0,1).
Take
xn+1=
?
x⋆
xn
if u ≤ π(x⋆)?π(xn),
otherwise.
While this algorithm is universal in that it applies to
any choice of posterior distribution π and proposal ψ,
its performance highly depends on the choice of the pro
posal ψ that has to be properly tuned to match some
characteristics of π. If the scale of the proposal ψ is too
small, that is, if it takes many steps of the random walk
to explore the support of π, the algorithm will require
many iterations to converge and, in the most extreme
cases, will fail to converge in the sense that it will miss
some relevant part of the support of π [18]. If, on the
other hand, the scale of ψ is too large, the algorithm
may also fail to adequately sample from π. This time,
the chain may exhibit low acceptance rates and fail to
generate a sufficiently diverse sample, even with longer
runs. There exist monitors that assess the convergence of
such algorithms but they usually are conservative – i.e.,
require a multiple of the number of necessary iterations
– and partial – i.e., only focus on a particular aspect of
convergence or on a special class of targets [5]. MCMC
algorithms are also notoriously delicate to calibrate on
line, both from a theoretical point of view and from a
practical perspective [19]. For these approaches, often
called adaptive MCMC, some recommendations for the
optimal scaling and calibration schedule for various pro
posals in high dimensions have been proposed [20], but
this is still at an experimental stage.
C.Population Monte Carlo
Population Monte Carlo (PMC) [16, 21] is an adaptive
version of importance sampling [22, 23] that produces
a sequence of samples (or populations) that are used in
a sequential manner to construct improved importance
functions and improved estimations of the quantities of
interest.
We recall that importance sampling is based on the
fundamental identity [5]
π(f) =
?
f(x)π(x)dx =
?
f(x)π(x)
q(x)q(x)dx,(5)
which holds for any probability density function q with
support including the support of π and any function f
for which the expectation π(f) is finite. Hence, this ap
proach to approximating integrals linked with complex
distributions is also universal in that the above identity
always holds. If x1,...,xNare drawn independently from
q,
ˆ π(f) =
1
N
N
?
n=1
f(xn)wn;wn= π(xn)/q(xn), (6)
provides a convergingapproximation to π(f). In this con
text, q is called the importance function and wnare com
monly referred to as importance weights. For Bayesian
inference, one cannot directly use (6) as only the unnor
malised version of π (i.e., the righthand side of eq. 1) is
available. Conveniently, the selfnormalised importance
ratio
ˆ πN(f) =
N
?
n=1
f(xn) ¯ wn,(7)
where the normalised importance weights are defined as
¯ wn=
wn
?N
m=1wm
, (8)
is also a converging approximation to π(f), independent
of the normalization of π. For an importance function
that is closely matched to the target density, significant
reductions in the variance of the Monte Carlo estimates
are possible in comparison to estimates obtained using
MCMC [5]. However, the importance sampling approach
is equally prone to poor performances as MCMC, in that
the resulting converging approximation may suffer from a
large or even infinite variance if q is not selected in accor
dance with π. There is no universal importance function
and most of the research in this field aims at fitting the
most efficient importance functions for the problem at
hand.
Population Monte Carlo offers a possible solution to
this difficulty through adaptivity: given the target pos
terior density π up to a constant, PMC produces a se
quence qtof importance functions (t = 1,...,T) aimed
Page 4
4
at approximating this very target.
is produced by a regular importance sampling scheme,
x1
The first sample
1,...,x1
N∼ q1, associated with importance weights
w1
n=π(x1
q1(x1
n)
n);
n = 1,...,N,(9)
and their normalised counterparts ¯ w1
a first approximation to a sample from π. Moments of
π can then be approximated to construct an updated
importance function q2, etc.
The approximation can be measured in terms of the
Kullback divergence (also called KullbackLeibler diver
gence or relative entropy) from the target,
n(eq. 8), providing
K(π?qt) =
?
log
?π(x)
qt(x)
?
π(x)dx, (10)
and the density qtcan be adjusted incrementally such
that K(π?qt) is smaller than K(π?qt−1). The impor
tance function should be selected from a family of func
tions which is sufficiently large to allow for a close match
with π but for which the minimization of (10) is com
putationally feasible. In [16] the authors propose to use
mixture densities of the form
qt(x) = q(x;αt,θt) =
D
?
d=1
αt
dϕ(x;θt
d) (11)
where αt= (αt
for the D mixture components (with αt
?D
probability density function, usually taken to be multi
variate Gaussian or Studentt (where the latter is to be
preferred in cases where it is suspected that the tails of
the posterior π are indeed heavier than Gaussian tails).
Given the vast array of densities that can be approx
imated by mixtures, such an importance function pro
vides considerable flexibility to efficiently estimate a wide
range of posteriors, including in this case those found in
cosmological settings.
The generic PMC algorithm then consists of the
following:
1,...,αt
D) is a vector of adaptable weights
d
> 0 and
d=1αt
eters which specify the components; ϕ is a parameterised
d= 1), and θt= (θt
1,...,θt
D) is a vector of param
Population Monte Carlo algorithm
Do: Choose an importance function q1.
Generate an independent sample x1
Compute the importance weights w1
1,...,x1
1,...,w1
N∼ q1.
N.
For t ≥ 1:
Update the importance function to qt+1, based on the
previous weighted sample (xt
Generate independently xt+1
Compute the importance weights wt+1
1,wt
,...,xt+1
1),...,(xt
N,wt
∼ qt+1.
,...,wt+1
N).
1
N
1
N.
Unlike for MCMC, in a PMC approach, the process
can be interrupted at any time as the sample produced at
each iteration can be validly used to approximate expec
tations under π using selfnormalised importance sam
pling following (7). Further, sampling outputs from pre
vious iterations can be combined [24, 25], and the sam
ple size at each iteration does not necessarily need to be
fixed. Both of these properties of PMC can be exploited
to improve parameter estimates, either by increasing the
coverage of the importance function to the target density
or increasing the precision of the approximation for the
integral of interest.
Also note that an approximate sample from the target
density can be obtained by sampling (xt
replacement, using the normalised importance weights
¯ wt
n. Although this process induces extra Monte Carlo
variation, there are a number of methods available which
considerably reduce the variation involved (e.g. residual
sampling [26] or systematic sampling [27]).
1,...,xt
n) with
1.Updating the importance function in the Gaussian case
In this section, we particularise the generic PMC algo
rithm to the case where the importance function consists
of a mixture of pdimensional Gaussian densities with
mean µdand covariance Σd,
ϕ(x;µd,Σd) = (2π)−p/2Σd−1/2
×exp
?
−1
2(x − µd)TΣ−1
d(x − µd)
?
.(12)
Using this importance function for the mixture model
(11), we start the PMC algorithm by arbitrarily fixing
the mixture parameters (α1,µ1,Σ1), and then sample in
dependently from the resulting importance function q1to
obtain our initial sample (x1
updates of the parameters proceed recursively.
At iteration t, the importance weights associated with
the sample (xt
N) are given by
1,...,x1
N). From this stage,
1,...,xt
wt
n=
π(xt
n)
?D
d=1αt
dϕ(xtn;µt
d,Σt
d)
(13)
with normalised counterparts ¯ wt
parameters (αt,µtand Σt) of the importance function
are then updated according to
ngiven by eq. (8). The
αt+1
d
=
N
?
?N
n=1
¯ wt
nρd(xt
n;αt,µt,Σt);(14)
µt+1
d
=
n=1¯ wt
nxt
nρd(xt
αt+1
d
n;αt,µt,Σt)
; (15)
Σt+1
d
=
?N
n=1¯ wt
n(xt
n− µt+1
d
)(xt
n− µt+1
αt+1
d
d
)Tρd(xt
n;αt,µt,Σt)
;
(16)
Page 5
5
where
ρd(x;α,µ,Σ) =
αdϕ(x;µd,Σd)
?D
d=1αdϕ(x;µd,Σd). (17)
Appendix A provides derivations of these expressions
and further details on the general approach, as well as
equations pertaining to the (more involved) case of mix
tures of multivariate Studentt distributions, which are
used in the simulations presented in Sect. III.
As discussed in Appendix A, the main theoretical
appeal of this particular update rule is that, as N
tends to infinity, the corresponding Kullback divergence
K(π?qt+1) is guaranteed to be less than K(π?qt).
2. Monitoring convergence
The above update process can be repeated a number
of times, and although there is no need for a formal
stopping rule, some measures of performance against the
target density can be used as a guide. As the objec
tive of importance function adaptations is to minimise
the Kullback divergence between the target density and
the importance function, we can stop the process when
further adaptations do not result in significant improve
ments in K(π?qt). To this end, it can be shown that
exp[−K(π?qt)] may be estimated by the normalised per
plexity
p = exp(Ht
N)/N, (18)
where
Ht
N= −
N
?
n=1
¯ wt
nlog ¯ wt
n
(19)
is the Shannon entropy of the normalised weights, a fre
quently used measure of the quality of an importance
sample. Thus, minimization of the Kullback divergence
can be approximately connected with the maximization
of the perplexity (18). Values of this criterion close to 1
will therefore indicate good agreement between the im
portance function and the target density.
Another frequently used criterion for importance sam
pling is the socalled effective sample size (ESS),
ESSt
N=
?N
n=1
?
?¯ wt
n
?2
?−1
, (20)
which lies in the interval [1;N] and can be interpreted as
the number of sample points with nonzero weight [28].
Both measures (18, 20) are interconnected, as an impor
tance function which is close to the target density will
have both a high normalised perplexity and a relatively
large number of points with nonzero weight, compared
to an illfitting importance function. Given a realvalued
function f of interest one can also estimate the asymp
totic variance of the selfnormalised importance sampling
x1
x2
−40 −200 2040
−40
−30
−20
−10
0
10
20
FIG. 1: Test target density on the (x1,x2) plane. Contours
represent the 68.3% (blue), 95% (green) and 99% (black) con
fidence regions.
estimator ˆ πt
importance sample itself, as
N(f) =?N
N
?
n=1¯ wt
nf(xt
n) (cf. eq. 7) using the
N
n=1
?¯ wt
n
?f(xt
n) − ˆ πt
N(f)??2.(21)
Beware that this formula (which is derived from theo
rem 2 of [29]) is only valid with normalised weights, and
that it is a variance conditional on the current impor
tance proposal qt, i.e. it does not take into account the
adaptation. This measure can be related to the socalled
integrated autocorrelation time (IAT) used for Markov
chain Monte Carlo simulations, which, in this case, takes
into consideration the level of autocorrelation present in
the chain [5, 20, 30].
3. A first illustration
To illustrate the PMC approach, we explore a banana
like target density presented Fig. 1. The same target
distribution will be studied in greater depth in the next
section. The results of the first eleven (11) iterations of
the PMC algorithm using a mixture of Studentt densities
are shown Fig. 2 (see also Appendix A for details of the
update procedure).
While this target density shows slightly more pro
nounced curvature for an example of a posterior density
in practice, it serves to illustrate the process of adapta
tion of the importance function. The initial importance
function q1is a mixture of multivariate Studentt’s, con
Page 6
6
sisting of nine components placed around the centre of
the range for the first two variables, each with a rela
tively large variance (for the first dimension = 200, sec
ond = 50) and degrees of freedom ν = 9. The different
coloured circles in Fig. 1 indicate the location of the com
ponent means, and the circle size is proportional to the
weight αdassociated with the component. At the fourth
iteration (t = 4), we see that the importance function
starts to resemble the shape of the target density, with
components becoming more separated and moving into
the tails of the target. By the sixth iteration (t = 6) the
importance function has further adapted to the shape
of the target banana density. For this target density and
importance function, Fig. 3 presents estimates of the nor
malised perplexity and normalised effective sample size
(ESS/N) for the first 10 iterations over 500 simulation
runs. As shown, the estimates of the normalised perplex
ity improve rapidly from approximately 0.14 for the sec
ond importance function (Iteration 2) to approximately
0.81 for the last importance function (Iteration 10), with
a similar increase in estimates of the normalised effec
tive sample size (ESS/N, increasing from 0.10 to 0.60).
For this importance function and target density, the nor
malised perplexity starts to level off after the 10th iter
ation (around 0.82), indicating that there is no need for
further adaptation of the proposal density. As mentioned
previously however, in general, one does not need to ob
serve the convergence of the proposal (as for MCMC) in
order to stop the sampling process.
An important consideration, and a choice that needs to
be made at the start of the algorithm are the parameter
values α1, µ1and Σ1for the initial importance function
q1, including the degrees of freedom ν in the case of the
Studentt mixture, and the sample size N. A poor initial
importance function, such as one that is tightly centred
around only one mode in the case of a multimodal poste
rior or a narrow importance function with light tails, may
take a long time to adapt or may miss important parts
of the posterior. For importance sampling the choice of q
requires both fat tails and a reasonable match between q
and the target π in regions of high density. Such an im
portance function can be more easily constructed in the
presence of a well informed guess about the parameters
and possibly the shape of the posterior density. Sample
size considerations also play an important role – smaller
samples can adapt quite quickly with less computational
time but may provide less reliable information about the
posterior density relative to larger samples. Such consid
erations are important as we look at posterior densities
of increasingly high dimensions, and thus we can expect
to take a larger sample size as the dimension of the prob
lem increases. We will discuss these issues further in the
context of simulated and actual data, and also in Sect. V.
−40−2002040
−40
−30
−20
−10
0
10
20
−40−2002040
−40
−30
−20
−10
0
10
20
−40−2002040
−40
−30
−20
−10
0
10
20
−40−2002040
−40
−30
−20
−10
0
10
20
−40−2002040
−40
−30
−20
−10
0
10
20
−40−2002040
−40
−30
−20
−10
0
10
20
FIG. 2: Evolution of the importance function for the target
density (see Fig. 1) over 11 iterations of 10k points for x1
(horizontal axis) and x2 (vertical axis), except for the last
iteration (11) which is a sample of 100k points. Iterations 1
(top left) to 11 (bottom right) from left to right with every
second iteration shown (i.e 1,3,5,7,9,11). Colours indicate the
mixture components with mean of each component indicated
by coloured dots and approximate 95% confidence regions for
the sample of points from each component by coloured el
lipses. Every 3rd sample point from the importance function
is plotted.
III.SIMULATIONS
In this section, we test the performance of PMC using
simulated data, and compare the results to an adaptive
MCMC procedure.
1.Target density
In order to provide a good test for both approaches
we use the target density considered in [19], which is dif
ficult to explore but which also provides a realistic sce
nario for many problems encountered in cosmology. The
target density is based on a centred pmultivariate nor
mal, x ∼ Np(0,Σ) with covariance Σ = diag(σ2
which is slightly twisted by changing the second co
ordinate x2 to x2+ b(x2
unchanged. We obtain a twisted density which is cen
tered with uncorrelated components. Since the Jacobian
of twist is equal to 1, the target density is:
1,1,...,1),
1− σ2
1). Other coordinates are
(x1,x2+ b(x2
1− σ2
1),x3,...,xp) ∼ Np(0,Σ).(22)
For the target density that we will consider, we set p =
10, σ2
1= 100, b = 0.03, which results in a bananashaped
density in the first two dimensions (see Fig. 1).
Page 7
7
12345678910
0.0
0.2
0.4
0.6
0.8
NPERPL
12345678910
0.0
0.2
0.4
0.6
0.8
NESS
FIG. 3: Normalised perplexity (top panel) and normalised
effective sample size (ESS/N) (bottom panel) estimates for
the first 10 iterations of PMC (represented in Fig. 2) over
500 simulation runs. The distributions are shown as whisker
plots: the thick horizontal line represents the median; the
box shows the interquartile range (IQR), containing 50% of
the points; the whiskers indicate the interval 1.5×IQR from
either Q1 (lower) or Q3 (upper); points outside the interval
[Q1,Q3] (outliers) are represented as circles.
For the target density considered, interest is in how
well PMC and MCMC are able to approximate the tails
of the target. Whilst the curvature present in the first
two dimensions of this target density is slightly more pro
nounced than what is typically seen in practice for cos
mology it serves to highlight the difficulties faced by both
PMC and MCMC in covering the parameter space. In
particular, little accurate information is available in or
der to guide the choice of importance function (for PMC)
and proposal distribution (for MCMC) and so both ap
proaches are forced to learn about the parameter space.
2.Test run proposal for PMC
For PMC, and in the absence of any detailed a priori
information about the target density, except the possible
range for each variable, we have chosen the first impor
tance function to be a mixture of multivariate Student
t distributions with components displaced randomly in
different directions slightly away from the centre of the
range for each variable: the mean of the components
is drawn from a pmultivariate Gaussian with mean 0
and covariance equal to Σ0/5 where Σ0is some positive
definite matrix; the variance for components was cho
sen to be Σ0. We choose a mixture of 9 components
of Studentt distribution with ν = 9 degrees of free
dom; and Σ0is a diagonal matrix with diagonal entries
(200,50,4,··· ,4).
quate coverage, albeit somewhat overdispersed, of the
feasible parameter region.
Studentt distributions are preferable to Gaussian distri
butions because the range of the variables is unbounded
(in contrast to the cosmology examples to be discussed
in Sect. IV).
A representation of the first importance function for
the first two dimensions is shown in the top lefthand
box in Fig. 2, with a typical evolution over the next few
iterations in the other panels. In pilot runs of various
importance functions against the target density, the best
fitting importance function required at least seven com
ponents in order to adequately represent the coverage of
the entire density.
For PMC, an important issue is the sample size for
each iteration. A poor initial importance function with
a relatively small sample size will take a long time to
adapt or it may even be unable to recover sufficiently to
provide reasonable parameter estimates. Such problems
are more likely to occur as the dimension of the parame
ter space increases, the socalled curse of dimensionality.
For the simulation exercise each iteration is based on a
sample of 10k points. To prevent numerical instabilities
in the updating of the parameters, components with a
very small weight (< 0.002) or containing less than 20
sample points are discarded.
This choice of (ν,Σ0) ensures ade
In this simulated example,
3.Test run proposal for MCMC
As little information is available for the target density,
an adaptive MCMC approach is used which can allow for
faster learning of the target density than using either in
dependent or nonadaptive random walk proposals [31].
For MCMC, the proposal distribution is a centred Gaus
sian with covariance matrix which is updated along the
iterations. An important choice for adaptive MCMC con
cerns the scaling of the proposal and the rate of adapta
tion. There has been much research on this [31, 32], and
a common choice for the covariance of the Gaussian pro
posal is to consider cΣnwhere Σnis an estimate of the
covariance of the target density, at update n. The choice
c = 2.382/p is considered to be optimal when the chain
is in its stationary regime, and for target densities that
have a product form [31]. However for the target density
we consider this does not hold: the first two components
are not independent despite being uncorrelated and de
pendence is not linear but quadratic. However, with no
other theoretical results to follow we start with a scal
ing factor of that form and for the simulation results to
follow assess the effect on convergence and results using
alternative values. We update the covariance matrix by
the recursive formula
Σn= (1 − an)Σn−1+ anSn
(23)
where Σn−1is the sample covariance of the previous up
date, and Snis the covariance of the sampled estimates
Page 8
8
from the previous update to the current iteration. The
value of anis 1/nkwith k chosen suitably to allow for a
cooling of the update, which is a necessary condition to
ensure convergence of this adaptive MCMC to the tar
get density as well as convergence of the empirical aver
ages [32, 33].
In pilot runs, we explored the effect of this schedule
for various values of (k,c) in (0,1) × (0,2.382/p) and we
observe that the choice of (k,c) plays a role on the time
to convergence (for the estimation of the quantities of
interest, see below) and on the acceptance rate of the
chain.
To ensure a fair comparison with PMC, we start the
chain at a random point drawn from the same Gaussian
distribution as for PMC (i.e Nd(0,Σ0/5), using the same
values for Σ0as used for PMC). We also explored in pilot
runs the role of the initial value of the chain: despite
it being known that MCMC is sensitive to the choice
of the initial position of the chain  which has no real
counterpart in PMC  this hasn’t been found to have
a major impact on performance (for reasonable choice of
the initial value at least) in this particular study. We also
fixed the update schedule to be every 10k points and we
assessed the effect on the results from using less or more
points before updating.For the simulation results to
follow, (k,c) has been set to (0.5,2.382/p) which ensures
convergence after the burnin period (see Sect. IIIA),
and a mean acceptance rate at convergence of about 10%.
The proposal distribution is updated for the entire length
of the chain and is not stopped after the burnin period.
A.Test runs
For PMC and the proposal outlined, the perplexity ap
peared to level off at around the 10th iteration, so for the
results to follow for PMC we ran the PMC algorithm for
10 iterations (10k points per iteration) and used a final
draw of 100k points. To assess MCMC for the same num
ber of points we used a chain length of 200k points with
a burnin of 100k points. Results for both approaches at
successive intervals before 200k points are also provided.
To assess the performance of the approaches, each simu
lation was replicated 500 times.
B.Results of the simulations
For the results of the simulations, we are interested in
both the mean estimates of the parameters (in particu
lar for x1 and x2) and also estimates of the confidence
region coverage (68.3% and 95%) which will provide an
indication of how well both approaches are covering the
tails of the target density. For each run r = 1,...,500,
we provide the results for various functions f of interest:
fa(x) = x1
fb(x) = x2
fc(x) = 168.3(x)
fd(x) = 195(x)
fe(x) = 168.3(x1,x2)
fg(x) = 195(x1,x2)
fh(x) = 168.3(x1)
fi(x) = 195(x1)
We note here 1qas the indicator function for the q% re
gion. fhand fiare indicators only for the first dimension,
while feand fgare dealing with the first 2. In all cases,
the remaining dimensions are marginalised over.
Table I shows the results for estimation of π(f) for
functions fa and fb (¯ x1 and ¯ x2 respectively). The re
sults provided show the mean and standard deviation of
estimates calculated over 500 runs. Although the per
formance is quite similar for both methods, PMC does
display a twofold reduction in standard deviation com
pared to MCMC for both functions. A closer look at
the results reveals that for π(fa) the empirical distribu
tions of the estimates (see Figure 4) are quite similar
for both methods, except for the variance which is much
reduced for PMC. For π(fb), on the other hand, the em
pirical distribution of the estimates for PMC are quite
skewed, resulting in a slight positive bias for the major
ity of the runs (second panel of Figure 4). The difference
between π(fa) and π(fb) can be explained from Figure 1
which shows that failure to visit sufficiently the down
ward lowprobability tails does indeed imply a positively
biased estimates for the mean of the second component.
PMC does appear to be more sensitive to this issue than
MCMC, despite the fact that the estimates for MCMC
display a larger overall variability.
Figure 5 provides the results for the confidence region
coverage. To depict the variability of the data, the results
are displayed by using whisker plots. The results from
both PMC and MCMC against all of the performance
measures are similar, with both showing good coverage
of the target density. The distribution of this estima
tor is again more skewed for PMC than it is for MCMC,
particularly for the 95% regions in the bottom panel of
Figure 5. Nevertheless, the variability of the estimates
TABLE I: Results of the simulations for the 10dimensional
banana shaped target density over 500 runs for both PMC
and MCMC
PMC MCMC
mean 0.097 0.028
std0.218
mean 0.013
std 0.163

0.80
π(fa)
”
π(fb)
”
0.536
0.002
0.315
0.11

acceptance
perplexity
Page 9
9
???????????? ???
??
??
??
?
?
?
?
??
?????? ?????? ???
??
??
??
?
?
?
?
??
??? ?????? ??? ???
??
??
??
?
?
?
?
??
??? ?????? ??????
??
??
??
?
?
?
?
??
fa
fa
fb
fb
PMC MCMC
FIG. 4: Evolution of π(fa) (top panels) and π(fb) (bottom
panels) from 10k points to 100k points for both PMC (left
panels) and MCMC (right panels). See Fig. 3 for details about
the whisker plot representation.
d10 PMC
PMC
d10 MCMC
MCMC
d2 PMC
PMC
d2 MCMC
MCMC
d1 PMC
PMC
d1 MCMC
MCMC
0.62
0.66
0.70
0.74
Propoportion of points inside
d10 PMC
PMC
d10 MCMC
MCMC
d2 PMC
PMC
d2 MCMC
MCMC
d1 PMC
PMC
d1 MCMC
MCMC
0.88
0.92
0.96
1.00
Propoportion of points inside
fc
fe
fh
fd
fg
fi
FIG. 5: Results showing the distributions of the PMC and the
MCMC estimates π(f) for (top) f = fc,fe,fh and (bottom)
f = fd,fg,fi (in this order, left to right). All estimates are
based on 500 simulation runs. See Fig. 3 for details about the
whisker plot representation.
obtained with PMC also is significantly reduced com
pared to MCMC.
Figure 4 shows the evolution of the results for π(fa)
and π(fb) from 10k points to 100k points for both PMC
(left panels) and MCMC (right panels). The results from
Fig. 4, in general, highlight the reduction in variance of
the Monte Carlo estimates for PMC in comparison to
MCMC. In particular, it is interesting to note that the
variance of the estimates, either for π(fa) or π(fb) for
100k posterior evaluations under MCMC are comparable
to estimates obtained using PMC at the second iteration
(20k points).
Simulating from this target distribution is a challeng
ing problem for both methods. In particular, the use
of a vague initial importance function in a multidimen
sional space represents a challenge to PMC and it has
been observed that the importance function takes some
time to properly adapt to the target density (about 10
iterations). The choice of the initial importance function
in PMC is more crucial than is the choice of the initial
proposal distribution in adaptive MCMC. Although dif
ferent variations for updating the covariance matrix for
the MCMC approach are possible we did not see a sig
nificant improvement in the results presented from using
alternative covariance structures. For most of the simula
tion results, the proposal covariance matrix was observed
to adapt relatively quickly to the true covariance matrix.
Changes to the covariance structure considered included
changes to the update frequency, the starting proposal
Σ0, the scaling of the proposal (value of c) and adap
tation of the covariance (value of k). Hence, the PMC
approach may require more precise a priori knowledge of
the target density than MCMC.
In the next section, we apply the PMC approach to
typical cosmological examples, and provide results in
comparison to MCMC.
IV. APPLICATION TO COSMOLOGY
We apply our new adaptive importance sampling
method to the posterior of cosmological parameters.
Flat CDM models with either a cosmological constant
(ΛCDM) or a constant darkenergy equationofstate pa
rameter (wCDM) are explored and tested with recent ob
servational data of CMB anisotropies, supernovae type Ia
and cosmic shear, as described in the next section.
The three data sets and likelihood functions used here
are the same as in [34]; the CMB measurements and like
lihood are based on the fiveyear WMAP data release
[35], the SNIa data set is the firstyear SNLS survey [36],
while the cosmic shear is from the CFHTLSWide third
release [T0003, 37]. The results presented in the follow
ing sections can be compared to the MCMC analysis in
[34].
A.Data sets
1.CMB
To obtain theoretical predictions of the CMB temper
ature and polarization power and crossspectra we use
Page 10
10
the publicly available package CAMB [6]. The likelihood
is calculated using the public WMAP5 code [4].
The WMAP5 likelihood takes as input the TT,TE,EE
and BB theoretical power spectra calculated by CAMB,
and returns a likelihood computed from a sum of differ
ent parts. It computes a pixelbased Gaussian likelihood
based on templatecleaned maps [38] and their associated
inverse covariance matrices (see Page et al. [39] for de
tails) at large angular scales (ℓ ≤ 32 for TT, ℓ ≤ 23 for
TE, EE and BB). At small angular scales, it computes
an approximate likelihood based on pseudospectra and
their associated covariance for TT and TE [40], based
respectively on the (Q,V) and (V,W) channel pairs for
TE and TT.
In addition, the likelihood computation takes into ac
count analytic marginalisations on nuisance parameters
such as the beam transfer function and pointsources un
certainties [40, 41]. We ignore corrections due to SZ and
impose a larger (flat) prior on the Hubble constant. In
deed, CMB data alone exhibits a degeneracy between the
Hubble constant and e.g. the cosmological constant [42]
which is removed by adding other cosmological probes.
The acoustic oscillation peaks in the CMB anisotropy
spectrum are a standard ruler at a redshift of about
z = 1100. CMB therefore measures the angular diam
eter distance to that redshift which depends mainly on
the total matterenergy density (Ωm+Ωde) and weakly on
the Hubble constant h. The overall anisotropy amplitude
is determined by the largescale normalization ∆2
relative height of the peaks is sensitive to the baryonic
and dark matter densities. On large scales, secondary
anisotropies are generated at late times (z<∼20) due to
reionisation, which is parametrised by the optical depth
τ, and the integrated SachsWolfe (ISW) effect, which is
a probe of Ωde.
R. The
2. SNIa
The SNLS data set is described in detail in [36]. We
use their results from the SNIa lightcurve fits which, for
each supernova, provides the restframe Bband magni
tude m∗
B, the shape or stretch parameter s and the colour
c. We use the standard likelihood analysis described in
[34], adopted from [36].
Under the assumption that supernovae of type Ia are
standard candles we can fit the luminosity distance to the
SNIa data. The luminosity distance is a function of Ωm,
Ωdeand w. Three additional parameters are the univer
sal SNIa magnitude M and the linear response parame
ters to stretch and colour, α and β, respectively. Those
three parameters are specific to our choice of distance
estimator, and can be regarded as nuisance parameters.
The Hubble constant h is integrated into the parameter
M, so there is no explicit dependence on h in the SNIa
posterior.
3. Cosmic shear
The CFHTLSWide 3rdyear data release (T0003), the
data and weak lensing analysis as well as cosmological re
sults are described in [37]. As in [37] we use the aperture
mass dispersion between 2 and 230 arc minutes as second
order lensing observable [43]. We assume a multivari
ate Gaussian likelihood function and take into account
the correlation between angular scales. The theoretical
aperturemass dispersion is obtained by nonlinear mod
els of the largescale structure [44].
The galaxy redshift distribution is obtained by using
the CFHTLSDeep redshift distribution [45] and rescal
ing it according to the iAB magnitude distribution of
CFHTLSWide galaxies. We fit the resulting histogram
with eq. (14) from [37], introducing the three fit parame
ters a,b,c. The histogram data is modeled as multivari
ate, uncorrelated Gaussian, the corresponding likelihood
is included, independent of the lensing likelihood, in the
analysis.
Weak gravitational lensing by the largescale struc
ture is sensitive to the angular diameter distance and the
amount of structure in the Universe. It is an important
probe to measure the normalisation σ8 on small scales.
With the current data, this parameter is however largely
degenerate with Ωm. This degeneracy is likely to be lifted
by future surveys which will include the measurement of
higherorderstatistics [46, 47] and shear tomography [48].
In particular from the latter a great improvement on the
determination of w is to be expected, a parameter which
is only weakly constrained by lensing up to now [49, 50].
B. Cosmological parameter and priors
We sample a hypercube in parameter space which cor
responds to flat priors for all parameters, see Table II
for more details. Additional priors exist, both in explicit
and implicit form, which represent regions of parameter
space which are unphysical or where numerical fitting for
mulae break down. For example, we exclude extremely
high baryon fractions (Ωb> 0.75Ωm) because of numeri
cal problems in the computation of the transfer function.
Further, for very low values of both Ωmand σ8the pivot
scale for the nonlinear power spectrum is outside the
allowed range. Very rarely, the calculation of the likeli
hood for individual points in parameter space is unsuc
cessful because of numerical errors or limitations of the
likelihood code. Since these points cannot be taken into
account, a pragmatic solution is to formally modify the
prior to exclude those points. Note that these rare cases
occur mainly in regions of very low likelihood.
C.Initial choice of the importance function
As described earlier in Sect. IIC3, it is important to
have a good guess for the initial importance function. In
Page 11
11
TABLE II: Parameters for the cosmology likelihood. C=CMB, S=SNIa, L=lensing.
Symbol Description
Ωb
Baryon density
Ωm
Total matter density
w
Darkenergy eq. of state
ns
Primordial spectral index
∆2
R
Normalization (large scales)
σ8
Normalization (small scales)a
h
Hubble constant
τ
Optical depth
M
Absolute SNIa magnitude
α
Colour response
β
Stretch response
a
b
galaxy zdistribution fit
c
Minimum Maximum Experiment
0.010.1
0.011.2
3.0 0.5
0.71.4
C
C S L
C S L
C
C
C
C
C
S
S
S
L
L
L
L
L
L
L
aFor WMAP5, σ8is a deduced quantity that depends on the other
parameters
all cases considered here, we rely on an estimate of the
maximum likelihood point and the Hessian at that point
(Fisher matrix) to build our initial proposals. We use the
conjugategradient approach [51] to find the maximum
likelihood point at which to calculate the Fisher matrix
F using the theoretical model. We construct a mixture
model consisting of D Gaussian components. Studentt
mixtures with small degrees of freedom were tested and
turned out to be a bad approximation to the posterior
under study, resulting in low perplexities. Each mixture
component is shifted randomly from the maximum by a
small amount. A random scaling is applied to the covari
ance of each component; i.e. the eigenvectors and ratios
between the eigenvalues of the covariance are the same
as the ones of the Fisher matrix.
We obtain good results for shifts of about 0.5% to 2%
of the box size. Here, a tradeoff between too large shifts
(resulting in low importance weights) and too small shifts
(components stay near the maximum, the posterior tails
do not get sampled) has to be found. The stretch fac
tor is chosen randomly between typical values of 1 and
2. In some cases, in particular with high dimensionality,
the derivation of the Fisher matrix is not stable and the
matrix is numerically singular. In such cases we set the
offdiagonal elements of F to zero.
We found a sample size between 7500 and 10000 points
to be adequate for most cases. The number of compo
nents D of the initial importance function was chosen
between 5 to 10. For the final iteration we used a sample
size five times that of the initial sample size.
D. Results
1. General performance
The PMC algorithm is reliable and very efficient in
sampling and exploring the parameter space. Both the
perplexity as well as the effective sample size increase
quickly with each iteration (Fig. 6).
reaches values of 0.95 or more in many cases, although in
particular for higher dimensional posteriors the final val
ues are lower. Satisfactory results (i.e. yielding consistent
mean and marginals compared to MCMC, see below) are
obtained for perplexities larger than about 0.6.
The distribution of importance weights gets narrower
from iteration to iteration (Fig. 7). Initially, many sam
pled points exhibit very low weights. After a few iter
ations, the importance function has moved towards the
posterior increasing the efficiency of the sampling.
The perplexity
0
0.2
0.4
0.6
0.8
1
0246
N/10000
810 12 14
perplexity
0
0.2
0.4
0.6
0.8
1
0246
N/10000
8 10 12 14
ESS/N
FIG. 6: Perplexity (left panel) and normalised effective sam
ple size ESS/N (right panel), as a function of the cumulative
sample size N. The likelihood is WMAP5 for a flat ΛCDM
model with six parameters.
Our initial mixture model starts with all mixture com
ponents close to the maximum likelihood point. With
consecutive iterations the components spread out to bet
ter cover the region where the posterior is significant.
This can be seen in Fig. 8.
Compared to traditional MCMC, our new PMC
method is faster by orders of magnitude.
consuming calculation of the posterior can be performed
in parallel and therefore a speedup by a factor of the
The time
Page 12
12
0.001
0.01
0.1
1
30 25
log(importance weight)
201510 5
frequency
iteration 0
iteration 3
iteration 6
iteration 9
FIG. 7: Histogram of the normalised importance weights ¯ wt
for four iterations t = 0,3,6,9. The posterior is WMAP5, flat
ΛCDM model with six parameters.
n
FIG. 8: Lower left panel: Overlaid to the SNIa confidence con
tours (68%, 95%, 99.7%) is the movement of the importance
function. For each iteration a circle is plotted at the posi
tion of the mean of each component, where different colours
indicate different components. The circle size indicates the
component weight.The starting point (first iteration, at
(0.3,−1.3)) is marked by a thick circle. The other two pan
els show the mean positions in projection, fanned out as a
function of the iteration.
number of CPUs is obtained. In times where clusters of
multicore processors are readily available, this speedup
is easily of the order of 100. In addition, MCMC has
a low efficiency with typical acceptance rates of 0.25 –
0.3. The PMC normalised effective sample size in the
WMAP5 case is 0.7 which results in a much larger final
sample for the same number of posterior calculations of
0.00.20.4
Ωm
0.60.8
−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
w
FIG. 9: The sampled points from the final iteration are plot
ted, the colours indicate the components of the importance
function from which the points are drawn (the colours are the
same as in Fig. 8). One out of five point is shown. Note that
the density of points does not correspond to the posterior den
sity since the former has to be weighted by the importance
weights.
around 150000.
We emphasise again that with MCMC one can make
only limited use of parallel computing since one has to
wait for each Markov chain to converge, and because it
is not straightforward to combine chains, as mentioned
earlier.
2.Comparison with MCMC
The MCMC results we present here are either obtained
using the adaptive MCMC algorithm or a classical one.
Indeed, as we show in the following, adaptive MCMC
can have some issues that a less efficient classical MCMC
algorithm can avoid. Apart from those special cases, the
MCMC and adaptive MCMC gave very similar results,
the latter usually reaching a better acceptance rate, and
thus a better efficiency.
We find excellent agreement between using our respec
tive implementations of MCMC (adaptive or not) and
PMC. Mean, confidence intervals and 2dmarginals are
very similar using both methods. The performance of
PMC is superior to MCMC in some cases, which is illus
trated by the following examples.
An inherent problem of MCMC is that even for a long
run there can be regions in parameter space that are not
sampled in an unbiased way. This is illustrated in Fig. 10.
The feature at the 99.7%level of MCMC (left panel, for
large values of −M and α) is due to an “excursion” of the
chain into a lowlikelihood region at step 130500, lasting
for 300 steps. We ran the chain for 300000 steps and the
feature was still visible. A second run of the chain did not
Page 13
13
Ωm
w0
0.0 0.2 0.40.6 0.81.0 1.2
−3.0
−2.0
−1.0
0.0
−M
α
19.119.319.519.7
1.0
1.5
2.0
2.5
FIG. 10: Examples of marginalised likelihoods (68%, 95% and
99.7% contours are shown) for PMC (solid blue) and MCMC
(dashed green) from the SNIa data.
exhibit this anomaly. This kind of sample “noise” can be
prevented by running a chain for a very long time or by
combining several (converged) chains. Such features are
much less likely to occur in an importance sample which
consists of uncorrelated points.
A second issue are parameters which are nearly uncon
strained by the data with the result that the marginalised
posterior in that dimension is flat. To illustrate this we
choose weak lensing alone which can not constrain Ωb
(Fig. 11). Using the Fisher matrix as initial Gaussian
proposal for adaptive MCMC, the chain stays in a small
region in the Ωbdirection; the covariance being very flat,
most jumps ends up out of the prior distribution. This
results in an update variance for this parameter which is
much too small, and in a bad exploration of the poste
rior in this flat direction as shown Fig. 11. The classical
MCMC algorithm, with the same proposal yields better
results, but with a very low acceptance rate and needing
500000 steps to reach the result presented Fig. 11. Alter
natively, modifying the initial proposal to be smaller and
better adapted to the prior, or increasing the covariance
stretch factor from the optimal value of c = 2.382/p (see
Sect. III 3) to c = 3.22/p helps the chain to explore more
of the parameter space in the latter steps of the adap
tation. These modifications to the algorithm also result
in very low acceptance rate, and somehow go against the
very idea of an adaptive algorithm, since they require
very fine tuning of the initial proposal!
With PMC we obtain a much better performance and
recover very well the flat posterior.
In Tables III and IV we show the mean and 68% con
fidence intervals for CMB alone for the ΛCDM model
and for lensing+SNIa+CMB using wCDM, respectively.
The differences in mean and 68%confidence intervals
is less than a few percent in most cases.
shows that the lowerconfidence regions and the correla
tion between parameters agrees very well between (non
adaptive) MCMC and PMC.
Figure 10
0.02 0.040.06
Ωb
0.080.10
0
5
10
15
20
25
posterior
b
MCMC no update
adaptive MCMC
PMC
posterior
Ω
FIG. 11: Normalised 1d marginals for Ωb from weak lensing
alone for PMC (solid blue lines) and MCMC (dashed green:
adaptive, dotted red: nonadaptive).
TABLE III: Parameter means and 68% confidence intervals
for PMC and (nonadaptive) MCMC from the WMAP5 data.
Parameter PMCMCMC
Ωb
0.04424+0.00321
−0.002900.04418+0.00321
−0.00294
Ωm
0.2633+0.0340
−0.0282
0.2626+0.0359
−0.0280
τ
0.0878+0.0181
−0.0160
0.0885+0.0181
−0.0160
ns
0.9622+0.0145
−0.0143
0.9628+0.0139
−0.0145
109∆2
R
2.431+0.118
−0.113
0.7116+0.0271
−0.0261
2.429+0.123
−0.108
0.7125+0.0274
−0.0268
h
V.DISCUSSION
In this paper, we have introduced and assessed an
adaptive importance sampling approach, called Popula
tion Monte Carlo (PMC), which aims to overcome the
main difficulty in using importance sampling, namely the
reliance on a single efficient importance function. PMC
achieves this goal by iteratively adapting the importance
function towards the target density of interest. A signifi
cant appeal of the approach, when compared to alterna
tives such as MCMC, lies in the possibility to use (mas
sive) parallel sampling which considerably reduces the
computational time involved in the estimation of param
eters for many astrophysical and cosmological problems.
Simulated and actual data have been used in this work to
assess the performance of PMC for estimation of param
eters in a Bayesian inference with features approaching
classical cosmological parameter posteriors.
The PMC approach is, in essence, an iterated impor
tance sampling scheme that simultaneously produces, at
each iteration, a sample approximately simulated from a
Page 14
14
TABLE IV: Parameter means and 68% confidence intervals
for PMC using lensing, SNIa and CMB in combination. The
(nonadaptive) MCMC results correspond to the values given
in Table 5 from [34].
Parameter PMC MCMC
Ωb
0.0432+0.0027
−0.0024
0.0432+0.0026
−0.0023
Ωm
0.254+0.018
−0.017
0.253+0.018
−0.016
τ
0.088+0.018
−0.016
0.088+0.019
−0.015
w
−1.011 ± 0.060 −1.010+0.059
−0.060
ns
0.963+0.015
−0.014
2.413+0.098
−0.093
0.720+0.022
−0.021
0.648+0.040
−0.041
9.3+1.4
−0.9
0.639+0.084
−0.070
19.331 ± 0.030 19.332+0.029
0.963+0.015
−0.014
2.414+0.098
−0.092
0.720+0.023
−0.021
0.649+0.043
−0.042
9.3+1.7
−0.9
0.639+0.082
−0.070
109∆2
R
h
a
b
c
−M
−0.031
α
1.61+0.15
−0.14
1.62+0.16
−0.14
−β
−1.82+0.17
−0.16
−1.82 ± 0.16
σ8
0.795+0.028
−0.030
0.795+0.030
−0.027
target distribution π and an approximation of π in the
form of the current proposal distribution. As such, the
samples produced by the PMC approach can be exploited
as regular importance sampling outputs at any iteration
t. Samples from previous iterations can be combined [25],
and approximations like ˆ π(f) can be updated dynami
cally, without necessarily requiring the storage of sam
ples.
Although adaptation of the importance function has
the explicit aim of improving the coverageof the posterior
density there are instances where this objective may not
be met. In some cases, successive updates of the impor
tance function may result in: (a) an importance function
which is too peaked and which has light tails (invalid im
portance function); (b) an importance function which fits
only one mode (in the case of a multimodal posterior);
(c) numerical problems due to the adaptation procedure
(usually involving poor conditioning of some of the co
variance matrices). Such cases are likely to produce a
poor approximation to the integral of interest, or alter
natively lead to highly variable parameter estimates over
iterations. These problems can be quickly discovered or
signalled by observing a poor ESS, and parameter es
timates or normalised perplexity which do not stabilise
after a few iterations.
Such cases of poor performance as outlined can be sig
nificantly reduced by choosing a reasonably well informed
initial importance function with a large enough sample
size at each iteration, especially on the initial iteration
that requires many points to counterweight a potentially
poor importance function. In general, the initial impor
tance function should be chosen to cover a region of the
parameter space that has support larger than the poste
rior. In the absence of reliable prior information, finding
such an importance function may be difficult to do. One
approach may be to locate the components in the centre
of the feasible range (if available) for each variable, with
reasonably large variances to ensure some coverage of the
parameter space. We found this approach to be reason
ably successful for the simulated data case discussed in
Sect. III. In the presence of some prior information, for
example an estimate of the maximumlikelihood point
and an approximation of the covariance matrix (via the
Hessian), components can be placed around these points
with variance comparable to the approximation. Another
approach may be to perform a singular value decomposi
tion of the covariance matrix, and make use of the eigen
vectors and eigenvalues to place components along the
most likely directions of interest. Alternatively and in
the same spirit, components can be placed according to
the principal points of the resulting sample, using a k
means clustering approach [52]. Both approaches have
been reasonably successful for a range of posterior den
sities examined, and by placing components in regions
of high posterior support in addition to the mode have
the potential to further significantly reduce the number
of iterations for difficult posterior densities.
The main appeals or advantages of the PMC method
are worth reemphasising at this point:
1. Parallelisation of the posterior calculations
2. Low variance of Monte Carlo estimates
3. Simple diagnostics of ‘convergence’ (perplexity)
We address these three points in more detail now.
(1.) The first advantage, namely the ability to par
allelise the computational task, is becoming increasingly
useful through the availability of cheap multiCPU com
puters and the standardization of clusters of computers.
Software to implement the parallelisation task, such as
Message Parsing Interface (MPI)3, are publicly avail
able and relatively straightforward to implement. For
the cosmological examples presented (Sect. IV), we used
up to 100 CPUs on a computer cluster to explore the cos
mology posteriors. In the case of WMAP5, this reduced
the computational time from several days for MCMC to
a few hours using PMC.
(2.) In general, for PMC and an importance function
that is closely matched to the target density, significant
reductions in the variance of the Monte Carlo estimates
are possible in comparison to estimates obtained using
3http://wwwunix.mcs.anl.gov/mpi/
Page 15
15
MCMC [5]. For example, for the posterior estimates for
the WMAP5 data we observed a 10fold reduction in vari
ance for the same number of sample points as observed for
MCMC. Such reductions suggest that the computational
time savings extend not only to the number of CPU’s
available but to smaller sample requirements for PMC in
total compared to MCMC to achieve similar variability
of estimates. For cosmological applications, this observa
tion is valuable as we observed, e.g., in Fig. 6 for CMB
data, that the fit between the adapted importance func
tion and the target posterior is sometimes quite good.
By combining samples across iterations further compu
tational savings are also possible. The absence of con
struction of a Markov chain for PMC can also have the
desirable attribute of reducing sample noise, as observed
for the SNIa data in Sect. IVD2.
(3.) As shown in Sect. IIC2, the perplexity (eq. 18)
is a relatively simple measure of sampling adequacy to
the target density of interest. For MCMC and other ap
proaches which rely on formal measures of convergence,
assessment of convergence can be very difficult with users
facing a potential array of associated diagnostic tests.
In addition to the above points, a further appeal of
PMC is the ability to provide a very good approximation
to the marginal posterior or evidence, which naturally
follows as a byproduct of the approach. To demonstrate
this appeal, further research is underway to explore the
use of PMC in the context of model selection problems
in cosmology.
Acknowledgments
We acknowledge the use of the Legacy Archive for Mi
crowave Background Data Analysis (LAMBDA). Sup
port for LAMBDA is provided by the NASA Office of
Space Science. We thank the Planck group at IAP
and the Terapix group for support and computational
facilities. DW and MK are supported by the CNRS
ANR “ECOSSTAT”, contract number ANR05BLAN
028304 ANR ECOSSTAT. The authors would like to
thank F. Bouchet, S. Bridle, J.M. Marin, Y. Mellier and
I. Tereno for helpful discussions.
APPENDIX A: DETAILS OF THE IMPORTANCE
FUNCTION UPDATES FOR PMC
The method proposed in [16] to adaptively update the
parameters of the importance function qtis based on
a variant of the EM (ExpectationMaximization) algo
rithm [53], which is the standard tool for the estimation
of the parameters of mixture densities. We describe be
low the principle underlying the algorithm of [16], show
ing in particular that each iteration decreases, up to the
importance sampling approximation errors, the Kullback
divergence between the target π and the importance func
tion qt.
Remember that our goal is to minimise (10), as t in
creases, by iteratively tuning the parameters αt,θtof the
mixture importance function defined in (11). Developing
the logarithm in (10), this objective can be equivalently
formulated in terms of the maximization of the following
quantity
ℓ(α,θ) =
?
log
?D
d=1
?
αdϕ(x;θd)
?
π(x)dx (A1)
with respect to α,θ. Using Bayes’ rule, we denote by
ρd(x;α,θ) =
αdϕ(x;θd)
?D
d=1αdϕ(x;θd)
(A2)
the posterior probability that x belongs to the dth com
ponent of the mixture (for the mixture parameters α,θ).
The EM principle consists of evaluating, at iteration t,
the following intermediate quantity
Lt(α,θ) =
?
D
?
d=1
ρd(x;αt,θt)log(αdϕ(x;θd))π(x)dx.
(A3)
Using the concavity of the log as well as the expression
of ρdin (A2), it is easily checked that
D
?
d=1
ρd(x;αt,θt)log
?αdϕ(x;θd)
??D
αt
dϕ(x;θt
d)
?
≤ log
d=1αdϕ(x;θd)
?D
d′=1αt
d′ϕ(x;θt
d′)
?
(A4)
and hence that Lt(α,θ) − Lt(αt,θt) ≤ ℓ(α,θ) − ℓ(αt,θt).
Thus, any value of α,θ which increases the intermediate
quantity Ltabove the level Lt(αt,θt) also results in, at
least, an equivalent increase of the actual objective func
tion ℓ. In the EM algorithm, one sets αt+1and θt+1
to the values where the intermediate quantity Lt(α,θ) is
maximal, thus satisfying the previous requirement. Fur
thermore, the maximization of Lt(α,θ) leads to a closed
form solution whenever ϕ belongs to a socalled exponen
tial family of probability densities.
In the example of the multivariate Gaussian density
recalled in (12), the parameter θd consists of the mean
µd and the covariance matrix Σd and the intermediate
quantity may be written as
Lt(α,µ,Σ) =
?
D
?
d=1
ρd(x;αt,µt,Σt)
?
log(αd)
−1
2
?logΣd + (x − µd)TΣ−1
up to terms that do not depend on α, µ or Σ. Routine
calculations show that the maximum of (A5) is achieved
d(x − µd)??
π(x)dx,(A5)
Download fulltext