Page 1
arXiv:1106.2525v1 [math.ST] 13 Jun 2011
Uniform Stability of a Particle Approximation of the Optimal
Filter Derivative∗
Pierre Del Moral†, Arnaud Doucet‡, Sumeetpal S. Singh§
June 14, 2011
Abstract
Sequential Monte Carlo methods, also known as particle methods, are a widely used set
of computational tools for inference in non-linear non-Gaussian state-space models. In many
applications it may be necessary to compute the sensitivity, or derivative, of the optimal filter
with respect to the static parameters of the state-space model; for instance, in order to obtain
maximum likelihood model parameters of interest, or to compute the optimal controller in an
optimal control problem. In Poyiadjis et al. [2011] an original particle algorithm to compute
the filter derivative was proposed and it was shown using numerical examples that the particle
estimate was numerically stable in the sense that it did not deteriorate over time. In this paper
we substantiate this claim with a detailed theoretical study. Lp bounds and a central limit
theorem for this particle approximation of the filter derivative are presented. It is further shown
that under mixing conditions these Lp bounds and the asymptotic variance characterized by
the central limit theorem are uniformly bounded with respect to the time index. We demon-
strate the performance predicted by theory with several numerical examples. We also use the
particle approximation of the filter derivative to perform online maximum likelihood parameter
estimation for a stochastic volatility model.
Some key words: Hidden Markov Models, State-Space Models, Sequential Monte Carlo,
Smoothing, Filter derivative, Recursive Maximum Likelihood.
1 Introduction
State-space models are a very popular class of non-linear and non-Gaussian time series models in
statistics, econometrics and information engineering; see for example Capp´ e et al. [2005], Doucet et al.
[2001], Durbin and Koopman [2001]. A state-space model is comprised of a pair of discrete-time
stochastic processes, {Xn}n≥0and {Yn}n≥0, where the former is an X-valued unobserved process
and the latter is a Y-valued process which is observed. The hidden process {Xn}n≥0is a Markov
process with initial law dxπθ(x) and time homogeneous transition law dx′fθ(x′|x), i.e.
X0∼ dx0πθ(x0) and Xn|(Xn−1= xn−1) ∼ dxnfθ(xn|xn−1),
∗First version: January 2011. Cambridge University Engineering Department Technical report number CUED/F-
INFENG/TR.668
†Centre INRIA Bordeaux et Sud-Ouest & Institut de Math´ ematiques de Bordeaux , Universit´ e de Bordeaux I, 351
cours de la Lib´ eration 33405 Talence cedex, France (Pierre.Del-Moral@inria.fr)
‡Department of Statistics, University of British Columbia, V6T 1Z4 Vancouver, BC, Canada (arnaud@stat.ubc.ca)
§Department of Engineering, University of Cambridge, Trumpington Street, CB2 1PZ, United Kingdom
(sss40@cam.ac.uk)
n ≥ 1.(1.1)
1
Page 2
It is assumed that the observations {Yn}n≥0conditioned upon {Xn}n≥0are statistically independent
and have marginal laws
?
Here πθ(x), fθ(x|x′) and gθ(y|x) are densities with respect to (w.r.t.) suitable dominating measures
denoted generically as dx and dy. For example, if X ⊆ Rpand Y ⊆ Rqthen the dominating measures
could be the Lebesgue measures. The variable θ in the densities are the particular parameters of
the model. The set of possible values for θ, denoted Θ, is assumed to be an open subset of Rd. The
model (1.1)-(1.2) is also often referred to as a hidden Markov model in the literature Capp´ e et al.
[2005].
For a sequence {zn}n≥0and integers i, j, let zi:jdenote the set {zi,zi+1,...,zj}, which is empty
if j < i. Equations (1.1) and (1.2) define the law of (X0:n,Y0:n−1) which is given by the measure
Yn|{Xk}k≥0= {xk}k≥0
?
∼ dyngθ(yn|xn). (1.2)
dx0πθ(x0)
n
?
k=1
dxkfθ(xk|xk−1)
n−1
?
k=0
dykgθ(yk|xk), (1.3)
from which the probability density of the observed process, or likelihood, is obtained
pθ(y0:n−1) =
?
dx0πθ(x0)
n
?
k=1
dxkfθ(xk|xk−1)
n−1
?
k=0
gθ(yk|xk). (1.4)
For a realization of observations Y0:n−1= y0:n−1, let Qθ,ndenote the law of X0:nconditioned on this
sequence of observed variables, i.e.
?
k=1
Qθ,n(dx0:n) =
1
pθ(y0:n−1)
dx0πθ(x0)gθ(y0|x0)
n−1
?
dxkfθ(xk|xk−1)gθ(yk|xk)
?
dxnfθ(xn|xn−1)
Let ηθ,ndenote the time n marginal of Qθ,n. This marginal, which we call the filter, may be computed
recursively using Bayes’ formula:
?ηθ,n(dxn)gθ(yn|xn)fθ(xn+1|xn)
and ηθ,0= πθby convention. Except for simple models such the linear Gaussian state-space model
or when X is a finite set, it is impossible to compute pθ(y0:n), Qθ,nor ηθ,nexactly. Particle methods
have been applied extensively to approximate these quantities for general state-space models of the
form (1.1)–(1.2); see Capp´ e et al. [2005], Doucet et al. [2001].
The particle approximation of Qθ,n is the empirical measure corresponding to a set of N ≥ 1
random samples termed particles, that is
ηθ,n+1(dxn+1) = Qθ,n+1(dxn+1) =dxn+1
?ηθ,n(dx′n)gθ(yn|x′n)
,n ≥ 0
Qp,N
θ,n(dx0:n) =
1
N
N
?
i=1
δX(i)
0:n(dx0:n) (1.5)
where δz(dz) denotes the Dirac delta mass located at z. This approximation is referred to as the
path space approximation Del Moral [2004] and it is denoted by the superscript ‘p’. The particle
approximation of ηθ,nis obtained from Qp,N
θ,nby marginalization
ηN
θ,n(dxn) =
1
N
N
?
i=1
δX(i)
n(dxn).
2
Page 3
These particles are propagatedin time using importance sampling and resampling steps; see Doucet et al.
[2001] and Capp´ e et al. [2005] for a review of the literature. Specifically, Qp,N
sure constructed from N independent samples from
θ,n+1is the empirical mea-
Qp,N
θ,n(dx0:n)dxn+1fθ(xn+1|xn)gθ(yn|xn)
?Qp,N
θ,n(dx0:n)gθ(yn|xn)
.(1.6)
It is a well known fact that the particle approximation of Qθ,nbecomes progressively impoverished
as n increases because of the successive resampling steps [Del Moral and Doucet, 2003, Olsson et al.,
2008]. That is, the number of distinct particles representing the marginal Qp,N
k < n diminishes as n increases until it collapses to a single particle – this is known as the particle
path degeneracy problem.
The focus of this paper is on the convergence properties of particle methods which have been re-
cently proposed to approximatethe derivative of the measures {ηθ,n(dxn)}n≥0w.r.t. θ = [θ1,...θd]T∈
Rd:
?∂ηθ,n
(See Section 2 for a definition.) References C´ erou et al. [2001] and Doucet and Tadi´ c [2003] present
particle methods which have a computational complexity that scales linearly with the number N
of particles. It was shown in Poyiadjis et al. [2011] (see also Poyiadjis et al. [2009] for a more de-
tailed numerical study) that the performance of these O(N) methods, which inherently rely on the
particle approximations of {Qθ,n}n≥0constructed as in (1.6) above, degraded over time and it was
conjectured that this may be attributed to the particle path degeneracy problem. In contrast, the
alternative method of Poyiadjis et al. [2005] was shown in numerical examples to be stable. The
method of Poyiadjis et al. [2005] is a non-standard particle implementation that avoids the parti-
cle path degeneracy problem at the expense of a computational complexity per time step which is
quadratic in the number of particles, i.e. O(N2); see Section 2 for more details. Supported by
numerical examples, it was conjectured in Poyiadjis et al. [2011] that even under strong mixing as-
sumptions, the variance of the estimate of the filter derivative computed with the O(N) methods
increases at least linearly in time while that of the O(N2) is uniformly bounded w.r.t. the time index.
This conjecture is confirmed in this paper. Specifically, we analyze the O(N2) implementation of
Poyiadjis et al. [2005] in Section 3 and obtain results on the errors of the approximation, in partic-
ular, Lpbounds and a Central Limit Theorem (CLT) are presented. We show that these Lpbounds
and asymptotic variances appearing in the CLT are uniformly bounded w.r.t. the time index when
the state-space model satisfies certain mixing assumptions. In contrast, the asymptotic variance of
the O(N) implementations, which is also captured through the CLT, is shown to increase linearly.
To the best of our knowledge, these are the first results of this kind.
An important application of our results, which is discussed in detail in Section 4, is to the
problem of estimating the parameters of the model (1.1)–(1.2) from observed data. The estimates
of the model parameters are found by maximizing the likelihood function pθ(y0:n) with respect to θ
using a gradient ascent algorithm which relies on the particle approximation of the filter derivative.
The results we present in Section 3 have bearing on the performance of the parameter estimation
algorithm, which we illustrate with numerical examples in Section 4. The Appendix contains the
proofs of the main results as well as that of some supporting auxiliary results. As a final remark,
although the algorithms and theoretical results are presented for a state-space model, they may be
reinterpreted for Feynman-Kac models as well.
θ,n(dx0:k) for any fixed
ζθ,n= ∇ηθ,n=
∂θ1
,...,∂ηθ,n
∂θd
?T
.
3
Page 4
1.1Notation and definitions
We give some basic definitions from probability and operator semigroup theory. For a measurable
space (E,E) let M(E) denote the set of all finite signed measures and P(E) the set of all probability
measures on E. The n-fold product space E×···×E is denoted by En. Let B(E) denote the Banach
space of all bounded real-valued and measurable functions ϕ : E → R equipped with the uniform
norm ?ϕ? = supx∈E|ϕ(x)|. For ν ∈ M(E) and ϕ ∈ B(E), let ν(ϕ) =?
ν(x) ϕ(x). We recall that a bounded integral kernel M(x,dx′) from a measurable space (E,E) into
an auxiliary measurable space (E′,E′) is an operator ϕ ?→ M(ϕ) from B(E′) into B(E) such that the
functions
x ?→ M(ϕ)(x) :=
are E-measurable and bounded for any ϕ ∈ B(E′). The kernel M also generates a dual operator
ν ?→ νM from M(E) into M(E′) defined by
(νM)(ϕ) := ν(M(ϕ)).
ν(dx) ϕ(x) be the Lebesgue
integral of ϕ w.r.t. ν. If ν is a density w.r.t. some dominating measure dx on E then, ν(ϕ) =?dx
?
E′M(x,dx′)ϕ(x′)
Given a pair of bounded integral operators (M1,M2), we let (M1M2) the composition operator
defined by (M1M2)(ϕ) = M1(M2(ϕ)).
A Markov kernel is a positive and bounded integral operator M such that M(1)(x) = 1 for any
x ∈ E. For ϕ ∈ B(E), let
osc(ϕ) = sup
x,x′∈E|ϕ(x) − ϕ(x′)|
and let
Osc1(E) = {ϕ ∈ B(E) : osc(ϕ) ≤ 1}.
Let β(M) ∈ [0,1] denote the Dobrushin coefficient of the Markov kernel M which is defined by the
formula [Del Moral, 2004, Prop. 4.2.1]:
β(M) := sup {osc(M(ϕ)) ; ϕ ∈ Osc1(E′)}.
If there exists a positive constant ρ such that the Markov kernel M satisfies
M(x,dz) ≥ ρM(x′,dz) for all x,x′∈ E then β (M) ≤ 1 − ρ.
For two Markov kernels M1,M2, β(M1M2) ≤ β(M1)β(M2).
Given a positive function G on E, let ΨG : ν ∈ P(E) ?→ ΨG(ν) ∈ P(E) be the probability
distribution defined by
ΨG(ν)(dx) :=ν(dx)G(x)
ν(G)
provided ∞ > ν(G) > 0. The definitions above also apply if ν is a density and M is a transition den-
sity. In this case all instances of ν(dx) should be replaced with dxν(x) and M(x,dx′) by dx′M(x,x′)
where dx and dx′is generic notation for the dominating measures.
It is convenient to introduce the following transition kernels:
Qθ,n(xn−1,dxn) = gθ(yn−1|xn−1)dxnfθ(xn|xn−1) = dxnqθ(xn|xn−1),
Qθ,k,n(xk,dxn) = (Qθ,k+1Qθ,k+2···Qθ,n)(xk,dxn),
with the convention that Qθ,n,n= Id, the identity operator. Note that Qθ,k,n(1)(xk) is the density
of the law of Yk:n−1given Xk= xk. For 0 ≤ p ≤ n, define the potential function Gθ,p,non X to be
Gθ,p,n(xp) = Qθ,p,n(1)(xp)/ηθ,pQθ,p,n(1).
n > 0,
0 ≤ k ≤ n,
(1.7)
4
Page 5
Let the mapping Φθ,k,n: P(X) → P(X), 0 ≤ k ≤ n, be defined as follows
Φθ,k,n(ν)(dxn) =νQθ,k,n(dxn)
νQθ,k,n(1)
.
It follows that ηθ,n= Φθ,k,n(ηθ,k). For conciseness, we also write Φθ,n−1,nas Φθ,n.
A key quantity that facilitates the recursive computation of the derivative of ηθ,nis the following
collection of backward Markov transition kernels:
Mθ,n(xn,dxn−1) =ηθ,n−1(dxn−1)qθ(xn|xn−1)
ηθ,n−1(qθ(xn|·))
, n > 0. (1.8)
Their particle approximations are
MN
θ,n(xn,dxn−1) =ηN
θ,n−1(dxn−1)qθ(xn|xn−1)
ηN
θ,n−1(qθ(xn|·))
. (1.9)
These backward Markov kernels are convenient for computing certain conditional expectations and
probability measures. In particular, for ϕ ∈ B(X2), we have
?
and the law of X0:n−1given Xn= xnand Y0:n−1= y0:n−1is Mθ,n(xn,dxn−1)···Mθ,1(x1,dx0).
Finally, the following two definitions are needed for the CLT of the particle approximation of
the derivative of ηθ,n. The bounded integral operator Dθ,k,nfrom X into Xn+1is defined for any
Fn∈ B(Xn+1) by
?
j=k
Eθ[ϕ(Xn−1,Xn)|y0:n−1,xn] =Mθ,n(xn,dxn−1)ϕ(xn−1,xn),
Dθ,k,n(Fn)(xk) :=
1?
Mθ,j(xj,dxj−1)
n−1
?
j=k
Qθ,j+1(xj,dxj+1)
Fn(x0:n),
θ,k,n, is defined to be
0 ≤ k ≤ n,
(1.10)
with the convention that?∅ = 1. The particle approximation, DN
DN
θ,k,n(Fn)(xk) :=
To be concise we write
?
1?
j=k
MN
θ,j(xj,dxj−1)
n−1
?
j=k
Qθ,j+1(xj,dxj+1)
Fn(x0:n).(1.11)
ηθ,k(dxk)Dθ,k,n(xk,dx0:k−1,dxk+1:n) asηθ,kDθ,k,n(dx0:n).
(And similarly for the particle versions.) Although convention dictates that ηθ,kDθ,k,n should be
understood as the measure (ηθ,kDθ,k,n)(dx0:k−1,dxk+1:n), when we mean otherwise it should be
clear from the infinitesimal neighborhood.
5
Page 6
2 Computing the filter derivative
For any Fn∈ B(Xn+1), we have
∇Qθ,n(Fn)
1
pθ(y0:n−1)
=
?
dx0:n∇
?
πθ(x0)
n
?
k=1
fθ(xk|xk−1)
?
n−1
?
k=0
gθ(yk|xk)
?
Fn(x0:n)
−
1
pθ(y0:n−1)Eθ{Fn(X0:n)|y0:n−1}
dx0:n∇
?
πθ(x0)
n
?
k=1
fθ(xk|xk−1)
n−1
?
k=0
gθ(yk|xk)
?
= Eθ{Fn(X0:n)Tθ,n(X0:n)|y0:n−1} − Eθ{Fn(X0:n)|y0:n−1}Eθ{Tθ,n(X0:n)|y0:n−1}
where
(2.1)
Tθ,n(x0:n) =
n
?
k=0
tθ,k(xk−1,xk) (2.2)
tθ,k(xk−1,xk) = ∇log(gθ(yk−1|xk−1)fθ(xk|xk−1)),
tθ,0(x−1,x0) = tθ,0(x0) = ∇logπθ(x0).
The first equality in (2.1) follows from the definition of Qθ,nand interchanging the order of differ-
entiation and integration. The interchange is permissible under certain regularity conditions [Pflug,
1996]; e.g. a sufficient condition would be the main assumption in Section 3 under which the uni-
form stability results are proved. The second equality follows from a change of measure, which
then permits an importance sampling based estimator for the derivative of Qθ,n; this is the well
known score method, e.g. see Pflug [1996, Section 4.2.1]. For any ϕn∈ B(X), it follows by setting
Fn(x0:n) = ϕn(xn) in (2.1) that
?
= Eθ{ϕn(Xn)Tθ,n(X0:n)|y0:n−1} − Eθ{ϕn(Xn)|y0:n−1}Eθ{Tθ,n(X0:n)|y0:n−1}
=ζθ,n(dxn)ϕn(xn)
k > 0,(2.3)
(2.4)
∇
ηθ,n(dxn)ϕn(xn)
?
where
ζθ,n(dxn) = ηθ,n(dxn)(Eθ[Tθ,n(X0:n)|y0:n−1,xn] − Eθ[Tθ,n(X0:n)|y0:n−1]).
We call ζθ,nthe derivative of ηθ,n.
Given the particle approximation (1.5) of Qθ,n, it is straightforward to construct a particle ap-
proximation of ζθ,n:
(2.5)
ζp,N
θ,n(dxn) =
N
?
i=1
1
N
Tθ,n(X(i)
0:n) −1
N
N
?
j=1
Tθ,n(X(j)
0:n)
δX(i)
n(dxn).(2.6)
This approximation is also referred to as the path space method. Such approximations were implicitly
proposed in C´ erou et al. [2001] and Doucet and Tadi´ c [2003] and there are several reasons why this
estimate appears attractive. Firstly, even with the resampling steps in the construction of Qp,N
ζp,N
θ,ncan be computed recursively. Secondly, there is no need to store the entire ancestry of each
particle, i.e.X(i)
0:n
θ,n,
?
?
1≤i≤N, and thus the memory requirement to construct ζp,N
θ,nis constant over
6
Page 7
time. Thirdly, the computational cost per time is O(N). However, as Qp,N
path degeneracy problem, we expect the approximation ζp,N
observed in numerical examples in Poyiadjis et al. [2011] and it was conjectured that the asymptotic
variance (i.e. as N → ∞) of ζp,N
strong mixing assumptions. This is now proven in this article.
An alternative particle method to approximate {ζθ,n}n≥0has been proposed in Poyiadjis et al.
[2005, 2011]. We now reinterpret this method using the representation in (2.5) and a different particle
approximation of Qθ,nthat avoids the path degeneracy problem.
The measure Qθ,nadmits the following backward representation
θ,nsuffers from the particle
θ,nto worsen over time. This was indeed
θ,nfor bounded integrands would increase linearly with n even under
Qθ,n(dx0:n) = ηθ,n(dxn)
1?
k=n
Mθ,k(xk,dxk−1)
and the corresponding particle approximation of Qθ,nis given by
QN
θ,n(dx0:n) = ηN
θ,n(dxn)
1?
k=n
MN
θ,k(xk,dxk−1)
where MN
[Poyiadjis et al., 2005, 2011]:
θ,kwas defined in (1.9). This now gives rise to the following particle approximation of ζθ,n
ζN
θ,n(ϕn) =
?
QN
θ,n(dx0:n)Tθ,n(x0:n)?ϕn(xn) − ηN
θ,n(dx0:n)ϕn(xn). It is apparent that QN
θ,n(ϕn)?
and indeed ηN
method avoids the degeneracy in paths. It is even possible to compute ζN
in Algorithm 1; since a recursion for ηθ,n is already available, it is apparent from (2.5) that what
remains is to specify a recursion for Eθ[Tθ,n(X0:n)|y0:n−1,xn]. Let Tθ,n(xn) denote this term, then
for n ≥ 1,
Tθ,n(xn) = Eθ[Tθ,n(X0:n)|y0:n−1,xn]
= Eθ[Tθ,n−1(X0:n−1)|y0:n−1,xn] + Eθ[tθ,n(Xn−1,Xn)|y0:n−1,xn]
=Mθ,n(xn,dxn−1)(Eθ[Tθ,n−1(X0:n−1)|y0:n−2,xn−1] + tθ,n(xn−1,xn))
?
where Tθ,0(x0) = tθ,0(x0). Algorithm 1 computes ζN
(i)
θ,0= tθ,0(X(i)
θ,n(ϕn) =?QN
θ,nconstructed using this backward
θ,nrecursively as detailed
?
=Mθ,n(xn,dxn−1)?Tθ,n−1(xn−1) + tθ,n(xn−1,xn)?
θ,nrecursively in time by computing?Tθ,n,ηθ,n
0
?
and is initialized with T
0) (see (2.2)) where
?
X(i)
?
1≤i≤Nare samples from πθ(x0).
Algorithm 1: A Particle Method to Compute the Filter Derivative
• Assume at time n − 1 that approximate samples
(i)
θ,n−1
Tθ,n−1
?
?
X(i)
n−1
?
1≤i≤Nfrom ηθ,n−1and approximations
?
T
?
1≤i≤Nof
?
?
X(i)
n−1
?
?N
??
1≤i≤Nare available.
• At time n, sampleX(i)
n
1≤i≤Nindependently from the mixture
?
?N
7
j=1fθ
xn|X(j)
n−1
?
?
gθ
?
yn−1|X(j)
?
n−1
?
j=1gθ
yn−1|X(j)
n−1
(2.7)
Page 8
and then compute
?
?N
T
(i)
θ,n
?
1≤i≤Nand ζN
?
θ,nas follows:
T
(i)
θ,n=
j=1
T
(j)
θ,n−1+ tθ,n
?
X(j)
?
N
?
n−1,X(i)
n
??
n−1
fθ
?
?
X(i)
?
n |X(j)
yn−1|X(j)
n−1
?
gθ
?
?
yn−1|X(j)
n−1
?
?N
(i)
θ,n−1
j=1fθ
X(i)
n |X(j)
gθ
n−1
, (2.8)
ζN
θ,n(dxn) =
1
N
N
?
i=1
T
N
j=1
T
(j)
θ,n
δX(i)
n(dxn). (2.9)
Algorithm 1 uses the bootstrap particle filter of Gordon et al. [1993]. Note that any SMC imple-
mentation of {ηθ,n}n≥0may be used, e.g. the auxiliary SMC method of Pitt and Shephard [1999] or
sequential importance resampling with a tailored proposal distribution [Doucet et al., 2001]. It was
conjectured in Poyiadjis et al. [2011] that the asymptotic variance of ζN
ϕ is uniformly bounded w.r.t. n under mixing assumptions. This is established in this article.
θ,n(ϕ) for bounded integrands
3 Stability of the particle estimates
The convergence analysis of ζN
convergence analysis of the N-particle measures QN
limiting values Qθ,n, as N → ∞, which is in turn intimately related to the convergence of the flow of
particle measuresηN
θ,n
central limit theorem presented here have been derived using the techniques developed in Del Moral
[2004] for the convergence analysis of the particle occupation measures ηN
objects in this analysis is the local sampling errors defined as
√N?ηN
The fluctuation and the deviations of these centered random measures can be estimated using non-
asymptotic Kintchine’s type Lr-inequalities, as well as Hoeffding’s or Bernstein’s type exponential de-
viations [Del Moral, 2004, Del Moral and Rio, 2009]. In Del Moral and Miclo [2000] it is proved that
these random perturbations behave asymptotically as Gaussian random perturbations; see Lemma
7.10 in the Appendix for more details. In the proof of Theorem 7.11 (a supporting theorem) in
the Appendix we provide some key decompositions expressing the deviation of the particle measures
QN
θ,naround its limiting value Qθ,nin terms of the local sampling errors (VN
compositions are key to deriving the Lr-mean error bounds and central limit theorems for the filter
derivative.
The following regularity conditions are assumed.
(A) The dominating measures dx on X and dy on Y are finite, and there exist constants 0 <
ρ,δ,c < ∞ such that for all (x,x′,y,θ) ∈ X2×Y ×Θ, the derivatives of πθ(x), fθ(x′|x) and gθ(y|x)
with respect to θ exists and
θ,n(and ζp,N
θ,nfor performance comparison) will largely focus on the
θ,n(and correspondingly Qp,N
θ,n) towards their
?
?
n≥0towards their limiting measures {ηθ,n}n≥0. The Lrerror bounds and the
θ,n. One of the central
VN
θ,n=
θ,n− Φθ,n(ηN
θ,n−1)?
(3.1)
θ,0,...,VN
θ,n). These de-
ρ−1≤ fθ(x′|x) ≤ ρ,
|∇logπθ(x)| ∨ |∇logfθ(x′|x)| ∨ |∇loggθ(y|x)| ≤ c.
Admittedly, these conditions are restrictive and fail to hold for many models in practice. (Exceptions
would include applications with a compact state-space.) However, they are typically made to estab-
lish the time uniform stability of particle approximations of the filter [Del Moral, 2004, Capp´ e et al.,
2005] as they lead to simpler and more transparent proofs. Also, we observe that the behaviors pre-
dicted by the Theorems below seem to hold in practice even in cases where the state-space models
δ−1≤ gθ(y|x) ≤ δ, (3.2)
(3.3)
8
Page 9
do not satisfy these assumptions; see Section 4. Thus the results in this paper can be seen to provide
a qualitative guide to the behavior of the particle approximation even in the more general setting.
For each parameter vector θ ∈ Θ, realization of observations y = {yn}n≥0and particle number
N, let (Ω,F,Py
comprised of the particle system only. Let Ey
θthe corresponding expectation operator computed
with respect to Py
θ. The first of the two main results in this section is a time uniform non-asymptotic
error bound.
θ) be the underlying probability space of the random process {(X(1)
n ,...,X(N)
n
)}n≥0
Theorem 3.1 Assume (A). For any r ≥ 1, there exists a constant Cr such that for all θ ∈ Θ,
y = {yn}n≥0, n ≥ 0, N ≥ 1, and ϕn∈ Osc1(X),
√NEy
θ
???ζN
θ,n(ϕn) − ζθ,n(ϕn)??r?1
r≤ Cr
Let {Vθ,n}n≥0be a sequence of independent centered Gaussian random fields defined as follows.
For any sequence {ϕn}n≥0 in B(X) and any p ≥ 0, {Vθ,n(ϕn)}p
zero-mean Gaussian random variables with variances given by
n=0is a collection of independent
ηθ,n(ϕ2
n) − ηθ,n(ϕn)2.(3.4)
Theorem 3.2 Assume (A). There exists a constant C < ∞ such that for any θ ∈ Θ, y = {yn}n≥0,
n ≥ 0 and ϕn ∈ Osc1(X),
Gaussian random variable
?
whose variance is uniformly bounded above by C where
√N
?
ζN
θ,n− ζθ,n
?
(ϕn) converges in law, as N → ∞, to the centered
n
p=0
Vθ,p
?
Gθ,p,n
Dθ,p,n(Fθ,n− Qθ,n(Fθ,n))
Dθ,p,n(1)
?
(3.5)
Fθ,n= (ϕn− Qθ,n(ϕn))(Tθ,n− Qθ,n(Tθ,n)).
The proofs of both these results are in the Appendix.
As a comparison, we quantify the variance of the particle estimate of the filter derivative computed
using the path-based method (see (2.6).) Consider the following simplified example that serves to
illustrate the point. Let gθ(y|x) = g (y|x) (that is θ-independent), fθ(xn|xn−1) = πθ(xn), where
πθ is the initial distribution. (Note that fθin this case satisfies a rephrased version of (3.2) under
which the conclusion of Theorem 3.2 also holds.) Also, consider the sequence of repeated observations
y0= y1= ··· where y0is arbitrary. Applying Lemma 7.12 (in the Appendix) that characterizes the
limiting distribution of
θ,n− Qθ,n) to this special case results in
(2.6)) having an asymptotic distribution which is Gaussian with mean zero and variance
?(∇logπθ)2?+ πθ
where ϕ = ϕ−πθ(ϕ), π′
in contrast to the time bounded variance of Theorem 3.2.
√N(Qp,N
√N(ζp,N
θ,n− ζθ,n)(ϕ) (see
n × πθ(ϕ2)π′
θ(x) = πθ(x)g (y0|x)/πθ(g(y0|·)). This variance increases linearly with time
θ
?ϕ2(∇logπθ)2?− ∇πθ(ϕ)2
4Application to recursive parameter estimation
Being able to compute {ζθ,n}n≥0is particularly useful when performing online static parameter esti-
mation for state-space models using Recursive Maximum Likelihood (RML) techniques [Le Gland and Mevel,
1997, Poyiadjis et al., 2005, 2011]; see also Kantas et al. [2009] for a general review of available
particle methods based solutions, including Bayesian ones, for this problem. The computed filter
derivative may also be useful in other areas; e.g. see Coquelin et al. [2008] for an application in
control.
9
Page 10
4.1Recursive Maximum Likelihood
Let θ∗be the true static parameter generating the observed data {yn}n≥0. Given a finite record of
observations y0:T, the log-likelihood may be maximized with the following steepest ascent algorithm:
θk= θk−1+ γk∇logpθ(y0:T)|θ=θk−1,k ≥ 1,(4.1)
where θ0 is some arbitrary initial guess of θ∗, ∇logpθ(y0:T)|θ=θk−1denotes the gradient of the
log-likelihood evaluated at the current parameter estimate and {γk}k≥1 is a decreasing positive
real-valued step-size sequence, which should satisfy the following constraints:
∞
?
k=1
γk= ∞,
∞
?
k=1
γ2
k< ∞.
Although ∇logpθ(y0:T) can be computed using (4.3), the computation cost can be prohibitive for
a long data record since each iteration of (4.1) would require a complete browse through the T + 1
data points. A more attractive alternative would be a recursive procedure in which the data is run
through once only sequentially. For example, consider the following update scheme:
θn= θn−1+ γn∇logpθ(yn|y0:n−1)|θ=θn−1
(4.2)
where ∇logpθ(yn|y0:n−1)|θ=θn−1denotes the gradient of logpθ(yn|y0:n−1) evaluated at the current
parameter estimate; that is upon receiving yn, θn−1 is updated in the direction of ascent of the
conditional density of this new observation. Since we have
?dxnηθn−1,n(xn) ∇gθ(yn|xn)|θn−1+?dxn(yn|xn)ζθn−1,n(xn)gθn−1
∇logpθ(yn|y0:n−1)|θ=θn−1=
?dxnηθn−1,n(xn)gθn−1(yn|xn)
,
(4.3)
this clearly requires the filter derivative ζθ,n. The algorithm in the present form is not suitable
for online implementation as it requires re-computing the filter and its derivative at the value θ =
θn−1 from time zero. The RML procedure uses an approximation of (4.3) which is obtained by
updating the filter and its derivative using the parameter value θn−1at time n; we refer the reader
to Le Gland and Mevel [1997] for details. The asymptotic properties of the RML algorithm, i.e.
the behavior of θnin the limit as n goes to infinity, has been studied in the case of an i.i.d. hidden
process by Titterington [1984] and Le Gland and Mevel [1997] for a finite state-space hidden Markov
model. It is shown in Le Gland and Mevel [1997] that under regularity conditions this algorithm
converges towards a local maximum of the average log-likelihood and that this average log-likelihood
is maximized at θ∗. A particle version of the RML algorithm of Le Gland and Mevel [1997] that uses
Algorithm 1’s estimate of ηθ,nis presented as Algorithm 2.
Algorithm 2: Particle Recursive Maximum Likelihood
• At time n − 1 we are given y0:n−1, the previous estimate θn−1of θ∗and {(X(i)
• At time n, upon receiving yn, sample
θ = θn−1to obtain
n−1,T
(i)
n−1)}N
i=1.
?
X(i)
n
?
?
1≤i≤N
independently from (2.7) using parameter
ηN
n(dxn) =
1
N
N
i=1
δX(i)
n(dxn)
10
Page 11
and then compute
T
(i)
n=
?N
j=1
?
T
(j)
n−1+ tθn−1,n
?
X(j)
n−1,X(i)
?
?
n
??
n−1
fθn−1
?
?
X(i)
?
n |X(j)
yn−1|X(j)
n−1
?
gθn−1
?
yn−1|X(j)
n−1
?
?N
(i)
n−1
j=1fθn−1
X(i)
n |X(j)
gθn−1
n−1
?
, (4.4)
ζN
n(dxn) =
1
N
N
?
i=1
T
N
N
j=1
T
(j)
n
δX(i)
n(dxn), (4.5)
and
?∇logp(yn|y0:n−1) =
?ηN
n(dxn) ∇gθ(yn|xn)|θn−1+?ζN
n(dxn)gθn−1(yn|xn)
?ηN
n(dxn)gθn−1(yn|xn)
.
Finally update the parameter:
θn= θn−1+ γn?∇logp(yn|y0:n−1). (4.6)
Under Assumption A, the particle approximation of the filter is stable [Del Moral, 2004]; see also
Lemma 7.4 in the Appendix. This combined with the proven stability of the particle approximation
of the filter derivative implies that the particle estimate of the derivative of logp(yn|y0:n−1) is also
stable.
4.2Simulations
The RML algorithm is applied to the following stochastic volatility model [Pitt and Shephard, 1999]:
?
Yn= β exp(Xn/2)Wn,
X0∼ N
0,
σ2
1 − φ2
?
, Xn+1= φXn+ σVn+1,
where N (m,s) denotes a Gaussian random variable with mean m and variance s, Vni.i.d.
and Wni.i.d.
∼ N (0,1) are two mutually independent sequences, both independent of the initial state
X0. The model parameters, θ = (φ,σ,β), are to be estimated.
Our first example demonstrates the theoretical results in Section 3.
logp(yn:n+L−1|y0:n−1) at θ∗= (0.8,√0.1,1)
cles and using the path-space method (see (2.6)) with 2.5×105particles for the stochastic volatility
model. The block size L was 500. Shown in Figure 1 is the variance of these particle estimates
for various values of n derived from many independent random replications of the simulation. The
linear increase of the variance of the path-space method as predicted by theory is evident although
Assumption A is not satisfied.
For the path-space method, because the variance of the estimate of the filter derivative grows
linearly in time, the eventual high variance in the gradient estimate can result in the divergence of the
parameter estimates. To illustrate this point, (4.6) was implemented with the path-space estimate of
the filter derivative (2.6) computed with 10000 particles and constant step-size sequence, γn= 10−4
for all n. θ0was initialized at the true parameter value. A sequence of two million observations was
simulated with θ∗= (0.8,√0.1,1). The results are shown in Figure 3.
For the same value of θ∗and sequence of observations used in the previous example, Algorithm
2 was executed with 500 particles and γn= 0.01, n ≤ 105, γn= (n − 5 × 104)−0.6, n > 105. As it
∼ N (0,1)
The estimate of ∂/∂σ
was computed using Algorithm 1 with 500 parti-
11
Page 12
15000 100001500020000
0
20
40
60
80
100
120
140
Figure 1: Variance of the particle estimates of ∂/∂σlogp(yn:n+500−1|y0:n−1) for various values of n
for the stochastic volatility model. Circles are variance of Algorithm 1’s estimate with 500 particles.
Stars indicate the variance of the estimate of the path-space method with 2.5×105particles. Dotted
line is best fitting straight line to path-space method’s variance to indicate trend.
0 5001000
× 103
15002000
0.1
0.8
1.5
1
(σ2)*=
φ*=
β*=
1.006
0.802
0.097
Figure 2: Sequence of recursive parameter estimates, θn= (σn,φn,βn), computed using (4.6) with
N = 500. From top to bottom: βn, φnand σnand marked on the right are the “converged values”
which were taken to be the empirical average of the last 1000 values.
12
Page 13
0 500 1000
x103
1500 2000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
Figure 3: RML for stochastic volatility with path-space gradient estimate with 10,000 particles,
constant step-size and initialized at the true parameter values which are indicated by the dashed
lines. From top to bottom, φ, β and σ.
can be seen from the results in Figure 2 the estimate converges to a value in the neighborhood of
the true parameter.
5 Conclusion
We have presented theoretical results establishing the uniform stability of the particle approximation
of the optimal filter derivative proposed in Poyiadjis et al. [2005, 2009]. While these results have
been presented in the context of state-space models, they can also be applied to Feynman-Kac
models [Del Moral, 2004] which could potentially enlarge the range of applications. For example, if
dx′fθ(x′|x) is reversible w.r.t. to some probability measure µθ and if we replace gθ(yn|xn) with
a time-homogeneous potential function gθ(xn) then ηθ,n converges, as n → ∞, to the probability
measure µθ,hdefined as
µθ,h(dx) :=
1
µθ(hθ
?dx′fθ(x′|·)hθ(x′))µθ(dx) hθ(x)
?
dx′fθ(x′|x)hθ(x′)
where hθ is a positive eigenmeasure associated with the top eigenvalue of the integral operator
Qθ(x,dx′) = gθ(x)dx′fθ(x′|x) (see section 12.4 of Del Moral [2004]).
invariant measure of the h-process defined as the Markov chain with transition kernel Mθ(x,dx′) ∝
dx′fθ(x′|x)hθ(x′). The particle algorithm described here can be directly used to approximate the
derivative of this invariant measure w.r.t to θ. It would also be of interest to weaken Assumption A
and there are several ways this might be approached. For example for non-ergodic signals using ideas
in Oudjane and Rubenthaler [2005], Heine and Crisan [2008] or via Foster-Lyapunov conditions as
in Beskos et al. [2011], Whiteley [2011].
The measure µθ,h is the
6Acknowledgement
We are grateful to Sinan Yildirim for carefully reading this report.
13
Page 14
7 Appendix
The statement of the results in this section hold for any θ and any sequence of observations y =
{yn}n≥0. All mathematical expectations are taken with respect to the law of the particle system only
for the specific θ and y under consideration. While θ is retained in the statement of the results, it is
omitted in the proofs. The superscript y of the expectation operator is also omitted in the proofs.
This section commences with some essential definitions in addition to those in Section 1.1. Let
Pθ,k,n(xk,dxn) =Qθ,k,n(xk,dxn)
Qθ,k,n(1)(xk),
and
Mθ,p(xp,dx0:p−1) =
1?
k=p
Mθ,k(xk,dxk−1),p > 0,
and its corresponding particle approximation is
MN
θ,p(xp,dx0:p−1) =
1?
k=p
MN
θ,k(xk,dxk−1)
To make the subsequent expressions more terse, let
? ηN
θ,n= Φθ,n(ηN
θ,n−1),n ≥ 0, (7.1)
where ? ηN
θ,0= Φθ,0(ηN
−1) = ηθ,0= πθby convention. (Recall Φθ,n= Φθ,n−1,n.) Let
??
be the natural filtration associated with the N-particle approximation model and let FN
trivial sigma field.
The following estimates are a straightforward consequence of Assumption (A). For all θ and time
indices 0 ≤ k < q ≤ n,
Qθ,k,n(1)(xk)
Qθ,k,n(1)(x′
Qθ,k,q(Qθ,q,n(1))(xk)
FN
n= σX(i)
k;0 ≤ k ≤ n,1 ≤ i ≤ N
??
,n ≥ 0,
−1be the
bθ,k,n= sup
xk,x′
k
k)≤ ρ2δ2,β
?Qθ,k,q(xk,dxq)Qθ,q,n(1)(xq)
?
≤?1 − ρ−4?(q−k)= ρq−k,
(7.2)
and for θ, 0 < k ≤ q,
MN
θ,k(x,dz) ≤ ρ4MN
θ,k(x′,dz) =⇒ β?MN
θ,q···MN
θ,k
?≤?1 − ρ−4?q−k+1. (7.3)
Note that setting q = n in (7.2) yields an estimate for β(Pθ,k,n)
Several auxiliary results are now presented, all of which hinge on the following Kintchine type
moment bound proved in Del Moral [2004, Lem. 7.3.3].
Lemma 7.1 Del Moral [2004, Lemma 7.3.3]Let µ be a probability measure on the measurable space
(E,E). Let G and h be E-measurable functions satisfying G(x) ≥ cG(x′) > 0 for all x,x′∈ E where c
is some finite positive constant. Let {X(i)}1≤i≤Nbe a collection of independent random samples from
µ. If h has finite oscillation then for any integer r ≥ 1 there exists a finite constant ar, independent
of N, G and h, such that
??????
14
√NE
?N
i=1G(X(i))h(X(i))
?N
i=1G(X(i))
−µ(Gh)
µ(G)
?????
r?1
r
≤ c−1osc(h)ar.
Page 15
Proof:
The result for G = 1 and c = 1 is proved in Del Moral [2004]. The case stated here can be established
using the representation
?µN− µ??
where µN(dx) = N−1?N
Remark 7.2 For k ≥ 0, let hN
surely. Then Lemma 7.1 can be invoked to establish
µN(Gh)
µN(G)
−µ(Gh)
µ(G)
=
µ(G)
µN(G)
G
µ(G)
?
h −µ(Gh)
µ(G)
??
i=1δX(i)(dx).
k−1be a FN
k−1measurable function satisfying hN
k−1∈ Osc1(X) almost
√
NEy
θ
??????
ηN
θ,k(GhN
ηN
θ,k(G)
k−1)
−Φθ,k(ηN
Φθ,k(ηN
θ,k−1)(GhN
θ,k−1)(G)
k−1)
?????
r?1
r
≤ c−1ar
where G is defined as in Lemma 7.1.
Lemma 7.3 to Lemma 7.6 are a consequence of Lemma 7.1 and the estimates in (7.2).
Lemma 7.3 For any r ≥ 1 there exist a finite constant ar such that the following inequality holds
for all θ, y, 0 ≤ k ≤ n and FN
almost surely,
k−1measurable function ϕN
nsatisfying ϕN
n∈ Osc1(X)
√
NEy
θ
???Φθ,k,n(ηN
θ,k)(ϕN
n) − Φθ,k−1,n(ηN
θ,k−1)(ϕN
n)??r
?1
r≤ arbθ,k,nβ (Pθ,k,n),
where, by convention Φθ,−1,n(ηN
(7.2).
θ,−1) = ηθ,n, and the constants bθ,k,nand β (Pθ,k,n) were defined in
Proof:
Φk,n(ηN
??
k)(ϕN
n) − Φk−1,n(ηN
k(dxk)Qk,n(1)(xk)
ηN
kQk,n(1)
k−1)(ϕN
−Φk(ηN
n)
=
ηN
k−1)(dxk)Qk,n(1)(xk)
Φk(ηN
k−1)Qk,n(1)
?
Pk,n(ϕN
n)(xk)
where Φ0(ηN
−1) = η0by convention. Applying Lemma 7.1 with the estimates in (7.2) we have
√NE
???Φk,n(ηN
k)(ϕN
n) − Φk−1,n(ηN
k−1)(ϕN
n)??r ??FN
k−1
?1
r≤ arbk,nβ (Pk,n)
almost surely.
Lemma 7.3 may be used to derive the following error estimate [Del Moral, 2004, Theorem 7.4.4].
Lemma 7.4 For any r ≥ 1, there exists a constant cr such that the following inequality holds for
all θ, y, n ≥ 0 and ϕ ∈ Osc1(X),
√NEy
θ
???[ηN
θ,n− ηθ,n](ϕ)??r?1
r≤ cr
n
?
k=0
bθ,k,nβ (Pθ,k,n). (7.4)
15
Download full-text