# Uniform Stability of a Particle Approximation of the Optimal Filter Derivative

**ABSTRACT** Sequential Monte Carlo methods, also known as particle methods, are a widely

used set of computational tools for inference in non-linear non-Gaussian

state-space models. In many applications it may be necessary to compute the

sensitivity, or derivative, of the optimal filter with respect to the static

parameters of the state-space model; for instance, in order to obtain maximum

likelihood model parameters of interest, or to compute the optimal controller

in an optimal control problem. In Poyiadjis et al. [2011] an original particle

algorithm to compute the filter derivative was proposed and it was shown using

numerical examples that the particle estimate was numerically stable in the

sense that it did not deteriorate over time. In this paper we substantiate this

claim with a detailed theoretical study. Lp bounds and a central limit theorem

for this particle approximation of the filter derivative are presented. It is

further shown that under mixing conditions these Lp bounds and the asymptotic

variance characterized by the central limit theorem are uniformly bounded with

respect to the time index. We demon- strate the performance predicted by theory

with several numerical examples. We also use the particle approximation of the

filter derivative to perform online maximum likelihood parameter estimation for

a stochastic volatility model.

**0**Bookmarks

**·**

**113**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**Poyiadjis et al. (2011) show how particle methods can be used to estimate both the score and the observed information matrix for state-space models. These methods either suffer from a computational cost that is quadratic in the number of particles, or produce estimates whose variance increases quadratically with the amount of data. This paper introduces an alternative approach for estimating the score and information matrix, which has a computational cost that is linear in the number of particles. The method is derived using a combination of kernel density estimation to avoid the particle degeneracy that causes the quadratically increasing variance, and Rao-Blackwellisation. Crucially, we show the method is robust to the choice of bandwidth within the kernel density estimation, as it has good asymptotic properties regardless of this choice. Our estimates of the score and observed information matrix can be used within both online and batch procedures for estimating parameters for state-space models. Empirical results show improved parameter estimates compared to existing methods at a significantly reduced computational cost.06/2013; - SourceAvailable from: Arnaud Doucet[Show abstract] [Hide abstract]

**ABSTRACT:**Nonlinear non-Gaussian state-space models are ubiquitous in statistics, econometrics, information engineering and signal processing. Particle methods, also known as Sequential Monte Carlo (SMC) methods, provide reliable numerical approximations to the associated state inference problems. However, in most applications, the state-space model of interest also depends on unknown static parameters that need to be estimated from the data. In this context, standard particle methods fail and it is necessary to rely on more sophisticated algorithms. The aim of this paper is to present a comprehensive review of particle methods that have been proposed to perform static parameter estimation in state-space models. We discuss the advantages and limitations of these methods and illustrate their performance on simple models.Statistical Science - Accepted for publication. 12/2014; - SourceAvailable from: Elena Ehrlich[Show abstract] [Hide abstract]

**ABSTRACT:**In this article we focus on Maximum Likelihood estimation (MLE) for the static parameters of hidden Markov models (HMMs). We will consider the case where one cannot or does not want to compute the conditional likelihood density of the observation given the hidden state because of increased computational complexity or analytical intractability. Instead we will assume that one may obtain samples from this conditional likelihood and hence use approximate Bayesian computation (ABC) approximations of the original HMM. ABC approximations are biased, but the bias can be controlled to arbitrary precision via a parameter \epsilon>0; the bias typically goes to zero as \epsilon \searrow 0. We first establish that the bias in the log-likelihood and gradient of the log-likelihood of the ABC approximation, for a fixed batch of data, is no worse than \mathcal{O}(n\epsilon), n being the number of data; hence, for computational reasons, one might expect reasonable parameter estimates using such an ABC approximation. Turning to the computational problem of estimating $\theta$, we propose, using the ABC-sequential Monte Carlo (SMC) algorithm in Jasra et al. (2012), an approach based upon simultaneous perturbation stochastic approximation (SPSA). Our method is investigated on two numerical examples10/2012;

Page 1

arXiv:1106.2525v1 [math.ST] 13 Jun 2011

Uniform Stability of a Particle Approximation of the Optimal

Filter Derivative∗

Pierre Del Moral†, Arnaud Doucet‡, Sumeetpal S. Singh§

June 14, 2011

Abstract

Sequential Monte Carlo methods, also known as particle methods, are a widely used set

of computational tools for inference in non-linear non-Gaussian state-space models. In many

applications it may be necessary to compute the sensitivity, or derivative, of the optimal filter

with respect to the static parameters of the state-space model; for instance, in order to obtain

maximum likelihood model parameters of interest, or to compute the optimal controller in an

optimal control problem. In Poyiadjis et al. [2011] an original particle algorithm to compute

the filter derivative was proposed and it was shown using numerical examples that the particle

estimate was numerically stable in the sense that it did not deteriorate over time. In this paper

we substantiate this claim with a detailed theoretical study. Lp bounds and a central limit

theorem for this particle approximation of the filter derivative are presented. It is further shown

that under mixing conditions these Lp bounds and the asymptotic variance characterized by

the central limit theorem are uniformly bounded with respect to the time index. We demon-

strate the performance predicted by theory with several numerical examples. We also use the

particle approximation of the filter derivative to perform online maximum likelihood parameter

estimation for a stochastic volatility model.

Some key words: Hidden Markov Models, State-Space Models, Sequential Monte Carlo,

Smoothing, Filter derivative, Recursive Maximum Likelihood.

1 Introduction

State-space models are a very popular class of non-linear and non-Gaussian time series models in

statistics, econometrics and information engineering; see for example Capp´ e et al. [2005], Doucet et al.

[2001], Durbin and Koopman [2001]. A state-space model is comprised of a pair of discrete-time

stochastic processes, {Xn}n≥0and {Yn}n≥0, where the former is an X-valued unobserved process

and the latter is a Y-valued process which is observed. The hidden process {Xn}n≥0is a Markov

process with initial law dxπθ(x) and time homogeneous transition law dx′fθ(x′|x), i.e.

X0∼ dx0πθ(x0) and Xn|(Xn−1= xn−1) ∼ dxnfθ(xn|xn−1),

∗First version: January 2011. Cambridge University Engineering Department Technical report number CUED/F-

INFENG/TR.668

†Centre INRIA Bordeaux et Sud-Ouest & Institut de Math´ ematiques de Bordeaux , Universit´ e de Bordeaux I, 351

cours de la Lib´ eration 33405 Talence cedex, France (Pierre.Del-Moral@inria.fr)

‡Department of Statistics, University of British Columbia, V6T 1Z4 Vancouver, BC, Canada (arnaud@stat.ubc.ca)

§Department of Engineering, University of Cambridge, Trumpington Street, CB2 1PZ, United Kingdom

(sss40@cam.ac.uk)

n ≥ 1.(1.1)

1

Page 2

It is assumed that the observations {Yn}n≥0conditioned upon {Xn}n≥0are statistically independent

and have marginal laws

?

Here πθ(x), fθ(x|x′) and gθ(y|x) are densities with respect to (w.r.t.) suitable dominating measures

denoted generically as dx and dy. For example, if X ⊆ Rpand Y ⊆ Rqthen the dominating measures

could be the Lebesgue measures. The variable θ in the densities are the particular parameters of

the model. The set of possible values for θ, denoted Θ, is assumed to be an open subset of Rd. The

model (1.1)-(1.2) is also often referred to as a hidden Markov model in the literature Capp´ e et al.

[2005].

For a sequence {zn}n≥0and integers i, j, let zi:jdenote the set {zi,zi+1,...,zj}, which is empty

if j < i. Equations (1.1) and (1.2) define the law of (X0:n,Y0:n−1) which is given by the measure

Yn|{Xk}k≥0= {xk}k≥0

?

∼ dyngθ(yn|xn).(1.2)

dx0πθ(x0)

n

?

k=1

dxkfθ(xk|xk−1)

n−1

?

k=0

dykgθ(yk|xk),(1.3)

from which the probability density of the observed process, or likelihood, is obtained

pθ(y0:n−1) =

?

dx0πθ(x0)

n

?

k=1

dxkfθ(xk|xk−1)

n−1

?

k=0

gθ(yk|xk).(1.4)

For a realization of observations Y0:n−1= y0:n−1, let Qθ,ndenote the law of X0:nconditioned on this

sequence of observed variables, i.e.

?

k=1

Qθ,n(dx0:n) =

1

pθ(y0:n−1)

dx0πθ(x0)gθ(y0|x0)

n−1

?

dxkfθ(xk|xk−1)gθ(yk|xk)

?

dxnfθ(xn|xn−1)

Let ηθ,ndenote the time n marginal of Qθ,n. This marginal, which we call the filter, may be computed

recursively using Bayes’ formula:

?ηθ,n(dxn)gθ(yn|xn)fθ(xn+1|xn)

and ηθ,0= πθby convention. Except for simple models such the linear Gaussian state-space model

or when X is a finite set, it is impossible to compute pθ(y0:n), Qθ,nor ηθ,nexactly. Particle methods

have been applied extensively to approximate these quantities for general state-space models of the

form (1.1)–(1.2); see Capp´ e et al. [2005], Doucet et al. [2001].

The particle approximation of Qθ,n is the empirical measure corresponding to a set of N ≥ 1

random samples termed particles, that is

ηθ,n+1(dxn+1) = Qθ,n+1(dxn+1) =dxn+1

?ηθ,n(dx′n)gθ(yn|x′n)

,n ≥ 0

Qp,N

θ,n(dx0:n) =

1

N

N

?

i=1

δX(i)

0:n(dx0:n)(1.5)

where δz(dz) denotes the Dirac delta mass located at z. This approximation is referred to as the

path space approximation Del Moral [2004] and it is denoted by the superscript ‘p’. The particle

approximation of ηθ,nis obtained from Qp,N

θ,nby marginalization

ηN

θ,n(dxn) =

1

N

N

?

i=1

δX(i)

n(dxn).

2

Page 3

These particles are propagatedin time using importance sampling and resampling steps; see Doucet et al.

[2001] and Capp´ e et al. [2005] for a review of the literature. Specifically, Qp,N

sure constructed from N independent samples from

θ,n+1is the empirical mea-

Qp,N

θ,n(dx0:n)dxn+1fθ(xn+1|xn)gθ(yn|xn)

?Qp,N

θ,n(dx0:n)gθ(yn|xn)

. (1.6)

It is a well known fact that the particle approximation of Qθ,nbecomes progressively impoverished

as n increases because of the successive resampling steps [Del Moral and Doucet, 2003, Olsson et al.,

2008]. That is, the number of distinct particles representing the marginal Qp,N

k < n diminishes as n increases until it collapses to a single particle – this is known as the particle

path degeneracy problem.

The focus of this paper is on the convergence properties of particle methods which have been re-

cently proposed to approximatethe derivative of the measures {ηθ,n(dxn)}n≥0w.r.t. θ = [θ1,...θd]T∈

Rd:

?∂ηθ,n

(See Section 2 for a definition.) References C´ erou et al. [2001] and Doucet and Tadi´ c [2003] present

particle methods which have a computational complexity that scales linearly with the number N

of particles. It was shown in Poyiadjis et al. [2011] (see also Poyiadjis et al. [2009] for a more de-

tailed numerical study) that the performance of these O(N) methods, which inherently rely on the

particle approximations of {Qθ,n}n≥0constructed as in (1.6) above, degraded over time and it was

conjectured that this may be attributed to the particle path degeneracy problem. In contrast, the

alternative method of Poyiadjis et al. [2005] was shown in numerical examples to be stable. The

method of Poyiadjis et al. [2005] is a non-standard particle implementation that avoids the parti-

cle path degeneracy problem at the expense of a computational complexity per time step which is

quadratic in the number of particles, i.e. O(N2); see Section 2 for more details. Supported by

numerical examples, it was conjectured in Poyiadjis et al. [2011] that even under strong mixing as-

sumptions, the variance of the estimate of the filter derivative computed with the O(N) methods

increases at least linearly in time while that of the O(N2) is uniformly bounded w.r.t. the time index.

This conjecture is confirmed in this paper. Specifically, we analyze the O(N2) implementation of

Poyiadjis et al. [2005] in Section 3 and obtain results on the errors of the approximation, in partic-

ular, Lpbounds and a Central Limit Theorem (CLT) are presented. We show that these Lpbounds

and asymptotic variances appearing in the CLT are uniformly bounded w.r.t. the time index when

the state-space model satisfies certain mixing assumptions. In contrast, the asymptotic variance of

the O(N) implementations, which is also captured through the CLT, is shown to increase linearly.

To the best of our knowledge, these are the first results of this kind.

An important application of our results, which is discussed in detail in Section 4, is to the

problem of estimating the parameters of the model (1.1)–(1.2) from observed data. The estimates

of the model parameters are found by maximizing the likelihood function pθ(y0:n) with respect to θ

using a gradient ascent algorithm which relies on the particle approximation of the filter derivative.

The results we present in Section 3 have bearing on the performance of the parameter estimation

algorithm, which we illustrate with numerical examples in Section 4. The Appendix contains the

proofs of the main results as well as that of some supporting auxiliary results. As a final remark,

although the algorithms and theoretical results are presented for a state-space model, they may be

reinterpreted for Feynman-Kac models as well.

θ,n(dx0:k) for any fixed

ζθ,n= ∇ηθ,n=

∂θ1

,...,∂ηθ,n

∂θd

?T

.

3

Page 4

1.1Notation and definitions

We give some basic definitions from probability and operator semigroup theory. For a measurable

space (E,E) let M(E) denote the set of all finite signed measures and P(E) the set of all probability

measures on E. The n-fold product space E×···×E is denoted by En. Let B(E) denote the Banach

space of all bounded real-valued and measurable functions ϕ : E → R equipped with the uniform

norm ?ϕ? = supx∈E|ϕ(x)|. For ν ∈ M(E) and ϕ ∈ B(E), let ν(ϕ) =?

ν(x) ϕ(x). We recall that a bounded integral kernel M(x,dx′) from a measurable space (E,E) into

an auxiliary measurable space (E′,E′) is an operator ϕ ?→ M(ϕ) from B(E′) into B(E) such that the

functions

x ?→ M(ϕ)(x) :=

are E-measurable and bounded for any ϕ ∈ B(E′). The kernel M also generates a dual operator

ν ?→ νM from M(E) into M(E′) defined by

(νM)(ϕ) := ν(M(ϕ)).

ν(dx) ϕ(x) be the Lebesgue

integral of ϕ w.r.t. ν. If ν is a density w.r.t. some dominating measure dx on E then, ν(ϕ) =?dx

?

E′M(x,dx′)ϕ(x′)

Given a pair of bounded integral operators (M1,M2), we let (M1M2) the composition operator

defined by (M1M2)(ϕ) = M1(M2(ϕ)).

A Markov kernel is a positive and bounded integral operator M such that M(1)(x) = 1 for any

x ∈ E. For ϕ ∈ B(E), let

osc(ϕ) = sup

x,x′∈E|ϕ(x) − ϕ(x′)|

and let

Osc1(E) = {ϕ ∈ B(E) : osc(ϕ) ≤ 1}.

Let β(M) ∈ [0,1] denote the Dobrushin coefficient of the Markov kernel M which is defined by the

formula [Del Moral, 2004, Prop. 4.2.1]:

β(M) := sup {osc(M(ϕ)) ; ϕ ∈ Osc1(E′)}.

If there exists a positive constant ρ such that the Markov kernel M satisfies

M(x,dz) ≥ ρM(x′,dz) for all x,x′∈ E then β (M) ≤ 1 − ρ.

For two Markov kernels M1,M2, β(M1M2) ≤ β(M1)β(M2).

Given a positive function G on E, let ΨG : ν ∈ P(E) ?→ ΨG(ν) ∈ P(E) be the probability

distribution defined by

ΨG(ν)(dx) :=ν(dx)G(x)

ν(G)

provided ∞ > ν(G) > 0. The definitions above also apply if ν is a density and M is a transition den-

sity. In this case all instances of ν(dx) should be replaced with dxν(x) and M(x,dx′) by dx′M(x,x′)

where dx and dx′is generic notation for the dominating measures.

It is convenient to introduce the following transition kernels:

Qθ,n(xn−1,dxn) = gθ(yn−1|xn−1)dxnfθ(xn|xn−1) = dxnqθ(xn|xn−1),

Qθ,k,n(xk,dxn) = (Qθ,k+1Qθ,k+2···Qθ,n)(xk,dxn),

with the convention that Qθ,n,n= Id, the identity operator. Note that Qθ,k,n(1)(xk) is the density

of the law of Yk:n−1given Xk= xk. For 0 ≤ p ≤ n, define the potential function Gθ,p,non X to be

Gθ,p,n(xp) = Qθ,p,n(1)(xp)/ηθ,pQθ,p,n(1).

n > 0,

0 ≤ k ≤ n,

(1.7)

4

Page 5

Let the mapping Φθ,k,n: P(X) → P(X), 0 ≤ k ≤ n, be defined as follows

Φθ,k,n(ν)(dxn) =νQθ,k,n(dxn)

νQθ,k,n(1)

.

It follows that ηθ,n= Φθ,k,n(ηθ,k). For conciseness, we also write Φθ,n−1,nas Φθ,n.

A key quantity that facilitates the recursive computation of the derivative of ηθ,nis the following

collection of backward Markov transition kernels:

Mθ,n(xn,dxn−1) =ηθ,n−1(dxn−1)qθ(xn|xn−1)

ηθ,n−1(qθ(xn|·))

,n > 0.(1.8)

Their particle approximations are

MN

θ,n(xn,dxn−1) =ηN

θ,n−1(dxn−1)qθ(xn|xn−1)

ηN

θ,n−1(qθ(xn|·))

. (1.9)

These backward Markov kernels are convenient for computing certain conditional expectations and

probability measures. In particular, for ϕ ∈ B(X2), we have

?

and the law of X0:n−1given Xn= xnand Y0:n−1= y0:n−1is Mθ,n(xn,dxn−1)···Mθ,1(x1,dx0).

Finally, the following two definitions are needed for the CLT of the particle approximation of

the derivative of ηθ,n. The bounded integral operator Dθ,k,nfrom X into Xn+1is defined for any

Fn∈ B(Xn+1) by

?

j=k

Eθ[ϕ(Xn−1,Xn)|y0:n−1,xn] =Mθ,n(xn,dxn−1)ϕ(xn−1,xn),

Dθ,k,n(Fn)(xk) :=

1?

Mθ,j(xj,dxj−1)

n−1

?

j=k

Qθ,j+1(xj,dxj+1)

Fn(x0:n),

θ,k,n, is defined to be

0 ≤ k ≤ n,

(1.10)

with the convention that?∅ = 1. The particle approximation, DN

DN

θ,k,n(Fn)(xk) :=

To be concise we write

?

1?

j=k

MN

θ,j(xj,dxj−1)

n−1

?

j=k

Qθ,j+1(xj,dxj+1)

Fn(x0:n). (1.11)

ηθ,k(dxk)Dθ,k,n(xk,dx0:k−1,dxk+1:n) asηθ,kDθ,k,n(dx0:n).

(And similarly for the particle versions.) Although convention dictates that ηθ,kDθ,k,n should be

understood as the measure (ηθ,kDθ,k,n)(dx0:k−1,dxk+1:n), when we mean otherwise it should be

clear from the infinitesimal neighborhood.

5

Page 6

2 Computing the filter derivative

For any Fn∈ B(Xn+1), we have

∇Qθ,n(Fn)

1

pθ(y0:n−1)

=

?

dx0:n∇

?

πθ(x0)

n

?

k=1

fθ(xk|xk−1)

?

n−1

?

k=0

gθ(yk|xk)

?

Fn(x0:n)

−

1

pθ(y0:n−1)Eθ{Fn(X0:n)|y0:n−1}

dx0:n∇

?

πθ(x0)

n

?

k=1

fθ(xk|xk−1)

n−1

?

k=0

gθ(yk|xk)

?

= Eθ{Fn(X0:n)Tθ,n(X0:n)|y0:n−1} − Eθ{Fn(X0:n)|y0:n−1}Eθ{Tθ,n(X0:n)|y0:n−1}

where

(2.1)

Tθ,n(x0:n) =

n

?

k=0

tθ,k(xk−1,xk)(2.2)

tθ,k(xk−1,xk) = ∇log(gθ(yk−1|xk−1)fθ(xk|xk−1)),

tθ,0(x−1,x0) = tθ,0(x0) = ∇logπθ(x0).

The first equality in (2.1) follows from the definition of Qθ,nand interchanging the order of differ-

entiation and integration. The interchange is permissible under certain regularity conditions [Pflug,

1996]; e.g. a sufficient condition would be the main assumption in Section 3 under which the uni-

form stability results are proved. The second equality follows from a change of measure, which

then permits an importance sampling based estimator for the derivative of Qθ,n; this is the well

known score method, e.g. see Pflug [1996, Section 4.2.1]. For any ϕn∈ B(X), it follows by setting

Fn(x0:n) = ϕn(xn) in (2.1) that

?

= Eθ{ϕn(Xn)Tθ,n(X0:n)|y0:n−1} − Eθ{ϕn(Xn)|y0:n−1}Eθ{Tθ,n(X0:n)|y0:n−1}

=ζθ,n(dxn)ϕn(xn)

k > 0,(2.3)

(2.4)

∇

ηθ,n(dxn)ϕn(xn)

?

where

ζθ,n(dxn) = ηθ,n(dxn)(Eθ[Tθ,n(X0:n)|y0:n−1,xn] − Eθ[Tθ,n(X0:n)|y0:n−1]).

We call ζθ,nthe derivative of ηθ,n.

Given the particle approximation (1.5) of Qθ,n, it is straightforward to construct a particle ap-

proximation of ζθ,n:

(2.5)

ζp,N

θ,n(dxn) =

N

?

i=1

1

N

Tθ,n(X(i)

0:n) −1

N

N

?

j=1

Tθ,n(X(j)

0:n)

δX(i)

n(dxn).(2.6)

This approximation is also referred to as the path space method. Such approximations were implicitly

proposed in C´ erou et al. [2001] and Doucet and Tadi´ c [2003] and there are several reasons why this

estimate appears attractive. Firstly, even with the resampling steps in the construction of Qp,N

ζp,N

θ,ncan be computed recursively. Secondly, there is no need to store the entire ancestry of each

particle, i.e.X(i)

0:n

θ,n,

?

?

1≤i≤N, and thus the memory requirement to construct ζp,N

θ,nis constant over

6

Page 7

time. Thirdly, the computational cost per time is O(N). However, as Qp,N

path degeneracy problem, we expect the approximation ζp,N

observed in numerical examples in Poyiadjis et al. [2011] and it was conjectured that the asymptotic

variance (i.e. as N → ∞) of ζp,N

strong mixing assumptions. This is now proven in this article.

An alternative particle method to approximate {ζθ,n}n≥0has been proposed in Poyiadjis et al.

[2005, 2011]. We now reinterpret this method using the representation in (2.5) and a different particle

approximation of Qθ,nthat avoids the path degeneracy problem.

The measure Qθ,nadmits the following backward representation

θ,nsuffers from the particle

θ,nto worsen over time. This was indeed

θ,nfor bounded integrands would increase linearly with n even under

Qθ,n(dx0:n) = ηθ,n(dxn)

1?

k=n

Mθ,k(xk,dxk−1)

and the corresponding particle approximation of Qθ,nis given by

QN

θ,n(dx0:n) = ηN

θ,n(dxn)

1?

k=n

MN

θ,k(xk,dxk−1)

where MN

[Poyiadjis et al., 2005, 2011]:

θ,kwas defined in (1.9). This now gives rise to the following particle approximation of ζθ,n

ζN

θ,n(ϕn) =

?

QN

θ,n(dx0:n)Tθ,n(x0:n)?ϕn(xn) − ηN

θ,n(dx0:n)ϕn(xn). It is apparent that QN

θ,n(ϕn)?

and indeed ηN

method avoids the degeneracy in paths. It is even possible to compute ζN

in Algorithm 1; since a recursion for ηθ,n is already available, it is apparent from (2.5) that what

remains is to specify a recursion for Eθ[Tθ,n(X0:n)|y0:n−1,xn]. Let Tθ,n(xn) denote this term, then

for n ≥ 1,

Tθ,n(xn) = Eθ[Tθ,n(X0:n)|y0:n−1,xn]

= Eθ[Tθ,n−1(X0:n−1)|y0:n−1,xn] + Eθ[tθ,n(Xn−1,Xn)|y0:n−1,xn]

=Mθ,n(xn,dxn−1)(Eθ[Tθ,n−1(X0:n−1)|y0:n−2,xn−1] + tθ,n(xn−1,xn))

?

where Tθ,0(x0) = tθ,0(x0). Algorithm 1 computes ζN

(i)

θ,0= tθ,0(X(i)

θ,n(ϕn) =?QN

θ,nconstructed using this backward

θ,nrecursively as detailed

?

=Mθ,n(xn,dxn−1)?Tθ,n−1(xn−1) + tθ,n(xn−1,xn)?

θ,nrecursively in time by computing?Tθ,n,ηθ,n

0

?

and is initialized with T

0) (see (2.2)) where

?

X(i)

?

1≤i≤Nare samples from πθ(x0).

Algorithm 1: A Particle Method to Compute the Filter Derivative

• Assume at time n − 1 that approximate samples

(i)

θ,n−1

Tθ,n−1

?

?

X(i)

n−1

?

1≤i≤Nfrom ηθ,n−1and approximations

?

T

?

1≤i≤Nof

?

?

X(i)

n−1

?

?N

??

1≤i≤Nare available.

• At time n, sampleX(i)

n

1≤i≤Nindependently from the mixture

?

?N

7

j=1fθ

xn|X(j)

n−1

?

?

gθ

?

yn−1|X(j)

?

n−1

?

j=1gθ

yn−1|X(j)

n−1

(2.7)

Page 8

and then compute

?

?N

T

(i)

θ,n

?

1≤i≤Nand ζN

?

θ,nas follows:

T

(i)

θ,n=

j=1

T

(j)

θ,n−1+ tθ,n

?

X(j)

?

N

?

n−1,X(i)

n

??

n−1

fθ

?

?

X(i)

?

n |X(j)

yn−1|X(j)

n−1

?

gθ

?

?

yn−1|X(j)

n−1

?

?N

(i)

θ,n−1

j=1fθ

X(i)

n |X(j)

gθ

n−1

, (2.8)

ζN

θ,n(dxn) =

1

N

N

?

i=1

T

N

j=1

T

(j)

θ,n

δX(i)

n(dxn). (2.9)

Algorithm 1 uses the bootstrap particle filter of Gordon et al. [1993]. Note that any SMC imple-

mentation of {ηθ,n}n≥0may be used, e.g. the auxiliary SMC method of Pitt and Shephard [1999] or

sequential importance resampling with a tailored proposal distribution [Doucet et al., 2001]. It was

conjectured in Poyiadjis et al. [2011] that the asymptotic variance of ζN

ϕ is uniformly bounded w.r.t. n under mixing assumptions. This is established in this article.

θ,n(ϕ) for bounded integrands

3Stability of the particle estimates

The convergence analysis of ζN

convergence analysis of the N-particle measures QN

limiting values Qθ,n, as N → ∞, which is in turn intimately related to the convergence of the flow of

particle measuresηN

θ,n

central limit theorem presented here have been derived using the techniques developed in Del Moral

[2004] for the convergence analysis of the particle occupation measures ηN

objects in this analysis is the local sampling errors defined as

√N?ηN

The fluctuation and the deviations of these centered random measures can be estimated using non-

asymptotic Kintchine’s type Lr-inequalities, as well as Hoeffding’s or Bernstein’s type exponential de-

viations [Del Moral, 2004, Del Moral and Rio, 2009]. In Del Moral and Miclo [2000] it is proved that

these random perturbations behave asymptotically as Gaussian random perturbations; see Lemma

7.10 in the Appendix for more details. In the proof of Theorem 7.11 (a supporting theorem) in

the Appendix we provide some key decompositions expressing the deviation of the particle measures

QN

θ,naround its limiting value Qθ,nin terms of the local sampling errors (VN

compositions are key to deriving the Lr-mean error bounds and central limit theorems for the filter

derivative.

The following regularity conditions are assumed.

(A) The dominating measures dx on X and dy on Y are finite, and there exist constants 0 <

ρ,δ,c < ∞ such that for all (x,x′,y,θ) ∈ X2×Y ×Θ, the derivatives of πθ(x), fθ(x′|x) and gθ(y|x)

with respect to θ exists and

θ,n(and ζp,N

θ,nfor performance comparison) will largely focus on the

θ,n(and correspondingly Qp,N

θ,n) towards their

?

?

n≥0towards their limiting measures {ηθ,n}n≥0. The Lrerror bounds and the

θ,n. One of the central

VN

θ,n=

θ,n− Φθ,n(ηN

θ,n−1)?

(3.1)

θ,0,...,VN

θ,n). These de-

ρ−1≤ fθ(x′|x) ≤ ρ,

|∇logπθ(x)| ∨ |∇logfθ(x′|x)| ∨ |∇loggθ(y|x)| ≤ c.

Admittedly, these conditions are restrictive and fail to hold for many models in practice. (Exceptions

would include applications with a compact state-space.) However, they are typically made to estab-

lish the time uniform stability of particle approximations of the filter [Del Moral, 2004, Capp´ e et al.,

2005] as they lead to simpler and more transparent proofs. Also, we observe that the behaviors pre-

dicted by the Theorems below seem to hold in practice even in cases where the state-space models

δ−1≤ gθ(y|x) ≤ δ, (3.2)

(3.3)

8

Page 9

do not satisfy these assumptions; see Section 4. Thus the results in this paper can be seen to provide

a qualitative guide to the behavior of the particle approximation even in the more general setting.

For each parameter vector θ ∈ Θ, realization of observations y = {yn}n≥0and particle number

N, let (Ω,F,Py

comprised of the particle system only. Let Ey

θthe corresponding expectation operator computed

with respect to Py

θ. The first of the two main results in this section is a time uniform non-asymptotic

error bound.

θ) be the underlying probability space of the random process {(X(1)

n ,...,X(N)

n

)}n≥0

Theorem 3.1 Assume (A). For any r ≥ 1, there exists a constant Cr such that for all θ ∈ Θ,

y = {yn}n≥0, n ≥ 0, N ≥ 1, and ϕn∈ Osc1(X),

√NEy

θ

???ζN

θ,n(ϕn) − ζθ,n(ϕn)??r?1

r≤ Cr

Let {Vθ,n}n≥0be a sequence of independent centered Gaussian random fields defined as follows.

For any sequence {ϕn}n≥0 in B(X) and any p ≥ 0, {Vθ,n(ϕn)}p

zero-mean Gaussian random variables with variances given by

n=0is a collection of independent

ηθ,n(ϕ2

n) − ηθ,n(ϕn)2. (3.4)

Theorem 3.2 Assume (A). There exists a constant C < ∞ such that for any θ ∈ Θ, y = {yn}n≥0,

n ≥ 0 and ϕn ∈ Osc1(X),

Gaussian random variable

?

whose variance is uniformly bounded above by C where

√N

?

ζN

θ,n− ζθ,n

?

(ϕn) converges in law, as N → ∞, to the centered

n

p=0

Vθ,p

?

Gθ,p,n

Dθ,p,n(Fθ,n− Qθ,n(Fθ,n))

Dθ,p,n(1)

?

(3.5)

Fθ,n= (ϕn− Qθ,n(ϕn))(Tθ,n− Qθ,n(Tθ,n)).

The proofs of both these results are in the Appendix.

As a comparison, we quantify the variance of the particle estimate of the filter derivative computed

using the path-based method (see (2.6).) Consider the following simplified example that serves to

illustrate the point. Let gθ(y|x) = g (y|x) (that is θ-independent), fθ(xn|xn−1) = πθ(xn), where

πθ is the initial distribution. (Note that fθin this case satisfies a rephrased version of (3.2) under

which the conclusion of Theorem 3.2 also holds.) Also, consider the sequence of repeated observations

y0= y1= ··· where y0is arbitrary. Applying Lemma 7.12 (in the Appendix) that characterizes the

limiting distribution of

θ,n− Qθ,n) to this special case results in

(2.6)) having an asymptotic distribution which is Gaussian with mean zero and variance

?(∇logπθ)2?+ πθ

where ϕ = ϕ−πθ(ϕ), π′

in contrast to the time bounded variance of Theorem 3.2.

√N(Qp,N

√N(ζp,N

θ,n− ζθ,n)(ϕ) (see

n × πθ(ϕ2)π′

θ(x) = πθ(x)g (y0|x)/πθ(g(y0|·)). This variance increases linearly with time

θ

?ϕ2(∇logπθ)2?− ∇πθ(ϕ)2

4 Application to recursive parameter estimation

Being able to compute {ζθ,n}n≥0is particularly useful when performing online static parameter esti-

mation for state-space models using Recursive Maximum Likelihood (RML) techniques [Le Gland and Mevel,

1997, Poyiadjis et al., 2005, 2011]; see also Kantas et al. [2009] for a general review of available

particle methods based solutions, including Bayesian ones, for this problem. The computed filter

derivative may also be useful in other areas; e.g. see Coquelin et al. [2008] for an application in

control.

9

Page 10

4.1Recursive Maximum Likelihood

Let θ∗be the true static parameter generating the observed data {yn}n≥0. Given a finite record of

observations y0:T, the log-likelihood may be maximized with the following steepest ascent algorithm:

θk= θk−1+ γk∇logpθ(y0:T)|θ=θk−1,k ≥ 1,(4.1)

where θ0 is some arbitrary initial guess of θ∗, ∇logpθ(y0:T)|θ=θk−1denotes the gradient of the

log-likelihood evaluated at the current parameter estimate and {γk}k≥1 is a decreasing positive

real-valued step-size sequence, which should satisfy the following constraints:

∞

?

k=1

γk= ∞,

∞

?

k=1

γ2

k< ∞.

Although ∇logpθ(y0:T) can be computed using (4.3), the computation cost can be prohibitive for

a long data record since each iteration of (4.1) would require a complete browse through the T + 1

data points. A more attractive alternative would be a recursive procedure in which the data is run

through once only sequentially. For example, consider the following update scheme:

θn= θn−1+ γn∇logpθ(yn|y0:n−1)|θ=θn−1

(4.2)

where ∇logpθ(yn|y0:n−1)|θ=θn−1denotes the gradient of logpθ(yn|y0:n−1) evaluated at the current

parameter estimate; that is upon receiving yn, θn−1 is updated in the direction of ascent of the

conditional density of this new observation. Since we have

?dxnηθn−1,n(xn) ∇gθ(yn|xn)|θn−1+?dxn(yn|xn)ζθn−1,n(xn)gθn−1

∇logpθ(yn|y0:n−1)|θ=θn−1=

?dxnηθn−1,n(xn)gθn−1(yn|xn)

,

(4.3)

this clearly requires the filter derivative ζθ,n. The algorithm in the present form is not suitable

for online implementation as it requires re-computing the filter and its derivative at the value θ =

θn−1 from time zero. The RML procedure uses an approximation of (4.3) which is obtained by

updating the filter and its derivative using the parameter value θn−1at time n; we refer the reader

to Le Gland and Mevel [1997] for details. The asymptotic properties of the RML algorithm, i.e.

the behavior of θnin the limit as n goes to infinity, has been studied in the case of an i.i.d. hidden

process by Titterington [1984] and Le Gland and Mevel [1997] for a finite state-space hidden Markov

model. It is shown in Le Gland and Mevel [1997] that under regularity conditions this algorithm

converges towards a local maximum of the average log-likelihood and that this average log-likelihood

is maximized at θ∗. A particle version of the RML algorithm of Le Gland and Mevel [1997] that uses

Algorithm 1’s estimate of ηθ,nis presented as Algorithm 2.

Algorithm 2: Particle Recursive Maximum Likelihood

• At time n − 1 we are given y0:n−1, the previous estimate θn−1of θ∗and {(X(i)

• At time n, upon receiving yn, sample

θ = θn−1to obtain

n−1,T

(i)

n−1)}N

i=1.

?

X(i)

n

?

?

1≤i≤N

independently from (2.7) using parameter

ηN

n(dxn) =

1

N

N

i=1

δX(i)

n(dxn)

10

Page 11

and then compute

T

(i)

n=

?N

j=1

?

T

(j)

n−1+ tθn−1,n

?

X(j)

n−1,X(i)

?

?

n

??

n−1

fθn−1

?

?

X(i)

?

n |X(j)

yn−1|X(j)

n−1

?

gθn−1

?

yn−1|X(j)

n−1

?

?N

(i)

n−1

j=1fθn−1

X(i)

n |X(j)

gθn−1

n−1

?

, (4.4)

ζN

n(dxn) =

1

N

N

?

i=1

T

N

N

j=1

T

(j)

n

δX(i)

n(dxn),(4.5)

and

?∇logp(yn|y0:n−1) =

?ηN

n(dxn) ∇gθ(yn|xn)|θn−1+?ζN

n(dxn)gθn−1(yn|xn)

?ηN

n(dxn)gθn−1(yn|xn)

.

Finally update the parameter:

θn= θn−1+ γn?∇logp(yn|y0:n−1). (4.6)

Under Assumption A, the particle approximation of the filter is stable [Del Moral, 2004]; see also

Lemma 7.4 in the Appendix. This combined with the proven stability of the particle approximation

of the filter derivative implies that the particle estimate of the derivative of logp(yn|y0:n−1) is also

stable.

4.2 Simulations

The RML algorithm is applied to the following stochastic volatility model [Pitt and Shephard, 1999]:

?

Yn= β exp(Xn/2)Wn,

X0∼ N

0,

σ2

1 − φ2

?

, Xn+1= φXn+ σVn+1,

where N (m,s) denotes a Gaussian random variable with mean m and variance s, Vni.i.d.

and Wni.i.d.

∼ N (0,1) are two mutually independent sequences, both independent of the initial state

X0. The model parameters, θ = (φ,σ,β), are to be estimated.

Our first example demonstrates the theoretical results in Section 3.

logp(yn:n+L−1|y0:n−1) at θ∗= (0.8,√0.1,1)

cles and using the path-space method (see (2.6)) with 2.5×105particles for the stochastic volatility

model. The block size L was 500. Shown in Figure 1 is the variance of these particle estimates

for various values of n derived from many independent random replications of the simulation. The

linear increase of the variance of the path-space method as predicted by theory is evident although

Assumption A is not satisfied.

For the path-space method, because the variance of the estimate of the filter derivative grows

linearly in time, the eventual high variance in the gradient estimate can result in the divergence of the

parameter estimates. To illustrate this point, (4.6) was implemented with the path-space estimate of

the filter derivative (2.6) computed with 10000 particles and constant step-size sequence, γn= 10−4

for all n. θ0was initialized at the true parameter value. A sequence of two million observations was

simulated with θ∗= (0.8,√0.1,1). The results are shown in Figure 3.

For the same value of θ∗and sequence of observations used in the previous example, Algorithm

2 was executed with 500 particles and γn= 0.01, n ≤ 105, γn= (n − 5 × 104)−0.6, n > 105. As it

∼ N (0,1)

The estimate of ∂/∂σ

was computed using Algorithm 1 with 500 parti-

11

Page 12

1 5000100001500020000

0

20

40

60

80

100

120

140

Figure 1: Variance of the particle estimates of ∂/∂σlogp(yn:n+500−1|y0:n−1) for various values of n

for the stochastic volatility model. Circles are variance of Algorithm 1’s estimate with 500 particles.

Stars indicate the variance of the estimate of the path-space method with 2.5×105particles. Dotted

line is best fitting straight line to path-space method’s variance to indicate trend.

0 500 1000

× 103

1500 2000

0.1

0.8

1.5

1

(σ2)*=

φ*=

β*=

1.006

0.802

0.097

Figure 2: Sequence of recursive parameter estimates, θn= (σn,φn,βn), computed using (4.6) with

N = 500. From top to bottom: βn, φnand σnand marked on the right are the “converged values”

which were taken to be the empirical average of the last 1000 values.

12

Page 13

05001000

x103

15002000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

Figure 3: RML for stochastic volatility with path-space gradient estimate with 10,000 particles,

constant step-size and initialized at the true parameter values which are indicated by the dashed

lines. From top to bottom, φ, β and σ.

can be seen from the results in Figure 2 the estimate converges to a value in the neighborhood of

the true parameter.

5 Conclusion

We have presented theoretical results establishing the uniform stability of the particle approximation

of the optimal filter derivative proposed in Poyiadjis et al. [2005, 2009]. While these results have

been presented in the context of state-space models, they can also be applied to Feynman-Kac

models [Del Moral, 2004] which could potentially enlarge the range of applications. For example, if

dx′fθ(x′|x) is reversible w.r.t. to some probability measure µθ and if we replace gθ(yn|xn) with

a time-homogeneous potential function gθ(xn) then ηθ,n converges, as n → ∞, to the probability

measure µθ,hdefined as

µθ,h(dx) :=

1

µθ(hθ

?dx′fθ(x′|·)hθ(x′))µθ(dx) hθ(x)

?

dx′fθ(x′|x)hθ(x′)

where hθ is a positive eigenmeasure associated with the top eigenvalue of the integral operator

Qθ(x,dx′) = gθ(x)dx′fθ(x′|x) (see section 12.4 of Del Moral [2004]).

invariant measure of the h-process defined as the Markov chain with transition kernel Mθ(x,dx′) ∝

dx′fθ(x′|x)hθ(x′). The particle algorithm described here can be directly used to approximate the

derivative of this invariant measure w.r.t to θ. It would also be of interest to weaken Assumption A

and there are several ways this might be approached. For example for non-ergodic signals using ideas

in Oudjane and Rubenthaler [2005], Heine and Crisan [2008] or via Foster-Lyapunov conditions as

in Beskos et al. [2011], Whiteley [2011].

The measure µθ,h is the

6 Acknowledgement

We are grateful to Sinan Yildirim for carefully reading this report.

13

Page 14

7 Appendix

The statement of the results in this section hold for any θ and any sequence of observations y =

{yn}n≥0. All mathematical expectations are taken with respect to the law of the particle system only

for the specific θ and y under consideration. While θ is retained in the statement of the results, it is

omitted in the proofs. The superscript y of the expectation operator is also omitted in the proofs.

This section commences with some essential definitions in addition to those in Section 1.1. Let

Pθ,k,n(xk,dxn) =Qθ,k,n(xk,dxn)

Qθ,k,n(1)(xk),

and

Mθ,p(xp,dx0:p−1) =

1?

k=p

Mθ,k(xk,dxk−1), p > 0,

and its corresponding particle approximation is

MN

θ,p(xp,dx0:p−1) =

1?

k=p

MN

θ,k(xk,dxk−1)

To make the subsequent expressions more terse, let

? ηN

θ,n= Φθ,n(ηN

θ,n−1),n ≥ 0,(7.1)

where ? ηN

θ,0= Φθ,0(ηN

−1) = ηθ,0= πθby convention. (Recall Φθ,n= Φθ,n−1,n.) Let

??

be the natural filtration associated with the N-particle approximation model and let FN

trivial sigma field.

The following estimates are a straightforward consequence of Assumption (A). For all θ and time

indices 0 ≤ k < q ≤ n,

Qθ,k,n(1)(xk)

Qθ,k,n(1)(x′

Qθ,k,q(Qθ,q,n(1))(xk)

FN

n= σX(i)

k;0 ≤ k ≤ n,1 ≤ i ≤ N

??

,n ≥ 0,

−1be the

bθ,k,n= sup

xk,x′

k

k)≤ ρ2δ2,β

?Qθ,k,q(xk,dxq)Qθ,q,n(1)(xq)

?

≤?1 − ρ−4?(q−k)= ρq−k,

(7.2)

and for θ, 0 < k ≤ q,

MN

θ,k(x,dz) ≤ ρ4MN

θ,k(x′,dz) =⇒ β?MN

θ,q···MN

θ,k

?≤?1 − ρ−4?q−k+1. (7.3)

Note that setting q = n in (7.2) yields an estimate for β(Pθ,k,n)

Several auxiliary results are now presented, all of which hinge on the following Kintchine type

moment bound proved in Del Moral [2004, Lem. 7.3.3].

Lemma 7.1 Del Moral [2004, Lemma 7.3.3]Let µ be a probability measure on the measurable space

(E,E). Let G and h be E-measurable functions satisfying G(x) ≥ cG(x′) > 0 for all x,x′∈ E where c

is some finite positive constant. Let {X(i)}1≤i≤Nbe a collection of independent random samples from

µ. If h has finite oscillation then for any integer r ≥ 1 there exists a finite constant ar, independent

of N, G and h, such that

??????

14

√NE

?N

i=1G(X(i))h(X(i))

?N

i=1G(X(i))

−µ(Gh)

µ(G)

?????

r?1

r

≤ c−1osc(h)ar.

Page 15

Proof:

The result for G = 1 and c = 1 is proved in Del Moral [2004]. The case stated here can be established

using the representation

?µN− µ??

where µN(dx) = N−1?N

Remark 7.2 For k ≥ 0, let hN

surely. Then Lemma 7.1 can be invoked to establish

µN(Gh)

µN(G)

−µ(Gh)

µ(G)

=

µ(G)

µN(G)

G

µ(G)

?

h −µ(Gh)

µ(G)

??

i=1δX(i)(dx).

k−1be a FN

k−1measurable function satisfying hN

k−1∈ Osc1(X) almost

√

NEy

θ

??????

ηN

θ,k(GhN

ηN

θ,k(G)

k−1)

−Φθ,k(ηN

Φθ,k(ηN

θ,k−1)(GhN

θ,k−1)(G)

k−1)

?????

r?1

r

≤ c−1ar

where G is defined as in Lemma 7.1.

Lemma 7.3 to Lemma 7.6 are a consequence of Lemma 7.1 and the estimates in (7.2).

Lemma 7.3 For any r ≥ 1 there exist a finite constant ar such that the following inequality holds

for all θ, y, 0 ≤ k ≤ n and FN

almost surely,

k−1measurable function ϕN

nsatisfying ϕN

n∈ Osc1(X)

√

NEy

θ

???Φθ,k,n(ηN

θ,k)(ϕN

n) − Φθ,k−1,n(ηN

θ,k−1)(ϕN

n)??r

?1

r≤ arbθ,k,nβ (Pθ,k,n),

where, by convention Φθ,−1,n(ηN

(7.2).

θ,−1) = ηθ,n, and the constants bθ,k,nand β (Pθ,k,n) were defined in

Proof:

Φk,n(ηN

??

k)(ϕN

n) − Φk−1,n(ηN

k(dxk)Qk,n(1)(xk)

ηN

kQk,n(1)

k−1)(ϕN

−Φk(ηN

n)

=

ηN

k−1)(dxk)Qk,n(1)(xk)

Φk(ηN

k−1)Qk,n(1)

?

Pk,n(ϕN

n)(xk)

where Φ0(ηN

−1) = η0by convention. Applying Lemma 7.1 with the estimates in (7.2) we have

√NE

???Φk,n(ηN

k)(ϕN

n) − Φk−1,n(ηN

k−1)(ϕN

n)??r ??FN

k−1

?1

r≤ arbk,nβ (Pk,n)

almost surely.

Lemma 7.3 may be used to derive the following error estimate [Del Moral, 2004, Theorem 7.4.4].

Lemma 7.4 For any r ≥ 1, there exists a constant cr such that the following inequality holds for

all θ, y, n ≥ 0 and ϕ ∈ Osc1(X),

√NEy

θ

???[ηN

θ,n− ηθ,n](ϕ)??r?1

r≤ cr

n

?

k=0

bθ,k,nβ (Pθ,k,n). (7.4)

15

#### View other sources

#### Hide other sources

- Available from Pierre Del Moral · May 23, 2014
- Available from ArXiv