Page 1

Automatica 43 (2007) 1418–1425

www.elsevier.com/locate/automatica

Brief paper

Parameteridentificationfornonlinearsystems:Guaranteedconfidence

regionsthroughLSCR?

Marco Dalaia, Erik Weyerb, Marco C. Campia,∗

aDepartment of Electrical Engineering and Automation, University of Brescia, Via Branze 38, 25123 Brescia, Italy

bDepartment of Electrical and Electronic Engineering, The University of Melbourne, Parkville VIC 3010, Australia

Received 6 June 2006; received in revised form 27 October 2006; accepted 19 January 2007

Available online 19 June 2007

Abstract

In this paper we consider the problem of constructing confidence regions for the parameters of nonlinear dynamical systems. The proposed

method uses higher order statistics and extends the LSCR (leave-out sign-dominant correlation regions) algorithm for linear systems introduced

in Campi and Weyer [2005, Guaranteed non-asymptotic confidence regions in system identification. Automatica 41(10), 1751–1764. Extended

version available at ?http://www.ing.unibs.it/∼campi?]. The confidence regions contain the true parameter value with a guaranteed probability

for any finite number of data points. Moreover, the confidence regions shrink around the true parameter value as the number of data points

increases. The usefulness of the proposed approach is illustrated on some simple examples.

? 2007 Elsevier Ltd. All rights reserved.

Keywords: Confidence sets; Finite sample results; Nonlinear system identification

1. Introduction

It is well known that a model of a dynamical system is of

limited use if no quality tag which describes the accuracy of

the model is attached. Confidence regions for the system pa-

rameters are commonly used as quality tags, and asymptotic

theoryiswidelyusedfortheconstructionofsuchregions.How-

ever, in practice one always has a finite number of samples,

and—even though the asymptotic theory delivers sensible re-

sults in many cases—there are also examples (Garatti, Campi,

& Bittanti, 2004) where it fails when applied to a finite num-

ber of data points. Thus, there is a need for techniques which

deliver confidence regions with guaranteed probabilities when

only a finite number of data points are available.

?This paper was not presented at any IFAC meeting. This paper was

recommended for publication in revised form by Associate Editor Antonio

Vicino under the direction of Editor Torsten Söderström.

∗Corresponding author. Tel.: +390303715458; fax: +39030380014.

E-mail addresses: marco.dalai@ing.unibs.it (M. Dalai),

e.weyer@ee.unimelb.edu.au (E. Weyer), marco.campi@ing.unibs.it

(M.C. Campi).

0005-1098/$-see front matter ? 2007 Elsevier Ltd. All rights reserved.

doi:10.1016/j.automatica.2007.01.016

In Campi and Weyer (2005) a method called LSCR (leave-

outsign-dominantcorrelationregions)wasproposedforfinding

confidence regions to which the parameters of a linear system

belong with guaranteed probability. See also Campi and Weyer

(2006) for a comprehensive presentation of LSCR. LSCR ex-

tends earlier work by Hartigan (1969, 1970) to a dynamical

system setting, and it has two important features: first, the prob-

ability that the confidence region contains the true parameters

is guaranteed for any finite amount of data samples; second, the

confidence region concentrates around the true parameter value

when the number of samples increases. In Campi and Weyer

(2005), second order statistics were explored for the construc-

tion of the confidence regions. In the present paper, we consider

nonlinear systems. It is well known (see for example, Ljung,

2001 for a general discussion, or Subba Rao, 1981 for the par-

ticular case of bilinear systems) that second order statistics are

insufficient for the identification of nonlinear systems. Here we

show that it is possible to extend the framework of LSCR to

higher order statistics, and hence to consider the problem of

nonlinear system identification within this setting.

The focus of this paper is on time series, that is the system

to be identified has no exogenous inputs which are measured.

The outline of the paper is as follows. In the next section, we

Page 2

M. Dalai et al. / Automatica 43 (2007) 1418–1425

1419

motivate the use of higher order statistics for nonlinear systems.

Section 3 contains the procedure for the construction of the

confidence region, and the properties of this procedure are also

studied. In Section 4 a simulation example using a bilinear

system is presented before conclusions are given in Section 5.

2. A simple nonlinear example: from second to higher

order statistics

This section illustrates the problems encountered when the

standard LSCR procedure of Campi and Weyer (2005) using

second order statistics is applied to a nonlinear system.

Consider the system

yt= ?0(y2

where ?0is the parameter value to be identified and wt is an

independent sequence of Gaussian variables with zero mean

and unit variance. We use the standard LSCR algorithm for

construction of a confidence region for ?0. To this end, we first

rewritethesystemwithagenericparameter?,yt=?(y2

wt, and then compute the associated optimal predictor: ˆ yt(?)=

?(y2

constructs a confidence region based on an empirical evaluation

of the correlations E[?t(?)?t+r(?)], r?1. In Campi and Weyer

(2005) it is shown that ?0is the only value of ? for which these

correlations are zero in the case of linear ARMA systems and,

consequently, the obtained confidence region shrinks around

the true parameter value ? = ?0as the number of data points

grows. Here we show that E[?t(?)?t+r(?)] = 0 does not imply

? = ?0for the system in (1), i.e. second order statistics do not

suffice.

Suppose that the true parameter value is ?0=0. Then yt=wt,

and we have

t−1− 1) + wt, (1)

t−1−1)+

t−1−1), and the prediction error: ?t(?)=yt− ˆ yt(?). LSCR

?t(?) = yt− ˆ yt(?) = wt− ?(w2

Thus,

t−1− 1).

E[?t(?)?t+r(?)]

= E[(wt− ?(w2

For r?2, E[?t(?)?t+r(?)] = 0 for any value of ? since wtand

(w2

t−1− 1) are zero mean random variables, and the products

in (2) only contain terms with different time indeces. For r =1

we have: E[?t(?)?t+1(?)]=−?E[wt(w2

E[wt]) = 0. So, E[?t(?)?t+r(?)] = 0 for any r?1, and any

value of ?. This implies that it is not possible to establish the

true value of ? from the conditions E[?t(?)?t+r(?)] = 0. In

turn, following the analysis in Campi and Weyer (2005), we

see that the confidence region obtained by using the standard

LSCR algorithm does not shrink around ?0when the number

of samples increases.

We complete this example by showing that the true value ?0

can indeed be determined by using higher order statistics. Take

for example the condition E[?2

E[?2

t−1− 1))(wt+r− ?(w2

t+r−1− 1))]. (2)

t−1)]=−?(E[w3

t]−

t(?)?t+1(?)] = 0. We have

t] − E[w4

t(?)?t+1(?)] = ?(E[w2

t]) = ?(1 − 3) = −2?. (3)

Thus, E[?2

?= 0 for any ? ?= ?0.

So, in order to construct confidence regions that shrink

around ?0higher order statistics must be utilized. In the next

section we generalize the LSCR method to this case.

t(?0)?t+1(?0)]=0 since ?0=0, while E[?2

t(?)?t+1(?)]

3. Extension of LSCR to higher order statistics

Consider a nonlinear system S0which maps a non-measured

noise process wt into a measured signal yt. Furthermore, as-

sume that S0belongs to a parameterized system class {S?}, that

is S0=S?0 for some ?0. wtis an independent sequence of ran-

dom variables, whose distribution is symmetric around zero.

Apart from this, we make no other assumptions on wt. The dis-

tribution of wtcan as well be time-varying. We aim at finding

a confidence region for the parameter vector ?0by observing

the output yt.

The LSCR method in Campi and Weyer (2005) constructs,

for every value of ?, a sequence wt(?) such that for the true

parameter ?0we have that wt(?0)=wt. Then, roughly speaking,

the confidence region for ?0is obtained by choosing the values

of ? for which wt(?) resembles an independent process. For

linear systems, one can take wt(?)=?t(?), the prediction error,

since ?t(?0) = wt, see Campi and Weyer (2005).

The case of nonlinear systems requires some extra care be-

cause ?t(?0) ?= wtand ?t(?0) is not even an independent pro-

cess in general. To see this, consider, e.g. the system class

yt= ?yt−1+ yt−1wt. The optimal predictor is ˆ yt(?) = ?yt−1;

but yt− ˆ yt(?0) = yt−1wtis not an independent sequence!

In order to obtain a sequence wt(?) such that wt(?0)=wt, we

can proceed in a different way by resorting to system inversion

instead of constructing the prediction error, see Fig. 1. For

linearsystemsthesetwoapproachescoincidesinceconstructing

the prediction error is the same as inverting the system. In

the example above we let wt(?) = (yt− ?yt−1)/yt−1, so that

wt(?0) = wtas long as yt−1?= 0. System inversion is used as

a basic building block in the algorithm presented below.

Before proceeding we formally introduce our working as-

sumptions.

Assumptions.

(i) The observed data yt are obtained as output of a causal

system S0whose input is an independent noise sequence wt

symmetrically distributed around zero, i.e. yt= S0(w?,??t).

(ii) The system S0belongs to a system model class S?, i.e.

there exists a value ?0of the parameter such that S?0 = S0.

(iii) The systems in {S?} are invertible with a causal inverse,

i.e. for every ? there exists an inverse system S−1

S−1

?

such that

?(y?(?),??t) = wt, where yt(?) = S?(w?,??t).

wt

yt

yt

S0

S-1

?

wt(?)

Fig. 1. Scheme for the extraction of wt(?).

Page 3

1420

M. Dalai et al. / Automatica 43 (2007) 1418–1425

The assumptions state that the model class consists of causal

systems which are also causally invertible and that the true data

generating system belongs to the model class.

3.1. Construction of the confidence region

We next describe the algorithm for the construction of the

confidence region.

Algorithm.

(A.1). Compute wt(?) = S−1

(A.2). Choose an integer s?0 and let e = (e0,e1,...,es) be

a vector of nonnegative integers such that at least one

of the ej, 0?j ?s, is odd (the way e should be chosen

is discussed later). For every t = 1,2,...,K − s = N,

compute

?(y?,??t) for t = 1,2,...,K.

ft,e(?) =

s?

j=0

wt+j(?)ej.

(A.3). Let IN= {1,...,N} and consider a collection GNof

differentsubsetsIN

i

⊆ IN,i=1,...,M,formingagroup

underthesymmetricdifferenceoperation(i.e.(IN

(IN

i

∩ IN

loss of generality, that IN

Mis the zero element of the group

GN: IN

M= ∅, the empty set. Compute

1

#IN

i

k∈IN

i

(# stands for “number of elements in the set”).

(A.4). Select an integer q in the interval [1,(M +1)/2) and find

the confidence region ?N

functions are bigger than zero and at least q are smaller

than zero.

i∪IN

j)−

j) ∈ GNif IN

i,IN

j

∈ GN). Suppose, without

gN

i,e(?) =

?

fk,e(?),i = 1,...,M − 1

ewhere at least q of the gN

i,e(?)

The intuitive idea behind the algorithm is as follows. For

the true parameter vector ?0, wt(?0) = wt is an independent

sequence symmetrically distributed around zero. Since at least

one ejis odd, ft,e(?0) is a zero mean random variable. More-

over, when ? = ?0, the functions gN

sums of zero mean random variables. It is therefore unlikely

that nearly all of them are positive or that nearly all of them

are negative. Based on this observation we exclude the regions

in parameter space where the gN

i,e(?) functions take on positive

or negative values too many times.

Note that the construction of ?N

edge of the characteristics of the noise wt. The Algorithm let

the data speak for themselves and constructs the region ?N

correspondingly: ?N

is through data only, not through a priori assumptions.

The next theorem says that the Algorithm always produces

a region that contains ?0with a probability chosen by the user.

i,e(?),i = 1,...,M − 1, are

edoes not require any knowl-

e

edoes depend on the noise level, but this

Theorem1. Theregion?N

that

P[?0∈ ?N

econstructedabovehastheproperty

e] = 1 − 2q/M.

-1-0.8-0.6-0.4-0.20

θ

0.20.40.60.81

-5

-4

-3

-2

-1

0

1

2

3

4

gi,e (θ)

Fig. 2. Some of the gN

i,e(?) functions obtained with e = (2,1), N = 1000.

Proof. See Appendix A.1.

?

Thus, the user controls the probability that ?0∈ ?N

choice of q.

In general, in order to determine a confidence region of

suitable shape we may want to intersect several regions

?N

e1,e2,...,eh, then the confidence region is given by

evia the

eobtained with different e vectors. If we have h vectors

?N=

h?

l=1

?N

el.(4)

Theorem2. Theregion?Nconstructedabovehastheproperty

that

P[?0∈ ?N]?1 − 2hq/M.(5)

Proof. The proof follows from Theorem 1. The inequality in

(5) is due to possible overlaps between the events ?0/ ∈?N

l = 1,...,h.

el,

?

To make the procedure more concrete, we next apply it to

the example in Section 2.

Example 3. Suppose we want to find a 90% confidence region

for ?0. Since yt=?(y2

Note that, in this example, wt(?) = ?t(?), the prediction error.

In Section 2, we established that E[?t(?)2?t+1(?)]=0 only for

? = ?0. Motivated by this observation we take e = (2,1).

We simulated the system with N =1000 and constructed the

group GNas explained in Appendix A.3 with M = 256. We

discarded the parameter values where less than q=12 functions

out of the M = 256 functions were positive or less than 12

functions were negative. Fig. 2 shows some of the obtained

gN

be [−0.05,0.03].

t−1−1)+wt, let wt(?)=yt−?(y2

t−1−1).

i,e(?) functions. The confidence interval for ?0turned out to

Page 4

M. Dalai et al. / Automatica 43 (2007) 1418–1425

1421

0

1000 2000 3000 4000500060007000 80009000

10000

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

N

ΘN

11000

Fig. 3. Ninety percent confidence regions with e = (2,1) for increasing N.

-1-0.8 -0.6 -0.4-0.200.20.4 0.60.81

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

θ

gi,e (θ)

Fig. 4. Some of the gN

i,e(?) functions obtained with e = (1,1), N = 1000.

The gN

E[w2

the gN

i,e(?) functions cut the ? axis near ?=0 and the confidence

region is a neighborhood of 0 = ?0.

As N increases one expects that the gN

better and better approximations of the function −2?. Corre-

spondingly, the confidence interval is expected to shrink around

? = 0. In Fig. 3 the confidence intervals obtained for increas-

ing values of N are shown, and the trend is that the length of

the intervals decreases as N increases. Section 3.2 provides a

general study of the convergence properties of the algorithm.

As a comparison, Fig. 4 shows some of the gN

obtained using the second order statistic E[wt(?)wt+1(?)],

i.e. by choosing e = (1,1). As we noticed in Section 2,

E[wt(?)wt+1(?)] = 0 for all values of ?; hence the gN

i,e(?) functions are estimates of the third order statistic

t(?)wt+1(?)]. From Eq. (3), E[w2

t(?)wt+1(?)] = −2?, so

i,e(?) functions become

i,e(?) functions

i,e(?)

functions are all flat along the ? axis, and the confidence

interval does not shrink around ? = 0.

?

Theorems 1 and 2 are very general and apply to any statistic

as described in point A.2 of the algorithm. Consequently, the

probability of the obtained region is always guaranteed. On the

other hand, the effect of the used statistics shows up in the

shape of the obtained region. Determining suitable statistics is

a problem for which no general guidelines can be given, and

the user should choose the statistics based on an analysis of the

system class at hand. See also Section 4 for an example.

3.2. Asymptotic behavior

When N → ∞, we would like the confidence region ?N

given by (4) to shrink around the true value ?0. In this section,

we discuss general conditions for this to happen.

Weneedthefollowingadditionalassumptions(whilesomeof

theseassumptionscanberelaxed,wehavepreferredtomaintain

them to avoid very technical mathematical derivations).

Assumptions.

(iv) The input noise wt is independent and identically dis-

tributed (i.i.d.).

(v) For every ? the considered statistics are in L1, i.e.

E[|ft,el(?)|]<∞,l = 1,2,...,h, and ?0is the only solution

to the set of conditions E[ft,el(?)] = 0, l = 1,2,...,h.

(vi)ThegroupsGNareconstructedasexplainedinAppendix

A.3, the value of M is fixed and the value of N is increasing.

Theorem 4. Under the hypotheses above, for every fixed ? ?=

?0,

P[∃¯ N |? / ∈?N,∀N >¯ N] = 1.

Proof. See Appendix A.2.

In other words, Theorem 4 says that any ? ?= ?0is eliminated

from ?Nstarting at some¯ N with probability 1.

Remark 5. The Algorithm in Section 3.1 can be generalized so

that the assumption of Theorem 4 that E[|ft,el(?)|]<∞,l=1,

2,...,h, is certainly satisfied. In points A.2 and A.3 of the Al-

gorithm, the ft,e(?) functions can be replaced by more general

expressions. By inspection of the proof of Theorem 1, we see

that the only property of ft,e(?) used is that ft,e(?) is a function

of wt(?),wt+1(?),...,wt+s(?) which is even or odd in all ar-

gumentsandoddinatleastoneargument.Forexample,suppose

s =2 and e=(2,1,2), then ft,e(?)=wt(?)2wt+1(?)wt+2(?)2.

This function is even in wt(?) and wt+2(?) and odd in

wt+1(?). However, other functions than monomials exhibit

the same odd–even structure. For example, the function

tanh2(wt(?))tanh(wt+1(?))tanh2(wt+2(?)), where tanh is the

hyperbolic tangent, can be used and Theorem 1 still holds. This

observation makes it easier to satisfy the first part of Assump-

tion (v) where it is required that E[|ft,e(?)|]<∞ since such

Page 5

1422

M. Dalai et al. / Automatica 43 (2007) 1418–1425

a condition is automatically satisfied by considering bounded

functions such as tanh2(wt(?))tanh(wt+1(?))tanh2(wt+2(?)).

4. Application example: a simple bilinear system

Here we illustrate the proposed approach on a bilinear sys-

tem, see Bruni, Di Pillo, and Koch (1974), Fnaiech and Ljung

(1987), Mohler and Kolodziej (1980), Priestley (1991), and

Subba Rao (1981).

Consider the system

yt= ?0yt−2wt−1+ wt,

where wtis i.i.d. with symmetric distribution around zero and

with unit variance. This system has been studied in detail in

Terdik and Máth (1998). By iterating (6), it is easy to see that

the output ytcan, for any q?1, be written as

(6)

yt=

q−1

?

k=0

?0kwt−2k

k?

j=1

wt−2j+1+ ?0qyt−2q

q?

j=1

wt−2j+1. (7)

Note that the product?q

⎡

⎣

Thus, if |?0|<1, by letting q → ∞ in (7) we can take

∞

?

as a candidate stationary solution. A calculation omitted here

shows that the series on the right-hand side of (8) is indeed

convergent in the L2-sense as well as almost surely, the limit

is stationary and it satisfies the system Eq. (6). We will refer

to this stationary solution in what follows.

A simulation with ?0=0.2 and wtnormally distributed with

zero mean and unit variance was carried out. A confidence

region was then constructed as explained next.

Following the procedure in the previous section, wt(?) was

obtained by applying the inverse system S−1

output yt, which can be done by solving the recursive relation

j=1wt−2j+1has second order moment

equal to 1 for any q:

⎛

j=1

E

⎢

⎝

q?

wt−2j+1

⎞

⎠

2⎤

⎥

⎦=

q?

j=1

E[w2

t−2j+1] = 1.

yt=

k=0

?0kwt−2k

k?

j=1

wt−2j+1

(8)

?

(|?|<1) to the

wt(?) = yt− ?yt−2wt−1(?).

Note that for ? = 0 we have wt(0) = yt. This has impor-

tant consequences: after some cumbersome calculations it is

possible to show that yt satisfies E[ytyt+r] = 0 for every

r >0 and E[ytyt+ryt+l] = 0 for every l?r?0 except for

(r,l)=(1,2).Sincewt(0)=yt,thisimpliesthat—independently

of the true parameter ?0—the value ? = 0 is a solution

of the equations E[wt(?)wt+r(?)] = 0 for every r >0 and

E[wt(?)wt+r(?)wt+l(?)] = 0 for every l?r?0 with (r,l) ?=

(1,2). So it is clear that the only possible statistic (up to third

order) is E[wt(?)wt+1(?)wt+2(?)]. Indeed, this choice turns

out to be an effective one since it can be shown that the only

-0.8 -0.6-0.4 -0.20

θ

0.2 0.4 0.6

-4

-2

0

2

4

6

8

10

12

gi,e (θ)

0.8

Fig. 5. Some of the gN

i,e(?) functions obtained with e = (1,1,1), N = 1000.

0

100020003000400050006000

N

7000 80009000

10000

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

ΘN

11000

Fig. 6. Ninety percent confidence regions with e=(1,1,1) for increasing N.

solution to E[wt(?)wt+1(?)wt+2(?)] = 0 is the true parameter

? = ?0.

Following the above reasoning, we selected e = (1,1,1) in

point A.1 in Section 3.1. The group GNwas constructed as

in Appendix A.3 with M = 256, and the functions gN

i = 1,2,...,M − 1, are given by

1

#IN

i

k∈IN

i

i,e(?), for

gN

i,e(?) =

?

wk(?)wk+1(?)wk+2(?).

SomeofthegN

in Fig. 5. The corresponding 90% confidence region for ?0

turned out to be [0.11,0.21]. In Fig. 6 the confidence regions

for different values of N are plotted.

i,e(?)functionsobtainedwithN=1000 areshown