Parameter identification for nonlinear systems: Guaranteed confidence regions through LSCR.
ABSTRACT In this paper we consider the problem of constructing confidence regions for the parameters of nonlinear dynamical systems. The proposed method uses higher order statistics and extends the LSCR (leaveout signdominant correlation regions) algorithm for linear systems introduced in Campi and Weyer (2005, Guaranteed nonasymptotic confidence regions in system identification. Automatica 41(10), 17511764. Extended version available athttp://www.ing.unibs.it/∼campi� ). The confidence regions contain the true parameter value with a guaranteed probability for any finite number of data points. Moreover, the confidence regions shrink around the true parameter value as the number of data points increases. The usefulness of the proposed approach is illustrated on some simple examples. 2007 Elsevier Ltd. All rights reserved.

Article: Perturbed datasets methods for hypothesis testing and structure of corresponding confidence sets
[Show abstract] [Hide abstract]
ABSTRACT: Hypothesis testing methods that do not rely on exact distribution assumptions have been emerging lately. The method of signperturbed sums (SPS) is capable of characterizing confidence regions with exact confidence levels for linear regression and linear dynamical systems parameter estimation problems if the noise distribution is symmetric. This paper describes a general family of hypothesis testing methods that have an exact user chosen confidence level based on finite sample count and without relying on an assumed noise distribution. It is shown that the SPS method belongs to this family and we provide another hypothesis test for the case where the symmetry assumption is replaced with exchangeability. In the case of linear regression problems it is shown that the confidence regions are connected, bounded and possibly nonconvex sets in both cases. To highlight the importance of understanding the structure of confidence regions corresponding to such hypothesis tests it is shown that confidence sets for linear dynamical systems parameter estimates generated using the SPS method can have nonconnected parts, which have far reaching consequences.Automatica 11/2014; · 3.13 Impact Factor  SourceAvailable from: sztaki.hu[Show abstract] [Hide abstract]
ABSTRACT: We propose a new finite sample system identification method, called SignPerturbed Sums (SPS), to estimate the parameters of dynamical systems under mild statistical assumptions. The proposed method constructs nonasymptotic confidence regions that include the leastsquares (LS) estimate and are guaranteed to contain the true parameters with a userchosen exact probability. Our method builds on ideas imported from the "Leaveout Signdominant Correlation Regions" (LSCR) approach, but, unlike LSCR, also guarantees the inclusion of the LS estimate and provides confidence regions for multiple parameters with exact probabilities. This paper presents the SPS method for FIR and ARX systems together with its main theoretical properties, as well as demonstrates the approach through simple examples and experiments.
Page 1
Automatica 43 (2007) 1418–1425
www.elsevier.com/locate/automatica
Brief paper
Parameteridentificationfornonlinearsystems:Guaranteedconfidence
regionsthroughLSCR?
Marco Dalaia, Erik Weyerb, Marco C. Campia,∗
aDepartment of Electrical Engineering and Automation, University of Brescia, Via Branze 38, 25123 Brescia, Italy
bDepartment of Electrical and Electronic Engineering, The University of Melbourne, Parkville VIC 3010, Australia
Received 6 June 2006; received in revised form 27 October 2006; accepted 19 January 2007
Available online 19 June 2007
Abstract
In this paper we consider the problem of constructing confidence regions for the parameters of nonlinear dynamical systems. The proposed
method uses higher order statistics and extends the LSCR (leaveout signdominant correlation regions) algorithm for linear systems introduced
in Campi and Weyer [2005, Guaranteed nonasymptotic confidence regions in system identification. Automatica 41(10), 1751–1764. Extended
version available at ?http://www.ing.unibs.it/∼campi?]. The confidence regions contain the true parameter value with a guaranteed probability
for any finite number of data points. Moreover, the confidence regions shrink around the true parameter value as the number of data points
increases. The usefulness of the proposed approach is illustrated on some simple examples.
? 2007 Elsevier Ltd. All rights reserved.
Keywords: Confidence sets; Finite sample results; Nonlinear system identification
1. Introduction
It is well known that a model of a dynamical system is of
limited use if no quality tag which describes the accuracy of
the model is attached. Confidence regions for the system pa
rameters are commonly used as quality tags, and asymptotic
theoryiswidelyusedfortheconstructionofsuchregions.How
ever, in practice one always has a finite number of samples,
and—even though the asymptotic theory delivers sensible re
sults in many cases—there are also examples (Garatti, Campi,
& Bittanti, 2004) where it fails when applied to a finite num
ber of data points. Thus, there is a need for techniques which
deliver confidence regions with guaranteed probabilities when
only a finite number of data points are available.
?This paper was not presented at any IFAC meeting. This paper was
recommended for publication in revised form by Associate Editor Antonio
Vicino under the direction of Editor Torsten Söderström.
∗Corresponding author. Tel.: +390303715458; fax: +39030380014.
Email addresses: marco.dalai@ing.unibs.it (M. Dalai),
e.weyer@ee.unimelb.edu.au (E. Weyer), marco.campi@ing.unibs.it
(M.C. Campi).
00051098/$see front matter ? 2007 Elsevier Ltd. All rights reserved.
doi:10.1016/j.automatica.2007.01.016
In Campi and Weyer (2005) a method called LSCR (leave
outsigndominantcorrelationregions)wasproposedforfinding
confidence regions to which the parameters of a linear system
belong with guaranteed probability. See also Campi and Weyer
(2006) for a comprehensive presentation of LSCR. LSCR ex
tends earlier work by Hartigan (1969, 1970) to a dynamical
system setting, and it has two important features: first, the prob
ability that the confidence region contains the true parameters
is guaranteed for any finite amount of data samples; second, the
confidence region concentrates around the true parameter value
when the number of samples increases. In Campi and Weyer
(2005), second order statistics were explored for the construc
tion of the confidence regions. In the present paper, we consider
nonlinear systems. It is well known (see for example, Ljung,
2001 for a general discussion, or Subba Rao, 1981 for the par
ticular case of bilinear systems) that second order statistics are
insufficient for the identification of nonlinear systems. Here we
show that it is possible to extend the framework of LSCR to
higher order statistics, and hence to consider the problem of
nonlinear system identification within this setting.
The focus of this paper is on time series, that is the system
to be identified has no exogenous inputs which are measured.
The outline of the paper is as follows. In the next section, we
Page 2
M. Dalai et al. / Automatica 43 (2007) 1418–1425
1419
motivate the use of higher order statistics for nonlinear systems.
Section 3 contains the procedure for the construction of the
confidence region, and the properties of this procedure are also
studied. In Section 4 a simulation example using a bilinear
system is presented before conclusions are given in Section 5.
2. A simple nonlinear example: from second to higher
order statistics
This section illustrates the problems encountered when the
standard LSCR procedure of Campi and Weyer (2005) using
second order statistics is applied to a nonlinear system.
Consider the system
yt= ?0(y2
where ?0is the parameter value to be identified and wt is an
independent sequence of Gaussian variables with zero mean
and unit variance. We use the standard LSCR algorithm for
construction of a confidence region for ?0. To this end, we first
rewritethesystemwithagenericparameter?,yt=?(y2
wt, and then compute the associated optimal predictor: ˆ yt(?)=
?(y2
constructs a confidence region based on an empirical evaluation
of the correlations E[?t(?)?t+r(?)], r?1. In Campi and Weyer
(2005) it is shown that ?0is the only value of ? for which these
correlations are zero in the case of linear ARMA systems and,
consequently, the obtained confidence region shrinks around
the true parameter value ? = ?0as the number of data points
grows. Here we show that E[?t(?)?t+r(?)] = 0 does not imply
? = ?0for the system in (1), i.e. second order statistics do not
suffice.
Suppose that the true parameter value is ?0=0. Then yt=wt,
and we have
t−1− 1) + wt,(1)
t−1−1)+
t−1−1), and the prediction error: ?t(?)=yt− ˆ yt(?). LSCR
?t(?) = yt− ˆ yt(?) = wt− ?(w2
Thus,
t−1− 1).
E[?t(?)?t+r(?)]
= E[(wt− ?(w2
For r?2, E[?t(?)?t+r(?)] = 0 for any value of ? since wtand
(w2
t−1− 1) are zero mean random variables, and the products
in (2) only contain terms with different time indeces. For r =1
we have: E[?t(?)?t+1(?)]=−?E[wt(w2
E[wt]) = 0. So, E[?t(?)?t+r(?)] = 0 for any r?1, and any
value of ?. This implies that it is not possible to establish the
true value of ? from the conditions E[?t(?)?t+r(?)] = 0. In
turn, following the analysis in Campi and Weyer (2005), we
see that the confidence region obtained by using the standard
LSCR algorithm does not shrink around ?0when the number
of samples increases.
We complete this example by showing that the true value ?0
can indeed be determined by using higher order statistics. Take
for example the condition E[?2
E[?2
t−1− 1))(wt+r− ?(w2
t+r−1− 1))].(2)
t−1)]=−?(E[w3
t]−
t(?)?t+1(?)] = 0. We have
t] − E[w4
t(?)?t+1(?)] = ?(E[w2
t]) = ?(1 − 3) = −2?. (3)
Thus, E[?2
?= 0 for any ? ?= ?0.
So, in order to construct confidence regions that shrink
around ?0higher order statistics must be utilized. In the next
section we generalize the LSCR method to this case.
t(?0)?t+1(?0)]=0 since ?0=0, while E[?2
t(?)?t+1(?)]
3. Extension of LSCR to higher order statistics
Consider a nonlinear system S0which maps a nonmeasured
noise process wt into a measured signal yt. Furthermore, as
sume that S0belongs to a parameterized system class {S?}, that
is S0=S?0 for some ?0. wtis an independent sequence of ran
dom variables, whose distribution is symmetric around zero.
Apart from this, we make no other assumptions on wt. The dis
tribution of wtcan as well be timevarying. We aim at finding
a confidence region for the parameter vector ?0by observing
the output yt.
The LSCR method in Campi and Weyer (2005) constructs,
for every value of ?, a sequence wt(?) such that for the true
parameter ?0we have that wt(?0)=wt. Then, roughly speaking,
the confidence region for ?0is obtained by choosing the values
of ? for which wt(?) resembles an independent process. For
linear systems, one can take wt(?)=?t(?), the prediction error,
since ?t(?0) = wt, see Campi and Weyer (2005).
The case of nonlinear systems requires some extra care be
cause ?t(?0) ?= wtand ?t(?0) is not even an independent pro
cess in general. To see this, consider, e.g. the system class
yt= ?yt−1+ yt−1wt. The optimal predictor is ˆ yt(?) = ?yt−1;
but yt− ˆ yt(?0) = yt−1wtis not an independent sequence!
In order to obtain a sequence wt(?) such that wt(?0)=wt, we
can proceed in a different way by resorting to system inversion
instead of constructing the prediction error, see Fig. 1. For
linearsystemsthesetwoapproachescoincidesinceconstructing
the prediction error is the same as inverting the system. In
the example above we let wt(?) = (yt− ?yt−1)/yt−1, so that
wt(?0) = wtas long as yt−1?= 0. System inversion is used as
a basic building block in the algorithm presented below.
Before proceeding we formally introduce our working as
sumptions.
Assumptions.
(i) The observed data yt are obtained as output of a causal
system S0whose input is an independent noise sequence wt
symmetrically distributed around zero, i.e. yt= S0(w?,??t).
(ii) The system S0belongs to a system model class S?, i.e.
there exists a value ?0of the parameter such that S?0 = S0.
(iii) The systems in {S?} are invertible with a causal inverse,
i.e. for every ? there exists an inverse system S−1
S−1
?
such that
?(y?(?),??t) = wt, where yt(?) = S?(w?,??t).
wt
yt
yt
S0
S1
?
wt(?)
Fig. 1. Scheme for the extraction of wt(?).
Page 3
1420
M. Dalai et al. / Automatica 43 (2007) 1418–1425
The assumptions state that the model class consists of causal
systems which are also causally invertible and that the true data
generating system belongs to the model class.
3.1. Construction of the confidence region
We next describe the algorithm for the construction of the
confidence region.
Algorithm.
(A.1). Compute wt(?) = S−1
(A.2). Choose an integer s?0 and let e = (e0,e1,...,es) be
a vector of nonnegative integers such that at least one
of the ej, 0?j ?s, is odd (the way e should be chosen
is discussed later). For every t = 1,2,...,K − s = N,
compute
?(y?,??t) for t = 1,2,...,K.
ft,e(?) =
s?
j=0
wt+j(?)ej.
(A.3). Let IN= {1,...,N} and consider a collection GNof
differentsubsetsIN
i
⊆ IN,i=1,...,M,formingagroup
underthesymmetricdifferenceoperation(i.e.(IN
(IN
i
∩ IN
loss of generality, that IN
Mis the zero element of the group
GN: IN
M= ∅, the empty set. Compute
1
#IN
i
k∈IN
i
(# stands for “number of elements in the set”).
(A.4). Select an integer q in the interval [1,(M +1)/2) and find
the confidence region ?N
functions are bigger than zero and at least q are smaller
than zero.
i∪IN
j)−
j) ∈ GNif IN
i,IN
j
∈ GN). Suppose, without
gN
i,e(?) =
?
fk,e(?),i = 1,...,M − 1
ewhere at least q of the gN
i,e(?)
The intuitive idea behind the algorithm is as follows. For
the true parameter vector ?0, wt(?0) = wt is an independent
sequence symmetrically distributed around zero. Since at least
one ejis odd, ft,e(?0) is a zero mean random variable. More
over, when ? = ?0, the functions gN
sums of zero mean random variables. It is therefore unlikely
that nearly all of them are positive or that nearly all of them
are negative. Based on this observation we exclude the regions
in parameter space where the gN
i,e(?) functions take on positive
or negative values too many times.
Note that the construction of ?N
edge of the characteristics of the noise wt. The Algorithm let
the data speak for themselves and constructs the region ?N
correspondingly: ?N
is through data only, not through a priori assumptions.
The next theorem says that the Algorithm always produces
a region that contains ?0with a probability chosen by the user.
i,e(?),i = 1,...,M − 1, are
edoes not require any knowl
e
edoes depend on the noise level, but this
Theorem1. Theregion?N
that
P[?0∈ ?N
econstructedabovehastheproperty
e] = 1 − 2q/M.
10.8 0.6 0.40.20
θ
0.20.40.60.81
5
4
3
2
1
0
1
2
3
4
gi,e (θ)
Fig. 2. Some of the gN
i,e(?) functions obtained with e = (2,1), N = 1000.
Proof. See Appendix A.1.
?
Thus, the user controls the probability that ?0∈ ?N
choice of q.
In general, in order to determine a confidence region of
suitable shape we may want to intersect several regions
?N
e1,e2,...,eh, then the confidence region is given by
evia the
eobtained with different e vectors. If we have h vectors
?N=
h?
l=1
?N
el.(4)
Theorem2. Theregion?Nconstructedabovehastheproperty
that
P[?0∈ ?N]?1 − 2hq/M. (5)
Proof. The proof follows from Theorem 1. The inequality in
(5) is due to possible overlaps between the events ?0/ ∈?N
l = 1,...,h.
el,
?
To make the procedure more concrete, we next apply it to
the example in Section 2.
Example 3. Suppose we want to find a 90% confidence region
for ?0. Since yt=?(y2
Note that, in this example, wt(?) = ?t(?), the prediction error.
In Section 2, we established that E[?t(?)2?t+1(?)]=0 only for
? = ?0. Motivated by this observation we take e = (2,1).
We simulated the system with N =1000 and constructed the
group GNas explained in Appendix A.3 with M = 256. We
discarded the parameter values where less than q=12 functions
out of the M = 256 functions were positive or less than 12
functions were negative. Fig. 2 shows some of the obtained
gN
be [−0.05,0.03].
t−1−1)+wt, let wt(?)=yt−?(y2
t−1−1).
i,e(?) functions. The confidence interval for ?0turned out to
Page 4
M. Dalai et al. / Automatica 43 (2007) 1418–1425
1421
0
10002000300040005000600070008000 9000
10000
0.25
0.2
0.15
0.1
0.05
0
0.05
0.1
0.15
0.2
0.25
N
ΘN
11000
Fig. 3. Ninety percent confidence regions with e = (2,1) for increasing N.
10.80.60.40.20 0.20.4 0.6 0.81
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
θ
gi,e (θ)
Fig. 4. Some of the gN
i,e(?) functions obtained with e = (1,1), N = 1000.
The gN
E[w2
the gN
i,e(?) functions cut the ? axis near ?=0 and the confidence
region is a neighborhood of 0 = ?0.
As N increases one expects that the gN
better and better approximations of the function −2?. Corre
spondingly, the confidence interval is expected to shrink around
? = 0. In Fig. 3 the confidence intervals obtained for increas
ing values of N are shown, and the trend is that the length of
the intervals decreases as N increases. Section 3.2 provides a
general study of the convergence properties of the algorithm.
As a comparison, Fig. 4 shows some of the gN
obtained using the second order statistic E[wt(?)wt+1(?)],
i.e. by choosing e = (1,1). As we noticed in Section 2,
E[wt(?)wt+1(?)] = 0 for all values of ?; hence the gN
i,e(?) functions are estimates of the third order statistic
t(?)wt+1(?)]. From Eq. (3), E[w2
t(?)wt+1(?)] = −2?, so
i,e(?) functions become
i,e(?) functions
i,e(?)
functions are all flat along the ? axis, and the confidence
interval does not shrink around ? = 0.
?
Theorems 1 and 2 are very general and apply to any statistic
as described in point A.2 of the algorithm. Consequently, the
probability of the obtained region is always guaranteed. On the
other hand, the effect of the used statistics shows up in the
shape of the obtained region. Determining suitable statistics is
a problem for which no general guidelines can be given, and
the user should choose the statistics based on an analysis of the
system class at hand. See also Section 4 for an example.
3.2. Asymptotic behavior
When N → ∞, we would like the confidence region ?N
given by (4) to shrink around the true value ?0. In this section,
we discuss general conditions for this to happen.
Weneedthefollowingadditionalassumptions(whilesomeof
theseassumptionscanberelaxed,wehavepreferredtomaintain
them to avoid very technical mathematical derivations).
Assumptions.
(iv) The input noise wt is independent and identically dis
tributed (i.i.d.).
(v) For every ? the considered statistics are in L1, i.e.
E[ft,el(?)]<∞,l = 1,2,...,h, and ?0is the only solution
to the set of conditions E[ft,el(?)] = 0, l = 1,2,...,h.
(vi)ThegroupsGNareconstructedasexplainedinAppendix
A.3, the value of M is fixed and the value of N is increasing.
Theorem 4. Under the hypotheses above, for every fixed ? ?=
?0,
P[∃¯ N ? / ∈?N,∀N >¯ N] = 1.
Proof. See Appendix A.2.
In other words, Theorem 4 says that any ? ?= ?0is eliminated
from ?Nstarting at some¯ N with probability 1.
Remark 5. The Algorithm in Section 3.1 can be generalized so
that the assumption of Theorem 4 that E[ft,el(?)]<∞,l=1,
2,...,h, is certainly satisfied. In points A.2 and A.3 of the Al
gorithm, the ft,e(?) functions can be replaced by more general
expressions. By inspection of the proof of Theorem 1, we see
that the only property of ft,e(?) used is that ft,e(?) is a function
of wt(?),wt+1(?),...,wt+s(?) which is even or odd in all ar
gumentsandoddinatleastoneargument.Forexample,suppose
s =2 and e=(2,1,2), then ft,e(?)=wt(?)2wt+1(?)wt+2(?)2.
This function is even in wt(?) and wt+2(?) and odd in
wt+1(?). However, other functions than monomials exhibit
the same odd–even structure. For example, the function
tanh2(wt(?))tanh(wt+1(?))tanh2(wt+2(?)), where tanh is the
hyperbolic tangent, can be used and Theorem 1 still holds. This
observation makes it easier to satisfy the first part of Assump
tion (v) where it is required that E[ft,e(?)]<∞ since such
Page 5
1422
M. Dalai et al. / Automatica 43 (2007) 1418–1425
a condition is automatically satisfied by considering bounded
functions such as tanh2(wt(?))tanh(wt+1(?))tanh2(wt+2(?)).
4. Application example: a simple bilinear system
Here we illustrate the proposed approach on a bilinear sys
tem, see Bruni, Di Pillo, and Koch (1974), Fnaiech and Ljung
(1987), Mohler and Kolodziej (1980), Priestley (1991), and
Subba Rao (1981).
Consider the system
yt= ?0yt−2wt−1+ wt,
where wtis i.i.d. with symmetric distribution around zero and
with unit variance. This system has been studied in detail in
Terdik and Máth (1998). By iterating (6), it is easy to see that
the output ytcan, for any q?1, be written as
(6)
yt=
q−1
?
k=0
?0kwt−2k
k?
j=1
wt−2j+1+ ?0qyt−2q
q?
j=1
wt−2j+1. (7)
Note that the product?q
⎡
⎣
Thus, if ?0<1, by letting q → ∞ in (7) we can take
∞
?
as a candidate stationary solution. A calculation omitted here
shows that the series on the righthand side of (8) is indeed
convergent in the L2sense as well as almost surely, the limit
is stationary and it satisfies the system Eq. (6). We will refer
to this stationary solution in what follows.
A simulation with ?0=0.2 and wtnormally distributed with
zero mean and unit variance was carried out. A confidence
region was then constructed as explained next.
Following the procedure in the previous section, wt(?) was
obtained by applying the inverse system S−1
output yt, which can be done by solving the recursive relation
j=1wt−2j+1has second order moment
equal to 1 for any q:
⎛
j=1
E
⎢
⎝
q?
wt−2j+1
⎞
⎠
2⎤
⎥
⎦=
q?
j=1
E[w2
t−2j+1] = 1.
yt=
k=0
?0kwt−2k
k?
j=1
wt−2j+1
(8)
?
(?<1) to the
wt(?) = yt− ?yt−2wt−1(?).
Note that for ? = 0 we have wt(0) = yt. This has impor
tant consequences: after some cumbersome calculations it is
possible to show that yt satisfies E[ytyt+r] = 0 for every
r >0 and E[ytyt+ryt+l] = 0 for every l?r?0 except for
(r,l)=(1,2).Sincewt(0)=yt,thisimpliesthat—independently
of the true parameter ?0—the value ? = 0 is a solution
of the equations E[wt(?)wt+r(?)] = 0 for every r >0 and
E[wt(?)wt+r(?)wt+l(?)] = 0 for every l?r?0 with (r,l) ?=
(1,2). So it is clear that the only possible statistic (up to third
order) is E[wt(?)wt+1(?)wt+2(?)]. Indeed, this choice turns
out to be an effective one since it can be shown that the only
0.8 0.60.40.20
θ
0.20.40.6
4
2
0
2
4
6
8
10
12
gi,e (θ)
0.8
Fig. 5. Some of the gN
i,e(?) functions obtained with e = (1,1,1), N = 1000.
0
100020003000 400050006000
N
700080009000
10000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
ΘN
11000
Fig. 6. Ninety percent confidence regions with e=(1,1,1) for increasing N.
solution to E[wt(?)wt+1(?)wt+2(?)] = 0 is the true parameter
? = ?0.
Following the above reasoning, we selected e = (1,1,1) in
point A.1 in Section 3.1. The group GNwas constructed as
in Appendix A.3 with M = 256, and the functions gN
i = 1,2,...,M − 1, are given by
1
#IN
i
k∈IN
i
i,e(?), for
gN
i,e(?) =
?
wk(?)wk+1(?)wk+2(?).
SomeofthegN
in Fig. 5. The corresponding 90% confidence region for ?0
turned out to be [0.11,0.21]. In Fig. 6 the confidence regions
for different values of N are plotted.
i,e(?)functionsobtainedwithN=1000 areshown