DataPDF Available

Supplementary Material of "Path Patterns: Analyzing and Comparing Real and Simulated Crowds"

Authors:
Supplementary Material
Terms Notation Meaning
Agent State w w={p, v}where p
and v are the position
and orientation of an
agent
Data Segments d Trajectory and velocity
data of the crowd. d=
{wi}.
Path Pattern βA mixture of paths.
DP Atoms hli,cdj
DP Weight Parameters vk,gl,li,πdj
Dirichlet Parameter η
Beta Parameter a, b, ω,α
Component Indices j, i, l, k and their totals: J, I, L, K
Table 1: Terminology and Parameters
1 SV-DHDP Model
We first briefly review Dirichlet Processes (DPs) and Dependant
Dirichlet Processes (DDPs). A DP can be seen as a probabilistic
distribution over distributions, which means any draw from a DP is
a probabilistic distribution itself. In a stick-breaking representation
[Sethuraman 1994] of DP: G=P
k=1 σk(v)βk, where σk(v) =
vkQk1
j=1 (1 vj),P
k=1 σk(v) = 1 and βkH(η).
βkare DP atoms drawn from some base distribution H. In our
problem, they are Multinomials drawn from a Dirichlet(η). The
σk(v)s are called stick proportions or DP weights because it mim-
ics breaking a stick iteratively in the following way. Assuming the
length of a stick is 1, in each iteration, a proportion vkof what is left
of the stick , Qk1
j=1 (1 vj), is broken away. vis Beta-distributed.
A DDP [MacEachern 1999] generalizes the concept of DP by re-
placing its weights and atoms with stochastic processes. In our
context, a DDP can be represented as: Gl=P
i=1 σi(v)Gli,
where everything is the same as the DP representation except that
the atoms are now {Gli}. Each Gliitself is a DP and all {Gli}are
draws from same base DP. Both DP and DDP are ideal priors for
modeling infinite clusters.
With terminologies defined in Table 1 and equiped with DP and
DDP, we are ready to fully define our model. In a standard hi-
erarchical Bayesian setting, a tree is constructed in attempt to ex-
plaining the observations through a hierarchy of factors. In our
problem, the observations are agent states. We segment them into
equal-length data segments along time domain. Our goal is to find a
set of path patterns {βk}that, when combined with their respective
weights, best describe all the segments in terms of their likelihoods.
As shown in the toy example in the paper, a subset of {βk}is
needed to describe a data segment.{βk}are shared across all seg-
ments. A two-layer tree is used to model this phenomenon. The
root node is {βk}governed by a global DP prior. Each leaf node
represents a data segment with a DP drawn from the global DP prior
to model its own pattern set {βk}d⊂ {βk}. This is a standard two-
layer HDP.
Further, imagine some data segments share a bigger subset of {βk},
namely {βk}c, so that {βk}d⊂ {βk}c⊂ {βk}and they form
asegment cluster. Also, we have potentially infinitely many such
clusters. We need a middle layer to capture this effect. At this layer,
there is a nested clustering. First, each {βk}ccan contain infinite
elements. Second, the number of clusters can be infinitely big. This
effect can be captured by adding a DDP layer immediately below
all {βk}but higher than leaf nodes. After constructing such a tree
structure, we can compute {βk}by clustering the agent states layer
by layer up to the top.
Such a tree structure is shown in Figure 1. Each sharp-cornered
rectangle is a DP. Gon the right is the global DP over {βk}. The
bottom-left segment-level distribution, Gd, is the local DP over
{βk}d.Glis the DDP. The number of atoms in Glis the num-
ber of segment clusters. Each atom Gliis a DP governing {βk}c
All stick proportions sum to 1, s.t. Pv= 1,Pg= 1,P= 1
and Pπ= 1.βk,hliand cdjare DP atoms.
Figure 1: SV-DHDP model. Sharp-corner rectangles (G,Gd,Gli
and Gl) are DPs in which squares are weights and circles are
atoms. Rounded rectangles are data samples or data segments.
Hexgons are pattern assignments. N is the number of agent states
in segment D.
This model explains how the observations, w(shaded), are gener-
ated from βkthrough a hierarchical factors between wnand βk.
This dependency is explained in Algorithm 1 in the supplementary
material. We explain, in Algorithm 1, the dependency between the
observed agent states and the latent path patterns we are solving for.
2 Variational Inference for SVDHDP
In this section we give details of the variational inference of our
SVDHDP model. Variational Inference (VI) [Bishop 2007] ap-
proximates a target distribution by solving an optimization prob-
lem. When the target distribution is intractable, VI uses a fam-
ily of tractable distributions (variational distributions) to approxi-
mate the target distribution. By optimizing for the parameters of
the variational distributions, the target distribution can be approx-
imated. The optimization is done by minimizing the Kullback-
Leibler (KL) divergence between the posterior distribution and the
variational distribution q(β, Ω), which amounts to maximizing the
For the global DP, build G by:
Drawing an infinite number of patterns, βkDirichlet(η) for k∈{1,2,3, . . . }
Drawing the stick proportions, σk(v)where vkBeta(1, ω) for k∈{1,2,3, . . . }
For each segment cluster l, build a Glby:
1. Drawing stick proportions, glBeta(1, b) for l∈{1,2,3, . . . }
2. Since the atoms of Glare DPs, build each atom Gliby:
(a) Drawing an infinite number of pattern indices, hliMultinomial(σ(v)) for i∈{1,2,3, . . . }
(b) Drawing the stick proportions, σi(), where liBeta(1, a) for i∈{1,2,3, . . . }
(c) For each data segment d, build Gdby:
i. Drawing an infinite number of data segment cluster indices, el
dMultinomial(σ(g)) for l∈{1,2,3, . . . }
ii. Drawing an infinite number of group pattern indices, cdjMultinomial(σ(l)) for j∈{1,2,3, . . . }
iii. Drawing the stick proportions, σj(π), where πdjBeta(1, α) for j∈{1,2,3, . . . }
iv. For each data sample w:
A. Draw a pattern assignment, zdnMultinomial(σ(π)).
B. Generate a data sample wnMultinomial(βhu), where u = ex,x=cyand y = zdn.
Algorithm 1: Data sample generation in SV-DHDP. Dirichlet, Beta and Multinomial stand for their eponymous distributions.
evidence lower bound(ELBO), a lower bound on the logarithm of
the marginal probability of the observations log p(w):
L(q) = Eq[log p(w, β , Ω)] Eq[log q(β, Ω)] (1)
The mean-field family is the simplest for approximating the poste-
rior. It assumes each model parameter is only conditioned on its
own hyper-parameters:
q(β, Ω) =
K
Y
k=1
q(βk|λk)(
M
Y
m=1
R
Y
r=1
q(Ωmr|ξmr )) (2)
where λis the parameter governing the distribution of the global
parameter. Parameter ξmr governs the distribution of the local pa-
rameter mr in the mth context (e.g., the mth data segment or the
mth cluster). Here, M and R do not have specific meanings and are
only for illustration purpose.
We then optimize Equation 1 for λand ξ. Since all the distribu-
tions in SV-DHDP are from the exponential family, we assume that
q(Ω|ξ)and q(β|λ)are also from the exponential family which has
the general form:
p(β|w, , η) = h(β)exp{ρg(w , , η)Tt(β)ag(ρg(w, , η))}
(3)
where scalar functions h() and a() are base measure and
log normalizer; the vector functions ρ() and t() are the
natural parameter and suf f icient statistics. For optimizing
Equation 1 with respect to λ, we take the gradient:
OλL=O2
λag(λ)(Eq[ρg(w, , η)] λ)(4)
and we can set it to zero by setting:
λ=Eq[ρg(w, , η)λ](5)
The optimization for εis similar to Equation 5.
2.1 Natural Gradient
Since we are trying to optimize the parameters to minimize the KL-
divergence, it is more reasonable to compute the natural gradient
of the ELBO instead of the Euclidean gradient. The natural gradient
of a function accounts for the information geometry of its parameter
space, using a Riemannian metric to correct the traditional gradient.
According to [Amari 1998], a natural gradient can be computed
by pre-multiplying the gradient by the inverse of the Riemannian
metric G(ω)1:
ˆ
Oλf(λ),G(λ)1Ofλ(λ)(6)
where G(λ)1is the Fisher information matrix of q(λ). When
q(β|λ)is from the exponential family, G(λ) = O2
λag(λ)and
ˆ
OλL=E[ρg(w, , η)]λ. The natural gradient of Lwith respect
to ξis in a similar form, but only depending on its local contexts.
2.2 Stochastic Optimization
Optimizing Equation 1 for λand ξby a traditional coordinate ascent
algorithm involves nested iteration loops. The inner loop iterates on
all data segments to update ξuntil it converges and jumps out to the
outer loop to make one update on λ, then the iteration starts over
again until λalso converges. This is very slow especially when the
number of data segments is large, because before updating λfor
one step, the inner loop has to compute the gradient at every data
segment in the dataset.
To further speed up the training, we employ Stochastic Optimiza-
tion. Stochastic optimization uses noisy gradient estimates with a
decreasing step size to discover good local optima. Noisy gradi-
ents are usually cheaper to compute and help avoid low quality lo-
cal optima. With certain conditions on the step size, it provably
converges to an optimum [Robbins and Monro 1951]. Stochas-
tic optimization uses a noisy gradient distribution B(λ)so that
Eq[B(λ)] = Oλf(λ). It allows us to update λ:
λ(t)= (1 ρt)λ(t1) +ρtbt(λ(t1) )(7)
where btis an independent draw from the noisy gradient B, t is
time step and the step size ρtsatisfies:
Xρt=;Xρ2
t<(8)
Specifically, we use:
ρt= (t+τ)κ(9)
where τdown-weights the early iterations and κ, the forgetting rate,
controls how much the new information is valued in each iteration.
From Equation 7, we can sample the gradient on one data segment
instead of all of them to compute the gradient.
We further extend Equation 7 to a mini batch version of Equa-
tion 7. In each iteration, we sample Ddata segments and compute
Equation 7 for each of them, then average the results as the final
update:
λ(t)= (1 ρt)λ(t1) +ρt
1
DX
d
bt
d(λ(t1))(10)
where bt
d() is the stochastic gradient computed from sample dand
Dis the sample number. Since the mini batch version is highly
parallelizable and gives better estimations of the gradient, we thus
further speed up the computation and improve the results.
In practice, we cannot perform computations for an infinite number
of path patterns. So a truncation number is given at each level.
This number is the maximum cluster number modeled at its level.
It is set bigger than needed so that only a part of clusters are used
in the clustering. The truncation number for each layer is much
smaller than the one above it because we expect a much smaller
number of path patterns in a child node than its parent. We empha-
size that this is fundamentally different from giving a pre-defined
cluster number and the model can still automatically compute the
desirable number of clusters.
Given, Ddata segments, each containing Nagent states, we as-
sume that the whole data set contains kpath patterns where k < K.
Data segments can be clustered into lclusters where l < L, each
of which contain ipath pattern indices where i < I. Finally, in
each data segment d, the agent states can be clustered into jgroups
where j < Jgroups. We give the overall algorithm in Algorithm 2
and refer the readers to the supplementary material for the function
subroutines and the mathematical deduction.
Algorithm 2: VI Optimization
1Initialize λ0, set o1= 1 and o2=ω,p1= 1, p2= b, q1= 1, q2= a;
Set up step size ρt, set init t = 0;
2while not converged do
3sample a data segment wd;
4[ε0,µd,ζd,φd] = initLocal(wd,λ) (Algorithm 3);
5[µd,ζd,φd] = opLocal(wd,ε,ζd,φd,β, g) (Algorithm 4);
6[,ζd,φd] = updateCluster(wd,ε,ζd,φd, v, β)
Algorithm 5);
7[λ(t),o1(t),o2(t)] = updateGlobal(wd,η,ε,ζd,φd,ρt,
λ(t1),o1(t1) ,o2(t1)) (Algorithm 6);
8t=t+1;
9update ρwith t (Equation 9);
10 end
2.3 Computational Details
Based on Figure 1 and the complete conditional in explained in
the paper Equation 2. However, Equation 2 is for the purpose of
explaining Variational Inference in the paper and does not contain
all the details. To do variational inference, we condition our model
parameters on their own hyper-parameters. Here we expand it into:
q(β, Ω) =(
K
Y
k=1
q(βk|λk)q(vk|ωk))
(
L
Y
l=1
q(gl|pl)(
I
Y
i=1
q(li|qli)q(hli|εli)))
(
D
Y
d=1
q(ed|µd)
J
Y
j=1
q(cdj |ζdj)q(πdj|αdj)
N
Y
n=1
q(zdn|φdn ))
(11)
This is the complete variational distribution. From this, we can
deduce the complete conditional for every parameter. A complete
conditional is the distribution of a parameter given all the other pa-
rameters. We also assume the conditional distribution of parameters
on their hyper-parameters are also from the same exponential fam-
ilies. So q(z|φ),q(c|ζ)and q(h|ε)are Multinomial distributions.
q(π|α1, α2),q(|q1, q2),q(g|p1, p2)and q(v|ω1, ω2)are Beta dis-
tributions. Finally, q(β|λ)is Dirichlet distribution.
We abuse the notation a bit here. We convert our denotations into
vector indicators. For instance, we treat was a vector of size(S). So
if the nth agent state in dth data segment is v, it can be represented
by wv
dn= 1.zj
dn= 1 means the nth agent state in the dth data
segment is classified into the jth group in this segment. Similarly,
cli
dj= 1 means the jth group in the dth data segment is assigned
to the ith component in cluster l. Finally, hk
li= 1 means the ith
component in the lth cluster is assigned to the kth pattern. So the
complete conditionals for Multinomial nodes are:
P(zj
dn= 1|πd, wdn, cd, ed, h, β)
exp{logσj(πd) +
L
X
l=1
el
d
I
X
i=1
cli
dj
K
X
k=1
hk
lilogβk,wdn}
(12)
P(cli
dj= 1|wd, zd, ed, h, , β)
exp{logσi(l) + el
d
N
X
n=1
zj
dn
K
X
k=1
hk
lilogβk,wdn}
(13)
P(el
d= 1|g, wd, zd, cd, h, β )
exp{logσl(g) +
I
X
i=1
J
X
j=1
cli
dj
N
X
n=1
zj
dn
K
X
k=1
hk
lilogβk,wdn}
(14)
P(hk
li= 1|v, w, z , e, c, β)
exp{logσk(v) +
D
X
d=1
el
d
J
X
j=1
cli
dj
N
X
n=1
zj
dnlogβk,wdn}
(15)
Aside from the Multinomial nodes, we also have Beta nodes:
P(vk|h, w) = Beta(1 +
L
X
l=1
I
X
i=1
hk
li, ω +
L
X
l=1
I
X
i=1
X
m>k
hm
li)(16)
P(gl|b, e) = Beta(1 +
D
X
d=1
el
d, b +
D
X
d=1
X
m>l
el
d)(17)
P(li|a, c) = Beta(1 +
D
X
d=1
J
X
j=1
cli
dj, a +
D
X
d=1
J
X
j=1
X
m>i
clm
dj)(18)
P(πdj|α, zd) = Beta(1 +
N
X
n=1
zj
dn,
N
X
n=1
X
m>j
zm
dn)(19)
Finally, the path patterns are Dirichlet distributions:
P(βk|w, z, c, e, h, η )
=Dirichlet(η+
D
X
d=1
L
X
l=1
el
d
I
X
i=1
hk
li
J
X
j=1
cli
dj
N
X
n=1
zj
dnwdn)
(20)
Given the complete conditionals, now we can compute the hyper-
parameters. We first give the distributions of hyper-parameters of
the Multinomial distributions:
φj
dn=E[zj
dn]exp{logσj(πd)+
L
X
l=1
µl
d
I
X
i=1
ζli
dj
K
X
k=1
εk
liE[logβk,wdn]}(21)
ζli
dj=E[cli
dj] =
exp{logσi(l) + µl
d
N
X
n=1
φj
dn
K
X
k=1
εk
liE[logβk,wdn]}
(22)
µl
d=E[el
d]exp{logσl(g)+
I
X
i=1
J
X
j=1
ζli
dj
N
X
n=1
φj
dn
K
X
k=1
εk
liE[logβk,wdn]}(23)
εk
li=E[hk
li]exp{logσk(v)+
D
X
d=1
µl
d
J
X
j=1
ζli
dj
N
X
n=1
φj
dnE[logβk,wdn]}(24)
(25)
We also give the distributions of hyper-parameters of the Beta dis-
tributions:
ok= (1 +
L
X
l=1
I
X
i=1
εk
li, ω +
L
X
l=1
I
X
i=1
X
m>k
εm
li)(26)
pl= (1 +
D
X
d=1
µl
d, b +
D
X
d=1
X
m>l
µl
d)(27)
qli= (1 +
D
X
d=1
J
X
j=1
ζli
dj, a +
D
X
d=1
J
X
j=1
X
m>i
ζlm
dj)(28)
γdj= (1 +
N
X
n=1
φj
dn, α +
N
X
n=1
X
m>j
φm
dn)(29)
Finally, for the sake of completeness, we give the equations to cal-
culate E[logσi(v)] and E[logβ]:
E[logvk] = Ψ(o1
k)Ψ(o1
k+o2
k)
E[log(1 vk)] = Ψ(o2
k)Ψ(o1
k+o2
k)
E[logσk(v)] = E[logvk] +
k1
X
l=1
E[log(1 vk)] (30)
E[logβkv ] = Ψ(λkv )Ψ(X
v0
λkv0)(31)
where Ψis digamma function.
Algorithm 3: initLocal
Data:wd
Result:ε,ζd,φd
1for l∈ {1,...,L}do
2for i∈ {1,...,I}do
3εk
liexp{PN
n=1 E[logβk,wdn ]},k∈ {1,...,K};
4end
5end
6for l∈ {1,...,L}do
7µl
dexp{PI
i=1 PK
k=1 εk
liPN
n=1 E[logβk,wdn ]};
8end
9for j∈ {1,...,J}do
10 for l∈ {1,...,L}do
11 ζli
djexp{µl
dPK
k=1 εk
liPN
n=1 E[logβk,wdn ]},
i∈ {1,...,I};
12 end
13 end
14 for n∈ {1,...,N}do
15 φj
dnexp{PL
l=1 µl
dPI
i=1 ζli
djPK
k=1 εk
liE[logβk,wdn ]},
j∈ {1,...,J};
16 end
3 Additional Patterns
3.1 Bi-directional Flows
Here, we show some data segments of the simulations done for our
Bi-directional flow example. They are shown in Figure 2.
3.2 Park
Trajectories and data segments of the park dataset is shown in Fig-
ure 3.
3.3 Train Station
3.3.1 Patterns learned by SVDHDP
Figure 4 shows some snapshots of the data segments of the train
station dataset. Some additional patterns for the train station dataset
shown in Figure 5.
3.3.2 Patterns learned by Gibbs Sampling
The top 32 patterns learned by Gibbs Sampling for the train station
dataset shown in Figure 6 and Figure 7.
Algorithm 4: opLocal
Data:wd,ε,ζd,φd,β, g
Result:ζd,φd,µd
1while µdnot converged do
2while γd,ζd,φdnot converged do
3for j∈ {1,...,J}do
4γ1
dj=1+PN
n=1 φj
dn;
5γ2
dj=α+PN
n=1 Pm>j φm
dn;
6ζli
djexp{E[logσl(g))] + E[logσi(l)] +
PN
n=1 φj
dnPK
k=1 εk
liE[logβk,wdn],l∈ {1,...,J},
i∈ {1,...,I};
7end
8for n∈ {1,...,N}do
9φj
dnexp{E[logσj(πd)] +
PL
l=1 PI
i=1 ζli
djPK
k=1 εk
liE[logβk,wdn]},
j∈ {1, . . . .J };
10 end
11 end
12 µl
dexp{E[logσl(g)] +
PI
i=1 PJ
j=1 ζli
djPN
n=1 φj
dnPK
k=1 εk
liE[logβk,wdn]},
l∈ {1,...,L}
13 end
4 Similarity
4.1 Park Simulation
Here we show, in Figure 8, some data segments of the four simula-
tions we used in similarity computation in the park example.
4.2 Train Station Simulation
Here we show, in Figure 9, some data segments of the four simula-
tions we used in similarity computation in the train station example.
Learned patterns can be found in the main paper.
References
AMARI, S.-I. 1998. Natural Gradient Works Efficiently in Learn-
ing. Neural Comp. 10, 2, 251–276.
BISHOP, C. 2007. Pattern Recognition and Machine Learning.
Springer, New York.
MACEACHE RN , S. 1999. Dependent Nonparametric Processes. In
ASA Bayesian Stat. Sci.
ROBBINS, H., AN D MONRO, S. 1951. A Stochastic Approxima-
tion Method. Ann. Math. Statist. 22, 3, 400–407.
SETHURAMAN, J . 1994. A constructive definition of Dirichlet
priors. Statistica Sinica 4, 639–650.
Algorithm 5: updateCluster
Data:wd,εd,ζd,φd, v, β
Result:ε, p, q
1Set initial step size ρt0
l, set initial t0= 0;
2while p not converged do
3Set initial step size ρt0
o
i, set initial t0
o= 0;
4while q, εnot converged do
5[ζd,φd] = opLocal(wd,εd,ζd,φd, p, q)(Algorithm 4);
6for l∈ {1,...,L}do
7for i∈ {1,...,I}do
8ˆq1
li=1+DPJ
j=1 ζli
dj;
9ˆq2
li=a+DPJ
j=1 Pm>i ζlm
dj;
10 ˆεk
liexp{E[logσk(v)] +
l
dPJ
j=1 ζli
djPN
n=1 φj
dn
E[logβk,wdn]},
k∈ {1,...,K};
11 end
12 end
13 t0
o=t0
o+ 1;
14 update ρt0
o
lwith t0
o(Equation 9). for l∈ {1,...,L}do
15 for i∈ {1,...,I}do
16 q(1,t0)
li=(1 ρt0
l)q(1,t01)
li+ρt0
o
lˆq1
li;
17 q(2,t0)
li=(1 ρt0
l)q(2,t01)
li+ρt0
o
lˆq2
li;
18 ε(k,t0)
li=(1 ρt0
l)ε(k,t01)
li+ρt0
o
lˆεk
li,
k∈ {1,...,K};
19 end
20 end
21 end
22 t0=t0+ 1;
23 update ρt0
lwith t0(Equation 9). for l∈ {1,...,L}do
24 ˆp1
l=1+DPI
i=1 E(li)PJ
j=1 ζli
dj;
25 ˆp2
l=b+DPI
i=1 E(li)PJ
j=1 Pm>l ζmi
dj;
26 end
27 for l∈ {1,...,L}do
28 p(1,t0)
l=(1 ρt0
l)p(1,t01)
l+ρt0
lˆp1
l;
29 p(2,t0)
l=(1 ρt0
l)p(2,t01)
l+ρt0
lˆp2
l;
30 end
31 end
Algorithm 6: updateGlobal
Data:wd,η,ε,µd,ζd,φd,ρt,λ(t1),o1,(t1) ,o2,(t1)
Result:λ(t),o1,(t),o2,(t)
1for k∈ {1,...,K}do
2ˆ
λkv =η+ DPL
l=1 µl
dPI
i=1 εk
liPJ
j=1 ζli
djPN
n=1 φj
dnwv
dn;
3ˆo1
k=1+PL
l=1 PI
i=1 εk
li;
4ˆo2
k=ω+PL
l=1 PI
i=1 Pm>k εm
li;
5end
6λ(t)=(1 ρt)λ(t1) +ρtˆ
λ;
7o1,(t)=(1 ρt)o1,(t1) +ρtˆ
o1;
8o2,(t)=(1 ρt)o(2,t1) +ρtˆ
o2;
9return λ(t),o1,(t),o2,(t)
Figure 2: Data segment samples from PARIS07, ONDREJ10,
PETT09 and MOU09
Figure 3: a: All trajectories. The red dots are cameras. The blue
circles are exts/entrances. b-d:data segments. All data segments
span 5 seconds.
Figure 4: Two data segments in train station dataset.
Figure 5: Additional patterns learned by SVDHDP from train sta-
tion dataset.
Figure 6: Patterns learned Gibbs Sampling from train station
dataset.
Figure 7: Patterns learned Gibbs Sampling from train station
dataset.
Figure 8: Data segment samples for park simulation from
PARIS07, ONDREJ10, PETT09 and MOU09
Figure 9: Data segment samples for train station simulation from
PARIS07, ONDREJ10, PETT09 and MOU09
ResearchGate has not been able to resolve any citations for this publication.
Article
Let $M(x)$ denote the expected value at level $x$ of the response to a certain experiment. $M(x)$ is assumed to be a monotone function of $x$ but is unknown to the experimenter, and it is desired to find the solution $x = \theta$ of the equation $M(x) = \alpha$, where $\alpha$ is a given constant. We give a method for making successive experiments at levels $x_1,x_2,\cdots$ in such a way that $x_n$ will tend to $\theta$ in probability.
Article
When a parameter space has a certain underlying structure, the ordinary gradient of a function does not represent its steepest direction, but the natural graadient does. Information geometry is used for calculating the natural gradients in the parameter space of perceptrons, the space of matrices (for blind source separation), and the space of linear dynamical systems (for blind source deconvolution). The dynamical behaviour of natural gradient online learning is analyzed and is proved to be Fischer efficient, implying that it has assymptotically the same performance as the optimal batch estimation of parameters. This suggests that the plateau phenomenon, which appears in the backpropagation learning algorithm of multilayer perceptrons, might disappear or might not be so serious when the natural gradient is used. An adaptive method of updating the learning rate is proposed and analyzed.