A SelfOrganising Neural Network for Processing Data from Multiple Sensors
ABSTRACT This paper shows how a folded Markov chain network can be applied to the problem of processing data from multiple sensors, with an emphasis on the special case of 2 sensors. It is necessary to design the network so that it can transform a high dimensional input vector into a posterior probability, for which purpose the partitioned mixture distribution network is ideally suited. The underlying theory is presented in detail, and a simple numerical simulation is given that shows the emergence of ocular dominance stripes.
 Citations (2)
 Cited In (0)

Article: Partitioned mixture distribution: an adaptive Bayesian network for lowlevel image processing
[Show abstract] [Hide abstract]
ABSTRACT: Bayesian methods are used to analyse the problem of training a model to make predictions about the probability distribution of data that has yet to be received. Mixture distributions emerge naturally from this framework, but are not ideally matched to the density estimation problems that arise in image processing. An extension, called a partitioned mixture distribution is presented, which is essentially a set of overlapping mixture distributions. An expectation maximisation training algorithm is derived for optimising partitioned mixture distributions according to the maximum likelihood description. Finally, the results of some numerical simulations are presented, which demonstrate that lateral inhibition arises naturally in partitioned mixture distributions, and that the nodes in a partitioned mixture distribution network cooperate in such a way that each mixture distribution in the partitioned mixture distribution receives its necessary complement of computing machineryIEE Proceedings  Vision Image and Signal Processing 09/1994;  [Show abstract] [Hide abstract]
ABSTRACT: In this paper Bayesian methods are used to analyze some of the properties of a special type of Markov chain. The forward transitions through the chain are followed by inverse transitions (using Bayes' theorem) backward through a copy of the same chain; this will be called a folded Markov chain. If an appropriately defined Euclidean error (between the original input and its reconstruction via Bayes' theorem) is minimized with respect to the choice of Markov chain transition probabilities, then the familiar theories of both vector quantizers and selforganizing maps emerge. This approach is also used to derive the theory of selfsupervision, in which the higher layers of a multilayer network supervise the lower layers, even though overall there is no external teacher.Neural Computation 09/1994; 6:767794. · 1.69 Impact Factor
Page 1
arXiv:1012.4173v1 [cs.NE] 19 Dec 2010
A SelfOrganising Neural Network for
Processing Data from Multiple Sensors∗
S P Luttrell
December 21, 2010
Abstract
This paper shows how a folded Markov chain network can be applied
to the problem of processing data from multiple sensors, with an empha
sis on the special case of 2 sensors. It is necessary to design the network
so that it can transform a high dimensional input vector into a posterior
probability, for which purpose the partitioned mixture distribution net
work is ideally suited. The underlying theory is presented in detail, and a
simple numerical simulation is given that shows the emergence of ocular
dominance stripes.
1 Theory
1.1 Neural Network Model
In order to fix ideas, it is useful to give an explicit “neural network” interpreta
tion to the theory that will be developed. The model will consist of 2 layers of
nodes. The input layer has a “pattern of activity” that represents the compo
nents of the input vector x, and the output layer has a pattern of activity that
is the collection of activities of each output node. The activities in the output
layer depend only on the activities in the input layer. If an input vector x is
presented to this network, then each output node “fires” discretely at a rate that
corresponds to its activity. After n nodes have fired the probabilistic descrip
tion of the relationship between the input and output of the network is given by
Pr(y1,y2,···,ynx), where yiis the location in the output layer (assumed to be
on a rectangular lattice of size m) of the ithnode that fires. In this paper it will
be assumed that the order in which the n nodes fire is not observed, in which
case Pr(y1,y2,···,ynx) is a sum of probabilities over all n! permutations of
(y1,y2,···,yn), which is a symmetric function of the yi, by construction.
The theory that is introduced in section 1.2 concerns the special case n = 1.
In the n = 1 case the probabilistic description Pr(yx) is proportional to the
∗This unpublished draft paper accompanied a talk that was given at the Conference on
Neural Networks for Computing, 47 April 1995, Snowbird.
1
Page 2
firing rate of node y in response to input x. When n > 1 there is an indirect
relationship between the probabilistic description Pr(y1,y2,···,ynx) and the
firing rate of node y, which is given by the marginal probability
Pr(yx) =
m
?
y2,···,yn
Pr(y,y2,···,ynx)
(1)
It is important to maintain this distinction between events that are observed
(i.e. (y1,y2,···,yn) given x) and the probabilistic description of the events that
are observed (i.e. Pr(y1,y2,···,ynx)). The only possible exception is in the
n → ∞ limit, where Pr(y1,y2,···,ynx) has all of its probability concentrated
in the vicinity of those (y1,y2,···,yn) that are consistent with the observed
longterm average firing rate of each node. It is essential to consider the n > 1
case to obtain the results that are described in this paper.
1.2Probabilistic Encoder/Decoder
A theory of selforganising networks based on an analysis of a probabilistic
encoder/decoder was presented in [1]. It deals with the n = 1 case referred to
in section 1.1. The objective function that needs to be minimised in order to
optimise a network in this theory is the Euclidean distortion D defined as
D ≡
m
?
y=1
ˆ
dxdx′Pr(x)Pr(yx)Pr(x′y)?x − x′?2
(2)
where x is an input vector, y is a coded version of x (a vector index on a d
dimensional rectangular lattice of size m), x′is a reconstructed version of x from
y, Pr(x) is the probability density of input vectors, Pr(yx) is a probabilistic
encoder, and Pr(x′y) is a probabilistic decoder which is specified by Bayes’
theorem as
Pr(yx)Pr(x)
´dx′Pr(yx′)Pr(x′)
D can be rearranged into the form [1]
Pr(xy) =
(3)
D = 2
m
?
y=1
ˆ
dxPr(x)Pr(yx)?x − x′(y)?2
(4)
where the reference vectors x′(y) are defined as
x′(y) ≡
ˆ
dxPr(xy) x
(5)
Although equation 2 is symmetric with respect to interchanging the encoder and
decoder, equation 4 is not. This is because Bayes’ theorem has made explicit the
dependence of Pr(xy) on Pr(yx). From a neural network viewpoint Pr(yx)
describes the feedforward transformation from the input layer to the output
2
Page 3
layer, and x′(y) describes the feedback transformation that is implied from
the output layer to the input layer. The feedback transformation is necessary
to implement the objective function that has been chosen here.
Minimisation of D with respect to all free parameters leads to an optimal
encoder/decoder. In equation 4 the Pr(yx) are the only free parameters, be
cause x′(y) is fixed by equation 5. However, in practice, both Pr(yx) and
x′(y) may be treated as free parameters [1], because x′(y) satisfy equation 5
at stationary points of D with respect to variation of x′(y).
1.3Posterior Probability Model
The probabilistic encoder/decoder requires an explicit functional form for the
posterior probability Pr(yx). A convenient expression is
Pr(yx) =
Q(xy)
y′=1Q(xy′)
?m
(6)
where Q(xy) > 0 can be regarded as a node “activity”, and?m
satisfies 0 ≤ Q(xy) ≤ 1)
y=1P (yx) = 1.
Any nonnegative function can be used for Q(xy), such as a sigmoid (which
Q(xy) =
1
1 + exp(−w(y) · x − b(y))
(7)
where w(y) and b(y) are a weight vector and bias, respectively.
A drawback to the use of equation 6 is that it does not permit it to scale
well to input vectors that have a large dimensionality. This problem arises from
the restricted functional form allowed for Q(xy). A solution was presented in
[2]
1
MQ(xy)
y′∈˜
where M ≡ m1m2···md, and N (y) is a set of lattice points that are deemed
to be “in the neighbourhood of” the lattice point y, and˜ N (y) is the inverse
neighbourhood defined as the set of lattice points that have lattice point y in
their neighbourhood. This expression for Pr(yx) satisfies?m
Pr(yx) =
?
N(y)
1
?
y′′∈N(y′)Q(xy′′)
(8)
y=1P (yx) = 1
(see appendix A). It is convenient to define
Pr(yx;y′) ≡
Q(xy)
?
y′′∈N(y′)Q(xy′′)
(9)
which is another posterior probability, by construction. It includes the effect of
the output nodes that are in the neighbourhood of node y′only. Pr(yx;y′) is
thus a localised posterior probability derived from a localised subset of the node
activities. This allows equation 8 to be written as Pr(yx) =
so Pr(yx) is the average of the posterior probabilities at node y arising from
each of the localised subsets that happens to include node y.
1
M
?
y′∈˜
N(y)Pr(yx;y′),
3
Page 4
1.4Multiple Firing Model
The model may be extended to the case where n output nodes fire. Pr(yx) is
then replaced by Pr(y1,y2,···ynx), which is the probability that (y1,y2,···yn)
are the first n nodes to fire (in that order). With this modification, D becomes
D = 2
m
?
y1,y2,···yn=1
ˆ
dxPr(x)Pr(y1,y2,···ynx)?x − x′(y1,y2,···yn)?2
(10)
where the reference vectors x′(y1,y2,···yn) are defined as
x′(y1,y2,···yn) ≡
ˆ
dxPr(xy1,y2,···yn) x
(11)
The dependence of Pr(y1,y2,···ynx) and x′(y1,y2,···yn) on n output node
locations complicates this result. Assume that Pr(y1,y2,···ynx) is a symmet
ric function of its (y1,y2,···yn) arguments, which corresponds to ignoring the
order in which the first n nodes choose to fire (i.e. Pr(y1,y2,···ynx) is a sum
over all permutations of (y1,y2,···yn)). For simplicity, assume that the nodes
fire independently so that Pr(y1,y2x) = Pr(y1x)Pr(y2x) (see appendix B
for the general case where Pr(y1,y2x) does not factorise). D may be shown to
satisfy the inequality D ≤ D1+ D2(see appendix B), where
D1
≡
2
n
m
?
y=1
ˆ
dxPr(x)Pr(yx)?x − x′(y)?2
D2
≡
2(n − 1)
n
ˆ
dxPr(x)
?????
m
?
y=1
Pr(yx)(x − x′(y))
?????
2
(12)
D1and D2are both nonnegative. D1→ 0 as n → ∞, and D2= 0 when n = 0,
so the D1term is the sole contribution to the upper bound when n = 0, and the
D2term provides the dominant contribution as n → ∞. The difference between
the D1and the D2terms is the location of the?m
a Euclidean distance. The D2 term will therefore exhibit interference effects,
whereas the D1term will not.
y=1Pr(yx)(···) average: in
the D2term it averages a vector quantity, whereas in the D1term it averages
1.5Probability Leakage
The model may be further extended to the case where the probability that a
node fires is a weighted average of the underlying probabilities that the nodes
in its vicinity fire. Thus Pr(yx) becomes
Pr(yx) →
m
?
y′=1
Pr(yy′)Pr(y′x)
(13)
4
Page 5
where Pr(yy′) is the conditional probability that node y fires given that node
y′would have liked to fire. In a sense, Pr(yy′) describes a “leakage” of prob
ability from node y′that onto node y. Pr(yy′) then plays the role of a soft
“neighbourhood function” for node y′. This expression for Pr(yx) can be used
wherever a plain Pr(yx) has been used before. The main purpose of introduc
ing leakage is to encourage neighbouring nodes to perform a similar function.
This occurs because the effect of leakage is to soften the posterior probability
Pr(yx), and thus reduce the ability to reconstruct x accurately from knowledge
of y, which thus increases the average Euclidean distortion D. To reduce the
damage that leakage causes, the optimisation must ensure that nodes that leak
probability onto each other have similar properties, so that it does not matter
much that they leak.
1.6The Model
The focus of this paper is on minimisation of the upper bound D1+ D2 (see
equation 12) to D in the multiple firing model, using a scalable posterior proba
bility Pr(yx) (see equation 8), with the effect of activity leakage Pr(yy′) taken
into account (see equation 13). Gathering all of these pieces together yields
D1
=
2
nM
ˆ
dxPr(x)
m
?
y=1
m
?
y′=1
Pr(yy′)
?
y′′∈˜
N(y′)
Pr(y′x;y′′)?x − x′(y)?2
D2
=
2(n − 1)
nM2
ˆ
dxPr(x)
(14)
×
??????
m
?
y=1
m
?
y′=1
Pr(yy′)
?
y′′∈˜
N(y′)
Pr(y′x;y′′)(x − x′(y))
??????
2
where Pr(yx;y′) ≡
Q(xy)
?
y′′∈N(y′)Q(xy′′).
In order to ensure that the model is truly scalable, it is necessary to restrict
the dimensionality of the reference vectors. In equation 14 dimx′(y) = dimx,
which is not acceptable in a scalable network. In practice, it will be assumed any
properties of node y that are vectors in input space will be limited to occupy
an “input window” of restricted size that is centred on node y. This restriction
applies to the node reference vector x′(y), which prevents D1+ D2from being
fully minimised, because x′(y) is allowed to move only in a subspace of the full
dimensional input space. However, useful results can nevertheless be obtained,
so this restriction is acceptable.
1.7 Optimisation
Optimisation is achieved by minimising D1+D2with respect to its free param
eters. Thus the derivatives with respect to x′(y) are given by
4
nM
∂D1
∂x′(y)
=−
ˆ
dxPr(x) f1(x,y)
5
Page 6
∂D2
∂x′(y)
=−4(n − 1)
nM2
ˆ
dxPr(x) f2(x,y)
(15)
and the variations with respect to Q(xy) are given by
δD1
=
2
nM
m
?
y=1
ˆ
dxPr(x) g1(x,y) δ logQ(xy)
δD2
=
4(n − 1)
nM2
m
?
y=1
ˆ
dxPr(x) g2(x,y) δ logQ(xy)
(16)
The functions f1(x,y), f2(x,y), g1(x,y), and g2(x,y) are derived in appendix
C. Inserting a sigmoidal function Q(xy) =
derivatives with respect to w(y) and b(y) as
1
1+exp(−w(y)·x−b(y))then yields the
∂D1
b(y)
w(y)
∂
?
?
=
2
nM
ˆ
dxPr(x) g1(x,y) (1 − Q(xy))
?
1
x
?
∂D2
b(y)
w(y)
∂
?
?
=
4(n − 1)
nM2
ˆ
dxPr(x) g2(x,y) (1 − Q(xy))
?
1
x
?
(17)
Because all of the properties of node y that are vectors in input space (i.e. x′(y)
and w(y)) are assumed to be restricted to an input window centred on node
y, the eventual result of evaluating the right hand sides of the above equations
must be similarly restricted to the same input window.
1.8 The Effect of the Euclidean Norm on Minimising D1+
D2
The expressions for D1and D2, and especially their derivatives, are fairly com
plicated, so an intuitive interpretation will now be presented. When D1+D2is
stationary with respect to variations of x′(y) it may be written as (see appendix
D).
D1+ D2
=−2
n
´dxPr(x)?m
−2(n−1)
n
´dxPr(x)
y=1Pr(yx)?x′(y)?2
????m
y=1Pr(yx)x′(y)
???
2
+ constant
(18)
The M and M2factors do not appear in this expression because Pr(yx) is
normalised to sum to unity.The first term (which derives from D1) is an
incoherent sum (i.e. a sum of Euclidean distances), whereas the second term
(which derives from D2) is a coherent sum (i.e. a sum of vectors). The first
term contributes for all values of n, whereas the second term contributes only
for n ≥ 2, and dominates for n ≫ 1. In order to minimise the first term the
6
Page 7
?x′(y)?2like to be as large as possible for those nodes that have a large Pr(yx).
Since x′(y) is the centroid of the probability density Pr(xy), this implies that
node y prefers to encode a region of input space that is as far as possible from the
origin. This is a consequence of using a Euclidean distortion measure ?x − x′?2,
which has the dimensions of ?x?2, in the original definition of the distortion in
equation 2. In order to minimise the second term the superposition of x′(y)
weighted by the Pr(yx) likes to have as large a Euclidean norm as possible.
Thus the nodes cooperate amongst themselves to ensure that the nodes that
????m
2Solvable Analytic Model
have a large Pr(yx) also have a large
y=1Pr(yx)x′(y)
???
2
.
The purpose of this section is to work through a case study in order to demon
strate the various properties that emerge when D1+ D2is minimised.
2.1The Model
It convenient to begin by ignoring the effects of leakage Pr?yy′?, and to con
(as in equation 6) Pr(yx) =
?m
Q(xy) =
1
above threshold
centrate on a simple (nonscaling) version of the posterior probability model
Q(xy)
y′=1Q(xy′), where the Q(xy) are threshold
functions of x
?
0
below threshold
(19)
It is also convenient to imagine that a hypothetical infinitesized training set
is available, so it may be described by a probability density Pr(x). This is a
“frequentist”, rather than a “Bayesian”, use of the Pr(x) notation, but the dis
tinction is not important in the context of this paper. Assume that x = (x1,x2)
is drawn from a training set, that has 2 statistically independent subspaces, so
that
Pr(x1,x2) = Pr(x1)Pr(x2)
(20)
Furthermore, assume that Pr(x1) and Pr(x2) each have the form
Pr(xi) =
1
2π
ˆ2π
0
dθiδ (xi− xi(θi))
(21)
i.e. Pr(xi) is a loop (parameterised by a phase angle θi) of probability density
that sits in xispace. In order to make it easy to deduce the optimum reference
vectors, choose xi(θi) so that the following 2 conditions are satisfied for i = 1,2
?xi(θi)?2
=
constant
????
∂xi(θi)
∂θi
????
2
=
constant(22)
This type of training set can be visualised topologically. Each training vector
(x1,x2) consists of 2 subvectors, each of which is parameterised by a phase angle,
7
Page 8
Figure 1: Representation of S1× S1topology with a threshold Q(xy) super
imposed.
and which therefore lives in a subspace that has the topology of a circle, which
is denoted as S1. Because of the independence assumption in equation 20, the
pair (x1,x2) lives on the surface of a 2torus, which is denoted as S1×S1. The
minimisation of D1+D2thus reduces to finding the optimum way of designing
an encoder/decoder for input vectors that live on a 2torus, with the proviso
that their probability density is uniform (this follows from equation 21 and
equation 22). In order to derive the reference vectors x′(y), the solution(s) of
the stationarity condition
∂x′(y)
= 0 must be computed. The stationarity
condition reduces to (see appendix D)
∂(D1+D2)
n´dx1dx2Pr(x1,x2y) (x1,x2)
= (n − 1)?m
+(x′
y′=1
?´dx1dx2Pr(x1,x2y)Pr(y′x1,x2)?(x′
1(y′),x′
2(y′))
1(y),x′
2(y))
(23)
It is useful to use the simple diagrammatic notation shown in figure 2.1. Each
circle in figure 2.1 represents one of the S1subspaces, so the two circles together
represent the product S1×S1. The constraints in equation 22 are represented by
each circle being centred on the origin of its subspace (?xi(θi)?2is constant),
and the probability density around each circle being constant (
constant). A single threshold function Q(xy) is represented by a chord cut
ting through each circle (with 0 and 1 indicating on which side of the chord
the threshold is triggered). The xi that lie above threshold in each subspace
are highlighted. Both x1 and x2 must lie above threshold in order to ensure
???∂xi(θi)
∂θi
???
2
is
8
Page 9
Figure 2: Explicit representation of S1×S1topology as a torus with the effect
of 3 different types of threshold Q(xy) shown.
Q(xy) = 1, i.e. they must both lie within regions that are highlighted in figure
2.1. In this case node y will be said to be “attached” to both subspace 1 and
subspace 2. A special case arises when the chord in one of the subspaces (say
it is x2) does not intersect the circle at all, and the circle lies on the side of the
chord where the threshold is triggered. In this case Q(xy) does not depend on
x2, so that Pr(yx1,x2) = Pr(yx1), in which case node y will be said to be
“attached” to subspace 1 but “detached” from subspace 2. The typical ways in
which a node becomes attached to the 2torus are shown in figure 2.1. In figure
2.1(a) the node is attached to one of the S1subspaces and detached from the
other. In figure 2.1(b) the attached and detached subspaces are interchanged
with respect to figure 2.1(a). In figure 2.1(c) the node is attached to both
subspaces.
2.2 All Nodes Attached to One Subspace
Consider the configuration of threshold functions shown in figure 2.2. This is
equivalent to all of the nodes being attached to loops to cover the 2torus, with
a typical node being as shown in figure 2.1(a) (or, equivalently, figure 2.1(b)).
When D1+ D2is minimised, it is assumed that the 4 nodes are symmetrically
disposed in subspace 1, as shown. Each is triggered if and only if x1lies within
its quadrant, and one such quadrant is highlighted in figure 2.2. This implies
that only 1 node is triggered at a time. The assumed form of the threshold
functions implies Pr(yx1,x2) = Pr(yx1), so equation 23 reduces to
n´dx1dx2Pr(x1y)Pr(x2) (x1,x2)
=´dx1dx2Pr(x1y)Pr(x2)
?
whence
×(n − 1)?M
y′=1Pr(y′x1)(x′
1(y′),x′
2(y′)) + (x′
1(y),x′
2(y))
?
(24)
x′
1(y)=
ˆ
0
dx1Pr(x1y) x1
x′
2(y)=
(25)
9
Page 10
Figure 3: 16 nodes are shown, which are all attached to subspace 1, and all
detached from subspace 2.
2.3All Nodes Attached to Both Subspaces
Consider the configuration of threshold functions shown in figure 2.3. This is
equivalent to all of the nodes being attached to patches to cover the 2torus,
with a typical node being as shown in figure 2.1(c). In this case, when D1+D2is
minimised, it is assumed that each subspace is split into 2 halves. This requires
a total of 4 nodes, each of which is triggered if, and only if, both x1and x2lie
on the corresponding halfcircles. This implies that only 1 node is triggered at a
time. The assumed form of the threshold functions implies that the stationarity
condition becomes
n Pr(y)´dx1dx2Pr(x1,x2y) (x1,x2)
= Pr(y)´dx1dx2Pr(x1,x2y)
?
whence
×(n − 1)?M
y′=1Pr(y′x1,x2)(x′
1(y′),x′
2(y′)) + (x′
1(y),x′
2(y))
?
(26)
x′
1(y)=
ˆ
ˆ
dx1Pr(x1y)x1
x′
2(y)=
dx2Pr(x2y)x2
(27)
2.4Half the Nodes Attached to One Subspace, and Half
to the Other Subspace
Consider the configuration of threshold functions shown in figure 2.4. This is
equivalent to half of the nodes being attached to loops to cover the 2torus,
10
Page 11
Figure 4: 16 nodes are shown, which are all attached to both subspace 1 and
subspace 2.
with a typical node being as shown in figure 2.1(a). The other half of the nodes
would then be attached in an analogous way, but as shown in figure 2.1(b).
Thus the 2torus is covered twice over. In this case, when D1+D2is minimised,
it is assumed that each subspace is split into 2 halves. This requires a total
of 4 nodes, each of which is triggered if x1 (or x2) lies on the halfcircle in
the subspace to which the node is attached. Thus exactly 2 nodes y1(x1) and
y2(x2) are triggered at a time, so that
Pr(yx1,x2)=
1
2
1
2(Pr(yx1) + Pr(yx2))
?δy,y1(x1)+ δy,y2(x2)
?
=
(28)
For simplicity, assume that node y is attached to subspace 1, then Pr(x1,x2y) =
Pr(x1y)Pr(x2) and the stationarity condition becomes
n Pr(y)´dx1dx2Pr(x1y)Pr(x2) (x1,x2)
= Pr(y)´dx1dx2Pr(x1y)Pr(x2)
?
×
n−1
2
?M
y′=1(Pr(y′x1) + Pr(y′x2)) (x′
+(x′
1(y′),x′
2(y′))
1(y),x′
2(y))
?
(29)
11
Page 12
Figure 5: 16 nodes are shown, 8 of which are attached to subspace 1 and de
tached from subspace 2 (top row), and 8 of which are attached to subspace 2
and detached from subspace 1 (bottom row).
12
Page 13
This may be simplified to yield
n´dx1Pr(x1y) (x1,0)
=n+1
2
(x′
1(y),x′
2(y))
+n−1
2
´dx2Pr(x2)?M
(x′
y′=1Pr(y′x2) (x′
1(y′),x′
2(y′))
=n+1
2
1(y),x′
2(y)) +n−1
2
?x′
1(y),x′
2(y)?2
(30)
Write the 2 subspaces separately (remember that node y is assumed to be at
tached to subspace 1)
x′
1(y)=
2n
n + 1
−n − 1
n + 1?x′
ˆ
dx1Pr(x1y)x1−n − 1
n + 1?x′
1(y)?2
x′
2(y)=
2(y)?2
(31)
If this result is simultaneously solved with the analogous result for node y at
tached to subspace 2, then the ?···? terms vanish to yield
x′
1(y)=
?
?
2n
n+1
0
´dx1Pr(x1y)x1
y attached to subspace 1
y attached to subspace 2
x′
2(y)=
0
y attached to subspace 1
y attached to subspace 2
2n
n+1
´dx2Pr(x2y)x2
(32)
2.5Compare D1+D2for the 3 Different Types of Solution
Consider the left hand side of figure 2.2 for the case of M nodes, when the M
threshold functions form a regular Mogon. Pr(xy) then denotes the part of
the circle that is associated with node y, whose radius of gyration squared is
given by (assuming that the circle has unit radius)
RM
≡
????
ˆ
dxPr(xy) x
????
2
=
?
M
2π
ˆ
2π
M
0
dθcosθ
?2
=
?M
2πsin2π
M
?2
(33)
Gather the results for (x′
(referred to as type 2), and 32 (referred to as type 3) together and insert them
into D1+ D2in equation 18 to obtain (see appendix E)
1(y),x′
2(y)) in equations 25 (referred to as type 1), 27
D1+ D2=
constant − 2RM
constant − 4R√M
constant −
type 1
type 2
type 3
4n
n+1R M
2
(34)
13
Page 14
Figure 6: Plots of −D1− D2for n = 1 for each of the 3 types of optimum.
In figure 2.5 the 3 solutions are plotted for the case n = 1. For n = 1 the type
3 solution is never optimal, the type 1 solution is optimal for M ≤ 19, and the
type 2 solution is optimal for M ≥ 20. This behaviour is intuitively sensible,
because a larger number of nodes is required to cover a 2torus as shown in
figure 2.1(c) than as shown in figure 2.1(a) (or figure 2.1(b)).
In figure 2.5 the 3 solutions are plotted for the case n = 2. For n = 2 the
type 1 solution is optimal for M ≤ 12, and the type 2 solution is optimal for
large M ≥ 30, but there is now an intermediate region 12 ≤ M ≤ 29 (type 1 and
type 3 have an equal D1+ D2at M = 12) where the ndependence of the type
3 solution has now made it optimal. Again, this behaviour is intuitively reason
able, because the type 3 solution requires at least 2 observations in order to be
able to yield a small Euclidean resonstruction error in each of the 2 subspaces,
i.e. for n = 2 the 2 nodes that fire must be attached to different subspaces.
Note that in the type 3 solution the nodes that fire are not guaranteed to be
attached to different subspaces. In the type 3 solution there is a probability
1
2n
trend is for the type 3 solution to become more favoured as n is increased.
In figure 2.5 the 3 solutions are plotted for the case n → ∞. For n → ∞
the type 2 solution is never optimal , the type 1 solution is optimal for M ≤ 8,
and the type 3 solution is optimal for M ≥ 8. The type 2 solution approaches
the type 3 solution from below asymptotically as M → ∞. In figure 2.5 a phase
diagram is given which shows how the relative stability of the 3 types of solution
for different M and n, where the type 3 solution is seen to be optimal over a large
part of the (M,n) plane. Thus the most interesting, and commonly occurring,
solution is the one in which half the nodes are attached to one subspace and
n!
n1!n2!that ni(where n = n1+ n2) nodes are attached to subspace i, so the
14
Page 15
Figure 7: Plots of −D1− D2for n = 2 for each of the 3 types of optimum.
Figure 8: Plots of −D1− D2for n → ∞ for each of the 3 types of optimum.
15
View other sources
Hide other sources
 Available from Stephen Luttrell · Jan 15, 2015
 Available from ArXiv