Content uploaded by Jonathan A. Marshall

Author content

All content in this area was uploaded by Jonathan A. Marshall on May 11, 2014

Content may be subject to copyright.

Network: Comput. Neural Syst. 9 (1998) 279–302. Printed in the UK PII: S0954-898X(98)91912-1

Generalization and exclusive allocation of credit in

unsupervised category learning

Jonathan A Marshall† and Vinay S Gupta

Department of Computer Science, CB 3175, Sitterson Hall, University of North Carolina, Chapel

Hill, NC 27599–3175, USA

Received 19 August 1997

Abstract. A new way of measuring generalization in unsupervised learning is presented. The

measure is based on an exclusive allocation,orcredit assignment, criterion. In a classiﬁer that

satisﬁes the criterion, input patterns are parsed so that the credit for each input feature is assigned

exclusively to one of multiple, possibly overlapping, output categories. Such a classiﬁer achieves

context-sensitive, global representations of pattern data. Two additional constraints, sequence

masking and uncertainty multiplexing, are described; these can be used to reﬁne the measure of

generalization. The generalization performance of EXIN networks, winner-take-all competitive

learning networks, linear decorrelator networks, and Nigrin’s SONNET-2 network are compared.

1. Generalization in unsupervised learning

The concept of generalization in pattern classiﬁcation has been extensively treated in the

literature on supervised learning, but rather little has been written on generalization with

regard to unsupervised learning. Indeed, it has been unclear what generalization even means

in unsupervised learning. This paper provides an appropriate deﬁnition for generalization

in unsupervised learning, a metric for generalization quality, and a qualitative evaluation

(using the metric) of generalization in several simple neural network classiﬁers.

The essence of generalization is the ability to appropriately categorize unfamiliar

patterns, based on the categorization of familiar patterns. In supervised learning, the output

categorizations for a training set of input patterns are given explicitly by an external teacher,

or supervisor. Various techniques have been used to ensure that test patterns outside this

set are correctly categorized, according to an external standard of correctness.

After supervised learning, a system’s ability to generalize can be measured in terms

of task performance. For instance, a face-recognition system can be tested using different

image viewpoints or illumination conditions, and performance can be evaluated in terms of

how accurately the system’s outputs match the actual facial identities in images. However,

in some situations, it may not be appropriate to measure the ability to generalize in terms

of performance on a speciﬁc task. For example, on which task would one measure the

generalization quality of the human visual system? Human vision is capable of so many

tasks that no one task is appropriate as a benchmark. In fact, much of the power of the

human visual system is its usefulness in completely novel tasks.

For general-purpose systems, like the human visual system, it would be useful to deﬁne

generalization performance in a task-independent way, rather than in terms of a speciﬁc task.

† E-mail: marshall@cs.unc.edu

0954-898X/98/020279+24$19.50

c

1998 IOP Publishing Ltd

279

280 J A Marshall and V S Gupta

It would be desirable to have a ‘general-purpose’ deﬁnition of generalization quality, such

that if a system satisﬁes the deﬁnition, it is likely to perform well on many different tasks.

This paper proposes such a general-purpose deﬁnition, based on unsupervised learning.

The deﬁnition measures how well a system’s internal representations correspond to the

underlying structure of its input environment, under manipulations of context, uncertainty,

multiplicity, and scale (Marshall 1995). For this deﬁnition, a good internal representation

is the goal, rather than good performance on a particular task.

In unsupervised learning, input patterns are assigned to output categories based on some

internal standard, such as similarity to other classiﬁed patterns. Patterns are drawn from a

training environment, or probability distribution, in which some patterns may be more likely

to occur than others. The classiﬁcations are typically determined by this input probability

distribution, with frequently-occurring patterns receiving more processing and hence ﬁner

categories. The categorization of patterns with a low or zero training probability, i.e. the

generalization performance, is determined partly by the higher-probability patterns. There

may be classiﬁer systems that categorize patterns in the training environment similarly, but

which respond to unfamiliar patterns in different ways. In other words, the generalizations

produced by different classiﬁers may differ.

2. A criterion for evaluating generalization

How can one judge whether the classiﬁcations and parsings that an unsupervised classiﬁer

generates are good ones? Several criteria (e.g., stability, dispersion, selectivity, convergence,

and capacity) for benchmarking unsupervised neural network classiﬁer performance have

been proposed in the literature. This paper describes an additional criterion: an exclusive

allocation (or credit assignment) measure (Bregman 1990, Marshall 1995). Exclusive

allocation as a criterion for evaluating classiﬁcations was ﬁrst discussed by Marshall (1995).

This paper reﬁnes and formalizes the intuitive concept of exclusive allocation, and it

describes in detail how exclusive allocation can serve as a measure for generalization in

unsupervised classiﬁers.

This paper also describes two regularization constraints, sequence masking and

uncertainty multiplexing, which can be used to evaluate further the generalization

performance of unsupervised classiﬁers. In cases where there exist multiple possible

classiﬁcations that would satisfy the exclusive allocation criterion, these regularizers allow

a secondary measurement and ranking of the quality of the classiﬁcations.

The principle of credit assignment states that the ‘credit’ for a given input feature should

be assigned, or allocated, exclusively to a single classiﬁcation. In other words, any given

piece of data should count as evidence for one pattern at a time and should be prevented

from counting as evidence for multiple patterns simultaneously. This intuitively simple

concept has not been stated in a mathematically precise way; such a precise statement is

given in this paper.

There are many examples (e.g., from visual perception of orientation, stereo depth,

and motion grouping, from visual segmentation, from other perceptual modalities, and from

‘blind source separation’ tasks) where a given datum should be allowed to count as evidence

for only one pattern at a time (Bell and Sejnowski 1995, Comon et al 1991, Hubbard and

Marshall 1994, Jutten and Herault 1991, Marshall 1990a, c, Marshall et al 1996, 1997, 1998,

Morse 1994, Schmitt and Marshall 1998). A good example comes from visual stereopsis,

where a visual feature seen by one eye can be potentially matched with many visual features

seen by the other eye (the ‘correspondence’ problem). Human visual systems assign the

credit for each such monocular visual feature to at most one unique binocular match; this

Generalization and exclusive allocation 281

property is known as the uniqueness constraint (Marr 1982, Marr and Poggio 1976). In

stereo transparency (Prazdny 1985), individual visual features should be assigned to the

representation of only one of multiple superimposed surfaces (Marshall et al 1996).

2.1. Neural network classiﬁers

A neural network categorizes an input pattern by activating some classiﬁer output neurons.

These activations constitute a representation of the input pattern, and the input features

of that pattern are said to be assigned to that output representation. An input pattern

that is not part of the training set, but which contains features present in two or more

training patterns, can exist. Such an input is termed a superimposition of input patterns.

Presentation of superimposed input patterns can lead to simultaneous activation of multiple

representations (neurons).

2.2. An exclusive allocation measure

One way to deﬁne an exclusive allocation measure for a neural network classiﬁer is to

specify how input patterns (both familiar and unfamiliar) should ideally be parsed, in terms

of a given training environment (the familiar patterns), and then to measure how well the

network’s actual parsings compare with the ideal. Consider, for instance, the network shown

in ﬁgure 1, which has been trained to recognize patterns ab and bc (Marshall 1995). Each

output neuron is given a ‘label’ (ab, bc) that reﬂects the familiar patterns to which the

neuron responds. The parsings that the network generates are evaluated in terms of those

labels. When ab or bc is presented, then the ‘best’ parsing is for the correspondingly labelled

output neuron to become fully active and for the other output neuron to become inactive

(ﬁgure 1(A)). In a linear network, when half a pattern is missing (say the input pattern is a),

and the other half does not overlap with other familiar patterns, the corresponding output

neuron should become half-active (ﬁgure 1(B)).

However, when the missing half renders the pattern’s classiﬁcation ambiguous (say

the input pattern is b), the partially matching alternatives (ab and bc) should not both

be half-active. Instead, the activation should be distributed among the partially matching

alternatives. One such parsing, in which the activation from b is distributed equally between

ab and bc, results in two activations at 25% of the maximum level (ﬁgure 1(C)). This parsing

would represent the network’s uncertainty about the classiﬁcation of input pattern b.

Another such parsing, in which the activation from b is distributed unequally, to ab

and not to bc, results in 50% activation of neuron ab (ﬁgure 1(D)). This parsing would

represent a ‘guess’ by the network that the ambiguous input pattern b should be classiﬁed

as ab. Although the distribution of credit from b to ab and bc is different in the two

parsings of ﬁgure 1(C) and 1(D), both parsings allocate the same total amount of credit.

(An additional criterion, ‘uncertainty multiplexing,’ which distinguishes between parsings

like the ones in ﬁgures 1(C) and 1(D), will be presented in subsection 4.7.)

By the same reasoning, it would be incorrect to parse pattern abc as ab (ﬁgure 1(E)),

because then the contribution from c is ignored. (That can happen if the inhibition between

ab and bc is too strong.) It would also be incorrect to parse abc as ab + bc (ﬁgure 1(F))

because b would be represented twice.

A correct parsing in this case would be to equally activate neurons ab and bc at 75%

of the maximum level (ﬁgure 1(G)). That this is correct can be veriﬁed by comparing the

sum of the input signals, 1 + 1 + 1 = 3, with the sum of the ‘size-normalized’ output

signals. Each output neuron encodes a pattern of a certain preferred ‘size’ (or ‘scale’)

282 J A Marshall and V S Gupta

ab bc

–

abc

(A)

ab bc

–

abc

(E)

ab bc

–

abc

(F)

ab bc

–

abc

(B)

ab bc

–

abc

(C)

ab bc

–

abc

(G)

ab bc

–

abc

(D)

Figure 1. Parsings for exclusive allocation. (A) Normal parsing; the familiar pattern ab activates

the correspondingly labelled output neuron. (B) The unfamiliar pattern a half-activates the best-

matching output neuron, ab. (C) The unfamiliar input pattern b matches ab and bc equally

well, and its excitation credit is divided equally between the corresponding two output neurons,

resulting in a 25% activation for each of the two neurons. (D) The excitation credit from b is

allocated entirely to neuron ab (producing a 50% activation), and not to neuron bc. (E) Incorrect

parsing in response to unfamiliar pattern abc: neuron ab is fully active, but the credit from input

unit c is lost. (F) Another incorrect parsing of abc: the credit from unit b is counted twice,

contributing to the full activation of both neurons ab and bc. (G) Correct parsing of abc: the

credit from b is divided equally between the best matches ab and bc, resulting in a 75% activation

of both neurons ab and bc. (Redrawn with permission, from Marshall (1995), copyright Elsevier

Science.)

(Marshall 1990b, 1995), which in ﬁgure 1 is the sum of the weights of the neuron’s input

connections. The sum of the input weights to neuron ab is 1 + 1 + 0 = 2, and the sum

of the input weights to neuron bc is 0 + 1 + 1 = 2. Thus, both of these output neurons

are said to have a size of 2. The size-normalized output signal for each output neuron is

computed by multiplying its activation by its size. The sum of the size-normalized output

signals in ﬁgure 1(G) is (0.75 × 2) + (0.75 × 2) = 3. Because this equals the sum of the

input signals, the exclusively-allocated parsing in ﬁgure 1(G) is valid (unlike the parsings

in ﬁgure 1(E) and 1(F)).

Given the examples above, exclusive allocation can be informally deﬁned as the

conjunction of the following pair of conditions. Exclusive allocation is said to be achieved

when:

• Condition 1. The activation of every output neuron is accounted for exactly once by

the input activations.

• Condition 2. The total input equals the total size-normalized output, as closely as

possible.

These two informal exclusive allocation conditions are made more precise in subsequent

sections. They are used below to evaluate the generalization performance of several neural

network classiﬁers.

3. Generalization performance of several networks

3.1. Response to familiar and unfamiliar patterns

This section compares the generalization performance of three neural network classiﬁers: a

winner-take-all network, an EXIN network, and a linear decorrelator network. First, each

network will be described brieﬂy.

3.1.1. Winner-take-all competitive learning. Among the simplest unsupervised learning

procedures is the winner-take-all (WTA) competitive learning rule, which divides the space

Generalization and exclusive allocation 283

of input patterns into hyper-polyhedral decision regions, each centered around a ‘prototype’

pattern. The ART-1 network (Carpenter and Grossberg 1987) and the Kohonen network

(Kohonen 1982) are examples of essentially WTA neural networks. When an input pattern

ﬁrst arrives, it is assigned to the one category whose prototype pattern best matches it.

The activation of neurons encoding other categories is suppressed (e.g., through strong

inhibition). The prototype of the winner category is then modiﬁed to make it slightly closer

to the input pattern. This is done by strengthening the winner’s input connection weights

from features in the input pattern and/or weakening the winner’s input connection weights

from features not in the input pattern. In these networks, generalization is based purely on

similarity of patterns to individual prototypes.

There can exist input patterns (e.g., abc or b in ﬁgure 1) that are not part of the training

set but which contain features present in two or more training patterns. Such inputs may

bear similarities to multiple individual prototypes and may be quite different from any

one individual prototype. However, a WTA network cannot activate multiple categories

simultaneously. Hence, the network cannot parse the input in a way that satisﬁes the

second exclusive allocation condition, so generalization performance suffers.

3.1.2. EXIN networks. In the EXIN (EXcitatory + INhibitory learning) neural network

model (Marshall 1990b, 1995), this problem is overcome by using an anti-Hebbian inhibitory

learning rule in addition to a Hebbian excitatory learning rule. If two output neurons are

frequently coactive, which would happen if the categories that they encode overlap or

have common features, the lateral inhibitory weights between them become stronger. On

the other hand, if the activations of the two output neurons are independent, which would

happen if the neurons encode dissimilar categories, then the inhibitory weights between them

become weaker. This results in category scission between independent category groupings

and allows the EXIN network to generate near-optimal parsings of multiple superimposed

patterns, in terms of multiple simultaneous activations (Marshall 1995).

3.1.3. Linear decorrelator networks. Linear decorrelator networks (Oja 1982, F

¨

oldi

´

ak

1989) also use an anti-Hebbian inhibitory learning rule that can cause the lateral inhibitory

connections to vanish during learning. This allows simultaneous neural activations.

However, the linear decorrelator network responds essentially to differences, or distinctive

features (Anderson et al 1977, Sattath and Tversky 1987) among the patterns, rather than

to the patterns themselves (Marshall 1995).

3.1.4. Example. Figure 2 (Marshall 1995) compares the exclusive allocation performance

of winner-take-all competitive learning networks, linear decorrelator networks, and EXIN

networks and illustrates the intuitions on which the rest of this paper is based. The initial

connectivity pattern in the three networks is identical (ﬁgure 2(A)). The networks are trained

on patterns ab, abc, and cd, which occur with equal probability. Within each of the three

networks, a single neuron learns to respond selectively to each of the familiar patterns. In

the WTA and EXIN networks, the neuron labelled ab develops strong input connections

from a and b and weak connections from c and d. Similarly, the neurons labelled abc

and cd develop appropriate selective excitatory input connections. In the WTA network,

the weights on the lateral inhibitory connections among the output neurons remain uniform,

ﬁxed, and strong enough to ensure WTA behaviour. In the EXIN network, the inhibition

between neurons ab and abc and between neurons abc and cd becomes strong because of

the overlap in the category exemplars; the inhibition between neurons ab and cd becomes

284 J A Marshall and V S Gupta

abcd

ab abc cd

+

–

abcd

ab abc cd

+

–

Winner-

Take-All

Linear

Decorrelator

EXIN

(B)

(E)

(H)

(C)

(F)

(I)

(D)

(G)

(J)

–

Lateral

inhibitory

connections

+

Feedforward

excitatory

connections

Input

Output

abcd

Train on patterns

ab, abc, cd.

(A)

Initially

nonspecific

network

ab

ab abc cd

+.5

+1

+.5

+1

–1

+1

–1

c d

ab

ab abc cd

+.5

+1

+.5

+1

–1

+1

–1

cd

abcd

ab abc cd

+

–

abcd

ab abc cd

+

–

ab

ab abc cd

+.5

+1

+.5

+1

–1

+1

–1

c d

Input

=

abc

Input

=

abcd

Input

=

c

+

–

abcd

ab abc cd

abcd

ab abc cd

+

–

Figure 2. Comparison of WTA, linear decorrelator, and EXIN networks. (A) Initially, neurons

in the input layer project excitatory connections non-speciﬁcally to neurons in the output layer.

Also, each neuron in the output layer projects lateral inhibitory connections non-speciﬁcally to

all its neighbours (shaded arrows). (B), (C), (D) The excitatory learning rule causes each type

of neural network to become selective for patterns ab, abc, and cd after a period of exposure to

those patterns; a different neuron becomes wired to respond to each of the familiar patterns. Each

network’s response to pattern abc is shown. (E) In WTA, the compound pattern abcd (ﬁlled

lower circles) causes the single ‘nearest’ neuron (abc) (ﬁlled upper circle) to become active

and suppress the activation of the other output neurons. (G) In EXIN, the inhibitory learning

rule weakens the strengths of inhibitory connections between neurons that code non-overlapping

patterns, such as between neurons ab and cd. Then when abcd is presented, both neurons ab

and cd become active (ﬁlled upper circles), representing the simultaneous presence of the

familiar patterns ab and cd. (F) The linear decorrelator responds similarly to EXIN for input

pattern abcd. However, in response to the unfamiliar pattern c, both WTA (H) and EXIN (J)

moderately activate (partially ﬁlled circles) the neuron whose code most closely matches the

pattern (cd), whereas the linear decorrelator (I) activates a more distant match (abc). (Reprinted,

with permission, from Marshall (1995), copyright Elsevier Science.)

Generalization and exclusive allocation 285

weak because the category exemplars have no common features. The linear decorrelator

network learns to respond to the differences among the patterns, rather than to the patterns

themselves. For example, the neuron labelled abc really becomes wired to respond optimally

to pattern c-and-not-d. In the linear decorrelator, the weights on the lateral connections

vanish, when the responses of the three neurons become fully decorrelated.

Each of the three networks responds correctly to the patterns in the training set, by

activating the appropriate output neuron (ﬁgure 2(B), 2(C) and 2(D)). Now consider the

response of these trained networks to the unfamiliar pattern abcd. The WTA network

responds by activating neuron abc (ﬁgure 2(E)) because the input pattern is closest to this

prototype. However, the response of the WTA network to pattern abcd is the same as the

response to pattern abc. The activation of neuron abc can be credited to the input features a,

b, and c. The input feature d is not accounted for in the output: it is not accounted for by

the activation of neuron cd because the activation of neuron cd is zero. Thus, condition 1

from the pair of exclusive allocation conditions is not fully satisﬁed. Also, the total input

(1 + 1 + 1 + 1 = 4) does not equal the total size-normalized output (0 + (1 × 3) + 0 = 3),

so the WTA network does not satisfy condition 2 for input pattern abcd.

On the other hand, in the linear decorrelator and the EXIN networks, neurons ab

and cd are simultaneously activated (ﬁgure 2(F) and 2(G)), because during training, the

inhibition between neurons ab and cd became weak or vanished. Thus, all input features

are fully represented in the output, and the exclusive allocation conditions are met for input

pattern abcd, in the linear decorrelator and EXIN networks. These two networks exhibit

a global context-sensitive constraint satisfaction property (Marshall 1995) in their parsing

of abcd: the contextual presence or absence of small distinguishing features, or nuances,

(like d) dramatically alters the parsing. When abc is presented, the network groups a, b,

and c together as a unit, but when d is added, the network breaks c away from a and b

and binds it with d instead, forming two separate groupings, ab and cd.

It is evident from ﬁgure 2(C), 2(F) and 2(I) that the size-normalization value for each

neuron must be computed not by examining the neuron’s input weight values per se (which

would give the wrong values for the linear decorrelator), but rather by examining the size of

the training patterns to which the neuron responds. Thus, the output neuron sizes are (2, 3, 2)

for all the networks shown in ﬁgure 2.

Now consider the response of the three networks to the unfamiliar pattern c. As shown

in ﬁgure 2(H), 2(I) and 2(J), the WTA and the EXIN networks respond by partially activating

neuron cd. However, in the linear decorrelator network, neuron abc is fully activated. Since,

during training, this neuron was fully activated when the pattern abc was presented, its full

activation is not accounted for by the presence of feature c alone in the input pattern. Thus,

condition 1 is not satisﬁed by the linear decorrelator network for input pattern c. Note that

abc (ﬁgure 2(I)) also does not satisfy condition 2 for pattern c, because 1 6= 1 × 3.

The example of ﬁgure 2 thus illustrates that allowing multiple simultaneous neural

activations and learning common features, rather than distinctive features, among the input

patterns enables an EXIN network to satisfy exclusive allocation constraints and to exhibit

good generalization performance when presented with multiple superimposed patterns. In

contrast, WTA networks (by deﬁnition) cannot represent multiple patterns simultaneously.

Although linear decorrelator networks can represent multiple patterns simultaneously, they

are not guaranteed to satisfy the exclusive allocation constraints.

A basic idea of this paper is that exclusive allocation provides a meaningful,

self-consistent way of specifying how a network should respond to unfamiliar patterns

and is therefore a valuable criterion for generalization.

286 J A Marshall and V S Gupta

3.2. Equality of total input and total output

Condition 2 will be used to compare a linear decorrelator network, an EXIN network,

and a SONNET-2 (Self-Organizing Neural NETwork-2) network (Nigrin 1993). (Since a

WTA network does not allow simultaneous activation of multiple category winners, it is

not considered in this example.)

3.2.1. SONNET-2. SONNET-2 is a fairly complex network, involving the use of inhibition

between connections (Desimone 1992, Reggia et al 1992, Yuille and Grzywacz 1989),

rather than between neurons, to implement exclusive allocation. The discussion in this

paper will focus on the differences in how EXIN and SONNET-2 networks implement the

inhibitory competition among perceptual categories. Because the inhibition in SONNET-2

acts between connections, rather than between neurons, it is more selective. Connections

from one input neuron to different output neurons compete for the ‘right’ to transmit signals;

this competition is implemented through an inhibitory signal that is a combination of the

excitatory signal on the connection and the activation of the corresponding output neuron.

For example, ﬁgures 3(C), 3(F) and 3(I) show that connections from input feature b to the

two output neurons compete with each other; other connections in ﬁgures 3(C), 3(F) and 3(I)

do not. As in EXIN networks, the excitatory learning rule involves prototype modiﬁcation of

output layer competition winners, and the inhibitory learning rule is based on coactivation of

the competing neurons; hence SONNET-2 displays the global context-sensitive constraint

satisfaction property (abc versus abcd) and the sequence masking property (Cohen and

Grossberg 1986, 1987) (abc versus c) (Nigrin 1993) displayed by EXIN networks.

3.2.2. Example. Figure 3 shows three networks (linear decorrelator, EXIN, SONNET-2)

trained on patterns ab and bc, which are assumed to occur with equal training probability.

The linear decorrelator network can end up in one of many possible ﬁnal conﬁgurations,

subject to the constraint that the output neurons are maximally decorrelated. A problem

with linear decorrelators is that they are not guaranteed to come up with a conﬁguration that

responds well to unfamiliar patterns. To illustrate this point, a conﬁguration that responds

correctly to familiar patterns but does not generalize well to unfamiliar patterns has been

chosen for ﬁgures 3(A), 3(D) and 3(G).

When the unfamiliar and ambiguous pattern b is presented, the EXIN and SONNET-2

networks respond correctly by activating both neuron ab and neuron bc to about 25% of their

maximum level (ﬁgures 3(E) and 3(F)), thus representing the uncertainty in the classiﬁcation.

This parsing is considered a good one because there are two alternatives for matching input

pattern b (ab and bc) and the input pattern comprises half of both alternatives. Condition 2 is

thus satisﬁed for this input pattern. The linear decorrelator network activates both neurons ab

and bc to 50% of their maximum level because the neurons receive no inhibitory input

(ﬁgure 3(D)); condition 2 is not satisﬁed because 0 + 1 + 0 6= (2 × 0.5) + (2 × 0.5).

When the unfamiliar (but not ambiguous) pattern ac is presented, the two neurons in

the linear decorrelator network receive a net input of zero and hence do not become active.

This linear decorrelator network thus does not satisfy condition 1 for this input pattern. In

the EXIN network, ab and bc are active to about 25% of their maximum activation. This

behaviour arises because neurons ab and ac still exert an inhibitory inﬂuence on each other

because of the overlap in their category prototypes, even though the subpatterns a and c

in the pattern ac have nothing in common. The EXIN network thus does not fully satisfy

condition 1 for this input pattern. On the other hand, in the SONNET-2 network, both

neurons ab and bc are correctly active to 50% of their maximum level. This parsing is

Generalization and exclusive allocation 287

Input

=

ab

Input

=

b

Input

=

ac

EXIN

(B)

(E)

(H)

+

–

d

ab bc

ba

c

+

–

d

ab bc

ba

c

+

–

d

ab bc

ba

c

Linear

Decorrelator

(A)

(D)

(G)

SONNET-2

(C)

(F)

(I)

+

d

ab bc

ba

c

–

+

d

ab bc

ba

c

–

+

d

ab bc

ba

c

–

+.5

ab bc

b

+.5

c

–.5

+.5 +.5

a

–.5

+.5

ab bc

b

+.5 +.5 +.5

–.5

a c

–.5

+.5

ab bc

b

+.5 +.5 +.5

ca

–.5 –.5

Figure 3. Comparison of linear decorrelator, EXIN, and SONNET-2 networks. Three different

networks trained on patterns ab and bc. (A), (B), (C) A different neuron becomes wired to

respond to each of the familiar patterns. The response of each network to pattern ab is shown.

(D) When an ambiguous pattern b is presented, in the linear decorrelator network, both neurons

ab and bc are active at 50% of their maximum level; the input does not fully account for

these activations (see text). (E), (F) The EXIN and SONNET-2 networks respond correctly by

partially activating both neurons ab and bc, to about 25% of the maximum activation. (G) When

pattern ac is presented, the two neurons in the linear decorrelator network receive a net input

of zero and hence do not become active. (H) In the EXIN network, neurons ab and bc still

compete with each other, even though the subpatterns a and c are disjoint. This results in

incomplete representation of the input features at the output. (I) In SONNET-2, links from a

to ab and from c to bc do not inhibit each other; this ensures that neurons ab and bc are active

sufﬁciently (at about 50% of their maximum level) to fully account for the input features.

considered to be correct because the subpatterns a and c within the input pattern comprise

half of the prototypes ab and bc respectively, and there is only one (partially) matching

alternative for each subpattern. The SONNET-2 network responds in this manner because

the link from input feature a to neuron ab does not compete with the link from input

feature c to neuron bc.

The SONNET-2 network satisﬁes condition 2 for all three input patterns shown in

ﬁgure 3, the EXIN network satisﬁes condition 2 on two of the input patterns, and the

linear decorrelator network satisﬁes condition 2 on one of the input patterns. The greater

selectivity of inhibition in SONNET-2 leads to better satisfaction of the exclusive allocation

constraints and thus better generalization. The example of ﬁgure 3 thus elaborates the

concept of exclusive allocation and is incorporated in the formalization below.

288 J A Marshall and V S Gupta

3.3. Summary of generalization behaviour examples

Figure 4 summarizes the comparison between the networks that have been considered.

A‘+’ in the table indicates that the given network possesses the corresponding property

to a satisfactory degree; a ‘−’ indicates that it does not. The general complexity of the

neural dynamics and the architecture of the networks have also been compared; the ‘<’

signs indicate that complexity increases from left to right in the table. WTA networks

have ﬁxed, uniform inhibitory connections and are considered to be the simplest of all

the networks discussed. Linear decorrelators use a single learning rule for both excitatory

and inhibitory connections; further, the inhibitory connection weights can all vanish under

certain conditions. EXIN networks use slightly different learning rules for feedforward and

lateral connections. SONNET-2 implements inhibition between input layer → output layer

connections, rather than between neurons. Thus, sophisticated generalization performance

is obtained at the cost of increased complexity.

Example WTA LD EXIN SONNET-2

Figure 2:

abc

versus

abcd

– + + +

Figure 2:

cd

versus

c

+ – + +

Figure 3:

b

versus

ac

– – – +

Complexity of network < < <

Figure 4. Summary of exclusive allocation examples. The table indicates how well

different networks behave on the representative examples of exclusive allocation discussed in

subections 3.1 and 3.2. Key: +, network behaved properly; −, network did not behave properly;

<, network complexity increases from left to right.

4. Formal expression of generalization conditions

Section 3 compared the generalization performance of several networks qualitatively. The

exclusive allocation conditions will now be framed in formal terms, so that a quantitative

computation of how well a network adheres to the exclusive allocation constraints, and a

quantitative measure of generalization performance, will be theoretically possible.

As mentioned earlier, classiﬁcations done by an unsupervised classiﬁer are determined

by the patterns present in the training environment. Hence, to formalize the two exclusive

allocation conditions, a precise way to describe the concepts or category prototypes learned

by the network must be provided. Deriving such a description is analogous to the

rule-extraction task (Craven and Shavlik 1994): ‘Given a trained neural network and

the examples used to train it, produce a concise and accurate symbolic description of the

network’ (p 38). What does a neuron’s activation mean? A possible description of the

patterns encoded by the neuron can be obtained from the connection weights. However,

because of the possible presence of lateral interactions, feedback, etc, connection weights

may not provide an accurate picture of the patterns learned by the neuron.

Another approach would be to use symbolic if–then rules (Craven and Shavlik 1994).

Such a description can be quite comprehensive and elaborate; however, the number of rules

required to describe a network can grow exponentially with the number of input features.

Generalization and exclusive allocation 289

Moreover, the method described by Craven and Shavlik (1994) for obtaining the rules uses

examples not contained in the training set; the case of real-valued input features is also not

considered.

4.1. Label vectors to describe network behaviour

The approach taken in this paper is to derive a label for each output neuron, based on the

neuron’s activations in response to the familiar input patterns. The label is symbolized as a

label vector and quantitatively summarizes the features to which the neuron responds. An

advantage of this method is that the label can be computed by using examples drawn only

from the training set.

An input pattern is represented by neural activations in the input layer. Each neuron

in the output layer responds to one or more patterns; the label deﬁnes a prototype for this

group of patterns. The label for each output neuron is expressed in terms of the input units

that feed it; multilayered networks would be analysed by considering successive layers in

sequence.

Consider an input pattern X = (x

1

,x

2

,...,x

I

), drawn from the network’s input space.

Let the network’s training set be deﬁned by probability distribution S on the input space,

and let p

S

(X) be the probability of X being the input pattern on any training presentation.

When X is presented to a network, the activation of the ith input neuron is x

i

, and the

activation of the j th output neuron is y

j

(X). Y(X) =

y

1

(X), y

2

(X),...,y

J

(X)

is the

vector of output activation values in response to input X. Abbreviate y

j

≡ y

j

(X), and

assume 0 6 x

i

,y

j

6 1. Deﬁne

L

0

ij

=

Z

X

p

S

(X) · x

i

· y

j

dX. (1)

If S consists of a ﬁnite number of patterns instead of a continuum, then the deﬁnition

becomes

L

0

ij

=

X

X

p

S

(X) · x

i

· y

j

. (2)

The L

0

ij

values are normalized to obtain the label values that will be used in expressing the

exclusive allocation conditions:

L

ij

=

L

0

ij

max

k

L

0

kj

. (3)

The label L

j

of the j th output neuron is the vector (L

1j

,L

2j

,...,L

Ij

), where I is the

number of input units.

The label L

j

is a summary that characterizes the set of patterns to which neuron j

responds. The use of labels for this characterization is reasonable for most unsupervised

networks, where learning is based on pattern similarity, and where the decision regions

thus tend to be convex. However, labels would not be appropriate for characterizing

networks with substantially non-convex decision regions, e.g., the type of network produced

by many supervised learning procedures. The process of computing labels is essentially a

rule extraction process, to infer the structure of a network, given knowledge only of the

training input probabilities and the network’s ‘black box’ input–output behaviour on the

training data. Each component L

ij

of a label is analogous to a weight in an inferred model

of the black box network. One beneﬁt of this approach is that it facilitates comparing the

generalization behaviour of different networks, without regard to differences in their internal

structure or operation.

290 J A Marshall and V S Gupta

4.2. Exclusive allocation conditions

When an input pattern is presented to a network, the network parses that pattern and

represents the parsing via activations in the output layer. The activation of each input

neuron can be decomposed into parts, each part being accounted for by (assigned to) a

different output neuron. Thus, for condition 1 to be satisﬁed, the sum of these parts should

equal the activation of the input neuron, and together they should be able to account for the

activation of all output neurons.

One can describe the decomposition by using parsing coefﬁcients. The parsing

coefﬁcient C

ij

(

ˆ

X,

ˆ

Y) describes how much of the ‘credit’ for the activation of input neuron i

is assigned to output neuron j , given an input pattern vector

ˆ

X and an output vector

ˆ

Y .

Abbreviate C

ij

≡ C

ij

X, Y (X)

. If the exclusive allocation constraints are fully satisﬁed,

then for each input pattern X (and its corresponding output vector Y(X)) in the full pattern

space there should exist parsing coefﬁcients C

ij

> 0 such that for all output neurons j,

X

i:L

ij

6=0

x

i

C

ij

P

k

C

ik

L

ij

P

k

L

kj

= y

j

. (4)

In equation (4), the normalized label values L

ij

/

P

k

L

kj

are analogous to the weights of a

neural network. The parsing coefﬁcients C

ij

/

P

k

C

ik

describe how the credit for each input

activation is allocated to output neurons, so that the L-weighted, C-allocated inputs exactly

produce the outputs. It is assumed that

P

j

C

ij

> 0 for all i. The sum is taken only over

the non-zero L

ij

values; otherwise the C

ij

coefﬁcients would be underconstrained.

The idea of using parsing coefﬁcients to express exclusive allocation constraints is

similar to the idea of using dynamic gating weights for credit assignment (Rumelhart and

McClelland 1986, Morse 1994). The dynamic gating weights are computed using the actual

or static weights on the connections in the network. In contrast, parsing coefﬁcients are

computed using the label vector for the output neurons and are independent of the actual

connection weights. As seen in the examples in section 3, networks that respond identically

to patterns in the training set can have very different connection weights (the weights may

even have different signs). Hence it is difﬁcult to compare the generalization properties

of these networks using dynamic gating weights. On the other hand, label vectors are

computed from the response of a network to patterns in the training set; the label vectors in

these different networks (ﬁgure 2 or ﬁgure 3) are identical. The label vector method treats

each network as a black box (independent of the network’s connectivity and weights, which

are internal to the box), examining just the networks’ inputs and outputs. This facilitates a

comparison of the generalization behaviour of the networks.

4.3. Minimization form of conditions

It is possible (e.g., in the presence of noise) that a network does not satisfy equation (4)

exactly. Yet it would still be desirable to measure how close the network comes to

satisfying the exclusive allocation conditions expressed by this equation. Hence the

exclusive allocation requirement should instead be framed as a minimization condition.

By squaring the difference between the left-hand and right-hand sides in equation (4) and

summing over all output neurons, one obtains

E

1

(

ˆ

X,

ˆ

Y) =

1

J

X

j

ˆy

j

(

ˆ

X) −

X

i:L

ij

6=0

ˆx

i

C

ij

(

ˆ

X,

ˆ

Y)

P

k

C

ik

(

ˆ

X,

ˆ

Y)

L

ij

P

k

L

kj

2

(5)

Generalization and exclusive allocation 291

where

ˆ

X = ( ˆx

1

, ˆx

2

,..., ˆx

I

) and

ˆ

Y = ( ˆy

1

, ˆy

2

,..., ˆy

J

) are placeholder variables. The

normalization 1/J adjusts for the number of output units J .

By integrating over the network’s parsings of all possible input patterns, one obtains

the quantity

E

1

=

Z

X

E

1

X, Y (X)

dX. (6)

Thus, for each input pattern X, the objective is to ﬁnd a set of parsing

coefﬁcients C

ij

X, Y (X)

that minimizes the measure E

1

of the network’s exclusive

allocation ‘deﬁciency’, in a least-squares sense. The measure E

1

is computed across all

patterns in the full pattern space, whereas (as shown in equation (1)) the labels are computed

across only the training set S. How the parsing coefﬁcients can be obtained, for the purpose

of measuring network behaviour, is a separate question, not treated in detail in this paper.

This analysis is concerned non-constructively with the existence of parsing coefﬁcients

that satisfy or minimize the equations. In practice, the minimization can be performed in a

number of ways, e.g., using an iterative procedure (Morse 1994) to ﬁnd the coefﬁcients.

4.4. Condition 2 is necessary but not sufﬁcient

The E

1

scores produced by equation (5) can be used as a criterion to grade a network’s

generalization behaviour on particular input pattern parsings. For instance, in ﬁgure 2(I), it

is easily seen that equation (5) will produce poor scores for the linear decorrelator’s parsing

of pattern c, with any set of parsing coefﬁcients.

However, for certain input patterns there can be more than one parsing that would yield

good E

1

scores; some of these parsings may reﬂect better generalization behaviour than

others. An extreme example is illustrated in ﬁgure 5, which shows a network with two

input neurons, marked a and b, and two output neurons, marked p and q. The network is

trained on two patterns, (1, 0) and (, 1), which occur with equal probability during training,

with 0 <1. By equations (1)–(3), the neuron labels in this network are

L

ap

= 1 L

bp

= 0 L

aq

= L

bq

= 1.

Suppose that, after the network has been trained, the pattern X = (1, 0) is presented.

As shown in ﬁgure 5(A), the network could respond by activating output neuron p fully;

Y(X) = (1,0). Using the parsing coefﬁcients

C

ap

= 1 C

bp

= 0 C

aq

= 0 C

bq

= 0,

this response satisﬁes equation (4) for all input and output neurons. However, a network

could instead respond as in ﬁgure 5(B), where the output is Y(X) =

0,/(1+)

. In this

case, one set of valid parsing coefﬁcients would be

C

ap

= 0 C

bp

= 0 C

aq

= 1 C

bq

= 0.

Even for this parsing, equation (4) is fully satisﬁed. If this same relationship holds when

is made vanishingly small, then equation (4) will always be satisﬁed for neuron q with

the given set of parsing coefﬁcients. This example shows that in the limit as → 0, the

activation of an input neuron i could be assigned to an inactive output neuron j if the

corresponding label L

ij

were zero, and condition 1 would be satisﬁed. For this reason,

equation (4) excludes labels of value zero.

Condition 2 is imposed to repair further this anomaly; the parsing in ﬁgure 5(B) does

not satisfy the equations listed below. Deﬁne

Y

∗

1

(

ˆ

X) =

ˆ

Y : E

1

(

ˆ

X,

ˆ

Y) = min

ˆ

ˆ

Y

E

1

(

ˆ

X,

ˆ

ˆ

Y)

. (7)

292 J A Marshall and V S Gupta

11

0

ε

a b

p

q

11

0

ε

a b

p

q

11

0

ε

b

p

q

a

(A) (B) (C) (D)

ε 1

10

b

pq

a

Figure 5. Credit assignment example. The label vector of each of the two output neurons

(marked p and q) has two elements, corresponding to the two input neurons (marked a and b).

The label of neuron p is (1, 0); the label of neuron q is (, 1). These values are indicated by

the numbers and arrows. Thin arrows denote a weak connection, with weight . Dashed arrows

denote a connection with weight 0. (A) When pattern a is presented, the depicted network

fully activates the output neuron marked p; this parsing satisﬁes conditions 1 and 2. (B) When

pattern a is presented, the depicted network activates the output neuron marked q to a small

value, /(1+); this parsing satisﬁes exclusive allocation condition 1, but not condition 2. (C) In

response to input b, neuron p becomes active in the depicted network. However, if C

bp

= 1,

this parsing would satisfy condition 2 but not condition 1. (D) Here the label of neuron p

is (, 1); the label of neuron q is (0, 1). When input pattern a is presented, the network’s best

response (according to the neuron labels) is to activate neuron p to the level /(1 + ). This

parsing satisﬁes condition 1, and condition 2 should be designed so that this parsing is judged

to satisfy it.

Y

∗

1

(

ˆ

X) is the set of all output vectors that would best satisfy condition 1, given input

ˆ

X.

Next, deﬁne the function

M

2

(

ˆ

X,

ˆ

Y) =

X

i

ˆx

i

−

X

j

ˆy

j

X

k

L

kj

2

. (8)

This function measures the difference between the total input and the total size-normalized

output. The factor

P

k

L

kj

represents the size of output neuron j .

Next, deﬁne

E

2

(

ˆ

X,

ˆ

Y) =

M

2

(

ˆ

X,

ˆ

Y)− min

ˆ

ˆ

Y ∈Y

∗

1

(

ˆ

X)

M

2

(

ˆ

X,

ˆ

ˆ

Y)

2

. (9)

The ‘min M

2

’ term represents the best condition 2 score for any parsing of any output

vector that best satisﬁes condition 1. The equation thus measures the difference between

the condition 2 score for the given parsing and the condition 2 score for the best parsing.

Finally, deﬁne

E

2

=

Z

X

E

2

X, Y (X)

dX. (10)

This equation computes an overall score for how well all the network’s parsings satisfy

condition 2.

For input pattern X = (1, 0) in ﬁgures 5(A) and 5(B), both the output vectors

Y(X) = (1,0) and Y(X) =

0,/(1+)) are included in the set Y

∗

1

(X). However, the value

of M

2

(

ˆ

X,

ˆ

Y) equals zero when

ˆ

Y = (1, 0) and exceeds zero when

ˆ

Y =

0,/(1+)

. The

minimum value of the M

2

function in (9) is zero. Thus, E

2

(

ˆ

X,

ˆ

Y) = 0 when

ˆ

Y = (1, 0),

and E

2

(

ˆ

X,

ˆ

Y) > 0 when

ˆ

Y =

0,/(1+)

. The measure E

2

is minimized in networks

that behave like the network of ﬁgure 5(A), rather than like the network of ﬁgure 5(B).

Generalization and exclusive allocation 293

As ﬁgures 5(A) and 5(B) illustrate, condition 2 is a necessary part of the deﬁnition of

exclusive allocation. Figure 5(C) shows that condition 2 alone is not sufﬁcient to deﬁne

exclusive allocation; the parsing

C

ap

= 0 C

bp

= 1 C

aq

= 0 C

bq

= 0

satisﬁes condition 2 but not condition 1. Therefore, both conditions are necessary in the

deﬁnition of exclusive allocation.

The equations above for condition 2 were designed to insure that the parsing shown in

ﬁgure 5(D) is considered valid. In this case, the total input (= 1) does not equal the total

size-normalized output (= /(1 + ). Nevertheless, the parsing is the best one possible,

given the labels shown. In equation (9), the quality of a parsing is measured relative to the

quality of the best parsing, rather than to an arbitrary external standard. For this reason, the

clause ‘as closely as possible’ is included in condition 2.

4.5. Inexactness tolerance

Consider equations (7)–(9). In a realistic environment with noise, there might exist an

output vector

ˆ

Y that is spuriously excluded from the set Y

∗

1

(

ˆ

X), yet whose M

2

value is

signiﬁcantly smaller than that of any output vector in Y

∗

1

(

ˆ

X). This situation can occur

because equation (7) requires that the E

1

(

ˆ

X,

ˆ

Y) value exactly equal the smallest such value

for any

ˆ

ˆ

Y ; but noise can preclude exact equality.

Hence, near-equality, rather than exact equality, should be required; some degree of

inexactness tolerance is necessary. Equation (7) must therefore be revised. One way to

remedy the problem is to replace equation (7) with

Y

∗

1

(

ˆ

X) =

ˆ

Y : E

1

(

ˆ

X,

ˆ

Y) 6 T

1

min

ˆ

ˆ

Y

E

1

(

ˆ

X,

ˆ

ˆ

Y)

(11)

where T

1

> 1 is an inexactness tolerance parameter. Using this new equation, E

2

measures

the degree to which condition 2 is satisﬁed, relative to the best M

2

value chosen from

among the parsings that satisfy condition 1 tolerably well. T

1

thus becomes an additional

free parameter of the evaluation process.

The two exclusive allocation conditions will be discussed further in section 5. But ﬁrst,

two additional constraints that reﬁne the measure of generalization will be introduced. The

exclusive allocation conditions leave ambiguous the choice between certain parsings (for

example, between ﬁgures 1(C) and 1(D)). The two additional constraints are useful because

they further limit the allowable choices, thereby regularizing or disambiguating the parsings.

The additional constraints are optional: there may be some instances for which the added

regularization is not needed.

4.6. Sequence masking constraint

The ‘sequence masking’ property (Cohen and Grossberg 1986, 1987, Marshall 1990b, 1995,

Nigrin 1993) concerns the responses of a system to patterns of different sizes (or scales). It

holds that large, complete output representations are better than small ones or incomplete

ones. For example, it is better to parse input pattern ab as a single output category ab

(ﬁgure 6(A)) than as two smaller output categories a + b (ﬁgure 6(B)). It is also better to

parse input ab as the complete output category ab (ﬁgure 6(A)) than as an incomplete part

of a larger output category abcd (ﬁgure 6(C)).

294 J A Marshall and V S Gupta

(C)

a abc

d

bab

a cb d

(A)

a abcdbab

a cb d

(B)

a abcdbab

a cb d

Figure 6. Sequence masking. Input pattern ab is presented. Possible output responses satisfying

conditions 1 and 2 are shown. (A) Output is ab. (B) Output is a+b. (C) Output is 50% activation

of abcd.

A new sequence masking constraint can optionally be imposed, to augment the exclusive

allocation criterion, as part of the deﬁnition of generalization. The sequence masking

constraint biases the network evaluation measure toward preferring parsings that exhibit the

sequence masking property. The sequence masking constraint can be stated as

• Condition 3. Large, complete output representations are better than small ones or

incomplete ones.

One way to implement condition 3 is as follows. Let

Y

∗

2

(

ˆ

X) =

ˆ

Y : E

2

(

ˆ

X,

ˆ

Y) 6 T

2

min

ˆ

ˆ

Y ∈Y

∗

1

(

ˆ

X)

E

2

(

ˆ

X,

ˆ

ˆ

Y)

(12)

M

3

(

ˆ

Y) =

1

P

j

ˆy

j

P

k

L

kj

2

1+

P

k

L

kj

(13)

E

3

(

ˆ

X) =

M

3

ˆ

Y(

ˆ

X)

− min

ˆ

ˆ

Y ∈Y

∗

2

(

ˆ

X)

M

3

ˆ

ˆ

Y(

ˆ

X)

2

. (14)

Here Y

∗

2

(

ˆ

X) is the set of output vectors satisfying condition 1 that also best satisfy

condition 2, in response to a given input pattern

ˆ

X. The parameter T

2

> 1 speciﬁes the

inexactness tolerance of the evaluation process with regard to satisfaction of condition 2.

The function M

3

computes a bias in favour of larger, complete output representations.

Using this function, the network of ﬁgure 6(A) would have an M

3

score of

3

4

, the network

of ﬁgure 6(B) would have an M

3

score of

2

2

, and the network of ﬁgure 6(C) would have an

M

3

score of

5

4

; smaller values are considered to be better.

In equation (14), the ‘minM

3

’ term represents the best condition 3 score for any parsing

of any output vector that best satisﬁes conditions 1 and 2. The equation thus measures the

difference between the condition 3 score for the given parsing and the condition 3 score for

the best parsing.

To compute an overall score for how well the network’s parsings satisfy condition 3,

deﬁne

E

3

=

Z

X

E

3

X, Y (X)

dX. (15)

This equation integrates the E

3

scores across all possible input patterns.

Generalization and exclusive allocation 295

The sequence masking constraint should be imposed in measurements of generalization

when larger, complete representations are more desirable than small ones or incomplete

ones.

4.7. Uncertainty multiplexing constraint

If an input pattern is ambiguous (i.e., there exists more than one valid parsing), then

conditions 1 and 2 do not indicate whether a particular parsing should be selected or whether

the representations for the multiple parsings should be simultaneously active. For instance,

in subsection 3.2.2 (ﬁgures 1(C) or 3(E) and 3(F)), when pattern b is presented, conditions 1

and 2 can be satisﬁed if neuron ab is half-active and neuron bc is inactive, or if ab is inactive

and bc is half-active, or by an inﬁnite number of combinations of activations between these

two extreme cases.

Marshall (1990b, 1995) discussed the desirability of representing the ambiguity in such

cases by partially activating the alternative representations, to equal activation values. The

output in which neurons ab and bc are equally active at the 25% level (ﬁgures 1(C), 3(E) and

3(F)) would be preferred to one in which they were unequally active: for example, when

ab is 50% active and bc is inactive (ﬁgure 1(D)). This type of representation expresses

the network’s uncertainty about the true classiﬁcation of the input pattern, by multiplexing

(simultaneously activating) partial activations of the best classiﬁcation alternatives.

A new ‘uncertainty multiplexing’ constraint can optionally be imposed, to augment

the exclusive allocation criterion. The uncertainty multiplexing constraint regularizes the

classiﬁcation ambiguities by limiting the allowable relative activations of the representations

for the multiple alternative parsings for ambiguous input patterns. The uncertainty

multiplexing constraint can be stated as

• Condition 4. When there is more than one best match for an input pattern, the

best-matching representations divide the input signals equally.

The notion of best match is speciﬁed by conditions 1 and 2, and (optionally) 3. (Other

deﬁnitions for best match can be used instead.)

One way to implement the uncertainty multiplexing constraint is as follows. Let

Y

∗

3

(

ˆ

X) =

ˆ

Y : E

3

(

ˆ

X,

ˆ

Y) 6 T

3

min

ˆ

ˆ

Y ∈Y

∗

2

(

ˆ

X)

E

3

(

ˆ

X,

ˆ

ˆ

Y)

(16)

Y

∗

4

(

ˆ

X) = mean

Y

∗

3

(

ˆ

X)

(17)

E

4

(

ˆ

X,

ˆ

Y) =

ˆ

Y −Y

∗

4

(

ˆ

X)

2

(18)

where mean(α) ≡

R

α dα

kαk is the element-by-element mean of the set of vectors α,

where kαk describes the size or ‘measure’ of the region α, and where α

2

refers to the dot

product of vector α with itself. Function Y

∗

2

selects the set of output vectors satisfying

condition 1 that also best satisfy condition 2. Of this set of output vectors, function Y

∗

3

selects the subset that also best satisﬁes condition 3. Function Y

∗

4

averages all the output

vectors in this subset together. Finally, E

4

treats this average as the ‘ideal’ output vector

and measures the deviation of a given output vector from the ideal. Parameter T

3

> 1

speciﬁes the inexactness tolerance of the evaluation process with regard to satisfaction of

condition 3.

By these equations, the parsing of input pattern b in ﬁgure 1(C) in which neurons ab

and bc are equally active at the 25% level would be preferred to a parsing in which they

were unequally active: for example, when ab is 50% active and bc is inactive.

296 J A Marshall and V S Gupta

If enforcement of uncertainty multiplexing is desired but enforcement of sequence

masking is not desired, then equation (16) can be replaced by Y

∗

3

(

ˆ

X) = Y

∗

2

(

ˆ

X).

To compute an overall score for how well the network’s parsings satisfy condition 4,

deﬁne

E

4

=

Z

X

E

4

X, Y (X)

dX. (19)

The uncertainty multiplexing constraint should be imposed in measurements of

generalization when balancing ambiguity among likely alternatives is more desirable than

making a deﬁnite guess.

4.8. Scoring network performance

To compare objectively the generalization performance of speciﬁc networks, given a

particular training environment, one can formulate a numerical score that incorporates the

four criteria E

1

, E

2

, E

3

, and E

4

. One can assign each of these factors a weighting to yield

an overall network performance score. For instance, the score can be deﬁned as

E

T

1

,T

2

,T

3

= a

1

E

1

+ a

2

E

2

+ a

3

E

3

+ a

4

E

4

(20)

where a

1

, a

2

, a

3

, and a

4

are weightings that reﬂect the relative importance of the four

generalization conditions, and T

1

, T

2

, and T

3

are parameters of the evaluation process,

specifying the degrees to which various types of inexactness are tolerated (equations (11),

(12) and (16)). The score of each network can then be computed numerically (and

laboriously). A full demonstration of such a computation would be an interesting next

step for this research. The choice of weightings for the various factors would affect the

ﬁnal rankings.

It is theoretically possible to eliminate the free parameters for inexactness tolerance

by replacing them with a ﬁxed calculation, such as a standard deviation from the mean.

However, such expedients have not been explored in this research. In any case, a single set

of parameter values or calculations should be chosen for any comparison across different

network types.

To compare fully the generalization performance of the classiﬁers themselves (e.g., all

linear decorrelators versus all EXIN networks) might require evaluating

R

E(S) dS, across all

possible training environments S. Such a calculation is obviously infeasible, except perhaps

by stochastic analysis. Nevertheless, the comparisons can be understood qualitatively on

the basis of key examples, like the ones presented in this paper.

5. Assessing generalization in EXIN network simulations

This section discusses how the generalization criteria can be applied to measure the

performance of an EXIN network. The EXIN network was chosen as an example because

it yields good but not perfect generalization; thus, it shows effectively how the criteria

operate.

Figure 7(A) shows an EXIN network that has been trained with six input patterns

(Marshall 1995). Figure 7(B) shows the multiplexed, context-sensitive response of the

network to a variety of familiar and unfamiliar input combinations. All 64 possible binary

input patterns were tested, and reasonable results were produced in each case (Marshall

1995); ﬁgure 7(B) shows 16 of the 64 tested parsings.

Given the four generalization conditions described in the preceding sections, the

performance of this EXIN network (Marshall 1995) will be summarized below. A sample

Generalization and exclusive allocation 297

2 3 4 3 3 4

abc

cdab

de

def

a

b c d e f

a

}

–

}

+

(A)

(B)

Figure 7. Simulation results. (A) The state of an EXIN network after 3000 training presentations

of input patterns drawn from the set a, ab, abc, cd, de, def . The input pattern coded by each

output neuron (the label) is listed above the neuron body. The approximate ‘size’ normalization

factor of each output neuron is shown inside the neuron. Strong excitatory connections (weights

between 0.992 and 0.999) are indicated by lines from input (lower) neurons to output (upper)

neurons. All other excitatory connections (weights between 0 and 0.046) are omitted from the

ﬁgure. Strong inhibitory connections (weights between 0.0100 and 0.0330), indicated by thick

lateral arrows, remain between neurons coding patterns that overlap. The thickness of the lines

is proportional to the inhibitory connection weights. All other inhibitory connections (weights

between 0 and 0.0006) are omitted from the ﬁgure. (B) 16 copies of the network. Each copy

illustrates the network’s response to a different input pattern. Network responses are indicated

by ﬁlling of active output neurons; fractional height of ﬁlling within each circle is proportional to

neuron activation value. Rectangles are drawn around the networks that indicate the responses

to one of the training patterns. (Redrawn with permission, from Marshall (1995), copyright

Elsevier Science.)

of the most illustrative parsings, and the degree to which they satisfy the conditions, will be

discussed. The most complex example below, pattern abcdf, is examined in greater detail,

using the equations described in the preceding section to evaluate the parsing.

5.1. EXIN network response to training patterns

Consider the response of the network to a pattern in the training set, such as a (ﬁgure 7(B)).

The active output neuron has the label a. It is fully active, so it fully accounts for the input

pattern a. No other output neuron is active, so the activations across the output layer are

298 J A Marshall and V S Gupta

fully accounted for by the input pattern. Thus, condition 1 is satisﬁed on the patterns from

the training set. The total input almost exactly equals the total size-normalized output, so

condition 2 is well satisﬁed on these patterns. Condition 3 is also satisﬁed for the training

patterns: for example, pattern ab activates output neuron ab, not a or abc. Since the

training patterns are unambiguous, condition 4 is satisﬁed by default on these patterns. As

seen in ﬁgure 7(B), the generalization conditions are satisﬁed for all patterns in the training

set (marked by rectangles).

5.2. EXIN network response to ambiguous patterns

Consider the response of the network to an ambiguous pattern such as d. Pattern d is part of

familiar patterns cd, de, and def, and it matches cd and de most closely. The corresponding

two output neurons are active, both between the 25% and 50% levels. Conditions 1 and 2

appear to be approximately satisﬁed on this pattern: the activation of d is accounted for

by split activation across cd and de, and the activations of cd and de are accounted for by

disjoint fractions of the activation of d. Condition 3 is well satisﬁed because neuron def

is inactive. Condition 4 is also approximately (but not perfectly) satisﬁed, since the two

neuron activations are nearly equal.

Now consider pattern b, which is part of patterns ab and abc. The network activates

neuron ab at about the 50% level. Since b constitutes 50% of the pattern ab, the activation

of neuron ab fully accounts for the input pattern. Likewise, the activation of ab is fully

accounted for by the input pattern. Pattern b is more similar to ab than to abc,soitis

correct for neuron abc to be inactive in this case, by condition 3.

Similarly, pattern c is part of abc and cd. However, neuron abc is slightly active, and

neuron cd is active at a level slightly less than 50%. Condition 1 is satisﬁed on pattern c:

the sum of the output activations attributable to c is still the same as that of the activations

attributable to b in the previous example, and the activation of neurons abc and cd are

attributable to disjoint fractions (approximately 25% and 75%) of the activation of c; thus,

condition 2 is well satisﬁed. Condition 3 is not as well satisﬁed here as in the previous

example. The difference can be explained by the weaker inhibition between abc and cd

than between ab and abc; more coactivation is thus allowed.

Input pattern c is unambiguous, by condition 4. To satisfy condition 4 on an input

pattern, a network must determine which output neurons represent the best matches for the

input pattern. The simultaneous partial activation of abc and cd is a manifestation of some

inexactness tolerance in determining the best matches, by the EXIN network. Alternatively,

as described in subsection 3.2.2, greater selectivity in the interneuron inhibition (as in

SONNET-2) can be used to satisfy condition 2 more exactly.

The results in ﬁgure 7 show that when presented with ambiguous patterns, the EXIN

network activates the best match, and when there is more than one best match, it permits

simultaneous activation of the best matches. Thus, the generalization behaviour on

ambiguous patterns meets the exclusive allocation conditions satisfactorily.

5.3. EXIN network response to multiple superimposed patterns

Consider the response of the network to pattern abcd. Pattern abcd can be compared with

patterns ab and cd; the response to abcd is the superposition of the separate responses

to ab and cd. Conditions 1, 2 and 4 are clearly met here. As discussed in subsection 3.1.4,

this is in contrast to the response of a WTA network, where the output neuron abc would

Generalization and exclusive allocation 299

become active. Condition 3 is also met here; there is no output neuron labelled abcd, and

if neuron abc were fully active, then the input from neuron d could be accounted for only

by partial activation of another output neuron, such as de.

When f is added to abcd to yield the pattern abcdf, a chain reaction alters the network’s

response, from def down to a in the output layer. The presence of d and f causes the def

neuron to become approximately 50% active. In turn, this inhibits the cd neuron more,

which then becomes less active. As a result, the abc neuron receives less inhibition and

becomes more active. This in turn inhibits the activation of neuron ab. Because neuron ab

is less active, neuron a then becomes more active. These increases and decreases tend to

balance one another, thereby keeping conditions 1 and 2 satisﬁed on pattern abcdf. The

dominant parsing appears to be ab+ cd+ def, but the overlap between cd and def prevents

those two neurons from becoming fully coactive. As a result, the alternative parsings

involving abc or a can become partially active. No strong violations of conditions 3 and 4

are apparent. The responses to patterns cdf, abcf, and bcdf are also shown for comparison.

The patterns listed above were selected for discussion on the basis of their interesting

properties. The network’s response to all the other patterns can also be evaluated using

the exclusive allocation criterion. In each case, the EXIN network adheres well to the

four generalization conditions. Thus, the simulation indicates the high degree with which

EXIN networks show exclusive allocation, sequence masking, and uncertainty multiplexing

behaviours.

5.4. An example credit assignment computation

The generalization conditions can be formalized in a number of ways; the equations given

above represent one such formalization. For example, a different computation could be

used to express exclusive allocation deﬁciency, instead of the least-squares method of

equations (5), (8), (9), (14) and (18). Nonlinearities could be introduced in the credit

assignment scheme (equation (4)). The formalization given here expresses the generalization

conditions in a relatively simple manner that is suitable from a computational viewpoint.

Figure 8 describes a computation of the extent to which the network in ﬁgure 7 satisﬁes

the generalization conditions for a particular input pattern. The table in the rectangle

describes approximate parsing coefﬁcients for pattern abcdf. The coefﬁcients shown in

the table were estimated manually, to two decimal places. These coefﬁcients represent the

portion of the credit that is assigned between each input neuron activation and each output

neuron activation. For example, the activation of input neuron a is 1; 21% of its credit

is allocated to output neuron a, and 79% is allocated to ab. The input to neuron ab

is 0.79 + 0.38, the sum of the contributions it receives from different input neurons

weighted by the activation of the input neurons. This input is divided by the neuron’s

normalization factor (‘size’), 2. This normalization factor is derived from the neuron’s

label, which is determined by the training (familiar) patterns to which the neuron responds

(equations (2) and (3)). The resulting attributed activation value, 0.59, is very close to the

actual activation, 0.58, of neuron ab in the simulation. The existence of parsing coefﬁcients

(e.g., those in ﬁgure 8) that produce attributed activations that are all close to the actual

allocations shows that condition 1 (equation (4)) is well satisﬁed for the input pattern abcdf.

Condition 2 is well satisﬁed because

P

i

x

i

(which equals 5) is very close to

P

j

y

j

P

k

L

kj

(which equals (0.19 × 1) + (0.58 × 2) + (0.24 × 3) + (0.58 × 2) + (0.00 ×

2) +(0.56× 3) = 4.91). Numerical values for conditions 3 and 4 can also be calculated, but

the calculations would be much more computationally intensive, as they call for evaluation

of all possible parsings of an input pattern, within a given training environment.

300 J A Marshall and V S Gupta

.21

.79

.00

.38

.62 .12

.88 .30

.00

.70

.00

.00 1.00

abcdef

a

ab

abc

cd

de

def

From input neuron

To output

neuron

.21

1.17

.74

1.18

.00

1.70

/ 1 =

/ 2 =

/ 3 =

/ 2 =

/ 2 =

/ 3 =

.21

.59

.25

.59

.00

.57

.19

.58

.24

.58

.00

.56

Attributed

activation

Neuron size

Total parse

energy

Actual

activation

111101

C L

Σ

C

ij

ij

k

ik

1

/

(Σ

L

)

k

kj

y

j

Input acti-

vations x

i

Figure 8. Parsing coefﬁcients and attributed activations. The table inside the rectangle shows

parsing coefﬁcients: the inferred decomposition of the credit from each input neuron into

each output neuron to produce the activations shown in ﬁgure 7. Because the results of this

computation are very close to the EXIN network’s simulated results (compare grey shaded

columns on the right), it can be concluded that the EXIN network satisﬁes condition 1 for

exclusive allocation very well, on pattern abcdf. (Redrawn with permission, from Marshall

(1995), copyright Elsevier Science.)

6. Discussion

The exclusive allocation criterion was used to compare qualitatively the generalization

performance of four unsupervised classiﬁers: WTA competitive learning networks, linear

decorrelator networks, EXIN networks, and SONNET-2 networks. The comparisons suggest

that more sophisticated generalization performance is obtained at the cost of increased

complexity. The exclusive allocation behaviour of an EXIN network was examined in more

detail, and one parsing was analysed quantitatively. The concept of exclusive allocation,

or credit assignment, is a conceptually useful way of deﬁning generalization because it

lends itself very well to the natural problem of decomposing and identifying independent

sources underlying superimposed or ambiguous signals (blind source separation) (Bell and

Sejnowski 1995, Comon et al 1991, Jutten and Herault 1991).

This paper has described formal criteria for evaluating the generalization properties

of unsupervised neural networks, based on the principles of exclusive allocation, sequence

masking, and uncertainty multiplexing. The examples and simulations show that satisfaction

of the generalization conditions can enable a network to do context-sensitive parsing, in

response to multiple superimposed patterns as well as ambiguous patterns. The method

describes a network in terms of its response to patterns in the training set and then places

constraints on the response of the network to all patterns, both familiar and unfamiliar.

The concepts of exclusive allocation, sequence masking, and uncertainty multiplexing

thus provide a principled basis for evaluating the generalization capability of unsupervised

classiﬁers.

The criteria in this paper deﬁne success for a system in terms of the quality of the

system’s internal representations of its input environment, rather than in terms of a particular

external task. The internal representations are inferred, without actually examining the

Generalization and exclusive allocation 301

system’s internal processing, weights, etc, through a black-box approach of ‘labelling’:

observing the system’s responses to its training (‘familiar’) inputs. Then the system’s

generalization performance is evaluated by examining its responses to both familiar and

unfamiliar inputs. This deﬁnition is useful when a system’s generalization cannot be

measured in terms of performance on a speciﬁc external task, either when objective

classiﬁcations (‘supervision’) of input patterns are unavailable or when the system is general

purpose.

Acknowledgments

This research was supported in part by the Ofﬁce of Naval Research (Cognitive and Neural

Sciences, N00014-93-1-0208) and by the Whitaker Foundation (Special Opportunity Grant).

We thank George Kalarickal, Charles Schmitt, William Ross, and Douglas Kelly for valuable

discussions.

References

Anderson J A, Silverstein J W, Ritz S A and Jones R S 1977 Distinctive features, categorical perception, and

probability learning: Some applications of a neural model Psychol. Rev. 84 413–51

Bell A J and Sejnowski T J 1995 An information-maximization approach to blind separation and blind

deconvolution Neural Comput. 7 1129–59

Bregman A S 1990 Auditory Scene Analysis: The Perceptual Organization of Sound (Cambridge, MA: MIT Press)

Carpenter G A and Grossberg S 1987 A massively parallel architecture for a self-organizing neural pattern

recognition machine Comput. Vision, Graphics Image Process. 37 54–115

Cohen M A and Grossberg S 1986 Neural dynamics of speech and language coding: developmental programs,

perceptual grouping, and competition for short term memory Human Neurobiol. 5 1–22

——1987 Masking ﬁelds: A massively parallel neural architecture for learning, recognizing and predicting multiple

groupings of patterned data Appl. Opt. 26 1866–91

Comon P, Jutten C and Herault J 1991 Blind separation of sources, part II: problems statement Signal Process. 24

11–21

Craven M W and Shavlik J W 1994 Using sampling and queries to extract rules from trained neural networks

Machine Learning: Proc. 11th Int. Conf. (San Francisco, CA: Morgan Kaufmann) pp 37–45

Desimone R 1992 Neural circuits for visual attention in the primate brain Neural Networks for Vision and Image

Processing ed G A Carpenter and S Grossberg (Cambridge, MA: MIT Press) pp 343–64

F

¨

oldi

´

ak P 1989 Adaptive network for optimal linear feature extraction Proc. Int. Joint Conf. on Neural Networks

(Washington, DC) (Piscataway, NJ: IEEE) vol I, pp 401–5

Hubbard R S and Marshall J A 1994 Self-organizing neural network model of the visual inertia phenomenon in

motion perception Technical Report 94-001 Department of Computer Science, University of North Carolina

at Chapel Hill, 26 pp

Jutten C and Herault J 1991 Blind separation of sources, part I: an adaptive algorithm based on neuromimetic

architecture Signal Process. 24 1–10

Kohonen T 1982 Self-organized formation of topologically correct feature maps Biol. Cybern. 43 59–69

Marr D 1982 Vision: A Computational Investigation into the Human Representation and Processing of Visual

Information (San Francisco, CA: Freeman)

Marr D and Poggio T 1976 Cooperative computation of stereo disparity Science 194 238–87

Marshall J A 1990a Self-organizing neural networks for perception of visual motion Neural Networks 3 45–74

——1990b A self-organizing scale-sensitive neural network Proc. Int. Joint Conf. on Neural Networks

(San Diego, CA) (Piscataway, NJ: IEEE) vol III, pp 649–54

——1990c Adaptive neural methods for multiplexing oriented edges Proc. SPIE 1382 (Intelligent Robots and

Computer Vision IX: Neural, Biological, and 3-D Methods, Boston, MA) ed D P Casasent, pp 282–91

——1992 Development of perceptual context-sensitivity in unsupervised neural networks: Parsing, grouping, and

segmentation Proc. Int. Joint Conf. on Neural Networks (Baltimore, MD) (Piscataway, NJ: IEEE) vol III,

pp 315–20

——1995 Adaptive perceptual pattern recognition by self-organizing neural networks: Context, uncertainty,

multiplicity, and scale Neural Networks 8 335–62

302 J A Marshall and V S Gupta

Marshall J A, Kalarickal G J and Graves E B 1996 Neural model of visual stereomatching: Slant, transparency,

and clouds Network: Comput. Neural Syst. 7 635–70

Marshall J A, Kalarickal G J and Ross W D 1997 Transparent surface segmentation and ﬁlling-in using local

cortical interactions Investigative Ophthalmol. Visual Sci. 38 641

Marshall J A, Schmitt C P, Kalarickal G J and Alley R K 1998 Neural model of transfer-of-binding in visual

relative motion perception Computational Neuroscience: Trends in Research, 1998 ed J M Bower, to appear

Morse B 1994 Computation of object cores from grey-level images PhD Thesis Department of Computer Science,

University of North Carolina at Chapel Hill

Nigrin A 1993 Neural Networks for Pattern Recognition (Cambridge, MA: MIT Press)

Oja E 1982 A simpliﬁed neuron model as a principal component analyzer J. Math. Biol. 15 267–73

Reggia J A, D’Autrechy C L, Sutton G G and Weinrich M 1992 A competitive redistribution theory of neocortical

dynamics Neural Comput. 4 287–317

Rumelhart D E and McClelland J L 1986 On the learning of past tenses of English verbs Parallel Distributed

Processing: Explorations in the Microstructure of Cognition. Volume 2: Psychological and Biological Models

(Cambridge, MA: MIT Press) pp 216–71

Sattath S and Tversky A 1987 On the relation between common and distinctive feature models Psychol. Rev. 94

16–22

Schmitt C P and Marshall J A 1998 Grouping and disambiguation in visual motion perception: A self-organizing

neural circuit model, in preparation

Yuille A L and Grzywacz N M 1989 A winner-take-all mechanism based on presynaptic inhibition feedback Neural

Comput. 1 334–47