Content uploaded by Brandon Northcutt

Author content

All content in this area was uploaded by Brandon Northcutt on Jan 30, 2018

Content may be subject to copyright.

Biol Cybern (2017) 111:207–227

DOI 10.1007/s00422-017-0716-z

ORIGINAL ARTICLE

An insect-inspired model for visual binding II: functional analysis

and visual attention

Brandon D. Northcutt1·Charles M. Higgins2

Received: 2 April 2016 / Accepted: 27 February 2017 / Published online: 16 March 2017

© Springer-Verlag Berlin Heidelberg 2017

Abstract We have developed a neural network model

capable of performing visual binding inspired by neuronal

circuitry in the optic glomeruli of ﬂies: a brain area that lies

just downstream of the optic lobes where early visual pro-

cessing is performed. This visual binding model is able to

detect objects in dynamic image sequences and bind together

their respective characteristic visual features—such as color,

motion, and orientation—by taking advantage of their com-

mon temporal ﬂuctuations. Visual binding is represented in

the form of an inhibitory weight matrix which learns over

time which features originate from a given visual object.

In the present work, we show that information represented

implicitly in this weight matrix can be used to explicitly count

the number of objects present in the visual image, to enumer-

ate their speciﬁc visual characteristics, and even to create an

enhanced image in which one particular object is emphasized

over others, thus implementing a simple form of visual atten-

tion. Further, we present a detailed analysis which reveals the

function and theoretical limitations of the visual binding net-

work and in this context describe a novel network learning

rule which is optimized for visual binding.

BBrandon D. Northcutt

brandon@northcutt.net

Charles M. Higgins

higgins@neurobio.arizona.edu

1Department of Electrical and Computer Engineering,

University of Arizona, 1230 E. Speedway Blvd., Tucson, AZ

85721, USA

2Departments of Neuroscience and Electrical/Computer Eng.,

University of Arizona, 1040 E. 4th St., Tucson, AZ 85721,

USA

Keywords Neural networks ·Blind source separation ·

Visual binding ·Object perception ·Visual attention ·

Artiﬁcial intelligence

1 Introduction

Visual binding is the process of grouping together the visual

characteristics of one object while differentiating them from

the characteristics of other objects (Malsburg 1999), without

regard to the spatial position of the objects. Based on recently

identiﬁed structures termed optic glomeruli in the brains of

ﬂies and bees (Strausfeld et al. 2007;Strausfeld and Okamura

2007;Okamura and Strausfeld 2007;Paulk et al. 2009;Mu

et al. 2012), we have developed a neural network model of

visual binding (Northcutt et al. 2017), which encodes the

visual binding in a pattern of inhibitory weights. This model’s

genesis was in the anatomical similarity of insect olfactory

and optic glomeruli and was inspired by the work of Hopﬁeld

(1991), who modeled olfactory binding based on temporal

ﬂuctuations in the mammalian olfactory bulb, and by the

seminal work of Herault and Jutten (1986) on blind source

separation (BSS).

A pattern of inhibitory weights is learned by the binding

network from visual experience based upon common tempo-

ral ﬂuctuations of spatially global visual characteristics, with

the underlying suppositions being that the visual character-

istics of any given object ﬂuctuate together, and differently

from other objects. An example would be when an automo-

bile passes behind an occluding tree: its color, motion, form,

and orientation disappear and then reappear together, thus

undergoing a common temporal ﬂuctuation.

A detailed description and demonstration of the function

of this model is given in a companion paper (Northcutt et al.

2017). In the present work, we describe the essentials of

123

208 Biol Cybern (2017) 111:207–227

the model, present one representative demonstration of its

function in visual binding, and then show how to explicitly

interpret the binding output, which the network encodes only

implicitly in the pattern of inhibitory weights. We show how

to count the number of objects present in the input image

sequence and enumerate their characteristics and relative

strength. Further, we show how to use this information to

create an enhanced image, emphasizing one particular object

while de-emphasizing all others, thereby implementing a

form of visual attention. Finally, we present a theoretical

analysis of the functional limitations and capabilities of this

neural network model to provide a better understanding of

the network and its potential.

2 The visual binding model

Model simulations were performed in MATLAB (The

MathWorks, Natick, MA) using a simulation time step of

Δt=10ms and image sizes of 500 ×500 pixels.

Figure 1shows a diagram of the full two-stage neural net-

work used. This network begins with a ﬁrst stage comprised

of three separate, fully connected recurrent inhibitory neural

networks used for reﬁning the representation of motion, ori-

entation, and color. The outputs of the ﬁrst stage are fed into

a second stage, a fully connected ten-unit inhibitory neural

network which performs visual binding. Despite their appar-

ently disparate purposes, all four neural networks comprising

the overall model operate using exactly the same temporal

evolution and learning rules, which are given below.

Each of the four recurrent neural networks used in the

model (three in the ﬁrst stage processing individual visual

submodalities, and one in the second stage performing visual

binding) learned, by using common temporal ﬂuctuations, an

inhibitory weight matrix that indicated the degree of inhibi-

tion between neurons within each network. We describe the

essentials of the operation and training of these networks

below; see Northcutt et al. (2017) for full details.

2.1 Computation of inputs to the network

Despite the fact that ﬂy color vision is based on green,

blue, and ultraviolet photoreceptors (Snyder 1979), for ease

of human visualization and computer representation—and

without loss of generality—our color images had red, green,

and blue (RGB) color planes. Since this network is based

on a model of the insect brain, an elaborated version of the

Hassenstein–Reichardt motion detection algorithm (Hassen-

stein and Reichardt 1956;Santen and Sperling 1985)—the

standard model of insect elementary motion computation—

was used to compute motion inputs, and difference-of-

Gaussian (DoG) ﬁlters—the best existing model of early

Motion

Orientation

Color

First stage Second stage

o1

o2

o3

o4

o5

o6

o7

o8

o9

o10

i1

(left)

i2

(right)

i3

(up)

i4

(down)

i5

(0°)

i6

(60°)

i7

(120°)

i8

(red)

i9

(green)

i10

(blue)

Fig. 1 The two-stage network for visual submodality reﬁnement and

visual binding. Large circles represent units in the neural network.

Unshaded half-circles at connections indicate excitation, and ﬁlled half-

circles indicate inhibition. During the ﬁrst phase of training, neurons in

the three ﬁrst-stage networks learn to mutually inhibit one another, thus

reﬁning the representation of motion, orientation, and color and result-

ing in more selective tunings in each submodality. In the second phase

of training with a stimulus comprised of moving bars, the second-stage

visual binding network learns the relative strengths of visual object fea-

tures based on common temporal ﬂuctuations and develops an inhibitory

weight matrix to produce outputs that reﬂect the temporal ﬂuctuations

unique to each object

insect orientation processing (Rivera-Alvidrez et al. 2011)—

were convolved with the image to compute orientation.

Inputs to the neural network were computed as dia-

grammed in Fig. 2. For achromatic motion and orientation

processing, each RGB image was ﬁrst converted to grayscale

by taking the average of the 3 color components.

123

Biol Cybern (2017) 111:207–227 209

G

HR

H

I

RGB

image

Input

Images

Feature

Processing

2D Feature

Images

Wide-field

scalar

outputs

i1

i2

i3

i4

i5

i6

i7

i8

i9

i10

I

V

down

I

up

I

left

I

right

I

I0°

60°

I

120°

I

P

red

I

grn

I

blu

I

H>0

V<0

V>0

H<0

DOG

0°

DOG

60°

DOG

120°

Fig. 2 Diagram of wide-ﬁeld visual input computation from input

images. Each input RGB image was converted to grayscale (G)for

orientation and motion processing. Image motion was computed by

the Hassenstein–Reichardt (HR) elementary motion detection model in

the horizontal (IH) and vertical (IV) directions and then separated into

four feature images Ileft ,Iright,Idown ,andIup , respectively, contain-

ing strictly positive leftward (H<0), rightward (H>0), downward

(V<0), and upward (V>0) components. Three orientation-selective

DoG ﬁlter kernels were convolved withthe grayscale image to produce

three orientation feature images I0◦,I60◦,andI120◦. The individual red,

green, and blue color planes were taken as the ﬁnal three feature images

Ired,Igrn ,andIblu . Each of these ten 2D feature images was then spa-

tially summed to create a scalar value representing wide-ﬁeld image

feature content. The inputs i1(t)through i10(t)became input to the

neural network model of Fig. 1

The Hassenstein-Reichardt motion algorithm was iterated

across rows and columns of the grayscale image, resulting

in vertical and horizontal “motion images” containing signed

local motion outputs. These were further subdivided into non-

negative leftward, rightward, downward, and upward motion

“feature images” by selecting components of each motion

image with a particular sign. DoG ﬁlters selective for three

orientations (0◦,60

◦, and 120◦) were convolved with the

grayscale images to create three orientation feature images.

Finally, the red, green, and blue color planes were used as

color feature images.

Each of these ten two-dimensional (2D) feature images

was spatially summed over both dimensions to produce ten

scalar measures of full-image feature content. Each group

of scalar signals corresponding to a given visual submodal-

ity (motion, orientation, or color) was then normalized to a

maximum value of unity, allowing features from different

submodalities to be comparable while preserving ratios of

features within each group. This normalization divided each

group of signals by scalar factors nm(t),no(t), and nc(t)com-

puted as the maximum value of any signal in the group during

the last 2s, thereby automatically scaling inputs from differ-

ent submodalities to become comparable with one another

and simultaneously adapting the signals to changing visual

conditions. The ten normalized scalar values were provided

at each simulation time step as inputs i1(t)through i10(t)to

the neural network of Fig. 1.

2.2 Network temporal evolution

All inputs to each of the four recurrent networks in the model

were high-pass ﬁltered before being processed. This ensured

that static features of the visual scene such as an unchanging

background never became input to the network.

As described in detail in a companion paper (North-

cutt et al. 2017), the activation of each neuron in the

model—which may be positive or negative—represents for

non-spiking neurons the graded potential of the neuron rela-

tive to its resting potential and for spiking neurons the average

ﬁring rate relative to the spontaneous rate.

The activation on(t)of neuron nin a recurrent inhibitory

neural network may be modeled as

on(t)=i

n(t)−

N

k=1

Wn,k·ok(t−τi)(1)

where i

n(t)represents a ﬁrst-order temporal high-pass ﬁl-

tered version of the input in(t)using time constant τHI =

1.0s, Wn,krepresents the strength of the inhibitory synap-

tic pathway from neuron kto neuron n,ok(t)represents the

activation of a different neuron kin the network, and τirepre-

sents the small but ﬁnite delay required to produce inhibition.

This set of equations can be expressed in matrix form as

o(t)=i(t)−W·o(t−τi)(2)

123

210 Biol Cybern (2017) 111:207–227

where lowercase bold symbols indicate N-element column

vectors and uppercase bold symbols represent N×Nmatri-

ces. Since the biophysical details of optic glomeruli—and

thus τi—are unknown, but the existence of a ﬁnite delay is

crucial (as detailed in Northcutt and Higgins 2017), in our

simulations we formulate the network temporal dynamics as

o(t)=i(t)−W·o(t−Δt)(3)

where Δtis the simulation time step. By using τi=Δt,we

provide the smallest ﬁnite inhibition delay possible in our

simulation. This equation was used in all simulations.

When the time scale of input and output changes is much

larger than the simulation time step Δt,(3) may be approxi-

mated as

o(t)=i(t)−W·o(t)(4)

Apart for the use of high-pass ﬁltered inputs, (4)isacommon

formulation for a fully connected recurrent inhibitory neural

network used in BSS (Herault and Jutten 1986;Jutten and

Herault 1991;Cichocki et al. 1997). However, (4) represents

an idealized system, for which the outputs may be instan-

taneously computed from the inputs so long as the matrix

[I+W]−1exists (Ibeing the identity matrix). This system

of equations can be singular, but, quite unlike any realistic

recurrent neuronal network, cannot be temporally unstable.

The use of (3) for temporal dynamics instead of the more

common, seemingly quite reasonable approximation of (4)

allows for modeling of the temporal instability of recurrent

neuronal networks—which, as shown below, is crucial to

understanding their function—while still allowing network

temporal evolution to be approximated by (4) when required

to make theoretical analysis tractable and relate the present

network to previous studies.

As detailed in a companion paper (Northcutt et al. 2017),

the temporal stability of linear systems such as the one

described by (3) has long been well understood (Trentel-

man et al. 2012), and stability of such a network may be

maintained by simply requiring that the magnitude of all

eigenvalues of the weight matrix Wbe less than unity.

2.3 Network learning rule

The Hebbian-style network learning rule used to generate

inhibitory weight matrices based on common temporal ﬂuc-

tuations of the inputs was modiﬁed from that of Cichocki

et al. (1997). This spatially asymmetric multiplicative weight

update rule was chosen to support the representation of

neuronal activation described in Sect. 2.2—which does not

explicitly represent neuronal action potentials—and to lever-

age existing theoretical work on neural network solutions to

BSS problems, in awareness of the fact that the underlying

biological basis of this learning is likely to be spike-timing-

dependent plasticity (Markram et al. 1997) as discussed in

Northcutt et al. (2017).

Weight matrices Wwere initialized to zero so that the

state of the network was o(t)=i(t)and thus network out-

puts were initially identical to high-pass ﬁltered inputs. The

network learning rule is formulated as

dWn,k

dt=γ·μ(t)·g(o

n)·f(o

k)(5)

where n= kare neuron indices and γis a scalar learn-

ing rate. Diagonal elements of the weight matrix always

remained at zero, preventing self-inhibition. Any element of

Wthat became negative after a learning rule update was set

to zero, thereby enforcing that network weights were strictly

inhibitory.

μ(t)is a learning onset function that has a value of zero

at the start of training and rises asymptotically to unity with

a time constant of 2 s. Our formulation of μ(t)—quite unlike

the identically named function used by Cichocki et al.—is

used to gradually turn on the learning rule at the start of

training to avoid a powerful transient in weights based solely

on the initial phase of the inputs.

o

n(t)and o

k(t), respectively, indicate ﬁrst-order tempo-

ral high-pass ﬁltered versions of network outputs nand k

with time constant τHO =0.5 s. Note that use of high-pass

ﬁltered outputs in the learning rule, rather than simply the

outputs, makes the network’s learning dependent on tempo-

ral ﬂuctuations of the inputs: speciﬁcally those ﬂuctuations

with temporal frequencies greater than the cutoff frequency

of the output high-pass ﬁlter.

We used network “activation functions” f(x)=x3and

g(x)=tanh(π x)similar to those used by previous authors

(Jutten and Herault 1991;Cichocki et al. 1997) to introduce

higher-order statistics of the ﬁltered outputs into the learning

rule (Hyvärinen and Oja 1998), although the positions of

these two functions with respect to rows nand columns kof

the weight matrix are exchanged in (5) as compared to the

conventional BSS learning rule. This exchange is crucial to

the function of our model, and both the role of these activation

functions and the requirement that they be exchanged for our

model are addressed in detail in Sect. 5.3.5.

For reasons that will become apparent later, we will refer

to the learning rule with activation functions in their conven-

tional positions as the cooperative learning rule. In contrast,

we will refer to the learning rule of (5) with exchanged acti-

vation functions as the competitive learning rule.

2.4 Training of the model

Before training of any network began, a visual input was

presented and all linear ﬁlters and the input adaptive scaling

123

Biol Cybern (2017) 111:207–227 211

algorithm were allowed to reach steady state to eliminate

artifactual startup transients.

Training began with each of the three ﬁrst-stage networks

using a learning rate of γ1=50. This training resulted in the

ﬁrst-stage motion, orientation, and color networks, respec-

tively, learning weight matrices M,O, and C. To rapidly give

the ﬁrst-stage networks sufﬁcient experience to reﬁne each

visual submodality, an artiﬁcial visual stimulus (detailed in

Northcutt et al. 2017) was presented that provided simulta-

neous temporal ﬂuctuations in all colors, orientations, and

directions of motion. This stimulus provided near-identical

signals to each input of every network, effectively reducing

the learning rule of (5) to a purely Hebbian one (Hebb 1949).

This input resulted in uniform symmetric (zero diagonal)

weight matrices, indicating uniform lateral inhibition: a well-

known technique for sensory reﬁnement (Linster and Smith

1997). However, with this symmetric stimulus weight matri-

ces never converge to a stable state, but rather increase in

value as long as training continues. For this reason, learn-

ing for each ﬁrst-stage network was terminated by setting

its respective learning rate γ1to zero when the maximum

magnitude of any eigenvalue of the network weight matrix

reached a value of V1,max =0.9. This procedure allowed

us to rapidly learn strong lateral inhibition in the ﬁrst stage

while avoiding temporally unstable recurrent networks.

During this ﬁrst-stage training period, the learning rate

γ2for the second stage was set to zero. While not strictly

necessary, this isolation of the two stages allowed for rapid

training of the ﬁrst-stage network and a clear demonstration

of second-stage function.

After ﬁrst-stage learning was complete, the second-stage

learning rate was set to γ2=0.5, after which visual stimuli

composed of multiple objects and intended to demonstrate

visual binding were presented, as shown in the next section.

In this second phase of training, the second-stage network

learned an inhibitory connection matrix Tindicating the

binding among visual submodalities.

To avoid temporal instability of the second-stage network,

if any weight matrix update resulted in Thaving an eigen-

value with a magnitude Vgreater than V2,max =0.95, the

weight matrix was multiplied by a scalar factor V2,max/V,

thus holding the maximum eigenvalue at V2,max and main-

taining network temporal stability.

3 An example of visual binding

To demonstrate the operation of the visual binding net-

work, an artiﬁcial visual stimulus composed of moving

50 ×12-pixel bars on a black background was presented.

This stimulus consisted of a red bar that started near the

upper left corner of the image and moved down and right at

−30◦, and a green bar that started near the upper right and

moved down and left at 210◦. Both bars were oriented with

their longest axis orthogonal to the direction of motion and

moved at 50 pixels per second.

The bars moved through a ﬁxed pattern of multiplicative

horizontal sinusoidal shadow with a spatial period of 50 pix-

els, a mean value of 0.5, and an amplitude of 0.25. As the bars

moved independently through this pattern of shadows, the

features of each bar ﬂuctuated together, allowing the model

to learn their individual characteristics. Bars wrapped around

toroidally to re-enter the image, thus putting no time limita-

tion on network training.

Figure 3a and c, respectively, shows the time course of

network outputs and the ﬁnal weight matrix when the net-

work was trained for 15s with this visual stimulus using the

competitive learning rule of (5). For comparison, Fig. 3b and

d shows the same data using the cooperative learning rule, in

which the activation functions f() and g() of (5) are placed

in their conventional position in the BSS literature (Cichocki

et al. 1997), with the expansive function then f() applying

to row elements, and the compressive function g() to column

elements.

The results of Fig. 3a and c using the competitive learning

rule correspond to desired operation of the visual binding

network. The red and green output neurons clearly come

to dominate all others, and (neglecting very small weights)

the columns of the ﬁnal weight matrix correctly indicate the

characteristics of the individual objects which comprised the

stimulus. Reading the weights from the “red” column, the

bar moved to the right, and down to a lesser extent, and

got a roughly equal response from the 0◦and 120◦orienta-

tion ﬁlters, indicating an approximate orientation of −30◦or

equivalently 150◦. From the “green” column, the bar moved

to the left, downward to a lesser extent, and had an orientation

of approximately 30◦or equivalently 210◦.

In stark contrast, the results shown in Fig. 3b and d reveal

clearly that the cooperative learning rule is not applicable to

the visual binding model. The reasons for this are detailed in

Sect. 5.3.5.

This single example sufﬁces to illustrate the methods of

weight matrix interpretation and network functional analysis

presented below, but many further experiments are described

and full details given in a companion paper (Northcutt et al.

2017).

4 Extracting object-level information

While we have presented and discussed the second-stage net-

work so far as if it learned the features of a static set of objects

and then stabilized, the temporal dynamics of learning are in

general far more complicated. After initial training of the ﬁrst

stage, the learning of the second stage never stops. This is

123

212 Biol Cybern (2017) 111:207–227

Left

Right

Down

Up

0

60

120

Red

Green

Blue

o

o

o

012345678910

Time since start of stage 2 training (s)

-1.5

-1

-0.5

0

0.5

1

1.5

Stage 2 outputs

(a) Network outputs competitive learning rule

012345678910

Time since start of stage 2 training (s)

-1.5

-1

-0.5

0

0.5

1

1.5

Stage 2 outputs

(b) Network outputs cooperative learning rule

Left Right Down Up 0 60 120 Red Green Blue

Each column: weights FROM this neuron

Left

Right

Down

Up

0

60

120

Red

Green

Blue

Each row: weights TO this neuron

o

o

o

ooo

Left Right Down Up Red Green Blue

Each column: weights FROM this neuron

Left

Right

Down

Up

Red

Green

Blue

Each row: weights TO this neuron

0

60

120

o

o

o

0 60 120

ooo

(c) Wieghts using competitive learning (d) Wieghts using cooperative learning

Fig. 3 Outputs and weights of the second-stage network as it trained

with a visual stimulus comprised of two bars moving through sinu-

soidal shadow. A legend to identify each trace in panels aand bis

shown at upper left.aNetwork outputs when using the competitive

learning rule of (5), in which the activation functions f() and g() are

switched in position relative to conventional learning rules for BSS, and

thus emphasize patterns of column over row weights. Note that over the

time of training, the red and green outputs come to inhibit all others.

bNetwork outputs when using the conventional cooperative learning

rule, which emphasizes patterns of row over column weights. No pat-

tern of outputs is evident apart from nearly uniform inhibition. cFinal

weight matrix Tusing the competitive learning rule of (5). Brighter

colors indicate stronger inhibition. The strongest weights are in the red

and green columns, and clearly indicate the features of each bar.dFinal

weight matrix Tusing the cooperative learning rule

desirable because it allows the network to continuously adapt

to dynamic visual scenery.

However, should a set of objects in the visual scene persist

sufﬁciently long, the matrix Twill converge to a partic-

ular set of weights representing the characteristics of the

objects, and the number of outputs o(t)that are signiﬁ-

cantly nonzero will come to correspond to the number of

objects.

The convergence of second-stage learning for a given

visual stimulus—and thus the validity of what has been

learned—may most simply be determined at the current time

tby requiring that the sum of all absolute weight matrix

123

Biol Cybern (2017) 111:207–227 213

changes over a recent period of time τsdeclines below a

threshold Sthr

t

t−τsN

n=1

N

k=1

dTn,k

dtdt<Sthr (6)

Once the weight matrix has stabilized, the representation

of objects in the image developed by the visual binding net-

work is implicit in the activity of the outputs o(t)and the

connection matrix T.

4.1 The number of objects and their features

This representation may easily be made more explicit for

human interpretation, both making network operation easier

to understand and giving practical utility to the model.

The weight matrix Tmay be simpliﬁed by normalizing

it to its maximum value over all rows and columns and then

removing weights less than a given threshold, which may be

expressed as

Tnorm =T/max

n,k=1,NTn,k(7)

Tsimp

n,k=Tnorm

n,kTnorm

n,k>=vmin

0Tnorm

n,k<v

min

(8)

We used a threshold value of vmin =0.33 (1/3 of the maxi-

mum weight value). An example of this simpliﬁed weight

matrix, generated from the raw weight matrix shown in

Fig. 3c, is shown in Fig. 4a. Here the column-oriented pattern

of weights is even more evident, and the relative strength of

each feature may be read out directly from the matrix.

In fact, it is possible to make the representation of objects

and their features even more explicit. By summing Tsimp

vertically, a ten-element row vector tsum results

tsum

k=

N

n=1

Tsimp

n,kk=1...10 (9)

Each element tsum

krepresents the sum of all inhibitory

weights to other neurons from any neuron k.

Using this information, we can create an object matrix

Othat explicitly describes the number of objects and their

features. Every neuron kfor which the entry tsum

kis above

a threshold value (for which we used 0.6) is concluded to

have accumulated sufﬁcient inhibitory weight to represent

an object. For each such neuron k, the weights from Tsimp

for column kmaybeusedtocreateanewrowiof the object

matrix Oas

Oi,j=Tsimp

j,kj= k

1j=kj=1...10 (10)

Left Right Down Up

(a) simplified visual binding matrix

(a) Object matrix extracted from binding matrix

Red Green Blue

Left

Right

Down

Up

Red

Green

Blue

0

60

120

o

o

o

0

60

120

ooo

Object 1

Object 2

Left

Right

Down

Up

0

60

120

Red

Green

Blue

o

o

o

Fig. 4 Extraction of object-level information. aSimpliﬁed visual bind-

ing matrix Tsimp, from which object features can clearly be read. b

Feature matrix Oextracted from Tsimp using (10). Each row of the

object matrix corresponds to a unique object. The associated vector r

= [ 8 9 ] indicates that the ﬁrst row of the object matrix corresponds to

output 8 (red ) and the second row to output 9 (green)

ri=k(11)

where the vector ris used to store the index of the output

neuron corresponding to each object matrix row i, and the

zero diagonal element of Tsimp is replaced with unity to rep-

resent the fact that this row of weights in the object matrix

originated from neuron k(see Sect. 5.5.1 for a theoretical

justiﬁcation). An example of the resulting object matrix Ois

showninFig.4b: The number of rows of Ocorresponds to the

number of objects, the columns to each individual visual fea-

ture, and the weights in each column to the learned strength

of each characteristic feature. In this case, the object matrix

Oaccurately represents the motion, orientation, and color of

the two objects in the example visual stimulus.

4.2 Elementary visual attention

As can be seen from the example data shown in Fig. 3a,

the mutually inhibitory second-stage outputs corresponding

123

214 Biol Cybern (2017) 111:207–227

to objects in the visual scene ﬂuctuate over time, with each

having greater value in proportion to ﬂuctuations of the char-

acteristics of the represented object, measured in terms of the

visual features input to the network. This overall measure of

feature strength is closely related to computational models

of visual saliency (Itti and Koch 2001), and so the model

might reasonably be said to be switching its visual attention

(Itti et al. 1998) from one object to another as their relative

saliency changes. In fact, visual attention is often modeled

as a winner-take-all phenomenon (Lee et al. 1999), which

is quite akin to the mutually inhibitory competition of the

second-stage network.

The neuron that has the largest output value at any given

time corresponds to the most salient object. This movement

of this “attentional spotlight” from one object to another can

be used to emphasize the currently attended object in the

visual image and simultaneously de-emphasize unattended

objects. We may synthesize such an “enhanced image” by

recombining the feature matrices created in Fig. 2for every

input image using the row of weights from the object matrix

Ocorresponding to the currently winning neuron.

At any given time, let network output ok(t)currently be

the largest. Row isuch that ri=kof the object matrix O

represents the visual features of this output. To make the raw

feature matrices comparable in magnitude to one another, we

make use of the input adaptive group normalization factors

already computed and described in Sect. 2.1, which may be

composed into a single ten-element vector

n(t)=[nmnmnmnmnonononcncnc]

We may then combine this vector of normalization factors

with the feature weights associated with this object and the

neuron output to create a vector f(t)of weights to be applied

to each feature matrix

fj(t)=|ok(t)|·Oi,j

nj(t)j=1...10 (12)

where the absolute value is used so that large valuesof the out-

put ok(t), regardless of sign, result in large contributions to

the enhanced image. The vector f(t)may then be used to cre-

ate a linear recombination of the feature matrices computed

from the current input image as shown in Fig. 5, resulting in

a normalized RGB mask identifying where salient features

exist in the current image. By multiplying this mask with the

current input image, an enhanced image is created in which

the characteristics associated with the object represented by

output kare emphasized while effectively de-emphasizing

other objects.

Figure 6presents a demonstration of the algorithm to

enhance the most salient object in input images using our

example two-bar visual stimulus, the network outputs and

f

RGB

mask

Feature

images

Image Enhance

d

Image

f (t)

j

left

fright

fup

fdown

f0°

f120°

fred

fgreen

fblue

f60°

Weighting

summation Normalization

left

I

down

I

up

I

right

I

I0°

60°

I

120°

I

red

I

grn

I

blu

I

HPF

HPF

HPF

HPF

HPF

HPF

HPF

HPF

HPF

HPF

NORM

ABS

ABS

ABS

ABS

ABS

ABS

ABS

ABS

ABS

ABS

Filtering,

Input

image

Masking

rectification with

Enhanced

image

Fig. 5 Computational diagram for creating an enhanced image. Thin

lines indicate scalar quantities; thick lines represent matrices. Starting at

ﬁgure left, each of the ten 2D feature images created in Fig. 2is ﬁrst pro-

cessed through a ﬁrst-order high-pass ﬁlter (HPF) using the same time

constant τHI as was used for network inputs, and the absolute value of

each matrix taken (ABS) so that both increases and decreases in features

are represented in the output. Each of the resulting ﬁltered matrices is

then multiplied (Π) by the corresponding scalar weight fjfrom (12).

These weighted feature images are summed point-by-point (Σ)intoa

single RGB image, after which this image is normalized (NORM) by

a scalar value corresponding to its maximum over all pixels and color

planes. This results in an RGB “mask” that identiﬁes where salient

features of the input image exist. This mask is then multiplied point-by-

point (Π) with the input RGB image, resulting in a 2D RGB enhanced

image that emphasizes the most salient object. Note that motion and

orientation feature images contribute equally to red,green,andblue

color planes, but red,green,andblue feature images contribute only to

their own color plane

ﬁnal weight matrix of which are shown in Fig. 3a and c.

Comparing the input and enhanced images Fig. 6a and b

taken at 6.5s into training, the red bar is clearly stronger rel-

ative to the green in the enhanced image than in the input. To

quantify this effect, the ratio of the maximum red value to the

maximum green value in the input image is approximately

123

Biol Cybern (2017) 111:207–227 215

Fig. 6 Demonstration of enhancement of the most salient object. Each

of the four panels shows a 160 ×160-pixel region cropped from the

center of a 500 ×500 image, taken at times when the two moving bars

passed near one another. aThe input image at 6.5s after the beginning

of second-stage training, a time at which the red bar was the most salient

(refer to red and green neuron outputs in Fig. 3a). The role of shadowing

in saliency is evident here: the red bar is strongly visible in the image,

almost maximally out of shadow, whereas the green bar is almost half

shadowed. bThe enhanced image resulting from the algorithm of Fig.

5, emphasizing the red bar while the green bar is barely visible. c

The input image at 7.2 s into second-stage training, a time at which

the green object was most salient (refer to Fig. 3a). bThe resulting

enhanced image, emphasizing the green bar while the red bar is barely

visible

unity, but in the enhanced image, the maximum red value is

6.2 times stronger than that of green. In contrast, comparing

the input and enhanced images shown in Fig. 6c and d taken

at 7.2s into training, the green bar is enhanced relative to the

red, and again red and green have nearly the same maximum

value in the input image, but green is 4.1 times stronger in

the enhanced image.

Looking more closely at Fig. 6, the effect is not that the

characteristics associated with the largest neuron output, rep-

resenting the most salient object, are greatly increased in

the enhanced image; rather, the characteristics of all other

objects, even background objects (due to the high-pass ﬁlter-

ing of the feature matrices), are weakened. It is notable that

this is very reminiscent of attentional effects observed in pri-

mate visual cortex (Moran and Desimone 1985): Responses

to attended objects are not increased by attention, but rather

responses to unattended objects are relatively weakened.

This algorithm provides a bottom-up method for auto-

matically attending to the most salient object in an image

sequence over time without needing any prior information

about objects or their features.

5 Function and limitations of the model

We have presented one example set of experimental results

above and a variety of experiments in a companion paper

(Northcutt et al. 2017) practically demonstrating the utility

of this two-stage neural network model for sensory reﬁne-

ment, visual binding, internal object-level representation, and

even rudimentary visual attention. However, it is impossible

to present an exhaustive set of visual stimuli. Under what

conditions does this subtly complex neural network model

actually perform visual binding, how is this accomplished,

and what are its limitations? Intriguingly, the same recurrent

neural network subunit is used four times for two distinct

purposes in this model.

The ﬁrst-stage networks (refer to Fig. 1) take as their

input feature sensors which may have signiﬁcant overlap

in their particular visual submodality, and the output sig-

nals show a signiﬁcant reduction in overlap of the input

sensors: For example, one of the ﬁrst-stage networks takes

inputs which are broadly orientation-tuned and reﬁnes them

using lateral inhibition into outputs with narrower orientation

tuning.

In the ﬁrst stage, we have applied this same recurrent

network to three speciﬁc visual submodalities: color, orien-

tation, and motion. Balanced mutual inhibition is a naturally

stable state of a fully connected recurrent inhibitory neural

network, reached automatically by the learning rule when

given an appropriately balanced set of visual stimuli. In fact,

our training of the ﬁrst stage is speciﬁcally designed to elicit

lateral inhibition.

Lateral inhibition has long been understood to provide

reﬁnement of sensory inputs at many levels, its effects having

been studied as early as the mid-19th century (Mach 1866),

and a number of sensory systems have been proposed to func-

tion using this mechanism, including vision (Blakemore and

Tobin 1972), audition (Shamma 1985), and olfaction (Linster

and Smith 1997).

While sensory reﬁnement via lateral inhibition may well

be a useful function in many neural systems, does the ﬁrst

stage contribute anything to the operation of the visual bind-

ing network? In fact, we have shown experimentally in a

companion paper (Northcutt et al. 2017) that strong mutual

inhibition in the ﬁrst stage is essential to the rapid learning

demonstrated in Fig. 3a. Without strong ﬁrst-stage inhibition,

second-stage learning proceeded very slowly, and network

weights are very unlikely to ever have reached the stable

state shown in Fig. 3c.

123

216 Biol Cybern (2017) 111:207–227

While ﬁrst-stage reﬁnement of selectivity in motion direc-

tion, orientation, and color inputs undoubtedly aids the visual

binding network in distinguishing objects, the ﬁrst stage’s

most essential function for second-stage learning arises from

a more subtle effect. The strong inhibition between ﬁrst-

stage outputs creates a tendency for them to become mutually

exclusive, much like a weak winner-take-all function. Thus,

when an object passes out of shadow and becomes more

salient, the ﬁrst-stage network aids in causing all of its visual

attributes to increase simultaneously by inhibiting weaker

attributes in each visual submodality. Since the learning rule

of (5) develops inhibitory weights based on simultaneous

increases or decreases of visual attributes, the synchro-

nization of visual features from the same object caused by

ﬁrst-stage inhibition greatly speeds the learning of second-

stage network weights.

Although the network structure, temporal evolution, and

learning rule of the second-stage network (refer to Fig. 1)

are the same as the three ﬁrst-stage networks, the operation

of the second stage is much more obscure. The second-stage

network learns by associating input signals which are tempo-

rally correlated, just as the ﬁrst-stage networks do. However,

as has been proposed for mammalian cortex (Douglas et al.

1989), it is the variety of dissimilar inputs to this network,

not its internal structure, that makes its function differ from

the ﬁrst stage. In the present work, these signals are full-ﬁeld

spatial summations of elementary visual features. Based on

the results we have shown, the empirical result of this stage’s

operation is to detect signals which originate from each inde-

pendently ﬂuctuating object in the input and to quantify how

strongly that object expresses each visual feature, thereby

providing a solution to the visual binding problem.

5.1 Visual binding as blind source separation

The problem that the second-stage network must solve is

really that of blind source separation. The canonical example

of BSS is typically a linear mixture of auditory sources. Sev-

eral spatially separated microphones listening to a mixture

of spatially distinct independent audio sources can be used

to recover the individual audio sources. The use of recurrent

neural networks for BSS is a very well studied area, with

much of the research stemming from the seminal work of

Herault and Jutten (1986).

Mathematically, the problem of BSS can be stated as fol-

lows. Consider a column vector of Ntime-varying sources

s(t). These sources are combined in an unknown (but static)

linear mixture described by an N×N mixing matrix Mto

provide a vector of Nmixed inputs

i(t)=M·s(t)(13)

The challenge of BSS, as it is generally considered, is to

recover the “hidden” sources s(t)given only the mixed obser-

vations i(t). The mixing matrix Mis recovered implicitly in

this process, but often is of little interest: for example, in the

auditory case, Mwould describe the relative microphone

locations, which are usually known. Rather, it is most com-

monly the hidden source signals s(t)that contain the desired

information.

Given the inputs i(t), the fully connected inhibitory neural

network used in this paper—as described by the instanta-

neous update rule of (4), and neglecting the high-pass ﬁlter

on the inputs—has been shown to converge (given a proper

learning rule, and under certain conditions, addressed below)

to a state in which the outputs o(t)become scaled versions

of the hidden sources s(t)(Jutten and Herault 1991), thus

revealing each source separately.

The technique we have developed for processing 2D

images into a small number of highly meaningful scalar sig-

nals, shown in Fig. 2, allows the problem of visual binding to

be reduced to one of BSS. Each of the full-resolution visual

feature images is summed into a single time-varying signal

which is a linear instantaneous summation of the visual fea-

ture contributions of all objects in the scene.

In the case of visual binding—quite opposite to that of

auditory BSS—we are not particularly interested in recov-

ering the hidden sources s(t). These “sources” correspond

to temporal ﬂuctuations of the visual feature signals from

a given object caused, for example, by changes in lighting,

movement through occlusions, or changes in distance, and

are of little practical value. Rather, we are interested in deter-

mining how many independent sources exist in the scene and

with which visual features they correspond. This implies that,

in visual binding, the mixing matrix Mis of primary interest

because it reveals the features of objects in the scene, and

can be used to enumerate them.

5.2 Assumptions of the visual binding model

For any BSS problem, the number of distinct hidden sources

and inputs and the details of their mixture determines whether

that problem can be solved. After all, solving the BSS prob-

lem is not possible in every case: consider separating many

auditory sources given microphones all located at the same

spatial location! An analysis of the requirements for the

isolated second-stage BSS network to function properly is

given in the next section. However, due to the vastly greater

complexity of the visual binding problem—which takes a

sequence of 2D RGB images rather than a time-varying vec-

tor of scalar signals as input—the additional assumptions

required for our model to perform properly include the fol-

lowing, all of which we assert are reasonable in virtually any

practical situation.

123

Biol Cybern (2017) 111:207–227 217

1. The visual scene is dynamic. In order for the visual binding

model to detect and differentiate objects from the background

and other objects, the visual scene must change, as observed

in the feature space measured by the inputs to the network.

This is in direct contrast to visual BSS models proposed

by other authors that are designed for separating mixtures

of static images (Jutten and Herault 1991;Guo and Gar-

land 2006). Instead, our model takes as input global spatial

summations of local feature detector circuits and relies on

changes over time in the scene to produce temporal ﬂuctua-

tions in these signals.

Due to the temporal high-pass ﬁlters in the input pathway

(3) and in the learning rule (5), there would be no input to

the network and no changes to the weight matrix of this sys-

tem in response to static images, and thus no learning. This

assumption implies an inherent interest in novelty detection,

ignoring static background features and making the network

“prefer” objects which are more “active” in the visual scene.

The assumption of a dynamic visual scene is hardly restric-

tive; in fact, this is unavoidable in almost every practical

situation.

Speciﬁcally, to be useful for binding, the temporal fre-

quency of ﬂuctuations in the features of an object must not

be lower than the cutoff frequency set by the high-pass ﬁlter

time constant τHO used in the learning rule (5) or they will

be ﬁltered out. This time constant, of course, can be adjusted

to ﬁt the rate of change of the features of an object of interest.

Related to this assumption is a subtle dependence on spa-

tial image resolution. Since the inputs to our network are

scalar global spatial sums, it is not obvious what 2D spa-

tial resolution of input images is required to support visual

binding. In the limit of very tiny images, no matter what the

visual features are, they will have little or no temporal ﬂuc-

tuation due to the fact that there are not enough image pixels.

As image size increases, feature ﬂuctuations will become

smoother, and thus, the quality of features provided to the

network will increase. However, this quality will saturate as

the image resolution becomes sufﬁcient to clearly visualize

objects in a given scene. The optimal spatial resolution will

be highly dependent on the scene in which objects are to be

observed.

2. Visual features of an object are correlated. For the visual

binding network to function properly, visual features origi-

nating from the same object, which then become inputs to

the network, must have temporal correlation with each other.

This requirement simply means that visual features of any

given object measured by the network (in our case, motion,

color, and orientation) must vary together over time, as they

would in any natural situation.

A wide variety of common scene changes satisfy this

assumption. Occlusion of an object by another visual scene

element causes a rapid decrease in all local visual features of

the occluded object, and therefore a reduction in the wide-

ﬁeld summation of these signals; there is a corresponding

increase if the object then reappears. Similarly, when an

object passes in and out of shadows, or becomes nearer to

or more distant from a light source, the visual measures of

the object will vary together in proportion to the resulting

brightness ﬂuctuations. Also, as an object draws nearer to the

camera, the object size increases in the visual image. Since

a closer object stimulates a greater number of local feature

detection circuits, its wide-ﬁeld signals grow in inverse pro-

portion to its distance.

3. Fluctuations from different objects are distinct. Herault

and Jutten’s original work on BSS (Herault and Jutten

1986) inspired many related methods: independent compo-

nent analysis (Comon 1994), information maximization (Bell

and Sejnowski 1995), and mutual information minimization

(Yang and Si 1997). The majority of these methods require

statistical independence for sources to be separated.

However, we observe that while statistically independent

sources can be guaranteed to separate, this assumption is not

strictly necessary. For example, as we have shown in Fig. 3,

two sinusoidal “hidden sources” with the same frequency but

different phase can be separated, despite not being statisti-

cally independent. However, these sources are sufﬁciently

distinct to be separable.

Implicit in the requirement of statistical independence is

that visual feature ﬂuctuations are stochastic. This will gen-

erally be true in practical situations, due to the fact that these

ﬂuctuations derive from objects moving through shadows,

past occlusions, or moving unpredictably with respect to

the camera. The assumption that ﬂuctuations from different

objects will be statistically independent is also reasonable

in practical cases: visual features from different objects will

be independent simply because they result from independent

physical processes in the world.

However, the network does not simply fail if this assump-

tion is not satisﬁed. If features of two objects are not

statistically independent, nor even distinct, the visual binding

network will bind these signals together to represent a single

object, which in perspective is not an unreasonable conclu-

sion given a set of visual features that are highly correlated

with one another.

4. Object features are persistent. If a given visual feature (the

color of an object, for example) is to be used in binding, an

object must retain that feature over a period of time sufﬁcient

for the network to learn a pattern of weights based upon it.

While the network relies on temporal ﬂuctuations of these

features for learning, we must assume that at least some sub-

set of an object’s features (chosen from the color, motion,

and orientation submodalities for the present network) are

relatively persistent.

For example, the motion feature inputs from a car driving

rapidly in a circle, and thus constantly changing direction,

would not be useful in solving the visual binding problem,

123

218 Biol Cybern (2017) 111:207–227

although its color would likely remain constant as it turned,

and would be useful in binding.

The upper limit on the speed that visual features of an

object may change is set by the learning rate γin the learn-

ingruleof(5). Features of a given object that appear and

disappear so quickly that the integration of their effect over

time never creates a signiﬁcant change in the weight matrix

Twill not affect the output. Within the bounds of numeri-

cal stability, the learning rate can be adjusted to ﬁt the time

course of feature changes for a given visual stimulus.

5. Measured visual features must be diverse. In order to sep-

arate out a variety of different objects, the visual features

measured by the model must be diverse, and span a range

of visual submodalities. Preferably, the feature inputs should

span the full space of interest in each visual submodality. An

example of this diversity is the feature set we have shown in

Fig. 2, fully spanning motion, color, and orientation.

If the visual features measured are not sufﬁcient to sepa-

rate the objects of interest (for an extreme example, imagine

that all ten feature detectors were only sensitive to the amount

of red color in the image), in general, visual binding will not

be possible.

5.3 Analysis of a two-neuron BSS network

Given the assumptions of the previous section, we can reduce

the visual binding problem solved by the model as a whole

to one of blind source separation in the second stage. The

network output and inhibitory weight temporal dynamics of

the second-stage network that we have used for visual binding

in the model are surprisingly complex, once equipped with

the time-stepping neuron update rule of (3) and the learning

rule of (5), both of which are potentially unstable. Outside

the context of visual binding, the generalized analysis of an

N-neuron BSS network with N(N−1)inhibitory weights

is highly formalized and unrevealing and has already been

well addressed in the literature (Jutten and Herault 1991;

Sorouchyari 1991;Joho et al. 2000).

However, the second stage is sufﬁciently complex that

even a two-neuron network with only two inhibitory weights

can generate extremely unexpected results, and an analysis

of this minimal network better serves to clarify how the sec-

ond stage of our model works, and how our novel changes

to the network update and learning rule affect the model’s

performance. For this reason, we base our theoretical anal-

ysis of the second stage around a two-neuron network for

which all solutions can be simply enumerated and generalize

our results to the full network wherever possible. We fol-

low with demonstrations of the function of this two-neuron

network that can be clearly understood in terms of the anal-

ysis shown, and make conclusions about the full ten-neuron

second-stage network thereafter.

Our analysis begins by enumerating the possible “correct”

solutions to the BSS problem for a two-neuron network with-

out any learning using the instantaneous network update rule

of (4), which for very small simulation time steps approxi-

mates the far less analytically tractable time-stepping update

rule actually used in the model. For simplicity, in this section

we assume that all sources (and thus inputs) have temporal

frequencies sufﬁciently higher than the cutoff frequency of

the input high-pass ﬁlter so that i(t)=i(t).

5.3.1 The typical case

The “typical case” in BSS is one in which the number of

distinct, nonzero hidden sources is the same as the number

of network inputs and outputs. In this case, as will be shown,

there is no single correct solution but rather a family of closely

related solutions.

For the two-neuron network, the set of scalar equations

described by (13)is

i1(t)=M1,1·s1(t)+M1,2·s2(t)(14)

i2(t)=M2,1·s1(t)+M2,2·s2(t)(15)

The temporal evolution equation of (4) becomes

o1(t)=i1(t)−T1,2·o2(t)(16)

o2(t)=i2(t)−T2,1·o1(t)(17)

into which we can substitute (14) and (15) to get

o1(t)=M1,1·s1(t)+M1,2·s2(t)−T1,2·o2(t)(18)

o2(t)=M2,1·s1(t)+M2,2·s2(t)−T2,1·o1(t)(19)

For the blind source separation problem to be solved, the out-

puts o(t)must become equal to scaled versions of the hidden

inputs s(t)(Jutten and Herault 1991). As long as the weight

matrix Mis nonsingular, there are exactly two possible solu-

tions that can be learned in the inhibitory connection matrix

T.

As previous authors have shown (and may be easily

derived from the equations above), the ﬁrst situation in which

correct source separation occurs is when source indices are

not permuted with respect to the outputs

o1(t)=M1,1·s1(t)(20)

o2(t)=M2,2·s2(t)(21)

which results in the connection matrix

T=0M1,2

M2,2

M2,1

M1,10(22)

123

Biol Cybern (2017) 111:207–227 219

The only other possible solution occurs when the source

indices have exchanged with respect to the outputs

o1(t)=M1,2·s2(t)(23)

o2(t)=M2,1·s1(t)(24)

and results in the connection matrix

T=0M1,1

M2,1

M2,2

M1,20(25)

In general, for an N-element BSS network in this case,

there are N!possible solutions differing only in the permu-

tation of the outputs o(t)relative to the hidden sources s(t).

Note that, even for our modest ten-neuron second-stage net-

work, this still allows for 3,628,800 possible solutions!

5.3.2 The overdetermined case

A more unusual situation in the BSS literature is the “overde-

termined case” (Joho et al. 2000), in which less than N

distinct sources are mixed to provide inputs to an N-element

network. There are far fewer solutions in this case, and they

differ from the typical case.

In our simpliﬁed two-neuron example, when s2(t)=0

(or equivalently M1,2=M2,2=0) while s1(t)is nonzero,

achieving the nonpermuted solution of (20) and (21) can be

accomplished by setting o2(t)=0. From (19), this requires

T2,1·o1(t)=M2,1·s1(t)+M2,2·s2(t)(26)

and given that s2(t)=0 and that our goal is to make o1(t)=

M1,1·st(t),T2,1can be written directly as

T2,1=M2,1

M1,1

(27)

However, this leaves us with no constraint on T1,2, since in

the network temporal evolution rule it multiplies a signal that

is zero. Therefore, our only requirement is that T1,2≥0to

prevent unintentional excitation, and the resulting connection

matrix is

T=0T1,2

M2,1

M1,10(28)

where T1,2≥0 (we will be able to place further constraints

on T1,2in the next section). Another possible solution with

this same source condition is to permute the sources with

respect to the outputs as in (23) and (24) which results in

connection matrix

T=0M1,1

M2,1

T2,10(29)

where T2,1≥0 is largely unconstrained.

By symmetry, if instead the sources switched and s1(t)=

0 (or equivalently M1,1=M2,1=0) while s2(t)were

nonzero, the required connection matrix for the nonpermuted

case would be

T=0M1,2

M2,2

T2,10(30)

where T2,1≥0, and for the permuted case

T=0T1,2

M2,2

M1,20(31)

where T1,2≥0.

In general, ignoring the indeterminate (and irrelevant)

matrix values, for an N-element BSS network in the overde-

termined case with only m<Nnonzero sources, there are

N!

(N−m)!2possible solutions, corresponding to all possible

permutations of the mnonzero outputs with any given pattern

of msources, and all permutations of those msources with

respect to the inputs.

While still not trivially small, this is vastly fewer solutions

than for the typical case by a factor of (N−m)!2

N!. For our ten-

neuron second-stage network, if only two distinct sources are

presented, the number of possible solutions is reduced to a

mere 8100.

5.3.3 Stability of network temporal evolution

The values of the connection matrix Tgiven above repre-

sent solutions to the BSS problem in a two-neuron network,

but were derived using an approximate temporal evolution

rule. Will a recurrent network with this connection matrix

using the time-stepping update rule of (3) be stable? We have

earlier addressed conditions for stability of this temporal evo-

lution rule and concluded that stability simply requires that

all eigenvalues of the connection matrix Thave magnitude

less than unity.

For the speciﬁc values of Tgiven in the “typical case”

above, it is possible to compute the eigenvalues directly from

the characteristic polynomial, and thus the conditions for sta-

bility of the connection matrices of (22) and (25) can be

shown, respectively, to be

M1,2·M2,1<M1,1·M2,2(32)

M1,2·M2,1>M1,1·M2,2(33)

123

220 Biol Cybern (2017) 111:207–227

These two conditions are mutually exclusive, meaning that

for any given nonsingular mixing matrix M, only one of the

two connection matrices given in the typical case for the two-

neuron network can be temporally stable.

The computation of eigenvalues may also be carried out

for the connection matrices derived for the overdetermined

case. For each matrix in (28) through (31), this computation

results, respectively, in conditions for stability

M1,1

M2,1

>T1,2≥0(34)

M2,1

M1,1

>T1,2≥0 (35)

M2,2

M1,2

>T2,1≥0 (36)

M1,2

M2,2

>T1,2≥0 (37)

Under its respective condition, each of the matrices derived

for the overdetermined case is a stable state of the network.

Thus, only in the typical case, the requirement for stabil-

ity of the time evolution rule eliminates half of the potential

solutions for the two-neuron network, and in the general

N-neuron situation a potentially large number of the N!theo-

retically possible connection matrices. In the overdetermined

case, where less solutions exist already, the requirement for

stability places an upper bound on the unconstrained weight,

but eliminates no potential solutions.

This reduction in the number of valid solutions to the BSS

problem could be a crucially important consequence of using

the time-stepping network evolution rule of (3) rather than

the conventional approximation of (4).

5.3.4 Stability of the network learning rule

Assuming a given connection matrix Tdoes represent a solu-

tion to the BSS problem and has eigenvalue magnitudes less

than unity, thus allowing network temporal evolution to be

stable, do these solutions represent stable states of the learn-

ing rule of (5)? If not, they could never be learned by our

neural network model.

For a given inhibitory connection matrix Tto be a stable

state of the learning rule, updates to Tmade by the learn-

ing rule of (5) must be zero-mean over time. This can be

expressed as

EdT1,2

dt=Eγ·go

1(t)·fo2(t)=0 (38)

EdT2,1

dt=Eγ·go2(t)·fo

1(t)=0 (39)

where E[x]formally represents the expected value of ran-

dom variable x, but may also be interpreted for deterministic

time variables as the average value of x(t)over some ﬁxed

period of time. The fundamental BSS assumption that hid-

den sources are statistically independent is a requirement for

stability of the learning rule, but as discussed previously, the

rule may also stabilize for distinctly varying deterministic

sources.

Assuming that the network does indeed solve the BSS

problem, resulting in one of the connection matrices shown

above, we may substitute in high-pass ﬁltered versions of

(20) and (21) (both of which are satisﬁed in all typical and

overdetermined cases presented above) and factoring out the

constant γ,(38) and (39) become

EgM1,1·s

1(t)·fM2,2·s

2(t)=0(40)

EgM2,2·s

2(t)·fM1,1·s

1(t)=0 (41)

The constants from the mixing matrix Min these equa-

tions may be factored out without changing the required

conditions to make the equations true, so let us look more

closely at the effect of the expected value operator on the

high-pass ﬁltered sources and the learning rule activation

functions f(x)and g(x).

Let x1(t)=s

1(t)and x2(t)=s

2(t). The function of the

high-pass ﬁlter is to remove low frequencies (the mean, or

expected value, being the lowest possible frequency), so due

to the operation of the ﬁlter

E[x1(t)]=E[x2(t)]=0(42)

Thus, both x1(t)and x2(t)are zero-mean random variables

due to the action of the high-pass ﬁlter and retain the sta-

tistical independence of the sources s1(t)and s2(t)after the

linear ﬁltering operation.

In our learning rule, the nonlinear functions g(x)=

tanh(π x)and f(x)=x3are applied, respectively, to the

statistically independent zero-mean random variables x1(t)

and x2(t). The conditions for stability of learning rules of this

form have been studied in detail by Sorouchyari (1991). By

approximating the hyperbolic tangent function with a Tay-

lor series to the third-order term—a good approximation for

our model’s normalized inputs—it can be shown that this

stability is conditioned on the third moment about zero (the

skewness, a measure of the asymmetry of the probability den-

sity function) of one or both of the random variables x1(t)

and x2(t)being zero. This directly requires that, for stability

of the learning rule, the skewness of one or both of the hidden

sources s1(t)and s2(t)must be zero.

Note that, unlike network temporal stability, this condition

for stability of the learning rule requires only that the con-

nection matrices Tsolve the visual binding problem. The

conditions for learning stability are primarily placed on the

statistics of the hidden sources s(t).

123

Biol Cybern (2017) 111:207–227 221

For distinctly varying deterministic sources, this require-

ment would imply that the mean value over some period of

time of each cubed hidden source signal s3(t)be zero, which

is true of many functions including the sinusoidal sources

used in the experiment of Fig. 3and in the examples given

below.

In the more general case of visual binding, these hid-

den sources represent temporal ﬂuctuations of visual features

caused by unpredictable changes in lighting, occlusion, dis-

tance, or similar effects. Small or zero skewness of these

visual ﬂuctuations is extremely plausible, and thus in this

case stability of the learning rule is very likely as well.

5.3.5 Role of the activation functions in learning

If linear weighting functions g(x)=xand f(x)=xwere

used in the learning rule, changes to diagonal elements in the

weight matrix dTn,k/dtand dTk,n/dtwould necessarily be

equal, resulting in equal values of Tn,kand Tk,nand thus a

purely Hebbian diagonally symmetric connection matrix T.

Based on the BSS theory presented above, this learning

rule could never solve the overdetermined case, in which all

correct weight matrices are asymmetric, and could only solve