ArticlePDF Available

An insect-inspired model for visual binding II: functional analysis and visual attention

Authors:

Abstract and Figures

We have developed a neural network model capable of performing visual binding inspired by neuronal circuitry in the optic glomeruli of flies: a brain area that lies just downstream of the optic lobes where early visual processing is performed. This visual binding model is able to detect objects in dynamic image sequences and bind together their respective characteristic visual features—such as color, motion, and orientation—by taking advantage of their common temporal fluctuations. Visual binding is represented in the form of an inhibitory weight matrix which learns over time which features originate from a given visual object. In the present work, we show that information represented implicitly in this weight matrix can be used to explicitly count the number of objects present in the visual image, to enumerate their specific visual characteristics, and even to create an enhanced image in which one particular object is emphasized over others, thus implementing a simple form of visual attention. Further, we present a detailed analysis which reveals the function and theoretical limitations of the visual binding network and in this context describe a novel network learning rule which is optimized for visual binding.
Diagram of wide-field visual input computation from input images. Each input RGB image was converted to grayscale (G) for orientation and motion processing. Image motion was computed by the Hassenstein–Reichardt (HR) elementary motion detection model in the horizontal (IH\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$I_H$$\end{document}) and vertical (IV\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$I_V$$\end{document}) directions and then separated into four feature images Ileft\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$I_\mathrm{left}$$\end{document}, Iright\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$I_\mathrm{right}$$\end{document}, Idown\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$I_\mathrm{down}$$\end{document}, and Iup\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$I_\mathrm{up}$$\end{document}, respectively, containing strictly positive leftward (H<0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H < 0$$\end{document}), rightward (H>0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$H > 0$$\end{document}), downward (V<0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V < 0$$\end{document}), and upward (V>0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V > 0$$\end{document}) components. Three orientation-selective DoG filter kernels were convolved with the grayscale image to produce three orientation feature images I0∘\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$I_{0^\circ }$$\end{document}, I60∘\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$I_{60^\circ }$$\end{document}, and I120∘\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$I_{120^\circ }$$\end{document}. The individual red, green, and blue color planes were taken as the final three feature images Ired\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$I_\mathrm{red}$$\end{document}, Igrn\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$I_\mathrm{grn}$$\end{document}, and Iblu\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$I_\mathrm{blu}$$\end{document}. Each of these ten 2D feature images was then spatially summed to create a scalar value representing wide-field image feature content. The inputs i1(t)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i_1(t)$$\end{document} through i10(t)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i_{10}(t)$$\end{document} became input to the neural network model of Fig. 1
… 
Outputs and weights of the second-stage network as it trained with a visual stimulus comprised of two bars moving through sinusoidal shadow. A legend to identify each trace in panels a and b is shown at upper left. a Network outputs when using the competitive learning rule of (5), in which the activation functions f() and g() are switched in position relative to conventional learning rules for BSS, and thus emphasize patterns of column over row weights. Note that over the time of training, the red and green outputs come to inhibit all others. b Network outputs when using the conventional cooperative learning rule, which emphasizes patterns of row over column weights. No pattern of outputs is evident apart from nearly uniform inhibition. c Final weight matrix T\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{T}}$$\end{document} using the competitive learning rule of (5). Brighter colors indicate stronger inhibition. The strongest weights are in the red and green columns, and clearly indicate the features of each bar. d Final weight matrix T\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{T}}$$\end{document} using the cooperative learning rule
… 
Computational diagram for creating an enhanced image. Thin lines indicate scalar quantities; thick lines represent matrices. Starting at figure left, each of the ten 2D feature images created in Fig. 2 is first processed through a first-order high-pass filter (HPF) using the same time constant τHI\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau _{HI}$$\end{document} as was used for network inputs, and the absolute value of each matrix taken (ABS) so that both increases and decreases in features are represented in the output. Each of the resulting filtered matrices is then multiplied (Π\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varPi }$$\end{document}) by the corresponding scalar weight fj\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f_j$$\end{document} from (12). These weighted feature images are summed point-by-point (Σ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varSigma }$$\end{document}) into a single RGB image, after which this image is normalized (NORM) by a scalar value corresponding to its maximum over all pixels and color planes. This results in an RGB “mask” that identifies where salient features of the input image exist. This mask is then multiplied point-by-point (Π\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varPi }$$\end{document}) with the input RGB image, resulting in a 2D RGB enhanced image that emphasizes the most salient object. Note that motion and orientation feature images contribute equally to red, green, and blue color planes, but red, green, and blue feature images contribute only to their own color plane
… 
This content is subject to copyright. Terms and conditions apply.
Biol Cybern (2017) 111:207–227
DOI 10.1007/s00422-017-0716-z
ORIGINAL ARTICLE
An insect-inspired model for visual binding II: functional analysis
and visual attention
Brandon D. Northcutt1·Charles M. Higgins2
Received: 2 April 2016 / Accepted: 27 February 2017 / Published online: 16 March 2017
© Springer-Verlag Berlin Heidelberg 2017
Abstract We have developed a neural network model
capable of performing visual binding inspired by neuronal
circuitry in the optic glomeruli of flies: a brain area that lies
just downstream of the optic lobes where early visual pro-
cessing is performed. This visual binding model is able to
detect objects in dynamic image sequences and bind together
their respective characteristic visual features—such as color,
motion, and orientation—by taking advantage of their com-
mon temporal fluctuations. Visual binding is represented in
the form of an inhibitory weight matrix which learns over
time which features originate from a given visual object.
In the present work, we show that information represented
implicitly in this weight matrix can be used to explicitly count
the number of objects present in the visual image, to enumer-
ate their specific visual characteristics, and even to create an
enhanced image in which one particular object is emphasized
over others, thus implementing a simple form of visual atten-
tion. Further, we present a detailed analysis which reveals the
function and theoretical limitations of the visual binding net-
work and in this context describe a novel network learning
rule which is optimized for visual binding.
BBrandon D. Northcutt
brandon@northcutt.net
Charles M. Higgins
higgins@neurobio.arizona.edu
1Department of Electrical and Computer Engineering,
University of Arizona, 1230 E. Speedway Blvd., Tucson, AZ
85721, USA
2Departments of Neuroscience and Electrical/Computer Eng.,
University of Arizona, 1040 E. 4th St., Tucson, AZ 85721,
USA
Keywords Neural networks ·Blind source separation ·
Visual binding ·Object perception ·Visual attention ·
Artificial intelligence
1 Introduction
Visual binding is the process of grouping together the visual
characteristics of one object while differentiating them from
the characteristics of other objects (Malsburg 1999), without
regard to the spatial position of the objects. Based on recently
identified structures termed optic glomeruli in the brains of
flies and bees (Strausfeld et al. 2007;Strausfeld and Okamura
2007;Okamura and Strausfeld 2007;Paulk et al. 2009;Mu
et al. 2012), we have developed a neural network model of
visual binding (Northcutt et al. 2017), which encodes the
visual binding in a pattern of inhibitory weights. This model’s
genesis was in the anatomical similarity of insect olfactory
and optic glomeruli and was inspired by the work of Hopfield
(1991), who modeled olfactory binding based on temporal
fluctuations in the mammalian olfactory bulb, and by the
seminal work of Herault and Jutten (1986) on blind source
separation (BSS).
A pattern of inhibitory weights is learned by the binding
network from visual experience based upon common tempo-
ral fluctuations of spatially global visual characteristics, with
the underlying suppositions being that the visual character-
istics of any given object fluctuate together, and differently
from other objects. An example would be when an automo-
bile passes behind an occluding tree: its color, motion, form,
and orientation disappear and then reappear together, thus
undergoing a common temporal fluctuation.
A detailed description and demonstration of the function
of this model is given in a companion paper (Northcutt et al.
2017). In the present work, we describe the essentials of
123
208 Biol Cybern (2017) 111:207–227
the model, present one representative demonstration of its
function in visual binding, and then show how to explicitly
interpret the binding output, which the network encodes only
implicitly in the pattern of inhibitory weights. We show how
to count the number of objects present in the input image
sequence and enumerate their characteristics and relative
strength. Further, we show how to use this information to
create an enhanced image, emphasizing one particular object
while de-emphasizing all others, thereby implementing a
form of visual attention. Finally, we present a theoretical
analysis of the functional limitations and capabilities of this
neural network model to provide a better understanding of
the network and its potential.
2 The visual binding model
Model simulations were performed in MATLAB (The
MathWorks, Natick, MA) using a simulation time step of
Δt=10ms and image sizes of 500 ×500 pixels.
Figure 1shows a diagram of the full two-stage neural net-
work used. This network begins with a first stage comprised
of three separate, fully connected recurrent inhibitory neural
networks used for refining the representation of motion, ori-
entation, and color. The outputs of the first stage are fed into
a second stage, a fully connected ten-unit inhibitory neural
network which performs visual binding. Despite their appar-
ently disparate purposes, all four neural networks comprising
the overall model operate using exactly the same temporal
evolution and learning rules, which are given below.
Each of the four recurrent neural networks used in the
model (three in the first stage processing individual visual
submodalities, and one in the second stage performing visual
binding) learned, by using common temporal fluctuations, an
inhibitory weight matrix that indicated the degree of inhibi-
tion between neurons within each network. We describe the
essentials of the operation and training of these networks
below; see Northcutt et al. (2017) for full details.
2.1 Computation of inputs to the network
Despite the fact that fly color vision is based on green,
blue, and ultraviolet photoreceptors (Snyder 1979), for ease
of human visualization and computer representation—and
without loss of generality—our color images had red, green,
and blue (RGB) color planes. Since this network is based
on a model of the insect brain, an elaborated version of the
Hassenstein–Reichardt motion detection algorithm (Hassen-
stein and Reichardt 1956;Santen and Sperling 1985)—the
standard model of insect elementary motion computation—
was used to compute motion inputs, and difference-of-
Gaussian (DoG) filters—the best existing model of early
Motion
Orientation
Color
First stage Second stage
o1
o2
o3
o4
o5
o6
o7
o8
o9
o10
i1
(left)
i2
(right)
i3
(up)
i4
(down)
i5
(0°)
i6
(60°)
i7
(120°)
i8
(red)
i9
(green)
i10
(blue)
Fig. 1 The two-stage network for visual submodality refinement and
visual binding. Large circles represent units in the neural network.
Unshaded half-circles at connections indicate excitation, and filled half-
circles indicate inhibition. During the first phase of training, neurons in
the three first-stage networks learn to mutually inhibit one another, thus
refining the representation of motion, orientation, and color and result-
ing in more selective tunings in each submodality. In the second phase
of training with a stimulus comprised of moving bars, the second-stage
visual binding network learns the relative strengths of visual object fea-
tures based on common temporal fluctuations and develops an inhibitory
weight matrix to produce outputs that reflect the temporal fluctuations
unique to each object
insect orientation processing (Rivera-Alvidrez et al. 2011)—
were convolved with the image to compute orientation.
Inputs to the neural network were computed as dia-
grammed in Fig. 2. For achromatic motion and orientation
processing, each RGB image was first converted to grayscale
by taking the average of the 3 color components.
123
Biol Cybern (2017) 111:207–227 209
G
HR
H
I
RGB
image
Input
Images
Feature
Processing
2D Feature
Images
Wide-field
scalar
outputs
i1
i2
i3
i4
i5
i6
i7
i8
i9
i10
I
V
down
I
up
I
left
I
right
I
I
60°
I
120°
I
P
red
I
grn
I
blu
I
H>0
V<0
V>0
H<0
DOG
DOG
60°
DOG
120°
Fig. 2 Diagram of wide-field visual input computation from input
images. Each input RGB image was converted to grayscale (G)for
orientation and motion processing. Image motion was computed by
the Hassenstein–Reichardt (HR) elementary motion detection model in
the horizontal (IH) and vertical (IV) directions and then separated into
four feature images Ileft ,Iright,Idown ,andIup , respectively, contain-
ing strictly positive leftward (H<0), rightward (H>0), downward
(V<0), and upward (V>0) components. Three orientation-selective
DoG filter kernels were convolved withthe grayscale image to produce
three orientation feature images I0,I60,andI120. The individual red,
green, and blue color planes were taken as the final three feature images
Ired,Igrn ,andIblu . Each of these ten 2D feature images was then spa-
tially summed to create a scalar value representing wide-field image
feature content. The inputs i1(t)through i10(t)became input to the
neural network model of Fig. 1
The Hassenstein-Reichardt motion algorithm was iterated
across rows and columns of the grayscale image, resulting
in vertical and horizontal “motion images” containing signed
local motion outputs. These were further subdivided into non-
negative leftward, rightward, downward, and upward motion
“feature images” by selecting components of each motion
image with a particular sign. DoG filters selective for three
orientations (0,60
, and 120) were convolved with the
grayscale images to create three orientation feature images.
Finally, the red, green, and blue color planes were used as
color feature images.
Each of these ten two-dimensional (2D) feature images
was spatially summed over both dimensions to produce ten
scalar measures of full-image feature content. Each group
of scalar signals corresponding to a given visual submodal-
ity (motion, orientation, or color) was then normalized to a
maximum value of unity, allowing features from different
submodalities to be comparable while preserving ratios of
features within each group. This normalization divided each
group of signals by scalar factors nm(t),no(t), and nc(t)com-
puted as the maximum value of any signal in the group during
the last 2s, thereby automatically scaling inputs from differ-
ent submodalities to become comparable with one another
and simultaneously adapting the signals to changing visual
conditions. The ten normalized scalar values were provided
at each simulation time step as inputs i1(t)through i10(t)to
the neural network of Fig. 1.
2.2 Network temporal evolution
All inputs to each of the four recurrent networks in the model
were high-pass filtered before being processed. This ensured
that static features of the visual scene such as an unchanging
background never became input to the network.
As described in detail in a companion paper (North-
cutt et al. 2017), the activation of each neuron in the
model—which may be positive or negative—represents for
non-spiking neurons the graded potential of the neuron rela-
tive to its resting potential and for spiking neurons the average
firing rate relative to the spontaneous rate.
The activation on(t)of neuron nin a recurrent inhibitory
neural network may be modeled as
on(t)=i
n(t)
N
k=1
Wn,k·ok(tτi)(1)
where i
n(t)represents a first-order temporal high-pass fil-
tered version of the input in(t)using time constant τHI =
1.0s, Wn,krepresents the strength of the inhibitory synap-
tic pathway from neuron kto neuron n,ok(t)represents the
activation of a different neuron kin the network, and τirepre-
sents the small but finite delay required to produce inhibition.
This set of equations can be expressed in matrix form as
o(t)=i(t)W·o(tτi)(2)
123
210 Biol Cybern (2017) 111:207–227
where lowercase bold symbols indicate N-element column
vectors and uppercase bold symbols represent N×Nmatri-
ces. Since the biophysical details of optic glomeruli—and
thus τi—are unknown, but the existence of a finite delay is
crucial (as detailed in Northcutt and Higgins 2017), in our
simulations we formulate the network temporal dynamics as
o(t)=i(t)W·o(tΔt)(3)
where Δtis the simulation time step. By using τi=Δt,we
provide the smallest finite inhibition delay possible in our
simulation. This equation was used in all simulations.
When the time scale of input and output changes is much
larger than the simulation time step Δt,(3) may be approxi-
mated as
o(t)=i(t)W·o(t)(4)
Apart for the use of high-pass filtered inputs, (4)isacommon
formulation for a fully connected recurrent inhibitory neural
network used in BSS (Herault and Jutten 1986;Jutten and
Herault 1991;Cichocki et al. 1997). However, (4) represents
an idealized system, for which the outputs may be instan-
taneously computed from the inputs so long as the matrix
[I+W]1exists (Ibeing the identity matrix). This system
of equations can be singular, but, quite unlike any realistic
recurrent neuronal network, cannot be temporally unstable.
The use of (3) for temporal dynamics instead of the more
common, seemingly quite reasonable approximation of (4)
allows for modeling of the temporal instability of recurrent
neuronal networks—which, as shown below, is crucial to
understanding their function—while still allowing network
temporal evolution to be approximated by (4) when required
to make theoretical analysis tractable and relate the present
network to previous studies.
As detailed in a companion paper (Northcutt et al. 2017),
the temporal stability of linear systems such as the one
described by (3) has long been well understood (Trentel-
man et al. 2012), and stability of such a network may be
maintained by simply requiring that the magnitude of all
eigenvalues of the weight matrix Wbe less than unity.
2.3 Network learning rule
The Hebbian-style network learning rule used to generate
inhibitory weight matrices based on common temporal fluc-
tuations of the inputs was modified from that of Cichocki
et al. (1997). This spatially asymmetric multiplicative weight
update rule was chosen to support the representation of
neuronal activation described in Sect. 2.2—which does not
explicitly represent neuronal action potentials—and to lever-
age existing theoretical work on neural network solutions to
BSS problems, in awareness of the fact that the underlying
biological basis of this learning is likely to be spike-timing-
dependent plasticity (Markram et al. 1997) as discussed in
Northcutt et al. (2017).
Weight matrices Wwere initialized to zero so that the
state of the network was o(t)=i(t)and thus network out-
puts were initially identical to high-pass filtered inputs. The
network learning rule is formulated as
dWn,k
dt=γ·μ(t)·g(o
n)·f(o
k)(5)
where n= kare neuron indices and γis a scalar learn-
ing rate. Diagonal elements of the weight matrix always
remained at zero, preventing self-inhibition. Any element of
Wthat became negative after a learning rule update was set
to zero, thereby enforcing that network weights were strictly
inhibitory.
μ(t)is a learning onset function that has a value of zero
at the start of training and rises asymptotically to unity with
a time constant of 2 s. Our formulation of μ(t)—quite unlike
the identically named function used by Cichocki et al.—is
used to gradually turn on the learning rule at the start of
training to avoid a powerful transient in weights based solely
on the initial phase of the inputs.
o
n(t)and o
k(t), respectively, indicate first-order tempo-
ral high-pass filtered versions of network outputs nand k
with time constant τHO =0.5 s. Note that use of high-pass
filtered outputs in the learning rule, rather than simply the
outputs, makes the network’s learning dependent on tempo-
ral fluctuations of the inputs: specifically those fluctuations
with temporal frequencies greater than the cutoff frequency
of the output high-pass filter.
We used network “activation functions” f(x)=x3and
g(x)=tanh x)similar to those used by previous authors
(Jutten and Herault 1991;Cichocki et al. 1997) to introduce
higher-order statistics of the filtered outputs into the learning
rule (Hyvärinen and Oja 1998), although the positions of
these two functions with respect to rows nand columns kof
the weight matrix are exchanged in (5) as compared to the
conventional BSS learning rule. This exchange is crucial to
the function of our model, and both the role of these activation
functions and the requirement that they be exchanged for our
model are addressed in detail in Sect. 5.3.5.
For reasons that will become apparent later, we will refer
to the learning rule with activation functions in their conven-
tional positions as the cooperative learning rule. In contrast,
we will refer to the learning rule of (5) with exchanged acti-
vation functions as the competitive learning rule.
2.4 Training of the model
Before training of any network began, a visual input was
presented and all linear filters and the input adaptive scaling
123
Biol Cybern (2017) 111:207–227 211
algorithm were allowed to reach steady state to eliminate
artifactual startup transients.
Training began with each of the three first-stage networks
using a learning rate of γ1=50. This training resulted in the
first-stage motion, orientation, and color networks, respec-
tively, learning weight matrices M,O, and C. To rapidly give
the first-stage networks sufficient experience to refine each
visual submodality, an artificial visual stimulus (detailed in
Northcutt et al. 2017) was presented that provided simulta-
neous temporal fluctuations in all colors, orientations, and
directions of motion. This stimulus provided near-identical
signals to each input of every network, effectively reducing
the learning rule of (5) to a purely Hebbian one (Hebb 1949).
This input resulted in uniform symmetric (zero diagonal)
weight matrices, indicating uniform lateral inhibition: a well-
known technique for sensory refinement (Linster and Smith
1997). However, with this symmetric stimulus weight matri-
ces never converge to a stable state, but rather increase in
value as long as training continues. For this reason, learn-
ing for each first-stage network was terminated by setting
its respective learning rate γ1to zero when the maximum
magnitude of any eigenvalue of the network weight matrix
reached a value of V1,max =0.9. This procedure allowed
us to rapidly learn strong lateral inhibition in the first stage
while avoiding temporally unstable recurrent networks.
During this first-stage training period, the learning rate
γ2for the second stage was set to zero. While not strictly
necessary, this isolation of the two stages allowed for rapid
training of the first-stage network and a clear demonstration
of second-stage function.
After first-stage learning was complete, the second-stage
learning rate was set to γ2=0.5, after which visual stimuli
composed of multiple objects and intended to demonstrate
visual binding were presented, as shown in the next section.
In this second phase of training, the second-stage network
learned an inhibitory connection matrix Tindicating the
binding among visual submodalities.
To avoid temporal instability of the second-stage network,
if any weight matrix update resulted in Thaving an eigen-
value with a magnitude Vgreater than V2,max =0.95, the
weight matrix was multiplied by a scalar factor V2,max/V,
thus holding the maximum eigenvalue at V2,max and main-
taining network temporal stability.
3 An example of visual binding
To demonstrate the operation of the visual binding net-
work, an artificial visual stimulus composed of moving
50 ×12-pixel bars on a black background was presented.
This stimulus consisted of a red bar that started near the
upper left corner of the image and moved down and right at
30, and a green bar that started near the upper right and
moved down and left at 210. Both bars were oriented with
their longest axis orthogonal to the direction of motion and
moved at 50 pixels per second.
The bars moved through a fixed pattern of multiplicative
horizontal sinusoidal shadow with a spatial period of 50 pix-
els, a mean value of 0.5, and an amplitude of 0.25. As the bars
moved independently through this pattern of shadows, the
features of each bar fluctuated together, allowing the model
to learn their individual characteristics. Bars wrapped around
toroidally to re-enter the image, thus putting no time limita-
tion on network training.
Figure 3a and c, respectively, shows the time course of
network outputs and the final weight matrix when the net-
work was trained for 15s with this visual stimulus using the
competitive learning rule of (5). For comparison, Fig. 3b and
d shows the same data using the cooperative learning rule, in
which the activation functions f() and g() of (5) are placed
in their conventional position in the BSS literature (Cichocki
et al. 1997), with the expansive function then f() applying
to row elements, and the compressive function g() to column
elements.
The results of Fig. 3a and c using the competitive learning
rule correspond to desired operation of the visual binding
network. The red and green output neurons clearly come
to dominate all others, and (neglecting very small weights)
the columns of the final weight matrix correctly indicate the
characteristics of the individual objects which comprised the
stimulus. Reading the weights from the “red” column, the
bar moved to the right, and down to a lesser extent, and
got a roughly equal response from the 0and 120orienta-
tion filters, indicating an approximate orientation of 30or
equivalently 150. From the “green” column, the bar moved
to the left, downward to a lesser extent, and had an orientation
of approximately 30or equivalently 210.
In stark contrast, the results shown in Fig. 3b and d reveal
clearly that the cooperative learning rule is not applicable to
the visual binding model. The reasons for this are detailed in
Sect. 5.3.5.
This single example suffices to illustrate the methods of
weight matrix interpretation and network functional analysis
presented below, but many further experiments are described
and full details given in a companion paper (Northcutt et al.
2017).
4 Extracting object-level information
While we have presented and discussed the second-stage net-
work so far as if it learned the features of a static set of objects
and then stabilized, the temporal dynamics of learning are in
general far more complicated. After initial training of the first
stage, the learning of the second stage never stops. This is
123
212 Biol Cybern (2017) 111:207–227
Left
Right
Down
Up
0
60
120
Red
Green
Blue
o
o
o
012345678910
Time since start of stage 2 training (s)
-1.5
-1
-0.5
0
0.5
1
1.5
Stage 2 outputs
(a) Network outputs competitive learning rule
012345678910
Time since start of stage 2 training (s)
-1.5
-1
-0.5
0
0.5
1
1.5
Stage 2 outputs
(b) Network outputs cooperative learning rule
Left Right Down Up 0 60 120 Red Green Blue
Each column: weights FROM this neuron
Left
Right
Down
Up
0
60
120
Red
Green
Blue
Each row: weights TO this neuron
o
o
o
ooo
Left Right Down Up Red Green Blue
Each column: weights FROM this neuron
Left
Right
Down
Up
Red
Green
Blue
Each row: weights TO this neuron
0
60
120
o
o
o
0 60 120
ooo
(c) Wieghts using competitive learning (d) Wieghts using cooperative learning
Fig. 3 Outputs and weights of the second-stage network as it trained
with a visual stimulus comprised of two bars moving through sinu-
soidal shadow. A legend to identify each trace in panels aand bis
shown at upper left.aNetwork outputs when using the competitive
learning rule of (5), in which the activation functions f() and g() are
switched in position relative to conventional learning rules for BSS, and
thus emphasize patterns of column over row weights. Note that over the
time of training, the red and green outputs come to inhibit all others.
bNetwork outputs when using the conventional cooperative learning
rule, which emphasizes patterns of row over column weights. No pat-
tern of outputs is evident apart from nearly uniform inhibition. cFinal
weight matrix Tusing the competitive learning rule of (5). Brighter
colors indicate stronger inhibition. The strongest weights are in the red
and green columns, and clearly indicate the features of each bar.dFinal
weight matrix Tusing the cooperative learning rule
desirable because it allows the network to continuously adapt
to dynamic visual scenery.
However, should a set of objects in the visual scene persist
sufficiently long, the matrix Twill converge to a partic-
ular set of weights representing the characteristics of the
objects, and the number of outputs o(t)that are signifi-
cantly nonzero will come to correspond to the number of
objects.
The convergence of second-stage learning for a given
visual stimulus—and thus the validity of what has been
learned—may most simply be determined at the current time
tby requiring that the sum of all absolute weight matrix
123
Biol Cybern (2017) 111:207–227 213
changes over a recent period of time τsdeclines below a
threshold Sthr
t
tτsN
n=1
N
k=1
dTn,k
dtdt<Sthr (6)
Once the weight matrix has stabilized, the representation
of objects in the image developed by the visual binding net-
work is implicit in the activity of the outputs o(t)and the
connection matrix T.
4.1 The number of objects and their features
This representation may easily be made more explicit for
human interpretation, both making network operation easier
to understand and giving practical utility to the model.
The weight matrix Tmay be simplified by normalizing
it to its maximum value over all rows and columns and then
removing weights less than a given threshold, which may be
expressed as
Tnorm =T/max
n,k=1,NTn,k(7)
Tsimp
n,k=Tnorm
n,kTnorm
n,k>=vmin
0Tnorm
n,k<v
min
(8)
We used a threshold value of vmin =0.33 (1/3 of the maxi-
mum weight value). An example of this simplified weight
matrix, generated from the raw weight matrix shown in
Fig. 3c, is shown in Fig. 4a. Here the column-oriented pattern
of weights is even more evident, and the relative strength of
each feature may be read out directly from the matrix.
In fact, it is possible to make the representation of objects
and their features even more explicit. By summing Tsimp
vertically, a ten-element row vector tsum results
tsum
k=
N
n=1
Tsimp
n,kk=1...10 (9)
Each element tsum
krepresents the sum of all inhibitory
weights to other neurons from any neuron k.
Using this information, we can create an object matrix
Othat explicitly describes the number of objects and their
features. Every neuron kfor which the entry tsum
kis above
a threshold value (for which we used 0.6) is concluded to
have accumulated sufficient inhibitory weight to represent
an object. For each such neuron k, the weights from Tsimp
for column kmaybeusedtocreateanewrowiof the object
matrix Oas
Oi,j=Tsimp
j,kj= k
1j=kj=1...10 (10)
Left Right Down Up
(a) simplified visual binding matrix
(a) Object matrix extracted from binding matrix
Red Green Blue
Left
Right
Down
Up
Red
Green
Blue
0
60
120
o
o
o
0
60
120
ooo
Object 1
Object 2
Left
Right
Down
Up
0
60
120
Red
Green
Blue
o
o
o
Fig. 4 Extraction of object-level information. aSimplified visual bind-
ing matrix Tsimp, from which object features can clearly be read. b
Feature matrix Oextracted from Tsimp using (10). Each row of the
object matrix corresponds to a unique object. The associated vector r
= [ 8 9 ] indicates that the first row of the object matrix corresponds to
output 8 (red ) and the second row to output 9 (green)
ri=k(11)
where the vector ris used to store the index of the output
neuron corresponding to each object matrix row i, and the
zero diagonal element of Tsimp is replaced with unity to rep-
resent the fact that this row of weights in the object matrix
originated from neuron k(see Sect. 5.5.1 for a theoretical
justification). An example of the resulting object matrix Ois
showninFig.4b: The number of rows of Ocorresponds to the
number of objects, the columns to each individual visual fea-
ture, and the weights in each column to the learned strength
of each characteristic feature. In this case, the object matrix
Oaccurately represents the motion, orientation, and color of
the two objects in the example visual stimulus.
4.2 Elementary visual attention
As can be seen from the example data shown in Fig. 3a,
the mutually inhibitory second-stage outputs corresponding
123
214 Biol Cybern (2017) 111:207–227
to objects in the visual scene fluctuate over time, with each
having greater value in proportion to fluctuations of the char-
acteristics of the represented object, measured in terms of the
visual features input to the network. This overall measure of
feature strength is closely related to computational models
of visual saliency (Itti and Koch 2001), and so the model
might reasonably be said to be switching its visual attention
(Itti et al. 1998) from one object to another as their relative
saliency changes. In fact, visual attention is often modeled
as a winner-take-all phenomenon (Lee et al. 1999), which
is quite akin to the mutually inhibitory competition of the
second-stage network.
The neuron that has the largest output value at any given
time corresponds to the most salient object. This movement
of this “attentional spotlight” from one object to another can
be used to emphasize the currently attended object in the
visual image and simultaneously de-emphasize unattended
objects. We may synthesize such an “enhanced image” by
recombining the feature matrices created in Fig. 2for every
input image using the row of weights from the object matrix
Ocorresponding to the currently winning neuron.
At any given time, let network output ok(t)currently be
the largest. Row isuch that ri=kof the object matrix O
represents the visual features of this output. To make the raw
feature matrices comparable in magnitude to one another, we
make use of the input adaptive group normalization factors
already computed and described in Sect. 2.1, which may be
composed into a single ten-element vector
n(t)=[nmnmnmnmnonononcncnc]
We may then combine this vector of normalization factors
with the feature weights associated with this object and the
neuron output to create a vector f(t)of weights to be applied
to each feature matrix
fj(t)=|ok(t)Oi,j
nj(t)j=1...10 (12)
where the absolute value is used so that large valuesof the out-
put ok(t), regardless of sign, result in large contributions to
the enhanced image. The vector f(t)may then be used to cre-
ate a linear recombination of the feature matrices computed
from the current input image as shown in Fig. 5, resulting in
a normalized RGB mask identifying where salient features
exist in the current image. By multiplying this mask with the
current input image, an enhanced image is created in which
the characteristics associated with the object represented by
output kare emphasized while effectively de-emphasizing
other objects.
Figure 6presents a demonstration of the algorithm to
enhance the most salient object in input images using our
example two-bar visual stimulus, the network outputs and
f
RGB
mask
Feature
images
Image Enhance
d
Image
f (t)
j
left
fright
fup
fdown
f
f120°
fred
fgreen
fblue
f60°
Weighting
summation Normalization
left
I
down
I
up
I
right
I
I
60°
I
120°
I
red
I
grn
I
blu
I
HPF
HPF
HPF
HPF
HPF
HPF
HPF
HPF
HPF
HPF
NORM
ABS
ABS
ABS
ABS
ABS
ABS
ABS
ABS
ABS
ABS
Filtering,
Input
image
Masking
rectification with
Enhanced
image
Fig. 5 Computational diagram for creating an enhanced image. Thin
lines indicate scalar quantities; thick lines represent matrices. Starting at
figure left, each of the ten 2D feature images created in Fig. 2is first pro-
cessed through a first-order high-pass filter (HPF) using the same time
constant τHI as was used for network inputs, and the absolute value of
each matrix taken (ABS) so that both increases and decreases in features
are represented in the output. Each of the resulting filtered matrices is
then multiplied (Π) by the corresponding scalar weight fjfrom (12).
These weighted feature images are summed point-by-point (Σ)intoa
single RGB image, after which this image is normalized (NORM) by
a scalar value corresponding to its maximum over all pixels and color
planes. This results in an RGB “mask” that identifies where salient
features of the input image exist. This mask is then multiplied point-by-
point (Π) with the input RGB image, resulting in a 2D RGB enhanced
image that emphasizes the most salient object. Note that motion and
orientation feature images contribute equally to red,green,andblue
color planes, but red,green,andblue feature images contribute only to
their own color plane
final weight matrix of which are shown in Fig. 3a and c.
Comparing the input and enhanced images Fig. 6a and b
taken at 6.5s into training, the red bar is clearly stronger rel-
ative to the green in the enhanced image than in the input. To
quantify this effect, the ratio of the maximum red value to the
maximum green value in the input image is approximately
123
Biol Cybern (2017) 111:207–227 215
Fig. 6 Demonstration of enhancement of the most salient object. Each
of the four panels shows a 160 ×160-pixel region cropped from the
center of a 500 ×500 image, taken at times when the two moving bars
passed near one another. aThe input image at 6.5s after the beginning
of second-stage training, a time at which the red bar was the most salient
(refer to red and green neuron outputs in Fig. 3a). The role of shadowing
in saliency is evident here: the red bar is strongly visible in the image,
almost maximally out of shadow, whereas the green bar is almost half
shadowed. bThe enhanced image resulting from the algorithm of Fig.
5, emphasizing the red bar while the green bar is barely visible. c
The input image at 7.2 s into second-stage training, a time at which
the green object was most salient (refer to Fig. 3a). bThe resulting
enhanced image, emphasizing the green bar while the red bar is barely
visible
unity, but in the enhanced image, the maximum red value is
6.2 times stronger than that of green. In contrast, comparing
the input and enhanced images shown in Fig. 6c and d taken
at 7.2s into training, the green bar is enhanced relative to the
red, and again red and green have nearly the same maximum
value in the input image, but green is 4.1 times stronger in
the enhanced image.
Looking more closely at Fig. 6, the effect is not that the
characteristics associated with the largest neuron output, rep-
resenting the most salient object, are greatly increased in
the enhanced image; rather, the characteristics of all other
objects, even background objects (due to the high-pass filter-
ing of the feature matrices), are weakened. It is notable that
this is very reminiscent of attentional effects observed in pri-
mate visual cortex (Moran and Desimone 1985): Responses
to attended objects are not increased by attention, but rather
responses to unattended objects are relatively weakened.
This algorithm provides a bottom-up method for auto-
matically attending to the most salient object in an image
sequence over time without needing any prior information
about objects or their features.
5 Function and limitations of the model
We have presented one example set of experimental results
above and a variety of experiments in a companion paper
(Northcutt et al. 2017) practically demonstrating the utility
of this two-stage neural network model for sensory refine-
ment, visual binding, internal object-level representation, and
even rudimentary visual attention. However, it is impossible
to present an exhaustive set of visual stimuli. Under what
conditions does this subtly complex neural network model
actually perform visual binding, how is this accomplished,
and what are its limitations? Intriguingly, the same recurrent
neural network subunit is used four times for two distinct
purposes in this model.
The first-stage networks (refer to Fig. 1) take as their
input feature sensors which may have significant overlap
in their particular visual submodality, and the output sig-
nals show a significant reduction in overlap of the input
sensors: For example, one of the first-stage networks takes
inputs which are broadly orientation-tuned and refines them
using lateral inhibition into outputs with narrower orientation
tuning.
In the first stage, we have applied this same recurrent
network to three specific visual submodalities: color, orien-
tation, and motion. Balanced mutual inhibition is a naturally
stable state of a fully connected recurrent inhibitory neural
network, reached automatically by the learning rule when
given an appropriately balanced set of visual stimuli. In fact,
our training of the first stage is specifically designed to elicit
lateral inhibition.
Lateral inhibition has long been understood to provide
refinement of sensory inputs at many levels, its effects having
been studied as early as the mid-19th century (Mach 1866),
and a number of sensory systems have been proposed to func-
tion using this mechanism, including vision (Blakemore and
Tobin 1972), audition (Shamma 1985), and olfaction (Linster
and Smith 1997).
While sensory refinement via lateral inhibition may well
be a useful function in many neural systems, does the first
stage contribute anything to the operation of the visual bind-
ing network? In fact, we have shown experimentally in a
companion paper (Northcutt et al. 2017) that strong mutual
inhibition in the first stage is essential to the rapid learning
demonstrated in Fig. 3a. Without strong first-stage inhibition,
second-stage learning proceeded very slowly, and network
weights are very unlikely to ever have reached the stable
state shown in Fig. 3c.
123
216 Biol Cybern (2017) 111:207–227
While first-stage refinement of selectivity in motion direc-
tion, orientation, and color inputs undoubtedly aids the visual
binding network in distinguishing objects, the first stage’s
most essential function for second-stage learning arises from
a more subtle effect. The strong inhibition between first-
stage outputs creates a tendency for them to become mutually
exclusive, much like a weak winner-take-all function. Thus,
when an object passes out of shadow and becomes more
salient, the first-stage network aids in causing all of its visual
attributes to increase simultaneously by inhibiting weaker
attributes in each visual submodality. Since the learning rule
of (5) develops inhibitory weights based on simultaneous
increases or decreases of visual attributes, the synchro-
nization of visual features from the same object caused by
first-stage inhibition greatly speeds the learning of second-
stage network weights.
Although the network structure, temporal evolution, and
learning rule of the second-stage network (refer to Fig. 1)
are the same as the three first-stage networks, the operation
of the second stage is much more obscure. The second-stage
network learns by associating input signals which are tempo-
rally correlated, just as the first-stage networks do. However,
as has been proposed for mammalian cortex (Douglas et al.
1989), it is the variety of dissimilar inputs to this network,
not its internal structure, that makes its function differ from
the first stage. In the present work, these signals are full-field
spatial summations of elementary visual features. Based on
the results we have shown, the empirical result of this stage’s
operation is to detect signals which originate from each inde-
pendently fluctuating object in the input and to quantify how
strongly that object expresses each visual feature, thereby
providing a solution to the visual binding problem.
5.1 Visual binding as blind source separation
The problem that the second-stage network must solve is
really that of blind source separation. The canonical example
of BSS is typically a linear mixture of auditory sources. Sev-
eral spatially separated microphones listening to a mixture
of spatially distinct independent audio sources can be used
to recover the individual audio sources. The use of recurrent
neural networks for BSS is a very well studied area, with
much of the research stemming from the seminal work of
Herault and Jutten (1986).
Mathematically, the problem of BSS can be stated as fol-
lows. Consider a column vector of Ntime-varying sources
s(t). These sources are combined in an unknown (but static)
linear mixture described by an N×N mixing matrix Mto
provide a vector of Nmixed inputs
i(t)=M·s(t)(13)
The challenge of BSS, as it is generally considered, is to
recover the “hidden” sources s(t)given only the mixed obser-
vations i(t). The mixing matrix Mis recovered implicitly in
this process, but often is of little interest: for example, in the
auditory case, Mwould describe the relative microphone
locations, which are usually known. Rather, it is most com-
monly the hidden source signals s(t)that contain the desired
information.
Given the inputs i(t), the fully connected inhibitory neural
network used in this paper—as described by the instanta-
neous update rule of (4), and neglecting the high-pass filter
on the inputs—has been shown to converge (given a proper
learning rule, and under certain conditions, addressed below)
to a state in which the outputs o(t)become scaled versions
of the hidden sources s(t)(Jutten and Herault 1991), thus
revealing each source separately.
The technique we have developed for processing 2D
images into a small number of highly meaningful scalar sig-
nals, shown in Fig. 2, allows the problem of visual binding to
be reduced to one of BSS. Each of the full-resolution visual
feature images is summed into a single time-varying signal
which is a linear instantaneous summation of the visual fea-
ture contributions of all objects in the scene.
In the case of visual binding—quite opposite to that of
auditory BSS—we are not particularly interested in recov-
ering the hidden sources s(t). These “sources” correspond
to temporal fluctuations of the visual feature signals from
a given object caused, for example, by changes in lighting,
movement through occlusions, or changes in distance, and
are of little practical value. Rather, we are interested in deter-
mining how many independent sources exist in the scene and
with which visual features they correspond. This implies that,
in visual binding, the mixing matrix Mis of primary interest
because it reveals the features of objects in the scene, and
can be used to enumerate them.
5.2 Assumptions of the visual binding model
For any BSS problem, the number of distinct hidden sources
and inputs and the details of their mixture determines whether
that problem can be solved. After all, solving the BSS prob-
lem is not possible in every case: consider separating many
auditory sources given microphones all located at the same
spatial location! An analysis of the requirements for the
isolated second-stage BSS network to function properly is
given in the next section. However, due to the vastly greater
complexity of the visual binding problem—which takes a
sequence of 2D RGB images rather than a time-varying vec-
tor of scalar signals as input—the additional assumptions
required for our model to perform properly include the fol-
lowing, all of which we assert are reasonable in virtually any
practical situation.
123
Biol Cybern (2017) 111:207–227 217
1. The visual scene is dynamic. In order for the visual binding
model to detect and differentiate objects from the background
and other objects, the visual scene must change, as observed
in the feature space measured by the inputs to the network.
This is in direct contrast to visual BSS models proposed
by other authors that are designed for separating mixtures
of static images (Jutten and Herault 1991;Guo and Gar-
land 2006). Instead, our model takes as input global spatial
summations of local feature detector circuits and relies on
changes over time in the scene to produce temporal fluctua-
tions in these signals.
Due to the temporal high-pass filters in the input pathway
(3) and in the learning rule (5), there would be no input to
the network and no changes to the weight matrix of this sys-
tem in response to static images, and thus no learning. This
assumption implies an inherent interest in novelty detection,
ignoring static background features and making the network
“prefer” objects which are more “active” in the visual scene.
The assumption of a dynamic visual scene is hardly restric-
tive; in fact, this is unavoidable in almost every practical
situation.
Specifically, to be useful for binding, the temporal fre-
quency of fluctuations in the features of an object must not
be lower than the cutoff frequency set by the high-pass filter
time constant τHO used in the learning rule (5) or they will
be filtered out. This time constant, of course, can be adjusted
to fit the rate of change of the features of an object of interest.
Related to this assumption is a subtle dependence on spa-
tial image resolution. Since the inputs to our network are
scalar global spatial sums, it is not obvious what 2D spa-
tial resolution of input images is required to support visual
binding. In the limit of very tiny images, no matter what the
visual features are, they will have little or no temporal fluc-
tuation due to the fact that there are not enough image pixels.
As image size increases, feature fluctuations will become
smoother, and thus, the quality of features provided to the
network will increase. However, this quality will saturate as
the image resolution becomes sufficient to clearly visualize
objects in a given scene. The optimal spatial resolution will
be highly dependent on the scene in which objects are to be
observed.
2. Visual features of an object are correlated. For the visual
binding network to function properly, visual features origi-
nating from the same object, which then become inputs to
the network, must have temporal correlation with each other.
This requirement simply means that visual features of any
given object measured by the network (in our case, motion,
color, and orientation) must vary together over time, as they
would in any natural situation.
A wide variety of common scene changes satisfy this
assumption. Occlusion of an object by another visual scene
element causes a rapid decrease in all local visual features of
the occluded object, and therefore a reduction in the wide-
field summation of these signals; there is a corresponding
increase if the object then reappears. Similarly, when an
object passes in and out of shadows, or becomes nearer to
or more distant from a light source, the visual measures of
the object will vary together in proportion to the resulting
brightness fluctuations. Also, as an object draws nearer to the
camera, the object size increases in the visual image. Since
a closer object stimulates a greater number of local feature
detection circuits, its wide-field signals grow in inverse pro-
portion to its distance.
3. Fluctuations from different objects are distinct. Herault
and Jutten’s original work on BSS (Herault and Jutten
1986) inspired many related methods: independent compo-
nent analysis (Comon 1994), information maximization (Bell
and Sejnowski 1995), and mutual information minimization
(Yang and Si 1997). The majority of these methods require
statistical independence for sources to be separated.
However, we observe that while statistically independent
sources can be guaranteed to separate, this assumption is not
strictly necessary. For example, as we have shown in Fig. 3,
two sinusoidal “hidden sources” with the same frequency but
different phase can be separated, despite not being statisti-
cally independent. However, these sources are sufficiently
distinct to be separable.
Implicit in the requirement of statistical independence is
that visual feature fluctuations are stochastic. This will gen-
erally be true in practical situations, due to the fact that these
fluctuations derive from objects moving through shadows,
past occlusions, or moving unpredictably with respect to
the camera. The assumption that fluctuations from different
objects will be statistically independent is also reasonable
in practical cases: visual features from different objects will
be independent simply because they result from independent
physical processes in the world.
However, the network does not simply fail if this assump-
tion is not satisfied. If features of two objects are not
statistically independent, nor even distinct, the visual binding
network will bind these signals together to represent a single
object, which in perspective is not an unreasonable conclu-
sion given a set of visual features that are highly correlated
with one another.
4. Object features are persistent. If a given visual feature (the
color of an object, for example) is to be used in binding, an
object must retain that feature over a period of time sufficient
for the network to learn a pattern of weights based upon it.
While the network relies on temporal fluctuations of these
features for learning, we must assume that at least some sub-
set of an object’s features (chosen from the color, motion,
and orientation submodalities for the present network) are
relatively persistent.
For example, the motion feature inputs from a car driving
rapidly in a circle, and thus constantly changing direction,
would not be useful in solving the visual binding problem,
123
218 Biol Cybern (2017) 111:207–227
although its color would likely remain constant as it turned,
and would be useful in binding.
The upper limit on the speed that visual features of an
object may change is set by the learning rate γin the learn-
ingruleof(5). Features of a given object that appear and
disappear so quickly that the integration of their effect over
time never creates a significant change in the weight matrix
Twill not affect the output. Within the bounds of numeri-
cal stability, the learning rate can be adjusted to fit the time
course of feature changes for a given visual stimulus.
5. Measured visual features must be diverse. In order to sep-
arate out a variety of different objects, the visual features
measured by the model must be diverse, and span a range
of visual submodalities. Preferably, the feature inputs should
span the full space of interest in each visual submodality. An
example of this diversity is the feature set we have shown in
Fig. 2, fully spanning motion, color, and orientation.
If the visual features measured are not sufficient to sepa-
rate the objects of interest (for an extreme example, imagine
that all ten feature detectors were only sensitive to the amount
of red color in the image), in general, visual binding will not
be possible.
5.3 Analysis of a two-neuron BSS network
Given the assumptions of the previous section, we can reduce
the visual binding problem solved by the model as a whole
to one of blind source separation in the second stage. The
network output and inhibitory weight temporal dynamics of
the second-stage network that we have used for visual binding
in the model are surprisingly complex, once equipped with
the time-stepping neuron update rule of (3) and the learning
rule of (5), both of which are potentially unstable. Outside
the context of visual binding, the generalized analysis of an
N-neuron BSS network with N(N1)inhibitory weights
is highly formalized and unrevealing and has already been
well addressed in the literature (Jutten and Herault 1991;
Sorouchyari 1991;Joho et al. 2000).
However, the second stage is sufficiently complex that
even a two-neuron network with only two inhibitory weights
can generate extremely unexpected results, and an analysis
of this minimal network better serves to clarify how the sec-
ond stage of our model works, and how our novel changes
to the network update and learning rule affect the model’s
performance. For this reason, we base our theoretical anal-
ysis of the second stage around a two-neuron network for
which all solutions can be simply enumerated and generalize
our results to the full network wherever possible. We fol-
low with demonstrations of the function of this two-neuron
network that can be clearly understood in terms of the anal-
ysis shown, and make conclusions about the full ten-neuron
second-stage network thereafter.
Our analysis begins by enumerating the possible “correct”
solutions to the BSS problem for a two-neuron network with-
out any learning using the instantaneous network update rule
of (4), which for very small simulation time steps approxi-
mates the far less analytically tractable time-stepping update
rule actually used in the model. For simplicity, in this section
we assume that all sources (and thus inputs) have temporal
frequencies sufficiently higher than the cutoff frequency of
the input high-pass filter so that i(t)=i(t).
5.3.1 The typical case
The “typical case” in BSS is one in which the number of
distinct, nonzero hidden sources is the same as the number
of network inputs and outputs. In this case, as will be shown,
there is no single correct solution but rather a family of closely
related solutions.
For the two-neuron network, the set of scalar equations
described by (13)is
i1(t)=M1,1·s1(t)+M1,2·s2(t)(14)
i2(t)=M2,1·s1(t)+M2,2·s2(t)(15)
The temporal evolution equation of (4) becomes
o1(t)=i1(t)T1,2·o2(t)(16)
o2(t)=i2(t)T2,1·o1(t)(17)
into which we can substitute (14) and (15) to get
o1(t)=M1,1·s1(t)+M1,2·s2(t)T1,2·o2(t)(18)
o2(t)=M2,1·s1(t)+M2,2·s2(t)T2,1·o1(t)(19)
For the blind source separation problem to be solved, the out-
puts o(t)must become equal to scaled versions of the hidden
inputs s(t)(Jutten and Herault 1991). As long as the weight
matrix Mis nonsingular, there are exactly two possible solu-
tions that can be learned in the inhibitory connection matrix
T.
As previous authors have shown (and may be easily
derived from the equations above), the first situation in which
correct source separation occurs is when source indices are
not permuted with respect to the outputs
o1(t)=M1,1·s1(t)(20)
o2(t)=M2,2·s2(t)(21)
which results in the connection matrix
T=0M1,2
M2,2
M2,1
M1,10(22)
123
Biol Cybern (2017) 111:207–227 219
The only other possible solution occurs when the source
indices have exchanged with respect to the outputs
o1(t)=M1,2·s2(t)(23)
o2(t)=M2,1·s1(t)(24)
and results in the connection matrix
T=0M1,1
M2,1
M2,2
M1,20(25)
In general, for an N-element BSS network in this case,
there are N!possible solutions differing only in the permu-
tation of the outputs o(t)relative to the hidden sources s(t).
Note that, even for our modest ten-neuron second-stage net-
work, this still allows for 3,628,800 possible solutions!
5.3.2 The overdetermined case
A more unusual situation in the BSS literature is the “overde-
termined case” (Joho et al. 2000), in which less than N
distinct sources are mixed to provide inputs to an N-element
network. There are far fewer solutions in this case, and they
differ from the typical case.
In our simplified two-neuron example, when s2(t)=0
(or equivalently M1,2=M2,2=0) while s1(t)is nonzero,
achieving the nonpermuted solution of (20) and (21) can be
accomplished by setting o2(t)=0. From (19), this requires
T2,1·o1(t)=M2,1·s1(t)+M2,2·s2(t)(26)
and given that s2(t)=0 and that our goal is to make o1(t)=
M1,1·st(t),T2,1can be written directly as
T2,1=M2,1
M1,1
(27)
However, this leaves us with no constraint on T1,2, since in
the network temporal evolution rule it multiplies a signal that
is zero. Therefore, our only requirement is that T1,20to
prevent unintentional excitation, and the resulting connection
matrix is
T=0T1,2
M2,1
M1,10(28)
where T1,20 (we will be able to place further constraints
on T1,2in the next section). Another possible solution with
this same source condition is to permute the sources with
respect to the outputs as in (23) and (24) which results in
connection matrix
T=0M1,1
M2,1
T2,10(29)
where T2,10 is largely unconstrained.
By symmetry, if instead the sources switched and s1(t)=
0 (or equivalently M1,1=M2,1=0) while s2(t)were
nonzero, the required connection matrix for the nonpermuted
case would be
T=0M1,2
M2,2
T2,10(30)
where T2,10, and for the permuted case
T=0T1,2
M2,2
M1,20(31)
where T1,20.
In general, ignoring the indeterminate (and irrelevant)
matrix values, for an N-element BSS network in the overde-
termined case with only m<Nnonzero sources, there are
N!
(Nm)!2possible solutions, corresponding to all possible
permutations of the mnonzero outputs with any given pattern
of msources, and all permutations of those msources with
respect to the inputs.
While still not trivially small, this is vastly fewer solutions
than for the typical case by a factor of (Nm)!2
N!. For our ten-
neuron second-stage network, if only two distinct sources are
presented, the number of possible solutions is reduced to a
mere 8100.
5.3.3 Stability of network temporal evolution
The values of the connection matrix Tgiven above repre-
sent solutions to the BSS problem in a two-neuron network,
but were derived using an approximate temporal evolution
rule. Will a recurrent network with this connection matrix
using the time-stepping update rule of (3) be stable? We have
earlier addressed conditions for stability of this temporal evo-
lution rule and concluded that stability simply requires that
all eigenvalues of the connection matrix Thave magnitude
less than unity.
For the specific values of Tgiven in the “typical case”
above, it is possible to compute the eigenvalues directly from
the characteristic polynomial, and thus the conditions for sta-
bility of the connection matrices of (22) and (25) can be
shown, respectively, to be
M1,2·M2,1<M1,1·M2,2(32)
M1,2·M2,1>M1,1·M2,2(33)
123
220 Biol Cybern (2017) 111:207–227
These two conditions are mutually exclusive, meaning that
for any given nonsingular mixing matrix M, only one of the
two connection matrices given in the typical case for the two-
neuron network can be temporally stable.
The computation of eigenvalues may also be carried out
for the connection matrices derived for the overdetermined
case. For each matrix in (28) through (31), this computation
results, respectively, in conditions for stability
M1,1
M2,1
>T1,20(34)
M2,1
M1,1
>T1,20 (35)
M2,2
M1,2
>T2,10 (36)
M1,2
M2,2
>T1,20 (37)
Under its respective condition, each of the matrices derived
for the overdetermined case is a stable state of the network.
Thus, only in the typical case, the requirement for stabil-
ity of the time evolution rule eliminates half of the potential
solutions for the two-neuron network, and in the general
N-neuron situation a potentially large number of the N!theo-
retically possible connection matrices. In the overdetermined
case, where less solutions exist already, the requirement for
stability places an upper bound on the unconstrained weight,
but eliminates no potential solutions.
This reduction in the number of valid solutions to the BSS
problem could be a crucially important consequence of using
the time-stepping network evolution rule of (3) rather than
the conventional approximation of (4).
5.3.4 Stability of the network learning rule
Assuming a given connection matrix Tdoes represent a solu-
tion to the BSS problem and has eigenvalue magnitudes less
than unity, thus allowing network temporal evolution to be
stable, do these solutions represent stable states of the learn-
ing rule of (5)? If not, they could never be learned by our
neural network model.
For a given inhibitory connection matrix Tto be a stable
state of the learning rule, updates to Tmade by the learn-
ing rule of (5) must be zero-mean over time. This can be
expressed as
EdT1,2
dt=Eγ·go
1(t)·fo2(t)=0 (38)
EdT2,1
dt=Eγ·go2(t)·fo
1(t)=0 (39)
where E[x]formally represents the expected value of ran-
dom variable x, but may also be interpreted for deterministic
time variables as the average value of x(t)over some fixed
period of time. The fundamental BSS assumption that hid-
den sources are statistically independent is a requirement for
stability of the learning rule, but as discussed previously, the
rule may also stabilize for distinctly varying deterministic
sources.
Assuming that the network does indeed solve the BSS
problem, resulting in one of the connection matrices shown
above, we may substitute in high-pass filtered versions of
(20) and (21) (both of which are satisfied in all typical and
overdetermined cases presented above) and factoring out the
constant γ,(38) and (39) become
EgM1,1·s
1(t)·fM2,2·s
2(t)=0(40)
EgM2,2·s
2(t)·fM1,1·s
1(t)=0 (41)
The constants from the mixing matrix Min these equa-
tions may be factored out without changing the required
conditions to make the equations true, so let us look more
closely at the effect of the expected value operator on the
high-pass filtered sources and the learning rule activation
functions f(x)and g(x).
Let x1(t)=s
1(t)and x2(t)=s
2(t). The function of the
high-pass filter is to remove low frequencies (the mean, or
expected value, being the lowest possible frequency), so due
to the operation of the filter
E[x1(t)]=E[x2(t)]=0(42)
Thus, both x1(t)and x2(t)are zero-mean random variables
due to the action of the high-pass filter and retain the sta-
tistical independence of the sources s1(t)and s2(t)after the
linear filtering operation.
In our learning rule, the nonlinear functions g(x)=
tanh x)and f(x)=x3are applied, respectively, to the
statistically independent zero-mean random variables x1(t)
and x2(t). The conditions for stability of learning rules of this
form have been studied in detail by Sorouchyari (1991). By
approximating the hyperbolic tangent function with a Tay-
lor series to the third-order term—a good approximation for
our model’s normalized inputs—it can be shown that this
stability is conditioned on the third moment about zero (the
skewness, a measure of the asymmetry of the probability den-
sity function) of one or both of the random variables x1(t)
and x2(t)being zero. This directly requires that, for stability
of the learning rule, the skewness of one or both of the hidden
sources s1(t)and s2(t)must be zero.
Note that, unlike network temporal stability, this condition
for stability of the learning rule requires only that the con-
nection matrices Tsolve the visual binding problem. The
conditions for learning stability are primarily placed on the
statistics of the hidden sources s(t).
123
Biol Cybern (2017) 111:207–227 221
For distinctly varying deterministic sources, this require-
ment would imply that the mean value over some period of
time of each cubed hidden source signal s3(t)be zero, which
is true of many functions including the sinusoidal sources
used in the experiment of Fig. 3and in the examples given
below.
In the more general case of visual binding, these hid-
den sources represent temporal fluctuations of visual features
caused by unpredictable changes in lighting, occlusion, dis-
tance, or similar effects. Small or zero skewness of these
visual fluctuations is extremely plausible, and thus in this
case stability of the learning rule is very likely as well.
5.3.5 Role of the activation functions in learning
If linear weighting functions g(x)=xand f(x)=xwere
used in the learning rule, changes to diagonal elements in the
weight matrix dTn,k/dtand dTk,n/dtwould necessarily be
equal, resulting in equal values of Tn,kand Tk,nand thus a
purely Hebbian diagonally symmetric connection matrix T.
Based on the BSS theory presented above, this learning
rule could never solve the overdetermined case, in which all
correct weight matrices are asymmetric, and could only solve