Available via license: CC BY 4.0
Content may be subject to copyright.
ORIGINAL RESEARCH ARTICLE
published: 15 December 2014
doi: 10.3389/fncom.2014.00160
Sparsey™: event recognition via deep hierarchical sparse
distributed codes
Gerard J. Rinkus *
†
Neurithmic Systems LLC, Newton, MA, USA
Edited by:
Antonio J. Rodriguez-Sanchez,
University of Innsbruck, Austria
Reviewed by:
Alessandro Treves, Scuola
Internazionale Superiore di Studi
Avanzati, Italy
Marc Pomplun, University of
Massachusetts Boston, USA
*Correspondence:
Gerard J. Rinkus, Neurithmic
Systems LLC, 275 Grove St.,
Suite 2-4069, Newton, MA, USA
e-mail: grinkus@brandeis.edu
†
Present address:
Gerard J. Rinkus, Visiting Scientist,
Lisman Lab, Biology, Brandeis
University, Waltham, MA, USA
The visual cortex’s hierarchical, multi-level organization is captured in many biologically
inspired computational visi on models, the general idea being that progressively larger
scale (spatially/temporally) and more complex visual features are represented in
progressively higher areas. However, most earlier models use localist representations
(codes) in each representational field (which we equate with the cortical macrocolumn,
“mac”), at each level. In localism, each represented feature/concept/event (hereinafter
“item”) is coded by a single unit. The model we desc ribe, Sparsey, is hierarchical as
well but crucially, it uses sparse distributed coding (SDC) in every mac in all levels. In
SDC, each represented item is coded by a small subset of the mac’s units. The SDCs
of different items can overlap and the size of overlap between items can be used to
represent their similarity. The difference between localism and SDC is crucial because
SDC allows the two essential operations of associative memory, storing a new item and
retrieving the best-matching stored item, to be done in fixed time for the life of the model.
Since the model’s core algorithm, which does both storage and retrieval (inference),
makes a single pass ov er all macs on each time step, the overall model’s storage/retrieval
operation is also fixed-time, a criterion we consider essential for scalabilit y to the huge
(“Big Data”) problems. A 2010 paper described a nonhierarchical version of this model
in the context of purely spatial pattern processing. Here, we elaborate a fully hierarchical
model (arbitrary numbers of levels and macs per level), describing novel model principles
like progressive critical periods, dynamic modulation of principal cells’ activation functions
based on a mac-level familiarity measure, representation of multiple simultaneously active
hypotheses, a novel method of time warp invariant recognition, and we report results
showing learning/recognition of spatiotemporal patterns.
Keywords: sparse distributed codes, cortical hierarchy, sequence recognition, event recognition, deep learning,
critical periods, time warp invariance
INTRODUCTION
In this paper, we provide the hierarchical elaboration of the
macro/mini-column model of cortical computation described in
Rinkus (1996, 2010) which is now named Sparsey. We report
results of initial experiments involving multi-level models with
multiple macrocolumns (“macs”) per level, processing spatiotem-
poral patterns, i.e., “events.” In particular, we show: (a) single-
trial unsupervised learning of sequences where this learning
results in the formation of hierarchical spatiotemporal memory
traces; and (b) recognition of training sequences, i.e., exact or
nearly exact reactivation of complete hierarchical traces over all
frames of a sequence. The canonical macrocolumnar algorithm—
which probabilistically chooses a sparse distributed code (SDC)
as a function of a mac’s entire input, i.e., its bottom-up (U), hori-
zontal (H), and top-down (D) input vectors, at a given moment—
operates similarly, modulo parameters, in both learning and
recognition, in all macs at all levels. Computationally, Sparsey’s
most impor tant property is that a mac both stores (learns) new
input items—which in general are temporal-context-dependent
inputs, i.e., particular spa tiotemporal moments—and retrieves
the spatiotemporally closest-matching stored item in time that
remains fixed as the number of items stored in the mac increases.
This property depends critically on the use of SDCs, is essential
for sca lability to “Big Data” problems, and has not been shown for
any other computational model, biologically inspired or not!
The model has a number of other inter esting neurally plau-
sible properties, including the following. (1) A “critical period”
concept wherein learning is frozen in a mac’s afferent synaptic
projections when those projections reach a threshold saturation.
In a hierarchical setting, freezing will occur beginning with the
lowest level macs (analogous to primary sensor y cortex) and
progress upward over the course of experience. (2) A “progressive
persistence” property wherein the activation duration (persis-
tenc e) of the “neurons” (and thus of the SDCs which are sets
of co-active neurons) increases with level; there is some evidence
for increasing persistence along the ventral visual path (Rolls and
Tovee, 1994; Uusitalo et al., 1997; Gauthier et al., 2012). This
allows an SDC in a mac at level J to associate with sequences of
SDCs in Level J-1 macs with which it is connected, i.e., a chunking
(compression) mechanism. In particular, this provides a means to
Frontiers in Computational Neuroscience www.frontiersin.org December 2014 | Volume 8 | Article 160 | 1
COMPUTATIONAL NEUROSCIENC
E
Rinkus Sparse deep hierarchical vision model
learn in unsupervised fashion perceptual invariances produced by
continuous transforms occurring in the environment (e.g., rota-
tion, translation, etc.). Rolls’ VisNet model, introduced in Rolls
(1992) and reviewed in Rolls (2012), uses a similar concept to
explain learning of naturally-experienced transforms, although
his trace-learning-rule-based implementation differs markedly
from ours. (3) During learning, an SDC is chosen on the basis of
signals arriving from all active afferent neurons in the mac’s total
(U,H,andD)receptivefield(RF).However,duringretrieval,if
the highest-order match, i.e., involving all three (U , H, and D)
input sources, falls below a threshold, the mac considers a pro-
gression of lower-order matches, e.g., involving only its U and D
inputs, but ignoring its H inputs, and if that also falls below a
threshold, a match involving only its U inputs. This “back-off ”
protocol, in conjunction w ith progressive persistence, allows a
protocol by which the model can r apidly—crucially, the protocol
does not increase the time complexity of closest-match retrieval—
compare a test sequence (e.g., video snippet) not only to the set
of all sequences actually experienced and stored, but to a much
larger space of nonlinearly time-warped variants of the actually-
experienced sequences. (4) During retrieval, multiple competing
hypotheses can momentarily (i.e., for one or several frames) be
co-activeinanygivenmacandresolvetoasinglehypothesisas
subsequent disambiguating information enters.
While the results reported herein are specifically for the unsu-
perv ised learning case, Sparsey also implements supervised learn-
ing in the form of cross-modal unsupervised learning, where one
of the input modalities is treated as a label modality. That is,
if the same label is co-presented with multiple (arbitrarily dif-
ferent) inputs in another (raw sensory) modality, then a single
internal representation of that label can be associated with the
multiple (arbitrarily different) internal representations of the sen-
sory inputs. That internal representation of the label then de facto
constitutes a r epresentation of the class that includes all those sen-
sory inputs regardless of how different they are, providing the
model a means to learn essentially arbitrarily nonlinear categories
(invariances), i.e., instances of what Bengio terms “AI Set” prob-
lems (Bengio, 2007). Although we describe this principle in this
paper, its full ela boration and demonstration in the context of
supervised learning will be treated in a future paper.
Regarding the model’s possible neural realization, our pri-
mary concern is that all of the model’s formal structural and
dynamic properties/mechanisms be plausibly realizable by known
neural principles. For example, we do not give a detailed neural
model of the winner-take-all (WTA) competition that we hypoth-
esize to take place in the model’s minicolumns, but rather rely
on the plausibility of any of the many detailed models of WTA
competition in the literature, (e.g., Grossberg, 1973; Yu et al.,
2002; Knoblich et al., 2007; Oster et al., 2009; Jitsev, 2010). Nor
do we give a detailed neural model for the mac’s computation
of the overall spatiotemporal familiarity of its input (the “G”
measure), or for the G-contingent modulation of neurons’ acti-
vation functions. Furthermore, the model relies only upon binary
neurons and a simple synaptic learning model. This paper is
really most centrally an explanation of why and how the use
of SDC in conjunction with hierarchy provides a computation-
ally efficient, scalable, and neurally plausible solution to e vent
(i.e., single- or multimodal spatiotempor a l pattern) learning and
recognition.
OVERALL MODEL CONCEPT
The remarkable structural homogeneity across the neocortical
sheet suggests a canonical circuit/algorithm, i.e., a core com-
putational module, operating similarly in all regions (Douglas
et al., 1989; Douglas and Martin, 2004). In addition, DiCarlo
et al. (2012) present compelling first-principles arguments based
on computational efficiency a nd evolution for a macrocolumn-
sized canonical functional module whose goal they describe as
“cortically local subspace untangling.” We also identify the canon-
ical functional module with the cortical “macrocolumn” (a.k.a.
“hypercolumn” in V1, or “barrel”-related volumes in rat/mouse
primary somatosensory cortex), i.e., a volume of cortex, ∼200–
500 um in diameter, and will refer to it as a “mac.” In our view,
the mac’s essential function, or “meta job description,” in the
terms of DiCarlo et al. (2012), is to operate as a semi-autonomous
content-addressable memory. That is, the mac:
(a) assigns (stores, learns) neural codes, specifically sparse dis-
tributed codes (SDCs), representing its global (i.e., combined
U, H, and D) input patterns; and
(b) retrieves (reactivates) stored codes, i.e., memories,onsub-
sequent occasions when the global input pattern matc hes a
stored code sufficiently closely.
If the mac’s learning process ensures that similar inputs map to
similar codes (SISC), as Sparsey’s does, then operating as a content
addressable memory is functionally eq uivalent to local subspace
untangling.
Although the majority of neurophysiological studies through
the decades have formalized the responses of cortical neurons in
terms of purely spatial receptive fields (RFs), evidence revealing
the truly spatiotemporal nature of neuronal RFs is accumulat-
ing (DeAngelis et al., 1993, 1999; Rust et al., 2005; Gavornik and
Bear, 2014; Ramirez et al., 2014). In our mac model, time is dis-
crete: U signals arrive from neurons active on the current time
step while H and D signals arrive from neurons active on the pre-
vious time step. We can view the combined U, H, and D inputs as a
“context-dependent U input” (where the H and D signals are con-
sidered the “context”) or more holistically, as an overall particular
spatiotemporal moment (as suggested earlier).
As will be described in detail, the first step of the mac’s canon-
ical algorithm, during both learning and retrieval, is to combine
itsU,H,andDinputstoyielda(scalar)judgment,G,astothe
spatiotemporal familiarity of the current moment.Providedthe
number of codes stored in the mac is small enough, G measures
the spatiotemporal similarity of the best matching stored moment,
x, to the current moment, I.
G = arg max
x
(sim(I, x))
Figure I-1 shows the envisioned correspondence of Sparsey to
the cortical macrocolumn. In particular, we view the mac’s sub-
population of L2/3 pyra midals as the actual repository of S DCs.
And even more specifically, we postulate that the ∼20 L2/3
Frontiers in Computational Neuroscience www.frontiersin.org December 2014 | Volume 8 | Article 160 | 2
Rinkus Sparse deep hierarchical vision model
FIGURE I-1 | Proposed correspondence between the cortical
macrocolumn and Sparsey’s mac. Left: schematic of a cortical
macrocolumn composed of ∼70 minicolumns (green cylinder). SDCs
representing context-dependent inputs reside in mac’s L2/3 population. An
SDC is a set composed of one active L2/3 pyramidal cell per minicolumn.
Upper Right: 2-photon calcium image of activity in a mac-sized area of cat
V1 given a lef t-moving vertical bar in the mac’s RF; we have added dashed
hexagonal boundary to suggest the boundary of macrocolumn/
hypercolumn module (adapted from Ohki et al., 2005). Lower Right:two
formats that we use to depict macs; they show only the L2/3 cells. The
hexagonal format mac has 10 minicolumns each with seven cells. The
rectangular format mac has nine minicolumns each with nine cells. Note
that in these formats, active cells are black (or red as in many subsequent
figures); inactive cells are white.
pyramidals in each of the mac’s ∼70 minicolumns function in
WTA fashion. Thus, a sing le SDC code will consist of 70 L2/3
pyramidals, one per minicolumn. Note: we also refer to mini-
columns as competitive modules (CMs). Two-photon calcium
imaging movies, e.g., Ohki et al. (2005), Sadovsky and MacLean
(2014), provide some support for the existence of such macro-
columnar SDCs as they show numerous instances of ensembles,
consisting of from several to hundreds of neurons, often span-
ning several 100 um, turning on and off as tightly synchronized
wholes. We anticipate that the recently developed super-fast volt-
age sensor ASAP1 (St-Pierre et a l., 2014) may allow much higher
fidelity testing of SDCs and Sparsey in general.
Figure I-2 (left) illustrates the three afferent projections to a
particular mac at level L1 (analog of cortical V1), M
1
i
(i.e., the ith
mac at level L1). The red hexagon at L0 indicates M
1
i
’s aperture
onto the thalamic representation of the visual space, i.e., its clas-
sical r eceptive field (RF), which we can refer to more specifically as
M
1
i
’s U-RF. This aperture consists of about 40 binary pixels con-
nected all-to-all with M
1
i
’s cells; black arrows show representative
U-weights (U-wts) from two active pixels. Note that we assume
that visual inputs to the model are filtered to single-pixel-wide
edges and binarized. The blue semi-transparent prism represents
the full bundle of U-wts comprising M
1
i
’s U-RF.
The all-to-all U-connectivity within the blue prism is essen-
tial because the concept of the RF of a mac as a whole, not of
an individual cell, is central to our theory. This is because the
“atomic coding unit,” or equivalently, the “atomic unit of mean-
ing” in this theory is the SDC, i.e., a set of cells. The activation
of a mac, during both learning and recognition, consists in the
activation of an entire SDC, i.e., simultaneous activation of one
cell in every minicolumn. Similarly, deactivation of a mac con-
sists in the simultaneous deactivation of all cells comprising the
SDC (though in general, some of the cells contained in a mac’s
currently active SDC might also be contained in the next SDC to
become active in that mac). Thus, in order to be able to view an
SDC as collectively (or atomically) representing the input to a mac
as a whole, all cells in a mac must have the same RF (the same set
of afferent cells). This scenario is assumed throughout this report.
In Figure I-2, magenta lines represent the D-wts comprising
M
1
i
’s afferent D projection, or D-RF. In this case, M
1
i
’s D-RF con-
sists of only one L2 (analog of V2) mac, M
2
j
, which is all-to-all
connected to M
1
i
(representative D-wts from just two of M
2
j
’s cells
are shown). Any given mac also receives complete H-projections
from all nearby macs in its own level (including itself) whose cen-
ters fall within a parameter-specifiable radius of its own center.
Signals propagating via H-wts are defined to take one time step
(one sequence item) to propagate. Green arrows show a small rep-
resentative sample of H-w ts mediating signals arriving form cells
active on the prior time step (gray). Red indicates cells active on
current time step. At right of Figure I-2, we zoom in on one of
M
1
i
’s minicolumns (CMs) to emphasize that every cell in a CM
has the same H-, U-, and D-RFs. Figure I-3 further illustrates
Frontiers in Computational Neuroscience www.frontiersin.org December 2014 | Volume 8 | Article 160 | 3
Rinkus Sparse deep hierarchical vision model
FIGURE I-2 | Detail of afferent projections to a mac. See text for description.
(using the rectangular format for depicting macs) the concept that
all cells in a given mac have the same U-, H-, and D-RFs and that
those RFs respect the borders of the source macs. Each cell in the
L1 mac, M
1
(2,2)
(here we use an alternate (x,y) coordinate indexing
convention for the macs), receives a D-wt from all cells in all five
L2 macs indicated, an H-wt from all cells in M
1
(2,2)
and its N, S, E,
and W neighboring macs (green shading), and a U-wt from all 36
cells in the indicated aperture.
The hierarchical organization of visual cortex is captured in
many biologically inspired computational vision models with the
general idea being that progressively larger scale (both spatially
and temporally) and more complex visual features are repre-
sented in progressively higher areas (Riesenhuber and Poggio,
1999; Serre et al., 2005). Our cortical model, Sparsey, is hierar-
chical as well, but as noted above, a crucial, in fact, the most
crucial difference between Sparsey and most other biologically
inspired vision models is that Sparsey encodes information at
all levels of the hierarchy, and in every mac at every level, with
SDCs. This stands in contrast to models that use localist repre-
sentations, e.g., all published versions of the HMAX family of
models, (e.g., Murray and Kreutz-Delgado, 2007; S erre et al.,
2007) and other cortically-inspired hierarchical models (Kouh
and Poggio, 2008; Litvak and Ullman, 2009; Jitsev, 2010)and
the majority of graphical probability-based models (e.g., hidden
Markov models, Bayesian nets, dynamic Bayesian nets). There
are several other models for which SDC is central, e.g., SDM
(Kanerva, 1988, 1994, 2009; Jockel, 2009), Convergence-Zone
Memory (Moll and Miikkulainen, 1997), Associative-Projective
Neural Networks (Rachkovskij, 2001; Rachkovskij and Kussul,
2001), Cogent Confabulation (Hecht-Nielsen, 2005), Valiant’s
“positive shared” representations (Valiant, 2006; Feldman and
Valiant, 2009), and Numenta’s Grok (described in Numenta white
papers). However, none of these models has been substantially
elaborated or demonstrated in an explicitly hierarchical archi-
tecture and most have not been substantially elaborated for the
spatiotemporal case.
Figure I-4 illustrates the difference between a localist, e.g., an
HMAX-like, model and the SDC-based Sparsey model. The input
level (analogous to thalamus) is the same in both cases: each small
gray/red hexagon in the input level represents the aperture (U-
RF) of a single V1 mac (gray/red hexagon). In Figure I-4A,the
representation used in each mac (at all levels) is localist, i.e., each
feature is represented by a single cell and at any one time, only
one cell (feature) is active (red) in any given mac (here the cell is
depicted with an icon representing the feature it represents). In
contrast, in Figure I-4B, any particular feature is represented by
a set of co-active cells (red), one in each of a mac’s minicolumns:
compare the two macs at lower left of Figure I-4A with the cor -
responding macs in Figure I-4B (blue and brown arrows). Any
given cell will generally participate in the codes of many different
features. A yellow call-out shows codes for other features stored
in the mac, besides the feature that is currently active. If you look
closely, you can see that for some macs, some cells are active in
more than one of the codes.
Frontiers in Computational Neuroscience www.frontiersin.org December 2014 | Volume 8 | Article 160 | 4
Rinkus Sparse deep hierarchical vision model
FIGURE I-3 | Connectivity scheme. Within each of the three af ferent
projections, H, U, and D, to a mac, M
1
(2,2)
(where the mac index is now in
terms of (x,y) coordinates in the lev el, and we have switc hed to the
rectangular mac topology), the connectivity is full and respects mac borders.
L1isa5× 4 sheet of macs (blue borders), each consisting of 36 minicolumns
(pink borders), but the scale is too small to see the individual cells within
minicolumns. L2 is a 4 × 3 sheet of macs, each consisting of nine CMs, each
consisting of nine cells.
Looking at Figure I-4A, adapted from Serre et al. (2005),
one can see the basic principle of hierarchical compositionality
in action. The two neighboring apertures (pink) over the dog’s
nose lead to activation of cells representing a vertical and a
horizontal feature in neighboring V1 macs. Due to the con-
vergence/divergence of U-projections to V2, both of these cells
project to the cells in the left-hand V2 mac. Each of these cells
projects to multiple cells in that V2 mac, however, only the red
(active) cell representing an “upper left corner” feature, is max-
imally activated by the conjunction of these two V1 features.
Similarly, the U-signals from the cell representing the “diagonal”
feature activ e in the right-hand V1 mac will combine with signals
representing features in nearby apertures to activate the appropri-
atehigher-levelfeatureintheV2macwhoseU-RFincludesthese
apertures (small dashed circles in the input le vel). Note that some
notion of competition (e.g., the “max” operation in HMAX mod-
els) operates amongst the cells of a mac such that at any one time,
only one cell (one feature) can be active.
We underscore that i n Figure I-4, we depict simple (solid bor-
der) and complex (dashed border) features within individual
macs, implying that complex and simple features can compete
with each other. We believe that the distinction between simple
and complex features may be largely due to coarseness of older
experimental methods (e.g., using synthetic low-dimensional
stimuli): newer studies are revealing far more precise tuning func-
tions (Nandy et al., 2013), including temporal context specificity,
even as early as V1 (DeAngelis et al., 1993, 1999), and in other
modalities, somatosensory (Ramirez et al., 2014)andauditory
(Theunissen and Elie, 2014).
The same hierarchical compositional scheme as between V1
and V2 continues up the hierarchy (some levels not shown),
causing activation of progressively higher-level features. At higher
levels, we typically call them concepts, e.g., the visual concept of
“Jennifer Aniston,” the visual concept of the class of dogs, the
visual concept of a particular dog, etc. We show most of the fea-
tures at higher levels with dashed outlines to indicate that they
are complex features, i.e., features with particular, perhaps many,
dimensions of invariance, most of which are learned through
experience. In Sparse y, the particular invariances are learned
from scratch and will generally vary from one feature/concept to
Frontiers in Computational Neuroscience www.frontiersin.org December 2014 | Volume 8 | Article 160 | 5
Rinkus Sparse deep hierarchical vision model
FIGURE I-4 | Comparison of a localist (A) and an SDC-based (B) hierarchical vision model. See text.
another, including within the same mac. The particular features
shown in the different macs in this example are purely notional:
it is the overall hierarchical compositionality principle that is
important, not the par ticular features shown, nor the particular
cortical regions in which they are shown.
The hierarchical compositional process described above in the
context of the localist model of Figure I-4A applies to the SDC-
based model in Figure I-4B as well. However, features/concepts
are now represented by sets of cells rather than single cells. Thus,
the vertical and horizontal features forming part of the dog’s nose
are represented with SDCs in their respective V1 macs (blue and
brown arrows, respectively), rather than with single cells. The U-
signals propagating from these two V1 macs converge on the cells
of the left-hand V2 mac and combine, via Sparsey’s code selection
algorithm (CSA) (described in Section Sparsey’s Core Algorithm),
to activate the SDC representing the “corner” feature, and simi-
larly on up the hierarchy. Each of the orange outlined insets at V2
shows the input level aperture of the corresponding mac, empha-
sizing the idea that the precise input pattern is mapped into the
closest-matching stored feature, in this example, a “upper left 90
◦
corner” at left and a “NNE-pointing 135
◦
angle” at right. The
inset at bottom of Figure I-4B zooms in to show that the U-
signals to V1 arise from individual pixels of the a pertures (which
would correspond to individual LGN projection cells).
In the past, IT cells have generally been depicted as being
narrowly selective to particular objects (Desimone et al., 1984;
Kreiman et al., 2006; Kiani et al., 2007; Rust and DiCarlo, 2010).
However, as DiCarlo et al. (2012) point out, the data overwhelm-
ingly support the view o f individual IT cells as having a “diversity
of selectivity”; that is, individual IT cells generally respond to
Frontiers in Computational Neuroscience www.frontiersin.org December 2014 | Volume 8 | Article 160 | 6
Rinkus Sparse deep hierarchical vision model
FIGURE I-5 | Notional mapping of Sparsey to brain.
many different objects and in that sense are much more broadly
tuned. This diversity is notionally suggested in Figures I-4B, I-5
in that individual cells are seen to participate in multiple SDCs
representing different images/concepts. However, the particular
input (stimulus) dimensions for which any given cell ultimately
demonstrates some degree of invariance is not prescribed a prior i.
Rather they emerge essentially idiosyncratically over the history
of a cell’s inclusions in SDCs of particular experienced moments.
Thus, the dimensions of invariance in the tuning functions of
even immediately neighboring cells may generally end up quite
different.
Figure I-5 embellishes the sc heme shown in Figure I-4B and
(turning it sideways) casts it onto the physical brain. We add
paths from V1 and V2 to an MT representation as well. We add
a notional PFC representation in which a higher-level concept
in volving the dog, i.e., the fact that it is being walked, is activ e. We
show a more complete tiling of macs at V1 than in Figure I-4B
to emphasize that only V1 macs that have a sufficient fraction
of active pixels, e.g., an edge cont our, in their aperture become
active (pink). In general, we expect the fraction of active macs to
decrease with level. As this and prior figures suggest, we currently
model the macs as having no overlap with each other (i.e., they
tile the local region), though their RFs [as well as their projec-
tive fields (PFs)] can overlap. However, we expect that in the real
brain, macs can physically overlap. That is, any given minicolumn
could be contained in multiple overlapping macs, w here only one
of those macs can be active at any given moment. The degree of
overlap could vary by region, possibly generally increasing anteri-
orly. If so, then this would partially explain (in conjunction with
the extremely limited view of population activity that single/few-
unit electrophysiology has provided through most of the history
of neuroscience) why there has been little evidence thus far for
macs in more frontal regions.
SPARSE DISTRIBUTED CODES vs. LOCALIST CODES
One important difference between SDC and localist representa-
tion is that the space of repr esentations (codes) for a mac using
SDC is exponentially larger than for a mac using a localist repre-
sentation. Specifically, if Q is the number of CMs in a mac and
K is the number of cells per CM, then there are K
Q
unique SDC
Frontiers in Computational Neuroscience www.frontiersin.org December 2014 | Volume 8 | Article 160 | 7
Rinkus Sparse deep hierarchical vision model
codes for that mac. A localist mac of the same size only has Q × K
unique codes. Note that it is not the case that an SDC-based mac
can use that entire code space, i.e., store K
Q
features. Rather, the
limiting factor on the number of codes storable in an SDC-based
mac is the fraction of the mac’s afferent synaptic weights that are
set high (our model uses effectively binary weights), i.e., degree
of saturation. In fact, the number of codes storable such that
all stored codes can be retrieved with some prescribed average
retrieval accuracy (error), is probably a vanishingly small frac-
tion of the entire code space. However, real macrocolumns have
Q ≈ 70 minicolumns, each with K ≈ 20 L2/3 principal cells: a
“vanishingly small fraction” of 20
70
can of course still be a la rge
absolute number of codes.
While the difference in code space size between localist and
SDC models is important, it is the distributed nature of the SDC
codes per se that is most importa nt. Many have pointed out a key
property of SDC which is that since codes overlap, the number of
cells in common between two codes can be used to represent their
similarity. For example, if a given mac has Q = 100 CMs, then
there are 101 possible degrees of intersection between codes, and
thus 101 degrees of similarity, which can be represented between
concepts stored in that mac. The details of the process/algorithm
that assigns codes to inputs determines the specific definition
of similarity implemented. We will discuss the similarity met-
ric(s) implemented and implementable in Sparsey throughout the
sequel.
However, as stated earlier, the most important distinction
between localism and SDC is that SDC allows the two essential
operations of associative (content-addressable) memory, storing
new inputs and retrieving the best-matching stored input, to be
done in fixed time for the life of the model. That is, given a model
of a fixed size (dominated by the number of weights), and which
therefore has a particular limit on the amount, C, of informa-
tion that it can store and retrieve subject to a prescribed average
retrieval accuracy (error), the time it takes to either stor e (learn)
a new input or retrieve the best-matching stored input (mem-
ory) remains constant regardless of how much information has
been stored, so long as that amount remains less than C. There
is no other extant model, including all HMAX models, all convo-
lutional network (CN) models, all Deep Learning (DL) models, all
other models in the class of graphical probability models (GPMs),
and the locality-sensitive hashing models, for which this capability—
constant storage and best-match retrieval time over the life of the
system—has been demonstrated. All these other classes of mod-
els realize the benefits of hierarchy per se, i.e., the principle of
hierarchical compositionality which is critical for rapidly learn-
ing highly nonlinear category boundaries, as described in Bengio
et al. (2012), but only Sparsey also realizes the speed benefit, a nd
therefore ultimately, the scalability benefit, of SDC. We stat e the
algorithm in Section Sparsey’s Core Algorithm. The reader can see
by inspection of the CSA (Tab l e I - 1 )thatithasafixednumberof
steps; in particular, it does not iterate over stored items.
Another way of understanding the computational power of
SDC compared to localism is as follows. We stated a bove that in a
localist representation such as in Figure I-4A, only one cell, rep-
resenting one hypothesis can be active at a time. The other cells
in the mac might, at some point prior to the choice of a final
winner, have a distribution of sub-threshold voltages that reflects
the likelihood distribution over all represented hypotheses. But
ultimately, only one cell will win, i.e., go supra-threshold and
spike. Consequently, only that one cell, and thus that one hypoth-
esis, will materially influence the next time step’s decision process
in the same mac (via the recurrent H matrix) and in any other
downstream macs.
In contrast, because SDCs physically overlap, if one particular
SDC (and thus, the hypothesis that it represents) is fully active in
a mac, i.e., if all Q of that code’ s cells are active, then all other codes
(and thus, their associated hypotheses) stored in that mac are also
simultaneously physically partially active in proportion to the size
of their intersections with the single fully active code.Furthermore,
if the process/algorithm that assigns the codes to inputs has
enforced the similar-inputs-to-similar-codes (SISC) property, then
all stored inputs (hypotheses) are active with strength in descend-
ing order of similarity to the fully active hypothesis. We assume
that more similar inputs generally reflect more similar world
states and that world state similarity correlates with likelihood.
In this case, the single fully active code also physically functions
as the full likelihood distribution over all SDCs (hypotheses) stored
in a mac. Figure I-6 illustrates this concept. We show five hypo-
thetical SDCs, denoted with φ(), for five input items, A-E (the
actual input ite ms are not shown here), which have been stored
in the mac shown. At right, we show the decreasing intersections
of the codes with φ(A). Thus, when code φ(A)is(fully)active,
φ(B)is4/7active,φ(C)is3/7active,etc.Sincecellsrepresent-
ing all of these hypotheses, not just the most likely hypothesis,
A, actually spike, it follows that all of these hypotheses physically
influence the next time step’s decision processes, i.e., the resulting
likelihood distributions, active on the next time step in the same
and all downstream macs.
We believe this difference to be fundamentally important.
In particular, it means that performing a single execution of
the fixed-time CSA transmits the influence of ever y represented
hypothesis, regardless of how strongly active a h ypothesis is, to
every hypothesis represented in downstream macs. We emphasize
that the representation of a hypothesis’s probability (or likeli-
hood) in our model—i.e., as the fraction of a given hypothesis’s
full code (of Q c ells) that is active—differs fundamentally from
existing representations in which single neurons encode such
probabilities in their strengths of activation (e.g., firing rates) as
described in the recent review of Pouget et al. (2013).
SPARSEY’S CORE ALGORITHM
During learning, Sparsey’s core algorithm, the code selection
algorithm (CSA), operates on every time step (frame) in every
mac of every level, resulting in activation of a set of cells (an SDC)
in the mac. The CSA can also be used, with one major variation,
during retrieval (recognition). However, there is a much simpler
retrieval algorithm, essentially just the first few steps of the CSA,
which is preferable if the system “knows” that it is in retrieval
mode. Note that this is not the natural condition for autonomous
systems: in general, the system must be able to decide for itself,
on a frame-by-frame basis, whether it needs to be in learning
mode (if, and to what extent, the input is novel) or retrieval mode
(if the input is completely familiar). We first describe the CSA’s
Frontiers in Computational Neuroscience www.frontiersin.org December 2014 | Volume 8 | Article 160 | 8
Rinkus Sparse deep hierarchical vision model
Table I-1 | The CSA during lear ning.
Equation Short descr iption
1 Active(m) =
⎧
⎪
⎪
⎨
⎪
⎪
⎩
true ϒ(m) <δ(m)
true π
−
U
≤ π
U
(m) ≤ π
+
U
false otherwise
Determine if mac m will become active
2 u(i) =
j ∈ RF
U
x(j, t) × F (ζ (j, t)) × w(j, i)
h(i) =
j ∈ RF
H
x(j, t − 1) × F(ζ (j, t − 1)) × w (j, i)
d(i) =
j ∈ RF
D
x(j, t − 1) × F (ζ (j, t − 1)) × w(j, i)
Compute the raw U, H, and D input summations
3 U(i) =
⎧
⎨
⎩
max
1, u(i)/π
−
U
× w
max
L = 1
max
1, u(i)min
π
−
U
,π
∗
U
× Q × w
max
L > 1
.
H(i) = max
1, h(i)/ min
π
−
H
,π
∗
H
× Q × w
max
D(i) = max
1, d(i)min
π
−
D
,π
∗
D
× Q × w
max
Compute normalized, filtered input summations
4 V (i) =
⎧
⎨
⎩
H(i)
λ
H
× U(i)
λ
U
(t)
× D(i)
λ
D
t ≥ 1
U(i)
λ
U
(0)
t = 0
Compute local evidential support for each cell
5a
5b
ζ
q
=
K
i = 0
V (i) > V
ζ
ζ =
Q − 1
j = 0
ζ
q
/Q
(a) Compute #cells representing a maximally competing
hypothesis in each CM. (b) Compute # of maximally active
hypotheses, ζ ,inthemac
6 F(ζ ) =
⎧
⎨
⎩
ζ
A
1 ≤ ζ ≤ B
0 ζ>B
Compute the multiple competing hypotheses (MCH)
correction factor, F(ζ), for the mac
7
ˆ
V
j
= max
i ∈ C
j
V (i)
Find the max V,
ˆ
V
j
,ineachCM,C
j
8 G =
Q
q = 1
ˆ
V
k
/Q Compute G as the average
ˆ
V -value over the Q CMs
9 η = 1 +
G − G
−
1 − G
−
+
γ
× χ × K Determine the expansivity of the sigmoid activation
function
10 ψ(i) =
(η − 1)
1 + σ
1
e
−σ
2
(V (i) − σ
3
)
σ
4
+ 1 Apply sigmoid activation function (which collapses to the
constant function when G < G
−
)toeachcell
11 ρ(i) =
ψ(i)
k ∈ CM
ψ(k)
In each CM, normalize the relative probabilities of winning
(ψ) to final probabilities (ρ) of winning
12 Select a final winner in each CM according to the ρ distribution in that CM, i.e., soft max
learning mode, then its variation for retrieval, then its much sim-
pler retrieval mode. See Table I-2 for definitions of symbols used
in equations and throughout the paper.
CSA: LEARNING MODE
The overall goal of the CSA when in learning mode is to assign
codes to a mac’s inputs in adherence with the SISC property,
i.e., more similar overall inputs to a mac are mapped to more
highly intersecting SDCs. With respect to each of a mac’s individ-
ual afferent RFs, U, H, and D, the similarity metric is extremely
primitive: the similarity of two patterns in an afferent RF is sim-
ply an increasing function of the number of features in common
between the two patterns, thus embodying only what Bengio
et al. (2012) refer to as the weakest of priors, the smoothness
prior. However, the CSA multiplicatively combines these com-
ponent similarity measures and, because the H and D signals
carry temporal information reflecting the history of the sequence
being processed, the CSA implements a spatiotemporal similar-
ity metric. Nevertheless, the ability to learn arbitrarily complex
nonlinear similarity metrics (i.e., category boundaries, or invari-
ances), requires a hierarchical network of macs and the ability
for an individual SDC, e.g., active in one mac, to associate with
multiple (perhaps arbitrarily different) SDCs in one or more
other macs. We elaborate more on Sparsey’s implementation of
this capability in Section Learning arbitrarily complex nonlinear
similarity metrics.
Frontiers in Computational Neuroscience www.frontiersin.org December 2014 | Volume 8 | Article 160 | 9
Rinkus Sparse deep hierarchical vision model
FIGURE I-6 | If the process that assigns SDCs to inputs enforces the
similar-input-to-similar-codes (SISC) property, then the currently active
code in a mac simultaneously physically functions as the entire
likelihood distribution over all hypotheses stored in the mac. At bottom,
we show the activation strength distribution over all five codes (stored
hypotheses), when eac h of the five codes is fully active. If SISC was enforced
when these codes were assigned (learned), then these distributions are
interpretable as likelihood distributions. See text for further discussion.
The CSA has 12 steps which can be broken into two phases.
Phase 1 (Steps 1–7) culminates in computation of the familiar-
ity, G (normalized to [0,1]), of the overall (H, U, and D) input
to the mac as a whole, i.e., G is a function of the global state
of the mac. To first approximation, G is the similarity of the
current overall input to the closest-matching previously stored
(learned) overall input. As we will see, computing G involves
a round of deterministic (hard max) competition resulting in
one winning cell in each of the Q CMs. In Phase 2 (Steps
8–12), the activation function of the cells is modified based
on G and a second round of competition occurs, resulting in
the final set of Q winners, i.e., the activated code in the mac
on the current time step. The second round of competition is
probabilistic (soft max), i.e., the winner in each CM is cho-
sen as a draw from a probability distribution over the CM’s
K cells.
In neural terms, each of the CSA’s two competitive rounds
entail the principal cells in each CM integrating their inputs,
Frontiers in Computational Neuroscience www.frontiersin.org December 2014 | Volume 8 | Article 160 | 10
Rinkus Sparse deep hierarchical vision model
engaging the local inhibitory circuitry, resulting in a single
spiking winner. The difference is that the cell activation func-
tions (F/I-curves) used during the second round of integration
will generally be very different from those used during the
first round. Broadly, the goal is as follows: as G approaches 1,
make cells with larger inputs compared to others in the CM
increasingly likely to win in the second round, whereas as G
approaches 0, make all cells in a CM equ ally likely to win in
the second round. We discuss this further in Section Neural
implementation of CSA.
We now describe the steps of the CSA in learning mode.
We will refer to the generic “circuit model” in Figure II-1 in
describing some of the steps. The figure has two internal levels
with one small mac at each level, but the focus, in describing
the algorithm, will be on the L1 mac, M
1
j
, highlighted in yellow.
M
1
j
consists of Q = 4CMs,eachwithK = 3 cells. Gray arrows
represent the U-wts from the input level, L0, consisting of 12
binary pixels. Magenta arrows represent the D-wts from the L2
mac. Green lines depict a subset of the H-wts. The represen-
tation of where the different afferents arrive on the cells is not
intended to be veridical. The depicted “Max” operations are the
hard max operations of CSA Step 7. The blue arrows portray
the mac-global G-based modulation of the cellular V-to-ψ map
(essentially, the F/I curve). The p robabilistic draw operation is not
explicitly depicted in this circuit model.
FIGURE II-1 | Generic “circuit model” for reference in describing the some steps of the CSA.
Frontiers in Computational Neuroscience www.frontiersin.org December 2014 | Volume 8 | Article 160 | 11
Rinkus Sparse deep hierarchical vision model
Step 1: Determine if the mac will become active
As shown in Equation (1), during learning, a mac, m, becomes
active if either of two conditions hold: (a) if the number of active
features in its U-RF, π
U
(m), is between π
−
U
and π
+
U
;or(b)if
it is already active but the number of frames that it has been
on for, i.e., its code age, ϒ(m), is less than its persistence, δ(m).
That is, during learning, we want to ensure that codes remain
on for their entire prescribed persistence durations. We currently
have no conditions on the number of active features in the H
and D RFs.
Active(m) =
⎧
⎪
⎨
⎪
⎩
true ϒ(m) <δ(m)
true π
−
U
≤ π
U
(m) ≤ π
+
U
false otherwise
(1)
Step 2: Compute raw U, H, and D-summations for each cell, i, in the
mac
Every cell, i, in the mac computes its three weighted input
summations, u(i), as in Equation (2a). RF
U
is a synonym for U-
RF. a(j, t)ispre-synapticcellj’s activation, which is binary, on
the current frame. Note that the synapses are effectively binary.
Although the weight range is [0,127], pre-post correlation causes
a weight to increase immediately to w
max
= 127 and the asymp-
totic weight distribution will have a tight cluster around 0 (for
weights that are effectively “0”) and around 127 (for weights
that are effectively “1”). The learning policy and mechanics are
described in Section Learning policy and mechanics. F(ζ (j, t)) is
a term needed to adjust the weights of afferent signals from cells in
macs in which multiple competing hypotheses (MCHs) are active.
If the number of MCHs (ζ ) is small then we want to boost the
weights of those signals, but if it gets too high, in which case we
refer to the source mac as being muddled, those signals will gener-
ally only serve to decrease SNR in target macs and so we disregard
them. Computing and dealing with MCHs is described in Steps 5
and 6. h(i)andd(i) are computed in analogous fashion Equations
(2b) and (2c), w ith the slight change that H and D signals are
modeled as originating from codes active on the previous time
step (t − 1).
u(i) =
j ∈ RF
U
a(j, t) × F(ζ (j, t)) × w(j, i)(2a)
h(i) =
j ∈ RF
H
a(j, t − 1) × F(ζ (j, t − 1)) × w(j, i)(2b)
d(i) =
j ∈ RF
D
a(j, t − 1) × F (ζ (j, t − 1)) × w(j, i)(2c)
Step 3: Normalize and filter the raw summations
The summations, u(i), h(i), and d(i), are normalized to [0,1]
interval, yielding U(i),H(i), and D(i). We explained above that
amacm only becomes active if the number of active features
in its U-RF, π
U
(m), is between π
−
U
and π
+
U
, referred to as the
lower a nd upper mac activation bounds. Given our assumption
that visual inputs to the model are filtered to single-pixel-wide
edges and binarized, we expect relatively straight or low-curvature
edges roughly spanning the diameter of an L0 aperture to occur
rather frequently in natural imagery. Figure II-2 shows two exam-
ples of such inputs, as frames of sequences, involving either only
a single L0 aperture (panel A) or a region consisting of three L0
apertures, i.e., as might comprise the U-RFs of an L2 mac (e.g.,
as in Figure I-4B). The general problem, treated in this figure, is
that the number of features present in a mac’s U-RF, π
U
(m), may
vary from one frame to the next. Note that for macs at L2 and
higher, the number of features present in an RF is the number
of active macs in that RF, not the total number of active cells in
that RF. The policy implemented in Sparsey is that inputs with
different numbers of active features compete with each other on
an equal footing. Thus, normalizers (denominators) in Equations
(3a–c) use the lower mac activation bound, π
−
U
, π
−
H
,andπ
−
D
.
This necessitates hard limiting the maximum possible normalized
value to 1, so that inputs with between π
−
U
and π
+
U
active features
yield normalized values confined to [0,1]. There is one additional
nuance. As noted above, if a mac in m’sU-RFismuddled,then
we disregard all signals from it, i.e., they are not included in the
u-summations of m’s cells. However, since that mac is active, it
will be included in the number of active features, π
U
(m). Thus,
we should normalize by the number of active, nonmuddled macs
in m’sU-RF(notsimplythenumberofactivemacs):wedenote
this value as π
∗
U
. Finally, note that when the afferent feature is rep-
resent ed by a mac, that feature is actually being represented by the
simultaneous activation of, and thus, inputs from, Q cells; thus
the denominator must be adjusted accordingly, i.e., multiplied by
Q and by the maximum weight of a synapse, w
max
.
U(i) =
max (1, u(i)/π
−
U
× w
max
) L = 1
max (1, u(i)/ min (π
−
U
,π
∗
U
) × Q × w
max
) L > 1
(3a)
H(i) = max
1, h(i)/ min
π
−
H
,π
∗
H
× Q × w
max
(3b)
D(i) = max
1, d(i)/ min
π
−
D
,π
∗
D
× Q × w
max
(3c)
Step 4: Compute overall local support for each cell in the mac
The overall local (to the individual cell) measure, V(i), of evi-
dence/support that cell i should be activated is computed by mul-
tiplying filtered versions of the nor malized inputs as in Equa tion
(4). V(i) can also be viewed as the normalized degree of match
of cell i’s total afferent (including U, H, and D) synaptic weight
vector to its total input pattern. We emphasize that the V measure
is not a measure of support for a single hypothesis, since an indi-
vidual cell does not represent a single hypothesis. Rather, in terms
of hypotheses, V(i) can be viewed as the local support for the
set of hypotheses whose representations (codes) include cell i.
The individual normalized summations are raised to powers (λ),
which allows control of the relative sensitivities of V to the dif-
ferent input sources (U, H, and D). Currently, the U-sensitivity
parameter, λ
U
, varies with time (index of frame with respect to
beginning of sequence). We will add time-dependence to the H
and D sensitivity parameters as well and explore the space of
policies regarding these schedules in the future. In general terms,
these parameters (along with many others) influence the shapes
of the boundaries of the categories learned by a mac.
V(i) =
H(i)
λ
H
× U(i)
λ
U
(t)
× D(i)
λ
D
t ≥ 1
U(i)
λ
U
(0)
t = 0
(4)
Frontiers in Computational Neuroscience www.frontiersin.org December 2014 | Volume 8 | Article 160 | 12
Rinkus Sparse deep hierarchical vision model
FIGURE II-2 | The mac’s normalization policy must be able to deal
with inputs of different sizes, i.e., inputs having different numbers
of active features. (A) An edge rotates through the aperture over
three time steps, but the number of active features (in this case,
pixels) varies from one time step (moment) to the next. In order for
the mac to be able to recognize the 5-pixel input (T = 1) just as
strongly as the 6 or 7-pixel inputs, the u-summations must be divided
by 5. (B) The U-RFs of macs at L2 and higher consist of an integer
number of subjacent level macs, e.g., here, M
2
i
’s U-RF consists of
three L1 macs (blue border). Each active mac in M
2
i
’s U-RF represents
one feature. As for panel a, the number of active features varies across
moments, but in this case, the variation is in increments/decrements of
Q synaptic inputs. Grayed-out apertures have too few active pixels for
their associated L1 macs to become active.
As described in Section CSA: Retrieval Mode, during retrieval,
this step is significantly generalized to provide an extremely pow-
erful, general, and efficient mechanism for dealing with arbitrary,
nonlinear invariances, most notably, nonlinear time-warping of
sequences.
Step 5: Compute the number of competing hypotheses that will be
active in the mac once the final code for this frame is activated
To motivate the need for keeping track of the number of compet-
ing hypotheses active in a mac, we consider the case of complex
sequences, in which the same input item occurs multiple times
and in multiple contexts. Figure II-3 portrays a minimal exam-
ple in which item B occurs as the middle state of sequences
[ABC] and [DBE]. Here, the model’s single internal level, L1,
consists of just one mac, with Q = 4CMs,eachwithK = 4
cell. Figure II-3A shows notional codes (SDCs) chosen on the
threetimestepsof[ABC].Thecodenameconventionhereis
that φ denotes a code, the superscr ipt “1” indicates the model
level at which code resides. The subscript indicates the specific
moment of the sequence that the code represents; thus, it is
necessary for the subscript to specify the full temporal context,
from start of sequence, leading up to the current input item.
Successively active codes are chained together , resulting in spa-
tiotemporal memory traces that represent sequences. Green lines
indicate the H-wts that are increased from one code to the next.
Black lines indicate the U-wts that are increased from cur rently
active pixels to currently active L1 cells (red). Thus, as described
earlier, e.g., in Figure I-2, individual cells learn spatiotemporal
Frontiers in Computational Neuroscience www.frontiersin.org December 2014 | Volume 8 | Article 1 60 | 13
Rinkus Sparse deep hierarchical vision model
FIGURE II-3 | Portrayal of reason why macs need to know how many
multiple competing hypotheses (MCHs) are/were active in their affe rent
macs. (A) Memory trace of 3-item sequence, [ABC]. This model has a single
internal level with one mac consisting of Q = 4CMs,eachwithK= 4 cells. We
show notional SDCs (sets of red cells) f or each of the three items. The green
lines represent increased H-wts in the recurrent H-matrix: the trace is shown
unrolled in time in time. (B) A notional memory trace of sequence [DBE]. The
SDC chosen for item B differs from that in [ABC] because of the different
temporal context signals, i.e., from the code for item D rather than the code
for item A. (C) W e prompt with item B, the model enters a state that has
equal measures of both of B’s previously assigned SDCs. Thus multiple (here,
two) hypotheses are equally active. (D) If the model can detect that multiple
hypotheses are active in this mac, then it can boost its efferent H-signals
(multiplying them by the number of MCHs), in which case the combined H
and U signals when the next item, here “C”, is presents, causing the SDC for
the moment [ABC] to become fully active. See text for more details.
inputs in correlated fashion, as whole SDCs. Learning is
described more thoroughly in Section Learning Policy and
Mechanics.
As portrayed in Figure II-3B, if [ABC] has been previously
learned, then when item B of another sequence, [DBC], is
encountered, the CSA will generally cause a different SDC, here,
φ
1
DB
,tobechosen.φ
1
DB
will be H-associated with whatever
code is activated for the next item, in this case φ
1
DBE
for item
E. This choosing of codes in a context-dependent way (where
the dependency has no fixed Markov order and in practice can
be extremely long), enables subsequent recognition of complex
sequences without confusion.
However, what if in some future recognition test instanc e, we
prompt the network with item B, i.e., as the first item of the
sequence, as shown in Figure II-3C? In this case, there are no
active H-wts and so the computation of local support Equation
(4) depends only on the U-wts. But, the pixels comprising item
B have been fully associated with the two codes, φ
1
AB
and φ
1
DB
,
which have been assigned to the two moments when item B was
presented, [AB]and[DB]. We show the two maximally impli-
cated (more specifically, maximally U-implicated) cells in each
CM as orange to indicate that a choice between them in each
CM has not yet been made. However, by the time the CSA com-
pletes for the frame when item B is presented, one winner must
Frontiers in Computational Neuroscience www.frontiersin.org December 2014 | Volume 8 | Article 160 | 14
Rinkus Sparse deep hierarchical vision model
be chosen in each CM (as will become clear as we continue to
explain the CSA throughout the remainder of section Sparsey’s
Core Algorithm). And, because it is the case in each CM, that both
orange cells are equally implicated, we choose winners randomly
between them, resulting in a code that is an equal mix of the win-
ners from φ
1
AB
and φ
1
DB
. In this case, we refer to the mac as having
multiple competing hypotheses active (MCHs), where we specif-
ically mean that all the active hypotheses (in this case, just two)
are approximately equally strongly active.
The problem can now be seen at the right of Figure II-3C when
C is presented. Clearly, once C is presented, the model has enough
information to know which of the two learned sequences, or more
specifically, which particular moment is intended, [ABC]rather
than [DBE]. Ho wever, the cells comprising the code representing
that learned moment, φ
1
ABC
, will, at the current test moment
(lower inset in Figure II-3C), have only half the active H-inputs
that they had during the original learning instance (i.e., upper
inset in Figure II-3C). This leads, once processed through steps
2b, 3b, and 4, to V-values that will be far below V = 1, for sim-
plicity, let’s say V = 0.5, for the cells comprising φ
1
ABC
.Aswill
be explained in the remaining CSA steps, this ultimately leads to
the model not recognizing the current test trial moment [BC]as
equivalent to the learning trial moment [ABC], and consequently,
to activation of a new code that could in general be arbitrarily
different from φ
1
ABC
.
However, there is a fairly general solution to this problem
where multiple competing hypotheses are present in an active
mac code, e.g., in the code for B indicated by the yellow call-
out. The mac can easily detect when an MCH condition exists.
Specifically, it can tally the number cells with V = 1—or, allow-
ing some slight tolerance for considering a cell to be maximally
implicated, cells with V(i) > V
ζ
,whereV
ζ
is close to 1, e.g .,
V
ζ
= 0.95—in each of its Q CMs, as in Equation (5a). It can
then sum ζ
q
over all Q CMs and divide by Q (and round to the
nearest integer, “rni”), resulting in the number of MCHs active
in the mac, ζ , as in Equation (5b). In this example, ζ = 2, and
the principle by which the H-input conditions, specifically the
h-summations, for the cells in φ
1
ABC
on this test trial moment
[BC] can be made the same as they were during the learning trial
moment [ABC], is simply to multiply all outgoing H-signals from
φ
1
B
by ζ = 2. We indicate the inflated H-signals by the thicker
green lines in the lower inset at right of Figure II-3D.Thisulti-
mately leads to V = 1 for a ll four cells comprising φ
1
ABC
and,
via the remaining steps of the CSA, reinstatement of
1
φ
ABC
with
very high probability (or with certainty, in the simple retrieval
mode described in Section CSA: Simple Retrieval Mode), i.e.,
with recognition of test trial moment [BC] as equivalent to
learning trial moment [ABC]. The model has successfully gotten
through an ambiguous moment based on presentation of further,
disambiguating inputs.
We note here that uniformly boosting the efferent H-signals
from φ
1
B
also causes the h-summations for the four cells compris-
ing the code φ
1
DBE
tobethesameastheywereinthelearning
trial moment [DBE]. However, by Equation (4), the V-values
depend on the U-inputs as well. In this case, the four cells of φ
1
DBE
have u-summations of zero, which leads to V = 0, and ultimately
to essentially zero probability of any of these cells winning the
competitions in their respective CMs. Though we don’t show the
example here, if on the test trial, we present E instead of C after
B, the situation is reversed; the u-summations of cells compris-
ing the code φ
1
DBE
are the same as they were in the learning trial
moment [DBE] whereas those of the cells comprising the code
φ
1
ABC
are zero, resulting with high probability (or certainty) in
reinstatement of φ
1
DBE
.
ζ
q
=
K
i = 0
V(i) > V
ζ
(5a)
ζ = rni
⎛
⎝
Q − 1
j = 0
ζ
q
/Q
⎞
⎠
(5b)
Step 6: Compute correction factor for multiple competing
hypotheses to be applied to efferent signals from this mac
The example in Figure II-3 was r ather clean in that it involved
only two sequences having been learned, containing a total of six
moments, [A], [AB], [ABC], [D], [DB], and [DBE], and very lit-
tle pixel-wise ov erlap between the items. Thus, cross-talk between
the stored codes was minimized. However, in general, macs will
store far more codes. If for example, the mac of Figure II-3 was
asked to store 10 moments where B was presented, then, if we
prompted the network with B as the first sequence item, we would
expect almost all cells in all CMs to have V = 1. As discussed in
Step 2, when the number of MCHs (ζ )inamacgetstoohigh,
i.e., when the mac is muddled, its efferent signals will generally
only serve to decrease SNR in target macs (including itself on the
next time step via the recurrent H-wts) and so we disregard them.
Specifically, when ζ is small, e.g., two or three, we want to boost
the value of the signals coming from all active c ells in that mac
by multiplying by ζ (as in Figure II-3D). However, as ζ grows
beyond that range, the expected overlap between the competing
codes increases and to approximately account for that, we begin
to diminish the boost factor as in Equation (6), where A is an
exponentlessthan1,e.g.,0.7.Further,onceζ reaches a thresh-
old, B, typically set to 3 or 4, we multiply the outgoing weights by
0, thus effectively disregarding the mac completely in downstream
computations. We denote the correction factor for MCHs as F(ζ ),
defined as in Equation (6). We also use the notation,F(ζ (j, t)), as
in Equation (2), where ζ (j, t) is the number of hypotheses tied for
maximal activation strength in the owning mac of a pre-synaptic
cell, j,attime(frame)t.
F(ζ ) =
ζ
A
1 ≤ ζ ≤ B
0 ζ>B
(6)
Step 7: Determine the maximum local support in each of the mac’s
CMs
Operationally, this step is quite simple: simply find the cell with
the highest V-value,
ˆ
V
j
,ineachCM,C
j
,asinEquation(7).
ˆ
V
j
= max
i ∈ C
j
{
V(i)
}
(7)
Frontiers in Computational Neuroscience www.frontiersin.org December 2014 | Volume 8 | Article 1 60 | 15
Rinkus Sparse deep hierarchical vision model
Concept