PreprintPDF Available

Sparsey, a memory-centric model of on-line, fixed-time, unsupervised continual learning (presented at 2018 NIPS Continual Learning Wkshp)

Authors:
  • Neurithmic Systems
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Four hallmarks of human intelligence are: 1) on-line, single/few-trial learning; 2) important/salient memories and knowledge are permanent over lifelong durations, though confabulation (semantically plausible retrieval errors) accrues with age; 3) the times to learn a new item and to retrieve the best-matching (most relevant) item(s) remain constant as the number of stored items grows; and 4) new items can be learned throughout life (storage capacity is never reached). No machine learning model, the vast majority of which are optimization-centric, i.e., learning involves optimizing a global objective (loss, energy), has all these capabilities. Here, I describe a memory-centric model, Sparsey, which in principle, has them all. I note prior results showing possession of Hallmarks 1 and 3 and sketch an argument, relying on hierarchy, critical periods, metaplasticity, and the recursive, compositional (part-whole) structure of natural objects/events, that it also possesses Hallmarks 2 and 4. Two of Sparsey's essential properties are: i) information is represented in the form of fixed-size sparse distributed representations (SDRs); and ii) its fixed-time learning algorithm maps more similar inputs to more highly intersecting SDRs. Thus, the similarity (statistical) structure over the inputs, not just pair-wise but in principle, of all orders present, in essence, a generative model, emerges in the pattern of intersections of the SDRs of individual inputs. Thus, semantic and episodic memory are fully superposed and semantic memory emerges as a by-product of storing episodic memories, contrasting sharply with deep learning (DL) approaches in which semantic and episodic memory are physically separate.
Sparsey, a memory-centric model of on-line,
fixed-time, unsupervised continual learning
Rod Rinkus
Neurithmic Systems
Newton, MA 02465, USA
rod@neurithmicsystems.com
Abstract
Four hallmarks of human intelligence are: 1) on-line, single/few-trial learning; 2)
important/salient memories and knowledge are permanent over lifelong durations,
though confabulation (semantically plausible retrieval errors) accrues with age;
3) the times to learn a new item and to retrieve the best-matching (most relevant)
item(s) remain constant as the number of stored items grows; and 4) new items can
be learned throughout life (storage capacity is never reached). No machine learning
model, the vast majority of which are optimization-centric, i.e., learning involves
optimizing a global objective (loss, energy), has all these capabilities. Here, I
describe a memory-centric model, Sparsey, which in principle, has them all. I note
prior results showing possession of Hallmarks 1 and 3 and sketch an argument, re-
lying on hierarchy, critical periods, metaplasticity, and the recursive, compositional
(part-whole) structure of natural objects/events, that it also possesses Hallmarks
2 and 4. Two of Sparsey’s essential properties are: i) information is represented
in the form of fixed-size sparse distributed representations (SDRs); and ii) its
fixed-time learning algorithm maps more similar inputs to more highly intersecting
SDRs. Thus, the similarity (statistical) structure over the inputs, not just pair-wise
but in principle, of all orders present, in essence, a generative model, emerges
in the pattern of intersections of the SDRs of individual inputs. Thus, semantic
and episodic memory are fully superposed and semantic memory emerges as a
by-product of storing episodic memories, contrasting sharply with deep learning
(DL) approaches in which semantic and episodic memory are physically separate.
1 Introduction
Any human-like artificial general intelligence (AGI) must possess the hallmarks listed in the abstract.
No machine learning (ML) model, including any deep learning (DL) model, has been shown to have
them all. In particular, ML/DL models—the vast majority of which are optimization-centric, i.e.,
learning involves optimizing a global loss or energy objective—have been subject to catastrophic
forgetting (CF) [
24
] and so have difficulty with Hallmark 2 and thus, with lifelong continual learning
(CL). Equally important viz. scaling to lifelong CL, no ML/DL model has been shown to have
Hallmark 3. Here, I describe a radically different AGI model for which Hallmarks 1 and 3 have
already been shown, for sequences as well as purely spatial inputs. I then sketch an argument, relying
on hierarchy, critical periods, metaplasticity, and the recursive, compositional (part-whole) structure
of natural objects/events, that it also possesses Hallmarks 2 and 4, and further, that it can be expected
to retain these properties over an effectively open-ended lifetime.
The core problem of CF is that the converged state of the weights capturing one dataset/task generally
differs from that for any other. This is especially an issue for optimization-centric models, which
www.neurithmicsystems.com and Visting Scientist, Brandeis, Biology
32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada.
generally involve repeated, small changes along the objective’s gradient for massive numbers of
weights. If all weights remain permanently subject to change (across datasets/tasks), then previously
stored information can be erased as the weights converge to an optimum for a new dataset/task.
Several solution types have been advanced.
1.
Sparsify/orthogonalize memory traces, causing
weights to be used less often, reducing competing influences on individual weights, and increasing
the number of stably learnable mappings [
3
,
10
,
38
].
2.
Continually re-present previously learned
items (e.g., from old tasks) interleaved with new items, as in [
9
]. But, the set of old items needing to
be interleaved continually increases, suggesting difficulty scaling to lifelong CL scenarios. However,
mounting evidence (recent review in [
8
]) suggests the hippocampus does act as a transient, on-line,
single-trial, learner facilitating replay of neocortical memory traces of recent experiences, allowing
gradual formation of representations of higher-order statistical structure of experiences, an idea
formalized in the pioneering Complementary Learning Systems (CLS) model [
23
].
3.
Adding a
regularization term to the loss function, which penalizes changing weights in proportion to their
importance for previously learned mappings/tasks [
1
,
20
]. But, this increases learning complexity
since the importance measure is continually re-evaluated for every weight throughout the system’s
lifetime.
4.
Finally, recent approaches that add an external (episodic) memory for individual inputs to
a core DL model [
14
,
15
,
31
,
40
], in principle, address CF/CL in a similar vein as the CLS model: cf.
[
22
]. However, unlike Sparsey, any model in which semantic and episodic memory are physically
separate entails increased complexity in managing the interaction of the two memories including the
cost of moving information between the two.
2 Brief Summary of Sparsey
Sparsey [
32
,
33
,
35
] differs fundamentally from most ML/DL models as it is not optimization-centric,
but rather, memory-centric: it simply assigns/stores memory traces to inputs, e.g., successive frames
of streaming video, as they present, as well as associating (chaining) those traces both sequentially
in time and hierarchically (across model levels). These traces are in the form of fixed-size sparse
distributed representations (SDRs), as in Fig. 1, which allows more similar inputs to be represented
by more highly intersecting SDRs. Sparsey’s learning algorithm, the Code Selection Algorithm
(CSA) (Fig. 2, see refs for details), does statistically preserve similarity in this way (see Fig. 3).
The similarity structure over the inputs, not just pairwise, but in principle, of all orders present,
emerges automatically in the pattern of intersections over the stored SDRs. Thus, semantic memory
emerges as a by-product of storing episodic memories in superposition: the same weight changes
that store an episodic memory also act to create semantic memory, which as noted above, suggests a
potentially large efficiency advantage over models with external episodic memories [
14
,
15
,
31
,
40
].
More important, this constitutes a fundamentally different, on-line, single-trial method for building a
generative model without any notion of optimization (though supervised and reinforcement learning
can be easily implemented as meta-protocols).
Equally important, the CSA runs in fixed-time: the number of algorithmic steps needed to learn a
new item remains constant as the number of stored items grows. In particular, the CSA preserves
similarity without needing to compare new inputs to previously stored inputs, either serially, as in
most nearest-neighbor models, or to a log number of them, as in tree-based models. Closest-match
retrieval time is also fixed. Thus, Sparsey can be viewed as implementing, in a biologically plausible
manner, similar functionality to locality-sensitive hashing (LSH) [
19
]. In fact, based on a recent
review [
39
], Sparsey is more general in that: a) its simlarity metric is graded, whereas LSH’s is
binary; b) spatiotemporal and spatial metrics are handled qualitiatively identically; and c) the “hash
index” is learned from scratch from the data, a crucial capability, especially for large datasets [21].
As Fig. 1c shows, an overall model instance is a hierarchy with each level consisting of an array of
SDR coding fields (or “macs”: see Fig. 1 caption), with local, bottom-up, top-down, and horizontal
connectivity (most connections not shown). Three other properties (not visible in figs) are essential
to the argument of Sec. 3.
1)
Event-based critical periods: learning in a mac (storing new SDR
codes) is shut down, i.e., no further changes allowed to the mac’s afferent synapses, when a threshold
fraction of them have been increased. There is substantial evidence for critical periods in primary
cortical areas [
2
,
4
,
6
,
7
,
11
,
18
,
25
,
27
] and olfactory bulb [
5
,
30
].
2)
A large-delta Hebbian learning
scheme combined with a metaplasticity concept, i.e., synaptic resistence to decay, permanence,
denoted
θ
. The model starts as a tabula rasa: all synapses initially have
w
=0 and
θ
=0. Whenever
a synapse experiences a pre-post coincidence: a) its weight is (re)set to the max (binary “1”); and
2
Figure 1: a) The model’s coding field [called a “mac” as it is proposed to correspond to the (L2/3
portion) of a cortical macrocolumn] and its receptive field (RF) to which it is fully connected. b)
Example of learned association from an input, A, to its SDR code,
φ
(A): blue lines denote increased
wts. c) A hierarchical model showing that higher-level macs RFs are unions of subjacent macs, e.g.,
the RF of L2 mac
M2
1
consists of the seven cyan L1 macs. Thus, codes (SDRs) learned (stored)
in higher-level macs synaptically associate with, and thus bind, codes in lower macs: overall, a
hierarchical, compositional, i.e., recursive part-whole, sparse distributed, memory trace.
b) if the time since last pre-post falls within a
θ
-dependent time window, its
θ
is increased (decay
rate decreased) and the window length increases [see [
35
] for details]. Assuming the expected time
for a pre-post coincidence due to inputs reflecting a structural regularity of the world to recur is
much (likely exponentially) smaller than that for a pre-post coincidence due to randomness, this
metaplasticity scheme preferentially embeds SDRs (and chains of SDRs) reflecting the domain’s
structural regularities, i.e., its statistics. This permanence scheme is purely local, only requiring
a synapse to count frames since its last weight increase, and computationally far simpler than
other schemes requiring continual re-evaluation of each weight’s importance to previously learned
tasks/mappings, e.g., [
1
,
12
,
20
].
3)
Unit activation duration (persistence) increases (e.g., doubles)
with level, allowing a code at level J to associate with multiple sequentially active codes at level J-1,
i.e., nested “chunking”.
Figure 4 summarizes results [14] showing that an SDR-based mac has high storage capacity and
can learn complex sequences (items can repeat many times in varying contexts, e.g., text), with
single trials, addressing Hallmark 1. In recent work (unpublished), Sparsey achieved 90% accuracy
on MNIST and 67% on Weizmann event recognition with unsupervised learning. The accuracy is
sub-SOA, but likely much-improvable via thorough parameter space search and adding supervised
learning. However, learning is extremely fast (runs in fixed time), likely much faster than SOA
methods when adjustsed for the fact that it uses no machine parallelism, thus addressing Hallmark 3.
3 Combination of Crucial Principles Yields Lifelong Continual Learning
Broadly, the 4-step argument is that the combined effects of SDR, hierarchy, imposed critical periods,
metaplasticity, and the statistics of natural inputs, cause the expected time for a mac’s afferent synaptic
matrices to reach saturation (i.e., for the mac to reach storage capacity) to increase quickly, likely,
exponentially, with the mac’s hierarchical level. Thus, in practice, in a system with even few levels,
3
e.g., 10, as relevant to human cortex, the macs at the highest levels might never reach capacity, even
over very long lifetimes operating on (or in) naturalistic domains, thus meeting Hallmark 4.
Step 1:
The natural world is recursively compositional in both space in time. Objects are made of
parts, which are made of sub-parts, etc. Even allowing for articulation (i.e., parts can move with
respect their containing wholes), this vastly constrains the space of likely percepts compared to if all
pixels of the visual field varied fully independently. Further, edges carry most of the information in
images/video. Thus, Sparsey’s input images (frames) are edge-filtered, binarized, and skeletonized
(cheap, local operations), as in Fig. 5, further greatly constraining the space of likely percepts.
Step 2:
The human visual system is hierarchical and receptive field (RF) size grows with level.
Consider the 37-pixel hexagonal RFs of Fig. 6. In light of Step 1’s preprocessing, Sparsey implements
the policy that an L1 mac only activates if the number of active pixels in its RF is within some tight
range around its RF’s diameter, e.g., 6-7 pixels. This yields an input space of
C(37,6) + C(37,7)
12.6M
. But, the combined effect of natural statistics and the preprocessing precludes the vast majority
of those patterns, yielding a vastly smaller likely input space (see Figs. 6e-g).
Step 3:
The highly (structurally) constrained space of inputs likely to occur in a small RF, suggests
a small basis (lexicon) might plausibly represent all future inputs to the RF with sufficient fidelity
to support correct inference/classification on larger-scale tasks pertinent to the overall hierarchical
model. When an overall classification process is realized as a hierarchy of component classification
processes occurring in many macs, whose RFs span many spatiotemporal scales, some substantial
portion of the information as to the input’s class resides in which macs are involved at each level.
This decreases the accuracy needed in individual macs to achieve a given accuracy on the overall
task, in turn, suggesting that smaller bases—entailing a greater average difference between a mac’s
actual input and the best-matching stored input (basis element) to which it is mapped—may suffice.
Figs. 6a-d,h illustrate the basic idea. Here, assume the RF of an L2 mac (not shown) includes the
whole field of depicted L1 mac RFs (green hexagons). Even with the very small basis set (Fig. 6b),
a plane is clearly discernible in Fig. 6c. But, the plane is also discernible even if the active basis
elements are randomly chosen (readily seen by squinting while looking at Fig. 6d), suggesting that in
low-level macs, learning can be deliberately frozen (critical period terminated) even relatively early
in the system’s lifetime, thus preventing CF in those macs, while allowing acceptable accuracy of
higher-level inference processes as well as ongoing learning at higher levels, i.e., of new compositions
of frozen basis elements (features). This view is consistent with: a) emerging notions of progressive
disentangling [
13
,
36
,
41
] in that lower-level macs will compute partial, learned invariances over
their respective RFs, leaving progressively higher-level invariances to be handled at progressively
higher levels; and b) the idea that by factoring the recognition problem into multiple scale-specific
sub-problems (e.g., carried out at the different levels of a hierarchy), the number of samples needed
to train each scale might be small and the number of samples needed overall might be exponentially
smaller than for the unfactored “flat” approach, cf. viewing the ventral visual stream as reducing
sample complexity [29], Bayesian belief nets [26], and Hinton’s Capsules [37].
Step 4:
As also suggested in Figure 1c, a level J+1 mac’s RFs consists of a patch of level J macs.
Once the level J macs’ bases are frozen, the space of possible inputs to level J+1 macs is further (and,
permanently) constrained. Moreover, just as an L1 mac only activates if the number of active pixels
in its RF is within a tight range, so too, a level J+1 mac only activates if the number of active level J
macs in its RF is within a tight range. While the space of possible raw (L0) inputs falling within the
L0 RF of a level J mac increases exponentially with J, the combined effect of the above constraining
forces allows only a tiny fraction of those possible inputs to ever occur. And, only a tiny fraction
of those that do occur, occur more than once. The permanence policy acts to let memory traces of
structurally produced (and thus more likely to recur) inputs become permanent, while letting traces of
random/spurious inputs (having much longer expected times to recurrence) to fade. The effect of these
principles must necessarily increase with level and my working hypothesis is that the magnitidue
of the effect increases exponentially with level. That is, the rate at which inputs with sufficient
novelty so as to require new memory traces (SDRs) to be assigned/stored, and thus the rate at which
macs’ afferent matrices approach saturation, likely decreases exponentially with level. Thus, macs at
higher-level macs are softly protected from CF, meaning that critical periods need not be enforced at
higher levels, thus allowing new, permanent learning at the higher levels for effectively open-ended
lifetimes (without requiring on-line creation/allocation of new memory substrate [
17
,
28
]), addressing
Hallmark 4, and providing a solution to Grossberg’s “stability-plasticity’ dilemma [16].
4
This is only a sketch of an argument that a memory-centric, SDR-based, hierarchical system can retain
the ability to learn new information, both episodic and semantic, throughout essentially unbounded
lifetimes, without suffering CF, and while retaining fixed response time for learning and best-match
retrieval. Quantitaively analyzing this overall argument is a current major research goal.
References
[1]
Aljundi, Rahaf, Babiloni, Francesca, Elhoseiny, Mohamed, Rohrbach, Marcus, & Tuytelaars, Tinne. 2017.
Memory Aware Synapses: Learning what (not) to forget. CoRR,abs/1711.09601.
[2]
Barkat, T.R., Polley, D.B., & Hensch, T.K. 2011. A critical period for auditory thalamocortical activity".
Nature Neuroscience. Nature Neurosci.,14(9), 1189–1196.
[3]
Bengio, Emmanuel, Bacon, Pierre-Luc, Pineau, Joelle, & Precup, Doina. 2015. Conditional Computation in
Neural Networks for faster models. ArXiv e-prints, Nov., arXiv:1511.06297.
[4]
Blakemore, Colin, & Van Sluyters, Richard C. 1974. Reversal of the physiological effects of monocular
deprivation in kittens: further evidence for a sensitive period. The Journal of Physiology,237(1), 195–216.
[5]
Cheetham, Claire E., & Belluscio, Leonardo. 2014. An Olfactory Critical Period. Science,
344
(6180),
157–158.
[6]
Daw, N. W., & Wyatt, H. J. 1976. Kittens reared in a unidirectional environment: evidence for a critical
period. The Journal of Physiology,257(1), 155–170.
[7]
Erzurumlu, Reha S., & Gaspar, Patricia. 2012. Development and Critical Period Plasticity of the Barrel
Cortex. The European Journal of Neuroscience,35(10), 1540–1553.
[8] Foster, David J. 2017. Replay Comes of Age. Annual Review of Neuroscience,40(1), 581–602.
[9]
French, R. M. 1999. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences,
3
(4),
128–135.
[10]
French, Robert. 1994. Dynamically Constraining Connectionist Networks to Produce Distributed, Orthog-
onal Representations to Reduce Catastrophic Interference. Pages 335–340 of: In Proceedings of the 16th
Annual Cognitive Science Society Conference. Erlbaum.
[11]
Friedmann, Naama, & Rusou, Dana. 2015. Critical period for first language: the crucial role of language
input during the first year of life. Current Opinion in Neurobiology,35, 27–34.
[12]
Fusi, Stefano, Drew, Patrick J., & Abbott, L. F. 2005. Cascade Models of Synaptically Stored Memories.
Neuron,45(4), 599–611.
[13]
Fusi, Stefano, Miller, Earl K., & Rigotti, Mattia. 2016. Why neurons mix: high dimensionality for higher
cognition. Current Opinion in Neurobiology,37, 66–74.
[14]
Graves, Alex, Wayne, Greg, & Danihelka, Ivo. 2014. Neural Turing Machines. ArXiv e-prints, Oct.,
arXiv:1410.5401.
[15]
Graves, Alex, Wayne, Greg, Reynolds, Malcolm, Harley, Tim, Danihelka, Ivo, Grabska-Barwi´
nska,
Agnieszka, Colmenarejo, Sergio Gómez, Grefenstette, Edward, Ramalho, Tiago, Agapiou, John, Badia,
Adrià Puigdomènech, Hermann, Karl Moritz, Zwols, Yori, Ostrovski, Georg, Cain, Adam, King, Helen,
Summerfield, Christopher, Blunsom, Phil, Kavukcuoglu, Koray, & Hassabis, Demis. 2016. Hybrid computing
using a neural network with dynamic external memory. Nature,538, 471.
[16] Grossberg, S. 1980. How does a brain build a cognitive code? Psychological Review,87(1), 1–51.
[17]
Hintzman, Douglas L. 1984. MINERVA 2: A simulation model of human memory. Behavior Research
Methods, Instruments, and Computers,16(2), 96–101.
[18] Hubel, D. H., & Wiesel, T. N. 1970. The period of susceptibility to the physiological effects of unilateral
eye closure in kittens. The Journal of Physiology,206(2), 419–436.
[19]
Indyk, Piotr, & Motwani, Rajeev. 1998. Approximate Nearest Neighbors: Towards Removing the Curse
of Dimensionality. Pages 604–613 of: Proceedings of the Thirtieth Annual ACM Symposium on Theory of
Computing. STOC ’98. New York, NY, USA: ACM.
5
[20]
Kirkpatrick, James, Pascanu, Razvan, Rabinowitz, Neil, Veness, Joel, Desjardins, Guillaume, Rusu,
Andrei A., Milan, Kieran, Quan, John, Ramalho, Tiago, Grabska-Barwinska, Agnieszka, Hassabis, Demis,
Clopath, Claudia, Kumaran, Dharshan, & Hadsell, Raia. 2017. Overcoming catastrophic forgetting in neural
networks. PNAS,114(13), 3521–3526.
[21]
Kraska, Tim, Beutel, Alex, Chi, Ed H., Dean, Jeffrey, & Polyzotis, Neoklis. 2017. The Case for Learned
Index Structures. ArXiv e-prints, Dec., arXiv:1712.01208.
[22]
Kumaran, Dharshan, Hassabis, Demis, & McClelland, James L. 2016. What Learning Systems do
Intelligent Agents Need? Complementary Learning Systems Theory Updated. Trends in Cognitive Sciences,
20(7), 512–534.
[23]
McClelland, J. L., McNaughton, B. L., & O’Reilly, R. C. 1995. Why there are complementary learning
systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models
of learning and memory. Psychol. Rev.,102, 419–457.
[24]
McCloskey, M., & Cohen, N. J. 1989. Catastrophic Interference in Connectionist Networks: The Sequential
Learning Problem. Vol. 24. Academic Press. Pages 109–165.
[25] Muir, Darwin W., & Mitchell, Donald E. 1975. Behavioral deficits in cats following early selected visual
exposure to contours of a single orientation. Brain Research,85(3), 459–477.
[26]
Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo,
CA: Morgan Kaufmann.
[27]
Pettigrew, J. D., & Freeman, R. D. 1973. Visual Experience without Lines: Effect on Developing Cortical
Neurons. Science,182(4112), 599–601.
[28]
Pickett, Marc, Al-Rfou, Rami, Shao, Louis, & Tar, Chris. 2016. A Growing Long-term Episodic &
Semantic Memory. ArXiv e-prints, Oct., arXiv:1610.06402.
[29]
Poggio, T., Mutch, J., Leibo, J. Z., Rosasco, L., & Taccheti, A. 2012. The computational magic of the
ventral stream: sketch of a theory (and why some deep architectures work). Report TR-2012-035. MIT
CSAIL.
[30] Poo, Cindy, & Isaacson, Jeffry S. 2007. An Early Critical Period for Long-Term Plasticity and Structural
Modification of Sensory Synapses in Olfactory Cortex. J. Neurosci.,27(28), 7553–7558.
[31]
Pritzel, Alexander, Uria, Benigno, Srinivasan, Sriram, Badia, Adrià Puigdomènech, Vinyals, Oriol, Has-
sabis, Demis, Wierstra, Daan, & Blundell, Charles. 2017. Neural Episodic Control. Pages 2827–2836
of: Precup, Doina, & Teh, Yee Whye (eds), Proceedings of the 34th International Conference on Machine
Learning. Proceedings of Machine Learning Research, vol. 70. International Convention Centre, Sydney,
Australia: PMLR.
[32]
Rinkus, Gerard. 1996. A Combinatorial Neural Network Exhibiting Episodic and Semantic Memory
Properties for Spatio-Temporal Patterns. Thesis.
[33]
Rinkus, Gerard. 2010. A cortical sparse distributed coding model linking mini- and macrocolumn-scale
functionality. Frontiers in Neuroanatomy,4.
[34]
Rinkus, Gerard. 2017. A Radically New Theory of how the Brain Represents and Computes with
Probabilities. arXiv preprint arXiv:1701.07879.
[35]
Rinkus, Gerard J. 2014. Sparsey
TM
: event recognition via deep hierarchical sparse distributed codes.
Frontiers in Computational Neuroscience,8(160).
[36]
Rust, Nicole C., & DiCarlo, James J. 2010. Selectivity and Tolerance (“Invariance”) Both Increase as Visual
Information Propagates from Cortical Area V4 to IT. The Journal of Neuroscience,30(39), 12978–12995.
[37]
Sabour, Sara, Frosst, Nicholas, & E Hinton, Geoffrey. 2017. Dynamic Routing Between Capsules. ArXiv
e-prints, Oct., arXiv:1710.09829.
[38]
Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, & Salakhutdinov, Ruslan. 2014.
Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res.,
15
(1), 1929–1958.
[39]
Wang, J., Liu, W., Kumar, S., & Chang, S. F. 2016. Learning to Hash for Indexing Big Data - A Survey.
Proceedings of the IEEE,104(1), 34–57.
[40]
Weston, Jason, Chopra, Sumit, & Bordes, Antoine. 2014. Memory Networks. ArXiv e-prints, Oct.,
arXiv:1410.3916.
[41]
Zoccolan, Davide, Kouh, Minjoon, Poggio, Tomaso, & DiCarlo, James J. 2007. Trade-Off between Object
Selectivity and Tolerance in Monkey Inferotemporal Cortex. J. Neurosci.,27(45), 12292–12307.
6
4 Supplementary Material
Figure 2: A simple non-hierarchical, variant of Sparsey’s Code Selection Algorithm (CSA), involving
only the combination of bottom-up (U) and horizontal (H) signals, see [
33
] for details. The H
matrix carries signals (recurrently) from the previously active (at T-1) SDR code(s). More general,
hierarchical variants are given in [
34
,
35
]. One can readily see by inspection, that the the algorithm
does not iterate over stored items. Its dominant step is Step 2, requiring a single iteration over afferent
weights, for all units, i.e., input summations. There is one feedforward pass through the steps. In the
case of a hierarchical model with many macs on many levels, on each time step (e.g., frame of an
input sequence), there is a single upward processing pass through all the levels, in which, generally, a
subset of the macs meet activation criteria and execute the CSA, thus assigning/storing codes (or in
the retrieval case, activating the best-matching stored codes). The details of how the CSA preserves
similarity are given in [
32
35
]. A Java app (http://www.sparsey.com/CSA_explainer_app_page.html)
is available of a slightly simplified version of the CSA, allowing experimentation with parameters
affecting the learned simlarity-preserving mapping.
7
Figure 3: Simulation results of a simplified CSA and a model with only bottom-up (U) in-
puts from an input level to a single SDR coding field, consisting of
Q
=10 WTA CMs (in one
case,
Q
=20), each with
K
=10 binary units, showing that spatial input similarlity (pixel-wise
overlap, x-axis decreases towards right) is statistically preserved into size of intersection of
SDR codes (y-axis). These results are produced from Java app mentioned in previous caption
(http://www.sparsey.com/CSA_explainer_app_page.html). The various curves correspond to different
settings of some of the CSA parameters controlling the shape and size of the sigmoid transfer function.
“mult” corresponds to parameter
χ
in Step 6 of Fig. 2. “ecc” and “inflect” have their appropriate
meaning viz. a sigmoid function, but don’t correspond precisely to the
σ
parameters in Fig. 2. The
main point of this figure is simply to show that the fixed-time CSA statistically preserves similarity
from pixel overlap to SDR intersection, and does so in a smoothly graded way.
8
Figure 4: a) Example synthetic data sequence (10 frames, 20 out of 100 randomly chosen features
per frame) for testing model capacity. b) Capacity is linear in the weights. “Uncorrelated” denotes
randomly created frames (20 out of 100 features, chosen independently on each frame). “Correlated”
denotes the complex sequence case: an alphabet of a 100 frames is pre-created. Sequences are then
created by choosing 10 frames from lexicon randomly with replacement). The model had
Q
=100
CMs, each with
K
=40 units, and approximately 16M weights. As the chart shows, over 3000 such
uncorrelated sequences were learned/stored with one trial each, while permitting an average retrieval
accuracy of average 97%. See [
32
] and http://www.sparsey.com/Sparsey_Storage_Capacity.html for
more details. Also see [
32
] for other experiments demonstrating the ability to learn extremely long
time dependences.
Figure 5: Example showing the pre-processing applied to spatial inputs, e.g., MNIST images, or
frames of video. We edge filter, binarize, and skeletonize the inputs, all extremely cheap, local
operations. These are a few frames from a Weizmann video.
9
Figure 6: Explanation of why the pre-processing (e.g., Fig. 5) and a further constraint on the
number of active features in a mac’s bottom-up RF, in combination with the recursive, compositional
(part-whole) structure of natural objects, greatly constrain the size of the basis needed for a mac to
adequately represent its input space, i.e., represent all future inputs to its RF with sufficient fidelity to
support expected future tasks (both probabilistic inference tasks and classifications), in particular,
tasks defined at higher spatial/temporal scales, which will depend, in a complex way, on the fidelities
of all macs at all levels of the hierarchy. For a-d, see text. e) Examples of inputs that are statistically
plausible, but have too many active features, and so do not activate the mac and and are not stored.
f) Examples of inputs that are statistically plausible, but will not survive the preprocessing, and so
will not be stored. g) Examples of inputs that will survive the preprocessing but are statistically very
unlikely (i.e., unlikely to be due to structural regularities), and so will not be stored. h) At left of each
row is a statistically plausible input with an acceptable number of active features, which will thus
be stored (assigned an SDR) if presented. Examples to right of dashed line have varying degrees of
pixel overlap with the leftmost pattern, but would presumably be well-represented by the leftmost
pattern, and thus not need to be explicitly stored, supporting the adequacy of a smaller basis.
10
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Indexes are models: a B-Tree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes. The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show, that by using neural nets we are able to outperform cache-optimized B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over several real-world data sets. More importantly though, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work just provides a glimpse of what might be possible.
Article
Full-text available
Significance Deep neural networks are currently the most successful machine-learning technique for solving a variety of tasks, including language translation, image classification, and image generation. One weakness of such models is that, unlike humans, they are unable to learn multiple tasks sequentially. In this work we propose a practical solution to train such models sequentially by protecting the weights important for previous tasks. This approach, inspired by synaptic consolidation in neuroscience, enables state of the art results on multiple reinforcement learning problems experienced sequentially.
Article
Full-text available
Neurons often respond to diverse combinations of task-relevant variables. This form of mixed selectivity plays an important computational role which is related to the dimensionality of the neural representations: high-dimensional representations with mixed selectivity allow a simple linear readout to generate a huge number of different potential responses. In contrast, neural representations based on highly specialized neurons are low dimensional and they preclude a linear readout from generating several responses that depend on multiple task-relevant variables. Here we review the conceptual and theoretical framework that explains the importance of mixed selectivity and the experimental evidence that recorded neural representations are high-dimensional. We end by discussing the implications for the design of future experiments.
Article
Humans can learn in a continuous manner. Old rarely utilized knowledge can be overwritten by new incoming information while important, frequently used knowledge is prevented from being erased. In artificial learning systems, lifelong learning so far has focused mainly on accumulating knowledge over tasks and overcoming catastrophic forgetting. In this paper, we argue that, given the limited model capacity and the unlimited new information to be learned, knowledge has to be preserved or erased selectively. Inspired by neuroplasticity, we propose an online method to compute the importance of the parameters of a neural network, based on the data that the network is actively applied to, in an unsupervised manner. After learning a task, whenever a sample is fed to the network, we accumulate an importance measure for each parameter of the network, based on how sensitive the predicted output is to a change in this parameter. When learning a new task, changes to important parameters are penalized. We show that a local version of our method is a direct application of Hebb's rule in identifying the important connections between neurons. We test our method on a sequence of object recognition tasks and on the challenging problem of learning an embedding in a continuous manner. We show state of the art performance and the ability to adapt the importance of the parameters towards what the network needs (not) to forget, which may be different for different test conditions.
Article
Hippocampal place cells take part in sequenced patterns of reactivation after behavioral experience, known as replay. Since replay was first reported, nearly 20 years ago, many new results have been found, necessitating revision of the original interpretations. We review some of these results with a focus on the phenomenology of replay.
Article
The long-term memory of most connectionist systems lies entirely in the weights of the system. Since the number of weights is typically fixed, this bounds the total amount of knowledge that can be learned and stored. Though this is not normally a problem for a neural network designed for a specific task, such a bound is undesirable for a system that continually learns over an open range of domains. To address this, we describe a lifelong learning system that leverages a fast, though non-differentiable, content-addressable memory which can be exploited to encode both a long history of sequential episodic knowledge and semantic knowledge over many episodes for an unbounded number of domains. This opens the door for investigation into transfer learning, and leveraging prior knowledge that has been learned over a lifetime of experiences to new domains.
Article
Artificial neural networks are remarkably adept at sensory processing, sequence learning and reinforcement learning, but are limited in their ability to represent variables and data structures and to store data over long timescales, owing to the lack of an external memory. Here we introduce a machine learning model called a differentiable neural computer (DNC), which consists of a neural network that can read from and write to an external memory matrix, analogous to the random-access memory in a conventional computer. Like a conventional computer, it can use its memory to represent and manipulate complex data structures, but, like a neural network, it can learn to do so from data. When trained with supervised learning, we demonstrate that a DNC can successfully answer synthetic questions designed to emulate reasoning and inference problems in natural language. We show that it can learn tasks such as finding the shortest path between specified points and inferring the missing links in randomly generated graphs, and then generalize these tasks to specific graphs such as transport networks and family trees. When trained with reinforcement learning, a DNC can complete a moving blocks puzzle in which changing goals are specified by sequences of symbols. Taken together, our results demonstrate that DNCs have the capacity to solve complex, structured tasks that are inaccessible to neural networks without external read-write memory.
Article
1. Kittens were visually deprived by suturing the lids of the right eye for various periods of time at different ages. Recordings were subsequently made from the striate cortex, and responses from the two eyes compared. As previously reported, monocular eye closure during the first few months of life causes a sharp decline in the number of cells that can be influenced by the previously closed eye. 2. Susceptibility to the effects of eye closure begins suddenly near the start of the fourth week, remains high until some time between the sixth and eighth weeks, and then declines, disappearing finally around the end of the third month. Monocular closure for over a year in an adult cat produces no detectable effects. 3. During the period of high susceptibility in the fourth and fifth weeks eye closure for as little as 3‐4 days leads to a sharp decline in the number of cells that can be driven from both eyes, as well as an over‐all decline in the relative influence of the previously closed eye. A 6‐day closure is enough to give a reduction in the number of cells that can be driven by the closed eye to a fraction of the normal. The physiological picture is similar to that following a 3‐month monocular deprivation from birth, in which the proportion of cells the eye can influence drops from 85 to about 7%. 4. Cells of the lateral geniculate receiving input from a deprived eye are noticeably smaller and paler to Nissl stain following 3 or 6 days' deprivation during the fourth week. 5. Following 3 months of monocular deprivation, opening the eye for up to 5 yr produces only a very limited recovery in the cortical physiology, and no obvious recovery of the geniculate atrophy, even though behaviourally there is some return of vision in the deprived eye. Closing the normal eye, though necessary for behavioural recovery, has no detectable effect on the cortical physiology. The amount of possible recovery in the striate cortex is probably no greater if the period of eye closure is limited to weeks, but after a 5‐week closure there is a definite enhancement of the recovery, even though it is far from complete.
Article
We update complementary learning systems (CLS) theory, which holds that intelligent agents must possess two learning systems, instantiated in mammalians in neocortex and hippocampus. The first gradually acquires structured knowledge representations while the second quickly learns the specifics of individual experiences. We broaden the role of replay of hippocampal memories in the theory, noting that replay allows goal-dependent weighting of experience statistics. We also address recent challenges to the theory and extend it by showing that recurrent activation of hippocampal traces can support some forms of generalization and that neocortical learning can be rapid for information that is consistent with known structure. Finally, we note the relevance of the theory to the design of artificial intelligent agents, highlighting connections between neuroscience and machine learning.
Article
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.