ArticlePDF Available

Loihi: A Neuromorphic Manycore Processor with On-Chip Learning

Authors:

Abstract and Figures

Loihi is a 60 mm2 chip fabricated in Intels 14nm process that advances the state-of-the-art modeling of spiking neural networks in silicon. It integrates a wide range of novel features for the field, such as hierarchical connectivity, dendritic compartments, synaptic delays, and most importantly programmable synaptic learning rules. Running a spiking convolutional form of the Locally Competitive Algorithm, Loihi can solve LASSO optimization problems with over three orders of magnitude superior energy-delay-product compared to conventional solvers running on a CPU iso-process/voltage/area. This provides an unambiguous example of spike-based computation outperforming all known conventional solutions.
Content may be subject to copyright.
1
Loihi: a Neuromorphic Manycore Processor with
On-Chip Learning
Mike Davies, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya, Yongqiang Cao, Sri Harsha Choday,
Georgios Dimou, Prasad Joshi, Nabil Imam, Shweta Jain, Yuyun Liao, Chit-Kwan Lin, Andrew Lines,
Ruokun Liu, Deepak Mathaikutty, Steve McCoy, Arnab Paul, Jonathan Tse,
Guruguhanathan Venkataramanan, Yi-Hsin Weng, Andreas Wild, Yoonseok Yang, Hong Wang
Contact: mike.davies@intel.com
Intel Labs, Intel Corporation
Abstract—Loihi is a 60 mm2 chip fabricated in Intel’s 14nm process that advances the state-of-the-art modeling of spiking neural networks in
silicon. It integrates a wide range of novel features for the field, such as hierarchical connectivity, dendritic compartments, synaptic delays, and
most importantly programmable synaptic learning rules. Running a spiking convolutional form of the Locally Competitive Algorithm, Loihi can solve
LASSO optimization problems with over three orders of magnitude superior energy-delay-product compared to conventional solvers running on a
CPU iso-process/voltage/area. This provides an unambiguous example of spike-based computation outperforming all known conventional
solutions.
KeywordsNeural nets, Neuromorphic computing, Artificial Intelligence, Machine learning, Computing Methodologies, Other Architecture Styles
1 INTRODUCTION
EUROSCIENCE offers a bountiful source of inspiration for novel
hardware architectures and algorithms. Through their
complex interactions at large scales, biological neurons exhibit an
impressive range of behaviors and properties that we currently
struggle to model with modern analytical tools, let alone replicate
with our design and manufacturing technology. Some of the magic
that we se e in th e bra in u ndoubtedly s tems from e xoti c de vice and
material properties that will remain out of our fabs’ reach for
many years to come. Yet highly simplified abstractions of neural
networks are now revolutionizing computing by solving difficult
and diverse machine learning problems of great practical value.
Perhaps other less simplified models may also yield near-term
value.
Artificial neural networks (ANNs) are reasonably well served
by todays von Neumann CPU architectures and GPU variants,
especially when assisted by coprocessors optimized for streaming
matrix arithmetic. Spiking neural network models, on the other
hand, are exceedingly poorly served by conventional
architectures. Just as the value of ANNs was not fully appreciated
until the advent of sufficiently fast CPUs and GPUs, the same could
be the case for spiking models—except different computing
architectures will be required.
The neuromorphic computing field of research spans a range
of different neuron models and levels of abstraction. Loihi
(pronounced ”low-EE-hee”) is one stake in the ground motivated
by a particular class of algorithmic results and perspectives from
our survey of computational neuroscience and recent
neuromorphic advances. We approach the field with an eye for
mathematical rigor, top-down modeling, rapid architecture
iteration, and quantitative benchmarking. Our aim is to develop
algorithms and hardware in a principled way as much as possible.
We begin this paper with our definition of the SNN
computational model and the features that motivated Loihi’s
architectural requirements. We then describe the architecture
that supports those requirements and provide an overview of the
chip’s asynchronous design implementation. We conclude with
some preliminary 14nm silicon results.
Importantly, Section 2.2 presents a result that unambiguously
demonstrates the value of spike-based computation for one
foundational problem. We view this as a significant result in light
of ongoing debate about the value of spikes as a computational
tool in both mainstream and neuromorphic communities. The
skepticism towards spikes is well founded, but in our research we
have moved on from this question, given the existence of an
example that potentially generalizes to a very broad class of
neural networks, namely all recurrent networks.
2 SPIKING NEURAL NETWORKS
We consider a spiking neural network (SNN) as a model of
computation with neurons as the basic processing elements.
Different from artificial neural networks, SNNs incorporate time
as an explicit dependency in their computations. At some instant
in time, one or more neurons may send out single-bit impulses,
the spike, to neighbors through directed connections known as
synapses, with a potentially nonzero traveling time. Neurons have
local state variables with rules governing their evolution and
timing of spike generation. Hence the network is a dynamical
system where individual neurons interact through spikes.
N
This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
Some content may change prior to final publication.
Digital Object Identifier 10.1109/MM.2018.112130359 0272-1732/$26.00 2018 IEEE
2
2.1 Spiking Neural Unit
A spiking neuron integrates its spike train input in some fashion,
usually by low pass filter, and fires once a state variable exceeds a
threshold. Mathematically, each spike train is a sum of Dirac delta
functions ()=()
where tk is the time of the k-th
spike. We adopt a variation of the well-known CUBA leaky-
integrate-and-fire model that has two internal state variables, the
synaptic response current ui(t) and the membrane potential vi(t).
The synaptic response current is the sum of filtered input spike
trains and a constant bias current:
()=
,∗
()+

( 1 )
where wij is the synaptic weight from neuron-j to i, ()=
exp(−/)() is the synaptic filter impulse response
parameterized by the time constant τu with H(t) the unit step
function, and bi is a constant bias. The synaptic current is further
integrated as the membrane potential, and the neuron sends out
a spike when its membrane potential passes its firing threshold θi.
̇()=−1
()+()−()
( 2 )
Note that the integration is leaky, as captured by the time
constant τv. vi is initialized with a value less than θi, and is reset to
0 right after a spiking event occurs.
Loihi, a fully digital architecture, approximates the above
continuous time dynamics using a fixed-size discrete timestep
model. In this model, all neurons need to maintain a consistent
understanding of time so their distributed dynamics can evolve in
a well-defined, synchronized manner. It is worth clarifying that
these fixed-size, synchronized time steps relate to the algorithmic
time of the computation, and need not have a direct relationship
to the hardware execution time.
2.2 Computation with Spikes and Fine-grained Parallelism
Computations in SNNs are carried out through the interacting
dynamics of neuron states. An instructive example is the 1-
minimizing sparse coding problem, also known as LASSO, which
we can solve with the SNN in Figure 1a using the Spiking Locally
Competitive Algorithm [2]. The objective of this problem is to
determine a sparse set of coefficients that best represents a given
input as the linear combination of features from a feature
dictionary. The coefficients can be viewed as the activities of the
spiking neurons in Figure 1a that are competing to form an
accurate representation of the data. By properly configuring the
network, it can be established that as the network dynamics
evolve, the average spike rates of the neurons will converge to a
fixed point, and this fixed point is identical to the solution of the
optimization problem.
Such computation exhibits completely different characteristics
from conventional linear algebra based approaches. Figure 1b
compares the computational efficiency of an SNN with the
conventional solver FISTA [3] by having them both solve a sparse
coding problem on a single-threaded CPU. The SNN approach
(labelled S-LCA) gives a rapid initial drop in error and obtains a
good approximate solution faster than FISTA. After this, the S-LCA
convergence speed significantly slows down, and FISTA instead
finds a much more precise solution quicker. Hence an interesting
efficiency-accuracy tradeoff arises that makes the SNN solution
particularly attractive for applications that do not require highly
precise solutions, e.g., a solution that is 1% within the optimal.
The remarkable algorithmic efficiency of S-LCA can be
attributed to its ability to exploit the temporal ordering of spikes,
a general property of the SNN computational model. In Figure 1a,
the neuron that has the largest external input to win the
competition is more likely to spike at the earliest time, causing
immediate inhibition of the other neurons. This inhibition
happens with only a single one-to-many spike communication, in
contrast to the usual need for all-to-all state exchanges with
matrix arithmetic based solutions such as FISTA and other
conventional solvers. This implies that the SNN solution is
communication efficient, and it may solve the optimization
problem with a reduced number of arithmetic operations. We
point interested readers to [1] for more discussions.
Our CPU-based evaluation has yet to exploit one important
advantage of SNN-based algorithms: the inherent abundant
parallelism. The dominant part of SNN computations—the
evolution of individual neuron states within a timestep—can all be
computed concurrently. However, harnessing such speedup can
be a nontrivial task especially on a conventional CPU architecture.
The parallelizable work for each neuron only consists of a few
variable updates. Given that the parallel segment of the work can
be executed very quickly, the underlying architecture must
support a fine granularity of parallelism with minimal overhead in
coordinating the order of computations. These observations
motivate fundamental features of the Loihi architecture,
described in Section 3.
….
Inhibitory weights
1
2
(a) (b)
Fig. 1
: (a) The network topology for solving LASSO. Each neuron
receives the correlation bi
between the input data and a predefined
feature vector as
its input. Bottom figure shows the evolution of
membrane potential in a 3-
neuron example; the spike rates of the
neurons stabilizes to fixed values. (b) Algorithmic efficiency
comparison of a solution based on spiking network (S-
LCA) and
conventional opti
mization methods (FISTA). Both algorithms are
implemented on a CPU with single thread. Y-
axis is the normalized
difference to the optimal objective function value. Figures taken from
[1] with detailed information therein.
This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
Some content may change prior to final publication.
Digital Object Identifier 10.1109/MM.2018.112130359 0272-1732/$26.00 2018 IEEE
3
2.3 Learning with Local Information
Learning in an SNN refers to adapting the synaptic weights and
hence varying the SNN dynamics to a desired one. Similar to
conventional machine learning, we wish to express learning as the
minimization of a particular loss function over many training
samples. In the sparse coding case, learning involves finding the
set of synaptic weights that allows the best performing sparse
representation, expressed as minimizing the sum of all sparse
coding losses. Learning in an SNN naturally proceeds in an online
manner, where training samples are sent to the network
sequentially.
SNN synaptic weight adaptation rules must satisfy a locality
constraint: each weight can only be accessed and modified by the
destination neuron, and the rule can only make use of locally
available information, such as the spike trains from the
presynaptic (source) and postsynaptic (destination) neurons. The
locality constraint imposes a significant challenge on the design of
learning algorithms, as most conventional optimization
procedures do not satisfy it. Although the development of such
decentralized learning algorithms is still in active research, some
pioneering work exists showing the promise of this approach.
They range from the simple Oja’s rule for finding principal
components, to the Widrow-Hoff rule for supervised learning and
its generalization to exploit precise spike timing information [4],
to the more complex unsupervised sparse dictionary learning
using feedback [5] and event-driven random back-propagation
[6].
Once a learning rule satisfies the locality constraint, the
inherent parallelism offered by SNNs will then allow the adaptive
network to be scaled up to large sizes in a way that can be
computed efficiently. If the rule also minimizes a loss function,
then the system will have well defined dynamics.
To support the development of such scalable learning rules,
Loihi offers a variety of local information to a programmable
synaptic learning process:
Spike traces corresponding to filtered presynaptic and
postsynaptic spike trains with configurable time constants
(Section 3.4.4). In particular, a short time constant allows
the learning rule to utilize precise spike timing
information, while a long time constant captures the
information in spike rates.
Multiple spike traces for a given spike train filtered with
different time constants. This provides support for
differential Hebbian learning by measuring perturbations
in spike patterns and Bienenstock-Cooper-Munro learning
using triplet STDP [7], among others.
Two additional state variables per synapse, besides the
normal weight, in order to provide more flexibility for
learning. For example, these can be used as synaptic tags
for reinforcement learning.
Reward traces that correspond to special reward spikes
carrying signed impulse values to represent reward or
punishment signals for reinforcement learning. Reward
spikes are broadcast to defined sets of synapses in the
network that may connect to many different source and
destination neurons.
Loihi is the first fully integrated digital SNN chip that supports any
of the above features. Some small-scale neuromorphic chips with
analog synapse and neuron circuits have prototyped synaptic
plasticity using spike traces, for example [8], but these prior chips
have orders of magnitude lower network capacity compared to
Loihi as well as far less programmability.
2.4 Other Computational Primitives
Loihi includes several computational primitives related to other
active areas of SNN algorithmic research:
Stochastic noise. Uniformly distributed pseudorandom
numbers may be added to a neuron’s synaptic response
current, membrane voltage, and refractory delay. This
provides support for algorithms such as Neural Sampling
[9], which can solve probabilistic inference and constraint
satisfaction problems using stochastic dynamics and a
form of Markov chain Monte Carlo sampling.
Configurable and adaptable synaptic, axon, and refractory
delays. This provides support for novel forms of temporal
computation such as polychronous dynamics [10], in
which subsets of neurons may synchronize over periods of
varying timescales. The number of polychronous groups
far exceeds the number of stable attractors in
conventional attractor networks, suggesting a productive
space for computational development.
Configurable dendritic tree processing. Neurons in the
SNN may be decomposed into a tree of compartment
units, with the neuron’s input synapses distributed over
those compartments. Each compartment supports the
same state variables as a neuron, but only the root of the
tree (soma compartment) generates spike outputs. The
compartments’ state variables are combined in a
configurable manner by programming different join
functions for each compartment junction.
Neuron threshold adaptation in support of intrinsic
excitability homeostasis.
Scaling and saturation of synaptic weights in support of
“permanence” levels that exceed the range of weights
used during inference.
The combination of these features in one device, especially in
combination with Loihi’s learning capabilities, is novel for the field
of SNN silicon implementation.
3 ARCHITECTURE
3.1 Chip Overview
Loihi features a manycore mesh comprising 128 neuromorphic
cores, three embedded x86 processor cores, and off-chip
communication interfaces that hierarchically extend the mesh in
four planar directions to other chips. An asynchronous network-
on-chip (NoC) transports all communication between cores in the
form of packetized messages. The NoC supports write, read
request, and read response messages for core management and
x86-to-x86 messaging, spike messages for SNN computation, and
barrier messages for time synchronization between cores. All
message types may be sourced externally by a host CPU or on-chip
This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
Some content may change prior to final publication.
Digital Object Identifier 10.1109/MM.2018.112130359 0272-1732/$26.00 2018 IEEE
4
by the x86 cores, and these may be directed to any on-chip core.
Messages may be hierarchically encapsulated for off-chip
communication over a second-level network. The mesh protocol
supports scaling to 4096 on-chip cores and, via hierarchical
addressing, up to 16,384 chips.
Each neuromorphic core implements 1,024 primitive spiking
neural units (compartments) grouped into sets of trees
constituting neurons. The compartments, along with their fanin
and fanout connectivity, share configuration and state varia bles in
ten architectural memories. Their state variables are updated in a
time-multiplexed, pipelined manner every algorithmic timestep.
When a neuron’s activation exceeds some threshold level, it
generates a spike message that is routed to a set of fanout
compartments contained in some number of destination cores.
Flexible and well provisioned SNN connectivity features are
crucial for supporting a broad range of workloads. Some desirable
networks may call for dense, all-to-all connectivity while others
may call for sparse connectivity; some may have uniform graph
degree distributions, others power law distributions; some may
require high precision synaptic weights, e.g. to support learning,
while others can make do with binary connections. As a rule,
algorithmic performance scales with increasing network size,
measured not only by neuron counts but especially neuron-to-
neuron fanout degrees. We see this rule holding all the way to
biological levels (1:10,000). Due to the O(N2) scaling of
connectivity state in the number of fanouts, it becomes an
enormous challenge to support networks with high connectivity
using today’s integrated circuit technology.
To address this challenge, Loihi supports a range of features to
relax the sometimes severe constraints that other neuromorphic
designs have imposed on the programmer:
1) Sparse network compression. Besides a common dense
matrix connectivity model, Loihi supports three sparse
matrix compression models in which fanout neuron
indices are computed based on index state stored with
each synapse’s state variables.
2) Core-to-core multicast. Any neuron may direct a single
spike to any number of destination cores, as the network
connectivity may require.
3) Variable synaptic formats. Loihi supports any weight
precision between one and nine bits, signed or unsigned,
and weight precisions may be mixed (with scale
normalization) even within a single neuron’s fanout
distribution.
4) Population-based hierarchical connectivity. As a
generalized weight sharing mechanism, e.g. to support
convolutional neural network types, connectivity
templates may be defined and mapped to specific
population instances during operation. This feature can
reduce a network’s required connectivity resources by
over an order magnitude.
Loihi is the first fully integrated SNN chip that supports any of the
above features. All prior chips, for example the previously most
synaptically dense chip [11], store their synapses in dense matrix
form that significantly constrains the space of networks that may
be efficiently supported.
Each Loihi core includes a programmable learning engine that
can evolve synaptic state variables over time as a function of
historical spike activity. In order to support the broadest possible
class of rules, the learning engine operates on filtered spike traces.
Learning rules are microcode programmable and support a rich
selection of input terms and output synaptic target variables.
Specific sets of these rules are associated with a learning profile
bound to each synapse to be modified. The profile is mapped by
some combination of presynaptic neuron, postsynaptic neuron, or
class of synapse. The learning engine supports simple pairwise
STDP rules and also much more complicated rules such as triplet
STDP, reinforcement learning with synaptic tag assignments, and
complex rules that reference both rate averaged and spike-timing
traces.
All logic in the chip is digital, functionally deterministic, and
implemented in an asynchronous bundled data design style. This
allows spikes to be generated, routed, and consumed in an event-
driven manner with maximal activity gating during idle periods.
This implementation style is well suited for spiking neural
networks that fundamentally feature a high degree of sparseness
in their activity across both space and time.
3.2 Mesh Operation
Figure 2 shows the operation of the neuromorphic mesh as it
executes a spiking neural network model. All cores begin at
algorithmic timestep t. Each core independently iterates over its
set of neuron compartments, and any neurons that enter a firing
state gen erate spike messages th at the NoC distributes to all cores
(a) Initial idle state for timestep
t. Each square repres
ents a core
in the mesh containing multiple
neurons
(b) Neurons n1 and n2 in cores A and
B fire and generate spike messages
(c) Spikes from all other
neurons firing on timestep
t in
cores A and B are distributed
to their destination cores
(d) Each core advances its algorithmic
timestep to t
+1 as it handshakes
with its neighbors via barrier
synchronization messages
Fig. 2: Mesh Operation
This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
Some content may change prior to final publication.
Digital Object Identifier 10.1109/MM.2018.112130359 0272-1732/$26.00 2018 IEEE
5
that contain their synaptic fanouts. Spike distributions for two
such example neurons n1 and n2 in cores A and B are illustrated in
Figure 2b, with additional spike distributions from other firing
neurons adding to the NoC traffic in Figure 2c.
The NoC distributes spike (and all other) messages according
to a dimension-order routing algorithm. The NoC itself only
supports unicast distributions. To multicast spikes, the output
process of each core iterates over a list of destination cores for a
firing neuron’s fanout distribution and sends one spike per core.
For deadlock protection reasons relating to read and chip-to-chip
message transactions, the mesh uses two independent physical
router networks. For bandwidth efficiency, the cores alternate
sending their spike messages across the two physical networks.
This is possible because SNN computation does not depend on the
spike sequence ordering within a timestep.
At the end of the timestep, a mechanism is needed to ensure
that all spikes have been delivered and that it’s safe for the cores
proceed to timestep t + 1. Rather than using a globally distributed
time reference (clock) that must pessimize for the worst-case
chip-wide network activity, we use a barrier synchronization
mechanism, illustrated in Figure 2d. As each core finishes servicing
its compartments for timestep t, it exchanges barrier messages
with its neighboring cores. The barrier messages flush any spikes
in flight and, in a second phase, propagate a timestep-advance
notification to all cores. As cores receive the second phase of
barrier messages, they advance their timestep and proceed to
update compartments for time t + 1.
As long as management activity is restricted to a specific
“preemption” phase of the barrier synchronization process that
any embedded x86 core or off-chip host may introduce on
demand, the Loihi mesh is provably deadlock free.
3.3 Network Connectivity Architecture
In its most abstract formulation, the neural network mapped to
the Loihi architecture is a directed multigraph structure = (N,S),
where N is the set of neurons in the network and S is a set of
synapses (edges) connecting pairs of neurons. Each synapse s
S
corresponds to a 5-tuple: (i,j,wgt,dly,tag), where i,j N identify
the source and destination neurons of the synapse, and wgt, dly,
and tag are integer-valued properties of the synapse. In general,
Loihi will autonomously modify the synaptic variables
(wgt,dly,tag) according to programmed learning rules. All other
network parameters remain constant unless they are modified by
x86 core intervention.
An abstract network is mapped to the mesh by assigning
neurons to cores, subject to each core’s resource constraints.
Figure 3 shows an example of a simple seven neuron network
mapped to three cores. Given a particular neuron-to-core
mapping for N, each neuron’s synaptic fanin state (wgt, dly, and
tag) must be stored in the core’s synaptic memory. These
schematically correspond to the synaptic spike markers in Figure
3. Each neuron’s fanout edges are projected to a list of core-to-
core edges (colored yellow), and each core-to-core edge is
assigned an axon_id identifier unique to each destination core
(colored red). The neuron’s synaptic fanout contained within each
destination core is associated with the corresponding axon_id and
organized as a list of 4-tuples (j,wgt,dly,tag) stored in the synaptic
memory in some suitably compressed form. When neuron i
spikes, the mesh routes each axon_id to the appropriate fanout
core which then expands it to the corresponding synaptic list.
This connectivity architecture can support arbitrary
multigraph networks subject to the cores’ resource constraints:
1) The total number of neurons assigned to any core may
not exceed 1,024 (Ncx).
2) The total synaptic fanin state mapped to any core must
not exceed 128KB (Nsyn × 64b, subject to compression
and list alignment considerations.)
3) The total number of core-to-core fanout edges mapped
to any given core must not exceed 4,096 (Naxout). This
corresponds to the number of output-side routing slots
highlighted in yellow in Figure 3.
4) The total number of distribution lists, associated by
axon_id, in any core must not exceed 4,096 (Naxin). This
is the number of input-side axon_id routing slots
highlighted in red in Figure 3.
In practice, constraints 2 and 4 tend to be the most limiting.
In order to exploit structure that may exist in the network ,
Loihi supports a hierarchical network model. This feature can
significantly reduce the chip-wide connectivity and synaptic
resources needed to map convolutional-style networks in which a
template of synaptic connections is applied to many neurons in a
uniform way.
Formally, we represent the hierarchical template network as a
directed multigraph = (,) where is a set of disjoint neuron
population types and defines a set of edges connecting between
pairs Tsrc,Tdst . An edge E associated with the (Tsrc,Tdst)
population type pair is a set of synapses where each s E
connects a neuron i Tsrc to to a neuron j Tdst.
In order to hierarchically compress the resource mapping of
the desired flat network = (N,S), a set of disjoint neuron
populations instances must be defined where each P is a
subset of neurons P N. Each population instance is associated
with a population type T from the hierarchical template
F
E
DCORE3
C
B
CORE2
X
A
CORE1
axon_id
1
axon_id
2
axon_id
3
F
E
D
C
B
A
X
N
axin
N
axo ut
Fig. 3: Neuron-to-neuron mesh routing model
This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
Some content may change prior to final publication.
Digital Object Identifier 10.1109/MM.2018.112130359 0272-1732/$26.00 2018 IEEE
6
network . Neurons n N belonging to some population
instance P are said to be population-mapped. By configuring
the connectivity in hardware, the redundant connectivity in
is implied and doesn’t consume resources, beyond what it takes
to map the population-level connectivity of as if it were a flat
network.
Population-mapped neurons produce population spike
messages whose axon_id fields identify (1) the destination
population Pdst, (2) the source neuron index i Psrc within the
source population, and (3) the particular edge connecting
between Tsrc and Tdst when there is more than one. One
population spike must be sent per destination population rather
th an per dest inat ion core , as in the f lat c ase. T his m arg inal ly hig her
level of spike traffic is more than offset by the savings in network
mapping resources.
Convolutional artificial neural networks (ConvNets), in which a
single kernel of weights is repeatedly applied to different patches
of input pixels, is an example class of network that greatly benefits
from hierarchy. By treating such a weight kernel as the template
connectivity that is applied to the different image patches
(population instances), Loihi can support a spiking form of such
networks. The S-LCA network discussed in Section 5.2 features a
similar kernel-style convolutional network topology which
additionally includes lateral inhibitory connections between the
feature neurons of each population instance.
3.4 Learning Engine
3.4.1 Baseline STDP
A number of neuromorphic chip architectures to date have
incorporated the most basic form of pairwise, nearest-neighbor
spike time dependent plasticity (STDP). Pairwise STDP is simple,
event-driven, and highly amenable to hardware implementation.
For a given synapse connecting presynaptic neuron j to
postsynaptic neuron i, an implementation needs only maintain
the most recent spike times for the two neurons ( and ).
Given a spike arrival at time t, one local nonlinear computation
needs to be evaluated in order to update the synaptic weight:
Δ, =ℱ, On presynaptic spike
ℱ, On postsynaptic spike
( 3 )
where ℱ() is some approximation of / ⋅(), for constants
A< 0, A+ > 0, and τ > 0. Since a design must already perform a
lookup of weight wi,j on any presynaptic spike arrival, the first case
above matches the natural dataflow present in any neuromorphic
implementation. To support this depressive half of the STDP
learning rule, the handling of a presynaptic spike arrival simply
turns a read of the weight state into a read-modify-write
operation, assuming availability of the tpost spike time.
The potentiating half of Equation 3 is the only significant
challenge that pairwise STDP introduces. To handle this weight
update in an event-driven manner, symmetric to the depressive
case, the implementation needs to perform a backwards routing
table lookup, obtaining wi,j from the firing postsynaptic neuron i.
This is at odds with the algorithmic impetus for more complex and
diverse network routing functions R : j Y , where i Y . The
more complex R becomes, the more expensive, in general, it
becomes to implement an inverse lookup R−1 efficiently in
hardware. Some implementations have explored creative
solutions to this problem [12], but in general these approaches
constrain network topologies and are not scalable.
For Loihi, we adopt a less event-driven epoch-based synaptic
modification architecture in the interest of supporting arbitrarily
complex R and extending the architecture to more advanced
learning rules. This architecture delays the updating of all synaptic
state to the end of a periodic learning epoch time Tepoch.
An epoch-based architecture fundamentally requires iteration
over each core’s active input axons, which Loihi does sequentially.
In theory this is a disadvantage that a direct implementation of the
R−1 reverse lookup may avoid. However, in practice, any pipelined
digital core implementation still requires iteration over active
input axons in order to maintain spike timestamp or trace state.
Even the fully transposable synaptic crossbar architecture used in
[12] includes an iteration over all input axons per timestep for this
reason.
3.4.2 Advancing Beyond Pairwise STDP
A number of architectural challenges arise in the pursuit of
supporting more advanced learning rules. First, the functional
forms describing wi,j become more complex and seemingly
arbitrary. These rules are at the frontier of algorithm research and
therefore require a high degree of configurability. Second, the
rules involve multiple synaptic variables, not just weights. Finally,
advanced learning rules rely on temporal correlations in spiking
activity over a range of timescales, which means more than just
the most recent spike times must be maintained. These challenges
motivate the central features of Loihi’s learning architecture,
described below.
3.4.3 Learning Rule Functional Form
On every learning epoch, a synapse will be updated whenever the
appropriate pre- or post-synaptic conditions are satisfied. A set of
microcode operations associated with the synapse determines the
functional form of one or more transformations to apply to the
synapse’s state variables. The rules are specified in sum-of-
products form:
≔+
(, +,)


( 4 )
where z is the transformed synaptic variable (either wgt, dly, or
tag), Vi,j refers to some choice of input variable available to the
learning engine, and Ci,j and Si are microcode-specified signed
constants.
Table 1 provides a comprehensive list of product terms as
encoded by a 4-bit field in each microcode op. The multiplications
and summations of Equation 4 are computed iteratively by the
hardware and accumulated in 16-bit registers. The epoch period
is globally configured per core up to a maximum value of 63, with
Ti,j
Pi
This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
Some content may change prior to final publication.
Digital Object Identifier 10.1109/MM.2018.112130359 0272-1732/$26.00 2018 IEEE
7
typical values in the 2 to 8 range. To avoid receiving more than
one spike in a given epoch, the epoch period is normally set to the
minimum refractory delay of all neurons in the network.
The basic pairwise STDP rule only requires two products
involving four of these terms (0, 1, 3, and 4) and two constants.
The Loihi microcode format can specify this rule in a single 32-bit
word. With an encoding capacity of up to sixteen 32-bit words and
the full range of terms in Table 1, the learning engine provides
considerable headroom for far more complex rules.
3.4.4 Trace Evaluation
The trace variables (x1,x2,y1,y2,y3,r1) in Table 1 refer to filtered
spike trains associated with each synapse that the learning engine
modifies. The filtering function associated with each trace is
defined by two configurable quantities: an impulse amount δ
added on every spike event and a decay factor α. Given a spike
arrival sequence s[t] {0,1}, an ideal trace sequence x[t] over
time is defined as follows:
[]=⋅[−1]+⋅[].
( 5 )
The Loihi hardware computes a low-precision (seven bit)
approximation of this first-order filter using stochastic rounding.
By setting δ to 1 (typically with relatively small α), x[t]
saturates on each spike and its decay measures elapsed time since
the most recent spike. Such trace configurations exactly
implement the baseline STDP rules dependent only on nearest-
neighbor pre/post spike time separations described in Section
3.4.1. On the other hand, setting δ to a value less than 1,
specifically 1 αTmin , where Tmin is the minimum spike period,
causes sufficiently closely spaced spike impulses to accumulate
over time and x[t] reflects the average spike rate over a timescale
of τ = −1/log α.
4 DESIGN IMPLEMENTATION
4.1 Core Microarchitecture
Figure 4 shows the internal structure of the Loihi neuromorphic
core. Colored blocks in this diagram represent the major
memories that store the connectivity, configuration, and dynamic
state of all neurons mapped to the core. The core’s total SRAM
capacity is 2Mb including ECC overhead. The coloring of memories
and dataflow arcs illustrates the core’s four primary operating
modes: input spike handling (green), neuron compartment
updates (purple), output spike generation (blue), and synaptic
updates (red). Each of these modes operates independently with
minimal synchronization at a variety of frequencies, based on the
state and configuration of the core. The black structure marked
UCODE represents the configurable learning engine.
The values annotated by each memory indicate its number of
logical addresses, which correspond to the core’s major resource
constraints. The number of input and output axons (Naxin and
Naxout), the synaptic memory size (Nsyn), and the total number of
neuron compartments (Ncx) impose network connectivity
constraints as described in Section 3.3. The parameter Nsdelay
indicates the minimum number of synaptic delay units supported,
eight in Loihi. Larger synaptic delay values, up to 62, may be
supported when fewer neuron compartments are needed by a
particular mapped network.
Varying degrees of parallelism and serialization are applied to
sections of the core’s pipeline in order to balance the throughput
bottlenecks that typical workloads will encounter. Dataflow
drawn with finely dotted arrows in Figure 4 indicate parts of the
design where single events are expanded into a potentially large
number of dependent events. In these areas, we generally
parallelize the hardware.
For example, synapses are extracted from SYNAPSE_MEM’s
64-bit words with up to four-way parallelism, depending on the
synaptic encoding format, and that parallelism is extended to
DENDRITE_ACCUM and throughout the synaptic modification
pipeline in the learning engine. Conversely, the presynaptic trace
state is stored together with SYNAPSE_MEM pointer entries in the
SYNAPSE_MAP memory, which then may result in multiple serial
accesses per ingress spike. This balances pipeline throughputs for
ingress learning-enabled axons when their synaptic fanout factor
within the core is on the order of 10:1 while maintaining the best
possible area efficiency.
Read-modify-write (RMW) memory accesses, shown as loops
around the relevant memories in Figure 4, are fundamental to the
neuromorphic computational model and unusually pervasive
compared to many other microarchitecture domains. Such loops
can introduce significant design challenges, particularly for
performance. We manage this challenge with an asynchronous
design pattern that encapsulates and distributes the memory’s
state over a collection of single-ported SRAM banks. The
encapsulation wrapper presents a simple dual-ported interface to
the environment logic and avoids severely stalling the pipeline
except for statistically rare address conflicts.
Encoding
Term (T
i,j
)
Bits
Description
0
x0 +C
5b (U)
Presynaptic spike count
1
x1 +C
7b (U)
1st presynaptic trace
2
x2 +C
7b (U)
2nd presynaptic trace
3
y0 +C
5b (U)
Postsynaptic spike count
4
y1 +C
7b (U)
1st postsynaptic trace
5
y2 +C
7b (U)
2nd postsynaptic trace
6
y3 +C
7b (U)
3rd postsynaptic trace
7
r0 +C
1b (U)
Reward spike
8
r1 +C
8b (S)
Reward trace
9
wgt+C
9b (S)
Synaptic weight
10
dly+C
6b (U)
Synaptic delay
11
tag+C
9b (S)
Synaptic tag
12
sgn(wgt+C)
1b (S)
Sign of case 9 (±1)
13
sgn(dly+C)
1b (S)
Sign of case 10 (±1)
14
sgn(tag+C)
1b (S)
Sign of case 11 (±1)
15
C
8b (S)
Constant term. (Variant 1)
15
S
m
· 2
Se
4b (S)
Scaling term. 4b mantissa,
4b exponent. (Variant 2)
TABLE 1: Learning rule product terms
This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
Some content may change prior to final publication.
Digital Object Identifier 10.1109/MM.2018.112130359 0272-1732/$26.00 2018 IEEE
8
4.2 Asynchronous Design Methodology
Biological neural networks are fundamentally asynchronous, as
reflected by the absence of an explicit synchronization assumption
in the continuous time SNN model given in Section 2. Accordingly,
asynchronous design methods have long been seen as the
appropriate tool for prototyping spiking neural networks in silicon,
and most published chips to date use this methodology. Loihi is no
different and in fact the asynchronous design methodology
developed for Loihi is the most advanced of its kind.
For rapid neuromorphic design prototyping, we extended and
improved on an earlier asynchronous design methodology used to
develop several generations of commercial Ethernet switches. In
this methodology, designs are entered according to a top-down
decomposition process using the CAST and CSP languages.
Modules in each level of design hierarchy communicate over
message-passing channels that are later mapped to a circuit-level
implementation, which in this case is a bundled data
implementation comprising a data payload with request and
acknowledge handshaking signals that mediate the propagation of
data tokens through the system. Figure 5 shows a template
pipeline example. Each pipeline stage has at least one pulse
generato r, such as the one shown in Figure 6, that i mplements the
two-phase handshake and latch sequencing.
Fine-grain flow control is an important property of
asynchronous design that offers several benefits for
neuromorphic applications. First, since the activity in SNNs is
highly sparse in both space and time, the activity gating that
comes automatically with asynchronous flow control eliminates
the power that would often be wasted by a continuously running
clock. Second, local flow control allows different modules in the
same design to run at their natural microarchitectural
frequencies. This properly complements the need for spiking
neuron processes to run at a variety of timescales dependent on
workload and can significantly simplify back-end timing closure.
Finally, asynchronous techniques can reduce or eliminate timing
margin. In Loihi, the mesh-level barrier synchronization
mechanism is the best example of asynchronous handshaking
providing a globally significant performance advantage by
eliminating needless mesh-wide idle time.
Given a hierarchical design decomposition written in CSP, a
pipeline synthesis tool converts the CSP module descriptions to
Verilog representations that are compatible with standard EDA
tools. The initial Verilog representation supports logic synthesis to
both synchronous and asynchronous implementations with full
functional equivalence, providing support for synchronous FPGA
emulation of the design.
Fig. 4: Core Top-Level Microarchitecture. The SYNAPSE unit processes all incoming spikes and reads out the associated synaptic weights from the memory.
The DENDRITE unit updates the state variables u and v of all neurons in the core. The AXON unit generates spike messages for all fanout cores of each firing
neuron. The LEARNING unit updates synaptic weights using the programmed learning rules at epoch boundaries.
L
A
T
C
H
PULSE
GENERATOR
EN
DEN
L
.
q
R
.
q
R
.
a
L
.
a
Pulse
-
width
extension
Pipeline
p
datapath
DATA
IN
DATA
OUT
L
A
T
C
H
L
A
T
C
H
Logic
Fig. 5: Bundled data pipeline stage
EN
DEN
R
.
q
R
.
a
L
L
.
a
L
L
.
q
LATCH
D
Q
PULSE
GENERATOR
Fig. 6: Bundled data pulse generator circuit
This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
Some content may change prior to final publication.
Digital Object Identifier 10.1109/MM.2018.112130359 0272-1732/$26.00 2018 IEEE
9
The asynchronous back-end layout flow uses standard tools
with an almost fully standard cell library. Here, the asynchronous
methodology simplifies the layout closure problem. At every level
of layout hierarchy, all timing constraints apply only to
neighboring, physically proximate pipeline stages. This greatly
facilitates convergent timing closure, especially at the chip level.
For example, the Loihi mesh assembles by physical abutment
without needing any unique clock distribution layout or timing
analysis for different mesh dimensions or core types.
5 RESULTS
5.1 Silicon Realization
Loihi was fabbed in Intel’s 14nm FinFET process. The chip
instantiates a total of 2.07 billion transistors and 33 MB of SRAM
over its 128 neuromorphic cores and three x86 cores, with a die
area of 60 mm2. The device is functional over a supply voltage
range of 0.50V to 1.25V. Table 2 provides a selection of energy an d
performance measurements from pre-silicon SDF and SPICE
simulations, consistent with early post-silicon characterization.
Loihi includes a total of 16MB of synaptic memory. With its
densest 1-bit synapse format, this provides a total of 2.1 million
unique synaptic variables per mm2, over three times higher than
TrueNorth, the previously most dense SNN chip [11]. This does not
consider Loihi’s hierarchical network support that can significantly
boost its effective synaptic density. On the other hand, Loihis
maximum neuron density of 2,184 per mm2 is marginally worse
than TrueNorth’s. Process normalized, this represents a
reduction in the design’s neuron density, which may be
interpreted as the cost of Loihi’s greatly expanded feature set, an
intentional design choice.
5.2 Algorithmic Results
On an earlier iteration of the Loihi architecture, we quantitatively
assessed the efficiency of Spiking LCA to solve LASSO, as described
in Section 2.2. We used a 1.67 GHz Atom CPU running both LARS
and FISTA [3] numerical solvers as a reference architecture for
benchmarking. These solvers are among the best known for this
problem. Both chips were fabbed in 14nm technology, were
evaluated at a 0.75V supply voltage, and required similar active
silicon areas (5 mm2).
Measured parameter
Value at 0.75V
Cross-sectiona l spike bandwidth per tile
3.44 Gspike/s
Within-tile spike energy
1.7 pJ
Within-tile spike latency
2.1 ns
Energy per tile hop (E-W / N-S)
3.0 pJ / 4.0 pJ
Latency per tile hop (E-W / N-S)
4.1 ns / 6.5 ns
Energy per synaptic spike op (min)
23.6 pJ
Time per synaptic spike op (max)
3.5 ns
Energy per synaptic update (pairwise STDP)
120 pJ
Time per synaptic update (pairwise STDP)
6.1 ns
Energy per neuro n update (active / inactive)
81 pJ / 52 pJ
Time per neuron update (active / inactive)
8.4 ns / 5.3 ns
Mesh-wide barrier sync time (1-32 tiles)
113-465ns
TABLE 2: Loihi pre-silicon performance and energy measurements
(a) Original
(b) Reconstruction
Fig. 8: Image reconstruction from the sparse coefficients
computed using the Loihi predecessor.
The largest problem we evaluated is a convolutional sparse
coding problem on a 52×52 image with a 224-atom dictionary, a
patch size of 8×8, and a patch stride of 4 pixels. Loihi’s hierarchical
connectivity provided a factor of 18 compression in synaptic
resources for this network. We solved the sparse coding problem
to a solution within 1% of the optimal solution. Figure 8 compares
the original and the reconstructed image using the computed
sparse coefficients.
Table 3 shows the comparison in computational efficiency
between these two architectures, as measured by EDP. It is not
surprising to see that the conventional LARS solver can handle
problems of small sizes and very sparse solutions quite efficiently.
On the other hand, the conventional solvers do not scale well for
the large problem and the Loihi predecessor achieves the target
objective value with over 5,000 times lower EDP.
No. Unknowns
400
1,700
32,256
No. nonzeros in solutions
10
30
420
Energy
2.58x
8.08x
48.74x
Delay
0.27x
2.76x
118.18x
EDP
0.7x
22.33x
5760x
TABLE 3: Comparison of solving
1
minimization on Loihi and Atom.
Results are expressed as improvement ratios Atom/Loihi. The Atom
numbers are chosen from using the more efficient solver between LARS
and FISTA.
Fig. 7: Loihi chip plot
This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
Some content may change prior to final publication.
Digital Object Identifier 10.1109/MM.2018.112130359 0272-1732/$26.00 2018 IEEE
10
Loihi’s flexible learning engine allows one to explore and
experiment with various learning methods. We have developed
and validated the following networks in pre-silicon FPGA
emulation with all learning taking place on chip:
A single-layer classifier using a supervised variant of STDP
similar to [4] as the learning method. This network, when
trained with local-intensity-change based temporally
spike-coded image samples, can achieve 96% accuracy on
the MNIST dataset using ten neurons, in line with a
reference ANN of the same structure.
Solving the shortest path problem of a weighted graph.
Vertices and edges are represented as neurons and
synapses respectively. The algorithm is based on the
effects of STDP on a propagating wavefront of spikes [13].
Solving a one-dimensional, non-Markovian sequential
decision making problem. The network learns the decision
making policy in response to delayed reward and
punishment feedback similar to [14].
The algorithmic development and characterization of Loihi is
just beginning. These proof-of-concept examples use only a
fraction of the resources and features available in the chip. With
Loihi now in hand, our focus turns to scaling and further evaluating
these networks.
6 CONCLUSION
Loihi is Intel’s fifth and most complex fabricated chip in a family of
devices that explore different points in the neuromorphic design
space spanning architectural variations, circuit methodologies,
and process technology. In some respects, its flexibility may go too
far, while in others, not far enough. Further optimizations of the
architecture and implementation are planned. The pursuit of
commercially viable neuromorphic architectures and algorithms
may well end at design points far from what we have described in
this paper, but we hope Loihi provides a step in the right direction.
We offer it as a vehicle for collaborative exploration with the
broader research community.
REFERENCES
[1] P. T. P. Tang, T.-H. Lin, and M. Davies, “Sparse coding by spiking neural
networks: Convergence theory and computational results,” arXiv e-
prints, 2017.
[2] S. Shapero, M. Zhu, J. Hasler, and C. Rozell, “Optimal sparse
approximation with integrate and fire neurons,” International journal of
neural systems, vol. 24, no. 5, p. 1440001, 2014.
[3] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding
algorithm for linear inverse problems,” SIAM Journal on Imaging
Sciences, vol. 2, no. 1, pp. 183–202, 2009.
[4] F. Ponulak and A. Kasinski, “Supervised learning in spiking neural´
networks with ReSuMe: sequence learning, classification, and spike
shifting,” Neural Computation, vol. 22, no. 2, pp. 467–510, 2010.
[5] T.-H. Lin, “Local Information with Feedback Perturbation Suffices for
Dictionary Learning in Neural Circuits,” arXiv e-prints, 2017.
[6] E. Neftci, C. Augustine, S. Paul, and G. Detorakis, “Event-Driven Random
Back-Propagation: Enabling Neuromorphic Deep Learning Machines,
Frontiers in neuroscience, vol. 11, p. 324, 2017.
[7] J. Gjorgjieva, C. Clopath, J. Audet, and J.-P. Pfister, “A triplet spike-
timing dependent plasticity model generalizes the Bienenstock Cooper
Munro rule to higher-order spatiotemporal correlations,” Proceedings
of the National Academy of Sciences, vol. 108, no. 48, pp. 19 383–19
388, 2011.
[8] N. Qiao, H. Mostafa, F. Corradi, M. Osswald, F. Stefanini, D.
Sumislawska, and G. Indiveri, “A reconfigurable on-line learning spiking
neuromorphic processor comprising 256 neurons and 128K synapses,”
Frontiers in Neuroscience, vol. 9, p. 141, 2015.
[9] L. Buesing, J. Bill, B. Nessler, and W. Maass, “Neural dynamics as
sampling: a model for stochastic computation in recurrent networks of
spiking neurons,” PLoS computational biology, vol. 7, no. 11, p.
e1002211, 2011.
[10] E. M. Izhikevich, “Polychronization: Computation with Spikes,” Neural
Computation, vol. 18, no. 2, pp. 245–282, 2006, pMID: 16378515.
[11] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy,
J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, B.
Brezzo, I. Vo, S. K. Esser, R. Appuswamy, B. Taba, A. Amir, M. D.
Flickner, W. P. Risk, R. Manohar, and D. S. Modha, “A million spiking-
neuron integrated circuit with a scalable communication network and
interface,” Science, vol. 345, no. 6197, pp. 668–673, 2014.
[12] J. s. Seo, B. Brezzo, Y. Liu, B. D. Parker, S. K. Esser, R. K. Montoye, B.
Rajendran, J. A. Tierno, L. Chang, D. S. Modha, and D. J. Friedman, “A
45nm CMOS neuromorphic chip with a scalable architecture for
learning in networks of spiking neurons,” in 2011 IEEE Custom
Integrated Circuits Conference (CICC), Sept 2011, pp. 1–4.
[13] F. Ponulak and J. J. Hopfield, “Rapid, parallel path planning by
propagating wavefronts of spiking neural activity,” Frontiers in
Computational Neuroscience, vol. 7, p. 98, 2013.
[14] R. V. Florian, “Reinforcement learning through modulation of spike-
timing-dependent synaptic plasticity,” Neural Computation, vol. 19, no.
6, pp. 1468–1502, 2007.
AUTHOR INFORMATION
At the time of development, all authors were researchers in Intel Labs’
Architecture and Design Research (ADR) division of Intel Labs. Loihi chip
development and algorithms research was performed in the Microarchitecture
Research Lab (MRL) headed by Hong Wang, Intel Fellow. Mike Davies led
silicon development, Narayan Srinivasa led algorithms research and
architectural modeling. Tsung-Han Lin is a researcher in MRL focused on
sparse coding and related learning algorithms. Gautham Chinya, also in MRL
focused on advanced IP prototyping, led validation and SDK development.
Georgios Dimou, Prasad Joshi, Andrew Lines, Ruokun Liu, Steve McCoy,
Jonathan Tse, and Yi-Hsin Weng developed Loihi’s asynchronous architecture,
design flow, and design components, and Sri Harsha Choday contributed to
asynchronous circuit validation. Yongqiang Cao, Nabil Imam, Arnab Paul, and
Andreas Wild contributed to Loihi’s algorithms, feature set, and modeling.
Shweta Jain, Chit-Kwan Lin, Deepak Mathaikutty, Guruguhanathan
Venkataramanan, and Yoonseok Yang prototyped proo f-of-concept networks
and software to demonstrate the chip’s learning capabilities and validate its
functionality, and also provided synchronous and FPGA design development
support. Yuyun Liao, a silicon implementation manager in ADR, helped to
validate all aspects of the final Loihi layout implementation. Going forward,
Mike Davies leads all ongoing neuromorphic research in Intel Labs as head of
its Neuromorphic Computing Lab. Any inquiries should be directed to him.
This article has been accepted for publication in IEEE Micro but has not yet been fully edited.
Some content may change prior to final publication.
Digital Object Identifier 10.1109/MM.2018.112130359 0272-1732/$26.00 2018 IEEE
... Since the start of the third wave of artificial neural network research in 2006, artificial intelligence, and in particular deep neural network (DNN) model-based artificial intelligence, has made amazing progress in data-centric applications [1][2][3][4] . DNNs have too many weights to be stored in the internal memory, and the transfer of weights between the external memory and the processing units leads to substantial power consumption, known as the von Neumann bottleneck [5][6][7] . ...
Article
Full-text available
In-memory computing could enhance computing energy efficiency by directly implementing multiply accumulate (MAC) operations in a crossbar memory array with low energy consumption (around femtojoules for a single operation). However, a crossbar memory array cannot execute nonlinear activation; moreover, activation processes are power-intensive (around milliwatts), limiting the overall efficiency of in-memory computing. Here we develop an ultrafast bipolar flash memory to execute self-activated MAC operations. Based on atomically sharp van der Waals heterostructures, the basic flash cell has an ultrafast n/p program speed in the range of 20–30 ns and an endurance of 8 × 10⁶ cycles. Utilizing sign matching between the input voltage signal and the storage charge type, our bipolar flash can realize a rectified linear unit activation function during the MAC process with a power consumption for each operation of just 30 nW (or 5 fJ of energy). Using a convolutional neural network, we find that the self-activated MAC method has a simulated accuracy of 97.23%, tested on the Modified National Institute of Standards and Technology dataset, which is close to the conventional method where the MAC and activation operations are separated.
... The Loihi is a neuromorphic many-core processor that supports on-chip learning, making it a viable option for both inference and deployment. The Loihi has 128 cores, simulates up to 131,072 neurons with 130,000,000 synapses possible; furthermore, each Loihi can be put in parallel with up to 16,384 other chips, reaping the benefits of massive parallelism by allowing the number of effective on-chip cores to be 4,096 (Davies et al., 2018). Another common implementation tool for SNNs is SpiNNaker, a million-core, open-access, ARM-based neuromorphic computing platform developed and housed in the University of Machester: SpiNNaker is used for simulation and testing for applications that do not require on-site implementations (Furber et al., 2014). ...
Preprint
Full-text available
Biological neural networks continue to inspire breakthroughs in neural network performance. And yet, one key area of neural computation that has been under-appreciated and under-investigated is biologically plausible, energy-efficient spiking neural networks, whose potential is especially attractive for low-power, mobile, or otherwise hardware-constrained settings. We present a literature review of recent developments in the interpretation, optimization, efficiency, and accuracy of spiking neural networks. Key contributions include identification, discussion, and comparison of cutting-edge methods in spiking neural network optimization, energy-efficiency, and evaluation, starting from first principles so as to be accessible to new practitioners.
... In recent years, several neuromorphic chips have been introduced both from academia and industry, including analog [41], digital [42][43][44][45][46][47][48] and mixed-signal [49] implementations. A dominant trend across these approaches is to implement a large number of spiking neurons communicating with each other via event-driven packets that encapsulate spiking information. ...
Chapter
Full-text available
Reverse-engineering the human brain has been a grand challenge for researchers in machine learning, experimental neuroscience, and computer architecture. Current deep neural networks (DNNs), motivated by the same challenge, have achieved remarkable results in Machine Learning applications. However, despite their original inspiration from the brain, DNNs have largely moved away from biological plausibility, resorting to intensive statistical processing on huge amounts of data. This has led to exponentially increasing demand on hardware compute resources that is quickly becoming economically and technologically unsustainable. Recent neuroscience research has led to a new theory on human intelligence, that suggests Cortical Columns (CCs) as the fundamental processing units in the neocortex that encapsulate intelligence. Each CC has the potential to learn models of complete objects through continuous predict-sense-update loops. This leads to the overarching question: Can we build Cortical Columns Computing Systems (C3S) that possess brain-like capabilities as well as brain-like efficiency? This chapter presents ongoing research in the Neuromorphic Computer Architecture Lab (NCAL) at Carnegie Mellon University (CMU) focusing on addressing this question. Our initial findings indicate that designing truly intelligent and extremely energy-efficient C3S-based sensory processing units, using off-the-shelf digital CMOS technology and tools, is quite feasible and very promising, and certainly warrants further research exploration.
Article
Full-text available
Crop protection is a key activity for the sustainability and feasibility of agriculture in a current context of climate change, which is causing the destabilization of agricultural practices and an increase in the incidence of current or invasive pests, and a growing world population that requires guaranteeing the food supply chain and ensuring food security. In view of these events, this article provides a contextual review in six sections on the role of artificial intelligence (AI), machine learning (ML) and other emerging technologies to solve current and future challenges of crop protection. Over time, crop protection has progressed from a primitive agriculture 1.0 (Ag1.0) through various technological developments to reach a level of maturity closelyin line with Ag5.0 (section 1), which is characterized by successfully leveraging ML capacity and modern agricultural devices and machines that perceive, analyze and actuate following the main stages of precision crop protection (section 2). Section 3 presents a taxonomy of ML algorithms that support the development and implementation of precision crop protection, while section 4 analyses the scientific impact of ML on the basis of an extensive bibliometric study of >120 algorithms, outlining the most widely used ML and deep learning (DL) techniques currently applied in relevant case studies on the detection and control of crop diseases, weeds and plagues. Section 5 describes 39 emerging technologies in the fields of smart sensors and other advanced hardware devices, telecommunications, proximal and remote sensing, and AI-based robotics that will foreseeably lead the next generation of perception-based, decision-making and actuation systems for digitized, smart and real-time crop protection in a realistic Ag5.0. Finally, section 6 highlights the main conclusions and final remarks.
Article
Full-text available
We present an innovative working mechanism (the SBC memory ) and surrounding infrastructure ( BitBrain ) based upon a novel synthesis of ideas from sparse coding, computational neuroscience and information theory that enables fast and adaptive learning and accurate, robust inference. The mechanism is designed to be implemented efficiently on current and future neuromorphic devices as well as on more conventional CPU and memory architectures. An example implementation on the SpiNNaker neuromorphic platform has been developed and initial results are presented. The SBC memory stores coincidences between features detected in class examples in a training set, and infers the class of a previously unseen test example by identifying the class with which it shares the highest number of feature coincidences. A number of SBC memories may be combined in a BitBrain to increase the diversity of the contributing feature coincidences. The resulting inference mechanism is shown to have excellent classification performance on benchmarks such as MNIST and EMNIST, achieving classification accuracy with single-pass learning approaching that of state-of-the-art deep networks with much larger tuneable parameter spaces and much higher training costs. It can also be made very robust to noise. BitBrain is designed to be very efficient in training and inference on both conventional and neuromorphic architectures. It provides a unique combination of single-pass, single-shot and continuous supervised learning; following a very simple unsupervised phase. Accurate classification inference that is very robust against imperfect inputs has been demonstrated. These contributions make it uniquely well-suited for edge and IoT applications.
Preprint
Full-text available
The connectivity in the brain is locally dense and globally sparse - giving rise to a small-world graph. This is a principle that has persisted during the evolution of many species - indicating a universal solution to the efficient routing of information. However, existing circuit architectures for artificial neural networks neither leverage this organization nor do they efficiently support small-world neural network models. Here, we propose the neuromorphic Mosaic: a non-von Neumann systolic architecture that uses distributed memristors, not only for in-memory computing, but also for in-memory routing, to efficiently implement small-world graph topologies. We design, fabricate, and experimentally demonstrate the building blocks of this architecture, using integrated memristors with 130 nm CMOS technology. We demonstrate that neural networks implemented following this approach can achieve competitive accuracy figures compared to equivalent unconstrained and full-precision networks, for three real-time benchmarks: classification of electrocardiography signals, keyword spotting and motor control via reinforcement learning. The Mosaic shows improvements between one and four orders of magnitude, compared to other event-based neuromorphic architectures for routing events across the network. The Mosaic opens up a new scalable approach for designing edge AI systems based on distributed computing and in-memory routing, offering a natural platform onto which architectures inspired by biological nervous systems can be readily mapped.
Chapter
Despite the tremendous advancements in deep neural network research to achieve Artificial General Intelligence, it continues to suffer from various issues such as higher power consumption and longer training time. Many of these issues result from a fundamental drawback of the current computing architecture, that is, the von Neumann bottleneck. Therefore, there is growing research to develop computing architectures to eliminate this bottleneck. One of the most promising approaches is neuromorphic computing, which takes direct inspiration from the structure of a biological neuron. This chapter discusses core neuromorphic computing concepts and reviews several ongoing projects on neuromorphic hardware accelerators.KeywordsNeuromorphic computingSpiking neural networksMemristorsMagnetoresistive random access memoryNeuromorphic hardware accelerators
Article
Full-text available
An ongoing challenge in neuromorphic computing is to devise general and computationally efficient models of inference and learning which are compatible with the spatial and temporal constraints of the brain. One increasingly popular and successful approach is to take inspiration from inference and learning algorithms used in deep neural networks. However, the workhorse of deep learning, the gradient descent Gradient Back Propagation (BP) rule, often relies on the immediate availability of network-wide information stored with high-precision memory during learning, and precise operations that are difficult to realize in neuromorphic hardware. Remarkably, recent work showed that exact backpropagated gradients are not essential for learning deep representations. Building on these results, we demonstrate an event-driven random BP (eRBP) rule that uses an error-modulated synaptic plasticity for learning deep representations. Using a two-compartment Leaky Integrate & Fire (I&F) neuron, the rule requires only one addition and two comparisons for each synaptic weight, making it very suitable for implementation in digital or mixed-signal neuromorphic hardware. Our results show that using eRBP, deep representations are rapidly learned, achieving classification accuracies on permutation invariant datasets comparable to those obtained in artificial neural network simulations on GPUs, while being robust to neural and synaptic state quantizations during learning.
Article
Full-text available
Implementing compact, low-power artificial neural processing systems with real-time on-line learning abilities is still an open challenge. In this paper we present a full-custom mixed-signal VLSI device with neuromorphic learning circuits that emulate the biophysics of real spiking neurons and dynamic synapses for exploring the properties of computational neuroscience models and for building brain-inspired computing systems. The proposed architecture allows the on-chip configuration of a wide range of network connectivities, including recurrent and deep networks, with short-term and long-term plasticity. The device comprises 128 K analog synapse and 256 neuron circuits with biologically plausible dynamics and bi-stable spike-based plasticity mechanisms that endow it with on-line learning abilities. In addition to the analog circuits, the device comprises also asynchronous digital logic circuits for setting different synapse and neuron properties as well as different network configurations. This prototype device, fabricated using a 180 nm 1P6M CMOS process, occupies an area of 51.4 mm(2), and consumes approximately 4 mW for typical experiments, for example involving attractor networks. Here we describe the details of the overall architecture and of the individual circuits and present experimental results that showcase its potential. By supporting a wide range of cortical-like computational modules comprising plasticity mechanisms, this device will enable the realization of intelligent autonomous systems with on-line learning capabilities.
Article
Full-text available
Learning from instructions or demonstrations is a fundamental property of our brain necessary to acquire new knowledge and develop novel skills or behavioral patterns. This type of learning is thought to be involved in most of our daily routines. Although the concept of instruction-based learning has been studied for several decades, the exact neural mechanisms implementing this process remain unrevealed. One of the central questions in this regard is, How do neurons learn to reproduce template signals (instructions) encoded in precisely timed sequences of spikes? Here we present a model of supervised learning for biologically plausible neurons that addresses this question. In a set of experiments, we demonstrate that our approach enables us to train spiking neurons to reproduce arbitrary template spike patterns in response to given synaptic stimuli even in the presence of various sources of noise. We show that the learning rule can also be used for decision-making tasks. Neurons can be trained to classify categories of input signals based on only a temporal configuration of spikes. The decision is communicated by emitting precisely timed spike trains associated with given input categories. Trained neurons can perform the classification task correctly even if stimuli and corresponding decision times are temporally separated and the relevant information is consequently highly overlapped by the ongoing neural activity. Finally, we demonstrate that neurons can be trained to reproduce sequences of spikes with a controllable time shift with respect to target templates. A reproduced signal can follow or even precede the targets. This surprising result points out that spiking neurons can potentially be applied to forecast the behavior (firing times) of other reference neurons or networks.
Article
Full-text available
Efficient path planning and navigation is critical for animals, robotics, logistics and transportation. We study a model in which spatial navigation problems can rapidly be solved in the brain by parallel mental exploration of alternative routes using propagating waves of neural activity. A wave of spiking activity propagates through a hippocampus-like network, altering the synaptic connectivity. The resulting vector field of synaptic change then guides a simulated animal to the appropriate selected target locations. We demonstrate that the navigation problem can be solved using realistic, local synaptic plasticity rules during a single passage of a wavefront. Our model can find optimal solutions for competing possible targets or learn and navigate in multiple environments. The model provides a hypothesis on the possible computational mechanisms for optimal path planning in the brain, at the same time it is useful for neuromorphic implementations, where the parallelism of information processing proposed here can fully be harnessed in hardware.
Conference Paper
Full-text available
Efforts to achieve the long-standing dream of realizing scalable learning algorithms for networks of spiking neurons in silicon have been hampered by (a) the limited scalability of analog neuron circuits; (b) the enormous area overhead of learning circuits, which grows with the number of synapses; and (c) the need to implement all inter-neuron communication via off-chip address-events. In this work, a new architecture is proposed to overcome these challenges by combining innovations in computation, memory, and communication, respectively, to leverage (a) robust digital neuron circuits; (b) novel transposable SRAM arrays that share learning circuits, which grow only with the number of neurons; and (c) crossbar fan-out for efficient on-chip inter-neuron communication. Through tight integration of memory (synapses) and computation (neurons), a highly configurable chip comprising 256 neurons and 64K binary synapses with on-chip learning based on spike-timing dependent plasticity is demonstrated in 45nm SOI-CMOS. Near-threshold, event-driven operation at 0.53V is demonstrated to maximize power efficiency for real-time pattern classification, recognition, and associative memory tasks. Future scalable systems built from the foundation provided by this work will open up possibilities for ubiquitous ultra-dense, ultra-low power brain-like cognitive computers.
Article
Full-text available
The organization of computations in networks of spiking neurons in the brain is still largely unknown, in particular in view of the inherently stochastic features of their firing activity and the experimentally observed trial-to-trial variability of neural systems in the brain. In principle there exists a powerful computational framework for stochastic computations, probabilistic inference by sampling, which can explain a large number of macroscopic experimental data in neuroscience and cognitive science. But it has turned out to be surprisingly difficult to create a link between these abstract models for stochastic computations and more detailed models of the dynamics of networks of spiking neurons. Here we create such a link and show that under some conditions the stochastic firing activity of networks of spiking neurons can be interpreted as probabilistic inference via Markov chain Monte Carlo (MCMC) sampling. Since common methods for MCMC sampling in distributed systems, such as Gibbs sampling, are inconsistent with the dynamics of spiking neurons, we introduce a different approach based on non-reversible Markov chains that is able to reflect inherent temporal processes of spiking neuronal activity through a suitable choice of random variables. We propose a neural network model and show by a rigorous theoretical analysis that its neural activity implements MCMC sampling of a given distribution, both for the case of discrete and continuous time. This provides a step towards closing the gap between abstract functional models of cortical computation and more detailed models of networks of spiking neurons.
Article
While the sparse coding principle can successfully model information processing in sensory neural systems, it remains unclear how learning can be accomplished under neural architectural constraints. Feasible learning rules must rely solely on synaptically local information in order to be implemented on spatially distributed neurons. We describe a neural network with spiking neurons that can address the aforementioned fundamental challenge and solve the L1-minimizing dictionary learning problem, representing the first model able to do so. Our major innovation is to introduce feedback synapses to create a pathway to turn the seemingly non-local information into local ones. The resulting network encodes the error signal needed for learning as the change of network steady states caused by feedback, and operates akin to the classical stochastic gradient descent method.
Article
Inspired by the brain’s structure, we have developed an efficient, scalable, and flexible non–von Neumann architecture that leverages contemporary silicon technology. To demonstrate, we built a 5.4-billion-transistor chip with 4096 neurosynaptic cores interconnected via an intrachip network that integrates 1 million programmable spiking neurons and 256 million configurable synapses. Chips can be tiled in two dimensions via an interchip communication interface, seamlessly scaling the architecture to a cortexlike sheet of arbitrary size. The architecture is well suited to many applications that use complex neural networks in real time, for example, multiobject detection and classification. With 400-pixel-by-240-pixel video input at 30 frames per second, the chip consumes 63 milliwatts.
Article
Sparse approximation is a hypothesized coding strategy where a population of sensory neurons (e.g. V1) encodes a stimulus using as few active neurons as possible. We present the Spiking LCA (locally competitive algorithm), a rate encoded Spiking Neural Network (SNN) of integrate and fire neurons that calculate sparse approximations. The Spiking LCA is designed to be equivalent to the nonspiking LCA, an analog dynamical system that converges on a ℓ(1)-norm sparse approximations exponentially. We show that the firing rate of the Spiking LCA converges on the same solution as the analog LCA, with an error inversely proportional to the sampling time. We simulate in NEURON a network of 128 neuron pairs that encode 8 × 8 pixel image patches, demonstrating that the network converges to nearly optimal encodings within 20 ms of biological time. We also show that when using more biophysically realistic parameters in the neurons, the gain function encourages additional ℓ(0)-norm sparsity in the encoding, relative both to ideal neurons and digital solvers.
Article
We consider the class of iterative shrinkage-thresholding algorithms (ISTA) for solving linear inverse problems arising in signal/image processing. This class of methods, which can be viewed as an ex- tension of the classical gradient algorithm, is attractive due to its simplicity and thus is adequate for solving large-scale problems even with dense matrix data. However, such methods are also known to converge quite slowly. In this paper we present a new fast iterative shrinkage-thresholding algorithm (FISTA) which preserves the computational simplicity of ISTA but with a global rate of convergence which is proven to be significantly better, both theoretically and practically. Initial promising nu- merical results for wavelet-based image deblurring demonstrate the capabilities of FISTA which is shown to be faster than ISTA by several orders of magnitude.