Content uploaded by Chris I De Zeeuw
Author content
All content in this area was uploaded by Chris I De Zeeuw on Feb 09, 2015
Content may be subject to copyright.
Real-Time Olivary Neuron Simulations on
Dataflow Computing Machines
Georgios Smaragdos1, Craig Davies3, Christos Strydis1, Ioannis Sourdis5,
C˘at˘alin Ciobanu5, Oskar Mencer3,4, and Chris I. De Zeeuw1,2
1Dept. of Neuroscience, Erasmus Medical Center, Rotterdam, The Netherlands
2Netherlands Institute for Neuroscience, Amsterdam, The Netherlands
3Maxeler Technologies Inc.
4Imperial College London
5Dept. of Computer Science & Engineering, Chalmers University of Technology,
Gothenburg, Sweden
g.smaragdos@erasmusmc.nl
Abstract.
The Inferior-Olivary nucleus (ION) is a well-charted brain
region, heavily associated with the sensorimotor control of the body. It
comprises neural cells with unique properties which facilitate sensory
processing and motor-learning skills. Simulations of such neurons become
rapidly intractable when biophysically plausible models and meaning-
ful network sizes (at least in the order of some hundreds of cells) are
modeled. To overcome this problem, we accelerate a highly detailed ION
network model using a Maxeler Dataflow Computing Machine. The design
simulates a 330-cell network at real-time speed and achieves maximum
throughputs of 24.7 GFLOPS. The Maxeler machine, integrating a Virtex-
6 FPGA, yields speedups of
×
92-102, and
×
2-8 compared to a reference-C
implementation, running on a Intel Xeon 2.66GHz, and a pure Virtex-7
FPGA implementation, respectively.
1 Introduction
The United-States National Academy of Engineers has classified brain simu-
lation as one of the Grand Engineering Challenges [7]. Biologically accurate
brain simulation is a highly relevant topic for neuroscience for a number of
reasons; the main goal is accelerated brain research by the creation of more
advance research platforms. Even though a significant amount of effort is spent
in understanding brain functionality, the exact mechanisms in most cases are
still only hypothesized. Fast and real-time simulation platforms can enable the
the neuroscientific community to more efficiently test these hypotheses. Better
understanding of brain functionality can lead to a number of critical practical
applications: (1) Brain rescue: If brain functions can be simulated accurately
enough and in real-time, this can lead to robotic prosthetics and implants for
restoring lost brain functionality. (2) Advanced A.I.: Artificial Neural Networks
(ANNs) have already been successfully used in this field. It is believed that greater
understanding of biological systems and the richer computational dynamics of
their models, like spiking neural networks (SNNs) [5], can lead to more advanced,
artificial-intelligence and robotic applications. (3) New architectural paradigms:
Alternatives to the typical Von-Neumann architectures.
In many of these applications, real-time performance of Neural Networks (NN)
is desirable or, even, required. The main challenge in building such systems lies
largely in the computational and communication load of the network simulations.
Furthermore, biological NNs execute these computations with massive parallelism,
something that conventional, CPU-based execution cannot cope with very well.
As a result, the neuron-population size and interconnectivity are quite low when
running on CPUs. This greatly impedes the efficiency of brain research in relation
to the above goals of brain simulation.
Other HPC computing alternatives fall short in a number of aspects. General-
Purpose GPUs can exploit application parallelism better and can be more efficient
in running neuron models. Yet, in the case of complex models or very large-scale
networks, they may not be able to provide real-time performance due to the
high rates of data exchange between the neurons and are less efficient in terms
of energy and power. Supercomputers can emulate the behavior and parallelism
of biological networks with sufficient speed but have limited availability and
require large space, implementation, maintenance and energy costs while lacking
any kind of mobility. Mixed-signal, Very-Large-Scale-Integration (VLSI) circuits
achieve adequate simulation speeds while simulating biological systems more
accurately since they model neurons through analog signals. However, they are
much more difficult to implement, lack flexibility and often suffer from problems
typical in analog design (transistor mismatching, crosstalk etc.).
Implementing the neural network in parallel, digital hardware can efficiently
match the parallelism of biological models and provide real-time performance.
While Application-Specific Integrated-Circuits (ASICs) design is certainly an
option, it is expensive, time-consuming and – most importantly – inflexible (i.e.
cannot be altered after fabrication), a characteristic often required when fitting
novel neuron models. Most of these issues can be tackled through the use of
reconfigurable hardware. Although slower than ASICs, it can still provide high
performance for real-time and hyper-real-time neuron simulations, by exploiting
the inherent parallelism of hardware. Besides requiring a lot less energy and, in
some cases, less space than most of the above solutions for delivering the same
computational power, the reconfiguration property of such platforms provides
the flexibility of modifying brain models on demand.
In this paper we present a hardware-accelerated application for a specific,
biophysically-meaningful SNN model, using single-precision floating-point (FP)
arithmetic computations. The application is mapped onto a Maxeler Datalow
Machine [9] which employs a dataflow-computing paradigm utilizing highly
efficient reconfigurable-hardware-based engines. The targeted application models
an essential area of the Olivocerebellar system: the Inferior-Olivary Nucleus (ION).
This design is a part of an ongoing effort to replicate the full Olivocerebellar
circuit on reconfigurable hardware-based platforms, for the purpose of real-time
experimentation on meaningful network sizes.
2 The Biological Neuron, the Cerebellum and the
Inferior Olive
Neurons are electrochemically excitable cells
1
that process and transmit signals in
the brain. The biological neuron comprises in general (although, in truth is a much
more complicated system) three parts, called compartments in neuromodeling
jargon: The Dendrites, the Soma and the Axon. The dendritic compartment
represents the cell input stage. In turn, the soma processes the stimuli and
translates them into a cell membrane potential which evokes a cell response called
an action potential or, simply, a spike. This response is transferred through the
axon – the cell output stage – to other cells through a synapse.
The cerebellum is one of the most complex and dense regions in the brain,
playing an important role in sensorimotor control. It does not initiate movement
but influences the sensorimotor region in order to precisely coordinate the body’s
activities and motor learning skills. It also plays an important role in the sensing
of rhythm, enabling the handling of concepts such as music or harmony. The
Olivocerebellar circuitry is an essential part of the Cerebellar functionality. Main
areas of the circuit are the Purkinje cells (PC), the Deep Cerebellar Nuclei
(DCN) and the Inferior-Olivary Nucleus (ION) [3]. The system is theorized to be
organized in separate groups of PC, DCN and ION cells (Microsections) that – in
combination – control distinctive parts of muscle movement and may or may not
be interconnected to each other [8]. The network sizes in these microsections are
hypothesized to vary. Specifically for the ION cell, their numbers in a microsection
can be from a dozen to several hundreds of cells. ION cells especially are also
interconnected by purely electrical connections between their dendrites, called
gap junctions, considered to be important for the synchronization of activity
within the nucleus and the Olivocerebellar system, in general.
3 Related Work
In the past, a number of designs have been proposed for the implementation
of neuron and network models using reconfigurable hardware. In this section, we
present some of the most notable works in the field.
Two of the most interesting implementations in terms of computation (perfor-
mance) density [10] using Izhikevich neuron modeling [4] are the designs proposed
by Cheung et al. [2] and by Moore et al. (Bluehive [6]). Each approach proposed
a reconfigurable-hardware architecture for very-large-scale SNNs. To improve on
their initial memory-bandwidth requirements, Cheung et al. also used a Maxeler
Dataflow Machine [2]. The size of the implemented network achieved was 64K
neurons, each having about 1,000 synaptic connections to neighboring neurons.
In the Bluehive device, Moore et al. implemented DDR2 RAMs and built custom-
made SATA-to-PCI connections for stacking FPGA devices for facilitating large
SNN simulations. Only a small portion of data was stored on-board the FPGAs.
1We will use the terms neuron and (neural) cell interchangeably in this paper.
AXON
CellOUTPUT
voltage
CellINPUT
current
CellINPUT/OUTPUTvoltageinfluence
(gapjunctions)fromothercells
DEND
SOMA
a. b. c.
Fig. 1.
Illustration of (a) a 6-cell network-model example, (b) the 3-compartmental
model of a single ION cell, and (c) its output (spiking) response to an input pulse.
In a Stratix IV FPGA, the authors simulated 64K Izhikevich neurons with 64M
simple synapses at real-time performance.
These works have incorporated fixed-point arithmetic to implement the compu-
tation of their neuron models. Zhang et al. [11] have proposed a Hodgkin-Huxley
(HH) modeling [4] accelerator using dedicated Floating Point (FP) units. The
FP units were custom-designed to provide better accuracy, resource usage and
performance. The 32-bit FP arithmetic used in the model produced a neuro-
processor architecture which could simulate 4 complete cells (synapse, dendrite,
soma) at real-time speed. Smaragdos et al. [10] have also ported an HH-based
model of Olivocerebellar cells onto a Xilinx Virtex-7 FPGA. The model entails
demanding FP computations per cell as well as high inter-cell communication
(up to all-to-all connectivity). Real-time simulations of up to 96 neurons could
be achieved. It must be noted that the neuron model used was estimated about
x18 more complex compared to the rest of the related works in terms of FP
operations.
4 ION-Model Description
The application accelerated in this paper models the behavior of the ION neurons
using an extended-HH model description [1]. This model is a first step towards
building a high-performance, Olivocrebellar-system simulation platform. The
model not only divides the cells into multiple compartments but also creates a
network onto which neurons are interconnected by modeling gap junctions.
The ION cell model divides the neuron into three computational compartments
– closely resembling biological counterparts – as shown in Figure 1(b). Every
compartment has a state that holds pertinent electrochemical variables and, on
every simulation step, the state is updated based on: i) the previous state, ii) the
other compartments’ previous state, iii) the other neurons’ previous state, and
iv) any externally evoked input stimulus to the cell or network.
The computational model operates in a fashion allowing concurrent execution
of the three compartments. The model is calibrated to produce every per-neuron
output value with a 50
µsec
time step. This means that, in order to support
real-time simulations, all neurons are required to compute one iteration of com-
partmental calculations within 50
µsec
. Due to the realistic electrochemical
variables handled by the model, most of the computations require FP arithmetic.
Figure 1(a) illustrates the network-model architecture with an example size
of 6 cells. Every cell receives, through the dendritic compartment, the influence
of all other cells in the network, thus modeling the massive number of biological
gap junctions present in the Inferior Olive. The gap-junction computations are
repeated for every neighboring input, computing the accumulated influence of
all neighboring connections. The ION network must be synchronized in order
to guarantee the correct exchange of cell-state data when multiple cells and
compartments are being computed simultaneously. The system works in lock-
step, computing discrete output values (with a 50-
µsec
time step) that, when
aggregated in time, contribute to form the electrical waveform response of the
system (Figure 1(c)).
An extensive profiling of the application, conducted in [10], reveals that
in a fully interconnected ION network the gap-junction computations increase
quadratically with network size. A meaningful network size for the ION model
would be one large enough to enable as extensive as possible exploration of
microsection organizations. Organizations of either a high number of microsections
with few cells or a low number of microsections with many cells can reveal a
range of different system behavioral dynamics and, thus, validate (or invalidate)
a set of hypotheses on brain functionality. According to our neuroscientific expert
partners, a minimum ION-network size for meaningful experimentation would be
around 100 cells. Any improvement in achievable size beyond this point would
enable greater possibilities for the exploration of microsection behavior. For such
network sizes (100 cells and higher), gap-junction computations dominate all cell
computations.
5 DFE Implementation of the ION Model
The design presented in this paper is a continuation of a previous work of
an FPGA implementation of an ION-network hardware accelerator [10]. The
FPGA kernel was designed to work alongside a host processor and executed the
aforementioned model in a step-by-step process. The hardware description of the
kernel was coded in C using the Vivado HLS v2013.2 tool targeting a Virtex-7
device. Although this design offered considerable speedup over a reference CPU
implementation, it was still unable to fully exploit the parallelism of the model –
which essentially is a dataflow application – without considerable restructuring
of the initial model.
On the other hand, a Maxeler Dataflow Computing Machine [9], based on data-
flow engines (or DFEs), has the ability to better exploit the inherent parallelism
of the model and has the potential to achieve even greater speed-ups with minor
changes in the model architecture. Also based on FPGA hardware, DFE boards
and their compiling tools are designed to fully exploit loop-level parallelism by
ION
Kernel
BRAMs
Control
FPGA
D = Dendrite
S = Soma
A = Axon
GJ = Gap Junction
Non-Pipelined Datapaths
Pipelined Partially
Unrolled GJ Loop
To Host
a.
DRAM
FPGA Chip
DFE Board
I/O
To Host
Scalar
Input
b.
State BRAMs
D1 D2 A3 A4S1
From
DRAM
From
Host
To
DRAM
C = Control Flow Counters
ION
Kernel
Fully Pipelined Datapaths
Fig. 2.
Illustration of (a) A single instance of the FPGA ION Kernel of [10] (b) A single
instance of the DFE ION Kernel.
fine-grain pipelining of computations and efficient usage of the FPGA fabric,
while additionally including efficient I/O between the chip and the on-board
resources (e.g. on-board DRAMs).
5.1 The ION-kernel DFE Architecture
The DFE implementation of the ION network can be seen in Figure 2(b). The
design incorporates 3 internal pipelines one for each part of the cell (Dendrite,
Soma, Axon), executing the respective computations. The cell states consist
of 19 FP values. Each parameter for each neuron is stored on its own BRAM
block, for fast read/update of the network state. Since every new cell state is
dependent only on the network state of the previous simulation step, only one
copy of each neuron state is required. The input stream of the DFE kernel comes
from the on-board RAM and represents the evoked inputs (one value for each
neuron per simulation step) used in the dendritic computations comprising the
network input. Only for the first simulation step the initial state and neighboring
(gap-junction) influence are also streamed-in from the on-board memory as each
neuron begins its first simulation step. The network output (represented by the
axonal voltage) is also streamed to the on-board memory at the same point as it
is updated on its respective BRAM block.
Due to the dataflow paradigm followed, the DFE kernel executes the complete
simulation run when activated, as opposed to the control-flow-based FPGA
kernel that only executes the simulation step by step, under the supervision
of a MicroBlaze core. As such, the DFE kernel additionally receives scalar
input parameters, denoting the simulation duration and the network size to
be simulated. Program flow is monitored using hardware counters monitoring
gap-junction loop iterations, neurons executed and the number of simulation steps
concluded. All scalar parameters, activation of the kernel, input-data preparation
before execution and output visualization after execution is handled by an off-
board host processor. The data flows through the DFE pipelines with each
kernel execution step (or tick), consuming the respective input or producing the
respective output and state. Each kernel tick represents the completion of one
gap-junction loop iteration. As a result, the DFE execution naturally pipelines
not only the gap-junction loop iterations but the execution of different neurons
within one simulation step as well, as opposed to the FPGA kernel that was
capable of only the former (Figure 2(a)). Simulation steps are not pipelined in
an all-to-all network, as every neuron must have the previous state of all other
neurons ready for its gap-junction computations before a new step begins. The
DFE pipeline is, thus, flushed before a new simulation step begins execution. The
execution of a single simulation step requires
N2
ticks to be completed, where N
is the network size.
5.2 Additional Design Optimizations
There are two straightforward ways to speed up execution of the DFE kernel.
One is to use multiple instances of the kernel in a single DFE, if the DFE spare
resources allow it, thus doubling the network size achievable within a certain
time frame. The other is to unroll the gap-junction loop by replicating the single-
iteration hardware logic, essentially executing multiple iterations of the loop per
kernel tick. If U is the unroll factor of the loop, the number of ticks required
for a network simulation step is
N∗N/U
, denoting potentially a considerable
speed-up. Both of these techniques are subject to area but also timing constraints.
Loop unrolling, in particular, could cause extra pressure in the routing of the
hardware, limiting the maximum achievable frequency of the DFE kernel.
6 Evaluation
The design was implemented on a Maxeler device using the Vectis architecture.
The Vectis boards include a Virtex 6 SX475T FPGA in their DFEs. The maximum
frequency that an ION kernel achieved on the DFE board was 150MHz. This
design could be optimized by ether using a second kernel instance within the same
DFE or unrolling the gap-junction loop. Unfortunately, the spare resources did
not enable the use of both optimizations simultaneously. As a result, 2 versions
of the design were tested, one with 2 ION kernels within the DFE and one with
a single kernel and loop unrolling. The maximum unroll factor achieved for the
frequency of 150MHz was 8. One last design was also evaluated. By reducing the
Fig. 3.
Simulation Step execution time for the DFE kernels and the FPGA kernel [10].
Fig. 4.
Speed-up of best DFE and FPGA implementations compared to single-FP
CPU-based execution on an Intel Xeon 2.66GHz with 20GB RAM.
DFE frequency to 140 MHz, the unroll factor could be raised to 16 expecting
to balance out the performance loss due to the lower frequency. Larger unroll
factors were not achievable due both to timing and area constraints.
In Figure 3 we can see the execution time for all the DFE-based designs and
the FPGA-kernel version (deployed on a Virtex 7 XC7VX485T device running
at 100 MHz
2
) which includes 8 instances of the ION kernel shown in Figure 2(a).
Indeed, the best-performing Vectis implementation is the one with the lowest
frequency but the highest unroll factor. The gain of loop unrolling supersedes
the gains of using the extra kernel instance or the higher frequency. The FPGA
implementation incorporates also unrolling optimizations but of lower factor (4)
and, combined with its coarser-grain pipelining and lower operating frequency,
performs worse than the DFE implementation. In effect, the DFE can simulate
up to 330 Inferior-olivary cells at real-time speed (within the 50-
µsec
deadline)
and is
×
7.7 to
×
2.26 faster than the FPGA implementation. In terms of speed-
up compared to single-core CPU execution
3
, the fastest DFE implementation
2For fairness in comparisons, the Maxeler DFE and the Xilinx FPGA board contain
similar resources.
3
We use the single-core C implementation as a reference point. It would be possible
and interesting explore a multi-core implementation and assess the speedups, but
this subject is out of the scope of this paper.
Design Cheung et al. [2] Smaragdos et al. [10] This work
Model Izhikevich Extended HH Extended HH
Time Step (ms) 1 0.05 0.05
Real-Time 64000 96 330
Network Size
Arithmetic Fixed Floating Floating
Precision Point Point Point
Neuron Model
OPs * Net. Size >832* 2131.2 24684
(MFLOPS)
Speed-up x12.5 (C Code) x92 - x102 (C Code)
vs. CPU -
FPGA Virtex 6 SX475T Virtex 7 Virtex 6 SX475T
Chip Maxeler Machine XC7VX485T Maxeler Machine
Device capacity 297,600 303,600 297,600
(LUTs) 6-input LUTs 6-input LUTs 6-input LUTs
Computation density 2,796* 7,019 82,943
(FLOPS/LUT)
* Fixed-point operations
Table 1.
Overview of current and related work SNN Implementations on achievable
real-time network sizes. CPU Speed-up for the ION designs is compared to a Xeon
E5430 2.66GHz/20GB RAM.
has a speed-up of
×
92 to
×
102 compared to the single-FP C implementation
(Figure 4). It uses about 74% of the total logic available in the DFE hardware;
more specifically, about 64% of LUTs, 60% of FF, 30% of DSPs and 41% of
on-chip BRAMs. The maximum network size that can be simulated is 7,680 ION
neurons before we run out of resources.
To quantify the computation density of the design, we use the same method
of calculating FP operations per second (FLOPS) per logic unit (LUTs) as in [10].
To the best of our knowledge, the only other SNN implementation on a Maxeler
DFE is the one by Cheung et al. [2]. A comparison of this work to the FPGA-
based kernel and our new Maxeler-based design can be seen on Table 1. The
Maxeler-based design can achieve 24.7 GFLOPS when executing its maximum
real-time network (330 cells) and has a computation (performance) density of
82,943 FLOPS/LUT (6-input LUTs), as opposed to 2.1 GFLOPS and 7,019
FLOPS/LUT for the conventional FPGA kernel, respectively.
7 Conclusions
In this paper a dataflow implementation of a model of the Inferior-olivary Nucleus
was presented with significant speed-ups compared to the CPU implementation
of the same model and related works. The model itself is a highly complex,
extended-HH, biophysically-meaningful model of its biological counterpart. The
inherent application parallelism was exploited to a much greater extent when
implemented in a single DFE of a Maxeler Dataflow Machine. This, alongside
with the higher operating frequency, led to a significant improvement to the
previous design implemented on a conventional FPGA board. The fastest DFE
implementation achieved a real-time network of 330 neurons –
×
3.4 larger than
the FPGA one – and achieved almost x2-x8 greater speed-ups compared to the
FPGA port. The real-time network supports little over than 24 GFLOPS and
has almost x11 greater computation density than the conventional FPGA.
The larger real-time-network size as well as the high modeling accuracy, have
the potential to enable deeper exploration of the Olivocerebellar microsection
behavior compared to the previous FPGA implementation. Besides the speedup,
this DFE-based implementation has also opened new possibilities for future
simulation-based brain research: The Maxeler DFE platforms offer extended
capabilities and significantly higher programming ease compared to conventional
FPGAs. Such capabilities, include the use of the large DRAMs located on the
DFE boards, fast network connections directly to the DFE fabric and the ability
to combine multiple DFE boards for running massive-scale network simulations.
References
1.
P. Bazzigaluppi, J. R. De Gruijl, R. S. Van Der Giessen, S. Khosrovani, C. I. De
Zeeuw, and M. T. G. De Jeu. Olivary subthreshold oscillations and burst activity
revisited. Frontiers in Neural Circuits, 6(91), 2012.
2.
K. Cheung, S. R. Schultz, and W. Luk. A large-scale spiking neural network
accelerator for FPGA systems. In Int. conf. on Artificial Neural Networks and
Machine Learning, ICANN’12, pages 113–120, 2012.
3.
C.I. De Zeeuw, F.E. Hoebeek , L.W.J. Bosman, M. Schonewille, L. Witter, and S.K.
Koekkoek. Spatiotemporal firing patterns in the cerebellum. Nat Rev Neurosci,
12(6):327–344, jun 2011.
4.
E. Izhikevich. Which Model to Use for Cortical Spiking Neurons? IEEE Trans on
Neural Net., 15(5), 2004.
5.
W. Maass. Noisy Spiking Neurons with Temporal Coding have more Computational
Power than Sigmoidal Neurons. In Neural Inf. Proc. Systems, pages 211–217, 1996.
6.
S. W. Moore, P. J. Fox, S. J. Marsh, A. T. Markettos, and A. Mujumdar. Bluehive
— A Field-Programable Custom Computing Machine for Extreme-Scale Real-Time
Neural Network Simulation. In IEEE Int. Symp. on FCCM, pages 133–140, 2012.
7. National Academy of Engineering. Grand Challenges for Engineering, 2010.
8.
Sarah P. Marshall and Eric J. Lang. Inferior Olive Oscillations Gate Transmission
of Motor Cortical Activity to the Cerebellum. In The Journal of Neuroscience
•
24(50):11356 –11367), 2004.
9.
Maxeler Technologies (www.maxeler.com/products/mpc-xseries/). MPC-X Series.
10.
G. Smaragdos, S. Isaza, M. V.Eijk, I. Sourdis, and C. Strydis. FPGA-
based Biophysically-Meaningful Modeling of Olivocerebellar Neurons. In 22nd
ACM/SIGDA Int. Symposium on FPGAs (FPGA), 2014.
11.
Y. Zhang, J. P. McGeehan, E. M. Regan, S. Kelly, and J. L. Nunez-Yanez. Biophys-
ically Accurate Floating Point Neuroprocessors for Reconfigurable Logic. IEEE
Trans on Computers, 62(3):599–608, march 2013.