Conference PaperPDF Available

Real-Time Olivary Neuron Simulations on Dataflow Computing Machines


Abstract and Figures

The Inferior-Olivary nucleus (ION) is a well-charted brain region, heavily associated with the sensorimotor control of the body. It comprises neural cells with unique properties which facilitate sensory processing and motor-learning skills. Simulations of such neurons become rapidly intractable when biophysically plausible models and meaningful network sizes (at least in the order of some hundreds of cells) are modeled. To overcome this problem, we accelerate a highly detailed ION network model using a Maxeler Dataflow Computing Machine. The design simulates a 330-cell network at real-time speed and achieves maximum throughputs of 24.7 GFLOPS. The Maxeler machine, integrating a Virtex-6 FPGA, yields speedups of ×92-102, and ×2-8 compared to a reference-C implementation, running on a Intel Xeon 2.66GHz, and a pure Virtex-7 FPGA implementation, respectively.
Content may be subject to copyright.
Real-Time Olivary Neuron Simulations on
Dataflow Computing Machines
Georgios Smaragdos1, Craig Davies3, Christos Strydis1, Ioannis Sourdis5,
at˘alin Ciobanu5, Oskar Mencer3,4, and Chris I. De Zeeuw1,2
1Dept. of Neuroscience, Erasmus Medical Center, Rotterdam, The Netherlands
2Netherlands Institute for Neuroscience, Amsterdam, The Netherlands
3Maxeler Technologies Inc.
4Imperial College London
5Dept. of Computer Science & Engineering, Chalmers University of Technology,
Gothenburg, Sweden
The Inferior-Olivary nucleus (ION) is a well-charted brain
region, heavily associated with the sensorimotor control of the body. It
comprises neural cells with unique properties which facilitate sensory
processing and motor-learning skills. Simulations of such neurons become
rapidly intractable when biophysically plausible models and meaning-
ful network sizes (at least in the order of some hundreds of cells) are
modeled. To overcome this problem, we accelerate a highly detailed ION
network model using a Maxeler Dataflow Computing Machine. The design
simulates a 330-cell network at real-time speed and achieves maximum
throughputs of 24.7 GFLOPS. The Maxeler machine, integrating a Virtex-
6 FPGA, yields speedups of
92-102, and
2-8 compared to a reference-C
implementation, running on a Intel Xeon 2.66GHz, and a pure Virtex-7
FPGA implementation, respectively.
1 Introduction
The United-States National Academy of Engineers has classified brain simu-
lation as one of the Grand Engineering Challenges [7]. Biologically accurate
brain simulation is a highly relevant topic for neuroscience for a number of
reasons; the main goal is accelerated brain research by the creation of more
advance research platforms. Even though a significant amount of effort is spent
in understanding brain functionality, the exact mechanisms in most cases are
still only hypothesized. Fast and real-time simulation platforms can enable the
the neuroscientific community to more efficiently test these hypotheses. Better
understanding of brain functionality can lead to a number of critical practical
applications: (1) Brain rescue: If brain functions can be simulated accurately
enough and in real-time, this can lead to robotic prosthetics and implants for
restoring lost brain functionality. (2) Advanced A.I.: Artificial Neural Networks
(ANNs) have already been successfully used in this field. It is believed that greater
understanding of biological systems and the richer computational dynamics of
their models, like spiking neural networks (SNNs) [5], can lead to more advanced,
artificial-intelligence and robotic applications. (3) New architectural paradigms:
Alternatives to the typical Von-Neumann architectures.
In many of these applications, real-time performance of Neural Networks (NN)
is desirable or, even, required. The main challenge in building such systems lies
largely in the computational and communication load of the network simulations.
Furthermore, biological NNs execute these computations with massive parallelism,
something that conventional, CPU-based execution cannot cope with very well.
As a result, the neuron-population size and interconnectivity are quite low when
running on CPUs. This greatly impedes the efficiency of brain research in relation
to the above goals of brain simulation.
Other HPC computing alternatives fall short in a number of aspects. General-
Purpose GPUs can exploit application parallelism better and can be more efficient
in running neuron models. Yet, in the case of complex models or very large-scale
networks, they may not be able to provide real-time performance due to the
high rates of data exchange between the neurons and are less efficient in terms
of energy and power. Supercomputers can emulate the behavior and parallelism
of biological networks with sufficient speed but have limited availability and
require large space, implementation, maintenance and energy costs while lacking
any kind of mobility. Mixed-signal, Very-Large-Scale-Integration (VLSI) circuits
achieve adequate simulation speeds while simulating biological systems more
accurately since they model neurons through analog signals. However, they are
much more difficult to implement, lack flexibility and often suffer from problems
typical in analog design (transistor mismatching, crosstalk etc.).
Implementing the neural network in parallel, digital hardware can efficiently
match the parallelism of biological models and provide real-time performance.
While Application-Specific Integrated-Circuits (ASICs) design is certainly an
option, it is expensive, time-consuming and – most importantly – inflexible (i.e.
cannot be altered after fabrication), a characteristic often required when fitting
novel neuron models. Most of these issues can be tackled through the use of
reconfigurable hardware. Although slower than ASICs, it can still provide high
performance for real-time and hyper-real-time neuron simulations, by exploiting
the inherent parallelism of hardware. Besides requiring a lot less energy and, in
some cases, less space than most of the above solutions for delivering the same
computational power, the reconfiguration property of such platforms provides
the flexibility of modifying brain models on demand.
In this paper we present a hardware-accelerated application for a specific,
biophysically-meaningful SNN model, using single-precision floating-point (FP)
arithmetic computations. The application is mapped onto a Maxeler Datalow
Machine [9] which employs a dataflow-computing paradigm utilizing highly
efficient reconfigurable-hardware-based engines. The targeted application models
an essential area of the Olivocerebellar system: the Inferior-Olivary Nucleus (ION).
This design is a part of an ongoing effort to replicate the full Olivocerebellar
circuit on reconfigurable hardware-based platforms, for the purpose of real-time
experimentation on meaningful network sizes.
2 The Biological Neuron, the Cerebellum and the
Inferior Olive
Neurons are electrochemically excitable cells
that process and transmit signals in
the brain. The biological neuron comprises in general (although, in truth is a much
more complicated system) three parts, called compartments in neuromodeling
jargon: The Dendrites, the Soma and the Axon. The dendritic compartment
represents the cell input stage. In turn, the soma processes the stimuli and
translates them into a cell membrane potential which evokes a cell response called
an action potential or, simply, a spike. This response is transferred through the
axon – the cell output stage – to other cells through a synapse.
The cerebellum is one of the most complex and dense regions in the brain,
playing an important role in sensorimotor control. It does not initiate movement
but influences the sensorimotor region in order to precisely coordinate the body’s
activities and motor learning skills. It also plays an important role in the sensing
of rhythm, enabling the handling of concepts such as music or harmony. The
Olivocerebellar circuitry is an essential part of the Cerebellar functionality. Main
areas of the circuit are the Purkinje cells (PC), the Deep Cerebellar Nuclei
(DCN) and the Inferior-Olivary Nucleus (ION) [3]. The system is theorized to be
organized in separate groups of PC, DCN and ION cells (Microsections) that – in
combination – control distinctive parts of muscle movement and may or may not
be interconnected to each other [8]. The network sizes in these microsections are
hypothesized to vary. Specifically for the ION cell, their numbers in a microsection
can be from a dozen to several hundreds of cells. ION cells especially are also
interconnected by purely electrical connections between their dendrites, called
gap junctions, considered to be important for the synchronization of activity
within the nucleus and the Olivocerebellar system, in general.
3 Related Work
In the past, a number of designs have been proposed for the implementation
of neuron and network models using reconfigurable hardware. In this section, we
present some of the most notable works in the field.
Two of the most interesting implementations in terms of computation (perfor-
mance) density [10] using Izhikevich neuron modeling [4] are the designs proposed
by Cheung et al. [2] and by Moore et al. (Bluehive [6]). Each approach proposed
a reconfigurable-hardware architecture for very-large-scale SNNs. To improve on
their initial memory-bandwidth requirements, Cheung et al. also used a Maxeler
Dataflow Machine [2]. The size of the implemented network achieved was 64K
neurons, each having about 1,000 synaptic connections to neighboring neurons.
In the Bluehive device, Moore et al. implemented DDR2 RAMs and built custom-
made SATA-to-PCI connections for stacking FPGA devices for facilitating large
SNN simulations. Only a small portion of data was stored on-board the FPGAs.
1We will use the terms neuron and (neural) cell interchangeably in this paper.
a. b. c.
Fig. 1.
Illustration of (a) a 6-cell network-model example, (b) the 3-compartmental
model of a single ION cell, and (c) its output (spiking) response to an input pulse.
In a Stratix IV FPGA, the authors simulated 64K Izhikevich neurons with 64M
simple synapses at real-time performance.
These works have incorporated fixed-point arithmetic to implement the compu-
tation of their neuron models. Zhang et al. [11] have proposed a Hodgkin-Huxley
(HH) modeling [4] accelerator using dedicated Floating Point (FP) units. The
FP units were custom-designed to provide better accuracy, resource usage and
performance. The 32-bit FP arithmetic used in the model produced a neuro-
processor architecture which could simulate 4 complete cells (synapse, dendrite,
soma) at real-time speed. Smaragdos et al. [10] have also ported an HH-based
model of Olivocerebellar cells onto a Xilinx Virtex-7 FPGA. The model entails
demanding FP computations per cell as well as high inter-cell communication
(up to all-to-all connectivity). Real-time simulations of up to 96 neurons could
be achieved. It must be noted that the neuron model used was estimated about
x18 more complex compared to the rest of the related works in terms of FP
4 ION-Model Description
The application accelerated in this paper models the behavior of the ION neurons
using an extended-HH model description [1]. This model is a first step towards
building a high-performance, Olivocrebellar-system simulation platform. The
model not only divides the cells into multiple compartments but also creates a
network onto which neurons are interconnected by modeling gap junctions.
The ION cell model divides the neuron into three computational compartments
– closely resembling biological counterparts – as shown in Figure 1(b). Every
compartment has a state that holds pertinent electrochemical variables and, on
every simulation step, the state is updated based on: i) the previous state, ii) the
other compartments’ previous state, iii) the other neurons’ previous state, and
iv) any externally evoked input stimulus to the cell or network.
The computational model operates in a fashion allowing concurrent execution
of the three compartments. The model is calibrated to produce every per-neuron
output value with a 50
time step. This means that, in order to support
real-time simulations, all neurons are required to compute one iteration of com-
partmental calculations within 50
. Due to the realistic electrochemical
variables handled by the model, most of the computations require FP arithmetic.
Figure 1(a) illustrates the network-model architecture with an example size
of 6 cells. Every cell receives, through the dendritic compartment, the influence
of all other cells in the network, thus modeling the massive number of biological
gap junctions present in the Inferior Olive. The gap-junction computations are
repeated for every neighboring input, computing the accumulated influence of
all neighboring connections. The ION network must be synchronized in order
to guarantee the correct exchange of cell-state data when multiple cells and
compartments are being computed simultaneously. The system works in lock-
step, computing discrete output values (with a 50-
time step) that, when
aggregated in time, contribute to form the electrical waveform response of the
system (Figure 1(c)).
An extensive profiling of the application, conducted in [10], reveals that
in a fully interconnected ION network the gap-junction computations increase
quadratically with network size. A meaningful network size for the ION model
would be one large enough to enable as extensive as possible exploration of
microsection organizations. Organizations of either a high number of microsections
with few cells or a low number of microsections with many cells can reveal a
range of different system behavioral dynamics and, thus, validate (or invalidate)
a set of hypotheses on brain functionality. According to our neuroscientific expert
partners, a minimum ION-network size for meaningful experimentation would be
around 100 cells. Any improvement in achievable size beyond this point would
enable greater possibilities for the exploration of microsection behavior. For such
network sizes (100 cells and higher), gap-junction computations dominate all cell
5 DFE Implementation of the ION Model
The design presented in this paper is a continuation of a previous work of
an FPGA implementation of an ION-network hardware accelerator [10]. The
FPGA kernel was designed to work alongside a host processor and executed the
aforementioned model in a step-by-step process. The hardware description of the
kernel was coded in C using the Vivado HLS v2013.2 tool targeting a Virtex-7
device. Although this design offered considerable speedup over a reference CPU
implementation, it was still unable to fully exploit the parallelism of the model –
which essentially is a dataflow application – without considerable restructuring
of the initial model.
On the other hand, a Maxeler Dataflow Computing Machine [9], based on data-
flow engines (or DFEs), has the ability to better exploit the inherent parallelism
of the model and has the potential to achieve even greater speed-ups with minor
changes in the model architecture. Also based on FPGA hardware, DFE boards
and their compiling tools are designed to fully exploit loop-level parallelism by
D = Dendrite
S = Soma
A = Axon
GJ = Gap Junction
Non-Pipelined Datapaths
Pipelined Partially
Unrolled GJ Loop
To Host
DFE Board
To Host
State BRAMs
D1 D2 A3 A4S1
C = Control Flow Counters
Fully Pipelined Datapaths
Fig. 2.
Illustration of (a) A single instance of the FPGA ION Kernel of [10] (b) A single
instance of the DFE ION Kernel.
fine-grain pipelining of computations and efficient usage of the FPGA fabric,
while additionally including efficient I/O between the chip and the on-board
resources (e.g. on-board DRAMs).
5.1 The ION-kernel DFE Architecture
The DFE implementation of the ION network can be seen in Figure 2(b). The
design incorporates 3 internal pipelines one for each part of the cell (Dendrite,
Soma, Axon), executing the respective computations. The cell states consist
of 19 FP values. Each parameter for each neuron is stored on its own BRAM
block, for fast read/update of the network state. Since every new cell state is
dependent only on the network state of the previous simulation step, only one
copy of each neuron state is required. The input stream of the DFE kernel comes
from the on-board RAM and represents the evoked inputs (one value for each
neuron per simulation step) used in the dendritic computations comprising the
network input. Only for the first simulation step the initial state and neighboring
(gap-junction) influence are also streamed-in from the on-board memory as each
neuron begins its first simulation step. The network output (represented by the
axonal voltage) is also streamed to the on-board memory at the same point as it
is updated on its respective BRAM block.
Due to the dataflow paradigm followed, the DFE kernel executes the complete
simulation run when activated, as opposed to the control-flow-based FPGA
kernel that only executes the simulation step by step, under the supervision
of a MicroBlaze core. As such, the DFE kernel additionally receives scalar
input parameters, denoting the simulation duration and the network size to
be simulated. Program flow is monitored using hardware counters monitoring
gap-junction loop iterations, neurons executed and the number of simulation steps
concluded. All scalar parameters, activation of the kernel, input-data preparation
before execution and output visualization after execution is handled by an off-
board host processor. The data flows through the DFE pipelines with each
kernel execution step (or tick), consuming the respective input or producing the
respective output and state. Each kernel tick represents the completion of one
gap-junction loop iteration. As a result, the DFE execution naturally pipelines
not only the gap-junction loop iterations but the execution of different neurons
within one simulation step as well, as opposed to the FPGA kernel that was
capable of only the former (Figure 2(a)). Simulation steps are not pipelined in
an all-to-all network, as every neuron must have the previous state of all other
neurons ready for its gap-junction computations before a new step begins. The
DFE pipeline is, thus, flushed before a new simulation step begins execution. The
execution of a single simulation step requires
ticks to be completed, where N
is the network size.
5.2 Additional Design Optimizations
There are two straightforward ways to speed up execution of the DFE kernel.
One is to use multiple instances of the kernel in a single DFE, if the DFE spare
resources allow it, thus doubling the network size achievable within a certain
time frame. The other is to unroll the gap-junction loop by replicating the single-
iteration hardware logic, essentially executing multiple iterations of the loop per
kernel tick. If U is the unroll factor of the loop, the number of ticks required
for a network simulation step is
, denoting potentially a considerable
speed-up. Both of these techniques are subject to area but also timing constraints.
Loop unrolling, in particular, could cause extra pressure in the routing of the
hardware, limiting the maximum achievable frequency of the DFE kernel.
6 Evaluation
The design was implemented on a Maxeler device using the Vectis architecture.
The Vectis boards include a Virtex 6 SX475T FPGA in their DFEs. The maximum
frequency that an ION kernel achieved on the DFE board was 150MHz. This
design could be optimized by ether using a second kernel instance within the same
DFE or unrolling the gap-junction loop. Unfortunately, the spare resources did
not enable the use of both optimizations simultaneously. As a result, 2 versions
of the design were tested, one with 2 ION kernels within the DFE and one with
a single kernel and loop unrolling. The maximum unroll factor achieved for the
frequency of 150MHz was 8. One last design was also evaluated. By reducing the
Fig. 3.
Simulation Step execution time for the DFE kernels and the FPGA kernel [10].
Fig. 4.
Speed-up of best DFE and FPGA implementations compared to single-FP
CPU-based execution on an Intel Xeon 2.66GHz with 20GB RAM.
DFE frequency to 140 MHz, the unroll factor could be raised to 16 expecting
to balance out the performance loss due to the lower frequency. Larger unroll
factors were not achievable due both to timing and area constraints.
In Figure 3 we can see the execution time for all the DFE-based designs and
the FPGA-kernel version (deployed on a Virtex 7 XC7VX485T device running
at 100 MHz
) which includes 8 instances of the ION kernel shown in Figure 2(a).
Indeed, the best-performing Vectis implementation is the one with the lowest
frequency but the highest unroll factor. The gain of loop unrolling supersedes
the gains of using the extra kernel instance or the higher frequency. The FPGA
implementation incorporates also unrolling optimizations but of lower factor (4)
and, combined with its coarser-grain pipelining and lower operating frequency,
performs worse than the DFE implementation. In effect, the DFE can simulate
up to 330 Inferior-olivary cells at real-time speed (within the 50-
and is
7.7 to
2.26 faster than the FPGA implementation. In terms of speed-
up compared to single-core CPU execution
, the fastest DFE implementation
2For fairness in comparisons, the Maxeler DFE and the Xilinx FPGA board contain
similar resources.
We use the single-core C implementation as a reference point. It would be possible
and interesting explore a multi-core implementation and assess the speedups, but
this subject is out of the scope of this paper.
Design Cheung et al. [2] Smaragdos et al. [10] This work
Model Izhikevich Extended HH Extended HH
Time Step (ms) 1 0.05 0.05
Real-Time 64000 96 330
Network Size
Arithmetic Fixed Floating Floating
Precision Point Point Point
Neuron Model
OPs * Net. Size >832* 2131.2 24684
Speed-up x12.5 (C Code) x92 - x102 (C Code)
vs. CPU -
FPGA Virtex 6 SX475T Virtex 7 Virtex 6 SX475T
Chip Maxeler Machine XC7VX485T Maxeler Machine
Device capacity 297,600 303,600 297,600
(LUTs) 6-input LUTs 6-input LUTs 6-input LUTs
Computation density 2,796* 7,019 82,943
* Fixed-point operations
Table 1.
Overview of current and related work SNN Implementations on achievable
real-time network sizes. CPU Speed-up for the ION designs is compared to a Xeon
E5430 2.66GHz/20GB RAM.
has a speed-up of
92 to
102 compared to the single-FP C implementation
(Figure 4). It uses about 74% of the total logic available in the DFE hardware;
more specifically, about 64% of LUTs, 60% of FF, 30% of DSPs and 41% of
on-chip BRAMs. The maximum network size that can be simulated is 7,680 ION
neurons before we run out of resources.
To quantify the computation density of the design, we use the same method
of calculating FP operations per second (FLOPS) per logic unit (LUTs) as in [10].
To the best of our knowledge, the only other SNN implementation on a Maxeler
DFE is the one by Cheung et al. [2]. A comparison of this work to the FPGA-
based kernel and our new Maxeler-based design can be seen on Table 1. The
Maxeler-based design can achieve 24.7 GFLOPS when executing its maximum
real-time network (330 cells) and has a computation (performance) density of
82,943 FLOPS/LUT (6-input LUTs), as opposed to 2.1 GFLOPS and 7,019
FLOPS/LUT for the conventional FPGA kernel, respectively.
7 Conclusions
In this paper a dataflow implementation of a model of the Inferior-olivary Nucleus
was presented with significant speed-ups compared to the CPU implementation
of the same model and related works. The model itself is a highly complex,
extended-HH, biophysically-meaningful model of its biological counterpart. The
inherent application parallelism was exploited to a much greater extent when
implemented in a single DFE of a Maxeler Dataflow Machine. This, alongside
with the higher operating frequency, led to a significant improvement to the
previous design implemented on a conventional FPGA board. The fastest DFE
implementation achieved a real-time network of 330 neurons –
3.4 larger than
the FPGA one – and achieved almost x2-x8 greater speed-ups compared to the
FPGA port. The real-time network supports little over than 24 GFLOPS and
has almost x11 greater computation density than the conventional FPGA.
The larger real-time-network size as well as the high modeling accuracy, have
the potential to enable deeper exploration of the Olivocerebellar microsection
behavior compared to the previous FPGA implementation. Besides the speedup,
this DFE-based implementation has also opened new possibilities for future
simulation-based brain research: The Maxeler DFE platforms offer extended
capabilities and significantly higher programming ease compared to conventional
FPGAs. Such capabilities, include the use of the large DRAMs located on the
DFE boards, fast network connections directly to the DFE fabric and the ability
to combine multiple DFE boards for running massive-scale network simulations.
P. Bazzigaluppi, J. R. De Gruijl, R. S. Van Der Giessen, S. Khosrovani, C. I. De
Zeeuw, and M. T. G. De Jeu. Olivary subthreshold oscillations and burst activity
revisited. Frontiers in Neural Circuits, 6(91), 2012.
K. Cheung, S. R. Schultz, and W. Luk. A large-scale spiking neural network
accelerator for FPGA systems. In Int. conf. on Artificial Neural Networks and
Machine Learning, ICANN’12, pages 113–120, 2012.
C.I. De Zeeuw, F.E. Hoebeek , L.W.J. Bosman, M. Schonewille, L. Witter, and S.K.
Koekkoek. Spatiotemporal firing patterns in the cerebellum. Nat Rev Neurosci,
12(6):327–344, jun 2011.
E. Izhikevich. Which Model to Use for Cortical Spiking Neurons? IEEE Trans on
Neural Net., 15(5), 2004.
W. Maass. Noisy Spiking Neurons with Temporal Coding have more Computational
Power than Sigmoidal Neurons. In Neural Inf. Proc. Systems, pages 211–217, 1996.
S. W. Moore, P. J. Fox, S. J. Marsh, A. T. Markettos, and A. Mujumdar. Bluehive
— A Field-Programable Custom Computing Machine for Extreme-Scale Real-Time
Neural Network Simulation. In IEEE Int. Symp. on FCCM, pages 133–140, 2012.
7. National Academy of Engineering. Grand Challenges for Engineering, 2010.
Sarah P. Marshall and Eric J. Lang. Inferior Olive Oscillations Gate Transmission
of Motor Cortical Activity to the Cerebellum. In The Journal of Neuroscience
24(50):11356 –11367), 2004.
Maxeler Technologies ( MPC-X Series.
G. Smaragdos, S. Isaza, M. V.Eijk, I. Sourdis, and C. Strydis. FPGA-
based Biophysically-Meaningful Modeling of Olivocerebellar Neurons. In 22nd
ACM/SIGDA Int. Symposium on FPGAs (FPGA), 2014.
Y. Zhang, J. P. McGeehan, E. M. Regan, S. Kelly, and J. L. Nunez-Yanez. Biophys-
ically Accurate Floating Point Neuroprocessors for Reconfigurable Logic. IEEE
Trans on Computers, 62(3):599–608, march 2013.
... It, further, allows for applications to be implemented in a deeply pipelined fashion leading to a high computational throughput. The performance benefits due to the dataflow paradigm, when compared to the control-flow paradigm, are shown in [14]. Moreover, by programming with the MaxJ toolflow, the programming complexity is significantly reduced in comparison to using low-level (e.g. ...
... end if 14: return iComp 15: end function with both neighbouring compartments. This results in (14) for the calculation of the current: ...
... As follows from (13) to (15), (14) (the current when compartment i is between other compartments) is the sum of (13) (the current at the starting position) and (15) (the current at the ending position). Consequently, (13) is stored in I comp,next and (15) is stored in I comp,prev and based on the position one, of these currents or the sum of these current is chosen for I comp,i . ...
Full-text available
The Hodgkin-Huxley (HH) neuron is one of the most biophysically-meaningful models used in computational neuroscience today. Ironically, the model’s high experimental value is offset by its disproportional computational complexity. To such an extent that neuroscientists have either resorted to simpler models, losing precious neuron detail, or to using high-performance computing systems, to gain acceleration, for complex models. However, multicore/multinode CPU-based systems have proven too slow while FPGA-based ones have proven too time-consuming to (re)deploy to. Clearly, a solution that bridges user friendliness and high speedups is necessary. This paper presents flexHH, a flexible FPGA library implementing five popular, highly parameterizable variants of the HH neuron model. flexHH is the first crucial step towards making FPGA-based simulations of compute-intensive neural models available to neuroscientists without the debilitating penalty of re-engineering and re-synthesis. Through flexHH, the user can instantiate custom models and immediately take advantage of the acceleration without the mediation of an engineer, which has proven to be a major inhibitor to full adoption of FPGAs in neuroscience labs. In terms of performance, flexHH achieves speedups between 8×–20× compared to sequential-C implementations, while only a small drop in real-time capabilities is observed when compared to hardcoded FPGA-based versions of the models tested.
... The DFE implementation of the InfOli application can be seen in Figure 4 and is a more advanced version of the work done in [26]. Added features include the addition of programmable connectivity and programmable neuron state by the user between experiments runs without the need to re-synthesize the design. ...
... On purely dataflow neuromodeling applications, the DFE can have great benefits both in large scale networks but also in real-time network performance [30]. Even in the cases of HH neurons that include highly accurate interconnectivity modeling (breaking the pure dataflow nature), the DFEs can accomplish greater benefits than traditional FPGA acceleration [26]. ...
Full-text available
Objective: The advent of High-Performance Computing (HPC) in recent years has led to its increasing use in brain study through computational models. The scale and complexity of such models are constantly increasing, leading to challenging computational requirements. Even though modern HPC platforms can often deal with such challenges, the vast diversity of the modeling field does not permit for a homogeneous acceleration platform to effectively address the complete array of modeling requirements. Approach: In this paper we propose and build BrainFrame, a heterogeneous acceleration platform that incorporates three distinct acceleration technologies, an Intel Xeon-Phi CPU, a NVidia GP-GPU and a Maxeler Dataflow Engine. The PyNN software framework is also integrated into the platform. As a challenging proof of concept, we analyze the performance of BrainFrame on different experiment instances of a state-of-the-art neuron model, representing the Inferior-Olivary Nucleus using a biophysically-meaningful, extended Hodgkin-Huxley representation. The model instances take into account not only the neuronal-network dimensions but also different network-connectivity densities, which can drastically affect the workload's performance characteristics. Main results: The combined use of different HPC fabrics demonstrated that BrainFrame is better able to cope with the modeling diversity encountered in realistic experiments. Our performance analysis shows clearly that the model directly affects performance and all three technologies are required to cope with all the model use cases. Significance: The BrainFrame framework is designed to transparently configure and select the appropriate back-end accelerator technology for use per simulation run. The PyNN integration provides a familiar bridge to the vast number of models already available. Additionally, it gives a clear roadmap for extending the platform support beyond the proof of concept, with improved usability and directly useful features to the computational-neuroscience community, paving the way for wider adoption.&#13.
... This work, however, uses non-biophysically meaningful modeling for the cerebellar circuit, lacking many of the intricate details of the biological processes leading to the usage of black boxes within the modeling structure. Additionally for HH models, GPU implementations have been shown to be less efficient compared to reconfigurable hardware solutions [11], [12], even though providing notable speedups [13]. ...
... The DFE implementation of the InfOli application depicted in Figure 4 is based on the design presented in [11]. It incorporates 3 internal pipelines, one for each part of the neuron (Dendrite, Soma, Axon), each performing the respective computations. ...
... The goal of our work is BrainFrame, which is a heterogeneous acceleration platform that incorporates three distinct acceleration technologies, Intel Xeon-Phi CPUs, NVIDIA GP-GPUs, and Maxeler Dataflow Engines (DFEs). So far, we have simulated the ION model on a single-node GPU [7], Xeon-Phi [29], and DFE [30] setup as well as on a multi-node (eight-way) Xeon-Phi [1] setup. Eventually, BrainFrame will move toward multi-node heterogeneity and into the Cloud for all to access. ...
Full-text available
In-silico brain simulations are the de-facto tools computational neuroscientists use to understand large-scale and complex brain-function dynamics. Current brain simulators do not scale efficiently enough to large-scale problem sizes (e.g., >100,000 neurons) when simulating biophysically complex neuron models. The goal of this work is to explore the use of true multi-GPU acceleration through NVIDIA’s GPUDirect technology on computationally challenging brain models and to assess their scalability. The brain model used is a state-of-the-art, extended Hodgkin-Huxley, biophysically meaningful, three-compartmental model of the inferior-olivary nucleus. The Hodgkin-Huxley model is the most widely adopted conductance-based neuron representation, and thus the results from simulating this representative workload are relevant for many other brain experiments. Not only the actual network-simulation times but also the network-setup times were taken into account when designing and benchmarking the multi-GPU version, an aspect often ignored in similar previous work. Network sizes varying from 65K to 2M cells, with 10 and 1,000 synapses per neuron were executed on 8, 16, 24, and 32 GPUs. Without loss of generality, simulations were run for 100 ms of biological time. Findings indicate that communication overheads do not dominate overall execution while scaling the network size up is computationally tractable. This scalable design proves that large-network simulations of complex neural models are possible using a multi-GPU design with GPUDirect.
... The computational complexity of conductance-based models is orders-of-magnitude higher than IaF models, posing a signif- icant challenge for their efficient simulation. For HH models, GPU implementations have been shown to be less efficient compared to reconfigurable hardware solutions [22], [23], even though providing notable speedups [24]. ...
Simulation of brain neurons in real-time using biophysically meaningful models is a prerequisite for comprehensive understanding of how neurons process information and communicate with each other, in effect efficiently complementing in-vivo experiments. State-of-the-art neuron simulators are, however, capable of simulating at most few tens/hundreds of biophysically accurate neurons in real-time due to the exponential growth in the interneuron communication costs with the number of simulated neurons. In this paper, we propose a real-time, reconfigurable, multichip system architecture based on localized communication, which effectively reduces the communication cost to a linear growth. All parts of the system are generated automatically, based on the neuron connectivity scheme. Experimental results indicate that the proposed system architecture allows the capacity of over 3000 to 19 200 (depending on the connectivity scheme) biophysically accurate neurons over multiple chips.
Conference Paper
Full-text available
The Inferior-Olivary nucleus (ION) is a well-charted region of the brain, heavily associated with sensorimotor control of the body. It comprises ION cells with unique properties which facilitate sensory processing and motor-learning skills. Various simulation models of ION-cell networks have been written in an attempt to unravel their mysteries. However, simulations become rapidly intractable when biophysically plausible models and meaningful network sizes (>=100 cells) are modeled. To overcome this problem, in this work we port a highly detailed ION cell network model, originally coded in Matlab, onto an FPGA chip. It was first converted to ANSI C code and extensively profiled. It was, then, translated to HLS C code for the Xilinx Vivado toolflow and various algorithmic and arithmetic optimizations were applied. The design was implemented in a Virtex 7 (XC7VX485T) device and can simulate a 96-cell network at real-time speed, yielding a speedup of x700 compared to the original Matlab code and x12.5 compared to the reference C implementation running on a Intel Xeon 2.66GHz machine with 20GB RAM. For a 1,056-cell network (non-real-time), an FPGA speedup of x45 against the C code can be achieved, demonstrating the design's usefulness in accelerating neuroscience research. Limited by the available on-chip memory, the FPGA can maximally support a 14,400-cell network (non-real-time) with online parameter configurability for cell state and network size. The maximum throughput of the FPGA ION-network accelerator can reach 2.13 GFLOPS.
Conference Paper
Full-text available
Spiking neural networks (SNN) aim to mimic membrane potential dynamics of biological neurons. They have been used widely in neuromorphic applications and neuroscience modeling studies. We design a parallel SNN accelerator for producing large-scale cortical simulation targeting an off-the-shelf Field-Programmable Gate Array (FPGA)-based system. The accelerator parallelizes synaptic processing with run time proportional to the firing rate of the network. Using only one FPGA, this accelerator is estimated to support simulation of 64K neurons 2.5 times real-time, and achieves a spike delivery rate which is at least 1.4 times faster than a recent GPU accelerator with a benchmark toroidal network.
Full-text available
The inferior olive (IO) forms one of the major gateways for information that travels to the cerebellar cortex. Olivary neurons process sensory and motor signals that are subsequently relayed to Purkinje cells. The intrinsic subthreshold membrane potential oscillations of the olivary neurons are thought to be important for gating this flow of information. In vitro studies have revealed that the phase of the subthreshold oscillation determines the size of the olivary burst and may gate the information flow or encode the temporal state of the olivary network. Here, we investigated whether the same phenomenon occurred in murine olivary cells in an intact olivocerebellar system using the in vivo whole-cell recording technique. Our in vivo findings revealed that the number of wavelets within the olivary burst did not encode the timing of the spike relative to the phase of the oscillation but was related to the amplitude of the oscillation. Manipulating the oscillation amplitude by applying Harmaline confirmed the inverse relationship between the amplitude of oscillation and the number of wavelets within the olivary burst. Furthermore, we demonstrated that electrotonic coupling between olivary neurons affect this modulation of the olivary burst size. Based on these results, we suggest that the olivary burst size might reflect the "expectancy" of a spike to occur rather than the spike timing, and that this process requires the presence of gap junction coupling.
Full-text available
Neurons are generally considered to communicate information by increasing or decreasing their firing rate. However, in principle, they could in addition convey messages by using specific spatiotemporal patterns of spiking activities and silent intervals. Here, we review expanding lines of evidence that such spatiotemporal coding occurs in the cerebellum, and that the olivocerebellar system is optimally designed to generate and employ precise patterns of complex spikes and simple spikes during the acquisition and consolidation of motor skills. These spatiotemporal patterns may complement rate coding, thus enabling precise control of motor and cognitive processing at a high spatiotemporal resolution by fine-tuning sensorimotor integration and coordination.
Conference Paper
Blue hive is a custom 64-FPGA machine targeted at scientific simulations with demanding communication requirements. Blue hive is designed to be extensible with a reconfigurable communication topology suited to algorithms with demanding high-bandwidth and low-latency communication, something which is unattainable with commodity GPGPUs and CPUs. We demonstrate that a spiking neuron algorithm can be efficiently mapped to Blue hive using Blue spec System Verilog by taking a communication-centric approach. This contrasts with many FPGA-based neural systems which are very focused on parallel computation, resulting in inefficient use of FPGA resources. Our design allows 64k neurons with 64M synapses per FPGA and is scalable to a large number of FPGAs.
This paper presents a high-performance and biophysically accurate neuroprocessor architecture based on floating point arithmetic and compartmental modeling. It aims to overcome the limitations of traditional hardware neuron models that simplify the required arithmetic using fixed-point models. This can result in arbitrary loss of precision due to rounding errors and data truncation. On the other hand, a neuroprocessor based on a floating-point bio-inspired model, such as the one presented in this work, is able to capture additional cell properties and accurately mimic cellular behaviors required in many neuroscience experiments. The architecture is prototyped in reconfigurable logic obtaining a flexible and adaptable cell and network structure together with real time performance by using the available floating point hardware resources in parallel. The paper also demonstrates model scalability by combining the basic processor components that describe the soma, dendrite and synapse of organic cells to form more complex neuron structures.
Inferior olivary (IO) neurons display spontaneous oscillatory activity, yet the importance of these oscillations for shaping the responses of this system to its afferents is uncertain. We used multiple electrode recording of crus 2a Purkinje cell complex spikes (CSs) in ketamine-xylazine-anesthetized rats to investigate olivocerebellar responses to activation of motor cortico-olivary pathways. Trains of electrical stimuli were applied to the motor cortex at frequencies between 4 and 30 Hz. Various frequency-response curves were observed, with the most common types being unimodal with a maximum at 9.5 +/- 2.3 Hz and bimodal with peaks at 8.9 +/- 1.0 and 15.1 +/- 1.3 Hz. To determine whether IO oscillatory properties underlie the resonance peaks in the frequency-response curves, apamin and charybdotoxin were injected into the IO. These toxins, which weaken and enhance spontaneous IO oscillations, respectively, had corresponding effects on the sharpness of resonance peaks. Next, the variation of CS entrainment patterns with frequency was investigated to characterize the nature of the IO oscillator. Low-frequency (4 Hz) stimulation was relatively ineffective in entraining CS activity. Between 4 and 30 Hz, two predominant entrainment patterns emerged. For low-frequency (4-6 Hz) and high-frequency (17-30 Hz) ranges, a 1:2 entrainment dominated, whereas in the intermediate range (6-17 Hz), 1:1 entrainment was most prevalent. These results indicate that IO neurons respond as nonlinear oscillators to afferent signals.
We discuss the biological plausibility and computational efficiency of some of the most useful models of spiking and bursting neurons. We compare their applicability to large-scale simulations of cortical neural networks.