Content uploaded by Guanwen Zhong
Author content
All content in this area was uploaded by Guanwen Zhong on May 21, 2018
Content may be subject to copyright.
Synergy: A HW/SW Framework for High
Throughput CNNs on Embedded
Heterogeneous SoC
Guanwen Zhong∗
, Akshat Dubey†
, Tan Cheng‡
, and Tulika Mitra§
School of Computing, National University of Singapore
Abstract
Convolutional Neural Networks (CNN) have been widely deployed
in diverse application domains. There has been significant progress in
accelerating both their training and inference using high-performance
GPUs, FPGAs, and custom ASICs for datacenter-scale environments.
The recent proliferation of mobile and IoT devices have necessitated
real-time, energy-efficient deep neural network inference on embedded-
class, resource-constrained platforms. In this context, we present Syn-
ergy, an automated, hardware-software co-designed, pipelined, high-
throughput CNN inference framework on embedded heterogeneous
system-on-chip (SoC) architectures (Xilinx Zynq). Synergy leverages,
through multi-threading, all the available on-chip resources, which in-
cludes the dual-core ARM processor along with the FPGA and the
NEON SIMD engines as accelerators. Moreover, Synergy provides
a unified abstraction of the heterogeneous accelerators (FPGA and
NEON) and can adapt to different network configurations at runtime
without changing the underlying hardware accelerator architecture by
balancing workload across accelerators through work-stealing. Syn-
ergy achieves 7.3X speedup, averaged across seven CNN models, over
a well-optimized software-only solution. Synergy demonstrates sub-
stantially better throughput and energy-efficiency compared to the
contemporary CNN implementations on the same SoC architecture.
1 introduction
Convolutional Neural Networks (CNNs) are a popular class of deep learn-
ing method with wide range of applications, including computer vision, im-
∗zhguanwen@gmail.com
†akshatdubey@nus.edu.sg
‡tancheng@comp.nus.edu.sg
§tulika@comp.nus.edu.sg
1
arXiv:1804.00706v1 [cs.DC] 28 Mar 2018
age/video processing, natural language processing, and others. A typical
CNN consists of multiple layers. Given an application, such as image classi-
fication, the network is first trained with the training dataset. The trained
network is then deployed for inference, i.e., classification of a new image.
Both the training and the inference are compute- and memory-intensive, but
also exhibit massive intrinsic parallelism. Thus, there exist numerous efforts
to improve the performance and the energy-efficiency of CNN implementa-
tions through architectures and computing substrates that support extensive
parallelism, such as GPUs, FPGAs, or even ASIC accelerators. This line of
research has primarily focused on the high-performance computing platforms
in datacenters or clouds.
The proliferation of the mobile devices and the recent emergence of the
IoT (Internet of Things) have transformed the computing landscape. There
is a compelling need to realise real-time, energy-efficient CNN inference on
resource-constrained mobile and IoT edge devices. However, an efficient im-
plementation of CNN-based inference on embedded platforms remains chal-
lenging given the resource limitations. In this context, we present Synergy, an
automated, transparent, pipelined, high-throughput, hardware-software co-
designed CNN inference framework on embedded heterogeneous SoC archi-
tectures. We design Synergy prototype on the Xilinx Zynq XC7Z020 device
leveraging all its available on-chip compute resources, namely the dual-core
ARM processor with NEON SIMD engines and the FPGA. Synergy is a com-
plete system-level solution including a multi-threaded software component,
multi-threaded FPGA and NEON accelerators, an interface between hard-
ware and software components, support for dynamic workload balancing,
as well as an architecture generator for customized solutions (if required).
Figure 1 depicts the Synergy framework mapping a CNN model on a het-
erogeneous SoC. Synergy distinguishes itself from the state-of-the-art along
multiple dimensions.
Heterogeneous HW/SW Acceleration:Synergy leverages all the
compute resources available on a heterogeneous SoC for maximum perfor-
mance. The convolutional (referred to as CONV hereafter) layers are the
most compute-intensive component of the CNN consuming more than 90%
of the execution time [24]. All contemporary CNN designs [6][20][21] on
the Xilinx Zynq platform offload the CONV layers to the FPGA. We ob-
serve that the NEON SIMD (Single-Instruction Multiple-Data) engines in
the ARM cores are quite effective in accelerating the CONV layers as well.
Therefore, harnessing the compute power of the FPGA in conjunction with
2
Hardware
Architecture
Generator
Multithread CNN (Software)
Multi-core
Processor
CONV CONV POOL FC FC
weights Network
Configuration
Hardware
Configuration
CPU
CPU
NEON
NEON
POOL
FPGA
Cluster 0 Cluster 1 Cluster 2
PE PE
PE PE
PE PE
PE PE
PE PE
PE PE
PE
PE
Software-Hardware Interface
Figure 1: Synergy: Mapping CNNs on Heterogeneous SoC
the NEON engines can reduce the execution latency of the CONV layers sig-
nificantly. Embracing the heterogeneity — hardware accelerators on FPGA
and software accelerators on NEON — for a single computational kernel is
a difficult proposition. Synergy effectively transforms the computation of a
CONV layer into a tiled matrix multiplication operation, where the tiles can
be processed independently. Synergy then seamlessly feeds the tiles to both
the hardware and the software accelerators to exploit the parallelism within
CONV layers. Synergy improves the overall latency and throughput by 12%
and 15% respectively, averaged across multiple CNN models, using NEON
and FPGA compared to FPGA-only solutions.
Transparent Consolidation of Accelerators: Most contemporary
FPGA-based CNN frameworks [3][6][13] [17][20][21][24][23] rely heavily on
customizing the CONV layer accelerators for each individual network to
minimize latency. The configuration of a CNN (number and type of layers,
nodes, connections, etc.) is dependent on the application. Given a specific
CNN model, existing approaches perform an automated (or manual) design
space exploration (DSE) to identify the optimal accelerator architectures for
the CONV layers of that network on the target FPGA device. This approach
has the drawback that the application developer needs to be involved in the
DSE and High-Level Synthesis (HLS) to map the given CNN model on the
FPGA. Even if the DSE and HLS steps can be fully automated, there are
3
still some quirks [16] that make this process quite challenging for an applica-
tion developer with limited experience in FPGAs. Second, in an embedded
FPGA device with strict resource constraints, a single accelerator design is
used by all the CONV layers of the network in a time-multiplexed fashion
even though the different layers have diverse compute requirements. In other
words, the single accelerator is a compromise to offer the best average per-
formance across all the CONV layers of the network, but it is not ideal for
any particular CONV layer [16]. Moreover, this single CONV accelerator is
still custom-designed for each network through extensive DSE and HLS.
In contrast, Synergy accelerators (FPGA, NEON) are network-agnostic.
A fixed set of accelerators is used irrespective of the network and layer as
the CONV layer computation is transformed into tiled matrix multiplica-
tions. Using fine-grained tiled matrix multiplication operations as funda-
mental primitives (as opposed to complete CONV layer) in conjunction with
awork-stealing software scheduler that distributes these tiles to the different
accelerators and balances the workload, Synergy achieves comparable perfor-
mance to the customized network-specific implementations returned through
DSE. Thus, Synergy can bypass the DSE and HLS for each individual CNN
model and provide a completely out-of-the-box solution.
High-Throughput HW/SW Multi-Threaded Pipeline: The trans-
parent consolidation of heterogeneous HW/SW accelerators in Synergy pro-
vides a powerful abstraction layer for any CNN implementation. The abun-
dance of sensors on mobile and IoT devices capturing continuous data streams
demand in-situ real-time inference (e.g., continuous object detection in im-
age/video stream [2]). Here throughput (i.e., frames per second) is the defin-
ing metric as opposed to minimizing the latency for each individual frame
in isolation. Synergy employs a HW/SW multi-threaded pipelined design of
the different layers of the network that allows consecutive frames (images)
from the streaming input to be processed concurrently exploiting inter-frame
parallelism and improving throughput. However, in this pipelined design,
CONV layers from different frames may need to be processed simultaneously
on the FPGA. This inter-frame parallelism is easy to support in Synergy as
the different matrix multiplications from the different CONV layers simply
generate matrix multiplication tiles and the tiles from different layers get dis-
tributed in a transparent fashion to the set of heterogeneous accelerators to be
executed in parallel. Synergy achieves 39.5 – 136.4 frames/second throughput
and consumes 14.4 – 55.8 mJ/frame energy depending on the CNN model.
This is substantially better than the contemporary CNN implementations
4
on the Xilinx Zynq XC7Z020 device (see Table 4). Moreover, the concur-
rent execution of multiple CONV layers on FPGA accelerators in Synergy
greatly improves their utilization. The low utilization of the accelerators is a
critical issue in full-system implementation of CNNs on resource-constrained
devices where non-compute intensive layers (e.g., pooling, activation and
fully-connected) implemented in software on not so powerful CPUs (e.g.,
ARM) take up significant execution time while the FPGA remains inactive.
The pipelined design with multiple frames in-flight keeps the accelerators
busy in Synergy with 99.8% average utilization.
Automated Customized Accelerator Design:Synergy offers a de-
fault plug-n-play solution for a na¨ıve application developer that avoids the
complex DSE and HLS for each individual CNN model. An experienced
designer, on the other hand, may want to further improve the performance
by designing accelerators that are optimized for the CNN model at hand.
Synergy offers an automated approach to customized acceleration design for
a specific CNN model. The framework accepts the network configuration
(number and type of layers) corresponding to a CNN model as an input. The
designer only needs to provide the architectural parameters for the matrix
multiplication accelerators. The Synergy framework not only can automati-
cally synthesize the accelerators (according to designer-defined parameters)
on the FPGA fabric but also generate the appropriate hardware-software in-
terface to synergistically engage these newly synthesized accelerators. This
is shown as the “Hardware Architecture Generator” in Figure 1. In addition,
the same architecture generator can be used to synthesize the accelerators
and the HW/SW interface for a new SoC device. Thus, Synergy provides
a complete push-button solution to CNN acceleration on a heterogeneous
platform.
2 related works
The state-of-the-art FPGA-based CNN works are shown in Table 1. To the
best of our knowledge, there is no work focusing on heterogeneous HW/SW
acceleration (with CPUs, NEONs and FPGA) for CNNs. We classify the ex-
isting works into two categories: network-dependent and network-independent
FPGA-based CNN frameworks.
Network-dependent Frameworks generally require designers to ex-
plore different configurations to generate a hardware architecture for a spe-
5
cific CNN network manually or with the help of scripts provided, and per-
form synthesis (which normally takes half to one hour) to generate the
bitstream. Given a new network, designers need to redo the above steps,
which is time consuming. This approach is well-explored and can pro-
duce extreme high-performance CNN accelerators, but sacrificing the flex-
ibility to different networks. Almost all the existing FPGA-based CNN
works [3][6][8][13][16][17][18][19][20][21][23][25][24] use the network-sensitive
approach. [16][17][25][24] require designers to manually explore architec-
tures for different networks, while [3][18][19][20][21][23] propose automated
toolchains for mapping CNNs on FPGAs. [3][6][13][17][20][21][23][24] mainly
focus on exploiting intra-frame parallelism within layers and execute layers
in a CNN in sequence, ignoring inter-frame parallelism across layers. [18][25]
map all layers onto FPGAs and enable hardware pipelining to exploit the
inter-frame parallelism. [18] proposed an automated framework to accelerate
binarized CNNs. Their work maps all layers in an binarized CNN on FPGA
and enables hardware pipelining. [25] proposes an approach by mapping
convolutional, normalization, pooling, activation and fully-connected layers
onto multiple FPGA devices (1 Xilinx ZC706 + 6 Xilinx VC709 boards).
Each device is in charge of a specific one or more CNN layers and devices are
connected in a ring network. The pipelining flow is controlled by dual-core
ARM processors on the Xilinx Zynq board. However, the cost of the deeply
pipelined FPGA cluster is too high as it requires multiple high-end FPGA
devices and the setup is difficult. Different from [18][25], [8] starts with
multi-threaded CNN inference codes and converts all its layers into FPGA
accelerators. However, the workload of different layers in a CNN could be
imbalanced, which leads to low accelerator utilization, wasting the precious
FPGA resources. [16] statically splits single large processing engine (PE)
used to accelerate convolutional layers into multiple small PEs. Their ap-
proach can allow multiple layers running simultaneously with different image
frames. However, the evaluation of their work is based on simulation and the
performance (execution cycles) of PEs is obtained by Vivado HLS. The per-
formance number is not accurate as it does not consider the runtime overhead
of the real platform.
Network-independent Frameworks leverage a fixed optimized hard-
ware architecture for various CNN networks. To adapt to different networks
and achieve good hardware efficiency, this approach relies on either static
(compiler) or dynamic (runtime scheduler) techniques. The key advantage
of this approach is that designers can easily switch different networks at run-
6
Table 1: Current State-of-the-art vs. Synergy
Reference Automated Inter-
frame
Self-
balancing
HW
Reuse*
Network
Agnostic
On-board
Evaluation
[24] [FPGA’15] 8 8 8 8 8 4
[13] [FPGA’16] 8 8 8 4 8 4
[17] [FPGA’16] 8 8 8 8 8 4
[6] [CASES’16] 8 8 8 4 8 4
[21] [DAC’16] 4 8 8 8 8 4
[23] [ICCAD’16] 4 8 8 8 8 4
[19][FCCM’16]
[20][FPGA’17] 4 8 8 8 8 4
[18] [FPGA’17] 4 4 8 8 8 4
[3] [FCCM’17] 4 8 8 4 8 4
[25] [ISLPED’16] 8 4 8 8 8 4
[8] [SOCC’17] 8 4 8 8 8 4
[16] [ISCA’17] 8 4 8 4 8 8
[4] [TCAD’17] 4 8 8 4 4 4
Proposed
Synergy 4 4 4 4 4 4
HW Reuse: different CONV layers and FC layers can share the same FPGA accelerators
time without going through the time-consuming synthesis step. [4] belongs
to this category. Their approach relies on a compiler developed to statically
generate a set of instructions (describing the process of CNN execution) that
execute on the fixed hardware architecture. Layers are executed in sequence
in their work. Moreover, as [4] includes data quantization to reduce memory
requirement, their approach can support large networks on embedded FPGA
devices. However, their approach can not allow multiple layers running con-
currently with different input frames, which might result in low accelerator
utilization.
Synergy supports network-independent feature. More specifically, we pro-
pose a hardware abstraction to unify various computing elements (NEON
cores and FPGA) within an FPGA-based SoC architecture. Thus, Synergy
can leverage all computing elements (multiple ARM cores, its NEON cores
and FPGA) to accelerate CNNs via HW/SW multi-threading, unleashing the
true power of heterogeneity. Different from [4], Synergy adapts to various
networks by leveraging a work-stealing scheduler (Section 3.1.3) in software to
automatically balance the workload of accelerators at runtime without chang-
ing hardware or software implementations. Moreover, Synergy provides an
automated toolchain to allow designers to explore various accelerator archi-
tectures or migrate designs to other embedded FPGA devices.
7
3 The Synergy Framework
Synergy, as shown in Figure 1, is an automated framework to map CNN
models onto embedded FPGA-based heterogeneous SoC platforms. Synergy
targets the CNN inference phase and successfully unleashes the power of
heterogeneity of the SoC architectures by leveraging all its compute elements
(CPUs, NEON engines, and FPGA).
A CNN model contains multiple layers such as convolutional, normaliza-
tion, pooling, activation and fully connected layers. The input of a layer is the
output of the previous layer. When input frames stream into the CNN, the
layers can process different frames concurrently. This inter-frame parallelism
can be exploited to improve throughput.
Synergy uses the FPGA logic and the NEON engines to accelerate the
most compute-intensive layers (CONV) in a CNN, while the CPU cores work
on the other layers (such as pooling, activation and fully-connected layers)
and preprocessing functions (e.g., normalization, scaling and data layout
transformation). As shown in Figure 1, a designer can instantiate multiple
processing engines (referred to as PE hereafter) on the FPGA to accelerate
the CONV layers. The computation in a CONV layer is transformed into
a set of independent tiled matrix-multiplication operations, called jobs as
mentioned in Section 3.1.1. These jobs are executed by the FPGA and the
NEON accelerators in parallel.
To improve the inference throughput and accelerator utilization, Syn-
ergy supports HW/SW multi-threaded pipeline where the CPU cores and
the accelerators work on different layers of different frames concurrently.
Therefore, we modify the traditional single-threaded CNN framework with
multi-threaded support. Specifically, the workload in each layer is conducted
by the corresponding thread and the communication between layers is per-
formed through a mailbox (a synchronized first-in-first-out buffer) accessible
by the threads. Multiple threads collaborate with each other in a producer-
consumer fashion constructing the full network.
As multi-threading is a software concept, hardware accelerators cannot
directly share the well-established mechanisms in the multi-threading model
such as mutex, semaphore, and mailbox. To abstract away the hardware
accelerators as hardware threads and extend operating system to support
HW/SW threads, we adapt ReconOS [10], an open-source operating sys-
tem for reconfigurable computing. ReconOS provides the HW/SW multi-
threading technique that we build upon to accelerate CNN applications on
8
FPGA-based heterogeneous SoC platforms. Each hardware accelerator or
PE is represented by a delegate thread in the software space that behaves
just like traditional software threads as shown in Figure 2 and explained in
detail in Section 3.1.2.
The accelerators (PEs and NEONs) can be grouped into multiple clus-
ters so that each CONV layer can have its own private cluster. However,
Synergy accelerators are not customized given a specific CNN model. Thus
the generic multi-cluster configuration may not be optimal for each network
and may lead to imbalance in execution time of the different CONV lay-
ers. Synergy employs work-stealing (detailed in Section 3.1.3), a dynamic
workload scheduling technique, to deal with the workload imbalance among
the clusters. The jobs (independent tiled matrix-multiplication operations)
from the different CONV layers are distributed to the different accelerator
clusters. An idle cluster steals workload from the other busy clusters and
thereby achieves workload balance across clusters and maximizes through-
put. Within a cluster, the jobs are dispatched to the available accelerators
(NEONs and FPGA-based PEs) in a round-robin fashion.
The Synergy framework provides a default architecture of the FPGA-
based PEs and their cluster configuration that has been designed to provide
quality performance across a range of CNN models. These clusters and their
constituent PEs are pre-synthesized in Synergy corresponding to each unique
FPGA device/platform and do not need to be generated for each individ-
ual CNN model. In other words, the FPGA bitstream remains unchanged
across different CNN models and the FPGA device need not be reconfigured
corresponding to each CNN model unless desired by the application devel-
oper. Given a CNN model corresponding to an application, Synergy takes
in a network configuration file that defines the architecture of the CNN as
input. The CPU-based host code used to control the hardware accelerators,
Linux kernels and HW/SW multi-threaded library are written as templates.
With the network configuration file and the software templates, Synergy au-
tomatically generates a software/hardware multi-threaded CNN application
in C.
If an advanced application developer wants to customize the PE and
cluster design for a specific CNN model, the Synergy framework offers a
hardware architecture generator (Section 3.3). In this case, Synergy takes in
a hardware configuration file as input and creates the hardware architecture
by instantiating the HLS-RTL accelerator templates in C corresponding to
the tiled matrix-multiplication operations. These FPGA-based accelerators
9
Streaming
input
Streaming
output
CONV-0
Thread
CONV-1
Thread
CONV-2
Thread
Pooling
Thread
Pooling
Thread
FC
Thread
courier
Software
courier courier
Hardware
Abstraction
Cluster-0 Cluster-1
Job
Queue 1
Delegate
Thread-2
Delegate
Thread-3
Delegate
Thread-4
Delegate
Thread-0
Delegate
Thread-1
Job
Queue 0
PE PE PE PE PE
FPGA
NEON
NEON
Thief
Figure 2: Overview of the Software Architecture
for the CONV layers are generated by a commercial HLS tool from the C
templates, while accelerator interfaces and memory subsystem are created by
RTL templates. Both the generation of software and hardware components
are completely automated.
3.1 Software Architecture
Figure 2 shows the software component in Synergy. We explain the func-
tionality in software to implement the CONV layers and the other layers,
preprocessing functions.
3.1.1 CONV Layers
CONV layers are the most compute-intensive components in a CNN, occu-
pying more than 90% of the execution time during inference [24]. They take
in input feature maps and convolve them with convolutional filters to ob-
tain output feature maps. As we target the low-end embedded platforms,
the FPGA resources are quite limited and we cannot generate a dedicated
accelerator for each convolutional layer in a CNN model like [13][24][23][25].
Therefore, in our implementation, we need to share the hardware accelerators
among the convolutional layers. We transform the convolution operations
into matrix multiplication (MM) by flattening and rearranging the input
10
features [3][17]. A data layout transformation is required to convert the 3D
array in CONV layers into 2D array, which is known as image-to-column
(im2col) function in many popular open-source CNN frameworks such as
Caffe [7] and Darknet [14]. Details related to the data layout transformation
can be found in [7][17]. Synergy leverages both the FPGA-based PEs and
the NEON engines to accelerate the MM computations.
Listing 1: Tiled Matrix Multiplication
1/∗T i l e S i z e : TS; Lo op bo un ds : N , M, K ∗/
2T i l e −t 1 : for ( t 1 =0; t 1<f l o o r (N/TS );++ t1 ){
3T i l e −t 2 : f o r ( t2 =0 ; t2 <f l o o r (M/TS) ;++t 2 ) {
4... //Initialization
5ti le d mm : for ( t 3 =0 ; t 3<f l o o r (K/TS) ;++ t3 ) {
6// S t ep 1 : Cop y d at a fr om DDR t o l o c a l b u f f e r ( a , b , c ) ;
7d at a c o p y ( d dr a , d dr b , a , b , o f f s e t A , o f f s e t B ) ;
8// S t e p 2 : K e r n e l Co mp u ta t i on
9l o o p 1 : for( i = 0; i <TS;++ i ) {
10 l o o p 2 : for( j = 0; j <TS;++ j ) {
11 l o o p 3 : for( k = 0; k<TS;++k ) {
12 c [ i ] [ j ] += a [ i ] [ k ] ∗b [ k ] [ j ] ; } }} }
13 // S t ep 3 : W r it e da ta f ro m l o c a l b u f f e r t o DDR
14 d at a s e n d ( c , d d r c , o f f s e t C ) ;
15 }}
After flattening and rearranging the input features, the input matrices of
the matrix multiplication in convolutional layers are generally too large to be
accommodated on an FPGA platform. Loop Tiling is employed to partition
the iteration space of the loop into smaller tiles so that the working data set of
a tile can be easily accommodated using the available on-chip BRAM storage
in FPGAs. Listing 1 shows the tiled matrix multiplication after Loop Tiling.
The portion highlighted (Line 5-14) is the key computation of a tile and we
accelerate this portion with FPGA-based PEs (explained in Section 3.2.1)
and NEON cores.
Workload Granularity and Computation: Figure 3 shows a tiled
MM example with 2 ×2 tile size. In Synergy, the workload granularity of a
tiled MM computation is called a job, which is defined as the computation
required to output a tile, C(i,j), of an output feature map C. A job is a
structure as shown in Listing 2 containing the base addresses of the arrays
(A, B and C), the input data dimensions (m, n and k), the tile index and
the layer ID, which is used to identify the CONV layer that owns the job.
Each CONV layer generates a set of jobs. In a CONV thread, we implement
acourier function that sends the jobs to the accelerators (PEs and NEONs).
When an accelerator gets a job, it first calculates the memory addresses of
the required tiles of input/output feature maps with the base address, data
dimension and tile index provided by the job, and fetches the tiles from the
external DDR memory to its local storage with the memory controller. After
11
A(1,1) A(1,2) B(1,1)
B(2,1)
C(1,1)
N
K
K
M M
N
K
kjkkiji BAC 1),(),(),(
Tile Calculation:
C(1,1)
C(2,1) C(2,2)
C(1,2)
Job 1 Job 2
Job 4Job 3
Figure 3: Job: Workload Granularity of a Tiled MM
computation is completed, the PE stores the output tile back to the DDR.
Listing 2: The Structure of Job
1t yp e d e f s t r u c t {
2/∗The ba s e a d d r e s s e s o f i n pu t an d ou tp u t f e a t u r e maps ∗/
3DATA TYPE A a dd r ; DATA TYPE B a dd r ; DATA TYPE C a ddr ;
4/∗Da ta di m en s io n o f in p u t / ou t pu t f e a t u r e map s ∗/
5DATA TYPE m ; DATA TYPE n ; DATA TYPE k ;
6/∗In d ex u se d t o l o c a t e t h e t i l e ∗/
7DATA TYPE t 1 ; DATA TYPE t 2 ;
8DATA TYPE l a y e r i d ; /∗To tr a c k t h e CONV l a y e r ∗/
9}j o b t ;
Heterogeneous Accelerators: As we target Xilinx Zynq SoC, Synergy
uses the FPGA-based PEs and two NEON cores in the ARM A9 processor
as the accelerators. A PE is an FPGA implementation of tiled MM. PEs can
have different optimizations, and thus performance of PEs might be different.
Number of PEs is dependent on the available resource in the target FPGA
device. We explain the PE design in Section 3.2.1 in more detail. To leverage
the NEON cores, we have implemented the MM kernel in NEON assembly
code. This assembly code is encapsulated in two separate software threads,
one corresponding to each NEON core, creating two NEON accelerators.
Accelerator Clusters: From the software perspective, Synergy groups
the heterogeneous accelerators into clusters. For example, in Figure 2, Cluster-
0 has two NEON cores and two FPGA-based PEs, while Cluster-1 groups
three PEs. Each cluster has a private workload pool, Job Queue, as shown in
Figure 2. A job queue is a synchronous buffer, storing the address of the jobs.
12
Each CONV layer is assigned to a cluster by default. Different CONV layers
can share the same cluster, for example CONV-0 and CONV-1 are mapped
to Cluster-0 and CONV-2 uses Cluster-1. Mapping of CONV layers and clus-
ters is decided by the number of jobs a CONV layer has. A CONV layer with
less workload will be mapped onto a less powerful cluster and vice-versa. In
addition, a designer can define the number of clusters and the corresponding
accelerator combinations simply with a hardware configuration file as shown
in Figure 8. In this case, the hardware accelerators will be synthesized and
the required hardware-software interface will be automatically generated in
the Synergy framework (see Section 3.3).
The CONV layers assigned to a cluster send their jobs to the Job Queue
and use all the available accelerators in the cluster. Once the cluster de-
tects jobs in the job queue, it dispatches the jobs to the synchronous buffers
attached to each accelerator. Then, the accelerators work on the jobs and
inform the cluster when they finish.
3.1.2 Delegate Threads
To abstract away the hardware accelerators, we deploy delegate threads in-
troduced in ReconOS [10]. A delegate thread is a software wrapper for an
FPGA-based accelerator, which can execute operating system (OS) calls on
behalf of the associated accelerator. From the OS perspective, the delegate
threads are software threads and can interact with the traditional software
threads.
In Synergy, a delegate thread is created corresponding to each FPGA-
based PE. Once launched, it initializes the hardware system and sends start
signal to the associated accelerator via the first-in-first-out (FIFO) control
buffer shown in Figure 5. Then, the delegate thread waits for a request from
the accelerator to execute a job. When an accelerator sends a job request, the
delegate thread obtains the address of the job from its associated cluster and
sends back to the accelerator, waiting for the accelerator’s acknowledgment.
Upon receiving the address of the job, the accelerator obtains the contents of
a job structure, fetches the tile data of input arrays via the memory controller
and performs the MM calculations. Once it finishes, the accelerator issues
a signal to the delegate thread and acknowledges the completion of the tile
calculation. The delegate thread repeats the above steps until all the jobs
are finished.
13
Cluster-0 Cluster-1
Job
Queue 1
Job
Queue 0
PE
PE
NEON
NEON
PE
PE
PE
PE
PE
CONV
Thread
Manager
Idle Book
Thief
Thread
CONV
Thread
Stealer
activate
steal steal
push push
work
done
work
done
Figure 4: Work Stealing Execution Flow
3.1.3 Self-balancing: Work Stealing Scheduler
Synergy clusters the FPGA-based PEs and the NEON accelerators into mul-
tiple clusters, so that the threads corresponding to multiple CONV layers
can execute concurrently achieving better throughput. This approach also in-
creases the accelerator utilization. However, as the workload of CONV layers
varies depending on the data dimensions, an improper cluster configuration
may lead to workload imbalance among the clusters. Some clusters might
complete their workload early and stay idle, wasting precious computing re-
sources. Therefore, clusters should be carefully partitioned and statically
mapped to different CONV layers, so that the runtime of each cluster spent
on processing the associated workload is balanced [16]. This can increase the
accelerator utilization and improve the performance. However, finding the
optimal cluster configuration is not easy. It requires profiling the performance
of different cluster combinations for the input data dimensions of the CONV
layers for the specific CNN model and perform a detailed design space explo-
ration to identify the best cluster configuration for static mapping. Then the
identified clusters and PEs have to be synthesized on the FPGA. However,
this approach is challenging and time-consuming, especially without exten-
sive FPGA expertise. In Synergy, we introduce dynamic workload balancing
technique, work-stealing, to bypass this optimization problem.
This self-balancing technique is based on the job granularity and does
not require the best cluster configuration as the idle cluster can steal jobs
from the busy clusters. Synergy enables work stealing by introducing a thief
14
thread. The thief thread consists of a manager,idle book and stealer. The
manager checks the status (idle or busy) of the clusters and activates the
stealer if necessary. The idle book records IDs of the idle clusters, while the
stealer can steal jobs from the victim clusters and push these jobs to the idle
clusters. Figure 4 shows the work-stealing flow. Initially, Synergy dispatches
the jobs of different CONV layers to job queues of different clusters. Due
to the workload imbalance of the CONV layers, some clusters may finish
the assigned workload earlier and remain idle. Let us assume that Cluster-0
finishes first and Cluster-1 is still busy. Cluster-0 then notifies the manager
of the thief thread, as its work has been done. The manager records Cluster-0
in the idle book and activates the stealer. After activation, the stealer tries
to steal jobs from the clusters that are not in the idle book. Once it succeeds,
the stealer dispatches the jobs to the idle clusters and the manager removes
the clusters from the idle book. In this manner, Synergy can fully utilize the
accelerator resources and achieve load balancing. Different from the static
mapping technique, the work-stealing approach does not rely on any specific
cluster configuration to achieve workload balance. It eases the pressure of
seeking the best cluster configuration and does not require designer’s effort.
3.1.4 Other Layers and Preprocessing functions
A CNN contains many other layers, which are executed by the ARM CPU
cores in the Synergy framework. Fully connected (FC) Layer: This layer is
usually used at the end of a network to compute the class scores, resulting in
as many outputs as there are classes. Pooling layer: This layer progressively
reduces the spatial size of the output from the previous layer to reduce the
amount of parameters and computation in the network. Activation layer:
This layer comprises of a non-linear function that does a 1-to-1 mapping of
each of the outputs from the previous layer to an activation value. Synergy
supports all kinds of activation functions.
A CNN also contains a few preprocessing functions within the layers
such as im2col and normalization that take non-negligible time on embedded
CPUs. im2col (mentioned in Section 3.1.1) is used for data layout transfor-
mation. Normalization is used to scale all the input data to values between 0
and 1 during the inference phase. The overheads of these sequential portions
are partially hidden by HW/SW multi-threaded pipeline in Synergy.
15
System BUS
Proc
Ethernet Other peripherals
(USB, UART, etc.)
DelegateT-1
if_mem2hw MEM
Arbiter
MEM
Arbiter
MEM
Controller
MEM
Controller
DDR
DRAM
SW Threads
Cluster-0
PE-0
PE-1
PE-2
PE-3
MMU
MMU
Proc
Arbiter
if_hw2mem
if_mem2hw
if_hw2mem
if_mem2hw
if_hw2mem
if_mem2hw
if_hw2mem
FIFOs
if_sw2hw
if_hw2sw
if_sw2hw
if_hw2sw
if_sw2hw
if_hw2sw
if_sw2hw
if_hw2sw
FIFOs
Multi-core CPU Synergy Hardware Architecture (FPGA)
Synergy
Library
DelegateT-2
DelegateT-3
User Space Kernel
Space
DelegateT-0
Cluster-1
Memory Subs ystem
Figure 5: The Hardware Architecture
3.2 Hardware Architecture
Figure 5 shows an example Synergy hardware architecture example with
four FPGA-based PEs. The architecture is adapted from ReconOS [10].
In this architecture, the software communicates with the hardware acceler-
ators via control FIFOs (if hw2sw and if sw2hw). At the software side, a
delegate thread (DelegateT) interacts with other software threads on behalf
of the associated PE. Data transactions of a PE are handled by the Memory
Subsystem via two FIFOs (if hw2mem and if mem2hw). In the following
subsections, we discuss the accelerator design, memory subsystem, and the
hardware architecture generator.
3.2.1 Accelerator Design
As mentioned earlier, Synergy processes CONV layers as matrix multiplica-
tion (MM) operations accelerated using NEON cores and FPGA-based PEs.
In this section, we mainly focus on FPGA-based accelerators and discuss
several design challenges. The FPGA-based accelerator for MM is the pro-
cessing engine (PE) shown in Figure 5, which is generated by a commercial
high-level synthesis (HLS) tool, Vivado HLS [22]. HLS is used to convert
algorithms in high-level specification (i.e., C/C++) into hardware languages
(VHDL/Verilog). It provides optimization options, a.k.a pragmas, such as
loop unrolling, array partitioning and pipelining, to explore diverse hardware
architecture with different area and performance tradeoff.
As mentioned in Section 3.1.1, due to the large input size of MM in
16
CONV layers, we deploy Loop Tiling on MM and partition the iteration
space of the loop into smaller tiles so that data size of a tile can be easily
accommodated on available BRAM. Loop Tiling exposes potential parallelism
of MM as different tiles are independent. We exploit the parallelism by
instantiating multiple PEs under FPGA resource budget to process the tiles
simultaneously, while exploring hardware architectures of a PE with HLS
pragmas. Opening up more parallelism per PE limits the number of PEs
that can be accommodated on FPGA due to resource constraints [26].
Listing 3: Pseudo Code for the HLS Template of a PE
1P r o c e s s i n g E n g i n e ( i f s w 2h w , i f h w2 s w ,
2if hw 2me m , i f mem 2hw ) {
3/∗S i m p l i f i e d pr ag ma s ∗/
4#prag ma i n t e r f a c e a p f i f o p o r t=i f s w2 h w
5#prag ma i n t e r f a c e a p f i f o p o r t=i f h w2 s w
6#prag ma i n t e r f a c e a p f i f o p o r t=if hw 2me m
7#prag ma i n t e r f a c e a p f i f o p o r t=if me m2h w
8#prag ma i n t e r f a c e a p ct r l no n e p or t= return
9...
10 w a i t f o r s t a r t s i g n a l ( ) ;
11 j o b t jo b ;
12 while ( 1 ) {
13 u in t 3 2 j o b ad d r e s s = a s k f o r a j o b ( ) ;
14 j ob = r e a d j o b ( j o b a d d r e s s ) ;
15 p a r s e j o b ( j o b , &A add r, &B add r ,& Cad dr ,
16 &m,&n ,& k , &t 1 , & t 2 , & l a y e r I D ) ;
17 ti le d m m (Aa ddr , Badd r , Cadd r ,m, n , k , t1 , t 2 ,
18 if hw 2me m , i f m em 2hw ) ;
19 send acknowledgment ( l ayer ID ) ; }
20 }
Processing Engine (PE): The pseudo code for the HLS template in
Listing 3 demonstrates the general execution flow of a PE. A PE inter-
acts with its associated delegate thread in the user space via control FIFOs
(if hw2sw and if sw2hw). For data transaction, the PE cooperates with the
Memory Subsystem (Section 3.2.2) through memory FIFOs (if hw2mem and
if mem2hw). At the beginning, the PE waits for a start signal issued from
its associated delegate thread. Line 13 - 19 in Listing 3 shows the logic to
compute a job. The PE first acquires a job by sending requests to the del-
egate thread. The real computation of MM is performed in tiled mm. The
skeleton of tiled mm is shown in (Line 5-14) in Listing 1. The mm tile func-
tion can be summarized as the following four steps: 1 It computes locations
of tiles required of the input arrays (Aand B) in the main memory; 2 It
then fetches a tile of data to local memory (aand b); 3 It performs matrix
multiplication and adds the partial result with a local array c; 4 mm tile
repeats Step 1 until it exhausts a row of Aand a column of B; 5 mm tile
stores the output data back to the main memory. An acknowledgment will
be sent to the delegate thread once the PE finishes computation.
17
Computation optimizations in mm tile: Loop pipelining is a crucial
optimization option provided by HLS. As the technique can overlap the exe-
cution of operations from different iterations, it has great impact on system
throughput. Throughput of a loop depends on the initiation interval (II),
which is defined as the number of cycles between consecutive initiations of
the loop. In this work, we apply loop pipelining at loop2 in Listing 1. With
the optimization, the HLS tool merges loop1 and loop2 into a new loop with
larger loop bound (newBound =T S ∗T S) and completely unrolls the in-
nermost loop (loop3). We define latloop3as the latency of loop3. Then the
latency latkernel of the nested loop for kernel computation is calculated as
latkernel = (newBound −1) ∗I I +latloop3. When newBound is large enough,
latkernel of the nested loop is decided by II .
As operations inside loop3 in Listing 1 have no data dependence, when
loop3 is completely unrolled, operations in different iterations can be ideally
executed in parallel. However, the parallelism is constrained by the memory
bandwidth. Local buffers (aand b) are implemented with FPGA BRAM re-
source. By default, a local buffer has only two read-ports and one write-port.
Thus, when loop3 is completely unrolled, only two memory read requests to
each buffer (aand b) can be served, even if T S read requests are gener-
ated. This makes II to be T S/2 and limits performance of an accelerator.
To improve II, we can leverage the array partitioning pragma to split the
buffer into multiple banks where each bank has two read-ports and one write-
port. With loop pipelining and array partitioning, the accelerator requires
more multiplication and addition units, and thus more compute resources.
Opening up more parallelism per PE limits the number of PEs that can be
accommodated on FPGA due to resource constraints. Given a FPGA device,
the tile size, the settings for HLS pragmas, and the number of PEs can be
done automatically decided via design space exploration (DSE) [26].
Communication optimization in mm tile: For tiled matrix multi-
plication, Synergy overlaps the data transfer cost with the computation cost
by leveraging double buffering, i.e., instantiating two buffers for each local
array. This significantly improves the throughput of tiled MM.
Zero Padding in mm tile: In Synergy, the hardware accelerators are
shared among the convolutional layers. This implies the same MM acceler-
ator of fixed size are used in different layers. As the loop bounds (or data
dimensions) of MM in different convolutional layers are different, we may
encounter scenarios where the fixed-size MM accelerator attempts to access
out of the loop bound data of the input matrices or write data outside the
18
L1 offset Page offsetL2 offset
L1 Page Table L2 Page Table
R
Virtual
address
1
2
3
4
Physical
address
5 5
Figure 6: Virtual To Physical Address Translation [1]
bounds of the output matrix. Hence, we include border detection in mm tile.
When fetching data, if the memory address exceeds the matrix border, the
specific portion of the local buffer will be set to zero. Similarly, for writing
data, mm tile ignores write requests if a memory address exceeds the given
matrix borders.
3.2.2 Memory Subsystem
The Memory Subsystem shown in Figure 5 is used to process memory re-
quests from multiple PEs. It consists of memory arbiters (MEM Arbiter),
memory management units (MMUs), memory controllers (MEM Controllers),
aProc Arbiter and a Proc unit.MMU is used to translate virtual addresses
to physical addresses, while MEM Arbiter is employed to allocate memory
transaction requests to the shared MMU.Proc unit is used to obtain the first-
level translation page table address and handle page fault request, and Proc
Arbiter allows multiple MMUs to access the Proc unit.MEM Controllers are
implemented to access the DDR memory with AXI4 burst mode protocol.
All the components in the Memory Subsystem are written in RTL code and
constitute the Hardware Template Library as shown in Figure 8.
Virtual to Physical Address Translation: In a traditional HW/SW
co-design approach, a device driver normally has a continuous memory ad-
dress space in the Linux kernel. When a delegate thread tries to communicate
with an FPGA PE, it first copies data from the user space to the allocated
continuous memory (kernel space) in the device driver and sends the phys-
ical address of the memory to the PE. Then the PE obtains the data from
19
the DDR memory via the MEM Controllers. In Synergy, we avoid the extra
data copy in the acceleration of CONV layers. As mentioned in Section 3.1.2
and 3.2.1, a PE obtains an address of a job directly from the delegate thread
in the user space and the job content includes the base memory address of
input/output arrays in the user space. Those are virtual addresses. In ARM
Cortex-A9 architecture [1], virtual addresses are translated to physical ad-
dresses by a two-level (L1 and L2) page table walk as shown in Figure 6.
The base address of the L1 page table is stored in a CPU system register
R[1], which can be accessed in the kernel space. Synergy supports this two-
level page table walk in FPGA. During the FPGA initialization in Synergy,
the Proc unit obtains the base address of the L1 page table via its device
driver. Then, the Memory Subsystem translates the virtual address to phys-
ical address following the steps in Figure 6. In case of a page fault, the Proc
unit triggers a CPU interrupt, obtains a new base address and repeats the
translation process.
0.0
1.0
2.0
3.0
4.0
0
20
40
60
80
100
1 2 3 4 5 6
Speedup compared to
single PE
Performance (ms)
Number of PE
(a) ReconOS
0
1
2
3
4
5
6
7
0
20
40
60
80
100
1 2 3 4 5 6
Performance compared to
single PE
Performance (ms)
Number of PEs
(b) Synergy
Figure 7: Single-MMU vs. Multi-MMU Peformance
Multiple MMU Support: ReconOS architecture [10] contains a single
MMU and MEM Controller. The memory transactions from the PEs com-
pete for the resources in the Memory Subsystem. As the number of PEs
increases, the memory contention significantly degrades the system perfor-
mance as shown in Figure 7a. To solve the problem, Synergy instantiates
multiple MMUs with at most two PEs sharing an MMU and MEM Con-
troller. As the frequency of page faults is generally low in our case, multiple
MMUs in Synergy share the same Proc unit via the arbiter logic Proc Arbiter.
Figure 7b shows that the performance speedup increases linearly as we in-
stantiate more PEs in Synergy.
20
*.hw_config
Hardware
Template
Library
Hardware
Template
Library
Tiled MM
(C++)
Tiled MM
(C++)
HLS ToolHLS Tool
pragmas
AcceleratorsAccelerators
Synergy Hardware
Architecture
Synergy Hardware
Architecture
FPGA BitstreamFPGA Bitstream
Library
Generator
Library
Generator
Architecture
Parameters
tile size
HW/SW Multi-
threading Lib
HW/SW Multi-
threading Lib
…
cluster_num = 2
clockFreq = 100000000
tile_size = 32
fifo_os = 128
fifo_mem = 128
hw@PE0
hlsopt = pragmas0.tcl
cluster = 0
hw@PE1
hlsopt = pragmas1.tcl
cluster = 0
hw@PE2
hlsopt = pragmas2.tcl
cluster = 0
hw@PE3
hlsopt = pragmas3.tcl
cluster = 1
...
Figure 8: Hardware Architecture Generator
3.3 Hardware Architecture Generator
Synergy provides a default accelerator architecture on a given FPGA device.
However, for a new FPGA device or in case the developer is interested in cus-
tomizing the accelerator architecture corresponding to a CNN model, Synergy
provides an architecture generator as shown in Figure 8. This automates the
processes of generating PEs with HLS, the Hardware Architecture, and final
FPGA bitstream. Input to the generator is a configuration file, *.hw config,
containing the architecture parameters. The simplified format of this config-
uration file is shown in the left side of Figure 8, which creates the Hardware
Architecture shown in Figure 5. Moreover, based on the configuration file, the
generator also compiles HW/SW multi-threading library (Synergy Library)
to provide APIs required by Section 3.1.
4 Experimental Evaluation
In this section, we evaluate the Synergy framework with multiple represen-
tative CNN models.
Synergy has been implemented on heterogeneous SoC platforms Zed-
board [22] and Xilinx ZC702, both featuring the same Xilinx Zynq XC7Z020
device. Xilinx Zynq XC7Z020 is a low-end SoC in terms of its compute
capability and the availability of limited FPGA resources. We report the
performance and power numbers from the Xilinx ZC702 evaluation board
because it has the power measurement support. All performance results are
21
collected using an FPGA-based timer.
We use Darknet [14] as our deep learning package. Darknet is an open
source neural network framework written in C. We use Darknet because it has
a highly-optimized single-threaded software implementation and does not de-
pend on any external library. We first compile Darknet for the ARM core in
the Xilinx Zynq device. Apart from the single-threaded software implementa-
tion, we create a multi-threaded pipelined version of Darknet to take advan-
tage of inter-frame parallelism for high-throughput CNN implementation.
The CPU-only implementations for various CNNs in this section are well-
optimized and compiled by gcc with -O3 optimization. As Darknet uses 32-bit
floating-point CNN models, we also use 32-bit floating-point implementation
both in software and hardware accelerators. The performance-power numbers
of Synergy will improve substantially if 32-bit floating-point implementation
is replaced by N-bit fixed-point implementation where N << 32. However,
this optimization is orthogonal and complementary to our current approach.
Even with floating-point, we achieve better throughput and energy-efficiency
compared to contemporary fixed-point implementations on the same device.
We write assembly-language code to generate highly optimized NEON ac-
celerators for the tiled matrix-multiplication operations. For hardware accel-
erator generation, we use Vivado Design Suite version 2016.2 for High-Level
Synthesis (HLS). The tiled matrix multiplication code is written in C and are
synthesized on FPGA using Vivado HLS with appropriate pragma settings as
presented in Section 3.2.1. Synergy uses two clusters (Cluster-0: 2 NEONs
+ 2 S-PE; Cluster-1: 6 F-PE) configuration across all benchmarks. The
cluster configuration is chosen based on power/performance results across
multiple CNNs and the work stealing technique can ensure that other CNN
applications could work well on this fixed hardware architecture as well by
balancing workload at runtime. The FPGA-based PEs run at 100MHz. The
HW/SW multi-threading is implemented by adapting ReconOS open-source
operating system for reconfigurable computing [10][15]. The ARM cores run
Linux, which is augmented with ReconOS to interface with the hardware
accelerators.
The entire Synergy framework is set up on a PC with an Intel Xeon CPU
E5-2620 core running at 2.10Hz with 64GB RAM, running Ubuntu 14.04
OS. Given a CNN model, the Synergy framework is responsible to generate
the appropriate software threads corresponding to the different layers of the
network, interfacing the software threads with the delegate threads of the
hardware accelerators, and creating a default mapping between the CONV
22
Table 2: Network Architectures of Benchmark CNN Models
Benchmark CONV
Layers
Num. of
Layers Description
CIFAR Darknet [14] 4 9 Object Recognition
CIFAR Alex [5] 3 8 Object Recognition
CIFAR Alex+ [5] 3 9 Object Recognition
CIFAR full [7] 3 9 Object Recognition
MNIST [9] 2 7 Digit Recognition
SVHN [12] 3 8 Digit Recognition
MPCNN [11] 3 9 Gesture Recognition
layers and the accelerator clusters. The Synergy framework can also automate
the FPGA bitstream generation given a hardware accelerator architecture
configuration by the designer to customize Synergy implementation for a
particular CNN model (if desired) or generate one for a new device.
Benchmarks: Table 2 shows seven CNN models used in this work and
trained with Darknet.
4.1 Synergy Throughput and Energy-Efficiency
Throughput: Compared with the original single-threaded Darknet imple-
mentation running on ARM core, Synergy achieves average 7.3x throughput
improvement as shown in Figure 9.
Power and Energy Consumption: Figure 10 depicts the power distri-
bution and energy consumption of Synergy system. The FPGA logic accounts
for only 27% of the total power consumption (around 2.08 W) averaged across
all CNN models. The ARM cores and the DDR memory account for most
of the power consumption. Compared with the power (1.52 W on average)
measured for the CPU+NEON only implementations, Synergy incurs 36.63%
more power consumption.
Table 3 shows the energy and performance per watt comparison between
the original single-threaded Darknet implementation running on ARM cores
and the Synergy design. Considering power consumption, the Synergy design
consumes 36.63% more power on average, as it fully leverages the heterogene-
ity of the ZYNQ platform. Although the power consumption increases, the
Synergy implementation achieves much higher throughput (7.3x speedup),
23
1.0x 1.0x 1.0x 1.0x 1.0x 1.0x
1.0x
7.8x 6.0x
8.7x
8.2x
6.6x
9.4x
4.5x
0.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
Throughput (frames/sec)
Original-Darknet Darknet-With-Synergy
Figure 9: Throughput improvement using Synergy
and thus reduces 80.13% energy consumption averaged across all CNN mod-
els compared to the original Darknet on ARM cores.
FPGA Resource Consumption: Hardware accelerators generated by
Vivado HLS have great impact on FPGA resource consumption. With the
limited FPGA resource budget, opening up more parallelism via HLS prag-
mas reduces the number of hardware accelerators that can fit in Xilinx
ZC702. Therefore, we explore diverse architectures of hardware accelerators
by traversing different tile size and HLS pragma combinations consisting loop
unrolling, loop pipelining and array partitioning. In this work, the tile size is
set to be 32 based on empirical evaluation. On ZC702 device we instantiate 6
faster FPGA-based processing engines (F-PE) with loop pipelining pragma
applied at loop2 in Listing 1 and 2 slower PE (S-PE) with loop unrolling
(factor = 2) and loop pipelining at loop3.
Comparison with State-of-the-art: Table 4 compares Synergy with
the recent FPGA-based CNN works. Note that CaffePresso [6] is using a
development platform with significantly more resources, and is running at a
higher clock speed. Moreover, as Darknet doesn’t support data quantization
feature and fixed-point implementation, Synergy uses 32-bit floating-point de-
sign that consumes much more resources than 32/16-bit fixed-point designs
on FPGAs. As shown in Table 4, even though we have handicapped our-
selves with floating-point operations, our implementations (both CIFAR full
and MNIST) are superior to [6][21] in terms of throughput (frames per sec-
ond), giga-operations-per-second (GOPS), and energy consumption. Com-
24
0
10
20
30
40
50
60
0.00
0.50
1.00
1.50
2.00
2.50
Energy (mJ/frame)
Power Consumption (W)
FPGA ARM DDR Energy
Figure 10: Power Distribution and Energy Consumption
pared to [20], GOPS of our MNIST and MPCNN designs achieve 4.5x and
1.8x speedup, respectively. Table 4 demonstrates that Synergy can provide
high-throughput and energy-efficient mapping of CNN models on embedded
heterogeneous SoC platforms.
4.2 Advantage of Heterogeneity
We now investigate the impact of heterogeneity in improving the CNN
performance in Synergy. Figure 11 shows the latency of different non-
pipelined CNN implementations (single-threaded, leveraging single-core
ARM): CPU+NEON,CPU+FPGA, and CPU+Het, which consists of FPGA
and NEON accelerators compared to the baseline single-core ARM design.
Compared to the CPU+FPGA design, the heterogeneous implementation
with FPGA and NEON CPU+Het improves the latency by 12% on an aver-
age with 45% maximum improvement for MPCNN model.
The throughput speedup of different pipelined CNN implementations
(multi-threaded, using two ARM cores): CPU+NEON,CPU+FPGA, and
CPU+Het compared to the baseline single-core ARM design is shown in Fig-
ure 12. Compared to the CPU+FPGA designs, the heterogeneous implemen-
tations with FPGA and NEON CPU+Het achieves 15% better throughput
on an average (37% maximum improvement for MNIST benchmark).
25
Table 3: Energy and Performance per Watt Comparison: Original Darknet
Versus Synergy
Benchmarks Energy (mJ/frame) Performance per watt (GOPS/W)
Original Synergy Reduction (%) Original Synergy Speedup
CIFAR Darknet 142.18 25.36 -82.16 0.14 0.80 5.61x
CIFAR Alex 105.03 23.43 -77.70 0.16 0.80 4.48x
CIFAR Alex+ 326.62 55.81 -82.91 0.16 0.70 5.85x
CIFAR full 196.41 33.71 -82.84 0.13 0.94 5.83x
MNIST 112.90 22.78 -79.83 0.20 0.78 4.96x
SVHN 193.67 28.07 -85.50 0.14 0.98 6.90x
MPCNN 47.87 14.37 -69.99 0.20 0.68 3.33x
mean -80.13 5.28x
Table 4: Comparison With Recent FPGA-based CNN Works. ‘*’ indicates
values estimated from charts
CaffePresso [6] fpgaConvNet[19][20] DeepBurning [21] Synergy
Device 7Z045 7Z020 7Z020 7Z020
Clock (MHz) 180 100 100 100
Precision 16-bit Fixed-point 16-bit Fixed-point Fixed-point 32-bit Floating-point
Benchmarks MNIST CIFAR full MNIST MPCNN MNIST CIFAR full MNIST CIFAR full MPCNN
Latency(ms) 16.0 28.0 – – 14.3 21.4 24.3 33.2 12.2
Throughput
(frames/s) 62.5 35.7 – – 69.9 46.7 96.2 63.5 136.4
GOPS 1.19 0.94 0.48 0.74 1.33* 1.23* 2.15 1.67 1.33
Energy
(mJ/frame) >200* >500* – – 150* 63 22.8 33.7 14.4
4.3 Transparent Accelerators: Work Stealing
We show the advantage of dynamic load balancing across accelerators using
work-stealing in Synergy versus static mapping of the CONV layers to the ac-
celerators. We consider two different clusters and PE configurations for static
mapping. The first cluster configuration consists of two clusters (Cluster-0:
2 NEONs + 2 S-PE; Cluster-1: 6 F-PE) used in Synergy across all bench-
marks. But unlike Synergy, the CONV layers are statically assigned to the
clusters based on their workload. We refer to this as static-mapping+fixed-
architecture (SF). Figure 13 shows that the SF designs can achieve 6.1x better
throughput compared to the well-optimized CPU designs.
However, the SF designs are inefficient as the workload assigned to the
clusters might not be balanced due to the different computation requirement
26
1.00
C1
C2
C3
C4
C5
C6
C1
C2
C3
C4
C5
1.78 1.52 1.70 1.92 1.50 1.41
1.01
3.30
2.69
3.78 3.75
2.27
4.69
1.84
3.41
2.79
3.84 3.87
2.83
4.81
2.68
0
1
2
3
4
5
6
Speedup Compared to CPU-only
CPU+NEON CPU+FPGA CPU+Het
Het: Heterogeneous Accelerators with NEON and FPGA
Figure 11: Latency Improvement with Accelerators Compared to CPU-only
Solutions for Non-Pipelined Designs
of each CONV layer. Figure 14a presents the execution time of each CONV
layer in CIFAR Alex model with this configuration. The CONV-0 layer is
mapped to Cluster-0, while CONV-1 and CONV-2 layers are mapped to
Cluster-1. As shown in Figure 14a, the runtime of Cluster-0 and Cluster-1
are 24.3 ms and 12.3 ms per frame, respectively. This imbalance in execution
time between the clusters leads to poor cluster utilization and throughput.
Synergy employs work-stealing to automatically balance workload of dif-
ferent clusters. This provides a network-agnostic feature in Synergy, as the
jobs from different CONV layers are automatically distributed across the
different clusters to achieve self-balancing. With the same generic cluster
configuration used in the SF designs, Figure 13 shows that Synergy improves
the throughput by average 24% compared to the SF designs. The perfor-
mance improvement comes from the balanced clusters. Figure 14b presents
the execution time of each CONV layer of the Synergy design for CIFAR Alex
benchmark. The runtime of Cluster-0 and Cluster-1 are 22.2 ms and 20.9
ms per frame, respectively. Compared to the SF design in Figure 14a, the
workload of Cluster-0 and Cluster-1 are balanced.
Finally, we show that Synergy work-stealing with generic cluster archi-
tecture is competitive and even better than CNN-model specific customized
cluster configurations. We call this static-mapping+custom-architecture (SC)
designs. In the SC designs, we find the best multi-cluster configuration for
each CNN model by exploring all possible cluster configurations. The best
27
1.89
2.14
2.23 2.26 1.61 2.23
1.29
6.96
5.80
7.95 7.49
4.81
7.47
4.12
7.82
5.95
8.71 8.15
6.62
9.44
4.47
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
Speedup Compared to CPU-only
CPU+NEON CPU+FPGA CPU+Het
Het: Heterogeneous Accelerators with NEON and FPGA
Figure 12: Throughput Improvement with Accelerators Compared to CPU-
only Solutions for Pipelined Designs
0.0
Speedup compared to (CPU+ST)
7.32
4.31 7.07
7.05
4.82
7.65
3.55
7.69
5.85
8.11 7.81
5.98
8.38
4.34
7.82
5.95
8.71 8.15
6.62
9.44
4.47
0.0
2.0
4.0
6.0
8.0
10.0
Speedup compared to CPU-only
SF: Static mapping + Fixed Architecture SC: Static Mapping + Custom Architecture
Synergy: Dynamic Mapping + Fixed Architecture
Figure 13: Advantage of Work stealing
multi-cluster configurations are shown in Table 51. The CONV layers are
statically mapped to these clusters. Note that unlike optimized cluster con-
figurations in SC designs, Synergy leverages the same generic cluster configu-
ration used in the SF designs for various CNN models. As shown in Figure 13,
Synergy still achieves 6% better throughput than SC designs. This is because
in the static mapping approaches (SF and SC) an entire CONV layer is as-
signed to a cluster and it is hard to perfectly balance the cluster workloads.
In contrast, the work-stealing in Synergy at the granularity of job-level (tiled
MM) can easily balance the workload even with un-optimized generic accel-
erators. The work stealing feature in Synergy empowers developers to easily
switch between different networks at runtime without losing performance.
1The number of clusters in this work can be t, where t∈N.
28
11.70
CONV-2
24.3
12.3
0
5
10
15
20
25
30
Cluster-0 Cluster-1
Execution Time (ms)
CONV-0 CONV-1 CONV-2
(a) SF Configuration with Two Clus-
ters
22.2 20.9
0
5
10
15
20
25
Cluster-0 Cluster-1
Execution Time (ms)
CONV-0 CONV-1 CONV-2
(b) Synergy: same SF configuration +
work-stealing
Figure 14: Dynamic Load Balancing in CIFAR Alex. SF Conf.: Cluster-0 (2
NEONs + 2 S-PE), Cluster-1 (6 F-PE)
Table 5: Best Cluster Configurations for CNN Models under Static Mapping
+ Custom Architectures
Benchmarks Cluster 0 Cluster 1
NEON FPGA IP NEON FPGA IP
CIFAR Darknet 0 2 S-PE + 1 F-PE 2 5 F-PE
CIFAR Alex 0 2 S-PE + 2 F-PE 2 4 F-PE
CIFAR Alex+ 2 2 S-PE + 2 F-PE 0 4 F-PE
CIFAR full 0 2 S-PE + 2 F-PE 2 4 F-PE
MNIST 2 2 S-PE + 2 F-PE 0 4 F-PE
SVHN 2 2 S-PE + 2 F-PE 0 4 F-PE
MPCNN 0 2 S-PE + 2 F-PE 2 4 F-PE
To better understand the performance improvement, Table 6 shows the
accelerator cluster utilization of various designs. The non-pipelined designs
are the best single-threaded implementations (the blue bars in Figure 11)
leveraging single-CPU, NEON core and FPGA accelerators. As shown in Ta-
ble 6, the cluster utilization of the non-pipelined designs is very low, indicat-
ing FPGA being idle for 43.95% (=1−56.05%) of the total execution time on
average. The reason is that in non-pipelined design, FPGA accelerators have
to wait for CPU or NEON core to finish their work. With multi-threading
support, the pipelined designs significantly increase the cluster utilization
(above 90%), as various computing elements can work simultaneously. Ta-
ble 6 shows that the SF designs increase the accelerator cluster utilization
to 92.5% on average from 56.1%. Compared to the SF designs, the cluster
29
Table 6: Accelerator Cluster Utilization Comparison Across SF, SC and
Synergy
Benchmarks Non-
pipelined (%)
Pipelined (%)
SF SC Synergy
CIFAR Darknet 50.77 95.32 97.55 99.89
CIFAR Alex 53.56 92.72 96.61 99.83
CIFAR Alex+ 61.28 98.81 98.73 99.95
CIFAR full 54.06 93.53 94.97 100.00
MNIST 59.03 85.63 96.09 99.89
SVHN 53.00 94.72 96.86 99.26
MPCNN 60.62 86.47 94.45 99.79
mean 56.05 92.46 96.47 99.80
utilization of the SC designs achieves 96.5% averaged across the benchmarks.
This is because the SC designs use the fine-tuned cluster configurations and
workload assigned to the clusters is more balanced. As mentioned above,
since Synergy leverages the work-stealing scheduler which works at the finer
granularity of job-level (tiled MM), the scheduler helps to improve the cluster
utilization at runtime by balancing workload in clusters. The average cluster
utilization of Synergy achieves 99.8% as shown in Table 6.
5 conclusion
This paper presents Synergy, an automated, transparent hardware-software
co-designed CNN inference framework on an embedded FPGA-based hetero-
geneous SoC architecture. Synergy fully utilizes the heterogeneity by lever-
aging diverse computing resources (CPUs, NEONs and FPGA) to accelerate
CNNs. Moreover, Synergy provides a work-stealing scheduler in software
to automatically balance the workload of accelerators, so that it can easily
adapt to various networks at runtime without changing hardware or software
implementations. Our result shows that Synergy achieves 7.3x speedup, aver-
aged across seven representative CNN models, over a well-optimized software-
only solution. Compared to the contemporary CNN implementations on the
same SoC platform, Synergy delivers better throughput as well as energy-
efficiency.
30
References
[1] ARM Infocenter. http://infocenter.arm.com. 2018.
[2] A. Dundar, J. Jin, B. Martini, and E. Culurciello. Embedded streaming
deep neural networks accelerator with applications. IEEE Transactions
on Neural Networks and Learning Systems, 28(7):1572–1583, July 2017.
[3] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang,
and J. Cong. Fp-dnn: An automated framework for mapping deep
neural networks onto fpgas with rtl-hls hybrid templates. In 2017 IEEE
25th Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM), pages 152–159, April 2017.
[4] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang,
and H. Yang. Angel-eye: A complete design flow for mapping cnn onto
embedded fpga. IEEE Transactions on Computer-Aided Design of In-
tegrated Circuits and Systems, 37(1):35–47, Jan 2018.
[5] S. Hashemi, N. Anthony, H. Tann, R. I. Bahar, and S. Reda. Un-
derstanding the impact of precision quantization on the accuracy and
energy of neural networks. In Design, Automation Test in Europe Con-
ference Exhibition (DATE), 2017, pages 1474–1479, March 2017.
[6] Gopalakrishna Hegde, Siddhartha, Nachiappan Ramasamy, and
Nachiket Kapre. Caffepresso: An optimized library for deep learning
on embedded accelerator-based platforms. In Proceedings of the In-
ternational Conference on Compilers, Architectures and Synthesis for
Embedded Systems, CASES ’16, pages 14:1–14:10, New York, NY, USA,
2016. ACM.
[7] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan
Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe:
Convolutional architecture for fast feature embedding. In Proceedings of
the 22Nd ACM International Conference on Multimedia, MM ’14, pages
675–678, New York, NY, USA, 2014. ACM.
[8] J. H. Kim, B. Grady, R. Lian, J. Brothers, and J. H. Anderson. Fpga-
based cnn inference accelerator synthesized from multi-threaded c soft-
ware. In 2017 30th IEEE International System-on-Chip Conference
(SOCC), pages 268–273, Sept 2017.
31
[9] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11):2278–
2324, Nov 1998.
[10] Enno L¨ubbers and Marco Platzner. Reconos: Multithreaded program-
ming for reconfigurable computers. ACM Trans. Embed. Comput. Syst.,
9(1):8:1–8:33, October 2009.
[11] J. Nagi, F. Ducatelle, G. A. Di Caro, D. Cirean, U. Meier, A. Giusti,
F. Nagi, J. Schmidhuber, and L. M. Gambardella. Max-pooling convo-
lutional neural networks for vision-based hand gesture recognition. In
2011 IEEE International Conference on Signal and Image Processing
Applications (ICSIPA), pages 342–347, Nov 2011.
[12] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu,
and Andrew Y Ng. Reading digits in natural images with unsupervised
feature learning. In NIPS workshop on deep learning and unsupervised
feature learning, 2011.
[13] Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin
Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, and
Huazhong Yang. Going deeper with embedded fpga platform for con-
volutional neural network. In Proceedings of the 2016 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, FPGA
’16, pages 26–35, New York, NY, USA, 2016. ACM.
[14] Joseph Redmon. Darknet: Open source neural networks in c, February
2018.
[15] Christoph R¨uthing et al. Self-Adaptation in Programmable Automation
Controllers based on Hybrid Multi-Cores. Master Thesis, University of
Paderborn, 2016.
[16] Yongming Shen, Michael Ferdman, and Peter Milder. Maximizing cnn
accelerator efficiency through resource partitioning. In Proceedings of
the 44th Annual International Symposium on Computer Architecture,
ISCA ’17, pages 535–547, New York, NY, USA, 2017. ACM.
[17] Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei
Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. Throughput-optimized
32
opencl-based fpga accelerator for large-scale convolutional neural net-
works. In Proceedings of the 2016 ACM/SIGDA International Sympo-
sium on Field-Programmable Gate Arrays, FPGA ’16, pages 16–25, New
York, NY, USA, 2016. ACM.
[18] Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela
Blott, Philip Leong, Magnus Jahre, and Kees Vissers. Finn: A frame-
work for fast, scalable binarized neural network inference. In Pro-
ceedings of the 2017 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, FPGA ’17, pages 65–74, New York, NY,
USA, 2017. ACM.
[19] S. I. Venieris and C. S. Bouganis. fpgaconvnet: A framework for mapping
convolutional neural networks on fpgas. In 2016 IEEE 24th Annual In-
ternational Symposium on Field-Programmable Custom Computing Ma-
chines (FCCM), pages 40–47, May 2016.
[20] Stylianos I. Venieris and Christos-Savvas Bouganis. fpgaconvnet: Au-
tomated mapping of convolutional neural networks on fpgas (abstract
only). In Proceedings of the 2017 ACM/SIGDA International Sympo-
sium on Field-Programmable Gate Arrays, FPGA ’17, pages 291–292,
New York, NY, USA, 2017. ACM.
[21] Ying Wang, Jie Xu, Yinhe Han, Huawei Li, and Xiaowei Li. Deep-
burning: Automatic generation of fpga-based learning accelerators for
the neural network family. In Proceedings of the 53rd Annual Design
Automation Conference, DAC ’16, pages 110:1–110:6, New York, NY,
USA, 2016. ACM.
[22] Xilinx Inc. http://www.xilinx.com. 2018.
[23] Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason
Cong. Caffeine: Towards uniformed representation and acceleration
for deep convolutional neural networks. In Proceedings of the 35th In-
ternational Conference on Computer-Aided Design, ICCAD ’16, pages
12:1–12:8, New York, NY, USA, 2016. ACM.
[24] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and
Jason Cong. Optimizing fpga-based accelerator design for deep con-
volutional neural networks. In Proceedings of the 2015 ACM/SIGDA
33
International Symposium on Field-Programmable Gate Arrays, pages
161–170, New York, NY, USA, 2015. ACM.
[25] Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, and Jason
Cong. Energy-efficient cnn implementation on a deeply pipelined fpga
cluster. In Proceedings of the 2016 International Symposium on Low
Power Electronics and Design, ISLPED ’16, pages 326–331, New York,
NY, USA, 2016. ACM.
[26] G. Zhong, A. Prakash, S. Wang, Y. Liang, T. Mitra, and S. Niar. Design
space exploration of fpga-based accelerators with multi-level parallelism.
In Design, Automation Test in Europe Conference Exhibition (DATE),
2017, pages 1141–1146, March 2017.
34