ArticlePDF Available

Synergy: An HW/SW framework for high throughput CNNs on embedded heterogeneous SoC

Authors:

Abstract

Convolutional Neural Networks (CNN) have been widely deployed in diverse application domains. There has been significant progress in accelerating both their training and inference using high-performance GPUs, FPGAs, and custom ASICs for datacenter-scale environments. The recent proliferation of mobile and IoT devices have necessitated real-time, energy-efficient deep neural network inference on embedded-class, resource-constrained platforms. In this context, we present {\em Synergy}, an automated, hardware-software co-designed, pipelined, high-throughput CNN inference framework on embedded heterogeneous system-on-chip (SoC) architectures (Xilinx Zynq). {\em Synergy} leverages, through multi-threading, all the available on-chip resources, which includes the dual-core ARM processor along with the FPGA and the NEON SIMD engines as accelerators. Moreover, {\em Synergy} provides a unified abstraction of the heterogeneous accelerators (FPGA and NEON) and can adapt to different network configurations at runtime without changing the underlying hardware accelerator architecture by balancing workload across accelerators through work-stealing. {\em Synergy} achieves 7.3X speedup, averaged across seven CNN models, over a well-optimized software-only solution. {\em Synergy} demonstrates substantially better throughput and energy-efficiency compared to the contemporary CNN implementations on the same SoC architecture.
Synergy: A HW/SW Framework for High
Throughput CNNs on Embedded
Heterogeneous SoC
Guanwen Zhong
, Akshat Dubey
, Tan Cheng
, and Tulika Mitra§
School of Computing, National University of Singapore
Abstract
Convolutional Neural Networks (CNN) have been widely deployed
in diverse application domains. There has been significant progress in
accelerating both their training and inference using high-performance
GPUs, FPGAs, and custom ASICs for datacenter-scale environments.
The recent proliferation of mobile and IoT devices have necessitated
real-time, energy-efficient deep neural network inference on embedded-
class, resource-constrained platforms. In this context, we present Syn-
ergy, an automated, hardware-software co-designed, pipelined, high-
throughput CNN inference framework on embedded heterogeneous
system-on-chip (SoC) architectures (Xilinx Zynq). Synergy leverages,
through multi-threading, all the available on-chip resources, which in-
cludes the dual-core ARM processor along with the FPGA and the
NEON SIMD engines as accelerators. Moreover, Synergy provides
a unified abstraction of the heterogeneous accelerators (FPGA and
NEON) and can adapt to different network configurations at runtime
without changing the underlying hardware accelerator architecture by
balancing workload across accelerators through work-stealing. Syn-
ergy achieves 7.3X speedup, averaged across seven CNN models, over
a well-optimized software-only solution. Synergy demonstrates sub-
stantially better throughput and energy-efficiency compared to the
contemporary CNN implementations on the same SoC architecture.
1 introduction
Convolutional Neural Networks (CNNs) are a popular class of deep learn-
ing method with wide range of applications, including computer vision, im-
zhguanwen@gmail.com
akshatdubey@nus.edu.sg
tancheng@comp.nus.edu.sg
§tulika@comp.nus.edu.sg
1
arXiv:1804.00706v1 [cs.DC] 28 Mar 2018
age/video processing, natural language processing, and others. A typical
CNN consists of multiple layers. Given an application, such as image classi-
fication, the network is first trained with the training dataset. The trained
network is then deployed for inference, i.e., classification of a new image.
Both the training and the inference are compute- and memory-intensive, but
also exhibit massive intrinsic parallelism. Thus, there exist numerous efforts
to improve the performance and the energy-efficiency of CNN implementa-
tions through architectures and computing substrates that support extensive
parallelism, such as GPUs, FPGAs, or even ASIC accelerators. This line of
research has primarily focused on the high-performance computing platforms
in datacenters or clouds.
The proliferation of the mobile devices and the recent emergence of the
IoT (Internet of Things) have transformed the computing landscape. There
is a compelling need to realise real-time, energy-efficient CNN inference on
resource-constrained mobile and IoT edge devices. However, an efficient im-
plementation of CNN-based inference on embedded platforms remains chal-
lenging given the resource limitations. In this context, we present Synergy, an
automated, transparent, pipelined, high-throughput, hardware-software co-
designed CNN inference framework on embedded heterogeneous SoC archi-
tectures. We design Synergy prototype on the Xilinx Zynq XC7Z020 device
leveraging all its available on-chip compute resources, namely the dual-core
ARM processor with NEON SIMD engines and the FPGA. Synergy is a com-
plete system-level solution including a multi-threaded software component,
multi-threaded FPGA and NEON accelerators, an interface between hard-
ware and software components, support for dynamic workload balancing,
as well as an architecture generator for customized solutions (if required).
Figure 1 depicts the Synergy framework mapping a CNN model on a het-
erogeneous SoC. Synergy distinguishes itself from the state-of-the-art along
multiple dimensions.
Heterogeneous HW/SW Acceleration:Synergy leverages all the
compute resources available on a heterogeneous SoC for maximum perfor-
mance. The convolutional (referred to as CONV hereafter) layers are the
most compute-intensive component of the CNN consuming more than 90%
of the execution time [24]. All contemporary CNN designs [6][20][21] on
the Xilinx Zynq platform offload the CONV layers to the FPGA. We ob-
serve that the NEON SIMD (Single-Instruction Multiple-Data) engines in
the ARM cores are quite effective in accelerating the CONV layers as well.
Therefore, harnessing the compute power of the FPGA in conjunction with
2
Hardware
Architecture
Generator
Multithread CNN (Software)
Multi-core
Processor
CONV CONV POOL FC FC
weights Network
Configuration
Hardware
Configuration
CPU
CPU
NEON
NEON
POOL
FPGA
Cluster 0 Cluster 1 Cluster 2
PE PE
PE PE
PE PE
PE PE
PE PE
PE PE
PE
PE
Software-Hardware Interface
Figure 1: Synergy: Mapping CNNs on Heterogeneous SoC
the NEON engines can reduce the execution latency of the CONV layers sig-
nificantly. Embracing the heterogeneity — hardware accelerators on FPGA
and software accelerators on NEON — for a single computational kernel is
a difficult proposition. Synergy effectively transforms the computation of a
CONV layer into a tiled matrix multiplication operation, where the tiles can
be processed independently. Synergy then seamlessly feeds the tiles to both
the hardware and the software accelerators to exploit the parallelism within
CONV layers. Synergy improves the overall latency and throughput by 12%
and 15% respectively, averaged across multiple CNN models, using NEON
and FPGA compared to FPGA-only solutions.
Transparent Consolidation of Accelerators: Most contemporary
FPGA-based CNN frameworks [3][6][13] [17][20][21][24][23] rely heavily on
customizing the CONV layer accelerators for each individual network to
minimize latency. The configuration of a CNN (number and type of layers,
nodes, connections, etc.) is dependent on the application. Given a specific
CNN model, existing approaches perform an automated (or manual) design
space exploration (DSE) to identify the optimal accelerator architectures for
the CONV layers of that network on the target FPGA device. This approach
has the drawback that the application developer needs to be involved in the
DSE and High-Level Synthesis (HLS) to map the given CNN model on the
FPGA. Even if the DSE and HLS steps can be fully automated, there are
3
still some quirks [16] that make this process quite challenging for an applica-
tion developer with limited experience in FPGAs. Second, in an embedded
FPGA device with strict resource constraints, a single accelerator design is
used by all the CONV layers of the network in a time-multiplexed fashion
even though the different layers have diverse compute requirements. In other
words, the single accelerator is a compromise to offer the best average per-
formance across all the CONV layers of the network, but it is not ideal for
any particular CONV layer [16]. Moreover, this single CONV accelerator is
still custom-designed for each network through extensive DSE and HLS.
In contrast, Synergy accelerators (FPGA, NEON) are network-agnostic.
A fixed set of accelerators is used irrespective of the network and layer as
the CONV layer computation is transformed into tiled matrix multiplica-
tions. Using fine-grained tiled matrix multiplication operations as funda-
mental primitives (as opposed to complete CONV layer) in conjunction with
awork-stealing software scheduler that distributes these tiles to the different
accelerators and balances the workload, Synergy achieves comparable perfor-
mance to the customized network-specific implementations returned through
DSE. Thus, Synergy can bypass the DSE and HLS for each individual CNN
model and provide a completely out-of-the-box solution.
High-Throughput HW/SW Multi-Threaded Pipeline: The trans-
parent consolidation of heterogeneous HW/SW accelerators in Synergy pro-
vides a powerful abstraction layer for any CNN implementation. The abun-
dance of sensors on mobile and IoT devices capturing continuous data streams
demand in-situ real-time inference (e.g., continuous object detection in im-
age/video stream [2]). Here throughput (i.e., frames per second) is the defin-
ing metric as opposed to minimizing the latency for each individual frame
in isolation. Synergy employs a HW/SW multi-threaded pipelined design of
the different layers of the network that allows consecutive frames (images)
from the streaming input to be processed concurrently exploiting inter-frame
parallelism and improving throughput. However, in this pipelined design,
CONV layers from different frames may need to be processed simultaneously
on the FPGA. This inter-frame parallelism is easy to support in Synergy as
the different matrix multiplications from the different CONV layers simply
generate matrix multiplication tiles and the tiles from different layers get dis-
tributed in a transparent fashion to the set of heterogeneous accelerators to be
executed in parallel. Synergy achieves 39.5 – 136.4 frames/second throughput
and consumes 14.4 – 55.8 mJ/frame energy depending on the CNN model.
This is substantially better than the contemporary CNN implementations
4
on the Xilinx Zynq XC7Z020 device (see Table 4). Moreover, the concur-
rent execution of multiple CONV layers on FPGA accelerators in Synergy
greatly improves their utilization. The low utilization of the accelerators is a
critical issue in full-system implementation of CNNs on resource-constrained
devices where non-compute intensive layers (e.g., pooling, activation and
fully-connected) implemented in software on not so powerful CPUs (e.g.,
ARM) take up significant execution time while the FPGA remains inactive.
The pipelined design with multiple frames in-flight keeps the accelerators
busy in Synergy with 99.8% average utilization.
Automated Customized Accelerator Design:Synergy offers a de-
fault plug-n-play solution for a na¨ıve application developer that avoids the
complex DSE and HLS for each individual CNN model. An experienced
designer, on the other hand, may want to further improve the performance
by designing accelerators that are optimized for the CNN model at hand.
Synergy offers an automated approach to customized acceleration design for
a specific CNN model. The framework accepts the network configuration
(number and type of layers) corresponding to a CNN model as an input. The
designer only needs to provide the architectural parameters for the matrix
multiplication accelerators. The Synergy framework not only can automati-
cally synthesize the accelerators (according to designer-defined parameters)
on the FPGA fabric but also generate the appropriate hardware-software in-
terface to synergistically engage these newly synthesized accelerators. This
is shown as the “Hardware Architecture Generator” in Figure 1. In addition,
the same architecture generator can be used to synthesize the accelerators
and the HW/SW interface for a new SoC device. Thus, Synergy provides
a complete push-button solution to CNN acceleration on a heterogeneous
platform.
2 related works
The state-of-the-art FPGA-based CNN works are shown in Table 1. To the
best of our knowledge, there is no work focusing on heterogeneous HW/SW
acceleration (with CPUs, NEONs and FPGA) for CNNs. We classify the ex-
isting works into two categories: network-dependent and network-independent
FPGA-based CNN frameworks.
Network-dependent Frameworks generally require designers to ex-
plore different configurations to generate a hardware architecture for a spe-
5
cific CNN network manually or with the help of scripts provided, and per-
form synthesis (which normally takes half to one hour) to generate the
bitstream. Given a new network, designers need to redo the above steps,
which is time consuming. This approach is well-explored and can pro-
duce extreme high-performance CNN accelerators, but sacrificing the flex-
ibility to different networks. Almost all the existing FPGA-based CNN
works [3][6][8][13][16][17][18][19][20][21][23][25][24] use the network-sensitive
approach. [16][17][25][24] require designers to manually explore architec-
tures for different networks, while [3][18][19][20][21][23] propose automated
toolchains for mapping CNNs on FPGAs. [3][6][13][17][20][21][23][24] mainly
focus on exploiting intra-frame parallelism within layers and execute layers
in a CNN in sequence, ignoring inter-frame parallelism across layers. [18][25]
map all layers onto FPGAs and enable hardware pipelining to exploit the
inter-frame parallelism. [18] proposed an automated framework to accelerate
binarized CNNs. Their work maps all layers in an binarized CNN on FPGA
and enables hardware pipelining. [25] proposes an approach by mapping
convolutional, normalization, pooling, activation and fully-connected layers
onto multiple FPGA devices (1 Xilinx ZC706 + 6 Xilinx VC709 boards).
Each device is in charge of a specific one or more CNN layers and devices are
connected in a ring network. The pipelining flow is controlled by dual-core
ARM processors on the Xilinx Zynq board. However, the cost of the deeply
pipelined FPGA cluster is too high as it requires multiple high-end FPGA
devices and the setup is difficult. Different from [18][25], [8] starts with
multi-threaded CNN inference codes and converts all its layers into FPGA
accelerators. However, the workload of different layers in a CNN could be
imbalanced, which leads to low accelerator utilization, wasting the precious
FPGA resources. [16] statically splits single large processing engine (PE)
used to accelerate convolutional layers into multiple small PEs. Their ap-
proach can allow multiple layers running simultaneously with different image
frames. However, the evaluation of their work is based on simulation and the
performance (execution cycles) of PEs is obtained by Vivado HLS. The per-
formance number is not accurate as it does not consider the runtime overhead
of the real platform.
Network-independent Frameworks leverage a fixed optimized hard-
ware architecture for various CNN networks. To adapt to different networks
and achieve good hardware efficiency, this approach relies on either static
(compiler) or dynamic (runtime scheduler) techniques. The key advantage
of this approach is that designers can easily switch different networks at run-
6
Table 1: Current State-of-the-art vs. Synergy
Reference Automated Inter-
frame
Self-
balancing
HW
Reuse*
Network
Agnostic
On-board
Evaluation
[24] [FPGA’15] 8 8 8 8 8 4
[13] [FPGA’16] 8 8 8 4 8 4
[17] [FPGA’16] 8 8 8 8 8 4
[6] [CASES’16] 8 8 8 4 8 4
[21] [DAC’16] 4 8 8 8 8 4
[23] [ICCAD’16] 4 8 8 8 8 4
[19][FCCM’16]
[20][FPGA’17] 4 8 8 8 8 4
[18] [FPGA’17] 4 4 8 8 8 4
[3] [FCCM’17] 4 8 8 4 8 4
[25] [ISLPED’16] 8 4 8 8 8 4
[8] [SOCC’17] 8 4 8 8 8 4
[16] [ISCA’17] 8 4 8 4 8 8
[4] [TCAD’17] 4 8 8 4 4 4
Proposed
Synergy 4 4 4 4 4 4
HW Reuse: different CONV layers and FC layers can share the same FPGA accelerators
time without going through the time-consuming synthesis step. [4] belongs
to this category. Their approach relies on a compiler developed to statically
generate a set of instructions (describing the process of CNN execution) that
execute on the fixed hardware architecture. Layers are executed in sequence
in their work. Moreover, as [4] includes data quantization to reduce memory
requirement, their approach can support large networks on embedded FPGA
devices. However, their approach can not allow multiple layers running con-
currently with different input frames, which might result in low accelerator
utilization.
Synergy supports network-independent feature. More specifically, we pro-
pose a hardware abstraction to unify various computing elements (NEON
cores and FPGA) within an FPGA-based SoC architecture. Thus, Synergy
can leverage all computing elements (multiple ARM cores, its NEON cores
and FPGA) to accelerate CNNs via HW/SW multi-threading, unleashing the
true power of heterogeneity. Different from [4], Synergy adapts to various
networks by leveraging a work-stealing scheduler (Section 3.1.3) in software to
automatically balance the workload of accelerators at runtime without chang-
ing hardware or software implementations. Moreover, Synergy provides an
automated toolchain to allow designers to explore various accelerator archi-
tectures or migrate designs to other embedded FPGA devices.
7
3 The Synergy Framework
Synergy, as shown in Figure 1, is an automated framework to map CNN
models onto embedded FPGA-based heterogeneous SoC platforms. Synergy
targets the CNN inference phase and successfully unleashes the power of
heterogeneity of the SoC architectures by leveraging all its compute elements
(CPUs, NEON engines, and FPGA).
A CNN model contains multiple layers such as convolutional, normaliza-
tion, pooling, activation and fully connected layers. The input of a layer is the
output of the previous layer. When input frames stream into the CNN, the
layers can process different frames concurrently. This inter-frame parallelism
can be exploited to improve throughput.
Synergy uses the FPGA logic and the NEON engines to accelerate the
most compute-intensive layers (CONV) in a CNN, while the CPU cores work
on the other layers (such as pooling, activation and fully-connected layers)
and preprocessing functions (e.g., normalization, scaling and data layout
transformation). As shown in Figure 1, a designer can instantiate multiple
processing engines (referred to as PE hereafter) on the FPGA to accelerate
the CONV layers. The computation in a CONV layer is transformed into
a set of independent tiled matrix-multiplication operations, called jobs as
mentioned in Section 3.1.1. These jobs are executed by the FPGA and the
NEON accelerators in parallel.
To improve the inference throughput and accelerator utilization, Syn-
ergy supports HW/SW multi-threaded pipeline where the CPU cores and
the accelerators work on different layers of different frames concurrently.
Therefore, we modify the traditional single-threaded CNN framework with
multi-threaded support. Specifically, the workload in each layer is conducted
by the corresponding thread and the communication between layers is per-
formed through a mailbox (a synchronized first-in-first-out buffer) accessible
by the threads. Multiple threads collaborate with each other in a producer-
consumer fashion constructing the full network.
As multi-threading is a software concept, hardware accelerators cannot
directly share the well-established mechanisms in the multi-threading model
such as mutex, semaphore, and mailbox. To abstract away the hardware
accelerators as hardware threads and extend operating system to support
HW/SW threads, we adapt ReconOS [10], an open-source operating sys-
tem for reconfigurable computing. ReconOS provides the HW/SW multi-
threading technique that we build upon to accelerate CNN applications on
8
FPGA-based heterogeneous SoC platforms. Each hardware accelerator or
PE is represented by a delegate thread in the software space that behaves
just like traditional software threads as shown in Figure 2 and explained in
detail in Section 3.1.2.
The accelerators (PEs and NEONs) can be grouped into multiple clus-
ters so that each CONV layer can have its own private cluster. However,
Synergy accelerators are not customized given a specific CNN model. Thus
the generic multi-cluster configuration may not be optimal for each network
and may lead to imbalance in execution time of the different CONV lay-
ers. Synergy employs work-stealing (detailed in Section 3.1.3), a dynamic
workload scheduling technique, to deal with the workload imbalance among
the clusters. The jobs (independent tiled matrix-multiplication operations)
from the different CONV layers are distributed to the different accelerator
clusters. An idle cluster steals workload from the other busy clusters and
thereby achieves workload balance across clusters and maximizes through-
put. Within a cluster, the jobs are dispatched to the available accelerators
(NEONs and FPGA-based PEs) in a round-robin fashion.
The Synergy framework provides a default architecture of the FPGA-
based PEs and their cluster configuration that has been designed to provide
quality performance across a range of CNN models. These clusters and their
constituent PEs are pre-synthesized in Synergy corresponding to each unique
FPGA device/platform and do not need to be generated for each individ-
ual CNN model. In other words, the FPGA bitstream remains unchanged
across different CNN models and the FPGA device need not be reconfigured
corresponding to each CNN model unless desired by the application devel-
oper. Given a CNN model corresponding to an application, Synergy takes
in a network configuration file that defines the architecture of the CNN as
input. The CPU-based host code used to control the hardware accelerators,
Linux kernels and HW/SW multi-threaded library are written as templates.
With the network configuration file and the software templates, Synergy au-
tomatically generates a software/hardware multi-threaded CNN application
in C.
If an advanced application developer wants to customize the PE and
cluster design for a specific CNN model, the Synergy framework offers a
hardware architecture generator (Section 3.3). In this case, Synergy takes in
a hardware configuration file as input and creates the hardware architecture
by instantiating the HLS-RTL accelerator templates in C corresponding to
the tiled matrix-multiplication operations. These FPGA-based accelerators
9
Streaming
input
Streaming
output
CONV-0
Thread
CONV-1
Thread
CONV-2
Thread
Pooling
Thread
Pooling
Thread
FC
Thread
courier
Software
courier courier
Hardware
Abstraction
Cluster-0 Cluster-1
Job
Queue 1
Delegate
Thread-2
Delegate
Thread-3
Delegate
Thread-4
Delegate
Thread-0
Delegate
Thread-1
Job
Queue 0
PE PE PE PE PE
FPGA
NEON
NEON
Thief
Figure 2: Overview of the Software Architecture
for the CONV layers are generated by a commercial HLS tool from the C
templates, while accelerator interfaces and memory subsystem are created by
RTL templates. Both the generation of software and hardware components
are completely automated.
3.1 Software Architecture
Figure 2 shows the software component in Synergy. We explain the func-
tionality in software to implement the CONV layers and the other layers,
preprocessing functions.
3.1.1 CONV Layers
CONV layers are the most compute-intensive components in a CNN, occu-
pying more than 90% of the execution time during inference [24]. They take
in input feature maps and convolve them with convolutional filters to ob-
tain output feature maps. As we target the low-end embedded platforms,
the FPGA resources are quite limited and we cannot generate a dedicated
accelerator for each convolutional layer in a CNN model like [13][24][23][25].
Therefore, in our implementation, we need to share the hardware accelerators
among the convolutional layers. We transform the convolution operations
into matrix multiplication (MM) by flattening and rearranging the input
10
features [3][17]. A data layout transformation is required to convert the 3D
array in CONV layers into 2D array, which is known as image-to-column
(im2col) function in many popular open-source CNN frameworks such as
Caffe [7] and Darknet [14]. Details related to the data layout transformation
can be found in [7][17]. Synergy leverages both the FPGA-based PEs and
the NEON engines to accelerate the MM computations.
Listing 1: Tiled Matrix Multiplication
1/T i l e S i z e : TS; Lo op bo un ds : N , M, K /
2T i l e t 1 : for ( t 1 =0; t 1<f l o o r (N/TS );++ t1 ){
3T i l e t 2 : f o r ( t2 =0 ; t2 <f l o o r (M/TS) ;++t 2 ) {
4... //Initialization
5ti le d mm : for ( t 3 =0 ; t 3<f l o o r (K/TS) ;++ t3 ) {
6// S t ep 1 : Cop y d at a fr om DDR t o l o c a l b u f f e r ( a , b , c ) ;
7d at a c o p y ( d dr a , d dr b , a , b , o f f s e t A , o f f s e t B ) ;
8// S t e p 2 : K e r n e l Co mp u ta t i on
9l o o p 1 : for( i = 0; i <TS;++ i ) {
10 l o o p 2 : for( j = 0; j <TS;++ j ) {
11 l o o p 3 : for( k = 0; k<TS;++k ) {
12 c [ i ] [ j ] += a [ i ] [ k ] b [ k ] [ j ] ; } }} }
13 // S t ep 3 : W r it e da ta f ro m l o c a l b u f f e r t o DDR
14 d at a s e n d ( c , d d r c , o f f s e t C ) ;
15 }}
After flattening and rearranging the input features, the input matrices of
the matrix multiplication in convolutional layers are generally too large to be
accommodated on an FPGA platform. Loop Tiling is employed to partition
the iteration space of the loop into smaller tiles so that the working data set of
a tile can be easily accommodated using the available on-chip BRAM storage
in FPGAs. Listing 1 shows the tiled matrix multiplication after Loop Tiling.
The portion highlighted (Line 5-14) is the key computation of a tile and we
accelerate this portion with FPGA-based PEs (explained in Section 3.2.1)
and NEON cores.
Workload Granularity and Computation: Figure 3 shows a tiled
MM example with 2 ×2 tile size. In Synergy, the workload granularity of a
tiled MM computation is called a job, which is defined as the computation
required to output a tile, C(i,j), of an output feature map C. A job is a
structure as shown in Listing 2 containing the base addresses of the arrays
(A, B and C), the input data dimensions (m, n and k), the tile index and
the layer ID, which is used to identify the CONV layer that owns the job.
Each CONV layer generates a set of jobs. In a CONV thread, we implement
acourier function that sends the jobs to the accelerators (PEs and NEONs).
When an accelerator gets a job, it first calculates the memory addresses of
the required tiles of input/output feature maps with the base address, data
dimension and tile index provided by the job, and fetches the tiles from the
external DDR memory to its local storage with the memory controller. After
11
A(1,1) A(1,2) B(1,1)
B(2,1)
C(1,1)
N
K
K
M M
N
K
kjkkiji BAC 1),(),(),(
Tile Calculation:
C(1,1)
C(2,1) C(2,2)
C(1,2)
Job 1 Job 2
Job 4Job 3
Figure 3: Job: Workload Granularity of a Tiled MM
computation is completed, the PE stores the output tile back to the DDR.
Listing 2: The Structure of Job
1t yp e d e f s t r u c t {
2/The ba s e a d d r e s s e s o f i n pu t an d ou tp u t f e a t u r e maps /
3DATA TYPE A a dd r ; DATA TYPE B a dd r ; DATA TYPE C a ddr ;
4/Da ta di m en s io n o f in p u t / ou t pu t f e a t u r e map s /
5DATA TYPE m ; DATA TYPE n ; DATA TYPE k ;
6/In d ex u se d t o l o c a t e t h e t i l e /
7DATA TYPE t 1 ; DATA TYPE t 2 ;
8DATA TYPE l a y e r i d ; /To tr a c k t h e CONV l a y e r /
9}j o b t ;
Heterogeneous Accelerators: As we target Xilinx Zynq SoC, Synergy
uses the FPGA-based PEs and two NEON cores in the ARM A9 processor
as the accelerators. A PE is an FPGA implementation of tiled MM. PEs can
have different optimizations, and thus performance of PEs might be different.
Number of PEs is dependent on the available resource in the target FPGA
device. We explain the PE design in Section 3.2.1 in more detail. To leverage
the NEON cores, we have implemented the MM kernel in NEON assembly
code. This assembly code is encapsulated in two separate software threads,
one corresponding to each NEON core, creating two NEON accelerators.
Accelerator Clusters: From the software perspective, Synergy groups
the heterogeneous accelerators into clusters. For example, in Figure 2, Cluster-
0 has two NEON cores and two FPGA-based PEs, while Cluster-1 groups
three PEs. Each cluster has a private workload pool, Job Queue, as shown in
Figure 2. A job queue is a synchronous buffer, storing the address of the jobs.
12
Each CONV layer is assigned to a cluster by default. Different CONV layers
can share the same cluster, for example CONV-0 and CONV-1 are mapped
to Cluster-0 and CONV-2 uses Cluster-1. Mapping of CONV layers and clus-
ters is decided by the number of jobs a CONV layer has. A CONV layer with
less workload will be mapped onto a less powerful cluster and vice-versa. In
addition, a designer can define the number of clusters and the corresponding
accelerator combinations simply with a hardware configuration file as shown
in Figure 8. In this case, the hardware accelerators will be synthesized and
the required hardware-software interface will be automatically generated in
the Synergy framework (see Section 3.3).
The CONV layers assigned to a cluster send their jobs to the Job Queue
and use all the available accelerators in the cluster. Once the cluster de-
tects jobs in the job queue, it dispatches the jobs to the synchronous buffers
attached to each accelerator. Then, the accelerators work on the jobs and
inform the cluster when they finish.
3.1.2 Delegate Threads
To abstract away the hardware accelerators, we deploy delegate threads in-
troduced in ReconOS [10]. A delegate thread is a software wrapper for an
FPGA-based accelerator, which can execute operating system (OS) calls on
behalf of the associated accelerator. From the OS perspective, the delegate
threads are software threads and can interact with the traditional software
threads.
In Synergy, a delegate thread is created corresponding to each FPGA-
based PE. Once launched, it initializes the hardware system and sends start
signal to the associated accelerator via the first-in-first-out (FIFO) control
buffer shown in Figure 5. Then, the delegate thread waits for a request from
the accelerator to execute a job. When an accelerator sends a job request, the
delegate thread obtains the address of the job from its associated cluster and
sends back to the accelerator, waiting for the accelerator’s acknowledgment.
Upon receiving the address of the job, the accelerator obtains the contents of
a job structure, fetches the tile data of input arrays via the memory controller
and performs the MM calculations. Once it finishes, the accelerator issues
a signal to the delegate thread and acknowledges the completion of the tile
calculation. The delegate thread repeats the above steps until all the jobs
are finished.
13
Cluster-0 Cluster-1
Job
Queue 1
Job
Queue 0
PE
PE
NEON
NEON
PE
PE
PE
PE
PE
CONV
Thread
Manager
Idle Book
Thief
Thread
CONV
Thread
Stealer
activate
steal steal
push push
work
done
work
done
Figure 4: Work Stealing Execution Flow
3.1.3 Self-balancing: Work Stealing Scheduler
Synergy clusters the FPGA-based PEs and the NEON accelerators into mul-
tiple clusters, so that the threads corresponding to multiple CONV layers
can execute concurrently achieving better throughput. This approach also in-
creases the accelerator utilization. However, as the workload of CONV layers
varies depending on the data dimensions, an improper cluster configuration
may lead to workload imbalance among the clusters. Some clusters might
complete their workload early and stay idle, wasting precious computing re-
sources. Therefore, clusters should be carefully partitioned and statically
mapped to different CONV layers, so that the runtime of each cluster spent
on processing the associated workload is balanced [16]. This can increase the
accelerator utilization and improve the performance. However, finding the
optimal cluster configuration is not easy. It requires profiling the performance
of different cluster combinations for the input data dimensions of the CONV
layers for the specific CNN model and perform a detailed design space explo-
ration to identify the best cluster configuration for static mapping. Then the
identified clusters and PEs have to be synthesized on the FPGA. However,
this approach is challenging and time-consuming, especially without exten-
sive FPGA expertise. In Synergy, we introduce dynamic workload balancing
technique, work-stealing, to bypass this optimization problem.
This self-balancing technique is based on the job granularity and does
not require the best cluster configuration as the idle cluster can steal jobs
from the busy clusters. Synergy enables work stealing by introducing a thief
14
thread. The thief thread consists of a manager,idle book and stealer. The
manager checks the status (idle or busy) of the clusters and activates the
stealer if necessary. The idle book records IDs of the idle clusters, while the
stealer can steal jobs from the victim clusters and push these jobs to the idle
clusters. Figure 4 shows the work-stealing flow. Initially, Synergy dispatches
the jobs of different CONV layers to job queues of different clusters. Due
to the workload imbalance of the CONV layers, some clusters may finish
the assigned workload earlier and remain idle. Let us assume that Cluster-0
finishes first and Cluster-1 is still busy. Cluster-0 then notifies the manager
of the thief thread, as its work has been done. The manager records Cluster-0
in the idle book and activates the stealer. After activation, the stealer tries
to steal jobs from the clusters that are not in the idle book. Once it succeeds,
the stealer dispatches the jobs to the idle clusters and the manager removes
the clusters from the idle book. In this manner, Synergy can fully utilize the
accelerator resources and achieve load balancing. Different from the static
mapping technique, the work-stealing approach does not rely on any specific
cluster configuration to achieve workload balance. It eases the pressure of
seeking the best cluster configuration and does not require designer’s effort.
3.1.4 Other Layers and Preprocessing functions
A CNN contains many other layers, which are executed by the ARM CPU
cores in the Synergy framework. Fully connected (FC) Layer: This layer is
usually used at the end of a network to compute the class scores, resulting in
as many outputs as there are classes. Pooling layer: This layer progressively
reduces the spatial size of the output from the previous layer to reduce the
amount of parameters and computation in the network. Activation layer:
This layer comprises of a non-linear function that does a 1-to-1 mapping of
each of the outputs from the previous layer to an activation value. Synergy
supports all kinds of activation functions.
A CNN also contains a few preprocessing functions within the layers
such as im2col and normalization that take non-negligible time on embedded
CPUs. im2col (mentioned in Section 3.1.1) is used for data layout transfor-
mation. Normalization is used to scale all the input data to values between 0
and 1 during the inference phase. The overheads of these sequential portions
are partially hidden by HW/SW multi-threaded pipeline in Synergy.
15
System BUS
Proc
Ethernet Other peripherals
(USB, UART, etc.)
DelegateT-1
if_mem2hw MEM
Arbiter
MEM
Arbiter
MEM
Controller
MEM
Controller
DDR
DRAM
SW Threads
Cluster-0
PE-0
PE-1
PE-2
PE-3
MMU
MMU
Proc
Arbiter
if_hw2mem
if_mem2hw
if_hw2mem
if_mem2hw
if_hw2mem
if_mem2hw
if_hw2mem
FIFOs
if_sw2hw
if_hw2sw
if_sw2hw
if_hw2sw
if_sw2hw
if_hw2sw
if_sw2hw
if_hw2sw
FIFOs
Multi-core CPU Synergy Hardware Architecture (FPGA)
Synergy
Library
DelegateT-2
DelegateT-3
User Space Kernel
Space
DelegateT-0
Cluster-1
Memory Subs ystem
Figure 5: The Hardware Architecture
3.2 Hardware Architecture
Figure 5 shows an example Synergy hardware architecture example with
four FPGA-based PEs. The architecture is adapted from ReconOS [10].
In this architecture, the software communicates with the hardware acceler-
ators via control FIFOs (if hw2sw and if sw2hw). At the software side, a
delegate thread (DelegateT) interacts with other software threads on behalf
of the associated PE. Data transactions of a PE are handled by the Memory
Subsystem via two FIFOs (if hw2mem and if mem2hw). In the following
subsections, we discuss the accelerator design, memory subsystem, and the
hardware architecture generator.
3.2.1 Accelerator Design
As mentioned earlier, Synergy processes CONV layers as matrix multiplica-
tion (MM) operations accelerated using NEON cores and FPGA-based PEs.
In this section, we mainly focus on FPGA-based accelerators and discuss
several design challenges. The FPGA-based accelerator for MM is the pro-
cessing engine (PE) shown in Figure 5, which is generated by a commercial
high-level synthesis (HLS) tool, Vivado HLS [22]. HLS is used to convert
algorithms in high-level specification (i.e., C/C++) into hardware languages
(VHDL/Verilog). It provides optimization options, a.k.a pragmas, such as
loop unrolling, array partitioning and pipelining, to explore diverse hardware
architecture with different area and performance tradeoff.
As mentioned in Section 3.1.1, due to the large input size of MM in
16
CONV layers, we deploy Loop Tiling on MM and partition the iteration
space of the loop into smaller tiles so that data size of a tile can be easily
accommodated on available BRAM. Loop Tiling exposes potential parallelism
of MM as different tiles are independent. We exploit the parallelism by
instantiating multiple PEs under FPGA resource budget to process the tiles
simultaneously, while exploring hardware architectures of a PE with HLS
pragmas. Opening up more parallelism per PE limits the number of PEs
that can be accommodated on FPGA due to resource constraints [26].
Listing 3: Pseudo Code for the HLS Template of a PE
1P r o c e s s i n g E n g i n e ( i f s w 2h w , i f h w2 s w ,
2if hw 2me m , i f mem 2hw ) {
3/S i m p l i f i e d pr ag ma s /
4#prag ma i n t e r f a c e a p f i f o p o r t=i f s w2 h w
5#prag ma i n t e r f a c e a p f i f o p o r t=i f h w2 s w
6#prag ma i n t e r f a c e a p f i f o p o r t=if hw 2me m
7#prag ma i n t e r f a c e a p f i f o p o r t=if me m2h w
8#prag ma i n t e r f a c e a p ct r l no n e p or t= return
9...
10 w a i t f o r s t a r t s i g n a l ( ) ;
11 j o b t jo b ;
12 while ( 1 ) {
13 u in t 3 2 j o b ad d r e s s = a s k f o r a j o b ( ) ;
14 j ob = r e a d j o b ( j o b a d d r e s s ) ;
15 p a r s e j o b ( j o b , &A add r, &B add r ,& Cad dr ,
16 &m,&n ,& k , &t 1 , & t 2 , & l a y e r I D ) ;
17 ti le d m m (Aa ddr , Badd r , Cadd r ,m, n , k , t1 , t 2 ,
18 if hw 2me m , i f m em 2hw ) ;
19 send acknowledgment ( l ayer ID ) ; }
20 }
Processing Engine (PE): The pseudo code for the HLS template in
Listing 3 demonstrates the general execution flow of a PE. A PE inter-
acts with its associated delegate thread in the user space via control FIFOs
(if hw2sw and if sw2hw). For data transaction, the PE cooperates with the
Memory Subsystem (Section 3.2.2) through memory FIFOs (if hw2mem and
if mem2hw). At the beginning, the PE waits for a start signal issued from
its associated delegate thread. Line 13 - 19 in Listing 3 shows the logic to
compute a job. The PE first acquires a job by sending requests to the del-
egate thread. The real computation of MM is performed in tiled mm. The
skeleton of tiled mm is shown in (Line 5-14) in Listing 1. The mm tile func-
tion can be summarized as the following four steps: 1 It computes locations
of tiles required of the input arrays (Aand B) in the main memory; 2 It
then fetches a tile of data to local memory (aand b); 3 It performs matrix
multiplication and adds the partial result with a local array c; 4 mm tile
repeats Step 1 until it exhausts a row of Aand a column of B; 5 mm tile
stores the output data back to the main memory. An acknowledgment will
be sent to the delegate thread once the PE finishes computation.
17
Computation optimizations in mm tile: Loop pipelining is a crucial
optimization option provided by HLS. As the technique can overlap the exe-
cution of operations from different iterations, it has great impact on system
throughput. Throughput of a loop depends on the initiation interval (II),
which is defined as the number of cycles between consecutive initiations of
the loop. In this work, we apply loop pipelining at loop2 in Listing 1. With
the optimization, the HLS tool merges loop1 and loop2 into a new loop with
larger loop bound (newBound =T S T S) and completely unrolls the in-
nermost loop (loop3). We define latloop3as the latency of loop3. Then the
latency latkernel of the nested loop for kernel computation is calculated as
latkernel = (newBound 1) I I +latloop3. When newBound is large enough,
latkernel of the nested loop is decided by II .
As operations inside loop3 in Listing 1 have no data dependence, when
loop3 is completely unrolled, operations in different iterations can be ideally
executed in parallel. However, the parallelism is constrained by the memory
bandwidth. Local buffers (aand b) are implemented with FPGA BRAM re-
source. By default, a local buffer has only two read-ports and one write-port.
Thus, when loop3 is completely unrolled, only two memory read requests to
each buffer (aand b) can be served, even if T S read requests are gener-
ated. This makes II to be T S/2 and limits performance of an accelerator.
To improve II, we can leverage the array partitioning pragma to split the
buffer into multiple banks where each bank has two read-ports and one write-
port. With loop pipelining and array partitioning, the accelerator requires
more multiplication and addition units, and thus more compute resources.
Opening up more parallelism per PE limits the number of PEs that can be
accommodated on FPGA due to resource constraints. Given a FPGA device,
the tile size, the settings for HLS pragmas, and the number of PEs can be
done automatically decided via design space exploration (DSE) [26].
Communication optimization in mm tile: For tiled matrix multi-
plication, Synergy overlaps the data transfer cost with the computation cost
by leveraging double buffering, i.e., instantiating two buffers for each local
array. This significantly improves the throughput of tiled MM.
Zero Padding in mm tile: In Synergy, the hardware accelerators are
shared among the convolutional layers. This implies the same MM acceler-
ator of fixed size are used in different layers. As the loop bounds (or data
dimensions) of MM in different convolutional layers are different, we may
encounter scenarios where the fixed-size MM accelerator attempts to access
out of the loop bound data of the input matrices or write data outside the
18
L1 offset Page offsetL2 offset
L1 Page Table L2 Page Table
R
Virtual
address
1
2
3
4
Physical
address
5 5
Figure 6: Virtual To Physical Address Translation [1]
bounds of the output matrix. Hence, we include border detection in mm tile.
When fetching data, if the memory address exceeds the matrix border, the
specific portion of the local buffer will be set to zero. Similarly, for writing
data, mm tile ignores write requests if a memory address exceeds the given
matrix borders.
3.2.2 Memory Subsystem
The Memory Subsystem shown in Figure 5 is used to process memory re-
quests from multiple PEs. It consists of memory arbiters (MEM Arbiter),
memory management units (MMUs), memory controllers (MEM Controllers),
aProc Arbiter and a Proc unit.MMU is used to translate virtual addresses
to physical addresses, while MEM Arbiter is employed to allocate memory
transaction requests to the shared MMU.Proc unit is used to obtain the first-
level translation page table address and handle page fault request, and Proc
Arbiter allows multiple MMUs to access the Proc unit.MEM Controllers are
implemented to access the DDR memory with AXI4 burst mode protocol.
All the components in the Memory Subsystem are written in RTL code and
constitute the Hardware Template Library as shown in Figure 8.
Virtual to Physical Address Translation: In a traditional HW/SW
co-design approach, a device driver normally has a continuous memory ad-
dress space in the Linux kernel. When a delegate thread tries to communicate
with an FPGA PE, it first copies data from the user space to the allocated
continuous memory (kernel space) in the device driver and sends the phys-
ical address of the memory to the PE. Then the PE obtains the data from
19
the DDR memory via the MEM Controllers. In Synergy, we avoid the extra
data copy in the acceleration of CONV layers. As mentioned in Section 3.1.2
and 3.2.1, a PE obtains an address of a job directly from the delegate thread
in the user space and the job content includes the base memory address of
input/output arrays in the user space. Those are virtual addresses. In ARM
Cortex-A9 architecture [1], virtual addresses are translated to physical ad-
dresses by a two-level (L1 and L2) page table walk as shown in Figure 6.
The base address of the L1 page table is stored in a CPU system register
R[1], which can be accessed in the kernel space. Synergy supports this two-
level page table walk in FPGA. During the FPGA initialization in Synergy,
the Proc unit obtains the base address of the L1 page table via its device
driver. Then, the Memory Subsystem translates the virtual address to phys-
ical address following the steps in Figure 6. In case of a page fault, the Proc
unit triggers a CPU interrupt, obtains a new base address and repeats the
translation process.
0.0
1.0
2.0
3.0
4.0
0
20
40
60
80
100
1 2 3 4 5 6
Speedup compared to
single PE
Performance (ms)
Number of PE
(a) ReconOS
0
1
2
3
4
5
6
7
0
20
40
60
80
100
1 2 3 4 5 6
Performance compared to
single PE
Performance (ms)
Number of PEs
(b) Synergy
Figure 7: Single-MMU vs. Multi-MMU Peformance
Multiple MMU Support: ReconOS architecture [10] contains a single
MMU and MEM Controller. The memory transactions from the PEs com-
pete for the resources in the Memory Subsystem. As the number of PEs
increases, the memory contention significantly degrades the system perfor-
mance as shown in Figure 7a. To solve the problem, Synergy instantiates
multiple MMUs with at most two PEs sharing an MMU and MEM Con-
troller. As the frequency of page faults is generally low in our case, multiple
MMUs in Synergy share the same Proc unit via the arbiter logic Proc Arbiter.
Figure 7b shows that the performance speedup increases linearly as we in-
stantiate more PEs in Synergy.
20
*.hw_config
Hardware
Template
Library
Hardware
Template
Library
Tiled MM
(C++)
Tiled MM
(C++)
HLS ToolHLS Tool
pragmas
AcceleratorsAccelerators
Synergy Hardware
Architecture
Synergy Hardware
Architecture
FPGA BitstreamFPGA Bitstream
Library
Generator
Library
Generator
Architecture
Parameters
tile size
HW/SW Multi-
threading Lib
HW/SW Multi-
threading Lib
cluster_num = 2
clockFreq = 100000000
tile_size = 32
fifo_os = 128
fifo_mem = 128
hw@PE0
hlsopt = pragmas0.tcl
cluster = 0
hw@PE1
hlsopt = pragmas1.tcl
cluster = 0
hw@PE2
hlsopt = pragmas2.tcl
cluster = 0
hw@PE3
hlsopt = pragmas3.tcl
cluster = 1
...
Figure 8: Hardware Architecture Generator
3.3 Hardware Architecture Generator
Synergy provides a default accelerator architecture on a given FPGA device.
However, for a new FPGA device or in case the developer is interested in cus-
tomizing the accelerator architecture corresponding to a CNN model, Synergy
provides an architecture generator as shown in Figure 8. This automates the
processes of generating PEs with HLS, the Hardware Architecture, and final
FPGA bitstream. Input to the generator is a configuration file, *.hw config,
containing the architecture parameters. The simplified format of this config-
uration file is shown in the left side of Figure 8, which creates the Hardware
Architecture shown in Figure 5. Moreover, based on the configuration file, the
generator also compiles HW/SW multi-threading library (Synergy Library)
to provide APIs required by Section 3.1.
4 Experimental Evaluation
In this section, we evaluate the Synergy framework with multiple represen-
tative CNN models.
Synergy has been implemented on heterogeneous SoC platforms Zed-
board [22] and Xilinx ZC702, both featuring the same Xilinx Zynq XC7Z020
device. Xilinx Zynq XC7Z020 is a low-end SoC in terms of its compute
capability and the availability of limited FPGA resources. We report the
performance and power numbers from the Xilinx ZC702 evaluation board
because it has the power measurement support. All performance results are
21
collected using an FPGA-based timer.
We use Darknet [14] as our deep learning package. Darknet is an open
source neural network framework written in C. We use Darknet because it has
a highly-optimized single-threaded software implementation and does not de-
pend on any external library. We first compile Darknet for the ARM core in
the Xilinx Zynq device. Apart from the single-threaded software implementa-
tion, we create a multi-threaded pipelined version of Darknet to take advan-
tage of inter-frame parallelism for high-throughput CNN implementation.
The CPU-only implementations for various CNNs in this section are well-
optimized and compiled by gcc with -O3 optimization. As Darknet uses 32-bit
floating-point CNN models, we also use 32-bit floating-point implementation
both in software and hardware accelerators. The performance-power numbers
of Synergy will improve substantially if 32-bit floating-point implementation
is replaced by N-bit fixed-point implementation where N << 32. However,
this optimization is orthogonal and complementary to our current approach.
Even with floating-point, we achieve better throughput and energy-efficiency
compared to contemporary fixed-point implementations on the same device.
We write assembly-language code to generate highly optimized NEON ac-
celerators for the tiled matrix-multiplication operations. For hardware accel-
erator generation, we use Vivado Design Suite version 2016.2 for High-Level
Synthesis (HLS). The tiled matrix multiplication code is written in C and are
synthesized on FPGA using Vivado HLS with appropriate pragma settings as
presented in Section 3.2.1. Synergy uses two clusters (Cluster-0: 2 NEONs
+ 2 S-PE; Cluster-1: 6 F-PE) configuration across all benchmarks. The
cluster configuration is chosen based on power/performance results across
multiple CNNs and the work stealing technique can ensure that other CNN
applications could work well on this fixed hardware architecture as well by
balancing workload at runtime. The FPGA-based PEs run at 100MHz. The
HW/SW multi-threading is implemented by adapting ReconOS open-source
operating system for reconfigurable computing [10][15]. The ARM cores run
Linux, which is augmented with ReconOS to interface with the hardware
accelerators.
The entire Synergy framework is set up on a PC with an Intel Xeon CPU
E5-2620 core running at 2.10Hz with 64GB RAM, running Ubuntu 14.04
OS. Given a CNN model, the Synergy framework is responsible to generate
the appropriate software threads corresponding to the different layers of the
network, interfacing the software threads with the delegate threads of the
hardware accelerators, and creating a default mapping between the CONV
22
Table 2: Network Architectures of Benchmark CNN Models
Benchmark CONV
Layers
Num. of
Layers Description
CIFAR Darknet [14] 4 9 Object Recognition
CIFAR Alex [5] 3 8 Object Recognition
CIFAR Alex+ [5] 3 9 Object Recognition
CIFAR full [7] 3 9 Object Recognition
MNIST [9] 2 7 Digit Recognition
SVHN [12] 3 8 Digit Recognition
MPCNN [11] 3 9 Gesture Recognition
layers and the accelerator clusters. The Synergy framework can also automate
the FPGA bitstream generation given a hardware accelerator architecture
configuration by the designer to customize Synergy implementation for a
particular CNN model (if desired) or generate one for a new device.
Benchmarks: Table 2 shows seven CNN models used in this work and
trained with Darknet.
4.1 Synergy Throughput and Energy-Efficiency
Throughput: Compared with the original single-threaded Darknet imple-
mentation running on ARM core, Synergy achieves average 7.3x throughput
improvement as shown in Figure 9.
Power and Energy Consumption: Figure 10 depicts the power distri-
bution and energy consumption of Synergy system. The FPGA logic accounts
for only 27% of the total power consumption (around 2.08 W) averaged across
all CNN models. The ARM cores and the DDR memory account for most
of the power consumption. Compared with the power (1.52 W on average)
measured for the CPU+NEON only implementations, Synergy incurs 36.63%
more power consumption.
Table 3 shows the energy and performance per watt comparison between
the original single-threaded Darknet implementation running on ARM cores
and the Synergy design. Considering power consumption, the Synergy design
consumes 36.63% more power on average, as it fully leverages the heterogene-
ity of the ZYNQ platform. Although the power consumption increases, the
Synergy implementation achieves much higher throughput (7.3x speedup),
23
1.0x 1.0x 1.0x 1.0x 1.0x 1.0x
1.0x
7.8x 6.0x
8.7x
8.2x
6.6x
9.4x
4.5x
0.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
Throughput (frames/sec)
Original-Darknet Darknet-With-Synergy
Figure 9: Throughput improvement using Synergy
and thus reduces 80.13% energy consumption averaged across all CNN mod-
els compared to the original Darknet on ARM cores.
FPGA Resource Consumption: Hardware accelerators generated by
Vivado HLS have great impact on FPGA resource consumption. With the
limited FPGA resource budget, opening up more parallelism via HLS prag-
mas reduces the number of hardware accelerators that can fit in Xilinx
ZC702. Therefore, we explore diverse architectures of hardware accelerators
by traversing different tile size and HLS pragma combinations consisting loop
unrolling, loop pipelining and array partitioning. In this work, the tile size is
set to be 32 based on empirical evaluation. On ZC702 device we instantiate 6
faster FPGA-based processing engines (F-PE) with loop pipelining pragma
applied at loop2 in Listing 1 and 2 slower PE (S-PE) with loop unrolling
(factor = 2) and loop pipelining at loop3.
Comparison with State-of-the-art: Table 4 compares Synergy with
the recent FPGA-based CNN works. Note that CaffePresso [6] is using a
development platform with significantly more resources, and is running at a
higher clock speed. Moreover, as Darknet doesn’t support data quantization
feature and fixed-point implementation, Synergy uses 32-bit floating-point de-
sign that consumes much more resources than 32/16-bit fixed-point designs
on FPGAs. As shown in Table 4, even though we have handicapped our-
selves with floating-point operations, our implementations (both CIFAR full
and MNIST) are superior to [6][21] in terms of throughput (frames per sec-
ond), giga-operations-per-second (GOPS), and energy consumption. Com-
24
0
10
20
30
40
50
60
0.00
0.50
1.00
1.50
2.00
2.50
Energy (mJ/frame)
Power Consumption (W)
FPGA ARM DDR Energy
Figure 10: Power Distribution and Energy Consumption
pared to [20], GOPS of our MNIST and MPCNN designs achieve 4.5x and
1.8x speedup, respectively. Table 4 demonstrates that Synergy can provide
high-throughput and energy-efficient mapping of CNN models on embedded
heterogeneous SoC platforms.
4.2 Advantage of Heterogeneity
We now investigate the impact of heterogeneity in improving the CNN
performance in Synergy. Figure 11 shows the latency of different non-
pipelined CNN implementations (single-threaded, leveraging single-core
ARM): CPU+NEON,CPU+FPGA, and CPU+Het, which consists of FPGA
and NEON accelerators compared to the baseline single-core ARM design.
Compared to the CPU+FPGA design, the heterogeneous implementation
with FPGA and NEON CPU+Het improves the latency by 12% on an aver-
age with 45% maximum improvement for MPCNN model.
The throughput speedup of different pipelined CNN implementations
(multi-threaded, using two ARM cores): CPU+NEON,CPU+FPGA, and
CPU+Het compared to the baseline single-core ARM design is shown in Fig-
ure 12. Compared to the CPU+FPGA designs, the heterogeneous implemen-
tations with FPGA and NEON CPU+Het achieves 15% better throughput
on an average (37% maximum improvement for MNIST benchmark).
25
Table 3: Energy and Performance per Watt Comparison: Original Darknet
Versus Synergy
Benchmarks Energy (mJ/frame) Performance per watt (GOPS/W)
Original Synergy Reduction (%) Original Synergy Speedup
CIFAR Darknet 142.18 25.36 -82.16 0.14 0.80 5.61x
CIFAR Alex 105.03 23.43 -77.70 0.16 0.80 4.48x
CIFAR Alex+ 326.62 55.81 -82.91 0.16 0.70 5.85x
CIFAR full 196.41 33.71 -82.84 0.13 0.94 5.83x
MNIST 112.90 22.78 -79.83 0.20 0.78 4.96x
SVHN 193.67 28.07 -85.50 0.14 0.98 6.90x
MPCNN 47.87 14.37 -69.99 0.20 0.68 3.33x
mean -80.13 5.28x
Table 4: Comparison With Recent FPGA-based CNN Works. ‘*’ indicates
values estimated from charts
CaffePresso [6] fpgaConvNet[19][20] DeepBurning [21] Synergy
Device 7Z045 7Z020 7Z020 7Z020
Clock (MHz) 180 100 100 100
Precision 16-bit Fixed-point 16-bit Fixed-point Fixed-point 32-bit Floating-point
Benchmarks MNIST CIFAR full MNIST MPCNN MNIST CIFAR full MNIST CIFAR full MPCNN
Latency(ms) 16.0 28.0 14.3 21.4 24.3 33.2 12.2
Throughput
(frames/s) 62.5 35.7 69.9 46.7 96.2 63.5 136.4
GOPS 1.19 0.94 0.48 0.74 1.33* 1.23* 2.15 1.67 1.33
Energy
(mJ/frame) >200* >500* 150* 63 22.8 33.7 14.4
4.3 Transparent Accelerators: Work Stealing
We show the advantage of dynamic load balancing across accelerators using
work-stealing in Synergy versus static mapping of the CONV layers to the ac-
celerators. We consider two different clusters and PE configurations for static
mapping. The first cluster configuration consists of two clusters (Cluster-0:
2 NEONs + 2 S-PE; Cluster-1: 6 F-PE) used in Synergy across all bench-
marks. But unlike Synergy, the CONV layers are statically assigned to the
clusters based on their workload. We refer to this as static-mapping+fixed-
architecture (SF). Figure 13 shows that the SF designs can achieve 6.1x better
throughput compared to the well-optimized CPU designs.
However, the SF designs are inefficient as the workload assigned to the
clusters might not be balanced due to the different computation requirement
26
1.00
C1
C2
C3
C4
C5
C6
C1
C2
C3
C4
C5
1.78 1.52 1.70 1.92 1.50 1.41
1.01
3.30
2.69
3.78 3.75
2.27
4.69
1.84
3.41
2.79
3.84 3.87
2.83
4.81
2.68
0
1
2
3
4
5
6
Speedup Compared to CPU-only
CPU+NEON CPU+FPGA CPU+Het
Het: Heterogeneous Accelerators with NEON and FPGA
Figure 11: Latency Improvement with Accelerators Compared to CPU-only
Solutions for Non-Pipelined Designs
of each CONV layer. Figure 14a presents the execution time of each CONV
layer in CIFAR Alex model with this configuration. The CONV-0 layer is
mapped to Cluster-0, while CONV-1 and CONV-2 layers are mapped to
Cluster-1. As shown in Figure 14a, the runtime of Cluster-0 and Cluster-1
are 24.3 ms and 12.3 ms per frame, respectively. This imbalance in execution
time between the clusters leads to poor cluster utilization and throughput.
Synergy employs work-stealing to automatically balance workload of dif-
ferent clusters. This provides a network-agnostic feature in Synergy, as the
jobs from different CONV layers are automatically distributed across the
different clusters to achieve self-balancing. With the same generic cluster
configuration used in the SF designs, Figure 13 shows that Synergy improves
the throughput by average 24% compared to the SF designs. The perfor-
mance improvement comes from the balanced clusters. Figure 14b presents
the execution time of each CONV layer of the Synergy design for CIFAR Alex
benchmark. The runtime of Cluster-0 and Cluster-1 are 22.2 ms and 20.9
ms per frame, respectively. Compared to the SF design in Figure 14a, the
workload of Cluster-0 and Cluster-1 are balanced.
Finally, we show that Synergy work-stealing with generic cluster archi-
tecture is competitive and even better than CNN-model specific customized
cluster configurations. We call this static-mapping+custom-architecture (SC)
designs. In the SC designs, we find the best multi-cluster configuration for
each CNN model by exploring all possible cluster configurations. The best
27
1.89
2.14
2.23 2.26 1.61 2.23
1.29
6.96
5.80
7.95 7.49
4.81
7.47
4.12
7.82
5.95
8.71 8.15
6.62
9.44
4.47
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
Speedup Compared to CPU-only
CPU+NEON CPU+FPGA CPU+Het
Het: Heterogeneous Accelerators with NEON and FPGA
Figure 12: Throughput Improvement with Accelerators Compared to CPU-
only Solutions for Pipelined Designs
0.0
Speedup compared to (CPU+ST)
7.32
4.31 7.07
7.05
4.82
7.65
3.55
7.69
5.85
8.11 7.81
5.98
8.38
4.34
7.82
5.95
8.71 8.15
6.62
9.44
4.47
0.0
2.0
4.0
6.0
8.0
10.0
Speedup compared to CPU-only
SF: Static mapping + Fixed Architecture SC: Static Mapping + Custom Architecture
Synergy: Dynamic Mapping + Fixed Architecture
Figure 13: Advantage of Work stealing
multi-cluster configurations are shown in Table 51. The CONV layers are
statically mapped to these clusters. Note that unlike optimized cluster con-
figurations in SC designs, Synergy leverages the same generic cluster configu-
ration used in the SF designs for various CNN models. As shown in Figure 13,
Synergy still achieves 6% better throughput than SC designs. This is because
in the static mapping approaches (SF and SC) an entire CONV layer is as-
signed to a cluster and it is hard to perfectly balance the cluster workloads.
In contrast, the work-stealing in Synergy at the granularity of job-level (tiled
MM) can easily balance the workload even with un-optimized generic accel-
erators. The work stealing feature in Synergy empowers developers to easily
switch between different networks at runtime without losing performance.
1The number of clusters in this work can be t, where tN.
28
11.70
CONV-2
24.3
12.3
0
5
10
15
20
25
30
Cluster-0 Cluster-1
Execution Time (ms)
CONV-0 CONV-1 CONV-2
(a) SF Configuration with Two Clus-
ters
22.2 20.9
0
5
10
15
20
25
Cluster-0 Cluster-1
Execution Time (ms)
CONV-0 CONV-1 CONV-2
(b) Synergy: same SF configuration +
work-stealing
Figure 14: Dynamic Load Balancing in CIFAR Alex. SF Conf.: Cluster-0 (2
NEONs + 2 S-PE), Cluster-1 (6 F-PE)
Table 5: Best Cluster Configurations for CNN Models under Static Mapping
+ Custom Architectures
Benchmarks Cluster 0 Cluster 1
NEON FPGA IP NEON FPGA IP
CIFAR Darknet 0 2 S-PE + 1 F-PE 2 5 F-PE
CIFAR Alex 0 2 S-PE + 2 F-PE 2 4 F-PE
CIFAR Alex+ 2 2 S-PE + 2 F-PE 0 4 F-PE
CIFAR full 0 2 S-PE + 2 F-PE 2 4 F-PE
MNIST 2 2 S-PE + 2 F-PE 0 4 F-PE
SVHN 2 2 S-PE + 2 F-PE 0 4 F-PE
MPCNN 0 2 S-PE + 2 F-PE 2 4 F-PE
To better understand the performance improvement, Table 6 shows the
accelerator cluster utilization of various designs. The non-pipelined designs
are the best single-threaded implementations (the blue bars in Figure 11)
leveraging single-CPU, NEON core and FPGA accelerators. As shown in Ta-
ble 6, the cluster utilization of the non-pipelined designs is very low, indicat-
ing FPGA being idle for 43.95% (=156.05%) of the total execution time on
average. The reason is that in non-pipelined design, FPGA accelerators have
to wait for CPU or NEON core to finish their work. With multi-threading
support, the pipelined designs significantly increase the cluster utilization
(above 90%), as various computing elements can work simultaneously. Ta-
ble 6 shows that the SF designs increase the accelerator cluster utilization
to 92.5% on average from 56.1%. Compared to the SF designs, the cluster
29
Table 6: Accelerator Cluster Utilization Comparison Across SF, SC and
Synergy
Benchmarks Non-
pipelined (%)
Pipelined (%)
SF SC Synergy
CIFAR Darknet 50.77 95.32 97.55 99.89
CIFAR Alex 53.56 92.72 96.61 99.83
CIFAR Alex+ 61.28 98.81 98.73 99.95
CIFAR full 54.06 93.53 94.97 100.00
MNIST 59.03 85.63 96.09 99.89
SVHN 53.00 94.72 96.86 99.26
MPCNN 60.62 86.47 94.45 99.79
mean 56.05 92.46 96.47 99.80
utilization of the SC designs achieves 96.5% averaged across the benchmarks.
This is because the SC designs use the fine-tuned cluster configurations and
workload assigned to the clusters is more balanced. As mentioned above,
since Synergy leverages the work-stealing scheduler which works at the finer
granularity of job-level (tiled MM), the scheduler helps to improve the cluster
utilization at runtime by balancing workload in clusters. The average cluster
utilization of Synergy achieves 99.8% as shown in Table 6.
5 conclusion
This paper presents Synergy, an automated, transparent hardware-software
co-designed CNN inference framework on an embedded FPGA-based hetero-
geneous SoC architecture. Synergy fully utilizes the heterogeneity by lever-
aging diverse computing resources (CPUs, NEONs and FPGA) to accelerate
CNNs. Moreover, Synergy provides a work-stealing scheduler in software
to automatically balance the workload of accelerators, so that it can easily
adapt to various networks at runtime without changing hardware or software
implementations. Our result shows that Synergy achieves 7.3x speedup, aver-
aged across seven representative CNN models, over a well-optimized software-
only solution. Compared to the contemporary CNN implementations on the
same SoC platform, Synergy delivers better throughput as well as energy-
efficiency.
30
References
[1] ARM Infocenter. http://infocenter.arm.com. 2018.
[2] A. Dundar, J. Jin, B. Martini, and E. Culurciello. Embedded streaming
deep neural networks accelerator with applications. IEEE Transactions
on Neural Networks and Learning Systems, 28(7):1572–1583, July 2017.
[3] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang,
and J. Cong. Fp-dnn: An automated framework for mapping deep
neural networks onto fpgas with rtl-hls hybrid templates. In 2017 IEEE
25th Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM), pages 152–159, April 2017.
[4] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang,
and H. Yang. Angel-eye: A complete design flow for mapping cnn onto
embedded fpga. IEEE Transactions on Computer-Aided Design of In-
tegrated Circuits and Systems, 37(1):35–47, Jan 2018.
[5] S. Hashemi, N. Anthony, H. Tann, R. I. Bahar, and S. Reda. Un-
derstanding the impact of precision quantization on the accuracy and
energy of neural networks. In Design, Automation Test in Europe Con-
ference Exhibition (DATE), 2017, pages 1474–1479, March 2017.
[6] Gopalakrishna Hegde, Siddhartha, Nachiappan Ramasamy, and
Nachiket Kapre. Caffepresso: An optimized library for deep learning
on embedded accelerator-based platforms. In Proceedings of the In-
ternational Conference on Compilers, Architectures and Synthesis for
Embedded Systems, CASES ’16, pages 14:1–14:10, New York, NY, USA,
2016. ACM.
[7] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan
Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe:
Convolutional architecture for fast feature embedding. In Proceedings of
the 22Nd ACM International Conference on Multimedia, MM ’14, pages
675–678, New York, NY, USA, 2014. ACM.
[8] J. H. Kim, B. Grady, R. Lian, J. Brothers, and J. H. Anderson. Fpga-
based cnn inference accelerator synthesized from multi-threaded c soft-
ware. In 2017 30th IEEE International System-on-Chip Conference
(SOCC), pages 268–273, Sept 2017.
31
[9] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11):2278–
2324, Nov 1998.
[10] Enno L¨ubbers and Marco Platzner. Reconos: Multithreaded program-
ming for reconfigurable computers. ACM Trans. Embed. Comput. Syst.,
9(1):8:1–8:33, October 2009.
[11] J. Nagi, F. Ducatelle, G. A. Di Caro, D. Cirean, U. Meier, A. Giusti,
F. Nagi, J. Schmidhuber, and L. M. Gambardella. Max-pooling convo-
lutional neural networks for vision-based hand gesture recognition. In
2011 IEEE International Conference on Signal and Image Processing
Applications (ICSIPA), pages 342–347, Nov 2011.
[12] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu,
and Andrew Y Ng. Reading digits in natural images with unsupervised
feature learning. In NIPS workshop on deep learning and unsupervised
feature learning, 2011.
[13] Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin
Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, and
Huazhong Yang. Going deeper with embedded fpga platform for con-
volutional neural network. In Proceedings of the 2016 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, FPGA
’16, pages 26–35, New York, NY, USA, 2016. ACM.
[14] Joseph Redmon. Darknet: Open source neural networks in c, February
2018.
[15] Christoph R¨uthing et al. Self-Adaptation in Programmable Automation
Controllers based on Hybrid Multi-Cores. Master Thesis, University of
Paderborn, 2016.
[16] Yongming Shen, Michael Ferdman, and Peter Milder. Maximizing cnn
accelerator efficiency through resource partitioning. In Proceedings of
the 44th Annual International Symposium on Computer Architecture,
ISCA ’17, pages 535–547, New York, NY, USA, 2017. ACM.
[17] Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei
Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. Throughput-optimized
32
opencl-based fpga accelerator for large-scale convolutional neural net-
works. In Proceedings of the 2016 ACM/SIGDA International Sympo-
sium on Field-Programmable Gate Arrays, FPGA ’16, pages 16–25, New
York, NY, USA, 2016. ACM.
[18] Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela
Blott, Philip Leong, Magnus Jahre, and Kees Vissers. Finn: A frame-
work for fast, scalable binarized neural network inference. In Pro-
ceedings of the 2017 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, FPGA ’17, pages 65–74, New York, NY,
USA, 2017. ACM.
[19] S. I. Venieris and C. S. Bouganis. fpgaconvnet: A framework for mapping
convolutional neural networks on fpgas. In 2016 IEEE 24th Annual In-
ternational Symposium on Field-Programmable Custom Computing Ma-
chines (FCCM), pages 40–47, May 2016.
[20] Stylianos I. Venieris and Christos-Savvas Bouganis. fpgaconvnet: Au-
tomated mapping of convolutional neural networks on fpgas (abstract
only). In Proceedings of the 2017 ACM/SIGDA International Sympo-
sium on Field-Programmable Gate Arrays, FPGA ’17, pages 291–292,
New York, NY, USA, 2017. ACM.
[21] Ying Wang, Jie Xu, Yinhe Han, Huawei Li, and Xiaowei Li. Deep-
burning: Automatic generation of fpga-based learning accelerators for
the neural network family. In Proceedings of the 53rd Annual Design
Automation Conference, DAC ’16, pages 110:1–110:6, New York, NY,
USA, 2016. ACM.
[22] Xilinx Inc. http://www.xilinx.com. 2018.
[23] Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason
Cong. Caffeine: Towards uniformed representation and acceleration
for deep convolutional neural networks. In Proceedings of the 35th In-
ternational Conference on Computer-Aided Design, ICCAD ’16, pages
12:1–12:8, New York, NY, USA, 2016. ACM.
[24] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and
Jason Cong. Optimizing fpga-based accelerator design for deep con-
volutional neural networks. In Proceedings of the 2015 ACM/SIGDA
33
International Symposium on Field-Programmable Gate Arrays, pages
161–170, New York, NY, USA, 2015. ACM.
[25] Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, and Jason
Cong. Energy-efficient cnn implementation on a deeply pipelined fpga
cluster. In Proceedings of the 2016 International Symposium on Low
Power Electronics and Design, ISLPED ’16, pages 326–331, New York,
NY, USA, 2016. ACM.
[26] G. Zhong, A. Prakash, S. Wang, Y. Liang, T. Mitra, and S. Niar. Design
space exploration of fpga-based accelerators with multi-level parallelism.
In Design, Automation Test in Europe Conference Exhibition (DATE),
2017, pages 1141–1146, March 2017.
34
... operational efficiency, such as microprocessor energy efficiency and power usage effectiveness (PUE), optimization from industry [41,45,80,82] and academia [74,96,134]. Further operational energy efficiency improvement is increasingly more challenging [13,32,53,65,69,73,102,116,117,118,127,132]. In addition to efficiency optimization, renewable energy, such as solar and wind, is increasingly adopted to reduce computing's carbon footprint (CF) [2,54,106,114]. ...
... Carbon-efficiency variability can stem from interference between and within applications [123] and the stability of network [22]. Unfortunately, state-of-the-art approaches, such as [13,32,53,54,65,66,68,69,73,88,102,114,116,117,118,127,130,132], determine the execution target primarily relying on the operational characteristics, such as performance and energy, without considering the aforementioned features (i.e., location-and time-dependent renewable energy availability, embodied emissions, and runtime variance), leaving a significant room for improvement (Fig. 1). This is due to the absence of a tool to quantify and analyze the carbon emissions of the infrastructure components by taking into account the aforementioned features, due to various challenges such as the heterogeneous interface across the system stacks and multiple organizations of infrastructure components. ...
... Traditionally, edge devices have been mainly used as user-end sensors, user interfaces, or both, in many mobile services. Recently, with the advancements in powerful mobile systems-on-a-chip (SoCs) [52,58,118], a varying amount of computations can be executed locally on the edge devices [102,116,117,118,122,127,132]. By doing so, the services can improve their response time and remove the data transmission overhead [13,32,53,65]. ...
Preprint
Full-text available
To improve the environmental implications of the growing demand of computing, future applications need to improve the carbon-efficiency of computing infrastructures. State-of-the-art approaches, however, do not consider the intermittent nature of renewable energy. The time and location-based carbon intensity of energy fueling computing has been ignored when determining how computation is carried out. This poses a new challenge -- deciding when and where to run applications across consumer devices at the edge and servers in the cloud. Such scheduling decisions become more complicated with the stochastic runtime variance and the amortization of the rising embodied emissions. This work proposes GreenScale, a framework to understand the design and optimization space of carbon-aware scheduling for green applications across the edge-cloud infrastructure. Based on the quantified carbon output of the infrastructure components, we demonstrate that optimizing for carbon, compared to performance and energy efficiency, yields unique scheduling solutions. Our evaluation with three representative categories of applications (i.e., AI, Game, and AR/VR) demonstrate that the carbon emissions of the applications can be reduced by up to 29.1% with the GreenScale. The analysis in this work further provides a detailed road map for edge-cloud application developers to build green applications.
Chapter
Convolutional neural networks (CNNs)-based inference is a quintessential component in mobile machine learning applications. Privacy and real-time response requirements require applications to perform inference on the mobile (edge) devices themselves. Heterogeneous multi-processor system-on-chips (HMPSoCs) within the edge devices enable high-throughput, low-latency edge inference. An HMPSoC contains several processing cores, each capable of independently performing CNN inference. However, to meet stringent performance requirements, an application must simultaneously involve all core types in inferencing. A software-based CNN inference pipeline design allows for synergistic engagement of all the cores in an HMPSoC for a high-throughput and low-latency CNN inference. In this chapter, we present two different CNN inference pipeline designs. The first design creates a pipeline between two different types of CPU cores. The second design extends the pipeline from CPU to GPU. We also provide a future perspective and research directions on the subject.
Chapter
The recent advances in artificial intelligence (AI) and the Internet of Things (IoT) have given rise to a pyramid of intelligent mobile applications such as autonomous driving, smart health monitoring, and virtual reality, which are running on ubiquitous edge devices such as smartphones and vehicles.
Chapter
In previous chapters, we have presented several edge computing frameworks that address the rationality and heterogeneity challenges in the SEC paradigm. In this chapter, we shift our focus to real-time AI in the social edge that investigates the challenge of building time-sensitive AI models in SEC. In particular, we focus on a widely adopted AI model—deep neural networks (DNN), and review a novel optimal batching algorithm called EdgeBatch that can fully utilize the data parallelization feature of DNN to significantly expedite the execution time of DNN-based AI models at the edge. EdgeBatch represents a line of research that addresses the emerging challenges at the intersection of real-time and AI communities.KeywordsReal-time AIEdgeBatchIntelligent edge systemsEnd-to-end delayBatching strategyEnergy consumption
Chapter
The advent of edge computing pushes the frontier of computation, service, and data along the cloud-to-things continuum to the edge of the network, and brings new opportunities for human-centric applications (e.g., social sensing, smart mobile computing, edge intelligence). By coupling those applications with edge computing, the individually owned edge devices form a federation of computational nodes where the data collected from them can be processed and consumed “on the spot”. In this chapter, we offer a high-level view of the SEC paradigm, its background, motivation, trends, enabling technologies, and examples of applications.KeywordsSocial edge computingTrendEnabling technologyApplications
Article
With the evaluation of Artificial Intelligence (AI), there has been a resurgence of interest in how to use AI algorithms on low-power embedded systems to broaden potential use cases of the Internet of Things (IoT). To mimic multimodal human perception, multimodal deep neural networks (M-DNN) have recently become very popular with the classification task due to their impressive performance for computer vision and audio processing tasks. This paper presents TinyM ² Net-V2 - a compact low-power software hardware architecture for m ulti m odal deep neural networks for resource-constrained tiny devices. In order to compress the models to implement on tiny devices, cyclicly sparsification and hybrid quantization (4-bits weights and 8-bits activations) methods are used. Although model compression techniques are an active research area, we are the first to demonstrate their efficacy for multimodal deep neural networks, using cyclicly sparsification and hybrid quantization of weights/activations. TinyM ² Net-V2 shows that even a tiny multimodal deep neural network model can improve the classification accuracy more than that of any unimodal counterparts. Parameterized M-DNN model architecture was designed to be evaluated in two different case-studies: vehicle detection from multimodal images and audios and COVID-19 detection from multimodal audio recordings. The most compressed TinyM ² Net-V2 achieves 92.5% COVID-19 detection accuracy (6.8% improvement from the unimodal full precision model) and 90.6% vehicle classification accuracy (7.7% improvement from the unimodal full precision model). A parameterized and flexible FPGA hardware accelerator was designed as well for TinyM ² Net-V2 models. To the best of our knowledge, this is the first work accelerating multimodal deep neural network models on low power Artix-7 FPGA hardware. We achieved energy efficiency of 9.04 GOP/s/W and 15.38 GOP/s/W for case-study 1 and case-study 2 respectively which is comparable to the state-of-the-art results. Finally, we compared our tiny FPGA hardware implementation results with off-the-shelf resource-constrained devices and showed our implementation is faster and consumed less power compared to the off-the-shelf resource-constrained devices.
Article
Stochastic computing (SC) has recently emerged as a promising method for efficient machine learning acceleration. Its high compute density, affinity with dense linear algebra primitives, and approximation properties have an uncanny level of synergy with the deep neural network computational requirements. However, there is a conspicuous lack of works trying to integrate SC hardware with sparsity awareness, which has brought significant performance improvements to conventional architectures. In this work, we identify why common sparsity-exploiting techniques are not easily applicable to SC accelerators and propose a new architecture—SASCHA—sparsity-aware SC hardware architecture for the neural network acceleration that addresses those issues. SASCHA encompasses a set of techniques that make utilizing sparsity in inference practical for different types of SC computation. At 90% weight sparsity, SASCHA can be up to $6.5\times $ faster and $5.5\times $ more energy-efficient than comparable dense SC accelerators with a similar area without sacrificing the dense network throughput. SASCHA also outperforms sparse fixed-point accelerators by up to $4\times $ in terms of latency. To the best of our knowledge, SASCHA is the first SC accelerator architecture oriented around sparsity.
Article
Recently, Reinforcement Learning (RL) has shown great performance in solving sequential decision-making and control in dynamic environment problems. Despite its achievements, deploying Deep Neural Network (DNN) based RL is expensive in terms of time and power due to the large number of episodes required to train agents with high dimensional image representations. Additionally, at the interference the large energy footprint of deep neural networks can be a major drawback. Embedded edge devices as the main platform for deploying RL applications, are intrinsically resource-constrained and deploying deep neural network based RL on them is a challenging task. As a result, reducing the number of actions taken by the RL agent to learn desired policy, along with the energy-efficient deployment of RL is crucial. In this paper, we propose Energy Efficient Hierarchical Reinforcement Learning (E2HRL), which is a scalable hardware architecture for RL applications. E2HRL utilizes a cross-layer design methodology for achieving better energy efficiency, smaller model size, higher accuracy, and system integration at the software and hardware layers. Our proposed model for RL agent is designed based on the learning hierarchical policies, which makes the network architecture more efficient for implementation on mobile devices. We evaluated our model in three different RL environment with different level of complexity. Simulation results with our analysis illustrate that hierarchical policy learning with several levels of control improves RL agents training efficiency and the agent learns the desired policy faster compared to a none hierarchical model. This improvement is specifically more observable as the environment or the task becomes more complex with multiple objective subgoals. We tested our model with different hyperparameters to achieve the maximum reward by the RL agent while minimizing the model size, parameters, and required number of operations. E2HRL model enables efficient deployment of RL agent on a resource constraint embedded devices with the proposed custom hardware architecture which is scalable and fully parameterized with respect to the number of input channels, filter size, and depth. The number of processing engines (PE) in the proposed hardware can vary between 1 to 8, which provides the flexibility of trade-off different factors such as latency, throughput, power and energy efficiency. By performing a systematic hardware parameter analysis and design space exploration, we implemented the most energy-efficient hardware architectures of E2HRL on Xilinx Artix-7 FPGA and NVIDIA Jetson TX2. Comparing the implementation results shows Jetson TX2 boards achieve 0.1 ∼ 1.3 GOP/S/W energy efficiency while Artix-7 FPGA achieves 1.1 ∼ 11.4 GOP/S/W, which denotes 8.8X ∼ 11X better energy efficiency of E2HRL when model is implemented on FPGA. Additionally, compared to similar works our design shows better performance and energy efficiency.
Conference Paper
Full-text available
In recent years, Convolutional Neural Networks (ConvNets) have become the state-of-the-art in several Artificial Intelligence tasks. Across the range of applications, the performance needs vary significantly, from high-throughput image recognition to the very low-latency requirements of autonomous cars. In this context, FPGAs can provide a potential platform that can be optimally configured based on the different performance needs. However, the complexity of ConvNet models keeps increasing leading to a large design space. This work presents fpgaConvNet, an end-to-end framework for mapping ConvNets on FPGAs. The proposed framework employs an automated design methodology based on the Synchronous Dataflow (SDF) paradigm and defines a set of transformations on the SDF graph in order to efficiently explore the architectural design space. By treating high-throughput and latency-critical systems separately, the presented tool is able to efficiently explore the architectural design space and to generate hardware designs from high-level ConvNet specifications, explicitly optimised for the performance metric of interest. Overall our framework yields designs that improve the performance density and the performance efficiency by up to 6× and 4.49× respectively over existing highly-optimised FPGA, DSP and embedded GPU work.
Article
Full-text available
Deep neural networks are gaining in popularity as they are used to generate state-of-the-art results for a variety of computer vision and machine learning applications. At the same time, these networks have grown in depth and complexity in order to solve harder problems. Given the limitations in power budgets dedicated to these networks, the importance of low-power, low-memory solutions has been stressed in recent years. While a large number of dedicated hardware using different precisions has recently been proposed, there exists no comprehensive study of different bit precisions and arithmetic in both inputs and network parameters. In this work, we address this issue and perform a study of different bit-precisions in neural networks (from floating-point to fixed-point, powers of two, and binary). In our evaluation, we consider and analyze the effect of precision scaling on both network accuracy and hardware metrics including memory footprint, power and energy consumption, and design area. We also investigate training-time methodologies to compensate for the reduction in accuracy due to limited bit precision and demonstrate that in most cases, precision scaling can deliver significant benefits in design metrics at the cost of very modest decreases in network accuracy. In addition, we propose that a small portion of the benefits achieved when using lower precisions can be forfeited to increase the network size and therefore the accuracy. We evaluate our experiments, using three well-recognized networks and datasets to show its generality. We investigate the trade-offs and highlight the benefits of using lower precisions in terms of energy and memory footprint.
Conference Paper
Full-text available
In recent years, convolutional neural network (CNN) based methods have achieved great success in a large number of applications and have been among the most powerful and widely used techniques in computer vision. However, CNN-based methods are com-putational-intensive and resource-consuming, and thus are hard to be integrated into embedded systems such as smart phones, smart glasses, and robots. FPGA is one of the most promising platforms for accelerating CNN, but the limited bandwidth and on-chip memory size limit the performance of FPGA accelerator for CNN. In this paper, we go deeper with the embedded FPGA platform on accelerating CNNs and propose a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification. We first present an in-depth analysis of state-of-the-art CNN models and show that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric. Then the dynamic-precision data quantization method and a convolver design that is efficient for all layer types in CNN are proposed to improve the bandwidth and resource utilization. Results show that only 0.4% accuracy loss is introduced by our data quantization flow for the very deep VGG16 model when 8/4-bit quantization is used. A data arrangement method is proposed to further ensure a high utilization of the external memory bandwidth. Finally, a state-of-the-art CNN, VGG16-SVD, is implemented on an embedded FPGA platform as a case study. VGG16-SVD is the largest and most accurate network that has been implemented on FPGA end-to-end so far. The system on Xilinx Zynq ZC706 board achieves a frame rate at 4.45 fps with the top-5 accuracy of 86.66% using 16-bit quantization. The average performance of convolutional layers and the full CNN is 187.8 GOP/s and 137.0 GOP/s under 150MHz working frequency, which outperform previous approaches significantly.
Conference Paper
A deep-learning inference accelerator is synthesized from a C-language software program parallelized with Pthreads. The software implementation uses the well-known producer/consumer model with parallel threads interconnected by FIFO queues. The LegUp high-level synthesis (HLS) tool synthesizes threads into parallel FPGA hardware, translating software parallelism into spatial parallelism. A complete system is generated where convolution, pooling and padding are realized in the synthesized accelerator, with remaining tasks executing on an embedded ARM processor. The accelerator incorporates reduced precision, and a novel approach for zero-weight-skipping in convolution. On a mid-sized Intel Arria 10 SoC FPGA, peak performance on VGG-16 is 138 effective GOPS.
Article
Convolutional Neural Network (CNN) has become a successful algorithm in the region of artificial intelligence and a strong candidate for many computer vision (CV) algorithms. But the computation complexity of CNN is much higher than traditional algorithms. With the help of GPU acceleration, CNN based applications are widely deployed in servers. However, for embedded platforms, CNN-based solutions are still too complex to be applied. Various dedicated hardware designs on FPGAs have been carried out to accelerate CNNs, while few of them explore the whole design flow for both fast deployment and high power efficiency.
Conference Paper
Applications containing compute-intensive kernels with nested loops can effectively leverage FPGAs to exploit fine- and coarse-grained parallelism. HLS tools used to translate these kernels from high-level languages (e.g., C/C++), however, are inefficient in exploiting multiple levels of parallelism automatically, thereby producing sub-optimal accelerators. Moreover, the large design space resulting from the various combinations of fine- and coarse-grained parallelism options makes exhaustive design space exploration prohibitively time-consuming with HLS tools. Hence, we propose a rapid estimation framework, MPSeeker, to evaluate performance/area metrics of various accelerator options for an application at an early design phase. Experimental results show that MPSeeker can rapidly (in minutes) explore the complex design space and accurately estimate performance/area of various design points to identify the near-optimal (95.7% performance of the optimal on average) combination of parallelism options.
Conference Paper
Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values. In this paper, we present FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. By utilizing a novel set of optimizations that enable efficient mapping of binarized neural networks to hardware, we implement fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. On a ZC706 embedded FPGA platform drawing less than 25 W total system power, we demonstrate up to 12.3 million image classifications per second with 0.31 µs latency on the MNIST dataset with 95.8% accuracy, and 21906 image classifications per second with 283 µs latency on the CIFAR-10 and SVHN datasets with respectively 80.1% and 94.9% accuracy. To the best of our knowledge, ours are the fastest classification rates reported to date on these benchmarks.