ArticlePDF Available

Synergy: An HW/SW framework for high throughput CNNs on embedded heterogeneous SoC

March 2019
ACM Transactions on Embedded Computing Systems (In press)

March 2019
(In press)

DOI:10.1145/3301278

Authors:

Guanwen Zhong

Xilinx Inc.

Cheng Tan

National University of Singapore

Tulika Mitra

National University of Singapore

Convolutional Neural Networks (CNN) have been widely deployed in diverse application domains. There has been significant progress in accelerating both their training and inference using high-performance GPUs, FPGAs, and custom ASICs for datacenter-scale environments. The recent proliferation of mobile and IoT devices have necessitated real-time, energy-efficient deep neural network inference on embedded-class, resource-constrained platforms. In this context, we present {\em Synergy}, an automated, hardware-software co-designed, pipelined, high-throughput CNN inference framework on embedded heterogeneous system-on-chip (SoC) architectures (Xilinx Zynq). {\em Synergy} leverages, through multi-threading, all the available on-chip resources, which includes the dual-core ARM processor along with the FPGA and the NEON SIMD engines as accelerators. Moreover, {\em Synergy} provides a unified abstraction of the heterogeneous accelerators (FPGA and NEON) and can adapt to different network configurations at runtime without changing the underlying hardware accelerator architecture by balancing workload across accelerators through work-stealing. {\em Synergy} achieves 7.3X speedup, averaged across seven CNN models, over a well-optimized software-only solution. {\em Synergy} demonstrates substantially better throughput and energy-efficiency compared to the contemporary CNN implementations on the same SoC architecture.

Content uploaded by Guanwen Zhong

Content may be subject to copyright.

Synergy: A HW/SW Framework for High

Throughput CNNs on Embedded

Heterogeneous SoC

Guanwen Zhong∗

, Akshat Dubey†

, Tan Cheng‡

, and Tulika Mitra§

School of Computing, National University of Singapore

Abstract

Convolutional Neural Networks (CNN) have been widely deployed

in diverse application domains. There has been signiﬁcant progress in

accelerating both their training and inference using high-performance

GPUs, FPGAs, and custom ASICs for datacenter-scale environments.

The recent proliferation of mobile and IoT devices have necessitated

real-time, energy-eﬃcient deep neural network inference on embedded-

class, resource-constrained platforms. In this context, we present Syn-

ergy, an automated, hardware-software co-designed, pipelined, high-

throughput CNN inference framework on embedded heterogeneous

system-on-chip (SoC) architectures (Xilinx Zynq). Synergy leverages,

through multi-threading, all the available on-chip resources, which in-

cludes the dual-core ARM processor along with the FPGA and the

NEON SIMD engines as accelerators. Moreover, Synergy provides

a uniﬁed abstraction of the heterogeneous accelerators (FPGA and

NEON) and can adapt to diﬀerent network conﬁgurations at runtime

without changing the underlying hardware accelerator architecture by

balancing workload across accelerators through work-stealing. Syn-

ergy achieves 7.3X speedup, averaged across seven CNN models, over

a well-optimized software-only solution. Synergy demonstrates sub-

stantially better throughput and energy-eﬃciency compared to the

contemporary CNN implementations on the same SoC architecture.

1 introduction

Convolutional Neural Networks (CNNs) are a popular class of deep learn-

ing method with wide range of applications, including computer vision, im-

∗zhguanwen@gmail.com

†akshatdubey@nus.edu.sg

‡tancheng@comp.nus.edu.sg

§tulika@comp.nus.edu.sg

arXiv:1804.00706v1 [cs.DC] 28 Mar 2018

age/video processing, natural language processing, and others. A typical

CNN consists of multiple layers. Given an application, such as image classi-

ﬁcation, the network is ﬁrst trained with the training dataset. The trained

network is then deployed for inference, i.e., classiﬁcation of a new image.

Both the training and the inference are compute- and memory-intensive, but

also exhibit massive intrinsic parallelism. Thus, there exist numerous eﬀorts

to improve the performance and the energy-eﬃciency of CNN implementa-

tions through architectures and computing substrates that support extensive

parallelism, such as GPUs, FPGAs, or even ASIC accelerators. This line of

research has primarily focused on the high-performance computing platforms

in datacenters or clouds.

The proliferation of the mobile devices and the recent emergence of the

IoT (Internet of Things) have transformed the computing landscape. There

is a compelling need to realise real-time, energy-eﬃcient CNN inference on

resource-constrained mobile and IoT edge devices. However, an eﬃcient im-

plementation of CNN-based inference on embedded platforms remains chal-

lenging given the resource limitations. In this context, we present Synergy, an

automated, transparent, pipelined, high-throughput, hardware-software co-

designed CNN inference framework on embedded heterogeneous SoC archi-

tectures. We design Synergy prototype on the Xilinx Zynq XC7Z020 device

leveraging all its available on-chip compute resources, namely the dual-core

ARM processor with NEON SIMD engines and the FPGA. Synergy is a com-

plete system-level solution including a multi-threaded software component,

multi-threaded FPGA and NEON accelerators, an interface between hard-

ware and software components, support for dynamic workload balancing,

as well as an architecture generator for customized solutions (if required).

Figure 1 depicts the Synergy framework mapping a CNN model on a het-

erogeneous SoC. Synergy distinguishes itself from the state-of-the-art along

multiple dimensions.

Heterogeneous HW/SW Acceleration:Synergy leverages all the

compute resources available on a heterogeneous SoC for maximum perfor-

mance. The convolutional (referred to as CONV hereafter) layers are the

most compute-intensive component of the CNN consuming more than 90%

of the execution time [24]. All contemporary CNN designs [6][20][21] on

the Xilinx Zynq platform oﬄoad the CONV layers to the FPGA. We ob-

serve that the NEON SIMD (Single-Instruction Multiple-Data) engines in

the ARM cores are quite eﬀective in accelerating the CONV layers as well.

Therefore, harnessing the compute power of the FPGA in conjunction with

Hardware

Architecture

Generator

Multithread CNN (Software)

Multi-core

Processor

CONV CONV POOL FC FC

weights Network

Configuration

Hardware

Configuration

CPU

NEON

POOL

FPGA

Cluster 0 Cluster 1 Cluster 2

PE PE

Software-Hardware Interface

Figure 1: Synergy: Mapping CNNs on Heterogeneous SoC

the NEON engines can reduce the execution latency of the CONV layers sig-

niﬁcantly. Embracing the heterogeneity — hardware accelerators on FPGA

and software accelerators on NEON — for a single computational kernel is

a diﬃcult proposition. Synergy eﬀectively transforms the computation of a

CONV layer into a tiled matrix multiplication operation, where the tiles can

be processed independently. Synergy then seamlessly feeds the tiles to both

the hardware and the software accelerators to exploit the parallelism within

CONV layers. Synergy improves the overall latency and throughput by 12%

and 15% respectively, averaged across multiple CNN models, using NEON

and FPGA compared to FPGA-only solutions.

Transparent Consolidation of Accelerators: Most contemporary

FPGA-based CNN frameworks [3][6][13] [17][20][21][24][23] rely heavily on

customizing the CONV layer accelerators for each individual network to

minimize latency. The conﬁguration of a CNN (number and type of layers,

nodes, connections, etc.) is dependent on the application. Given a speciﬁc

CNN model, existing approaches perform an automated (or manual) design

space exploration (DSE) to identify the optimal accelerator architectures for

the CONV layers of that network on the target FPGA device. This approach

has the drawback that the application developer needs to be involved in the

DSE and High-Level Synthesis (HLS) to map the given CNN model on the

FPGA. Even if the DSE and HLS steps can be fully automated, there are

still some quirks [16] that make this process quite challenging for an applica-

tion developer with limited experience in FPGAs. Second, in an embedded

FPGA device with strict resource constraints, a single accelerator design is

used by all the CONV layers of the network in a time-multiplexed fashion

even though the diﬀerent layers have diverse compute requirements. In other

words, the single accelerator is a compromise to oﬀer the best average per-

formance across all the CONV layers of the network, but it is not ideal for

any particular CONV layer [16]. Moreover, this single CONV accelerator is

still custom-designed for each network through extensive DSE and HLS.

In contrast, Synergy accelerators (FPGA, NEON) are network-agnostic.

A ﬁxed set of accelerators is used irrespective of the network and layer as

the CONV layer computation is transformed into tiled matrix multiplica-

tions. Using ﬁne-grained tiled matrix multiplication operations as funda-

mental primitives (as opposed to complete CONV layer) in conjunction with

awork-stealing software scheduler that distributes these tiles to the diﬀerent

accelerators and balances the workload, Synergy achieves comparable perfor-

mance to the customized network-speciﬁc implementations returned through

DSE. Thus, Synergy can bypass the DSE and HLS for each individual CNN

model and provide a completely out-of-the-box solution.

High-Throughput HW/SW Multi-Threaded Pipeline: The trans-

parent consolidation of heterogeneous HW/SW accelerators in Synergy pro-

vides a powerful abstraction layer for any CNN implementation. The abun-

dance of sensors on mobile and IoT devices capturing continuous data streams

demand in-situ real-time inference (e.g., continuous object detection in im-

age/video stream [2]). Here throughput (i.e., frames per second) is the deﬁn-

ing metric as opposed to minimizing the latency for each individual frame

in isolation. Synergy employs a HW/SW multi-threaded pipelined design of

the diﬀerent layers of the network that allows consecutive frames (images)

from the streaming input to be processed concurrently exploiting inter-frame

parallelism and improving throughput. However, in this pipelined design,

CONV layers from diﬀerent frames may need to be processed simultaneously

on the FPGA. This inter-frame parallelism is easy to support in Synergy as

the diﬀerent matrix multiplications from the diﬀerent CONV layers simply

generate matrix multiplication tiles and the tiles from diﬀerent layers get dis-

tributed in a transparent fashion to the set of heterogeneous accelerators to be

executed in parallel. Synergy achieves 39.5 – 136.4 frames/second throughput

and consumes 14.4 – 55.8 mJ/frame energy depending on the CNN model.

This is substantially better than the contemporary CNN implementations

on the Xilinx Zynq XC7Z020 device (see Table 4). Moreover, the concur-

rent execution of multiple CONV layers on FPGA accelerators in Synergy

greatly improves their utilization. The low utilization of the accelerators is a

critical issue in full-system implementation of CNNs on resource-constrained

devices where non-compute intensive layers (e.g., pooling, activation and

fully-connected) implemented in software on not so powerful CPUs (e.g.,

ARM) take up signiﬁcant execution time while the FPGA remains inactive.

The pipelined design with multiple frames in-ﬂight keeps the accelerators

busy in Synergy with 99.8% average utilization.

Automated Customized Accelerator Design:Synergy oﬀers a de-

fault plug-n-play solution for a na¨ıve application developer that avoids the

complex DSE and HLS for each individual CNN model. An experienced

designer, on the other hand, may want to further improve the performance

by designing accelerators that are optimized for the CNN model at hand.

Synergy oﬀers an automated approach to customized acceleration design for

a speciﬁc CNN model. The framework accepts the network conﬁguration

(number and type of layers) corresponding to a CNN model as an input. The

designer only needs to provide the architectural parameters for the matrix

multiplication accelerators. The Synergy framework not only can automati-

cally synthesize the accelerators (according to designer-deﬁned parameters)

on the FPGA fabric but also generate the appropriate hardware-software in-

terface to synergistically engage these newly synthesized accelerators. This

is shown as the “Hardware Architecture Generator” in Figure 1. In addition,

the same architecture generator can be used to synthesize the accelerators

and the HW/SW interface for a new SoC device. Thus, Synergy provides

a complete push-button solution to CNN acceleration on a heterogeneous

platform.

2 related works

The state-of-the-art FPGA-based CNN works are shown in Table 1. To the

best of our knowledge, there is no work focusing on heterogeneous HW/SW

acceleration (with CPUs, NEONs and FPGA) for CNNs. We classify the ex-

isting works into two categories: network-dependent and network-independent

FPGA-based CNN frameworks.

Network-dependent Frameworks generally require designers to ex-

plore diﬀerent conﬁgurations to generate a hardware architecture for a spe-

ciﬁc CNN network manually or with the help of scripts provided, and per-

form synthesis (which normally takes half to one hour) to generate the

bitstream. Given a new network, designers need to redo the above steps,

which is time consuming. This approach is well-explored and can pro-

duce extreme high-performance CNN accelerators, but sacriﬁcing the ﬂex-

ibility to diﬀerent networks. Almost all the existing FPGA-based CNN

works [3][6][8][13][16][17][18][19][20][21][23][25][24] use the network-sensitive

approach. [16][17][25][24] require designers to manually explore architec-

tures for diﬀerent networks, while [3][18][19][20][21][23] propose automated

toolchains for mapping CNNs on FPGAs. [3][6][13][17][20][21][23][24] mainly

focus on exploiting intra-frame parallelism within layers and execute layers

in a CNN in sequence, ignoring inter-frame parallelism across layers. [18][25]

map all layers onto FPGAs and enable hardware pipelining to exploit the

inter-frame parallelism. [18] proposed an automated framework to accelerate

binarized CNNs. Their work maps all layers in an binarized CNN on FPGA

and enables hardware pipelining. [25] proposes an approach by mapping

convolutional, normalization, pooling, activation and fully-connected layers

onto multiple FPGA devices (1 Xilinx ZC706 + 6 Xilinx VC709 boards).

Each device is in charge of a speciﬁc one or more CNN layers and devices are

connected in a ring network. The pipelining ﬂow is controlled by dual-core

ARM processors on the Xilinx Zynq board. However, the cost of the deeply

pipelined FPGA cluster is too high as it requires multiple high-end FPGA

devices and the setup is diﬃcult. Diﬀerent from [18][25], [8] starts with

multi-threaded CNN inference codes and converts all its layers into FPGA

accelerators. However, the workload of diﬀerent layers in a CNN could be

imbalanced, which leads to low accelerator utilization, wasting the precious

FPGA resources. [16] statically splits single large processing engine (PE)

used to accelerate convolutional layers into multiple small PEs. Their ap-

proach can allow multiple layers running simultaneously with diﬀerent image

frames. However, the evaluation of their work is based on simulation and the

performance (execution cycles) of PEs is obtained by Vivado HLS. The per-

formance number is not accurate as it does not consider the runtime overhead

of the real platform.

Network-independent Frameworks leverage a ﬁxed optimized hard-

ware architecture for various CNN networks. To adapt to diﬀerent networks

and achieve good hardware eﬃciency, this approach relies on either static

(compiler) or dynamic (runtime scheduler) techniques. The key advantage

of this approach is that designers can easily switch diﬀerent networks at run-

Table 1: Current State-of-the-art vs. Synergy

Reference Automated Inter-

frame

Self-

balancing

Reuse*

Network

Agnostic

On-board

Evaluation

[24] [FPGA’15] 8 8 8 8 8 4

[13] [FPGA’16] 8 8 8 4 8 4

[17] [FPGA’16] 8 8 8 8 8 4

[6] [CASES’16] 8 8 8 4 8 4

[21] [DAC’16] 4 8 8 8 8 4

[23] [ICCAD’16] 4 8 8 8 8 4

[19][FCCM’16]

[20][FPGA’17] 4 8 8 8 8 4

[18] [FPGA’17] 4 4 8 8 8 4

[3] [FCCM’17] 4 8 8 4 8 4

[25] [ISLPED’16] 8 4 8 8 8 4

[8] [SOCC’17] 8 4 8 8 8 4

[16] [ISCA’17] 8 4 8 4 8 8

[4] [TCAD’17] 4 8 8 4 4 4

Proposed

Synergy 4 4 4 4 4 4

HW Reuse: diﬀerent CONV layers and FC layers can share the same FPGA accelerators

time without going through the time-consuming synthesis step. [4] belongs

to this category. Their approach relies on a compiler developed to statically

generate a set of instructions (describing the process of CNN execution) that

execute on the ﬁxed hardware architecture. Layers are executed in sequence

in their work. Moreover, as [4] includes data quantization to reduce memory

requirement, their approach can support large networks on embedded FPGA

devices. However, their approach can not allow multiple layers running con-

currently with diﬀerent input frames, which might result in low accelerator

utilization.

Synergy supports network-independent feature. More speciﬁcally, we pro-

pose a hardware abstraction to unify various computing elements (NEON

cores and FPGA) within an FPGA-based SoC architecture. Thus, Synergy

can leverage all computing elements (multiple ARM cores, its NEON cores

and FPGA) to accelerate CNNs via HW/SW multi-threading, unleashing the

true power of heterogeneity. Diﬀerent from [4], Synergy adapts to various

networks by leveraging a work-stealing scheduler (Section 3.1.3) in software to

automatically balance the workload of accelerators at runtime without chang-

ing hardware or software implementations. Moreover, Synergy provides an

automated toolchain to allow designers to explore various accelerator archi-

tectures or migrate designs to other embedded FPGA devices.

3 The Synergy Framework

Synergy, as shown in Figure 1, is an automated framework to map CNN

models onto embedded FPGA-based heterogeneous SoC platforms. Synergy

targets the CNN inference phase and successfully unleashes the power of

heterogeneity of the SoC architectures by leveraging all its compute elements

(CPUs, NEON engines, and FPGA).

A CNN model contains multiple layers such as convolutional, normaliza-

tion, pooling, activation and fully connected layers. The input of a layer is the

output of the previous layer. When input frames stream into the CNN, the

layers can process diﬀerent frames concurrently. This inter-frame parallelism

can be exploited to improve throughput.

Synergy uses the FPGA logic and the NEON engines to accelerate the

most compute-intensive layers (CONV) in a CNN, while the CPU cores work

on the other layers (such as pooling, activation and fully-connected layers)

and preprocessing functions (e.g., normalization, scaling and data layout

transformation). As shown in Figure 1, a designer can instantiate multiple

processing engines (referred to as PE hereafter) on the FPGA to accelerate

the CONV layers. The computation in a CONV layer is transformed into

a set of independent tiled matrix-multiplication operations, called jobs as

mentioned in Section 3.1.1. These jobs are executed by the FPGA and the

NEON accelerators in parallel.

To improve the inference throughput and accelerator utilization, Syn-

ergy supports HW/SW multi-threaded pipeline where the CPU cores and

the accelerators work on diﬀerent layers of diﬀerent frames concurrently.

Therefore, we modify the traditional single-threaded CNN framework with

multi-threaded support. Speciﬁcally, the workload in each layer is conducted

by the corresponding thread and the communication between layers is per-

formed through a mailbox (a synchronized ﬁrst-in-ﬁrst-out buﬀer) accessible

by the threads. Multiple threads collaborate with each other in a producer-

consumer fashion constructing the full network.

As multi-threading is a software concept, hardware accelerators cannot

directly share the well-established mechanisms in the multi-threading model

such as mutex, semaphore, and mailbox. To abstract away the hardware

accelerators as hardware threads and extend operating system to support

HW/SW threads, we adapt ReconOS [10], an open-source operating sys-

tem for reconﬁgurable computing. ReconOS provides the HW/SW multi-

threading technique that we build upon to accelerate CNN applications on

FPGA-based heterogeneous SoC platforms. Each hardware accelerator or

PE is represented by a delegate thread in the software space that behaves

just like traditional software threads as shown in Figure 2 and explained in

detail in Section 3.1.2.

The accelerators (PEs and NEONs) can be grouped into multiple clus-

ters so that each CONV layer can have its own private cluster. However,

Synergy accelerators are not customized given a speciﬁc CNN model. Thus

the generic multi-cluster conﬁguration may not be optimal for each network

and may lead to imbalance in execution time of the diﬀerent CONV lay-

ers. Synergy employs work-stealing (detailed in Section 3.1.3), a dynamic

workload scheduling technique, to deal with the workload imbalance among

the clusters. The jobs (independent tiled matrix-multiplication operations)

from the diﬀerent CONV layers are distributed to the diﬀerent accelerator

clusters. An idle cluster steals workload from the other busy clusters and

thereby achieves workload balance across clusters and maximizes through-

put. Within a cluster, the jobs are dispatched to the available accelerators

(NEONs and FPGA-based PEs) in a round-robin fashion.

The Synergy framework provides a default architecture of the FPGA-

based PEs and their cluster conﬁguration that has been designed to provide

quality performance across a range of CNN models. These clusters and their

constituent PEs are pre-synthesized in Synergy corresponding to each unique

FPGA device/platform and do not need to be generated for each individ-

ual CNN model. In other words, the FPGA bitstream remains unchanged

across diﬀerent CNN models and the FPGA device need not be reconﬁgured

corresponding to each CNN model unless desired by the application devel-

oper. Given a CNN model corresponding to an application, Synergy takes

in a network conﬁguration ﬁle that deﬁnes the architecture of the CNN as

input. The CPU-based host code used to control the hardware accelerators,

Linux kernels and HW/SW multi-threaded library are written as templates.

With the network conﬁguration ﬁle and the software templates, Synergy au-

tomatically generates a software/hardware multi-threaded CNN application

in C.

If an advanced application developer wants to customize the PE and

cluster design for a speciﬁc CNN model, the Synergy framework oﬀers a

hardware architecture generator (Section 3.3). In this case, Synergy takes in

a hardware conﬁguration ﬁle as input and creates the hardware architecture

by instantiating the HLS-RTL accelerator templates in C corresponding to

the tiled matrix-multiplication operations. These FPGA-based accelerators

Streaming

input

Streaming

output

CONV-0

Thread

CONV-1

Thread

CONV-2

Thread

Pooling

Thread

Pooling

Thread

courier

Software

courier courier

Hardware

Abstraction

Cluster-0 Cluster-1

Job

Queue 1

Delegate

Thread-2

Delegate

Thread-3

Delegate

Thread-4

Delegate

Thread-0

Delegate

Thread-1

Job

Queue 0

PE PE PE PE PE

FPGA

NEON

Thief

Figure 2: Overview of the Software Architecture

for the CONV layers are generated by a commercial HLS tool from the C

templates, while accelerator interfaces and memory subsystem are created by

RTL templates. Both the generation of software and hardware components

are completely automated.

3.1 Software Architecture

Figure 2 shows the software component in Synergy. We explain the func-

tionality in software to implement the CONV layers and the other layers,

preprocessing functions.

3.1.1 CONV Layers

CONV layers are the most compute-intensive components in a CNN, occu-

pying more than 90% of the execution time during inference [24]. They take

in input feature maps and convolve them with convolutional ﬁlters to ob-

tain output feature maps. As we target the low-end embedded platforms,

the FPGA resources are quite limited and we cannot generate a dedicated

accelerator for each convolutional layer in a CNN model like [13][24][23][25].

Therefore, in our implementation, we need to share the hardware accelerators

among the convolutional layers. We transform the convolution operations

into matrix multiplication (MM) by ﬂattening and rearranging the input

features [3][17]. A data layout transformation is required to convert the 3D

array in CONV layers into 2D array, which is known as image-to-column

(im2col) function in many popular open-source CNN frameworks such as

Caﬀe [7] and Darknet [14]. Details related to the data layout transformation

can be found in [7][17]. Synergy leverages both the FPGA-based PEs and

the NEON engines to accelerate the MM computations.

Listing 1: Tiled Matrix Multiplication

1/∗T i l e S i z e : TS; Lo op bo un ds : N , M, K ∗/

2T i l e −t 1 : for ( t 1 =0; t 1<f l o o r (N/TS );++ t1 ){

3T i l e −t 2 : f o r ( t2 =0 ; t2 <f l o o r (M/TS) ;++t 2 ) {

4... //Initialization

5ti le d mm : for ( t 3 =0 ; t 3<f l o o r (K/TS) ;++ t3 ) {

6// S t ep 1 : Cop y d at a fr om DDR t o l o c a l b u f f e r ( a , b , c ) ;

7d at a c o p y ( d dr a , d dr b , a , b , o f f s e t A , o f f s e t B ) ;

8// S t e p 2 : K e r n e l Co mp u ta t i on

9l o o p 1 : for( i = 0; i <TS;++ i ) {

10 l o o p 2 : for( j = 0; j <TS;++ j ) {

11 l o o p 3 : for( k = 0; k<TS;++k ) {

12 c [ i ] [ j ] += a [ i ] [ k ] ∗b [ k ] [ j ] ; } }} }

13 // S t ep 3 : W r it e da ta f ro m l o c a l b u f f e r t o DDR

14 d at a s e n d ( c , d d r c , o f f s e t C ) ;

15 }}

After ﬂattening and rearranging the input features, the input matrices of

the matrix multiplication in convolutional layers are generally too large to be

accommodated on an FPGA platform. Loop Tiling is employed to partition

the iteration space of the loop into smaller tiles so that the working data set of

a tile can be easily accommodated using the available on-chip BRAM storage

in FPGAs. Listing 1 shows the tiled matrix multiplication after Loop Tiling.

The portion highlighted (Line 5-14) is the key computation of a tile and we

accelerate this portion with FPGA-based PEs (explained in Section 3.2.1)

and NEON cores.

Workload Granularity and Computation: Figure 3 shows a tiled

MM example with 2 ×2 tile size. In Synergy, the workload granularity of a

tiled MM computation is called a job, which is deﬁned as the computation

required to output a tile, C(i,j), of an output feature map C. A job is a

structure as shown in Listing 2 containing the base addresses of the arrays

(A, B and C), the input data dimensions (m, n and k), the tile index and

the layer ID, which is used to identify the CONV layer that owns the job.

Each CONV layer generates a set of jobs. In a CONV thread, we implement

acourier function that sends the jobs to the accelerators (PEs and NEONs).

When an accelerator gets a job, it ﬁrst calculates the memory addresses of

the required tiles of input/output feature maps with the base address, data

dimension and tile index provided by the job, and fetches the tiles from the

external DDR memory to its local storage with the memory controller. After

A(1,1) A(1,2) B(1,1)

B(2,1)

C(1,1)

M M







kjkkiji BAC 1),(),(),(

Tile Calculation:

C(1,1)

C(2,1) C(2,2)

C(1,2)

Job 1 Job 2

Job 4Job 3

Figure 3: Job: Workload Granularity of a Tiled MM

computation is completed, the PE stores the output tile back to the DDR.

Listing 2: The Structure of Job

1t yp e d e f s t r u c t {

2/∗The ba s e a d d r e s s e s o f i n pu t an d ou tp u t f e a t u r e maps ∗/

3DATA TYPE A a dd r ; DATA TYPE B a dd r ; DATA TYPE C a ddr ;

4/∗Da ta di m en s io n o f in p u t / ou t pu t f e a t u r e map s ∗/

5DATA TYPE m ; DATA TYPE n ; DATA TYPE k ;

6/∗In d ex u se d t o l o c a t e t h e t i l e ∗/

7DATA TYPE t 1 ; DATA TYPE t 2 ;

8DATA TYPE l a y e r i d ; /∗To tr a c k t h e CONV l a y e r ∗/

9}j o b t ;

Heterogeneous Accelerators: As we target Xilinx Zynq SoC, Synergy

uses the FPGA-based PEs and two NEON cores in the ARM A9 processor

as the accelerators. A PE is an FPGA implementation of tiled MM. PEs can

have diﬀerent optimizations, and thus performance of PEs might be diﬀerent.

Number of PEs is dependent on the available resource in the target FPGA

device. We explain the PE design in Section 3.2.1 in more detail. To leverage

the NEON cores, we have implemented the MM kernel in NEON assembly

code. This assembly code is encapsulated in two separate software threads,

one corresponding to each NEON core, creating two NEON accelerators.

Accelerator Clusters: From the software perspective, Synergy groups

the heterogeneous accelerators into clusters. For example, in Figure 2, Cluster-

0 has two NEON cores and two FPGA-based PEs, while Cluster-1 groups

three PEs. Each cluster has a private workload pool, Job Queue, as shown in

Figure 2. A job queue is a synchronous buﬀer, storing the address of the jobs.

Each CONV layer is assigned to a cluster by default. Diﬀerent CONV layers

can share the same cluster, for example CONV-0 and CONV-1 are mapped

to Cluster-0 and CONV-2 uses Cluster-1. Mapping of CONV layers and clus-

ters is decided by the number of jobs a CONV layer has. A CONV layer with

less workload will be mapped onto a less powerful cluster and vice-versa. In

addition, a designer can deﬁne the number of clusters and the corresponding

accelerator combinations simply with a hardware conﬁguration ﬁle as shown

in Figure 8. In this case, the hardware accelerators will be synthesized and

the required hardware-software interface will be automatically generated in

the Synergy framework (see Section 3.3).

The CONV layers assigned to a cluster send their jobs to the Job Queue

and use all the available accelerators in the cluster. Once the cluster de-

tects jobs in the job queue, it dispatches the jobs to the synchronous buﬀers

attached to each accelerator. Then, the accelerators work on the jobs and

inform the cluster when they ﬁnish.

3.1.2 Delegate Threads

To abstract away the hardware accelerators, we deploy delegate threads in-

troduced in ReconOS [10]. A delegate thread is a software wrapper for an

FPGA-based accelerator, which can execute operating system (OS) calls on

behalf of the associated accelerator. From the OS perspective, the delegate

threads are software threads and can interact with the traditional software

threads.

In Synergy, a delegate thread is created corresponding to each FPGA-

based PE. Once launched, it initializes the hardware system and sends start

signal to the associated accelerator via the ﬁrst-in-ﬁrst-out (FIFO) control

buﬀer shown in Figure 5. Then, the delegate thread waits for a request from

the accelerator to execute a job. When an accelerator sends a job request, the

delegate thread obtains the address of the job from its associated cluster and

sends back to the accelerator, waiting for the accelerator’s acknowledgment.

Upon receiving the address of the job, the accelerator obtains the contents of

a job structure, fetches the tile data of input arrays via the memory controller

and performs the MM calculations. Once it ﬁnishes, the accelerator issues

a signal to the delegate thread and acknowledges the completion of the tile

calculation. The delegate thread repeats the above steps until all the jobs

are ﬁnished.

Cluster-0 Cluster-1

Job

Queue 1

Job

Queue 0

NEON

CONV

Thread

Manager

Idle Book

Thief

Thread

CONV

Thread

Stealer

activate

steal steal

push push

work

done

work

done

Figure 4: Work Stealing Execution Flow

3.1.3 Self-balancing: Work Stealing Scheduler

Synergy clusters the FPGA-based PEs and the NEON accelerators into mul-

tiple clusters, so that the threads corresponding to multiple CONV layers

can execute concurrently achieving better throughput. This approach also in-

creases the accelerator utilization. However, as the workload of CONV layers

varies depending on the data dimensions, an improper cluster conﬁguration

may lead to workload imbalance among the clusters. Some clusters might

complete their workload early and stay idle, wasting precious computing re-

sources. Therefore, clusters should be carefully partitioned and statically

mapped to diﬀerent CONV layers, so that the runtime of each cluster spent

on processing the associated workload is balanced [16]. This can increase the

accelerator utilization and improve the performance. However, ﬁnding the

optimal cluster conﬁguration is not easy. It requires proﬁling the performance

of diﬀerent cluster combinations for the input data dimensions of the CONV

layers for the speciﬁc CNN model and perform a detailed design space explo-

ration to identify the best cluster conﬁguration for static mapping. Then the

identiﬁed clusters and PEs have to be synthesized on the FPGA. However,

this approach is challenging and time-consuming, especially without exten-

sive FPGA expertise. In Synergy, we introduce dynamic workload balancing

technique, work-stealing, to bypass this optimization problem.

This self-balancing technique is based on the job granularity and does

not require the best cluster conﬁguration as the idle cluster can steal jobs

from the busy clusters. Synergy enables work stealing by introducing a thief

thread. The thief thread consists of a manager,idle book and stealer. The

manager checks the status (idle or busy) of the clusters and activates the

stealer if necessary. The idle book records IDs of the idle clusters, while the

stealer can steal jobs from the victim clusters and push these jobs to the idle

clusters. Figure 4 shows the work-stealing ﬂow. Initially, Synergy dispatches

the jobs of diﬀerent CONV layers to job queues of diﬀerent clusters. Due

to the workload imbalance of the CONV layers, some clusters may ﬁnish

the assigned workload earlier and remain idle. Let us assume that Cluster-0

ﬁnishes ﬁrst and Cluster-1 is still busy. Cluster-0 then notiﬁes the manager

of the thief thread, as its work has been done. The manager records Cluster-0

in the idle book and activates the stealer. After activation, the stealer tries

to steal jobs from the clusters that are not in the idle book. Once it succeeds,

the stealer dispatches the jobs to the idle clusters and the manager removes

the clusters from the idle book. In this manner, Synergy can fully utilize the

accelerator resources and achieve load balancing. Diﬀerent from the static

mapping technique, the work-stealing approach does not rely on any speciﬁc

cluster conﬁguration to achieve workload balance. It eases the pressure of

seeking the best cluster conﬁguration and does not require designer’s eﬀort.

3.1.4 Other Layers and Preprocessing functions

A CNN contains many other layers, which are executed by the ARM CPU

cores in the Synergy framework. Fully connected (FC) Layer: This layer is

usually used at the end of a network to compute the class scores, resulting in

as many outputs as there are classes. Pooling layer: This layer progressively

reduces the spatial size of the output from the previous layer to reduce the

amount of parameters and computation in the network. Activation layer:

This layer comprises of a non-linear function that does a 1-to-1 mapping of

each of the outputs from the previous layer to an activation value. Synergy

supports all kinds of activation functions.

A CNN also contains a few preprocessing functions within the layers

such as im2col and normalization that take non-negligible time on embedded

CPUs. im2col (mentioned in Section 3.1.1) is used for data layout transfor-

mation. Normalization is used to scale all the input data to values between 0

and 1 during the inference phase. The overheads of these sequential portions

are partially hidden by HW/SW multi-threaded pipeline in Synergy.

System BUS

Proc

Ethernet Other peripherals

(USB, UART, etc.)

DelegateT-1

if_mem2hw MEM

Arbiter

MEM

Arbiter

MEM

Controller

MEM

Controller

DDR

DRAM

SW Threads

Cluster-0

PE-0

PE-1

PE-2

PE-3

MMU

Proc

Arbiter

if_hw2mem

if_mem2hw

if_hw2mem

if_mem2hw

if_hw2mem

if_mem2hw

if_hw2mem

FIFOs

if_sw2hw

if_hw2sw

if_sw2hw

if_hw2sw

if_sw2hw

if_hw2sw

if_sw2hw

if_hw2sw

FIFOs

Multi-core CPU Synergy Hardware Architecture (FPGA)

Synergy

Library

DelegateT-2

DelegateT-3

User Space Kernel

Space

DelegateT-0

Cluster-1

Memory Subs ystem

Figure 5: The Hardware Architecture

3.2 Hardware Architecture

Figure 5 shows an example Synergy hardware architecture example with

four FPGA-based PEs. The architecture is adapted from ReconOS [10].

In this architecture, the software communicates with the hardware acceler-

ators via control FIFOs (if hw2sw and if sw2hw). At the software side, a

delegate thread (DelegateT) interacts with other software threads on behalf

of the associated PE. Data transactions of a PE are handled by the Memory

Subsystem via two FIFOs (if hw2mem and if mem2hw). In the following

subsections, we discuss the accelerator design, memory subsystem, and the

hardware architecture generator.

3.2.1 Accelerator Design

As mentioned earlier, Synergy processes CONV layers as matrix multiplica-

tion (MM) operations accelerated using NEON cores and FPGA-based PEs.

In this section, we mainly focus on FPGA-based accelerators and discuss

several design challenges. The FPGA-based accelerator for MM is the pro-

cessing engine (PE) shown in Figure 5, which is generated by a commercial

high-level synthesis (HLS) tool, Vivado HLS [22]. HLS is used to convert

algorithms in high-level speciﬁcation (i.e., C/C++) into hardware languages

(VHDL/Verilog). It provides optimization options, a.k.a pragmas, such as

loop unrolling, array partitioning and pipelining, to explore diverse hardware

architecture with diﬀerent area and performance tradeoﬀ.

As mentioned in Section 3.1.1, due to the large input size of MM in

CONV layers, we deploy Loop Tiling on MM and partition the iteration

space of the loop into smaller tiles so that data size of a tile can be easily

accommodated on available BRAM. Loop Tiling exposes potential parallelism

of MM as diﬀerent tiles are independent. We exploit the parallelism by

instantiating multiple PEs under FPGA resource budget to process the tiles

simultaneously, while exploring hardware architectures of a PE with HLS

pragmas. Opening up more parallelism per PE limits the number of PEs

that can be accommodated on FPGA due to resource constraints [26].

Listing 3: Pseudo Code for the HLS Template of a PE

1P r o c e s s i n g E n g i n e ( i f s w 2h w , i f h w2 s w ,

2if hw 2me m , i f mem 2hw ) {

3/∗S i m p l i f i e d pr ag ma s ∗/

4#prag ma i n t e r f a c e a p f i f o p o r t=i f s w2 h w

5#prag ma i n t e r f a c e a p f i f o p o r t=i f h w2 s w

6#prag ma i n t e r f a c e a p f i f o p o r t=if hw 2me m

7#prag ma i n t e r f a c e a p f i f o p o r t=if me m2h w

8#prag ma i n t e r f a c e a p ct r l no n e p or t= return

9...

10 w a i t f o r s t a r t s i g n a l ( ) ;

11 j o b t jo b ;

12 while ( 1 ) {

13 u in t 3 2 j o b ad d r e s s = a s k f o r a j o b ( ) ;

14 j ob = r e a d j o b ( j o b a d d r e s s ) ;

15 p a r s e j o b ( j o b , &A add r, &B add r ,& Cad dr ,

16 &m,&n ,& k , &t 1 , & t 2 , & l a y e r I D ) ;

17 ti le d m m (Aa ddr , Badd r , Cadd r ,m, n , k , t1 , t 2 ,

18 if hw 2me m , i f m em 2hw ) ;

19 send acknowledgment ( l ayer ID ) ; }

20 }

Processing Engine (PE): The pseudo code for the HLS template in

Listing 3 demonstrates the general execution ﬂow of a PE. A PE inter-

acts with its associated delegate thread in the user space via control FIFOs

(if hw2sw and if sw2hw). For data transaction, the PE cooperates with the

Memory Subsystem (Section 3.2.2) through memory FIFOs (if hw2mem and

if mem2hw). At the beginning, the PE waits for a start signal issued from

its associated delegate thread. Line 13 - 19 in Listing 3 shows the logic to

compute a job. The PE ﬁrst acquires a job by sending requests to the del-

egate thread. The real computation of MM is performed in tiled mm. The

skeleton of tiled mm is shown in (Line 5-14) in Listing 1. The mm tile func-

tion can be summarized as the following four steps: 1 It computes locations

of tiles required of the input arrays (Aand B) in the main memory; 2 It

then fetches a tile of data to local memory (aand b); 3 It performs matrix

multiplication and adds the partial result with a local array c; 4 mm tile

repeats Step 1 until it exhausts a row of Aand a column of B; 5 mm tile

stores the output data back to the main memory. An acknowledgment will

be sent to the delegate thread once the PE ﬁnishes computation.

Computation optimizations in mm tile: Loop pipelining is a crucial

optimization option provided by HLS. As the technique can overlap the exe-

cution of operations from diﬀerent iterations, it has great impact on system

throughput. Throughput of a loop depends on the initiation interval (II),

which is deﬁned as the number of cycles between consecutive initiations of

the loop. In this work, we apply loop pipelining at loop2 in Listing 1. With

the optimization, the HLS tool merges loop1 and loop2 into a new loop with

larger loop bound (newBound =T S ∗T S) and completely unrolls the in-

nermost loop (loop3). We deﬁne latloop3as the latency of loop3. Then the

latency latkernel of the nested loop for kernel computation is calculated as

latkernel = (newBound −1) ∗I I +latloop3. When newBound is large enough,

latkernel of the nested loop is decided by II .

As operations inside loop3 in Listing 1 have no data dependence, when

loop3 is completely unrolled, operations in diﬀerent iterations can be ideally

executed in parallel. However, the parallelism is constrained by the memory

bandwidth. Local buﬀers (aand b) are implemented with FPGA BRAM re-

source. By default, a local buﬀer has only two read-ports and one write-port.

Thus, when loop3 is completely unrolled, only two memory read requests to

each buﬀer (aand b) can be served, even if T S read requests are gener-

ated. This makes II to be T S/2 and limits performance of an accelerator.

To improve II, we can leverage the array partitioning pragma to split the

buﬀer into multiple banks where each bank has two read-ports and one write-

port. With loop pipelining and array partitioning, the accelerator requires

more multiplication and addition units, and thus more compute resources.

Opening up more parallelism per PE limits the number of PEs that can be

accommodated on FPGA due to resource constraints. Given a FPGA device,

the tile size, the settings for HLS pragmas, and the number of PEs can be

done automatically decided via design space exploration (DSE) [26].

Communication optimization in mm tile: For tiled matrix multi-

plication, Synergy overlaps the data transfer cost with the computation cost

by leveraging double buﬀering, i.e., instantiating two buﬀers for each local

array. This signiﬁcantly improves the throughput of tiled MM.

Zero Padding in mm tile: In Synergy, the hardware accelerators are

shared among the convolutional layers. This implies the same MM acceler-

ator of ﬁxed size are used in diﬀerent layers. As the loop bounds (or data

dimensions) of MM in diﬀerent convolutional layers are diﬀerent, we may

encounter scenarios where the ﬁxed-size MM accelerator attempts to access

out of the loop bound data of the input matrices or write data outside the

L1 offset Page offsetL2 offset

L1 Page Table L2 Page Table

Virtual

address

Physical

address

5 5

Figure 6: Virtual To Physical Address Translation [1]

bounds of the output matrix. Hence, we include border detection in mm tile.

When fetching data, if the memory address exceeds the matrix border, the

speciﬁc portion of the local buﬀer will be set to zero. Similarly, for writing

data, mm tile ignores write requests if a memory address exceeds the given

matrix borders.

3.2.2 Memory Subsystem

The Memory Subsystem shown in Figure 5 is used to process memory re-

quests from multiple PEs. It consists of memory arbiters (MEM Arbiter),

memory management units (MMUs), memory controllers (MEM Controllers),

aProc Arbiter and a Proc unit.MMU is used to translate virtual addresses

to physical addresses, while MEM Arbiter is employed to allocate memory

transaction requests to the shared MMU.Proc unit is used to obtain the ﬁrst-

level translation page table address and handle page fault request, and Proc

Arbiter allows multiple MMUs to access the Proc unit.MEM Controllers are

implemented to access the DDR memory with AXI4 burst mode protocol.

All the components in the Memory Subsystem are written in RTL code and

constitute the Hardware Template Library as shown in Figure 8.

Virtual to Physical Address Translation: In a traditional HW/SW

co-design approach, a device driver normally has a continuous memory ad-

dress space in the Linux kernel. When a delegate thread tries to communicate

with an FPGA PE, it ﬁrst copies data from the user space to the allocated

continuous memory (kernel space) in the device driver and sends the phys-

ical address of the memory to the PE. Then the PE obtains the data from

the DDR memory via the MEM Controllers. In Synergy, we avoid the extra

data copy in the acceleration of CONV layers. As mentioned in Section 3.1.2

and 3.2.1, a PE obtains an address of a job directly from the delegate thread

in the user space and the job content includes the base memory address of

input/output arrays in the user space. Those are virtual addresses. In ARM

Cortex-A9 architecture [1], virtual addresses are translated to physical ad-

dresses by a two-level (L1 and L2) page table walk as shown in Figure 6.

The base address of the L1 page table is stored in a CPU system register

R[1], which can be accessed in the kernel space. Synergy supports this two-

level page table walk in FPGA. During the FPGA initialization in Synergy,

the Proc unit obtains the base address of the L1 page table via its device

driver. Then, the Memory Subsystem translates the virtual address to phys-

ical address following the steps in Figure 6. In case of a page fault, the Proc

unit triggers a CPU interrupt, obtains a new base address and repeats the

translation process.

0.0

1.0

2.0

3.0

4.0

100

1 2 3 4 5 6

Speedup compared to

single PE

Performance (ms)

Number of PE

(a) ReconOS

100

1 2 3 4 5 6

Performance compared to

single PE

Performance (ms)

Number of PEs

(b) Synergy

Figure 7: Single-MMU vs. Multi-MMU Peformance

Multiple MMU Support: ReconOS architecture [10] contains a single

MMU and MEM Controller. The memory transactions from the PEs com-

pete for the resources in the Memory Subsystem. As the number of PEs

increases, the memory contention signiﬁcantly degrades the system perfor-

mance as shown in Figure 7a. To solve the problem, Synergy instantiates

multiple MMUs with at most two PEs sharing an MMU and MEM Con-

troller. As the frequency of page faults is generally low in our case, multiple

MMUs in Synergy share the same Proc unit via the arbiter logic Proc Arbiter.

Figure 7b shows that the performance speedup increases linearly as we in-

stantiate more PEs in Synergy.

*.hw_config

Hardware

Template

Library

Hardware

Template

Library

Tiled MM

(C++)

Tiled MM

(C++)

HLS ToolHLS Tool

pragmas

AcceleratorsAccelerators

Synergy Hardware

Architecture

Synergy Hardware

Architecture

FPGA BitstreamFPGA Bitstream

Library

Generator

Library

Generator

Architecture

Parameters

tile size

HW/SW Multi-

threading Lib

HW/SW Multi-

threading Lib

…

cluster_num = 2

clockFreq = 100000000

tile_size = 32

fifo_os = 128

fifo_mem = 128

hw@PE0

hlsopt = pragmas0.tcl

cluster = 0

hw@PE1

hlsopt = pragmas1.tcl

cluster = 0

hw@PE2

hlsopt = pragmas2.tcl

cluster = 0

hw@PE3

hlsopt = pragmas3.tcl

cluster = 1

...

Figure 8: Hardware Architecture Generator

3.3 Hardware Architecture Generator

Synergy provides a default accelerator architecture on a given FPGA device.

However, for a new FPGA device or in case the developer is interested in cus-

tomizing the accelerator architecture corresponding to a CNN model, Synergy

provides an architecture generator as shown in Figure 8. This automates the

processes of generating PEs with HLS, the Hardware Architecture, and ﬁnal

FPGA bitstream. Input to the generator is a conﬁguration ﬁle, *.hw conﬁg,

containing the architecture parameters. The simpliﬁed format of this conﬁg-

uration ﬁle is shown in the left side of Figure 8, which creates the Hardware

Architecture shown in Figure 5. Moreover, based on the conﬁguration ﬁle, the

generator also compiles HW/SW multi-threading library (Synergy Library)

to provide APIs required by Section 3.1.

4 Experimental Evaluation

In this section, we evaluate the Synergy framework with multiple represen-

tative CNN models.

Synergy has been implemented on heterogeneous SoC platforms Zed-

board [22] and Xilinx ZC702, both featuring the same Xilinx Zynq XC7Z020

device. Xilinx Zynq XC7Z020 is a low-end SoC in terms of its compute

capability and the availability of limited FPGA resources. We report the

performance and power numbers from the Xilinx ZC702 evaluation board

because it has the power measurement support. All performance results are

collected using an FPGA-based timer.

We use Darknet [14] as our deep learning package. Darknet is an open

source neural network framework written in C. We use Darknet because it has

a highly-optimized single-threaded software implementation and does not de-

pend on any external library. We ﬁrst compile Darknet for the ARM core in

the Xilinx Zynq device. Apart from the single-threaded software implementa-

tion, we create a multi-threaded pipelined version of Darknet to take advan-

tage of inter-frame parallelism for high-throughput CNN implementation.

The CPU-only implementations for various CNNs in this section are well-

optimized and compiled by gcc with -O3 optimization. As Darknet uses 32-bit

ﬂoating-point CNN models, we also use 32-bit ﬂoating-point implementation

both in software and hardware accelerators. The performance-power numbers

of Synergy will improve substantially if 32-bit ﬂoating-point implementation

is replaced by N-bit ﬁxed-point implementation where N << 32. However,

this optimization is orthogonal and complementary to our current approach.

Even with ﬂoating-point, we achieve better throughput and energy-eﬃciency

compared to contemporary ﬁxed-point implementations on the same device.

We write assembly-language code to generate highly optimized NEON ac-

celerators for the tiled matrix-multiplication operations. For hardware accel-

erator generation, we use Vivado Design Suite version 2016.2 for High-Level

Synthesis (HLS). The tiled matrix multiplication code is written in C and are

synthesized on FPGA using Vivado HLS with appropriate pragma settings as

presented in Section 3.2.1. Synergy uses two clusters (Cluster-0: 2 NEONs

+ 2 S-PE; Cluster-1: 6 F-PE) conﬁguration across all benchmarks. The

cluster conﬁguration is chosen based on power/performance results across

multiple CNNs and the work stealing technique can ensure that other CNN

applications could work well on this ﬁxed hardware architecture as well by

balancing workload at runtime. The FPGA-based PEs run at 100MHz. The

HW/SW multi-threading is implemented by adapting ReconOS open-source

operating system for reconﬁgurable computing [10][15]. The ARM cores run

Linux, which is augmented with ReconOS to interface with the hardware

accelerators.

The entire Synergy framework is set up on a PC with an Intel Xeon CPU

E5-2620 core running at 2.10Hz with 64GB RAM, running Ubuntu 14.04

OS. Given a CNN model, the Synergy framework is responsible to generate

the appropriate software threads corresponding to the diﬀerent layers of the

network, interfacing the software threads with the delegate threads of the

hardware accelerators, and creating a default mapping between the CONV

Table 2: Network Architectures of Benchmark CNN Models

Benchmark CONV

Layers

Num. of

Layers Description

CIFAR Darknet [14] 4 9 Object Recognition

CIFAR Alex [5] 3 8 Object Recognition

CIFAR Alex+ [5] 3 9 Object Recognition

CIFAR full [7] 3 9 Object Recognition

MNIST [9] 2 7 Digit Recognition

SVHN [12] 3 8 Digit Recognition

MPCNN [11] 3 9 Gesture Recognition

layers and the accelerator clusters. The Synergy framework can also automate

the FPGA bitstream generation given a hardware accelerator architecture

conﬁguration by the designer to customize Synergy implementation for a

particular CNN model (if desired) or generate one for a new device.

Benchmarks: Table 2 shows seven CNN models used in this work and

trained with Darknet.

4.1 Synergy Throughput and Energy-Eﬃciency

Throughput: Compared with the original single-threaded Darknet imple-

mentation running on ARM core, Synergy achieves average 7.3x throughput

improvement as shown in Figure 9.

Power and Energy Consumption: Figure 10 depicts the power distri-

bution and energy consumption of Synergy system. The FPGA logic accounts

for only 27% of the total power consumption (around 2.08 W) averaged across

all CNN models. The ARM cores and the DDR memory account for most

of the power consumption. Compared with the power (1.52 W on average)

measured for the CPU+NEON only implementations, Synergy incurs 36.63%

more power consumption.

Table 3 shows the energy and performance per watt comparison between

the original single-threaded Darknet implementation running on ARM cores

and the Synergy design. Considering power consumption, the Synergy design

consumes 36.63% more power on average, as it fully leverages the heterogene-

ity of the ZYNQ platform. Although the power consumption increases, the

Synergy implementation achieves much higher throughput (7.3x speedup),

1.0x 1.0x 1.0x 1.0x 1.0x 1.0x

1.0x

7.8x 6.0x

8.7x

8.2x

6.6x

9.4x

4.5x

0.0

20.0

40.0

60.0

80.0

100.0

120.0

140.0

Throughput (frames/sec)

Original-Darknet Darknet-With-Synergy

Figure 9: Throughput improvement using Synergy

and thus reduces 80.13% energy consumption averaged across all CNN mod-

els compared to the original Darknet on ARM cores.

FPGA Resource Consumption: Hardware accelerators generated by

Vivado HLS have great impact on FPGA resource consumption. With the

limited FPGA resource budget, opening up more parallelism via HLS prag-

mas reduces the number of hardware accelerators that can ﬁt in Xilinx

ZC702. Therefore, we explore diverse architectures of hardware accelerators

by traversing diﬀerent tile size and HLS pragma combinations consisting loop

unrolling, loop pipelining and array partitioning. In this work, the tile size is

set to be 32 based on empirical evaluation. On ZC702 device we instantiate 6

faster FPGA-based processing engines (F-PE) with loop pipelining pragma

applied at loop2 in Listing 1 and 2 slower PE (S-PE) with loop unrolling

(factor = 2) and loop pipelining at loop3.

Comparison with State-of-the-art: Table 4 compares Synergy with

the recent FPGA-based CNN works. Note that CaﬀePresso [6] is using a

development platform with signiﬁcantly more resources, and is running at a

higher clock speed. Moreover, as Darknet doesn’t support data quantization

feature and ﬁxed-point implementation, Synergy uses 32-bit ﬂoating-point de-

sign that consumes much more resources than 32/16-bit ﬁxed-point designs

on FPGAs. As shown in Table 4, even though we have handicapped our-

selves with ﬂoating-point operations, our implementations (both CIFAR full

and MNIST) are superior to [6][21] in terms of throughput (frames per sec-

ond), giga-operations-per-second (GOPS), and energy consumption. Com-

0.00

0.50

1.00

1.50

2.00

2.50

Energy (mJ/frame)

Power Consumption (W)

FPGA ARM DDR Energy

Figure 10: Power Distribution and Energy Consumption

pared to [20], GOPS of our MNIST and MPCNN designs achieve 4.5x and

1.8x speedup, respectively. Table 4 demonstrates that Synergy can provide

high-throughput and energy-eﬃcient mapping of CNN models on embedded

heterogeneous SoC platforms.

4.2 Advantage of Heterogeneity

We now investigate the impact of heterogeneity in improving the CNN

performance in Synergy. Figure 11 shows the latency of diﬀerent non-

pipelined CNN implementations (single-threaded, leveraging single-core

ARM): CPU+NEON,CPU+FPGA, and CPU+Het, which consists of FPGA

and NEON accelerators compared to the baseline single-core ARM design.

Compared to the CPU+FPGA design, the heterogeneous implementation

with FPGA and NEON CPU+Het improves the latency by 12% on an aver-

age with 45% maximum improvement for MPCNN model.

The throughput speedup of diﬀerent pipelined CNN implementations

(multi-threaded, using two ARM cores): CPU+NEON,CPU+FPGA, and

CPU+Het compared to the baseline single-core ARM design is shown in Fig-

ure 12. Compared to the CPU+FPGA designs, the heterogeneous implemen-

tations with FPGA and NEON CPU+Het achieves 15% better throughput

on an average (37% maximum improvement for MNIST benchmark).

Table 3: Energy and Performance per Watt Comparison: Original Darknet

Versus Synergy

Benchmarks Energy (mJ/frame) Performance per watt (GOPS/W)

Original Synergy Reduction (%) Original Synergy Speedup

CIFAR Darknet 142.18 25.36 -82.16 0.14 0.80 5.61x

CIFAR Alex 105.03 23.43 -77.70 0.16 0.80 4.48x

CIFAR Alex+ 326.62 55.81 -82.91 0.16 0.70 5.85x

CIFAR full 196.41 33.71 -82.84 0.13 0.94 5.83x

MNIST 112.90 22.78 -79.83 0.20 0.78 4.96x

SVHN 193.67 28.07 -85.50 0.14 0.98 6.90x

MPCNN 47.87 14.37 -69.99 0.20 0.68 3.33x

mean -80.13 5.28x

Table 4: Comparison With Recent FPGA-based CNN Works. ‘*’ indicates

values estimated from charts

CaﬀePresso [6] fpgaConvNet[19][20] DeepBurning [21] Synergy

Device 7Z045 7Z020 7Z020 7Z020

Clock (MHz) 180 100 100 100

Precision 16-bit Fixed-point 16-bit Fixed-point Fixed-point 32-bit Floating-point

Benchmarks MNIST CIFAR full MNIST MPCNN MNIST CIFAR full MNIST CIFAR full MPCNN

Latency(ms) 16.0 28.0 – – 14.3 21.4 24.3 33.2 12.2

Throughput

(frames/s) 62.5 35.7 – – 69.9 46.7 96.2 63.5 136.4

GOPS 1.19 0.94 0.48 0.74 1.33* 1.23* 2.15 1.67 1.33

Energy

(mJ/frame) >200* >500* – – 150* 63 22.8 33.7 14.4

4.3 Transparent Accelerators: Work Stealing

We show the advantage of dynamic load balancing across accelerators using

work-stealing in Synergy versus static mapping of the CONV layers to the ac-

celerators. We consider two diﬀerent clusters and PE conﬁgurations for static

mapping. The ﬁrst cluster conﬁguration consists of two clusters (Cluster-0:

2 NEONs + 2 S-PE; Cluster-1: 6 F-PE) used in Synergy across all bench-

marks. But unlike Synergy, the CONV layers are statically assigned to the

clusters based on their workload. We refer to this as static-mapping+ﬁxed-

architecture (SF). Figure 13 shows that the SF designs can achieve 6.1x better

throughput compared to the well-optimized CPU designs.

However, the SF designs are ineﬃcient as the workload assigned to the

clusters might not be balanced due to the diﬀerent computation requirement

1.00

1.78 1.52 1.70 1.92 1.50 1.41

1.01

3.30

2.69

3.78 3.75

2.27

4.69

1.84

3.41

2.79

3.84 3.87

2.83

4.81

2.68

Speedup Compared to CPU-only

CPU+NEON CPU+FPGA CPU+Het

Het: Heterogeneous Accelerators with NEON and FPGA

Figure 11: Latency Improvement with Accelerators Compared to CPU-only

Solutions for Non-Pipelined Designs

of each CONV layer. Figure 14a presents the execution time of each CONV

layer in CIFAR Alex model with this conﬁguration. The CONV-0 layer is

mapped to Cluster-0, while CONV-1 and CONV-2 layers are mapped to

Cluster-1. As shown in Figure 14a, the runtime of Cluster-0 and Cluster-1

are 24.3 ms and 12.3 ms per frame, respectively. This imbalance in execution

time between the clusters leads to poor cluster utilization and throughput.

Synergy employs work-stealing to automatically balance workload of dif-

ferent clusters. This provides a network-agnostic feature in Synergy, as the

jobs from diﬀerent CONV layers are automatically distributed across the

diﬀerent clusters to achieve self-balancing. With the same generic cluster

conﬁguration used in the SF designs, Figure 13 shows that Synergy improves

the throughput by average 24% compared to the SF designs. The perfor-

mance improvement comes from the balanced clusters. Figure 14b presents

the execution time of each CONV layer of the Synergy design for CIFAR Alex

benchmark. The runtime of Cluster-0 and Cluster-1 are 22.2 ms and 20.9

ms per frame, respectively. Compared to the SF design in Figure 14a, the

workload of Cluster-0 and Cluster-1 are balanced.

Finally, we show that Synergy work-stealing with generic cluster archi-

tecture is competitive and even better than CNN-model speciﬁc customized

cluster conﬁgurations. We call this static-mapping+custom-architecture (SC)

designs. In the SC designs, we ﬁnd the best multi-cluster conﬁguration for

each CNN model by exploring all possible cluster conﬁgurations. The best

1.89

2.14

2.23 2.26 1.61 2.23

1.29

6.96

5.80

7.95 7.49

4.81

7.47

4.12

7.82

5.95

8.71 8.15

6.62

9.44

4.47

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

Speedup Compared to CPU-only

CPU+NEON CPU+FPGA CPU+Het

Het: Heterogeneous Accelerators with NEON and FPGA

Figure 12: Throughput Improvement with Accelerators Compared to CPU-

only Solutions for Pipelined Designs

0.0

Speedup compared to (CPU+ST)

7.32

4.31 7.07

7.05

4.82

7.65

3.55

7.69

5.85

8.11 7.81

5.98

8.38

4.34

7.82

5.95

8.71 8.15

6.62

9.44

4.47

0.0

2.0

4.0

6.0

8.0

10.0

Speedup compared to CPU-only

SF: Static mapping + Fixed Architecture SC: Static Mapping + Custom Architecture

Synergy: Dynamic Mapping + Fixed Architecture

Figure 13: Advantage of Work stealing

multi-cluster conﬁgurations are shown in Table 51. The CONV layers are

statically mapped to these clusters. Note that unlike optimized cluster con-

ﬁgurations in SC designs, Synergy leverages the same generic cluster conﬁgu-

ration used in the SF designs for various CNN models. As shown in Figure 13,

Synergy still achieves 6% better throughput than SC designs. This is because

in the static mapping approaches (SF and SC) an entire CONV layer is as-

signed to a cluster and it is hard to perfectly balance the cluster workloads.

In contrast, the work-stealing in Synergy at the granularity of job-level (tiled

MM) can easily balance the workload even with un-optimized generic accel-

erators. The work stealing feature in Synergy empowers developers to easily

switch between diﬀerent networks at runtime without losing performance.

1The number of clusters in this work can be t, where t∈N.

11.70

CONV-2

24.3

12.3

Cluster-0 Cluster-1

Execution Time (ms)

CONV-0 CONV-1 CONV-2

(a) SF Conﬁguration with Two Clus-

ters

22.2 20.9

Cluster-0 Cluster-1

Execution Time (ms)

CONV-0 CONV-1 CONV-2

(b) Synergy: same SF conﬁguration +

work-stealing

Figure 14: Dynamic Load Balancing in CIFAR Alex. SF Conf.: Cluster-0 (2

NEONs + 2 S-PE), Cluster-1 (6 F-PE)

Table 5: Best Cluster Conﬁgurations for CNN Models under Static Mapping

+ Custom Architectures

Benchmarks Cluster 0 Cluster 1

NEON FPGA IP NEON FPGA IP

CIFAR Darknet 0 2 S-PE + 1 F-PE 2 5 F-PE

CIFAR Alex 0 2 S-PE + 2 F-PE 2 4 F-PE

CIFAR Alex+ 2 2 S-PE + 2 F-PE 0 4 F-PE

CIFAR full 0 2 S-PE + 2 F-PE 2 4 F-PE

MNIST 2 2 S-PE + 2 F-PE 0 4 F-PE

SVHN 2 2 S-PE + 2 F-PE 0 4 F-PE

MPCNN 0 2 S-PE + 2 F-PE 2 4 F-PE

To better understand the performance improvement, Table 6 shows the

accelerator cluster utilization of various designs. The non-pipelined designs

are the best single-threaded implementations (the blue bars in Figure 11)

leveraging single-CPU, NEON core and FPGA accelerators. As shown in Ta-

ble 6, the cluster utilization of the non-pipelined designs is very low, indicat-

ing FPGA being idle for 43.95% (=1−56.05%) of the total execution time on

average. The reason is that in non-pipelined design, FPGA accelerators have

to wait for CPU or NEON core to ﬁnish their work. With multi-threading

support, the pipelined designs signiﬁcantly increase the cluster utilization

(above 90%), as various computing elements can work simultaneously. Ta-

ble 6 shows that the SF designs increase the accelerator cluster utilization

to 92.5% on average from 56.1%. Compared to the SF designs, the cluster

Table 6: Accelerator Cluster Utilization Comparison Across SF, SC and

Synergy

Benchmarks Non-

pipelined (%)

Pipelined (%)

SF SC Synergy

CIFAR Darknet 50.77 95.32 97.55 99.89

CIFAR Alex 53.56 92.72 96.61 99.83

CIFAR Alex+ 61.28 98.81 98.73 99.95

CIFAR full 54.06 93.53 94.97 100.00

MNIST 59.03 85.63 96.09 99.89

SVHN 53.00 94.72 96.86 99.26

MPCNN 60.62 86.47 94.45 99.79

mean 56.05 92.46 96.47 99.80

utilization of the SC designs achieves 96.5% averaged across the benchmarks.

This is because the SC designs use the ﬁne-tuned cluster conﬁgurations and

workload assigned to the clusters is more balanced. As mentioned above,

since Synergy leverages the work-stealing scheduler which works at the ﬁner

granularity of job-level (tiled MM), the scheduler helps to improve the cluster

utilization at runtime by balancing workload in clusters. The average cluster

utilization of Synergy achieves 99.8% as shown in Table 6.

5 conclusion

This paper presents Synergy, an automated, transparent hardware-software

co-designed CNN inference framework on an embedded FPGA-based hetero-

geneous SoC architecture. Synergy fully utilizes the heterogeneity by lever-

aging diverse computing resources (CPUs, NEONs and FPGA) to accelerate

CNNs. Moreover, Synergy provides a work-stealing scheduler in software

to automatically balance the workload of accelerators, so that it can easily

adapt to various networks at runtime without changing hardware or software

implementations. Our result shows that Synergy achieves 7.3x speedup, aver-

aged across seven representative CNN models, over a well-optimized software-

only solution. Compared to the contemporary CNN implementations on the

same SoC platform, Synergy delivers better throughput as well as energy-

eﬃciency.

References

[1] ARM Infocenter. http://infocenter.arm.com. 2018.

[2] A. Dundar, J. Jin, B. Martini, and E. Culurciello. Embedded streaming

deep neural networks accelerator with applications. IEEE Transactions

on Neural Networks and Learning Systems, 28(7):1572–1583, July 2017.

[3] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang,

and J. Cong. Fp-dnn: An automated framework for mapping deep

neural networks onto fpgas with rtl-hls hybrid templates. In 2017 IEEE

25th Annual International Symposium on Field-Programmable Custom

Computing Machines (FCCM), pages 152–159, April 2017.

[4] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang,

and H. Yang. Angel-eye: A complete design ﬂow for mapping cnn onto

embedded fpga. IEEE Transactions on Computer-Aided Design of In-

tegrated Circuits and Systems, 37(1):35–47, Jan 2018.

[5] S. Hashemi, N. Anthony, H. Tann, R. I. Bahar, and S. Reda. Un-

derstanding the impact of precision quantization on the accuracy and

energy of neural networks. In Design, Automation Test in Europe Con-

ference Exhibition (DATE), 2017, pages 1474–1479, March 2017.

[6] Gopalakrishna Hegde, Siddhartha, Nachiappan Ramasamy, and

Nachiket Kapre. Caﬀepresso: An optimized library for deep learning

on embedded accelerator-based platforms. In Proceedings of the In-

ternational Conference on Compilers, Architectures and Synthesis for

Embedded Systems, CASES ’16, pages 14:1–14:10, New York, NY, USA,

2016. ACM.

[7] Yangqing Jia, Evan Shelhamer, Jeﬀ Donahue, Sergey Karayev, Jonathan

Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caﬀe:

Convolutional architecture for fast feature embedding. In Proceedings of

the 22Nd ACM International Conference on Multimedia, MM ’14, pages

675–678, New York, NY, USA, 2014. ACM.

[8] J. H. Kim, B. Grady, R. Lian, J. Brothers, and J. H. Anderson. Fpga-

based cnn inference accelerator synthesized from multi-threaded c soft-

ware. In 2017 30th IEEE International System-on-Chip Conference

(SOCC), pages 268–273, Sept 2017.

[9] Y. Lecun, L. Bottou, Y. Bengio, and P. Haﬀner. Gradient-based learning

applied to document recognition. Proceedings of the IEEE, 86(11):2278–

2324, Nov 1998.

[10] Enno L¨ubbers and Marco Platzner. Reconos: Multithreaded program-

ming for reconﬁgurable computers. ACM Trans. Embed. Comput. Syst.,

9(1):8:1–8:33, October 2009.

[11] J. Nagi, F. Ducatelle, G. A. Di Caro, D. Cirean, U. Meier, A. Giusti,

F. Nagi, J. Schmidhuber, and L. M. Gambardella. Max-pooling convo-

lutional neural networks for vision-based hand gesture recognition. In

2011 IEEE International Conference on Signal and Image Processing

Applications (ICSIPA), pages 342–347, Nov 2011.

[12] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu,

and Andrew Y Ng. Reading digits in natural images with unsupervised

feature learning. In NIPS workshop on deep learning and unsupervised

feature learning, 2011.

[13] Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin

Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, and

Huazhong Yang. Going deeper with embedded fpga platform for con-

volutional neural network. In Proceedings of the 2016 ACM/SIGDA

International Symposium on Field-Programmable Gate Arrays, FPGA

’16, pages 26–35, New York, NY, USA, 2016. ACM.

[14] Joseph Redmon. Darknet: Open source neural networks in c, February

2018.

[15] Christoph R¨uthing et al. Self-Adaptation in Programmable Automation

Controllers based on Hybrid Multi-Cores. Master Thesis, University of

Paderborn, 2016.

[16] Yongming Shen, Michael Ferdman, and Peter Milder. Maximizing cnn

accelerator eﬃciency through resource partitioning. In Proceedings of

the 44th Annual International Symposium on Computer Architecture,

ISCA ’17, pages 535–547, New York, NY, USA, 2017. ACM.

[17] Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei

Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. Throughput-optimized

opencl-based fpga accelerator for large-scale convolutional neural net-

works. In Proceedings of the 2016 ACM/SIGDA International Sympo-

sium on Field-Programmable Gate Arrays, FPGA ’16, pages 16–25, New

York, NY, USA, 2016. ACM.

[18] Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela

Blott, Philip Leong, Magnus Jahre, and Kees Vissers. Finn: A frame-

work for fast, scalable binarized neural network inference. In Pro-

ceedings of the 2017 ACM/SIGDA International Symposium on Field-

Programmable Gate Arrays, FPGA ’17, pages 65–74, New York, NY,

USA, 2017. ACM.

[19] S. I. Venieris and C. S. Bouganis. fpgaconvnet: A framework for mapping

convolutional neural networks on fpgas. In 2016 IEEE 24th Annual In-

ternational Symposium on Field-Programmable Custom Computing Ma-

chines (FCCM), pages 40–47, May 2016.

[20] Stylianos I. Venieris and Christos-Savvas Bouganis. fpgaconvnet: Au-

tomated mapping of convolutional neural networks on fpgas (abstract

only). In Proceedings of the 2017 ACM/SIGDA International Sympo-

sium on Field-Programmable Gate Arrays, FPGA ’17, pages 291–292,

New York, NY, USA, 2017. ACM.

[21] Ying Wang, Jie Xu, Yinhe Han, Huawei Li, and Xiaowei Li. Deep-

burning: Automatic generation of fpga-based learning accelerators for

the neural network family. In Proceedings of the 53rd Annual Design

Automation Conference, DAC ’16, pages 110:1–110:6, New York, NY,

USA, 2016. ACM.

[22] Xilinx Inc. http://www.xilinx.com. 2018.

[23] Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason

Cong. Caﬀeine: Towards uniformed representation and acceleration

for deep convolutional neural networks. In Proceedings of the 35th In-

ternational Conference on Computer-Aided Design, ICCAD ’16, pages

12:1–12:8, New York, NY, USA, 2016. ACM.

[24] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and

Jason Cong. Optimizing fpga-based accelerator design for deep con-

volutional neural networks. In Proceedings of the 2015 ACM/SIGDA

International Symposium on Field-Programmable Gate Arrays, pages

161–170, New York, NY, USA, 2015. ACM.

[25] Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, and Jason

Cong. Energy-eﬃcient cnn implementation on a deeply pipelined fpga

cluster. In Proceedings of the 2016 International Symposium on Low

Power Electronics and Design, ISLPED ’16, pages 326–331, New York,

NY, USA, 2016. ACM.

[26] G. Zhong, A. Prakash, S. Wang, Y. Liang, T. Mitra, and S. Niar. Design

space exploration of fpga-based accelerators with multi-level parallelism.

In Design, Automation Test in Europe Conference Exhibition (DATE),

2017, pages 1141–1146, March 2017.

GreenScale: Carbon-Aware Systems for Edge Computing

Preprint

Full-text available

Apr 2023

To improve the environmental implications of the growing demand of computing, future applications need to improve the carbon-efficiency of computing infrastructures. State-of-the-art approaches, however, do not consider the intermittent nature of renewable energy. The time and location-based carbon intensity of energy fueling computing has been ignored when determining how computation is carried out. This poses a new challenge -- deciding when and where to run applications across consumer devices at the edge and servers in the cloud. Such scheduling decisions become more complicated with the stochastic runtime variance and the amortization of the rising embodied emissions. This work proposes GreenScale, a framework to understand the design and optimization space of carbon-aware scheduling for green applications across the edge-cloud infrastructure. Based on the quantified carbon output of the infrastructure components, we demonstrate that optimizing for carbon, compared to performance and energy efficiency, yields unique scheduling solutions. Our evaluation with three representative categories of applications (i.e., AI, Game, and AR/VR) demonstrate that the carbon emissions of the applications can be reduced by up to 29.1% with the GreenScale. The analysis in this work further provides a detailed road map for edge-cloud application developers to build green applications.

PyTorch and CEDR: Enabling Deployment of Machine Learning Models on Heterogeneous Computing Systems

Conference Paper

Dec 2023

Pipelined CNN Inference on Heterogeneous Multi-processor System-on-Chip

Chapter

Oct 2023

Convolutional neural networks (CNNs)-based inference is a quintessential component in mobile machine learning applications. Privacy and real-time response requirements require applications to perform inference on the mobile (edge) devices themselves. Heterogeneous multi-processor system-on-chips (HMPSoCs) within the edge devices enable high-throughput, low-latency edge inference. An HMPSoC contains several processing cores, each capable of independently performing CNN inference. However, to meet stringent performance requirements, an application must simultaneously involve all core types in inferencing. A software-based CNN inference pipeline design allows for synergistic engagement of all the cores in an HMPSoC for a high-throughput and low-latency CNN inference. In this chapter, we present two different CNN inference pipeline designs. The first design creates a pipeline between two different types of CPU cores. The second design extends the pipeline from CPU to GPU. We also provide a future perspective and research directions on the subject.

A New Human-Centric Computing Age at Edge

Chapter

Feb 2023

The recent advances in artificial intelligence (AI) and the Internet of Things (IoT) have given rise to a pyramid of intelligent mobile applications such as autonomous driving, smart health monitoring, and virtual reality, which are running on ubiquitous edge devices such as smartphones and vehicles.

Real-Time AI in Social Edge

Chapter

Feb 2023

In previous chapters, we have presented several edge computing frameworks that address the rationality and heterogeneity challenges in the SEC paradigm. In this chapter, we shift our focus to real-time AI in the social edge that investigates the challenge of building time-sensitive AI models in SEC. In particular, we focus on a widely adopted AI model—deep neural networks (DNN), and review a novel optimal batching algorithm called EdgeBatch that can fully utilize the data parallelization feature of DNN to significantly expedite the execution time of DNN-based AI models at the edge. EdgeBatch represents a line of research that addresses the emerging challenges at the intersection of real-time and AI communities.KeywordsReal-time AIEdgeBatchIntelligent edge systemsEnd-to-end delayBatching strategyEnergy consumption

Social Edge Trends and Applications

Chapter

Feb 2023

The advent of edge computing pushes the frontier of computation, service, and data along the cloud-to-things continuum to the edge of the network, and brings new opportunities for human-centric applications (e.g., social sensing, smart mobile computing, edge intelligence). By coupling those applications with edge computing, the individually owned edge devices form a federation of computational nodes where the data collected from them can be processed and consumed “on the spot”. In this chapter, we offer a high-level view of the SEC paradigm, its background, motivation, trends, enabling technologies, and examples of applications.KeywordsSocial edge computingTrendEnabling technologyApplications

TinyM 2 Net-V2: A Compact Low Power Software Hardware Architecture for M ulti m odal Deep Neural Networks

Article

May 2023

With the evaluation of Artificial Intelligence (AI), there has been a resurgence of interest in how to use AI algorithms on low-power embedded systems to broaden potential use cases of the Internet of Things (IoT). To mimic multimodal human perception, multimodal deep neural networks (M-DNN) have recently become very popular with the classification task due to their impressive performance for computer vision and audio processing tasks. This paper presents TinyM ² Net-V2 - a compact low-power software hardware architecture for m ulti m odal deep neural networks for resource-constrained tiny devices. In order to compress the models to implement on tiny devices, cyclicly sparsification and hybrid quantization (4-bits weights and 8-bits activations) methods are used. Although model compression techniques are an active research area, we are the first to demonstrate their efficacy for multimodal deep neural networks, using cyclicly sparsification and hybrid quantization of weights/activations. TinyM ² Net-V2 shows that even a tiny multimodal deep neural network model can improve the classification accuracy more than that of any unimodal counterparts. Parameterized M-DNN model architecture was designed to be evaluated in two different case-studies: vehicle detection from multimodal images and audios and COVID-19 detection from multimodal audio recordings. The most compressed TinyM ² Net-V2 achieves 92.5% COVID-19 detection accuracy (6.8% improvement from the unimodal full precision model) and 90.6% vehicle classification accuracy (7.7% improvement from the unimodal full precision model). A parameterized and flexible FPGA hardware accelerator was designed as well for TinyM ² Net-V2 models. To the best of our knowledge, this is the first work accelerating multimodal deep neural network models on low power Artix-7 FPGA hardware. We achieved energy efficiency of 9.04 GOP/s/W and 15.38 GOP/s/W for case-study 1 and case-study 2 respectively which is comparable to the state-of-the-art results. Finally, we compared our tiny FPGA hardware implementation results with off-the-shelf resource-constrained devices and showed our implementation is faster and consumed less power compared to the off-the-shelf resource-constrained devices.

SASCHA -Sparsity-Aware Stochastic Computing Hardware Architecture for Neural Network Acceleration

Article

Nov 2022

Stochastic computing (SC) has recently emerged as a promising method for efficient machine learning acceleration. Its high compute density, affinity with dense linear algebra primitives, and approximation properties have an uncanny level of synergy with the deep neural network computational requirements. However, there is a conspicuous lack of works trying to integrate SC hardware with sparsity awareness, which has brought significant performance improvements to conventional architectures. In this work, we identify why common sparsity-exploiting techniques are not easily applicable to SC accelerators and propose a new architecture—SASCHA—sparsity-aware SC hardware architecture for the neural network acceleration that addresses those issues. SASCHA encompasses a set of techniques that make utilizing sparsity in inference practical for different types of SC computation. At 90% weight sparsity, SASCHA can be up to $6.5\times $ faster and $5.5\times $ more energy-efficient than comparable dense SC accelerators with a similar area without sacrificing the dense network throughput. SASCHA also outperforms sparse fixed-point accelerators by up to $4\times $ in terms of latency. To the best of our knowledge, SASCHA is the first SC accelerator architecture oriented around sparsity.

DRIPS: Dynamic Rebalancing of Pipelined Streaming Applications on CGRAs

Conference Paper

Apr 2022

E2HRL: An Energy-Efficient Hardware Accelerator for Hierarchical Deep Reinforcement Learning

Article

Feb 2022

Recently, Reinforcement Learning (RL) has shown great performance in solving sequential decision-making and control in dynamic environment problems. Despite its achievements, deploying Deep Neural Network (DNN) based RL is expensive in terms of time and power due to the large number of episodes required to train agents with high dimensional image representations. Additionally, at the interference the large energy footprint of deep neural networks can be a major drawback. Embedded edge devices as the main platform for deploying RL applications, are intrinsically resource-constrained and deploying deep neural network based RL on them is a challenging task. As a result, reducing the number of actions taken by the RL agent to learn desired policy, along with the energy-efficient deployment of RL is crucial. In this paper, we propose Energy Efficient Hierarchical Reinforcement Learning (E2HRL), which is a scalable hardware architecture for RL applications. E2HRL utilizes a cross-layer design methodology for achieving better energy efficiency, smaller model size, higher accuracy, and system integration at the software and hardware layers. Our proposed model for RL agent is designed based on the learning hierarchical policies, which makes the network architecture more efficient for implementation on mobile devices. We evaluated our model in three different RL environment with different level of complexity. Simulation results with our analysis illustrate that hierarchical policy learning with several levels of control improves RL agents training efficiency and the agent learns the desired policy faster compared to a none hierarchical model. This improvement is specifically more observable as the environment or the task becomes more complex with multiple objective subgoals. We tested our model with different hyperparameters to achieve the maximum reward by the RL agent while minimizing the model size, parameters, and required number of operations. E2HRL model enables efficient deployment of RL agent on a resource constraint embedded devices with the proposed custom hardware architecture which is scalable and fully parameterized with respect to the number of input channels, filter size, and depth. The number of processing engines (PE) in the proposed hardware can vary between 1 to 8, which provides the flexibility of trade-off different factors such as latency, throughput, power and energy efficiency. By performing a systematic hardware parameter analysis and design space exploration, we implemented the most energy-efficient hardware architectures of E2HRL on Xilinx Artix-7 FPGA and NVIDIA Jetson TX2. Comparing the implementation results shows Jetson TX2 boards achieve 0.1 ∼ 1.3 GOP/S/W energy efficiency while Artix-7 FPGA achieves 1.1 ∼ 11.4 GOP/S/W, which denotes 8.8X ∼ 11X better energy efficiency of E2HRL when model is implemented on FPGA. Additionally, compared to similar works our design shows better performance and energy efficiency.

fpgaConvNet: Automated Mapping of Convolutional Neural Networks on FPGAs (Abstract Only)

Conference Paper

Full-text available

Feb 2017

In recent years, Convolutional Neural Networks (ConvNets) have become the state-of-the-art in several Artificial Intelligence tasks. Across the range of applications, the performance needs vary significantly, from high-throughput image recognition to the very low-latency requirements of autonomous cars. In this context, FPGAs can provide a potential platform that can be optimally configured based on the different performance needs. However, the complexity of ConvNet models keeps increasing leading to a large design space. This work presents fpgaConvNet, an end-to-end framework for mapping ConvNets on FPGAs. The proposed framework employs an automated design methodology based on the Synchronous Dataflow (SDF) paradigm and defines a set of transformations on the SDF graph in order to efficiently explore the architectural design space. By treating high-throughput and latency-critical systems separately, the presented tool is able to efficiently explore the architectural design space and to generate hardware designs from high-level ConvNet specifications, explicitly optimised for the performance metric of interest. Overall our framework yields designs that improve the performance density and the performance efficiency by up to 6× and 4.49× respectively over existing highly-optimised FPGA, DSP and embedded GPU work.

Understanding the Impact of Precision Quantization on the Accuracy and Energy of Neural Networks

Article

Full-text available

Dec 2016

Deep neural networks are gaining in popularity as they are used to generate state-of-the-art results for a variety of computer vision and machine learning applications. At the same time, these networks have grown in depth and complexity in order to solve harder problems. Given the limitations in power budgets dedicated to these networks, the importance of low-power, low-memory solutions has been stressed in recent years. While a large number of dedicated hardware using different precisions has recently been proposed, there exists no comprehensive study of different bit precisions and arithmetic in both inputs and network parameters. In this work, we address this issue and perform a study of different bit-precisions in neural networks (from floating-point to fixed-point, powers of two, and binary). In our evaluation, we consider and analyze the effect of precision scaling on both network accuracy and hardware metrics including memory footprint, power and energy consumption, and design area. We also investigate training-time methodologies to compensate for the reduction in accuracy due to limited bit precision and demonstrate that in most cases, precision scaling can deliver significant benefits in design metrics at the cost of very modest decreases in network accuracy. In addition, we propose that a small portion of the benefits achieved when using lower precisions can be forfeited to increase the network size and therefore the accuracy. We evaluate our experiments, using three well-recognized networks and datasets to show its generality. We investigate the trade-offs and highlight the benefits of using lower precisions in terms of energy and memory footprint.

Going Deeper with Embedded FPGA Platform for Convolutional Neural Network

Conference Paper

Full-text available

Feb 2016

In recent years, convolutional neural network (CNN) based methods have achieved great success in a large number of applications and have been among the most powerful and widely used techniques in computer vision. However, CNN-based methods are com-putational-intensive and resource-consuming, and thus are hard to be integrated into embedded systems such as smart phones, smart glasses, and robots. FPGA is one of the most promising platforms for accelerating CNN, but the limited bandwidth and on-chip memory size limit the performance of FPGA accelerator for CNN. In this paper, we go deeper with the embedded FPGA platform on accelerating CNNs and propose a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification. We first present an in-depth analysis of state-of-the-art CNN models and show that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric. Then the dynamic-precision data quantization method and a convolver design that is efficient for all layer types in CNN are proposed to improve the bandwidth and resource utilization. Results show that only 0.4% accuracy loss is introduced by our data quantization flow for the very deep VGG16 model when 8/4-bit quantization is used. A data arrangement method is proposed to further ensure a high utilization of the external memory bandwidth. Finally, a state-of-the-art CNN, VGG16-SVD, is implemented on an embedded FPGA platform as a case study. VGG16-SVD is the largest and most accurate network that has been implemented on FPGA end-to-end so far. The system on Xilinx Zynq ZC706 board achieves a frame rate at 4.45 fps with the top-5 accuracy of 86.66% using 16-bit quantization. The average performance of convolutional layers and the full CNN is 187.8 GOP/s and 137.0 GOP/s under 150MHz working frequency, which outperform previous approaches significantly.

Maximizing CNN Accelerator Efficiency Through Resource Partitioning

Conference Paper

Jun 2017

FPGA-Based CNN Inference Accelerator Synthesized from Multi-Threaded C Software

Conference Paper

Sep 2017

A deep-learning inference accelerator is synthesized from a C-language software program parallelized with Pthreads. The software implementation uses the well-known producer/consumer model with parallel threads interconnected by FIFO queues. The LegUp high-level synthesis (HLS) tool synthesizes threads into parallel FPGA hardware, translating software parallelism into spatial parallelism. A complete system is generated where convolution, pooling and padding are realized in the synthesized accelerator, with remaining tasks executing on an embedded ARM processor. The accelerator incorporates reduced precision, and a novel approach for zero-weight-skipping in convolution. On a mid-sized Intel Arria 10 SoC FPGA, peak performance on VGG-16 is 138 effective GOPS.

FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates

Conference Paper

Apr 2017

Angel-Eye: A complete design flow for mapping CNN onto embedded FPGA

Article

May 2017

Convolutional Neural Network (CNN) has become a successful algorithm in the region of artificial intelligence and a strong candidate for many computer vision (CV) algorithms. But the computation complexity of CNN is much higher than traditional algorithms. With the help of GPU acceleration, CNN based applications are widely deployed in servers. However, for embedded platforms, CNN-based solutions are still too complex to be applied. Various dedicated hardware designs on FPGAs have been carried out to accelerate CNNs, while few of them explore the whole design flow for both fast deployment and high power efficiency.

Understanding the impact of precision quantization on the accuracy and energy of neural networks

Conference Paper

Mar 2017

Design Space exploration of FPGA-based accelerators with multi-level parallelism

Conference Paper

Mar 2017

Applications containing compute-intensive kernels with nested loops can effectively leverage FPGAs to exploit fine- and coarse-grained parallelism. HLS tools used to translate these kernels from high-level languages (e.g., C/C++), however, are inefficient in exploiting multiple levels of parallelism automatically, thereby producing sub-optimal accelerators. Moreover, the large design space resulting from the various combinations of fine- and coarse-grained parallelism options makes exhaustive design space exploration prohibitively time-consuming with HLS tools. Hence, we propose a rapid estimation framework, MPSeeker, to evaluate performance/area metrics of various accelerator options for an application at an early design phase. Experimental results show that MPSeeker can rapidly (in minutes) explore the complex design space and accurately estimate performance/area of various design points to identify the near-optimal (95.7% performance of the optimal on average) combination of parallelism options.

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

Conference Paper

Feb 2017

Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values. In this paper, we present FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. By utilizing a novel set of optimizations that enable efficient mapping of binarized neural networks to hardware, we implement fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. On a ZC706 embedded FPGA platform drawing less than 25 W total system power, we demonstrate up to 12.3 million image classifications per second with 0.31 µs latency on the MNIST dataset with 95.8% accuracy, and 21906 image classifications per second with 283 µs latency on the CIFAR-10 and SVHN datasets with respectively 80.1% and 94.9% accuracy. To the best of our knowledge, ours are the fastest classification rates reported to date on these benchmarks.

Synergy: An HW/SW framework for high throughput CNNs on embedded heterogeneous SoC

Abstract

Recommended publications

FPGA implementation for GPR signal processing based on HW/SW co-design architecture

HW/SW Co-design of an IEEE 802.11a/g Receiver on Xilinx Zynq SoC using High-Level Synthesis

Performance evaluation over HW/SW co-design SoC memory transfers for a CNN accelerator

Hardware/Software Co-design of Physical Unclonable Function based Authentications on FPGAs