ArticlePDF Available

Abstract

In recent years, Convolution Neural Networks (CNN) have been extensively adopted in broad Artificial Intelligence (AI) applications and have demonstrated ability and effectiveness in solving learning problems. However, developing high-performance hardware accelerators on Field Programmable Gate Array (FPGA) for CNNs often demands skills in hardware design and verification, accurate distribution localization, and long development cycles. Besides, the depth of CNN architectures increases by reusing and replicating several layers. In this work, we take advantage of the replication of CNN layers to achieve improvement in design performance and productivity. We propose a programming flow for CNNs on FPGA to generate high-performance accelerators by assembling CNN pre-implemented components as a puzzle based on the graph topology. Using pre-implemented components allows us to use minimum of resources, predict the performance, and gain in productivity since there is no need to synthesize any Hardware Description Language (HDL) source code. Furthermore, the pre-implemented components are reused for different range of applications, reducing the engineering time. Through prototyping, we demonstrate the viability and relevance of our approach. Experiments show a productivity improvement of up to 69% compared to a traditional FPGA implementation while achieving over 1.75× higher Fmax with lower resources and higher energy efficiency.
Towards a Component-based Acceleration of Convolutional Neural Networks on FPGAs
Danielle Tchuinkou Kwadjoa, Erman Nghonda Tchindaa, Joel Mandebi Mbonguea, Christophe Bobdaa
aElectrical and Computer Engineering Department, University of Florida, Gainesville, 32603, FL, USA
Abstract
In recent years, Convolution Neural Networks (CNN) have been extensively adopted in broad Artificial Intelligence (AI) appli-
cations and have demonstrated ability and eectiveness in solving learning problems. However, developing high-performance
hardware accelerators on Field Programmable Gate Array (FPGA) for CNNs often demands skills in hardware design and verifica-
tion, accurate distribution localization, and long development cycles. Besides, the depth of CNN architectures increases by reusing
and replicating several layers. In this work, we take advantage of the replication of CNN layers to achieve improvement in design
performance and productivity. We propose a programming flow for CNNs on FPGA to generate high-performance accelerators
by assembling CNN pre-implemented components as a puzzle based on the graph topology. Using pre-implemented components
allows us to use minimum of resources, predict the performance, and gain in productivity since there is no need to synthesize any
Hardware Description Language (HDL) source code. Furthermore, the pre-implemented components are reused for dierent range
of applications, reducing the engineering time. Through prototyping, we demonstrate the viability and relevance of our approach.
Experiments show a productivity improvement of up to 69% compared to a traditional FPGA implementation while achieving over
1.75×higher Fmax with lower resources and higher energy eciency.
Keywords: FPGA, Data Flow Graph, CNN Inference, Pre-implemented Flow
PACS: 0000, 1111
2000 MSC: 0000, 1111
1. Introduction
In recent years, the implementation of Convolutional Neu-
ral Networks (CNN) on Field Programmable Gate Arrays (FP-
GAs) has drawn considerable attention as the need for more ef-
ficiency and accuracy is mitigated by the rapid increase in com-
putational cost. CNNs achieve higher quality of result (QoR) at
the cost of significant computing and memory requirements due
to their deep topological structures, complicated neural con-
nections, and massive data to process [1, 2]. As a result, to
keep up with the versatile performance needs of CNN applica-
tions in several domains like image and video processing, hard-
ware accelerators such as Application-Specific Integrated Cir-
cuits (ASICs), FPGAs, and Graphics Processing Units (GPUs)
are increasingly used to improve the throughput of the CNN
network. Though ASICs and GPUs have long been the default
solution to speed up computation in high-performance com-
puting platforms, FPGAs are now a rising trend, mainly due
to their performance/watt advantage over GPUs and flexibility
over ASICs [3, 4]. Furthermore, the continuous growth of in-
tegration capacity in FPGA technology has led to the advent
of large devices capable of hosting millions of logic compo-
nents and thousands of hard IP blocks [5, 6, 7]. The inno-
vation in FPGA hardware architecture provides the basis for
unprecedented flexibility and acceleration in high-performance
computing and embedded system applications. It also requires
computer-aided design (CAD) tools capable of extracting ap-
plication and domain-specific features to leverage the resources
available in high-end FPGAs. As the complexity of FPGA ar-
chitectures increases, so is the need for improved productiv-
ity and performance in several computing domains such as im-
age processing, financial analytics, edge computing, and deep
learning. However, vendor tools are primarily general-purpose.
They attempt to provide acceptable results on various applica-
tions, which may not exploit application and domain-specific
characteristics to deliver higher QoR.
This paper presents a divide-and-conquer design flow that
enables application and domain-specific optimization on the de-
sign of CNN architectures using Xilinx FPGAs. The proposed
approach follows three fundamental steps. The first step con-
sists in dividing the design into smaller components. The gran-
ularity of the sub-components is left at the designer’s choice.
Next, each component is synthesized and implemented. Finally,
we generate the targeted CNN architecture by assembling pre-
built components with minimal QoR loss to achieve various de-
sign goals such as decreased latency, reduced power consump-
tion, optimized maximum frequency, and low hardware utiliza-
tion. Recent research has demonstrated that such an approach
provides improved QoR than that of the traditional Vivado flow
in some instances [8, 9, 10]. By pre-implementing specific
components of a design, higher performance is achieved locally
and maintained with minimal loss when assembling the final
circuit. This observation is supported by two major consider-
ations: (1) vendor tools such as Vivado tend to deliver high-
performance results on small modules in a design [8, 11]. (2)
Computing applications such as CNN architectures increase in
Preprint submitted to Journal of Parallel and Distributed Computing January 30, 2023
0
200
400
600
800
1000
1200
1400
MM OP RC SM
Time(s)
VivadoFlow
RapidWrightFlow
5% 18%
37%
7%
(a)
200
250
300
350
400
450
500
MM OP RC SM
Frequency(MHz)
VivadoFlow
RapidWrightFlow
19%
33%
9%
8%
(b)
Figure 1: Motivation example. (a) Compilation time comparison. (b) Fma x comparison. The results from previous research show that the pre-implemented design
flow with RapidWright can lead to improved productivity and QoR compared to the traditional design flow with Vivado [9] (MM=Matrix Multiplication; OP=Outer
Product; RC=Robert Cross; SM=Smoothing).
size and complexity by replicating design modules. By care-
fully selecting principal design modules found in common CNNs,
we leverage Vivado optimization to generate highly specialized
implementations that can be combined into the desired CNN
topology.
As motivation example illustrating the advantages of the
pre-implemented flow, Figure 1 summarizes a few results from
the work of Mandebi et al. [9]. It studies the compilation time
and maximum frequency achieved when implementing FPGA
accelerators for matrix multiplication, outer product, Roberts
Cross edge detection, and image smoothing using the Vivado
flow and the pre-implemented flow with RapidWright. The re-
sults show that the pre-implemented designs flow achieve up
to 37% gain in productivity and 33% higher Fmax compared to
compiling the same designs with Vivado. While little details are
provided on the choice of granularity for the pre-built compo-
nents, the work demonstrated that pre-implementing modules
could significantly improve the QoR when exploiting applica-
tion and domain-specific features. However, the work only fo-
cused on micro kernels and small scale applications [9]. As a
result, it remains necessary to evaluate the performance benefits
of using the pre-implemented flow on more complex data flow
architectures such as CNN accelerators.
In the context of this work, we aim to explore the perfor-
mance that can be achieved when utilizing RapidWright in the
design flow of an FPGA accelerator for CNNs. Specifically, our
contribution includes:
(1) Reviewing the design of CNN architectures on FPGAs:
we will explore the features of state-of-art CNNs that are
suitable for FPGA-based acceleration.
(2) Reviewing the pre-implemented flow with RapidWright:
we will discuss key steps to follow to eciently lever-
age RapidWright in the design of crafted application and
domain-specific FPGA accelerators.
(3) A complete component-based framework that maps CNN
models to FPGA implementations without tedious HDL
programming and verifications, while improving the QoR
compared to the traditional design flow with Vivado.
(4) An eective and ecient algorithm is proposed for high-
quality and scalable routability-drive placement of com-
ponents on FPGA
As opposed to vendor tools that are closed source, we be-
lieve the access to RapidWright internal features and design re-
sources makes it suitable for design flow exploration and the
implementation of targeted FPGA solutions.
2. Overview on CNN FPGA Architectures
2.1. Architecture Topology
CNN inference refers to the forward propagation of Minput
images through Llayers. In recent years, multiple CNN archi-
tectures on FPGA have been proposed [12, 2]. They can be
reviewed in two categories: Single Instruction, Multiple Data-
based (SIMD) accelerators and streaming-based accelerators.
In this section we highlight the potential benefits of design-
ing FPGA-based CNN architectures with the pre-implemented
flow, as well as the challenges that may arise. We do not discuss
any architecture implementation detail.
In general, the SIMD accelerators start by fetching feature
maps and weights from an external memory to on-chip buers
[13, 14, 15]. The data is then streamed into computing engines
composed of several processing elements (PEs). The PEs typi-
cally implement general-purpose matrix multiplication circuits
in which computations are scheduled to execute the CNN lay-
ers in sequence [16, 17]. At the end of the PE computations, the
results are streamed back to on-chip buers and, if necessary, to
the external memory to be processed in subsequent CNN lay-
ers. Each PE is configurable and has its own computing re-
sources mainly using DSP blocks, and data caching relying on
on-chip registers. Computing engines are usually composed of
hundreds of identical PEs that are replicated across the chip for
accelerating specific layers of the CNN [18]. This repetition
of components within CNN architectures make them suitable
candidates for RapidWright implementation as the CNN sub-
modules can be optimized for performance, and the achieved
performance can be preserved when replicating and relocating
the modules across the FPGA. The main advantage of this ap-
proach is its flexible, as it fits the implementation of various
2
CNN topologies. However, the resulting CNN architecture has
a major limitation. It requires frequent memory transfer be-
tween FPGA on-chip scratchpad memories and external mem-
ory (DDR/HBM) to fetch the weights and feature maps. Fur-
thermore, the layer-by-layer execution flow makes real-time in-
ference dicult.
Accelerators with the streaming architecture always tailor
the hardware with respect to the target network [19, 12]. The
topology of such CNN accelerators is transformed into a layer-
by-layer execution schedule, following the structure of the DAG
[20]. Shen et al. [21] note that FPGA-based acceleration used
a Convolutional Layer Engine (CLE) to process consecutive
CNN layers one at a time. The intermediate results between
layers are stored in registers, on-chip memory (OCM), or di-
rectly pipelined into the next layer. However, since the dimen-
sion as well as filter parameters from consecutive layers might
be dierent, using a fixed CLE for all layers leads to poor per-
formance and inecient utilization of resources. For an L-layer
CNN, they propose using QCLEs, where Q<L, to maximize
to BRAM availability for each CLEs. With Q<L, some layers
are replicated in the design, thus making this architecture suit-
able for the pre-implemented flow. In the same line of work, a
streamed accelerator [19, 22] consists on a sequential execution
of all the layers of a given CNN. In the same line of work, fpga-
ConvNet [23, 24] framework generates a specialized circuit to
run a convolutional neural network given its high-level descrip-
tion. The framework then partitions the network’s graph and
generates distinct bitstreams for each part of the graph to dy-
namically configure the FPGA. The architecture analyzes inter-
output and kernel parallelism. Given the area constraints as-
sociated with the design, fpgaConvNet shares MAC units to
reduce required resources, which creates a trade-obetween
area and performance. The main advantage of this type of ar-
chitecture is to minimize the latency caused by communication
with o-chip memory and thereby, maximize on-chip memory
communication, ensuring high throughput and avoiding any la-
tency [25]. On the downside, this accelerator architecture can-
not scale to arbitrarily large CNNs and can diculty be applied
for embedded devices. It is essentially restricted by available
on-chip resources needed to implement compute units for each
CNN layer, and critically, the size of OCM required to store the
weights.
In this work, we propose a framework to generate an accel-
erator with a streaming architecture. Regarding such accelera-
tors, several semi-automated design flows have been proposed
in the literature. In the section below, we review some of them.
2.2. Component-based Approaches
Previous research has shown that pre-implementing specific
design components are a practical approach to reduce FPGA
compilation time. [26, 27]. The placement and routing of com-
plex FPGA designs on large devices is generally time consum-
ing. Therefore, component-based design flow is leveraged to re-
duce implementation time in FPGA architectures. Just In Time
(JIT) [28] compilation supplies the users with Domain Specific
Language (DSL) to develop FPGAs, linking the design patterns
with precompiled bitstreams at runtime. On top of the FPGAs,
the overlay connects all the reconfigurable tiles similar to the
switch boxes, but it only routes word-wide data. In the same
line of work, [27] presents a block-based compilation frame-
work to reduce compilation time through partial reconfigura-
tion (PR). The FPGA capacity is divided into a set of regions of
predefined size. The physical PR regions are dedicated to the
user logic. Finally, a packet-switched overlay network provides
connectivity between the PR regions. In [29], the authors pro-
pose a placement model of pre-implemented components us-
ing set theory. The module placer is implemented as a con-
straint solver, which computes feasible placement positions for
relocatable modules. Most of the works presented above use a
Network On Chip (NOC) to interconnect the dierent compo-
nents/modules, adding additional logic. Furthermore, the num-
ber and the size of PR regions are predefined, limiting the flex-
ibility and might lead to the under-utilization of the PR regions
by components.
2.3. Design Flows
To improve the compute units resources and data move-
ment, several optimizations [30, 31] are employed on each lay-
ers of a given CNN. FlexCNN [30, 31], uses a SW/HW co-
design approach to compose an architecture for dierent types
of convolution layers using techniques including dynamic tiling
and data layout optimization across dierent layers. With Open-
Pose as an application driver, they drive up eective DSP uti-
lization on 3×31×1 kernel to reach a lowest latency. Nguyen
et al. [31] focus on two layer-specific optimizations: layer-
specific mixed data flow and layer-specific mixed precision.
The mixed data flow aims to minimize the o-chip access while
demanding a minimal OCM resource. The mixed precision
quantization achieves both a lossless accuracy and model com-
pression to reduce the o-chip accesses. Finally, a Bayesian op-
timization approach is used to select the best sparsity for each
layer, achieving the best trade-obetween accuracy and com-
pression.
Numerous works [32, 33, 34] present a multi-layer proces-
sor approach, in which a dedicated hardware unit processes
each layer to maximize the utilization of computing resources.
Sine the OCM is not enough for multiple hardware units, the
data must be stored in o-chip memory. Therefore, these works
require a considerable amount of memory access for data. Even
they work fine for shallow networks; it is challenging to scale
up to deeper networks.
Streaming dataflow aims to tailor the accelerator in regard
to the model topology. In this line of work, Sharma et al.[34]
proposed DNNWEAVER, a framework that automatically gen-
erates a synthesizable accelerator for a given (DNN,FPGA) pair
from a high-level specification in Cae. To achieve large bene-
fits while preserving automation, DNNWEAVER generates ac-
celerators by using a virtual instruction set to describe a net-
work. The model is then translated into an instruction sequence.
The sequence is mapped as hardware FSM states. The design
flow in [20] searches the optimized parameter for a handcrafted
Verilog template with the input network description and plat-
form constraint, which leads to a uniform mapping of PEs that
3
reduces the accelerator architecture complexity. The acceler-
ation strategy is further generalized for dierent CNN mod-
els with varying dimensions and topology. Ahmad et al.[35]
analyze Winograd minimal filtering or fast convolution algo-
rithms to reduce the arithmetic complexity of convolutional lay-
ers of CNNs. They propose a pipelined and parallel Winograd
convolution engine that improves the throughput and power-
eciency while reducing the computational complexity of the
overall system. The proposed techniques focus on automati-
cally generate the HDL design based on the network parame-
ters. The main contribution of this approach is the selection of
an intermediate level description of the network to cover the gap
between high-level network description and low-level hardware
design.
3. Pre-implemented Flow with Vivado and RapidWright
In this section, we present RapidWright and describe its in-
tegration in common design steps with Vivado. We also elab-
orate on background concepts such as the ”out-of-context” de-
sign flow in Vivado. Finally, we provide necessary discussion
on the pre-implementation of design components.
3.1. RapidWright
RapidWright [8] is an open source Java framework from
Xilinx Research Labs that provides a bridge to Vivado backend
at dierent compilation stages (synthesis, optimization, place-
ment, routing, etc) using design checkpoint (DCP) files as illus-
trated in Figure 2. Once a DCP is loaded within RapidWright,
Figure 2: Vivado and RapidWright interaction
the logical/physical netlist data structures and functions pro-
vided in the RapidWright APIs enable custom netlist manip-
ulations such as cell and net instantiation, edition, and deletion.
The hundreds of APIs in RapidWright make it possible to di-
rectly access logic and routing resources such as look-up tables
(LUT), flip-flops (FF) and programmable interconnect points
(PIPs) from a high-level programming language like Java. They
also provide means to run some operations such as timing anal-
ysis, placement, and routing. Upon completing the netlist ma-
nipulation, RapidWright enables storing the changes in a netlist
into a new DCP that can directly be loaded into Vivado.
3.2. Out-Of-Context Design Flow
The ”out-of-context” (OOC) Flow [36] is a netlist genera-
tion mode ensuring that the placement of I/O buers is disabled
at compile time to facilitate the design of internal components
of an architecture. It has several advantages among which: (1)
it allows to implement and analyze (resource analysis, timing
analysis, power analysis, etc) a module independently of the
rest of the design. (2) It enables reusing and preserving the
characteristics of placed and routed modules within a top-level
design.
3.3. Pre-implementing Design Components
Vendor CAD tools such as Vivado use heuristics for the
physical implementation (placement and routing). They con-
sider the number of cells in a design, their connections, and the
physical architecture of the target FPGA device to generate a
circuit according to specified constraints. Consequently, ven-
dor tools generally achieve better QoR on smaller designs as
the resource allocation problem addressed in the physical im-
plementation is well-known to be NP-hard [37]. Focusing the
optimization on smaller modules may therefore lead to over-
all QoR improvement in a design. Furthermore, several works
in the literature have shown that pre-implementing components
or macros can significantly decrease the overall FPGA com-
pilation time with performance benefits [28, 38, 9]. The pre-
implemented flow therefore aims to generate high-performance
implementations by reusing in multiple contexts and chip loca-
tions, high-quality and customized pre-built circuits. Using an
iterative process, the top-level design can then be constructed
by assembling the pre-built circuits with minimal QoR loss.
To fully exploit the benefits of the pre-implemented flow
with RapidWright, the design architect must first restructure the
CNN HDL codes hierarchically. The reorganization of the HDL
sources must consider three main design characteristics[8] that
are:
1. Modularity: highlights the design structure so that it can
be strategically mapped to architectural models.
2. Module replication: when modules are replicated, it al-
lows the reuse of high-quality solutions in the design while
increasing productivity.
3. Latency tolerance: if the modules in a design tolerate
additional latency, inserting pipeline elements between
them improves both synchronization performance and o-
shoring
4. Proposed Design Flow
In this section, we present the design exploration steps im-
plemented to optimize CNN components to fully exploit the
benefit of our approach. The overview of the pre-implemented
flow is presented in Figure 3. The flow has two major steps that
are: function optimization and architecture optimization. The
function optimization essentially consists in performing a de-
sign space exploration of the performances that can be achieved
4
CNN DFG
Models
Granularity
Exploration
Modules to pre-
implement
Performance
Exploration
Timing,flooplanning,
device constraints
Locked placed and routed
components
Function
Optimization
Iteration to meet
the constraints
Database of pre-
built checkpoints
CNN Architecture
Definition
Component
Extraction
Component
Matching
Architecture
Composition
Inter-component
Routing
Partially routed
CNN
CNN Accelerator
Architecture
Optimization
2
1
Architecture-specific
components are
assembled with minimal
QoR degradation
Design space exploration
to optimize sub-function
performance (Fmax, Area,
Power)
HW Generator
.xdc
Figure 3: General overview of the proposed design flow.
on sub-functions. It takes into consideration some design con-
straints such as device, timing, floor planning, and power. If the
design space exploration results in satisfiable performance, the
produced netlists are saved into a database in the form of DCPs.
This step is semi-manual as the designer must choose and pre-
compile the sub-functions in a design using vendor tools. It
is however performed exactly once, and the saved netlists may
serve in multiple designs. The architecture optimization is a
fully automated process that aims to combine the pre-built com-
ponents (the netlists saved in the function optimization phase)
into a CNN architecture as defined by the users.
4.1. Function Optimization
This section describes the major steps involved in the design
of optimized sub-functions.
4.1.1. Granularity Exploration
The design space exploration only supports CNNs. A typi-
cal CNN is usually composed of:
(1) Convolution: The convolution layer convolves the input
image with a set of learnable filters, each producing one
feature map in the output image.
(2) Pooling: Max-pooling splits the input image into a set
of non-overlapping rectangles and, for each of these sub-
regions, outputs the maximum value.
(3) Rectified-Linear: Given an input value x, the ReLU is
a simple calculation that returns the value provided as
input directly xif x>0 and 0 otherwise. Several ReLU
functions exists and might be employed.
(4) Fully Connected (FC): Each activation of a FC layer is
the result of a scalar product composed of input values,
weights, and a bias.
By porting these 4 layers onto the FPGA, the vast major-
ity of forward processing networks can be implemented on the
FPGA. The modules implementations should resolve around
this minimum of granularity. Automated decomposition of user
logic into leaf components is complementary future work.
4.1.2. Performance Exploration
We start by manually building the CNN components OOC.
The OOC flow ensures that I/O buers and global clocks re-
sources are not inserted into the netlists as the pre-built compo-
nents are still to be inserted within the top-level module of the
design. While eciently designing components OOC requires
hardware expertise, it is done exactly once, and the pre-built
netlists may be reused in several other applications.
Optimization and implementation of components: the
architecture of a convolution engine is depicted in Figure 4a.
The input samples are read in a streaming fashion from the lo-
cal buer. For a convolution of size K, at least K-1 lines of
data must be fetched before the circuit can process the first
sample. All the computations performed before this are sim-
ply discarded through the use of conditionals. Each line buer
can store up to K+1 data. When a new sample is read, an-
other sample is pushed out of the line buer. Interestingly, the
5
(a) Convolution Engine
PE
PE
PE
PE
PE
PE
+
+
+
+
+
+
Input
Feature
Maps
Input
Feature
Maps
CU
(b) Architecture of a Compute Unit
Figure 4: Pre-implemented Design Components
newest sample is used in the calculation, and then the sample
is stored into the line buer, and the old sample is ejected out.
Therefore, it ensures that only K+1 lines must be cached rather
than an unknown number of lines and minimize local storage
use. Figure4b illustrates the Compute unit (CU) data path of
a convolution engine. It starts by multiplying the input feature
map data and the corresponding weights via the multiple par-
allel multiplication arrays; then, the final cumulative values are
determined by the adder tree in a pipeline. Several slices of
CUs constitute the convolution engine. The 27x18 multiplier in
the DSP48E2 slice carries two parallel 8-bit multiplications that
share one common operand. Precisely, the two INT8 operands
are packed into the 27-bit port and then multiplied in parallel by
the third, shared INT8 operand in the 18-bit port. The output
of the DSP48E2 slice can generate two parallel products in a
particular full-precision dual-product format, reducing by half
the number of DSP48E2.
The design of the pooling layer is similar to [39] and is de-
picted in Figure 5. The designs flow presents an implemen-
tation of the max-pooling in a stride of 2 samples. The main
components of the architecture are a shift register with only
L+2 stages, a comparator core, and a controller. The compara-
tor core consists of three comparators for finding the maximum
pixel value in the 2×2 window. This circuit is implemented in-
dependently from the convolution engine to accommodate most
CNNs topologies. I/O interfaces of components implement a
producer-consumer scheme with a globally asynchronous lo-
cally synchronous approach for weight storage, whereby mem-
ory resources operate faster than the compute logic. It then al-
lows adjacent layers to start operation before the producer lay-
ers are completed. Each layer feeds its output to the next layer
using similar datatype layouts and allows it to overlap in their
operation once enough data has been accumulated in the previ-
ous layer. There is no need to store each layer’s intermediate
results in o-chip memory with a streaming architecture since
they are directly piped down the stream.
To achieve high QoR in the performance exploration phase,
the implementation of components follow the following design
considerations:
...
Comparator
Comparator
Comparator
Controller
Shift Register
Enable
Ifmaps
Figure 5: Max-pooling Computing Engine
(a) Strategic floorplanning: utilizing pblock constraints al-
lows carefully selecting the FPGA resources that will be
used by each design component. It helps improving the
module-level performance and area. Hence, the designer
has the possibility to only use necessary resources as op-
posed to letting the CAD tool utilize as many chip tiles
as it wants. Given that Xilinx architectures generally
replicate the resource structures (CLBs, DSPs, BRAM,
URAM, etc) over an entire column of clock regions, the
smaller the area of a pblock is, the more RapidWright
will be capable of relocating the design components across
the chip, which increases the reusability. The automated
definition of pblock range is out of the scope of this work.
(b) Strategic port planning: the placement of the ports when
pre-implementing modules is one of the most important
steps to ensure high performance and productivity im-
provement. Failure to plan the location of the ports of
the pre-implemented modules may result in long compi-
lation time, poor performance, and high congestion in the
design in which they are inserted.
As example, let us consider a design in which we pre-
implement the two sub-modules Module1 and Module2.
In order to preserve the QoR of the sub-modules in the
final design, we should foresee the length of the nets con-
6
necting the cells at the interface of the sub-modules. In
fact, the maximum frequency is proportional to the high-
est delay on the timing paths. We must therefore re-
duce the length of the net between the modules by ensur-
ing that the cells at the interface of the pre-built compo-
nents are placed near the edge of the pblocks of the mod-
ules. However, the modules are pre-implemented inde-
pendently. Hence, the CAD tool is not aware of the con-
text in which the modules will be inserted into a design
and connected to other components. A pre-implemented
component may then achieve a high maximum frequency
in standalone but perform poorly when inserted into a de-
sign because of very long inter-module nets. We there-
fore pre-implement the modules with partition pin con-
straints (PartPins) [36] to specify the interconnect tiles
that will route the nets connecting to the other modules
of a design. Figure 6a presents a possible outcome of
pre-implementing the two sub-modules without consid-
ering the port planning. The distance between the two
FFs that are identified by the green and red marks may
become a source high delay, resulting in reduced maxi-
mum frequency. On the other hand, Figure 6b shows that
partpins make it possible to shorten the length of the net,
which decreases the Fmax degradation when connecting
the components in a design.
(c) Clock routing: to accurately run the timing analysis on
the OOC modules, source clock buers must be specified
using the constraint HD.CLK SRC. Though the buers
are not inserted in the OOC modules, clock signals are
partially routed to the interconnect tiles and the timing
analysis tool can then run timing estimations.
(d) Logic locking: the main goal of the performance explo-
ration is to achieve high QoR locally. Once a module at-
tains a desirable performance (Fmax , area, power, etc), we
lock the placement and routing to prevent Vivado from al-
tering the design later and preserve design performance.
The other advantage of locking the design is that the fi-
nal inter-module routing with Vivado will only consider
non-routed nets. This decreases compilation times and
improves the productivity.
(e) Checkpoint file generation: pre-implemented modules
are stored in the form of DCPs. The top-level design will
then implement synthesis black-boxes that will be filled
by the optimized pre-built modules.
The implementation here is done using vendor tools and
considers several constraints such as timing and floor planning.
The pblock partitioning is performed for each component ac-
cording to its needs in terms of hardware resources and the
physical structure of the FPGA. However, when synthesizing
components OOC, there is not control on how the I/O ports are
placed. With pblocks and timing constraints, I/O ports might be
contained anywhere in the pblock resulting in routing conges-
tion and timing issues around I/O interfaces when generating
the whole design as described in Section 4.2.
4.2. Architecture Optimization
In this section, we discuss the generation of a CNN acceler-
ator based on user definition. The architecture optimization fol-
lows four major stages that are: component extraction, compo-
nent matching, architecture composition, and inter-component
routing. The following paragraphs will elaborate on each of
these phases.
4.3. Hardware Generation
In this section, we discuss the generation of a CNN accel-
erator based on user definition. The architecture optimization
follows four major stages: component extraction, component
matching, architecture composition, and inter-component rout-
ing. The following paragraphs will elaborate on each of these
phases.
(a) (b)
Figure 6: Illustration of the importance of port planning. (a) Without PartPins,
the FFs are placed randomly, resulting in a high distancing. (b) Defining Part-
Pins provides context to Vivado when pre-implementing the modules, which
results in shorter distance between the FFs.
4.3.1. Component Extraction
From the library of pre-built components, users compose
the CNNs hardware accelerator’s resources on FPGA. This im-
plies providing information about the topology and the type of
layers that compose the CNN in a form that we call: CNN ar-
chitecture definition . In the following stage, a CNN hardware
generator designed with the RapidWright C API automatically
produces the corresponding CNN accelerator. The major func-
tion of the Component Extraction is to parse the CNN archi-
tecture definition from the DFG specification and identify the
components. It then creates a data flow graph (DFG) structure
in which the nodes represent the components, and the edges
account for the connections between them. Each node of the
graph can be a component candidate. Nevertheless, consecu-
tive nodes in the graph can be pre-implemented as one com-
ponent if the data movement between them does not require a
7
memory controller. In that case, a simple handshake protocol
is enough to provide node-to-node communication with simply
single-source, single-sink FIFO queues with unbounded length.
For instance, the first convolution of LeNet outputs 6@28 ×28
features maps, and pooling outputs 6@14 ×14 feature maps
from a 2 ×2 sliding windows. This architecture requires a
memory controller to compose the addresses to read/store the
data from/to the memory and feed the FIFOs, as shown in Fig-
ure 7. That constraint is not required for the following ReLu,
and the operation can be directly applied to intermediate results
of the pooling layers.
4.3.2. Component Matching
Figure 7: Communication Interface between Components
The RapidWright application first parses the DFG using
a breath-first search (BFS) approach (Algorithm 1 line 1-10).
This enables eciently discovering the components to load into
the CNN architecture as well as their connectivity. We choose
the BFS traversal as the DFGs representing CNN architectures
are generally deeper than wider. Each node is described with a
set of characteristics. For instance, a convolution is identified
with information such as input width and height, the number
of channels, the kernel size, the padding, and the strike. The
hardware generator that we implement with the RapidWright
API loads the DCPs corresponding to the components defined
in the CNN architecture definition from the database of pre-
built checkpoints to compose the final architecture.
To achieve physical hardware re-usability, some require-
ments must be fulfilled: each component must implement a
specific interface to communicate with the other design mod-
ules. As shown in Figure 7, components are pre-implemented
with two interfaces. The first interface called ”source”, is a
dedicated memory controller that read data from a memory and
feed their computing units. The second interface called ”sink”
controls the writing of feature maps in OCM. Finally, since all
the components implement a well-known interface, we use the
RapidWright API to create interconnections. It is done by in-
serting specific nets in the netlist of the design to implement
logic routing between the dierent components that communi-
cate in the design (Algorithm 1 line 11-18). After stitching, the
blocks are placed, a DCP file is generated, then read into Vivado
to complete the inter-component routing.
4.3.3. Component Placement
The placement algorithm is based on Xilinx Ultrascale ar-
chitecture, which is an array of programmable logic blocks con-
sisting of configurable logic blocks (CLB), Embedded Memory
(BRAM), and multiplier (DSP) blocks. CLB slices are orga-
nized in a regular array. Each connects to a switch box to ac-
cess general routing resources, which run vertically and hori-
zontally between rows and columns. The device is surrounded
by I/O Blocks allowing o-chip connections. DSP blocks and
BRAMs are arranged in columnar-wise and spread across the
device. We aim to find a congestion-aware timing-driven place-
ment for components of the input graph.
Problem Formulation.Given an Utrascale FPGA with logic
elements, its architecture, and a graph G of components, we
need to map the component’s netlist to the logic elements of the
FPGA and determine their positions to minimize routed wire-
length and congestion. In summary, (1) each component must
be assigned to a valid position on the FPGA, and (2) the place-
ment legalization rules of each tile are satisfied.
The algorithm works as follows: we recursively parse the
input graph and place the first component. Since components
are pre-implemented within pblocks, the number of resources
used and allocated is reported. For each adjacent component,
we assign a location on the FPGA grid, with minimal intercon-
nect wire length, i.e., the estimated half-perimeter wire length
(HPWL) from the placed cells locations. To ful that require-
ment, we define timing and congestion cost functions to evalu-
ate the cost of the assigned location.
The timing cost. is defined by the wire length between two
components.
timing cost =
n1
X
i=1,i<j
HPW L(Wi,j) (1)
Where Wi,jis the wire between component iand j(distance
from physical net’s source pin to sink pin). A fan-out greater
than one will in most cases, have some branching farther (reusing
a path). In this case, the unit of length is the dimension size of
a tile.
Congestion Estimation:. for optimal routing, a placement al-
gorithm must consider the number of resources used by each
inter-component nets and the interaction between them. For in-
stance, if all nets are limited to a relatively small portion of the
chip area, the routing path request will probably be very high.
8
Algorithm 1: Hardware generation Algorithm
Input : Design d, Graph G, Node root
Output: DCP file
1let Qbe a queue ;
2let tbe the max number of iterations ;
3mark root as discovered ;
4Q.enqueue(root) ;
5while Q.size() !=0do
6Node v =Q.dequeue() if v is the goal then
7return v ;
8end
9ParseGraph (v, G, Q, 0);
10 end
11 Function ParseGraph(G,v,Q,iter):
12 q=locate an optimal placement for v;
13 addNodeToDesign(q);
14 Nodes w =G.Next();
15 foreach edges from v to w do
16 if w is not marked then
17 p=locate an optimal placement for w;
18 addNodeToDesign(p);
19 Ports ports v =selectPortOfInterest (v);
20 Ports ports w =selectPortOfInterest (w);
21 nets =create nets (ports v, ports w, v, w);
22 timing, cost =TimingEstimation (nets, v,
w);
23 if (not (timing &cost) &(iter <t)) then
24 iter ++;
25 p=G.Previous();
26 ParseGraph (G, p, Q);
27 end
28 mark w as discovered ;
29 w.parent =v ;
30 Q.enqueue(w) ;
31 end
32 end
33 End Function
34 Function TimingEstimation(nets,v,w):
35 foreach net nets do
36 if size(net)>1then
37 timing =getTileSize();
38 else
39 timing =timing cost(nets);
40 end
41 cost =cgtcost (v,w);
42 end
43 return (timing,cost)
44 End Function
Furthermore, the number of switch boxes to traverse factor into
the total delay[40]. The algorithm tries to build a solution in-
crementally, one component at a time, removing those solu-
tions that fail to satisfy the problem’s constraints at any point
in time. A placement is validate if the costs are lower than a
defined threshold. Otherwise, for each previously placed com-
ponent, we unplaced them, find another location, until the costs
are satisfied. After a certain number of predefined iterations, if
not placement satisfying the constraints is found, we move to
the next component. The proposed solution works as follows:
we recursively parse the input graph and place the first com-
ponent(Line 3-13). For each adjacent component, we assign a
valid location on the FPGA grid (Line 14-18). We evaluate the
cost of the current assignment (Line 22). If the placement sat-
isfies the constraints (timing and congestion cost) , we move to
the next component (Line 28-30). Otherwise, we recursively
unplace previous components and try to find another location
(Line 23-27).
cgtcoe f =X
i
Ci
cgt cost =X
Wi,j
ωi,j×cgtcoe f k
#S wBox
(2)
Where Ciis the number of components overlapping within a
tilek,ωi,jis the is a weight proportional to the number of pins of
wire Wi,j, and #S wBox is the number of switch boxes traversed
by the nets.
Although RapidWright provides a lightweight timing model[41],
it works at the BEL level, which is more fine-grained than what
is required here.
0
1
S0
MUX
0
1
S0
MUX
FF
FF
LUT
LUT
LUT FF
0
1
S0
MUX
CLK
CLB
IO BANKSINTERCONNECT WIRES
BRAMS
FF
FF
FF
FF
+/- FF
xFF +-
0
1
S0
MUX
=
DSP
SwitchBox
Figure 8: Xilinx Ultrascale Architecture
4.3.4. Datapath Regularization
To reduce overall latency and data management overhead,
datapaths must be regularized. Each component comes with
its own latency in number of clock cycles. We must therefore
ensure that operands arrive at the boundary of each components
9
CLB LUTs CLB Registers BRAMs DSPs
Data Precision 8 bit
LeNet 17005 (7.90%) 7591 (1.61%) 109 (18.44%) 121 (5.22%)
Pre-implemented
LeNet
14533
(14.89%)
6847
(9.57%) 104 (5.43%) 121.00
VGG-16 100767 (41.35%) 151646 (38.80%) 462 (37.84%) 986 (50.83%)
Pre-implemented VGG-16 85204
(15.44%)
121879
(11.25%)
437
(2.07%)
935
(1.12%)
Table 1: FPGA Resource Utilization
at the same time to expect correct results. This task is done by
inserting FFs on the critical path. Inserting FFs do not increase
overall latency as the number of FFs is the cumulative latency
of operations on the datapath.
4.3.5. Inter-component Routing
After the architecture composition, the design contains all
the necessary CNN modules. Each design module still has the
logic and the internal routing locked. However, the Rapid-
Wright hardware generator only enables the logic routing be-
tween the components. While recent updates in the Rapid-
Wright API provide some functions to route the designs, the
routing heuristics are still a work in progress and are not as ma-
ture as Vivado. We, therefore, utilize Vivado for the final rout-
ing, which essentially consists in finding FPGA interconnects
to implement the logic routes created within RapidWright in a
way that minimizes timing delays.
5. Experimental Results
5.1. Evaluation Platform and Setup
For evaluation purposes, designs are implemented on a Xil-
inx FPGA XCKU060. The hardware is generated using Vi-
vado v2019.2 and RapidWright v2019.1, and the components
are implemented with vivado HLS. The hardware generation is
conducted on a computer equipped with an Intel Corei7-9700K
CPU@3.60GHz×4 processor and 32GB of RAM.
5.2. Benchmarks
We study two CNN architectures: LeNet[42] and VGG[43].
We run applications individually with the purpose of assess-
ing achievable performances, in particular: (1) global latency,
(2) Maximum Frequency (Fmax) and productivity and (3) re-
source utilization, when comparing pre-implemented to full im-
plemented CNNs. For both networks, we use a 8 bit data pre-
cision. Figure 2 presents the dierent the workload and the
requirement in terms of memory of the two networks. We com-
pare the pre-implemented circuits to the corresponding classic
implementation. Here, a classic implementation refers to a cir-
cuit generated from a single top level file, following the vivado
design flow [44].
LeNet-5 VGG-16
# CONV Layer 2 16
# weights 26 K 14.7 M
# MACs 1.9 M 15.3 G
# FC Layers 2 3
# weights 406 K 124 M
# MACs 405 K 124 M
Total Weights 431 K 138 M
Total MACs 2,3 M 15.5 G
Table 2: Computational hardware resources for state-of-art DNNs.
5.2.1. LeNet Architecture
It’s built by replication of four main modules: (1) The con-
volution: this module performs the convolution computing us-
ing a systolic array architecture. The fully connecting layers are
also implemented as convolution, with the kernel size equal to
input data size. (2) The max pool layers, (3) The relu layers, (4)
The memory managment unit, jogging around the input data,
and feed the computing units. The weights and biases are hard
coded in ROM. This choice has been decided out of simplicity
5.2.2. VGG-16 Architecture
VGG consists of 16 convolutional layers and is very ap-
pealing for the pre-implemented flow because of its very uni-
form architecture. Input images are passed through a stack of
convolutional layers with the fixed filter size of 3×3 and the
stride of 1. There are five max pooling filters built-in between
convolutional layers. The stack of convolutional layers is fol-
lowed by 3 fully connected layers. Each convolution layer is
made of 2 or 3 convolutions with same parameters, followed by
a pooling layer. The replicability of layers within VGG suits
the pre-implemented flow. With 124 M of weights, there is not
enough resources on-chip to store them. We use o-chip mem-
ory to store the coecient data and data layout configuration
files. The coecient data files contain the parameters of each
layer and the data layout configuration files include the size of
the input feature map and the output feature map, as well as the
shape of the tensor coecients. The o-chip memory allocation
is based on a Best-Fit with Coalescing algorithm. The goal of
this allocator is to support defragmentation via coalescing. The
principle behind this algorithm is to divide the memory into a
series of memory blocks, each of which is managed by a block
data structure. From the block structure, information such as
10
Figure 9: Performance Exploration of LeNet
the base address of the memory block, the state of use of the
memory block, the size of the block, the pointer on the previ-
ous block and the following can be obtained. All memory can
be represented by a block structure with a double-link list.
5.3. Resource Utilization
Pre-implementing basic components have the potentiality
of reducing resource utilization as shown in Table 1. The clas-
sic implementation LeNet and VGG use respectively 7.90% and
41.35% of LUTs, 1.61% and 17.80% of registers. The pre-
implemented version of LeNet uses less than 14.89% of LUTs
and 9.57% of registers (resp. VGG uses less 9.09% of LUTs
and 11.255% of registers) when compared to the classic imple-
mentation. Overall, the pre-implement networks use less re-
sources than the baseline implementation. When the design is
small, vivado can provide a better optimization of the resources.
Furthermore, when pre-implementing components, we define
pblocks, which limit the amount of resources that vivado can
use and hence, forcing some area optimizations. When the de-
sign is bigger, vivado tends to maximize the capacity of adap-
tation and becomes dicult to capture all its specificities. A
snapshot of Lenet of the FPGA fabric is shown in Figure 10.
LeNet uses 18.44% of the BRAM available on the chip.
This is simply because the weights and biases are hard coded
in ROM and uses more resources. The pre-implemented LeNet
(resp pre-implemented VGG) uses 5.43% less BRAM (resp.
37.84%). Vivado can optimize individual component IR with-
out BRAM insertion while adding such resources when com-
piling bigger design, which translates into a higher power con-
sumption. The amount of DPS are the same for LeNet imple-
mentation. However, a notice an slight decrease of 1.12% the
pre-implemented VGG. By defining pblock for each compo-
nent, we sometimes provide more DSPs than needed to have
enough resources to place the design. This is due to the topol-
ogy of Xilinx FPGA which are organized column-wise.
5.4. Productivity
With the continuous growth of CNNs parameters and depth,
improving the productivity is an important factor when it comes
to hardware design. In this section, we show how the proposed
(a) Design implemented with Vi-
vado
(b) Design Implemented with our
approach
Figure 10: Lenet circuit on FPGA fabric
flow can leverage component reuse to reduce both compile time
and implementation cycles. Table 4 presents the time in sec-
onds to generate the design checkpoint with both rapidwright
and vivado. This time measure the implementation and the
generation of DCP. For the Baseline LeNet and VGG, imple-
mentation time is the sum of Vivado’s opt design,place design,
phys opt design and route design functions. For the networks
that are pre-implemented, since components have already im-
plemented o-line, we only measure DCP generation with rapid-
wright and inter-component routing with vivado. With the pre-
implemented flow, it takes 13.54 min (resp. 41.94 min) to gen-
erate LeNet (resp. VGG). There is a productivity improve-
ment of 69% for LeNet and 61% for VGG when using the
pre-implemented flow. For LeNet (resp VGG), the stitching
with RapiWright represents only 6.2% (resp. 8%) of the to-
tal time. RapidWright has minimal impact on the productivity.
The biggest portion of the time is used to route the nets between
components.
5.5. Performance
This section presents a comparison with FPGA designs that
utilize a batch size of 1, and we report simultaneously latency
and Frequency. In Figure 9, we present the performance of
11
CPU Cloud-DNN [45] Caeine [46] Biookaghazadeh
et al.[47] Zhu et al. [48] Pang et al. [49] Ours
FPGA Device - XC7Z045 KU060 Arria 10 ZCU102 VC707 KU060
Data Precision float 16 16 8/16 16 16 8
Power (W) 140 49.25 26 23 23.6 8.152 25.2
Energy Eciency
(GOPS/W) 1.36 37.13 28.85 39.1 13.05 23.1 42
Table 3: Energy Eciency Comparison
LeNet
Pre-Implemented Flow Classic LeNet
RapidWright Inter-node
Routing Placement Routing
Time (min) 0.84 12.7 18.32 25.67
Ratio 6.2% 96.54% 39.6% 60.4%
Total (min) 13.54 (69% )45
VGG Pre-Implemented
Flow Classic VGG
RapidWright Routing Placement Routing
Time (min) 5.27 35.67 8.77 128.23
Ratio 8.00% 92.00% 6.40% 93.60%
Total (min) 41.94 (61% )137.00
Table 4: Design Generation Time for implementation of LeNet and VGG with
vivado and the pre-implemented flow in seconds.
each component as well as the pre-implemented LeNet. Over-
all, LeNet achieves up to 1.2×higher frequency than the clas-
sic stream-like architecture. The first convolution reaches 562
MHz. However, with a higher number of parameters (from 156
in conv1 to 2416 in conv2), the number of multiplications in-
creases from 117600 to 240000, and having a negative impact
on the frequency. Futhermore, The frequency of the pre-built
design is upper bounded by the slowest component in the de-
sign. Figure 9 also present the variation of the latency in micro
seconds (µs) of each component. The pre-implemented LeNet
reaches a 16.3% lower latency.
The pre-implementing VGG has 1.17×higher frequency
than the baseline VGG implementation, with a 23.19% lower
latency (Figure 11). Hence, a given design reflecting the prop-
erties of modularity, module replication, and latency tolerance,
a circuit generated with our approach will have better perfor-
mance than the classic implementation. In contrary to LeNet,
VGG has more and dense layers to place and route on the chip.
When several design components must be spread around the
chip, a rising issue is how to deal with fabric discontinuities
such as erratic tile patterns and I/O columns. Those disconti-
nuities increase the datapath and have a negative eect on the
performance. Hence, inserting pipeline elements such as FFs
on the critical path improves the timing performance, while in-
creasing the overall latency. Even with a projecting higher la-
tency, the proposing flow succeed on providing better perfor-
mance.
To show the performance of our approach, we compare our
implementation of VGG-16 with state-of-the art accelerators in
Table 5. For each work, we report the architecture topology,
data precision, resource utilization and throughput in GOPS.
Due to dierences in technology, hardware resources and sys-
tem setup, it is hard to make an apple to apple comparison be-
tween dierent implementations. But we list some recent works
for qualitative reference. The latency here represents the time
it takes for a single frame inference. McDanel et al [19] have
the lowest latency. They can achieve such performance because
they use a Selector-Accumulator (SAC) for Multiplication-free
Systolic Array. It reduces the number of operations by which
92×for VGG-16. We want highly that the SAC implementation
can also be used to pre-implement the components to achieve
competitive results. When it comes to the throughput, ELB-NN
[51] has the highest performance of 3.3 TOPS with ultra-low
data precision of 4 bit. Despite impressive throughput, the accu-
racy of the proposed circuit drops to 55.8%.. Our work achieves
a throughput of 1059 GOPS, which is lower than Cloud-DNN
and ELB-NN. Nevertheless, it uses less than 50% of the FPGA
fabric, with 2×lower latency than Cloud-DNN. Overall, our pa-
per has the best ration performance/resources, with the highest
frequency.
We also compare the FPGA energy eciency to the existing
designs on FPGAs and CPU. (Table 3). For fair comparison, we
use GOP/WS as the standard metric. Our implementation using
8-bit fixed-point has the highest energy ecient over a batch
of 1. Despite having a higher frequency than most of the de-
sign, our implementation has the smallest number of resources,
which results in a lower power consumption.0000000
6. Conclusion
This paper proposes a pre-implemented flow based on a di-
vide and conquers approach to accelerate model inference on
FPGA. The flow takes as input an abstract representation of the
CNN model inference to perform model mapping and design
checkpoint generating, by assembling pre-implemented CNN
components with RapidWright. With the pre-implemented flow,
each component is implemented to reach maximum performance.
Experiments and results show that our approach show improve-
ments in terms of latency and maximum frequency, with little
to no impact on the number of resources used. Our workflow
is designed in a modular fashion, allowing easy integration for
new layer types.
However, there are still several aspects that we plan to in-
vestigate with the goal of improving the current work, such as
supporting a more exhaustive range of DNNs. Particularly an
optimized and automated floor planning to achieve higher per-
formance. Furthermore, the maximum frequency of the pre-
12
Biookaghazadeh
et al.[47]
Super-LIP
[50] ELB-NN [51] Caeine [46] McDanel
et al. [19]
fpgaConvNet
[23]
Venieris
et al. [52]
Cloud-DNN
[45] Ours
FPGA Device Arria 10 ZCU102 ZC706 KU060 VC707 Zynq-7045 XC7Z045 XC7Z045 KU060
Architecture
Topology SIMD Dataflow Dataflow SIMD SIMD Dataflow Dataflow SIMD -
Dataflow Dataflow
Fmax (MHz) 212 200 200 200 170 125 125 214 263
Data Precision
(bit) 8 - 16 16 4 8 5 16 16 16 8
DSP - 57.87% 33% (298) 38% (1058) 4% (112) 95% 100% (900) 78% 48%
BRAM - 92.43% 93 % (509) 56% (782) 81% (834) - - 74.4% 36%
LUTs - - 52% (112992) 60% (200K) 78% (239K) - 89% (216.60 K) 58.5% 40%
Latency (ms) 26.52 - 30.3 71.46 5.84 25.3 2.28 249.50 249.50 16.92 8.51
Throughput
(GOPS) 990 - 3.3 TOPS 365 - 155.81 123.12 1828.61 1059
Table 5: VGG-16 Performance Comparison with state-of-art approaches
Figure 11: Performance Exploration of VGG
Figure 12: VGG
architecture
with labelled
components
implemented network is bounded by the slowest component of
the design. We are planning to investigate optimization ap-
proaches to improve the performance of components during
the function optimization stage. Futhermore, The input to this
framework is a ”CNN architecture definition” we are working
on extending our current flow to support other frameworks like
ONNX and PyTorch. We also plan to expand our approach to
utilize multiple FPGAs with larger models in the future.
Acknowledgement
This work was funded by the National Science Foundation
(NSF) under Grant CNS 2007320.
References
[1] H. Kung, B. McDanel, S. Q. Zhang, X. Dong, C. C. Chen, Maestro: A
memory-on-logic architecture for coordinated parallel use of many sys-
tolic arrays, in: 2019 IEEE 30th International Conference on Application-
specific Systems, Architectures and Processors (ASAP), Vol. 2160, IEEE,
2019, pp. 42–50.
[2] Y. Chen, Y. Xie, L. Song, F. Chen, T. Tang, A survey of accelerator archi-
tectures for deep neural networks, Engineering 6 (3) (2020) 264–274.
[3] P. Bhowmik, J. H. Pantho, J. M. Mbongue, C. Bobda, Esca: Event-
based split-cnn architecture with data-level parallelism on ultrascale+
fpga, in: 2021 IEEE 29th Annual International Symposium on Field-
Programmable Custom Computing Machines (FCCM), IEEE, 2021, pp.
176–180.
[4] J. Fowers, G. Brown, P. Cooke, G. Stitt, A performance and energy com-
parison of fpgas, gpus, and multicores for sliding-window applications,
in: Proceedings of the ACM/SIGDA international symposium on Field
Programmable Gate Arrays, 2012, pp. 47–56.
[5] Xilinx, Alveo u250 data center accelerator card,
https://www.xilinx.com/u250 (2019).
[6] Microsoft, Project catapult, https://www.microsoft.com/en-
us/research/project/project-catapult/(2018).
[7] Intel, Intel arria 10 product table,
https://www.intel.com/content/dam/www/programmable/us/en/
pdfs/literature/pt/arria-10-product-table.pdf (2018).
[8] C. Lavin, A. Kaviani, Rapidwright: Enabling custom crafted implemen-
tations for fpgas, in: 2018 IEEE 26th Annual International Symposium
on Field-Programmable Custom Computing Machines (FCCM), IEEE,
2018, pp. 133–140.
[9] J. M. Mbongue, D. T. Kwadjo, C. Bobda, Automatic generation of
application-specific fpga overlays with rapidwright, in: 2019 Interna-
tional Conference on Field-Programmable Technology (ICFPT), IEEE,
2019, pp. 303–306.
13
[10] D. T. Kwadjo, J. M. Mbongue, C. Bobda, Performance exploration on pre-
implemented cnn hardware accelerator on fpga, in: 2020 International
Conference on Field-Programmable Technology (ICFPT), IEEE, 2020,
pp. 298–299.
[11] D. T. Kwadjo, J. M. Mbongue, C. Bobda, Exploring a layer-based pre-
implemented flow for mapping cnn on fpga, in: 2021 IEEE International
Parallel and Distributed Processing Symposium Workshops (IPDPSW),
IEEE, 2021, pp. 116–123.
[12] S. Mittal, A survey of fpga-based accelerators for convolutional neural
networks, Neural computing and applications (2020) 1–31.
[13] X. Wei, Y. Liang, J. Cong, Overcoming data transfer bottlenecks in
fpga-based dnn accelerators via layer conscious memory management,
in: 2019 56th ACM/IEEE Design Automation Conference (DAC), IEEE,
2019, pp. 1–6.
[14] X. Wei, C. H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, J. Cong,
Automated systolic array architecture synthesis for high throughput cnn
inference on fpgas, in: Proceedings of the 54th Annual Design Automa-
tion Conference 2017, ACM, 2017, p. 29.
[15] Y. Xing, S. Liang, L. Sui, X. Jia, J. Qiu, X. Liu, Y. Wang, Y. Shan,
Y. Wang, Dnnvm: End-to-end compiler leveraging heterogeneous op-
timizations on fpga-based cnn accelerators, IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems (2019).
[16] K. Abdelouahab, M. Pelcat, J. Serot, F. Berry, Accelerating cnn inference
on fpgas: A survey, arXiv preprint arXiv:1806.01683 (2018).
[17] P. Haghi, T. Geng, A. Guo, T. Wang, M. Herbordt, Fp-amg: Fpga-based
acceleration framework for algebraic multigrid solvers, in: 2020 IEEE
28th Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM), IEEE, 2020, pp. 148–156.
[18] M. J. H. Pantho, P. Bhowmik, C. Bobda, Towards an ecient cnn infer-
ence architecture enabling in-sensor processing, Sensors 21 (6) (2021)
1955.
[19] B. McDanel, S. Q. Zhang, H. Kung, X. Dong, Full-stack optimization for
accelerating cnns using powers-of-two weights with fpga validation, in:
Proceedings of the ACM International Conference on Supercomputing,
2020, pp. 449–460.
[20] Y. Ma, Y. Cao, S. Vrudhula, J.-s. Seo, An automatic rtl compiler for high-
throughput fpga implementation of diverse deep convolutional neural net-
works, in: 2017 27th International Conference on Field Programmable
Logic and Applications (FPL), IEEE, 2017, pp. 1–8.
[21] Y. Shen, M. Ferdman, P. Milder, Maximizing cnn accelerator eciency
through resource partitioning, in: 2017 ACM/IEEE 44th Annual Inter-
national Symposium on Computer Architecture (ISCA), IEEE, 2017, pp.
535–547.
[22] S. Hussain, M. Javaheripi, P. Neekhara, R. Kastner, F. Koushanfar,
Fastwave: Accelerating autoregressive convolutional neural networks on
fpga, arXiv preprint arXiv:2002.04971 (2020).
[23] S. I. Venieris, C. S. Bouganis, fpgaConvNet: Mapping Regular and Irreg-
ular Convolutional Neural Networks on FPGAs, IEEE Transactions on
Neural Networks and Learning Systems (2018) 1–17.
[24] S. I. Venieris, C.-S. Bouganis, fpgaConvNet: A Framework for Map-
ping Convolutional Neural Networks on FPGAs, in: 2016 IEEE 24th
Annual International Symposium on Field-Programmable Custom Com-
puting Machines (FCCM), 2016, pp. 40–47.
[25] L. Petrica, T. Alonso, M. Kroes, N. Fraser, S. Cotofana, M. Blott,
Memory-ecient dataflow inference for deep cnns on fpga, in: 2020
International Conference on Field-Programmable Technology (ICFPT),
IEEE, 2020, pp. 48–55.
[26] C. Lavin, M. Padilla, S. Ghosh, B. Nelson, B. Hutchings, M. Wirthlin,
Using hard macros to reduce fpga compilation time, in: 2010 Interna-
tional Conference on Field Programmable Logic and Applications, IEEE,
2010, pp. 438–441.
[27] Y. Xiao, D. Park, A. Butt, H. Giesen, Z. Han, R. Ding, N. Magnezi,
R. Rubin, A. DeHon, Reducing fpga compile time with separate com-
pilation for fpga building blocks, in: 2019 International Conference on
Field-Programmable Technology (ICFPT), IEEE, 2019, pp. 153–161.
[28] S. Ma, Z. Aklah, D. Andrews, Just in time assembly of accelerators,
in: Proceedings of the 2016 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, 2016, pp. 173–178.
[29] A. Wold, D. Koch, J. Torresen, Component based design using constraint
programming for module placement on fpgas, in: 2013 8th International
Workshop on Reconfigurable and Communication-Centric Systems-on-
Chip (ReCoSoC), IEEE, 2013, pp. 1–8.
[30] A. Sohrabizadeh, J. Wang, J. Cong, End-to-end optimization of deep
learning applications, in: Proceedings of the 2020 ACM/SIGDA Interna-
tional Symposium on Field-Programmable Gate Arrays, 2020, pp. 133–
139.
[31] D. T. Nguyen, H. Kim, H.-J. Lee, Layer-specific optimization for mixed
data flow with mixed precision in fpga design for cnn-based object detec-
tors, IEEE Transactions on Circuits and Systems for Video Technology
(2020).
[32] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, J. Cong, Optimizing fpga-
based accelerator design for deep convolutional neural networks, in: Pro-
ceedings of the 2015 ACM/SIGDA international symposium on field-
programmable gate arrays, 2015, pp. 161–170.
[33] M. Blott, T. B. Preußer, N. J. Fraser, G. Gambardella, K. O’brien,
Y. Umuroglu, M. Leeser, K. Vissers, Finn-r: An end-to-end deep-learning
framework for fast exploration of quantized neural networks, ACM Trans-
actions on Reconfigurable Technology and Systems (TRETS) 11 (3)
(2018) 1–23.
[34] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra,
H. Esmaeilzadeh, From high-level deep neural models to fpgas, in: The
49th Annual IEEE/ACM International Symposium on Microarchitecture,
IEEE Press, 2016, p. 17.
[35] A. Ahmad, M. A. Pasha, Towards design space exploration and optimiza-
tion of fast algorithms for convolutional neural networks (cnns) on fpgas,
in: 2019 Design, Automation & Test in Europe Conference & Exhibition
(DATE), IEEE, 2019, pp. 1106–1111.
[36] Xilinx, Hierarchical design, https://www.xilinx.com/support/documentation/
sw manuals/xilinx2017 1/ug905-vivado-hierarchical-design.pdf (2017).
[37] G. Miranda, H. P. L. Luna, G. R. Mateus, R. P. M. Ferreira, A performance
guarantee heuristic for electronic components placement problems in-
cluding thermal eects, Computers & operations research 32 (11) (2005)
2937–2957.
[38] C. Lavin, M. Padilla, J. Lamprecht, P. Lundrigan, B. Nelson, B. Hutch-
ings, Hmflow: Accelerating fpga compilation with hard macros for
rapid prototyping, in: 2011 IEEE 19th Annual International Symposium
on Field-Programmable Custom Computing Machines, IEEE, 2011, pp.
117–124.
[39] W.-J. Hwang, Y.-J. Jhang, T.-M. Tai, An ecient fpga-based architecture
for convolutional neural networks, in: 2017 40th International Conference
on Telecommunications and Signal Processing (TSP), IEEE, 2017, pp.
582–588.
[40] xilinx, Ultrascale architecture configurable logic block,
https://www.xilinx.com/support/documentation/user guides/ug574-
ultrascale-clb.pdf (20178).
[41] P. Maidee, C. Neely, A. Kaviani, C. Lavin, An open-source lightweight
timing model for rapidwright, in: 2019 International Conference on Field-
Programmable Technology (ICFPT), IEEE, 2019, pp. 171–178.
[42] Y. LeCun, L. Bottou, Y. Bengio, P. Haner, Gradient-based learning ap-
plied to document recognition, Proceedings of the IEEE 86 (11) (1998)
2278–2324.
[43] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Pro-
ceedings of the IEEE conference on computer vision and pattern recogni-
tion, 2015, pp. 1–9.
[44] D. I. S. U. IP, Vivado design suite user guide, UG892 (v2020.2) (2021).
[45] Y. Chen, J. He, X. Zhang, C. Hao, D. Chen, Cloud-dnn: An open frame-
work for mapping dnn models to cloud fpgas, in: Proceedings of the 2019
ACM/SIGDA international symposium on field-programmable gate ar-
rays, 2019, pp. 73–82.
[46] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, J. Cong, Caeine: Towards
uniformed representation and acceleration for deep convolutional neural
networks, IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems (2018).
[47] S. Biookaghazadeh, P. K. Ravi, M. Zhao, Toward multi-fpga accelera-
tion of the neural networks, ACM Journal on Emerging Technologies in
Computing Systems (JETC) 17 (2) (2021) 1–23.
[48] C. Zhu, K. Huang, S. Yang, Z. Zhu, H. Zhang, H. Shen, An ecient hard-
ware accelerator for structured sparse convolutional neural networks on
fpgas, IEEE Transactions on Very Large Scale Integration (VLSI) Sys-
tems 28 (9) (2020) 1953–1965.
[49] W. Pang, C. Wu, S. Lu, An energy-ecient implementation of group
14
pruned cnns on fpga, IEEE Access 8 (2020) 217033–217044.
[50] W. Jiang, E. H.-M. Sha, X. Zhang, L. Yang, Q. Zhuge, Y. Shi, J. Hu,
Achieving super-linear speedup across multi-fpga for real-time dnn in-
ference, ACM Transactions on Embedded Computing Systems (TECS)
18 (5s) (2019) 1–23.
[51] J. Wang, Q. Lou, X. Zhang, C. Zhu, Y. Lin, D. Chen, Design flow of accel-
erating hybrid extremely low bit-width neural network in embedded fpga,
in: 2018 28th International Conference on Field Programmable Logic and
Applications (FPL), IEEE, 2018, pp. 163–1636.
[52] S. I. Venieris, C.-S. Bouganis, Latency-driven design for fpga-based con-
volutional neural networks, in: 2017 27th International Conference on
Field Programmable Logic and Applications (FPL), IEEE, 2017, pp. 1–8.
15
... In recent years, many researches have deployed DL algorithms on FPGAs for different purposes. [14][15][16][17][18][19][20][21][22] use FPGA-based CNN architecture for object detection, image classification and edge computing. There are also many researches focus on the acceleration of DL algorithms on FPGA, which are divided into two directions: quantization and weight reduction [23]. ...
Article
Full-text available
Recently, the combination of convolutional neural network (CNN) and long short-term memory (LSTM) exhibits better performance than single network architecture. Most of these studies connect LSTM networks behind CNNs. When operating on hardware, the current design of CNN-LSTM is similar to a pipeline architecture. However, the classic structure lead to a feature loss when data is sent to LSTM since CNN is not good at extracting temporal features. At the same time, as the depth and scale increases, it will bring a huge amount of computation, which makes hardware implementation difficult. Based on that, a parallel CNN-LSTM architecture is proposed, in which two networks extract features from the input data synchronously, being proven to be more effective than classical CNN-LSTM. This paper designs a parallel CNN-LSTM computing device based on FPGA. The device is divided into control module and operation module. Control stream and data stream transport between the two modules, ensuring the proper running device. A highly parallel multi-channel convolution layer and pooling layer are designed to improve the calculation efficiency. A 4-stage pipeline structure is adopted to implement the LSTM part. This paper makes full use of BRAM on FPGA to design a look-up table for activation function approximation, reducing the resource consumption by 95% compared with the traditional polynomial fitting. Finally, we verify our device under cooperative spectrum sensing (CSS) and handwritten classification scenarios. Proposed device reaches higher accuracy in two scenarios compared with classic CNN-LSTM structure as well as faster calculating speed, and the overall project power is limited below 2W. The scalability and limitations of this computing device are also discussed.
... The authors of [79] proposed using CNN layer replication to improve device performance and simplify design. Assembling pre-implemented neural network components using a graph topology minimizes resource costs, predicts device performance, and simplifies development by eliminating the source code synthesize in a hardware description language. ...
Article
Full-text available
The technology development greatly increases the amount of digital visual information. Existing devices cannot efficiently process such huge amounts of data. The technical characteristics of digital image processing (DIP) devices and systems are being actively improved to resolve this contradiction in science and technology. The state-of-the-art methodology includes a huge number of very diverse approaches at the mathematical, software, and hardware implementation levels. We have analyzed all modern trends to improve the technical characteristics of DIP devices and systems. The main distinguishing feature of this review is that we are not limited to considering various aspects of neural network image processing, to which the vast majority of both review and research papers on the designated topic are devoted. Review papers on the subject under consideration are analyzed. Various mathematical and arithmetic-logical methods for improving the characteristics of image processing devices are described in detail. Original and significant architectural and structural solutions are analyzed. Promising neural network models of visual data processing are characterized. Hardware platforms for the design and operation of DIP systems that are efficient in terms of resource costs are considered. The most significant improvements achieved through the hardware implementation of models and methods on field-programmable gate arrays and application-specific integrated circuits are noted.
... The storage characteristics of neural networks can be enhanced by designing FPGA computing units, bandwidth, and local memory, thereby improving the performance of FPGAs at various levels and dimensions. Give full play to the efficiency and performance of the AI computing stage [4]. ...
Article
Full-text available
In modern society, artificial intelligence (AI) is developing more rapidly. And the Field Programmable Gate Array (FPGA) has always been the focus of research as a driving platform. This paper studies in detail the theoretical basis, applications, defects, and future development directions of FPGAs. It is concluded that FPGA has three characteristics: gate array, programmable, and scene, and the detailed positioning of FPGA, the structure, principle, tools, process, and description language of FPGA design. And the unique advantages of FPGA in the field of artificial intelligence: flexible and configurable, special optimizations for convolutional neural networks, and deterministic low latency. Several typical applications of FPGA in the field of artificial intelligence, deficiencies and solutions, and two future development directions. This article will make a great contribution to the development of FPGA in the field of artificial intelligence in the future.
... The authors of paper [12] propose an approach for accelerating convolutional neural networks on FPGAs to generate high-performance hardware designs by assembling preimplemented components of convolutional neural networks into a graph topology-based puzzle. The use of pre-implemented components allows minimal resource utilization for predicting performance, which reduces engineering time and increases yields. ...
Article
Full-text available
Advancements in artificial intelligence algorithms and models, along with embedded device support, have resulted in the issue of high energy consumption and poor compatibility when deploying artificial intelligence models and networks on embedded devices becoming solvable. In response to these problems, this paper introduces three aspects of methods and applications for deploying artificial intelligence technologies on embedded devices, including artificial intelligence algorithms and models on resource-constrained hardware, acceleration methods for embedded devices, neural network compression, and current application models of embedded AI. This paper compares relevant literature, highlights the strengths and weaknesses, and concludes with future directions for embedded AI and a summary of the article.
... Histopathology is a time-consuming and challenging procedure that requires a high level of competence, focus, and patience. When pathologists employ computer-aided diagnostic (CAD) techniques, they may be more productive, objective, and consistent in their diagnosis [6]. ...
Conference Paper
Full-text available
Convolutional Neural Networks are compute-intensive learning models that have demonstrated ability and effectiveness in solving complex learning problems. However, developing a high-performance FPGA accelerator for CNN often demands high programming skills, hardware verification, precise distribution localization, and long development cycles. Besides, CNN depth increases by reuse and replication of multiple layers. This paper proposes a programming flow for CNN on FPGA to generate high-performance accelerators by assembling CNN pre-implemented components as a puzzle, based on the graph topology. Using pre-implemented components allows us to use the minimum of resources necessary, predict the performance, and gain in productivity since there is no need to synthesize any HDL code. Furthermore, components can be reused for a different range of applications. Through prototyping, we demonstrated the viability and relevance of our approach. Experiments show a productivity improvement of up to 69% compared to a traditional FPGA implementation while achieving over 1.75X higher Fmax with lower resources and power consumption.
Article
Full-text available
The astounding development of optical sensing imaging technology, coupled with the impressive improvements in machine learning algorithms, has increased our ability to understand and extract information from scenic events. In most cases, Convolution neural networks (CNNs) are largely adopted to infer knowledge due to their surprising success in automation, surveillance, and many other application domains. However, the convolution operations’ overwhelming computation demand has somewhat limited their use in remote sensing edge devices. In these platforms, real-time processing remains a challenging task due to the tight constraints on resources and power. Here, the transfer and processing of non-relevant image pixels act as a bottleneck on the entire system. It is possible to overcome this bottleneck by exploiting the high bandwidth available at the sensor interface by designing a CNN inference architecture near the sensor. This paper presents an attention-based pixel processing architecture to facilitate the CNN inference near the image sensor. We propose an efficient computation method to reduce the dynamic power by decreasing the overall computation of the convolution operations. The proposed method reduces redundancies by using a hierarchical optimization approach. The approach minimizes power consumption for convolution operations by exploiting the Spatio-temporal redundancies found in the incoming feature maps and performs computations only on selected regions based on their relevance score. The proposed design addresses problems related to the mapping of computations onto an array of processing elements (PEs) and introduces a suitable network structure for communication. The PEs are highly optimized to provide low latency and power for CNN applications. While designing the model, we exploit the concepts of biological vision systems to reduce computation and energy. We prototype the model in a Virtex UltraScale+ FPGA and implement it in Application Specific Integrated Circuit (ASIC) using the TSMC 90nm technology library. The results suggest that the proposed architecture significantly reduces dynamic power consumption and achieves high-speed up surpassing existing embedded processors’ computational capabilities.
Article
Full-text available
In recent years, convolutional neural network (CNN)-based artificial intelligence algorithms have been widely applied to object recognition and image classification tasks. However, the high performance of convolutional neural networks comes at the cost of high-intensity computing and enormous numbers of parameters, which pose substantial challenges to terminal implementations. An end-to-end FPGA-based accelerator is proposed in this work that efficiently processes fine-grained pruned CNNs. A group pruning algorithm with group sparse regularization (GSR) is introduced to solve internal buffer misalignments and load imbalances of the accelerator after fine-grained pruning. A mathematical model of accelerator access and transmission is established to explore the optimal design scale and calculation mode. The accelerator is optimized by designing sparse processing elements and by scheduling the on- and off-chip buffers. The proposed approach reduces the computation of a state-of-the-art large-scale CNN, VGG16, by 86.9% with an accuracy loss on CIFAR-10 of only 0.48%. The accelerator achieves 188.41 GOPS at 100 MHz and consumes 8.15 W when implemented on a Xilinx VC707, making it more energy-efficient than previous approaches.
Article
Full-text available
Convolutional neural networks (CNNs) require both intensive computation and frequent memory access, which lead to a low processing speed and large power dissipation. Although the characteristics of the different layers in a CNN are frequently quite different, previous hardware designs have employed common optimization schemes for them. This paper proposes a layer-specific design that employs different organizations that are optimized for the different layers. The proposed design employs two layer-specific optimizations: layer-specific mixed data flow and layer-specific mixed precision. The mixed data flow aims to minimize the off-chip access while demanding a minimal on-chip memory (BRAM) resource of an FPGA device. The mixed precision quantization is to achieve both a lossless accuracy and an aggressive model compression, thereby further reducing the off-chip access. A Bayesian optimization approach is used to select the best sparsity for each layer, achieving the best trade-off between the accuracy and compression. This mixing scheme allows the entire network model to be stored in BRAMs of the FPGA to aggressively reduce the off-chip access, and thereby achieves a significant performance enhancement. The model size is reduced by 22.66-28.93 times compared to that in a full-precision network with a negligible degradation of accuracy on VOC, COCO, and ImageNet datasets. Furthermore, the combination of mixed dataflow and mixed precision significantly outperforms the previous works in terms of both throughput, off-chip access, and on-chip memory requirement.
Article
High-throughput and low-latency Convolutional Neural Network (CNN) inference is increasingly important for many cloud- and edge-computing applications. FPGA-based acceleration of CNN inference has demonstrated various benefits compared to other high-performance devices such as GPGPUs. Current FPGA CNN-acceleration solutions are based on a single FPGA design, which are limited by the available resources on an FPGA. In addition, they can only accelerate conventional 2D neural networks. To address these limitations, we present a generic multi-FPGA solution, written in OpenCL, which can accelerate more complex CNNs (e.g., C3D CNN) and achieve a near linear speedup with respect to the available single-FPGA solutions. The design is built upon the Intel Deep Learning Accelerator architecture, with three extensions. First, it includes updates for better area efficiency (up to 25%) and higher performance (up to 24%). Second, it supports 3D convolutions for more challenging applications such as video learning. Third, it supports multi-FPGA communication for higher inference throughput. The results show that utilizing multiple FPGAs can linearly increase the overall bandwidth while maintaining the same end-to-end latency. In addition, the design can outperform other FPGA 2D accelerators by up to 8.4 times and 3D accelerators by up to 1.7 times.
Conference Paper
vendor tools are mostly general-purpose as they attempt to provide an acceptable quality of result (QoR) on a broad set of applications, which may not exploit application/domain-specific characteristics to deliver higher QoR. By pre-implementing specific components of a design, higher performance can be achieved locally and maintained to a certain extent when assembling the final circuit. This approach is supported by two main observations: (1) vendor tools such as Vivado tend to deliver high-performance results on small modules in a design. (2) Computing applications such as machine learning designs increase in size by replicating modules. Using pre-implemented components allows us the only use the minimum of resources necessary, predict the performance, and gain in productivity since there is no need to synthesize any HDL code. Furthermore, components can be reused for a different range of applications. Through prototyping, we demonstrated the viability and relevance of our approach. Experiments show a productivity improvement of up to 69% compared to a traditional FPGA implementation while achieving over 1.18× higher Fmax with lower resources and power consumption.
Article
Deep convolutional neural networks (CNNs) have achieved state-of-the-art performance in a wide range of applications. However, deeper CNN models, which are usually computation consuming, are widely required for complex artificial intelligence (AI) tasks. Though recent research progress on network compression, such as pruning, has emerged as a promising direction to mitigate computational burden, existing accelerators are still prevented from completely utilizing the benefits of leveraging sparsity due to the irregularity caused by pruning. On the other hand, field-programmable gate arrays (FPGAs) have been regarded as a promising hardware platform for CNN inference acceleration. However, most existing FPGA accelerators focus on dense CNN and cannot address the irregularity problem. In this article, we propose a sparsewise dataflow to skip the cycles of processing multiply-and-accumulates (MACs) with zero weights and exploit data statistics to minimize energy through zeros gating to avoid unnecessary computations. The proposed sparsewise dataflow leads to a low bandwidth requirement and high data sharing. Then, we design an FPGA accelerator containing a vector generator module (VGM) that can match the index between sparse weights and input activations according to the proposed dataflow. Experimental results demonstrate that our implementation can achieve 987-, 46-, and 57-imag/s performance for AlexNet, VGG-16, and ResNet-50 on Xilinx ZCU102, respectively, which provides 1.5×1.5\times 6.7×6.7\times speedup and 2.0×2.0\times 6.0×6.0\times energy efficiency over previous CNN FPGA accelerators.