Content uploaded by M. Akif Özkan
Author content
All content in this area was uploaded by M. Akif Özkan on Aug 26, 2020
Content may be subject to copyright.
AnyHLS: High-Level Synthesis with Partial
Evaluation
M. Akif Özkan‡, Arsène Pérard-Gayot†, Richard Membarth†∗ , Philipp Slusallek†∗ , Roland Leißa∗,
Sebastian Hack∗, Jürgen Teich‡, and Frank Hannig‡
‡Friedrich-Alexander University Erlangen-Nürnberg (FAU), Germany
∗Saarland University (UdS), Germany †German Research Center for Artificial Intelligence (DFKI), Germany
c
2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this
material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Abstract—FPGAs excel in low power and high throughput
computations, but they are challenging to program. Traditionally,
developers rely on hardware description languages like Verilog or
VHDL to specify the hardware behavior at the register-transfer
level. High-Level Synthesis (HLS) raises the level of abstraction,
but still requires FPGA design knowledge. Programmers usually
write pragma-annotated C/C++ programs to define the hardware
architecture of an application. However, each hardware vendor
extends its own C dialect using its own vendor-specific set
of pragmas. This prevents portability across different vendors.
Furthermore, pragmas are not first-class citizens in the language.
This makes it hard to use them in a modular way or design proper
abstractions.
In this paper, we present AnyHLS, an approach to synthesize
FPGA designs in a modular and abstract way. AnyHLS is
able to raise the abstraction level of existing HLS tools by
resorting to programming language features such as types and
higher-order functions as follows: It relies on partial evaluation
to specialize and to optimize the user application based on
a library of abstractions. Then, vendor-specific HLS code is
generated for Intel and Xilinx FPGAs. Portability is obtained
by avoiding any vendor-specific pragmas at the source code. In
order to validate achievable gains in productivity, a library for
the domain of image processing is introduced as a case study,
and its synthesis results are compared with several state-of-the-
art Domain-Specific Language (DSL) approaches for this domain.
I. INTRODUCTION
Field Programmable Gate Arrays (
FPGA
s) consist of a
network of reconfigurable digital logic cells that can be
configured to implement any combinatorial logic or sequential
circuits. This allows the design of custom application-tailored
hardware. In particular memory-intensive applications benefit
from
FPGA
implementations by exploiting fast on-chip memory
for high throughput. These features make
FPGA
implementa-
tions orders of magnitude faster/more energy-efficient than CPU
implementations in these areas. However,
FPGA
programming
poses challenges to programmers unacquainted with hardware
design.
FPGA
s are traditionally programmed at Register-Transfer
Level (
RTL
). This requires to model digital signals, their timing,
flow between registers, as well as the operations performed
on them. Hardware Description Languages (
HDL
s) such as
Verilog or VHDL allow for the explicit description of arbitrary
circuits but require significant coding effort and verification
time. This makes design iterations time-consuming and error-
prone, even for experts: The code needs to be rewritten for
different performance or area objectives. In recent languages
such as Chisel [1], VeriScala [2], and MyHDL [3], programmers
can create a functional description of their design but stick to
the RTL.
High-Level Synthesis (
HLS
) increases the abstraction level
to an untimed high-level specification similar to imperative
programming languages and automatically solves low-level
design issues such as clock-level timing, register allocation,
and structural pipelining [4]. However, an
HLS
code that is
optimized for the synthesis of high-performance circuits is
fundamentally different from a software program delivering
high performance on a CPU. This is due to the significant gap
between the programming paradigms. An
HLS
compiler has to
optimize the memory hierarchy of a hardware implementation
and parallelize its data paths [5].
In order to achieve good Quality of Results (
QoR
),
HLS
languages demand programmers also to specify the hardware
architecture of an application instead of just its algorithm. For
this reason,
HLS
languages offer hardware-specific pragmas.
This ad-hoc mix of software and hardware features makes
it difficult for programmers to optimize an application. In
addition, most
HLS
tools rely on their own C dialect, which
prevents code portability. For example, Xilinx Vivado HLS [6]
uses C++ as base language while Intel SDK [7] (formerly
Altera) uses OpenCL C. These severe restrictions make it hard
to use existing
HLS
languages in a portable and modular way.
In this paper, we advocate describing
FPGA
designs using
functional abstractions and partial evaluation to generate
optimized
HLS
code. Consider Figure 1 for an example from
image processing: With a functional language, we separate
the description of the
sobel_x
operator from its realization
in hardware. The hardware realization
make_local_op
is
a function that specifies the data path, the parallelization,
and memory architecture. Thus, the algorithm and hardware
architecture descriptions are described by a set of higher-
order functions. A partial evaluator, ultimately, combines
these functions to generate an HLS code that delivers high-
performance circuit designs when compiled with HLS tools.
Since the initial descriptions are high-level, compact, and
functional, they are reusable and distributable as a library.
We leverage the AnyDSL compiler framework [8] to perform
partial evaluation and extend it to generate input code for
HLS
tools targeting Intel and Xilinx
FPGA
devices. We claim
that this approach leads to a modular and portable code other
than existing
HLS
approaches, and is able to produce highly
arXiv:2002.05796v2 [cs.PL] 21 Jul 2020
line buffer
line buffer
row
sel
col
sel . . .
op1
opv
Mem2D
(1, h, v)
line buffers
Mem2D
(w+v−1, h, 1)
Mem2D
(w+v−1, h, 1)
sliding window
local operator
Mem1D
(W×H, v)
Mem1D
(W×H, v)
c
Blender Foundation (CC BY 3.0)
let sobel_x = @|img, x, y|
-1 *img.read(x-1, y-1) + 1 *img.read(x+1, y-1) +
-2 *img.read(x-1, y ) + 2 *img.read(x+1, y ) +
-1 *img.read(x-1, y+1) + 2 *img.read(x+1, y+1);
let input = make_img_mem1d("sandiego.jpg");
let output = make_img_mem1d("output.jpg");
let operator = make_local_op(sobel_x);
with generate(vhls) { operator(input, output) }
Figure 1. AnyHLS example: The algorithm description
sobel_x
is decoupled from its realization in hardware
make_local_op
. The hardware realization
is a function that specifies important transformations for the exploitation of parallelism and memory architecture. The function
generate(vhls)
selects the
backend for code generation, which is Vivado HLS in this case. Ultimately, an optimized input code for HLS is generated by partially evaluating the algorithm
and realization functions.
efficient hardware implementations.
In summary, this paper makes the following contributions:
•
We present AnyHLS
1
, raising the abstraction level in HLS
by using partial evaluation of higher-order functions as a
core compiler technology. It guarantees the well-typedness
of the residual program and offers considerably higher
productivity than existing
DSL
design techniques and
C/C++-based approaches (see Section II).
•
AnyHLS offers unprecedented target independence, and
thus portability, across different HLS tools by avoiding
tool-specific pragma extensions and generating target-
specific OpenCL or C
++
code as input to existing HLS
tools (see Section III).
•
Productivity, modularity, and portability gains are demon-
strated by presenting an image processing library as a
case study in Section IV. For this domain, we show
that a competitive performance in terms of throughput
and resource usage can be achieved in comparison with
existing state-of-the-art DSLs (see Section V).
II. OVERVIEW, BACKGROU ND,AND RELATED WORK
In the following, we briefly discuss prior work (Sections
II-A
to
II-B
) and fundamental concepts of AnyDSL (Section
II-C
).
A. QoR and Portability of Code in C-based HLS
HLS
increases the abstraction level to an untimed high-level
specification such as C/C
++
or OpenCL from a fully-timed
RTL
. This eases the hardware design problem by eliminating
low-level issues such as clock-level timing, register allocation,
and gate-level pipelining [4], [9], [10]. Modern
HLS
tools are
able to generate high-quality results for
DSP
and datapath-
oriented applications. Several authors (
e.g.,
[4], [11], [12])
have argued the following points as key to this success:
(i) advancements in
RTL
design tools, (ii) device-specific code
generation, (iii) domain-specific focus on the target applications,
and (iv) generating both software and hardware from the
same code. Modern
HLS
tools such as Intel FPGA SDK for
OpenCL (
AOCL
) and Xilinx SDX offer system synthesis to
map program parts to either software or hardware. This enables
software-like development for library design and verification.
1https://github.com/AnyDSL/anyhls
There is an ongoing discussion whether C-based languages
are good candidates for
HLS
[4], [12]–[15]. Yet, most com-
monly used HLS compilers (
e.g.,
Vivado HLS,
AOCL
, Catapult,
LegUp) are based on C-based languages [4], [6], [7], [10]. The
modularity and readability of C/C++ or OpenCL descriptions
often conflict with best coding practices of HLS compilers [16],
[17]. In the hardware design context,
QoR
design refers to
the ratio between the performance of the circuit (latency,
throughput) and design cost (circuit area, energy consumption).
A C-based
HLS
code optimized for satisfactory
QoR
is entirely
different from a typical software program [16]–[20]. Thereby,
the developer should express the
FPGA
implementation of an
application using the language abstractions of software (
i.e.,
arrays, loops to specify the memory hierarchy and hardware
pipelining). Language extensions like pragmas fill the gap
for the lacking
FPGA
-centric features. However, pragmas are
specific to HLS tools, and they cannot be used in a modular way
because the preprocessor already resolves them (e.g., pragmas
cannot be passed as function parameters). This ad-hoc mix of
software and hardware abstractions of programming languages
in HLS makes optimizations hard [15], [17], [19]. Furthermore,
the lack of standardization in
HLS
languages and compilers
hinders the portability of code across them. Often, the code
optimized for one HLS tool must significantly be changed to
target another HLS tool even when the same FPGA design is
described. For these reasons, we believe that the next step for
HLS requires an increased level of abstraction on the language
side, which can reduce the need for expert knowledge.
B. Raising the Abstraction Level in HLS
Recent work suggests raising the abstraction level in HLS
by designing libraries,
DSL
s or source-to-source compilers
to hide low-level implementation details. This improves the
modularity and reduces code duplication, but is hard to develop
and maintain when well-typedness of programs are preserved.
[16]–[19] make extensive use of C
++
template metaprogramming
to provide libraries that are optimized for Vivado-HLS. Generic
programs can be optimized for compile-time known values
using metaprogramming techniques, but it has the following
drawbacks: (i) The well-typedness of the generated program
cannot be guaranteed in metaprogramming. This makes it
difficult to understand error messages. (ii) Metaprograms are
hard to develop, maintain, and understand since the meta
language is different from the core language (C
++
core vs.
C
++
template language). For this reason, code cannot be easily
moved between the core and the meta language. (iii) Lambda
expressions are not allowed to be used as template arguments
in C
++
. We refer to [8] for more details. In particular, [16], [18]
explain the challenges of implementing higher-order algorithms
in C
++
for Vivado-HLS. OpenCL C does not support template
metaprogramming, thus forces users to use preprocessor macros
for generic library design. Therefore, libraries developed by
using C
++
template metaprogramming have to be rewritten
completely for OpenCL C, that is, for AOCL.
DSL
s use domain-specific knowledge to parallelize algo-
rithms and generate low-level, optimized code [21]. Program-
ming accelerators using
DSL
s is thus easier, in particular
for
FPGA
s, because the compiler performs scheduling. A
prominent example of that is the
FPGA
version of Spiral [22].
It generates
HDL
for digital signal processing applications.
In the domain of image processing, recent projects include
Darkroom [23], Rigel [24], and the work of Pu et al. [25]
based on Halide [26]. Hipacc [27], PolyMage [28], SODA [29],
and RIPL [30] create image processing pipelines from a
DSL
. Rigel/Halide, PolyMage, and RIPL are declarative
DSL
s,
whereas Hipacc is embedded into C
++
. All of these compilers,
except Rigel, generate
HLS
code in order to simplify their
backends. Other examples include LIFT that targets
FPGA
s via
algorithmic patterns [31] and Tiramisu [32] for data-parallel
algorithms on dense arrays. Tiramisu takes as input a set of
scheduling commands from the user and feeds it to the polyhe-
dral analysis of the compiler. However, a considerable portion
of these scheduling primitives remains platform-specific [33].
Spatial [15] is a language for programming Coarse-Grained
Reconfigurable Architectures (
CGRA
s) and
FPGA
s. Spatial
provides language constructs to express control, memory, and
interfaces of hardware implementation.
In this paper, it is shown that the described need to raise the
abstraction level in HLS may be accomplished by using recent
compiler technology, in particular by exploring the concepts
of partial evaluation and high-order-functions. Unlike the
aforementioned DSL compilers, AnyHLS allows programmers
to build the basic blocks and abstractions necessary for their
application domain by themselves (see Section III). AnyHLS is
thereby built on top of AnyDSL [8] (see Section
II-C
). AnyDSL
offers partial evaluation to enable shallow embedding [34]
without the need for modifying a compiler. This means
that there is no need to change the compiler when adding
support for a new application domain, since programmers can
design custom control structures. Partial evaluation specializes
algorithmic variants of a program at compile-time. Compared
to metaprogramming, partial evaluation operates in a single
language and preserves the well-typedness of programs [8]. Fur-
thermore, different combinations of static/dynamic parameters
can be instantiated from the same code. Previously, we have
shown how to abstract image border handling implementations
for Intel FPGAs using AnyDSL [35]. In this paper, we present
AnyHLS and an image processing library to synthesize FPGA
designs in a modular and abstract way for both Intel and Xilinx
FPGAs.
C. AnyDSL Compiler Framework
AnyDSL
2
[8], [34] is a compiler framework for designing
high-performance, domain-specific libraries. It provides the
imperative and functional language Impala. Impala’s syntax is
inspired by Rust. We will now briefly discuss Impala’s most
important features that we rely on in AnyHLS.
1) Partial Evaluation: Partial evaluation is a technique for
program optimization by specialization of compile-time known
values. Assume that each input of a program
F
is classified
as either static
s
or dynamic
d
, and values for all of the static
inputs are given. Then, partial evaluation produces an optimized
(residual) program Fssuch that
[[F s]](d) = [[F]](s, d)(1)
and running
Fs
on the dynamic inputs produces the same
result as running the original program
F
on all of the
inputs [36]. Compiler techniques such as constant propagation,
loop unrolling, or inlining are examples to partial evaluation.
Typically, the user has no control when these optimizations
are applied from a compiler.
Impala allows programmers to partially evaluate [37] their
program at compile time. Programmers control the partial
evaluator via filters [38]. These are Boolean expressions of the
form
@(expr)
that annotate function signatures. Each call site
instantiates the callee’s filter with the corresponding argument
list. The call is specialized when the expression evaluates to
true
. The expression
?expr
yields
true
, if
expr
is known
at compile-time; the expression
$expr
is never considered
constant by the evaluator. For example, the following
@(?n)
filter will only specialize calls to
pow
if
n
is statically known
at compile-time:
fn @(?n) pow(x: int, n: int) -> int {
if n==0{
1
}else {
if n %
let y = pow(x, n / 2);
y*y
}else {
x*pow(x, n - 1)
}
}
}
Thus, the calls
let z = pow(x, 5); let z = pow(3, 5);
will result in the following equivalent sequences of instructions
after specialization:
let y=x*x;
let z=x*y*y;
let z = 243;
As syntactic sugar, @is available as shorthand for @(true).
This causes the partial evaluator to always specialize the
annotated function.
FPGA implementations must be statically defined for
QoR
:
types, loops, functions, and interfaces must be resolved at
compile-time [16], [18], [19]. Partial evaluation has many
advantages compared to metaprogramming as discussed in
Section
II-B
. Hence, Impala’s partial evaluation is particularly
useful to optimize HLS descriptions.
2https://anydsl.github.io
2) Generators: Because iteration on various domains is a
common pattern, Impala provides syntactic sugar for invoking
certain higher-order functions. The loop
for var1, ..., varn in iter(arg1, ..., argn) { /*... */}
translates to
iter(arg1, ..., argn, |var1, ..., varn| { /*... */});
The body of the
for
loop and the iteration variables constitute
an anonymous function
|var1, ..., varn| { /*... */}
that is passed to
iter
as the last argument. We call functions
that are invokable like this generators. Domain-specific libraries
implemented in Impala make busy use of these features as
they allow programmers to write custom generators that take
advantage of both domain knowledge and certain hardware
features, as we will see in the next section.
Generators are particularly powerful in combination with
partial evaluation. Consider the following functions:
type Body = fn(int) -> ();
fn @(?a & ?b) unroll(a: int, b: int, body: Body) -> () {
if a < b { body(a); unroll(a+1, b, body) }
}
fn @ range(a: int, b: int, body: Body) -> () {
unroll($a, b, body)
}
Both generators iterate from
a
(inclusive) to
b
(exclusive)
while invoking
body
each time. The filter
unroll
tells the
partial evaluator to completely unroll the recursion if both loop
bounds are statically known at a particular call site.
III. THE ANYHLS LIBRARY
Efficient and resource-friendly
FPGA
designs require
application-specific optimizations. These optimizations and
transformations are well known in the community. For example,
de Fine Licht et al. [20] discuss the key transformations of
HLS
codes such as loop unrolling and pipelining. They describe
the whole hardware design from the low-level memory layout
to the operator implementations with support for low-level
loop transformations throughout the design. In our setting,
the programmer defines and provides these abstractions using
AnyDSL for a given domain in the form of a library. We
rely on partial evaluation to combine those abstractions and to
remove overhead associated with them. Ultimately, the AnyDSL
compiler synthesizes optimized HLS code (C
++
or OpenCL C)
from a given functional description of an algorithm as shown
in Figure 2. The generated code goes to the selected
HLS
tool.
This is in contrast to other domain-specific approaches like
Halide-HLS [25] or Hipacc [27], which rely on domain-specific
compilers to instantiate predefined templates or macros. Hipacc
makes use of two distinct libraries to synthesize algorithmic
abstractions to Vivado-HLS and Intel
AOCL
, while AnyHLS
uses the same image processing library that is described in
Impala.
A. HLS Code Generation
For HLS code generation, we implemented an intrinsic
named
vhls
in AnyHLS to emit Vivado HLS and an intrinsic
named opencl to emit AOCL:
halide-app.cpp hipacc-app.cpp anyhsl-app.impala
Hipacc compiler AnyDSL compiler
+
(partial evaluator)
Vivado
backend
Vivado
backend
AOCL
backend
Halide compiler Image
Processing
Lib.impala
VHLS-code.cpp AOCL-code.cl
VHLS-code.cpp VHLS-code.cpp AOCL-code.cl
template
library
template
library
template
library
VHLS AOCL VHLS AOCL
VHLS
XILINX
FPGA
INTEL
FPGA
XILINX
FPGA
INTEL
FPGA
XILINX
FPGA
Figure 2. FPGA code generation flows for Halide, Hipacc, and AnyHLS (from
left to right). VHLS and AOCL are used as acronyms for Vivado HLS and
Intel FPGA SDK for OpenCL, respectively. Halide and Hipacc rely on domain-
specific compilers for image processing that instantiate template libraries.
AnyHLS allows defining all abstractions for a domain in a language called
Impala and relies on partial evaluation for code specialization. This ensures
maintainability and extensibility of the provided domain-specific library—for
image processing in this example.
with vhls() { body() } with opencl() { body() }
With
opencl
we use a grid and block size of
(1, 1, 1)
to generate a single work-item kernel, as the official
AOCL
documentation recommends [7]. We extended AnyDSL’s
OpenCL runtime by the extensions of Intel OpenCL SDK.
To provide an abstraction over both HLS backends, we create
a wrapper
generate
that expects a code generation function:
type Backend = fn(fn() -> ()) -> ();
fn @ generate(be: Backend, body: fn() -> ()) -> () {
with be() { body() }
}
Switching backends is now just a matter of passing an
appropriate function to generate:
let backend = vhls; // or opencl
with generate(backend) { body() }
B. Building Abstractions for FPGA Designs
In the following, we present abstractions for the key
transformations and design patterns that are common in FPGA
design. These include (a) important loop transformations, (b)
control flow and data flow descriptions such as reductions,
Finite State Machines (
FSM
s) and (d) the explicit utilization of
different memory types. Approaches like Spatial [15] expose
these patterns within the language—new patterns require
dedicated support from the compiler. Hence, these languages
and compilers are restricted to a specialized application
domain they have been designed for. In AnyHLS, Impala’s
functional language and partial evaluation allow us to design
the abstractions needed for FPGA synthesis in the form of
a library. New patterns can be added to the library without
dedicated support from the compiler. This makes AnyHLS
easier to extend compared to the approaches mentioned afore.
1) Loop Transformations: C
++
compilers usually provide
certain preprocessor directives that perform particular code
transformations. A common feature is to unroll loops (see
left-hand side):
body
no unrolling
body body
unroll inner loop
body
body
unroll outer loop
body body
body body
unroll inner and outer loop
Figure 3. Parallel processing
for (int i=0; i<N/W; ++i) {
for (int w=0; w<W; ++w) {
#pragma unroll
body(i*W + w);
}
}
for iin range(0, N/W) {
for win unroll(0, W) {
body(i*W + w);
}
}
Such
pragmas
are built into the compiler. The Impala version
(shown at right) uses generators that are entirely implemented
as a library. Partial evaluation optimizes Impala’s
range
and
unroll
abstractions as well as the input body function
according to their static inputs,
i.e., N
,
W
. The residual program
consists of the consecutive
body
function according to the value
of the
W
as shown in Figure 3. This generates a concise and
clean code for the target HLS compiler, which is drastically
different from using a pragma.
Generators, unlike C
++ pragma
s, are first-class citizens of
the Impala language. This allows programmers to implement
sophisticated loop transformations. For example, the following
function
tile
returns a new generator. It instantiates a tiled
loop nest of the specified tile
size
with the
Loopsinner
and outer:
type Loop = fn(int,int,fn(int) -> ()) -> ();
fn @ tile(size: int, inner: Loop, outer: Loop) -> Loop {
@|beg, end, body| outer(0, (end-beg)/size,
|i| inner(i*size + beg, (i+1)*size + end, |j| body))
}
let schedule = tile(W, unroll, range);
for iin schedule(0, N) {
body(i)
}
Passing
W
for the tiling
size
,
unroll
for the inner loop, and
range
for the outer loop yields a generator that is identical
to the loop nest at the beginning of this paragraph. With this
design, we can reuse or explore iteration techniques without
touching the actual body of a
for
loop. For example, consider
the processing options for a two-dimensional loop nest as shown
in Figure 3: When just passing
range
as
inner
and
outer
loop, the partial evaluator will keep the loop nest and, hence,
not unroll
body
and instantiate it only once. Unrolling the inner
loop replicates
body
and increases the bandwidth requirements
accordingly. Unrolling the outer loop also replicates
body
, but
in a way that benefits data reuse from the temporal locality of
an iterative algorithm. Unrolling both loops replicate
body
for
increased bandwidth and data reuse for the temporal locality.
C/C
++
-based
HLS
solutions often use a
pragma
to mark a
loop amenable for pipelining. This means parallel execution
of the loop iterations in hardware. For example, the following
code on the left uses an initiation interval (II) of 3:
for (int i=0; i<N; ++i) {
#pragma HLS pipeline II=3
body(i);
}
let II = 3;
for iin pipeline(II, 0, N) {
body(i)
}
Instead of a pragma (on the left), AnyHLS uses the intrinsic
generator
pipeline
(on the right). Unlike the above loop
abstractions (
e.g.,
unroll), Impala emits a tool-specific pragma
for the
pipeline
abstraction. This provides portability across
different HLS tools. Furthermore, it allows the programmer
to invoke and pass around
pipeline
—just like any other
generator.
2) Reductions: Reductions are useful in many contexts. The
following function takes an array of values, a range within,
and an operator:
type T = int;
fn @(?beg & ?end) reduce(beg: int, end: int, input: &[T],
op: fn(T, T) -> T) -> T {
let n = end - beg;
if n==1{
input(beg)
}else {
let m = (end + beg) / 2;
let a = reduce(beg, m, input, op);
let b = reduce(m, end, input, op);
op(a, b)
}
}
In the above filter, the recursion will be completely unfolded
if the range is statically known. Thus,
reduce(0, 4, [a, b, c, d], |x, y| x + y)
yields: (a+b)+(c+d).
3) Finite State Machines: AnyHLS models computations
that depend not only on the inputs but also on an internal
state with an
FSM
. To define an
FSM
, programmers need to
specify states and a transition function that determines when
to change the current state based on the machine’s input. This
is especially beneficial for modeling control flow. To describe
an
FSM
in Impala, we start by introducing types to represent
the states and the machine itself:
type State = int;
struct FSM {
add: fn(State, fn() -> (), fn() -> State) -> (),
run: fn(State) -> ()
}
An object of type
FSM
provides two operations: adding one
state with
add
or
running
the computation. The
add
method
takes the name of the state, an action to be performed for this
state, and a transition function associated with this state. Once
all states are added, the programmer
runs
the machine by
passing the initial state as an input parameter. The following
example adds 1to every element of an array:
let buf = /*...*/;
let mut (idx, pixel) = (0, 0);
let fsm = make_fsm();
fsm.add(Read, || pixel = buf(idx),
|| if idx>=len { Exit } else { Compute });
fsm.add(Compute, || pixel += 1, || Write);
fsm.add(Write, || buf(idx++) = pixel, || Read );
fsm.run(Read);
Similar the other abstractions introduced in this section, the
constructor for an
FSM
is not a built-in function of the compiler
but a regular Impala function. In some cases, we want to
execute the
FSM
in a pipelined way. For this scenario, we add
a second method
run_pipelined
. As all the methods,
e.g.,
make_fsm
,
add
,
run
, are annotated for partial evaluation
(by
@
), input functions to these methods will be optimized
according to their static inputs. Ultimately, AnyHLS will emit
the states of an
FSM
as part of a loop according to the selected
run method.
4) Memory Types and Memory Abstractions: FPGAs have
different memory types of varying sizes and access properties.
Impala supports four memory types specific to hardware design
(see Figure 4): global memory, on-chip memory, registers, and
streams. Global memory (typically DRAM) is allocated on the
host using our runtime and accessed through regular pointers.
On-chip memory (
e.g.,
BRAM or M10K/M20K) for the FPGA
is allocated using the
reserve_onchip
compiler intrinsic.
Memory accesses using the pointer returned by this intrinsic
will map to on-chip memory. Standard variables are mapped
to registers, and a specific
stream
type is available to allow
for the communication between FPGA kernels. Memory-wise,
a
stream
is mapped to registers or on-chip memory by the
HLS tools. These FPGA-specific memory types in Impala will
be mapped to their corresponding tool-specific declarations in
the residual program (on-chip memory will be defined as local
memory for
AOCL
whereas it will be defined as an array in
Vivado HLS).
a) Memory partitioning: an array partitioning pragma
must be defined as follows to implement a C array with
hardware registers using Vivado HLS [6]:
typedef int T;
T Regs1D[size];
#pragma HLS variable=Regs1D array_partition dim=0
Listing 1. A typical way of partitioning an array by using pragmas in existing
HLS tools.
Other HLS tools offer similar pragmas for the same task.
Instead, AnyHLS provides a more concise description of a
register array without using any tool-specific pragma by the
recursive declaration of registers as follows:
type T = int;
struct Regs1D {
read: fn(int) -> T,
write: fn(int, T) -> (),
size: int
}
fn @ make_regs1d(size: int) -> Regs1D {
if size == 0 {
Regs1D {
read: @|_| 0,
write: @|_, _| (),
size: size
}
}else {
let mut reg: T;
let others = make_regs1d(size - 1);
Regs1D {
read: @|i| if i+1 == size { reg }
else { others.read(i) },
write: @|i, v| if i+1 == size { reg = v }
else { others.write(i, v) },
size: size
}
}
}
Listing 2. Recursive description of a register array using partial evalution
instead of declaring an array and partitioning it by HLS pragmas.
When the
size
is not zero, each recursive call to this
function allocates a register variable named reg, and creates
a smaller register array with one element less named
others
.
The
read
and
write
functions test if the index
i
is equal
to the index of the current register. In the case of a match,
the current register is used. Otherwise, the search continues in
global memory on-chip memory register stream
Figure 4. Memory types provided for FPGA design
Regs1D
1D register array
Regs2D
2D register array
OnChipArray
on-chip array
StreamArray
stream array
Figure 5. Memory abstractions
the smaller array. The generator (
make_regs1d
) returns an
Impala variable that can be read and written by index values
(regs in the following code), similar to C arrays.
let regs = make_regs1d(size);
However, it defines
size
number of registers in the residual
program instead of declaring an array and partitioning it by
tool-specific pragmas as in Listing 1. The generated code
does not contain any compiler directives; hence it can be
used for different HLS tools (
e.g.,
Vivado HLS,
AOCL
). Since
we annotated
make_regs1d
,
read
, and
write
for partial
evaluation, any call to these functions will be inlined recursively.
This means that the search to find the register to read to or
write from will be performed at compile time. These registers
will be optimized by the AnyDSL compiler, just like any other
variables: unnecessary assignments will be avoided, and a clean
HLS code will be generated.
Correspondingly, AnyHLS provides generators (similar to
Listing 2) for one and two-dimensional arrays of on-chip
memory (
e.g.,
line buffers in Section IV), global memory, and
streams (as illustrated in Figure 5) instead of using memory
partitioning pragmas encouraged in existing HLS tools (as in
Listing 1).
IV. A LIBRARY FOR IMAG E PROCESSING ON FPGA
AnyHLS allows for defining domain-specific abstractions
and optimizations that are used and applied prior to generating
customized input to existing HLS tools. In this section, we
introduce a library that is developed to support
HLS
for the
domain of image processing applications. It is based on the
fundamental abstractions introduced in Section
III-B
. Our low-
level implementation is similar to existing domain-specific
languages targeting
FPGA
s [24], [27]. For this reason, we focus
on the interface of our abstractions as seen by the programmer.
We design applications by decoupling their algorithmic
description from their schedule and memory operations. For
instance, typical image operators, such as the following
Sobel filter, just resort to the
make_local_op
generator.
Similarly, we implement a point operator for RGB-to-gray
color conversion as follows (Listing 3):
fn sobel_edge(output: &mut [T], input: &[T]) -> () {
let img = make_raw_mem2d(width, height, input);
let dx = make_raw_mem2d(width, height, output);
let sobel_extents = extents(1, 1); // for 3x3 filter
let operator = make_local_op(4, // vector factor
sobel_operator_x, sobel_extents, mirror, mirror);
with generate(hls) { operator(img, dx); }
}
fn rgb2gray(output: &mut [T], input: &[T]) -> () {
let img = make_raw_img(width, height, input);
let gray = make_raw_img(width, height, output);
let operator = make_point_op(@|pix| {
let r = pix & 0xFF;
let g = (pix >> 8) & 0xFF;
let b = (pix >> 16) & 0xFF;
(r+g+b)/3
});
with generate(hls) { operator(img, gray); }
}
Listing 3. Sobel filter and RGB-to-gray color conversion as example
applications described by using our library.
The image data structure is opaque. The target platform
mapping determines its layout. AnyHLS provides common
border handling functions as well as point and global operators
such as reductions (see Section
III-B
2). These operators are
composable to allow for more sophisticated ones.
A. Vectorization
Image processing applications consist of loops that possess a
very high degree of spatial parallelism. This should be exploited
to reach the bandwidth speed of memory technologies. A
resource-efficient approach, so-called vectorization or loop
coarsening, is to aggregate the input pixels to vectors and
process multiple input data at the same time to calculate
multiple output pixels in parallel [39]–[41]. This replicates only
the arithmetic operations applied to data (so-called datapath)
instead of the whole accelerator, similar to Single Instruction
Multiple Data (
SIMD
) architectures. Vectorization requires a
control structure specialized to a considered hardware design.
We support the automatic vectorization of an application by
a given factor
v
when using our image processing library. In
particular, our library use the vectorization techniques proposed
in [40]. For example, the
make_local_op
function has
an additional parameter to specify the desired vectorization
and will propagate this information to the functions it uses
internally:
make_local_op(op, v)
. For brevity, we omit
the parameter for the vectorization factor for the remaining
abstractions in this section.
B. Memory Abstractions for Image Processing
1) Memory Accessor: In order to optimize memory access
and encapsulate the contained memory type (on-chip memory,
etc.) into a data structure, we decouple the data transfer from
the data use via the following memory abstractions:
struct Mem1D {
read: fn(int) -> T,
write: fn(int, T)->(),
update: fn(int) -> (),
size: int
}
struct Mem2D {
read: fn(int,int) -> T,
write: fn(int,int, T)->(),
update: fn(int,int) -> (),
width: int, height: int
}
Similar to hardware design practices, these memory abstractions
require the memory address to be
updated
before the
read
/
write
operations. The
update
function transfers data
from/to the encapsulated memory to/from staging registers
using vector data types. Then, the
read
/
write
functions
access an element of the vector. This increases data reuse and
DRAM-to-on-chip memory bandwidth [42].
2) Stream Processing: Inter-kernel dependencies of an
algorithm should be accessed on-the-fly in combination with
fine-granular communication in order to pipeline the full
implementation with a fixed throughput. That is, as soon as a
block produces one data, the next block consumes it. In the
best case, this requires only a single register of a small buffer
instead of reading/writing to temporary images:
Kernel2Kernel1 Kernel3
Mem1D Mem1D
Mem1D Mem1D
We define a stream between two kernels as follows:
fn make_mem_from_stream(size: int, data: stream) -> Mem1D;
3) Line Buffers: Storing an entire image to on-chip memory
before execution is not feasible since on-chip memory blocks
are limited in
FPGA
s. On the other hand, feeding the data
on demand from main memory is extremely slow. Still, it is
possible to leverage fast on-chip memory by using it as FIFO
buffers containing only the necessary lines of the input images
(Wpixels per line).
line buffer
line buffer
Mem2D (1, h, v)
line buffers (W,h, v )
Mem1D (W, v)
This enables parallel reads at the output for every pixel read
at the input. We model a line buffer as follows:
type LineBuf1D = fn(Mem1D) -> Mem1D;
fn make_linebuf1d(width: int) -> LineBuf1D;
// similar for LineBuf2D
Akin to
Regs1D
(see Section
III-B
4), a recursive call builds
an array of line buffers (each line buffer will be declared by a
separate memory component in the residual program similar
to on-chip array in Figure 5).
4) Sliding Window: Registers are the most amenable re-
sources to hold data for highly parallelized access. A sliding
window of size
w×h
updates the constituting shift registers by
a new column of
h
pixels and enables parallel access to
w·h
pixels.
Mem2D (w, h, 1)
sliding window
Mem2D
(1, h, v)
This provides high data reuse for temporal locality and avoids
waste of on-chip memory blocks that might be utilized for a sim-
ilar data bandwidth. Our implementation uses
make_regs2d
for an explicit declaration of registers and supports pixel-based
indexing at the output. This will instantiate
w·h
registers in
the residual program, as explained in Section III-B4.
type Swin2D = fn(Mem2D) -> Mem2D;
fn @ make_sliding_window(w: int, h: int) -> Swin2D {
let win = make_regs2d(w, h);
// ...
}
C. Loop Abstractions for Image Processing
1) Point Operators: Algorithms such as image scaling and
color transformation calculate an output pixel for every input
pixel. The point operator abstraction (see Listing 4) in AnyHLS
yields a vectorized pipeline over the input and output image.
This abstraction is parametric in its vector factor
v
and the
desired operator function op.
type PointOp = fn(Mem1D) -> Mem1D;
fn @ make_point_op(v: int, op: Op) -> PointOp {
@|img, out| {
for idx in pipeline(1, 0, img.size) {
img.update(idx);
for iin unroll(0, v) {
out.write(i, op(img.read(i)));
}
out.update(idx);
}
}
}
Listing 4. Implementation of the point operator abstraction.
The total latency is
L=Larith +dW/ve · Hcycles (2)
where
W
and
H
are the width and height of the input image,
and Larith is the latency of the data path.
2) Local Operators: Algorithms such as Gaussian blur and
Sobel edge detection calculate an output pixel by considering
the corresponding input pixel and a certain neighborhood of it
in a local window. Thus, a local operator with a
w×h
window
requires
w·h
pixel reads for every output. The same
(w−1)·h
pixels are used to calculate results at the image coordinates
(
x
,
y
) and (
x+ 1
,
y
). This spatial locality is transformed into
temporal locality when input images are read in raster order for
burst mode, and subsequent pixels are sequentially processed
with a streaming pipeline implementation. The local operator
implementation in AnyHLS (shown in Listing 5) consists of
line buffers and a sliding window to hold dependency pixels
in on-chip memory and calculates a new result for every new
pixel read.
line buffer
line buffer
row
sel
col
sel . . .
op1
opv
Mem2D
(1, h, v)
line buffers
Mem2D
(w+v−1, h, 1)
Mem2D
(w+v−1, h, 1)
sliding window
local operator
Mem1D
(W×H, v)
Mem1D
(W×H, v)
This provides a throughput of
v
pixels per clock cycle at the
cost of an initial latency (vis the vectorization factor)
Linitial =Larith + (bh
/2c·dW/ve+bdw
/ve
/2c)(3)
that is spent for caching neighboring pixels of the first
calculation. The final latency is thus:
L=Linitial + (dW/ve · H)(4)
type LocalOp = fn(Mem1D) -> Mem1D;
fn @ make_local_op(v: int, op: Op, ext: Extents,
bh_lower: FnBorder,
bh_upper: FnBorder) -> LocalOp {
@|img, out| {
let mut (col, row, idx) = (0, 0, 0);
let wait = /*initial latency */
let fsm = make_fsm();
fsm.add(Read, || img.update(idx), || Compute);
fsm.add(Compute, || {
line_buffer.update(col);
sliding_window.update(row);
col_sel.update(col);
for iin unroll(0, v) {
out.write(i, op(col_sel.read(i)));
}
}, || if idx > wait { Write } else { Index });
fsm.add(Write, || out.update(idx-wait-1), || Index);
fsm.add(Index, || {
idx++; col++;
if col == img_width { col=0; row++; }
}, || if idx < img.size { Read } else { Exit });
fsm.run_pipelined(Read, 1, 0, img.size);
}
}
Listing 5. Implementation of the local operator abstraction.
Compared to the local operator in Figure 1, we also support
boundary handling. We specify the extent of the local operator
(filter size / 2) as well as functions specifying the boundary
handling for the lower and upper bounds. Then, row and column
selection functions apply border handling correspondingly in
x
-
and
y−
directions by using one-dimensional multiplexer arrays
similar to Özkan et al. [40].
V. EVAL UATION AND RESULTS
In the following, we compare the Post Place and Route
(
PPnR
) results using AnyHLS and other state-of-the-art domain-
specific approaches including Halide-HLS [25] and Hipacc [27].
The generated HLS codes are compiled using Intel FPGA SDK
for OpenCL 18.1 and Xilinx Vivado HLS 2017.2 targeting a
Cyclone V GT 5CGTD9
FPGA
and a Zynq XC7Z020
FPGA
,
repectively.
The generated hardware designs are evaluated for their
throughput, latency, and resource utilization.
FPGA
s possess
two types of resources: (i) computational:
LUT
s and
DSP
blocks; (ii) memory: Flipflops (
FF
s) and on-chip memory
(
BRAM
/M20K). A SLICE/ALM is comprised of look-up tables
(LUTs) and flip flops, thus indicate the resource usage when
considered with the DSP block and on-chip memory blocks.
The implementation results presented for Vivado
HLS
feature
only the kernel logic, while those by Intel OpenCL include
PCIe interfaces. The execution time of an FPGA circuit (Vivado
HLS
implementation) equals to
Tclk ·
latency, where
Tclk
is
the clock period of the maximum achievable clock frequency
(lower is better). We measured the timing results for Intel
OpenCL by executing the applications on a Cyclone V GT
5CGTD9 FPGA. This is the case for all analyzed applications.
We have no intention nor license rights [43, §4] [44, §2] to
benchmark and compare the considered
FPGA
technologies or
HLS tools.
A. Applications
In our experimental evaluation, we consider the following
applications:
0 16 35 107
FChain
Harris
FChain
Harris
Execution time [ms]
naïve
streaming pipeline
Figure 6. Execution time for naïve and streaming pipeline implementations
of the Harris and FChain for an Intel Cyclone V for images of
1024 ×1024
.
•Gaussian (Gauss)
blurring an image with a
5×5
integer
kernel
•Harris corner detector (Harris)
consisting of 9 kernels
that resort to integer arithmetic and horizontal/vertical
derivatives
•Jacobi smoothing an image with a 3×3integer kernel
•filter chain (FChain)
consisting of 3 convolution kernels
as a pre-processing algorithm
•bilateral filter (Bilateral)
, a
5×5
floating-point kernel
as an edge-preserving and noise-reducing function based
on exponential functions
•mean filter (MF)
, a
5×5
filter that determines the average
within a local window via 8-bit arithmetic
•SobelLuma
, an edge detection algorithm provided as a
design example by Intel. The algorithm consists of RGB
to Luma color conversion, Sobel filters, and thresholding
B. Library Optimizations
AnyHLS exploits stream processing and performs implicit
parallelization. The following subsections show the impact of
those optimizations.
1) Stream Processing: Memory transfers between
FPGA
’s
programmable logic and external memory are one of the most
time-consuming parts of many image processing applications.
AnyHLS streaming pipeline optimization passes dependency
pixels directly from the producer to the consumer kernel,
as explained in Section
IV-B
2. This allows pipelined kernel
execution and makes intermediate images between kernels
superfluous. The more intermediate images are eliminated, the
better the performance of the resulting designs. For example,
this eliminates 8 intermediate images in Harris corner and 2 in
filter chain, see Figure 6 for the performance impact.
The throughput of both streaming pipeline implementations
is indeed determined by their slowest individual kernel, which
is a local operator. Consider Table I, which displays the Vivado
HLS reports. The latency results correspond to Equation (4).
Table I
STREAMING PIPELINE IMPLEMENTATIONS OF HARRIS AND FCHAIN O N A
XILINX ZYNQ. DATA IS TR ANS FE RRE D TO THE FPGA ON LY ONC E,TH US
SIMIL AR THROU GH PUT S AR E ACHIEV ED. IMAG ES S IZE S AR E 1024 ×1024,
v= 1,ftarget = 200 MHZ.
App. Largest mask Sequential Dependency Latency [cyc.] Throughput [MB/s]
FChain 5×5local + local + local 1050649 821
Harris 3×3local + local + point 1049634 825
2) Vectorization: Many FPGA implementations benefit from
parallel processing in order to increase memory bandwidth.
AnyHLS implicitly parallelizes a given image pipeline by a
vectorization factor
v
. As an example, Figure 7 shows the
PPnR
results, along with the achieved memory throughput for
different vectorization factors for the mean filter on a Cyclone V.
The memory-bound of the Cyclone V is reported by Intel’s
200 400 600 800 1,000 1,200 1,400
1
2
4
8
16
32
Throughput [MB/s]
Vectorization factor (v)
Memory Bound [MB/s]
1 2 4 8 16 32
15
20
25
30
35
Vectorization factor (v)
Resource Usage in %
On-Chip Mem Blocks Logic Resources
Figure 7.
PPnR
results of AnyHLS’s mean filter implementation on an Intel
Cyclone V. The memory bound of the device for our setup is 1344.80 MB/s.
diagnosis tool. The speedup is almost linear, whereas resource
utilization is sub-linear to the vectorization factor, as Figure 7
depicts. AnyHLS exploits the data reuse between consecutive
iterations of the local operators. Data is read and written with
the vectorized data types. The line buffers and the sliding
window are extended to hold dependency pixels for vectorized
processing. Thus, only the datapath is replicated instead of the
whole accelerator implementation (see Section
IV-A
). All the
considered applications except Bilateral in Figure 9 reach the
memory bound. Bilateral is compute-bound due to its large
number of floating-point operations.
C. Hardware Design Evaluation
We evaluate the generated hardware designs based on their
throughput, latency, and resource utilization. As a reference, we
use the designs generated by Halide-HLS [25] and Hipacc [27],
two state-of-the-art image processing
DSL
s that generate
better results than previous approaches (
e.g.,
Xilinx OpenCV).
In contrast to these, which implement dedicated HLS code
generators, AnyHLS is essentially implemented as a library
within the AnyDSL framework, as illustrated in Figure 2. Our
focus is to show that higher-order abstractions, together with
partial evaluation, are powerful enough to design a library
targeting different HLS compilers.
1) Experiments using Xilinx Vivado HLS: We evaluate the
results of circuits generated using AnyHLS in comparison with
the domain-specific language approaches Hipacc and Halide-
HLS. We consider two representative applications from the
Halide-HLS repository with different configurations (border
handling mode and vectorization factor): Gauss and Harris.
These
DSL
s have been developed by FPGA experts and perform
better than many other existing libraries. The applications are
rewritten for Hipacc and AnyHLS by respecting their original
descriptions. This ensures that Halide-HLS applications have
been implemented with adequate scheduling primitives. Hipacc
and AnyHLS implementations require only the algorithm
descriptions as input.
For almost all applications in Tables II and III, AnyHLS
implementations demand fewer resources and deliver higher
performance. Of course, this improvement mainly stems from
our library implementation. AnyHLS achieves a lower latency
mainly because of the following reasons:
i)
The latency of a local operator generated from AnyHLS’
image processing library corresponds to the theoretical
latency given in Equation (4), which is
L=Larith +
1.042.442
clock cycles for Gauss when
v= 1
.
Larith =
14
for AnyHLS’ Gauss implementation as shown in
Table II.
ii)
Halide-HLS pads input images according to the selected
border handling mode (even when no border handling is
defined). This increases the input image size from (
W
,
H) to (W+w−1,H+h−1), thus the latency.
iii)
Hipacc does not pad input images, but run (
H+bh/2c ·
(W+bw/2c)
) loop iterations for a
(W×H)
image
and
(w×h)
window. This is similar to the convolution
example in the Vivado Design Suite User Guide [6], but
not optimal.
The execution time of an implementation equals to
Tclk ·
latency
, where
Tclk
is the clock period of the maximum
achievable clock frequency (lower is better). Overall, AnyHLS
processes a given image faster than the other
DSL
implemen-
tations.
Halide-HLS uses more on-chip memory for line buffers (see
Section
IV-C
2) compared to Hipacc and AnyHLS because of its
image padding for border handling. Let us consider the number
of BRAMs utilized for the Gaussian blur: The line buffers need
to hold 4 image lines for the
5×5
kernel. The image width
is
1024
and the pixel size is
32
bits. Therefore, AnyHLS and
Hipacc use eight
18
K BRAMs as shown in Table II. However,
Halide-HLS stores
1028
integer pixels, which require 16
18
K
BRAMs to buffer four image lines. This doubles the number
of BRAMs usage (see Table III).
AnyHLS use the vectorization architecture proposed in [40].
This improves the use of the registers compared to Hipacc and
Halide.
The performance metrics and resource usage reported by
Vivado HLS correlate with our Impala descriptions, hence we
claim that the HLS code generated from AnyHLS’ image
processing library does not entail severe side effects for
the synthesis of Vivado HLS. Hipacc and Halide-HLS have
dedicated compiler backends for HLS code generation. These
can be improved to achieve similar performance to AnyHLS.
However, this is not a trivial task and prone to errors. The
advantage of AnyDSL’s partial evaluation is that the user
Table II
PPNRRES ULTS F OR THE XILINX ZYN Q BOA RD FO R IM AGES OF SI ZE
1020 ×1020 AND Ttarget = 5 NS (CORRESPONDS TO ftarget = 200 MH Z).
BORDER HANDLING IS UNDEFINED.
App v #BRAM #SLICE #DSP Latency [cyc.] Throughput [MB/s]
Gauss
1
AnyHLS 8 463 16 1042456 828.2
Halide-HLS 8 1823 50 1052673 438.2
Hipacc 8 473 16 1044500 764.7
4
AnyHLS 16 1441 80 260626 3041.4
Halide-HLS 16 4112 180 266241 1640.1
Hipacc 16 1519 64 261649 3064.6
Harris
1
AnyHLS 20 1405 22 1041450 829.0
Halide-HLS 16 2688 35 1052673 464.0
Hipacc 20 1457 34 1042466 828.2
2
AnyHLS 20 2513 44 520740 1450.4
Halide-HLS 16 4011 70 528385 895.0
Hipacc 20 2326 68 521756 1637.8
Table III
PPNRRES ULTS F OR THE GAUS SIAN BL UR WITH C LAM PI NG AT THE
BORDERS. IMAGE SI ZE S ARE 1024 ×1024,v= 1 ,ftarget = 200 MH Z.
Framework #BRAM #SLICE #DSP Latency [cyc.] Throughput [MB/s]
AnyHLS 8 1646 16 1050641 801.8
Halide-HLS 16 2096 50 1060897 458.7
Hipacc 8 1709 16 1052693 820.1
has control over code generation. Extending AnyHLS’ image
processing library only requires adding new functions in Impala
(see Figure 2). Our intention to compare AnyHLS with these
DSLs is to show that we can generate equally good designs
without creating an entire compiler backend.
2) Experiments using Intel FPGA SDK for OpenCL (AOCL):
Table IV presents the implementation results for an edge
detection algorithm provided as a design example by Intel. The
algorithms consist of RGB to Luma color conversion, Sobel
filters, and thresholding. Intel’s implementations consist of a
single-work item kernel that utilizes shift registers according
to the FPGA design paradigm. These types of techniques are
recommended by Intel’s optimization guide [7] despite that
the same OpenCL code performs drastically bad on other
computing platforms.
Table IV
PPNRRES ULTS O F AN EDGE D ETE CT ION A PP LIC ATIO N FOR TH E INT EL
CYCLON E V. IMAGE S IZ ES AR E 1024 ×1024. NONE OF T HE
IMPLEMENTATIONS USE DSPS.
v Framework #M10K #ALM #DSP Throughput [MB/s]
1
Intel’s Imp. 290 23830 0 419.5
AnyHLS 291 23797 0 422.5
Hipacc 318 25258 0 449.1
16
Intel’s Imp. - - 0 -
AnyHLS 337 29126 0 1278.3
Hipacc 362 35079 0 1327.7
32
Intel’s Imp. - - 0 -
AnyHLS 401 38069 0 1303.8
Hipacc 421 44059 0 1320.0
We described Intel’s handwritten SobelLuma example using
Hipacc and AnyHLS. Both Hipacc and AnyHLS provide a
higher throughput even without vectorization. In order to reach
memory-bound, we would have to rewrite Intel’s hand-tuned
design example to exploit further parallelism. AnyHLS uses
slightly less resource, whereas Hipacc provides slightly higher
throughput for all the vectorization factors. Similar to Figure 7,
REFERENCES
20 30 40 50 60 70 80
102
103
1
2
4
8
16
CU1/SIMD1
CU4/SIMD16
CU16/SIMD1
Hardware resources (logic utilization [%])
Throughput in [MPixel/s]
AnyHLS
NDRange
Figure 8. Design space for a
5×5
mean filter using an NDRange kernel
(using the
num_compute_units
/
num_simd_work_items
attributes)
and AnyHLS (using the vectorization factor v) for an Intel Cyclone V.
MF
Gauss JacobiBilateral FChainHarris
28
29
210
Throughput in [MPixel/s]
Hipacc AnyHLS
Figure 9. Throughput measurements for an Intel Cyclone V for the
implementations generated from AnyHLS and Hipacc. Resource utilization
for the same implementations are shown in Table V.
both frameworks yield throughputs very close to the memory
bound of the Intel Cyclone V.
The OpenCL NDRange kernel paradigm conveys multiple
concurrent threads for data-level parallelism. OpenCL-based
HLS tools exploit this paradigm to synthesize hardware.
AOCL
provides attributes for NDRange kernels to transform its iter-
ation space. The
num_compute_units
attribute replicates
the kernel logic, whereas
num_simd_work_items
vector-
izes the kernel implementation
3
. Combinations of those provide
a vast design space for the same NDRange kernel. However, as
Figure 8 demonstrates, AnyHLS achieves implementations that
are orders of magnitude faster than using attributes in
AOCL
.
Finally, Table V and Figure 9 present a comparison between
AnyHLS and the
AOCL
backend of Hipacc [45]. As shown
in Figure 2, Hipacc has an individual backend and template
library written with preprocessor directives to generate high-
performance OpenCL code for FPGAs. In contrast, the ap-
plication and library code in AnyHLS stays the same. The
generated AOCL code consists of a loop that iterates over
the input image. Compared to Hipacc, AnyHLS achieves
similar performance but outperforms Hipacc for multi-kernel
applications such as the Harris corner detector. This shows that
AnyHLS optimizes the inter-kernel dependencies better than
Hipacc (see Section IV-B2).
3
These parallelization attributes are suggested in [7] for NDRange kernels,
not for the single-work item kernels using shift registers such as the edge
detection application shown in Table IV.
Table V
PPNR
FOR THE IN TEL CYCL ONE V. MISSING NUMBERS (-) I NDI CATE T HAT
THE GEN ERATED IM PLEME NTATIO NS D O NOT FIT THE B OAR D.
App v Framework #M10K #ALM #DSP Throughput [MB/s]
Gauss 16 AnyHLS 401 37509 0 1330.1
16 Hipacc 402 35090 0 1301.2
Jacobi 16 AnyHLS 370 31446 0 1328.8
16 Hipacc 372 30296 0 1282.9
Bilat. 1 AnyHLS 399 79270 153 326.6
1 Hipacc 422 79892 159 434.7
MF
16 AnyHLS 400 39266 0 1255.68
16 Hipacc - - - -
8 Hipacc 351 31796 0 1275.9
FChain 8 AnyHLS 418 44807 0 1230.6
8 Hipacc 645 64225 0 427.4
Harris 8 AnyHLS 442 50537 96 1158.5
8 Hipacc 668 74246 96 187.14
VI. CONCLUSIONS
In this paper, we advocate the use of modern compiler
technologies for high-level synthesis. We combine functional
abstractions with the power of partial evaluation to decouple a
high-level algorithm description from its hardware design that
implements the algorithm. This process is entirely driven by
code refinement, generating input code to HLS tools, such as
Vivado HLS and
AOCL
, from the same code base. To specify
important abstractions for hardware design, we have introduced
a set of basic primitives. Library developers can rely on these
primitives to create domain-specific libraries. As an example,
we have implemented an image processing library for synthesis
to both Intel and Xilinx FPGAs. Finally, we have shown that
our results are on par or even better in performance compared
to state-of-the-art approaches.
ACKNOWLEDGMENTS
This work is supported by the Federal Ministry of Education
and Research (BMBF) as part of the Metacca, MetaDL,
ProThOS, and REACT projects as well as the Intel Visual
Computing Institute (IVCI) at Saarland University. It was
also partially funded by the Deutsche Forschungsgemein-
schaft (DFG, German Research Foundation) – project number
146371743 – TRR 89 “Invasive Computing”. Many thanks to
our colleague Puya Amiri for his work on the pipeline support.
REFERENCES
[1]
J. Bachrach et al., “Chisel: Constructing hardware in a Scala
embedded language”, in Proc. of the 49th Annual Design
Automation Conf. (DAC), IEEE, Jun. 3–7, 2012.
[2]
Y. Liu et al., “A scala based framework for developing accel-
eration systems with FPGAs”, Journal of Systems Architecture,
vol. 98, 2019.
[3]
J. Decaluwe, “MyHDL: A Python-based hardware description
language”, Linux Journal, no. 127, 2004.
[4]
J. Cong et al., “High-level synthesis for FPGAs: From
prototyping to deployment”, IEEE Trans. on Computer-Aided
Design of Integrated Circuits and Systems (TCAD), vol. 30, no.
4, 2011.
[5]
J. Cong et al., “Automated accelerator generation and opti-
mization with composable, parallel and pipeline architecture”,
in Proc. of the 55th Annual Design Automation Conf. (DAC),
ACM, Jun. 24–29, 2018.
[6]
Xilinx, Vivado Design Suite user guide high-level synthesis
UG902, 2017.
[7]
Intel, Intel FPGA SDK for OpenCL: Best practices guide, 2017.
[8]
R. Leißa et al., “AnyDSL: A partial evaluation framework for
programming high-performance libraries”, Proc. of the ACM
on Programming Languages (PACMPL), vol. 2, no. OOPSLA,
Nov. 4–9, 2018.
[9]
L.
-
N. Pouchet et al., “Polyhedral-based data reuse optimization
for configurable computing”, in Proc. of the ACM/SIGDA
international symposium on Field programmable gate arrays,
ACM, 2013.
[10]
R. Nane et al., “A survey and evaluation of FPGA high-level
synthesis tools”, IEEE Trans. on Computer-Aided Design of
Integrated Circuits and Systems, vol. 35, no. 10, 2015.
[11]
G. Martin and G. Smith, “High-level synthesis: Past, present,
and future”, IEEE Design & Test of Computers, vol. 26, no. 4,
2009.
[12]
D. F. Bacon et al., “FPGA programming for the masses”,
Communications of the ACM, vol. 56, no. 4, 2013.
[13]
S. A. Edwards, “The challenges of synthesizing hardware from
C-like languages”, IEEE Design & Test of Computers, vol. 23,
no. 5, 2006.
[14]
J. Sanguinetti, “A different view: Hardware synthesis from
SystemC is a maturing technology”, IEEE Design & Test of
Computers, vol. 23, no. 5, 2006.
[15]
D. Koeplinger et al., “Spatial: A language and compiler for
application accelerators”, in Proc. of the 39th ACM SIGPLAN
Conf. on Programming Language Design and Implementation
(PLDI), ACM, Jun. 18–22, 2018.
[16]
H. Eran et al., “Design patterns for code reuse in HLS packet
processing pipelines”, in 27th Annual Int’l Symp. on Field-
Programmable Custom Computing Machines (FCCM), IEEE,
2019.
[17]
J. S. da Silva et al., “Module-per-object: A human-driven
methodology for C++-based high-level synthesis design”, in
27th Annual Int’l Symp. on Field-Programmable Custom
Computing Machines (FCCM), IEEE, 2019.
[18]
D. Richmond et al., “Synthesizable higher-order functions for
C++”, Trans. on Computer-Aided Design of Integrated Circuits
and Systems, vol. 37, no. 11, 2018.
[19]
M. A. Özkan et al., “A highly efficient and comprehensive
image processing library for C++-based high-level synthesis”,
in Proc. of the 4th Int’l Workshop on FPGAs for Software
Programmers (FSP), VDE, 2017.
[20]
J. de Fine Licht et al., “Transformations of high-level synthesis
codes for high-performance computing”, The Computing Re-
search Repository (CoRR), 2018. arXiv: 1805.08288
[cs.DC]
.
[21]
G. Ofenbeck et al., “Spiral in Scala: Towards the systematic
construction of generators for performance libraries”, in Proc.
of the Int’l Conf. on Generative Programming: Concepts &
Experiences (GPCE), ACM, Oct. 27–28, 2013.
[22]
P. Milder et al., “Computer generation of hardware for linear
digital signal processing transforms”, ACM Trans. on Design
Automation of Electronic Systems (TODAES), vol. 17, no. 2,
2012.
[23]
J. Hegarty et al., “Darkroom: Compiling high-level image
processing code into hardware pipelines”, ACM Trans. on
Graphics (TOG), vol. 33, no. 4, 2014.
[24]
J. Hegarty et al., “Rigel: Flexible multi-rate image processing
hardware”, ACM Trans. on Graphics (TOG), vol. 35, no. 4,
2016.
[25]
J. Pu et al., “Programming heterogeneous systems from an
image processing DSL”, ACM Trans. on Architecture and Code
Optimization (TACO), vol. 14, no. 3, 2017.
[26]
J. Ragan-Kelley et al., “Halide: A language and compiler for
optimizing parallelism, locality, and recomputation in image
processing pipelines”, in Proc. of the Conf. on Programming
Language Design and Implementation (PLDI), ACM, Jun. 16–
19, 2013.
[27]
O. Reiche et al., “Generating FPGA-based image processing
accelerators with Hipacc”, in Proc. of the Int’l Conf. On
Computer Aided Design (ICCAD), IEEE, Nov. 13–16, 2017.
[28]
N. Chugh et al., “A DSL compiler for accelerating image
processing pipelines on FPGAs”, in Proc. of the Int’l Conf.
on Parallel Architecture and Compilation Techniques (PACT),
ACM, Sep. 11–15, 2016.
[29]
Y. Chi et al., “Soda: Stencil with optimized dataflow archi-
tecture”, in 2018 IEEE/ACM Int’l Conf. on Computer-Aided
Design (ICCAD), IEEE, 2018.
[30]
R. Stewart et al., “A dataflow IR for memory efficient
RIPL compilation to FPGAs”, in Proc. of the Int’l Conf. on
Algorithms and Architectures for Parallel Processing (ICA3PP),
Springer, Dec. 14–16, 2016.
[31]
M. Kristien et al., “High-level synthesis of functional patterns
with Lift”, in Proc. of the 6th ACM SIGPLAN Int’l Workshop on
Libraries, Languages and Compilers for Array Programming,
ARRAY@PLDI 2019, Phoenix, AZ, USA, June 22, 2019., 2019.
[32]
R. Baghdadi et al., “Tiramisu: A polyhedral compiler for
expressing fast and portable code”, in Proc. of the IEEE/ACM
Int’l Symp. on Code Generation and Optimization (CGO),
IEEE, Feb. 16–20, 2019.
[33]
E. Del Sozzo et al., “A unified backend for targeting FPGAs
from DSLs”, in Proc. of the 29th Annual IEEE Int’l Conf.
on Application-specific Systems, Architectures and Processors
(ASAP), IEEE, Jul. 10–12, 2018.
[34]
R. Leißa et al., “Shallow embedding of DSLs via online partial
evaluation”, in Proc. of the Int’l Conf. on Generative Program-
ming: Concepts & Experiences (GPCE), ACM, Oct. 26–27,
2015.
[35]
M. A. Özkan et al., “A journey into DSL design using
generative programming: FPGA mapping of image border
handling through refinement”, in Proc. of the 5th Int’l Workshop
on FPGAs for Software Programmers (FSP), VDE, 2018.
[36]
N. D. Jones et al.,Partial evaluation and automatic program
generation. Peter Sestoft, 1993.
[37]
Y. Futamura, “Parital computation of programs”, in Proc. of the
RIMS Symposia on Software Science and Engineering, 1982.
[38]
C. Consel, “New insights into partial evaluation: The SCHISM
experiment”, in Proc. of the 2nd European Symp. on Program-
ming (ESOP), Springer, Mar. 21–24, 1988.
[39]
M. Schmid et al., “Loop coarsening in C-based high-level
synthesis”, in Proc. of the 26th Annual IEEE Int’l Conf.
on Application-specific Systems, Architectures and Processors
(ASAP), IEEE, 2015.
[40]
M. A. Özkan et al., “Hardware design and analysis of efficient
loop coarsening and border handling for image processing”,
in Proc. of the Int’l Conf. on Application-specific Systems,
Architectures and Processors (ASAP), IEEE, Jul. 10–12, 2017.
[41]
G. Stitt et al., “Scalable window generation for the Intel
Broadwell+Arria 10 and high-bandwidth FPGA systems”, in
Proc. of the ACM/SIGDA Int’lSymp. on Field-Programmable
Gate Arrays (FPGA), ACM, Feb. 25–27, 2018.
[42]
Y.
-
k. Choi et al., “A quantitative analysis on microarchitectures
of modern CPU-FPGA platforms”, in Proc. of the 53rd Annual
Design Automation Conf. (DAC), ACM, Jun. 5–9, 2016.
[43]
Core evaluation license agreement, version 2014.06, Xilinx,
Inc., Jun. 2014. [Online]. Available: https://www.xilinx.com/
products/intellectual-property/license/core-evaluation-license-
agreement.html.
[44]
Intel program license subscription agreement, version Rev.
10/2009, Intel Corporation, Oct. 2009. [Online]. Available:
https: //www.intel .com/ content/www /us/en /programmable/
downloads/software/license/lic-prog_lic.html.
[45]
M. A. Özkan et al., “FPGA-based accelerator design from
a domain-specific language”, in Proc. of the 26th Int’l Conf.
on Field-Programmable Logic and Applications (FPL), IEEE,
Aug. 29–Sep. 2, 2016.