Conference PaperPDF Available

HBM Connect: High-Performance HLS Interconnect for FPGA HBM

Authors:

Abstract and Figures

With the recent release of High Bandwidth Memory (HBM) based FPGA boards, developers can now exploit unprecedented external memory bandwidth. This allows more memory-bounded applications to benefit from FPGA acceleration. However, fully utilizing the available bandwidth may not be an easy task. If an application requires multiple processing elements to access multiple HBM channels, we observed a significant drop in the effective bandwidth. The existing high-level synthesis (HLS) programming environment had limitation in producing an efficient communication architecture. In order to solve this problem, we propose HBM Connect, a high-performance customized interconnect for FPGA HBM board. Novel HLS-based optimization techniques are introduced to increase the throughput of AXI bus masters and switching elements. We also present a high-performance customized crossbar that may replace the built-in crossbar. The effectiveness of HBM Connect is demonstrated using Xilinx's Alveo U280 HBM board. Based on bucket sort and merge sort case studies, we explore several design spaces and find the design point with the best resource-performance trade-off. The result shows that HBM Connect improves the resource-performance metrics by 6.5X-211X.
Content may be subject to copyright.
HBM Connect: High-Performance HLS Interconnect
for FPGA HBM
Young-kyu Choi, Yuze Chi, Weikang Qiao, Nikola Samardzic, and Jason Cong
Computer Science Department, University of California, Los Angeles
{ykchoi,chiyuze}@cs.ucla.edu,{wkqiao2015,nikola.s}@ucla.edu,cong@cs.ucla.edu
ABSTRACT
With the recent release of High Bandwidth Memory (HBM) based
FPGA boards, developers can now exploit unprecedented external
memory bandwidth. This allows more memory-bounded applica-
tions to benet from FPGA acceleration. However, fully utilizing
the available bandwidth may not be an easy task. If an application
requires multiple processing elements to access multiple HBM chan-
nels, we observed a signicant drop in the eective bandwidth. The
existing high-level synthesis (HLS) programming environment had
limitation in producing an ecient communication architecture.
In order to solve this problem, we propose HBM Connect, a high-
performance customized interconnect for FPGA HBM board. Novel
HLS-based optimization techniques are introduced to increase the
throughput of AXI bus masters and switching elements. We also
present a high-performance customized crossbar that may replace
the built-in crossbar. The eectiveness of HBM Connect is demon-
strated using Xilinx’s Alveo U280 HBM board. Based on bucket
sort and merge sort case studies, we explore several design spaces
and nd the design point with the best resource-performance trade-
o. The result shows that HBM Connect improves the resource-
performance metrics by 6.5X–211X.
KEYWORDS
High Bandwidth Memory; high-level synthesis; eld-programmable
gate array; on-chip network; performance optimization
ACM Reference Format:
Young-kyu Choi, Yuze Chi, Weikang Qiao, Nikola Samardzic, and Jason
Cong. 2021. HBM Connect: High-Performance HLS Interconnect for FPGA
HBM. In Proceedings of the 2021 ACM/SIGDA International Symposium on
Field Programmable Gate Arrays (FPGA ’21), February 28-March 2, 2021,
Virtual Event, USA. ACM, New York, NY, USA, 11 pages. https://doi.org/10.
1145/3431920.3439301
1 INTRODUCTION
Although eld-programmable gate array (FPGA) is known to pro-
vide a high-performance and energy-ecient solution for many
applications, there is one class of applications where FPGA is gen-
erally known to be less competitive: memory-bound applications.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
FPGA ’21, February 28-March 2, 2021, Virtual Event, USA
©2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8218-2/21/02. . . $15.00
https://doi.org/10.1145/3431920.3439301
In a recent study [
8
], the authors report that GPUs typically out-
perform FPGAs in applications that require high external memory
bandwidth. The Virtex-7 690T FPGA board used for the experiment
reportedly has only 13 GB/s peak DRAM bandwidth, which is much
smaller than the 290 GB/s bandwidth of the Tesla K40 GPU board
used in the study (even though the two boards are based on the
same 28 nm technology). This result is consistent with comparative
studies for earlier generations of FPGAs and GPUs [
9
,
10
]—FPGAs
traditionally were at a disadvantage compared to GPUs for ap-
plications with low reuse rate. The FPGA DRAM bandwidth was
also lower than the CPUs—Sandy Bridge E5-2670 (32 nm, similar
generation as Virtex-7 in [
8
]) has a peak bandwidth of 42 GB/s [
21
].
But with the recent emergence of the High Bandwidth Memory 2
(HBM2) [
15
] FPGA boards, there is a good chance that future FPGAs
can compete with GPUs to achieve higher performance in memory-
bound applications. HBM benchmarking works [
19
,
27
] report that
Xilinx’s Alveo U280 [
28
] (two HBM2) provides HBM bandwidth of
422–425 GB/s, which approaches that of Nvidia’s Titan V GPU [
23
]
(650 GB/s, three HBM2). Similar numbers are reported for Intel’s
Stratix 10 MX [
13
] as well. Since FPGAs already have advantage
over GPUs in terms of its custom datapath and the custom data
types [
10
,
22
], enhancing external memory bandwidth with HBM
could allow FPGAs to accelerate a wider range of applications.
The large external memory bandwidth of HBM originates from
multiple independent HBM channels (e.g., Fig. 1). To take full ad-
vantage of this architecture, we need to determine the most ecient
way to transfer data from multiple HBM channels to multiple pro-
cessing elements (PEs). It is worth noting that the Convey HC-1ex
platform [
2
] also has multiple (64) DRAM channels like the FPGA
HBM boards. But unlike Convey HC-1ex PEs that issue individual
FIFO requests of 64b data, HBM PEs are connected to 512b AXI bus
interface. Thus, utilizing the bus burst access feature has a large
impact on the performance of FPGA HBM boards. Also, the Con-
vey HC-1ex has a pre-synthesized full crossbar between PEs and
DRAM, but FPGA HBM boards require programmers to customize
the interconnect.
Table 1: Eective bandwidth of memory-bound applications
on Alveo U280 using Vivado HLS and Vitis tools
Appli- PC KClk EBW EBW/PC
cation # (MHz) (GB/s) (GB/s)
MV Mult 16 300 211 13.2
Stencil 16 293 206 12.9
Bucket sort 16 277 65 4.1
Merge sort 16 196 9.4 0.59
In order to verify that we can achieve high performance on an
FPGA HBM board, we have implemented several memory-bound
Figure 1: Alveo U280 Architecture
applications on Alveo U280 (Table 1). We were not able to complete
the routing for all 32 channels, so we used the next largest power-
of-2 HBM pseudo channel (PC) of 16. The kernels are written in C
(Xilinx Vivado HLS [
31
]) for ease of programming and a faster de-
velopment cycle [
17
]. For dense matrix-vector (MV) multiplication
and stencil, the eective bandwidth per PC is similar to the board’s
sequential access bandwidth (Section 4.1.1). Both applications can
evenly distribute the workload among the available HBM PCs, and
their long sequential memory access pattern allows a single PE to
fully saturate an HBM PC’s available bandwidth.
However, the eective bandwidth is far lower for bucket and
merge sort. In bucket sort, a PE distributes keys to multiple HBM
PCs (one HBM PC corresponds to one bucket). In merge sort, a
PE collects sorted keys from multiple HBM PCs. Such an oper-
ation is conducted in all PEs—thus, we need to perform all PEs
to all PCs communication. Alveo U280 provides an area-ecient
built-in crossbar to facilitate this communication pattern. But, as
will be explained in Section 6.1, enabling external memory burst
access to multiple PCs in the current high-level synthesis (HLS)
programming environment is dicult. Instantiating a burst buer
is a possible option, but we will show this leads to high routing
complexity and large BRAM consumption (details to be presented in
Section 6.2). Also, shared links among the built-in switches (called
lateral connections) become a bottleneck that limits the eective
bandwidth (details to be presented in Section 4.2).
This paper proposes HBM Connect—a high-performance cus-
tomized interconnect for FPGA HBM board. We rst evaluate the
performance of Alveo U280 built-in crossbar and analyzes the cause
of bandwidth degradation when PEs access several PCs. Next, we
propose a novel HLS buering scheme the increases the eective
bandwidth of the built-in crossbar and consumes fewer BRAMs.
We also present a high-performance custom crossbar architecture
to remove the performance bottleneck from lateral connections. As
will be demonstrated in the experimental result section, we found
that it is sometimes more ecient to completely ignore the built-in
crossbar and only utilize our proposed customized crossbar archi-
tecture. The proposed design is fully compatible with Vivado HLS
C syntax and does not require RTL coding.
The contribution of this paper can be summarized as follows:
A BRAM-ecient HLS buering scheme that increases the
AXI burst length and the eective bandwidth when PEs
access several PCs.
An HLS-based solution that increases the throughput of a
2×2 switching element of customized crossbar.
A design space exploration of customized crossbar and AXI
burst buer that nds the best area-performance trade-o
in HBM many-to-many unicast environment.
Evaluation of the built-in crossbar on Alveo U280 and analy-
sis of its performance.
The scope of this paper is currently limited to Xilinx’s Alveo
U280 board, but we plan to extend it to other Xilinx and Intel HBM
boards in the future.
2 BACKGROUND
2.1 High Bandwidth Memory 2
High Bandwidth Memory [
15
] is a 3D-stacked DRAM designed to
provide a high memory bandwidth. There are 2~8 HBM dies and
1024 data I/Os in each stack. The HBM dies are connected to a base
logic die using Through Silicon Via (TSV) technology. The base
logic die connects to FPGA/GPU/CPU dies through an interposer.
The maximum I/O data rate is improved from 1 Gbps in HBM1 to
2 Gbps in HBM2. This is partially enabled by the use of two pseudo
channels (PCs) per physical channel to hide the latency [
13
,
16
].
Sixteen PCs exist per stack, and we can access PCs independently.
2.2 HBM2 FPGA Platforms and Built-In
Crossbar
Intel and Xilinx have recently released HBM2 FPGA boards: Xilinx’s
Alveo U50 [
29
], U280 [
28
], and Intel’s Stratix 10 MX [
13
]. These
boards consist of an FPGA and two HBM2 stacks (8 HBM2 dies). The
FPGA and the HBM2 dies are connected through 32 independent
PCs. Each PC has 256MB of capacity (8 GB in total).
In Stratix 10 MX (early-silicon version), each PC is connected to
the FPGA PHY layer through 64 data I/Os that operates at 800MHz
(double data rate). The data communication between the kernels
(user logic) and the HBM2 memory is managed by the HBM con-
troller (400MHz). AXI4 [
1
] and Avalon [
14
] interfaces (both with
256 data bitwidth) are used to communicate with the kernel side.
The clock frequency of kernels may vary (capped at 450MHz) de-
pending on its complexity. Since the frequency of HBMCs is xed
to 400MHz, rate matching (RM) FIFOs are inserted between the
kernels and the memory controllers.
In Xilinx Alveo U280, the FPGA is composed of three super logic
regions (SLRs). The overall architecture of U280 is shown in Fig. 1
The FPGA connects to the HBM2 stacks on the bottom SLR (SLR0).
The 64b data I/Os to the HBM operate at the frequency of 900 MHz
(double data rate). The data transaction is managed by the HBM
memory controllers (MCs). A MC communicates with the user logic
Figure 2: Bucket sort application
(kernel) via a 256b AXI3 slave interface running at 450 MHz [
30
].
The user logic has a 512b AXI3 master interface, and the clock
frequency of the user logic is capped at 300 MHz. The ideal memory
bandwidth is 460 GB/s (= 256b * 32PCs * 450MHz = 64b * 32PCs * 2
* 900MHz).
Four user logic AXI masters can directly communicate with any
of the four adjacent PC AXI slaves through a fully-connected unit
switch. For example, the rst AXI master (M0) has direct connec-
tions to PCs 0–4 (Fig. 1). If an AXI master needs to access non-
adjacent PCs, it can use the lateral connections among the unit
switches—but the network contention may limit the eective band-
width [
30
]. For example in Fig. 1, M16 and M17 AXI masters and
the lower lateral AXI master may compete with each other to use
the upper lateral AXI slave for communication with PC 0–15. Each
AXI master is connected to four PC AXI slaves and two lateral
connections (see M5 in Fig. 1).
The thermal design power (TDP) of Alveo U280 is 200W. Note
that Alveo U280 also has traditional DDR DRAM—but we decided
not to utilize the traditional DRAM because the purpose of this
paper is to evaluate the HBM memory. We refer readers to the
work in [
19
,
27
] for comparison of HBM and DDR memory and also
the work in [
20
] for optimization case studies on heterogeneous
external memory architectures.
2.3 Case Studies
In order to quantify the resource consumption and the performance
of HBM interconnect when PEs access multiple PCs, we select two
applications for case studies: bucket sort and merge sort. A bucket
sort PE writes to multiple PCs, and a merge sort PE reads from
multiple PCs. These applications also have an ideal characteristic of
accessing each PC in a sequential fashion—allowing us to analyze
the resource-performance trade-o more clearly.
2.3.1 Bucket Sort. Arrays of unsorted keys are stored in input
PCs. A bucket PE sequentially reads the unsorted keys from each
input PC and classify them into dierent output buckets based on
the value of the keys. Each bucket is stored in a single HBM PC,
and this allows a second stage of sorting (e.g., with merge sort) to
work independently on each bucket. Several bucket PEs may send
their keys to the same PC—thus, all-to-all unicast communication
Figure 3: Merge sort application
architecture is needed for write as shown in Fig. 2. Since the keys
within a bucket does not need to be in a specic order, we combine
all the buckets in the same PC and write the keys to the same output
memory space.
Since our primary goal is to analyze and explore the HBM PE-PC
interconnect architecture, we make several simplications on the
sorter itself. We assume a key is 512b long. We also assume that the
distribution of keys is already known, and thus we preset a splitter
value that divides the keys into equal-sized buckets. Also, we do not
implement the second-stage intra-bucket sorter—the reader may
refer to [
3
,
12
,
25
,
26
] for high-performance sorters that utilize the
external memory.
We limit the number of used PCs to 16 for two reasons. First,
we were not able to utilize all 32 PCs due to routing congestion
(more details in Section 4.1.1). Second, we wanted to simplify the
architecture by keeping the number of used PCs to the power of
two.
2.3.2 Merge Sort. In contrast to the bucket sort application which
sends the data to a PC bucket before sorting within a PC bucket, we
can also sort the data within a PC rst and then collect and merge
the data among dierent PCs. Fig. 3 demonstrates this process. The
intra-PC sorted data is sent to one of the PEs depending on the
range of its value, and each PE performs merge sort on the incoming
data. Each PE reads from 16 input PCs and writes to one PC. This
sorting process is a hybrid of bucketing and merge sort—but for
convenience, we will simply refer to this application as merge sort
in the rest of this paper.
This application requires a many-to-many unicast architecture
between PCs and PEs for data read, and a one-to-one connection
is needed for data write. It performs both reading and writing in a
sequential address. We make a similar simplication as we did for
the bucket sort—we assume 512b key and equal key distribution,
and we omit the rst-stage intra-PC sorter.
Figure 4: Conventional HLS coding style to send keys to mul-
tiple output PCs (buckets) using the built-in crossbar
2.4 Conventional HLS Programming for
Bucket Sort
We program the kernel and host in C using Xilinx’s Vitis [
33
] and
Vivado HLS [
31
] tools. We employ dataow programming style
(C functions executing in parallel and communicating through
streaming FIFOs) for kernels to achieve high throughput with small
BRAM consumption [31].
Alveo U280 and Vivado HLS oer a particular coding style to
access multiple PCs. An example HLS code for bucket sort is shown
in Fig. 4. The output write function
key_write
reads an input data
and data’s bucket ID (line 15), and it writes the data to the function
argument that corresponds to the bucket ID (lines 17 to 20). We
can specify the output PC (bucket ID) of the function arguments
in Makele (lines 22 to 25). Notice that a common bundle (
M0
) was
assigned to all function arguments (lines 2 to 5). A bundle is a
Vivado HLS concept that corresponds to an AXI master. That is,
key_write
uses a single AXI master
M0
and the built-in crossbar
to distribute the keys to all PCs.
Although easy-to-code and area-ecient, this conventional HLS
coding style has two problems. First, while accessing multiple PCs
from a single AXI master, data from dierent AXI masters will
frequently share the lateral connections and reduce the eective
bandwidth (more details in Section 4.2). Second, the bucket ID of a
key read in line 15 may dier in the next iteration of the while loop.
Thus, Vivado HLS will set the AXI burst length to one for each
key write. This also degrades the HBM eective bandwidth (more
details in Section 6.1). In the following sections, we will examine
solutions to these problems.
Figure 5: Overall architecture of HBM Connect and the ex-
plored design space
3 DESIGN SPACE AND PROBLEM
FORMULATION
Let us denote a PE that performs computation as
𝑃𝐸𝑖
(0
<=𝑖<
𝐼
) and an HBM PC as
𝑃𝐶 𝑗
(0
<=𝑗<𝐽
). PE is a coarse-grain
computational unit composed of a single function and may contain
multiple ne-grain computational units inside its function.
𝑃𝐸𝑖
transfers
𝑑𝑎𝑡𝑎𝑖 𝑗
to
𝑃𝐶 𝑗
. If
𝑃𝐸𝑖
makes no communication with
𝑃𝐶 𝑗
,
𝑑𝑎𝑡𝑎𝑖 𝑗
equals 0. We denote the averaged eective bandwidth of
transferring
𝑑𝑎𝑡𝑎𝑖 𝑗
as
𝐵𝑊𝑖𝑗
. The total eective bandwidth of the
system 𝐵𝑊 is equal to the summation of 𝐵𝑊𝑖𝑗 for all (𝑖,𝑗).
We make the following assumptions. First, the kernel is written
in a dataow style, where functions execute in parallel and commu-
nicate through streaming FIFOs. Second, we read or write
𝑑𝑎𝑡𝑎𝑖 𝑗
from/to 𝑃𝐶 𝑗in a sequential address (see Section 2.3 for examples).
Third, PEs reads and writes data every cycle (II=1) if its input/output
FIFOs are not empty or full. Fourth, the pipeline depth of PEs is
negligible compared to the total execution time 𝑡𝑇𝑂𝑇 .
In Fig. 5, we show the design space of the HBM Connect. It
consists of a custom crossbar, an AXI burst buer, and a built-in
AXI crossbar.
The purpose of the custom crossbar is to partly replace the func-
tionality of the built-in AXI crossbar and increase the eective
bandwidth. We employ a multi-stage buttery network for a rea-
son we will explain in Section 5.1. As a design space, we may use
𝐶𝑋 𝐵𝐴𝑅 = 0, 1, 2, ..., log(16) stages of custom crossbar.
An AXI burst buer is needed to enable burst access in the built-
in crossbar (more details in Section 6.1). The design space of the
AXI buer size is 𝐴𝐵𝑈 𝐹 = 0, 1, 2, 4, ... 128, 256.
The aim of this work is to nd a
𝑃𝐸𝑖
-
𝑃𝐶 𝑗
interconnect architec-
ture (among all
𝑖
’s and
𝑗
’s) that has a good trade-o between the data
transmission time and the interconnect resource usage. For quanti-
tative evaluation, we use metrics that are similar to the inverse of
the classic area-delay-square product (
𝐴𝑇2
) metric. Specically, we
divide the squared value of the eective HBM bandwidth by LUT
(
𝐵𝑊 2/𝐿𝑈𝑇
), FF (
𝐵𝑊 2/𝐹𝐹
), or BRAM (
𝐵𝑊 2/𝐵𝑅𝐴𝑀
). These metrics
intuitively match a typical optimization goal of maximizing the
eective bandwidth while using as small resource as possible. The
eective bandwidth term is squared with an assumption that the
HBM boards will be more popular for memory-bound applications—
that is, the bandwidth is a more important criteria than the resource
consumption in the HBM boards.
The problem we solve in this paper is formulated as:
Given
𝑑𝑎𝑡𝑎𝑖 𝑗
, nd a design space (
𝐶𝑋 𝐵𝐴𝑅
,
𝐴𝐵𝑈 𝐹
) that minimizes
𝐵𝑊 2/𝐿𝑈 𝑇 .
Metric
𝐵𝑊 2/𝐿𝑈𝑇
in the formulation may be replaced with met-
rics
𝐵𝑊 2/𝐹𝐹
or
𝐵𝑊 2/𝐵𝑅𝐴𝑀
. The choice among the three metrics
will depend upon the bottleneck resource of the PEs.
We will explain the details of the HBM Connect major compo-
nents in the following sections. Section 4 provides an analysis of the
built-in crossbar. The architecture and optimization of the custom
crossbar is presented in Section 5. The HLS-based optimization of
the AXI burst buer will be described in Section 6.
4 BUILT-IN CROSSBAR AND HBM
This section provides an analysis of the built-in AXI crossbar and
HBM. The analysis is used to estimate the eective bandwidth
of the built-in interconnect system and guide the design space
exploration. See [
4
] for more details on our memory access analysis.
We also refer readers to the related HBM benchmarking studies in
[18, 19, 27].
4.1 Single PC Characteristics
We measure the eective bandwidth when a PE uses a single AXI
master to connect to a single HBM PC. We assume that the PE is
designed with Vivado HLS.
4.1.1 Maximum Bandwidth. The maximum memory bandwidth of
the HBM boards is measured with a long (64MB) sequential access
pattern. The experiment performs a simple data copy with read
& write, read only, and write only operations. We use the Alveo’s
default user logic data bitwidth of 512b.
A related RTL-based HBM benchmarking tool named Shuhai
[
27
] assumes that the total eective bandwidth can be estimated by
multiplying the bandwidth of a single PC by the total number of
PCs. In practice, however, we found that it is dicult to utilize all
PCs. PC 30 and 31 partially overlap with the PCIE static region, and
Vitis was not able to complete the routing even for a simple trac
generator for PC 30 and 31. The routing is further complicated
by the location of HBM MCs—they are placed on the bottom SLR
(SLR0) and user logic of memory-bound applications tends to get
placed near the bottom. For this reason, we used 16 PCs (nearest
power-of-two usable PCs) for evaluation throughout this paper.
Table 2: Maximum eective per-PC memory bandwidth
with sequential access pattern on Alveo U280 (GB/s)
Read & Write Read only Write only Ideal
12.9 13.0 13.1 14.4
Table 2 shows the measurement result. The eective bandwidth
per PC is similar to 13.3 GB/s measured in RTL-based Shuhai [
27
].
The result demonstrates that we can obtain about 90% of the ideal
bandwidth. The bandwidth can be saturated with read-only or
write-only access.
(a) (b)
Figure 6: Eective memory bandwidth per PC (a single AXI
master accesses a single PC) with varying sequential data
access size (a) Read BW (b) Write BW
4.1.2 Short Sequential Access Bandwidth. In most practical applica-
tions, it is unlikely that we can fetch such a long (64MB) sequential
data. The bucket PE, for example, needs to write to multiple PCs,
and there is a constraint on the size of write buer for each PC (more
details in Section 6). Thus, each write must be limited in length. A
similar constraint exists on the merge sort PE’s read length.
HLS applications require several cycles of delay when making
an external memory access. We measure the memory latency
𝐿𝐴𝑇
using the method described in [5, 6] (Table 3).
Table 3: Read/write memory latency
Read lat Write lat
Total 289 ns 151 ns
Let us divide
𝑑𝑎𝑡𝑎𝑖 𝑗
into
𝐶𝑁 𝑈 𝑀𝑖 𝑗
number of data chunks sized
𝐵𝐿𝐸 𝑁𝑖 𝑗 :
𝑑𝑎𝑡𝑎𝑖 𝑗 =𝐶𝑁 𝑈 𝑀𝑖𝑗 𝐵𝐿𝐸𝑁𝑖 𝑗 (1)
The time
𝑡𝐵𝐿𝐸 𝑁𝑖𝑗
taken to complete one burst transaction of length
𝐵𝐿𝐸 𝑁𝑖 𝑗 to HBM PC can be modeled as [7, 24]:
𝑡𝐵𝐿𝐸 𝑁𝑖𝑗
=𝐵𝐿𝐸 𝑁𝑖 𝑗 /𝐵𝑊𝑚𝑎𝑥 +𝐿𝐴𝑇 (2)
where
𝐵𝑊𝑚𝑎𝑥
is the maximum eective bandwidth (Table 2) of one
PC, and 𝐿𝐴𝑇 is the memory latency (Table 3).
Then the eective bandwidth when a single AXI master accesses
a PC is: 𝐵𝑊𝑖𝑗 =𝐵𝐿𝐸 𝑁𝑖 𝑗 /𝑡𝐵𝐿𝐸 𝑁𝑖𝑗 (3)
Fig. 6 shows the comparison between the estimated eective
bandwidth and the measured eective bandwidth after varying the
length (
𝐵𝐿𝐸 𝑁𝑖 𝑗
) of sequential data access on a single PC. Note that
the trend of the eective bandwidth in this gure resembles that of
other non-HBM, DRAM-based FPGA platforms [5, 6].
4.2 Many-to-Many Unicast Characteristics
In this section, we consider the case when multiple AXI masters
access multiple PCs in round-robin. Since each AXI master access
only one PC at a time, we will refer to this access pattern as many-
to-many unicast. We vary the number of PCs accessed by AXI
masters. For example, in the many-to-many write unicast test with
(AXI masters
×
PCs) = (2
×
2) conguration, AXI master M0 writes to
PC0/PC1, M1 writes to PC0/PC1, M2 writes to PC2/PC3, M3 writes
to PC2/PC3, and so on. AXI masters access dierent PCs in round
(a) (b)
Figure 7: Many-to-many unicast eective memory band-
width among 2~16 PCs (a) Read BW (b) Write BW
(a) (b)
Figure 8: Maximum bandwidth (𝐵𝑊𝑚𝑎𝑥 ) for many-to-many
unicast on Alveo U280 (GB/s) (a) Read BW (b) Write BW
robin. Another example of this would be the many-to-many read
unicast test with (AXI masters
×
PCs) = (4
×
4) conguration. All
M0, M1, M2, and M3 masters read from PC0, PC1, PC2, and PC3
in round robin. The AXI masters are not synchronized, and it is
possible that some masters will idle waiting for other masters to
nish their transaction.
Fig. 7 shows the eective bandwidth after varying the burst
length and the number of PCs accessed by AXI masters. The write
bandwidth (Fig. 7(b)) is generally higher than the read bandwidth
(Fig. 7(a)) for the same burst length because the write memory
latency is smaller than the read memory latency (Table 3). Shorter
memory latency decreases the time needed per transaction (Eq. 2).
For 16
×
16 unicast, which is the conguration used in bucket
sort and merge sort, the lateral connections become the bottleneck.
For example, M0 needs to cross three lateral connections of unit
switches to reach PC12–PC15. Multiple crossings severly reduces
the overall eective bandwidth.
Fig. 8 summarizes the maximum bandwidth observed in Fig. 7.
The reduction in the maximum bandwidth becomes more severe as
more AXI masters contend with each other to access the same PC.
We can predict the eective bandwidth of many-to-many unicast
by replacing the
𝐵𝑊𝑚𝑎𝑥
in Eq. 2 with the maximum many-to-many
unicast bandwidth in Fig. 8. The maximum many-to-many unicast
bandwidth can be reasonably well estimated (
𝑅2
=0.95 ~ 0.96) by
tting the experimentally obtained values with a second-order
polynomial. The tting result is shown in Fig. 8.
5 CUSTOM CROSSBAR
5.1 Network Topology
As demonstrated in Section 4.2, it is not possible to reach the maxi-
mum bandwidth when an AXI master tries to access multiple PCs.
To reduce the contention, we add a custom crossbar.
Figure 9: The buttery custom crossbar architecture (when
𝐶𝑋 𝐵𝐴𝑅 =4)
We found that Vitis was unable to nish routing when we tried
to make a fully connected crossbar. Thus, we decided to employ a
multi-stage network. To further simplify the routing process, we
compose the network with 2×2 switching elements.
There are several multi-stage network topologies. Examples in-
clude Omega, Clos, Benes, and buttery networks. Among them,
we chose the buttery network shown in Fig. 9. We chose this
topology because the buttery network allows sending data across
many hops of AXI masters with just a few stages. For example, let
us assume we deploy only the rst stage of buttery network in
Fig. 9. Data sent from PE0 to PC8–PC15 can avoid going through
two or three lateral connections with just a single switch
SW1_0
.
The same benet applies to the data sent from PE8 to PC0–PC7.
We can achieve a good trade-o between the LUT consumption
and the eective bandwidth due to this characteristics. The but-
tery network reduces its hop distance at the later stages of the
custom crossbar. Note that the performance and the resource usage
is similar to that of Omega networks if all four stages are used.
Adding more custom stages will reduce the amount of trac
crossing the lateral connection at the cost of more LUT/FF usage. If
we implement two stages of buttery as in Fig. 5, each AXI master
has to cross a single lateral connection. If we construct all four
stages as in Fig. 9, the AXI master in the built-in crossbar only
accesses a single PC.
5.2 Mux-Demux Switch
A 2
×
2 switching element in a multistage network reads two input
data and writes to output ports based on the destination PC. A
typical 2
×
2 switch can send both input data to output if the data’s
output ports are dierent. If they are the same, one of them has to
stall until the next cycle. Assuming the 2
×
2 switch has an initiation
interval (II) of 1 and the output port of the input data is random,
the averaged number of output data per cycle is 1.5.
We propose an HLS-based switch architecture named mux-demux
switch to increase the throughput. A mux-demux switch decom-
poses a 2
×
2 switch into simple operations to be performed in paral-
lel. Next, we insert buers between the basic operators so that there
is a higher chance that some data will exist to be demuxed/muxed.
We implement buers as FIFOs for simpler coding style.
Figure 10: Architecture of mux-demux switch
Fig. 10 shows the architecture of mux-demux switch. After read-
ing data in input0 and input1, the two demux modules indepen-
dently classify the data based on the destination PC. Then instead
of directly comparing the input data of the two demux modules,
we store them in separate buers. In parallel, the two mux modules
each read data from two buers in round-robin and send the data
to their output ports.
As long as the consecutive length of data intended for a particular
output port is less than the buer size, this switch can almost
produce two output elements per cycle. In essence, this architecture
trades o buer with performance.
We estimate the performance of mux-demux switch with a
Markov chain model (MCM), where the number of remaining buer
corresponds to a single MCM state. The transition probability be-
tween MCM states is modeled from the observation that the demux
module will send data to one of the buers with 50% probability
every cycle for random input (thus reducing buer space by one)
and that the mux module will read from each buer every two cy-
cles in round-robin (thus increasing buer space by one). The mux
module does not produce an output if the buer is in an “empty”
MCM state. The MCM estimated throughput with various buer
sizes is provided in the last row of Table 4.
Table 4: Resource consumption (post-PnR) and throughput
(experimental and estimated) comparison of typical 2×2
switch and the proposed 2×2 mux-demux switch in a stand-
alone Vivado HLS test
Typ SW Mux-Demux SW
Buer size - 4 8 16
LUT 3184 3732 3738 3748
FF 4135 2118 2124 2130
Thr (Exp.) 1.49 1.74 1.86 1.93
Thr (Est.) 1.5 1.74 1.88 1.94
We measure the resource consumption and averaged throughput
after generating random input in a stand-alone Vivado HLS test.
We compare the result with a typical 2
×
2 HLS switch that produces
two output data only when its two input data’s destination port is
dierent. One might expect that a mux-demux switch would con-
sume much more resource than a typical switch because it requires
4 additional buers (implemented as FIFOs). But the result (Table 4)
indicates that the post-PnR resource consumption is similar. This is
due to the complex typical switch control circuit which compares
two inputs for destination port conict on every cycle (II=1). A
mux-demux switch, on the other hand, decomposes this compar-
ison into 4 simpler operations. Thus, the resource consumption
is still comparable. In terms of throughput, a mux-demux switch
clearly outperforms a typical switch.
We x the buer size of the mux-demux switch to 16 in HBM
Connect, because it gives the best throughput-resource trade-o.
Table 4 conrms that the experimental throughput well matches
the throughput estimated by the MCM.
6 AXI BURST BUFFER
6.1 Direct Access from PE to AXI Master
In bucket sort, PEs distribute the keys to output PCs based on its
value (each PC corresponds to a bucket). Since each AXI master
can send data to any PC using the built-in crossbar (Sections 2.2),
we rst make a one-to-one connection between a bucket PE and an
AXI master. Then we utilize the built-in AXI crossbar to perform
the key distribution. We use the coding style in lines 17 to 20 of
Fig. 4 to directly access dierent PCs.
With this direct access coding style, however, we were only
able to achieve 59 GB/s among 16 PCs (with two stages of custom
crossbar). We obtain such a low eective bandwidth because there
is no guarantee that two consecutive keys from input PC will be
sent to the same bucket (output PC). Existing HLS tools do not
automatically hold the data in buer for burst AXI access to each
HBM PC. Thus, the AXI burst access is set to one. Non-burst access
to HBM PC severely degrades the eective bandwidth (Fig. 6 and
Fig. 7). A similar problem occurs when making a direct access for
read many-to-many unicast in the merge sort.
6.2 FIFO-Based Burst Buer
An intuitive solution to this problem is to utilize a FIFO-based AXI
burst buer for each PC [
4
]. Based on data’s output PC information,
data is sent to a FIFO burst buer reserved for that PC. Since all the
data in a particular burst buer is guaranteed to be sent to a single
HBM PC, the AXI bus can now be accessed in a burst mode. We
may choose to enlarge the burst buer size to increase the eective
bandwidth.
However, we found that this approach hinders with eective
usage of FPGA on-chip memory resource. It is ideal to use BRAM
as the burst buer because BRAM is a dedicated memory resource
with higher memory density than LUT (LUT might be more ecient
as a compute resource). But BRAM has a minimum depth of 512
[
32
]. As was shown in Fig. 6, we need a burst access of around
32 (2KB) to reach a half of the maximum bandwidth and saturate
the HBM bandwidth with simultaneous memory read and write.
Setting the burst buer size to 32 will under-utilize the minimum
BRAM depth (512). Table 5 conrms the high resource usage of the
FIFO-based burst buers.
Another problem is that this architecture scatters data to multiple
FIFOs and again gathers data to a single AXI master. This further
complicates the PnR process. Due to the high resource usage and
the routing complexity, we were not able to route the designs with
FIFO-based burst buer (Table 5).
Table 5: Eective bandwidth and FPGA resource consump-
tion of bucket sort with dierent AXI burst buer schemes
(𝐶𝑋 𝐵𝐴𝑅 =2)
Buf Bur CX FPGA Resource KClk EBW
Sch Len bar LUT/FF/DSP/BRAM (MHz) (GB/s)
Direct access 2 126K / 238K / 0 / 248 178 56
FIFO 16 2 195K / 335K / 0 / 728 PnR failed
Burst 32 2 193K / 335K / 0 / 728 PnR failed
Buf 64 2 195K / 335K / 0 / 728 PnR failed
HLS 16 2 134K / 233K / 0 / 368 283 116
Virt 32 2 134K / 233K / 0 / 368 286 185
Buf 64 2 134K / 233K / 0 / 368 300 180
Figure 11: HLS virtual buer architecture for 8 PCs
Figure 12: HLS code for HLS virtual buer (for write)
6.3 HLS Virtual Buer
In this section, we propose an HLS-based solution to solve all of
the aforementioned problems: the burst access problem, the BRAM
under-utilization problem, and the FIFO scatter/gather problem.
The idea is to share the BRAM as a burst buer for many dierent
Figure 13: Abstracted HLS virtual buer syntax (for read)
target PCs. But none of current HLS tools oer such functionality.
Thus, we propose a new HLS-based buering scheme called HLS
virtual buer (HVB). HVB allows a single physical FIFO to be shared
among multiple virtual channels [
11
] in HLS. As a result, we can
have a higher utilization of BRAM depth as the FIFOs for many
dierent PCs. Another major advantage is that the HVB physically
occupies one buer space—we can avoid scattering/gathering data
from multiple FIFOs and improve the PnR process.
We present the architecture of HVB in Fig. 11 and its HLS code
in Fig. 12. A physical buer (pbuf) is partitioned into virtual FIFO
buers for 8 target PCs. The buer for each PC has a size of
𝐴𝐵𝑈 𝐹
,
and we implement it as a circular buer with a write pointer (
wptr
)
and a read pointer (
rptr
). At each cycle, the HVB reads a data from
textttin_fo in a non-blocking fashion (line 24) and writes it to the
target PC’s virtual buer (line 27). The partition among dierent
PCs in pbuf is xed.
Whereas the target PC for input data is random, the output data
is sent in a burst for the same target PC. Before initiating a write
transfer for a new target PC, the HVB passes the target PC and the
number of elements in
out_info_fifo
(line 20). Then it transmits
the output data in a burst as shown in lines 7 to 14. A separate write
logic (omitted) receives the burst information and relays the data
to an AXI master.
It implements the HVB for read operation (e.g., in merge sort)
in a similar code as in Fig. 12, except that it collects the input
data in a burst from a single source PC and sends output data in a
round-robin fashion among dierent PCs.
The HVB for read operation (e.g., in merge sort) is implemented
in a similar code as in Fig. 12, except that the input data is collected
in a burst from a single source PC and output data is sent in a
round-robin fashion among dierent PCs.
Table 5 shows that the overall LUT/FF resource consumption
of HVB is similar to the direct access scheme. The performance
is much better than the direct access scheme because we send
data through the built-in crossbar in a burst. Compared to the
FIFO burst buer scheme, we reduce the BRAM usage as expected
because HVB better utilizes the BRAM by sharing. Also, the LUT/FF
usage has been reduced because we only use a single physical FIFO.
The routing for HVB is successful because of the small resource
consumption and the low complexity.
We can estimate the performance of HVB by setting
𝐵𝑊𝑚𝑎𝑥
of
Eq. 2 to that of Fig. 8 and
𝐵𝐿𝐸 𝑁
to the buer size of HVB (
𝐴𝐵𝑈 𝐹
).
It is dicult for novice HLS users to incorporate the code in
Fig. 12 into their design. For better abstraction, we propose us-
ing the syntax shown in Fig. 13. A programmer can instantiate
a physical buer
pfifo
and use a new virtual buer read key-
word
vfifo_read
. The virtual channel can be specied with a tag
Table 6: Eective bandwidth (on-board test), 𝐵𝑊 2/resource metrics, and resource consumption (post-PnR) of bucket sort after
varying the number of crossbar stages
Cus AXI Bur FPGA Resource KClk EBW 𝐵𝑊 2/Resource Metrics
Xbar Xbar Len LUT/FF/DSP/BRAM (MHz) (GB/s) 𝐵𝑊 2/𝐿𝑈 𝐵𝑊 2/𝐹𝐹 𝐵𝑊 2/𝐵𝑅
0 4 0 102K / 243K / 0 / 248 277 65 1.0 1.0 1.0
0 4 64 122K / 243K / 0 / 480 166 108 2.3 2.7 1.4
1 3 64 121K / 231K / 0 / 368 281 160 5.1 6.4 4.1
2 2 64 134K / 233K / 0 / 368 300 180 5.8 8.0 5.2
3 1 64 155K / 243K / 0 / 368 299 195 5.9 9.0 6.1
4 0 0 189K / 305K / 0 / 248 207 203 5.3 7.8 9.8
vir_ch0
. Then an automated tool can be used to perform a code-
to-code transformation from this abstracted code to the detailed
implementation in Fig. 12.
7 DESIGN SPACE EXPLORATION
As explained in Section 3, we explore the design space for
𝐶𝑋 𝐵𝐴𝑅
= 0, 1, 2, ...log(16) and
𝐴𝐵𝑈 𝐹
= 0, 1, 2, 4, ... 128, 256. The throughput
is estimated using the methods described in Sections 4, 5, and 6.
The resource is estimated by rst generating few design spaces
and obtaining the unit resource consumption of the components.
Table 7 shows the unit resource consumption of major HBM Con-
nect components. The BRAM consumption of HVB is estimated by
multiplying the burst buer depth and the number of supported
PCs ceiled by a 512 minimum depth. Next, we count the number of
components based on the design space (
𝐶𝑋 𝐵𝐴𝑅
,
𝐴𝐵𝑈 𝐹
). We can
estimate the total resource consumption by multiplying the unit
consumption and the number of components.
Table 7: FPGA resource unit consumption (post-PnR) of ma-
jor components (data bitwidth:512b)
LUT FF DSP BRAM
HLS AXI master 2220 6200 0 15.5
Mux-Demux switch 3748 2130 0 0
HVB 𝐴𝐵𝑈 𝐹 =64, 8PCs 160 601 0 7.5
HVB 𝐴𝐵𝑈 𝐹 =128, 8PCs 189 612 0 14.5
Since there are only 5 (
𝐶𝑋 𝐵𝐴𝑅
)
×
9 (
𝐴𝐵𝑈 𝐹
) = 45 design spaces
which can be estimated in seconds, we enumerate all design spaces.
The design space exploration result will be presented in Section 8.
8 EXPERIMENTAL RESULT
8.1 Experimental Setup
We use Alveo U280 board for experiment. The board’s FPGA re-
source is shown in Table 8. For programming, we utilize Xilinx’s
Vitis [33] and Vivado HLS [31] 2019.2 tools.
Table 8: FPGA resource on Alveo U280
LUT FF DSP BRAM
1.30M 2.60M 9.02K 2.02K
8.2 Case Study 1: Bucket Sort
In Table 5, we have already presented a quantitative resource-
performance analysis when enlarging the HLS virtual buer (after
xing the number of custom crossbar stage). In this section, we rst
analyze the eect of varying the number of custom crossbar stages.
We x the HLS virtual buer size to 64 for clearer comparison.
The result is shown in Table 6. We only account for the post-PnR
resource consumption of the user logic and exclude the resource
consumption of the the static region, the MCs, and the built-in
crossbars.
𝐵𝑊 2/𝐿𝑈𝑇
,
𝐵𝑊 2/𝐹𝐹
, and
𝐵𝑊 2/𝐵𝑅𝐴𝑀
metrics are nor-
malized to the baseline design with no custom crossbar stage and no
virtual buer. Larger value of these metrics suggests better designs.
As we add more custom crossbar stages, we can observe a steady
increase of LUT and FF because more switches are needed. Larger
number of custom crossbar stages reduces the data transaction
through the lateral connections and increases the eective band-
width. But as long as more than one AXI masters communicate with
a common PC through the built-in AXI crossbar, the bandwidth loss
due to contention is unavoidable (Section 4.2). When the custom
crossbar (4 stages) completely replaces the built-in crossbar, one AXI
master communicates with only a single PC. The data received from
multiple PEs is written to the same memory space because the keys
within a bucket does not need to be ordered. The one-to-one con-
nection between an AXI master and a PC removes the contention
in the built-in crossbar, and we can reach the best eective band-
width (203 GB/s). Note that this performance closely approaches
the maximum bandwidth of 206 GB/s (=16 PCs * 12.9GB/s) achieved
with sequential access microbenchmark on 16 PCs (Table 2).
In terms of the resource-performance metrics (𝐵𝑊 2/𝐿𝑈𝑇 ,
𝐵𝑊 2/𝐹𝐹
, and
𝐵𝑊 2/𝐵𝑅𝐴𝑀
), the designs with a few custom cross-
bar stages are much better than the baseline design with no custom
crossbar. For example, the design with two stages of custom cross-
bar and 64 virtual buer depth per PC is superior by factors of
5.8X/8.0X/5.2X. Even though adding more custom crossbar stages
resulted in an increased resource consumption, the amount of in-
creased eective bandwidth is far greater. This result shows that
memory-bound applications can benet by adding a few custom
crossbars to reduce the lateral connection communication.
We can observe a very interesting peak in the design point that
has 4 stages of custom crossbar. Since this design has the most
number of switches,
𝐵𝑊 2/𝐿𝑈𝑇
is slightly poor (5.3) compared to a
design with two custom crossbar stages (5.8). But in this design, a
PE only needs to communicate with a single bucket in a PC. Thus,
we can infer burst access without an AXI burst buer and remove
the HVB. The BRAM usage of this design point is lower than others,
and
𝐵𝑊 2/𝐵𝑅𝐴𝑀
is superior (9.8). We can deduce that if the data
from multiple PEs can be written to the same memory space and
BRAM is the most precious resource, it might be worth building
enough custom crossbar stages to ensure one-to-one connection
between an AXI master and a PC.
Table 9: Bucket sort’s design points with best 𝐵𝑊 2/𝐿𝑈𝑇 and
𝐵𝑊 2/𝐵𝑅𝐴𝑀 metrics (normalized to a baseline design with
𝐶𝑋 𝐵𝐴𝑅 =0 and 𝐴𝐵𝑈 𝐹 =0). Y-axis is the number custom cross-
bar stages and the X-axis is the virtual buer depth. The best
and the second best designs are in bold.
(𝐵𝑊 2/𝐿𝑈𝑇 ) (𝐵𝑊 2/𝐵𝑅𝐴𝑀)
0 16 32 64 128 0 16 32 64 128
0 1.0 0.9 2.6 2.3 NA 0 1.0 0.7 2.0 1.4 NA
1 1.0 2.8 6.5 5.1 3.5 1 1.1 2.2 5.2 4.1 2.2
2 0.6 2.4 6.2 5.8 4.7 2 0.7 2.1 5.5 5.2 4.2
3 0.8 2.3 3.8 5.9 5.3 3 1.1 2.3 3.9 6.1 5.5
4 5.3 - - - - 4 9.8 - - - -
Table 9 presents the design space exploration result with a vari-
ous number of custom/built-in crossbar stages and virtual buer
sizes. We present the numbers for
𝐵𝑊 2/𝐿𝑈𝑇
and
𝐵𝑊 2/𝐵𝑅𝐴𝑀
met-
rics but omit the table for
𝐵𝑊 2/𝐹𝐹
because it has a similar trend as
the
𝐵𝑊 2/𝐿𝑈𝑇
table. In terms of
𝐵𝑊 2/𝐵𝑅𝐴𝑀
metric, (
𝐶𝑋 𝐵𝐴𝑅
=4,
𝐴𝐵𝑈 𝐹
=0) is the best design point for the reason explained in the
interpretation of Table 6. In terms of
𝐵𝑊 2/𝐿𝑈𝑇
metric, the data
points with
𝐶𝑋 𝐵𝐴𝑅
=1~3 have similar values and clearly outper-
form data points with (
𝐶𝑋 𝐵𝐴𝑅
=0). This agrees with the result in
Fig. 7(b) where the 2
×
2 to 8
×
8 congurations all have a similar
eective bandwidth and are much better than the 16
×
16 congura-
tion. For both metrics, the design points with
𝐴𝐵𝑈 𝐹
less than 16
are not competitive because the eective bandwidth is too small
(Fig. 7). The design points with
𝐴𝐵𝑈 𝐹
larger than 64 also are not
competitive because almost an equal amount of read and write is
performed on each PC—the eective bandwidth cannot increase
beyond 6.5 GB/s (=12.9 GB/s ÷2) even with a large 𝐴𝐵𝑈 𝐹 .
8.3 Case Study 2: Merge Sort
Table 10 shows the design space exploration of merge sort that
uses HBM Connect in its read interconnect. The absolute values
of metrics
𝐵𝑊 2/𝐿𝑈𝑇
and
𝐵𝑊 2/𝐵𝑅𝐴𝑀
are considerably higher
than that of bucket sort for most of the design points. This is be-
cause the read eective bandwidth of the baseline implementation
(
𝐶𝑋 𝐵𝐴𝑅
=0,
𝐴𝐵𝑈 𝐹
=0) is 9.4 GB/s, which is much lower than the
write eective bandwidth (65 GB/s) of the bucket sort baseline
implementation.
As mentioned in Section 4.2, the read operation requires a longer
burst length than the write operation to saturate the eective
bandwidth because the read latency is relatively longer. Thus the
𝐵𝑊 2/𝐿𝑈𝑇
metric reaches the highest point at the burst length
of 128–256, which is larger than the 32–64 burst length observed
in bucket sort (Table 9).
𝐵𝑊 2/𝐵𝑅𝐴𝑀
metric, on the other hand,
reaches the peak at the shorter burst length of 64 because a larger
𝐴𝐵𝑈 𝐹 requires more BRAMs.
Table 10: Merge sort’s design points with best 𝐵𝑊 2/𝐿𝑈𝑇 and
𝐵𝑊 2/𝐵𝑅𝐴𝑀 metric (normalized to a baseline design with
𝐶𝑋 𝐵𝐴𝑅 =0 and 𝐴𝐵𝑈 𝐹 =0). Y-axis is the number custom cross-
bar stages and the X-axis is the virtual buer depth.
(𝐵𝑊 2/𝐿𝑈𝑇 ) (𝐵𝑊 2/𝐵𝑅𝐴𝑀)
0 32 64 128 256 0 32 64 128 256
0 1.0 64 52 NA NA 0 1.0 57 34 NA NA
1 1.8 82 120 100 114 1 1.6 62 66 36 25
2 1.7 88 149 119 168 2 1.6 66 81 42 35
3 1.5 86 141 154 211 3 1.6 70 84 60 48
4 12 85 137 181 191 4 15 70 85 73 46
Similar to bucket sort, replacing the built-in crossbar with a
custom crossbar provides a better performance because there is
less contention in the built-in crossbar. As a result, design points
with
𝐶𝑋 𝐵𝐴𝑅
=4 or
𝐶𝑋 𝐵𝐴𝑅
=3 generally have better
𝐵𝑊 2/𝐿𝑈𝑇
and
𝐵𝑊 2/𝐵𝑅𝐴𝑀
. But unlike bucket sort, the peak in
𝐵𝑊 2/𝐵𝑅𝐴𝑀
for
𝐶𝑋 𝐵𝐴𝑅
=4 does not stand out—it has a similar value as
𝐶𝑋 𝐵𝐴𝑅
=3.
This is because merge sort needs to read from 16 dierent mem-
ory spaces regardless of the number of custom crossbar stages
(explained in Section 2.3.2). Each memory space requires a separate
virtual channel in HVB. Thus, we cannot completely remove the
virtual buer as in the bucket sort.
9 CONCLUSION
We have implemented memory bound applications on a recently re-
leased FPGA HBM board and found that it is dicult to fully exploit
the board’s bandwidth when multiple PEs access multiple HBM PCs.
HBM Connect has been developed to meet this challenge. We have
proposed several HLS-compatible optimization techniques such as
the HVB and the mux-demux switch to remove the limitation of
current HLS HBM syntax. We also have tested the eectiveness
of buttery multi-stage custom crossbar to reduce the contention
in the lateral connection of the built-in crossbar. We found that
adding AXI burst buers and custom crossbar stages signicantly
improves the eective bandwidth. We also found in the case of
bucket sort that completely replacing the built-in crossbar with
a full custom crossbar may provide the best trade-o in terms of
BRAMs if the output from multiple PEs can be written into a single
memory space. The proposed architecture improves the baseline
implementation by a factor of 6.5X–211X for
𝐵𝑊 2/𝐿𝑈𝑇
metric and
9.8X–85X for
𝐵𝑊 2/𝐵𝑅𝐴𝑀
metric. As a future work, we plan to
apply HBM Connect to Intel HBM boards and also generalize it
beyond the two cases studied in this paper.
10 ACKNOWLEDGMENTS
This research is in part supported by Xilinx Adaptive Compute
Cluster (XACC) Program, Intel and NSF Joint Research Center
on Computer Assisted Programming for Heterogeneous Architec-
tures (CAPA) (CCF-1723773), NSF Grant on RTML: Large: Accelera-
tion to Graph-Based Machine Learning (CCF-1937599), NIH Award
(U01MH117079), and Google Faculty Award. We thank Thomas
Bollaert, Matthew Certosimo, and David Peascoe at Xilinx for help-
ful discussions and suggestions. We also thank Marci Baun for
proofreading this article.
REFERENCES
[1]
ARM. 2011. AMBA AXI and ACE Protocol Specication AXI3, AXI4, and AXI4-Lite,
ACE and ACE-Lite. www.arm.com
[2]
J. Bakos. 2010. High-performance heterogeneous computing with the Convey
HC-1. IEEE Comput. Sci. Eng. 12, 6 (2010), 80–87.
[3]
R. Chen, S. Siriyal, and V. Prasanna. 2015. Energy and memory ecient mapping
of bitonic sorting on FPGA. In Proc. ACM/SIGDA Int. Symp. Field-Programmable
Gate Arrays. 240–249.
[4]
Y. Choi, Y. Chi, J. Wang, L. Guo, and J. Cong. 2020. When HLS meets FPGA
HBM: Benchmarking and bandwidth optimization. ArXiv Preprint (2020). https:
//arxiv.org/abs/2010.06075
[5]
Y. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, and P. Wei. 2016. A quantitative
analysis on microarchitectures of modern CPU-FPGA platform. In Proc. Ann.
Design Automation Conf. 109–114.
[6]
Y. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, and P. Wei. 2019. In-depth analysis
on microarchitectures of modern heterogeneous CPU-FPGA platforms. ACM
Trans. Recongurable Technology and Systems 12, 1 (Feb. 2019).
[7]
Y. Choi, P. Zhang, P. Li, and J. Cong. 2017. HLScope+: Fast and accurate perfor-
mance estimation for FPGA HLS. In Proc. IEEE/ACM Int. Conf. Computer-Aided
Design. 691–698.
[8]
J. Cong, Z. Fang, M. Lo, H. Wang, J. Xu, and S. Zhang. 2018. Understanding
performance dierences of FPGAs and GPUs. In IEEE Ann. Int. Symp. Field-
Programmable Custom Computing Machines. 93–96.
[9]
P. Cooke, J. Fowers, G. Brown, and G. Stitt. 2015. A tradeo analysis of FPGAs,
GPUs, and multicores for sliding-window applications. ACM Trans. Recongurable
Technol. Syst. 8, 1 (Mar. 2015), 1–24.
[10]
B. Cope, P. Cheung, W. Luk, and L. Howes. 2010. Performance comparison of
graphics processors to recongurable logic: a case study. IEEE Trans. Computers
59, 4 (Apr. 2010), 433–448.
[11]
W. J. Dally and C. L. Seitz. 1987. Deadlock-free message routing in multiprocessor
interconnection networks. IEEE Trans. Computers C-36, 5 (May 1987), 547–553.
[12]
K. Fleming, M. King, and M. C. Ng. 2008. High-throughput pipelined mergesort.
In Int. Conf. Formal Methods and Models for Co-Design.
[13]
Intel. 2020. High Bandwidth Memory (HBM2) Interface Intel FPGA IP User Guide.
https://www.intel.com/
[14] Intel. 2020. Avalon Interface Specications. https://www.intel.com/
[15]
JEDEC. 2020. High Bandwidth Memory (HBM) DRAM. https://www.jedec.org/
standards-documents/docs/jesd235a
[16]
H. Jun, J. Cho, K. Lee, H. Son, K. Kim, H. Jin, and K. Kim. 2017. HBM (High Band-
width Memory) DRAM technology and architecture. In Proc. IEEE Int. Memory
Workshop. 1–4.
[17]
S. Lahti, P. Sjövall, and J. Vanne. 2019. Are we there yet? A study on the state of
high-level synthesis. IEEE Trans. Computer-Aided Design of Integrated Circuits
and Systems 38, 5 (May 2019), 898–911.
[18]
R. Li, H. Huang, Z. Wang, Z. Shao, X. Liao, and H. Jin. 2020. Optimizing memory
performance of Xilinx FPGAs under Vitis. ArXiv Preprint (2020). https://arxiv.
org/abs/2010.08916
[19]
A. Lu, Z. Fang, W. Liu, and L. Shannon. 2021. Demystifying the memory system
of modern datacenter FPGAs for software programmers through microbench-
marking. In Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays.
[20]
H. Miao, M. Jeon, G. Pekhimenko, K. S. McKinley, and F. X. Lin. 2019. Streambox-
HBM: Stream analytics on high bandwidth hybrid memory. In Proc. Int. Conf.
Architectural Support for Programming Languages and Operating Systems. 167–
181.
[21]
D. Molka, D. Hackenberg, and R. Schöne. 2014. Main memory and cache perfor-
mance of Intel Sandy Bridge and AMD Bulldozer. In Proc. Workshop on Memory
Systems Performance and Correctness. 1–10.
[22]
E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. Ong Gee Hock, Y. T.
Liew, K. Srivatsan, D. Moss, S. Subhaschandra, and G. Boudoukh. 2017. Can
FPGAs beat GPUs in accelerating next-generation deep neural networks?. In
Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays. 5–14.
[23] Nvidia. 2020. Nvidia Titan V. https://www.nvidia.com/en-us/titan/titan- v/
[24]
J. Park, P. Diniz, and K. Shayee. 2004. Performance and area modeling of complete
FPGA designs in the presence of loop transformations. IEEE Trans. Computers
53, 11 (Sept. 2004), 1420–1435.
[25]
M. Saitoh, E. A. Elsayed, T. V. Chu, S. Mashimo, and K. Kise. 2018. A high-
performance and cost-eective hardware merge sorter without feedback datapath.
In IEEE Ann. Int. Symp. Field-Programmable Custom Computing Machines. 197–
204.
[26]
N. Samardzic, W. Qiao, V. Aggarwal, M. F.Chang, and J. Cong. 2020. Bonsai: High-
performance adaptive merge tree sorting. In Ann. Int. Symp. Comput. Architecture.
282–294.
[27]
Z. Wang, H. Huang, J. Zhang, and G. Alonso. 2020. Shuhai: Benchmarking
High Bandwidth Memory on FPGAs. In IEEE Ann. Int. Symp. Field-Programmable
Custom Computing Machines.
[28]
Xilinx. 2020. Alveo U280 Data Center Accelerator Card User Guide.
https://www.xilinx.com/support/documentation/boards_and_kits/accelerator-
cards/ug1314-u280- recong-accel.pdf
[29]
Xilinx. 2020. Alveo U50 Data Center Accelerator Card User Guide.
https://www.xilinx.com/support/documentation/boards_and_kits/accelerator-
cards/ug1371-u50- recong-accel.pdf
[30]
Xilinx. 2020. AXI High Bandwidth Memory Controller v1.0. https://www.xilinx.
com/support/documentation/ip_documentation/hbm/v1_0/pg276-axi- hbm.pdf
[31] Xilinx. 2020. Vivado High-level Synthesis (UG902). https://www.xilinx.com/
[32]
Xilinx. 2020. UltraScale Architecture Memory Resources (UG573). https://www.
xilinx.com/
[33]
Xilinx. 2020. Vitis Unied Software Platform. https://www.xilinx.com/products/
design-tools/vitis/vitis- platform.html
... These factors limit the maximum transfer capacity between channels, making seamless, efficient access to all channels without bandwidth degradation a significant challenge in multi-channel access scenarios. As HBM increasingly replaces DDR memory, these issues of suboptimal bandwidth utilization become more pronounced, highlighting the need for improved system design to address these limitations [7,8]. ...
... Prakash et al. (2022) [10] demonstrated the use of FPGA overlay NoCs in a multi-die FPGA, but this introduced additional complexity due to on-chip routing. Choi et al. (2021) [7] introduced a butterfly network-based interconnect to interface processing elements with HBM. Additionally, Xue et al. (2023) [11] proposed a custom switching network based on a CLOS topology to replace the built-in crossbar, optimizing HBM access scheduling. ...
... Prakash et al. (2022) [10] demonstrated the use of FPGA overlay NoCs in a multi-die FPGA, but this introduced additional complexity due to on-chip routing. Choi et al. (2021) [7] introduced a butterfly network-based interconnect to interface processing elements with HBM. Additionally, Xue et al. (2023) [11] proposed a custom switching network based on a CLOS topology to replace the built-in crossbar, optimizing HBM access scheduling. ...
Article
Full-text available
The integration of High-Bandwidth Memory (HBM) into Field-Programmable Gate Arrays (FPGAs) has significantly enhanced data processing capabilities. However, the segmentation of HBM into 32 pseudo-channels, each managed by a performance-limited crossbar, imposes a significant bottleneck on data throughput. To overcome this challenge, we propose a transparent HBM access framework that integrates a non-blocking network-on-chip (NoC) module and fine-grained burst control transmission, enabling efficient multi-channel memory access in HBM. Our Omega-based NoC achieves a throughput of 692 million packets per second, surpassing state-of-the-art solutions. When implemented on the Xilinx Alveo U280 FPGA board, the proposed framework attains near-maximum single-channel write bandwidth, delivering 12.94 GB/s in many-to-many unicast communication scenarios, demonstrating its effectiveness in optimizing memory access for high-performance applications.
... Nos anos 1960, surgiu a rede Benes [Beneš 1965] que reduziu o custo, sendo rearranjáveis para conexões 1 para 1. Em 1968, um algoritmo polinomial para roteamento na rede Benes foi apresentado [Waksman 1968]. Apesar das redes multiestágio e Benes continuarem sendo importantes para conexão de computadores de alto desempenho em datacenters [Zhao et al. 2024], para FPGAs com memórias HBM [Choi et al. 2021] ou CGRA com compilação dinâmica [Silva et al. 2019], não existe nenhuma prova que a Benes e seu algoritmo sejam válidos em presença de multicast. ...
... Por outro lado, as redes Omega, especialmente quando expandidas com estágios adicionais, mostraram-se promissoras em contornar essa limitação, embora a comprovação definitiva de sua rearranjabilidade em contextos de multicast ainda demande investigações futuras. A exploração computacional do espaço de soluções, facilitada pelo uso de GPUs e a integração de algoritmos de aprendizado de máquina na análise dos dados gerados demonstrou potencial para aprimorar a predição e otimização das configurações de rede, sugerindo novas direções de pesquisa para uso das redes multiestágios em computação de alto desempenho considerando multicast, que pode estender os trabalhos atuais que são limitados a unicast [Zhao et al. 2024, Choi et al. 2021]. ...
Conference Paper
As redes multiestágio são uma alternativa econômica, com um custo de O(n log(n)) em termos de elementos de chaveamento. No entanto, encontrar um roteamento válido nessas redes é um grande desafio. Uma solução possível é utilizar redes rearranjáveis, que podem ser reconfiguradas para conectar qualquer entrada a qualquer saída, reorganizando suas conexões internas para evitar bloqueios. No entanto, em aplicações reais, as conexões geralmente seguem um padrão multicast, o que torna o roteamento mais complexo, mesmo em redes rearranjáveis. Demonstramos que, embora a rede Benes seja amplamente reconhecida na literatura como a principal referência em redes rearranjáveis, ela perde essa propriedade na presença de multicast. Para contornar esse problema, exploramos redes Omega ou shuffle-exchange com estágios extras e mostramos que elas podem ser reorganizadas de maneira a evitar bloqueios. Como o espaço de busca de soluções é exponencial, utilizamos uma implementação em GPU, o que nos permitiu derivar algumas propriedades importantes. Também avaliamos algoritmos de aprendizado de máquina utilizando os dados gerados pela GPU. Finalmente, este estudo aponta novas direções para o desenvolvimento de redes rearranjáveis na presença de multicast.
... Network-on-Chips (NoCs) serve as the communication backbone for connecting accelerators, memories, and I/Os across large chips [1], [2]. NoCs mapped to soft FPGA logic are key components in many designs [3], [4], [5], [6]. However, as FPGAs become larger, soft NoCs are approaching to their limits, especially when FPGAs incorporate multiple chiplets, resulting more communication along chiplet boundaries. ...
Preprint
Full-text available
With the advent of modern multi-chiplet FPGA architectures, vendors have begun integrating hardened NoC to address the scalability, resource usage, and frequency disadvantages of soft NoCs. However, as this work shows, effectively harnessing these hardened NoC is not trivial. It requires detailed knowledge of the microarchitecture and how it relates to the physical design of the FPGA. Existing literature has provided in-depth analyses for NoC in MPSoC devices, but few studies have systematically evaluated hardened NoC in FPGA, which have several unique implications. This work aims to bridge this knowledge gap by demystifying the performance and design trade-offs of hardened NoC on FPGA. Our work performs detailed performance analysis of hard (and soft) NoC under different settings, including diverse NoC topologies, routing strategies, traffic patterns and different external memories under various NoC placements. In the context of Versal FPGAs, our results show that using hardened NoC in multi-SLR designs can reduce expensive cross-SLR link usage by up to 30~40%, eliminate general-purpose logic overhead, and remove most critical paths caused by large on-chip crossbars. However, under certain aggressive traffic patterns, the frequency advantage of hardened NoC is outweighed by the inefficiency in the network microarchitecture. We also observe suboptimal solutions from the NoC compiler and distinct performance variations between the vertical and horizontal interconnects, underscoring the need for careful design. These findings serve as practical guidelines for effectively integrating hardened NoC and highlight important trade-offs for future FPGA-based systems.
... Hardware has evolved significantly over the past decades, and new technologies, particularly high-bandwidth memory (HBM), have profoundly influenced computer architectures [22], [23], [24]. These improvements in hardware and computation power have fundamentally altered the landscape of many fields, including neural networks, wireless communications, and science. ...
Preprint
Full-text available
Network coding enhances performance in network communications and distributed storage by increasing throughput and robustness while reducing latency. Batched Sparse (BATS) codes are a class of capacity-achieving network codes, but their practical applications are hindered by their structure, computational intensity, and power demands of finite field operations. Most literature focuses on algorithmic-level techniques to improve coding efficiency. Optimization with an algorithm/hardware co-designing approach has long been neglected. Leveraging the unique structure of BATS codes, we first present CS-BATS, a hardware-friendly variant. Next we propose a simple but effective bounded-value generator, to reduce the size of a finite field multiplier by up to 70%. Finally, we report on a scalable and resource-efficient FPGA-based network coding accelerator that achieves a throughput of 27 Gbps, a speedup of more than 300 over software.
Article
Full-text available
Conventional homogeneous multicore processors are not able to provide the continued performance and energy improvement that we have expected from past endeavors. Heterogeneous architectures that feature specialized hardware accelerators are widely considered a promising paradigm for resolving this issue. Among different heterogeneous devices, FPGAs that can be reconfigured to accelerate a broad class of applications with orders-of-magnitude performance/watt gains, are attracting increased attention from both academia and industry. As a consequence, a variety of CPU-FPGA acceleration platforms with diversified microarchitectural features have been supplied by industry vendors. Such diversity, however, poses a serious challenge to application developers in selecting the appropriate platform for a specific application or application domain. This paper aims to address this challenge by determining which microarchitectural characteristics affect performance, and in what ways. Specifically, we conduct a quantitative comparison and an in-depth analysis on five state-of-the-art CPU-FPGA acceleration platforms: 1) the Alpha Data board and 2) the Amazon F1 instance that represent the traditional PCIe-based platform with private device memory; 3) the IBM CAPI that represents the PCIe-based system with coherent shared memory; 4) the first generation of the Intel Xeon+FPGA Accelerator Platform that represents the QPI-based system with coherent shared memory; and 5) the second generation of the Intel Xeon+FPGA Accelerator Platform that represents a hybrid PCIe (non-coherent) and QPI (coherent) based system with shared memory. Based on the analysis of their CPU-FPGA communication latency and bandwidth characteristics, we provide a series of insights for both application developers and platform designers. Furthermore, we conduct two case studies to demonstrate how these insights can be leveraged to optimize accelerator designs. The microbenchmarks used for evaluation have been released for public use.
Article
Full-text available
To increase productivity in designing digital hardware components, high-level synthesis (HLS) is seen as the next step in raising the design abstraction level. However, the quality of results (QoR) of HLS tools has tended to be behind those of manual register-transfer level (RTL) flows. In this paper, we survey the scientific literature published since 2010 about the QoR and productivity differences between the HLS and RTL design flows. Altogether, our survey spans 46 papers and 118 associated applications. Our results show that on average, the QoR of RTL flow is still better than that of the state-of-the-art HLS tools. However, the average development time with HLS tools is only a third of that of the RTL flow, and a designer obtains over four times as high productivity with HLS. Based on our findings, we also present a model case study to sum up the best practices in comparative studies between HLS and RTL. The outcome of our case study is also in line with the survey results, as using an HLS tool is seen to increase the productivity by a factor of six. In addition, to help close the QoR gap, we present a survey of literature focused on improving HLS. Our results let us conclude that HLS is currently a viable option for fast prototyping and for designs with short time to market.
Conference Paper
With the public availability of FPGAs from major cloud service providers like AWS, Alibaba, and Nimbix, hardware and software developers can now easily access FPGA platforms. However, it is nontrivial to develop efficient FPGA accelerators, especially for software programmers who use high-level synthesis (HLS). The major goal of this paper is to figure out how to efficiently access the memory system of modern datacenter FPGAs in HLS-based accelerator designs. This is especially important for memory-bound applications; for example, a naive accelerator design only utilizes less than 5% of the available off-chip memory bandwidth. To achieve our goal, we first identify a comprehensive set of factors that affect the memory bandwidth, including 1) the number of concurrent memory access ports, 2) the data width of each port, 3) the maximum burst access length for each port, and 4) the size of consecutive data accesses. Then we carefully design a set of HLS-based microbenchmarks to quantitatively evaluate the performance of the Xilinx Alveo U200 and U280 FPGA memory systems when changing those affecting factors, and provide insights into efficient memory access in HLS-based accelerator designs. To demonstrate the usefulness of our insights, we also conduct two case studies to accelerate the widely used K-nearest neighbors (KNN) and sparse matrix-vector multiplication (SpMV) algorithms. Compared to the baseline designs, optimized designs leveraging our insights achieve about 3.5x and 8.5x speedups for the KNN and SpMV accelerators. CCS CONCEPTS • Hardware → Reconfigurable logic and FPGAs.
Conference Paper
Stream analytics has an insatiable demand for memory and performance. Emerging hybrid memories combine commodity DDR4 DRAM with 3D-stacked High Bandwidth Memory (HBM) DRAM to meet such demands. However, achieving this promise is challenging because (1) HBM is capacity-limited and (2) HBM boosts performance best for sequential access and high parallelism workloads. At first glance, stream analytics appears a particularly poor match for HBM because they have high capacity demands and data grouping operations, their most demanding computations, use random access. This paper presents the design and implementation of StreamBox-HBM, a stream analytics engine that exploits hybrid memories to achieve scalable high performance. StreamBox-HBM performs data grouping with sequential access sorting algorithms in HBM, in contrast to random access hashing algorithms commonly used in DRAM. StreamBox-HBM solely uses HBM to store Key Pointer Array (KPA) data structures that contain only partial records (keys and pointers to full records) for grouping operations. It dynamically creates and manages prodigious data and pipeline parallelism, choosing when to allocate KPAs in HBM. It dynamically optimizes for both the high bandwidth and limited capacity of HBM, and the limited bandwidth and high capacity of standard DRAM. StreamBox-HBM achieves 110 million records per second and 238 GB/s memory bandwidth while effectively utilizing all 64 cores of Intel's Knights Landing, a commercial server with hybrid memory. It outperforms stream engines with sequential access algorithms without KPAs by 7x and stream engines with random access algorithms by an order of magnitude in throughput. To the best of our knowledge, StreamBox-HBM is the first stream engine optimized for hybrid memories.
Conference Paper
We propose a high-performance and cost-effective hardware merge sorter (HMS) without any feedback datapaths in order to develop the fastest hardware sorting accelerator. The operating frequencies of existing HMSs are severely limited by the presence of feedback datapaths. We show the idea of eliminating the feedback datapaths, and propose a concrete architecture adopting the idea and some implementation optimizations. The evaluation results show that our HMS achieves 1.59x throughput improvement with less hardware resources compared to the state-of-the-art HMS.