ArticlePDF Available

In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms


Abstract and Figures

Conventional homogeneous multicore processors are not able to provide the continued performance and energy improvement that we have expected from past endeavors. Heterogeneous architectures that feature specialized hardware accelerators are widely considered a promising paradigm for resolving this issue. Among different heterogeneous devices, FPGAs that can be reconfigured to accelerate a broad class of applications with orders-of-magnitude performance/watt gains, are attracting increased attention from both academia and industry. As a consequence, a variety of CPU-FPGA acceleration platforms with diversified microarchitectural features have been supplied by industry vendors. Such diversity, however, poses a serious challenge to application developers in selecting the appropriate platform for a specific application or application domain. This paper aims to address this challenge by determining which microarchitectural characteristics affect performance, and in what ways. Specifically, we conduct a quantitative comparison and an in-depth analysis on five state-of-the-art CPU-FPGA acceleration platforms: 1) the Alpha Data board and 2) the Amazon F1 instance that represent the traditional PCIe-based platform with private device memory; 3) the IBM CAPI that represents the PCIe-based system with coherent shared memory; 4) the first generation of the Intel Xeon+FPGA Accelerator Platform that represents the QPI-based system with coherent shared memory; and 5) the second generation of the Intel Xeon+FPGA Accelerator Platform that represents a hybrid PCIe (non-coherent) and QPI (coherent) based system with shared memory. Based on the analysis of their CPU-FPGA communication latency and bandwidth characteristics, we provide a series of insights for both application developers and platform designers. Furthermore, we conduct two case studies to demonstrate how these insights can be leveraged to optimize accelerator designs. The microbenchmarks used for evaluation have been released for public use.
Content may be subject to copyright.
In-Depth Analysis on Microarchitectures of Modern
Heterogeneous CPU-FPGA Platforms
Center for Domain-Specic Computing, University of California, Los Angeles
Conventional homogeneous multicore processors are not able to provide the continued performance and
energy improvement that we have expected from past endeavors. Heterogeneous architectures that feature
specialized hardware accelerators are widely considered a promising paradigm for resolving this issue. Among
dierent heterogeneous devices, FPGAs that can be recongured to accelerate a broad class of applications
with orders-of-magnitude performance/watt gains, are attracting increased attention from both academia and
industry. As a consequence, a variety of CPU-FPGA acceleration platforms with diversied microarchitectural
features have been supplied by industry vendors. Such diversity, however, poses a serious challenge to
application developers in selecting the appropriate platform for a specic application or application domain.
This paper aims to address this challenge by determining which microarchitectural characteristics aect
performance, and in what ways. Specically, we conduct a quantitative comparison and an in-depth analysis
on ve state-of-the-art CPU-FPGA acceleration platforms: 1) the Alpha Data board and 2) the Amazon F1
instance that represent the traditional PCIe-based platform with private device memory; 3) the IBM CAPI that
represents the PCIe-based system with coherent shared memory; 4) the rst generation of the Intel Xeon+FPGA
Accelerator Platform that represents the QPI-based system with coherent shared memory; and 5) the second
generation of the Intel Xeon+FPGA Accelerator Platform that represents a hybrid PCIe (non-coherent) and
QPI (coherent) based system with shared memory. Based on the analysis of their CPU-FPGA communication
latency and bandwidth characteristics, we provide a series of insights for both application developers and
platform designers. Furthermore, we conduct two case studies to demonstrate how these insights can be
leveraged to optimize accelerator designs. The microbenchmarks used for evaluation have been released for
public use.
CCS Concepts: Computer systems organization Heterogeneous (hybrid) systems.
Additional Key Words and Phrases: Heterogeneous Computing, CPU-FPGA Platform, Xeon+FPGA, CAPI,
ACM Reference Format:
Young-kyu Choi, Jason Cong, Zhenman Fang, Yuchen Hao, Glenn Reinman, and Peng Wei. 2018. In-Depth
Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms. ACM Trans. Recong. Technol.
Syst. 1, 1, Article 1 (January 2018), 21 pages.
Extension of Conference Paper [10].
Zhenman Fang is also an Adjunct Professor in Simon Fraser University, Canada
Peng Wei is the corresponding author, email:
Authors’ address: Young-kyu Choi; Jason Cong; Zhenman Fang; Yuchen Hao; Glenn Reinman; Peng Wei
Center for Domain-Specic Computing, University of California, Los Angeles.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and
the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specic permission and/or a fee. Request permissions from
©2018 Association for Computing Machinery.
1936-7406/2018/1-ART1 $15.00
ACM Trans. Recong. Technol. Syst., Vol. 1, No. 1, Article 1. Publication date: January 2018.
1:2 Y. Choi et al.
In today’s datacenter designs, power and energy eciency have become two of the primary con-
straints. The increasing demand for energy-ecient high-performance computing has stimulated a
growing number of heterogeneous architectures that feature hardware accelerators or coprocessors,
such as GPUs (graphics processing units), FPGAs (eld-programmable gate arrays), and ASICs
(application-specic integrated circuits). Among various heterogeneous acceleration platforms, the
FPGA-based approach is considered one of the most promising directions, since FPGAs provide
low power and high energy eciency, and can be reprogrammed to accelerate dierent applica-
tions. Motivated by such advantages, leading cloud service providers have begun to incorporate
FPGAs in their datacenters. For instance, Microsoft has designed a customized FPGA board called
Catapult and integrated it into conventional computer clusters to accelerate large-scale production
workloads, such as search engines [
] and neural networks [
]. In its Elastic Compute Cloud
(EC2), Amazon also introduces the F1 compute instance [
] that equips a server with one or more
FPGA boards. Intel, with its $16.7 billion acquisition of Altera, has predicted that approximately
30% of servers could have FPGAs in 2020 [
], indicating that FPGAs can play an important role in
datacenter computing.
With the trend of adopting FPGAs in datacenters, various CPU-FPGA acceleration platforms with
diversied microarchitectural features have been developed. We classify state-of-the-art CPU-FPGA
platforms in Table 1according to their physical integration and memory models. Traditionally, the
most widely used integration is to connect an FPGA to a CPU via the PCIe interface, with both
components equipped with private memories. Many FPGA boards built on top of Xilinx or Intel
FPGAs use this way of integration because of its extensibility. The customized Microsoft Catapult
board integration is such an example. Another example is the Alpha Data FPGA board [
] with the
Xilinx FPGA fabric that can leverage the Xilinx SDAccel development environment [
] to support
ecient accelerator design using high-level programming languages, including C/C++ and OpenCL.
The Amazon F1 instance also adopts this software/hardware environment to allow high-level
accelerator design. On the other hand, vendors like IBM tend to support a PCIe connection with a
coherent, shared memory model for easier programming. For example, IBM has been developing
the Coherent Accelerator Processor Interface (CAPI) on POWER8 [
] for such an integration, and
has used this platform in the IBM data engine for NoSQL [
]. Meanwhile, the CCIX consortium has
proposed the Cache Coherent Interconnect for Accelerators which can connect FPGAs with ARM
processors through the PCIe interface with coherent shared memory as well [
]. More recently,
closer CPU-FPGA integration becomes available using a new class of processor interconnects
such as front-side bus (FSB) and the newer QuickPath Interconnect (QPI), and provides a coherent,
shared memory, such as the FSB-based Convey machine [
] and the Intel Xeon+FPGA accelerator
platform [
]. While the rst generation of the Xeon+FPGA platform (Xeon+FPGA v1) connects a
CPU to an FPGA only through a coherent QPI channel, the second generation of the Xeon+FPGA
platform (Xeon+FPGA v2) adds two non-coherent PCIe data communication channels between the
CPU and the FPGA, resulting in a hybrid CPU-FPGA communication model.
The evolution of various CPU-FPGA platforms brings up a challenging question: which platform
should we choose to gain better performance and energy eciency for a given application to accel-
erate? There are numerous factors that can aect the choice, such as platform cost, programming
models and eorts, logic resource and frequency of FPGA fabric, CPU-FPGA communication latency
and bandwidth, to name just a few. While some of them are easy to gure out, others are nontrivial,
especially the communication latency and bandwidth between CPU and FPGA under dierent
integration. One reason is that there are few publicly available documents for the newly announced
platforms like the Xeon+FPGA family, CAPI and Amazon F1 instance. More importantly, those
ACM Trans. Recong. Technol. Syst., Vol. 1, No. 1, Article 1. Publication date: January 2018.
In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms 1:3
Table 1. Classification of modern CPU-FPGA platforms
Separate Private Memory Shared Memory
PCIe Peripheral Interconnect
Alpha Data [30],
Microsoft Catapult [24],
Amazon F1 [2]
IBM CAPI [27],
CCIx [4]
Processor Interconnect N/A Intel Xeon+FPGA v1 [22] (QPI),
Convey HC-1 [6] (FSB)
Hybrid N/A Intel Xeon+FPGA v2 (QPI, PCIe)
Fig. 1. Summary of CPU-FPGA communication bandwidth and latency (not to scale)
architectural parameters in the datasheets are often advertised values, which are usually dicult to
achieve in practice. Actually, sometimes there could be a huge gap between the advertised numbers
and practical numbers. For example, the advertised bandwidth of the PCIe Gen3 x8 interface is
8GB/s; however, our experimental results show that the PCIe-equipped Alpha Data platform can
only provide 1.6GB/s PCIe-DMA bandwidth using OpenCL APIs implemented by Xilinx (see Section
3.2.1). Quantitative evaluation and in-depth analysis of such kinds of microarchitectural character-
istics could help CPU-FPGA platform users to accurately predict the performance of a computation
kernel to accelerate on various candidate platforms, and make the right choice. Furthermore, it
could also benet CPU-FPGA platform designers for identifying performance bottlenecks and
providing better hardware and software support.
Motivated by those potential benets to both platform users and designers, this paper aims to
discover which microarchitectural characteristics aect the performance of modern CPU-FPGA
platforms, and evaluate what that eect will be. We conduct our quantitative comparison on
ve state-of-the-art CPU-FPGA platforms: 1) the Alpha Data board and 2) Amazon F1 instance
that represent the conventional PCIe-based platform with private device memory, 3) IBM CAPI
that represents the PCIe-based system with coherent shared memory, 4) Intel Xeon+FPGA v1
that represents the QPI-based system with coherent shared memory, and 5) Xeon+FPGA v2 that
represents a hybrid PCIe (non-coherent) and QPI (coherent) based system with shared memory.
These ve platforms cover various CPU-FPGA interconnection approaches, and dierent memory
models as well.
In summary, this paper makes the following contributions.
ACM Trans. Recong. Technol. Syst., Vol. 1, No. 1, Article 1. Publication date: January 2018.
1:4 Y. Choi et al.
The rst quantitative characterization and comparison on the microarchitectures of state-of-the-
art CPU-FPGA acceleration platforms—including the Alpha Data board and Amazon F1 instance,
IBM CAPI, and Intel Xeon+FPGA v1 and v2—which covers the whole range of CPU-FPGA
connections. We quantify each platform’s CPU-FPGA communication latency and bandwidth
and the results are summarized in Fig. 1.
An in-depth analysis of the big gap between advertised and practically achievable performance
(Section 3), with step-by-step decomposition of the ineciencies.
Seven insights for both application developers to improve accelerator designs and platform
designers to improve platform support (Section 4). Specically, we suggest that accelerator
designers avoid using the advertised platform parameters to estimate the acceleration eect,
which almost always leads to an overly optimistic estimation. Moreover, we analyze the trade-o
between private-memory and shared-memory platforms, and analytically model the eective
bandwidth with the introduction of the memory data reuse ratio r. We also propose the metric
of computation-to-communication (CTC) ratio to measure when the CPU-FPGA communication
latency and bandwidth are critical. Finally, we suggest that the complicated communication
stack and hard-to-use coherent cache system may improve in the next-generation of CPU-FPGA
Two case studies in real applications to demonstrate how these insights can be leveraged,
including matrix multiplication and matrix-vector multiplication. The former is a well-known
compute-intensive application, and the latter is bounded by communication. We use these two
applications to demonstrate how to choose the appropriate platform by applying the proposed
A high-performance interconnect between the host processor and FPGA is crucial to the overall
performance of CPU-FPGA platforms. In this section we rst summarize existing CPU-FPGA
architectures with typical PCIe and QPI interconnects. Then we present the private and shared
memory models of dierent platforms. Finally, we discuss related work.
2.1 Common CPU-FPGA Architectures
Typical PCIe-based CPU-FPGA platforms feature direct memory access (DMA) and private device
DRAM (Fig. 2(a)). To interface with the device DRAM as well as the host-side CPU-attached memory,
a memory controller IP and a PCIe endpoint with a DMA IP are required to be implemented on
the FPGA, in addition to user-dened accelerator function units (AFUs). Fortunately, vendors
have provided hard IP solutions to enable ecient data copy and faster development cycles. For
example, Xilinx releases device support for the Alpha Data card [
] in the SDAccel development
environment [
]. As a consequence, users can focus on designing application-related AFUs and
easily swap them into the device support to build customized CPU-FPGA acceleration platforms.
IBM integrates the Coherent Accelerator Processor Interface (CAPI) [
] into its Power8 and
future systems, which provides virtual addressing, cache coherence and virtualization for PCIe-
based accelerators (Fig. 2(b)). A coherent accelerator processor proxy (CAPP) unit is introduced
to the processor to maintain coherence for the o-chip accelerator. Specically, it maintains the
directory of all cache blocks of the accelerator, and it is responsible for snooping the CPU bus for
cache block status and data on behalf of the accelerator. On the FPGA side, IBM also supplies a
power service layer (PSL) unit alongside the user AFU. The PSL handles address translation and
coherency functions while sending and receiving trac as native PCIe-DMA packets. With the
ACM Trans. Recong. Technol. Syst., Vol. 1, No. 1, Article 1. Publication date: January 2018.
In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms 1:5
Host DRAM Device DRAM
Cores LLC
Process or FPGA
(a) Conventional PCIe-based platforms (e.g.,
Alpha Data board and F1 instance)
Host DRAM Device DRAM
Pow er 8
(b) PCIe-based CAPI platform
Cores LLC
Xeo n FPGA
(c) QPI-based Xeon+FPGA v1
Cores LLC
Xeo n FPGA
(d) Xeon+FPGA v2 with 1 QPI and 2 PCIe
Fig. 2. A tale of five CPU-FPGA platforms
ability to access coherent shared memory of the host core, device DRAM and memory controller
become optional for users.
Intel Xeon+FPGA v1 [
] brings the FPGA one step closer to the processor via QPI where an
accelerator hardware module (AHM) occupies the other processor socket in a 2-socket motherboard.
By using QPI interconnect, data coherency is maintained between the last-level cache (LLC) in the
processor and the FPGA cache. As shown in Fig. 2(c), an Intel QPI IP that contains a 64KB cache is
required to handle coherent communication with the processor, and a system protocol layer (SPL)
is introduced to provide address translation and request reordering to the user AFU. Specically, a
page table of 1024 entries, each associated with a 2MB page (2GB in total), is implemented in SPL,
which will be loaded by the device driver during runtime. Though current addressable memory
is limited to 2GB and private high-density memory for FPGA is not supported, this low-latency
coherent interconnect has distinct implications for programming models and overall processing
models of CPU-FPGA platforms.
Xeon+FPGA v2 co-packages the CPU and FPGA to deliver even higher bandwidth and lower
latency than discrete forms. As shown in Fig. 2(d), the communication between CPU and FPGA is
supported by two PCIe Gen3 x8 and one QPI (UPI in Skylake and later architectures) physical links.
These are presented as virtual channels on the user interface. The FPGA logic is divided into two
parts: the Intel-provided FPGA interface unit (FIU) and the user AFU. The FIU provides platform
capabilities such as unied address space, coherent FPGA cache and partial reconguration of
user AFU, in addition to implementing interface protocols for the three physical links. Moreover, a
memory properties factory for higher-level memory services and semantics is supplied to provide
a push-button development experience for end-users.
ACM Trans. Recong. Technol. Syst., Vol. 1, No. 1, Article 1. Publication date: January 2018.
1:6 Y. Choi et al.
Process or FPGA
Memory Shared Memory
Process or
Fig. 3. Developer view of separate and shared memory spaces
2.2 CPU-FPGA Memory Models
Accelerators with physical addressing eectively adopt a separate address space paradigm (Fig. 3).
Data shared between the host and device must be allocated in both the host-side CPU-attached
memory and the private device DRAM, and explicitly copied between them by the host program.
Although copying array-based data structures is straightforward, moving pointer-based data
structures such as linked-lists and trees presents complications. Also, separate address spaces cause
data replication, resulting in extra latency and overhead. To mitigate this performance penalty,
users usually consolidate data movement into one upfront bulk transfer from the host memory to
the device memory. In this paper the evaluated Alpha Data and Amazon F1 platforms fall into this
With tighter logical CPU-FPGA integration, the ideal case would be to have a unied shared
address space between the CPU and FPGA. In this case (Fig. 3), instead of allocating two copies in
both host and device memories, only a single allocation is necessary. This has a variety of benets,
including the elimination of explicit data copies, pointer semantics and increased performance
of ne-grained memory accesses. CAPI enables unied address space through additional hard-
ware module and operating system support. Cacheline-aligned memory spaces allocated using
are allowed in the host program. Xeon+FPGA v1 provides the convenience of a
unied shared address space using pinned host memory, which allows the device to directly access
data on that memory location. However, users must rely on special APIs, rather than normal C or
C++ allocation (e.g., malloc/new), to allocate pinned memory space.
Xeon+FPGA v2 supports both memory models by conguring the supplied memory properties
factory, so that users can decide whether the benet of having a unied address space outweighs
the address translation overhead based on their use case.
2.3 Related Work
In this section we discuss three major categories of related work.
First, in addition to the commodity CPU-FPGA integrated platforms in Table 1, there is also a
large body of academic work that focuses on how to eciently integrate hardware accelerators into
general-purpose processors. Yesil et al. [
] surveyed existing custom accelerators and integration
techniques for accelerator-rich systems in the context of data centers, but without a quantitative
study as we did. Chandramoorthy et al. [
] examined the performance of dierent design points in-
cluding tightly coupled accelerators (TCAs) and loosely coupled accelerators (LCAs) customized for
computer vision applications. Cotat et al. [
] specically analyzed the integration and interaction
of TCAs and LCAs at dierent levels in the memory hierarchy. CAMEL [
] featured recongurable
fabric to improve the utilization and longevity of on-chip accelerators. All of these studies were
done using simulated environments instead of commodity CPU-FPGA platforms.
Second, a number of approaches have been proposed to make accelerators more programmable by
supporting shared virtual memory. NVIDIA introduced “unied virtual addressing” beginning with
ACM Trans. Recong. Technol. Syst., Vol. 1, No. 1, Article 1. Publication date: January 2018.
In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms 1:7
the Fermi architecture [
]. The Heterogeneous System Architecture Foundation announced hetero-
geneous Uniform Memory Accesses (hUMA) that will implement the shared address paradigm in
future heterogeneous processors [
]. Cong et al. [
] propose supporting address translation using
two-level TLBs and host page walk for accelerator-centric architectures. Shared virtual memory
support for CPU-FPGA platforms has been explored in CAPI and the Xeon+FPGA family [
This paper covers both the separate memory model (Alpha Data and F1 instance) and shared
memory model (CAPI, Xeon+FPGA v1 and v2).
Third, there is also numerous work that evaluates modern CPU and GPU microarchitectures.
For example, Fang et al. [
] evaluated the memory system microarchitectures on commodity
multicore and many-core CPUs. Wong et al. [
] evaluated the microarchitectures on modern GPUs.
This work is the rst to evaluate the microarchitectures of modern CPU-FPGA platforms with an
in-depth analysis.
This work aims to reveal how the underlying microarchitectures, i.e., processor or peripheral
interconnect, and shared or private memory model, aect the performance of CPU-FPGA platforms.
To achieve this goal, in this section we quantitatively study those microarchitectural characteristics,
with a key focus on the eective bandwidth and latency of CPU-FPGA communication on ve
state-of-the-art platforms: Alpha Data, CAPI, Xeon+FPGA v1 and v2, and Amazon F1 instance.1
3.1 Experimental Setup
To measure the CPU-FPGA communication bandwidth and latency, we design and implement our
own microbenchmarks, based on the Xilinx SDAccel SDK 2017.4 [
] for Alpha Data and F1 instance,
Alpha Data CAPI Design Kit [
] for CAPI, and Intel AALSDK 5.0.3 [
] for Xeon+FPGA v1 and v2.
Each microbenchmark consists of two parts: a host program and a computation kernel. Following
each platform’s typical programming model, we use the C language to write the host programs for
all platforms, and describe the kernel design using OpenCL for Alpha Data and F1 instance, and
Verilog HDL for the other three platforms.
The hardware congurations of Alpha Data, CAPI, Xeon+FPGA v1 and v2, and Amazon F1 in
our study are listed in Table 2.
Table 2. Platform configurations of Alpha Data, F1, CAPI, Xeon+FPGA v1 and v2
Platform Alpha Data CAPI Xeon+FPGA v1 Xeon+FPGA v2 Amazon EC2 F1
Host CPU Xeon E5-2620v3
Power8 Turismo
Xeon E5-2680v2
Xeon E5-2600v4
Xeon E5-2686v4
Host Memory 64GB DDR3-1600 16GB DDR3-1600 96GB DDR3-1600 64GB DDR4-2133 64GB DDR4-2133
FPGA Fabric Xilinx Virtex 7
Xilinx Virtex 7
Intel Stratix V
Intel Arria 10
Xilinx UltraScale+
CPU FPGA PCIe Gen3 x8,
PCIe Gen3 x8,
Intel QPI,
1×Intel QPI &
2×PCIe Gen3 x8, 25.6GB/s
PCIe Gen3 x16,
Device Memory 16GB DDR3-1600 16GB DDR3-1600N/A N/A 64GB DDR4-2133
The device memory in CAPI is not used in this work.
The user clock can be easily congured to 137/200/273 MHz using the supplied SDK, in addition to max 400MHz frequency.
3.2 Eective Bandwidth
3.2.1 Eective Bandwidth for Alpha Data. Traditional CPU-FPGA platforms like Alpha Data contain
two communication phases: 1) PCIe-based direct memory access (DMA) between host memory
Results in this publication were generated using pre-production hardware or software, and may not reect the performance
of future products.
ACM Trans. Recong. Technol. Syst., Vol. 1, No. 1, Article 1. Publication date: January 2018.
1:8 Y. Choi et al.
Fig. 4. Eective bandwidth of Alpha Data, CAPI, Xeon+FPGA v1 and v2, and F1 instance
(a) Alpha Data (b) Amazon EC2 F1
Fig. 5. PCIe-DMA bandwidth breakdown
and device memory, and 2) device memory access. We measure the eective bandwidths with
various payload sizes for both phases. The measurement results are illustrated in Fig. 4. Since the
bandwidths for both directions of PCIe-DMA transfer are almost identical (less than 4% dierence),
we only present the unidirection PCIe-DMA bandwidth in Fig. 4.
ACM Trans. Recong. Technol. Syst., Vol. 1, No. 1, Article 1. Publication date: January 2018.
In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms 1:9
While Fig. 4illustrates a relatively high private DRAM bandwidth (9.5GB/s for read, 8.9GB/s for
), the PCIe-DMA bandwidth (1.6GB/s) reaches merely 20% of PCIe’s advertised bandwidth
(8GB/s). That is, the expectation of a high DMA bandwidth with PCIe is far from being fullled.
The rst reason is that there is non-payload data overhead for the useful payload transfer [
In a PCIe transfer, a payload is split into small packets, each packet equipped with a header. Along
with the payload packets, there are also a large number of packets for control purposes transferred
through PCIe. As a result, the maximum supplied bandwidth for the actual payload, which we call
the theoretical bandwidth, is already smaller than the advertised value.
Another important reason is that a PCIe-DMA transaction involves not only PCIe transfer, but
also host buer allocation and host memory copy [
]. The host memory stores user data in a
pageable (unpinned) space from which the FPGA cannot directly retrieve data. A page-locked
(pinned), physically contiguous memory buer is in the operating system kernel space that serves
as a staging area for PCIe transfer. When a PCIe-DMA transaction starts, a pinned buer is rst
allocated in the host memory, followed by a memory copy of pageable data to this pinned buer.
The data is then transferred from the pinned buer to device memory through PCIe. These three
steps—buer allocation, host memory copy, and PCIe transfer—are sequentially processed in Alpha
Data, which signicantly decreases the PCIe-DMA bandwidth.
Moreover, there could be some implementation deciencies in the vendor-provided environment
which serve as another source of overhead.
One possibility could be the data transfer overhead
between the endpoint of the PCIe channel, i.e., the vendor-provided FPGA DMA IP, and the FPGA-
attached device DRAM. Specically, the device DRAM does not directly connect to the PCIe channel;
instead, the data from the host side rst reach the on-chip DMA IP and then are written into the
device DRAM through the vendor-provided DRAM controller IP. If this extra step were not well
overlapped with the actual data transfer via the PCIe channel through careful pipelining, it would
further reduce the eective bandwidth. Our experiments show that a considerable gap still exists
between the measured bandwidth and the theoretical value, indicating that the vendor-provided
environment could be potentially improved with further performance tuning.
Next, we quantitatively evaluate the large PCIe-DMA bandwidth gap step by step, with results
shown in Fig. 5.
The non-payload data transfer lowers the theoretical PCIe bandwidth to 6.8GB/s from the
advertised 8GB/s [20].
Possible implementation deciencies in the vendor-provided environment prevent the 6.8GB/s
PCIe bandwidth from being fully exploited. As a result, the highest achieved eective PCIe-DMA
bandwidth without buer allocation and host memory copy decreases to 5.2GB/s.
The memory copy between the pageable and pinned buers further degrades the PCle-DMA
bandwidth to 2.7GB/s.
The buer allocation overhead degrades the nal eective PCIe-DMA bandwidth to only 1.6GB/s.
This is the actual bandwidth that end-users can obtain.
3.2.2 Eective Bandwidth for CAPI. CPU-FPGA platforms that realize the shared memory model,
such as CAPI, Xeon+FPGA v1 and v2, allow the FPGA to retrieve data directly from the host
memory. Such platforms therefore contain only one communication phase: host memory access
through the communication channel(s). For the PCIe-based CAPI platform, we simply measure the
eective read and write bandwidths of its PCIe channel for a variety of payload sizes, as shown in
Fig. 4.
If not specically indicated, the bandwidth appearing in the remainder of this paper refers to the maximum achievable
3The Xilinx SDAccel environment is close-sourced, so we were not able to pin-point this overhead.
ACM Trans. Recong. Technol. Syst., Vol. 1, No. 1, Article 1. Publication date: January 2018.
1:10 Y. Choi et al.
Compared to the Alpha Data board, CAPI supplies end-users with a much higher eective PCIe
bandwidth (3.3GB/s vs. 1.6GB/s). This is because CAPI provides ecient API support for application
developers to directly allocate and manipulate pinned memory buers, eliminating the memory
copy overhead between the pageable and pinned buers. However, Alpha Data’s private local
memory read and write bandwidths (9.5GB/s, 8.9GB/s) are much higher than those of CAPI’s
shared remote memory access. This phenomenon oers opportunities for both platforms. As will be
discussed in Section 4, if an accelerator is able to eciently use the private memory as a “shortcut”
for accessing the host memory, it will probably obtain a similar or even more eective CPU-FPGA
communication bandwidth in a traditional platform like Alpha Data than in a shared-memory
platform like CAPI or Xeon+FPGA.
Another remarkable phenomenon shown in Fig. 4is the dramatic fallo of the PCIe bandwidth at
the 4MB payload size. This could be the side eect of CAPI’s memory coherence mechanism. CAPI
shares the last-level cache (LLC) of the host CPU with the FPGA, and the data access latency varies
signicantly between LLC hit and miss. Therefore, one possible explanation for the fallo is that
CAPI shares 2MB of the 8MB LLC with the FPGA. Payloads of a size that is not larger than 2MB
can t into LLC, resulting in a low LLC hit latency that can be well amortized by a few megabytes
of data. Nevertheless, when the payload size grows to 4MB and cannot t into LLC, the average
access latency of the payload data will suddenly increase, leading to the observed fallo. With the
payload size continuing to grow, this high latency is gradually amortized, and the PCIe bandwidth
gradually reaches the maximum value.
3.2.3 Eective Bandwidth for Xeon+FPGA v1. The CPU-FPGA communication of Xeon+FPGA v1
involves only one step: host memory access through QPI; therefore, we just measure a set of eective
read and write bandwidths for dierent payload sizes, as shown in Fig. 4. We can see that both the
read and write bandwidths (7.0GB/s, 4.9GB/s) are much higher than the PCIe bandwidths of Alpha
Data and CAPI. That is, the QPI-based CPU-FPGA integration demonstrates a higher eective
bandwidth than the PCIe-based integration. However, the remote memory access bandwidths of
Xeon+FPGA v1 are still lower than those of Alpha Data’s local memory access. Thus, similar to
CAPI, Xeon+FPGA v1 can possibly be outperformed by Alpha Data if an accelerator keeps reusing
the data in the device memory as a “shortcut” for accessing the host memory.
We need to mention that Xeon+FPGA v1 provides a 64KB cache on its FPGA chip for coherency
purposes [
]. Each CPU-FPGA communication will rst go through this cache and then go to the
host memory if a cache miss happens. Therefore, the CPU-FPGA communication of Xeon+FPGA v1
follows the classic cache access pattern. Since the bandwidth study mainly focuses on large payloads,
our microbenchmarks simply ush the cache before accessing any payload to ensure all requests
go through the host memory. The bandwidths illustrated in Fig. 4are, more accurately, miss
bandwidths. Section 3.3.1 discusses the cache behaviors in detail.
3.2.4 Eective Bandwidth for Xeon+FPGA v2. While the CPU-FPGA communication of Xeon+FPGA v2
involves only one step as well, Xeon+FPGA v2 allows the user accelerator to operate at dierent
clock frequencies. Therefore, we measure the eective bandwidths of various payloads sizes at
200MHz and 400MHz. 200MHz is the frequency that is also used by Alpha Data and Xeon+FPGA v1;
400MHz is the maximal frequency supported by Xeon+FPGA v2. Note that this change does not
aect the frequency of the internal logic of the communication channels, but just the interface
between the channels and the user accelerator. Fig. 4illustrates the measurement results. We can
see that Xeon+FPGA v2 outperforms the aforementioned three CPU-FPGA platforms in terms of
eective bandwidth (20GB/s at 400MHz, 12.8GB/s at 200MHz). This is because Xeon+FPGA v2
connects the CPU and the FPGA through three communication channels—one QPI channel and
two PCIe channels—resulting in a signicantly high aggregate bandwidth.
ACM Trans. Recong. Technol. Syst., Vol. 1, No. 1, Article 1. Publication date: January 2018.
In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms 1:11
Like its rst generation, Xeon+FPGA v2 also provides a 64KB on-chip cache for coherency
purposes. However, this cache maintains coherence only for the QPI channel, and the two PCIe
channels have no coherence properties. As a consequence, Xeon+FPGA v2 actually delivers a
partially coherent memory model. We will discuss the cache behaviors of the Xeon+FPGA family
in Section 3.3.1, and the coherence issue in Section 4.
3.2.5 Eective Bandwidth for F1 Instance. The Amazon EC2 F1 instance represents the state-of-
the-art advance of the canonical PCIe-based platform architecture. Like the Alpha Data board, it
connects the CPU with the FPGA through the PCIe channel, with private memory attached to both
components. It is powered by the Xilinx SDAccel environment as well to achieve the behavior-level
hardware accelerator development. Both the PCIe channel and the private DRAM are upgraded to
supply higher bandwidths. However, the end-to-end bandwidth delivered to the end user is rather
surprising. As illustrated in Fig. 4, the eective PCIe bandwidth of the F1 instance turns out to
be even worse than that of the Alpha Data board which adopts an old-generation technique. The
breakdown in Fig. 5shows that the F1 PCIe bandwidth is twice the amount of the Alpha Data
bandwidth if the buer allocation and the memory copy overhead are not considered. This suggests
that the buer allocation and memory copy impose more overhead on the F1 instance than on the
Alpha Data board. It might be due to the virtualization overhead of the F1 instance.
As mentioned before, the fact that CPU-FPGA bandwidth of
Xeon+FPGA v1
lies between the PCIe
bandwidth and the private device DRAM bandwidth of the
Alpha Data
board provides opportunities
for both platforms. The F1 instance and
Xeon+FPGA v2
form another (more advanced) pair of CPU-
FPGA platforms that follows such a relation. We expect that this relation between a private-memory
platform and a shared-memory platform will continue to exist in future CPU-FPGA platforms,
and the platform suitability will depend on the characteristics of each application. This will be
discussed in Section 4.
3.3 Eective Latency
3.3.1 Coherent Cache Behaviors. As described in Sections 3.2.3 and 3.2.4, the QPI channel of the
Xeon+FPGA family includes a 64KB cache for coherence purposes, and the QPI-based communica-
tion thus falls into the classic cache access pattern. A cache transaction is typically depicted by its
hit time and miss penalty. We follow this traditional methodology for cache study and quantify the
hit time and miss latency of the Xeon+FPGA coherent cache, as shown in Table 3.
Table 3. CPU-FPGA access latency in Xeon+FPGA
Access Type Latency (ns)
Read Hit 70
Write Hit 60
Read Miss avg: 355
Write Miss avg: 360
A noteworthy phenomenon is the long hit time—70ns (14 FPGA cycles) for read hit and 60ns (12
FPGA cycles) for write hit in this 64KB cache. We investigate this phenomenon by decomposing
the hit time into three phases—address translation, cache access and transaction reordering—and
measuring the elapsed time of each phase, as shown in Table 4.
The data demonstrate a possibly exorbitant price (up to 100% extra time) paid for address trans-
lation and transaction reordering. Worse still, the physical cache access latency is still prohibitively
high—35ns (7 FPGA cycles). Given this small but long-latency cache, it is extremely hard, if not
impossible, for an accelerator to harness the caching functionality.
ACM Trans. Recong. Technol. Syst., Vol. 1, No. 1, Article 1. Publication date: January 2018.
1:12 Y. Choi et al.
Table 4. Hit latency breakdown in Xeon+FPGA
Access Step Read Latency (ns) Write Latency (ns)
Address Translation 20 20
Cache Access 35 35
Transaction Reordering 15 5
It is worth noting that the Xeon+FPGA v2 platform supports higher clock frequencies than
its rst generation, and thus can potentially lead to a lower cache access latency. However, this
does not fundamentally change the fact that the latency of accessing the coherent cache is still
much longer than that of accessing the on-chip BRAM blocks. Therefore, Xeon+FPGA v2 does
not fundamentally improve the usability of the coherent cache. As discussed in Section 4, we still
suggest accelerator designers sticking to the conventional FPGA design principle that explicitly
manages the on-chip BRAM resource.
3.3.2 Communication Channel Latencies. We now compare the eective latencies among the PCIe
transfer of Alpha Data, F1 instance and CAPI, the device memory access of Alpha Data and F1
instance, and the QPI transfer of the Xeon+FPGA family.
Table 5lists the measured latencies of
all ve platforms for transferring a single 512-bit cache block (since all of them have the same
512-bit interface bitwidth). We can see that the QPI transfer expresses orders-of-magnitude lower
latency compared to the PCIe transfer, and is even smaller than that of Alpha Data or Amazon F1’s
private DRAM access. This rather surprising observation is largely due to the implementation of
the vendor-provided environment. In particular, Xilinx SDAccel connects the accelerator circuit to
the FPGA-attached DRAM through not only the DRAM controller but also an AXI interface that
is implemented on the FPGA chip. The data back and forth through the AXI interface impose a
signicant overhead to the eective device DRAM access latency,
resulting in the fact that the
local DRAM access latency of Alpha Data is even longer than the remote memory access latency of
. This phenomenon implies that a QPI-based platform is preferable for applications with
ne-grained CPU-FPGA interaction. In addition, we can see that CAPI’s PCIe transfer latency is
much lower than that of the
Alpha Data
board. This is because the
Alpha Data
board harnesses the
SDAccel SDK which enables accelerator design and integration through high-level programming
languages. Such a higher level of abstraction introduces an extra CPU-FPGA communication
overhead in processing the high-level APIs.
Table 5. Latencies of transferring a single 512-bit cache block
Platform Alpha Data CAPI Xeon+FPGA v1 Xeon+FPGA v2 Amazon EC2 F1
Latency PCIe: 160µs882ns 355ns 323ns PCIe: 127µs
DRAM: 542ns DRAM: 561ns
Based on our quantitative studies, we now analyze how these microarchitectural characteristics can
aect the performance of CPU-FPGA platforms and propose seven insights for platform users (to
optimize their accelerator designs) and platform designers (to improve the hardware and software
support in future CPU-FPGA platform development).
For simplicity, we mainly discuss the CPU-to-FPGA read case; the observation is similar for the FPGA-to-CPU write case.
While not being able to perfectly reason the long latency of the Xilinx platforms, we have conrmed with Xilinx that the
phenomenon is observed by Xilinx as well, and the AXI bus is one of the major causes.
ACM Trans. Recong. Technol. Syst., Vol. 1, No. 1, Article 1. Publication date: January 2018.
In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms 1:13
4.1 Insights for Platform Users
Insight 1
: Application developers should never use the advertised communication parameters, but
measure the practically achievable parameters to estimate the CPU-FPGA platform performance.
Experienced accelerator designers are generally aware of the data transfer overhead from the
non-payload data, e.g., packet header, checksum, control packets, etc., and expect approximately 10%
to 20% or even less bandwidth degradation. Quite often, this results in a signicant overestimation of
the end-to-end bandwidth due to the unawareness of the overhead generated by the system software
stack, like the host-to-kernel memory copy discussed in this paper. As analyzed in Section 3, the
eective bandwidth provided by a CPU-FPGA platform to end users is often far worse than the
advertised value that reects the physical limit of the communication channel. For example, the
PCIe-based DMA transfer of the Alpha Data board fullls only 20% of the 8GB/s bandwidth of the
PCIe Gen3 x8 channel; the Amazon F1 instance that adopts a more advanced data communication
technique delivers an even worse eective bandwidth to the end user. Evaluating a CPU-FPGA
platform using these advertised values will probably result in a signicant overestimation of the
platform performance. Worse still, the relatively low eective bandwidth is not always achievable.
In fact, the communication bandwidth for a small payload is up to two orders of magnitude smaller
than the maximum achievable eective bandwidth. A specic application may not always be able to
supply each communication transaction with a suciently large payload to reach a high bandwidth.
Platform users need to consider this issue as well in platform selection.
Insight 2
: In terms of eective bandwidth, both the private-memory and shared-memory platforms
have opportunities to outperform each other. The key metric is the device memory reuse ratio r.
Bounded by the low-bandwidth PCIe-based DMA transfer, the Alpha Data board generally
reaches a lower CPU-FPGA eective bandwidth than that of a shared-memory platform like CAPI
or Xeon+FPGA v1. The higher private memory bandwidth, however, does provide opportunities
for a private-memory platform to perform better in some cases. For example, given 1GB input data
sent to the device memory through PCIe, if the FPGA accelerator iteratively reads the data for a
large number of times, then the low DMA bandwidth will be amortized by the high private memory
bandwidth, and the eective CPU-FPGA bandwidth will be nearly equal to the private memory
bandwidth, which is higher than that of the shared-memory platform. Therefore, the data reuse
of FPGA’s private DRAM determines the eective CPU-FPGA bandwidth of a private-memory
platform, and whether it can achieve higher eective bandwidth than a shared-memory platform.
Quantitatively, we dene the device memory reuse ratio, r, as:
r=Íde vSde v
Ídma Sd ma
Íde vSde v
denotes the aggregate data size of all device memory accesses, and
Ídma Sd ma
denotes the aggregate data size of all DMA transactions between the host and the device memory.
Then, the eective CPU-FPGA bandwidth for a private-memory platform,
bwc pu f pдa
, can be
dened as:
bwc pu f pдa=1
bwd ev
bwd ev
denote the bandwidths of the DMA transfer and the device memory access,
The above formula suggests that larger
leads to higher eective CPU-FPGA bandwidth. It
is worth noting that since the FPGA on-chip BRAM data reuse is typically important for FPGA
ACM Trans. Recong. Technol. Syst., Vol. 1, No. 1, Article 1. Publication date: January 2018.
1:14 Y. Choi et al.
design optimization, the above nding suggests that accelerator designers using a private-memory
platform need to consider both on-chip BRAM data reuse and o-chip DRAM data reuse. Moreover,
by comparing this eective CPU-FPGA bandwidth of a private-memory platform to the DRAM
bandwidth of a shared-memory platform, we can get a threshold of device memory reuse ratio,
rth r es ho ld
. If the
value of an application is larger than
rth r es ho ld
, the private-memory platform will
achieve a higher bandwidth; and vice versa. This could serve as an initial guideline for application
developers to choose the appropriate platform for a specic application. One example is the
logistic regression application whose computation kernel is a series of matrix-vector multiplication
operations that iterate over the same input matrix. Our matrix-vector multiplication case study in
Section 5.2 demonstrates that the Alpha Data board starts to outperform Xeon+FPGA v1 when the
iteration number grows beyond 7, which is the value of
rth r es ho ld
for Alpha Data and Xeon+FPGA v1.
Also, this phenomenon continues to exist between the next-generation platforms, i.e., the Amazon
EC2 F1 instance and Xeon+FPGA v2. This suggests that the proposed device memory reuse ratio is
not merely applicable to the evaluated platforms in this paper, but also provides guidance to the
selection of the private-memory versus shared-memory CPU-FPGA platforms across generations.
Fundamentally, the device memory reuse ratio quanties the trade-o between private-memory
and shared-memory platforms based on the following two observations. First, local memory access
is usually faster than remote memory access. Using the same technology, the private-memory
platform achieves a higher device memory access bandwidth compared to the shared-memory
platform which retrieves data from the CPU-attached memory. Second, the end-to-end CPU-FPGA
data transfer routine of the shared-memory platform is a subset of that of the private-memory
platform. Specically, the routine of the private-memory platform contains data transfers 1) from
CPU-attached memory to FPGA, 2) from FPGA to FPGA-attached memory, and 3) from FPGA-
attached memory to FPGA; whereas the shared-memory platform performs only the rst step.
Since these observations are not likely to change over time, we expect that the trade-o between
these two types of platforms will continue to exist, and the device memory reuse ratio will remain
a critical parameter.
Insight 3
: In terms of eective latency, the shared-memory platform generally outperforms the private-
memory platform, and the QPI-based platform outperforms the PCIe-based platform.
As shown in Table 5, the shared-memory platform generally achieves a lower communication
latency than the private-memory platform with the same communication technology (CAPI vs.
Alpha Data). This is because the private-memory platform rst caches the data in its device memory,
and then allows the FPGA to access the data; this results in a longer communication routine. This
advantage, together with an easier programming model, motivates the new trend of CPU-FPGA
platforms with a PCIe connection and coherent shared memory, such as CAPI and CCIx.
Meanwhile, compared to the PCIe-based platform, the QPI-based platform brings the FPGA closer
to the CPU, leading to a lower communication latency. Therefore, a QPI-based, shared-memory
platform is preferred for latency-sensitive applications, especially those that require frequent
(random) ne-grained CPU-FPGA communication. Some examples like high-frequency trading
(HFT), online transaction processing (OLTP), or autonomous driving might benet from the low
communication latency of the QPI channel. Compared to Xeon+FPGA systems, the major drive
of the PCIe-based shared memory system is its extensibility for more FPGA boards in large-scale
Insight 4
: CPU-FPGA communication is critical to some applications, but not all. The key metric is
the computation-to-communication ratio (CTC).
ACM Trans. Recong. Technol. Syst., Vol. 1, No. 1, Article 1. Publication date: January 2018.
In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms 1:15
Double buering and dataow are well-used techniques in accelerator design optimizations. Such
techniques can realize a coarse-grained data processing pipeline by overlapping the computation
and data communication processes. As a result, the performance of the FPGA accelerator is generally
bounded by the coarse-grained pipeline stage that consumes more time. Based on this criterion,
FPGA accelerators can be roughly categorized into two classes: 1) computation-bounded ones
where the computation stage takes a longer time, and 2) communication-bounded ones where the
communication stage takes longer time.
If an accelerator is communication-bounded, then a better CPU-FPGA communication stack
will greatly improve its overall performance. As demonstrated in our matrix-vector multiplication
case study (Section 5.2), the overall performance of the FPGA accelerator is determined by the
eective CPU-FPGA communication bandwidth that the platform can provide. In this case, the
high-bandwidth F1 instance and Xeon+FPGA v2 platform are preferred. On the other hand, if an ac-
celerator is computation-bounded, then switching to another platform with a better communication
stack does not make a considerable dierence. As a computation-bounded accelerator example, the
matrix multiplication accelerator performs almost the same in Alpha Data, CAPI, the Xeon+FPGA
family, and the F1 instance when scaling to the same 200MHz frequency. In this case, the traditional
PCIe-based private-memory platform may be preferred because of its good compatibility. One
may even prefer not to choose a platform with the best CPU-FPGA communication technology
for cost eciency. Application developers should nd out whether the application to accelerate is
compute-intensive or communication-intensive in order to select the appropriate platform.
Quantitatively, we use the computation-to-communication (C2C) ratio [
] (which is also named
“memory intensity” in [
]) to justify whether a computation kernel is computation/communication
bounded. Specically, the C2C ratio is dened as the division of the computation throughput and
the data transfer throughput:
C2C ratio =
Throuдhputc omp ut e
Throuдhputt r a ns f er
The computation throughput is referred to as the speed of processing a certain size of input for a
given FPGA accelerator; the data transfer throughput is referred to as the speed of transferring this
certain size of input into or out of the FPGA fabric. When the C2C ratio of a kernel is above 1, this
kernel is then computation bounded; otherwise, it is communication bounded. In general, the data
transfer throughput is linearly proportional to the input size. Therefore, a computation kernel with
super-linear time complexity, such as matrix multiplication, is computation bounded. Meanwhile,
computation kernels, like matrix-vector multiplication that is of linear time complexity, are often
bounded by the CPU-FPGA communication. For computation bounded kernels, the CPU-FPGA
communication bandwidth is not the performance bottleneck, so the accelerator designers do
not need to chase for high-end communication interfaces. On the other hand, the CPU-FPGA
communication is critical to communication bounded kernels with C2C ratio less than 1, and the
eciency of the communication interface is then a key factor in platform selection.
Insight 5
: CPU-FPGA memory coherence is promising, but impractical to be used in accelerator design,
at least for now.
The newly-announced CPU-FPGA platforms, including CAPI, CCIx and the Xeon+FPGA family,
attempt to provide memory coherence support between the CPU and the FPGA—either through
PCIe or QPI. Their implementation methodology is similar: constructing a coherent cache on
the FPGA fabric to realize the classic snoopy protocol with the last-level cache of the host CPU.
However, although the FPGA fabric supplies a few megabytes of on-chip BRAM blocks, only 64KB
(the Xeon+FPGA family) or 128KB (CAPI) of them are organized into the coherent cache. That
ACM Trans. Recong. Technol. Syst., Vol. 1, No. 1, Article 1. Publication date: January 2018.
1:16 Y. Choi et al.
is, these platforms maintain memory coherence for less than 5% of the on-chip memory space,
leaving the majority as scratchpads of which the coherence needs to be maintained by application
developers. Although users may choose to ignore the 95% scratchpads and store data on chip
only through the coherent cache to obtain transparent coherence maintenance, this approach is
apparently inecient. For one thing, the coherent cache has a much longer access latency than
that of the scratchpads. Also, the coherent cache provides much less parallel access capability
compared to the scratchpads that can potentially feed thousands of data per cycle. As a consequence,
application developers may still have to stick to the conventional FPGA accelerator design principle
to explicitly manage the coherence of the scratchpad data.
While the current implementation of the CPU-FPGA memory coherence seems to be impractical
due to the aforementioned prohibitively high overhead, the methodology does create a great
potential for reducing FPGA programming eort. The coherent cache is particularly benecial
for computation kernels with unforeseeable memory access patterns, such as hashing. As will be
discussed in Insight 7, implementing the coherent cache on the FPGA fabric considerably restricts
its capacity, latency, and bandwidth. If the cache is implemented as a dedicated ASIC memory block
in future FPGA platforms, application developers could harness the power of the cache coherence.
4.2 Insights for Platform Designers
Insight 6
: There still exists a large room for improvement in order to bridge the gap between the
practically achieved bandwidth and the physical limit of the communication channel.
For example, none of Alpha Data, CAPI or Amazon F1 instance fulll the 8GB/s PCIe bandwidth,
even without considering the overhead of pinned memory allocation and pageable-pinned memory
copy. Meanwhile, it proves to be a good alternative to alleviate the communication overhead
by allowing direct pageable data transfer through PCIe, which is realized in the CAPI platform.
Another alternative is to oer end users the capability to directly manipulate pinned memory
space. For example, both the Xeon+FPGA family and unied virtual addressing (CUDA for GPU)
provide ecient and exible API support to allow software developers to operate on allocated
pinned memory arrays or objects just like those allocated by
]. Nevertheless, these
solutions result in “fragmented” payloads, i.e., the payload data may be stored in discrete memory
pages, causing reduced communication bandwidth.
Another alternative optimization is to form the CPU-FPGA communication stack into a coarse-
grained pipeline, like the CUDA streams technique in GPUs. This may slightly increase the com-
munication latency for an individual payload, but could signicantly boost the throughput of
CPU-FPGA communication for concurrent transactions.
Both approaches should solve the problem, and the solution has been veried by two of our
follow-up studies. Guided by this insight,
proposes a new programming environment for
streaming applications that achieves a much higher eective bandwidth in the Amazon F1 instance;
proposes a deep pipeline stack to overlap the communication and computation steps. More
discussions are provided in Section 5.3.
Insight 7: The coherent cache design could be greatly improved.
The coherent cache of the recently-announced CPU-FPGA platforms aims to provide the classic
functionalities of CPU caches: data caching and memory coherence that are transparent to pro-
grammers. However, the long latency and small capacity make this component impractical to be
used eciently in FPGA accelerator design.
One important reason for such a long-latency, small-capacity design is that the coherent cache is
implemented on the FPGA fabric. Therefore, compared to the CPU cache counterpart, the FPGA-
based coherent cache has a much lower frequency and thus a much worse performance. One
ACM Trans. Recong. Technol. Syst., Vol. 1, No. 1, Article 1. Publication date: January 2018.
In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms 1:17
possible approach to address this issue is to move the coherent cache module out of the FPGA fabric
as a hard ASIC circuit instead. This could potentially reduce the dierence between the FPGA’s
cache latency and the scratchpad latency and also enlarge the cache capacity. As a result, cache can
be more eciently utilized in FPGA accelerator designs.
Another important reason is that the cache structure generally has a very limited number of
data access ports. In contrast, the BRAM blocks on the FPGA fabric can be partitioned to supply
thousands of data in parallel. To exploit such a parallel data supply, it is also a common practice to
assign a dedicated BRAM buer for multiple processing elements in FPGA accelerator design (e.g.,
]). Since massive parallelism is a widely adopted way for FPGA acceleration, the future cache
design may also need to take this into consideration; e.g., a distributed, many-port cache design
might be preferred to a centralized, single-port design.
To demonstrate the usefulness of our proposed insights, we conduct two case studies that utilize
those insights (for platform users) to optimize the accelerator designs: matrix multiplication and
matrix-vector multiplication. These two applications share the same basic operations, oating-point
addition and multiplication, but belong to dierent categories. The former is computation-bounded,
and the latter is communication-bounded. We use them to demonstrate how to leverage our two
proposed metrics—device memory reuse ratio rand the computation-to-communication ratio
CTC—to choose the right platform.
The basic settings of the two cases are almost identical. The dimension of the matrices is
4096, and the dimension of the vectors is 4096. The input data are initially stored in the
CPU-attached memory, and the output data will be stored back to the CPU-attached memory. The
same systolic-array-based matrix multiplication accelerator design presented in
is implemented
on all ve platforms, and the matrix-vector multiplication design has a similar architecture. For
comparison purposes, the computation architecture has been almost identically designed on all
platforms; thus the capacity of the FPGA fabric has no eect on our case evaluation.
5.1 Matrix Multiplication
Given two
oating-point matrices as input and a
matrix as output, the computation time
complexity (
) is higher than that of data communication (
), resulting in the algorithm
being computation-bounded. Fig. 6illustrates the accelerator performances on the ve platforms.
We can see that the performance is proportionate to the frequency of the accelerator design. In
detail, since Alpha Data and Xeon+FPGA v1 have the same frequency of operation, 200MHz, the
same accelerator design leads to almost identical performance. For the Xeon+FPGA v2 platform that
supports multiple frequencies, we congure the accelerator under both 200MHz and 273MHz to 1)
make a fair comparison with other platforms, 2) demonstrate the impact of frequency. The results
show that the 200MHz design on the Xeon+FPGA v2 platform does not achieve any superiority
over the ones on Alpha Data and Xeon+FPGA v1, even though it has a much higher CPU-FPGA
communication bandwidth. Meanwhile, the CAPI platform and the F1 instance with the 250MHz
frequency delivers a better performance over the 200MHz designs on the other platforms.
The fundamental reason for such results is that the matrix multiplication design is computation-
bounded. The data communication is therefore completely overlapped by the accelerator com-
putation, which is determined by the frequency of operation given the same circuit design. This
veries our insight that a carefully designed accelerator may diminish the impact of CPU-FPGA
communication if it is computation-bounded. Application developers could thus focus on the other
factors in platform selection, like compatibility, cost, etc.
ACM Trans. Recong. Technol. Syst., Vol. 1, No. 1, Article 1. Publication date: January 2018.
1:18 Y. Choi et al.
Fig. 6. Matrix multiplication kernel execution time
Fig. 7. Matrix-vector multiplication kernel execution time
5.2 Matrix-Vector Multiplication
Given an
oating-point matrix and an
-dimension vector as input, and an
vector as output, the computation time complexity (
) is the same as that of the data commu-
nication (
). The algorithm is generally communication-bounded. To avoid the interference of
the accelerator design and the frequency of operation, we use the same design on Alpha Data and
Xeon+FPGA under the same 200MHz frequency. In order to demonstrate the impact of the device
memory reuse ratio, we iteratively perform the matrix-vector multiplication with the same matrix,
but the updated vector that is generated from the last iteration. This is the typical computation
pattern of the widely used logistic regression application.
Fig. 7illustrates the performances on dierent platforms with various iteration numbers. Note
that the accelerator performance on Xeon+FPGA is not aected by the iteration number, so we
just show one value for it. We can see that the accelerator performance on the Alpha Data board
improves with the increase of the iteration number, because the one-time, low-bandwidth PCIe
transfer is amortized by the high-bandwidth device memory access. After the iteration number
(device memory reuse ratio r) grows beyond 7, the value of
rth r es ho ld
between Alpha Data and
Xeon+FPGA, the Alpha Data board starts to outperform Xeon+FPGA.
5.3 Optimization of Communication Overhead
While the case studies demonstrate how the insights work, there are a set of follow-up studies
that demonstrate the eectiveness of the insights in acceleration design and platform optimization.
In Section
, we claimed that the eective host-to-FPGA bandwidth will be reduced by various
factors such as PCIe packet header overhead, host buer allocation, and host memory copy. In
ACM Trans. Recong. Technol. Syst., Vol. 1, No. 1, Article 1. Publication date: January 2018.
In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms 1:19
addition to the analysis we provided, this claim may also be veried by optimizing away the
aforementioned overheads and measuring the performance improvement. We summarize them as
In ST-Accel
, a new communication library is proposed to provide direct connection between
IO and FPGA, therefore bypassing host DRAM and FPGA DRAM data copy. This has improved
the eective host-to-FPGA bandwidth from 2.59 GB/s to 11.89 GB/s. In
, the data transfer
from FPGA to JVM is pipelined with FPGA computation and host-JVM communication, eectively
hiding the low PCIE bandwidth. This improves the throughput by 4.9
on average. The work in
provides another case study on DNA sequencing where the host-to-FPGA bandwidth is improved
by batch processing and sharing the FPGA accelerator among multiple threads. These techniques
reduce the communication overhead from 1000
of the computation time down to only 16% of the
overall execution time.
To the best of our knowledge, this is the rst comprehensive study that aims to evaluate and
analyze the microarchitectural characteristics of state-of-the-art CPU-FPGA platforms in depth.
The paper covers all the latest-announced, shared-memory platforms, as well as the traditional
private-memory Alpha Data broad and the Amazon EC2 F1 instance, with detailed data published
(most of which not available from public datasheets). We found that the advertised communication
parameters are often too ideal to be delivered to end users in practice, and suggest application
developers avoiding overestimation of the platform performance by considering the eective
bandwidth and the communication payload. Moreover, we demonstrate that the communication-
bounded accelerators can be signicantly aected by dierent platform implementations, and
propose the device memory reuse ratio as a metric to quantify the boundary of platform selection
between a private-memory platform and a shared memory platform. Also, we demonstrate that the
CPU-FPGA communication may not matter for computation-bounded applications where the data
movement can be overlapped by the accelerator computation, and propose to use the computation-
to-communication ratio CTC to measure it. In addition, we point out that the transparent data
caching and memory coherence functionalities may be impractical in the current platforms because
of the low-capacity and high-latency cache design.
We believe these results and insights can aid platform users in choosing the best platform
for a given application to accelerate, and facilitate the maturity of CPU-FPGA platforms. To
help the community measure other platforms, we have also released our microbenchmarks at
This research is supported in part by Intel and the National Science Foundation (NSF) Joint Re-
search Center on Computer Assisted Programming for Heterogeneous Architectures (CAPA)
(CCF-1723773); NSF/Intel InTrans Program (CCF-1436827); CRISP, one of six centers in JUMP, a
Semiconductor Research Corporation (SRC) program sponsored by DARPA (GI18518.156870); A
Community Eort to Translate Protein Data to Knowledge: An Integrated Platform (U54GM114833-
01); UCLA Institute for Digital Research and Education Postdoc Fellowship; Fujitsu and Huawei
under the CDSC industrial partnership; and C-FAR, one of the six SRC STARnet Centers, sponsored
by MARCO and DARPA. We thank Amazon for the AWS F1 credit, Intel and Xilinx for the hardware
donation, and Janice Wheeler for proofreading this paper.
ACM Trans. Recong. Technol. Syst., Vol. 1, No. 1, Article 1. Publication date: January 2018.
1:20 Y. Choi et al.
2016. Intel to Start Shipping Xeons With FPGAs in Early 2016. (2016).
intel-to- start-shipping- xeons-with-fpgas-in-early- 2016.html
[2] 2017. Amazon EC2 F1 Instance. (2017).
2017. SDAccel Development Environment. (2017).
[4] 2018. Cache Coherent Interconnect for Accelerators. (2018).
Brad Brech, Juan Rubio, and Michael Hollinger. 2015. IBM Data Engine for NoSQL - Power Systems Edition. Technical
Report. IBM Systems Group.
[6] Tony M Brewer. 2010. Instruction set innovations for the Convey HC-1 computer. IEEE Micro 2 (2010), 70–79.
Nanchini Chandramoorthy, Giuseppe Tagliavini, Kevin Irick, Antonio Pullini, Siddharth Advani, Sulaiman Al Habsi,
Matthew Cotter, John Sampson, Vijaykrishnan Narayanan, and Luca Benini. 2015. Exploring architectural heterogeneity
in intelligent vision systems. In HPCA-21.
Yu-Ting Chen, Jason Cong, Zhenman Fang, Jie Lei, and Peng Wei. 2016. When Apache Spark meets FPGAs: A case
study for next-generation DNA sequencing acceleration. In HotCloud.
Young-kyu Choi and Jason Cong. 2016. Acceleration of EM-based 3D CT reconstruction using FPGA. IEEE Transactions
on Biomedical Circuits and Systems 10, 3 (2016), 754–767.
Young-kyu Choi, Jason Cong, Zhenman Fang, Yuchen Hao, Glenn Reinman, and Peng Wei. 2016. A quantitative
analysis on microarchitectures of modern CPU-FPGA platforms. In DAC-53.
Jason Cong, Zhenman Fang, Yuchen Hao, and Glenn Reinman. 2017. Supporting Address Translation for Accelerator-
Centric Architectures. In HPCA-23.
Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, Hui Huang, and Glenn Reinman. 2013. Compos-
able accelerator-rich microprocessor enhanced for adaptivity and longevity. In ISLPED.
Jason Cong, Peng Wei, and Cody Hao Yu. 2018. From JVM to FPGA: Bridging abstraction hierarchy via optimized deep
pipelining. In HotCloud.
[14] Shane Cook. 2012. CUDA programming: a developer’s guide to parallel computing with GPUs. Newnes.
Emilio G Cota, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P Carloni. 2015. An analysis of accelerator coupling
in heterogeneous architectures. In DAC-52.
Zhenman Fang, Sanyam Mehta, Pen-Chung Yew, Antonia Zhai, James Greensky, Gautham Beeraka, and Binyu Zang.
2015. Measuring microarchitectural details of multi-and many-core memory systems through microbenchmarking.
ACM Transactions on Architecture and Code Optimization (TACO) 11, 4 (2015), 55.
[17] IBM 2015. Coherent Accelerator Processor Interface UserâĂŹs Manual Xilinx Edition. IBM. Rev. 1.1.
[18] Intel 2016. BDW+FPGA Beta Release 5.0.3 Core Cache Interface (CCI-P) Interface Specication. Intel. Rev. 1.0.
J. Jang, S. Choi, and V. Prasanna. 2005. Energy-and time-ecient matrix multiplication on FPGAs. IEEE TVLSI 13, 11
(2005), 1305–1319.
[20] Jason Lawley. 2014. Understanding Performance of PCI Express Systems. Xilinx. Rev. 1.2.
[21] NVIDIA 2009. NVIDIA’s Next Generation CUDA Compute Architecture: FERMI. NVIDIA. Rev. 1.1.
Neal Oliver, Rahul R Sharma, Stephen Chang, Bhushan Chitlur, Elkin Garcia, Joseph Grecco, Aaron Grier, Nelson Ijih,
Yaping Liu, Pratik Marolia, and others. 2011. A recongurable computing system based on a cache-coherent fabric. In
Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric S Chung. 2015. Toward
accelerating deep learning at scale using specialized hardware in the datacenter. In Hot Chips.
Andrew Putnam, Adrian M Cauleld, Eric S Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Es-
maeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, and others. 2014. A recongurable fabric for accelerating
large-scale datacenter services. In ISCA-41.
[25] Phil Rogers. 2013. Heterogeneous system architecture overview. In Hot Chips.
Zhenyuan Ruan, Tong He, Bojie Li, Peipei Zhou, and Jason Cong. 2018. ST-Accel: A high-level programming platform
for streaming applications on FPGA. In FCCM.
J Stuecheli, Bart Blaner, CR Johns, and MS Siegel. 2015. CAPI: A Coherent Accelerator Processor Interface. IBM Journal
of Research and Development 59, 1 (2015), 7–1.
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Rooine: an insightful visual performance model for
multicore architectures. Commun. ACM 52, 4 (2009), 65–76.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. 2010. Demystifying
GPU microarchitecture through microbenchmarking. In ISPASS.
[30] Xilinx 2017. ADM-PCIE-7V3 Datasheet. Xilinx. Rev. 1.3.
Serif Yesil, Muhammet Mustafa Ozdal, Taemin Kim, Andrey Ayupov, Steven Burns, and Ozcan Ozturk. 2015. Hardware
Accelerator Design for Data Centers. In ICCAD.
ACM Trans. Recong. Technol. Syst., Vol. 1, No. 1, Article 1. Publication date: January 2018.
In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms 1:21
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based
Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays (FPGA ’15). 161–170.
ACM Trans. Recong. Technol. Syst., Vol. 1, No. 1, Article 1. Publication date: January 2018.
... Aim of this Chapter is to provide an extensive overview of the reconfigurable systems' evolution from different points of view, with a specific focus on those based on Field Programmable Gate Arrays (FPGAs). We will explore the evolution of these systems that starts from standalone to hybrid solutions [1], the evolution of the toolchains developed both to increase the productivity and widening the user base of these reconfigurable fabrics [2], and the differentiation of the paradigms employed and applicative scenarios [3]. Considering the magnitude of the topics, we will cover the time-span between a period when only a restricted elite of people knew and exploited reconfigurable systems and the current days where they are often integrated into datacenters and provided as services to a wider audience. ...
... The second attempt to increase the usage of reconfigurable systems by a larger number of users was combining them with general purpose processors and, later on, with software programmable vector engines. The coupling with micro-controller and hard-processors opens to different applicative scenarios but also introduces new challenges on interconnections and memory coherency [1]. Indeed, the aforementioned heterogeneity and high connectivity favors the adoption of reconfigurable systems in the cloud computing ecosystem, where the power wall hit with the dark silicon [11] makes the providers craving for energy efficient solutions, such as reconfigurable systems. ...
... Following the aforementioned improvements and considering that homogeneous multi-core processors, especially in data-centres, are failing to provide the desired energy efficiency and performance, new devices have been deployed, specifically heterogeneous architectures [1]. The integration in these architectures of hardware accelerators is gaining interest as a promising solution. ...
Reconfigurable computing is an expanding field that, during the last decades, has evolved from a relatively closed community, where hard skilled developers deployed high performance systems, based on their knowledge of the underlying physical system, to an attractive solution to both industry and academia. With this chapter, we explore the different lines of development in the field, namely the need of new tools to shorten the development time, the creation of heterogeneous platforms which couple hardware accelerators with general purpose processors, and the demand to move from general to specific solutions. Starting with the identification of the main limitations that have led to improvements in the field, we explore the emergence of a wide range of Computer Aided Design tools that allow the use of high level languages and guide the user in the whole process of system deployment. This opening to a wider public and their high performance with relatively low power consumption facilitate the spreading in data-centers, where, apart from the undeniable benefits, we have explored critical issues. We conclude with the latest trends in the field such as the use of hardware as a service and the shifting to Domain Specific Architectures based on reconfigurable fabrics.
... Net + Accel SmartNICs [5,110], AcclNet [53], hXDP [35] Net + GPU GPUDirect [102], GPUNet [78] Sto + GPU Donard [22], SPIN [25], GPUfs [124], GPUDirect [103], nvidia BAM [113] Net + Sto iSCSI, NVMoF (offload [117], BlueField [5]), i10 [68], ReFlex [80] Sto + Accel ASIC/CPU [60,83,121], GPUs [25,26,124], FPGA [69,116,119,143], Hayagui [15] Hybrid System with ARM SoC [3,47,90], BEE3 [44], hybrid CPU-FPGA systems [39,41] DPUs Hyperion (stand-alone), Fungible (MIPS64 R6 cores) DPU processor [54], Pensando (hostattached P4 Programmable processor) [108], BlueField (host-attached, with ARM cores) [5] Table 1: Related work ( §4) in the integration of network (net), storage (sto), and accelerators (accel) devices. ...
... Second, a direct consequence of keeping a CPU-driven design is to inherit its choices of memory addressing, translation, and protection mechanisms such as virtual memory, paging, and segmentation [45]. When an accelerator such as FPGA, is attached to a CPU as an external device [39] or as a co-processor [41], there is a temptation to provide/port the familiar memory abstractions like unified virtual memory [84] and/or shared memory [94]. This design necessitates a complex integration with further CPU-attached memory abstractions such as page tables and TLBs, virtualization, huge pages, IOMMU, etc., while keeping such an integration coherent with the CPU view of the system [84,94]. ...
Since the inception of computing, we have been reliant on CPU-powered architectures. However, today this reliance is challenged by manufacturing limitations (CMOS scaling), performance expectations (stalled clocks, Turing tax), and security concerns (microarchitectural attacks). To re-imagine our computing architecture, in this work we take a more radical but pragmatic approach and propose to eliminate the CPU with its design baggage, and integrate three primary pillars of computing, i.e., networking, storage, and computing, into a single, self-hosting, unified CPU-free Data Processing Unit (DPU) called Hyperion. In this paper, we present the case for Hyperion, its design choices, initial work-in-progress details, and seek feedback from the systems community.
... In particular, data is first copied from the FPGA memory to the shared memory space and then transferred to another FPGA. Compared with CPU-FPGA communication, FPGAto-FPGA communication is much slower because it requires additional data copying to the CPU memory [26]. Thus, we propose to fetch data directly from the CPU memory. ...
Full-text available
As the size of real-world graphs increases, training Graph Neural Networks (GNNs) has become time-consuming and requires acceleration. While previous works have demonstrated the potential of utilizing FPGA for accelerating GNN training, few works have been carried out to accelerate GNN training with multiple FPGAs due to the necessity of hardware expertise and substantial development effort. To this end, we propose HitGNN, a framework that enables users to effortlessly map GNN training workloads onto a CPU-Multi-FPGA platform for acceleration. In particular, HitGNN takes the user-defined synchronous GNN training algorithm, GNN model, and platform metadata as input, determines the design parameters based on the platform metadata, and performs hardware mapping onto the CPU+Multi-FPGA platform, automatically. HitGNN consists of the following building blocks: (1) high-level application programming interfaces (APIs) that allow users to specify various synchronous GNN training algorithms and GNN models with only a handful of lines of code; (2) a software generator that generates a host program that performs mini-batch sampling, manages CPU-FPGA communication, and handles workload balancing among the FPGAs; (3) an accelerator generator that generates GNN kernels with optimized datapath and memory organization. We show that existing synchronous GNN training algorithms such as DistDGL and PaGraph can be easily deployed on a CPU+Multi-FPGA platform using our framework, while achieving high training throughput. Compared with the state-of-the-art frameworks that accelerate synchronous GNN training on a multi-GPU platform, HitGNN achieves up to 27.21x bandwidth efficiency, and up to 4.26x speedup using much less compute power and memory bandwidth than GPUs. In addition, HitGNN demonstrates good scalability to 16 FPGAs on a CPU+Multi-FPGA platform.
... Using the traditional approach, we would need to create two separate prediction models, one for the low-end edge environment and another one for the high-end cloud environment, each one requiring a large number of samples. Our goal is to leverage an ML-based performance model trained on a low-end local system to: (1) predict the performance in a new, unknown, high-end FPGA-based system, (2) predict the performance of a new, unknown application to overcome the limitations of current MLbased performance models in a cloud environment. ...
Machine-learning-based models have recently gained traction as a way to overcome the slow downstream implementation process of FPGAs by building models that provide fast and accurate performance predictions. However, these models suffer from two main limitations: (1) training requires large amounts of data (features extracted from FPGA synthesis and implementation reports), which is cost-inefficient because of the time-consuming FPGA design cycle; (2) a model trained for a specific environment cannot predict for a new, unknown environment. In a cloud system, where getting access to platforms is typically costly, data collection for ML models can significantly increase the total cost-ownership (TCO) of a system. To overcome these limitations, we propose LEAPER, a transfer learning-based approach for FPGA-based systems that adapts an existing ML-based model to a new, unknown environment to provide fast and accurate performance and resource utilization predictions. Experimental results show that our approach delivers, on average, 85% accuracy when we use our transferred model for prediction in a cloud environment with 5-shot learning and reduces design-space exploration time by 10x, from days to only a few hours.
... The need for energy-efficiency from edge to cloud computing has boosted the widespread adoption of FPGAs. In cloud computing [16,21,63,89,379,384], FPGA's A part of this chapter is published as "Modeling FPGA-Based Systems via Few-Shot Learning" in FPGA 2021. ...
The cost of moving data between the memory units and the compute units is a major contributor to the execution time and energy consumption of modern workloads in computing systems. At the same time, we are witnessing an enormous amount of data being generated across multiple application domains. These trends suggest a need for a paradigm shift towards a data-centric approach where computation is performed close to where the data resides. Further, a data-centric approach can enable a data-driven view where we take advantage of vast amounts of available data to improve architectural decisions. As a step towards modern architectures, this dissertation contributes to various aspects of the data-centric approach and proposes several data-driven mechanisms. First, we design NERO, a data-centric accelerator for a real-world weather prediction application. Second, we explore the applicability of different number formats, including fixed-point, floating-point, and posit, for different stencil kernels. Third, we propose NAPEL, an ML-based application performance and energy prediction framework for data-centric architectures. Fourth, we present LEAPER, the first use of few-shot learning to transfer FPGA-based computing models across different hardware platforms and applications. Fifth, we propose Sibyl, the first reinforcement learning-based mechanism for data placement in hybrid storage systems. Overall, this thesis provides two key conclusions: (1) hardware acceleration on an FPGA+HBM fabric is a promising solution to overcome the data movement bottleneck of our current computing systems; (2) data should drive system and design decisions by leveraging inherent data characteristics to make our computing systems more efficient.
... The authors in Todman et al. (2005) have provided background information on notable aspects of older FPGA technologies and simultaneously explained the fundamental architectures and design methods for codesign. Furthermore, the work in Choi et al. (2019) is another comprehensive study that aims to evaluate and analyze the microarchitectural characteristics of state-of-the-art CPU-FPGA platforms in depth. That paper covers most of the sharedmemory platforms with detailed benchmarks. ...
Full-text available
In this community review report, we discuss applications and techniques for fast machine learning (ML) in science—the concept of integrating powerful ML methods into the real-time experimental data processing loop to accelerate scientific discovery. The material for the report builds on two workshops held by the Fast ML for Science community and covers three main areas: applications for fast ML across a number of scientific domains; techniques for training and implementing performant and resource-efficient ML algorithms; and computing architectures, platforms, and technologies for deploying these algorithms. We also present overlapping challenges across the multiple scientific domains where common solutions can be found. This community report is intended to give plenty of examples and inspiration for scientific discovery through integrated and accelerated ML solutions. This is followed by a high-level overview and organization of technical advances, including an abundance of pointers to source material, which can enable these breakthroughs.
Given the need for efficient high-performance computing, computer architectures combining central processing units (CPUs), graphics processing units (GPUs), and field-programmable gate arrays (FPGAs) are currently prevalent. However, each of these components suffers from electrical-level security risks. Moving to heterogeneous systems, with the potential of multitenancy, it is essential to understand and investigate how the security vulnerabilities of individual components may affect the system as a whole. In this work, we provide a survey on the electrical-level attacks on CPUs, FPGAs, and GPUs. Additionally, we discuss whether these attacks can extend to heterogeneous systems and highlight open research directions for ensuring the security of heterogeneous computing systems in the future.
Cloud vendors are actively adopting FPGAs into their infrastructures for enhancing performance and efficiency. As cloud services continue to evolve, FPGA (field programmable gate array) systems would play an even important role in the future. In this context, FPGA sharing in multi-tenancy scenarios is crucial for the wide adoption of FPGA in the cloud. Recently, many works have been done towards effective FPGA sharing at different layers of the cloud computing stack. In this work, we provide a comprehensive survey of recent works on FPGA sharing. We examine prior art from different aspects and encapsulate relevant proposals on a few key topics. On the one hand, we discuss representative papers on FPGA resource sharing schemes; on the other hand, we also summarize important SW/HW techniques that support effective sharing. Importantly, we further analyze the system design cost behind FPGA sharing. Finally, based on our survey, we identify key opportunities and challenges of FPGA sharing in future cloud scenarios.
We present a High Level Synthesis compiler that automatically obtains a multi-chip accelerator system from a single-threaded sequential C/C++ application. Invoking the multi-chip accelerator is functionally identical to invoking the single-threaded sequential code the multi-chip accelerator is compiled from. Therefore, software development for using the multi-chip accelerator hardware is simplified, but the multi-chip accelerator can exhibit extremely high parallelism. We have implemented, tested and verified our push-button system design model on multiple FPGAs of the Amazon Web Services EC2 F1 instances platform, using, as an example, a sequential-natured DES key search application that does not have any DOALL loops and that tries each candidate key in order and stops as soon as a correct key is found. An 8 FPGA accelerator produced by our compiler achieves 44, 600 times better performance than an x86 Xeon CPU executing the sequential single-threaded C program the accelerator was compiled from. New features of our compiler system include: an ability to parallelize outer loops with loop-carried control dependences, an ability to pipeline an outer loop without fully unrolling its inner loops, and fully automated deployment, execution and termination of multi-FPGA application-specific accelerators in the AWS cloud, without requiring any manual steps.
Cloud FPGAs provide new energy‐efficient opportunities to design dataflow accelerators. Nevertheless, FPGAs still have challenges to overcome for widespread usages, such as programmability, compilation time (minutes to hours), and hardware knowledge, mainly because it is highly challenging for beginners to learn and use FPGAs. The READY tool recently provides compilation time reduction to the range of microseconds using a CGRA overlay and a friendly, high‐level C++ interface for the Intel/Altera HARPv2 FPGA cloud platform. However, the HARPv2 is not available in any commercial cloud platform. This work extends READY by creating the fast flow cloud framework (FFC). First, FFC offers a simple browser‐based graphical interface for less experienced FPGA users. Second, we improve the CGRA overlay portability to include Xilinx FPGAs and a transparent design flow to deploy in the widespread commercial Amazon AWS F1 cloud. Third, we improve the CGRA reconfiguration engine. Also, we compare the overlay performance of HARPv2 and AWS F1 to an eight‐thread XEON processor. Finally, the framework is open‐source for collaborative development and has clearly defined application programming interfaces for future extensions.
Conference Paper
Full-text available
While emerging accelerator-centric architectures offer orders-of-magnitude performance and energy improvements, use cases and adoption can be limited by their rigid programming model. A unified virtual address space between the host CPU cores and customized accelerators can largely improve the programmability, which necessitates hardware support for address translation. However, supporting address translation for customized accelerators with low overhead is nontrivial. Prior studies either assume an infinite-sized TLB and zero page walk latency, or rely on a slow IOMMU for correctness and safety— which penalizes the overall system performance. To provide efficient address translation support for accelerator-centric architectures, we examine the memory access behavior of customized accelerators to drive the TLB augmentation and MMU designs. First, to support bulk transfers of consecutive data between the scratchpad memory of customized accelerators and the memory system, we present a relatively small private TLB design to provide low-latency caching of translations to each accelerator. Second, to compensate for the effects of the widely used data tiling techniques, we design a shared level-two TLB to serve private TLB misses on common virtual pages, eliminating duplicate page walks from accelerators working on neighboring data tiles that are mapped to the same physical page. This two-level TLB design effectively reduces page walks by 75.8% on average. Finally, instead of implementing a dedicated MMU which introduces additional hardware complexity, we propose simply leveraging the host per-core MMU for efficient page walk handling. This mechanism is based on our insight that the existing MMU cache in the CPU MMU satisfies the demand of customized accelerators with minimal overhead. Our evaluation demonstrates that the combined approach incurs only a 6.4% performance overhead compared to the ideal address translation.
Conference Paper
Full-text available
CPU-FPGA heterogeneous acceleration platforms have shown great potential for continued performance and energy efficiency improvement for modern data centers, and have captured great attention from both academia and industry. However, it is nontrivial for users to choose the right platform among various PCIe and QPI based CPU-FPGA platforms from different vendors. This paper aims to find out what microarchitectural characteristics affect the performance, and how. We conduct our quantitative comparison and in-depth analysis on two representative platforms: QPI-based Intel-Altera HARP with coherent shared memory, and PCIe-based Alpha Data board with private device memory. We provide multiple insights for both application developers and platform designers.
Conference Paper
Full-text available
FPGA-enabled datacenters have shown great potential for providing performance and energy efficiency improvement. In this paper we aim to answer one key question: how can we efficiently integrate FPGAs into state-of-the-art big-data computing frameworks like Spark? To provide a generalized methodology and insights for efficient integration, we conduct an in-depth analysis of challenges and corresponding solutions at single-thread, single-node multi-thread, and multi-node levels. With a step-by-step case study for the next-generation DNA se-quencing application, we demonstrate how a straightforward integration with 1000x slowdown can be tuned into an efficient integration with 2.6x overall system speedup and 2.4x energy efficiency improvement.
Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we designed and built a composable, reconfigurable hardware fabric based on field programmable gate arrays (FPGA). Each server in the fabric contains one FPGA, and all FPGAs within a 48-server rack are interconnected over a low-latency, high-bandwidth network. We describe a medium-scale deployment of this fabric on a bed of 1632 servers, and measure its effectiveness in accelerating the ranking component of the Bing web search engine. We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system. Under high load, the large-scale reconfigurable fabric improves the ranking throughput of each server by 95% at a desirable latency distribution or reduces tail latency by 29% at a fixed throughput. In other words, the reconfigurable fabric enables the same throughput using only half the number of servers.
Conference Paper
Convolutional neural network (CNN) has been widely employed for image recognition because it can achieve high accuracy by emulating behavior of optic nerves in living creatures. Recently, rapid growth of modern applications based on deep learning algorithms has further improved research and implementations. Especially, various accelerators for deep CNN have been proposed based on FPGA platform because it has advantages of high performance, reconfigurability, and fast development round, etc. Although current FPGA accelerators have demonstrated better performance over generic processors, the accelerator design space has not been well exploited. One critical problem is that the computation throughput may not well match the memory bandwidth provided an FPGA platform. Consequently, existing approaches cannot achieve best performance due to under-utilization of either logic resource or memory bandwidth. At the same time, the increasing complexity and scalability of deep learning applications aggravate this problem. In order to overcome this problem, we propose an analytical design scheme using the roofline model. For any solution of a CNN design, we quantitatively analyze its computing throughput and required memory bandwidth using various optimization techniques, such as loop tiling and transformation. Then, with the help of rooine model, we can identify the solution with best performance and lowest FPGA resource requirement. As a case study, we implement a CNN accelerator on a VC707 FPGA board and compare it to previous approaches. Our implementation achieves a peak performance of 61.62 GFLOPS under 100MHz working frequency, which outperform previous approaches significantly.
Existing research on accelerators has emphasized the performance and energy efficiency improvements they can provide, devoting little attention to practical issues such as accelerator invocation and interaction with other on-chIP components (e.g. cores, caches). In this paper we present a quantitative study that considers these aspects by implementing seven high-throughput accelerators following three design models: tight coupling behind a CPU, loose out-of-core coupling with Direct Memory Access (DMA) to the LLC, and loose out-of-core coupling with DMA to DRAM. A salient conclusion of our study is that working sets of non-trivial size are best served by loosely-coupled accelerators that integrate private memory blocks tailored to their needs.