Conference PaperPDF Available

Adopting OpenCAPI for High Bandwidth Database Accelerators

Authors:
Adopting OpenCAPI for High Bandwidth Database
Accelerators
Jian Fang, Yvo T.B. Mulder, Kangli Huang, Yang Qiao, Xianwei Zeng, H. Peter Hofstee*,
Jinho Lee*, and Jan Hidders
Delft University of Technology *IBM Research Vrije Universiteit Brussel
j.fang-1@tudelft.nl {y.t.b.mulder, k.huang-5, y.qiao, x.zeng}@student.tudelft.nl *{hofstee,
leejinho}@us.ibm.com jan.hidders@vub.be
1. INTRODUCTION
Due to the scaling difficulty and high power consumption of
CPUs, data center applications look for solutions to improve
performance while reducing energy consumption. Among
different solutions, heterogeneous architectures utilizing both
CPUs and accelerators, such as FPGAs, show promising re-
sults. FPGAs have more potential to achieve high through-
put, low latency and power-efficient designs compared to a
general-purpose processor. However, wide adoption of FP-
GAs is limited by the relatively low bandwidth between the
CPU and FPGA, limiting applications mainly to computation-
intensive problems.
Meanwhile, database systems have sought ways of achiev-
ing high bandwidth access to the data. One trend here is
the increasing usage of in-memory database systems. This
kind of system has higher data access speed than disk-based
database systems, leading to a high data processing rate.
In order to leverage emerging heterogeneous architecture
to accelerate the databases, we need to solve the intercon-
nect bottleneck. A recent advancement is the introduc-
tion of the Open Coherent Accelerator Processor Interface
(OpenCAPI) [1], which provides a significant increase in
bandwidth compared to the current state-of-the-art (PCIe
gen 3). This change requires re-evaluation of our current
design methodologies for accelerators.
This abstract presents our ongoing work towards a hetero-
geneous architecture for databases with high memory band-
width connected FPGAs. Based on this architecture, three
accelerator design examples that have promising throughput
and can keep up with the increased bandwidth are proposed.
2. SYSTEM ARCHITECTURE OVERVIEW
Modern systems consist of one or more CPUs that contain
multiple cores and are coupled with DRAM. Accelerators
are most commonly connected to the host using PCIe. The
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
H2RC 2017 Denver, Colorado, USA
widely adopted PCIe gen 3 interconnect attains a bandwidth
of roughly 8Gb/s per lane. In contrast, OpenCAPI enables a
low latency, high bandwidth interconnect by using 25Gb/s
differential signaling. Aggregating OpenCAPI bandwidth
can rival or exceed the bandwidth of DDR memory, making
OpenCAPI-attached accelerators candidates for bandwidth-
limited applications.
3. ACCELERATOR DESIGN EVALUATION
Three high-bandwidth streaming accelerators for database
queries have been studied: decompress-filter, hash-join and
merge-sort. Each has different buffering requirements, which
are challenging at this speed. Requirements vary over having
to hide latency versus the number of read ports.
The critical path of a merge-sort occurs in the last pass. It
merges multiple sorted streams into a final in-order stream.
To guarantee a high output rate in the last pass demands a
strong merge engine that has a stable and high processing
rate. At the same time, this type of sorter requests data
from different streams randomly. Consequently, an obvious
but tough challenge is how to hide the high access latency
of main memory with uncertain stream entry requests.
Hash-join is a memory-bound application. Increasing the
bandwidth helps to reduce the data transfer time. However,
not all the bandwidth can be fully utilized due to the low
locality of the data and multiple passes of data transfers. To
keep up with this high bandwidth, an algorithm with fewer
passes of data access and higher throughput for each pass of
data processing is required.
In contrast, the Parquet decompress-filter is computation-
bound. The main bottleneck comes from the dependency
between the interpretation of tokens that are adjacent. To
have a high processing rate for decompressing one stream
demands special designs to avoid or solve the dependency.
Another possibility for high throughput is to use multiple
identical, but small, decompression engines. However, excel-
lent arbitration is required to schedule and balance between
the different engines.
4. REFERENCES
[1] J. Stuecheli. OpenCAPI- A New Standard for High
Performance Memory, Acceleration and Networks. HPC
Advisory Council - Swiss Conference 2017, 2017.
... A harmonized effort is made by studying a re-emerging and inherently multi-stream workload: database operators. Preliminary findings have been presented at the H 2 RC 2017 workshop [18]. ...
... (s1 _req_sid ) 11 ); 12 13 wire s1_mgr_v , s1 _mgr_r ; 14 wire [tag_width -1:0] s1 _mgr_tag ; 15 base_res_mgr # ( 16 . width ( tag_width ) 17 ) is1 _res_mgr ( 18 .clk (clk), 19 . ...
Thesis
Full-text available
A new class of accelerator interfaces has significant implications on system architecture. An order of magnitude more bandwidth forces us to reconsider FPGA design. OpenCAPI is a new interconnect standard that enables attaching FPGAs coherently to a high-bandwidth, low-latency interface. Keeping up with this bandwidth poses new challenges for the design of accelerators, and the logic feeding them. This thesis is conducted as part of a group project, where three other master students investigate database operator accelerators. This thesis focuses on the logic to feed the accelerators, by designing a reconfigurable multi-stream buffer architecture. By generalizing across multiple common streaming-like accelerator access patterns, an interface consisting of multiple read ports with a smaller than cache line granularity is desired. At the same time, multiple read ports are allowed to request any stream, including reading across a cache line boundary. The proposed architecture exploits different memory primitives available on the latest generation of Xilinx FPGAs. By combining a traditional multi-read port approach for data duplication with a second level of buffering, a hierarchy typically found in caches, an architecture is proposed which can supply data from 64 streams to eight read ports without any access pattern restrictions. A correct-by-construction design methodology was used to simplify the validation of the design and to speedup the implementation phase. At the same time, the design methodology is documented and examples are provided for ease of adoption. With the design methodology, the proposed architecture has been implemented and is accompanied by a validation framework. Various configurations of the multi-stream buffer have been tested. Configurations up to 64 streams with four read ports meet timing with an AFU request-to-response latency of five cycles. The largest configuration with 64 streams and eight read ports fails timing. Limiting factors are the inherent architecture of FPGAs, where memories are physically located in specific columns. This makes extracting data complex, especially at the target frequencies of 200 MHz and 400 MHz. Wires are scattered across the FPGA and wire delay becomes dominant. FPGA design at increasing bandwidths requires new design approaches. Synthesis results are no guarantee for the implemented design, and depending on the design size, could indicate a very optimistic operating frequency. Therefore, designing accelerators to keep up with an order of magnitude more bandwidth compared to the current state-of-the-art is complex, and requires carefully thought out accelerator cores, combined with an interface capable of feeding it.
... Lastly, for FPGA designers that want to exercise more control over the design of the FPGA kernels for database queries or operators, the approaches that separate interface generation from computational kernel design can further increase productivity. Different FPGA kernels have different memory access patterns [37]. Some of them require gathering multiple streams (e.g., the merge tree), while some others have a random access pattern (e.g., the hash join). ...
Article
Full-text available
While FPGAs have seen prior use in database systems, in recent years interest in using FPGA to accelerate databases has declined in both industry and academia for the following three reasons. First, specifically for in-memory databases, FPGAs integrated with conventional I/O provide insufficient bandwidth, limiting performance. Second, GPUs, which can also provide high throughput, and are easier to program, have emerged as a strong accelerator alternative. Third, programming FPGAs required developers to have full-stack skills, from high-level algorithm design to low-level circuit implementations. The good news is that these challenges are being addressed. New interface technologies connect FPGAs into the system at main-memory bandwidth and the latest FPGAs provide local memory competitive in capacity and bandwidth with GPUs. Ease of programming is improving through support of shared coherent virtual memory between the host and the accelerator, support for higher-level languages, and domain-specific tools to generate FPGA designs automatically. Therefore, this paper surveys using FPGAs to accelerate in-memory database systems targeting designs that can operate at the speed of main memory.
... Future versions of this interface will support 32 lanes and even higher frequencies per lane [8]. Therefore, these new interfaces provide a chance to improve the decompression speed further [9]. Most prior work focuses on hardware decompression based on Gzip [10], but there is little prior work focusing on hardware (de)compression on Snappy [11,6]. ...
Thesis
Full-text available
New interfaces to interconnect CPUs and accelerators at memory-class bandwidth pose new opportunities and challenges for the design of accelerators. This thesis studies one such accelerator, a decom-pressor for Parquet files compressed with the Snappy library. Our design targets reconfigurable logic (FPGAs) attached via the open coherent accelerator processor interface(OpenCAPI) at 25.6GB/s. We give an overview of the previous research in hardware-based (de)compression engines and present and analyze our design. Much of the challenge of designing the decompression engine stems from the need to process more than one token per cycle. In our design, a single engine can process two tokens per cycle. A Xilinx KU15P FPGA is expected to support multiple such engines. The input throughput and the output throughput ranges of a single engine are 3.9∼6.3 bytes/cycle and 8.3∼15 bytes/cycle, respectively. Based on the implementation results, a single engine of the proposed design could work at 140MHz, meaning 0.51∼0.82 GB/s input throughput or 1.08∼1.96 GB/s output throughput. The Parquet format enables the parallel decompression of multiple blocks when multiple units are instantiated. With the latest generation of FPGAs, we estimate at most 28 units can be supported leading to a total input/output bandwidth of 14.28/30.24 to 22.96/54.88 GB/s. Because the output bandwidth can exceed the interface bandwidth if multiple engines are supported, the design is especially effective when combined with a filter engine that reduces the output size. Abstract New interfaces to interconnect CPUs and accelerators at memory-class bandwidth pose new opportunities and challenges for the design of accelerators. This thesis studies one such accelerator, a decompressor for Parquet files compressed with the Snappy library. Our design targets reconfigurable logic (FPGAs) attached via the open coherent accelerator processor inter-face(OpenCAPI) at 25.6GB/s. We give an overview of the previous research in hardware-based (de)compression engines and present and analyze our design. Much of the challenge of designing the decompression engine stems from the need to process more than one token per cycle. In our design, a single engine can process two tokens per cycle. A Xilinx KU15P FPGA is expected to support multiple such engines. The input throughput and the output throughput ranges of a single engine are 3.9∼6.3 bytes/cycle and 8.3∼15 bytes/cycle, respectively. Based on the implementation results, a single engine of the proposed design could work at 140MHz, meaning 0.51∼0.82 GB/s input throughput or 1.08∼1.96 GB/s output throughput. The Parquet format enables the parallel decompression of multiple blocks when multiple units are instantiated. With the latest generation of FPGAs, we estimate at most 28 units can be supported leading to a total input/output bandwidth of 14.28/30.24 to 22.96/54.88 GB/s. Because the output bandwidth can exceed the interface bandwidth if multiple engines are supported, the design is especially effective when combined with a filter engine that reduces the output size.
... Future versions of this interface will support 32 lanes and even higher frequencies per lane [8]. Therefore, these new interfaces provide a chance to improve the decompression speed further [9]. Most prior work focuses on hardware decompression based on Gzip [10], but there is little prior work focusing on hardware (de)compression on Snappy [11,6]. ...
Thesis
New interfaces to interconnect CPUs and accelerators at memory-class bandwidth pose new opportunities and challenges for the design of accelerators. This thesis studies one such accelerator, a decompressor for Parquet files compressed with the Snappy library. Our design targets reconfigurable logic (FPGAs) attached via the open coherent accelerator processor interface(OpenCAPI) at 25.6GB/s. We give an overview of the previous research in hardware-based (de)compression engines and present and analyze our design. Much of the challenge of designing the decompression engine stems from the need to process more than one token per cycle. In our design, a single engine can process two tokens per cycle. A Xilinx KU15P FPGA is expected to support multiple such engines. The input throughput and the output throughput ranges of a single engine are 3.9$\sim$6.3 bytes/cycle and 8.3$\sim$15 bytes/cycle, respectively. Based on the implementation results, a single engine of the proposed design could work at 140MHz, meaning 0.51$\sim$0.82 GB/s input throughput or 1.08$\sim$1.96 GB/s output throughput. The Parquet format enables the parallel decompression of multiple blocks when multiple units are instantiated. With the latest generation of FPGAs, we estimate at most 28 units can be supported leading to a total input/output bandwidth of 14.28/30.24 to 22.96/54.88 GB/s. Because the output bandwidth can exceed the interface bandwidth if multiple engines are supported, the design is especially effective when combined with a filter engine that reduces the output size.
Conference Paper
While in-memory databases have largely removed I/O as a bottleneck for database operations, loading the data from storage into memory remains a significant limiter to end-to-end performance. Snappy is a widely used compression algorithm in the Hadoop ecosystem and in database systems and is an option in often-used file formats such as Parquet andORC. Compression reduces the amount of data that must be transferred from/to the storage saving both storage space and storage bandwidth. While it is easy for a CPU Snappy compressor to keep up with the bandwidth of a hard disk drive, when moving to NVMe devices attached with high-bandwidth connections such as PCIe Gen4 or OpenCAPI, the decompression speed in a CPU is insufficient. We propose an FPGA-based Snappy decompressor that can process multiple tokens in parallel and operates on each FPGA block ram independently. Read commands are recycled until the read data is valid dramatically reducing control complexity. One instance of our decompression engine takes 9%of the LUTs in the XCKU15P FPGA, and achieves up to3GB/s (5GB/s) decompression rate from the input (output)side, about an order of magnitude faster than a CPU (single thread). Parquet allows for independent decompression of multiple pages and instantiating eight of these units on aXCKU15P FPGA can keep up with the highest performance interface bandwidths
OpenCAPI™-A New Standard for High Performance Memory, Acceleration and Networks. HPC Advisory Council-Swiss Conference
  • J Stuecheli
J. Stuecheli. OpenCAPI™-A New Standard for High Performance Memory, Acceleration and Networks. HPC Advisory Council-Swiss Conference 2017, 2017.