Conference Paper

RIFFA: A Reusable Integration Framework for FPGA Accelerators

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We present RIFFA, a reusable integration framework for FPGA accelerators. RIFFA provides communication and synchronization for FPGA accelerated software using a standard interface. Our goal is to expand the use of FPGAs as an acceleration platform by releasing, as open source, a no cost framework that easily integrates software on traditional CPUs with FPGA based IP cores, over PCIe, with minimal custom configuration. RIFFA requires no specialized hardware or fee licensed IP cores. It can be deployed on common Linux workstations with a PCIe bus and has been tested on two different Linux distributions using Xilinx FPGAs.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... RIFFA [19], inspired by Microsoft Research's SIRC [20], proposes a reusable integration framework that targets the integration of accelerators deployed on programmable logic. RIFFA combines both software and hardware parts. ...
... It is interesting to note that the DMA transfers of the early RiFFA-1 [19] to provide access to devices through an host OS (most commonly Linux). As explained in [24], the principle is to provide "a simple, narrow, and idealized view of hardware" ...
Thesis
Hardware development pace is increasing both in speed and complexity.This is for instance the case for Multiprocessor System-on-a-chip, a class of systems optimized towards specific application or specific classes of applications.Even though these systems tend to be reusable, they need to embed specific hardware Intellectual Properties (IPs) to achieve the right power and price budgets.Those systems evolving very quickly, a large number of platform versions are released over short periods of time, embedding themselves a large number of IPs.Efficient hardware design being still a real challenge, the huge number of IPs to integrate in a large number of different platforms exacerbates the already difficult hardware/software integration challenge.This integration process deals with a dozen to a hundred of IPs, each one having dozens to hundreds of hardware registers, each of them being a potential bitfield with many functions to take into account by software drivers.Having a hardware/software system working in those conditions is a huge amount of work and debug, making no surprise that the vast majority of bugs comes from drivers.The problem is even worse in the context of FPGAs.Their reconfigurability makes the number of systems and IP versions growing even faster.The FPGA adoption is growing well with the improvement of tools and process automation but to help pushing it further in all domains, from embedded systems to cloud computing, the integration process needs a major leap in efficiency.This thesis addresses that specific problem by proposing a simple and effective approach.The approach does not require expensive or new technologies, it can be adopted incrementally without having to throw existing solutions out.It only requires to follow an end-to-end design that can be followed using existing tools and technologies.The core principles of our solution is inspired from the successful story of USB: 1) A message conduit hiding the register and interrupt based hardware/software frontier, 2) Defining device classes allowing to have one driver able to drive multiple devices by following a generic message protocol.The proposal defines an open and abstract conduit for sending and receiving messages across the hardware/software frontier.This message conduit is co-designed across both sides of the frontier using existing concepts that each one is familiar with.Hardware side, it provides to devices a simple interface based on streaming interfaces in which messages are traveling.It also provides to devices a generic message protocol giving them a simple and yet very useful lifecycle to follow.Software side, it provides a simple message-based API, easy to understand for software developers.This API hides hardware details such as interrupts or hardware registers behind communication channels in which messages are exchanged with devices.This first designed our solution for small embedded systems, addressing the integration challenge by implementing a message conduit across the hardware and the software.The prototype we built demonstrates that our solution fits for small systems with low latency and low throughput devices without growing much the size of both hardware and software implementations.We then integrated the solution in the Linux kernel, demonstrating the feasibility of our solution for bigger systems.Our experiments show that our solution can fit for high-performance devices with negligible overheads.It also shows that it can be well integrated within the kernel beside existing integration solutions.In the context of cloud computing, our solution offers great benefits to help improving hardware support for guest operating systems.We shown that using messages is a good solutions to support and share devices amount.It also fits well in hypervisors, in particular Xen which has all the features our solutions requires and follow compatible concepts to simplify drivers in guest operating systems.
... In order to demonstrate that homomorphic encryption/decryption operations of the SEAL library can be accelerated considerably, we designed a proof of concept accelerator framework that includes SEAL software and an FPGA accelerator that implements our architectures. For communication between the software stack and FPGA, we utilized RIFFA driver [26], which employs a PCIe connection between CPU and FPGA. Resulting framework is shown in Fig. 6. ...
... This approach is utilized to enable a pipelined architecture and maximize performance. In [26], it is shown that RIFFA is able to achieve only 76% of the maximum theoretical bandwidth. Therefore, the bandwidth of the PCIe module is assumed to be ∼24 Gbps. ...
Article
Fully homomorphic encryption (FHE) is a technique that allows computations on encrypted data without the need for decryption and it provides privacy in various applications such as privacy-preserving cloud computing. In this article, we present two hardware architectures optimized for accelerating the encryption and decryption operations of the Brakerski/Fan-Vercauteren (BFV) homomorphic encryption scheme with high-performance polynomial multipliers. For proof of concept, we utilize our architectures in a hardware/software codesign accelerator framework, in which encryption and decryption operations are offloaded to an FPGA device, while the rest of operations in the BFV scheme are executed in software running on an off-the-shelf desktop computer. Specifically, our accelerator framework is optimized to accelerate Simple Encrypted Arithmetic Library (SEAL), developed by the Cryptography Research Group at Microsoft Research. The hardware part of the proposed framework targets the XILINX VIRTEX-7 FPGA device, which communicates with its software part via a peripheral component interconnect express (PCIe) connection. For proof of concept, we implemented our designs targeting 1024-degree polynomials with 8-bit and 32-bit coefficients for plaintext and ciphertext, respectively. The proposed framework achieves almost 12x and 7x latency speedups, including I/O operations for the offloaded encryption and decryption operations, respectively, compared to their pure software implementations.
... Achieving fast reconfiguration is challenging in such a general computing environment due to the generally supported slow external configuration interfaces such as JTAG. Furthermore, use of external interfaces adds additional complexity in terms of cabling and driver support on the host.Recently, researchers have developed open source frameworks to enable easier interfacing between a host PC and FPGA boards[150,151]. These platforms offer an API that abstracts the interface, enabling FPGA designs to be accessed efficiently from software on the host. ...
... However, Ethernet fails to offer the throughput capabilities often required for such software-hardware systems. Recently, frameworks that interface over higher throughput PCI Express (PCIe) links, such as RIFFA[151,155] and OCPI[156], have emerged. These frameworks enable static FPGA designs to be accessed through an abstracted software API on the host, including hooks for different programming languages. ...
... Recently, researchers have developed open source frameworks to enable easier interfacing between a host PC and FPGA boards [4], [5], [6]. These frameworks offer an API that abstracts the interface, enabling FPGA designs to be accessed efficiently from software on the host. ...
... SIRC [4] is interfaces a Windows host PC and an FPGA board over Ethernet but fails to offer the throughput capabilities often required for such software-hardware systems. Recently, frameworks that interface over higher throughput PCI Express (PCIe) links, such as RIFFA [5], [12] and OCPI, have emerged. These frameworks enable static FPGA designs to be accessed through an abstracted software API on the host, including hooks for different programming languages. ...
Conference Paper
Full-text available
Integrating FPGAs with a general purpose computer remains difficult, but recent efforts have resulted in open frameworks that offer a software API and hardware interface to allow easier integration. However, such systems only support static FPGA designs. With the addition of partial reconfiguration (PR) support, such frameworks can enable more effective use of FPGAs. Now, designers can incorporate hardware accelerators within their software applications, and these can be loaded dynamically as required. We present a PR-enabled FPGA platform that allows user modules to be loaded onto the FPGA, inputs to be applied, results obtained, and functions to be swapped at runtime. The interface and PR management logic are part of the static region, while multiple accelerators can be loaded using high level functions provided by the API. Reconfiguration and data transfer are both managed over the PCIe interface from the host PC, with communication throughput of more than 1.5 GB/s (75% of peak PCIe bandwidth) and reconfiguration of a large accelerator in 20 ms.
... But system also needs a platform which takes in charge of the communication between FPGA and CPU [7]. There also has several FPGA based accelerator platforms using PCIe to communicate with CPU [1] [9]. M. Jacobsen et al. have built up a reusable integration framework for an FPGA based accelerator [9]. ...
... There also has several FPGA based accelerator platforms using PCIe to communicate with CPU [1] [9]. M. Jacobsen et al. have built up a reusable integration framework for an FPGA based accelerator [9]. They only used a 1× Gen1.1 PCIe interface. ...
Conference Paper
Full-text available
Modern cloud storage requires a high throughput and low latency data protection system, which is usually implemented with an Advanced Encryption Standard (AES) hardware accelerator connected with CPU through PCI Express (PCIe). However, most existing systems cannot simultaneously achieve high throughput and low latency, as they impose conflicting requirements to the block size of packets used in PCIe. High throughput requires the block size to be larger, while low latency requires the block size to be smaller. To provide both high throughput and low latency, we have developed an FPGA based data protection system called sAES. It uses a highly pipelined Direct Memory Access (DMA) based PCIe interface. It can achieve 10.4 Gbps throughput when the block size is 512 bytes, which is 51 times higher than the state-of-the-art Speedy PCIe interface [1]. The worst latency of sAES is only 4.368 μs when its block size is 512 bytes.
... The output is then used to calculate the normalization base values and both are DMA transferred to the host workstation over a PCIe connection. We used the RIFFA [17] framework to connect the FPGA to the host workstation (and thus the GPU). ...
... Our heterogenous design is controlled by a C++ program and compiled using GCC 4.4 and CUDA Toolkit 4.2. The C++ program interfaces with the CUDA API and the RIFFA API [17] to access the GPU and FPGA respectively. It provides simulated camera to the FPGA and coordinates transferring data to and from the FPGA and CPU/GPU. ...
Conference Paper
Full-text available
Real-time optical mapping technology is a technique that can be used in cardiac disease study and treatment technology development to obtain accurate and comprehensive electrical activity over the entire heart. It provides a dense spatial electro-physiology. Each pixel essentially plays the role of a probe on that location of the heart. However, the high throughput nature of the computation causes significant challenges in implementing a real-time optical mapping algorithm. This is exacerbated by high frame rate video for many medical applications (order of 1000 fps). Accelerating optical mapping technologies using multiple CPU cores yields modest improvements, but still only performs at 3.66 frames per second (fps). A highly tuned GPU implementation achieves 578 fps. A FPGA-only implementation is infeasible due to the resource requirements for processing intermediate data arrays generated by the algorithm. We present a FPGA-GPU-CPU architecture that is a real-time implementation of the optical mapping algorithm running at 1024 fps. This represents a 273× speed up over a multi-core CPU implementation.
... FPGAs are used to implement the core functionality of the Intelligent Personal Assistant (IPA) because of their significantly better performance per watt compared to CPUs and GPUs [26]. FPGA integration into personal computers is ensured in the open-source framework, and that communication throughput is close to the modern PCIe interface [27]. Reference [28] considers configuration and communication using PCIe. ...
Preprint
Full-text available
Finding the optimum virtual machine placement (VMP) approach in cloud infrastructure is one of the most important optimization problems. Furthermore, in recent years, with the semiconductor industry's increasing development, there has been a growing interest in fabricated chips, including multiple homogeneous or heterogeneous processing elements (PEs). Modern chips include multiple general-purpose cores alongside reconfigurable fabrics (RFs) used for high-performance computing which perform on par with ASIC hardware. Using RF at different hierarchy levels of cloud architecture significantly improve the performance of VM tasks. The methodology in our previous work is used to design a hierarchical RF-aware VMP. The Communication Delay (CD) parameter is used to perform the optimal virtual machine (VM) placement considering multi-hierarchy RF cloud architecture. The synthetic workload simulation results show that the proposed VMP algorithm outperforms other algorithms that work with our proposed cloud architecture model.
... FPGAs due to their performance per watt are used to implement core functions in intelligent personal assistant (IPA) which is remarkably better in comparison with CPU and GPU [25]. Open-source frameworks make FPGAs integrated in personal computers and communication throughput closed to modern PCIe interface [26]. Configuration and communication using PCIe are considered in [27]. ...
Article
Full-text available
Finding the best approach for virtual machine placement (VMP) in cloud infrastructure is one of the most important optimization problems. The obtained solution of this problem significantly impacts on costs, energy, performance, etc. Physical machine (PM) processing capacity and virtual machine (VM) workloads have played important roles in VMP. Besides, in recent years with the increasingly development of semiconductors industry, fabricated chips including multiple homogeneous or heterogeneous processing elements (PEs) are of interest. The latest produced chip contains several general-purpose cores side by side with reconfigurable fabrics (RF) which have been used for accelerated computing and performing on par with ASIC hardware. In this paper a methodology is proposed to design VMP algorithms using arbitrary PEs. Moreover, a novel algorithm to address VMP problem using RF elements in cloud infrastructure is proposed. The methodology includes discovering, evaluation environment, models, parameters extraction, limitations, adaptation, problem formulation and heuristic. Among those, parameters extraction has a critical role in the overall performance. The extracted parameters are employed to make decision about which PM is more appropriate for hosting the desired VM. According to simulation results on synthetic workloads our proposed VMP algorithm outperforms others in operation with our proposed cloud architecture model.
... In order to be able to speed-up encryption operation of the SEAL library, we designed a proof of concept framework that includes SEAL and an FPGA-based accelerator. To establish communication between the software stack and the FPGA, we utilized RIFFA driver [19], which employs a PCIe connection between the CPU and the FPGA. Resulting framework is 14: end for 15: return u p 0 , u p 1 shown in Figure 5. Inside SEAL, there is encrypt function which work as decribed in III. ...
Conference Paper
Full-text available
In this paper, we present an optimized FPGA implementation of a novel, fast and highly parallelized NTT-based polynomial multiplier architecture, which proves to be effective as an accelerator for lattice-based homomorphic cryptographic schemes. As I/O operations are as time-consuming as NTT operations during homomorphic computations in a host processor/accelerator setting, instead of achieving the fastest NTT implementation possible on the target FPGA, we focus on a balanced time performance between the NTT and I/O operations. Even with this goal, we achieved the fastest NTT implementation in literature, to the best of our knowledge. For proof of concept, we utilize our architecture in a framework for Fan-Vercauteren (FV) homomorphic encryption scheme, utilizing a hardware/software co-design approach, in which polynomial multiplication operations are offloaded to the accelerator via PCIe bus while the rest of operations in the FV scheme are executed in software running on an off-the-shelf desktop computer. Specifically, our framework is optimized to accelerate Simple Encrypted Arithmetic Library (SEAL), developed by the Cryptography Research Group at Microsoft Research, for the FV encryption scheme, where large degree polynomial multiplications are utilized extensively. The hardware part of the proposed framework targets Xilinx Virtex-7 FPGA device and the proposed framework achieves almost 11x latency speedup for the offloaded operations compared to their pure software implementations. We achieved a throughput of almost 800K polynomial multiplications per second, for polynomials of degree 1024 with 32-bit coefficients.
... RIFFA [5] and RIFFA 2.0 [6] are open source frameworks for accessing hardware accelerators. System Level FPGA Device Driver with High Level Synthesis support [7], extends RIFFA v1.0, suporting DDR and Ethernet transactions. ...
... The communication between the PC and FPGA board is carried through PCI port. RIFFA Jacobsen et al. (2012) framework is used to facilitate the communication between our IP core and PCI endpoint on the board. ...
Conference Paper
Full-text available
Support Vector Machines (SVM) are powerful supervised learnings method in machine learning. However, their applicability to large problems, where frequent retraining of the system is required, has been limited due to the time consuming training stage whose computational cost scales quadratically with the number of examples. In this work, a complete FPGA-based system for kernelized SVM training using ensemble learning is presented. The proposed framework builds on the FPGA architecture and utilises a cascaded multiprecision training flow, exploits the heterogeneity within the training problem by tuning the number representation used, and supports ensemble training tuned to each internal memory structure so to address very large datasets. Its performance evaluation shows that the proposed system achieves more than an order of magnitude better results compared to state-of-the-art CPU and GPU-based implementations, providing a stepping stone for researchers and practitioners to tackle large-scale SVM problems that require frequent retraining.
... These frameworks have greatly reduced the overall system development time of researchers/designers having limited knowledge in hardware/software design [3] [4] [5] [6]. Reusable Integration Framework for FPGA Accelerators (RIFFA) [7], is an open-source, reusable framework focused on facilitating the use of an FPGA as a hardware accelerator. They have integrated FPGAs into a traditional software environment. ...
... FPGA Accelerators (RIFFA) was developed at University of California (USA) and its first version is presented in [36]. A second version, removing restraints of the initial version, is presented in [37]. ...
Article
Full-text available
One of the key future challenges for reconfigurable computing is to enable higher design productivity and a more easy way to use reconfigurable computing systems for users that are unfamiliar with the underlying concepts. One way of doing this is to provide standardization and abstraction, usually supported and enforced by an operating system. This article gives historical review and a summary on ideas and key concepts to include reconfigurable computing aspects in operating systems. The article also presents an overview on published and available operating systems targeting the area of reconfigurable computing. The purpose of this article is to identify and summarize common patterns among those systems that can be seen as de facto standard. Furthermore, open problems, not covered by these already available systems, are identified.
... With our simplified implementation we have measured data rates on the PCIe Gen2 ×8 interface of approximately 230MB/s, and this figure has to be divided by 4 to account for the aforementioned lack of compression. We can therefore expect to gain a significant speed-up by a sensible implementation of the transfer protocol — for instance by integrating the RIFFA framework [42], which gets very close to the theoretical limit of 4GB/s — but this goes beyond the scope of our paper. Evidently we have a longer input transfer 6 to the DFE (35 µs) compared to the output 7 phase (16 µs). ...
Article
Even though it seems that FPGAs have finally made the transition from research labs to the consumer devices' market, programming them remains challenging. Despite the improvements made by High-Level Synthesis (HLS), which removed the language and paradigm barriers that prevented many computer scientists from working with them, producing a new design typically requires at least several hours, making data- and context-dependent adaptations virtually impossible. In this paper we present a new framework that off-loads, on-the-fly and transparently to both the user and the developer, computationally-intensive code fragments to FPGAs. While the performance should not surpass that of hand-crafted HDL code, or even code produced by HLS, our results come with no additional development costs and do not require producing and deploying a new bit-stream to the FPGA each time a change is made. Moreover, since optimizations are made at run-time, they may fit particular datasets or usage scenarios, something which is rarely foreseeable at design or compile time. Our proposal revolves around an overlay architecture that is pre-programmed on the FPGA and dynamically reconfigured by our framework to execute code fragments extracted from the Data Flow Graph (DFG) of computational intensive routines. We validated our solution using standard benchmarks and proved we are able to off-load to FPGAs without developer's intervention.
... The Float KNLMS core was integrated with a RIFFA 2.2.0 [15] PCI Express (PCIe) interface as illustrated in Figure 7. Data ingress and egress are controlled by 512-word FIFOs, and a P -word memory (P must be a multiple of L) for each of the 4 parameters shown in the bottom left module of Figure 2 was used to store the parameters to be searched. Separate memories indexed by l (as detailed in Section III-E) are used to store dictionary and weight values. ...
... Since only one flit can be injected and ejected in a single cycle in the NoC, this constraint is automatically ensured. Our implementation uses the following " Network and Router Options " for NoC generated using CONNECT (topology and number of endpoints specified as required):The evaluation was done on Xilinx Virtex 6 ML605 on an Intel i7 host, hardware-software link between them was implemented using RIFFA 2.0[13]. The multithreaded message passing software version (processing elements corresponding to threads) was evaluated on a 6 core Xeon (E5-2620). ...
Article
Full-text available
The algorithm-to-hardware High-level synthesis (HLS) tools today are purported to produce hardware comparable in quality to handcrafted designs, particularly with user directive driven or domains specific HLS. However, HLS tools are not readily equipped for when an application/algorithm needs to scale. We present a (work-in-progress) semi-automated framework to map applications over a packet-switched network of modules (single FPGA) and then to seamlessly partition such a network over multiple FPGAs over quasi-serial links. We illustrate the framework through three application case studies: LDPC Decoding, Particle Filter based Object Tracking, and Matrix Vector Multiplication over GF(2). Starting with high-level representations of each case application, we first express them in an intermediate message passing formulation, a model of communicating processing elements. Once the processing elements are identified, these are either handcrafted or realized using HLS. The rest of the flow is automated where the processing elements are plugged on to a configurable network-on-chip (CONNECT) topology of choice, followed by partitioning the 'on-chip' links to work seamlessly across chips/FPGAs.
... [19] projects, which were published after we started our project, follow a similar approach to ours also using PCIe. The implementations of the earlier versions of these projects [20,21] were too slow for our project. King et al. [22] present a method for managing process communication over the software/hardware boundary comparable to our API. ...
Article
This paper presents SAccO (Scalable Accelerator platform Osnabrück), a novel framework for implementing data-intensive applications using scalable and portable reconfigurable hardware accelerators. Instead of using expensive ”reconfigurable supercomputers”, SAccO is based on standard PCs and PCI-Express extension cards featuring Field-Programmable Gate Arrays (FPGAs) and memory. In our framework, we exploit task-level parallelism by manually partitioning applications into several parallel processes using the SAccO communication API for data streams. This also allows pure software implementations on PCs without FPGA cards. If an FPGA accelerator is present, the same API calls transfer data between the PC’s CPU and the FPGA. Then, the processes implemented in hardware can exploit instruction-level and pipelining parallelism as well. Furthermore, SAccO components follow a set of hardware implementation rules which enable portable and scalable designs. Device specific hardware wrappers hide the FPGA’s and board’s idiosyncrasies from the application developer.
... • Our driver supports direct streaming of data from Host↔FPGA user logic over PCIe, as in the case of RIFFA. Writes peak at over 1.3GB/s, while reads peak at over 1.5GB/s while assuming similar termination at user logic as in [6]. In Fig. 4(b), we see that non-blocking transfers (with deferred synchronization) offer better behavior for for small transfers. ...
Conference Paper
Full-text available
We can exploit the standardization of communication abstractions provided by modern high-level synthesis tools like Vivado HLS, Bluespec and SCORE to provide stable system interfaces between the host and PCIe-based FPGA accelerator platforms. At a high level, our FPGA driver attempts to provide CUDA-like driver behavior, and more, to FPGA programmers. On the FPGA fabric, we develop an AXI-compliant, lightweight interface switch coupled to multiple physical interfaces (PCIe, Ethernet, DRAM) to provide programmable, portable routing capability between the host and user logic on the FPGA. On the host, we adapt the RIFFA 1.0 driver to provide enhanced communication APIs along with bitstream configuration capability allowing low-latency, high-throughput communication and safe, reliable programming of user logic on the FPGA. Our driver only consumes 21% BRAMs and 14% logic overhead on a Xilinx ML605 platform or 9% BRAMs and 8% logic overhead on a Xilinx V707 board. We are able to sustain DMA transfer throughput (to DRAM) of 1.47GB/s (74% peak) of the PCIe (x4 Gen2) bandwidth, 120.2MB/s (96%) of the Ethernet (1G) bandwidth and 5.93GB/s (92.5%) of DRAM bandwidth.
... As a result, many if not most, FPGA uses involve standalone designs. This is the motivation that led to the development of RIFFA [5]. RIFFA 2.0 is a reusable interface for FPGA accelerators. ...
Conference Paper
Full-text available
We present RIFFA 2.0, a reusable integration framework for FPGA accelerators. RIFFA 2.0 provides communication and synchronization for FPGA accelerated applications using simple interfaces for hardware and software. Our goal is to expand the use of FPGAs as an acceleration platform by releasing, as open source, a framework that easily integrates software running on commodity CPUs with FPGA cores. RIFFA 2.0 uses PCIe to connect FPGAs to a CPU's system bus. RIFFA 2.0 extends the original RIFFA project by supporting more classes of Xilinx FPGAs, multiple FPGAs in a system, more PCIe link configurations, higher bandwidth, and Linux and Windows operating systems. This release also supports C/C++, Java, and Python bindings. Tests show that data transfers between hardware and software can saturate the PCIe link to achieve the highest bandwidth possible.
Conference Paper
This paper describes two different approaches to emulate an Ethernet communication link between a host computer and a RISC-V multiprocessor system running on a FPGA accelerator by using PCIe as the real communication link. Two approaches are tested, one based on user-level applications using TUN/TAP drivers and another based on implemented kernel-mode drivers on the Linux Operation System. We have functionally validated the approaches in multiple RISC-V systems and measured the achieved performance. A maximum bandwidth of 32.5 Mbps has been achieved in a Lagarto Hun system running at 100 Mhz.
Thesis
Full-text available
The rapidly increasing spread of information and communication technology in the everyday lives of many people goes hand in hand with an equally rapidly increasing demand for energy-efficient computing power. Easily programmable processors currently provide the majority of this computing power. In the past, such systems were able to meet the increasing demands on computing power through architectural improvements and increases in system clock rates. Approaching the physical limits of conventional processor technology will require the use of new architectural approaches in the future in order to meet the increasing demand for efficient computing power. Field Programmable Gate Arrays (FPGA) have the potential to be an essential part of these new architectures due to their high flexibility and computing efficiency. FPGAs are already of great importance as specialized implementation options, especially where high computing power is required. As a result, FPGA circuit design has become an increasingly important competence of highly trained electrical engineers. The possible transition from highly specialized niche technology to mainstream technology will require teaching FPGA concepts to a wide variety of target groups, including potential users. Here, the very high complexity of FPGA design represents a major challenge. This thesis presents a comprehensive tool for flexible and precise scaling of the abstraction of the FPGA design flow. The components of this works approach are suitable to simplify the complexity of the different aspects of FPGA circuit design individually in such a way that the teaching of groups with different levels of knowledge becomes possible. The shown approach preserves all essential concepts of the FPGA circuit design in order to guarantee the successful teaching of the core aspects. Several case studies show the successful application of the framework developed in the context of this work. The youth competition "INVENT a CHIP" shows the fruitful teaching of basic FPGA concepts to pupils in grades 8 to 13. The laboratory "Design Methods for FPGAs" at the Leibniz University Hannover offers master students in the field of electrical engineering the opportunity to gain in-depth and detailed knowledge about FPGAs. The concept of a novel laboratory for software developers shows the possible abstraction of FPGA design with a focus on hardware-related programming. In addition, the simplification of the FPGA design with the help of the presented tool is able to shorten the design time in rapid prototyping projects significantly. The implementation of a state-of-the-art FPGA demonstration system for video-based person detection using the framework illustrates this aspect.
Article
In this work, we present DE-ZFP: a hardware implementation of modified ZFP compression and decompression algorithms on a Field Programmable Gate Array (FPGA). It can be used to accelerate applications running on a host CPU that generates large volumes of floating point data. The proposed design uses dictionary-based encoding (DE) in lieu of ZFP’s original embedded encoding to maximize hardware performance. Furthermore, the block encoder logic was optimized such that the loss of compression efficiency due to DE remains within 4%–13% of the original ZFP software implementation, with up to 19x improvement in throughput.
Article
Platforms combining Central Processing Systems (CPUs) with Field Programmable Gate Arrays (FPGAs) have become popular, as they promise high performance with energy efficiency. This is the result of the combination of FPGA accelerators tuned to the application, with the CPU providing the programming flexibility. Unfortunately, the security of these new platforms has received little attention: The classic software security assumption that hardware is immutable no longer holds. It is expected that attack surfaces will expand and threats will evolve, hence the trust models, and security solutions should be prepared. The attacker model should be enhanced and consider the following three basic entities as the source of threats: applications run by users, accelerators designed by third-party developers, and the cloud service providers enabling the computation on their platforms. In our work, we review current trust models and existing security assumptions and point out their shortcomings. We survey existing research that target secure remote FPGA configuration, the protection of intellectual property, and secure shared use of FPGAs. When combined, these are the foundations to build a solution for secure use of FPGAs in the cloud. In addition to analysing the existing research, we provide discussions on how to improve it and disclose various concerns that have not been addressed yet.
Article
Multiplication of polynomials of large degrees is the predominant operation in lattice-based cryptosystems in terms of execution time. This motivates the study of its fast and efficient implementations in hardware. Also, applications such as those using homomorphic encryption need to operate with polynomials of different parameter sets. This calls for design of configurable hardware architectures that can support multiplication of polynomials of various degrees and coefficient sizes. In this work, we present the design and an FPGA implementation of a run-time configurable and highly parallelized NTT-based polynomial multiplication architecture, which proves to be effective as an accelerator for lattice-based cryptosystems. The proposed polynomial multiplier can also be used to perform Number Theoretic Transform (NTT) and Inverse NTT (INTT) operations. It supports 6 different parameter sets, which are used in lattice-based homomorphic encryption and/or post-quantum cryptosystems. We also present a hardware/software co-design framework, which provides high-speed communication between the CPU and the FPGA connected by PCIe standard interface provided by the RIFFA driver [1]. For proof of concept, the proposed polynomial multiplier is deployed in this framework to accelerate the decryption operation of Brakerski/Fan-Vercauteren (BFV) homomorphic encryption scheme implemented in Simple Encrypted Arithmetic Library (SEAL), by the Cryptography Research Group at Microsoft Research [2]. In the proposed framework, polynomial multiplication operation in the decryption of the BFV scheme is offloaded to the accelerator in the FPGA via PCIe bus while the rest of operations in the decryption are executed in software running on an off-the-shelf desktop computer. The hardware part of the proposed framework targets Xilinx Virtex-7 FPGA device and the proposed framework achieves the speedup of almost 7 × in latency for the offloaded operations compared to their pure software implementations, excluding I/O overhead.
Article
Kernel adaptive filters (KAFs) are online machine learning algorithms which are amenable to highly efficient streaming implementations. They require only a single pass through the data and can act as universal approximators, i.e. approximate any continuous function with arbitrary accuracy. KAFs are members of a family of kernel methods which apply an implicit non-linear mapping of input data to a high dimensional feature space, permitting learning algorithms to be expressed entirely as inner products. Such an approach avoids explicit projection into the feature space, enabling computational efficiency. In this paper, we propose the first fully pipelined implementation of the kernel normalised least mean squares algorithm for regression. Independent training tasks necessary for hyperparameter optimisation fill pipeline stages, so no stall cycles to resolve dependencies are required. Together with other optimisations to reduce resource utilisation and latency, our core achieves 161 GFLOPS on a Virtex 7 XC7VX485T FPGA for a floating point implementation and 211 GOPS for fixed point. Our PCI Express based floating-point system implementation achieves 80% of the core’s speed, this being a speedup of 10× over an optimised implementation on a desktop processor and 2.66× over a GPU.
Conference Paper
When complex system designs with complex specifications, where more than one address and data need to process simultaneously, than various protocols are used to process. There are various serial and parallel protocols are significant to used. But, now a day, AMBA AXI protocols are generally used for the interface. In this research work we present a flexible module with generic interfaces. Our implementation is based on AXI-4.0 and PCI Express protocols, which are the highspeed and high-performance protocols. We implement the Inter connect between host processor and PCI endpoint. Also, we implement how the packet is to be transfer in terms of address and data between host and device. We describe the interface module, which connected to various processors and high speed modules like DMA.
Article
Full-text available
As the Internet becomes more and more widespread, the power consumption associated with the Internet infrastructure grows rapidly, contributing to a significant increase in operational costs of service providers. The paper presents a proof of a concept solution consisting of an FPGA expansion card with a dedicated image processing accelerator, connected to a server via PCI Express interface. The use of a dedicated accelerator allows faster completion of the task performed by the server, resulting in over tenfold improvement in energy efficiency.
Article
Long DRAM latency is a critical performance bottleneck in current systems. DRAM access latency is defined by three fundamental operations that take place within the DRAM cell array: (i) activation of a memory row, which opens the row to perform accesses; (ii) precharge, which prepares the cell array for the next memory access; and (iii) restoration of the row, which restores the values of cells in the row that were destroyed due to activation. There is significant latency variation for each of these operations across the cells of a single DRAM chip due to irregularity in the manufacturing process. As a result, some cells are inherently faster to access, while others are inherently slower. Unfortunately, existing systems do not exploit this variation. The goal of this work is to (i) experimentally characterize and understand the latency variation across cells within a DRAM chip for these three fundamental DRAM operations, and (ii) develop new mechanisms that exploit our understanding of the latency variation to reliably improve performance. To this end, we comprehensively characterize 240 DRAM chips from three major vendors, and make several new observations about latency variation within DRAM. We find that (i) there is large latency variation across the cells for each of the three operations; (ii) variation characteristics exhibit significant spatial locality: slower cells are clustered in certain regions of a DRAM chip; and (iii) the three fundamental operations exhibit different reliability characteristics when the latency of each operation is reduced. Based on our observations, we propose Flexible-LatencY DRAM (FLY-DRAM), a mechanism that exploits latency variation across DRAM cells within a DRAM chip to improve system performance. The key idea of FLY-DRAM is to exploit the spatial locality of slower cells within DRAM, and access the faster DRAM regions with reduced latencies for the fundamental operations. Our evaluations show that FLY-DRAM improves the performance of a wide range of applications by 13.3%, 17.6%, and 19.5%, on average, for each of the three different vendors' real DRAM chips, in a simulated 8-core system. We conclude that the experimental characterization and analysis of latency variation within modern DRAM, provided by this work, can lead to new techniques that improve DRAM and system performance.
Conference Paper
Long DRAM latency is a critical performance bottleneck in current systems. DRAM access latency is defined by three fundamental operations that take place within the DRAM cell array: (i) activation of a memory row, which opens the row to perform accesses; (ii) precharge, which prepares the cell array for the next memory access; and (iii) restoration of the row, which restores the values of cells in the row that were destroyed due to activation. There is significant latency variation for each of these operations across the cells of a single DRAM chip due to irregularity in the manufacturing process. As a result, some cells are inherently faster to access, while others are inherently slower. Unfortunately, existing systems do not exploit this variation. The goal of this work is to (i) experimentally characterize and understand the latency variation across cells within a DRAM chip for these three fundamental DRAM operations, and (ii) develop new mechanisms that exploit our understanding of the latency variation to reliably improve performance. To this end, we comprehensively characterize 240 DRAM chips from three major vendors, and make several new observations about latency variation within DRAM. We find that (i) there is large latency variation across the cells for each of the three operations; (ii) variation characteristics exhibit significant spatial locality: slower cells are clustered in certain regions of a DRAM chip; and (iii) the three fundamental operations exhibit different reliability characteristics when the latency of each operation is reduced. Based on our observations, we propose Flexible-LatencY DRAM (FLY-DRAM), a mechanism that exploits latency variation across DRAM cells within a DRAM chip to improve system performance. The key idea of FLY-DRAM is to exploit the spatial locality of slower cells within DRAM, and access the faster DRAM regions with reduced latencies for the fundamental operations. Our evaluations show that FLY-DRAM improves the performance of a wide range of applications by 13.3%, 17.6%, and 19.5%, on average, for each of the three different vendors' real DRAM chips, in a simulated 8-core system. We conclude that the experimental characterization and analysis of latency variation within modern DRAM, provided by this work, can lead to new techniques that improve DRAM and system performance.
Article
We present RIFFA 2.1, a reusable integration framework for Field-Programmable Gate Array (FPGA) accelerators. RIFFA provides communication and synchronization for FPGA accelerated applications using simple interfaces for hardware and software. Our goal is to expand the use of FPGAs as an acceleration platform by releasing, as open source, a framework that easily integrates software running on commodity CPUs with FPGA cores. RIFFA uses PCI Express (PCIe) links to connect FPGAs to a CPU’s system bus. RIFFA 2.1 supports FPGAs from Xilinx and Altera, Linux and Windows operating systems, and allows multiple FPGAs to connect to a single host PC system. It has software bindings for C/C++, Java, Python, and Matlab. Tests show that data transfers between hardware and software can reach 97% of the achievable PCIe link bandwidth.
Conference Paper
Especially in complex system-of-systems scenarios, where multiple high-performance or real-time processing functions need to co-exist and interact, reconfigurable devices together with virtualization techniques show considerable promise to increase efficiency, ease integration and maintain functional and non-functional properties of the individual functions. In this paper, we propose a flexible interface architecture with low overhead for coupling reconfigurable coprocessors to high-performance general-purpose processors, allowing customized yet efficient construction of heterogeneous processing systems. Our implementation is based on PCI Express (PCIe) and optimized for virtualized systems, taking advantage of the SR-IOV capabilities in modern PCIe implementations. We describe the interface architecture and its fundamental technologies, detail the services provided to individual coprocessors and accelerator modules, and quantify key corner performance indicators relevant for virtualized applications.
Article
Markov Chain Monte Carlo (MCMC) is a method to draw samples from a given probability distribution. Its frequent use for solving probabilistic inference problems, where big-scale data are repeatedly processed, means that MCMC runtimes can be unacceptably large. This paper focuses on population-based MCMC, a popular family of computationally intensive MCMC samplers; we propose novel, highly optimized accelerators in three parallel hardware platforms (multi-core CPUs, GPUs and FPGAs), in order to address the performance limitations of sequential software implementations. For each platform, we jointly exploit the nature of the underlying hardware and the special characteristics of population-based MCMC. We focus particularly on the use of custom arithmetic precision, introducing two novel methods which employ custom precision in the largest part of the algorithm in order to reduce runtime, without causing sampling errors. We apply these methods to all platforms. The FPGA accelerators are up to 114x faster than multi-core CPUs and up to 53x faster than GPUs when doing inference on mixture models.
Conference Paper
We present an FPGA (field programmable gate array) based PCI-E (PCI-Express) root complex architecture for SOPCs (System-on-a-Programmable-Chip) in this paper. In our work, the system on the FPGA serves as a PCIE master device rather than a PCIE endpoint, which is usually a common practice as a co-processing device driven by a desktop computer or a server. We use this system to control a PCIE endpoint, which is also an FPGA based endpoint implemented on another FPGA board. This architecture requires only IP cores free of charge. We also provide basic software driver so that specific device driver can be developed on it to control popular PCIE device in the future, i.e. ethernet card or graphic card. The whole architecture has been implemented on Xilinx Virtex-6 FPGAs to indicate that this architecture is a feasible approach to standalone SOPCs, which has better efficiencies than those with additional generic controlling processors.
Conference Paper
Markov Chain Monte Carlo (MCMC) is an ubiquitous stochastic method, used to draw random samples from arbitrary probability distributions, such as the ones encountered in Bayesian inference. MCMC often requires forbiddingly long runtimes to give a representative sample in problems with high dimensions and large-scale data. Field-Programmable Gate Arrays (FPGAs) have proven to be a suitable platform for MCMC acceleration due to their ability to support massive parallelism. This paper introduces an automated method, which minimizes the floating point precision of the most computationally intensive part of an FPGA-mapped MCMC sampler, while keeping the precision-related bias in the output within a user-specified tolerance. The method is based on an efficient bias estimator, proposed here, which is able to estimate the bias in the output with only few random samples. The optimization process involves FPGA pre-runs, which estimate the bias and choose the optimized precision. This precision is then used to reconfigure the FPGA for the final, long MCMC run, allowing for higher sampling throughputs. The process requires no user intervention. The method is tested on two Bayesian inference case studies: Mixture models and neural network regression. The achieved speedups over double-precision FPGA designs were 3.5x-5x (including the optimization overhead). Comparisons with a sequential CPU and a GPGPU showed speedups of 223x-446x and 16x-18x respectively.
Conference Paper
A common type of triangulation-based active 3D scanner outputs sets of surface coordinates, called profiles, by extracting the salient features of 2D images formed from an object illuminated by a narrow plane of light. Because a conventional 2D image must be digitized and processed for each profile, current systems do not always provide adequate speed and resolution to meet application demands. To address this challenge, a special purpose image sensor is being developed. Using Compressive Sensing, this sensor will be able to digitize compressed measurements of highly structured images, such as those formed in active 3D scanning, at a rate that would represent the conventional equivalent of 50G pixels/second. It is a significant challenge to process such a high-speed data stream at rates approaching realtime. Therefore, we present a single-chip FPGA design for the extraction of surface profiles from a compressed image stream originating from a 1024 by 768 pixel array at a rate of 14K images per second.
Conference Paper
Recent architectural advancements in reconfigurable devices have exposed the ability to support massive parallelism inside of small, low-cost, embedded devices. The massive parallelism inside of these reconfigurable devices has promised to bring an unprecedented level of performance to the embedded systems domain. However, the complexity of programming these reconfigurable devices is daunting - too daunting for the average programmer. This paper presents Hthreads. Hthreads is a computational architecture which aims to bridge the gap between regular programmers and powerful but complex reconfigurable devices. Hthreads accomplishes this goal using three layers of abstraction built upon standard reconfigurable devices: an operating system capable of supporting a diverse collection of computational models within a reconfigurable device, an intermediate form representation which eases the development of applications on reconfigurable devices, and support for high level languages which are familiar to most programmers
Conference Paper
We present HybridOS, a set of operating system extensions for supporting fine-grained reconfigurable accelerators integrated with general-purpose computing platforms. HybridOS specifically targets the application integration, data movement and communication overheads for a CPU/accelerator model when running a commodity operating system. HybridOS provides a simple API for applications and a well-defined hardware interface for reconfigurable accelerators. The goal is to reduce the difficulty in mapping applications into a CPU/accelerator model compared to an unrestrained FPGA platform while achieving whole-application speedups. HybridOS is integrated into a full Linux distribution running on the embedded processor of an FPGA. Application-specific accelerators are implemented in the reconfigurable fabric of the FPGA that are allocated to user applications running on Linux. We have developed and evaluated four methods for accessing the data buffers required by hardware-accelerated applications using our prototype. The results of our work show the feasibility of our system for a case study, JPEG encoding with two accelerators, and an evaluation of HybridOS for varying data movement requirements that can be used as a guide for future applications developers
Conference Paper
Reconfigurable computing applications often need to divide computation between software running on a conventional desktop processor and hardware mapped to an FPGA. However, the reconfigurable computing development platforms available today either do not provide a sufficient mechanism for the communication and synchronization that is needed or else employ a complex & proprietary API specific to a given toolflow or device, limiting code portability. The Simple Interface for Reconfigurable Computing (SIRC) project provides a straightforward, portable and extensible open-source communication and synchronization API. It consists of both a software-side interface and a hardware-side interface that allows C++ code running on a host PC to communicate and synchronize with a Verilog-based circuit mapped to a FPGA. One key feature of this API is that both the hardware and software user interfaces can remain consistent across all platforms and future releases. This allows applications built for existing systems to migrate to different platforms without significant modification to user code.
Conference Paper
Accelerators are special purpose processors designed to speed up compute-intensive sections of applications. Two extreme endpoints in the spectrum of possible accelerators are FPGAs and GPUs, which can often achieve better performance than CPUs on certain workloads. FPGAs are highly customizable, while GPUs provide massive parallel execution resources and high memory bandwidth. Applications typically exhibit vastly different performance characteristics depending on the accelerator. This is an inherent problem attributable to architectural design, middleware support and programming style of the target platform. For the best application-to-accelerator mapping, factors such as programmability, performance, programming cost and sources of overhead in the design flows must be all taken into consideration. In general, FPGAs provide the best expectation of performance, flexibility and low overhead, while GPUs tend to be easier to program and require less hardware resources. We present a performance study of three diverse applications - Gaussian elimination, data encryption standard (DES), and Needleman-Wunsch - on an FPGA, a GPU and a multicore CPU system. We perform a comparative study of application behavior on accelerators considering performance and code complexity. Based on our results, we present an application characteristic to accelerator platform mapping, which can aid developers in selecting an appropriate target architecture for their chosen application.