Table 1 - uploaded by Zhenman Fang
Content may be subject to copyright.
Source publication
Conventional homogeneous multicore processors are not able to provide the continued performance and energy improvement that we have expected from past endeavors. Heterogeneous architectures that feature specialized hardware accelerators are widely considered a promising paradigm for resolving this issue. Among different heterogeneous devices, FPGAs...
Contexts in source publication
Context 1
... the trend of adopting FPGAs in datacenters, various CPU-FPGA acceleration platforms with diversified microarchitectural features have been developed. We classify state-of-the-art CPU-FPGA platforms in Table 1 ...
Context 2
... summary, this paper makes the following contributions. The first quantitative characterization and comparison on the microarchitectures of state-of-the- art CPU-FPGA acceleration platforms-including the Alpha Data board and Amazon F1 instance, IBM CAPI, and Intel Xeon+FPGA v1 and v2-which covers the whole range of CPU-FPGA connections. We quantify each platform's CPU-FPGA communication latency and bandwidth and the results are summarized in Fig. 1. 2. An in-depth analysis of the big gap between advertised and practically achievable performance (Section 3), with step-by-step decomposition of the inefficiencies. ...
Context 3
... in addition to the commodity CPU-FPGA integrated platforms in Table 1, there is also a large body of academic work that focuses on how to efficiently integrate hardware accelerators into general-purpose processors. Yesil et al. [31] surveyed existing custom accelerators and integration techniques for accelerator-rich systems in the context of data centers, but without a quantitative study as we did. ...
Similar publications
Quantum computers are traditionally operated by programmers at the granularity of a gate-based instruction set. However, the actual device-level control of a quantum computer is performed via analog pulses. We introduce a compiler that exploits direct control at this microarchitectural level to achieve significant improvements for quantum programs....
Citations
... Although achieving gains in Vina's parallel acceleration on mainstream computing platforms such as CPU/GPU, they meanwhile introduce significant energy consumption [15]. The FPGA-based accelerator is considered one of the most promising directions, since FPGAs provide low-power and high-energy efficiency and can be reprogrammed to accelerate different applications [16]. Moreover, FPGAs are potential solutions to accelerate MD process, which has been proven in previous MD tools [17], [18]. ...
AutoDock Vina (Vina) stands out among numerous molecular docking tools due to its precision and comparatively high speed, playing a key role in the drug discovery process. Hardware acceleration of Vina on FPGA platforms offers a high energy-efficiency approach to speed up the docking process. However, previous FPGA-based Vina accelerators exhibit several shortcomings: 1) Simple uniform quantization results in inevitable accuracy drop; 2) Due to Vina’s complex computing process, the evaluation and optimization phase for hardware design becomes extended; 3) The iterative computations in Vina constrain the potential for further parallelization. 4) The system’s scalability is limited by its unwieldy architecture. To address the above challenges, we propose Vina-FPGA-cluster, a multi-FPGA-based molecular docking tool enabling high-accuracy and multi-level parallel Vina acceleration. Standing upon the shoulders of Vina-FPGA, we first adapt hybrid fixed-point quantization to minimize accuracy loss. We then propose a SystemC-based model, accelerating the hardware accelerator architecture design evaluation. Next, we propose a novel bidirectional AG module for data-level parallelism. Finally, we optimize the system architecture for scalable deployment on multiple Xilinx ZCU104 boards, achieving task-level parallelism. Vina-FPGA-cluster is tested on three representative molecular docking datasets. The experiment results indicate that in the context of RMSD (for successful docking outcomes with metrics below 2Å), Vina-FPGA-cluster shows a mere 0.2% lose. Relative to CPU and Vina-FPGA, Vina-FPGA-cluster achieves 27.33× and 7.26× speedup, respectively. Notably, Vina-FPGA-cluster is able to deliver the 1.38× speedup as GPU implementation (Vina-GPU), with just the 28.99% power consumption.
... Another significant advantage of FPGA over ASIC solutions is flexibility, since they can be reconfigured as coprocessors according to the required application. There is an upward trend in the academic and industrial sector toward the use and manufacture of heterogeneous platforms based on CPU-FPGA, such Alpha Data FPGA board [61], Amazon F1 instance [62], and Intel Xeon+FPGA [63], among others. ...
This paper proposes efficient implementations for addition/subtraction based on decimal floating point with Densely Packed Decimal (DPD) and Binary Integer Decimal (BID) encoding in FPGA devices. The designs use novel techniques based on the efficient utilization of dedicated resources in programmable devices. Implementations were made in Xilinx UltraScale+. For DPD adder/subtractor, they have computation times of 7.7 ns for Decimal32, 8.1 ns for Decimal64 and 8.5 ns for Decimal128. As for BID adder/subtractor, the computation time obtained is 13.5 ns for Decimal64. The proposed architecture achieves better computation times than related works. Compared to previous architectures, the proposed DPD implementation achieves 1.86× speedup and 47% better LUT occupation. Also, the BID adder/subtractor achieves 3× speedup and 5% less LUT occupation.
... For example, waiting for the most suitable resource to become available can lead to higher energy efficiency than resorting to an immediately available less suitable resource like a CPU core or a reconfigurable accelerator. Besides the complex runtime scheduling problem, recent studies show that execution times for applications of domain-specific systems are on the nanosecond scale [1,7]. Therefore, the classical scheduling problem encounters a new challenge in heterogeneous DSSoCs because domain-specific tasks can run in the order of nanoseconds, i.e., at least two or three orders of magnitude faster than generalpurpose cores when they are executed on their specialized pipelines. ...
Domain-specific systems on chip (DSSoCs) aim to narrow the gap between general-purpose processors and application-specific designs. CPU clusters enable programmability, whereas hardware accelerators tailored to the target domain minimize task execution times and power consumption. Traditional operating system (OS) schedulers can diminish the potential of DSSoCs, as their execution times can be orders of magnitude larger than the task execution time. To address this problem, we propose a dynamic adaptive scheduling (DAS) framework that combines the advantages of a fast, low-overhead scheduler and a sophisticated, high-performance scheduler with a larger overhead. We present a novel runtime classifier that chooses the better scheduler type as a function of the system workload, leading to improved system performance and energy-delay product (EDP). Experiments with five real-world streaming applications indicate that DAS consistently outperforms fast, low-overhead, and slow, sophisticated schedulers. DAS achieves a 1.29× speedup and a 45% lower EDP than the sophisticated scheduler under low data rates and a 1.28× speedup and a 37% lower EDP than the fast scheduler when the workload complexity increases. Furthermore, we demonstrate that the superior performance of the DAS framework also applies to hardware platforms, with up to a 48% and 52% reduction in the execution time and EDP, respectively.
... In particular, data is first copied from the FPGA memory to the shared memory space and then transferred to another FPGA. Compared with CPU-FPGA communication, FPGAto-FPGA communication is much slower because it requires additional data copying to the CPU memory [26]. Thus, we propose to fetch data directly from the CPU memory. ...
As the size of real-world graphs increases, training Graph Neural Networks (GNNs) has become time-consuming and requires acceleration. While previous works have demonstrated the potential of utilizing FPGA for accelerating GNN training, few works have been carried out to accelerate GNN training with multiple FPGAs due to the necessity of hardware expertise and substantial development effort. To this end, we propose HitGNN, a framework that enables users to effortlessly map GNN training workloads onto a CPU-Multi-FPGA platform for acceleration. In particular, HitGNN takes the user-defined synchronous GNN training algorithm, GNN model, and platform metadata as input, determines the design parameters based on the platform metadata, and performs hardware mapping onto the CPU+Multi-FPGA platform, automatically. HitGNN consists of the following building blocks: (1) high-level application programming interfaces (APIs) that allow users to specify various synchronous GNN training algorithms and GNN models with only a handful of lines of code; (2) a software generator that generates a host program that performs mini-batch sampling, manages CPU-FPGA communication, and handles workload balancing among the FPGAs; (3) an accelerator generator that generates GNN kernels with optimized datapath and memory organization. We show that existing synchronous GNN training algorithms such as DistDGL and PaGraph can be easily deployed on a CPU+Multi-FPGA platform using our framework, while achieving high training throughput. Compared with the state-of-the-art frameworks that accelerate synchronous GNN training on a multi-GPU platform, HitGNN achieves up to 27.21x bandwidth efficiency, and up to 4.26x speedup using much less compute power and memory bandwidth than GPUs. In addition, HitGNN demonstrates good scalability to 16 FPGAs on a CPU+Multi-FPGA platform.
... Aim of this Chapter is to provide an extensive overview of the reconfigurable systems' evolution from different points of view, with a specific focus on those based on Field Programmable Gate Arrays (FPGAs). We will explore the evolution of these systems that starts from standalone to hybrid solutions [1], the evolution of the toolchains developed both to increase the productivity and widening the user base of these reconfigurable fabrics [2], and the differentiation of the paradigms employed and applicative scenarios [3]. Considering the magnitude of the topics, we will cover the time-span between a period when only a restricted elite of people knew and exploited reconfigurable systems and the current days where they are often integrated into datacenters and provided as services to a wider audience. ...
... The second attempt to increase the usage of reconfigurable systems by a larger number of users was combining them with general purpose processors and, later on, with software programmable vector engines. The coupling with micro-controller and hard-processors opens to different applicative scenarios but also introduces new challenges on interconnections and memory coherency [1]. Indeed, the aforementioned heterogeneity and high connectivity favors the adoption of reconfigurable systems in the cloud computing ecosystem, where the power wall hit with the dark silicon [11] makes the providers craving for energy efficient solutions, such as reconfigurable systems. ...
... Following the aforementioned improvements and considering that homogeneous multi-core processors, especially in data-centres, are failing to provide the desired energy efficiency and performance, new devices have been deployed, specifically heterogeneous architectures [1]. The integration in these architectures of hardware accelerators is gaining interest as a promising solution. ...
Reconfigurable computing is an expanding field that, during the last decades, has evolved from a relatively closed community, where hard skilled developers deployed high performance systems, based on their knowledge of the underlying physical system, to an attractive solution to both industry and academia. With this chapter, we explore the different lines of development in the field, namely the need of new tools to shorten the development time, the creation of heterogeneous platforms which couple hardware accelerators with general purpose processors, and the demand to move from general to specific solutions. Starting with the identification of the main limitations that have led to improvements in the field, we explore the emergence of a wide range of Computer Aided Design tools that allow the use of high level languages and guide the user in the whole process of system deployment. This opening to a wider public and their high performance with relatively low power consumption facilitate the spreading in data-centers, where, apart from the undeniable benefits, we have explored critical issues. We conclude with the latest trends in the field such as the use of hardware as a service and the shifting to Domain Specific Architectures based on reconfigurable fabrics.
... Logic resource The logic resources of FPGA are used for function customization [20,32,33]. Typical FPGA logic resources contain the functional resources and wiring resources which can be further divided into the pre-defined control logic and programmable units. ...
... The off-chip memory of FPGA is mainly DRAM based storage. Several high-end devices are equipped with emerging storage called high-bandwidth memory (HBM) (e.g., Xilinx Alveo U280 [36] & U50 [32], and Intel Stratix 10 MX [33]). ...
... The devices are programmed with bitstreams loading in the configuration memory. Thus, Reconfiguration is the key hardware support for FPGA sharing [32,33]. FPGA could be shared with different configurations from tenants. ...
Cloud vendors are actively adopting FPGAs into their infrastructures for enhancing performance and efficiency. As cloud services continue to evolve, FPGA (field programmable gate array) systems would play an even important role in the future. In this context, FPGA sharing in multi-tenancy scenarios is crucial for the wide adoption of FPGA in the cloud. Recently, many works have been done towards effective FPGA sharing at different layers of the cloud computing stack.
In this work, we provide a comprehensive survey of recent works on FPGA sharing. We examine prior art from different aspects and encapsulate relevant proposals on a few key topics. On the one hand, we discuss representative papers on FPGA resource sharing schemes; on the other hand, we also summarize important SW/HW techniques that support effective sharing. Importantly, we further analyze the system design cost behind FPGA sharing. Finally, based on our survey, we identify key opportunities and challenges of FPGA sharing in future cloud scenarios.
... Using the traditional approach, we would need to create two separate prediction models, one for the low-end edge environment and another one for the high-end cloud environment, each one requiring a large number of samples. Our goal is to leverage an ML-based performance model trained on a low-end local system to: (1) predict the performance in a new, unknown, high-end FPGA-based system, (2) predict the performance of a new, unknown application to overcome the limitations of current MLbased performance models in a cloud environment. ...
Machine-learning-based models have recently gained traction as a way to overcome the slow downstream implementation process of FPGAs by building models that provide fast and accurate performance predictions. However, these models suffer from two main limitations: (1) training requires large amounts of data (features extracted from FPGA synthesis and implementation reports), which is cost-inefficient because of the time-consuming FPGA design cycle; (2) a model trained for a specific environment cannot predict for a new, unknown environment. In a cloud system, where getting access to platforms is typically costly, data collection for ML models can significantly increase the total cost-ownership (TCO) of a system. To overcome these limitations, we propose LEAPER, a transfer learning-based approach for FPGA-based systems that adapts an existing ML-based model to a new, unknown environment to provide fast and accurate performance and resource utilization predictions. Experimental results show that our approach delivers, on average, 85% accuracy when we use our transferred model for prediction in a cloud environment with 5-shot learning and reduces design-space exploration time by 10x, from days to only a few hours.
... The need for energy-efficiency from edge to cloud computing has boosted the widespread adoption of FPGAs. In cloud computing [16,21,63,89,379,384], FPGA's A part of this chapter is published as "Modeling FPGA-Based Systems via Few-Shot Learning" in FPGA 2021. ...
The cost of moving data between the memory units and the compute units is a major contributor to the execution time and energy consumption of modern workloads in computing systems. At the same time, we are witnessing an enormous amount of data being generated across multiple application domains. These trends suggest a need for a paradigm shift towards a data-centric approach where computation is performed close to where the data resides. Further, a data-centric approach can enable a data-driven view where we take advantage of vast amounts of available data to improve architectural decisions. As a step towards modern architectures, this dissertation contributes to various aspects of the data-centric approach and proposes several data-driven mechanisms. First, we design NERO, a data-centric accelerator for a real-world weather prediction application. Second, we explore the applicability of different number formats, including fixed-point, floating-point, and posit, for different stencil kernels. Third, we propose NAPEL, an ML-based application performance and energy prediction framework for data-centric architectures. Fourth, we present LEAPER, the first use of few-shot learning to transfer FPGA-based computing models across different hardware platforms and applications. Fifth, we propose Sibyl, the first reinforcement learning-based mechanism for data placement in hybrid storage systems. Overall, this thesis provides two key conclusions: (1) hardware acceleration on an FPGA+HBM fabric is a promising solution to overcome the data movement bottleneck of our current computing systems; (2) data should drive system and design decisions by leveraging inherent data characteristics to make our computing systems more efficient.
... Net + Accel SmartNICs [5,110], AcclNet [53], hXDP [35] Net + GPU GPUDirect [102], GPUNet [78] Sto + GPU Donard [22], SPIN [25], GPUfs [124], GPUDirect [103], nvidia BAM [113] Net + Sto iSCSI, NVMoF (offload [117], BlueField [5]), i10 [68], ReFlex [80] Sto + Accel ASIC/CPU [60,83,121], GPUs [25,26,124], FPGA [69,116,119,143], Hayagui [15] Hybrid System with ARM SoC [3,47,90], BEE3 [44], hybrid CPU-FPGA systems [39,41] DPUs Hyperion (stand-alone), Fungible (MIPS64 R6 cores) DPU processor [54], Pensando (hostattached P4 Programmable processor) [108], BlueField (host-attached, with ARM cores) [5] Table 1: Related work ( §4) in the integration of network (net), storage (sto), and accelerators (accel) devices. ...
... Second, a direct consequence of keeping a CPU-driven design is to inherit its choices of memory addressing, translation, and protection mechanisms such as virtual memory, paging, and segmentation [45]. When an accelerator such as FPGA, is attached to a CPU as an external device [39] or as a co-processor [41], there is a temptation to provide/port the familiar memory abstractions like unified virtual memory [84] and/or shared memory [94]. This design necessitates a complex integration with further CPU-attached memory abstractions such as page tables and TLBs, virtualization, huge pages, IOMMU, etc., while keeping such an integration coherent with the CPU view of the system [84,94]. ...
Since the inception of computing, we have been reliant on CPU-powered architectures. However, today this reliance is challenged by manufacturing limitations (CMOS scaling), performance expectations (stalled clocks, Turing tax), and security concerns (microarchitectural attacks). To re-imagine our computing architecture, in this work we take a more radical but pragmatic approach and propose to eliminate the CPU with its design baggage, and integrate three primary pillars of computing, i.e., networking, storage, and computing, into a single, self-hosting, unified CPU-free Data Processing Unit (DPU) called Hyperion. In this paper, we present the case for Hyperion, its design choices, initial work-in-progress details, and seek feedback from the systems community.
... The authors in Todman et al. (2005) have provided background information on notable aspects of older FPGA technologies and simultaneously explained the fundamental architectures and design methods for codesign. Furthermore, the work in Choi et al. (2019) is another comprehensive study that aims to evaluate and analyze the microarchitectural characteristics of state-of-the-art CPU-FPGA platforms in depth. That paper covers most of the sharedmemory platforms with detailed benchmarks. ...
In this community review report, we discuss applications and techniques for fast machine learning (ML) in science—the concept of integrating powerful ML methods into the real-time experimental data processing loop to accelerate scientific discovery. The material for the report builds on two workshops held by the Fast ML for Science community and covers three main areas: applications for fast ML across a number of scientific domains; techniques for training and implementing performant and resource-efficient ML algorithms; and computing architectures, platforms, and technologies for deploying these algorithms. We also present overlapping challenges across the multiple scientific domains where common solutions can be found. This community report is intended to give plenty of examples and inspiration for scientific discovery through integrated and accelerated ML solutions. This is followed by a high-level overview and organization of technical advances, including an abundance of pointers to source material, which can enable these breakthroughs.