Fig 3 - uploaded by Zhenman Fang
Content may be subject to copyright.
Source publication
Conventional homogeneous multicore processors are not able to provide the continued performance and energy improvement that we have expected from past endeavors. Heterogeneous architectures that feature specialized hardware accelerators are widely considered a promising paradigm for resolving this issue. Among different heterogeneous devices, FPGAs...
Contexts in source publication
Context 1
... with physical addressing effectively adopt a separate address space paradigm (Fig. 3). Data shared between the host and device must be allocated in both the host-side CPU-attached memory and the private device DRAM, and explicitly copied between them by the host program. Although copying array-based data structures is straightforward, moving pointer-based data structures such as linked-lists and trees presents ...
Context 2
... tighter logical CPU-FPGA integration, the ideal case would be to have a unified shared address space between the CPU and FPGA. In this case (Fig. 3), instead of allocating two copies in both host and device memories, only a single allocation is necessary. This has a variety of benefits, including the elimination of explicit data copies, pointer semantics and increased performance of fine-grained memory accesses. CAPI enables unified address space through additional hard- ware ...
Similar publications
Quantum computers are traditionally operated by programmers at the granularity of a gate-based instruction set. However, the actual device-level control of a quantum computer is performed via analog pulses. We introduce a compiler that exploits direct control at this microarchitectural level to achieve significant improvements for quantum programs....
Citations
... In particular, data is first copied from the FPGA memory to the shared memory space and then transferred to another FPGA. Compared with CPU-FPGA communication, FPGAto-FPGA communication is much slower because it requires additional data copying to the CPU memory [26]. Thus, we propose to fetch data directly from the CPU memory. ...
As the size of real-world graphs increases, training Graph Neural Networks (GNNs) has become time-consuming and requires acceleration. While previous works have demonstrated the potential of utilizing FPGA for accelerating GNN training, few works have been carried out to accelerate GNN training with multiple FPGAs due to the necessity of hardware expertise and substantial development effort. To this end, we propose HitGNN, a framework that enables users to effortlessly map GNN training workloads onto a CPU-Multi-FPGA platform for acceleration. In particular, HitGNN takes the user-defined synchronous GNN training algorithm, GNN model, and platform metadata as input, determines the design parameters based on the platform metadata, and performs hardware mapping onto the CPU+Multi-FPGA platform, automatically. HitGNN consists of the following building blocks: (1) high-level application programming interfaces (APIs) that allow users to specify various synchronous GNN training algorithms and GNN models with only a handful of lines of code; (2) a software generator that generates a host program that performs mini-batch sampling, manages CPU-FPGA communication, and handles workload balancing among the FPGAs; (3) an accelerator generator that generates GNN kernels with optimized datapath and memory organization. We show that existing synchronous GNN training algorithms such as DistDGL and PaGraph can be easily deployed on a CPU+Multi-FPGA platform using our framework, while achieving high training throughput. Compared with the state-of-the-art frameworks that accelerate synchronous GNN training on a multi-GPU platform, HitGNN achieves up to 27.21x bandwidth efficiency, and up to 4.26x speedup using much less compute power and memory bandwidth than GPUs. In addition, HitGNN demonstrates good scalability to 16 FPGAs on a CPU+Multi-FPGA platform.
... Aim of this Chapter is to provide an extensive overview of the reconfigurable systems' evolution from different points of view, with a specific focus on those based on Field Programmable Gate Arrays (FPGAs). We will explore the evolution of these systems that starts from standalone to hybrid solutions [1], the evolution of the toolchains developed both to increase the productivity and widening the user base of these reconfigurable fabrics [2], and the differentiation of the paradigms employed and applicative scenarios [3]. Considering the magnitude of the topics, we will cover the time-span between a period when only a restricted elite of people knew and exploited reconfigurable systems and the current days where they are often integrated into datacenters and provided as services to a wider audience. ...
... The second attempt to increase the usage of reconfigurable systems by a larger number of users was combining them with general purpose processors and, later on, with software programmable vector engines. The coupling with micro-controller and hard-processors opens to different applicative scenarios but also introduces new challenges on interconnections and memory coherency [1]. Indeed, the aforementioned heterogeneity and high connectivity favors the adoption of reconfigurable systems in the cloud computing ecosystem, where the power wall hit with the dark silicon [11] makes the providers craving for energy efficient solutions, such as reconfigurable systems. ...
... Following the aforementioned improvements and considering that homogeneous multi-core processors, especially in data-centres, are failing to provide the desired energy efficiency and performance, new devices have been deployed, specifically heterogeneous architectures [1]. The integration in these architectures of hardware accelerators is gaining interest as a promising solution. ...
Reconfigurable computing is an expanding field that, during the last decades, has evolved from a relatively closed community, where hard skilled developers deployed high performance systems, based on their knowledge of the underlying physical system, to an attractive solution to both industry and academia. With this chapter, we explore the different lines of development in the field, namely the need of new tools to shorten the development time, the creation of heterogeneous platforms which couple hardware accelerators with general purpose processors, and the demand to move from general to specific solutions. Starting with the identification of the main limitations that have led to improvements in the field, we explore the emergence of a wide range of Computer Aided Design tools that allow the use of high level languages and guide the user in the whole process of system deployment. This opening to a wider public and their high performance with relatively low power consumption facilitate the spreading in data-centers, where, apart from the undeniable benefits, we have explored critical issues. We conclude with the latest trends in the field such as the use of hardware as a service and the shifting to Domain Specific Architectures based on reconfigurable fabrics.
... Using the traditional approach, we would need to create two separate prediction models, one for the low-end edge environment and another one for the high-end cloud environment, each one requiring a large number of samples. Our goal is to leverage an ML-based performance model trained on a low-end local system to: (1) predict the performance in a new, unknown, high-end FPGA-based system, (2) predict the performance of a new, unknown application to overcome the limitations of current MLbased performance models in a cloud environment. ...
Machine-learning-based models have recently gained traction as a way to overcome the slow downstream implementation process of FPGAs by building models that provide fast and accurate performance predictions. However, these models suffer from two main limitations: (1) training requires large amounts of data (features extracted from FPGA synthesis and implementation reports), which is cost-inefficient because of the time-consuming FPGA design cycle; (2) a model trained for a specific environment cannot predict for a new, unknown environment. In a cloud system, where getting access to platforms is typically costly, data collection for ML models can significantly increase the total cost-ownership (TCO) of a system. To overcome these limitations, we propose LEAPER, a transfer learning-based approach for FPGA-based systems that adapts an existing ML-based model to a new, unknown environment to provide fast and accurate performance and resource utilization predictions. Experimental results show that our approach delivers, on average, 85% accuracy when we use our transferred model for prediction in a cloud environment with 5-shot learning and reduces design-space exploration time by 10x, from days to only a few hours.
... The need for energy-efficiency from edge to cloud computing has boosted the widespread adoption of FPGAs. In cloud computing [16,21,63,89,379,384], FPGA's A part of this chapter is published as "Modeling FPGA-Based Systems via Few-Shot Learning" in FPGA 2021. ...
The cost of moving data between the memory units and the compute units is a major contributor to the execution time and energy consumption of modern workloads in computing systems. At the same time, we are witnessing an enormous amount of data being generated across multiple application domains. These trends suggest a need for a paradigm shift towards a data-centric approach where computation is performed close to where the data resides. Further, a data-centric approach can enable a data-driven view where we take advantage of vast amounts of available data to improve architectural decisions. As a step towards modern architectures, this dissertation contributes to various aspects of the data-centric approach and proposes several data-driven mechanisms. First, we design NERO, a data-centric accelerator for a real-world weather prediction application. Second, we explore the applicability of different number formats, including fixed-point, floating-point, and posit, for different stencil kernels. Third, we propose NAPEL, an ML-based application performance and energy prediction framework for data-centric architectures. Fourth, we present LEAPER, the first use of few-shot learning to transfer FPGA-based computing models across different hardware platforms and applications. Fifth, we propose Sibyl, the first reinforcement learning-based mechanism for data placement in hybrid storage systems. Overall, this thesis provides two key conclusions: (1) hardware acceleration on an FPGA+HBM fabric is a promising solution to overcome the data movement bottleneck of our current computing systems; (2) data should drive system and design decisions by leveraging inherent data characteristics to make our computing systems more efficient.
... Net + Accel SmartNICs [5,110], AcclNet [53], hXDP [35] Net + GPU GPUDirect [102], GPUNet [78] Sto + GPU Donard [22], SPIN [25], GPUfs [124], GPUDirect [103], nvidia BAM [113] Net + Sto iSCSI, NVMoF (offload [117], BlueField [5]), i10 [68], ReFlex [80] Sto + Accel ASIC/CPU [60,83,121], GPUs [25,26,124], FPGA [69,116,119,143], Hayagui [15] Hybrid System with ARM SoC [3,47,90], BEE3 [44], hybrid CPU-FPGA systems [39,41] DPUs Hyperion (stand-alone), Fungible (MIPS64 R6 cores) DPU processor [54], Pensando (hostattached P4 Programmable processor) [108], BlueField (host-attached, with ARM cores) [5] Table 1: Related work ( §4) in the integration of network (net), storage (sto), and accelerators (accel) devices. ...
... Second, a direct consequence of keeping a CPU-driven design is to inherit its choices of memory addressing, translation, and protection mechanisms such as virtual memory, paging, and segmentation [45]. When an accelerator such as FPGA, is attached to a CPU as an external device [39] or as a co-processor [41], there is a temptation to provide/port the familiar memory abstractions like unified virtual memory [84] and/or shared memory [94]. This design necessitates a complex integration with further CPU-attached memory abstractions such as page tables and TLBs, virtualization, huge pages, IOMMU, etc., while keeping such an integration coherent with the CPU view of the system [84,94]. ...
Since the inception of computing, we have been reliant on CPU-powered architectures. However, today this reliance is challenged by manufacturing limitations (CMOS scaling), performance expectations (stalled clocks, Turing tax), and security concerns (microarchitectural attacks). To re-imagine our computing architecture, in this work we take a more radical but pragmatic approach and propose to eliminate the CPU with its design baggage, and integrate three primary pillars of computing, i.e., networking, storage, and computing, into a single, self-hosting, unified CPU-free Data Processing Unit (DPU) called Hyperion. In this paper, we present the case for Hyperion, its design choices, initial work-in-progress details, and seek feedback from the systems community.
... The authors in Todman et al. (2005) have provided background information on notable aspects of older FPGA technologies and simultaneously explained the fundamental architectures and design methods for codesign. Furthermore, the work in Choi et al. (2019) is another comprehensive study that aims to evaluate and analyze the microarchitectural characteristics of state-of-the-art CPU-FPGA platforms in depth. That paper covers most of the sharedmemory platforms with detailed benchmarks. ...
In this community review report, we discuss applications and techniques for fast machine learning (ML) in science—the concept of integrating powerful ML methods into the real-time experimental data processing loop to accelerate scientific discovery. The material for the report builds on two workshops held by the Fast ML for Science community and covers three main areas: applications for fast ML across a number of scientific domains; techniques for training and implementing performant and resource-efficient ML algorithms; and computing architectures, platforms, and technologies for deploying these algorithms. We also present overlapping challenges across the multiple scientific domains where common solutions can be found. This community report is intended to give plenty of examples and inspiration for scientific discovery through integrated and accelerated ML solutions. This is followed by a high-level overview and organization of technical advances, including an abundance of pointers to source material, which can enable these breakthroughs.
... This makes cc-accelerators suitable for µs-scale acceleration/offloading. Currently, some cc-accelerators are commercially available [27,28,68] and more commercial cc-accelerators will emerge as the next-generation Intel Xeon CPUs begins to support CXL [127]. ...
Responding to the "datacenter tax" and "killer microseconds" problems for datacenter applications, diverse solutions including Smart NIC-based ones have been proposed. Nonetheless, they often suffer from high overhead of communications over network and/or PCIe links. To tackle the limitations of the current solutions, this paper proposes ORCA, a holistic network and architecture co-design solution that leverages current RDMA and emerging cache-coherent off-chip interconnect technologies. Specifically, ORCA consists of four hardware and software components: (1) unified abstraction of inter- and intra-machine communications managed by one-sided RDMA write and cache-coherent memory write; (2) efficient notification of requests to accelerators assisted by cache coherence; (3) cache-coherent accelerator architecture directly processing requests received by NIC; and (4) adaptive device-to-host data transfer for modern server memory systems consisting of both DRAM and NVM exploiting state-of-the-art features in CPUs and PCIe. We prototype ORCA with a commercial system and evaluate three popular datacenter applications: in-memory key-value store, chain replication-based distributed transaction system, and deep learning recommendation model inference. The evaluation shows that ORCA provides 30.1~69.1% lower latency, up to 2.5x higher throughput, and 3x higher power efficiency than the current state-of-the-art solutions.
... For datacenter FPGAs, efforts such as Intel Xeon-FPGA multi-chip package and IBM CAPI also provide coherent shared cache memory support for Xeon/PowerPC CPUs and FPGAs. A quantitative evaluation of modern CPU-FPGA platforms with and without coherency support can be found in References [43,44]. However, such cache designs are shared by the CPU and FPGA; there are no practically available, dedicated onchip cache for the FPGA accelerators themselves yet. ...
... Here, we mainly focus on the direct memory access (DMA) from a DDR or high-bandwidth memory (HBM). For the communication optimization between the host program and the FPGA accelerators, we refer the interested readers to References [43,44] for more details. ...
FPGA-based accelerators are increasingly popular across a broad range of applications, because they offer massive parallelism, high energy efficiency, and great flexibility for customizations. However, difficulties in programming and integrating FPGAs have hindered their widespread adoption. Since the mid 2000s, there has been extensive research and development toward making FPGAs accessible to software-inclined developers, besides hardware specialists. Many programming models and automated synthesis tools, such as high-level synthesis, have been proposed to tackle this grand challenge. In this survey, we describe the progression and future prospects of the ongoing journey in significantly improving the software programmability of FPGAs. We first provide a taxonomy of the essential techniques for building a high-performance FPGA accelerator, which requires customizations of the compute engines, memory hierarchy, and data representations. We then summarize a rich spectrum of work on programming abstractions and optimizing compilers that provide different trade-offs between performance and productivity. Finally, we highlight several additional challenges and opportunities that deserve extra attention by the community to bring FPGA-based computing to the masses.
... Moreover, the Harpv2 has three PCI-buses to transfer data of the main memory to/from FPGA, where the maximum bandwidth is 26 GB/s [5], [19]. The AWS EC2 F1 FPGA has 4 local DDR memories and only one PCI-bus between the FPGA and the CPU. ...
... The highest value of k requires less than 25% of the total DSPs available in the AWS EC2 F1 FPGA. Furthermore, the AWS EC2 F1 FPGA has a favorable outlook, as Xilinx's VU9P FPGA has 6,760 DSPs [19] compared to the 1,518 Arria 10 in Harpv2 [19]. ...
... The highest value of k requires less than 25% of the total DSPs available in the AWS EC2 F1 FPGA. Furthermore, the AWS EC2 F1 FPGA has a favorable outlook, as Xilinx's VU9P FPGA has 6,760 DSPs [19] compared to the 1,518 Arria 10 in Harpv2 [19]. ...
... The authors in [518] have provided background information on notable aspects of older FPGA technologies and simultaneously explained the fundamental architectures and design methods for codesign. Furthermore, the work in [519] is another comprehensive study that aims to evaluate and analyze the microarchitectural characteristics of state-of-theart CPU-FPGA platforms in depth. That paper covers most of the shared-memory platforms with detailed benchmarks. ...
In this community review report, we discuss applications and techniques for fast machine learning (ML) in science -- the concept of integrating power ML methods into the real-time experimental data processing loop to accelerate scientific discovery. The material for the report builds on two workshops held by the Fast ML for Science community and covers three main areas: applications for fast ML across a number of scientific domains; techniques for training and implementing performant and resource-efficient ML algorithms; and computing architectures, platforms, and technologies for deploying these algorithms. We also present overlapping challenges across the multiple scientific domains where common solutions can be found. This community report is intended to give plenty of examples and inspiration for scientific discovery through integrated and accelerated ML solutions. This is followed by a high-level overview and organization of technical advances, including an abundance of pointers to source material, which can enable these breakthroughs.