Conference Paper

DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Non-linear operations in transformers, particularly Layer-Norm and softmax, are notoriously time-consuming [1], [2], [7], [8]. The recent surge in popularity of FlashAttention [9] has reduced softmax latency by 80% by integrating it with matrix multiplication, and several hardware accelerators [10]- [14] have further optimized softmax. ...
... Multiple studies have been conducted to accelerate the normalization operation within the transformer architecture. The previous methods including DFX [2], [26] and [27] modified the way for variance computation, which further improves the parallelism and reduce the processing latency of the layernorm computation. In [7], the intermediate results are dynamically compressed for the variance computation with low precision, leading to a reduction on energy and latency consumption. ...
... Most previous accelerators for Layer-Norm [7] [37] [38] are either designed and evaluated with ASIC or not revealed the detailed architecture design [39] [40]. To ensure a fair comparison, we compare HAAN accelerator with DFX [2] by extracting the latency of LayerNorm from the overall latency reported in their study. We also reproduce SOLE [7] and MHAA [40] while aligning with HAAN's settings to ensure a fair comparison. ...
Preprint
Large language models (LLMs) have revolutionized natural language processing (NLP) tasks by achieving state-of-the-art performance across a range of benchmarks. Central to the success of these models is the integration of sophisticated architectural components aimed at improving training stability, convergence speed, and generalization capabilities. Among these components, normalization operation, such as layer normalization (LayerNorm), emerges as a pivotal technique, offering substantial benefits to the overall model performance. However, previous studies have indicated that normalization operations can substantially elevate processing latency and energy usage. In this work, we adopt the principles of algorithm and hardware co-design, introducing a holistic normalization accelerating method named HAAN. The evaluation results demonstrate that HAAN can achieve significantly better hardware performance compared to state-of-the-art solutions.
... However, when faced with the significantly larger scale of LLM workloads, FPGAs have yet to fully realize their potential. Recent work, DFX [7] and FlightLLM [8], has demonstrated the feasibility of deploying LLMs on cloud-based Alveo U280 FPGAs with high-bandwidth memory (HBM). However, these approaches are not applicable to embedded FPGAs. ...
... As LLMs are memory-intensive, recent work has focused on accelerating LLMs using HBM-equipped cloud FPGAs, such as the Xilinx Alveo U280 and VCU128, achieving decoding speed improvements over GPUs in single-batch LLM inference setups. DFX [7] represents one of the first studies on FPGA accelerators for decoder-only transformer, utilizing four banks of Alveo U280 to accelerate the GPT-2 1.5B model [18]. FlightLLM [8] has deployed the LLaMA2-7B model [9] on both the Alveo U280 and the Versal VHK158, demonstrating advantages over NVIDIA GPUs in single-batch inference tasks. ...
... Fig.3 shows the fine-grain pipelining dataflow in the attention layer during the inference process. Unlike DFX's [7] coarse-grained pipeline where the query, key, and value projections occur before the multi-head attention, we adopt a fine-grained, head-wise pipeline that fuses the projection and attention computation processes. In the pipeline of each single head, the query projection occurs first, followed by the key projection. ...
Preprint
The extremely high computational and storage demands of large language models have excluded most edge devices, which were widely used for efficient machine learning, from being viable options. A typical edge device usually only has 4GB of memory capacity and a bandwidth of less than 20GB/s, while a large language model quantized to 4-bit precision with 7B parameters already requires 3.5GB of capacity, and its decoding process is purely bandwidth-bound. In this paper, we aim to explore these limits by proposing a hardware accelerator for large language model (LLM) inference on the Zynq-based KV260 platform, equipped with 4GB of 64-bit 2400Mbps DDR4 memory. We successfully deploy a LLaMA2-7B model, achieving a decoding speed of around 5 token/s, utilizing 93.3% of the memory capacity and reaching 85% decoding speed of the theoretical memory bandwidth limit. To fully reserve the memory capacity for model weights and key-value cache, we develop the system in a bare-metal environment without an operating system. To fully reserve the bandwidth for model weight transfers, we implement a customized dataflow with an operator fusion pipeline and propose a data arrangement format that can maximize the data transaction efficiency. This research marks the first attempt to deploy a 7B level LLM on a standalone embedded field programmable gate array (FPGA) device. It provides key insights into efficient LLM inference on embedded FPGA devices and provides guidelines for future architecture design.
... Thus, 48 documents were selected for evaluation, the review of which revealed 1 document that did not contain data for 2022-2024, and 4 documents that did not contain relevant information regarding the posed research questions. 43 documents were selected for review: [6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48]. The review of each document was performed according to the review map (appendix A). ...
... For January and February 2024, there are only articles in journals (2), and conference proceedings articles are absent. In total, for the period 2022-2024, the number of articles in journals (22) is equal to the number of conference proceedings articles (21). The increase may indicate a more thorough coverage of the issue in scientific journals compared to conference proceedings in recent years. ...
... Articles Sentence [7,9,11,12,14,15,17,18,19,20,22,23,25,26,29,30,31,32,33,36,39,40,41,45,46,48] Paragraph [7,9,12,15,17,18,19,20,23,29,37,14,39,40,41,45,46, 48] Document [9,12,18,19,20,29,35,40,41,42,45] Question-answer [7,9,12,17,18,19,22,26,45,47] Descriptive tables [13,21,30,31,33,34,36,43,44] Translations [18,27,31,33,36,37,45] Stories [9,31,33,36] Images [14,29,37,38] Audio files [29,28] Video clips [16] Computer programs [18] Not specified [6,8,10, 24] Table 9 Data annotation types used in the reviewed articles. ...
Conference Paper
Full-text available
Recent years have witnessed significant advancements in neural text generation driven by the emergence of large language models and growing interest in this field. This systematic review aims to identify and summarize current trends, approaches, and methods in neural text generation from 2022 to 2024, complementing the findings of a previous review covering 2015-2021. Following the PRISMA methodology, 43 articles were selected from the Scopus database for analysis. The review reveals a shift towards innovative model architectures like Transformer-based models (GPT-2, GPT-3, BERT), attention mechanisms, and controllable text generation. While BLEU, ROUGE, and human evaluation remain the most popular evaluation metrics, new metrics like BERTScore have emerged. Datasets span diverse domains and data types, with growing interest in unlabeled data. Applications have expanded to areas such as table-to-text generation, knowledge graph-based generation, and medical text generation. Although English dominates, there is increasing research on low-resource languages. The findings highlight the rapid evolution of neural text generation methods, the broadening of application areas, and promising avenues for future research.
... Transformer [44], BERT [9], and GPT [39] have been widely used for natural language processing (NLP) services at datacenters. Although GPUs are commonly used to accelerate the inference of deep learning models, GPUs are less effective in handling transformer models because of multi-head attention and inference stages that are memory-bound [19]. To address the limitations of GPU for transformer models, many recent works [13,14,34,45] have proposed to accelerate multi-head attention through dedicated accelerators and algorithmic changes; however, these prior work do not fully address the challenges of end-to-end inference acceleration. ...
... To address the limitations of GPU for transformer models, many recent works [13,14,34,45] have proposed to accelerate multi-head attention through dedicated accelerators and algorithmic changes; however, these prior work do not fully address the challenges of end-to-end inference acceleration. Recently, DFX [19] proposed an FPGA-based appliance that is designed for memory-bound transformer inference stages; however, it is sub-optimal for the compute-bound stages in end-to-end inference. ...
... The IANUS architecture in this work shares similarities as it represents a "middle ground" architecture between NPU and PIM architectures. GPT-2 compared to the A100 GPU and the state-ofthe-art prior work (DFX [19]), respectively. 3. System integration and FPGA prototyping: To demonstrate the feasibility of IANUS, we build an integrated system including a commercial NPU, commercial PIM chips, and an FPGA-based PIM controller. ...
Preprint
Full-text available
Accelerating end-to-end inference of transformer-based large language models (LLMs) is a critical component of AI services in datacenters. However, diverse compute characteristics of end-to-end LLM inference present challenges as previously proposed accelerators only address certain operations or stages (e.g., self-attention, generation stage, etc.). To address the unique challenges of accelerating end-to-end inference, we propose IANUS -- Integrated Accelerator based on NPU-PIM Unified Memory System. IANUS is a domain-specific system architecture that combines a Neural Processing Unit (NPU) with a Processing-in-Memory (PIM) to leverage both the NPU's high computation throughput and the PIM's high effective memory bandwidth. In particular, IANUS employs a unified main memory system where the PIM memory is used both for PIM operations and for NPU's main memory. The unified main memory system ensures that memory capacity is efficiently utilized and the movement of shared data between NPU and PIM is minimized. However, it introduces new challenges since normal memory accesses and PIM computations cannot be performed simultaneously. Thus, we propose novel PIM Access Scheduling that manages normal memory accesses and PIM computations through workload mapping and scheduling across the PIM and the NPU. Our detailed simulation evaluations show that IANUS improves the performance of GPT-2 by 6.2×\times and 3.2×\times, on average, compared to the NVIDIA A100 GPU and the state-of-the-art accelerator. As a proof-of-concept, we develop a prototype of IANUS with a commercial PIM, NPU, and an FPGA-based PIM controller to demonstrate the feasibility of IANUS.
... To accommodate the large model size and improve performance, the computation workloads are evenly distributed across available PIM channels and banks to maximize utilization of compute resources and on-chip bandwidth. Compared to existing Transformer accelerators [15][16][17] , the proposed PIM-GPT supports large GPT models endto-end without the need of expensive HBM, making it an efficient and practical solution for GPT acceleration. Benchmarking analysis shows the proposed PIM-GPT achieves state-of-the-art speedup and energy efficiency for GPT inference tasks. ...
... Earlier PIM implementations focus on VMM acceleration in neural networks. But inter-layer functions in GPT such as LayerNorm and Residual can contribute~10% of total latency 16 . Hence, efficient end-to-end PIM acceleration of GPT inference is of high interest. ...
... The high speedup originates from three aspects: (1) Memory bottleneck is effectively removed by performing the memory-intensive VMM operations inside PIM channel; (2) The mapping strategy maximizes computation parallelism and data locality; (3) Different workloads are efficiently distributed between PIM and ASIC. In comparison, GPU is not suitable for sequential token generation, since the large memory footprint and low data reuse rate under-utilize the GPU computation resources 16 . ...
Article
Full-text available
Decoder-only Transformer models such as Generative Pre-trained Transformers (GPT) have demonstrated exceptional performance in text generation by autoregressively predicting the next token. However, the efficiency of running GPT on current hardware systems is bounded by low compute-to-memory-ratio and high memory access. In this work, we propose a Process-in-memory (PIM) GPT accelerator, PIM-GPT, which achieves end-to-end acceleration of GPT inference with high performance and high energy efficiency. PIM-GPT leverages DRAM-based PIM designs for executing multiply-accumulate (MAC) operations directly in the DRAM chips, eliminating the need to move matrix data off-chip. Non-linear functions and data communication are supported by an application specific integrated chip (ASIC). At the software level, mapping schemes are designed to maximize data locality and computation parallelism. Overall, PIM-GPT achieves 41 − 137 × , 631 − 1074 × speedup and 123 − 383 × , 320 − 602 × energy efficiency over GPU and CPU baseline on 8 GPT models with up to 1.4 billion parameters.
... Low execution time of LLM inference is crucial for both user experience in real-time applications and hardware utilization efficiency [10]. The LLM decoding phase dominates the execution time in the LLM inference tasks [11,12]. For instance, the serial decoding of the GPT-3 175B model is responsible for 96% of the overall execution time when the input and output lengths are 32 [13]. ...
... Prior works explore hardware LLM accelerators to improve LLM inference performance. DFX [11] introduces a multi-FPGA accelerator with high-bandwidth memory (HBM) for end-to-end inference acceleration, and provides an efficient dataflow when the decoding stage is memorybound. However, even when using HBM, such designs still suffer from the memory bottleneck, especially when attention kernels exhibit very low arithmetic intensity [91]. ...
Preprint
Full-text available
Large language models (LLMs) are widely used for natural language understanding and text generation. An LLM model relies on a time-consuming step called LLM decoding to generate output tokens. Several prior works focus on improving the performance of LLM decoding using parallelism techniques, such as batching and speculative decoding. State-of-the-art LLM decoding has both compute-bound and memory-bound kernels. Some prior works statically identify and map these different kernels to a heterogeneous architecture consisting of both processing-in-memory (PIM) units and computation-centric accelerators. We observe that characteristics of LLM decoding kernels (e.g., whether or not a kernel is memory-bound) can change dynamically due to parameter changes to meet user and/or system demands, making (1) static kernel mapping to PIM units and computation-centric accelerators suboptimal, and (2) one-size-fits-all approach of designing PIM units inefficient due to a large degree of heterogeneity even in memory-bound kernels. In this paper, we aim to accelerate LLM decoding while considering the dynamically changing characteristics of the kernels involved. We propose PAPI (PArallel Decoding with PIM), a PIM-enabled heterogeneous architecture that exploits dynamic scheduling of compute-bound or memory-bound kernels to suitable hardware units. PAPI has two key mechanisms: (1) online kernel characterization to dynamically schedule kernels to the most suitable hardware units at runtime and (2) a PIM-enabled heterogeneous computing system that harmoniously orchestrates both computation-centric processing units and hybrid PIM units with different computing capabilities. Our experimental results on three broadly-used LLMs show that PAPI achieves 1.8×\times and 11.1×\times speedups over a state-of-the-art heterogeneous LLM accelerator and a state-of-the-art PIM-only LLM accelerator, respectively.
... InTAR exhibits 1.8× and 7.1× speedup compared with the corresponding dataflow and sequential accelerator. We further present InTAR on the GPT-2 medium model for a complete DNN example, which achieves a speedup of 3.65 ∼ 39.14× and a 1.72 ∼ 10.44× improvement in DSP efficiency compared to the SoTA accelerators (Allo [9] and DFX [23]). Moreover, INTAR demonstrated 1.66 ∼ 7.17× better power efficiency compared to GPUs. ...
... The designs are both placed and routed for U280 and VPK180. We select Allo [9] and DFX [23] as the SoTA accelerators for the GPT-2 medium model to compare their throughput with INTAR. Table IV lists the resource utilization and frequency. ...
Preprint
The rise of deep neural networks (DNNs) has driven a boom in AI services, which results in an increased demand for computing power and memory. In modern DNNs, the data sizes produced and consumed are highly varied across operations (high data volume variation, HDV). Because existing design paradigms use fixed execution patterns that lead to either low computational efficiency due to pipeline stalls or frequent off-chip memory accesses to manage large intermediate data, HDV applications are challenging to accelerate on FPGAs. To address these challenges, we introduce the Inter-Task Auto-Reconfigurable Accelerator (InTAR), a novel accelerator design for HDV applications on FPGAs. InTAR combines the high computational efficiency of sequential execution with the reduced off-chip memory overhead of dataflow execution. It switches execution patterns automatically with a static schedule determined before circuit design based on resource constraints and model parameters. Unlike previous reconfigurable accelerators, InTAR encodes reconfiguration schedules during circuit design, allowing model-specific optimizations that allocate only the necessary logic and interconnects. Thus, InTAR achieves a high clock frequency with fewer resources and low reconfiguration time. Furthermore, InTAR supports high-level tools such as HLS for fast design generation. We implement a set of multi-task kernels in various HDV DNNs using InTAR. Compared with dataflow and sequential accelerators, InTAR exhibits 1.8×1.8\times and 7.1×7.1 \times speedups correspondingly. We also implement InTAR for GPT-2 medium as a more complex example, which achieves a speedup of 3.6539.14×\mathbf{3.65 \sim 39.14\times} and a 1.7210.44×\mathbf{1.72 \sim 10.44\times} boost in DSP efficiency compared to the corresponding SoTA accelerators on FPGAs.
... Longformer [204] utilizes a combination of local attention and global attention to specific tokens, while BigBird [205] further adds random attention on top of local and global attention, demonstrating its ability to encompass all sequence-to-sequence functions. Static sparse attention changes the operations in MHSA from [185] ASIC tapeout 28nm BERT 20.5 (INT8) Sparsity 2022 DFX [186] FPGA GPT-2 2022 X-Former [187] PIM simulator 32nm BERT 13.44 (INT8) -2022 DOTA [188] ASIC 22nm/simulator GPT-2 -Quantization/Sparsity 2022 SPRINT [189] PIM simulator 32nm GPT-2/BERT 19.6x Sparsity 2022 TransPIM [190] PIM 65nm/simulator GPT-2/BERT 666.6x RTX 2080Ti -2022 Mokey [164] ASIC 65nm/simulator BERT 9x GOBO (FP16) Quantization 2022 LeOPArd [191] ASIC 65nm/simulator GPT-2/BERT 3x SpAtten Quantization/Sparsity 2022 STP [192] ASIC tapeout 12nm BERT 18.1 (FP4) Quantization 2023 HAIMA [193] PIM simulator 45nm BERT --2023 TF-MVP [194] ASIC 28nm BERT/GPT-2 0.48 (FP16) Sparsity 2023 TiC-SAT [195] gem5-X BERT --2023 Transformer-OPU [196] FPGA BERT --2023 FACT [197] ASIC 28nm BERT 94.88x V100 Quantization/Sparsity 2023 TaskFusion [198] ASIC 22nm/simulator BERT 19.83x Jetson Nano Sparsity 2023 OliVe [165] ASIC 22nm/simulator BERT/GPT-2/OPT 4x GOBO Quantization 2023 C-Transformer [199] ASIC tapeout 28nm GPT-2 33.4 (INT8) -2024 SpecPIM [200] PIM simulator LLaMA/OPT 6.67x A100 (FP16) -2024 ASADI [201] PIM simulator BERT/GPT-2 -(FP32) Sparsity 2024 AttAcc [202] PIM simulator LLaMA/GPT-3 2.67x DGX A100 (FP16) -2024 NeuPIMs [203] PIM simulator 22nm GPT-3 --2024 ...
... Beyond off-loading scenarios, some studies focus on accelerating LLM inference through distributed systems. For instance, DFX [186] employs model parallelism and an efficient network within a multi-FPGA system, resulting in minimal data synchronization between FPGAs. In another study, the authors of CXL-PNM [210] introduce a processing near memory (PNM) platform using Compute Express Link (CXL), leveraging both model parallelism and data parallelism for workload partitioning. ...
Preprint
Full-text available
The rapid development of large language models (LLMs) has significantly transformed the field of artificial intelligence, demonstrating remarkable capabilities in natural language processing and moving towards multi-modal functionality. These models are increasingly integrated into diverse applications, impacting both research and industry. However, their development and deployment present substantial challenges, including the need for extensive computational resources, high energy consumption, and complex software optimizations. Unlike traditional deep learning systems, LLMs require unique optimization strategies for training and inference, focusing on system-level efficiency. This paper surveys hardware and software co-design approaches specifically tailored to address the unique characteristics and constraints of large language models. This survey analyzes the challenges and impacts of LLMs on hardware and algorithm research, exploring algorithm optimization, hardware design, and system-level innovations. It aims to provide a comprehensive understanding of the trade-offs and considerations in LLM-centric computing systems, guiding future advancements in AI. Finally, we summarize the existing efforts in this space and outline future directions toward realizing production-grade co-design methodologies for the next generation of large language models and AI systems.
... He et al. [210,211] DFX [212] 3.1 Quantization ...
... He et al.[211] also propose new attention flow, SlimAttention, to reduce the KV cache size and ensure precision for efficient LLM inference on CPUs. The experimental results on Llama2-70b shows the latency of token generation is 4 tokens/s with 2 sockets and 11.4 tokens/s with 8 sockets on Intel Xeon 8563C CPUs.DFX[212] is a multi-FPGA acceleration which uses model parallelism and optimized dataflow to improve LLM inference speed in both prefilling and decoding phases. With the implementation on four Xilinx Alveo U280 FPGAs, DFX achieves about 120 tokens/s for GPT2-1.5B ...
Preprint
Full-text available
Large Language Models (LLMs) have demonstrated remarkable capabilities across various fields, from natural language understanding to text generation. Compared to non-generative LLMs like BERT and DeBERTa, generative LLMs like GPT series and Llama series are currently the main focus due to their superior algorithmic performance. The advancements in generative LLMs are closely intertwined with the development of hardware capabilities. Various hardware platforms exhibit distinct hardware characteristics, which can help improve LLM inference performance. Therefore, this paper comprehensively surveys efficient generative LLM inference on different hardware platforms. First, we provide an overview of the algorithm architecture of mainstream generative LLMs and delve into the inference process. Then, we summarize different optimization methods for different platforms such as CPU, GPU, FPGA, ASIC, and PIM/NDP, and provide inference results for generative LLMs. Furthermore, we perform a qualitative and quantitative comparison of inference performance with batch sizes 1 and 8 on different hardware platforms by considering hardware power consumption, absolute inference speed (tokens/s), and energy efficiency (tokens/J). We compare the performance of the same optimization methods across different hardware platforms, the performance across different hardware platforms, and the performance of different methods on the same hardware platform. This provides a systematic and comprehensive summary of existing inference acceleration work by integrating software optimization methods and hardware platforms, which can point to the future trends and potential developments of generative LLMs and hardware technology for edge-side scenarios.
... The model showcases its flexibility across various deep network architectures, achieving superior results and consistent training performance on diverse devices. DFX [88] is a low-latency multi-FPGA appliance for accelerating transformer-based text generation. It uses model parallelism to split the transformer model across multiple FPGAs. ...
... Not designed for large-scale training, but explores trade-offs for resource-constrained settings. DFX [88] 5.58x speedup in text generation compared to 4x NVIDIA V100 GPUs. ...
Preprint
Full-text available
In recent years, large language models (LLMs) have achieved remarkable success in natural language processing (NLP). LLMs require an extreme amount of parameters to attain high performance. As models grow into the trillion-parameter range, computational and memory costs increase significantly. This makes it difficult for many researchers to access the resources needed to train or apply these models. Optimizing LLM performance involves two main approaches: fine-tuning pre-trained models for specific tasks to achieve state-of-the-art performance, and reducing costs or improving training time while maintaining similar performance. This paper presents a systematic literature review (SLR) following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement. We reviewed 65 publications out of 983 from 2017 to December 2023, retrieved from 5 databases. The study presents methods to optimize and accelerate LLMs while achieving cutting-edge results without sacrificing accuracy. We begin with an overview of the development of language modeling, followed by a detailed explanation of commonly used frameworks and libraries, and a taxonomy for improving and speeding up LLMs based on three classes: LLM training, LLM inference, and system serving. We then delve into recent optimization and acceleration strategies such as training optimization, hardware optimization, scalability and reliability, accompanied by the taxonomy and categorization of these strategies. Finally, we provide an in-depth comparison of each class and strategy, with two case studies on optimizing model training and enhancing inference efficiency. These case studies showcase practical approaches to address LLM resource limitations while maintaining performance.
... [14], [53], [58] proposed accelerators that support quantization of the weight or embedding vector of the transformer model. Meanwhile, [18], [26] proposed accelerators that process the transformer model losslessly. [18] proposed a system that distributes processing across multiple FPGAs for low-latency inference of the transformer model and [26] proposed a dataflow for the accelerator to improve the efficiency of the attention operation. ...
... Meanwhile, [18], [26] proposed accelerators that process the transformer model losslessly. [18] proposed a system that distributes processing across multiple FPGAs for low-latency inference of the transformer model and [26] proposed a dataflow for the accelerator to improve the efficiency of the attention operation. ...
Preprint
Large language models (LLMs) have emerged due to their capability to generate high-quality content across diverse contexts. To reduce their explosively increasing demands for computing resources, a mixture of experts (MoE) has emerged. The MoE layer enables exploiting a huge number of parameters with less computation. Applying state-of-the-art continuous batching increases throughput; however, it leads to frequent DRAM access in the MoE and attention layers. We observe that conventional computing devices have limitations when processing the MoE and attention layers, which dominate the total execution time and exhibit low arithmetic intensity (Op/B). Processing MoE layers only with devices targeting low-Op/B such as processing-in-memory (PIM) architectures is challenging due to the fluctuating Op/B in the MoE layer caused by continuous batching. To address these challenges, we propose Duplex, which comprises xPU tailored for high-Op/B and Logic-PIM to effectively perform low-Op/B operation within a single device. Duplex selects the most suitable processor based on the Op/B of each layer within LLMs. As the Op/B of the MoE layer is at least 1 and that of the attention layer has a value of 4-8 for grouped query attention, prior PIM architectures are not efficient, which place processing units inside DRAM dies and only target extremely low-Op/B (under one) operations. Based on recent trends, Logic-PIM adds more through-silicon vias (TSVs) to enable high-bandwidth communication between the DRAM die and the logic die and place powerful processing units on the logic die, which is best suited for handling low-Op/B operations ranging from few to a few dozens. To maximally utilize the xPU and Logic-PIM, we propose expert and attention co-processing.
... There has been a large body of research works that aim to develop efficient hardware and software for LLM inference serving systems. Some works target to develop customized hardware techniques for accelerating LLM inference serving [23], [49], while others focus on developing optimized system software on GPU-based scale-out systems [10], [33], [41], [46], [52]. Recently, a few pioneering works propose to take into consideration both hardware and software together for designing holistic end-to-end accelerated systems [22], [48]. ...
... To handle batched requests from multiple users, the scale-out serving systems often constitute hundreds of nodes, each equipped with multiple high-performance AI accelerators with highbandwidth and high-capacity memories [22], [45], [48], [49]. Recently, several studies have explored solutions involving software, hardware, or both of them for such large-scale LLM serving systems [21], [23], [38]. However, the absence of an effective system-level simulator for scale-out LLM serving systems remains a major barrier for researchers and engineers who continue to explore solutions. ...
Preprint
Recently, there has been an extensive research effort in building efficient large language model (LLM) inference serving systems. These efforts not only include innovations in the algorithm and software domains but also constitute developments of various hardware acceleration techniques. Nevertheless, there is a lack of simulation infrastructure capable of accurately modeling versatile hardware-software behaviors in LLM serving systems without extensively extending the simulation time. This paper aims to develop an effective simulation tool, called LLMServingSim, to support future research in LLM serving systems. In designing LLMServingSim, we focus on two limitations of existing simulators: (1) they lack consideration of the dynamic workload variations of LLM inference serving due to its autoregressive nature, and (2) they incur repetitive simulations without leveraging algorithmic redundancies in LLMs. To address these limitations, LLMServingSim simulates the LLM serving in the granularity of iterations, leveraging the computation redundancies across decoder blocks and reusing the simulation results from previous iterations. Additionally, LLMServingSim provides a flexible framework that allows users to plug in any accelerator compiler-and-simulation stacks for exploring various system designs with heterogeneous processors. Our experiments demonstrate that LLMServingSim produces simulation results closely following the performance behaviors of real GPU-based LLM serving system with less than 14.7% error rate, while offering 91.5x faster simulation speed compared to existing accelerator simulators.
... For the generation stage, Tender still works and provides benefits by decomposing the activation. However, the under-utilization issue of most commercial accelerators (e.g., GPU, TPU) can be large as prior work points out [24]. To mitigate this, there are ongoing studies on batching decoding [51], [68], and Tender can work synergistically with those schemes. ...
... DNN Accelerators. Domain-specific accelerators for DNNs have been extensively studied over the past decade [3], [7], [8], [14], [15], [23], [24], [28], [39], [45], [47]. The processing units and dataflow of these accelerators are highly specialized for DNN computation, leading to high performance and energy efficiency. ...
Preprint
Full-text available
Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning and have thus become one of the most important workloads in today's computing landscape. However, deploying LLM inference poses challenges due to the high compute and memory requirements stemming from the enormous model size and the difficulty of running it in the integer pipelines. In this paper, we present Tender, an algorithm-hardware co-design solution that enables efficient deployment of LLM inference at low precision. Based on our analysis of outlier values in LLMs, we propose a decomposed quantization technique in which the scale factors of decomposed matrices are powers of two apart. The proposed scheme allows us to avoid explicit requantization (i.e., dequantization/quantization) when accumulating the partial sums from the decomposed matrices, with a minimal extension to the commodity tensor compute hardware. Our evaluation shows that Tender achieves higher accuracy and inference performance compared to the state-of-the-art methods while also being significantly less intrusive to the existing accelerators.
... The HPTA [25] accelerator provided support for several transformer variants without needing FPGA reconfiguration. DFX [26] presented an endto-end design for Generative Pretrained Transformer (GPT) model inference. [27] proposed efficient algorithm-hardware co-designs with sparse attention and dynamic pipelining. ...
Preprint
Full-text available
Large Language Model (LLM) deployment on edge devices is typically constrained by the need for off-chip memory access, leading to high power consumption and limited throughput. Ternary quantization for LLMs is promising in maintaining model accuracy while reducing memory footprint. However, existing accelerators have not exploited this potential for on-chip inference. We present TerEffic, an FPGA-based accelerator that carefully co-designs memory architecture and computational units to unlock highly efficient LLM inference with fully on-chip execution. Through weight compression, custom computational units, and memory hierarchy optimization, we achieve unprecedented efficiency by eliminating off-chip memory bandwidth bottlenecks. We propose two architectural variants: a fully on-chip design for smaller models and an HBM-assisted design for larger ones. When evaluated on a 370M parameter model with single-batch inference, our on-chip design achieves 12,700 tokens/sec (149 times higher than NVIDIA's Jetson Orin Nano) with a power efficiency of 467 tokens/sec/W (19 times better than Jetson Orin Nano). The HBM-assisted design provides 521 tokens/sec on a 2.7B parameter model (2 times higher than NVIDIA's A100) with 33W power consumption, achieving a power efficiency of 16 tokens/sec/W (8 times better than A100).
... There have been many recent works on accelerating LLM inference, but most have focused on LLM inferences with shorter context (i.e., up to 4 or 8k) [5,21,69,72]. As the context length increases, both the memory capacity and the memory bandwidth become a greater bottleneck during LLM inference. ...
Preprint
The expansion of large language models (LLMs) with hundreds of billions of parameters presents significant challenges to computational resources, particularly data movement and memory bandwidth. Long-context LLMs, which process sequences of tens of thousands of tokens, further increase the demand on the memory system as the complexity in attention layers and key-value cache sizes is proportional to the context length. Processing-in-Memory (PIM) maximizes memory bandwidth by moving compute to the data and can address the memory bandwidth challenges; however, PIM is not necessarily scalable to accelerate long-context LLM because of limited per-module memory capacity and the inflexibility of fixed-functional unit PIM architecture and static memory management. In this work, we propose LoL-PIM which is a multi-node PIM architecture that accelerates long context LLM through hardware-software co-design. In particular, we propose how pipeline parallelism can be exploited across a multi-PIM module while a direct PIM access (DPA) controller (or DMA for PIM) is proposed that enables dynamic PIM memory management and results in efficient PIM utilization across a diverse range of context length. We developed an MLIR-based compiler for LoL-PIM extending a commercial PIM-based compiler where the software modifications were implemented and evaluated, while the hardware changes were modeled in the simulator. Our evaluations demonstrate that LoL-PIM significantly improves throughput and reduces latency for long-context LLM inference, outperforming both multi-GPU and GPU-PIM systems (up to 8.54x and 16.0x speedup, respectively), thereby enabling more efficient deployment of LLMs in real-world applications.
... The data shows that RSN-XNN maintains a low overhead, both in absolute terms and as a percentage of the total design area. The table also includes data for two overlays: DLA [12], which controls execution at the tile level, and DFX [17], which controls execution at the layer level. RSN-XNN's area overhead is comparable to existing overlays but offers greater flexibility and better resource utilization. ...
Preprint
Overlay is an effective approach for creating FPGA-based AI accelerators, enabling software-programmable specialized hardware datapaths to flexibly support various DNN operations. Traditional DNN overlays typically base their instruction set design on the von Neumann model but adapt them to be more coarse-grained. These instruction sets control execution at the layer granularity and impose restricted patterns for mapping computation and bandwidth resources. Such constraints cause inefficiencies from the imperfect match between supported execution patterns and diverse DNN layer shapes and types. This work proposes a Reconfigurable Stream Network architecture, a unique ISA abstraction tailored for flexible FPGA overlay execution at low cost, marking it as the first known FPGA design to support dynamic sequential linear layer pipelining. This novel architecture presents a datapath abstraction modeled after a specialized circuit-switched network with stateful functional units (FUs) as nodes and data streaming on edges. Programming a computation corresponds to triggering a network path in this stream-connected datapath. The program can individually control FUs to form paths that exploit both spatial and pipeline parallelism between independent and dependent concurrent computations. We present a proof-of-concept design RSN-XNN on the Versal VCK190. Evaluations show a 22x latency reduction for BERT compared to the state of the art, along with throughput improvements of 3.2x, 2.4x, 2.5x, and 2.8x for BERT, VIT, NCF, and MLP, respectively. RSN-XNN matches the latency of the T4 GPU with the same FP32 performance but only 18% of the memory bandwidth. Compared to the A100 GPU under the same 7nm process node, it achieves 2.1x/4.5x better operating/dynamic energy efficiency in FP32.
... Modern data-intensive workloads (e.g., AI inference tasks for large language models [10], [17], [40], [47], [48], [92], [97], [114], recommendation systems [63], [72], [73], [86], and graph processing [2], [19], [35], [113], [117]) are memorybound as they pose unprecedented demand for large data. Despite the increasing demand for high memory bandwidth, integrating a larger number of DRAM I/O pins at the processor die is challenging because of form factor constraints and issues related to signal integrity [67]. ...
Preprint
Full-text available
Processing-in-memory (PIM) has emerged as a promising solution for accelerating memory-intensive workloads as they provide high memory bandwidth to the processing units. This approach has drawn attention not only from the academic community but also from the industry, leading to the development of real-world commercial PIM devices. In this work, we first conduct an in-depth characterization on UPMEM's general purpose PIM system and analyze the bottlenecks caused by the data transfers across the DRAM and PIM address space. Our characterization study reveals several critical challenges associated with DRAM to/from PIM data transfers in memory bus integrated PIM systems, for instance, its high CPU core utilization, high power consumption, and low read/write throughput for both DRAM and PIM. Driven by our key findings, we introduce the PIM-MMU architecture which is a hardware/software codesign that enables energy-efficient DRAM to/from PIM transfers for PIM systems. PIM-MMU synergistically combines a hardwarebased data copy engine, a PIM-optimized memory scheduler, and a heterogeneity-aware memory mapping function, the utilization of which is supported by our PIM-MMU software stack, significantly improving the efficiency of DRAM to/from PIM data transfers. Experimental results show that PIM-MMU improves the DRAM to/from PIM data transfer throughput by an average 4.1x and enhances its energy-efficiency by 4.1x, leading to a 2.2x end-to-end speedup for real-world PIM workloads.
... They also proposed a novel, modelindependent, post-training quantization search algorithm adaptable to various hardware environments. Besides the work from Wojcicki et al. [7], other transformer implementations on FPGAs have been developed, such as those by Li et al. [6], Peng et al. [15], Tzanos et al. [8], Hong et al. [16], and Han et al. [9]. However, these works mainly focus on the FPGA implementation of specific models. ...
Preprint
This study presents an efficient implementation of transformer architectures in Field-Programmable Gate Arrays(FPGAs) using hls4ml. We demonstrate the strategy for implementing the multi-head attention, softmax, and normalization layer and evaluate three distinct models. Their deployment on VU13P FPGA chip achieved latency less than 2us, demonstrating the potential for real-time applications. HLS4ML compatibility with any TensorFlow-built transformer model further enhances the scalability and applicability of this work. Index Terms: FPGAs, machine learning, transformers, high energy physics, LIGO
... In 2022, Hong et al. presented DFX [10] for the acceleration of the transformer networks used in LLMs. Similarly to NPE, the DFX architecture proposed a modular architecture consisting for several computer core for the acceleration of the transformer networks. ...
Preprint
Full-text available
Large Language Models (LLMs) have emerged as powerful tools for natural language processing tasks, revolutionizing the field with their ability to understand and generate human-like text. In this paper, we present a comprehensive survey of the several research efforts that have been presented for the acceleration of transformer networks for Large Language Models using hardware accelerators. The survey presents the frameworks that have been proposed and then performs a qualitative and quantitative comparison regarding the technology, the processing platform (FPGA, ASIC, In-Memory, GPU), the speedup, the energy efficiency, the performance (GOPs), and the energy efficiency (GOPs/W) of each framework. The main challenge in comparison is that every proposed scheme is implemented on a different process technology making hard a fair comparison. The main contribution of this paper is that we extrapolate the results of the performance and the energy efficiency on the same technology to make a fair comparison; one theoretical and one more practical. We implement part of the LLMs on several FPGA chips to extrapolate the results to the same process technology and then we make a fair comparison of the performance.
... For instance, the architecture of a NVIDIA GPUs are optimized for parallel processing, where many threads execute simultaneously across multiple cores. Especially in the generation stage, GPU cannot effectively route the incoming bandwidth to a single core that requires computation with a single vector at a time, which causes underutilization of both compute cores and memory bandwidth as mentioned by Hong et al. 5 This innate problem is more pronounced in smaller models due to even smaller operands (i.e., input and activations). According to Neubig et al. and Kitaev et al., a myriad of software techniques, such as in-flight batching, Key-Value caching, and other algorithmic optimizations have been proposed to raise the efficiency of GPUs 6,7 . ...
Preprint
Full-text available
The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33x and 1.32x energy efficiency over NVIDIA H100 and L4 servers, respectively.
... The model showcases its flexibility across various deep network architectures, achieving superior results and consistent training performance on diverse devices. DFX [86] is a low-latency multi-FPGA appliance for accelerating transformer-based text generation. It uses model parallelism to split the transformer model across multiple VOLUME 12, 2024 FPGAs. ...
Article
Full-text available
In recent years, large language models (LLMs) have achieved remarkable success in natural language processing (NLP). LLMs require an extreme amount of parameters to attain high performance. As models grow into the trillion-parameter range, computational and memory costs increase significantly. This makes it difficult for many researchers to access the resources needed to train or apply these models. Optimizing LLM performance involves two main approaches: fine-tuning pre-trained models for specific tasks to achieve state-of-the-art performance, and reducing costs or improving training time while maintaining similar performance. This paper presents a systematic literature review (SLR) following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement.We reviewed 65 publications out of 983 from 2017 to December 2023, retrieved from 5 databases. The study presents methods to optimize and accelerate LLMs while achieving cutting-edge results without sacrificing accuracy.We begin with an overview of the development of language modeling, followed by a detailed explanation of commonly used frameworks and libraries, and a taxonomy for improving and speeding up LLMs based on three classes: LLM training, LLM inference, and system serving. We then delve into recent optimization and acceleration strategies such as training optimization, hardware optimization, scalability and reliability, accompanied by the taxonomy and categorization of these strategies. Finally, we provide an in-depth comparison of each class and strategy, with two case studies on optimizing model training and enhancing inference efficiency. These case studies showcase practical approaches to address LLM resource limitations while maintaining performance.
... Beyond ASICs, FPGA-based accelerators are being actively investigated for their potential to provide more flexible and faster turnaround solutions. For example, DFX [17] utilizes model parallelism and enables rapid concurrent execution of transformer-based workloads, while FlightLLM [62] introduces a configurable compute unit and a LLM mapping flow to support LLM inference. ...
Preprint
Full-text available
Large Language Models (LLMs) have become extremely potent instruments with exceptional capacities for comprehending and producing human-like text in a wide range of applications. However, the increasing size and complexity of LLMs present significant challenges in both training and deployment, leading to substantial computational and storage costs as well as heightened energy consumption. In this paper, we provide a review of recent advancements and research directions aimed at addressing these challenges and enhancing the efficiency of LLM-based systems. We begin by discussing algorithm-level acceleration techniques focused on optimizing LLM inference speed and resource utilization. We also explore LLM-hardware co-design strategies with a vision to improve system efficiency by tailoring hardware architectures to LLM requirements. Further, we delve into LLM-to-accelerator compilation approaches, which involve customizing hardware accelerators for efficient LLM deployment. Finally, as a case study to leverage LLMs for assisting circuit design, we examine LLM-aided design methodologies for an important task: High-Level Synthesis (HLS) functional verification, by creating a new dataset that contains a large number of buggy and bug-free codes, which can be essential for training LLMs to specialize on HLS verification and debugging. For each aspect mentioned above, we begin with a detailed background study, followed by the presentation of several novel solutions proposed to overcome specific challenges. We then outline future research directions to drive further advancements. Through these efforts, we aim to pave the way for more efficient and scalable deployment of LLMs across a diverse range of applications.
Article
Modern transformer-based deep neural networks present unique technical challenges for effective acceleration in real-world applications. Apart from the vast amount of linear operations needed due to their sizes, modern transformer models are increasingly reliance on precise non-linear computations that make traditional low-bitwidth quantization methods and fixed-dataflow matrix accelerators ineffective for end-to-end acceleration. To address this need to accelerate both linear and non-linear operations in a unified and programmable framework, this paper introduces TATAA. TATAA employs 8-bit integer (int8) arithmetic for quantized linear layer operations through post-training quantization, while it relies on bfloat16 floating-point arithmetic to approximate non-linear layers of a transformer model. TATAA hardware features a transformable arithmetic architecture that supports both formats during runtime with minimal overhead, enabling it to switch between a systolic array mode for int8 matrix multiplications and a SIMD mode for vectorized bfloat16 operations. An end-to-end compiler is presented to enable flexible mapping from emerging transformer models to the proposed hardware. Experimental results indicate that our mixed-precision design incurs only 0.14% to 1.16% accuracy drop when compared with the pre-trained single-precision transformer models across a range of vision, language, and generative text applications. Our prototype implementation on the Alveo U280 FPGA currently achieves 2935.2 GOPS throughput on linear layers and a maximum of 189.5 GFLOPS for non-linear operations, outperforming related works by up to 1.45× in end-to-end throughput and 2.29× in DSP efficiency, while achieving 2.19× higher power efficiency than modern NVIDIA RTX4090 GPU.
Preprint
Full-text available
For FPGA-based neural network accelerators, digital signal processing (DSP) blocks have traditionally been the cornerstone for handling multiplications. This paper introduces LUTMUL, which harnesses the potential of look-up tables (LUTs) for performing multiplications. The availability of LUTs typically outnumbers that of DSPs by a factor of 100, offering a significant computational advantage. By exploiting this advantage of LUTs, our method demonstrates a potential boost in the performance of FPGA-based neural network accelerators with a reconfigurable dataflow architecture. Our approach challenges the conventional peak performance on DSP-based accelerators and sets a new benchmark for efficient neural network inference on FPGAs. Experimental results demonstrate that our design achieves the best inference speed among all FPGA-based accelerators, achieving a throughput of 1627 images per second and maintaining a top-1 accuracy of 70.95% on the ImageNet dataset.
Preprint
Benefiting from the self-attention mechanism, Transformer models have attained impressive contextual comprehension capabilities for lengthy texts. The requirements of high-throughput inference arise as the large language models (LLMs) become increasingly prevalent, which calls for large-scale token parallel processing (LTPP). However, existing dynamic sparse accelerators struggle to effectively handle LTPP, as they solely focus on separate stage optimization, and with most efforts confined to computational enhancements. By re-examining the end-to-end flow of dynamic sparse acceleration, we pinpoint an ever-overlooked opportunity that the LTPP can exploit the intrinsic coordination among stages to avoid excessive memory access and redundant computation. Motivated by our observation, we present SOFA, a cross-stage compute-memory efficient algorithm-hardware co-design, which is tailored to tackle the challenges posed by LTPP of Transformer inference effectively. We first propose a novel leading zero computing paradigm, which predicts attention sparsity by using log-based add-only operations to avoid the significant overhead of prediction. Then, a distributed sorting and a sorted updating FlashAttention mechanism are proposed with a cross-stage coordinated tiling principle, which enables fine-grained and lightweight coordination among stages, helping optimize memory access and latency. Further, we propose a SOFA accelerator to support these optimizations efficiently. Extensive experiments on 20 benchmarks show that SOFA achieves 9.5×9.5\times speed up and 71.5×71.5\times higher energy efficiency than Nvidia A100 GPU. Compared to 8 SOTA accelerators, SOFA achieves an average 15.8×15.8\times energy efficiency, 10.3×10.3\times area efficiency and 9.3×9.3\times speed up, respectively.
Article
The explosive arrival of OpenAI’s ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09× and 1.37× faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm 2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33× and 1.32× energy efficiency over NVIDIA H100 and L4 servers, respectively.
Article
Special-purpose hardware accelerators are increasingly pivotal for sustaining performance improvements in emerging applications, especially as the benefits of technology scaling continue to diminish. However, designers currently lack effective tools and methodologies to construct complex, high-performance accelerator architectures in a productive manner. Existing high-level synthesis (HLS) tools often require intrusive source-level changes to attain satisfactory quality of results. Despite the introduction of several new accelerator design languages (ADLs) aiming to enhance or replace HLS, their advantages are more evident in relatively simple applications with a single kernel. Existing ADLs prove less effective for realistic hierarchical designs with multiple kernels, even if the design hierarchy is flattened. In this paper, we introduce Allo, a composable programming model for efficient spatial accelerator design. Allo decouples hardware customizations, including compute, memory, communication, and data type from algorithm specification, and encapsulates them as a set of customization primitives. Allo preserves the hierarchical structure of an input program by combining customizations from different functions in a bottom-up, type-safe manner. This approach facilitates holistic optimizations that span across function boundaries. We conduct comprehensive experiments on commonly-used HLS benchmarks and several realistic deep learning models. Our evaluation shows that Allo can outperform state-of-the-art HLS tools and ADLs on all test cases in the PolyBench. For the GPT2 model, the inference latency of the Allo generated accelerator is 1.7x faster than the NVIDIA A100 GPU with 5.4x higher energy efficiency, demonstrating the capability of Allo to handle large-scale designs.
Article
Due to emerging workloads that require high memory bandwidth, Processing-in-Memory (PIM) has gained significant attention and led several industrial PIM products to be introduced which are integrated with conventional computing systems. This study characterizes the data transfer overheads between conventional DRAM address space and PIM address space within a PIM-integrated system using the commercialized PIM device made by UPMEM. Our findings highlight the need for optimization in PIM-integrated systems to address these overheads, offering critical insights for future PIM technologies.
Article
Recent advancements in large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. While hardware accelerators for Transformer-based models have been extensively studied, the majority of existing approaches rely on temporal architectures that reuse hardware units for different network layers and operators. However, these methods often encounter challenges in achieving low latency due to considerable memory access overhead. This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs. Our approach involves the specialization of distinct hardware units for specific operators or layers, facilitating direct communication between them through a dataflow architecture while minimizing off-chip memory accesses. We introduce a comprehensive analytical model for estimating the performance of a spatial LLM accelerator, taking into account the on-chip compute and memory resources available on an FPGA. This model can be extended to multi-FPGA settings for distributed inference. Through our analysis, we can identify the most effective parallelization and buffering schemes for the accelerator and, crucially, determine the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart. To enable more productive implementations of an LLM model on FPGAs, we further provide a library of high-level synthesis (HLS) kernels that are composable and reusable. This library will be made available as open-source. To validate the effectiveness of both our analytical model and HLS library, we have implemented BERT and GPT2 on an AMD Xilinx Alveo U280 FPGA device. Experimental results demonstrate our approach can achieve up to 13.4 × speedup when compared to previous FPGA-based accelerators for the BERT model. For GPT generative inference, we attain a 2.2 × speedup compared to DFX, an FPGA overlay, in the prefill stage, while achieving a 1.9 × speedup and a 5.7 × improvement in energy efficiency compared to the NVIDIA A100 GPU in the decode stage.
Article
Full-text available
Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.
Conference Paper
Full-text available
Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.
Article
Full-text available
In this paper, we propose a novel neural network model called RNN Encoder--Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder--Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.
Article
Full-text available
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
Article
We present CAiRE, an end-to-end generative empathetic chatbot designed to recognize user emotions and respond in an empathetic manner. Our system adapts the Generative Pre-trained Transformer (GPT) to empathetic response generation task via transfer learning. CAiRE is built primarily to focus on empathy integration in fully data-driven generative dialogue systems. We create a web-based user interface which allows multiple users to asynchronously chat with CAiRE. CAiRE also collects user feedback and continues to improve its response quality by discarding undesirable generations via active learning and negative training.
Article
Google's TPU supercomputers train deep neural networks 50x faster than general-purpose supercomputers running a high-performance computing benchmark.
Article
High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, the lack of deep understanding on how modern GPUs can be connected and the real impact of state-of-the-art interconnect technology on multi-GPU application performance become a hurdle. In this paper, we fill the gap by conducting a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVLink-V2, NVLink-SLI and NVSwitch, from six high-end servers and HPC platforms: NVIDIA P100-DGX-1, V100-DGX-1, DGX-2, OLCF's SummitDev and Summit supercomputers, as well as an SLI-linked system with two NVIDIA Turing RTX-2080 GPUs. Based on the empirical evaluation, we have observed four new types of GPU communication network NUMA effects: three are triggered by NVLink's topology, connectivity and routing, while one is caused by PCIe chipset design issue. These observations indicate that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance. Our evaluation can be leveraged in building practical multi-GPU performance models, which are vital for GPU task allocation, scheduling and migration in a shared environment (e.g., AI cloud and HPC centers), as well as communication-oriented performance tuning.
Conference Paper
We focus on essay generation, which is a challenging task that generates a paragraph-level text with multiple topics.Progress towards understanding different topics and expressing diversity in this task requires more powerful generators and richer training and evaluation resources. To address this, we develop a multi-topic aware long short-term memory (MTA-LSTM) network.In this model, we maintain a novel multi-topic coverage vector, which learns the weight of each topic and is sequentially updated during the decoding process.Afterwards this vector is fed to an attention model to guide the generator.Moreover, we automatically construct two paragraph-level Chinese essay corpora, 305,000 essay paragraphs and 55,000 question-and-answer pairs.Empirical results show that our approach obtains much better BLEU score compared to various baselines.Furthermore, human judgment shows that MTA-LSTM has the ability to generate essays that are not only coherent but also closely related to the input topics.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
There is a practically unlimited amount of natural language data available. Still, recent work in text comprehension has focused on datasets which are small relative to current computing possibilities. This article is making a case for the community to move to larger data and as a step in that direction it is proposing the BookTest, a new dataset similar to the popular Children's Book Test (CBT), however more than 60 times larger. We show that training on the new data improves the accuracy of our Attention-Sum Reader model on the original CBT test data by a much larger margin than many recent attempts to improve the model architecture. On one version of the dataset our ensemble even exceeds the human baseline provided by Facebook. We then show in our own human study that there is still space for further improvement.
Article
Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system.
Article
Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step. Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques.
Article
In this paper, we present an alternative to the Turing Test that has some conceptual and practical advantages. A Wino-grad schema is a pair of sentences that differ only in one or two words and that contain a referential ambiguity that is resolved in opposite directions in the two sentences. We have compiled a collection of Winograd schemas, designed so that the correct answer is obvious to the human reader, but cannot easily be found using selectional restrictions or statistical techniques over text corpora. A contestant in the Winograd Schema Challenge is presented with a collection of one sentence from each pair, and required to achieve human-level accuracy in choosing the correct disambiguation. Copyright © 2012, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Conference Paper
Increasing demand for higher-bandwidth DRAM drive TSV technology development. With the capacity of fine-pitch wide I/O [1], DRAM can be directly integrated on the interposer or host chip and communicate with the memory controller. However, there are many limitations, such as reliability and testability, in developing the technology. It is advantageous to adopt a logic-interface chip between the interposer and stacked-DRAM with thousands of TSV. The logic interface chip in the base level of high-bandwidth memory (HBM) decreases the CIO, repairs the chip-to-chip connection failure, and supports better testability and improves reliability.
Transfertransfo: A transfer learning approach for neural network based conversational agents
  • T Wolf
  • V Sanh
  • J Chaumond
  • C Delangue
Gpipe: Efficient training of giant neural networks using pipeline parallelism
  • Y Huang
  • Y Cheng
  • A Bapna
  • O Firat
  • D Chen
  • M Chen
  • H Lee
  • J Ngiam
  • Q V Le
  • Y Wu
Comparing bert against traditional machine learning text classification
  • S González-Carvajal
  • E C Garrido-Merchán
Language models are unsupervised multitask learners
  • A Radford
  • J Wu
  • R Child
  • D Luan
  • D Amodei
  • I Sutskever
Gaussian error linear units (gelus)
  • D Hendrycks
  • K Gimpel
Efficient algorithms for device placement of dnn graph operators
  • J M Tarnawski
  • A Phanishayee
  • N Devanur
  • D Mahajan
  • F Nina Paravecino
Bert: Pre-training of deep bidirectional transformers for language understanding
  • J Devlin
  • M.-W Chang
  • K Lee
  • K Toutanova
Megatron-lm: Training multi-billion parameter language models using model parallelism
  • M Shoeybi
  • M Patwary
  • R Puri
  • P Legresley
  • J Casper
  • B Catanzaro
Language models are few-shot learners
  • T B Brown
  • B Mann
  • N Ryder
  • M Subbiah
  • J Kaplan
  • P Dhariwal
  • A Neelakantan
  • P Shyam
  • G Sastry
First-generation inference accelerator deployment at facebook
  • M Anderson
  • B Chen
  • S Chen
  • S Deng
  • J Fix
  • M Gschwind
  • A Kalaiah
  • C Kim
  • J Lee
  • J Liang
Transfertransfo: A transfer learning approach for neural network based conversational agents
  • Wolf
Language models are unsupervised multitask learners
  • Radford
Gpipe: Efficient training of giant neural networks using pipeline parallelism
  • Huang
Efficient algorithms for device placement of dnn graph operators
  • Tarnawski
Megatron-lm: Training multi-billion parameter language models using model parallelism
  • Shoeybi
Bert: Pre-training of deep bidirectional transformers for language understanding
  • Devlin
Mesh-tensorflow: Deep learning for supercomputers
  • Shazeer