Figure 7 - available via license: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
Content may be subject to copyright.
DFX compute core microarchitecture. which modules to run. It is composed of the controller, scheduler, and scoreboard. Controller The controller's main job is to receive the start signal and system configuration from the host. The system configuration includes the core ID and the number of cores in the system, and the number of decoder layers and tokens that the system needs to run on. These parameters determine the behavior of each core. The core ID and the number of cores direct the corresponding core on which section of the model weights to work on and which peer device to receive from and transmit to. The number of decoder layers determines when single token processing completes, and the number of input and output tokens determines when the entire service completes. Since a different portion of the HBM needs to be accessed for each layer, the layer number designates the address the DMA needs to access. The token number is used specifically for knowing where to mask during MaskedMM. Lastly, the controller returns the done signal back to the host once the entire GPT-2 operation finishes. Scheduler The scheduler receives the decoded system configuration from the controller and instructions from the instruction buffer. The scheduler contains multiple finite state machines for each instruction type that checks the status of the DMA, processing units, register file, and the router to decide whether to run or wait on each instruction type. The chosen instruction is sent to the scoreboard for the last dependency check with the running instruction. Scoreboard The register file needs to check for dependencies to run instructions based on the chaining method.
Source publication
Transformer is a deep learning language model widely used for natural language processing (NLP) services in datacenters. Among transformer models, Generative Pre-trained Transformer (GPT) has achieved remarkable performance in text generation, or natural language generation (NLG), which needs the processing of a large input context in the summariza...
Context in source publication
Context 1
... MICROARCHITECTURE Figure 7 shows the proposed compute core's microarchitecture, which mainly consists of matrix processing unit and vector processing unit. The primary goal of microarchitecture is to efficiently process text generation workloads that have sequential processes with non-batched input. ...
Similar publications
Software requirements specification is undoubtedly critical for the whole software life-cycle. Currently, writing software requirements specifications primarily depends on human work. Although massive studies have been proposed to speed up the process via proposing advanced elicitation and analysis techniques, it is still a time-consuming and error...
Citations
... In 2022, Hong et al. presented DFX [14] for the acceleration of the Transformer networks used in LLMs. Similar to NPE, the DFX architecture proposed a modular architecture consisting of several computer cores for the acceleration of the Transformer networks. ...
Large language models (LLMs) have emerged as powerful tools for natural language processing tasks, revolutionizing the field with their ability to understand and generate human-like text. As the demand for more sophisticated LLMs continues to grow, there is a pressing need to address the computational challenges associated with their scale and complexity. This paper presents a comprehensive survey of hardware accelerators designed to enhance the performance and energy efficiency of large language models. By examining a diverse range of accelerators, including GPUs, FPGAs, and custom-designed architectures, we explore the landscape of hardware solutions tailored to meet the unique computational demands of LLMs. The survey encompasses an in-depth analysis of architecture, performance metrics, and energy efficiency considerations, providing valuable insights for researchers, engineers, and decision-makers aiming to optimize the deployment of LLMs in real-world applications.
This study presents an efficient implementation of transformer architectures in Field-Programmable Gate Arrays (FPGAs) using hls4ml . We demonstrate the strategy for implementing the multi head attention, softmax, and normalization layer and evaluate three distinct models. Their deployment on VU13P FPGA chip achieved latency less than 2 μs, demonstrating the potential for real-time applications. hls4ml 's compatibility with any TensorFlow-built transformer model further enhances the scalability and applicability of this work.