Seongmin Hong’s research while affiliated with Korea Advanced Institute of Science and Technology and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (10)


ADOR: A Design Exploration Framework for LLM Serving with Enhanced Latency and Throughput
  • Preprint
  • File available

March 2025

·

12 Reads

Junsoo Kim

·

Hunjong Lee

·

Geonwoo Ko

·

[...]

·

Joo-Young Kim

The growing adoption of Large Language Models (LLMs) across various domains has driven the demand for efficient and scalable AI-serving solutions. Deploying LLMs requires optimizations to manage their significant computational and data demands. The prefill stage processes large numbers of input tokens in parallel, increasing computational load, while the decoding stage relies heavily on memory bandwidth due to the auto-regressive nature of LLMs. Current hardware, such as GPUs, often fails to balance these demands, leading to inefficient utilization. While batching improves hardware efficiency, it delays response times, degrading Quality-of-Service (QoS). This disconnect between vendors, who aim to maximize resource efficiency, and users, who prioritize low latency, highlights the need for a better solution. To address this, we propose ADOR, a framework that automatically identifies and recommends hardware architectures tailored to LLM serving. By leveraging predefined architecture templates specialized for heterogeneous dataflows, ADOR optimally balances throughput and latency. It efficiently explores design spaces to suggest architectures that meet the requirements of both vendors and users. ADOR demonstrates substantial performance improvements, achieving 2.51x higher QoS and 4.01x better area efficiency compared to the A100 at high batch sizes, making it a robust solution for scalable and cost-effective LLM serving.

Download

LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference

November 2024

·

16 Reads

·

2 Citations

IEEE Micro

The explosive arrival of OpenAI’s ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09× and 1.37× faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm 2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33× and 1.32× energy efficiency over NVIDIA H100 and L4 servers, respectively.


FIGURE 6. LPU implementation and specification.
LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference

August 2024

·

91 Reads

The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33x and 1.32x energy efficiency over NVIDIA H100 and L4 servers, respectively.




Figure 1. Illustration of transformer-based text generation.
Figure 2. GPT-2 structure and illustration of summarization and generation stages in text generation.
Figure 7. DFX compute core microarchitecture. which modules to run. It is composed of the controller, scheduler, and scoreboard. Controller The controller's main job is to receive the start signal and system configuration from the host. The system configuration includes the core ID and the number of cores in the system, and the number of decoder layers and tokens that the system needs to run on. These parameters determine the behavior of each core. The core ID and the number of cores direct the corresponding core on which section of the model weights to work on and which peer device to receive from and transmit to. The number of decoder layers determines when single token processing completes, and the number of input and output tokens determines when the entire service completes. Since a different portion of the HBM needs to be accessed for each layer, the layer number designates the address the DMA needs to access. The token number is used specifically for knowing where to mask during MaskedMM. Lastly, the controller returns the done signal back to the host once the entire GPT-2 operation finishes. Scheduler The scheduler receives the decoded system configuration from the controller and instructions from the instruction buffer. The scheduler contains multiple finite state machines for each instruction type that checks the status of the DMA, processing units, register file, and the router to decide whether to run or wait on each instruction type. The chosen instruction is sent to the scoreboard for the last dependency check with the running instruction. Scoreboard The register file needs to check for dependencies to run instructions based on the chaining method.
DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

September 2022

·

1,434 Reads

Transformer is a deep learning language model widely used for natural language processing (NLP) services in datacenters. Among transformer models, Generative Pre-trained Transformer (GPT) has achieved remarkable performance in text generation, or natural language generation (NLG), which needs the processing of a large input context in the summarization stage, followed by the generation stage that produces a single word at a time. The conventional platforms such as GPU are specialized for the parallel processing of large inputs in the summarization stage, but their performance significantly degrades in the generation stage due to its sequential characteristic. Therefore, an efficient hardware platform is required to address the high latency caused by the sequential characteristic of text generation. In this paper, we present DFX, a multi-FPGA acceleration appliance that executes GPT-2 model inference end-to-end with low latency and high throughput in both summarization and generation stages. DFX uses model parallelism and optimized dataflow that is model-and-hardware-aware for fast simultaneous workload execution among devices. Its compute cores operate on custom instructions and provide GPT-2 operations end-to-end. We implement the proposed hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the channels of the high bandwidth memory (HBM) and the maximum number of compute resources for high hardware efficiency. DFX achieves 5.58x speedup and 3.99x energy efficiency over four NVIDIA V100 GPUs on the modern GPT-2 model. DFX is also 8.21x more cost-effective than the GPU appliance, suggesting that it is a promising solution for text generation workloads in cloud datacenters.



Accelerating Deep Convolutional Neural Networks Using Number Theoretic Transform

January 2022

·

6 Reads

·

6 Citations

IEEE Transactions on Circuits and Systems I Regular Papers

Modern deep convolutional neural networks (CNNs) suffer from high computational complexity due to excessive convolution operations. Recently, fast convolution algorithms such as fast Fourier transform (FFT) and Winograd transform have gained attention to address this problem. They reduce the number of multiplications required in the convolution operation by replacing it with element-wise multiplication in the transform domain. However, fast convolution-based CNN accelerators have three major concerns: expensive domain transform, large memory overhead, and limited flexibility in kernel size. In this paper, we present a novel CNN accelerator based on number theoretic transform (NTT), which overcomes the existing limitations. We propose the low-cost NTT and inverse-NTT converter that only use adders and shifters for on-chip domain transform, which solves the inflated bandwidth problem and enables more parallel computations in the accelerator. We also propose the accelerator architecture that includes multiple tile engines with the optimized data flow and mapping. Finally, we implement the proposed NTT-based CNN accelerator on the Xilinx Alveo U50 FPGA and evaluate it for popular deep CNN models. As a result, the proposed accelerator achieves 2859.5, 990.3, and 805.6 GOPS throughput for VGG-16, GoogLeNet, and Darknet-19, respectively. It outperforms the existing fast convolution-based CNN accelerators up to 9.6 ×\times .



FIXAR: A Fixed-Point Deep Reinforcement Learning Platform with Quantization-Aware Training and Adaptive Parallelism

February 2021

·

167 Reads

In this paper, we present a deep reinforcement learning platform named FIXAR which employs fixed-point data types and arithmetic units for the first time using a SW/HW co-design approach. Starting from 32-bit fixed-point data, Quantization-Aware Training (QAT) reduces its data precision based on the range of activations and performs retraining to minimize the reward degradation. FIXAR proposes the adaptive array processing core composed of configurable processing elements to support both intra-layer parallelism and intra-batch parallelism for high-throughput inference and training. Finally, FIXAR was implemented on Xilinx U50 and achieves 25293.3 inferences per second (IPS) training throughput and 2638.0 IPS/W accelerator efficiency, which is 2.7 times faster and 15.4 times more energy efficient than those of the CPU-GPU platform without any accuracy degradation.

Citations (6)


... Techniques like tiling and pipelining, which break down LLM computations into smaller units for optimal processing, are under study [4]. Methods to optimize data transfer between accelerators and memory, or between accelerators, are also proposed [5]. ...

Reference:

A Review on Proprietary Accelerators for Large Language Models
LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference
  • Citing Article
  • November 2024

IEEE Micro

... To achieve this, ADOR employs a MAC tree architecture that allows weights read from DRAM to be fed directly into the compute units without first being storing in SRAM. This approach ensures that the data is promptly processed, minimizing latency [23]. ...

HyperAccel Latency Processing Unit (LPU TM ) Accelerating Hyperscale Models for Generative AI
  • Citing Conference Paper
  • August 2023

... The HPTA [25] accelerator provided support for several transformer variants without needing FPGA reconfiguration. DFX [26] presented an endto-end design for Generative Pretrained Transformer (GPT) model inference. [27] proposed efficient algorithm-hardware co-designs with sparse attention and dynamic pipelining. ...

DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation
  • Citing Conference Paper
  • October 2022

... For ResNet-18 workloads, USEFUSE uses 34.5% less BRAM resources compared to the design presented in [25]. The proposed design achieves significant throughput improvements of 3.7×, 3.48×, 9.2×, and 1.9× for VGG-16 workloads compared to TGPA [33], [57], ShortcutFusion [58], and [59] respectively. Similarly, USEFUSE achieved throughput improvements of 1.2×, 2.82×, 12.6×, and 1.82× compared to the designs presented in [25], T-DLA [26], [60], and RDLA [61] respectively, for ResNet-18 workloads. ...

Accelerating Deep Convolutional Neural Networks Using Number Theoretic Transform
  • Citing Article
  • January 2022

IEEE Transactions on Circuits and Systems I Regular Papers

... Networked and heterogeneous FPGA clusters [32]- [34] are proposed for cloud and edge computing. There are existing works on scalable FPGA architecture [11]- [14], [35]- [38] primarily focus on accelerating applications, running emulation on multiple FPGAs and comparing the performance and power with other accelerators like GPU. Some of the multi-FPGA systems [15], [16] connect FPGAs and host CPUs into a hybrid network. ...

DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation
  • Citing Conference Paper
  • August 2022