Junsoo Kim’s research while affiliated with Korea Advanced Institute of Science and Technology and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (15)


ADOR: A Design Exploration Framework for LLM Serving with Enhanced Latency and Throughput
  • Preprint
  • File available

March 2025

·

10 Reads

Junsoo Kim

·

Hunjong Lee

·

Geonwoo Ko

·

[...]

·

Joo-Young Kim

The growing adoption of Large Language Models (LLMs) across various domains has driven the demand for efficient and scalable AI-serving solutions. Deploying LLMs requires optimizations to manage their significant computational and data demands. The prefill stage processes large numbers of input tokens in parallel, increasing computational load, while the decoding stage relies heavily on memory bandwidth due to the auto-regressive nature of LLMs. Current hardware, such as GPUs, often fails to balance these demands, leading to inefficient utilization. While batching improves hardware efficiency, it delays response times, degrading Quality-of-Service (QoS). This disconnect between vendors, who aim to maximize resource efficiency, and users, who prioritize low latency, highlights the need for a better solution. To address this, we propose ADOR, a framework that automatically identifies and recommends hardware architectures tailored to LLM serving. By leveraging predefined architecture templates specialized for heterogeneous dataflows, ADOR optimally balances throughput and latency. It efficiently explores design spaces to suggest architectures that meet the requirements of both vendors and users. ADOR demonstrates substantial performance improvements, achieving 2.51x higher QoS and 4.01x better area efficiency compared to the A100 at high batch sizes, making it a robust solution for scalable and cost-effective LLM serving.

Download

LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference

November 2024

·

16 Reads

·

2 Citations

IEEE Micro

The explosive arrival of OpenAI’s ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09× and 1.37× faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm 2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33× and 1.32× energy efficiency over NVIDIA H100 and L4 servers, respectively.


FIGURE 6. LPU implementation and specification.
LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference

August 2024

·

91 Reads

The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33x and 1.32x energy efficiency over NVIDIA H100 and L4 servers, respectively.


EPU: An Energy-Efficient Explainable AI Accelerator With Sparsity-Free Computation and Heat Map Compression/Pruning

March 2024

·

12 Reads

·

1 Citation

IEEE Journal of Solid-State Circuits

Deep neural networks (DNNs) have recently gained significant prominence in various real-world applications such as image recognition, natural language processing, and autonomous vehicles. However, due to their black-box nature in system, the underlying mechanisms of DNNs behind the inference results remain opaque to users. In order to address this challenge, researchers have focused on developing explainable artificial intelligence (AI) algorithms. Explainable AI aims to provide a clear and human-understandable explanation of the model’s decision, thereby building more reliable systems. However, the explanation task differs from well-known inference and training processes as it involves interactions with the user. Consequently, existing inference and training accelerators face inefficiencies when processing explainable AI on edge devices. This article introduces explainable processing unit (EPU), the first hardware accelerator designed for explainable AI workloads. The EPU utilizes a novel data compression format for the output heat maps and intermediate gradients to enhance the overall system performance by reducing both memory footprint and external memory access. Its sparsity-free computing core efficiently handles the input sparsity with negligible control overhead, resulting in a throughput boost of up to 9.48×. It also proposes a dynamic workload scheduling with a customized on-chip network for distinct inference and explanation tasks to maximize internal data reuse hence reducing external memory access by 63.7%. Furthermore, the EPU incorporates point-wise gradient pruning (PGP) that can significantly reduce the size of heat maps by a factor of 7.01× combined with the proposed compression format. Finally, the EPU chip fabricated in a 28 nm CMOS process achieves a remarkable heat map generation rate of 367 frames/s for ResNet-34 while maintaining the state-of-the-art area and energy efficiency of 112.3 GOPS/mm 2^2 and 26.55 TOPS/W, respectively.





Figure 1. Illustration of transformer-based text generation.
Figure 2. GPT-2 structure and illustration of summarization and generation stages in text generation.
Figure 7. DFX compute core microarchitecture. which modules to run. It is composed of the controller, scheduler, and scoreboard. Controller The controller's main job is to receive the start signal and system configuration from the host. The system configuration includes the core ID and the number of cores in the system, and the number of decoder layers and tokens that the system needs to run on. These parameters determine the behavior of each core. The core ID and the number of cores direct the corresponding core on which section of the model weights to work on and which peer device to receive from and transmit to. The number of decoder layers determines when single token processing completes, and the number of input and output tokens determines when the entire service completes. Since a different portion of the HBM needs to be accessed for each layer, the layer number designates the address the DMA needs to access. The token number is used specifically for knowing where to mask during MaskedMM. Lastly, the controller returns the done signal back to the host once the entire GPT-2 operation finishes. Scheduler The scheduler receives the decoded system configuration from the controller and instructions from the instruction buffer. The scheduler contains multiple finite state machines for each instruction type that checks the status of the DMA, processing units, register file, and the router to decide whether to run or wait on each instruction type. The chosen instruction is sent to the scoreboard for the last dependency check with the running instruction. Scoreboard The register file needs to check for dependencies to run instructions based on the chaining method.
DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

September 2022

·

1,434 Reads

Transformer is a deep learning language model widely used for natural language processing (NLP) services in datacenters. Among transformer models, Generative Pre-trained Transformer (GPT) has achieved remarkable performance in text generation, or natural language generation (NLG), which needs the processing of a large input context in the summarization stage, followed by the generation stage that produces a single word at a time. The conventional platforms such as GPU are specialized for the parallel processing of large inputs in the summarization stage, but their performance significantly degrades in the generation stage due to its sequential characteristic. Therefore, an efficient hardware platform is required to address the high latency caused by the sequential characteristic of text generation. In this paper, we present DFX, a multi-FPGA acceleration appliance that executes GPT-2 model inference end-to-end with low latency and high throughput in both summarization and generation stages. DFX uses model parallelism and optimized dataflow that is model-and-hardware-aware for fast simultaneous workload execution among devices. Its compute cores operate on custom instructions and provide GPT-2 operations end-to-end. We implement the proposed hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the channels of the high bandwidth memory (HBM) and the maximum number of compute resources for high hardware efficiency. DFX achieves 5.58x speedup and 3.99x energy efficiency over four NVIDIA V100 GPUs on the modern GPT-2 model. DFX is also 8.21x more cost-effective than the GPU appliance, suggesting that it is a promising solution for text generation workloads in cloud datacenters.



OpenMDS: An Open-Source Shell Generation Framework for High-Performance Design on Xilinx Multi-Die FPGAs

July 2022

·

3 Reads

·

5 Citations

IEEE Computer Architecture Letters

FPGA is a promising platform in designing hardware due to its design flexibility and fast development cycle, despite the device's limited hardware resources. To address this, latest FPGAs have adopted a multi-die architecture that employs multiple dies in a single device to provide abundant hardware resources. However, the multi-die architecture causes critical timing issues when signal paths cross the die-to-die boundaries, adding another design challenge in using FPGA. We propose OpenMDS, an open-source shell generation framework for high-performance design on Xilinx multi-die FPGAs. Based on the user's design requirements, it generates an optimized shell for the target FPGA via die-level kernel encapsulation, automated bus pipelining, and customized floorplanning. To evaluate our shell generation, we compare its implementation results against Xilinx's Vitis framework. As a result, OpenMDS uses average 20% less logic resources than Vitis for the same shell functionality. To show its practicality, we use OpenMDS for the design of machine learning accelerator that contains multiple systolic-array processors. OpenMDS achieves 247 MHz and 235 MHz kernel frequency and 400 MHz and 429 MHz memory bus frequency for U50 and U280, respectively, for the accelerator design over 90% logic utilization, claiming up to 12.27% and 22.92% higher kernel and memory bus frequency over Vitis.


Citations (9)


... Techniques like tiling and pipelining, which break down LLM computations into smaller units for optimal processing, are under study [4]. Methods to optimize data transfer between accelerators and memory, or between accelerators, are also proposed [5]. ...

Reference:

A Review on Proprietary Accelerators for Large Language Models
LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference
  • Citing Article
  • November 2024

IEEE Micro

... In highly dynamic scenarios such as real-time network slicing and vehicular network slicing, the time and energy efficiency of XAI methods is critical. Lightweight or simplified interpretability models represent one of the future research directions, ensuring efficient operation on resourceconstrained edge nodes [232], [233].Techniques such as model compression, quantization, and the use of specialized hardware accelerators can significantly reduce the energy consumption of XAI methods [234], [235]. Furthermore, exploring decentralized or distributed XAI approaches could further enhance energy efficiency and reduce latency [236]. ...

EPU: An Energy-Efficient Explainable AI Accelerator With Sparsity-Free Computation and Heat Map Compression/Pruning
  • Citing Article
  • March 2024

IEEE Journal of Solid-State Circuits

... To achieve this, ADOR employs a MAC tree architecture that allows weights read from DRAM to be fed directly into the compute units without first being storing in SRAM. This approach ensures that the data is promptly processed, minimizing latency [23]. ...

HyperAccel Latency Processing Unit (LPU TM ) Accelerating Hyperscale Models for Generative AI
  • Citing Conference Paper
  • August 2023

... Since edge devices have more stringent energy constraints compared to centralized data processing methods, energy-efficient data processing is a major challenge [4,5]. In response, various technological efforts have been made to enhance the energy efficiency of edge computing, one of which is the integration of processing-in-memory (PIM) architecture with edge devices [6,7]. PIM performs data processing inside or near the memory array, reducing latency due to data movement and addressing the key design goal of high energy efficiency in edge devices for memory-centric tasks such as AI applications [8]. ...

T-PIM: An Energy-Efficient Processing-in-Memory Accelerator for End-to-End On-Device Training
  • Citing Article
  • January 2022

IEEE Journal of Solid-State Circuits

... Other studies also evaluate the localization ability of CAM methods by turning the attributions into segmentation masks and comparing the IoU or classification accuracy [132,135,155]. Additionally, [156] compare attention networks and CAM variants on the metrics max-sensitivity and average % drop/increase in confidence. Regarding other xAI approaches, the attention weights are evaluated in [144] by inspecting drops in the accuracy for crop mapping when the transformer model is trained on a subset of dates with the highest attention values. ...

Federated Onboard-Ground Station Computing With Weakly Supervised Cascading Pyramid Attention Network for Satellite Image Analysis

IEEE Access

... The HPTA [25] accelerator provided support for several transformer variants without needing FPGA reconfiguration. DFX [26] presented an endto-end design for Generative Pretrained Transformer (GPT) model inference. [27] proposed efficient algorithm-hardware co-designs with sparse attention and dynamic pipelining. ...

DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation
  • Citing Conference Paper
  • October 2022

... Networked and heterogeneous FPGA clusters [32]- [34] are proposed for cloud and edge computing. There are existing works on scalable FPGA architecture [11]- [14], [35]- [38] primarily focus on accelerating applications, running emulation on multiple FPGAs and comparing the performance and power with other accelerators like GPU. Some of the multi-FPGA systems [15], [16] connect FPGAs and host CPUs into a hybrid network. ...

DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation
  • Citing Conference Paper
  • August 2022

... Given the equality of each port, maintaining fairness in data transfer between ports becomes paramount, and the doubled inter-die latency adds complexity to timing convergence, particularly for logic circuits spanning multiple dies. The interconnections in the multi-die architecture are thus susceptible to routing congestion [24,25]. ...

OpenMDS: An Open-Source Shell Generation Framework for High-Performance Design on Xilinx Multi-Die FPGAs
  • Citing Article
  • July 2022

IEEE Computer Architecture Letters

... By placing the processing unit in/near the memory, the high latency and power consumption caused by data movement can be significantly reduced, making PIM superior for accelerating data-intensive applications. By leveraging the benefits of PIM, various fully customized designs based on SRAM have been proposed with its high reliability [11], [12], [13], [14], [15], [16], [17], [18], [19]. However, due to the increasing demand for higher memory capacity with increasing model sizes, PIM designs using the cell with higher density, such as embedded DRAM (eDRAM), have been recently proposed [20], [21], [22], [23], [24] to provide better area and power efficiency compared to SRAM-based approach. ...

T-PIM: A 2.21-to-161.08TOPS/W Processing-In-Memory Accelerator for End-to-End On-Device Training
  • Citing Conference Paper
  • April 2022