Junsoo Kim’s research while affiliated with Korea Advanced Institute of Science and Technology and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (14)


LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference
  • Article

November 2024

·

13 Reads

·

1 Citation

IEEE Micro

Seungjae Moon

·

Jung-Hoon Kim

·

Junsoo Kim

·

[...]

·

Joo-Young Kim

The explosive arrival of OpenAI’s ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09× and 1.37× faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm 2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33× and 1.32× energy efficiency over NVIDIA H100 and L4 servers, respectively.


FIGURE 6. LPU implementation and specification.
LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference
  • Preprint
  • File available

August 2024

·

71 Reads

The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33x and 1.32x energy efficiency over NVIDIA H100 and L4 servers, respectively.

Download

EPU: An Energy-Efficient Explainable AI Accelerator With Sparsity-Free Computation and Heat Map Compression/Pruning

March 2024

·

10 Reads

·

1 Citation

IEEE Journal of Solid-State Circuits

Deep neural networks (DNNs) have recently gained significant prominence in various real-world applications such as image recognition, natural language processing, and autonomous vehicles. However, due to their black-box nature in system, the underlying mechanisms of DNNs behind the inference results remain opaque to users. In order to address this challenge, researchers have focused on developing explainable artificial intelligence (AI) algorithms. Explainable AI aims to provide a clear and human-understandable explanation of the model’s decision, thereby building more reliable systems. However, the explanation task differs from well-known inference and training processes as it involves interactions with the user. Consequently, existing inference and training accelerators face inefficiencies when processing explainable AI on edge devices. This article introduces explainable processing unit (EPU), the first hardware accelerator designed for explainable AI workloads. The EPU utilizes a novel data compression format for the output heat maps and intermediate gradients to enhance the overall system performance by reducing both memory footprint and external memory access. Its sparsity-free computing core efficiently handles the input sparsity with negligible control overhead, resulting in a throughput boost of up to 9.48×. It also proposes a dynamic workload scheduling with a customized on-chip network for distinct inference and explanation tasks to maximize internal data reuse hence reducing external memory access by 63.7%. Furthermore, the EPU incorporates point-wise gradient pruning (PGP) that can significantly reduce the size of heat maps by a factor of 7.01× combined with the proposed compression format. Finally, the EPU chip fabricated in a 28 nm CMOS process achieves a remarkable heat map generation rate of 367 frames/s for ResNet-34 while maintaining the state-of-the-art area and energy efficiency of 112.3 GOPS/mm 2^2 and 26.55 TOPS/W, respectively.





Figure 1. Illustration of transformer-based text generation.
Figure 2. GPT-2 structure and illustration of summarization and generation stages in text generation.
Figure 7. DFX compute core microarchitecture. which modules to run. It is composed of the controller, scheduler, and scoreboard. Controller The controller's main job is to receive the start signal and system configuration from the host. The system configuration includes the core ID and the number of cores in the system, and the number of decoder layers and tokens that the system needs to run on. These parameters determine the behavior of each core. The core ID and the number of cores direct the corresponding core on which section of the model weights to work on and which peer device to receive from and transmit to. The number of decoder layers determines when single token processing completes, and the number of input and output tokens determines when the entire service completes. Since a different portion of the HBM needs to be accessed for each layer, the layer number designates the address the DMA needs to access. The token number is used specifically for knowing where to mask during MaskedMM. Lastly, the controller returns the done signal back to the host once the entire GPT-2 operation finishes. Scheduler The scheduler receives the decoded system configuration from the controller and instructions from the instruction buffer. The scheduler contains multiple finite state machines for each instruction type that checks the status of the DMA, processing units, register file, and the router to decide whether to run or wait on each instruction type. The chosen instruction is sent to the scoreboard for the last dependency check with the running instruction. Scoreboard The register file needs to check for dependencies to run instructions based on the chaining method.
DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

September 2022

·

1,421 Reads

Transformer is a deep learning language model widely used for natural language processing (NLP) services in datacenters. Among transformer models, Generative Pre-trained Transformer (GPT) has achieved remarkable performance in text generation, or natural language generation (NLG), which needs the processing of a large input context in the summarization stage, followed by the generation stage that produces a single word at a time. The conventional platforms such as GPU are specialized for the parallel processing of large inputs in the summarization stage, but their performance significantly degrades in the generation stage due to its sequential characteristic. Therefore, an efficient hardware platform is required to address the high latency caused by the sequential characteristic of text generation. In this paper, we present DFX, a multi-FPGA acceleration appliance that executes GPT-2 model inference end-to-end with low latency and high throughput in both summarization and generation stages. DFX uses model parallelism and optimized dataflow that is model-and-hardware-aware for fast simultaneous workload execution among devices. Its compute cores operate on custom instructions and provide GPT-2 operations end-to-end. We implement the proposed hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the channels of the high bandwidth memory (HBM) and the maximum number of compute resources for high hardware efficiency. DFX achieves 5.58x speedup and 3.99x energy efficiency over four NVIDIA V100 GPUs on the modern GPT-2 model. DFX is also 8.21x more cost-effective than the GPU appliance, suggesting that it is a promising solution for text generation workloads in cloud datacenters.



OpenMDS: An Open-Source Shell Generation Framework for High-Performance Design on Xilinx Multi-Die FPGAs

July 2022

·

3 Reads

·

4 Citations

IEEE Computer Architecture Letters

FPGA is a promising platform in designing hardware due to its design flexibility and fast development cycle, despite the device's limited hardware resources. To address this, latest FPGAs have adopted a multi-die architecture that employs multiple dies in a single device to provide abundant hardware resources. However, the multi-die architecture causes critical timing issues when signal paths cross the die-to-die boundaries, adding another design challenge in using FPGA. We propose OpenMDS, an open-source shell generation framework for high-performance design on Xilinx multi-die FPGAs. Based on the user's design requirements, it generates an optimized shell for the target FPGA via die-level kernel encapsulation, automated bus pipelining, and customized floorplanning. To evaluate our shell generation, we compare its implementation results against Xilinx's Vitis framework. As a result, OpenMDS uses average 20% less logic resources than Vitis for the same shell functionality. To show its practicality, we use OpenMDS for the design of machine learning accelerator that contains multiple systolic-array processors. OpenMDS achieves 247 MHz and 235 MHz kernel frequency and 400 MHz and 429 MHz memory bus frequency for U50 and U280, respectively, for the accelerator design over 90% logic utilization, claiming up to 12.27% and 22.92% higher kernel and memory bus frequency over Vitis.


Design of Processing-in-Memory With Triple Computational Path and Sparsity Handling for Energy-Efficient DNN Training

June 2022

·

8 Reads

·

2 Citations

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

As machine learning (ML) and artificial intelligence (AI) have become mainstream technologies, many accelerators have been proposed to cope with their computation kernels. However, they access the external memory frequently due to the large size of deep neural network model, suffering from the von Neumann bottleneck. Moreover, as privacy issue is becoming more critical, on-device training is emerging as its solution. However, on-device training is challenging because it should perform the training under a limited power budget, which requires a lot more computations and memory accesses than the inference. In this paper, we present an energy-efficient processing-in-memory (PIM) architecture supporting end-to-end on-device training named T-PIM. Its macro design includes an 8T-SRAM cell-based PIM block to compute in-memory AND operation and three computational datapaths for end-to-end training. Each of three computational paths integrates arithmetic units for forward propagation, backward propagation, and gradient calculation and weight update, respectively, allowing the weight data stored in the memory stationary. T-PIM also supports variable bit precision to cover various ML scenarios. It can use fully variable input bit precision and 2-bit, 4-bit, 8-bit, and 16-bit weight bit precision for the forward propagation and the same input bit precision and 16-bit weight bit precision for the backward propagation. In addition, T-PIM implements sparsity handling schemes that skip the computation for input data and turn off the arithmetic units for weight data to reduce both unnecessary computations and leakage power. Finally, we fabricate the T-PIM chip on a 5.04mm 2 die in a 28-nm CMOS logic process. It operates at 50–280MHz with the supply voltage of 0.75–1.05V, dissipating 5.25–51.23mW power in inference and 6.10–37.75mW in training. As a result, it achieves 17.90–161.08TOPS/W energy efficiency for the inference of 1-bit activation and 2-bit weight data, and 0.84–7.59TOPS/W for the training of 8-bit activation/error and 16-bit weight data. In conclusion, T-PIM is the first PIM chip that supports end-to-end training, demonstrating 2.02 times performance improvement over the latest PIM that partially supports training.


Citations (8)


... Turbo Sparse [130], ProSparse [131] LLM-pruner [132], SparseGPT [133], Wanda [134], E-Sparse [135], Flash-LLM [136], Agarwalla et al. [137], , Sparse Transformer [139], Bigbird [140], StreamingLLM [141], Longformer [142], Adaptively Sparse Attention [143], Reformer [144], Sparse Flash Attention [145], Sparse Sinkhorn Attention [146], H2O [147] FlightLLM [120], EdgeLLM [121] Spatten [148], TF-MVP [149], SOFA [150] LauWS [151], HARDSEA [152], Sharda et al. [129] Fast Decoding LLMA [153], Speculative decoding [154], Lookahead [155], Medusa [156], EAGLE [157,158], Ouroboros [159], Sequoia [160], Draft&Verify [161], Kangaroo [162], LayerSkip [163], Adainfer [164], RAEE [165], MOD [166] C-Transformer [167] SpecPIM [168] Operator Optimization FlashAttention [169,170], FlashDecoding [171], FlashDe-coding++ [172], Deep-Speed [173], vLLM [174], OpenPPL [175], cuBLAS [176], TensorRT-LLM [177], CUT-LASS [178], ByteTransformer [179] LPU [180], Groq LPU [181], Con-Smax [182], MARCA [183], TCP [184], Habana Gaudi [185], Gaudi2 [186], Gaudi3 [187], Cerebras WSE-3 [188] PIMnast [189], At-tentionLego [190], PIM-GPT [191], SAL-PIM [192], PipePIM [193] ...

Reference:

Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective
LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference
  • Citing Article
  • November 2024

IEEE Micro

... In highly dynamic scenarios such as real-time network slicing and vehicular network slicing, the time and energy efficiency of XAI methods is critical. Lightweight or simplified interpretability models represent one of the future research directions, ensuring efficient operation on resourceconstrained edge nodes [232], [233].Techniques such as model compression, quantization, and the use of specialized hardware accelerators can significantly reduce the energy consumption of XAI methods [234], [235]. Furthermore, exploring decentralized or distributed XAI approaches could further enhance energy efficiency and reduce latency [236]. ...

EPU: An Energy-Efficient Explainable AI Accelerator With Sparsity-Free Computation and Heat Map Compression/Pruning
  • Citing Article
  • March 2024

IEEE Journal of Solid-State Circuits

... Since edge devices have more stringent energy constraints compared to centralized data processing methods, energy-efficient data processing is a major challenge [4,5]. In response, various technological efforts have been made to enhance the energy efficiency of edge computing, one of which is the integration of processing-in-memory (PIM) architecture with edge devices [6,7]. PIM performs data processing inside or near the memory array, reducing latency due to data movement and addressing the key design goal of high energy efficiency in edge devices for memory-centric tasks such as AI applications [8]. ...

T-PIM: An Energy-Efficient Processing-in-Memory Accelerator for End-to-End On-Device Training
  • Citing Article
  • January 2022

IEEE Journal of Solid-State Circuits

... Other studies also evaluate the localization ability of CAM methods by turning the attributions into segmentation masks and comparing the IoU or classification accuracy [132,135,155]. Additionally, [156] compare attention networks and CAM variants on the metrics max-sensitivity and average % drop/increase in confidence. Regarding other xAI approaches, the attention weights are evaluated in [144] by inspecting drops in the accuracy for crop mapping when the transformer model is trained on a subset of dates with the highest attention values. ...

Federated Onboard-Ground Station Computing With Weakly Supervised Cascading Pyramid Attention Network for Satellite Image Analysis

IEEE Access

... InTAR exhibits 1.8× and 7.1× speedup compared with the corresponding dataflow and sequential accelerator. We further present InTAR on the GPT-2 medium model for a complete DNN example, which achieves a speedup of 3.65 ∼ 39.14× and a 1.72 ∼ 10.44× improvement in DSP efficiency compared to the SoTA accelerators (Allo [9] and DFX [23]). Moreover, INTAR demonstrated 1.66 ∼ 7.17× better power efficiency compared to GPUs. ...

DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation
  • Citing Conference Paper
  • October 2022

... Various transformer acceleration frameworks for efficient inference have been proposed based on GPU [45,50], ASIC [41,18,16], and FPGA [20,43,42,17,51,11,49,21,48,52,53,54,55,46,56,47]. Unlike GPU and ASIC design, FPGA has attracted much attention recently, thanks to the configurable and flexible nature of FPGA devices, which has released the low hardware utilization rate issue in fixed architecture GPU or ASIC designs [47]. ...

DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation
  • Citing Conference Paper
  • August 2022

... Given the equality of each port, maintaining fairness in data transfer between ports becomes paramount, and the doubled inter-die latency adds complexity to timing convergence, particularly for logic circuits spanning multiple dies. The interconnections in the multi-die architecture are thus susceptible to routing congestion [24,25]. ...

OpenMDS: An Open-Source Shell Generation Framework for High-Performance Design on Xilinx Multi-Die FPGAs
  • Citing Article
  • July 2022

IEEE Computer Architecture Letters

... By placing the processing unit in/near the memory, the high latency and power consumption caused by data movement can be significantly reduced, making PIM superior for accelerating data-intensive applications. By leveraging the benefits of PIM, various fully customized designs based on SRAM have been proposed with its high reliability [11], [12], [13], [14], [15], [16], [17], [18], [19]. However, due to the increasing demand for higher memory capacity with increasing model sizes, PIM designs using the cell with higher density, such as embedded DRAM (eDRAM), have been recently proposed [20], [21], [22], [23], [24] to provide better area and power efficiency compared to SRAM-based approach. ...

T-PIM: A 2.21-to-161.08TOPS/W Processing-In-Memory Accelerator for End-to-End On-Device Training
  • Citing Conference Paper
  • April 2022