Jung-Hoon Kim’s research while affiliated with Korea Advanced Institute of Science and Technology and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (30)


LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference
  • Article

November 2024

·

16 Reads

·

2 Citations

IEEE Micro

Seungjae Moon

·

Jung-Hoon Kim

·

Junsoo Kim

·

[...]

·

Joo-Young Kim

The explosive arrival of OpenAI’s ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09× and 1.37× faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm 2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33× and 1.32× energy efficiency over NVIDIA H100 and L4 servers, respectively.


FIGURE 6. LPU implementation and specification.
LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference
  • Preprint
  • File available

August 2024

·

92 Reads

The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33x and 1.32x energy efficiency over NVIDIA H100 and L4 servers, respectively.

Download

SP-PIM: A Super-Pipelined Processing-In-Memory Accelerator With Local Error Prediction for Area/Energy-Efficient On-Device Learning

August 2024

·

19 Reads

·

2 Citations

IEEE Journal of Solid-State Circuits

Over the past few years, on-device learning (ODL) has become an integral aspect of the success of edge devices that embrace machine learning (ML) since it plays a crucial role in restoring ML model accuracy when the edge environment changes. However, implementing ODL on battery-limited edge devices poses significant challenges due to the generation of large-size intermediate data during ML training and the frequent data movement between the processor and memory, resulting in substantial power consumption. To address this limitation, certain ML accelerators in edge devices have adopted a processing-in-memory (PIM) paradigm, integrating computing logic into memory. Nevertheless, these accelerators still face hurdles such as long latency caused by the lack of a pipelined approach in the training process, notable power and area overheads related to floating-point arithmetic, and incomplete handling of data sparsity during training. This article presents a high-throughput super-pipelined PIM accelerator, named SP-PIM, designed to overcome the limitations of existing PIM-based ODL accelerators. To this end, SP-PIM implements a holistic multi-level pipelining scheme based on local error prediction (EP), enhancing training speed by 7.31 ×\times . In addition, SP-PIM introduces a local EP unit (LEPU), a lightweight circuit that performs accurate EP leveraging power-of-two (PoT) random weights. This strategy significantly reduces power-hungry external memory access (EMA) by 59.09%. Moreover, SP-PIM fully exploits sparsities in both activation and error data during training, facilitated by a highly optimized PIM macro design. Finally, the SP-PIM chip, fabricated using 28-nm CMOS technology, achieves a training speed of 8.81 epochs/s. It occupies a die area of 5.76 mm 2^{2} and consumes between 6.91 and 433.25 mW at operating frequencies of 20–450 MHz with a supply voltage of 0.56–1.05 V. We demonstrate that it can successfully execute end-to-end ODL for the CIFAR10 and CIFAR100 datasets. Consequently, it achieves state-of-the-art area efficiency (560.6 GFLOPS/mm 2^{2} ) and competitive power efficiency (22.4 TFLOPS/W), marking a 3.95 ×\times higher figure-of-merit (area efficiency ×\times power efficiency ×\times capacity) than previous work. Furthermore, we implemented a cycle-level simulator using Python to investigate and validate the scalability of SP-PIM. By doing architectural experiments in various hardware configurations, we successfully verified that the core computing unit within SP-PIM possesses both scale-up and scale-out capabilities.







Executing the Source-Radiation Targeting on Aerofoil Trailing Edge Noise by the Finlet-Serration

June 2022

·

33 Reads

Recently, many researches on the topology of owl’s wing are conducted in order to replicate the low-noise flight characteristics to an aerofoil geometry. Finlets mimic feathers covering of the wing, while the serrations mimic the trailing edge fringes. Both are known to be able to reduce the trailing edge noise through targeting the turbulent source, and the scattering efficiency, respectively. This paper studies the effects of three influencing parameters for a finlet’s (1) finlet spacing, (2) finlet height, and (3) streamwise placement with respect to the trailing edge, on an asymmetrical aerofoil with and without the trailing edge serration on the far field noise and the flow fields. The finlets were installed only on the pressure side of the aerofoil at Reynolds number of 4.5 x 105 based on the chord length of 0.15 m and the flow speed of 45 m/s. The geometrical angle of attack of the aerofoil was fixed at 0º, though at this loading condition lift is already generated due to the asymmetric nature. The finlets with the smaller spanwise spacing are more effective in the broadband noise reduction (at f > 5 kHz) if compared to the one with the larger spacing. If a smaller spanwise spacing is taken as the optimum design point, combining it with the shorter height finlet would resemble a gurney flap geometry that can produce a loud vortex shedding noise at low frequency, although the ability of achieving broadband noise reduction at high frequency is also demonstrated. Therefore, the most optimal combination, aeroacoustically, would be the one with a small finlet spanwise spacing and large finlet height. This type of finlet is then chosen to study the effects of streamwise placement with respect to the trailing edge. The results clearly demonstrate that moving it upstream, i.e. away from the trailing edge, is detrimental to the level of broadband noise reduction. In fact, if the finlet is too far away from the trailing edge, the radiated trailing edge noise level will be even larger than that by the baseline aerofoil. Therefore, the best location of the finlet is to ensure that it is flushed with the trailing edge without any streamwise offset. Finally, by repeating the same range of the finlet placements, but with the presence of an add-on type trailing edge serration, interesting results can be obtained. For the otherwise worst performance scenario where the finlet is placed at the most upstream location, adding a serration can significantly reduce the radiated noise to a level even lower than the serration-alone (no finlet) case. This provides a strong hint that the manipulated flow by the upstream finlet would create a favourable condition to the serration downstream. By keeping the same serration, but slowly moving the finlet back to the trailing edge location will continue to improve the level of broadband noise reduction. In fact, in some cases the noise reduction by the combination of the finlets and serration is larger than the sum of the individual reduction.


Exploration of Systolic-Vector Architecture with Resource Scheduling for Dynamic ML Workloads

June 2022

·

43 Reads

As artificial intelligence (AI) and machine learning (ML) technologies disrupt a wide range of industries, cloud datacenters face ever-increasing demand in inference workloads. However, conventional CPU-based servers cannot handle excessive computational requirements of deep neural network (DNN) models, while GPU-based servers suffer from huge power consumption and high operating cost. In this paper, we present a scalable systolic-vector architecture that can cope with dynamically changing DNN workloads in cloud datacenters. We first devise a lightweight DNN model description format called unified model format (UMF) that enables general model representation and fast decoding in hardware accelerator. Based on this model format, we propose a heterogeneous architecture that features a load balancer that performs a high-level workload distribution and multiple systolic-vector clusters, in which each cluster consists of a programmable scheduler, throughput-oriented systolic arrays, and function-oriented vector processors. We also propose a heterogeneity-aware scheduling algorithm that enables concurrent execution of multiple DNN workloads while maximizing heterogeneous hardware utilization based on computation and memory access time estimation. Finally, we build an architecture simulation framework based on actual synthesis and place-and-route implementation results and conduct design space exploration for the proposed architecture. As a result, the proposed systolic-vector architecture achieves 10.9x higher throughput performance and 30.17x higher energy efficiency than a compatible GPU on realistic ML workloads. The proposed heterogeneity-aware scheduling algorithm improves the throughput and energy efficiency by 81% and 20%, respectively, compared to a standard round-robin scheduling.


Citations (18)


... Techniques like tiling and pipelining, which break down LLM computations into smaller units for optimal processing, are under study [4]. Methods to optimize data transfer between accelerators and memory, or between accelerators, are also proposed [5]. ...

Reference:

A Review on Proprietary Accelerators for Large Language Models
LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference
  • Citing Article
  • November 2024

IEEE Micro

... Previous simulators, which run on CPU platforms, are not well suited for debugging the structural characteristics of hardware [18]. In contrast, emulators implement the entire PIM architecture in hardware, making it less flexible than simulators in terms of system configuration adjustments. ...

PRIMO: A Full-Stack Processing-in-DRAM Emulation Framework for Machine Learning Workloads
  • Citing Conference Paper
  • October 2023

... To achieve this, ADOR employs a MAC tree architecture that allows weights read from DRAM to be fed directly into the compute units without first being storing in SRAM. This approach ensures that the data is promptly processed, minimizing latency [23]. ...

HyperAccel Latency Processing Unit (LPU TM ) Accelerating Hyperscale Models for Generative AI
  • Citing Conference Paper
  • August 2023

... By placing the processing unit in/near the memory, the high latency and power consumption caused by data movement can be significantly reduced, making PIM superior for accelerating data-intensive applications. By leveraging the benefits of PIM, various fully customized designs based on SRAM have been proposed with its high reliability [11], [12], [13], [14], [15], [16], [17], [18], [19]. However, due to the increasing demand for higher memory capacity with increasing model sizes, PIM designs using the cell with higher density, such as embedded DRAM (eDRAM), have been recently proposed [20], [21], [22], [23], [24] to provide better area and power efficiency compared to SRAM-based approach. ...

SP-PIM: A 22.41TFLOPS/W, 8.81Epochs/Sec Super-Pipelined Processing-In-Memory Accelerator with Local Error Prediction for On-Device Learning
  • Citing Conference Paper
  • June 2023

... Nevertheless, the pipeline computation remained out-of-sync or required complicated hardware to maintain high PE utilization. Besides, the PE utilization of the depthwise (DW) convolution remained low (down to 0.4%) due to the channel-wise hardware parallelism [20], [21]. Shortcut mining and fusion techniques [22], [23] focused on SCB optimization, as SCB accounts for 19% to 38% of total input feature accesses in mainstream RB-based models. ...

A 409.6 GOPS and 204.8 GFLOPS Mixed-Precision Vector Processor System for General-Purpose Machine Learning Acceleration
  • Citing Conference Paper
  • November 2022

... The construction of smart cities has become a major focus of attention in China and around the world [10,11]. The COVID-19 pandemic has led to a profound rethinking of the planning and construction of smart cities as people begin to explore how to support their efficient operations and respond quickly and intelligently to emergencies [12]. In recent years, cities in China have gradually explored various development paths with unique characteristics in their smart city initiatives. ...

How Should the Structure of Smart Cities Change to Predict and Overcome a Pandemic?

... The uncertainty in the difference of Cs and Cd with and without slat is estimated to be 1.68% (Coleman and Steele 2009), which is much smaller than the observed differences in the aerodynamics force coefficients with and without slat. A similar experimental set-up was previously used to measure aerodynamic loads on an aerofoil model with leading-edge undulations (Kim et al. 2022), on pitching wing model (Dong et al. 2020 and 2022a) as well as flow control of an aerofoil with plasma actuators (Dong et al. 2022b). ...

Aerodynamic and Aeroacoustic Optimization of Leading-Edge Undulation of a NACA 65(12)-10 Airfoil

AIAA Journal

... The number of valid pages in the block with the least number of valid pages cannot exceed 3. The figure above shows the average case, where all blocks have the same number of valid pages. Assuming that the write requests from the upper layer are: (1,2,3,4,10,11,7,5,4,12,6,9,7,8,8,10,11,9,10,11,12,10,11,12), it can be seen from the write sequence that the upper layer sends write requests to all logical pages, i.e., all logical pages correspond to a physical page in the physical block. Assuming that the number of valid pages in a block is greater than 3, there must be a valid page in the preceding block that needs to be invalidated, so the number of valid pages in the block with the least number of valid pages becomes 2, which is consistent with our previous analysis. ...

An FTL-Aware Host System Alleviating Severe Long Latency of NAND Flash-based Storage
  • Citing Conference Paper
  • August 2021

... On the other hand, butterfly pea flower powder contains anthocyanin polyphenols, which give the wet noodles a blue [36]. In addition, the cooking process oxidizes polyphenols so that the whiteness of the noodles decreases [37]. The results of the brightness level measurement (L*) are presented in Table 4. ...

Plant based protein products: Characterization and functionality of dried tofu noodles containing lotus root powder
  • Citing Article
  • August 2021

Food Bioscience