Junsoo Kim’s research while affiliated with Korea Advanced Institute of Science and Technology and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (15)


Design of Processing-in-Memory With Triple Computational Path and Sparsity Handling for Energy-Efficient DNN Training
  • Article

June 2022

·

8 Reads

·

2 Citations

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

Wontak Han

·

Jaehoon Heo

·

Junsoo Kim

·

[...]

·

Joo-Young Kim

As machine learning (ML) and artificial intelligence (AI) have become mainstream technologies, many accelerators have been proposed to cope with their computation kernels. However, they access the external memory frequently due to the large size of deep neural network model, suffering from the von Neumann bottleneck. Moreover, as privacy issue is becoming more critical, on-device training is emerging as its solution. However, on-device training is challenging because it should perform the training under a limited power budget, which requires a lot more computations and memory accesses than the inference. In this paper, we present an energy-efficient processing-in-memory (PIM) architecture supporting end-to-end on-device training named T-PIM. Its macro design includes an 8T-SRAM cell-based PIM block to compute in-memory AND operation and three computational datapaths for end-to-end training. Each of three computational paths integrates arithmetic units for forward propagation, backward propagation, and gradient calculation and weight update, respectively, allowing the weight data stored in the memory stationary. T-PIM also supports variable bit precision to cover various ML scenarios. It can use fully variable input bit precision and 2-bit, 4-bit, 8-bit, and 16-bit weight bit precision for the forward propagation and the same input bit precision and 16-bit weight bit precision for the backward propagation. In addition, T-PIM implements sparsity handling schemes that skip the computation for input data and turn off the arithmetic units for weight data to reduce both unnecessary computations and leakage power. Finally, we fabricate the T-PIM chip on a 5.04mm 2 die in a 28-nm CMOS logic process. It operates at 50–280MHz with the supply voltage of 0.75–1.05V, dissipating 5.25–51.23mW power in inference and 6.10–37.75mW in training. As a result, it achieves 17.90–161.08TOPS/W energy efficiency for the inference of 1-bit activation and 2-bit weight data, and 0.84–7.59TOPS/W for the training of 8-bit activation/error and 16-bit weight data. In conclusion, T-PIM is the first PIM chip that supports end-to-end training, demonstrating 2.02 times performance improvement over the latest PIM that partially supports training.




T-PIM: An Energy-Efficient Processing-in-Memory Accelerator for End-to-End On-Device Training

January 2022

·

10 Reads

·

21 Citations

IEEE Journal of Solid-State Circuits

Recently, on-device training has become crucial for the success of edge intelligence. However, frequent data movement between computing units and memory during training has been a major problem for battery-powered edge devices. Processing-in-memory (PIM) is a novel computing paradigm that merges computing logic into memory, which can address the data movement problem with excellent power efficiency. However, previous PIM accelerators cannot support the entire training process on chip due to its computing complexity. This article presents a PIM accelerator for end-to-end on-device training (T-PIM), the first PIM realization that enables end-to-end on-device training as well as high-speed inference. Its full-custom PIM macro contains 8T-SRAM cells to perform the energy-efficient in-cell and operation and the bit-serial-based computation logic enables fully variable bit-precision for input data. The macro supports various data mapping methods and computational paths for both fully connected and convolutional layers, in order to handle the complex training process. An efficient tiling scheme is also proposed to enable T-PIM to compute any size of deep neural network with the implemented hardware. In addition, configurable arithmetic units in a forward propagation path make T-PIM handle power-of-two bit-precision for weight data, enabling a significant performance boost during inference. Finally, T-PIM efficiently handles sparsity in both operands by skipping the computation of zeros in the input data and by gating-off computing units when the weight data are zero. Finally, we fabricate the T-PIM chip in 28-nm CMOS technology, occupying a die area of 5.04 mm 2^{2} , including five T-PIM cores. It dissipates 5.25–51.23 mW at 50–280 MHz operating frequency with 0.75–1.05-V supply voltage. We successfully demonstrate that T-PIM can run the end-to-end training of VGG16 model on the CIFAR10 and CIFAR100 datasets, achieving 0.13–161.08-and 0.25–7.59-TOPS/W power efficiency during inference and training, respectively. The result shows that T-PIM is 2.02 ×\times more energy-efficient than the state-of-the-art PIM chip that only supports backward propagation, not a whole training. Furthermore, we conduct an architectural experiment using a cycle-level simulator based on actual measurement results, which suggests that the T-PIM architecture is scalable and its scaled-up version provides up to 203.26 ×\times higher power efficiency than a comparable GPU.


FIGURE 3: Scenario of the proposed federated XAI computing of onboard-ground station in satellite image analysis.
FIGURE 4: The background bias problem of conventional visual explanation based on top convolution layer. The trained DL model infers the ground-truth label by focusing background context around the target objects.
FIGURE 5: The spatial information loss caused by spatial pooling operations in a pyramid network. The highlighted region shows valuable context which is missed in visual explanation of top convolution layer.
FIGURE 6: The overall architecture of the proposed onboard-ground station federated XAI computing with cascading pyramid attention and weakly supervised refinement.
FIGURE 7: The proposed CPANet for generating the global context to explain the satellite images. It consists of the perception branch and cascading attention branch to transmit useful context from local explanations to global explanation via the bottomup pathway. Detailed formulations are shown in Section III-B

+6

Federated Onboard-Ground Station Computing With Weakly Supervised Cascading Pyramid Attention Network for Satellite Image Analysis
  • Article
  • Full-text available

January 2022

·

70 Reads

·

8 Citations

IEEE Access

With advances in NanoSat (CubeSat) and high-resolution sensors, the amount of raw data to be analyzed by human supervisors has been explosively increasing for satellite image analysis. To reduce the raw data, the satellite onboard AI processing with low-power COTS (Commercial, Off-The-Shelf) HW has emerged from a real satellite mission. It filters the useless data (e.g. cloudy images) that is worthless to supervisors, achieving efficient satellite-ground station communication. In the application for complex object recognition, however, additional explanation is required for the reliability of the AI prediction due to its low performance. Although various eXplainable AI (XAI) methods for providing human-interpretable explanation have been studied, the pyramid architecture in a deep network leads to the background bias problem which visual explanation only focuses on the background context around the object. Missing the small objects in a tiny region leads to poor explainability although the AI model corrects the object class. To resolve the problems, we propose a novel federated onboard-ground station (FOGS) computing with Cascading Pyramid Attention Network (CPANet) for reliable onboard XAI in object recognition. We present an XAI architecture with a cascading attention mechanism for mitigating the background bias for the onboard processing. By exploiting the localization ability in pyramid feature blocks, we can extract high-quality visual explanation covering the both semantic and small contexts of an object. For enhancing visual explainability of complex satellite images, we also describe a novel computing federation with the ground station and supervisors. In the ground station, active learning-based sample selection and attention refinement scheme with a simple feedback method are conducted to achieve the robustness of explanation and efficient supervisor’s annotation cost, simultaneously. Experiments on various datasets show that the proposed system improves accuracy in object recognition and accurate visual explanation detecting small contexts of objects even in a peripheral region. Then, our attention refinement mechanism demonstrates that the inconsistent explanation can be efficiently resolved only with very simple selection-based feedback.

Download

Citations (9)


... Techniques like tiling and pipelining, which break down LLM computations into smaller units for optimal processing, are under study [4]. Methods to optimize data transfer between accelerators and memory, or between accelerators, are also proposed [5]. ...

Reference:

A Review on Proprietary Accelerators for Large Language Models
LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference
  • Citing Article
  • November 2024

IEEE Micro

... In highly dynamic scenarios such as real-time network slicing and vehicular network slicing, the time and energy efficiency of XAI methods is critical. Lightweight or simplified interpretability models represent one of the future research directions, ensuring efficient operation on resourceconstrained edge nodes [232], [233].Techniques such as model compression, quantization, and the use of specialized hardware accelerators can significantly reduce the energy consumption of XAI methods [234], [235]. Furthermore, exploring decentralized or distributed XAI approaches could further enhance energy efficiency and reduce latency [236]. ...

EPU: An Energy-Efficient Explainable AI Accelerator With Sparsity-Free Computation and Heat Map Compression/Pruning
  • Citing Article
  • March 2024

IEEE Journal of Solid-State Circuits

... To achieve this, ADOR employs a MAC tree architecture that allows weights read from DRAM to be fed directly into the compute units without first being storing in SRAM. This approach ensures that the data is promptly processed, minimizing latency [23]. ...

HyperAccel Latency Processing Unit (LPU TM ) Accelerating Hyperscale Models for Generative AI
  • Citing Conference Paper
  • August 2023

... Since edge devices have more stringent energy constraints compared to centralized data processing methods, energy-efficient data processing is a major challenge [4,5]. In response, various technological efforts have been made to enhance the energy efficiency of edge computing, one of which is the integration of processing-in-memory (PIM) architecture with edge devices [6,7]. PIM performs data processing inside or near the memory array, reducing latency due to data movement and addressing the key design goal of high energy efficiency in edge devices for memory-centric tasks such as AI applications [8]. ...

T-PIM: An Energy-Efficient Processing-in-Memory Accelerator for End-to-End On-Device Training
  • Citing Article
  • January 2022

IEEE Journal of Solid-State Circuits

... Other studies also evaluate the localization ability of CAM methods by turning the attributions into segmentation masks and comparing the IoU or classification accuracy [132,135,155]. Additionally, [156] compare attention networks and CAM variants on the metrics max-sensitivity and average % drop/increase in confidence. Regarding other xAI approaches, the attention weights are evaluated in [144] by inspecting drops in the accuracy for crop mapping when the transformer model is trained on a subset of dates with the highest attention values. ...

Federated Onboard-Ground Station Computing With Weakly Supervised Cascading Pyramid Attention Network for Satellite Image Analysis

IEEE Access

... The HPTA [25] accelerator provided support for several transformer variants without needing FPGA reconfiguration. DFX [26] presented an endto-end design for Generative Pretrained Transformer (GPT) model inference. [27] proposed efficient algorithm-hardware co-designs with sparse attention and dynamic pipelining. ...

DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation
  • Citing Conference Paper
  • October 2022

... Networked and heterogeneous FPGA clusters [32]- [34] are proposed for cloud and edge computing. There are existing works on scalable FPGA architecture [11]- [14], [35]- [38] primarily focus on accelerating applications, running emulation on multiple FPGAs and comparing the performance and power with other accelerators like GPU. Some of the multi-FPGA systems [15], [16] connect FPGAs and host CPUs into a hybrid network. ...

DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation
  • Citing Conference Paper
  • August 2022

... Given the equality of each port, maintaining fairness in data transfer between ports becomes paramount, and the doubled inter-die latency adds complexity to timing convergence, particularly for logic circuits spanning multiple dies. The interconnections in the multi-die architecture are thus susceptible to routing congestion [24,25]. ...

OpenMDS: An Open-Source Shell Generation Framework for High-Performance Design on Xilinx Multi-Die FPGAs
  • Citing Article
  • July 2022

IEEE Computer Architecture Letters

... By placing the processing unit in/near the memory, the high latency and power consumption caused by data movement can be significantly reduced, making PIM superior for accelerating data-intensive applications. By leveraging the benefits of PIM, various fully customized designs based on SRAM have been proposed with its high reliability [11], [12], [13], [14], [15], [16], [17], [18], [19]. However, due to the increasing demand for higher memory capacity with increasing model sizes, PIM designs using the cell with higher density, such as embedded DRAM (eDRAM), have been recently proposed [20], [21], [22], [23], [24] to provide better area and power efficiency compared to SRAM-based approach. ...

T-PIM: A 2.21-to-161.08TOPS/W Processing-In-Memory Accelerator for End-to-End On-Device Training
  • Citing Conference Paper
  • April 2022