About
40
Publications
7,349
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
711
Citations
Citations since 2017
Introduction
Skills and Expertise
Publications
Publications (40)
When training deep neural networks (DNNs), expensive floating point arithmetic units are used in GPUs or custom neural processing units (NPUs). To reduce the burden of floating point arithmetic, community has started exploring the use of more efficient data representations, e.g., block floating point (BFP). The BFP format allows a group of values t...
When training early-stage deep neural networks (DNNs), generating intermediate features via convolution or linear layers occupied most of the execution time. Accordingly, extensive research has been done to reduce the computational burden of the convolution or linear layers. In recent mobile-friendly DNNs, however, the relative number of operations...
Embedding layers, which are widely used in various deep learning (DL) applications, are very large in size and are increasing. We propose scalable embedding memory system (SEMS) to deal with the inference of DL applications with a large embedding layer. SEMS is built using scalable embedding memory (SEM) modules, which include FPGA for acceleration...
In this brief, we present a novel design methodology of cost-effective approximate radix-4 Booth multipliers, which can significantly reduce the power consumption of error-resilient signal processing tasks. In contrast that the prior studies only focus on the approximation of either the partial product generation with encoders or the partial produc...
This paper presents a 46 nF/10 MΩ-range, digital-intensive, reconfigurable RC-to-digital converter (R²CDC) that can readout multiple C and R sensors in a time-interleaved fashion. Ratio-metric conversion using swing-boosted period-modulation (SB-PM) front-end by the R²CDC results in 114 aFrms/0.37 Ωrms resolutions and a worst-case temperature-drift...
In this paper, we present Zero-data Based Repeated bit flip Attack (ZeBRA) that precisely destroys deep neural networks (DNNs) by synthesizing its own attack datasets. Many prior works on adversarial weight attack require not only the weight parameters, but also the training or test dataset in searching vulnerable bits to be attacked. We propose to...
In this paper, we present a novel approximate computing scheme suitable for realizing the energy-efficient multiply-accumulate (MAC) processing. In contrast to the prior works that suffer from the error accumulation limiting the approximate range, we utilize different approximate multipliers in an interleaved way to compensate errors in the opposit...
This article discusses the high-performance near-memory neural network (NN) accelerator architecture utilizing the logic die in three-dimensional (3D) High Bandwidth Memory– (HBM) like memory. As most of the previously reported 3D memory-based near-memory NN accelerator designs used the Hybrid Memory Cube (HMC) memory, we first focus on identifying...
We propose a HW-SW co-optimization technique to perform energy-efficient spGEMM operations for deep learning. First, we present an automated pruning algorithm, named AutoRelax, that allows some level of relaxation to achieve higher compression ratio. As the benefit of the proposed pruning algorithm may be limited by the sparsity level of a given we...
In this paper, we present deep partitioned training to accelerate computations involved in training DNN models. This is the first work that partitions a DNN model across storage devices, an NPU and a host CPU forming a unified compute node for training workloads. To validate the benefit of using the proposed system during DNN training, a trace-base...
This paper presents an approximate computing method of long short-term memory (LSTM) operations for energy-efficient end-to-end speech recognition. We newly introduce the concept of similarity score, which can measure how much the inputs of two adjacent LSTM cells are similar to each other. Then, we disable the highly-similar LSTM operations and di...
Cache side-channel attack is one of the critical security threats to modern computing systems. As a representative cache side-channel attack, Flush+Reload attack allows an attacker to steal confidential information (e.g., private encryption key) by monitoring a victim's cache access patterns while generating the confidential values. Meanwhile, for...
In this paper, we propose a novel storage architecture, called an Inference-Enabled SSD (IESSD), which employs FPGA-based DNN inference accelerators inside an SSD. IESSD is capable of performing DNN operations inside an SSD, avoiding frequent data movements between application servers and data storage. This boosts up analytics performance of DNN ap...
The Long Short-Term Memory (LSTM) is a widely used neural network model for dealing with time-varying data. To reduce the memory requirement, pruning is often applied to the weight matrix of the LSTM, which makes the matrix sparse. In this paper, we present a new sparse matrix format, named Rearranged Compressed Sparse Column (RCSC), to maximize th...
In this paper, we present an integrated solution to design a high-performance LSTM accelerator. We propose a fast and flexible hardware architecture, named Peregrine, supported by a stack of innovations from algorithm to hardware design. Peregrine first minimizes the memory footprint by limiting the synaptic connection patterns within the LSTM netw...
Memory performance is a key bottleneck for deep learning systems. Binarization of both activations and weights is one promising approach that can best scale to realize the highest energy efficient system using the lowest possible precision. In this paper, we utilize and analyze the binarized neural network in doing human detection on infrared image...
The versatility of cellular nonlinear network (CNN) made it attractive in various scientific and engineering applications. The strength of solving problems using CNN comes from the fact that it deals with nonlinear dynamic systems. In this paper, feasible criteria for adaptive hardware approximation, automated method to switch between high and low...
The fast and energy-efficient simulation of dynamical systems defined by coupled ordinary/partial differential equations has emerged as an important problem. The accelerated simulation of coupled ODE/PDE is critical for analysis of physical systems as well as computing with dynamical systems. This paper presents a fast and programmable accelerator...
The fast and energy-efficient simulation of dynamical systems defined by coupled ordinary/partial differential equations has emerged as an important problem. The accelerated simulation of coupled ODE/PDE is critical for analysis of physical systems as well as computing with dynamical systems. This paper presents a fast and programmable accelerator...
This paper proposes that approximation by reducing bit-precision and using inexact multiplier can save power consumption of digital multilayer perceptron accelerator during the classification of MNIST (inference) with negligible accuracy degradation. Based on the error sensitivity precomputed during the training, synaptic weights with less sensitiv...
3D integration provides opportunities to design high-bandwidth and low-power CMOS image sensors (CIS) [1–4]. The 3D stacking of pixel tier, peripheral tier, memory tier(s), and compute tier(s) enables high degree of parallel processing. Also, each tier can be designed in different technology nodes (heterogeneous integration) to further improve powe...
This paper presents methodology of feedback-controlled dynamic approximation to enable energy-accuracy trade-off in digital recurrent neural network (RNN). A low-power digital RNN engine is presented that employs the proposed dynamic approximation. The on-chip feedback controller is realized by utilizing hysteretic or proportional controller. The d...
This paper presents a programmable and scalable digital neuromorphic architecture based on 3D high-density memory integrated with logic tier for efficient neural computing. The proposed architecture consists of clusters of processing engines, connected by 2D mesh network as a processing tier, which is integrated in 3D with multiple tiers of DRAM. T...
This paper studies the opportunities of energy-accuracy tradeoff in cellular neural network (CNN). Algorithmic characteristics of CNN is coupled with hardware-induced error distribution of a digital CNN cell to evaluate energy-accuracy tradeoff for simple image processing tasks as well as a complex application. The analysis shows that errors modula...
This paper experimentally demonstrates a methodology for proactive estimation of spatiotemporal variations in junction temperature of a silicon chip using multi-input multi-output (MIMO) thermal filters. The presented approach performs on-chip measurements to estimate the relations between power and temperature variations in the frequency domain to...
A methodology is presented for on-line and realtime estimation of transient variations in temperature and average power of a chip after fabrication and packaging. Measurements from a 130nm CMOS test chip demonstrate the accuracy of the proposed approach.
Thermal analysis involves solving a differential equation, thus inherently takes a long time. Any thermal optimization techniques require a compact model that can be evaluated fast while accuracy is not sacrificed too much. Several models have been proposed, but their accuracy and runtime have not been fully understood and compared. Three models, n...
An edge-triggered flip-flop is a de facto standard sequencing element in ASIC designs. As sequencing elements occupy increasing portion of timing and power, it is necessary to explore other types of elements. We identify pulsed-latch and dual edge-triggered flip-flop as two promising candidates. The challenges when they are employed for conventiona...
A floorplanning has a potential to reduce chip temperature due to the conductive nature of heat. If floorplan optimization, which is usually based on simulated annealing, is employed to reduce temperature, its evaluation should be done extremely fast with high accuracy. A new thermal index, named thermal signature, is proposed. It approximates the...