Article

CRIMP: C ompact & R eliable DNN Inference on I n- M emory P rocessing via Crossbar-Aligned Compression and Non-ideality Adaptation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Crossbar-based In-Memory Processing (IMP) accelerators have been widely adopted to achieve high-speed and low-power computing, especially for deep neural network (DNN) models with numerous weights and high computational complexity. However, the floating-point (FP) arithmetic is not compatible with crossbar architectures. Also, redundant weights of current DNN models occupy too many crossbars, limiting the efficiency of crossbar accelerators. Meanwhile, due to the inherent non-ideal behavior of crossbar devices, like write variations, pre-trained DNN models suffer from accuracy degradation when it is deployed on a crossbar-based IMP accelerator for inference. Although some approaches are proposed to address these issues, they often fail to consider the interaction among these issues, and introduce significant hardware overhead for solving each issue. To deploy complex models on IMP accelerators, we should compact the model and mitigate the influence of device non-ideal behaviors without introducing significant overhead from each technique. In this paper, we first propose to reuse bit-shift units in crossbars for approximately multiplying scaling factors in our quantization scheme to avoid using FP processors. Second, we propose to apply kernel-group pruning and crossbar pruning to eliminate the hardware units for data aligning. We also design a zerorize-recover training process for our pruning method to achieve higher accuracy. Third, we adopt the runtime-aware non-ideality adaptation with a self-compensation scheme to relieve the impact of non-ideality by exploiting the feature of crossbars. Finally, we integrate these three optimization procedures into one training process to form a comprehensive learning framework for co-optimization, which can achieve higher accuracy. The experimental results indicate that our comprehensive learning framework can obtain significant improvements over the original model when inferring on the crossbar-based IMP accelerator, with an average reduction of computing power and computing area by 100.02× and 17.37×, respectively. Furthermore, we can obtain totally integer-only, pruned, and reliable VGG-16 and ResNet-56 models for the Cifar-10 dataset on IMP accelerators, with accuracy drops of only 2.19% and 1.26%, respectively, without any hardware overhead.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Furthermore, [412] investigates the efficiency bottleneck of INT8 quantization and introduces hardware-friendly search space design to enable efficient INT8 quantization. More recently, [450,451] explore INT8 quantization to compress redundant CNNs for efficient in-memory computing infrastructures. In addition to quantizing CNNs, [413] turns back to transformers and leverages INT8 quantization to quantize computation-intensive transformers in order to boost the inference efficiency for general NLP tasks. ...
Preprint
Full-text available
Deep neural networks (DNNs) have recently achieved impressive success across a wide range of real-world vision and language processing tasks, spanning from image classification to many other downstream vision tasks, such as object detection, tracking, and segmentation. However, previous well-established DNNs, despite being able to maintain superior accuracy, have also been evolving to be deeper and wider and thus inevitably necessitate prohibitive computational resources for both training and inference. This trend further enlarges the computational gap between computation-intensive DNNs and resource-constrained embedded computing systems, making it challenging to deploy powerful DNNs upon real-world embedded computing systems towards ubiquitous embedded intelligence. To alleviate the above computational gap and enable ubiquitous embedded intelligence, we, in this survey, focus on discussing recent efficient deep learning infrastructures for embedded computing systems, spanning from training to inference, from manual to automated, from convolutional neural networks to transformers, from transformers to vision transformers, from vision models to large language models, from software to hardware, and from algorithms to applications. Specifically, we discuss recent efficient deep learning infrastructures for embedded computing systems from the lens of (1) efficient manual network design for embedded computing systems, (2) efficient automated network design for embedded computing systems, (3) efficient network compression for embedded computing systems, (4) efficient on-device learning for embedded computing systems, (5) efficient large language models for embedded computing systems, (6) efficient deep learning software and hardware for embedded computing systems, and (7) efficient intelligent applications for embedded computing systems.
Article
Deep neural networks (DNNs) have recently achieved impressive success across a wide range of real-world vision and language processing tasks, spanning from image classification to many other downstream vision tasks, such as object detection, tracking, and segmentation. However, previous well-established DNNs, despite being able to maintain superior accuracy, have also been evolving to be deeper and wider and thus inevitably necessitate prohibitive computational resources for both training and inference. This trend further enlarges the computational gap between computation-intensive DNNs and resource-constrained embedded computing systems, making it challenging to deploy powerful DNNs upon real-world embedded computing systems towards ubiquitous embedded intelligence. To alleviate the above computational gap and enable ubiquitous embedded intelligence, we, in this survey, focus on discussing recent efficient deep learning infrastructures for embedded computing systems, spanning from training to inference , from manual to automated , from convolutional neural networks to transformers , from transformers to vision transformers , from vision models to large language models , from software to hardware , and from algorithms to applications . Specifically, we discuss recent efficient deep learning infrastructures for embedded computing systems from the lens of (1) efficient manual network design for embedded computing systems, (2) efficient automated network design for embedded computing systems, (3) efficient network compression for embedded computing systems, (4) efficient on-device learning for embedded computing systems, (5) efficient large language models for embedded computing systems, (6) efficient deep learning software and hardware for embedded computing systems, and (7) efficient intelligent applications for embedded computing systems. Furthermore, we also envision promising future directions and trends, which have the potential to deliver more ubiquitous embedded intelligence. We believe this survey has its merits and can shed light on future research, which can largely benefit researchers to quickly and smoothly get started in this emerging field.
Article
Full-text available
Two of the most well-liked neural network frameworks, Theano and TensorFlow, will be compared in this study for how well they perform on a given problem. The MNIST database will be used for this specific problem, which is the recognition of handwritten digits from one to nine. It is a good idea to use more examples than contrasted ones to compare these frameworks because this database is the subject of active research at the moment and has produced excellent results. However, in order to be trained and deliver accurate results, neural networks need a sizeable amount of sample data, as will be covered in more detail later. Because of this, big data experts frequently encounter problems of this nature. As the project description implies, we won't just present a standard comparison because of this; instead, we'll work to present a comparison of these networks' performance in a Big Data environment using distributed computing. The FMNIST or Fashion MNIST database and CIFAR10 will also be tested (using the same neural network design), extending the scope of the comparison beyond MNIST. The same code will be used with the same structure thanks to the use of a higher-level library called Keras, which makes use of the aforementioned support (in our case, Theano or TensorFlow). There has been a surge in open-source parallel GPU implementation research and development as a result of the high computational cost of training CNNs on large data sets. However, there aren't many studies that have been done to assess the performance traits of those implementations. In this study, we compare these implementations carefully across a wide range of parameter configurations, look into potential performance bottlenecks, and pinpoint a number of areas that could use more fine-tuning.
Article
Full-text available
Resistive random-access memory (RRAM) is a promising technology for energy-efficient neuromorphic accelerators. However, when a pre-trained deep neural network (DNN) model is programmed to an RRAM array for inference, the model suffers from accuracy degradation due to RRAM nonidealities such as device variations, quantization error, and stuckat- faults. Previous solutions involving multiple read-verify-write (R-V-W) to the RRAM cells, require cell-by-cell compensation, and thus, an excessive amount of processing time. In this paper, we propose a joint algorithm-design solution to mitigate the accuracy degradation: 1)We first leverage Knowledge Distillation (KD), where the model is trained with the RRAM non-idealities to increase the robustness of the model under device variations. 2) Furthermore, we propose random sparse adaptation (RSA), which integrates a small on-chip memory with the main RRAM array for post-mapping adaptation. Only the on-chip memory is updated to recover the inference accuracy. The joint algorithmdesign solution achieves the state-of-the-art accuracy of 99.41% for MNIST (LeNet-5) and 91.86% for CIFAR-10 (VGG-16) with up to 5% parameters as overhead while providing a 15-150X speedup as compared to R-V-W.
Article
Full-text available
Crossbar architecture has been widely adopted in neural network accelerators due to the efficient implementations on vector-matrix multiplication operations. However, in the case of convolutional neural networks (CNNs), the efficiency is compromised dramatically because of the large amounts of data reuse. Although some mapping methods have been designed to achieve a balance between the execution throughput and resource overhead, the resource consumption cost is still huge while maintaining the throughput. Network pruning is a promising and widely studied method to shrink the model size, whereas prior work for CNNs compression rarely considered the crossbar architecture and the corresponding mapping method and cannot be directly utilized by crossbar-based neural network accelerators. This paper proposes a crossbar-aware pruning framework based on a formulated L0L_{0} -norm constrained optimization problem. Specifically, we design an L0L_{0} -norm constrained gradient descent with relaxant probabilistic projection to solve this problem. Two types of sparsity are successfully achieved: 1) intuitive crossbar-grain sparsity and 2) column-grain sparsity with output recombination, based on which we further propose an input feature maps reorder method to improve the model accuracy. We evaluate our crossbar-aware pruning framework on the median-scale CIFAR10 data set and the large-scale ImageNet data set with VGG and ResNet models. Our method is able to reduce the crossbar overhead by 44%–72% with insignificant accuracy degradation. This paper significantly reduce the resource overhead and the related energy cost and provides a new co-design solution for mapping CNNs onto various crossbar devices with much better efficiency.
Article
Full-text available
Human activity recognition (HAR) is a classification task for recognizing human movements. Methods of HAR are of great interest as they have become tools for measuring occurrences and durations of human actions, which are the basis of smart assistive technologies and manual processes analysis. Recently, deep neural networks have been deployed for HAR in the context of activities of daily living using multichannel time-series. These time-series are acquired from body-worn devices, which are composed of different types of sensors. The deep architectures process these measurements for finding basic and complex features in human corporal movements, and for classifying them into a set of human actions. As the devices are worn at different parts of the human body, we propose a novel deep neural network for HAR. This network handles sequence measurements from different body-worn devices separately. An evaluation of the architecture is performed on three datasets, the Oportunity, Pamap2, and an industrial dataset, outperforming the state-of-the-art. In addition, different network configurations will also be evaluated. We find that applying convolutions per sensor channel and per body-worn device improves the capabilities of convolutional neural network (CNNs). © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Conference Paper
Full-text available
Memristor-based synaptic network has been widely investigated and applied to neuromorphic computing systems for the fast computation and low design cost. As memristors continue to mature and achieve higher density, bit failures within crossbar arrays can become a critical issue. These can degrade the computation accuracy significantly. In this work, we propose a defect rescuing design to restore the computation accuracy. In our proposed design, significant weights in a specified network are first identified and retraining and remapping algorithms are described. For a two layer neural network with 92.64% classification accuracy on MNIST digit recognition, our evaluation based on real device testing shows that our design can recover almost its full performance when 20% random defects are present.
Article
Full-text available
A number of recent efforts have attempted to design accelerators for popular machine learning algorithms, such as those involving convolutional and deep neural networks (CNNs and DNNs). These algorithms typically involve a large number of multiply-accumulate (dot-product) operations. A recent project, DaDianNao, adopts a near data processing approach, where a specialized neural functional unit performs all the digital arithmetic operations and receives input weights from adjacent eDRAM banks. This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner. While the use of crossbar memory as an analog dot-product engine is well known, no prior work has designed or characterized a full-fledged accelerator based on crossbars. In particular, our work makes the following contributions: (i) We design a pipelined architecture, with some crossbars dedicated for each neural network layer, and eDRAM buffers that aggregate data between pipeline stages. (ii) We define new data encoding techniques that are amenable to analog computations and that can reduce the high overheads of analog-to-digital conversion (ADC). (iii) We define the many supporting digital components required in an analog CNN accelerator and carry out a design space exploration to identify the best balance of memristor storage/compute, ADCs, and eDRAM storage on a chip. On a suite of CNN and DNN workloads, the proposed ISAAC architecture yields improvements of 14.8×, 5.5×, and 7.5× in throughput, energy, and computational density (respectively), relative to the state-of-the-art DaDianNao architecture.
Article
Full-text available
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide detailed a analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.
Article
Full-text available
Future processors will likely have large on-chip caches wit h a possibility of dedicating an entire die for on-chip storage in a 3D stacked design. With the ever growing disparity between transistor and wire delay, the properties of such large caches will primarily depend on the characteristics of the interconnection networks that connect various sub-modules of a cache. CACTI 6.0 is a signifi cantly enhanced version of the tool that primarily focuses on interconnect design for large caches. In additio n to strengthening the existing analytical model of the tool for dominant cache components, CACTI 6.0 includes t wo major extensions over earlier versions: first, the ability to model Non-Uniform Cache Access (NUCA), and se cond, the ability to model different types of wires, such as RC based wires with different power, delay, an d area characteristics and differential low-swing buses. This report details the analytical model assumed for the newly added modules along with their validation analysis.
Article
Full-text available
Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day
Article
Resistive random-access memory (ReRAM) based crossbar array (RCA) is a promising platform to accelerate vector-matrixmultiplication in deep neural networks (DNN). There are however some practical issues, especially device variation, that hinders the versatile development of ReRAM in neural computing systems. The device variation include device-to-device variation (DDV) and cycle-to-cycle variation (CCV) that deviate the devise resistance in the RCA from their target state. Such resistance deviation seriously degrades the inference accuracy of DNN. To address this issue, we propose a software-hardware compensation solution which includes compensation training based on scale factors (CTSF) and variation-aware compensation training based on scale factors (VACTSF) to protect the ReRAM-based DNN accelerator against device variation. The scale factors in CTSF can be lexibly set for reducing accuracy loss due to device variation when the weights programmed into RCA are determined. For efectively handling CCV, the scale factors are introduced into the training process for obtaining variation-tolerant weights by leveraging the inherent self-healing ability of DNN. Simulation results based on our method confirm that the accuracy losses due to device variation on LeNet-5, ResNet and VGG16 with diferent datasets are less than 5% under a large device variation by CTSF. More robust weights for conquering CCV are also obtained by VACTSF. The simulation results present that our method is competitive in comparison to other variation-tolerant methods.
Article
The high computational complexity and a large number of parameters of deep neural networks (DNNs) become the most intensive burden of deep learning hardware design, limiting efficient storage and deployment. With the advantage of high-density storage, non-volatility, and low energy consumption, resistive RAM (RRAM) crossbar based in-memory computing (IMC) has emerged as a promising technique for DNN acceleration. To fully exploit crossbar-based IMC efficiency, a systematic compression design that considers both hardware and algorithm is necessary. In this brief, we present a system-level design considering the low precision weight and activation, structured pruning, and RRAM crossbar mapping. The proposed multi-group Lasso algorithm and hardware implementations have been evaluated on ResNet/VGG models for CIFAR-10/ImageNet datasets. With the fully quantized 4-bit ResNet-18 for CIFAR-10, we achieve up to 65.4×65.4\times compression compared to full-precision software baseline, and 7×7\times energy reduction compared to the 4-bit unpruned RRAM IMC hardware with 1.1% accuracy loss. For the fully quantized 4-bit ResNet-18 model for ImageNet dataset, we achieve up to 10.9×10.9\times structured compression with 1.9% accuracy degradation.
Conference Paper
With the in-memory processing ability, ReRAM based computing gets more and more attractive for accelerating neural networks (NNs). However, most ReRAM based accelerators cannot support efficient mapping for sparse NN, and we need to map the whole dense matrix onto ReRAM crossbar array to achieve O(1) computation complexity. In this paper, we propose a sparse NN mapping scheme based on elements clustering to achieve better ReRAM crossbar utilization. Further, we propose crossbar-grained pruning algorithm to remove the crossbars with low utilization. Finally, since most current ReRAM devices cannot achieve high precision, we analyze the effect of quantization precision for sparse NN, and propose to complete high-precision composing in the analog field and design related periphery circuits. In our experiments, we discuss how the system performs with different crossbar sizes to choose the optimized design. Our results show that our mapping scheme for sparse NN with proposed pruning algorithm achieves 3 -- 5X energy efficiency and more than 2.5 -- 6X speedup, compared with those accelerators for dense NN. Also, the accuracy experiments show that our pruning method appears to have almost no accuracy loss.
Article
Neuro-inspired architectures based on synaptic memory arrays have been proposed for on-chip acceleration of weighted sum and weight update in machine/deep learning algorithms. In this paper, we developed NeuroSim, a circuit-level macro model that estimates the area, latency, dynamic energy and leakage power to facilitate the design space exploration of neuro-inspired architectures with mainstream and emerging device technologies. NeuroSim provides flexible interface and a wide variety of design options at the circuit and device level. Therefore, NeuroSim can be used by neural networks as a supporting tool to provide circuit-level performance evaluation. With NeuroSim, an integrated framework can be built with hierarchical organization from the device level (synaptic device properties) to the circuit level (array architectures) and then to the algorithm level (neural network topology), enabling instruction-accurate evaluation on the learning accuracy as well as the circuit-level performance metrics at the run-time of online learning. Using multilayer perceptron (MLP) as a case-study algorithm, we investigated the impact of the “analog” emerging non-volatile memory (eNVM)’s “non-ideal” device properties and benchmarked the trade-offs between SRAM, digital and analog eNVM based architectures for online learning and offline classification.
Conference Paper
An RRAM-based computing system (RCS) is an attractive hardware platform for implementing neural computing algorithms. Online training for RCS enables hardware-based learning for a given application and reduces the additional error caused by device parameter variations. However, a high occurrence rate of hard faults due to immature fabrication processes and limited write endurance restrict the applicability of on-line training for RCS. We propose a fault-tolerant on-line training method that alternates between a fault-detection phase and a fault-tolerant training phase. In the fault-detection phase, a quiescent-voltage comparison method is utilized. In the training phase, a threshold-training method and a re-mapping scheme is proposed. Our results show that, compared to neural computing without fault tolerance, the recognition accuracy for the Cifar-10 dataset improves from 37% to 83% when using low-endurance RRAM cells, and from 63% to 76% when using RRAM cells with high endurance but a high percentage of initial faults.
Article
Processing-in-memory (PIM) is a promising solution to address the "memory wall" challenges for future computer systems. Prior proposed PIM architectures put additional computation logic in or near memory. The emerging metal-oxide resistive random access memory (ReRAM) has showed its potential to be used for main memory. Moreover, with its crossbar array structure, ReRAM can perform matrix-vector multiplication efficiently, and has been widely studied to accelerate neural network (NN) applications. In this work, we propose a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM based main memory. In PRIME, a portion of ReRAM crossbar arrays can be configured as accelerators for NN applications or as normal memory for a larger memory space. We provide microarchitecture and circuit designs to enable the morphable functions with an insignificant area overhead. We also design a software/hardware interface for software developers to implement various NNs on PRIME. Benefiting from both the PIM architecture and the efficiency of using ReRAM for NN computation, PRIME distinguishes itself from prior work on NN acceleration, with significant performance improvement and energy saving. Our experimental results show that, compared with a state-of-the-art neural processing unit design, PRIME improves the performance by ~2360× and the energy consumption by ~895×, across the evaluated machine learning benchmarks.
Article
This tutorial introduces the basics of emerging nonvolatile memory (NVM) technologies including spin-transfer-torque magnetic random access memory (STTMRAM), phase-change random access memory (PCRAM), and resistive random access memory (RRAM). Emerging NVM cell characteristics are summarized, and device-level engineering trends are discussed. Emerging NVM array architectures are introduced, including the onetransistor?one-resistor (1T1R) array and the cross-point array with selectors. Design challenges such as scaling the write current and minimizing the sneak path current in cross-point array are analyzed. Recent progress on megabit-to gigabit-level prototype chip demonstrations is summarized. Finally, the prospective applications of emerging NVM are discussed, ranging from the last-level cache to the storage-class memory in the memory hierarchy. Topics of three-dimensional (3D) integration and radiation-hard NVM are discussed. Novel applications beyond the conventional memory applications are also surveyed, including physical unclonable function for hardware security, reconfigurable routing switch for field-programmable gate array (FPGA), logic-in-memory and nonvolatile cache/register/flip-flop for nonvolatile processor, and synaptic device for neuro-inspired computing.
Article
The Resistive Random Access Memory (RRAM) is a new type of non-volatile memory based on the resistive memory device. Researchers are currently moving from resistive device development to memory circuit design and implementation, hoping to fabricate memory chips that can be deployed in the market in the near future. However, so far the low manufacturing yield is still a major issue. In this paper, we propose defect and fault models specific to RRAM, i.e., the Over-Forming (OF) defect and the Read-One-Disturb (R1D) fault. We then propose a March algorithm to cover these defects and faults in addition to the conventional RAM faults, which is called March C*. We also develop a novel squeeze-search scheme to identify the OF defect, which leads to the Stuck-At Fault (SAF). The proposed test algorithm is applied to a first-cut 4-Mb HfO2-based RRAM test chip. Results show that OF defects and R1D faults do exist in the RRAM chip. We also identify specific failure patterns from the test results, which are shown to be induced by multiple short defects between bit-lines. By identifying the defects and faults, designers and process engineers can improve the RRAM yield in a more cost-effective way.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Article
In this issue, “Best of the Web” presents the modified National Institute of Standards and Technology (MNIST) resources, consisting of a collection of handwritten digit images used extensively in optical character recognition and machine learning research.
Accurate and efficient 2-bit quantized neural networks
  • Jungwook Choi
  • Swagath Venkataramani
  • Kailash Vijayalakshmi Viji Srinivasan
  • Zhuo Gopalakrishnan
  • Pierce Wang
  • Chuang
  • Choi Jungwook
Ahmed Eltawil, and Fadi Kurdahi. 2021. Cost-and dataset-free stuck-at fault mitigation for ReRAM-based deep learning accelerators
  • Giju Jung
  • Mohammed Fouda
  • Sugil Lee
  • Jongeun Lee
  • Jung Giju
PyTorch: An imperative style, high-performance deep learning library
  • Adam Paszke
  • Sam Gross
  • Francisco Massa
  • Adam Lerer
  • James Bradbury
  • Gregory Chanan
  • Trevor Killeen
  • Zeming Lin
  • Natalia Gimelshein
  • Luca Antiga
  • Alban Desmaison
  • Andreas Kopf
  • Edward Yang
  • Zachary Devito
  • Paszke Adam
Fast certified robust training with short warmup
  • Zhouxing Shi
  • Yihan Wang
  • Huan Zhang
  • Jinfeng Yi
  • Cho-Jui Hsieh
  • Shi Zhouxing
IEEE standard 754 for binary floating-point arithmetic
  • William Kahan
  • Kahan William