Article

A Dual 7T SRAM-Based Zero-Skipping Compute-In-Memory Macro With 1-6b Binary Searching ADCs for Processing Quantized Neural Networks

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This article presents a novel dual 7T static random-access memory (SRAM)-based compute-in-memory (CIM) macro for processing quantized neural networks. The proposed SRAM-based CIM macro decouples read/write operations and employs a zero-input/weight skipping scheme. A 65nm test chip with 528×128528\times 128 integrated dual 7T bitcells demonstrated reconfigurable precision multiply and accumulate operations with 384 ×\times binary inputs (0/1) and 384×128384\times 128 programmable multi-bit weights (3/7/15-levels). Each column comprises 384 ×\times bitcells for a dot product, 48 ×\times bitcells for offset calibration, and 96 ×\times bitcells for binary-searching analog-to-digital conversion. The analog-to-digital converter (ADC) converts a voltage difference between two read bitlines (i.e., an analog dot-product result) to a 1-6b digital output code using binary searching in 1-6 conversion cycles using replica bitcells. The test chip with 66Kb embedded dual SRAM bitcells was evaluated for processing neural networks, including the MNIST image classifications using a multi-layer perceptron (MLP) model with its layer configuration of 784-256-256-256-10 The measured classification accuracies are 97.62%, 97.65%, and 97.72% for the 3, 7, and 15 level weights, respectively. The accuracy degradations are only 0.58 to 0.74% off the baseline with software simulations. For the VGG6 model using the CIFAR-10 image dataset, the accuracies are 88.59%, 88.21%, and 89.07% for the 3, 7, and 15 level weights, with degradations of only 0.6 to 1.32% off the software baseline. The measured energy efficiencies are 258.5, 67.9, and 23.9 TOPS/W for the 3, 7, and 15 level weights, respectively, measured at 0.45/0.8V supplies.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... By summarizing the relationship between energy efficiency with ENOB and frequency under different processes in the current articles [4]- [26], [32]- [73], it can be seen that the relationship between energy efficiency with ENOB and frequency is shown in Fig. 12. Firstly, with the advancement of process nodes, the energy efficiency of CIMs has been significantly improved, especially in low-power designs, where advanced processes enable the ADCs to maintain the CIM architecture at a high energy efficiency even at high frequencies by optimizing the circuit structure and reducing the static power consumption. ...
Article
Full-text available
Computing-in-Memory (CIM) architectures have emerged as a pivotal technology for next-generation artificial intelligence (AI) and edge computing applications. By enabling computations directly within memory cells, CIM architectures effectively minimize data movement and significantly enhance energy efficiency. In the CIM system, the analog-to-digital converter (ADC) bridges the gap between efficient analog computation and general digital processing, while influencing the overall accuracy, speed and energy efficiency of the system. This review presents theoretical analyses and practical case studies on the performance requirements of ADCs and their optimization methods in CIM systems, aiming to provide ideas and references for the design and optimization of CIM systems. The review comprehensively explores the relationship between the design of CIM architectures and ADC optimization, and raises the issue of design trade-offs between low power consumption, high speed operation and compact integration design. On this basis, novel customized ADC optimization methods are discussed in depth, and a large number of current CIM systems and their ADC optimization examples are reviewed, with optimization methods summarized and classified in terms of power consumption, speed, and area. In the final part, this review analyzes energy efficiency, ENOB, and frequency scaling trends, demonstrating how advanced processes enable ADCs to balance speed, power, and area trade-offs, guiding ADC optimization for next-gen CIM systems.
Article
Full-text available
Artificial intelligence (AI) and machine learning (ML) are revolutionizing many fields of study, such as visual recognition, natural language processing, autonomous vehicles, and prediction. Traditional von-Neumann computing architecture with separated processing elements and memory devices have been improving their computing performances rapidly with the scaling of process technology. However, in the era of AI and ML, data transfer between memory devices and processing elements becomes the bottleneck of the system. To address this data movement issue, memory-centric computing takes an approach of merging the memory devices with processing elements so that computations can be done in the same location without moving any data. Processing-In-Memory (PIM) has attracted research community’s attention because it can improve the energy efficiency of memory-centric computing systems substantially by minimizing the data movement. Even though the benefits of PIM are well accepted, its limitations and challenges have not been investigated thoroughly. This paper presents a comprehensive investigation of state-of-the-art PIM research works based on various memory device types, such as static-random-access-memory (SRAM), dynamic-random-access-memory (DRAM), and resistive memory (ReRAM). We will present the overview of PIM designs in each memory type, covering from bit cells, circuits, and architecture. Then, a new software stack standard and its challenges for incorporating PIM with the conventional computing architecture will be discussed. Finally, we will discuss various future research directions in PIM for further reducing the data conversion overhead, improving test accuracy, and minimizing intra-memory data movement.
Article
Full-text available
A compact, accurate, and bitwidth-programmable in-memory computing (IMC) static random-access memory (SRAM) macro, named CAP-RAM, is presented for energy-efficient convolutional neural network (CNN) inference. It leverages a novel charge-domain multiply-and-accumulate (MAC) mechanism and circuitry to achieve superior linearity under process variations compared to conventional IMC designs. The adopted semi-parallel architecture efficiently stores filters from multiple CNN layers by sharing eight standard 6T SRAM cells with one charge-domain MAC circuit. Moreover, up to six levels of bit-width of weights with two encoding schemes and eight levels of input activations are supported. A 7-bit charge-injection SAR (ciSAR) analog-to-digital converter (ADC) getting rid of sample and hold (S&H) and input/reference buffers further improves the overall energy efficiency and throughput. A 65-nm prototype validates the excellent linearity and computing accuracy of CAP-RAM. A single 512×128512\times 128 macro stores a complete pruned and quantized CNN model to achieve 98.8% inference accuracy on the MNIST data set and 89.0% on the CIFAR-10 data set, with a 573.4-giga operations per second (GOPS) peak throughput and a 49.4-tera operations per second (TOPS)/W energy efficiency.
Article
Full-text available
The current mobile applications have rapidly growing memory footprints, posing a great challenge for memory system design. Insufficient DRAM main memory will incur frequent data swaps between memory and storage, a process that hurts performance, consumes energy, and deteriorates the write endurance of typical flash storage devices. Alternately, a larger DRAM has higher leakage power and drains the battery faster. Furthermore, DRAM scaling trends make further growth of DRAM in the mobile space prohibitive due to cost. Emerging nonvolatile memory (NVM) has the potential to alleviate these issues due to its higher capacity per cost than DRAM and minimal static power. Recently, a wide spectrum of NVM technologies, including phase-change memories (PCMs), memristor, and 3-D XPoint has emerged. Despite the mentioned advantages, NVM has longer access latency compared to DRAM and NVM writes can incur higher latencies and wear costs. Therefore, the integration of these new memory technologies in the memory hierarchy requires a fundamental rearchitecting of traditional system designs. In this work, we propose a hardware-accelerated memory manager (HMMU) that addresses in a flat address space, with a small partition of the DRAM reserved for subpage block-level management. We design a set of data placement and data migration policies within this memory manager such that we may exploit the advantages of each memory technology. By augmenting the system with this HMMU, we reduce the overall memory latency while also reducing writes to the NVM. The experimental results show that our design achieves a 39% reduction in energy consumption with only a 12% performance degradation versus an all-DRAM baseline that is likely untenable in the future.
Article
Full-text available
A multi-functional in-memory inference processor integrated circuit (IC) in a 65-nm CMOS process is presented. The prototype employs a deep in-memory architecture (DIMA), which enhances both energy efficiency and throughput over conventional digital architectures via simultaneous access of multiple rows of a standard 6T bitcell array (BCA) per precharge, and embedding column pitch-matched low-swing analog processing at the BCA periphery. In doing so, DIMA exploits the synergy between the dataflow of machine learning (ML) algorithms and the SRAM architecture to reduce the dominant energy cost due to data movement. The prototype IC incorporates a 16-kB SRAM array and supports four commonly used ML algorithms--the support vector machine, template matching, k-nearest neighbor, and the matched filter. Silicon measured results demonstrate simultaneous gains (dot product mode) in energy efficiency of 10x and in throughput of 5.3x leading to a 53x reduction in the energy-delay product with negligible (≤1%) degradation in the decision-making accuracy, compared with the conventional 8-b fixed-point single-function digital implementations.
Article
Full-text available
A versatile reconfigurable accelerator architecture for binary/ternary deep neural networks is presented. In-memory neural network processing without any external data accesses, sustained by the symmetry and simplicity of the computation of the binary/ternaty neural network, improves the energy efficiency dramatically. The prototype chip is fabricated, and it achieves 1.4 TOPS (tera operations per second) peak performance with 0.6-W power consumption at 400-MHz clock. The application examination is also conducted.
Article
Full-text available
We introduce a method to train Quantized Neural Networks (QNNs) --- neural networks with extremely low precision (e.g., 1-bit) weights and activations, at run-time. At train-time the quantized weights and activations are used for computing the parameter gradients. During the forward pass, QNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations. As a result, power consumption is expected to be drastically reduced. We trained QNNs over the MNIST, CIFAR-10, SVHN and ImageNet datasets. The resulting QNNs achieve prediction accuracy comparable to their 32-bit counterparts. For example, our quantized version of AlexNet with 1-bit weights and 2-bit activations achieves 51%51\% top-1 accuracy. Moreover, we quantize the parameter gradients to 6-bits as well which enables gradients computation using only bit-wise operation. Quantized recurrent neural networks were tested over the Penn Treebank dataset, and achieved comparable accuracy as their 32-bit counterparts using only 4-bits. Last but not least, we programmed a binary matrix multiplication GPU kernel with which it is possible to run our MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The QNN code is available online.
Article
Full-text available
We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time. At training-time the binary weights and activations are used for computing the parameters gradients. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which is expected to substantially improve power-efficiency. To validate the effectiveness of BNNs we conduct two sets of experiments on the Torch7 and Theano frameworks. On both, BNNs achieved nearly state-of-the-art results over the MNIST, CIFAR-10 and SVHN datasets. Last but not least, we wrote a binary matrix multiplication GPU kernel with which it is possible to run our MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The code for training and running our BNNs is available on-line.
Article
This article presents a compact, robust, and transposable SRAM in-memory computing (IMC) macro to support feed forward (FF) and back propagation (BP) computation within a single macro. The transpose macro is created with a clustering structure, and eight 6T bitcells are shared with one charge-domain computing unit (CCU) to efficiently deploy the DNNs weights. The normalized area overhead of clustering structure compared to 6T SRAM cell is only 0.37. During computation, the CCU performs robust charge-domain operations on the parasitic capacitances of the local bitlines in the IMC cluster. In the FF mode, the proposed design supports 128-input 1b XNOR and 1b AND multiplications and accumulations (MACs). The 1b AND can be extended to multi-bit MAC via bit-serial (BS) mapping, which can support DNNs with various precision. A power-gated auto-zero Flash analog-to-digital converter (ADC) reducing the input offset voltage maintains the overall energy efficiency and throughput. The proposed macro is prototyped in a 28-nm CMOS process. It demonstrates a 1b energy efficiency of 166 \vert 257 TOPS/W in FF-XNOR \vert AND mode, and 31.8 TOPS/W in BP mode, respectively. The macro achieves 80.26 %\% \vert 85.07 %\% classification accuracy for the CIFAR-10 dataset with 1b \vert 4b CNN models. Besides, 95.50 %\% MNIST dataset classification accuracy (95.66 %\% software accuracy) is achieved by the BP mode of the proposed transpose IMC macro.
Article
In this paper, we present a 10T SRAM compute-in memory (CiM) macro to process the multiplication-accumulation (MAC) operations between ternary-inputs and binary-weights. In the proposed 10T SRAM bitcell, the charge-domain analog computations are employed to improve the noise tolerance of bit-line (BL) signals where the MAC results are represented in CiM. Parallel processing of 3 different analog levels for ternary input activations is also performed in the proposed single 10T bitcell. To reduce the analog-to-digital converter (ADC) bit-resolutions without sacrificing deep neural network (DNN) accuracies, a confined-slope non-uniform integration (CS-NUI) ADC is proposed, which can provide layer-wise adaptive quantization for multiple different layers with different MAC distributions. In addition, by sharing the ADC reference voltage generator in every single column of SRAM array, the ADC area is effectively reduced with improved energy efficiencies of CiM. The 256×64.10256\times64.10 T SRAM CiM macro with the proposed charge-sharing scheme and CS-NUI ADCs has been implemented using 28nm CMOS process. The silicon measurement results show that the proposed CiM shows the accuracies of 98.66% and 88.48% with MNIST dataset on MLP, and CIFAR-10 dataset on VGGNet-7, respectively, with the energy efficiency of 2941-TOPS/W and the area efficiency of 59.584-TOPS/mm 2^{2} .
Article
This work introduces a digital SRAM-based near-memory compute macro for DNN inference, improving on-chip weight memory capacity and area efficiency compared to state-of-the-art digital computing-in-memory (CIM) macros A 20×256.120\times256.1 -16b reconfigurable digital computing near-memory (NM) macro is proposed, supporting a reconfigurable 1-16b precision through the bit-serial computing scheme and the weight and input gating architecture for sparsity-aware operations. Each reconfigurable column MAC comprises 16 ×\times custom-designed 7T SRAM bitcells to store 1-16b weights, a conventional 6T SRAM for zero weight skip control, a bitwise multiplier, and a full adder with a register for partial-sum accumulations 20 ×\times parallel partial-sum outputs are post-accumulated to generate a sub-partitioned output feature map, which will be concatenated to produce the final convolution result. Besides, pipelined array structure improves the throughput of the proposed macro. The proposed near-memory computing macro implements an 80Kb binary weight storage in a 0.473mm 2^{2} die area using 65nm. It presents the area/energy efficiency of 4329-270.6 GOPS/mm 2^{2} and 315.07-1.23TOPS/W at 1-16b precision.
Article
Computing-in-memory (CIM) is an emerging approach for alleviating the Von-Neumann bottleneck of latency and energy overheads, and improving energy efficiency and throughput. In this brief, we present a novel CIM macro aimed at improving the energy efficiency and throughput of edge devices when running 4b multiply-and-accumulate (MAC) operations. The proposed architecture uses (1) a customized 9T1C bit-cell in charge-domain computation for sensing margin improvement and compact design; (2) a hierarchical capacity attenuator for 4b weight accumulation without complicated controlling switches and signals for throughput improvement; (3) an input sparsity-sensing-based flash analog-to-digital converters readout scheme to improve energy efficiency and throughput. Fabricated in 28nm CMOS technology, the proposed 32Kb SRAM CIM macro demonstrates an average energy efficiency of 646.6 TOPS/W (normalized to 4b/4b input/weight) and a throughput of 1638.4 GOPS while achieving 84.89% classification accuracy on the CIFAR-10 dataset at 4b precision in inputs and weights.
Article
This article proposes a low-power real-time hand gesture recognition (HGR) system with high recognition accuracy for smart edge devices. This design balances accuracy and power consumption by utilizing computation-efficient hybrid classifiers assisted with a majority voting scheme. By combining the recognition results of consecutive frames, the HGR system shows improved immunity to misclassification. In addition, the compressed input data before high-level processing dramatically reduce the on- chip memory and computational load. The proposed Edge-convolutional neural network (CNN) core with interactable processing engines reduces the memory accessing and the feature register toggling rate by 27% and 50%, respectively. The sequence analyzer based on majority voting improves the static and dynamic gesture recognition accuracy by ~7% and ~8% only with 9.4% hardware overhead. The test chip was fabricated in 65-nm CMOS technology, occupying the area of 1 ×\times 1.5 mm2. It consumes the lowest power of 184 μW184~\mu \text{W} at 25 MHz and 0.6 V. The proposed HGR system can recognize six static gestures and 24 dynamic hand gestures with an average accuracy of 87.25%–95% and 85.4%–94.9%, respectively.
Article
Acoustic resolution photoacoustic microscopy (AR-PAM) can achieve deeper imaging depth in biological tissue, with the sacrifice of imaging resolution compared with optical resolution photoacoustic microscopy (OR-PAM). Here we aim to enhance the AR-PAM image quality towards OR-PAM image, which specifically includes the enhancement of imaging resolution, restoration of micro-vasculatures, and reduction of artifacts. To address this issue, a network (MultiResU-Net) is first trained as generative model with simulated AR-OR image pairs, which are synthesized with physical transducer model. Moderate enhancement results can already be obtained when applying this model to in vivo AR imaging data. Nevertheless, the perceptual quality is unsatisfactory due to domain shift. Further, domain transfer learning technique under generative adversarial network (GAN) framework is proposed to drive the enhanced image's manifold towards that of real OR image. In this way, perceptually convincing AR to OR enhancement result is obtained, which can also be supported by quantitative analysis. Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index (SSIM) values are significantly increased from 14.74 dB to 19.01 dB and from 0.1974 to 0.2937, respectively, validating the improvement of reconstruction correctness and overall perceptual quality. The proposed algorithm has also been validated across different imaging depths with experiments conducted in both shallow and deep tissue. The above AR to OR domain transfer learning with GAN (AODTL-GAN) framework has enabled the enhancement target with limited amount of matched in vivo AR-OR imaging data.
Chapter
SRAM-based PIM gained popularity from its implementation simplicity using active device-only and compatibility with the standard CMOS logic process. Unlike the DRAM macro that is typically placed off-chip, SRAM macro is implemented on-chip to serve as cache memory. Neural Cache [1] and CSRAM [2] took advantage of the conventional SRAM implementation and re-purposed the macro to run compute operations as well as the normal memory storing operation without significant changes in the hardware.
Article
In this work, we present a novel 8T static random access memory (SRAM)-based compute-in-memory (CIM) macro for processing neural networks with high energy efficiency. The proposed 8T bitcell is free from disturb issues thanks to the decoupled read channels by adding two extra transistors to the standard 6T bitcell. A 128 ×\times 128 8T SRAM array offers massively parallel binary multiply and accumulate (MAC) operations with 64 ×\times binary inputs (0/1) and 64 ×\times 128 binary weights (+1/–1). After parallel MAC operations, 128 column-based neurons generate 128 ×\times 1–5 bit outputs in parallel. The proposed column-based neuron comprises 64 ×\times bitcells for dot-product, 32 ×\times bitcells for analog-to-digital converter (ADC), and 32 ×\times bitcells for offset calibration. The column ADC with 32 ×\times replica SRAM bitcells converts the analog MAC results (i.e., a differential read bitline (RBL/RBLb) voltage) to the 1–5 bit output code by sweeping their reference levels in 1–31 cycles (i.e., 2N2^{N} –1 cycles for N -bit ADC). The measured linearity results [differential nonlinearity (DNL) and integral nonlinearity (INL)] are +0.314/–0.256 least significant bit (LSB) and + 0.27/–0.116 LSB, respectively, after offset calibration. The simulated image classification results are 96.37% for Mixed National Institute of Standards and Technology database (MNIST) using a multi-layer perceptron (MLP) with two hidden layers, 87.1%/82.66% for CIFAR-10 using VGG-like/ResNet-18 convolutional neural networks (CNNs), demonstrating slight accuracy degradations (0.67%–1.34%) compared with the software baseline. A test chip with a 16K 8T SRAM bitcell array is fabricated using a 65-nm process. The measured energy efficiency is 490–15.8 TOPS/W for 1–5 bit ADC resolution using 0.45-/0.8-V core supply.
Article
This paper presents a mixed-signal SRAM-based in-memory computing (IMC) macro for processing binarized neural networks. The IMC macro consists of 128×128128\times 128 (16K) SRAM-based bitcells. Each bitcell consists of a standard 6T SRAM bitcell, an XNOR-based binary multiplier, and a pseudo-differential voltage-mode driver (i.e., an accumulator unit). Multiply-and-accumulate (MAC) operations between 64 pairs of inputs and weights (stored in the first 64 SRAM bitcells) are performed in 128 rows of the macro, all in parallel. A weight-stationary architecture, which minimizes off-chip memory accesses, effectively reduces energy-hungry data communications. A row-by-row analog-to-digital converter (ADC) based on 32 replica bitcells and a sense amplifier reduces the ADC area overhead and compensates for nonlinearity and variation. The ADC converts the MAC result from each row to an N-bit digital output taking 2 N -1 cycles per conversion by sweeping the reference level of 32 replica bitcells. The remaining 32 replica bitcells in the row are utilized for offset calibration. In addition, this paper presents a pseudo-differential voltage-mode accumulator to address issues in the current-mode or single-ended voltage-mode accumulator. A test chip including a 16Kbit SRAM IMC bitcell array is fabricated using a 65nm CMOS technology. The measured energy- and area-efficiency is 741-87TOPS/W with 1-5bit ADC at 0.5V supply and 3.97TOPS/mm 2 , respectively.
Article
This article presents a computing-in-memory (CIM) structure aimed at improving the energy efficiency of edge devices running multi-bit multiply-and-accumulate (MAC) operations. The proposed scheme includes a 6T SRAM-based CIM (SRAM-CIM) macro capable of: 1) weight-bitwise MAC (WbwMAC) operations to expand the sensing margin and improve the readout accuracy for high-precision MAC operations; 2) a compact 6T local computing cell to perform multiplication with suppressed sensitivity to process variation; 3) an algorithm-adaptive low MAC-aware readout scheme to improve energy efficiency; 4) a bitline header selection scheme to enlarge signal margin; and 5) a small-offset margin-enhanced sense amplifier for robust read operations against process variation. A fabricated 28-nm 64-kb SRAM-CIM macro achieved access times of 4.1-8.4 ns with energy efficiency of 11.5-68.4 TOPS/W, while performing MAC operations with 4- or 8-b input and weight precision.
Article
In-memory computing establishes a new and promising computing paradigm aimed at solving problems caused by the von Neumann bottleneck. It eliminates the need for frequent data transfer between the memory and processing modules and enables the parallel activation of multiple lines. However, vertical data storage is generally required, increasing the implementation complexity for the SRAM writing mode. This article proposes a 10-transistor (10T) SRAM to omit vertical data storage and improve the stability of in-memory computing. A cross-layout of the word line enables arrays with multirow or multicolumn parallel activation to perform vector logic operations in two directions. In addition, the novel horizontal read channel allows matrix transposition. By reconfiguring the data lines, sense amplifiers, and multiplexing read ports, the proposed SRAM can be regarded as a content-addressable memory (CAM), and its symmetry provides selectable data search by column or by row according to the application that easily fits the SRAM storage mode without additional data adjustments. A proposed self-termination structure aims to decrease search energy consumption by approximately 38.5% at 0.9 V at the TT process corner. To verify the effectiveness of the proposed design, a 4 Kb SRAM was implemented in 28-nm CMOS technology. The read margin of the proposed 10T SRAM cell is three times higher than that of the conventional 6-transistor cell. At 0.9 V, logic operations can be performed at approximately 300 MHz, and binary CAM search operations are achieved at approximately 260 MHz with around 1 fJ of energy consumption per search/bit.
Article
This article (Colonnade) presents a fully digital bit-serial compute-in-memory (CIM) macro. The digital CIM macro is designed for processing neural networks with reconfigurable 1–16 bit input and weight precisions based on bit-serial computing architecture and a novel all-digital bitcell structure. A column of bitcells forms a column MAC and used for computing a multiply-and-accumulate (MAC) operation. The column MACs placed in a row work as a single neuron and computes a dot-product, which is an essential building block of neural network accelerators. Several key features differentiate the proposed Colonnade architecture from the existing analog and digital implementations. First, its full-digital circuit implementation is free from process variation, noise susceptibility, and data-conversion overhead that are prevalent in prior analog CIM macros. A bitwise MAC operation in a bitcell is performed in the digital domain using a custom-designed XNOR gate and a full-adder. Second, the proposed CIM macro is fully reconfigurable in both weight and input precision from 1 to 16 bit. So far, most of the analog macros were used for processing quantized neural networks with very low input/weight precisions, mainly due to a memory density issue. Recent digital accelerators have implemented reconfigurable precisions, but they are inferior in energy efficiency due to significant off-chip memory access. We present a regular digital bitcell array that is readily reconfigured to a 1–16 bit weight-stationary bit-serial CIM macro. The macro computes parallel dot-product operations between the weights stored in memory and inputs that are serialized from LSB to MSB. Finally, the bit-serial computing scheme significantly reduces the area overhead while sacrificing latency due to bit-by-bit operation cycles. Based on the benefits of digital CIM, reconfigurability, and bit-serial computing architecture, the Colonnade can achieve both high performance and energy efficiency (i.e., both benefits of prior analog and digital accelerators) for processing neural networks. A test-chip with 128×128128 \times 128 SRAM-based bitcells for digital bit-serial computing is implemented using 65-nm technology and tested with 1–16 bit weight/input precisions. The measured energy efficiency is 117.3 TOPS/W at 1 bit and 2.06 TOPS/W at 16 bit.
Article
We present an energy-efficient processing-in-memory (PIM) architecture named Z-PIM that supports both sparsity handling and fully variable bit-precision in weight data for energy-efficient deep neural networks. Z-PIM adopts the bit-serial arithmetic that performs a multiplication bit-by-bit through multiple cycles to reduce the complexity of the operation in a single cycle and to provide flexibility in bit-precision. To this end, it employs a zero-skipping convolution SRAM, which performs in-memory AND operations based on custom 8T-SRAM cells and channel-wise accumulations, and a diagonal accumulation SRAM that performs bit- and spatial-wise accumulation on the channel-wise accumulation results using diagonal logic and adders to produce the final convolution outputs. We propose the hierarchical bitline structure for energy-efficient weight bit pre-charging and computational readout by reducing the parasitic capacitances of the bitlines. Its charge reuse scheme reduces the switching rate by 95.42% for the convolution layers of VGG-16 model. In addition, Z-PIM’s channel-wise data mapping enables sparsity handling by skip-reading the input channels with zero weight. Its read-operation pipelining enabled by a read-sequence scheduling improves the throughput by 66.1%. The Z-PIM chip is fabricated in a 65-nm CMOS process on a 7.568-mm 2 die, while it consumes average 5.294-mW power at 1.0-V voltage and 200-MHz frequency. It achieves 0.31–49.12-TOPS/W energy efficiency for convolution operations as the weight sparsity and bit-precision vary from 0.1 to 0.9 and 1 to 16 bit, respectively. For the figure of merit considering input bit-width, weight bit-width, and energy efficiency, the Z-PIM shows more than 2.1 times improvement over the state-of-the-art PIM implementations.
Article
A novel 4T2C ternary embedded DRAM (eDRAM) cell is proposed for computing a vector-matrix multiplication in the memory array. The proposed eDRAM-based compute-in-memory (CIM) architecture addresses a well-known Von Neumann bottle-neck in the traditional computer architecture and improves both latency and energy in processing neural networks. The proposed ternary eDRAM cell takes a smaller area than prior SRAM-based bitcells using 6–12 transistors. Nevertheless, the compact eDRAM cell stores a ternary state (−1, 0, or +1), while the SRAM bitcells can only store a binary state. We also present a method to mitigate the compute accuracy degradation issue due to device mismatches and variations. Besides, we extend the eDRAM cell retention time to 200 μs200~\mu \text{s} by adding a custom metal capacitor at the storage node. With the improved retention time, the overall energy consumption of eDRAM macro, including a regular refresh operation, is lower than most of prior SRAM-based CIM macros. A 128×128128\times 128 ternary eDRAM macro computes a vector-matrix multiplication between a vector with 64 binary inputs and a matrix with 64×12864\times 128 ternary weights. Hence, 128 outputs are generated in parallel. Note that both weight and input bit-precisions are programmable for supporting a wide range of edge computing applications with different performance requirements. The bit-precisions are readily tunable by assigning a variable number of eDRAM cells per weight or adding multiple pulses to input. An embedded column ADC based on replica cells sweeps the reference level for 2N12^{\mathrm {N}}-1 cycles and converts the analog accumulated bitline voltage to a 1-5bit digital output. A critical bitline accumulate operation is simulated (Monte-Carlo, 3K runs). It shows the standard deviation of 2.84% that could degrade the classification accuracy of the MNIST dataset by 0.6% and the CIFAR-10 dataset by 1.3% versus a baseline with no variation. The simulated energy is 1.81fJ/operation, and the energy efficiency is 552.5-17.8TOPS/W (for 1-5bit ADC) at 200MHz using 65nm technology.
Article
In this work, we present a compute-in-memory (CIM) macro built around a standard two-port compiler macro using foundry 8T bit-cell in 7-nm FinFET technology. The proposed design supports 1024 4 b ×\times 4 b multiply-and-accumulate (MAC) computations simultaneously. The 4-bit input is represented by the number of read word-line (RWL) pulses, while the 4-bit weight is realized by charge sharing among binary-weighted computation caps. Each unit of computation cap is formed by the inherent cap of the sense amplifier (SA) inside the 4-bit Flash ADC, which saves area and minimizes kick-back effect. Access time is 5.5 ns with 0.8-V power supply at room temperature. The proposed design achieves energy efficiency of 351 TOPS/W and throughput of 372.4 GOPS. Implications of our design from neural network implementation and accuracy perspectives are also discussed.
Article
Previous SRAM-based computing-in-memory (SRAM-CIM) macros suffer small read margins for high-precision operations, large cell array area overhead, and limited compatibility with many input and weight configurations. This work presents a 1-to-8-bit configurable SRAM CIM unit-macro using: 1) a hybrid structure combining 6T-SRAM based in-memory binary product-sum (PS) operations with digital near-memory-computing multibit PS accumulation to increase read accuracy and reduce area overhead; 2) column-based place-value-grouped weight mapping and a serial-bit input (SBIN) mapping scheme to facilitate reconfiguration and increase array efficiency under various input and weight configurations; 3) a self-reference multilevel reader (SRMLR) to reduce read-out energy and achieve a sensing margin 2 ×\times that of the mid-point reference scheme; and 4) an input-aware bitline voltage compensation scheme to ensure successful read operations across various input-weight patterns. A 4-Kb configurable 6T-SRAM CIM unit-macro was fabricated using a 55-nm CMOS process with foundry 6T-SRAM cells. The resulting macro achieved access times of 3.5 ns per cycle (pipeline) and energy efficiency of 0.6–40.2 TOPS/W under binary to 8-b input/8-b weight precision.
Article
We present XNOR-SRAM, a mixed-signal in-memory computing (IMC) SRAM macro that computes ternary-XNOR-and-accumulate (XAC) operations in binary/ternary deep neural networks (DNNs) without row-by-row data access. The XNOR-SRAM bitcell embeds circuits for ternary XNOR operations, which are accumulated on the read bitline (RBL) by simultaneously turning on all 256 rows, essentially forming a resistive voltage divider. The analog RBL voltage is digitized with a column-multiplexed 11-level flash analog-to-digital converter (ADC) at the XNOR-SRAM periphery. XNOR-SRAM is prototyped in a 65-nm CMOS and achieves the energy efficiency of 403 TOPS/W for ternary-XAC operations with 88.8% test accuracy for the CIFAR-10 data set at 0.6-V supply. This marks 33x better energy efficiency and 300x better energy-delay product than conventional digital hardware and also represents among the best tradeoff in energy efficiency and DNN accuracy.
Article
Computation-in-memory (CIM) is a promising candidate to improve the energy efficiency of multiply-and-accumulate (MAC) operations of artificial intelligence (AI) chips. This work presents an static random access memory (SRAM) CIM unit-macro using: 1) compact-rule compatible twin-8T (T8T) cells for weighted CIM MAC operations to reduce area overhead and vulnerability to process variation; 2) an even–odd dual-channel (EODC) input mapping scheme to extend input bandwidth; 3) a two’s complement weight mapping (C2WM) scheme to enable MAC operations using positive and negative weights within a cell array in order to reduce area overhead and computational latency; and 4) a configurable global–local reference voltage generation (CGLRVG) scheme for kernels of various sizes and bit precision. A 64 ×\times 60 b T8T unit-macro with 1-, 2-, 4-b inputs, 1-, 2-, 5-b weights, and up to 7-b MAC-value (MACV) outputs was fabricated as a test chip using a foundry 55-nm process. The proposed SRAM-CIM unit-macro achieved access times of 5 ns and energy efficiency of 37.5–45.36 TOPS/W under 5-b MACV output.
Article
The trend of pushing inference from cloud to edge due to concerns of latency, bandwidth, and privacy has created demand for energy-efficient neural network hardware. This paper presents a mixed-signal binary convolutional neural network (CNN) processor for always-on inference applications that achieves 3.8 μJ/classification at 86% accuracy on the CIFAR-10 image classification data set. The goal of this paper is to establish the minimum-energy point for the representative CIFAR-10 inference task, using the available design tradeoffs. The BinaryNet algorithm for training neural networks with weights and activations constrained to +1 and -1 drastically simplifies multiplications to XNOR and allows integrating all memory on-chip. A weight-stationary, data-parallel architecture with input reuse amortizes memory access across many computations, leaving wide vector summation as the remaining energy bottleneck. This design features an energy-efficient switched-capacitor (SC) neuron that addresses this challenge, employing a 1024-bit thermometer-coded capacitive digital-to-analog converter (CDAC) section for summing pointwise products of CNN filter weights and activations and a 9-bit binary-weighted section for adding the filter bias. The design occupies 6 mm 2 in 28-nm CMOS, contains 328 kB of on-chip SRAM, operates at 237 frames/s (FPS), and consumes 0.9 mW from 0.6 V/0.8 V supplies. The corresponding energy per classification (3.8 μJ) amounts to a 40× improvement over the previous low-energy benchmark on CIFAR-10, achieved in part by sacrificing some programmability. The SC neuron array is 12.9× more energy efficient than a synthesized digital implementation, which amounts to a 4× advantage in system-level energy per classification.
Conference Paper
We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32×\times memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This results in 58×\times faster convolutional operations (in terms of number of the high precision operations) and 32×\times memory savings. XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time. Our binary networks are simple, accurate, efficient, and work on challenging visual tasks. We evaluate our approach on the ImageNet classification task. The classification accuracy with a Binary-Weight-Network version of AlexNet is the same as the full-precision AlexNet. We compare our method with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than 16%16\,\% in top-1 accuracy. Our code is available at: http:// allenai. org/ plato/ xnornet.
Article
We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time and when computing the parameters' gradient at train-time. We conduct two sets of experiments, each based on a different framework, namely Torch7 and Theano, where we train BNNs on MNIST, CIFAR-10 and SVHN, and achieve nearly state-of-the-art results. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which might lead to a great increase in power-efficiency. Last but not least, we wrote a binary matrix multiplication GPU kernel with which it is possible to run our MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The code for training and running our BNNs is available.
Conference Paper
Our challenge is clear: The drive for performance and the end of voltage scaling have made power, and not the number of transistors, the principal factor limiting further improvements in computing performance. Continuing to scale compute performance will require the creation and effective use of new specialized compute engines, and will require the participation of application experts to be successful. If we play our cards right, and develop the tools that allow our customers to become part of the design process, we will create a new wave of innovative and efficient computing devices.
Accurate and efficient 2-bit quantized neural networks
  • Choi
Quantizing deep convolutional networks for efficient inference: A whitepaper
  • R Krishnamoorthi
Accurate and efficient 2-bit quantized neural networks
  • J Choi
  • S Venkataramani
  • V Srinivasan
  • K Gopalakrishnan
  • Z Wang
  • P Chuang
An 89TOPS/W and 16.3TOPS/mm
  • Y.-D Chih