Article

A 65-nm 8T SRAM Compute-in-Memory Macro With Column ADCs for Processing Neural Networks

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this work, we present a novel 8T static random access memory (SRAM)-based compute-in-memory (CIM) macro for processing neural networks with high energy efficiency. The proposed 8T bitcell is free from disturb issues thanks to the decoupled read channels by adding two extra transistors to the standard 6T bitcell. A 128 ×\times 128 8T SRAM array offers massively parallel binary multiply and accumulate (MAC) operations with 64 ×\times binary inputs (0/1) and 64 ×\times 128 binary weights (+1/–1). After parallel MAC operations, 128 column-based neurons generate 128 ×\times 1–5 bit outputs in parallel. The proposed column-based neuron comprises 64 ×\times bitcells for dot-product, 32 ×\times bitcells for analog-to-digital converter (ADC), and 32 ×\times bitcells for offset calibration. The column ADC with 32 ×\times replica SRAM bitcells converts the analog MAC results (i.e., a differential read bitline (RBL/RBLb) voltage) to the 1–5 bit output code by sweeping their reference levels in 1–31 cycles (i.e., 2N2^{N} –1 cycles for N -bit ADC). The measured linearity results [differential nonlinearity (DNL) and integral nonlinearity (INL)] are +0.314/–0.256 least significant bit (LSB) and + 0.27/–0.116 LSB, respectively, after offset calibration. The simulated image classification results are 96.37% for Mixed National Institute of Standards and Technology database (MNIST) using a multi-layer perceptron (MLP) with two hidden layers, 87.1%/82.66% for CIFAR-10 using VGG-like/ResNet-18 convolutional neural networks (CNNs), demonstrating slight accuracy degradations (0.67%–1.34%) compared with the software baseline. A test chip with a 16K 8T SRAM bitcell array is fabricated using a 65-nm process. The measured energy efficiency is 490–15.8 TOPS/W for 1–5 bit ADC resolution using 0.45-/0.8-V core supply.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The goal of this paper is to reduce the proportion of error-prone columns and increase operational throughput by introducing offset calibration [6], [7] to PUD. The offset calibration technique, especially common in SRAM-based approaches [7], allocates specific cells in the dedicated rows to counteract the column-specific offsets caused by process variations. ...
... The goal of this paper is to reduce the proportion of error-prone columns and increase operational throughput by introducing offset calibration [6], [7] to PUD. The offset calibration technique, especially common in SRAM-based approaches [7], allocates specific cells in the dedicated rows to counteract the column-specific offsets caused by process variations. Our approach replaces the uniform neutral data in PUD with column-specific calibration data (Fig. 1b). ...
... The small number of calibration rows restricts the variety of offsets, making it difficult to comprehensively adapt to the distribution of threshold voltage variations. In a SRAM-based approach that employs offset calibration, 32 out of 256 simultaneously opened rows are allocated for calibration purposes [7]. This allows for the adjustment of convergence voltage offsets in 33 levels with a granularity of 1/256 of the supply voltage. ...
Preprint
Recently, practical analog in-memory computing has been realized using unmodified commercial DRAM modules. The underlying Processing-Using-DRAM (PUD) techniques enable high-throughput bitwise operations directly within DRAM arrays. However, the presence of inherent error-prone columns hinders PUD's practical adoption. While selectively using only error-free columns would ensure reliability, this approach significantly reduces PUD's computational throughput. This paper presents PUDTune, a novel high-precision calibration technique for increasing the number of error-free columns in PUD. PUDTune compensates for errors by applying pre-identified column-specific offsets to PUD operations. By leveraging multi-level charge states of DRAM cells, PUDTune generates fine-grained and wide-range offset variations despite the limited available rows. Our experiments with DDR4 DRAM demonstrate that PUDTune increases the number of error-free columns by 1.81×\times compared to conventional implementations, improving addition and multiplication throughput by 1.88×\times and 1.89×\times respectively.
... The following section provides an overview of each technique to clarify the pros and cons. Current-mode: Current-mode CIM accelerators perform accumulation by combining the pull-down currents of multiple cells on the bitline (BL) [4], [5]. Initially, the BL is precharged and then connected to the SRAM cells. ...
... However, in [14] and [4], if the WL/WLB voltages reach V DD and the BL/BLB voltages approach V SS during the operation, the SRAM becomes a write state, causing unintended data corruption. [5] presents a current-mode CIM using an 8T-SRAM. This circuit features a structure where the internal nodes and access transistors are connected through gates instead of drains, and the BL and internal nodes are decoupled during the operation, thus avoiding data corruption. ...
... At an operating voltage of 0.8 V, the average offset voltage was 0.21 mV with a standard deviation of 16 mV, with the required comparison time being 135 ps in the worst-case scenario. This is comparable to the previous research [5], which reported average offset voltage of 1.1 mV and standard deviation of 14.2 mV at 0.8 V with 65-nm process. ...
Article
Full-text available
This paper proposes a novel 8T-SRAM based computing-in-memory (CIM) accelerator for the Binary/Ternary neural networks. The proposed split dual-port 8T-SRAM cell has two input ports by adding two transistors to 6T-SRAM, simultaneously performing two binary multiply-and-accumulate (MAC) operations on left and right bitlines. This approach enables a twofold increase in throughput without significantly increasing area or power consumption., since the area overhead for doubling throughput is only two additional WL wires compared to the conventional 8T-SRAM. In addition, the proposed circuit supports binary and ternary activation input, allowing flexible adjustment of high energy efficiency and high inference accuracy depending on the application. The proposed SRAM macro consists of a 128×128 SRAM array that outputs the MAC operation results of 96 binary/ternary inputs and 96×128 binary weights as 1-5 bit digital values. The proposed circuit performance was evaluated by post-layout simulation with the 22-nm process layout of the overall CIM macro. The proposed circuit is capable of high-speed operation at 1 GHz. It achieves a maximum area efficiency of 3320 TOPS/mm, which is 3.4× higher compared to existing research with a reasonable energy efficiency of 1471 TOPS/W. The simulated inference accuracies of the proposed circuit are 96.45 %/97.67 % for MNIST dataset with binary/ternary MLP model, and 86.32 %/88.56 % for CIFAR-10 dataset with binary/ternary VGG-like CNN model.
... To further improve energy efficiency and computing speed of CIM, binary neural networks (BNN) was proposed, where 1-bit precision is used for both weight and activation, and it can be simplified to XNOR operation [15,25,35]. The binarized XNOR-Net and XNOR-SRAM achieves faster convolutional operation and memory saving by performing XNOR-and-accumulation (XAC) operations [16,20,34,40]. ...
... There are little studies on the scenario of CIM bitline current leakage, where leakage current is needed to be compensated on bitline. For example, in certain approaches of XAC based on bitline discharge method [34,40], the occurrence of the bitline leakage current may lead to a deviation from the expected results, eventually resulting in errors of the outputs from ADC. In the binary content-addressable memory (BCAM) operation, bitline leakage current will lead to a voltage reduction, cause mismatch with the reference voltage V REF , and obtain an error output from sense amplifier (SA) [23,26]. ...
Article
Full-text available
Computing-in-memory (CIM) technology has been developed to improve computer hardware efficiency for intelligent big data applications. An 8T-SRAM CIM macro with bitline leakage compensation ability is proposed in this study. In SRAM read mode, the proposed circuit solves the problem of read failure caused by bitline leakage current. In the write operation, the proposed circuit writes through the double-wordline, which can effectively improve the write static noise margin (WSNM) of the bitcell and reduce its write delay. When performing XNOR-and-accumulation (XAC) operations, the proposed circuit can effectively compensate the offset of the accumulated results due to bitline leakage current, thereby reducing output errors. In addition, this circuit can also be configured as binary content-addressable memory (BCAM) mode, and leakage compensation can be performed in this mode. A logic-AND sense amplifier (LASA) structure is also proposed, which can be applied in the SRAM read and BCAM modes. The proposed circuit is simulated in 55 nm CMOS process, under 1.2 V supply voltage. The proposed system demonstrates an energy efficiency of 62.7 TOPS/W, and effectively improves the anti-leakage ability of the array in all three modes.
... The fabricated chip does not include the integrators and the comparator which are implemented in software after obtaining the crossbar output. In-memory ramp ADC using SRAM has been demonstrated earlier 45 in a different architecture with a much larger (100%) overhead compared to the memory used for MAC operations; however, nonlinear functions have not been integrated with the ADC. Implementing the proposed scheme using SRAM would require many cells for each step due to the different step sizes. ...
... The energy consumption and area information of the ripple counter are determined through Spice simulation and reference paper 36 respectively. The data regarding integrators and comparators are obtained from reference 51 and reference 45 and scaled respectively. According to timing of in-memory computing circuits in Fig. 2b, MAC and NL-ADC operation requires a total of 64 clock cycles. ...
Preprint
Analog In-memory Computing (IMC) has demonstrated energy-efficient and low latency implementation of convolution and fully-connected layers in deep neural networks (DNN) by using physics for computing in parallel resistive memory arrays. However, recurrent neural networks (RNN) that are widely used for speech-recognition and natural language processing have tasted limited success with this approach. This can be attributed to the significant time and energy penalties incurred in implementing nonlinear activation functions that are abundant in such models. In this work, we experimentally demonstrate the implementation of a non-linear activation function integrated with a ramp analog-to-digital conversion (ADC) at the periphery of the memory to improve in-memory implementation of RNNs. Our approach uses an extra column of memristors to produce an appropriately pre-distorted ramp voltage such that the comparator output directly approximates the desired nonlinear function. We experimentally demonstrate programming different nonlinear functions using a memristive array and simulate its incorporation in RNNs to solve keyword spotting and language modelling tasks. Compared to other approaches, we demonstrate manifold increase in area-efficiency, energy-efficiency and throughput due to the in-memory, programmable ramp generator that removes digital processing overhead.
... The application of edge devices has become increasingly widespread across diverse fields, including smart cities, industrial automation, agriculture, healthcare, and transportation [5]. Recently, edge devices have evolved significantly, integrating artificial intelligence (AI) technologies and thus transforming into sophisticated AI edge devices capable of real-time data processing [6]. AI has established itself as an innovative technology for addressing real-time computational challenges, such as speech and facial recognition, where immediate data interpretation is critical. ...
... However, this approach inevitably leads to an increase in both cell size and power consumption, presenting trade-offs between power efficiency and memory density. Chengshuo Yu [6] introduced an 8T SRAM-based CIM macro with integrated analog computing capabilities, specifically targeting compute-in-memory (CIM) applications. Nevertheless, Yu's design relies on high-resolution analog-to-digital converters (ADCs) to interpret analog outputs, which significantly increases both the system's complexity and its power requirements. ...
Article
Full-text available
The traditional Von Neumann architecture creates bottlenecks due to data movement. The compute-in-memory (CIM) architecture performs computations within memory bit-cell arrays, enhancing computational performance. Edge devices utilizing artificial intelligence (AI) address real-time problems and have established themselves as groundbreaking technology. The 8T structure proposed in this paper has strengths over other existing structures in that it better withstands environmental changes within the SRAM and consumes lower power during memory operation. This structure minimizes reliance on complex ADCs, instead utilizing a simplified voltage differential approach for multiply-and-accumulate (MAC) operations, which enhances both power efficiency and stability. Based on these strengths, it can achieve higher battery efficiency in AI edge devices and improve system performance. The proposed integrated circuit was simulated in a 90 nm CMOS process and operated on a 1 V supply voltage.
... Compared with the all-digital accelerators, they largely improve throughput and energy efficiency. Since SRAM is the most user-friendly memory with unlimited endurance and is fabricated by a process which is almost compatible with the CMOS one, many works on inference accelerators based on SRAM cells have been published [3][4][5][6][7]. ...
... However, it requires control circuits for WL pulse width, which makes the array size larger and the system vulnerable to process variations. To avoid the read disturb, non-6T SRAM (8T, 10T, etc.) cells have been adopted in which their read paths are separated from the write paths [3,[9][10][11][12][13][14]. However, they obviously make the crossbar array size larger and the nonlinearity issue remains unsolved. ...
Article
Full-text available
A binarized neural network (BNN) inference accelerator is designed in which weights are stores in loadless four-transistor static random access memory (4T SRAM) cells. A time-multiplexed exclusive NOR (XNOR) multiplier with switched capacitors is proposed which prevents the loadless 4T SRAM cell from being destroyed in the operation. An accumulator with current sensing scheme is also proposed to make the multiply-accumulate operation (MAC) completely linear and read-disturb free. The BNN inference accelerator is applied to the MNIST dataset recognition problem with accuracy of 96.2% for 500 data and the throughput, the energy efficiency and the area efficiency are confirmed to be 15.50TOPS, 72.17TOPS/W and 50.13TOPS/mm², respectively, by HSPICE simulation in 32nm technology. Compared with the conventional SRAM cell based BNN inference accelerators which are scaled to 32nm technology, the synapse cell size is reduced to less than 16% (0.235μm²) and the cell efficiency (synapse array area/synapse array plus peripheral circuits) is 73.27% which is equivalent to the state-of-the-art of the SRAM cell based BNN accelerators.
... Reliable MAJX: PUD's MAJX operations inherently contain errors in some columns of commercial DRAM modules [32], [36]. To address this reliability challenge, MV-DRAM employs Frac operations [34] and calibration techniques [48] to increase the number of reliable columns, which achieves error-free computation. The number of reliable columns is shown in Table I. ...
Preprint
General matrix-vector multiplication (GeMV) remains a critical latency bottleneck in large language model (LLM) inference, even with quantized low-bit models. Processing-Using-DRAM (PUD), an analog in-DRAM computing technique, has the potential to repurpose on-device DRAM as a GeMV engine, offering additional high-throughput processing capabilities to widespread consumer devices without DRAM modifications. However, applying PUD to GeMV operations in the LLM inference pipeline incurs significant overheads before\textit{before} and after\textit{after} in-DRAM computation, diminishing the benefits of its high-throughput processing capabilities. This paper presents MVDRAM, the first practical system to accelerate GeMV operations for low-bit LLM inference using unmodified DRAM. By leveraging the data sharing patterns and mathematical linearity in GeMV operations, MVDRAM orchestrates the processor and DRAM to eliminate the costs associated with pre-arranging inputs and bit-transposition of outputs required in conventional PUD approaches. Our experimental evaluation with four DDR4 DRAM modules shows that MVDRAM achieves comparable or even better inference speed than the processor-based implementation for GeMV operations in low-bit (under 4-bit) LLM. In particular, MVDRAM achieves up to 7.29×\times speedup and 30.5×\times energy efficiency for low-bit GeMV operations. For end-to-end LLM inference, MVDRAM achieves 2.18×\times and 1.31×\times throughput improvements, along with 3.04×\times and 2.35×\times energy efficiency, for 2-bit and 4-bit quantized low-bit models, respectively. MVDRAM has the potential to redefine the AI hardware landscape by demonstrating the feasibility of standard DRAM as an LLM accelerator.
... The fabricated chip does not include the integrators and the comparator which are implemented in software after obtaining the crossbar output. Inmemory ramp ADC using SRAM has been demonstrated earlier 44 in a different architecture with a much larger (100%) overhead compared to the memory used for MAC operations; however, nonlinear functions have not been integrated with the ADC. Implementing the proposed scheme using SRAM would require many cells for each step due to the different step sizes. ...
Article
Full-text available
Analog In-memory Computing (IMC) has demonstrated energy-efficient and low latency implementation of convolution and fully-connected layers in deep neural networks (DNN) by using physics for computing in parallel resistive memory arrays. However, recurrent neural networks (RNN) that are widely used for speech-recognition and natural language processing have tasted limited success with this approach. This can be attributed to the significant time and energy penalties incurred in implementing nonlinear activation functions that are abundant in such models. In this work, we experimentally demonstrate the implementation of a non-linear activation function integrated with a ramp analog-to-digital conversion (ADC) at the periphery of the memory to improve in-memory implementation of RNNs. Our approach uses an extra column of memristors to produce an appropriately pre-distorted ramp voltage such that the comparator output directly approximates the desired nonlinear function. We experimentally demonstrate programming different nonlinear functions using a memristive array and simulate its incorporation in RNNs to solve keyword spotting and language modelling tasks. Compared to other approaches, we demonstrate manifold increase in area-efficiency, energy-efficiency and throughput due to the in-memory, programmable ramp generator that removes digital processing overhead.
... So it will encounter the same issue of interference with the weights stored in SRAM during computation as Voltagedomain based SRAM-CIM. [32] designed an 8T SRAM-CIM cell where two additional access transistors have been designed in addition to the standard SRAM. This modification creates dedicated channels for read operations, effectively mitigating issues related to read/write interference. ...
Article
Full-text available
Neural network models have been widely used in various fields as the main way to solve problems in the current artificial intelligence (AI) field. Efficient execution of neural network models requires devices with massively parallel Multiply-accumulate (MAC) and Matrix-vector Multiplication (MVM) computing capability. However, existing computing devices based on von Neumann architecture suffer from bottlenecks, and the separation of memory and computation module makes data on the move wasting a lot of meaningless computation time and energy. Computing-in-memory (CIM) based on performing MAC computation inside the memory is considered a promising direction to solve this problem. However, large-scale application of CIM still faces challenges due to the non-idealities of current CIM devices and the lack of a common and reliable programmable interface on the application side. In this paper, we will comprehensively analyze the current problems faced by CIMs from various perspectives, such as CIM memory arrays, peripheral circuits, and application-side design, and discuss the possible future development opportunities of CIMs.
... This also indicates that the ADCs and the interconnect architecture should be further optimized for developing energy-and area-efficient PIM systems. We also benchmarked our system-level simulations with previously reported results using different types of synaptic devices, as summarized in Table 1 [30][31][32][33][34][35][36][37][38][39] . As shown in the table, the fabricated CTF device performs exceedingly well in terms of energy efficiency and latency, compared to other devices. ...
Article
Full-text available
Processing-in-memory (PIM) is gaining tremendous research and commercial interest because of its potential to replace the von Neumann bottleneck in current computing architectures. In this study, we implemented a PIM hardware architecture (circuit) based on the charge-trap flash (CTF) as a synaptic device. The PIM circuit with a CT memory performed exceedingly well by reducing the inference energy in the synapse array. To evaluate the image recognition accuracy, a Visual Geometry Group (VGG)-8 neural network was used for training, using the Canadian Institute for Advanced Research (CIFAR)-10 dataset for off-chip learning applications. In addition to the system accuracy for neuromorphic applications, the energy efficiency, computing efficiency, and latency were closely investigated in the presumably integrated PIM architecture. Simulations that were performed incorporated cycle-to-cycle device variations, synaptic array size, and technology node scaling, along with other hardware-sense considerations.
... The column routed bit lines and straight-shaped poly help to reduce the layout area of bit-cell by 9.5% when compared to 8T SRAM (3.24 m 2 ). 26 The total area of 634 m 2 is reduced in the array of 64 Â 64 of the proposed IMC architecture by using AO-8T SRAM bit-cell at the cost of slight increment in bit line capacitance. The D-FF2 is used to latch the computation result and then write it back to the target row. ...
Article
Full-text available
In-Memory Computing (IMC) is an emerging paradigm that aims to shift computational workload away from CPUs. The bit-serial IMC architecture suffers from larger latency when performing logic and arithmetic operations. In this paper, a general-purpose, energy-efficient Bit Parallel IMC Architecture (BP-IMCA) based on Area-Optimized (AO-8T) static random access memory (SRAM) bit-cell is proposed to perform In-Memory Boolean Logic Computation (IMBC) and Near-Memory Arithmetic (NMA) operations with variable bit-width from 1- to 8-bit. The decoupled read/write paths of the employed AO-8T SRAM bit-cell eliminate compute disturbance during IMBC and NMA operations. A self-terminating read word line decoding scheme is proposed to disconnect the RBL discharging path from GND, which decreases the energy consumption of the proposed IMC architecture by 27.71% at 1V for IMBC operations. In addition to this, a VREF-based Low-offset Symmetric Differential Sense Amplifier (LSDSA) is proposed to achieve fast and reliable sensing for both normal read and IMBC operations in the proposed IMC architecture. Further, a 4Kb SRAM array is implemented in 65-nm technology to analyze the IMC architecture at a supply voltage of 1V. The operating frequency of 1,355MHz and average energy consumption of 7.04fJ/bit is achieved during logic (IMBC) operations. The 8-bit addition and 8-bit multiplication operations achieve an energy efficiency of 11.1 TOPS/W and 2.28 TOPS/W, respectively, at 1V and 970MHz. Cumulatively, the proposed architecture achieves the lowest figure of merit compared to the state-of-the-art IMC architectures.
... The emerging computing-in-memory (CIM) techniques address these shortcomings by performing the MAC operations directly upon reading the synaptic weights from the memory [5], [6], [7], [8], [9], [10], [11], [12]. The CIM circuit designs usually convert the ANN's digital inputs into analog signals to control the read wordlines (RWLs) of the memory cells whose responses are added in an analog way on the read bitlines (RBLs) to produce the analog MAC results. ...
Article
Full-text available
This article demonstrates the first functional neuromorphic spiking neural network (SNN) that processes the time-to-first-spike (TTFS) encoded analog spiking signals with the second-order leaky integrate-and-fire (SOLIF) neuron model to achieve superior biological plausibility. An 8-kb SRAM macro is used to implement the synapses of the neurons to enable analog computing in memory (ACIM) operation and produce current-type dendrite signals of the neurons. A novel low-leakage 8T (LL8T) SRAM cell is proposed for implementing the SRAM macro to reduce the read leakage currents on the read bitlines (RBLs) when performing ACIM. Each neuron’s soma is implemented with low-power analog circuits to realize the SOLIF model for processing the dendrite signals and generating the final analog output spikes. No data converters are required in our design by virtue of analog computing’s nature. A test chip implementing the complete output layer of the proposed SNN was fabricated in 90-nm CMOS. The active area is 553.4\ttimes118.6 μ\mu m 2^{2} . The measurement results show that our SNN implementation achieves an average inference latency of 196 ns and an inference accuracy of 81.4%. It consumes 242 μ\mu W with an energy efficiency of 4.74 pJ/inference/neuron.
... In this situation, signals are sampled in the analog domain [14] and computation speed is required to keep up with the sensor sampling speed. Another field which shows their advantage is sparse lightweight networks [15], [16], a main category under neural networks. In contrast to digital SRAM-based CIMs, which trade more area and computing time for higher precision [17], [18], [19], [20], analog-mixedsignal SRAM-based CIMs perform small kernel convolution computations in a single cycle, significantly reducing computation time at the cost of minor recognition rate lost [21], [22], [23]. ...
Article
Full-text available
In this paper, we present an analog-mixed-signal 6T SRAM computing-in-memory (CIM) macro. The macro uses dual-wordline 6T bitcells to reduce power consumption and write-disturb issues. The macro also proposes an analog computation logic circuit for high precision, energy efficient charge-domain computation. The bitcell structure combined with the analog computation logic circuit allows direct input of signed activations and weights to the chip for full signed computation. The proposed macro consists of four CIM blocks, each with four 32x8 compute blocks, a pulse generator, an analog computation logic circuit and a SAR-ADC. Fabricated in a 55 nm process, our CIM macro test chip achieves an energy efficiency of 7.3 TOPS/W. A comprehensive computing test that encompasses the entire range of inputs and weights has been conducted. The results show that the CIM macro test chip can achieve a precision of 79.51% in a 1-FE error range of 71.88%. The target application of the proposed CIM macro is lightweight neural networks, this is demonstrated by mapping a pre-trained network into the macro and achieving a recognition accuracy of 92.28% on the CIFAR-10 dataset. The design surpasses existing designs in comprehensive consideration of energy efficiency, technology and bit width.
... However, the above MAC operation only performs binary multiplication, not the multi-bit multiplication typically required for most real-world application. To solve this issue, several techniques are actively researched [5]- [7], [9], [10], [14]- [23], where some of the main ideas with 4-bit resolution are illustrated in Fig. 2. Most common approaches to multibit expressions based on SRAM bit cells can potentially group multiple bit cells and assign MSB to LSB accordingly. The case I shown in Fig. 2 is one such example, where the read access transistors are sized in a binary weighted manner so that the read current changes linearly depending on their sizes. ...
Article
Full-text available
Process-in-memory (PIM) is an emerging computing paradigm to overcome the energy bottleneck inherent in conventional computing platform. While PIM utilizes several types of memory elements, SRAM based PIM has been researched extensively for its high scalability and feasibility by using CMOS process technology. In this work, we proposed 8T SRAM based process-in-memory system with current mirror for accurate MAC operation. To resolve the nonlinearity issue inherent in current-based SRAM MAC macro, we utilized cascaded current mirror. Furthermore, to enable more precise MAC operation, we allocated RWL timing pulses which drive 8T SRAM bitcells in the middle of timing duration. Our PIM architecture realize 4 bit input, 4 bit weight and 4 bit output precisions. Prototype chip has been fabricated using TSMC 65nm GP process technology. Measurement result has proved the MAC operation linearity of designed 8T SRAM based PIM system, and its throughput and energy efficiency with 0.29 GOPs and 1.05 TOPs/W. Software level analysis proved that our design can achieve upto 98.02 percent accuracy for MNIST dataset classification and 94.93 percent accuracy for CIFAR-10 dataset classification.
... CIM minimizes power consumption by localizing runtime memory and performing computation directly within memory at the expense of low memory density [69], [70], [71]. For instance, to accommodate the need for both compute and data storage, a typical 6T SRAM is modified by adding an additional pair of transistors to decouple each SRAM cell sharing one bit-line, addressing the read-disturb issue during a compute cycle at the expense of reduced density [72]. Alternatively, novel non-volatile memory-based CIM solutions have also been proposed to increase density and reduce standby energy consumption. ...
Article
The tutorial explores key security and functional safety challenges for Artificial Intelligence (AI) in embedded automotive systems, including aspects from adversarial attacks, long life cycles of products, and limited energy resources of automotive platforms within safety-critical environments in diverse use cases. It provides a set of recommendations for how the security and safety engineering of machine learning can address these challenges. It also provides an overview of contemporary security and functional safety engineering practices, encompassing up-to-date legislative and technical prerequisites. Finally, we identify the role of AI edge processing in enhancing security and functional safety within embedded automotive systems.
... Recently, there has been a surge in the introduction of SRAM-based PIM solutions for energy-efficient DNN processing. The proposed PIM architecture, leveraging the SRAM bitcell, offers not only commendable processing speed but also logic compatibility [1][2][3][4]. Nonetheless, the SRAM bitcell encounters limitations due to its reduced integration potential arising from the bitcell size. Additionally, for ensuring stable multiply-accumulate (MAC) operations, supplementary transistors and bitlines become prerequisites [5,6]. ...
Article
Full-text available
This paper introduces an n-type pseudo-static gain cell (PS-nGC) embedded within dynamic random-access memory (eDRAM) for high-speed processing-in-memory (PIM) applications. The PS-nGC leverages a two-transistor (2T) gain cell and employs an n-type pseudo-static leakage compensation (n-type PSLC) circuit to significantly extend the eDRAM’s retention time. The implementation of a homogeneous NMOS-based 2T gain cell not only reduces write access times but also benefits from a boosted write wordline technique. In a comparison with the previous pseudo-static gain cell design, the proposed PS-nGC exhibits improvements in write and read access times, achieving 3.27 times and 1.81 times reductions in write access time and read access time, respectively. Furthermore, the PS-nGC demonstrates versatility by accommodating a wide supply voltage range, spanning from 0.7 to 1.2 V, while maintaining an operating frequency of 667 MHz. Fabricated using a 28 nm complementary metal oxide semiconductor (CMOS) process, the prototype features an efficient active area, occupying a mere 0.284 µm² per bitcell for the 4 kb eDRAM macro. Under various operational conditions, including different processes, voltages, and temperatures, the proposed PS-nGC of eDRAM consistently provides speedy and reliable read and write operations.
... Previous works focus on accelerating graph learning applications, such as classification [39] and graph mining [40], but rarely address high-level graph reasoning applications like graph memorization and reconstruction. [41], [42] inspire the possibility of interconnecting multiple FPGAs or heterogeneous systems for larger scale tasks, while [43]- [47] address the ever-increasing memory/storage requirement. Hyperdimensional computing (HDC) has recently shown much more potential in graph applications compared to traditional machine learning methods such as CNN or GNN [6], [19], [48], [49]. ...
Article
The successful implementation of artificial intelligence algorithms depends on the capacity to execute numerous repeated operations, which, in turn, requires systems with high data throughput. Although emerging computing‐in‐memory (CIM) eliminates the need for frequent data transfer between the memory and processing blocks and enables parallel activation of multiple rows, the traditional structure, where each row has only one identical input value, significantly limits its further application. To solve this problem, this study proposes a dual‐SRAM CIM architecture in which two SRAM arrays are coupled such that all operands are different, thus rendering the use of CIM considerably more flexible. The proposed dual‐SRAM array was implemented through a 55‐nm process, essentially delivering a frequency of 361 MHz for a 1.2‐V supply and energy efficiency of 161 TOPS/W at 0.9 V supply.
Article
This paper presents an ultra-low-power Spiking Neural Network co-Processor (RSNNP) with a reconfigurable computing-in-memory (CiM) cell array, leveraging both differential-pair SRAM and memristive memory. The proposed RSNNP introduces adaptability through three reconfigurable strategies: supporting both volatile (SRAM) and non-volatile (memristive) memory cell, free-extending the CiM array scale, and flexible-switching between rate and temporal coding schemes. A prototype chip, fabricated using 55nm technology, achieves exceptional efficiency, with power consumption as low as 0.057μW (@SRAM) and 0.0102μW (@memristive) for a 24fps image recognition task. Experimental results demonstrate the RSNNP's superiority in adapting to different SNN tasks and outperforming state-of-the-art SNN processors in power efficiency with lower power consumption. This work marks a significant step toward energy-efficient and adaptable SNN computation, suitable for edge computing and neuromorphic hardware applications.
Preprint
Full-text available
Low-rank adaptation (LoRA) is a predominant parameter-efficient finetuning method to adapt large language models (LLMs) for downstream tasks. In this paper, we first propose to deploy the LoRA-finetuned LLMs on the hybrid compute-in-memory (CIM) architecture (i.e., pretrained weights onto RRAM and LoRA onto SRAM). To address performance degradation from RRAM's inherent noise, we design a novel Hardware-aware Low-rank Adaption (HaLoRA) method, aiming to train a LoRA branch that is both robust and accurate by aligning the training objectives under both ideal and noisy conditions. Experiments finetuning LLaMA 3.2 1B and 3B demonstrate HaLoRA's effectiveness across multiple reasoning tasks, achieving up to 22.7 improvement in average score while maintaining robustness at various noise levels.
Article
Compute-in-memory (CIM) with high parallelism and energy efficiency is an attractive solution for next-generation orbital edge computing. However, the reliability of the CIM circuits in radiation environment is rarely studied. In this paper, we present an analog CIM macro based on Static Random-Access Memory (SRAM) in a 55-nm CMOS process, featuring multiply-accumulate (MAC) operations. The performance degradation including computational linearity and MAC operation accuracy is evaluated by means of γ-ray irradiation experiment. The mechanism of radiation effect of the analog CIM is revealed and a Radiation-Induced Calculation Error (RICE) model is proposed. The simulation results show that the accuracy of MAC operation decreases by about 5% after irradiation to 250 krad(Si), in good agreement with the measurement results. This paper provides guidance for the design and application of SRAM-based CIM processors in the radiation environment.
Conference Paper
Full-text available
In this paper, we propose a 6T SRAM-based all-digital Compute-in-memory (CIM) macro for multi-bit multiply-and-accumulate (MAC) operations. We propose a novel 2T bitwise multiplier, which is a direct improvement over the previously proposed 4T NOR gate-based multiplier. The 2T multiplier also eliminates the need to invert the input bits, which is required when using NOR gates for multipliers. We propose an efficient digital MAC computation flow based on a barrel shifter, which significantly reduces the latency of shift operation. This brings down the overall latency incurred while performing MAC operations to 13ns/25ns (in 65nm CMOS)for 4b/8b operands (in 65nm CMOS @ 0.6V), compared to 10ns/18ns (in 22nm CMOS @ 0.72V) of the previous work. The proposed CIM macro is fully re-configurable in weight bits (4/8/12/16) and input (4/8) bits. It can perform concurrent MAC and weight update operations. Moreover, its fully complete digital implementation circumvents the challenges associated with analog CIM macros. For MAC operation with 4b weight and input, the macro achieves 24 TOPS/W at 1.2 V and 81 TOPS/W at 0.7 V. When using low-threshold-voltage transistors in the 2T multiplier, the macro works reliably even at 0.6V while achieving 101 TOPS/W.
Article
Full-text available
The compute-in-memory (CIM) which embeds computation inside memory is an attractive scheme to circumvent von Neumann bottlenecks. This study proposes a logic-compatible embedded DRAM architecture that supports data storage as well as versatile digital computations. The proposed configurable memory unit operates in three modes: (1) memory mode in which it works as a normal dynamic memory, (2) logic–arithmetic mode where it performs bit-wise Boolean logic and full adder operations on two words stored within the memory array, and (3) convolution mode in which it executes digitally XNOR-and-accumulate (XAC) operation for binarized neural networks. A 1.0-V 4096-word × 8-bit computational DRAM implemented in a 45-nanometer CMOS technology performs memory, logic and arithmetic operations at 241, 229, and 224 MHz while consuming the energy of 7.92, 8.09, and 8.19 pJ/cycle. Compared with conventional digital computing, it saves energy and latency of the arithmetic operation by at least 47% and 46%, respectively. For VDD = 1.0 V, the proposed CIM unit performs two 128-input XAC operations at 292 MHz with an energy consumption of 20.8 pJ/cycle, achieving 24.6 TOPS/W. This marks at least 11.9× better energy efficiency and 38.8× better delay, thereby achieving at least 461× better energy-delay product than traditional 8-bit wide computing hardware.
Article
Artificial neural networks have led to a higher computational burden, complicating inference tasks on low-power edge devices. Spiking neural network (SNN), which leverages sparse spikes for computation and data transmission, is an effective energy-efficient computing technique. However, the length of spike sequences in SNN varies significantly depending on the input coding method, among which rate coding still results in substantial data movement. A highly energy-efficient SNN accelerator with a time-domain CIM processor is proposed with three key features: 1) time-domain bitcell array for high linearity with lower energy, reducing 58.6% power consumption compared to inverter-chain architecture, 2) time-domain multi-bit accumulate for assisting multi-bit weights without analog-to-digital converter, achieving 47.2% energy reduction of domain-conversion energy, 3) analog precision reconstruction unit for supporting phase coding. The proposed TS-CIM is designed in 65 nm CMOS technology and achieves 701.7 TOPS/W energy efficiency, marking a 1.58 ×\times enhancement compared to the state-of-the-art SNN CIM.
Article
The in-memory computation (IMC) is a potential technique to improve the speed and energy efficiency of data-intensive designs. However, the scalability of IMC to large systems is hindered by the non-linearities of analog multiply-and-accumulate (MAC) operations and process variation, which impacts the precision of high bit-width MAC operations. In this paper, we present an IMC architecture that is capable of performing multi-bit MAC operations with improved speed, linearity, and computational accuracy. To improve the speed/linearity of the IMC-MAC operations, the image and weight data are applied by using the pulse amplitude modulation (PAM) and thermometric techniques, respectively. Although the PAM technique improves the speed of the IMC-MAC operations, it has linearity issues that need to be addressed. Based on the detailed linearity analysis of the IMC-MAC circuit, we proposed two approaches to improve the linearity and the signal margin (SM) of the IMC architecture. The proposed configurable current steering thermometric digital-to-analog converter (CST-DAC) array is employed to provide the PAM signals with various dynamic ranges and non-linear gaps that are required to improve the linearity/SM. The proposed combined PAM and thermometric IMC (PT-IMC) architecture is designed and fabricated in the TSMC 180-nm CMOS process. The post-silicon calibration of the design point mitigates the process-variation issues and provides the maximum SM (close to the simulation results). Furthermore, the proposed PT-IMC architecture performs MNIST/CIFAR-10 data set classification with an accuracy of 98 %\% /88 %\% . In addition, the PT-IMC architecture achieves a peak throughput of 12.41 GOPS, a normalized energy efficiency of 30.64 TOPS/W, a normalized figure-of-merit (FOM) of 3039, a loss in the SM of 8.3 %\% with respect to the ideal SM, and a computational error of 0.41 %\% .
Article
In this letter, we present an analog compute-in-memory (CIM) macro design which incorporates near-CIM analog memory and nonlinearity activation unit (NAU) to alleviate the DAC/ADC power bottleneck. Fully differential analog memory is designed with switched capacitor storage circuits. Activation function, e.g., rectified linear unit, is also performed in analog domain in NAU. The CIM macro is fabricated using TSMC 55-nm technology, with a peak macro-level efficiency of 44.3 TOPS/W and a system energy efficiency of 27.7 TOPS/W for analog input and output with 4-bit weight. The near-CIM analog memory and NAU solution brings 76.0% energy reduction compared with DAC/ADC solution, which contributes 1.34×1.34\times to 2.37×2.37\times energy efficiency improvement.
Article
This article presents a novel dual 7T static random-access memory (SRAM)-based compute-in-memory (CIM) macro for processing quantized neural networks. The proposed SRAM-based CIM macro decouples read/write operations and employs a zero-input/weight skipping scheme. A 65nm test chip with 528×128528\times 128 integrated dual 7T bitcells demonstrated reconfigurable precision multiply and accumulate operations with 384 ×\times binary inputs (0/1) and 384×128384\times 128 programmable multi-bit weights (3/7/15-levels). Each column comprises 384 ×\times bitcells for a dot product, 48 ×\times bitcells for offset calibration, and 96 ×\times bitcells for binary-searching analog-to-digital conversion. The analog-to-digital converter (ADC) converts a voltage difference between two read bitlines (i.e., an analog dot-product result) to a 1-6b digital output code using binary searching in 1-6 conversion cycles using replica bitcells. The test chip with 66Kb embedded dual SRAM bitcells was evaluated for processing neural networks, including the MNIST image classifications using a multi-layer perceptron (MLP) model with its layer configuration of 784-256-256-256-10 The measured classification accuracies are 97.62%, 97.65%, and 97.72% for the 3, 7, and 15 level weights, respectively. The accuracy degradations are only 0.58 to 0.74% off the baseline with software simulations. For the VGG6 model using the CIFAR-10 image dataset, the accuracies are 88.59%, 88.21%, and 89.07% for the 3, 7, and 15 level weights, with degradations of only 0.6 to 1.32% off the software baseline. The measured energy efficiencies are 258.5, 67.9, and 23.9 TOPS/W for the 3, 7, and 15 level weights, respectively, measured at 0.45/0.8V supplies.
Article
Artificial intelligence workloads demand a wide range of multiply and accumulate (MAC) precision. Pitch-matching constraints in compute-in-memory (CIM) engines limit the analog-to-digital converter (ADC) precision to about 8 bits. This letter demonstrates a method of mapping a suitable input conditioned MAC range to the input dynamic range of the on-chip 7-b ADC, thereby achieving up to 10 bits of output MAC precision. A 424Kb SRAM CIM macro was fabricated in TSMC 28nm, which computes 72 MACs in parallel per cycle. Measurement results at nominal supply voltage show an energy efficiency of 196.6–102 TOPS/W/b for a 2–10 bit output MAC precision. Inference results on MNIST, CIFAR10, and CIFAR100 are shown with ≤1% accuracy loss from the software baseline.
Article
This work presents the implementation of a 6T SRAM-based array that computes matrix vector multiplication in binarized neural networks. A 6T SRAM bitcell with PMOS access transistors is proposed, which mitigates the read disturb issue that is attributed to the conventional 6T-SRAM bitcell. The degradation in classification accuracy is minimized by the lower mobility of PMOS access devices and by connecting custom Metal-Oxide-Metal capacitors to the bitlines in compute-mode. A single slope ADC is also proposed that enables the macro to compute partial multiply and accumulate value with 5b output precision. The estimated inference accuracy on MNIST and CIFAR-10 datasets is 96.5% and 87.2%, respectively. The macro consumes 1.01 (6.30) fJ of energy per operation to compute 1b-MAC (5b-MAC), and thus achieves a simulated energy efficiency of 984 (157.6) TOPS/W and a compute density of 242 (15.16) TOPS/mm2 at 370MHz in a TSMC 65nm LP CMOS process.
Article
This article presents a low-cost PMOS-based 8T (P-8T) static random access memory (SRAM) compute-in-memory (CIM) macro that efficiently reduces the hardware cost associated with a digital-to-analog converter (DAC) and an analog-to-digital converter (ADC). By utilizing the bitline (BL) charge-sharing technique, the area and power consumption of the proposed DAC have been reduced while achieving similar conversion linearity compared to a conventional DAC. The BL charge-sharing also facilitates the multiply-accumulate (MAC) operation to produce variation-tolerant and linear outputs. To reduce ADC area and power consumption, a 4-bit coarse-fine flash ADC has been collaboratively used with an in-SRAM reference voltage generation, where the ADC reference voltages are generated in the same way as the MAC operation mechanism. Moreover, to find the suitable ADC sample range and resolution for our CIM macro, a CIM noise-considered accuracy simulation has been conducted. Based on the simulation results, a 4-bit ADC resolution with a cutoff ratio of 0.5 is chosen, maintaining high accuracy. The 256 ×\times 80 P-8T SRAM CIM prototype chip has been fabricated in a 28-nm CMOS process. By leveraging charge-domain computing, the proposed CIM operates in a wide range of supply voltage from 0.6 to 1.2 V with an energy efficiency of 50.1-TOPS/W at 0.6 V. The accuracies of 91.26% and 65.20% are measured for CIFAR-10 and CIFAR-100 datasets, respectively. Compared to the state-of-the-art SRAM CIM works, this work improves energy efficiency by 1.2 ×\times and area efficiency by 6.5 ×\times due to the reduced analog circuit hardware costs.
Article
The deployment of neural networks on edge devices has created a growing need for energy-efficient computing. In this paper, we propose an all-digital standard cell-based time-domain compute-in-memory (TDCIM) macro for binary neural networks (BNNs) that is compatible with commercial digital design flow. The TDCIM macro utilizes multiple computing chains that share one threshold chain, and supports double-edge operation, parallel computing and data reuse. Time-domain wave-pipelining technique is introduced to enhance throughput while preserving accuracy. Regular placement (RP) and custom routing (CR) are employed during place and route (P&R) to reduce systematic variations. We show computing delay, POOL computation accuracy, and network test accuracy at different voltages, indicating that the proposed TDCIM macro can maintain high accuracy under PVT variations. We implemented two versions of the TDCIM macro in 22nm FDSOI technology using foundry-provided delay cells DLY40 and DLY60, respectively. At a voltage of 0.5V, the TDCIM macro achieved an energy efficiency of 1.2 (1.05) POPS/W for DLY40 (DLY60), while maintaining a baseline accuracy of 98.9% on the MNIST dataset for both designs.
Article
A reconfigurable cognitive computation matrix (RCCM) in static random access memory (SRAM) suitable for sensor edge applications is proposed in this article. The proposed RCCM can take multiple analog currents or digital integers as the input vector and perform vector-matrix multiplication with a weight integer matrix. The RCCM can carry out 1-quadrant, 2-quadrant, or 4-quadrant multiplications in the analog domain. Therefore, the digital integers for the inputs or weights stored in the SRAM can be either signed or unsigned, providing extensive usage flexibilities. Furthermore, three commonly used activation functions (AFs), the rectified linear unit (ReLU), radial basis function (RBF), and logistic function are available, converting multiply–accumulation outputs to single-ended currents as the computation results. The resultant output currents can be adopted as the input currents of other RCCMs to facilitate multiple-layer network implementation. A concept-proving prototype chip, including a 16\ttimes16 RCCM with 4-bit input and weight resolutions, is designed and fabricated in a 0.18- \um CMOS process. The computation accuracy that is deteriorated by process variation can be significantly improved by adopting 48 mismatch parameters after calibration. A handwritten digit recognition database, MNIST, is employed to evaluate the chip performance, achieving an average efficiency of 3.355~\TOPSperW .
Article
In this work, we design and implement a 1-Mb resistive random access memory (RRAM) processing-in-memory (PIM) chip based on a 180-nm CMOS technology. In this design, a time-division multiplexing (TDM) circuit along with sparsity-aware sense amplifier (SA) and asynchronous counter module are proposed to free the chip from digital-to-analog converter (DAC) and analog-to-digital converter (ADC). A sparsity-aware input module (SAIM) is designed to improve computational efficiency for bit-level input sparsity detection. A technique based on quantization-aware training (QAT), dynamically reconfigurable shifters (RecSTRs), and tree adders (TAs) is used to achieve system reconfigurability for 1–8-bit input, 1–8-bit weight, and 6–22-bit output. With this technique, optimized quantization to 4-bit weight 4-bit activation (W4A4) can reduce the number of network parameters to 1/8 of that required for the 32-bit floating-point (FP32) version. The number of calculate cycles can also be reduced to 1/4 of that of the FP32 version. This design has achieved a weight density of 13.32 Mb/mm2\mathrm{Mb/mm}^2 normalized to the 22-nm node and an energy efficiency of 17.36 TOPS/W for 4-bit integer (INT4) activation and weight.
Article
Full-text available
A compact, accurate, and bitwidth-programmable in-memory computing (IMC) static random-access memory (SRAM) macro, named CAP-RAM, is presented for energy-efficient convolutional neural network (CNN) inference. It leverages a novel charge-domain multiply-and-accumulate (MAC) mechanism and circuitry to achieve superior linearity under process variations compared to conventional IMC designs. The adopted semi-parallel architecture efficiently stores filters from multiple CNN layers by sharing eight standard 6T SRAM cells with one charge-domain MAC circuit. Moreover, up to six levels of bit-width of weights with two encoding schemes and eight levels of input activations are supported. A 7-bit charge-injection SAR (ciSAR) analog-to-digital converter (ADC) getting rid of sample and hold (S&H) and input/reference buffers further improves the overall energy efficiency and throughput. A 65-nm prototype validates the excellent linearity and computing accuracy of CAP-RAM. A single 512×128512\times 128 macro stores a complete pruned and quantized CNN model to achieve 98.8% inference accuracy on the MNIST data set and 89.0% on the CIFAR-10 data set, with a 573.4-giga operations per second (GOPS) peak throughput and a 49.4-tera operations per second (TOPS)/W energy efficiency.
Article
Full-text available
In this work, we review Binarized Neural Networks (BNNs). BNNs are deep neural networks that use binary values for activations and weights, instead of full precision values. With binary values, BNNs can execute computations using bitwise operations, which reduces execution time. Model sizes of BNNs are much smaller than their full precision counterparts. While the accuracy of a BNN model is generally less than full precision models, BNNs have been closing accuracy gap and are becoming more accurate on larger datasets like ImageNet. BNNs are also good candidates for deep learning implementations on FPGAs and ASICs due to their bitwise efficiency. We give a tutorial of the general BNN methodology and review various contributions, implementations and applications of BNNs.
Article
Full-text available
A multi-functional in-memory inference processor integrated circuit (IC) in a 65-nm CMOS process is presented. The prototype employs a deep in-memory architecture (DIMA), which enhances both energy efficiency and throughput over conventional digital architectures via simultaneous access of multiple rows of a standard 6T bitcell array (BCA) per precharge, and embedding column pitch-matched low-swing analog processing at the BCA periphery. In doing so, DIMA exploits the synergy between the dataflow of machine learning (ML) algorithms and the SRAM architecture to reduce the dominant energy cost due to data movement. The prototype IC incorporates a 16-kB SRAM array and supports four commonly used ML algorithms--the support vector machine, template matching, k-nearest neighbor, and the matched filter. Silicon measured results demonstrate simultaneous gains (dot product mode) in energy efficiency of 10x and in throughput of 5.3x leading to a 53x reduction in the energy-delay product with negligible (≤1%) degradation in the decision-making accuracy, compared with the conventional 8-b fixed-point single-function digital implementations.
Article
Full-text available
A versatile reconfigurable accelerator architecture for binary/ternary deep neural networks is presented. In-memory neural network processing without any external data accesses, sustained by the symmetry and simplicity of the computation of the binary/ternaty neural network, improves the energy efficiency dramatically. The prototype chip is fabricated, and it achieves 1.4 TOPS (tera operations per second) peak performance with 0.6-W power consumption at 400-MHz clock. The application examination is also conducted.
Article
This paper presents a mixed-signal SRAM-based in-memory computing (IMC) macro for processing binarized neural networks. The IMC macro consists of 128×128128\times 128 (16K) SRAM-based bitcells. Each bitcell consists of a standard 6T SRAM bitcell, an XNOR-based binary multiplier, and a pseudo-differential voltage-mode driver (i.e., an accumulator unit). Multiply-and-accumulate (MAC) operations between 64 pairs of inputs and weights (stored in the first 64 SRAM bitcells) are performed in 128 rows of the macro, all in parallel. A weight-stationary architecture, which minimizes off-chip memory accesses, effectively reduces energy-hungry data communications. A row-by-row analog-to-digital converter (ADC) based on 32 replica bitcells and a sense amplifier reduces the ADC area overhead and compensates for nonlinearity and variation. The ADC converts the MAC result from each row to an N-bit digital output taking 2 N -1 cycles per conversion by sweeping the reference level of 32 replica bitcells. The remaining 32 replica bitcells in the row are utilized for offset calibration. In addition, this paper presents a pseudo-differential voltage-mode accumulator to address issues in the current-mode or single-ended voltage-mode accumulator. A test chip including a 16Kbit SRAM IMC bitcell array is fabricated using a 65nm CMOS technology. The measured energy- and area-efficiency is 741-87TOPS/W with 1-5bit ADC at 0.5V supply and 3.97TOPS/mm 2 , respectively.
Article
This article presents a computing-in-memory (CIM) structure aimed at improving the energy efficiency of edge devices running multi-bit multiply-and-accumulate (MAC) operations. The proposed scheme includes a 6T SRAM-based CIM (SRAM-CIM) macro capable of: 1) weight-bitwise MAC (WbwMAC) operations to expand the sensing margin and improve the readout accuracy for high-precision MAC operations; 2) a compact 6T local computing cell to perform multiplication with suppressed sensitivity to process variation; 3) an algorithm-adaptive low MAC-aware readout scheme to improve energy efficiency; 4) a bitline header selection scheme to enlarge signal margin; and 5) a small-offset margin-enhanced sense amplifier for robust read operations against process variation. A fabricated 28-nm 64-kb SRAM-CIM macro achieved access times of 4.1-8.4 ns with energy efficiency of 11.5-68.4 TOPS/W, while performing MAC operations with 4- or 8-b input and weight precision.
Article
In-memory computing establishes a new and promising computing paradigm aimed at solving problems caused by the von Neumann bottleneck. It eliminates the need for frequent data transfer between the memory and processing modules and enables the parallel activation of multiple lines. However, vertical data storage is generally required, increasing the implementation complexity for the SRAM writing mode. This article proposes a 10-transistor (10T) SRAM to omit vertical data storage and improve the stability of in-memory computing. A cross-layout of the word line enables arrays with multirow or multicolumn parallel activation to perform vector logic operations in two directions. In addition, the novel horizontal read channel allows matrix transposition. By reconfiguring the data lines, sense amplifiers, and multiplexing read ports, the proposed SRAM can be regarded as a content-addressable memory (CAM), and its symmetry provides selectable data search by column or by row according to the application that easily fits the SRAM storage mode without additional data adjustments. A proposed self-termination structure aims to decrease search energy consumption by approximately 38.5% at 0.9 V at the TT process corner. To verify the effectiveness of the proposed design, a 4 Kb SRAM was implemented in 28-nm CMOS technology. The read margin of the proposed 10T SRAM cell is three times higher than that of the conventional 6-transistor cell. At 0.9 V, logic operations can be performed at approximately 300 MHz, and binary CAM search operations are achieved at approximately 260 MHz with around 1 fJ of energy consumption per search/bit.
Article
This article (Colonnade) presents a fully digital bit-serial compute-in-memory (CIM) macro. The digital CIM macro is designed for processing neural networks with reconfigurable 1–16 bit input and weight precisions based on bit-serial computing architecture and a novel all-digital bitcell structure. A column of bitcells forms a column MAC and used for computing a multiply-and-accumulate (MAC) operation. The column MACs placed in a row work as a single neuron and computes a dot-product, which is an essential building block of neural network accelerators. Several key features differentiate the proposed Colonnade architecture from the existing analog and digital implementations. First, its full-digital circuit implementation is free from process variation, noise susceptibility, and data-conversion overhead that are prevalent in prior analog CIM macros. A bitwise MAC operation in a bitcell is performed in the digital domain using a custom-designed XNOR gate and a full-adder. Second, the proposed CIM macro is fully reconfigurable in both weight and input precision from 1 to 16 bit. So far, most of the analog macros were used for processing quantized neural networks with very low input/weight precisions, mainly due to a memory density issue. Recent digital accelerators have implemented reconfigurable precisions, but they are inferior in energy efficiency due to significant off-chip memory access. We present a regular digital bitcell array that is readily reconfigured to a 1–16 bit weight-stationary bit-serial CIM macro. The macro computes parallel dot-product operations between the weights stored in memory and inputs that are serialized from LSB to MSB. Finally, the bit-serial computing scheme significantly reduces the area overhead while sacrificing latency due to bit-by-bit operation cycles. Based on the benefits of digital CIM, reconfigurability, and bit-serial computing architecture, the Colonnade can achieve both high performance and energy efficiency (i.e., both benefits of prior analog and digital accelerators) for processing neural networks. A test-chip with 128×128128 \times 128 SRAM-based bitcells for digital bit-serial computing is implemented using 65-nm technology and tested with 1–16 bit weight/input precisions. The measured energy efficiency is 117.3 TOPS/W at 1 bit and 2.06 TOPS/W at 16 bit.
Article
We present an energy-efficient processing-in-memory (PIM) architecture named Z-PIM that supports both sparsity handling and fully variable bit-precision in weight data for energy-efficient deep neural networks. Z-PIM adopts the bit-serial arithmetic that performs a multiplication bit-by-bit through multiple cycles to reduce the complexity of the operation in a single cycle and to provide flexibility in bit-precision. To this end, it employs a zero-skipping convolution SRAM, which performs in-memory AND operations based on custom 8T-SRAM cells and channel-wise accumulations, and a diagonal accumulation SRAM that performs bit- and spatial-wise accumulation on the channel-wise accumulation results using diagonal logic and adders to produce the final convolution outputs. We propose the hierarchical bitline structure for energy-efficient weight bit pre-charging and computational readout by reducing the parasitic capacitances of the bitlines. Its charge reuse scheme reduces the switching rate by 95.42% for the convolution layers of VGG-16 model. In addition, Z-PIM’s channel-wise data mapping enables sparsity handling by skip-reading the input channels with zero weight. Its read-operation pipelining enabled by a read-sequence scheduling improves the throughput by 66.1%. The Z-PIM chip is fabricated in a 65-nm CMOS process on a 7.568-mm 2 die, while it consumes average 5.294-mW power at 1.0-V voltage and 200-MHz frequency. It achieves 0.31–49.12-TOPS/W energy efficiency for convolution operations as the weight sparsity and bit-precision vary from 0.1 to 0.9 and 1 to 16 bit, respectively. For the figure of merit considering input bit-width, weight bit-width, and energy efficiency, the Z-PIM shows more than 2.1 times improvement over the state-of-the-art PIM implementations.
Article
A novel 4T2C ternary embedded DRAM (eDRAM) cell is proposed for computing a vector-matrix multiplication in the memory array. The proposed eDRAM-based compute-in-memory (CIM) architecture addresses a well-known Von Neumann bottle-neck in the traditional computer architecture and improves both latency and energy in processing neural networks. The proposed ternary eDRAM cell takes a smaller area than prior SRAM-based bitcells using 6–12 transistors. Nevertheless, the compact eDRAM cell stores a ternary state (−1, 0, or +1), while the SRAM bitcells can only store a binary state. We also present a method to mitigate the compute accuracy degradation issue due to device mismatches and variations. Besides, we extend the eDRAM cell retention time to 200 μs200~\mu \text{s} by adding a custom metal capacitor at the storage node. With the improved retention time, the overall energy consumption of eDRAM macro, including a regular refresh operation, is lower than most of prior SRAM-based CIM macros. A 128×128128\times 128 ternary eDRAM macro computes a vector-matrix multiplication between a vector with 64 binary inputs and a matrix with 64×12864\times 128 ternary weights. Hence, 128 outputs are generated in parallel. Note that both weight and input bit-precisions are programmable for supporting a wide range of edge computing applications with different performance requirements. The bit-precisions are readily tunable by assigning a variable number of eDRAM cells per weight or adding multiple pulses to input. An embedded column ADC based on replica cells sweeps the reference level for 2N12^{\mathrm {N}}-1 cycles and converts the analog accumulated bitline voltage to a 1-5bit digital output. A critical bitline accumulate operation is simulated (Monte-Carlo, 3K runs). It shows the standard deviation of 2.84% that could degrade the classification accuracy of the MNIST dataset by 0.6% and the CIFAR-10 dataset by 1.3% versus a baseline with no variation. The simulated energy is 1.81fJ/operation, and the energy efficiency is 552.5-17.8TOPS/W (for 1-5bit ADC) at 200MHz using 65nm technology.
Article
In this work, we present a compute-in-memory (CIM) macro built around a standard two-port compiler macro using foundry 8T bit-cell in 7-nm FinFET technology. The proposed design supports 1024 4 b ×\times 4 b multiply-and-accumulate (MAC) computations simultaneously. The 4-bit input is represented by the number of read word-line (RWL) pulses, while the 4-bit weight is realized by charge sharing among binary-weighted computation caps. Each unit of computation cap is formed by the inherent cap of the sense amplifier (SA) inside the 4-bit Flash ADC, which saves area and minimizes kick-back effect. Access time is 5.5 ns with 0.8-V power supply at room temperature. The proposed design achieves energy efficiency of 351 TOPS/W and throughput of 372.4 GOPS. Implications of our design from neural network implementation and accuracy perspectives are also discussed.
Article
Previous SRAM-based computing-in-memory (SRAM-CIM) macros suffer small read margins for high-precision operations, large cell array area overhead, and limited compatibility with many input and weight configurations. This work presents a 1-to-8-bit configurable SRAM CIM unit-macro using: 1) a hybrid structure combining 6T-SRAM based in-memory binary product-sum (PS) operations with digital near-memory-computing multibit PS accumulation to increase read accuracy and reduce area overhead; 2) column-based place-value-grouped weight mapping and a serial-bit input (SBIN) mapping scheme to facilitate reconfiguration and increase array efficiency under various input and weight configurations; 3) a self-reference multilevel reader (SRMLR) to reduce read-out energy and achieve a sensing margin 2 ×\times that of the mid-point reference scheme; and 4) an input-aware bitline voltage compensation scheme to ensure successful read operations across various input-weight patterns. A 4-Kb configurable 6T-SRAM CIM unit-macro was fabricated using a 55-nm CMOS process with foundry 6T-SRAM cells. The resulting macro achieved access times of 3.5 ns per cycle (pipeline) and energy efficiency of 0.6–40.2 TOPS/W under binary to 8-b input/8-b weight precision.
Article
This article presents C3SRAM, an in-memory-computing SRAM macro. The macro is an SRAM module with the circuits embedded in bitcells and peripherals to perform hardware acceleration for neural networks with binarized weights and activations. The macro utilizes analog-mixed-signal (AMS) capacitive-coupling computing to evaluate the main computations of binary neural networks, binary-multiply-and-accumulate operations. Without the need to access the stored weights by individual row, the macro asserts all its rows simultaneously and forms an analog voltage at the read bitline node through capacitive voltage division. With one analog-to-digital converter (ADC) per column, the macro realizes fully parallel vector-matrix multiplication in a single cycle. The network type that the macro supports and the computing mechanism it utilizes are determined by the robustness and error tolerance necessary in AMS computing. The C3SRAM macro is prototyped in a 65-nm CMOS. It demonstrates an energy efficiency of 672 TOPS/W and a speed of 1638 GOPS (20.2 TOPS/mm 2 ), achieving 3975× better energy-delay product than the conventional digital baseline performing the same operation. The macro achieves 98.3% accuracy for MNIST and 85.5% for CIFAR-10, which is among the best in-memory computing works in terms of energy efficiency and inference accuracy tradeoff.
Article
We present XNOR-SRAM, a mixed-signal in-memory computing (IMC) SRAM macro that computes ternary-XNOR-and-accumulate (XAC) operations in binary/ternary deep neural networks (DNNs) without row-by-row data access. The XNOR-SRAM bitcell embeds circuits for ternary XNOR operations, which are accumulated on the read bitline (RBL) by simultaneously turning on all 256 rows, essentially forming a resistive voltage divider. The analog RBL voltage is digitized with a column-multiplexed 11-level flash analog-to-digital converter (ADC) at the XNOR-SRAM periphery. XNOR-SRAM is prototyped in a 65-nm CMOS and achieves the energy efficiency of 403 TOPS/W for ternary-XAC operations with 88.8% test accuracy for the CIFAR-10 data set at 0.6-V supply. This marks 33x better energy efficiency and 300x better energy-delay product than conventional digital hardware and also represents among the best tradeoff in energy efficiency and DNN accuracy.
Article
Computation-in-memory (CIM) is a promising candidate to improve the energy efficiency of multiply-and-accumulate (MAC) operations of artificial intelligence (AI) chips. This work presents an static random access memory (SRAM) CIM unit-macro using: 1) compact-rule compatible twin-8T (T8T) cells for weighted CIM MAC operations to reduce area overhead and vulnerability to process variation; 2) an even–odd dual-channel (EODC) input mapping scheme to extend input bandwidth; 3) a two’s complement weight mapping (C2WM) scheme to enable MAC operations using positive and negative weights within a cell array in order to reduce area overhead and computational latency; and 4) a configurable global–local reference voltage generation (CGLRVG) scheme for kernels of various sizes and bit precision. A 64 ×\times 60 b T8T unit-macro with 1-, 2-, 4-b inputs, 1-, 2-, 5-b weights, and up to 7-b MAC-value (MACV) outputs was fabricated as a test chip using a foundry 55-nm process. The proposed SRAM-CIM unit-macro achieved access times of 5 ns and energy efficiency of 37.5–45.36 TOPS/W under 5-b MACV output.
Article
Large-scale matrix-vector multiplications, which dominate in deep neural networks (DNNs), are limited by data movement in modern VLSI technologies. This paper addresses data movement via an in-memory-computing accelerator that employs charged-domain mixed-signal operation for enhancing compute SNR and, thus, scalability. The architecture supports analog/binary input activation (IA)/weight first layer (FL) and binary/binary IA/weight hidden layers (HLs), with batch normalization and input–output (IO) (buffering) circuitry to enable cascading, if desired, for realizing different DNN layers. The architecture is arranged as 8×8=648\times 8=64 in-memory-computing neuron tiles, supporting up to 512, 3×3×5123\times 3\times 512 -input HL neurons and 64, 3×3×33\times 3\times 3 -input FL neurons, configurable via tile-level clock gating. In-memory computing is achieved using an 8T bit cell with overlaying metal-oxide-metal (MOM) capacitor, yielding a structure having 1.8×1.8\times the area of a standard 6T bit cell. Implemented in 65-nm CMOS, the design achieves HLs/FL energy efficiency of 866/1.25 TOPS/W and throughput of 18876/43.2 GOPS (1498/3.43 GOPS/mm 2 ), when implementing convolution layers; and 658/0.95 TOPS/W, 9438/10.47 GOPS (749/0.83 GOPS/mm 2 ), when implementing convolution followed by batch normalization layers. Several large-scale neural networks are demonstrated, showing performance on standard benchmarks (MNIST, CIFAR-10, and SVHN) equivalent to ideal digital computing.
Article
This paper presents an energy-efficient static random access memory (SRAM) with embedded dot-product computation capability, for binary-weight convolutional neural networks. A 10T bit-cell-based SRAM array is used to store the 1-b filter weights. The array implements dot-product as a weighted average of the bitline voltages, which are proportional to the digital input values. Local integrating analog-to-digital converters compute the digital convolution outputs, corresponding to each filter. We have successfully demonstrated functionality (>98% accuracy) with the 10,000 test images in the MNIST hand-written digit recognition data set, using 6-b inputs/outputs. Compared to conventional full-digital implementations using small bitwidths, we achieve similar or better energy efficiency, by reducing data transfer, due to the highly parallel in-memory analog computations.
Article
The trend of pushing inference from cloud to edge due to concerns of latency, bandwidth, and privacy has created demand for energy-efficient neural network hardware. This paper presents a mixed-signal binary convolutional neural network (CNN) processor for always-on inference applications that achieves 3.8 μJ/classification at 86% accuracy on the CIFAR-10 image classification data set. The goal of this paper is to establish the minimum-energy point for the representative CIFAR-10 inference task, using the available design tradeoffs. The BinaryNet algorithm for training neural networks with weights and activations constrained to +1 and -1 drastically simplifies multiplications to XNOR and allows integrating all memory on-chip. A weight-stationary, data-parallel architecture with input reuse amortizes memory access across many computations, leaving wide vector summation as the remaining energy bottleneck. This design features an energy-efficient switched-capacitor (SC) neuron that addresses this challenge, employing a 1024-bit thermometer-coded capacitive digital-to-analog converter (CDAC) section for summing pointwise products of CNN filter weights and activations and a 9-bit binary-weighted section for adding the filter bias. The design occupies 6 mm 2 in 28-nm CMOS, contains 328 kB of on-chip SRAM, operates at 237 frames/s (FPS), and consumes 0.9 mW from 0.6 V/0.8 V supplies. The corresponding energy per classification (3.8 μJ) amounts to a 40× improvement over the previous low-energy benchmark on CIFAR-10, achieved in part by sacrificing some programmability. The SC neuron array is 12.9× more energy efficient than a synthesized digital implementation, which amounts to a 4× advantage in system-level energy per classification.
Article
This paper presents a machine-learning classifier where computations are performed in a standard 6T SRAM array, which stores the machine-learning model. Peripheral circuits implement mixed-signal weak classifiers via columns of the SRAM, and a training algorithm enables a strong classifier through boosting and also overcomes circuit nonidealities, by combining multiple columns. A prototype 128 x 128 SRAM array, implemented in a 130-nm CMOS process, demonstrates ten-way classification of MNIST images (using image-pixel features downsampled from 28 x 28 = 784 to 9 x 9 = 81, which yields a baseline accuracy of 90%). In SRAM mode (bit-cell read/write), the prototype operates up to 300 MHz, and in classify mode, it operates at 50 MHz, generating a classification every cycle. With accuracy equivalent to a discrete SRAM/digital-MAC system, the system achieves ten-way classification at an energy of 630 pJ per decision, 113x lower than a discrete system with standard training algorithm and 13x lower than a discrete system with the proposed training algorithm.
Article
We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time and when computing the parameters' gradient at train-time. We conduct two sets of experiments, each based on a different framework, namely Torch7 and Theano, where we train BNNs on MNIST, CIFAR-10 and SVHN, and achieve nearly state-of-the-art results. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which might lead to a great increase in power-efficiency. Last but not least, we wrote a binary matrix multiplication GPU kernel with which it is possible to run our MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The code for training and running our BNNs is available.
Conference Paper
Our challenge is clear: The drive for performance and the end of voltage scaling have made power, and not the number of transistors, the principal factor limiting further improvements in computing performance. Continuing to scale compute performance will require the creation and effective use of new specialized compute engines, and will require the participation of application experts to be successful. If we play our cards right, and develop the tools that allow our customers to become part of the design process, we will create a new wave of innovative and efficient computing devices.
16.4 An 89TOPS/W and 16.3TOPS/mm
  • Y.-D Chih