Conference PaperPDF Available

MC2 -RAM: An In-8T-SRAM Computing Macro Featuring Multi-Bit Charge-Domain Computing and ADC-Reduction Weight Encoding

Authors:

Figures

Content may be subject to copyright.
MC2-RAM: An In-8T-SRAM Computing Macro
Featuring Multi-Bit Charge-Domain Computing and
ADC-Reduction Weight Encoding
Zhiyu Chen, Qing Jin, Jingyu Wang, Yanzhi Wang, and Kaiyuan Yang
Rice University, Houston TX Northeastern University, Boston MA
Abstract—In-memory computing (IMC) is a promising hard-
ware architecture to circumvent the memory walls in data-
intensive applications, like deep learning. Among various memory
technologies, static random-access memory (SRAM) is promising
thanks to its high computing accuracy, reliability, and scalability
to advanced technology nodes. This paper presents a novel
multi-bit capacitive convolution in-SRAM computing macro for
high accuracy, high throughput and high efficiency deep learn-
ing inference. It realizes fully parallel charge-domain multiply-
and-accumulate (MAC) within compact 8-transistor 1-capacitor
(8T1C) SRAM arrays that is only 41% larger than the standard
6T cells. It performs MAC with multi-bit activations without
conventional digital bit-serial shift-and-add schemes, leading to
drastically improved throughput for high-precision CNN models.
An ADC-reduction encoding scheme complements the compact
sram design, by reducing the number of needed ADCs by half
for energy and area savings. A 576×130 macro with 64 ADCs
is evaluated in 65nm with post-layout simulations, showing 4.60
TOPS/mm2compute density and 59.7 TOPS/W energy efficiency
with 4/4-bit activations/weights. The MC2-RAM also achieves
excellent linearity with only 0.14 mV (4.5% of the LSB) standard
deviation of the output voltage in Monte Carlo simulations.
Index Terms—CMOS; SRAM; in-memory computation; mixed-
signal computation; convolutional neural networks (CNNs); deep
learning accelerator
I. INT ROD UC TI ON
Deep convolutional neural networks (CNNs) have achieved
unprecedented success in the field of artificial intelligence
(AI) in the past decade. However, the intensive computation
required for even inference makes it challenging to deploy
pre-trained models on resource-constrained edge devices. The
essential and computationally dominant operation in CNN
models–the convolution–requires overwhelming multiply-and-
accumulate (MAC) operations with excessive on-/off-chip
memory access. It is well-known that the energy bottleneck
in such computation lies in the data movement rather than the
arithmetic operations, leading to the so-called memory wall
[1].
Recent progress in in-memory computing (IMC) provides
an attractive solution to circumvent the memory wall. The
key idea behind IMC is to perform the computation directly
inside the memory by accessing multiple rows simultaneously.
Thus, the local computation significantly reduces the data
movement and the parallel accessing amortizes the read energy
[2]. Emerging embedded non-volatile memories (eNVM) [3]–
[5] are promising candidates since they eliminate the off-chip
memory access and have high storage density with potential
multi-level cell states. Nevertheless, their computing accuracy
is largely compromised due to the inaccurate storage and
small readout dynamic range of the cells. On the other hand,
SRAM is a mature embedded memory technology and attracts
more and more interests for IMC implementations in recent
years [6]–[15] because of its superb computing efficiency and
technology scalability. Silicon-verified results [11], [12] prove
that the in-SRAM computing is able to achieve competitive
accuracy as the digital ASIC accelerators. More importantly,
SRAM scales well with advanced technology nodes while
eNVM falls behind the transistor scaling. As an example,
the in-SRAM computing in 7nm [16] provides about 2 times
higher storage density than in-RRAM computing in 22nm [17].
The concept of in-SRAM-computing was first proposed by
[18], and verified in silicon by [19]. This current-domain
computing scheme turns on multiple wordlines (WLs) simul-
taneously and accumulates the current on the bitline (BL). It
is further developed by [7], [10] to support multi-bit CNN
models. Simple as the implementation is, it suffers from
process variation and nonlinear I-V characteristic of transistors
and thus the sensing margin is limited. Recently, a charge-
domain computation, proposed by [8], proves to be an effective
method to enhancing the computing linearity and robustness
to PVT variations. The analog multiplication is performed in
local capacitors in the cells and the accumulation is achieved
by charge sharing [8], [11] or capacitive coupling [9], where
no transistor nonideality is involved throughout the operations.
However, due to large area overhead, those designs sacrifice
either the versatility of analog computation (only support
binary or ternary operations) [9], [11] or the parallelism with a
group of cells sharing one analog computing circuit [8] . Even
so, the cell area of such designs is still 2 to 3 times larger
than the logic-rule 6T, leading to degraded compute-density
(TOPS/mm2).
Beyond the computing methods, the optimization of weight
encoding scheme is overlooked. Most in-SRAM-computing
designs utilize the traditional 2’s complement weight encoding
[7], [11], [12], where the convolution of each bit is processed
in analog domain separately and the partial sums are shift-
and-added in digital periphery. This scheme requires kpower-978-1-6654-3922-0/21/$31.00 ©2021 IEEE
0018-9200 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Fig. 1: Comparison of state-of-the-art in-SRAM-computing schemes.
hungry ADCs to readout one output partial sum in a k-bit
CNN model.
This paper proposes a Multi-bit Capacitive Convolution
(MC2) SRAM macro using 8T SRAM cells, featuring: (1) high
computing accuracy without transistor non-idealities, (2) multi-
bit computation with full parallelism, (3) highly compact cell
area and (4) reduced number of ADCs and associated energy
and area overheads. Our main contributions include:
A novel MC2in-sram computing scheme that performs
multi-bit charge-domain MAC with full parallelism and
high accuracy in 8T1C SRAM cells. The cell area is only
41% larger than logic rule 6T cell.
An ADC-Reduction weight encoding scheme that reduces
the number of ADCs by half for identical throughput.
A complete MC2-RAM macro with all peripherals and
compact layouts, achieving state-of-the-art energy effi-
ciency and compute density in post-layout simulations.
II. DE SI GN CO NS ID ERATIO NS A ND RELATED WORK
This section summarizes the key design considerations for
in-SRAM-computing macros and analyze existing designs.
A. Computing Schemes
Fig. 1 abstracts the working principles of state-of-the-art
and the proposed MC2in-SRAM computing schemes. Current-
domain IMC [7], [10], [19] activates multiple WLs at the
same time and accumulates the current on the BL using 6T
or 8T cells. This scheme features simple implementations
and compatibility with standard SRAM cells, yet faces severe
transistor nonidealities where the process variation of the
SRAM access transistor directly affects the output current.
The PVT variation and the nonlinearity can be addressed
by the charge-domain IMC that draws increasing interests in
recent studies [6], [8], [9], [11], [20]. One of the charge-
domain computation is through capacitive coupling [9] (see
Fig. 1). The binary (or ternary) input directly drives the
bottom plates of local capacitors, forming a capacitive voltage
divider. This scheme has a relatively simple cell structure
and excellent computing linearity, but cannot support multi-
bit analog MAC because the multi-bit input requires power-
hungry analog drivers. On the other hand, the charge sharing
approach [8], [11], [20] performs the analog multiplication via
a 1-bit multiplier (can be implemented as a single PMOS) and
performs the accumulation by sharing the charge to the output
TABLE I: Performance comparison of bit-serial analog MAC
and multi-bit analog MAC in BX-bit CNN Model.
BIT SE RIAL MULT I-BIT
ENE RGY PER CY CL E EBS (1+
α
)·EBS
LATEN CY P ER CYCLE TBS (1+
β
)·TBS
ENE RGY PER M AC BX·EBS (1+
α
)·EBS
LATEN CY P ER MAC BX·TBS (1+
α
)·TBS
BLs (see Fig. 1) through an output switch. It is termed as the
top-plate sampling, top-plate charge-sharing (TSTC) scheme.
Multi-bit MAC can be supported by sampling the input voltage
on the local capacitor via an additional input switch. However,
these extra switches incurs large area and power overhead.
B. Analog Multi-Bit MAC Support
To match the computing accuracy of the digital accelerators,
it is crucial to support CNN models with high precision
(4-bit). Most IMC macros with multi-bit MAC support [7],
[11], [12] perform 1- or 2-bit analog MAC in memory and
accumulate the partial sums in digital shift-and-add periphery
in a serial manner. While it simplifies the cell design, the serial
computation causes huge energy and throughput penalty (see
Table I, where
α
and
β
are the energy and latency overhead of
implementing multi-bit analog MAC). For example, although
the estimated energy and latency overhead of the mult-bit MC2
is about 20% than the one-bit implementation, the potential
improvement of the overall performance in a 4-bit CNN model
is 3.3 times.
C. Computing Parallelism
Another feature of the proposed SRAM macro is the full
computing parallelism, where all the memory cells can be
accessed simultaneously. Recently, several studies [8], [12],
[20] propose a semi-parallel computing structure where cells
are clustered to share one local computing circuitry and only
one of the cells will be activated in each cycle. Despite
the reduced cell area, the peak compute density and energy
efficiency will both be compromised in this design. The
memory energy of a fully parallel IMC can be modeled as
EF= (NVLVCV+NHLHCH)·VDDVSW ,(1)
where NV/NHis the number of computing or control wires in
the vertical/horizontal direction, LV/LHrepresents the length
of the wires and CV/CHis the unit metal capacitance. VSW is
Fig. 2: The 8T1C MC2cell structure and logic-rule layout.
Fig. 3: Comparison of the cell area in TSMC 65 nm and the
simulated standard deviation of the output voltage as a result
of process variation.
the voltage swing on the wires. For the semi-parallel design
with k-cell clusters, the modeled energy becomes
ES= (NVLVCV+1
k·NHLHCH)·VDDVSW .(2)
In other words, the throughput of full-parallelism is ktimes
larger than semi-parallelism, while the energy consumption is
much less than k·Esdue to better read energy amortization.
III. THE MC2SCH EM E IN 8 T SR AM
We propose MC2scheme to realize charge-domain computa-
tion with multi-bit input and full computing parallelism, using
compact 8-transistor 1-capacitor (8T1C) SRAM cells (see
Fig. 2). The benefits of these design features are as follows.
1) The charge-domain computation ensures high computing
linearity that approaches the CNN inference accuracy of digital
hardware. 2) MAC with direct multi-bit input support avoids
serial digital processing that has long latency and large energy
overhead (see Table I). 3) Fully parallel computing better
amortizes the control wire energy and thus achieves higher
energy efficiency (see Equation 2), and higher compute density.
4) A compact cell structure (see Fig. 3) increases the compute
density. 5) The novel switching scheme in MC2requires less
control wires for analog MAC than previous charge-domain
designs [2], [20], reducing one of the dominant source of
energy in IMC operations.
The key concept behind MC2is a bottom-plate sampling,
top-plate charge-sharing method (BSTC) that replaces local
output switches in conventional TSTC schemes by a single
global switch per column (see Fig. 1). The input voltage is first
sampled on the bottom plate of the local capacitors with the
global switch on. After the analog multiplication, the bottom
plate is connected to either ground or VDD with the global
switch turned off so that the charge-sharing is performed on
the top plate.
This scheme leads to the 8T1C cell design with two extra
transistors controlled by the differential SRAM values (weight)
to perform input voltage sampling and local multiplication
between inputs and weights (Fig. 2). Although this cell design
looks the same as that in [9], the BSTC scheme in MC2
is totally different from the capacitive coupling scheme in
C3SRAM, because it reuses the two-transistor analog mul-
tiplier as the multi-bit input sampling switch, which is the
first to achieve fully parallel MAC with multi-bit inputs with
8T SRAM cells. In physical layout, each MC2SRAM cell
requires 8 transistors and one local metal-oxide-metal (MOM)
capacitor (see Fig. 2). The MOM cap can be placed above
the transistors without any area overhead. The cell layout
follows the conventional 6T layout style while the two pass
gate PMOS (T1 and T2) share the gate of the cross-coupled
inverters. Overall, we realize a highly compact cell layout that
is only 41% larger than the logic rule thin-cell 6T cells. As
shown in simulated Fig. 3, MC2-RAM is significantly smaller
than prior arts with similar computing accuracy (measured by
variation of computing result due to process variations), even
those that only support 1-2bit inputs.
More specifically, the MC2computing starts with a DAC
phase (see Fig. 4). The 4-bit input signal will be sampled on
the bottom plate via one of the input lines (INA1 or INA2)
by a current-steering DAC [21]. At the multiplication phase,
INA1 will be driven to VDD so that the MOM cap will either
be charged to VDD (equivalent to multiplying ‘0’) or keep
its sampled voltage (equivalent to multiplying ‘1’) depending
on the stored data in the cell. Despite the floating state of
some bottom plates, the leakage does not have significant
effect since those bottom plates are connected to INA2 with
large parasitic capacitance (180 fF) and the only leakage
path is through P2. Experiment shows that under 125 C
temperature, the leakage current is 9.5 nA in the worst case
scenario, leading to only 0.034 mV change in the final result
(equivalently 0.01 LSB). In the final accumulation phase, all
the bottom plates will be connected to VDD and the charge-
sharing is directly performed on top plates without the need
for any local switches. At the final state, all the bottom plates
are connected to VDD and the ADC starts the quantization.
All switches in the MAC switching unit (MSU) are PMOS
while N1 and N2 are NMOS.
IV. ADC -R ED UC TI ON ENCODING SC HE ME
Two’s complement weight encoding is a natural choice
for most in-SRAM computing macros because one SRAM
cell only store 1-bit information and the weights are easy to
decompose under this scheme. Each column only processes
one bit of the multi-bit weight and converts the analog
MAC into digital domain. The digital shift-and-add module
accumulates the partial sums according to the significance of
the bit position. Note that the partial sum on the MSB (sign
bit) is negative and the polarity needs to be flipped. The major
limitation of this scheme is that each bit position needs one
power-hungry ADC (i.e., one ADC per column), which brings
large power overhead. Meanwhile, it restricts the layout space
Fig. 4: Operations of the MC2scheme.
Fig. 5: ADC-reduction encoding for 4-bit weights.
of ADCs, leading to low-quality layout matching and degraded
computing accuracy.
In MC2-RAM, because the cell footprint is significantly
reduced as described in Section III, fitting one ADC per
column is even more challenging and dominates the total area.
Therefore, we propose a weight encoding scheme to effectively
reduce the number of ADCs by half. It takes advantage of
the natural subtraction performed by differential ADCs (see
Fig. 5). Different from 2’s complement encoding that only
the sign bit is negative, every two bits of the weights are
paired to have different polarities. As a result, neighboring
cells are connected to different inputs of one ADC. The
significance of the negative bit is twice of the positive bit. For
instance, the significance of a 4-bit number is {-8, 4, -2, 1},
respectively, so that the decimal number 7 is encoded as 1001
((8)×1+4×0+ (2)×0+1×1). The subtraction of the
negative and positive MAC results is performed concurrently
with the quantization process in the differential ADCs. After
that, the quantized partial sums from multiple bit pairs will
then be shift-added in digital.
According to the encoding table in Fig. 5, the range of
the number is not symmetrical. To match the range of 2’s
complement encoding, each weight needs a constant bias c
(e.g., 2 in 4-bit encoding) and thus needs to add icXito final
MAC results. Note that the bias is the same for all results in
each convolution. Therefore, one dummy column shared by
the whole array with all cells storing ‘1’ calculates iXi×1
with negligible overhead. This scheme can be extended to any
bitwidth of the weights with more inter-column digital shift-
and-adds and different constant bias cwhile always benefits
from the natural subtraction of ADCs. It is worth mentioning
that the encoding scheme can also be disabled to implement
Fig. 6: System diagram of the proposed MC2-RAM.
the widely-used ternary-weighted neural network.
V. IM PL EM EN TATION AND EVALUATI ON S
A. MC2-RAM Implementation
Leveraging the proposed MC2scheme and ADC-reduction
encoding, we implement a MC2-RAM macro consisting of
a 576×130 8T SRAM array, with 128 columns performing
normal MAC and 2 dummy columns, and peripherals for
computation and read/write (Fig. 6). Each row includes a 4-
bit current-steering DAC and two MSUs for positive (INAs)
and negative inputs (INBs). Differential 8-bit ADCs directly
quantize the difference between positive and negative columns.
A charge-injection SAR (ciSAR) ADC [22] is adapted for
the macro similar to [20]. In this design, the capacitive DAC
is replaced by groups of transistors with a long channel length
that behave as capacitors, which have much smaller area and
require no power-hungry analog buffers to drive the reference
voltages. The ciSAR ADC achieves 7.85-bit effective number
of bits (ENOB) in transient noise simulations, showing great
effectiveness in low-precision applications.
We implement the complete layout of MC2-RAM and verify
the linearity and performance in TSMC 65nm LP process,
as shown in Fig. 7. The total area of the macro is 0.280
mm2. The 9KB SRAM array, the ciSAR ADCs and the input
processing circuitry (DACs & MCUs) occupy 47.5%, 30.8%
and 8.7% of the total area, respectively. 2.6% of the area is
used by read/write circuitry and 2.6% is occupied by the WL
drivers and decoders. Moreover, an in-house simulator is used
to verify the computing accuracy and the effectiveness of the
encoding scheme in end-to-end CNN applications.
B. Computing Linearity and Variation
Monte Carlo simulations (Fig. 8(a)) verify the excellent
computing linearity of MC2(R2=0.9999) and show negligible
one-sigma variation of the output voltage (less than 0.14 mV,
Fig. 7: Layout of the MC2-RAM macro.
Fig. 8: Monte Carlo simulations (100 iterations) for the
computing linearity by sweeping (a) the activated number of
DACs and (b) the input code of DACs.
4.5% of the LSB). The reason is that the charge-sharing opera-
tion is inherently linear. It is also immune to process variation
because the capacitor mismatch is orders of magnitude smaller
than transistor mismatch. One-bit DACs are implemented in
this experiment to exclude the non-ideality of the DACs. The
capacitor mismatch is included using foundry MOM capacitor
variation model. Only local mismatch is considered because
the global systematic variation can be compensated by tuning
the bias voltage of ADCs and DAC.
The mismatch and nonlinearity of DACs are further included
in Fig. 8(b) by sweeping the input code of 4-bit DACs with
all DACs activated. The standard deviation in the worst-case
scenario is 0.27 mV, which is less than 8.6% of the LSB. In
both linearity results, the standard deviation of output voltage
increases with the voltage, which is expected as a result of the
accumulation of independent capacitor variations. The linearity
results confirm the system is able to accurately calculate dot
products with different combinations of inputs and weights.
C. Post-Layout System Verification
Fig. 9 illustrates the operational waveforms of four input
lines (INA1, INA1, INB1 and INB2) from one row and two
output lines (OLN and OLP) from neighboring positive and
negative columns in the system-level post-layout simulation.
In this experiment, we set one specific input and weight
pattern so that OLN stays at 0 V while OLP reaches 3/4
of its full range (150 mV out of 200 mV). The system
time-interleaves the MC2operations in positive and negative
columns to minimize the effect of parasitic coupling, which
is critical to OLN since it is floating after the operations in
Fig. 9: Waveforms in the system post-layout simulation.
Fig. 10: Computing linearity of the complete analog chain.
negative columns. It is observed that OLN does not keep 0
V (22.8 mV, see Fig. 7(b)) due to the coupling effect coming
from the switching of INBs, but it only introduces fixed offset
because the transition of INBs is independent of the inputs.
The post-layout simulation of the entire analog chain (see
Fig. 10), including the nonidealities of the memory, the ADC
and the DACs, further verifies the excellent linearity of the
system.
D. Inference Accuracy on End-to-End CNN Model
The simulator performing end-to-end CNN inference veri-
fies the effectiveness of the ADC-reduction encoding scheme
and the high computing accuracy of the macro. The functions
of the simulator include: (1) mapping the CNN models to
the SRAM macro, (2) decomposing the multi-bit weights
into multiple columns as required by the encoding scheme,
(3) quantizing the analog MAC partial sums in each column
and (4) adding non-idealities to the convolution operations.
4-bit quantized ResNet-20 is mapped to the macros in this
experiment. As Fig. 11 shows, if ideal differential ADC and
analog computation are assumed, the encoding scheme will
not cause significant accuracy loss in the CIFAR-10 dataset
compared to the baseline model that has no ADC quantization
errors. Further, the ideal MAC results are mapped to 64
transfer curves of the entire analog chain (from digital input
to ADC output) in the simulator, where the transfer curves are
exported under 64 Monte Carlo simulations. With all the non-
idealities considered, the system with 8-bit ADCs achieves
90.2% inference accuracy.
E. Energy and Throughput
The clock frequency of the macro is 70 MHz while the
power consumption is 21.6 mW where ADC takes 75.1%,
the array takes 21.6% and the controller and others occupy
Fig. 11: End-to-end CNN inference accuracy on CIFAR-10
Fig. 12: Comparison of energy efficiency and compute density
of IMC macros. Results are normalized to 65 nm technology.
3.3% of the total power. The system achieves 4.60 TOPS/mm2
compute density and 59.7 TOPS/W energy efficiency with
4/4-bit activations/weights. Since the bit precision significantly
affects the performance of IMC (see Section II.B), we utilize a
bitwise metric to characterize the performance similar to [12],
[20], where the number of bit operation (bOPs) is defined as
# bOPs =# OPs ×input bitwidth ×weight bitwidth.(3)
MC2-RAM simultaneously achieves 955.2 TbOPS/W peak
efficiency and 73.6 TbOPS/mm2peak compute density, out-
performing state-of-the-art SRAM-based IMC macros.
The high compute density results from the compact charge-
domain cell design and fully parallel computation. Since
ADC is the energy bottleneck of IMC systems, the proposed
multi-bit analog MAC avoids serial digital accumulation that
requires multiple iterations of ADC readout. Meanwhile, the
encoding sheme further halves the ADC energy. The array
only takes a small portion of the total energy because the
computing scheme requires only four horizontal wires (INA1,
INA2, INB1 and INB2) per row and one vertical wire (OL)
per column.
VI. CO NC LU SI ON
In summary, this work presents MC2-RAM, a multi-bit
charge-domain in-SRAM computing macro with 8T1C cells
and an ADC-reduction encoding scheme. The proposed BSTC
scheme enables analog MAC with multi-bit inputs with signif-
icantly reduced cell area, while maintaining high computing
accuracy and robustness to process variation. Simulations show
only 0.14 mV variation in the worst case, which is 56.7 times
better than the conventional current-domain design. Mean-
while, the proposed ADC-reduction encoding scheme reduces
ADC energy and relaxes the layout congestions. Overall, MC2-
RAM simultaneously improves bitwise energy efficiency and
compute density over state-of-the-art IMC SRAM macros. The
end-to-end ResNet-20 inference, with all physical nonidealities
included, achieves 90.2% accuracy on CIFAR-10.
REF ER EN CE S
[1] M. Horowitz, “1.1 Computing’s energy problem (and what we can do
about it),” in 2014 IEEE International Solid-State Circuits Conference
Digest of Technical Papers (ISSCC), pp. 10–14, IEEE, 2014.
[2] N. Verma, H. Jia, H. Valavi, Y. Tang, M. Ozatay, L.-Y. Chen, B. Zhang,
and P. Deaville, “In-memory computing: Advances and prospects,IEEE
Solid-State Circuits Magazine, vol. 11, no. 3, pp. 43–55, 2019.
[3] V. Joshi, M. Le Gallo, S. Haefeli, I. Boybat, S. R. Nandakumar,
C. Piveteau, M. Dazzi, B. Rajendran, A. Sebastian, and E. Eleftheriou,
“Accurate deep neural network inference using computational phase-
change memory,Nature communications, vol. 11, no. 1, pp. 1–13, 2020.
[4] P. Yao, H. Wu, B. Gao, J. Tang, Q. Zhang, W. Zhang, J. J. Yang, and
H. Qian, “Fully hardware-implemented memristor convolutional neural
network,” Nature, vol. 577, no. 7792, pp. 641–646, 2020.
[5] M.-H. Wu, M.-S. Huang, Z. Zhu, F.-X. Liang, M.-C. Hong, J. Deng,
J.-H. Wei, S.-S. Sheu, C.-I. Wu, G. Liang, and T.-H. Hou, “Compact
probabilistic poisson neuron based on back-hopping oscillation in STT-
MRAM for all-spin deep spiking neural network,” in IEEE Symposium
on VLSI Technology (VLSI), June 2020.
[6] H. Valavi, P. J. Ramadge, E. Nestler, and N. Verma, “A 64-Tile
2.4-Mb In-Memory-Computing CNN Accelerator Employing Charge-
Domain Compute,” IEEE Journal of Solid-State Circuits (JSSC), vol. 54,
pp. 1789–1799, June 2019.
[7] X. Si, J.-J. Chen, Y.-N. Tu, W.-H. Huang, J.-H. Wang, Y.-C. Chiu, W.-C.
Wei, S.-Y. Wu, X. Sun, R. Liu, et al., “A twin-8T SRAM computation-in-
memory unit-macro for multibit CNN-based AI edge processors,” IEEE
Journal of Solid-State Circuits (JSSC), vol. 55, no. 1, pp. 189–202, 2019.
[8] A. Biswas and A. P. Chandrakasan, “CONV-SRAM: An energy-efficient
SRAM with in-memory dot-product computation for low-power convo-
lutional neural networks,” IEEE Journal of Solid-State Circuits, vol. 54,
no. 1, pp. 217–230, 2018.
[9] Z. Jiang, S. Yin, J.-S. Seo, and M. Seok, “C3SRAM: An in-memory-
computing SRAM macro based on robust capacitive coupling computing
mechanism,” IEEE Journal of Solid-State Circuits, vol. 55, no. 7,
pp. 1888–1897, 2020.
[10] S. K. Gonugondla, M. Kang, and N. Shanbhag, “A 42pJ/decision 3.12
TOPS/W robust in-memory machine learning classifier with on-chip
training,” in 2018 IEEE International Solid-State Circuits Conference
(ISSCC), pp. 490–492, 2018.
[11] H. Jia, H. Valavi, Y. Tang, J. Zhang, and N. Verma, “A programmable
heterogeneous microprocessor based on bit-scalable in-memory comput-
ing,” IEEE Journal of Solid-State Circuits (JSSC), 2020.
[12] X. Si et al., “15.5 A 28nm 64Kb 6T SRAM computing-in-memory
macro with 8b MAC operation for AI edge chips,” in 2020 IEEE
International Solid- State Circuits Conference (ISSCC), pp. 246–248,
Feb. 2020.
[13] J. Yue et al., “14.3 A 65nm computing-in-memory-based CNN processor
with 2.9-to-35.8 TOPS/W system energy efficiency using dynamic-
sparsity performance-scaling architecture and energy-efficient inter/intra-
macro data reuse,” in IEEE International Solid-State Circuits Conference
(ISSCC), pp. 234–236, 2020.
[14] H. Kim, T. Yoo, T. T.-H. Kim, and B. Kim, “Colonnade: A re-
configurable SRAM-based digital bit-serial compute-in-memory macro
for processing neural networks,” IEEE Journal of Solid-State Circuits
(JSSC), 2021.
[15] H. Jia, M. Ozatay, Y. Tang, H. Valavi, R. Pathak, J. Lee, and N. Verma,
“15.1 a programmable neural-network inference accelerator based on
scalable in-memory computing,” in IEEE International Solid-State Cir-
cuits Conference (ISSCC), vol. 64, pp. 236–238, 2021.
[16] Q. Dong, M. E. Sinangil, B. Erbagci, D. Sun, W. Khwa, H. Liao,
Y. Wang, and J. Chang, “15.3 A 351TOPS/W and 372.4GOPS compute-
in-memory SRAM macro in 7nm FinFET CMOS for machine-learning
applications,” in 2020 IEEE International Solid-State Circuits Confer-
ence (ISSCC), pp. 242–244, Feb. 2020.
[17] C.-X. Xue et al., “15.4 A 22nm 2Mb ReRAM compute-in-memory
macro with 121-28TOPS/W for multibit MAC computing for tiny AI
edge devices,” in IEEE International Solid-State Circuits Conference
(ISSCC), pp. 244–246, 2020.
[18] M. Kang, M.-S. Keel, N. R. Shanbhag, S. Eilert, and K. Curewitz,
“An energy-efficient VLSI architecture for pattern recognition via deep
embedding of computation in SRAM,” in 2014 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 8326–8330, May 2014.
[19] J. Zhang, Z. Wang, and N. Verma, “In-memory computation of a
machine-learning classifier in a standard 6T SRAM array,IEEE Journal
of Solid-State Circuits, vol. 52, no. 4, pp. 915–924, 2017.
[20] Z. Chen, Z. Yu, Q. Jin, Y. He, J. Wang, S. Lin, D. Li, Y. Wang,
and K. Yang, “CAP-RAM: A charge-domain in-memory computing 6T-
SRAM for accurate and precision-programmable CNN inference,” IEEE
Journal of Solid-State Circuits (JSSC), pp. 1924 – 1935, 2021.
[21] B. Razavi, “The current-steering dac [a circuit for all seasons],” IEEE
Solid-State Circuits Magazine, vol. 10, no. 1, pp. 11–15, 2018.
[22] K. D. Choo, J. Bell, and M. P. Flynn, “Area-efficient 1GS/s 6b SAR
ADC with charge-injection-cell-based DAC,” in 2016 IEEE International
Solid-State Circuits Conference (ISSCC), pp. 460–461, 2016.
... Many previous CIM accelerators have achieved high performance in terms of TOPS/W (Tera operations per second per watt) [9]- [11], [16], [17], [19]. To achieve this high TOPS/W, much prior research has focused on reducing energy consumption per operation. ...
... Previous SRAM CIMs have significantly improved power, performance, and area (PPA), utilizing analog computation circuits [4]- [17]. Especially, various methods for parallel accumulation operations have been proposed thus far. ...
Article
Full-text available
This paper proposes a novel 8T-SRAM based computing-in-memory (CIM) accelerator for the Binary/Ternary neural networks. The proposed split dual-port 8T-SRAM cell has two input ports by adding two transistors to 6T-SRAM, simultaneously performing two binary multiply-and-accumulate (MAC) operations on left and right bitlines. This approach enables a twofold increase in throughput without significantly increasing area or power consumption., since the area overhead for doubling throughput is only two additional WL wires compared to the conventional 8T-SRAM. In addition, the proposed circuit supports binary and ternary activation input, allowing flexible adjustment of high energy efficiency and high inference accuracy depending on the application. The proposed SRAM macro consists of a 128×128 SRAM array that outputs the MAC operation results of 96 binary/ternary inputs and 96×128 binary weights as 1-5 bit digital values. The proposed circuit performance was evaluated by post-layout simulation with the 22-nm process layout of the overall CIM macro. The proposed circuit is capable of high-speed operation at 1 GHz. It achieves a maximum area efficiency of 3320 TOPS/mm, which is 3.4× higher compared to existing research with a reasonable energy efficiency of 1471 TOPS/W. The simulated inference accuracies of the proposed circuit are 96.45 %/97.67 % for MNIST dataset with binary/ternary MLP model, and 86.32 %/88.56 % for CIFAR-10 dataset with binary/ternary VGG-like CNN model.
... The matrix-vector multiplication operations can be divided into multiple parallel multiply-and-add (MAC) computations, which necessitate frequent data movement between on-chip arithmetic units and off-chip memory. The energy consumption associated with data movement is a critical concern when deploying these AI models on edge processors, given that these devices operate under strict energy budgets [7], [8], [9], [10]. This has resulted in the design of dedicated neural network acceleration chips being a current research hotspot [11], [12], [13], [14]. ...
Article
Full-text available
Deep learning has recently gained significant prominence in various real-world applications such as image recognition, natural language processing, and autonomous vehicles. While deep neural networks appear to have different architectures, the main operations within these models are matrix-vector multiplications (MVM). Compute-in-memory (CIM) architectures are promising solutions for accelerating the massive MVM operations by alleviating the frequent data movement issue in traditional processors. Analog CIM macros leverage current-accumulating or charge-sharing mechanisms to perform multiply-and-add (MAC) computations. Even though they can achieve high throughput and efficiency, the computing accuracy is sacrificed due to the analog nonidealities. To ensure precise MAC calculations, it is crucial to analyze the sources of nonidealities and identify their impacts, along with corresponding solutions. In this paper, comprehensive linearity analysis and dedicated calibration methods for charge domain static-random access memory (SRAM) based in-memory computing circuits are proposed. We analyze nonidealities from three areas based on the mechanism of charge domain computing: charge injection effect, temperature variations, and ADC reference voltage mismatch. By designing a 256×256 CIM macro and conducting investigations via post-layout simulation, we conclude that these nonidealities don't deteriorate the computing linearity, but only cause the scaling and bias drift. To mitigate the scaling and bias drift identified, we propose three calibration methods ranging from the circuit level to the algorithm level, all of which exhibit promising results. The comprehensive analysis and calibration methods can assist in designing CIM macros with more accurate MAC computations, thereby supporting more robust deep learning inference.
... However, this computing scheme requires column ADCs and shift-and-add peripheral circuits for partial sum combination, resulting in significant area overhead and long latency. One way to address this is to convert multi-bit inputs into analog voltage using a DAC [34], [36] (Fig. 2(d)). A similar problem arises in that the required hardware resources of the DAC also increase exponentially with the increased bit width of the inputs. ...
Article
Compute-in-memory (CIM) is a promising approach to solving the memory-wall problem existing in traditional computing architectures. In this paper, we introduce SSM-CIM, a charge-domain, static random-access memory (SRAM)-based CIM macro designed for area-energy-efficient convolutional neural network (CNN) inference. SSM-CIM utilizes an original sign-magnitude data encoding method for both inputs and weights. By codesigning four adjacent SRAM computing cells and employing a 3-bit digital-to-analog converter (DAC), SSM-CIM performs accurate 4-bit multiply-and-accumulate (MAC) computation in a single step, eliminating the peripheral digital shift-and-add circuits. To digitize the MAC computing results, a dedicated multi-reference assisted SAR ADC is designed by reusing the reference voltages from the DAC, which offers significant power and area savings. In addition, analog computing errors and quantization errors are analyzed to ensure the multi-bit computing accuracy of SSM-CIM. SSM-CIM is implemented and evaluated using 28-nm global foundry process. The post-layout simulation results validate the excellent computing linearity and accuracy of SSM-CIM. Benefitting from the compact layout design and fully parallel computing flow, the 144×256144\times 256 macro achieves a peak throughput of 2.3 TOPS, an area efficiency of 10.2 TOPS/mm 2^{2} , and an energy efficiency of 205.4 TOPS/W with 4-bit weights and 4-bit inputs.
Article
In this article, we present an energy-scalable CNN model that can adapt to different hardware resource constraints. Specifically, we propose a dual-precision network, named DualNet, that leverages two independent bit-precision paths (INT4 and ternary-binary). DualNet achieves both high accuracy and low complexity by balancing the ratio between two paths. We also present an evolutionary algorithm that allows the automatic search of the optimal ratios. In addition to the novel CNN architecture design, we develop a heterogeneous processing-in-memory (PIM) hardware that integrates SRAM-and eDRAM-based PIMs to efficiently compute two precision paths in parallel. To verify the energy efficiency of DualNet computed on the heterogeneous PIM, we prototyped a test chip in 28nm CMOS technology. To maximize the hardware efficiency, we utilize an improved data mapping scheme achieving the most effective deployment of DualNets on multiple PIM arrays. With the proposed SW-HW co-optimization, we can obtain the most energy-efficient DualNet model operating on the actual PIM hardware. Compared to the other quantized networks with a single bit-precision, DualNet reduces the energy consumption, memory footprint, and latency by 29.0%, 49.5%, 47.3% on average, respectively, for CIFAR-10/100 and ImageNet datasets.
Article
This article presents a low-cost PMOS-based 8T (P-8T) static random access memory (SRAM) compute-in-memory (CIM) macro that efficiently reduces the hardware cost associated with a digital-to-analog converter (DAC) and an analog-to-digital converter (ADC). By utilizing the bitline (BL) charge-sharing technique, the area and power consumption of the proposed DAC have been reduced while achieving similar conversion linearity compared to a conventional DAC. The BL charge-sharing also facilitates the multiply-accumulate (MAC) operation to produce variation-tolerant and linear outputs. To reduce ADC area and power consumption, a 4-bit coarse-fine flash ADC has been collaboratively used with an in-SRAM reference voltage generation, where the ADC reference voltages are generated in the same way as the MAC operation mechanism. Moreover, to find the suitable ADC sample range and resolution for our CIM macro, a CIM noise-considered accuracy simulation has been conducted. Based on the simulation results, a 4-bit ADC resolution with a cutoff ratio of 0.5 is chosen, maintaining high accuracy. The 256 ×\times 80 P-8T SRAM CIM prototype chip has been fabricated in a 28-nm CMOS process. By leveraging charge-domain computing, the proposed CIM operates in a wide range of supply voltage from 0.6 to 1.2 V with an energy efficiency of 50.1-TOPS/W at 0.6 V. The accuracies of 91.26% and 65.20% are measured for CIFAR-10 and CIFAR-100 datasets, respectively. Compared to the state-of-the-art SRAM CIM works, this work improves energy efficiency by 1.2 ×\times and area efficiency by 6.5 ×\times due to the reduced analog circuit hardware costs.
Article
In this brief, a novel charge-domain 4T2C eDRAM-CIM macro is proposed that has three key features: 1) a novel 4T2C eDRAM cell with an enhanced PVT variation tolerance that resolves the limitations of previous eDRAM-CIM macros such as current domain operation, large bit-cell size, cell leakage, and low energy- and area-efficiency, resulting in high linearity (R2 = 0.9998) and 76.8% reduction in 3σ3\sigma PVT variation, 2) a quarter-ADC-reduction scheme with an offset-calibration comparator that reduces the number of ADC by 73% while improving the accuracy drop by 7.82%, and 3) an array-embedded DAC that reduces the area overhead by 64.2% compared to current-based DAC. The proposed 4T2C eDRAM-CIM macro is fabricated in 65nm LP technology and achieves 43.02- to 49.20-TOPS/W and 2.4-TOPS/mm 2 when 4b ×4b\times 4\text{b} MAC operation is performed with 250MHz. In addition, an 90.03% accuracy at the CIFAR-10 dataset with the ResNet-20 network is achieved.
Article
This work introduces a digital SRAM-based near-memory compute macro for DNN inference, improving on-chip weight memory capacity and area efficiency compared to state-of-the-art digital computing-in-memory (CIM) macros A 20×256.120\times256.1 -16b reconfigurable digital computing near-memory (NM) macro is proposed, supporting a reconfigurable 1-16b precision through the bit-serial computing scheme and the weight and input gating architecture for sparsity-aware operations. Each reconfigurable column MAC comprises 16 ×\times custom-designed 7T SRAM bitcells to store 1-16b weights, a conventional 6T SRAM for zero weight skip control, a bitwise multiplier, and a full adder with a register for partial-sum accumulations 20 ×\times parallel partial-sum outputs are post-accumulated to generate a sub-partitioned output feature map, which will be concatenated to produce the final convolution result. Besides, pipelined array structure improves the throughput of the proposed macro. The proposed near-memory computing macro implements an 80Kb binary weight storage in a 0.473mm 2^{2} die area using 65nm. It presents the area/energy efficiency of 4329-270.6 GOPS/mm 2^{2} and 315.07-1.23TOPS/W at 1-16b precision.
Article
Full-text available
A compact, accurate, and bitwidth-programmable in-memory computing (IMC) static random-access memory (SRAM) macro, named CAP-RAM, is presented for energy-efficient convolutional neural network (CNN) inference. It leverages a novel charge-domain multiply-and-accumulate (MAC) mechanism and circuitry to achieve superior linearity under process variations compared to conventional IMC designs. The adopted semi-parallel architecture efficiently stores filters from multiple CNN layers by sharing eight standard 6T SRAM cells with one charge-domain MAC circuit. Moreover, up to six levels of bit-width of weights with two encoding schemes and eight levels of input activations are supported. A 7-bit charge-injection SAR (ciSAR) analog-to-digital converter (ADC) getting rid of sample and hold (S&H) and input/reference buffers further improves the overall energy efficiency and throughput. A 65-nm prototype validates the excellent linearity and computing accuracy of CAP-RAM. A single 512×128512\times 128 macro stores a complete pruned and quantized CNN model to achieve 98.8% inference accuracy on the MNIST data set and 89.0% on the CIFAR-10 data set, with a 573.4-giga operations per second (GOPS) peak throughput and a 49.4-tera operations per second (TOPS)/W energy efficiency.
Article
Full-text available
In-memory computing using resistive memory devices is a promising non-von Neumann approach for making energy-efficient deep learning inference hardware. However, due to device variability and noise, the network needs to be trained in a specific way so that transferring the digitally trained weights to the analog resistive memory devices will not result in significant loss of accuracy. Here, we introduce a methodology to train ResNet-type convolutional neural networks that results in no appreciable accuracy loss when transferring weights to phase-change memory (PCM) devices. We also propose a compensation technique that exploits the batch normalization parameters to improve the accuracy retention over time. We achieve a classification accuracy of 93.7% on CIFAR-10 and a top-1 accuracy of 71.6% on ImageNet benchmarks after mapping the trained weights to PCM. Our hardware results on CIFAR-10 with ResNet-32 demonstrate an accuracy above 93.5% retained over a one-day period, where each of the 361,722 synaptic weights is programmed on just two PCM devices organized in a differential configuration.
Article
This article (Colonnade) presents a fully digital bit-serial compute-in-memory (CIM) macro. The digital CIM macro is designed for processing neural networks with reconfigurable 1–16 bit input and weight precisions based on bit-serial computing architecture and a novel all-digital bitcell structure. A column of bitcells forms a column MAC and used for computing a multiply-and-accumulate (MAC) operation. The column MACs placed in a row work as a single neuron and computes a dot-product, which is an essential building block of neural network accelerators. Several key features differentiate the proposed Colonnade architecture from the existing analog and digital implementations. First, its full-digital circuit implementation is free from process variation, noise susceptibility, and data-conversion overhead that are prevalent in prior analog CIM macros. A bitwise MAC operation in a bitcell is performed in the digital domain using a custom-designed XNOR gate and a full-adder. Second, the proposed CIM macro is fully reconfigurable in both weight and input precision from 1 to 16 bit. So far, most of the analog macros were used for processing quantized neural networks with very low input/weight precisions, mainly due to a memory density issue. Recent digital accelerators have implemented reconfigurable precisions, but they are inferior in energy efficiency due to significant off-chip memory access. We present a regular digital bitcell array that is readily reconfigured to a 1–16 bit weight-stationary bit-serial CIM macro. The macro computes parallel dot-product operations between the weights stored in memory and inputs that are serialized from LSB to MSB. Finally, the bit-serial computing scheme significantly reduces the area overhead while sacrificing latency due to bit-by-bit operation cycles. Based on the benefits of digital CIM, reconfigurability, and bit-serial computing architecture, the Colonnade can achieve both high performance and energy efficiency (i.e., both benefits of prior analog and digital accelerators) for processing neural networks. A test-chip with 128×128128 \times 128 SRAM-based bitcells for digital bit-serial computing is implemented using 65-nm technology and tested with 1–16 bit weight/input precisions. The measured energy efficiency is 117.3 TOPS/W at 1 bit and 2.06 TOPS/W at 16 bit.
Article
This article presents C3SRAM, an in-memory-computing SRAM macro. The macro is an SRAM module with the circuits embedded in bitcells and peripherals to perform hardware acceleration for neural networks with binarized weights and activations. The macro utilizes analog-mixed-signal (AMS) capacitive-coupling computing to evaluate the main computations of binary neural networks, binary-multiply-and-accumulate operations. Without the need to access the stored weights by individual row, the macro asserts all its rows simultaneously and forms an analog voltage at the read bitline node through capacitive voltage division. With one analog-to-digital converter (ADC) per column, the macro realizes fully parallel vector-matrix multiplication in a single cycle. The network type that the macro supports and the computing mechanism it utilizes are determined by the robustness and error tolerance necessary in AMS computing. The C3SRAM macro is prototyped in a 65-nm CMOS. It demonstrates an energy efficiency of 672 TOPS/W and a speed of 1638 GOPS (20.2 TOPS/mm 2 ), achieving 3975× better energy-delay product than the conventional digital baseline performing the same operation. The macro achieves 98.3% accuracy for MNIST and 85.5% for CIFAR-10, which is among the best in-memory computing works in terms of energy efficiency and inference accuracy tradeoff.
Article
In-memory computing (IMC) addresses the cost of accessing data from memory in a manner that introduces a tradeoff between energy/throughput and computation signal-to-noise ratio (SNR). However, low SNR posed a primary restriction to integrating IMC in larger, heterogeneous architectures required for practical workloads due to the challenges with creating robust abstractions necessary for the hardware and software stack. This work exploits recent progress in high-SNR IMC to achieve a programmable heterogeneous microprocessor architecture implemented in 65-nm CMOS and corresponding interfaces to the software that enables mapping of application workloads. The architecture consists of a 590-Kb IMC accelerator, configurable digital near-memory-computing (NMC) accelerator, RISC-V CPU, and other peripherals. To enable programmability, microarchitectural design of the IMC accelerator provides the integration in the standard processor memory space, areaand energy-efficient analog-to-digital conversion for interfacing to NMC, bit-scalable computation (1-8 b), and input-vector sparsity-proportional energy consumption. The IMC accelerator demonstrates excellent matching between computed outputs and idealized software-modeled outputs, at 1b TOPS/W of 192|400 and 1b-TOPS/mm2 of 0.60|0.24 for MAC hardware, at V DD of 1.2|0.85 V, both of which scale directly with the bit precision of the input vector and matrix elements. Software libraries developed for application mapping are used to demonstrate CIFAR-10 image classification with a ten-layer CNN, achieving accuracy, throughput, and energy of 89.3%|92.4%, 176|23 images/s, and 5.31|105.2 μJ/image, for 1|4 b quantization levels.