Content uploaded by Kaiyuan Yang
Author content
All content in this area was uploaded by Kaiyuan Yang on Dec 29, 2021
Content may be subject to copyright.
MC2-RAM: An In-8T-SRAM Computing Macro
Featuring Multi-Bit Charge-Domain Computing and
ADC-Reduction Weight Encoding
Zhiyu Chen∗, Qing Jin†, Jingyu Wang∗, Yanzhi Wang†, and Kaiyuan Yang∗
∗Rice University, Houston TX †Northeastern University, Boston MA
Abstract—In-memory computing (IMC) is a promising hard-
ware architecture to circumvent the memory walls in data-
intensive applications, like deep learning. Among various memory
technologies, static random-access memory (SRAM) is promising
thanks to its high computing accuracy, reliability, and scalability
to advanced technology nodes. This paper presents a novel
multi-bit capacitive convolution in-SRAM computing macro for
high accuracy, high throughput and high efficiency deep learn-
ing inference. It realizes fully parallel charge-domain multiply-
and-accumulate (MAC) within compact 8-transistor 1-capacitor
(8T1C) SRAM arrays that is only 41% larger than the standard
6T cells. It performs MAC with multi-bit activations without
conventional digital bit-serial shift-and-add schemes, leading to
drastically improved throughput for high-precision CNN models.
An ADC-reduction encoding scheme complements the compact
sram design, by reducing the number of needed ADCs by half
for energy and area savings. A 576×130 macro with 64 ADCs
is evaluated in 65nm with post-layout simulations, showing 4.60
TOPS/mm2compute density and 59.7 TOPS/W energy efficiency
with 4/4-bit activations/weights. The MC2-RAM also achieves
excellent linearity with only 0.14 mV (4.5% of the LSB) standard
deviation of the output voltage in Monte Carlo simulations.
Index Terms—CMOS; SRAM; in-memory computation; mixed-
signal computation; convolutional neural networks (CNNs); deep
learning accelerator
I. INT ROD UC TI ON
Deep convolutional neural networks (CNNs) have achieved
unprecedented success in the field of artificial intelligence
(AI) in the past decade. However, the intensive computation
required for even inference makes it challenging to deploy
pre-trained models on resource-constrained edge devices. The
essential and computationally dominant operation in CNN
models–the convolution–requires overwhelming multiply-and-
accumulate (MAC) operations with excessive on-/off-chip
memory access. It is well-known that the energy bottleneck
in such computation lies in the data movement rather than the
arithmetic operations, leading to the so-called memory wall
[1].
Recent progress in in-memory computing (IMC) provides
an attractive solution to circumvent the memory wall. The
key idea behind IMC is to perform the computation directly
inside the memory by accessing multiple rows simultaneously.
Thus, the local computation significantly reduces the data
movement and the parallel accessing amortizes the read energy
[2]. Emerging embedded non-volatile memories (eNVM) [3]–
[5] are promising candidates since they eliminate the off-chip
memory access and have high storage density with potential
multi-level cell states. Nevertheless, their computing accuracy
is largely compromised due to the inaccurate storage and
small readout dynamic range of the cells. On the other hand,
SRAM is a mature embedded memory technology and attracts
more and more interests for IMC implementations in recent
years [6]–[15] because of its superb computing efficiency and
technology scalability. Silicon-verified results [11], [12] prove
that the in-SRAM computing is able to achieve competitive
accuracy as the digital ASIC accelerators. More importantly,
SRAM scales well with advanced technology nodes while
eNVM falls behind the transistor scaling. As an example,
the in-SRAM computing in 7nm [16] provides about 2 times
higher storage density than in-RRAM computing in 22nm [17].
The concept of in-SRAM-computing was first proposed by
[18], and verified in silicon by [19]. This current-domain
computing scheme turns on multiple wordlines (WLs) simul-
taneously and accumulates the current on the bitline (BL). It
is further developed by [7], [10] to support multi-bit CNN
models. Simple as the implementation is, it suffers from
process variation and nonlinear I-V characteristic of transistors
and thus the sensing margin is limited. Recently, a charge-
domain computation, proposed by [8], proves to be an effective
method to enhancing the computing linearity and robustness
to PVT variations. The analog multiplication is performed in
local capacitors in the cells and the accumulation is achieved
by charge sharing [8], [11] or capacitive coupling [9], where
no transistor nonideality is involved throughout the operations.
However, due to large area overhead, those designs sacrifice
either the versatility of analog computation (only support
binary or ternary operations) [9], [11] or the parallelism with a
group of cells sharing one analog computing circuit [8] . Even
so, the cell area of such designs is still 2 to 3 times larger
than the logic-rule 6T, leading to degraded compute-density
(TOPS/mm2).
Beyond the computing methods, the optimization of weight
encoding scheme is overlooked. Most in-SRAM-computing
designs utilize the traditional 2’s complement weight encoding
[7], [11], [12], where the convolution of each bit is processed
in analog domain separately and the partial sums are shift-
and-added in digital periphery. This scheme requires kpower-978-1-6654-3922-0/21/$31.00 ©2021 IEEE
0018-9200 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Fig. 1: Comparison of state-of-the-art in-SRAM-computing schemes.
hungry ADCs to readout one output partial sum in a k-bit
CNN model.
This paper proposes a Multi-bit Capacitive Convolution
(MC2) SRAM macro using 8T SRAM cells, featuring: (1) high
computing accuracy without transistor non-idealities, (2) multi-
bit computation with full parallelism, (3) highly compact cell
area and (4) reduced number of ADCs and associated energy
and area overheads. Our main contributions include:
•A novel MC2in-sram computing scheme that performs
multi-bit charge-domain MAC with full parallelism and
high accuracy in 8T1C SRAM cells. The cell area is only
41% larger than logic rule 6T cell.
•An ADC-Reduction weight encoding scheme that reduces
the number of ADCs by half for identical throughput.
•A complete MC2-RAM macro with all peripherals and
compact layouts, achieving state-of-the-art energy effi-
ciency and compute density in post-layout simulations.
II. DE SI GN CO NS ID ERATIO NS A ND RELATED WORK
This section summarizes the key design considerations for
in-SRAM-computing macros and analyze existing designs.
A. Computing Schemes
Fig. 1 abstracts the working principles of state-of-the-art
and the proposed MC2in-SRAM computing schemes. Current-
domain IMC [7], [10], [19] activates multiple WLs at the
same time and accumulates the current on the BL using 6T
or 8T cells. This scheme features simple implementations
and compatibility with standard SRAM cells, yet faces severe
transistor nonidealities where the process variation of the
SRAM access transistor directly affects the output current.
The PVT variation and the nonlinearity can be addressed
by the charge-domain IMC that draws increasing interests in
recent studies [6], [8], [9], [11], [20]. One of the charge-
domain computation is through capacitive coupling [9] (see
Fig. 1). The binary (or ternary) input directly drives the
bottom plates of local capacitors, forming a capacitive voltage
divider. This scheme has a relatively simple cell structure
and excellent computing linearity, but cannot support multi-
bit analog MAC because the multi-bit input requires power-
hungry analog drivers. On the other hand, the charge sharing
approach [8], [11], [20] performs the analog multiplication via
a 1-bit multiplier (can be implemented as a single PMOS) and
performs the accumulation by sharing the charge to the output
TABLE I: Performance comparison of bit-serial analog MAC
and multi-bit analog MAC in BX-bit CNN Model.
BIT SE RIAL MULT I-BIT
ENE RGY PER CY CL E EBS (1+
α
)·EBS
LATEN CY P ER CYCLE TBS (1+
β
)·TBS
ENE RGY PER M AC BX·EBS (1+
α
)·EBS
LATEN CY P ER MAC BX·TBS (1+
α
)·TBS
BLs (see Fig. 1) through an output switch. It is termed as the
top-plate sampling, top-plate charge-sharing (TSTC) scheme.
Multi-bit MAC can be supported by sampling the input voltage
on the local capacitor via an additional input switch. However,
these extra switches incurs large area and power overhead.
B. Analog Multi-Bit MAC Support
To match the computing accuracy of the digital accelerators,
it is crucial to support CNN models with high precision
(≥4-bit). Most IMC macros with multi-bit MAC support [7],
[11], [12] perform 1- or 2-bit analog MAC in memory and
accumulate the partial sums in digital shift-and-add periphery
in a serial manner. While it simplifies the cell design, the serial
computation causes huge energy and throughput penalty (see
Table I, where
α
and
β
are the energy and latency overhead of
implementing multi-bit analog MAC). For example, although
the estimated energy and latency overhead of the mult-bit MC2
is about 20% than the one-bit implementation, the potential
improvement of the overall performance in a 4-bit CNN model
is 3.3 times.
C. Computing Parallelism
Another feature of the proposed SRAM macro is the full
computing parallelism, where all the memory cells can be
accessed simultaneously. Recently, several studies [8], [12],
[20] propose a semi-parallel computing structure where cells
are clustered to share one local computing circuitry and only
one of the cells will be activated in each cycle. Despite
the reduced cell area, the peak compute density and energy
efficiency will both be compromised in this design. The
memory energy of a fully parallel IMC can be modeled as
EF= (NVLVCV+NHLHCH)·VDDVSW ,(1)
where NV/NHis the number of computing or control wires in
the vertical/horizontal direction, LV/LHrepresents the length
of the wires and CV/CHis the unit metal capacitance. VSW is
Fig. 2: The 8T1C MC2cell structure and logic-rule layout.
Fig. 3: Comparison of the cell area in TSMC 65 nm and the
simulated standard deviation of the output voltage as a result
of process variation.
the voltage swing on the wires. For the semi-parallel design
with k-cell clusters, the modeled energy becomes
ES= (NVLVCV+1
k·NHLHCH)·VDDVSW .(2)
In other words, the throughput of full-parallelism is ktimes
larger than semi-parallelism, while the energy consumption is
much less than k·Esdue to better read energy amortization.
III. THE MC2SCH EM E IN 8 T SR AM
We propose MC2scheme to realize charge-domain computa-
tion with multi-bit input and full computing parallelism, using
compact 8-transistor 1-capacitor (8T1C) SRAM cells (see
Fig. 2). The benefits of these design features are as follows.
1) The charge-domain computation ensures high computing
linearity that approaches the CNN inference accuracy of digital
hardware. 2) MAC with direct multi-bit input support avoids
serial digital processing that has long latency and large energy
overhead (see Table I). 3) Fully parallel computing better
amortizes the control wire energy and thus achieves higher
energy efficiency (see Equation 2), and higher compute density.
4) A compact cell structure (see Fig. 3) increases the compute
density. 5) The novel switching scheme in MC2requires less
control wires for analog MAC than previous charge-domain
designs [2], [20], reducing one of the dominant source of
energy in IMC operations.
The key concept behind MC2is a bottom-plate sampling,
top-plate charge-sharing method (BSTC) that replaces local
output switches in conventional TSTC schemes by a single
global switch per column (see Fig. 1). The input voltage is first
sampled on the bottom plate of the local capacitors with the
global switch on. After the analog multiplication, the bottom
plate is connected to either ground or VDD with the global
switch turned off so that the charge-sharing is performed on
the top plate.
This scheme leads to the 8T1C cell design with two extra
transistors controlled by the differential SRAM values (weight)
to perform input voltage sampling and local multiplication
between inputs and weights (Fig. 2). Although this cell design
looks the same as that in [9], the BSTC scheme in MC2
is totally different from the capacitive coupling scheme in
C3SRAM, because it reuses the two-transistor analog mul-
tiplier as the multi-bit input sampling switch, which is the
first to achieve fully parallel MAC with multi-bit inputs with
8T SRAM cells. In physical layout, each MC2SRAM cell
requires 8 transistors and one local metal-oxide-metal (MOM)
capacitor (see Fig. 2). The MOM cap can be placed above
the transistors without any area overhead. The cell layout
follows the conventional 6T layout style while the two pass
gate PMOS (T1 and T2) share the gate of the cross-coupled
inverters. Overall, we realize a highly compact cell layout that
is only 41% larger than the logic rule thin-cell 6T cells. As
shown in simulated Fig. 3, MC2-RAM is significantly smaller
than prior arts with similar computing accuracy (measured by
variation of computing result due to process variations), even
those that only support 1-2bit inputs.
More specifically, the MC2computing starts with a DAC
phase (see Fig. 4). The 4-bit input signal will be sampled on
the bottom plate via one of the input lines (INA1 or INA2)
by a current-steering DAC [21]. At the multiplication phase,
INA1 will be driven to VDD so that the MOM cap will either
be charged to VDD (equivalent to multiplying ‘0’) or keep
its sampled voltage (equivalent to multiplying ‘1’) depending
on the stored data in the cell. Despite the floating state of
some bottom plates, the leakage does not have significant
effect since those bottom plates are connected to INA2 with
large parasitic capacitance (∼180 fF) and the only leakage
path is through P2. Experiment shows that under 125 ◦C
temperature, the leakage current is 9.5 nA in the worst case
scenario, leading to only 0.034 mV change in the final result
(equivalently 0.01 LSB). In the final accumulation phase, all
the bottom plates will be connected to VDD and the charge-
sharing is directly performed on top plates without the need
for any local switches. At the final state, all the bottom plates
are connected to VDD and the ADC starts the quantization.
All switches in the MAC switching unit (MSU) are PMOS
while N1 and N2 are NMOS.
IV. ADC -R ED UC TI ON ENCODING SC HE ME
Two’s complement weight encoding is a natural choice
for most in-SRAM computing macros because one SRAM
cell only store 1-bit information and the weights are easy to
decompose under this scheme. Each column only processes
one bit of the multi-bit weight and converts the analog
MAC into digital domain. The digital shift-and-add module
accumulates the partial sums according to the significance of
the bit position. Note that the partial sum on the MSB (sign
bit) is negative and the polarity needs to be flipped. The major
limitation of this scheme is that each bit position needs one
power-hungry ADC (i.e., one ADC per column), which brings
large power overhead. Meanwhile, it restricts the layout space
Fig. 4: Operations of the MC2scheme.
Fig. 5: ADC-reduction encoding for 4-bit weights.
of ADCs, leading to low-quality layout matching and degraded
computing accuracy.
In MC2-RAM, because the cell footprint is significantly
reduced as described in Section III, fitting one ADC per
column is even more challenging and dominates the total area.
Therefore, we propose a weight encoding scheme to effectively
reduce the number of ADCs by half. It takes advantage of
the natural subtraction performed by differential ADCs (see
Fig. 5). Different from 2’s complement encoding that only
the sign bit is negative, every two bits of the weights are
paired to have different polarities. As a result, neighboring
cells are connected to different inputs of one ADC. The
significance of the negative bit is twice of the positive bit. For
instance, the significance of a 4-bit number is {-8, 4, -2, 1},
respectively, so that the decimal number 7 is encoded as 1001
((−8)×1+4×0+ (−2)×0+1×1). The subtraction of the
negative and positive MAC results is performed concurrently
with the quantization process in the differential ADCs. After
that, the quantized partial sums from multiple bit pairs will
then be shift-added in digital.
According to the encoding table in Fig. 5, the range of
the number is not symmetrical. To match the range of 2’s
complement encoding, each weight needs a constant bias c
(e.g., 2 in 4-bit encoding) and thus needs to add ∑icXito final
MAC results. Note that the bias is the same for all results in
each convolution. Therefore, one dummy column shared by
the whole array with all cells storing ‘1’ calculates ∑iXi×1
with negligible overhead. This scheme can be extended to any
bitwidth of the weights with more inter-column digital shift-
and-adds and different constant bias cwhile always benefits
from the natural subtraction of ADCs. It is worth mentioning
that the encoding scheme can also be disabled to implement
Fig. 6: System diagram of the proposed MC2-RAM.
the widely-used ternary-weighted neural network.
V. IM PL EM EN TATION AND EVALUATI ON S
A. MC2-RAM Implementation
Leveraging the proposed MC2scheme and ADC-reduction
encoding, we implement a MC2-RAM macro consisting of
a 576×130 8T SRAM array, with 128 columns performing
normal MAC and 2 dummy columns, and peripherals for
computation and read/write (Fig. 6). Each row includes a 4-
bit current-steering DAC and two MSUs for positive (INAs)
and negative inputs (INBs). Differential 8-bit ADCs directly
quantize the difference between positive and negative columns.
A charge-injection SAR (ciSAR) ADC [22] is adapted for
the macro similar to [20]. In this design, the capacitive DAC
is replaced by groups of transistors with a long channel length
that behave as capacitors, which have much smaller area and
require no power-hungry analog buffers to drive the reference
voltages. The ciSAR ADC achieves 7.85-bit effective number
of bits (ENOB) in transient noise simulations, showing great
effectiveness in low-precision applications.
We implement the complete layout of MC2-RAM and verify
the linearity and performance in TSMC 65nm LP process,
as shown in Fig. 7. The total area of the macro is 0.280
mm2. The 9KB SRAM array, the ciSAR ADCs and the input
processing circuitry (DACs & MCUs) occupy 47.5%, 30.8%
and 8.7% of the total area, respectively. 2.6% of the area is
used by read/write circuitry and 2.6% is occupied by the WL
drivers and decoders. Moreover, an in-house simulator is used
to verify the computing accuracy and the effectiveness of the
encoding scheme in end-to-end CNN applications.
B. Computing Linearity and Variation
Monte Carlo simulations (Fig. 8(a)) verify the excellent
computing linearity of MC2(R2=0.9999) and show negligible
one-sigma variation of the output voltage (less than 0.14 mV,
Fig. 7: Layout of the MC2-RAM macro.
Fig. 8: Monte Carlo simulations (100 iterations) for the
computing linearity by sweeping (a) the activated number of
DACs and (b) the input code of DACs.
4.5% of the LSB). The reason is that the charge-sharing opera-
tion is inherently linear. It is also immune to process variation
because the capacitor mismatch is orders of magnitude smaller
than transistor mismatch. One-bit DACs are implemented in
this experiment to exclude the non-ideality of the DACs. The
capacitor mismatch is included using foundry MOM capacitor
variation model. Only local mismatch is considered because
the global systematic variation can be compensated by tuning
the bias voltage of ADCs and DAC.
The mismatch and nonlinearity of DACs are further included
in Fig. 8(b) by sweeping the input code of 4-bit DACs with
all DACs activated. The standard deviation in the worst-case
scenario is 0.27 mV, which is less than 8.6% of the LSB. In
both linearity results, the standard deviation of output voltage
increases with the voltage, which is expected as a result of the
accumulation of independent capacitor variations. The linearity
results confirm the system is able to accurately calculate dot
products with different combinations of inputs and weights.
C. Post-Layout System Verification
Fig. 9 illustrates the operational waveforms of four input
lines (INA1, INA1, INB1 and INB2) from one row and two
output lines (OLN and OLP) from neighboring positive and
negative columns in the system-level post-layout simulation.
In this experiment, we set one specific input and weight
pattern so that OLN stays at 0 V while OLP reaches 3/4
of its full range (150 mV out of 200 mV). The system
time-interleaves the MC2operations in positive and negative
columns to minimize the effect of parasitic coupling, which
is critical to OLN since it is floating after the operations in
Fig. 9: Waveforms in the system post-layout simulation.
Fig. 10: Computing linearity of the complete analog chain.
negative columns. It is observed that OLN does not keep 0
V (22.8 mV, see Fig. 7(b)) due to the coupling effect coming
from the switching of INBs, but it only introduces fixed offset
because the transition of INBs is independent of the inputs.
The post-layout simulation of the entire analog chain (see
Fig. 10), including the nonidealities of the memory, the ADC
and the DACs, further verifies the excellent linearity of the
system.
D. Inference Accuracy on End-to-End CNN Model
The simulator performing end-to-end CNN inference veri-
fies the effectiveness of the ADC-reduction encoding scheme
and the high computing accuracy of the macro. The functions
of the simulator include: (1) mapping the CNN models to
the SRAM macro, (2) decomposing the multi-bit weights
into multiple columns as required by the encoding scheme,
(3) quantizing the analog MAC partial sums in each column
and (4) adding non-idealities to the convolution operations.
4-bit quantized ResNet-20 is mapped to the macros in this
experiment. As Fig. 11 shows, if ideal differential ADC and
analog computation are assumed, the encoding scheme will
not cause significant accuracy loss in the CIFAR-10 dataset
compared to the baseline model that has no ADC quantization
errors. Further, the ideal MAC results are mapped to 64
transfer curves of the entire analog chain (from digital input
to ADC output) in the simulator, where the transfer curves are
exported under 64 Monte Carlo simulations. With all the non-
idealities considered, the system with 8-bit ADCs achieves
90.2% inference accuracy.
E. Energy and Throughput
The clock frequency of the macro is 70 MHz while the
power consumption is 21.6 mW where ADC takes 75.1%,
the array takes 21.6% and the controller and others occupy
Fig. 11: End-to-end CNN inference accuracy on CIFAR-10
Fig. 12: Comparison of energy efficiency and compute density
of IMC macros. Results are normalized to 65 nm technology.
3.3% of the total power. The system achieves 4.60 TOPS/mm2
compute density and 59.7 TOPS/W energy efficiency with
4/4-bit activations/weights. Since the bit precision significantly
affects the performance of IMC (see Section II.B), we utilize a
bitwise metric to characterize the performance similar to [12],
[20], where the number of bit operation (bOPs) is defined as
# bOPs =# OPs ×input bitwidth ×weight bitwidth.(3)
MC2-RAM simultaneously achieves 955.2 TbOPS/W peak
efficiency and 73.6 TbOPS/mm2peak compute density, out-
performing state-of-the-art SRAM-based IMC macros.
The high compute density results from the compact charge-
domain cell design and fully parallel computation. Since
ADC is the energy bottleneck of IMC systems, the proposed
multi-bit analog MAC avoids serial digital accumulation that
requires multiple iterations of ADC readout. Meanwhile, the
encoding sheme further halves the ADC energy. The array
only takes a small portion of the total energy because the
computing scheme requires only four horizontal wires (INA1,
INA2, INB1 and INB2) per row and one vertical wire (OL)
per column.
VI. CO NC LU SI ON
In summary, this work presents MC2-RAM, a multi-bit
charge-domain in-SRAM computing macro with 8T1C cells
and an ADC-reduction encoding scheme. The proposed BSTC
scheme enables analog MAC with multi-bit inputs with signif-
icantly reduced cell area, while maintaining high computing
accuracy and robustness to process variation. Simulations show
only 0.14 mV variation in the worst case, which is 56.7 times
better than the conventional current-domain design. Mean-
while, the proposed ADC-reduction encoding scheme reduces
ADC energy and relaxes the layout congestions. Overall, MC2-
RAM simultaneously improves bitwise energy efficiency and
compute density over state-of-the-art IMC SRAM macros. The
end-to-end ResNet-20 inference, with all physical nonidealities
included, achieves 90.2% accuracy on CIFAR-10.
REF ER EN CE S
[1] M. Horowitz, “1.1 Computing’s energy problem (and what we can do
about it),” in 2014 IEEE International Solid-State Circuits Conference
Digest of Technical Papers (ISSCC), pp. 10–14, IEEE, 2014.
[2] N. Verma, H. Jia, H. Valavi, Y. Tang, M. Ozatay, L.-Y. Chen, B. Zhang,
and P. Deaville, “In-memory computing: Advances and prospects,” IEEE
Solid-State Circuits Magazine, vol. 11, no. 3, pp. 43–55, 2019.
[3] V. Joshi, M. Le Gallo, S. Haefeli, I. Boybat, S. R. Nandakumar,
C. Piveteau, M. Dazzi, B. Rajendran, A. Sebastian, and E. Eleftheriou,
“Accurate deep neural network inference using computational phase-
change memory,” Nature communications, vol. 11, no. 1, pp. 1–13, 2020.
[4] P. Yao, H. Wu, B. Gao, J. Tang, Q. Zhang, W. Zhang, J. J. Yang, and
H. Qian, “Fully hardware-implemented memristor convolutional neural
network,” Nature, vol. 577, no. 7792, pp. 641–646, 2020.
[5] M.-H. Wu, M.-S. Huang, Z. Zhu, F.-X. Liang, M.-C. Hong, J. Deng,
J.-H. Wei, S.-S. Sheu, C.-I. Wu, G. Liang, and T.-H. Hou, “Compact
probabilistic poisson neuron based on back-hopping oscillation in STT-
MRAM for all-spin deep spiking neural network,” in IEEE Symposium
on VLSI Technology (VLSI), June 2020.
[6] H. Valavi, P. J. Ramadge, E. Nestler, and N. Verma, “A 64-Tile
2.4-Mb In-Memory-Computing CNN Accelerator Employing Charge-
Domain Compute,” IEEE Journal of Solid-State Circuits (JSSC), vol. 54,
pp. 1789–1799, June 2019.
[7] X. Si, J.-J. Chen, Y.-N. Tu, W.-H. Huang, J.-H. Wang, Y.-C. Chiu, W.-C.
Wei, S.-Y. Wu, X. Sun, R. Liu, et al., “A twin-8T SRAM computation-in-
memory unit-macro for multibit CNN-based AI edge processors,” IEEE
Journal of Solid-State Circuits (JSSC), vol. 55, no. 1, pp. 189–202, 2019.
[8] A. Biswas and A. P. Chandrakasan, “CONV-SRAM: An energy-efficient
SRAM with in-memory dot-product computation for low-power convo-
lutional neural networks,” IEEE Journal of Solid-State Circuits, vol. 54,
no. 1, pp. 217–230, 2018.
[9] Z. Jiang, S. Yin, J.-S. Seo, and M. Seok, “C3SRAM: An in-memory-
computing SRAM macro based on robust capacitive coupling computing
mechanism,” IEEE Journal of Solid-State Circuits, vol. 55, no. 7,
pp. 1888–1897, 2020.
[10] S. K. Gonugondla, M. Kang, and N. Shanbhag, “A 42pJ/decision 3.12
TOPS/W robust in-memory machine learning classifier with on-chip
training,” in 2018 IEEE International Solid-State Circuits Conference
(ISSCC), pp. 490–492, 2018.
[11] H. Jia, H. Valavi, Y. Tang, J. Zhang, and N. Verma, “A programmable
heterogeneous microprocessor based on bit-scalable in-memory comput-
ing,” IEEE Journal of Solid-State Circuits (JSSC), 2020.
[12] X. Si et al., “15.5 A 28nm 64Kb 6T SRAM computing-in-memory
macro with 8b MAC operation for AI edge chips,” in 2020 IEEE
International Solid- State Circuits Conference (ISSCC), pp. 246–248,
Feb. 2020.
[13] J. Yue et al., “14.3 A 65nm computing-in-memory-based CNN processor
with 2.9-to-35.8 TOPS/W system energy efficiency using dynamic-
sparsity performance-scaling architecture and energy-efficient inter/intra-
macro data reuse,” in IEEE International Solid-State Circuits Conference
(ISSCC), pp. 234–236, 2020.
[14] H. Kim, T. Yoo, T. T.-H. Kim, and B. Kim, “Colonnade: A re-
configurable SRAM-based digital bit-serial compute-in-memory macro
for processing neural networks,” IEEE Journal of Solid-State Circuits
(JSSC), 2021.
[15] H. Jia, M. Ozatay, Y. Tang, H. Valavi, R. Pathak, J. Lee, and N. Verma,
“15.1 a programmable neural-network inference accelerator based on
scalable in-memory computing,” in IEEE International Solid-State Cir-
cuits Conference (ISSCC), vol. 64, pp. 236–238, 2021.
[16] Q. Dong, M. E. Sinangil, B. Erbagci, D. Sun, W. Khwa, H. Liao,
Y. Wang, and J. Chang, “15.3 A 351TOPS/W and 372.4GOPS compute-
in-memory SRAM macro in 7nm FinFET CMOS for machine-learning
applications,” in 2020 IEEE International Solid-State Circuits Confer-
ence (ISSCC), pp. 242–244, Feb. 2020.
[17] C.-X. Xue et al., “15.4 A 22nm 2Mb ReRAM compute-in-memory
macro with 121-28TOPS/W for multibit MAC computing for tiny AI
edge devices,” in IEEE International Solid-State Circuits Conference
(ISSCC), pp. 244–246, 2020.
[18] M. Kang, M.-S. Keel, N. R. Shanbhag, S. Eilert, and K. Curewitz,
“An energy-efficient VLSI architecture for pattern recognition via deep
embedding of computation in SRAM,” in 2014 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 8326–8330, May 2014.
[19] J. Zhang, Z. Wang, and N. Verma, “In-memory computation of a
machine-learning classifier in a standard 6T SRAM array,” IEEE Journal
of Solid-State Circuits, vol. 52, no. 4, pp. 915–924, 2017.
[20] Z. Chen, Z. Yu, Q. Jin, Y. He, J. Wang, S. Lin, D. Li, Y. Wang,
and K. Yang, “CAP-RAM: A charge-domain in-memory computing 6T-
SRAM for accurate and precision-programmable CNN inference,” IEEE
Journal of Solid-State Circuits (JSSC), pp. 1924 – 1935, 2021.
[21] B. Razavi, “The current-steering dac [a circuit for all seasons],” IEEE
Solid-State Circuits Magazine, vol. 10, no. 1, pp. 11–15, 2018.
[22] K. D. Choo, J. Bell, and M. P. Flynn, “Area-efficient 1GS/s 6b SAR
ADC with charge-injection-cell-based DAC,” in 2016 IEEE International
Solid-State Circuits Conference (ISSCC), pp. 460–461, 2016.