A Ferroelectric FET Based In-memory Architecture
for Multi-Precision Neural Networks
Abstract—Computing-in-memory (CIM) is a promising ap-
proach to improve the throughput and the energy efﬁciency
of deep neural network (DNN) processors. So far, resistive
nonvolatile memories have been adapted to build crossbar-based
accelerators for DNN inference. However, such structures suffer
from several drawbacks such as sneak paths, large ADCs/DACs,
high write energy, etc. In this paper we present a mixed signal
in-memory hardware accelerator for CNNs. We propose an in-
memory inference system that uses FeFETs as the main non-
volatile memory cell. We show how the proposed crossbar unit
cell can overcome the aforementioned issues while reducing unit
cell size and power consumption. The proposed system decom-
poses multi-bit operands down to single bit operations. We then
re-combine them without any loss of precision using accumulators
and shifters within the crossbar and across different crossbars.
Simulations demonstrate that we can outperform state-of-the-art
efﬁciencies with 3.28 TOPS/J and can pack 1.64 TOPS in an area
of 1.52 mm2using 22 nm FDSOI technology.
Index Terms—FeFET Crossbar array, In-Memory Computa-
tion, Convolution Neural Networks
Deep Neural Networks (DNNs) have been widely used for
various complex tasks such as image recognition or natural
language processing, at the cost of higher computational
resource requirements. These requirements in spite of the
limited available computational resources and memory access
force the systems to become stricter. At the same time, the size
of state-of-the-art DNNs keeps growing larger to achieve better
accuracy. This growth in DNN size presents more challenging
requirements for the on-chip memory capacity and the off-chip
memory accesses. This problem can be observed in the CMOS
ASIC designs in which the on-chip memory is the biggest
bottleneck for efﬁcient energy computation . Also, the off-
chip memory loading results in both substantial energy and
latency in such systems . Over the past years, in-memory
computation was presented as a solution to fulﬁll these re-
quirements. An emerging In-memory computation technique is
characterized by the usage of crossbar arrays of Non Volatile
Memory (NVM) units to perform matrix-vector multiplication
In NVM crossbar arrays, weights are stored within the mem-
ory unit. Several technologies have been introduced for such
crossbars, e.g. Resistive Random-Access Memories (RRAMs)
 , Phase Change Memories (PCM)  and Ferroelectric
Field-Effect Transistors (FeFETs) . In this work, FeFETs
are adapted as the main NVM for crossbar units due to
their high on/off ratio, long term retention, low writing and
operation voltages as well as scalability.
In this work, we introduce a mixed signal architecture that
aims to exploit opportunities FeFET memory elements can
offer in both digital and analog domains. We ﬁrst optimize the
unit size in the crossbar array, requiring only 1 FeFET for each
AND operation. Furthermore, we analyze the functionality of
our in-memory architecture at various parallelism levels, to
allow for an accelerated computation of Convolutional Neural
Networks (CNNs). In summary, our contributions are the
•Novel mixed-signal architecture for in-memory computa-
tion of CNNs using FeFET as memory cell.
•Low Analog to Digital Converter (ADC) overhead
through efﬁcient parallel bit decomposed MAC operation
compared to the state of the art.
•Elimination of Digital to Analog Converters (DACs).
•Multi-precision targeted convolution operations.
•Functionality analysis and mapping operations at three
–Simultaneous activation of multiple crossbar lines.
–Activation of multiple crossbar columns.
–Parallel operation of different crossbars.
This paper is structured as follows: In section II we go
through related work. Then, section III provides a background
on the basics of the FeFET crossbar design and the mode of
operation as well as convolution neural networks. In section
IV, We introduce our architecture. We focus on the CNN
system components and the atomic operations required. Also
in this section we go through how the operations can be
mapped to different components as well as the levels of
parallelism the structure is offering and how it reﬂects on the
data routing. Finally, we present the evaluation of our crossbar
architecture, as well as a comparison to state-of-the-art systems
in section V.
II. RE LATE D WORK
In the last few years, several studies regarding the usage of
crossbar in-memory architecture for DNN inference have been
presented. Many of the presented works such as , , 
and  all focused on binarized neural network which limits
the usage to certain set of networks that can adapt such data
The current available work for Multi-precision convolution
operation follows one of two main directions. The ﬁrst di-
rection tries to maximize the usage of the analog properties
the crossbar structure is offering. The analog crossbar is
complemented with DACs to convert the digital input values
to analog signal for crossbar. Also, ADCs to yield the ﬁnal
output in digital form. This method is followed in several
architectures such as , ,  and 
However the cost of this approach appears in the power
consumption and area used by such structure. For example
the work presented ISAAC  shows that the 8-bit ADC
accounted for 58% of total power and 31% of the area.
Also, it can be observed in Prime  the cost of extra
hardware added for decoder to drive the analog signal as
well as sense ampliﬁer circuit to yield the digital value. To
solve that, several architectures lower the bit precision of the
operands which corresponds to minimizing the effect of these
blocks. However, it limits the usage of these architecture to
networks that can have such bit precision.
On the other hand, several studies started to adapt more
digital oriented architectures. In the presented work by Y.
Long , a full precision full digital inference architecture
was proposed. However, the system lost one of the main
advantages the crossbar is offering by activating only a single
memory cell at a time. Also the system showed very poor
utilization in many layers. For example, when the system
contains a large number of processing crossbars that can serve
and beneﬁt very deep layers, it performs poorly in the shallow
layers compared to its peak performance. Also, the proposed
operating frequency suggested by the work (2GHz) leads to
the high power consumption.
III. BACKGRO UN D
A. Convolutional Neural Network
CNNs are a special kind of DNNs which adapts the con-
volutional layers as the main component of the network. In
addition to convolutional layers usually it has pooling, fully
connected layers as well. However the convolution operation
is the main operation and can be expressed as:
fo(x, y, z) =
fi(x+i, y +j, h)Kz(i, j, h)(1)
Such that foand firepresent the input, output feature
maps respectively. Kzrepresents the kernel zwhere zis the
output depth. kand Cin represent kernel size and input depth
Based on equation 1, we can further perform a decomposi-
tion of this operation into the bit operation level as shown in 2.
Ipand Wprefer to input and weight bit precisions respectively.
fo(x, y, z) =
fim(x+i, y +j, k)Kzn(i, j, k)) ∗(2m+n))
Equation 2 represents the lowest level of operation; an AND
gate between the two bits to yield the ﬁnal expected value.
However the summation of each set of operations should be
shifted by value that equals m+nto result the expected output.
B. FeFET Technology
FeFETs are three-terminal non-volatile memory elements.
However, until the discovery of ferroelectricity in HfO2thin
ﬁlms this device concept was lacking in terms of data storage
retention. The high coercive ﬁeld (Ec), low dielectric constant
and CMOS compatibility make it very suitable compared to
conventional ferroelectrics as lead zirconate titanate. Further-
more, its ferroelectricity is persistent down to ultra-thin ﬁlms
in the nanometer range , enabling highly scaled devices.
This enabled implementations into the 28 nm bulk 
and 22 nm FDSOI  technology node. In FeFETs, the
polarization state of the ferroelectric layer affects the transfer
characteristics of the transistor, thus resulting in a shift of
the threshold voltage Vt(see Fig. 1(B), as extracted from
). Due to the high coercive voltages and high remnant
polarization values of HfO2, large memory windows
(MW) are achievable, which are linked to a high on/off ratio
 reaching values in the range 103−105. The low trap
density and the low dielectric constant result in low gate
leakage current and low operation voltage, respectively.
For the design study, we use individual FeFET devices
out of the 22FDX platform of Globalfoundries. The trans-
conductance curve of an exemplary device is hereby given
in Fig. 1(B) (extracted from ). For the further analysis,
we assume a Vtshift with respect to Fig. 1(B) of 1 V. The
assumed scale is highlighted in green. Such an offset Vtshift is
obtainable by work function engineering and additional back
bias application. The offset Vtshift is hereby necessary to
neglect any unintended leakage current within the crossbar
Published implementations for compute-in-memory (CiM)
accelerators typically make use of several bit-cell concepts,
often utilizing an additional select transistor, which is im-
plemented per bit-cell, as shown in Fig. 1(C). Here, we
use a 1-FeFET bit-cell, which reduces the area to about
0.007 µm2(CPP×MMP), as shown schematically in Fig. 1(D).
The program scheme hereby follows conventional schemes as
discussed previously for the FeFET concept .
In inference mode the applied FeFETs can be selected by the
bitlines (BL) and wordlines (WL). Not activated columns are
grounded. All individual columns are only weakly capacitvely
coupled and assumed to be independent. Besides an increased
area efﬁciency, necessary for complex on-chip neural network
computation tasks, such scaled devices, effectively reduce
the saturation current in the strong inversion case. For such
an aggressively scaled device with a W/L conﬁguration of
20 ×80nm2, a variation of σVt/MW about 3is expected
in the current stage of technology maturity . However,
the operation in saturation under strong inversion signiﬁcantly
reduces the inﬂuence of the device variability on the compute-
in-memory result. In our conﬁguration we activate eight units
per column per operation. The output current for all permuta-
tions of the eight input signals and associated FeFET state was
evaluated. The individual states were hereby clearly separated
Fig. 1. FDSOI FeFET crossbar considerations (A) Schematic of a FDSOI n-FET device, (B) Transconductance curves as extracted from , (C) Schematic-
level representation of the applied FeFET array for compute-in-memory application, (D) Layout of the used 1-FeFET conﬁguration
IV. FEFE T CROSSBA R ACC EL ER ATOR
In this section we present our FeFET-based crossbar ac-
celerator dedicated for DNNs. More speciﬁcally, we discuss
the structure organization as well as the inference operation
mapping to the different blocks.
A. FeFET Processing Element Design
We start here by deﬁning Processing Element (PE) as the
main unit in our architecture. As shown in Fig. 2(E),2(F), Each
PE is consists of a FeFET crossbar and a mixed signal block.
We explore the operations performed by each block and their
1) Crossbar Cell Operation: As explained in equation 2
the convolution operation can be decomposed into a number
of AND operations between the input feature map and weights.
Similarly, each cell in the crossbar can perform a one-bit
AND operation as explained. The crossbar is conﬁgured such
that each unit cell stores a single weight bit. The gate, drain
and source of the transistors, are connected to weight line
(WL), bit line (BL), and source line(SL) respectively. Weight
line represents the row activation which corresponds to the
input feature map bit in case of active row. The source line
represents a control signal that specify activated columns with
each clock cycle. Finally, the bit line yields the ﬁnal result of
In our structure, we follow the conﬁguration of activating
multiple input rows at the same time. We chose eight simul-
taneous row activations which in return corresponds to eight
AND operations being computed and accumulated through the
column. Such a conﬁguration is based on the well deﬁned level
separation for different accumulation values shown in section
III. Also, limiting the accumulated values to eight; corresponds
to a smaller ADC converter of only 3-bit. We show in Fig. 2(A)
an example for such MAC operations where the weight bits
are stored in the unit cells, and the input feature map bits are
sequenced in their order.
2) FeFET Crossbar Section: For each crossbar, We split the
crossbar into sections. In each section, groups of columns are
formed where each group is connected to a mixed signal block.
In this conﬁguration within each group, only one column is
activated at any time. The advantages of this approach are:
•Reduce the number of needed mixed signal blocks as they
correspond to the number or groups and not the columns.
•Reduce the power consumption and area needed for the
•Storing different layers kernels within the same group.
In Fig. 2(F), an example for the crossbar section is il-
lustrated. Here, the convolution operation is unrolled by the
following factors. The Pk−1
k=1 is parallelized by
the factor of eight which corresponds to the activated rows
simultaneously. Also, PWp−1
n=0 is parallelized by the factor of
groups per crossbar section. Though group result can have
a maximum value of eight, approximating it to 3-bit value
does not affect the networks accuracy; which allows the usage
of a 3-bit ADC based on work presented in . The three
bit results out of the ADCs are connected to the adder with
different displacement according to the shifting value of n
mentioned in equation 2. Such adder is illustrated in 2(F)
where we show the group predeﬁned connections. The adder
results a partial sum that is routed to cluster adders (i.e
explained in the next section) for further processing.
3) Overall Crossbar: Each crossbar consists of four sec-
tions where each section computes a different kernel. Each
section has its adder that should yield a partial sum for a
different output feature depth. The four sections do not have
any physical separation. It is an algorithmic point of view
which means that the inputs applied to the crossbar rows are
applied to all of them. In our architecture, crossbars are of
size 256×256. Each 16 columns form a single group which
is connected to a one 3-bit ADC. Each Four groups form a
section where the ADCs results are connected to an adder with
different displacements according to the group stored weight
bit displacement n.
This crossbar organization offers the advantage of lowering
the data trafﬁc needed. At any point of time, it only needs
eight bits of input and only four result words each of nine bits
need to be routed to the cluster adders.
B. PE Cluster
Each PE cluster collects the results from the enclosed
PEs and further accumulates the values to compute either
further partial sums or the ﬁnal output. Each cluster consists
of eight PEs, 4 adders of 8 operands inputs, 4 shifters and
4 accumulators. As illustrated in Fig. 2(D), the values of
PEs are summed together to form 4 different partial sums.
Then, these sums are shifted by maccording to the input bit
displacement. Finally, this value is accumulated through the 4
Fig. 2. (A)Convolution operation mapped to a crossbar structure. (B) The system architecture which consists of columns of PE clusters as well as the control
unit and the buffers. (C) Column module where further accumulations can be performed as well as max-pooling operation. (D) The structure of PE cluster
and blocks needed for partial sum computation. (E) Single crossbar structure and sectioning where each section computes a different kernel. (F) Crossbar
section output block where analog signals are converted and added.
different accumulators. The iterations needed for accumulation
depend solely on the input feature map precision. The main
goal of the cluster structure is to further reduce the data trafﬁc
and accumulate the values correctly. The ﬁnal output of each
cluster is 4 different words that each corresponds to 64 full-
precision MAC operation.
C. Cluster Column
The proposed system consists of columns of the mentioned
clusters. Deep convolutional neural networks are known for
the high number of MAC operations reaching thousands of
operations for powerful DNNs. However, within the cluster,
the number of parallel MAC operations is limited to 64 pixels.
The Column of clusters can further parallelize MAC opera-
tions by having 4 accumulators for each cluster columns that
can receive and accumulate the accumulated pixels. Hence, for
cclusters per column, if for a certain layer the number of MAC
operations are less than or equal 64cthen the ﬁnal 4 different
output feature map pixels can be computed as soon cluster
results are ready. If more than 64cthen several clusters can
be used to compute the needed accumulations for the output
pixel. Such a unit is illustrated in Fig. 2(C) which includes
also the structure needed to perform max-pooling.
The column cluster is illustrated in Fig. 2(B). We build each
cluster column from 4 different clusters which corresponds to
D. Systme Grid
The ﬁnal system is seen as a grid of clusters which consist
of columns of crossbars. On the crossbar level, four kernels
are computed at the same time. However, convolution layers
consist of larger number of kernels. Across different columns
of clusters the kernels are parallelized. Considering ris the
number of columns of clusters, then the total number of
parallel kernels that can be computed is 4r. Then, if the output
feature map depth is less than or equal 4r, the full output
feature map depth will be computed simultaneously.
Another beneﬁt from such scheme is the extreme reuse of
the input feature map pixels. Since each row of clusters will
also be sharing the input feature map at each clock cycle which
dramatically reduces the complexity and the data trafﬁc needed
across the grid. In Fig. 2(B), we illustrate the ﬁnal overview
of the system.
The grid is complemented with an input feature map buffer
and an output feature map buffer. The input buffer is split into
partitions where each is targeting a certain row of clusters. The
input feature data mapping is done to preserve a one-to-one
relation between the buffers partition and the row of clusters.
This relation simpliﬁes the reading from the buffers. Mapping
of the partition content is deﬁned prior to inference.
In case of output feature map buffer, it is preceded by max-
pooling unit within the cluster column such that only the result
of this unit in case of max-pooling layers is stored. The output
buffer reduces the external memory transactions.
F. Control Unit
The whole system is conﬁgured by the control unit, which
receives at the beginning of each layer inference the conﬁg-
uration it should follow. This information include the input
feature map precision which deﬁnes the number of iterations
the crossbar needs to complete the operation. Also, the input
feature map depth and kernel size which deﬁne the results of
the PE clusters need to be added and the required accumula-
tions. Also, it receives the kernels mappings in the crossbars
to activate the corresponding columns and map inputs to the
correct rows. Such mappings are already done statically and
the weights have already been stored in the crossbar sub-arrays
according to the mapping. Finally, the number of kernels
which deﬁnes if the input needs to be used more than once
on the grid to compute different kernels.
V. EVALUATI ON
To evaluate the accelerator, a SPICE simulation was used.
The device elements of the crossbar were simulated within
22FDX technology, while the FeFETs properties in this tech-
nology were extracted from . The losses for every line
were assumed as if all branches were activated, allowing to
decouple the different states from the parasitics. In the worst
case scenario, this approach overestimates the losses. The input
source has a high internal impedance. The input and activation
states are binary, while the transistors are set to a program
or erase state. The crossbar simulation is integrated into the
rest of the overall structure. To have a full system modeling,
Synopsys Design Compiler is used to model power and area
of the synthesized components.
For the analysis of the power consumption, we ignore the
power overhead related to the movement of the input feature
map from the external memory to the architecture buffers
as well as for the ﬁnal output feature map. The validity of
this approximation is based on one of the main goals of this
structure which is keeping the data transfer to the external
memory to the minimum as the weights are stored internally
and inter-layer results are kept in the internal buffers.
The simulation only considers 1024 PEs with total size of
64Mb which is organized as 32 cluster columns. Each column
consists of 4 PE clusters where each contains 8 different PEs.
The prototype speciﬁcation is summarized in Table II.
BENCHMARKING MOD ELS .
Squezenet Dilated Model Resnet18 Resnet20
FP Acc.1[%] 56.67 63.08 68.68 91.04
4-bit Acc. [%] 53.77 62.43 68.26 91.03
#Input [M] 1.743 2.383 34.96 0.211
#Param [M] 0.722 4.957 11.175 0.271
#MAC [GOPs] 0.287 1.826 18.9 0.041
Perf. [FPS] 3773 79 768 24319
Perf. [TOPS] 1.083 1.49 1.4 0.993
We map and compute different convolutional models. We
benchmark our system with the networks Squeeznet 
for Imagenet , Renset18  for Imagenet, Resnet20
 for CIFAR10 , Dialated model  for Cityscapes
. As shown in Table I, these networks vary in the size
of the parameters and input/output feature map size. Also
they vary in the needed MAC operations to be preformed.
However, based on our quantization optimization we were able
to maintain high accuracy while using only 4 bit activations
with only 2.9% percent accuracy loss in worst case. Due to the
DES IGN S PE CIFI CATI ON FO R TH E SYS TEM P ROTOT YP E.
No. of PEs 1024 (1024)
Crossbar size 256 ×256 (64 Kb)
Power [W] 0.5
Total Area [mm2] 1.52
Operating frequency 400 Mhz
Peak perfomance [TOPS] 1.64
Area Efﬁciency [TOPS/mm2] 1.08
Energy Efﬁciency [TOP/J] 3.28
Convolution operation 8-bit Activation/4-bit Weight
limited space we will not go through our model optimization
A. Performance Analysis
In this section we discuss three main performance metrics:
computation performance, area and power efﬁciency of our ar-
chitecture. We ran our experiments at a frequency of 400MHz.
Although, crossbar unit cells can operate at a higher frequency,
we use this frequency to meet the targeted ADC and digital
blocks overall area and energy requirements.
1) Computational Performance: For the Computational
performance, we tested the previously mentioned CNN models
in the simulation environment and measured the performance.
The current structure can reach 1.638 tera operations per
second (TOPS) as shown in Table II. Each operation resembles
full MAC operation for 8-bit input and 4-bit weight.
Though the performance per layer is affected by the layer
dimension, the system shows high level of utilization by
exploring the usage of parallelizing operations across different
levels to maximize the system performance. For example,
the dialated model achieves a full utilization in 93% of the
layers which reﬂects directly on the system performance. As
shown in Table III across the different models structures and
memory/computational load, our system is keeping a high
performance reaching 90% of the peak performance across the
whole model. The system also shows a good extendability to
larger dimensions of PEs and crossbar sizes while maintaining
a high computation efﬁciency. However, in order to keep high
utilization in such extended dimensions, replication of the
weights need to be stored.
B. Area Efﬁciency
We also analyzed the area occupied by our design. In this
area investigation we included the crossbar size including the
ADC circuitry. Also the adders, accumulators across the PEs
and across the overall system, the needed input/output feature
map buffers and ﬁnally the control unit logic needed for the
system operation. The design occupies an area of 1.52 mm2.
With such area, we achieve 1.08 TOPS/mm2which outper-
forms the state of the art by a factor of 2.3x through balancing
the operations between digital and analog computations while
keeping the overhead related to such computations to the
minimum. The complete elimination of DACs and utilizing
only 3-bit ADCs keeps the area overhead occupied by support
logic to the minimal. Furthermore, spreading the adders across
SIMULATION RESULTS FOR VARIOUS IN-MEMORY COMPUTING SYSTEMS USING DIFFERENT TECHNOLOGIES (SYSTEM LEVEL COMPARISON).
SCOPE  Neural Cache  ISAAC  AtomLayer  VMM based  Our work
Technology 22nm 28nm 28nm - 28nm 22nm
Parameter Storage DRAM SRAM ReRAM ReRAM FeFET FeFET
Power [W] 176.4 52.9 65.8 4.8 18.2 0.5
Area [mm2] 273 - 85.4 6.89 49.6 1.52
Peak perfomance [TOPS] 7.08 28 - 3.23 16.38 1.64
Energy Efﬁciency [TOP/J] 0.04 0.529 0.38 0.68 0.896 3.28
Area Efﬁciency [TOPS/mm2] 0.026 - 0.46 0.47 0.33 1.08
the structure reduces the data to be transported which reﬂects
on buses area.
C. Power Efﬁciency
The power consumption of the architecture was investigated.
In the simulations we do not consider the energy required
to program the weights within the crossbar. This assumption
validity is based on considering only real-time inference
operation. Compared to state of the art, our design shows a
reduction of power consumption of 0.5 W as shown in Table
I. This corresponds to an energy efﬁciency of 3.28 TOPS/W.
With such performance, our system outperforms the state of
the art by a factor of 3.6x.
This high efﬁciency is further emphasized by the high
utilization the system makes of the available hardware. Similar
to the area efﬁciency the elimination of DACs and large ADCs
is reﬂected on such power consumption. In terms of power
efﬁciency, the system shows a good balance between analog
and digital computations which reﬂects the good distribution
of computational parallelization between these blocks. Also,
keeping the frequency to 400 MHz reduces the power con-
In this work, we presented an architecture that utilizes the
FeFET crossbar technology for multi-precision neural network
acceleration. Furthermore, we evaluated the performance of
our architecture across several widely used neural networks
models and compared it to state of the art systems. Through
combining the FeFET technology and innovative structure for
neural network acceleration, we achieved an area efﬁciency
enhancement of 2.3x to the current available state of the
art. We consider this work to be the foundation of further
optimized architectures that can drive innovations that current
structures can not.
 Y. Kwon and M. Rhu, “Beyond the memory wall: A case for memory-
centric hpc system for deep learning,” in MICRO, 2018.
 S. Yu, “Neuro-inspired computing with emerging nonvolatile memorys,”
Proceedings of the IEEE, 2018.
 S. Yu, Neuro-inspired computing using resistive synaptic devices.
 Ambrogio et al.,Equivalent-accuracy accelerated neural-network train-
ing using analogue memory. Nature, 2018.
 M. Jerry et al., “Ferroelectric fet analog synapse for acceleration of deep
neural network training,” in 2017 IEDM.
 M. Bocquet et al., “In-memory and error-immune differential rram
implementation of binarized deep neural networks,” in IEDM, 2018.
 X. Chen et al., “Design and optimization of fefet-based crossbars for
binary convolution neural networks,” in DATE ’18.
 S. Yu et al., “Binary neural network with 16 mb rram macro chip for
classiﬁcation and online training,” in IEDM, 2016.
 K. Ando et al., “Brein memory: A single-chip binary/ternary reconﬁg-
urable in-memory deep neural network accelerator achieving 1.4 tops at
0.6 w,” IEEE Journal of Solid-State Circuits, 2018.
 A. Shaﬁee et al., “Isaac: A convolutional neural network accelerator
with in-situ analog arithmetic in crossbars,” in ISCA, 2016.
 Z. Zhu et al., “Mixed size crossbar based rram cnn accelerator with
overlapped mapping method,” in ICCAD, 2018.
 Y. Long et al., “Reram-based processing-in-memory architecture for
recurrent neural network acceleration,” IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, vol. 26, no. 12, pp. 2781–2794, 2018.
 Z. Zhu et al., “A conﬁgurable multi-precision cnn computing framework
based on single bit rram,” in DAC, 2019.
 P. Chi et al., “Prime: A novel processing-in-memory architecture for
neural network computation in reram-based main memory,” in ISCA,
 Y. Long et al., “A ferroelectric fet-based processing-in-memory archi-
tecture for dnn acceleration,” IEEE Journal on Exploratory Solid-State
Computational Devices and Circuits, 2019.
oscke et al., “Ferroelectricity in hafnium oxide thin ﬁlms,” Appl. Phys.
 J. M ¨
uller et al., “Ferroelectric hafnium oxide based materials and
devices: Assessment of current status and future prospects,” ECS J. Solid
State Sci. Technol., 2015.
 S. Dunkel et al., “A fefet based super-low-power ultra-fast embedded
nvm technology for 22nm fdsoi and beyond,” in IEDM ’17.
 S. L. Miller et al., “Physics of the ferroelectric nonvolatile memory ﬁeld
effect transistor,” J. Appl. Phys., 1992.
 S. Beyer et al., “Fefet: A versatile cmos compatible device with game-
changing potential,” in IMW, 2020.
 K. D. Choo et al., “27.3 area-efﬁcient 1gs/s 6b sar adc with charge-
injection-cell-based dac,” in ISSCC, 2016.
 S. Li et al., “Scope: A stochastic computing engine for dram-based in-
situ accelerator,” in MICRO, 2018.
 C. Eckert et al., “Neural cache: Bit-serial in-cache acceleration of deep
neural networks,” in ISCA, 2018.
 X. Qiao et al., “Atomlayer: A universal reram-based cnn accelerator
with atomic layer computation,” in 2018 55th ACM/ESDA/IEEE Design
Automation Conference (DAC), 2018, pp. 1–6.
 F. N. Iandola et al., “Squeezenet: Alexnet-level accuracy with 50x fewer
parameters and ¡1mb model size,” ArXiv, vol. abs/1602.07360, 2016.
 J. Deng et al., “ImageNet: A Large-Scale Hierarchical Image Database,”
in CVPR09, 2009.
 K. He et al., “Deep residual learning for image recognition,” CVPR,
 A. Krizhevsky et al., “Cifar-10 (canadian institute for advanced re-
search).” [Online]. Available: http://www.cs.toronto.edu/ kriz/cifar.html
 F. Yu and V. Koltun, “Multi-scale context aggregation by dilated
 Cordts et al., “The cityscapes dataset for semantic urban scene under-
standing,” in CVPR, 2016.