Conference PaperPDF Available

A Ferroelectric FET Based In-memory Architecture for Multi-Precision Neural Networks


Abstract and Figures

Computing-in-memory (CIM) is a promising approach to improve the throughput and the energy efficiency of deep neural network (DNN) processors. So far, resistive nonvolatile memories have been adapted to build crossbar-based accelerators for DNN inference. However, such structures suffer from several drawbacks such as sneak paths, large ADCs/DACs, high write energy, etc. In this paper we present a mixed signal in-memory hardware accelerator for CNNs. We propose an in-memory inference system that uses FeFETs as the main non-volatile memory cell. We show how the proposed crossbar unit cell can overcome the aforementioned issues while reducing unit cell size and power consumption. The proposed system decomposes multi-bit operands down to single bit operations. We then re-combine them without any loss of precision using accumulators and shifters within the crossbar and across different crossbars. Simulations demonstrate that we can outperform state-of-the-art efficiencies with 3.28 TOPS/J and can pack 1.64 TOPS in an area of 1.52 mm 2 using 22 nm FDSOI technology.
Content may be subject to copyright.
A Ferroelectric FET Based In-memory Architecture
for Multi-Precision Neural Networks
Abstract—Computing-in-memory (CIM) is a promising ap-
proach to improve the throughput and the energy efficiency
of deep neural network (DNN) processors. So far, resistive
nonvolatile memories have been adapted to build crossbar-based
accelerators for DNN inference. However, such structures suffer
from several drawbacks such as sneak paths, large ADCs/DACs,
high write energy, etc. In this paper we present a mixed signal
in-memory hardware accelerator for CNNs. We propose an in-
memory inference system that uses FeFETs as the main non-
volatile memory cell. We show how the proposed crossbar unit
cell can overcome the aforementioned issues while reducing unit
cell size and power consumption. The proposed system decom-
poses multi-bit operands down to single bit operations. We then
re-combine them without any loss of precision using accumulators
and shifters within the crossbar and across different crossbars.
Simulations demonstrate that we can outperform state-of-the-art
efficiencies with 3.28 TOPS/J and can pack 1.64 TOPS in an area
of 1.52 mm2using 22 nm FDSOI technology.
Index Terms—FeFET Crossbar array, In-Memory Computa-
tion, Convolution Neural Networks
Deep Neural Networks (DNNs) have been widely used for
various complex tasks such as image recognition or natural
language processing, at the cost of higher computational
resource requirements. These requirements in spite of the
limited available computational resources and memory access
force the systems to become stricter. At the same time, the size
of state-of-the-art DNNs keeps growing larger to achieve better
accuracy. This growth in DNN size presents more challenging
requirements for the on-chip memory capacity and the off-chip
memory accesses. This problem can be observed in the CMOS
ASIC designs in which the on-chip memory is the biggest
bottleneck for efficient energy computation [1]. Also, the off-
chip memory loading results in both substantial energy and
latency in such systems [2]. Over the past years, in-memory
computation was presented as a solution to fulfill these re-
quirements. An emerging In-memory computation technique is
characterized by the usage of crossbar arrays of Non Volatile
Memory (NVM) units to perform matrix-vector multiplication
In NVM crossbar arrays, weights are stored within the mem-
ory unit. Several technologies have been introduced for such
crossbars, e.g. Resistive Random-Access Memories (RRAMs)
[3] , Phase Change Memories (PCM) [4] and Ferroelectric
Field-Effect Transistors (FeFETs) [5]. In this work, FeFETs
are adapted as the main NVM for crossbar units due to
their high on/off ratio, long term retention, low writing and
operation voltages as well as scalability.
In this work, we introduce a mixed signal architecture that
aims to exploit opportunities FeFET memory elements can
offer in both digital and analog domains. We first optimize the
unit size in the crossbar array, requiring only 1 FeFET for each
AND operation. Furthermore, we analyze the functionality of
our in-memory architecture at various parallelism levels, to
allow for an accelerated computation of Convolutional Neural
Networks (CNNs). In summary, our contributions are the
Novel mixed-signal architecture for in-memory computa-
tion of CNNs using FeFET as memory cell.
Low Analog to Digital Converter (ADC) overhead
through efficient parallel bit decomposed MAC operation
compared to the state of the art.
Elimination of Digital to Analog Converters (DACs).
Multi-precision targeted convolution operations.
Functionality analysis and mapping operations at three
parallelism levels:
Simultaneous activation of multiple crossbar lines.
Activation of multiple crossbar columns.
Parallel operation of different crossbars.
This paper is structured as follows: In section II we go
through related work. Then, section III provides a background
on the basics of the FeFET crossbar design and the mode of
operation as well as convolution neural networks. In section
IV, We introduce our architecture. We focus on the CNN
system components and the atomic operations required. Also
in this section we go through how the operations can be
mapped to different components as well as the levels of
parallelism the structure is offering and how it reflects on the
data routing. Finally, we present the evaluation of our crossbar
architecture, as well as a comparison to state-of-the-art systems
in section V.
In the last few years, several studies regarding the usage of
crossbar in-memory architecture for DNN inference have been
presented. Many of the presented works such as [6], [7], [8]
and [9] all focused on binarized neural network which limits
the usage to certain set of networks that can adapt such data
The current available work for Multi-precision convolution
operation follows one of two main directions. The first di-
rection tries to maximize the usage of the analog properties
the crossbar structure is offering. The analog crossbar is
complemented with DACs to convert the digital input values
to analog signal for crossbar. Also, ADCs to yield the final
output in digital form. This method is followed in several
architectures such as [10], [11], [12] and [13]
However the cost of this approach appears in the power
consumption and area used by such structure. For example
the work presented ISAAC [10] shows that the 8-bit ADC
accounted for 58% of total power and 31% of the area.
Also, it can be observed in Prime [14] the cost of extra
hardware added for decoder to drive the analog signal as
well as sense amplifier circuit to yield the digital value. To
solve that, several architectures lower the bit precision of the
operands which corresponds to minimizing the effect of these
blocks. However, it limits the usage of these architecture to
networks that can have such bit precision.
On the other hand, several studies started to adapt more
digital oriented architectures. In the presented work by Y.
Long [15], a full precision full digital inference architecture
was proposed. However, the system lost one of the main
advantages the crossbar is offering by activating only a single
memory cell at a time. Also the system showed very poor
utilization in many layers. For example, when the system
contains a large number of processing crossbars that can serve
and benefit very deep layers, it performs poorly in the shallow
layers compared to its peak performance. Also, the proposed
operating frequency suggested by the work (2GHz) leads to
the high power consumption.
A. Convolutional Neural Network
CNNs are a special kind of DNNs which adapts the con-
volutional layers as the main component of the network. In
addition to convolutional layers usually it has pooling, fully
connected layers as well. However the convolution operation
is the main operation and can be expressed as:
fo(x, y, z) =
fi(x+i, y +j, h)Kz(i, j, h)(1)
Such that foand firepresent the input, output feature
maps respectively. Kzrepresents the kernel zwhere zis the
output depth. kand Cin represent kernel size and input depth
Based on equation 1, we can further perform a decomposi-
tion of this operation into the bit operation level as shown in 2.
Ipand Wprefer to input and weight bit precisions respectively.
fo(x, y, z) =
fim(x+i, y +j, k)Kzn(i, j, k)) (2m+n))
Equation 2 represents the lowest level of operation; an AND
gate between the two bits to yield the final expected value.
However the summation of each set of operations should be
shifted by value that equals m+nto result the expected output.
B. FeFET Technology
FeFETs are three-terminal non-volatile memory elements.
However, until the discovery of ferroelectricity in HfO2thin
films this device concept was lacking in terms of data storage
retention. The high coercive field (Ec), low dielectric constant
and CMOS compatibility make it very suitable compared to
conventional ferroelectrics as lead zirconate titanate. Further-
more, its ferroelectricity is persistent down to ultra-thin films
in the nanometer range [16], enabling highly scaled devices.
This enabled implementations into the 28 nm bulk [17]
and 22 nm FDSOI [18] technology node. In FeFETs, the
polarization state of the ferroelectric layer affects the transfer
characteristics of the transistor, thus resulting in a shift of
the threshold voltage Vt(see Fig. 1(B), as extracted from
[18]). Due to the high coercive voltages and high remnant
polarization values of HfO2[17], large memory windows
(MW) are achievable, which are linked to a high on/off ratio
[19] reaching values in the range 103105. The low trap
density and the low dielectric constant result in low gate
leakage current and low operation voltage, respectively.
For the design study, we use individual FeFET devices
out of the 22FDX platform of Globalfoundries. The trans-
conductance curve of an exemplary device is hereby given
in Fig. 1(B) (extracted from [18]). For the further analysis,
we assume a Vtshift with respect to Fig. 1(B) of 1 V. The
assumed scale is highlighted in green. Such an offset Vtshift is
obtainable by work function engineering and additional back
bias application. The offset Vtshift is hereby necessary to
neglect any unintended leakage current within the crossbar
Published implementations for compute-in-memory (CiM)
accelerators typically make use of several bit-cell concepts,
often utilizing an additional select transistor, which is im-
plemented per bit-cell, as shown in Fig. 1(C). Here, we
use a 1-FeFET bit-cell, which reduces the area to about
0.007 µm2(CPP×MMP), as shown schematically in Fig. 1(D).
The program scheme hereby follows conventional schemes as
discussed previously for the FeFET concept [20].
In inference mode the applied FeFETs can be selected by the
bitlines (BL) and wordlines (WL). Not activated columns are
grounded. All individual columns are only weakly capacitvely
coupled and assumed to be independent. Besides an increased
area efficiency, necessary for complex on-chip neural network
computation tasks, such scaled devices, effectively reduce
the saturation current in the strong inversion case. For such
an aggressively scaled device with a W/L configuration of
20 ×80nm2, a variation of σVt/MW about 3is expected
in the current stage of technology maturity [20]. However,
the operation in saturation under strong inversion significantly
reduces the influence of the device variability on the compute-
in-memory result. In our configuration we activate eight units
per column per operation. The output current for all permuta-
tions of the eight input signals and associated FeFET state was
evaluated. The individual states were hereby clearly separated
without overlap.
Fig. 1. FDSOI FeFET crossbar considerations (A) Schematic of a FDSOI n-FET device, (B) Transconductance curves as extracted from [18], (C) Schematic-
level representation of the applied FeFET array for compute-in-memory application, (D) Layout of the used 1-FeFET configuration
In this section we present our FeFET-based crossbar ac-
celerator dedicated for DNNs. More specifically, we discuss
the structure organization as well as the inference operation
mapping to the different blocks.
A. FeFET Processing Element Design
We start here by defining Processing Element (PE) as the
main unit in our architecture. As shown in Fig. 2(E),2(F), Each
PE is consists of a FeFET crossbar and a mixed signal block.
We explore the operations performed by each block and their
1) Crossbar Cell Operation: As explained in equation 2
the convolution operation can be decomposed into a number
of AND operations between the input feature map and weights.
Similarly, each cell in the crossbar can perform a one-bit
AND operation as explained. The crossbar is configured such
that each unit cell stores a single weight bit. The gate, drain
and source of the transistors, are connected to weight line
(WL), bit line (BL), and source line(SL) respectively. Weight
line represents the row activation which corresponds to the
input feature map bit in case of active row. The source line
represents a control signal that specify activated columns with
each clock cycle. Finally, the bit line yields the final result of
the operation.
In our structure, we follow the configuration of activating
multiple input rows at the same time. We chose eight simul-
taneous row activations which in return corresponds to eight
AND operations being computed and accumulated through the
column. Such a configuration is based on the well defined level
separation for different accumulation values shown in section
III. Also, limiting the accumulated values to eight; corresponds
to a smaller ADC converter of only 3-bit. We show in Fig. 2(A)
an example for such MAC operations where the weight bits
are stored in the unit cells, and the input feature map bits are
sequenced in their order.
2) FeFET Crossbar Section: For each crossbar, We split the
crossbar into sections. In each section, groups of columns are
formed where each group is connected to a mixed signal block.
In this configuration within each group, only one column is
activated at any time. The advantages of this approach are:
Reduce the number of needed mixed signal blocks as they
correspond to the number or groups and not the columns.
Reduce the power consumption and area needed for the
crossbar unit.
Storing different layers kernels within the same group.
In Fig. 2(F), an example for the crossbar section is il-
lustrated. Here, the convolution operation is unrolled by the
following factors. The Pk1
i=0 Pk1
j=0 PCin
k=1 is parallelized by
the factor of eight which corresponds to the activated rows
simultaneously. Also, PWp1
n=0 is parallelized by the factor of
groups per crossbar section. Though group result can have
a maximum value of eight, approximating it to 3-bit value
does not affect the networks accuracy; which allows the usage
of a 3-bit ADC based on work presented in [21]. The three
bit results out of the ADCs are connected to the adder with
different displacement according to the shifting value of n
mentioned in equation 2. Such adder is illustrated in 2(F)
where we show the group predefined connections. The adder
results a partial sum that is routed to cluster adders (i.e
explained in the next section) for further processing.
3) Overall Crossbar: Each crossbar consists of four sec-
tions where each section computes a different kernel. Each
section has its adder that should yield a partial sum for a
different output feature depth. The four sections do not have
any physical separation. It is an algorithmic point of view
which means that the inputs applied to the crossbar rows are
applied to all of them. In our architecture, crossbars are of
size 256×256. Each 16 columns form a single group which
is connected to a one 3-bit ADC. Each Four groups form a
section where the ADCs results are connected to an adder with
different displacements according to the group stored weight
bit displacement n.
This crossbar organization offers the advantage of lowering
the data traffic needed. At any point of time, it only needs
eight bits of input and only four result words each of nine bits
need to be routed to the cluster adders.
B. PE Cluster
Each PE cluster collects the results from the enclosed
PEs and further accumulates the values to compute either
further partial sums or the final output. Each cluster consists
of eight PEs, 4 adders of 8 operands inputs, 4 shifters and
4 accumulators. As illustrated in Fig. 2(D), the values of
PEs are summed together to form 4 different partial sums.
Then, these sums are shifted by maccording to the input bit
displacement. Finally, this value is accumulated through the 4
Fig. 2. (A)Convolution operation mapped to a crossbar structure. (B) The system architecture which consists of columns of PE clusters as well as the control
unit and the buffers. (C) Column module where further accumulations can be performed as well as max-pooling operation. (D) The structure of PE cluster
and blocks needed for partial sum computation. (E) Single crossbar structure and sectioning where each section computes a different kernel. (F) Crossbar
section output block where analog signals are converted and added.
different accumulators. The iterations needed for accumulation
depend solely on the input feature map precision. The main
goal of the cluster structure is to further reduce the data traffic
and accumulate the values correctly. The final output of each
cluster is 4 different words that each corresponds to 64 full-
precision MAC operation.
C. Cluster Column
The proposed system consists of columns of the mentioned
clusters. Deep convolutional neural networks are known for
the high number of MAC operations reaching thousands of
operations for powerful DNNs. However, within the cluster,
the number of parallel MAC operations is limited to 64 pixels.
The Column of clusters can further parallelize MAC opera-
tions by having 4 accumulators for each cluster columns that
can receive and accumulate the accumulated pixels. Hence, for
cclusters per column, if for a certain layer the number of MAC
operations are less than or equal 64cthen the final 4 different
output feature map pixels can be computed as soon cluster
results are ready. If more than 64cthen several clusters can
be used to compute the needed accumulations for the output
pixel. Such a unit is illustrated in Fig. 2(C) which includes
also the structure needed to perform max-pooling.
The column cluster is illustrated in Fig. 2(B). We build each
cluster column from 4 different clusters which corresponds to
32 PEs.
D. Systme Grid
The final system is seen as a grid of clusters which consist
of columns of crossbars. On the crossbar level, four kernels
are computed at the same time. However, convolution layers
consist of larger number of kernels. Across different columns
of clusters the kernels are parallelized. Considering ris the
number of columns of clusters, then the total number of
parallel kernels that can be computed is 4r. Then, if the output
feature map depth is less than or equal 4r, the full output
feature map depth will be computed simultaneously.
Another benefit from such scheme is the extreme reuse of
the input feature map pixels. Since each row of clusters will
also be sharing the input feature map at each clock cycle which
dramatically reduces the complexity and the data traffic needed
across the grid. In Fig. 2(B), we illustrate the final overview
of the system.
E. Buffers
The grid is complemented with an input feature map buffer
and an output feature map buffer. The input buffer is split into
partitions where each is targeting a certain row of clusters. The
input feature data mapping is done to preserve a one-to-one
relation between the buffers partition and the row of clusters.
This relation simplifies the reading from the buffers. Mapping
of the partition content is defined prior to inference.
In case of output feature map buffer, it is preceded by max-
pooling unit within the cluster column such that only the result
of this unit in case of max-pooling layers is stored. The output
buffer reduces the external memory transactions.
F. Control Unit
The whole system is configured by the control unit, which
receives at the beginning of each layer inference the config-
uration it should follow. This information include the input
feature map precision which defines the number of iterations
the crossbar needs to complete the operation. Also, the input
feature map depth and kernel size which define the results of
the PE clusters need to be added and the required accumula-
tions. Also, it receives the kernels mappings in the crossbars
to activate the corresponding columns and map inputs to the
correct rows. Such mappings are already done statically and
the weights have already been stored in the crossbar sub-arrays
according to the mapping. Finally, the number of kernels
which defines if the input needs to be used more than once
on the grid to compute different kernels.
To evaluate the accelerator, a SPICE simulation was used.
The device elements of the crossbar were simulated within
22FDX technology, while the FeFETs properties in this tech-
nology were extracted from [18]. The losses for every line
were assumed as if all branches were activated, allowing to
decouple the different states from the parasitics. In the worst
case scenario, this approach overestimates the losses. The input
source has a high internal impedance. The input and activation
states are binary, while the transistors are set to a program
or erase state. The crossbar simulation is integrated into the
rest of the overall structure. To have a full system modeling,
Synopsys Design Compiler is used to model power and area
of the synthesized components.
For the analysis of the power consumption, we ignore the
power overhead related to the movement of the input feature
map from the external memory to the architecture buffers
as well as for the final output feature map. The validity of
this approximation is based on one of the main goals of this
structure which is keeping the data transfer to the external
memory to the minimum as the weights are stored internally
and inter-layer results are kept in the internal buffers.
The simulation only considers 1024 PEs with total size of
64Mb which is organized as 32 cluster columns. Each column
consists of 4 PE clusters where each contains 8 different PEs.
The prototype specification is summarized in Table II.
Squezenet Dilated Model Resnet18 Resnet20
FP Acc.1[%] 56.67 63.08 68.68 91.04
4-bit Acc. [%] 53.77 62.43 68.26 91.03
#Input [M] 1.743 2.383 34.96 0.211
#Param [M] 0.722 4.957 11.175 0.271
#MAC [GOPs] 0.287 1.826 18.9 0.041
Perf. [FPS] 3773 79 768 24319
Perf. [TOPS] 1.083 1.49 1.4 0.993
We map and compute different convolutional models. We
benchmark our system with the networks Squeeznet [25]
for Imagenet [26], Renset18 [27] for Imagenet, Resnet20
[27] for CIFAR10 [28], Dialated model [29] for Cityscapes
[30]. As shown in Table I, these networks vary in the size
of the parameters and input/output feature map size. Also
they vary in the needed MAC operations to be preformed.
However, based on our quantization optimization we were able
to maintain high accuracy while using only 4 bit activations
with only 2.9% percent accuracy loss in worst case. Due to the
1Top-1 Accuracy
No. of PEs 1024 (1024)
Crossbar size 256 ×256 (64 Kb)
Power [W] 0.5
Total Area [mm2] 1.52
Operating frequency 400 Mhz
Peak perfomance [TOPS] 1.64
Area Efficiency [TOPS/mm2] 1.08
Energy Efficiency [TOP/J] 3.28
Convolution operation 8-bit Activation/4-bit Weight
limited space we will not go through our model optimization
A. Performance Analysis
In this section we discuss three main performance metrics:
computation performance, area and power efficiency of our ar-
chitecture. We ran our experiments at a frequency of 400MHz.
Although, crossbar unit cells can operate at a higher frequency,
we use this frequency to meet the targeted ADC and digital
blocks overall area and energy requirements.
1) Computational Performance: For the Computational
performance, we tested the previously mentioned CNN models
in the simulation environment and measured the performance.
The current structure can reach 1.638 tera operations per
second (TOPS) as shown in Table II. Each operation resembles
full MAC operation for 8-bit input and 4-bit weight.
Though the performance per layer is affected by the layer
dimension, the system shows high level of utilization by
exploring the usage of parallelizing operations across different
levels to maximize the system performance. For example,
the dialated model achieves a full utilization in 93% of the
layers which reflects directly on the system performance. As
shown in Table III across the different models structures and
memory/computational load, our system is keeping a high
performance reaching 90% of the peak performance across the
whole model. The system also shows a good extendability to
larger dimensions of PEs and crossbar sizes while maintaining
a high computation efficiency. However, in order to keep high
utilization in such extended dimensions, replication of the
weights need to be stored.
B. Area Efficiency
We also analyzed the area occupied by our design. In this
area investigation we included the crossbar size including the
ADC circuitry. Also the adders, accumulators across the PEs
and across the overall system, the needed input/output feature
map buffers and finally the control unit logic needed for the
system operation. The design occupies an area of 1.52 mm2.
With such area, we achieve 1.08 TOPS/mm2which outper-
forms the state of the art by a factor of 2.3x through balancing
the operations between digital and analog computations while
keeping the overhead related to such computations to the
minimum. The complete elimination of DACs and utilizing
only 3-bit ADCs keeps the area overhead occupied by support
logic to the minimal. Furthermore, spreading the adders across
SCOPE [22] Neural Cache [23] ISAAC [10] AtomLayer [24] VMM based [15] Our work
Technology 22nm 28nm 28nm - 28nm 22nm
Power [W] 176.4 52.9 65.8 4.8 18.2 0.5
Area [mm2] 273 - 85.4 6.89 49.6 1.52
Peak perfomance [TOPS] 7.08 28 - 3.23 16.38 1.64
Energy Efficiency [TOP/J] 0.04 0.529 0.38 0.68 0.896 3.28
Area Efficiency [TOPS/mm2] 0.026 - 0.46 0.47 0.33 1.08
the structure reduces the data to be transported which reflects
on buses area.
C. Power Efficiency
The power consumption of the architecture was investigated.
In the simulations we do not consider the energy required
to program the weights within the crossbar. This assumption
validity is based on considering only real-time inference
operation. Compared to state of the art, our design shows a
reduction of power consumption of 0.5 W as shown in Table
I. This corresponds to an energy efficiency of 3.28 TOPS/W.
With such performance, our system outperforms the state of
the art by a factor of 3.6x.
This high efficiency is further emphasized by the high
utilization the system makes of the available hardware. Similar
to the area efficiency the elimination of DACs and large ADCs
is reflected on such power consumption. In terms of power
efficiency, the system shows a good balance between analog
and digital computations which reflects the good distribution
of computational parallelization between these blocks. Also,
keeping the frequency to 400 MHz reduces the power con-
In this work, we presented an architecture that utilizes the
FeFET crossbar technology for multi-precision neural network
acceleration. Furthermore, we evaluated the performance of
our architecture across several widely used neural networks
models and compared it to state of the art systems. Through
combining the FeFET technology and innovative structure for
neural network acceleration, we achieved an area efficiency
enhancement of 2.3x to the current available state of the
art. We consider this work to be the foundation of further
optimized architectures that can drive innovations that current
structures can not.
[1] Y. Kwon and M. Rhu, “Beyond the memory wall: A case for memory-
centric hpc system for deep learning,” in MICRO, 2018.
[2] S. Yu, “Neuro-inspired computing with emerging nonvolatile memorys,
Proceedings of the IEEE, 2018.
[3] S. Yu, Neuro-inspired computing using resistive synaptic devices.
Springer, 2017.
[4] Ambrogio et al.,Equivalent-accuracy accelerated neural-network train-
ing using analogue memory. Nature, 2018.
[5] M. Jerry et al., “Ferroelectric fet analog synapse for acceleration of deep
neural network training,” in 2017 IEDM.
[6] M. Bocquet et al., “In-memory and error-immune differential rram
implementation of binarized deep neural networks,” in IEDM, 2018.
[7] X. Chen et al., “Design and optimization of fefet-based crossbars for
binary convolution neural networks,” in DATE ’18.
[8] S. Yu et al., “Binary neural network with 16 mb rram macro chip for
classification and online training,” in IEDM, 2016.
[9] K. Ando et al., “Brein memory: A single-chip binary/ternary reconfig-
urable in-memory deep neural network accelerator achieving 1.4 tops at
0.6 w,IEEE Journal of Solid-State Circuits, 2018.
[10] A. Shafiee et al., “Isaac: A convolutional neural network accelerator
with in-situ analog arithmetic in crossbars,” in ISCA, 2016.
[11] Z. Zhu et al., “Mixed size crossbar based rram cnn accelerator with
overlapped mapping method,” in ICCAD, 2018.
[12] Y. Long et al., “Reram-based processing-in-memory architecture for
recurrent neural network acceleration,” IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, vol. 26, no. 12, pp. 2781–2794, 2018.
[13] Z. Zhu et al., “A configurable multi-precision cnn computing framework
based on single bit rram,” in DAC, 2019.
[14] P. Chi et al., “Prime: A novel processing-in-memory architecture for
neural network computation in reram-based main memory,” in ISCA,
[15] Y. Long et al., “A ferroelectric fet-based processing-in-memory archi-
tecture for dnn acceleration,” IEEE Journal on Exploratory Solid-State
Computational Devices and Circuits, 2019.
[16] B¨
oscke et al., “Ferroelectricity in hafnium oxide thin films,” Appl. Phys.
Lett., 2011.
[17] J. M ¨
uller et al., “Ferroelectric hafnium oxide based materials and
devices: Assessment of current status and future prospects,ECS J. Solid
State Sci. Technol., 2015.
[18] S. Dunkel et al., “A fefet based super-low-power ultra-fast embedded
nvm technology for 22nm fdsoi and beyond,” in IEDM ’17.
[19] S. L. Miller et al., “Physics of the ferroelectric nonvolatile memory field
effect transistor,J. Appl. Phys., 1992.
[20] S. Beyer et al., “Fefet: A versatile cmos compatible device with game-
changing potential,” in IMW, 2020.
[21] K. D. Choo et al., “27.3 area-efficient 1gs/s 6b sar adc with charge-
injection-cell-based dac,” in ISSCC, 2016.
[22] S. Li et al., “Scope: A stochastic computing engine for dram-based in-
situ accelerator,” in MICRO, 2018.
[23] C. Eckert et al., “Neural cache: Bit-serial in-cache acceleration of deep
neural networks,” in ISCA, 2018.
[24] X. Qiao et al., “Atomlayer: A universal reram-based cnn accelerator
with atomic layer computation,” in 2018 55th ACM/ESDA/IEEE Design
Automation Conference (DAC), 2018, pp. 1–6.
[25] F. N. Iandola et al., “Squeezenet: Alexnet-level accuracy with 50x fewer
parameters and ¡1mb model size,” ArXiv, vol. abs/1602.07360, 2016.
[26] J. Deng et al., “ImageNet: A Large-Scale Hierarchical Image Database,”
in CVPR09, 2009.
[27] K. He et al., “Deep residual learning for image recognition,” CVPR,
[28] A. Krizhevsky et al., “Cifar-10 (canadian institute for advanced re-
search).” [Online]. Available: kriz/cifar.html
[29] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated
convolutions,” 2015.
[30] Cordts et al., “The cityscapes dataset for semantic urban scene under-
standing,” in CVPR, 2016.
... In our work, we present an architecture based on a 1FeFET1R model that greatly reduces the previously mentioned variability and limits I ds necessary for ultralow power operation [9], [10], [11]. The proposed design operates in the range of 100nA output per activated FeFET with very low current variability. ...
Conference Paper
More and more applications use deep neural networks (DNN) to execute complex tasks. Depending on the application, there are high requirements regarding latency and performance parameters, especially for edge devices (edge AI). Due to the increasing challenges regarding Moore's Law, new approaches are required. A promising candidate for this is analog in-memory computing (AIMC). In-memory computing platforms are based on crossbar structures using resistive non-volatile memories. However, many demonstrated approaches suffer from the problems of reduced accuracy as well as reliability, large analog-to-digital as well as digital-to-analog converters (ADCs/DACs), and power efficiency. The in-memory computing architecture presented in this paper is utilizing ferroelectric field effect transistors (FeFETs), which are used as a non-volatile memory cell. The crossbar-based in-memory architecture eliminates the need for any DAC and also provides high parallelism with only 3-bit ADCs. This results in lower area and power usage. The implementation of operations with binary as well as variable bit depth and power efficiencies >10000 TOPS/W are presented.
Full-text available
The discovery of ferroelectricity in hafnium oxide spurred a growing research field due to hafnium oxides compatibility with processes in microelectronics as well as its unique properties. Notably, its application in non-volatile memories, neuromorphic devices as well as piezo- and pyroelectric sensors is investigated. However, the behavior of ferroelectric hafnium oxide is not understood into depth compared to common perovskite structure ferroelectrics. Due the the metastable nature of the ferroelectric phase, process conditions have a strong influence during and after its deposition. In this work, the physical properties of hafnium oxide, process influences on the microstructure as well as reliability aspects in non-volatile and neuromorphic devices are investigated. With respect to the physical properties, strong evidence is provided that the antiferroelectric-like behavior in hafnium oxide based thin films is governed by ferroelastic 90° domain wall movement. Furthermore, the discovery of an electric field-induced crystallization process in this material system is reported. For the analysis of the microstructure, the novel method of transmission Kikuchi diffraction is introduced, allowing an investigation of the local crystallographic phase, orientation and grain structure. Here, strong crystallographic textures are observed in dependence of the substrate, doping concentration and annealing temperature. Based on these results, the observed reliability behavior in the electronic devices is explainable and engineering of the present defect landscape enables further optimization. Finally, the behavior in neuromorphic devices is explored as well as process and design guidelines for the desired behavior are provided.
Due to the slow-down of Moore’s Law and Dennard Scaling, new disruptive computer architectures are mandatory. One such new approach is Neuromorphic Computing, which is inspired by the functionality of the human brain. In this position paper, we present the projected SEC-Learn ecosystem, which combines neuromorphic embedded architectures with Federated Learning in the cloud, and performance with data protection and energy efficiency.
Conference Paper
Full-text available
This paper presents a novel ferroelectric field-effect transistor (FeFET) in-memory computing architecture dedicated to accelerate Binary Neural Networks (BNNs). We present in-memory convolution, batch normalization and dense layer processing through a grid of small crossbars with reduced unit size, which enables multiple bit operation and value accumulation. Additionally, we explore the possible operations parallelization for maximized computational performance. Simulation results show that our new architecture achieves a computing performance up to 2.46 TOPS while achieving a high power efficiency reaching 111.8 TOPS/Watt and an area of 0.026 mm2 in 22nm FDSOI technology.
Full-text available
This paper presents a Ferroelectric FET (FeFET) based processing-in-memory (PIM) architecture to accelerate inference of deep neural networks (DNNs). We propose a digital in-memory vector-matrix multiplication (VMM) engine design utilizing the FeFET crossbar to enables bit-parallel computation and eliminate analog-to-digital conversion in prior mixed-signal PIM designs. A dedicated hierarchical network-on-chip (H-NoC) is developed for input broadcasting and on-the-fly partial results processing, reducing the data transmission volume and latency. Simulations in 28nm CMOS technology show 115x and 6.3x higher computing efficiency (GOPs/W) over desktop GPU (Nvidia GTX 1080Ti) and ReRAM based design, respectively.
Conference Paper
Convolutional Neural Networks (CNNs) play a vital role in machine learning. Emerging resistive random-access memories (RRAMs) and RRAM-based Processing-In-Memory architectures have demonstrated great potentials in boosting both the performance and energy efficiency of CNNs. However, restricted by the immature process technology, it is hard to implement and fabricate a CNN accelerator chip based on multi-bit RRAM devices. In addition, existing single bit RRAM based CNN accelerators only focus on binary or ternary CNNs which have more than 10% accuracy loss compared with full precision CNNs. This paper proposes a configurable multi-precision CNN computing framework based on single bit RRAM, which consists of an RRAM computing overhead aware network quantization algorithm and a configurable multi-precision CNN computing architecture based on single bit RRAM. The proposed method can achieve equivalent accuracy as full precision CNN but also with lower storage consumption and latency via multiple precision quantization. The designed architecture supports for accelerating the multi-precision CNNs even with various precision among different layers. Experiment results show that the proposed framework can reduce 70% computing area and 75% computing energy on average, with nearly no accuracy loss. And the equivalent energy efficiency is 1.6 ~ 8.6× compared with existing RRAM based architectures with only 1.07% area overhead.
This article presents Neural Cache architecture, which re-purposes cache structures to transform them into massively parallel compute units capable of running inferences for Deep Neural Networks. Techniques to do in-situ arithmetic in SRAM arrays, create efficient data mapping and reducing data movement are proposed. Neural Cache architecture is capable of fully executing convolutional, fully connected, and pooling layers in-cache. Our experimental results show that the proposed architecture can improve efficiency over a GPU by $128\times$ while requiring a minimal area overhead of 2%.
Conference Paper
RRAM-based in-Memory Computing is an exciting road for implementing highly energy efficient neural networks. This vision is however challenged by RRAM variability, as the efficient implementation of in-memory computing does not allow error correction. In this work, we fabricated and tested a differential HfO2-based memory structure and its associated sense circuitry, which are ideal for in-memory computing. For the first time, we show that our approach achieves the same reliability benefits as error correction, but without any CMOS overhead. We show, also for the first time, that it can naturally implement Binarized Deep Neural Networks, a very recent development of Artificial Intelligence, with extreme energy efficiency, and that the system is fully satisfactory for image recognition applications. Finally, we evidence how the extra reliability provided by the differential memory allows programming the devices in low voltage conditions, where they feature high endurance of billions of cycles.
Conference Paper
Convolutional Neural Networks (CNNs) play a vital role in machine learning. CNNs are typically both computing and memory intensive. Emerging resistive random-access memories (RRAMs) and RRAM crossbars have demonstrated great potentials in boosting the performance and energy efficiency of CNNs. Compared with small crossbars, large crossbars show better energy efficiency with less interface overhead. However, conventional workload mapping methods for small crossbars cannot make full use of the computation ability of large crossbars. In this paper, we propose an Overlapped Mapping Method (OMM) and MIxed Size Crossbar based RRAM CNN Accelerator (MISCA) to solve this problem. MISCA with OMM can reduce the energy consumption caused by the interface circuits, and improve the parallelism of computation by leveraging the idle RRAM cells in crossbars. The simulation results show that MISCA with OMM can achieve 2.7x speedup, 30% utilization rate improvement, and 1.2x energy efficiency improvement on average compared with fixed size crossbars based accelerator using the conventional mapping method. In comparison with GPU platform, MISCA with OMM can perform 490.4x higher on average in energy efficiency and 20x higher on average in speedup. Compared with PRIME, an existing RRAM based accelerator, MISCA has 26.4x speedup and 1.65x energy efficiency improvement.