Conference PaperPDF Available

Ultra-Low Power Flexible Precision FeFET Based Analog In-Memory Computing

Authors:

Abstract

This paper presents an efficient crossbar design and implementation intended for analog compute-in-memory (ACiM) acceleration of artificial neural networks based on ferroelectric FET (FeFET) technology. The novel mixed signal blocks presented in this work reduce the device-to-device variation and are optimized for low area, low power and high throughput. In addition, we illustrate the operation and programmability of the crossbar that adopts bit decomposition techniques for MAC operation. Our crossbar based ACiM accelerator achieves a record peak performance of 13714 TOPS/W.
Ultra-Low Power Flexible Precision FeFET Based
Analog In-Memory Computing
T. Soliman1*, F. Müller2*, T. Kirchner1 , T. Hoffmann2, H. Ganem1 , E. Karimov2, T. Ali2,
M. Lederer2, C. Sudarshan3, T. Kämpfe2, A. Guntoro1 and N. Wehn3
1Robert Bosch GmbH, Renningen Germany
2Fraunhofer IPMS, Center Nanoelectronic Technologies (CNT), Dresden, Germany,
3TU Kaiserslautern, Kaiserslautern, Germany
*equal contributions, email: taha.soliman@de.bosch.com, thomas.kaempfe@ipms.fraunhofer.de
Abstract This paper presents an efficient crossbar design and
implementation intended for analog compute-in-memory (ACiM)
acceleration of artificial neural networks based on ferroelectric
FET (FeFET) technology. The novel mixed signal blocks
presented in this work reduce the device-to-device variation and
are optimized for low area, low power and high throughput. In
addition, we illustrate the operation and programmability of the
crossbar that adopts bit decomposition techniques for MAC
operation. Our crossbar based ACiM accelerator achieves a
record peak performance of 13714 TOPS/W.
I. INTRODUCTION
Deep neural networks (DNN) have developed over the past
few years to the state-of-the-art technique covering various
tasks such as image processing, speech recognition, etc.
However, such networks come at the cost of high computational
and memory requirements. The computational kernels in these
networks rely on matrix-vector multiplications as their main
operation. In the last few years, various in-memory based
architectures were introduced which perform multiply and
accumulate (MAC) operations in an analog way [1]. However,
this approach requires power/area hungry, latency intensive
ADCs (analog to digital converters) which so far seen to limit
the overall performance and efficiency. In addition, the device-
to-device variation in ultra-scaled memory cells, particularly
non-volatile-memory (NVM), adds a further challenge to such
architectures as they affect the MAC operation results [2].
In this paper, we present an energy efficient, high
performance FeFET based crossbar, while supporting
accessibility for different operation modes and immunity to
variability. Influences of device-to-device variation on the
MAC operations are strongly reduced. We present optimized
peripheral IP specialized to improve power and area efficiency
on both single block level and overall system level. At the
system level, the proposed architecture is based on bit
decomposition of the MAC operations. Finally, we present our
results for different architectural configurations such as ADC
precision, number of activated cells and input feature/weight
precision.
II. PROPOSAL OF A 1FEFET1R CELL
The FeFET technology (Fig.2) has been introduced as a
cost-effective, ultra-low-power NVM solution for IoT
applications in 28nm bulk HKMG technology [3].
Ferroelectric memories have attracted broad interest for the
application for in-memory computing [4]. The FeFET is
particularly interesting as featuring a high memory window
(MW), a very high Ion/Ioff ratio, a fast and extremely low power
write operation (<1fJ), low device-to-device variation (MW>
30ߪ for a width W and length L configuration of
450×450nm²) and high data retention (>108s at 80°C) [5].
FeFET cells can be co-integrated with standard CMOS FETs,
which is particularly interesting for ultra-scaled periphery
circuitry.
Connected with the high Ion/Ioff ratio is a large variability in
the IDS of the low-VT (LVT) state even for a very small VT
variation. To compensate for the IDS variation, we suggest to
form a 1FeFET1R (1F1R) bitcell. For sufficiently high
resistance, the output current variability is strongly reduced
(Fig.3). Furthermore, variability originating from the word line
(WL) is suppressed due to the large operation window of VGS.
III. CROSSBAR ORGANIZATION
In [6], a mixed-signal system algorithm that can eliminate
digital to analog converters (DACs) as shown in Fig.1 was
presented. Instead of large power and area expensive ADCs,
smaller more efficient blocks are beneficial. We start here by
presenting the ultra-low power ADC developed to support
such a system.
A. Mixed Signal Blocks
Our ADC operates in high side current mode with a
thermometer coding schematically depicted in Fig.4. It is
connected between the matrix and the power supply Vdd. The
circuitry consists of two blocks, a ladder-style ADC and a
reference voltage generator. The ADC is consisting of stacked
current mirrors. In the case of a 2b ADC these are M1/M7,
M4/M9 and M6/M11. The set current of these mirrors is
decreasing linearly from top to bottom.
As current is drawn by the column, the lowest current mirror
maintains a high voltage potential at its output Out1 as long as
the matrix current (Ires) is smaller than the reference current . If
Ires rises beyond the reference current Iref,3, the voltage at Out1
drops sharply as the current through M11 is limited.
This turns on M12, which then provides a bypass for any extra
current drawn by the matrix. The sum of both currents through
M11 and M12 then also runs through M9. If that current keeps
rising and exceeds the set current of the M4/M9 mirror (Iref,2),
the voltage at Out2 will drop, too. This process repeats for all
subsequent stages. The idea is only explained for three stages,
but depending on the required ADC precision, a higher number
of stages can be defined. Inverters to drive the result for the
following digital circuits follow the ADC stages. Transient
simulations upon activated switches reveal the latency (see
Fig. 6). The extracted parameters are summarized in Tab. 2.
A reference generator converts multiple reference currents
into gate voltages for the current mirrors. Since it solely drives
the gates of the ADC transistors, it can be reused for multiple
ADCs. As the generated voltages are derived from the same
transistor type that is used in the ADC, it is automatically
accounting for device manufacturing and temperature
variations. The selection of current-input ADC also reduces
influence of wire resistance variation on the crossbar output.
B. Crossbar
We use a 1FeFET1R memory cell crossbar configuration
as shown in Fig. 6. The input feature bits (Vinput ) are applied to
the WLs, the operation result is read at the drain lines (DL).
For simulation, the experimentally obtained IDVG curves were
used to calibrate a Preisach based FeFET model [7]. We
implemented the cell variability using the experimentally
obtained VT variation for LVT and HVT state, as shown in Fig.
7. The Monte-Carlo (MC) transient analysis upon WL
activation suggests a fault-free digitization (Fig. 8). For larger
device-to-device variation a small deviation at large activated
FeFET count can be observed (Fig. 9). The 64×64 crossbar is
e.g. consisting of 64 8×8 segments (Fig. 10).
Within a segment, source terminals of all segment columns
are either connected to the programming/inhibit voltage pins or
to a shared resistor, which is used for limiting the current
contributed by one FeFET. Therefore, only one FeFET per
segment is activated during inference mode, for which its
column and row are activated. On the crossbar level, all gates
of a single row are connected. All drain terminals of a crossbar
column are connected, either to the programming/inhibit
voltage or to the ADC, which is shared among e.g. every 8
columns where only one of them is activated at a time (Fig. 11).
To reduce leakage by activated LVT FeFETs, we furthermore
apply a read inhibit voltage ܸ
inh,WL . The layout of the 4kb
crossbar is shown in Fig. 12. Every FeFET within a segment is
programmed individually. To program the FeFET to one of the
two binary states (HVT/ LVT), a WL voltage of ±4V (ܸ
write) is
applied while keeping the SL/DL at 0V. To prevent other
FeFETs within the same row from being programmed, an
inhibit voltage of ±2.7V (ܸ
write
inh, SL/BL) is applied at the SLs on
segment level and the DLs on the crossbar level. The further
parameter configuration is shown in Tab. 2.
Furthermore, we investigated the applicability of further
scaled devices. Statistical extraction of the IDVG curves of such
devices are shown in Fig.13 and summarized in Fig.14. The
variability is affected by the gate area. In general, the MW/ߪ
ratio decreases upon reduced W/L. Proper selection of the
FeFET W/L still can preserve a high ADC accuracy and can
significantly reduce the crossbar area.
IV. BENCHMARK OF THE PROPOSED FEFET ACIM
We now compare the 1FeFET1R based ACiM solution with
other memory cell concepts, such as static and resistive random
access memory (SRAM, RRAM). The here implemented
FeFET memory cell is non-volatile with a long data retention,
which reduces the need of large external memory and write
energy upon restart e.g. for an edge device such as in a SRAM
cell. The large Ion/Ioff ratio allows for a low latency operation.
The cell area is comparable or smaller than for SRAM.
We simulated four different ADC precision configurations
(2b, 3b, 4b and 5b) connected to the crossbar. The performance
is affected by both the latency and the power consumption
imposed by the ADC and the number of activated rows.
The configuration with the 3b ADC shows the highest
performance efficiency and requires a chip area of only 2μm².
It could be operated at 1GS/s while consuming a worst-case
power of 600nW. As the power consumption depends on the
measurement current, the average power consumption is
expected to be below 300nW.
The presented crossbar offers a flexible MAC operation
precision. It allows for an activation precision within the range
of 1b to 8b and weight precision of 1b, 2b or 4b. This flexibility
allows for coverage of various networks while ensuring
accuracy loss within an acceptable range. The architecture can
also allow for layer based precision to maintain the highest
possible throughput. For our experiments, we simulated various
networks with the datasets CIFAR10 and MNIST. In Fig. 15,
we show the crossbar performance at different precisions and at
different configurations, in which it can reach a peak power
efficiency of 13714 TOPS/W, which significantly advances the
state of the art (see Fig.16). This value can be further increased
by taking moderate sparsity into account.
V. CONCLUSIONS
In this work, we propose a novel high performance 1FeFET1R
based analog compute-in-memory (ACiM) architecture, which
mitigates the FeFET VT variation. By decreasing the current
sensing threshold with a novel current-input ADC we
drastically optimize the power and area efficiency, while
preserving the accuracy after digitalization. The functionality
has been verified on the cell and array level by both experiments
and simulations. The generated ACiM architectures high
parallelism supports various network precisions with a power
efficiency up to 13714 TOPS/W.
VI. ACKNOWLEDGMENT
This work received funding within the ECSEL Joint
Undertaking project TEMPO in collaboration with the
European Union’s H2020 Framework Program (H2020/2014-
2020) and National Authorities, under grant agreement number
826655. We thank Globalfoundries for the provision of 28nm
technology FeFET structures.
VII. REFERENCES
[1] A. Sebastian et al., “In-memory hyperdimensional computing”, Nature
Electronics, 3, 327-337, 2020.
[2] D. Ielmini, H.-S. P. Wong, “In-memory computing with resistive
switching devices”, Nature Electronics, 1, 333-343, 2018
[3] M. Trentzsch, et al., “A 28nm HKMG super low power embedded NVM
technology based on ferroelectric FETs,” IEDM 2016
[4] R. Berdan et al., “Low-power linear computation using nonlinear
ferroelectric tunnel junction memristors“, Nature Electronics, 3, 259–
266, 2020
[5] S. Beyer et al., “FeFET: A versatile CMOS compatible device with
game-changing potential“, IMW 2020
[6] T. Soliman et al., “Efficient FeFET crossbar accelerator for
multiprecision neural networks” SOCC 2020.
[7] K. Ni et al., “A circuit compatible accurate compact model for
ferroelectric- FETs”, VLSI Symposium, 2018
0RWLYDWLRQ8OWUDORZSRZHULQPHPRU\FRPSXWLQJXVLQJQRQYRODWLOH)H)(7 VHOHFWRUV
3HULSKHUDOVIRUUHOLDEOH)H)(7 EDVHGLQPHPRU\FRPSXWLQJ
)LJ 'HHS OHDUQLQJ DOJRULWKPV UHTXLUH PHPRU\LQWHQVLYH 0$& FDOFXODWLRQV
$QDORJ FRPSXWHLQPHPRU\ $&L0 XVLQJ 190 ZHLJKWV DQG DGDSWHG $'&V
LPSOHPHQWHG LQWR D WLOH DUFKLWHFWXUH RIIHU KLJK IOH[LELOLW\ WR HPXODWH DQG WR DGDSW
WR YDULRXV ODUJH QHXUDO QHWZRUNV ,PSHGLPHQW LV DQ KLJKO\ UHOLDEOH DQG DFFXUDWH
GLJLWDOL]DWLRQ
([SHULPHQWDO'HPRQVWUDWLRQRI3URSRVHG)HUURHOHFWULF&$0&HOO
)LJ )HUURHOHFWULF )(7V )H)(7V DFW DV QRQYRODWLOH
VHOHFWRUV ZLWK KLJK ,RQ,RII UDWLR LI FRQQHFWHG WR D UHVLVWRU
)5 7KH FXUUHQW YDULDELOLW\ LV KHUHE\ VWURQJO\ UHGXFHG
)LJ $&L0 DFFXUDF\ IRU WKH
H[SHULPHQWDO )H)(7 GHYLFHWRGHYLFH
97YDULDWLRQ DQG PXOWLWXGHV RI WKLV )RU
ߪ௅௏ൌ ͷͶ WKH $'& UHYHDOV WKH
FRUUHFW UHVXOW IRU DOO 0& VDPSOHV
$QDORJ&RPSXWH,Q0HPRU\$&L0
&DOFXODWLRQ )ORZ
190 190
190 190
190 190
190 190
190 190
190 190
190 190
190 190
$'& $'& $'& $'&
3URFHVVLQJ(OHPHQW3(
$
%
&
3(
3(
3(
3(
3(
3(
3(
3(
6KLIWHU
'
(
7LOH DUFKLWHFWXUH
6KLIWLQJ YDOXH LV WKH FXUUHQW LQSXW ELW
SRVLWLRQ
)LQDORXWSXW RI 0$&RSHUDWLRQV
DIWHUQFORFN F\FOHV ZKHUH QLV WKH
LQSXW SUHFLVLRQ
(
'
&
6LQJOHELW GLJLWDODFWLYDWLRQ
$FFXPXODWLRQ RI $1'RSHUDWLRQ RQWKDW
OLQH
'LJLWDOL]DWLRQ RI FURVVEDU RXWSXW WR ELW
YDOXH
$
%
ߪ௅௏ P9
/97
+97
)LJ 3URSRVHG FXUUHQWLQSXW WKHUPRPHWHUFRGH
$'& ZLWK DGDSWDEOH WKUHVKROG FXUUHQW SHU VWDJH
 VWDJHV VNHWFKHG
)LJ )H)(7 EDVHG FURVVEDU FROXPQV
DV XVHG LQ WKH VLPXODWLRQ
)LJ D )H)(7 GHYLFH PRGHOLQJ EDVHG RQ D
FDOLEUDWHG 3UHLVDFK )( PRGHO >@ DGGLQJ DQ
H[SHULPHQWDO YDULDELOLW\ ZLWK ߪ௅௏ൌ ͷͶǤE
7KH VLPXODWLRQ IRU WKH SURSRVHG )5 FRQILJ
XUDWLRQ PDWFKHV WKH H[SHULPHQWDO FRQGLWLRQ
9'6 P9
GHYLFHV
)LJ 7UDQVLHQW DQDO\VLV RI WKH SURSRVHG $'&
ZLWK  VWDJHV
)LJ 7UDQVLHQW UHVSRQVH XSRQ VHTXHQWLDO
DFWLYDWLRQ RI WKH /97 )H)(7V LQ WKH FROXPQ
DV ZHOO DV WKH GLJLWDOL]DWLRQ RI WKH FXUUHQW
UHVSRQVH XSRQ H[SHULPHQWDOO\ H[WUDFWHG
GHYLFH YDULDELOLW\
)LJ 7(0 LPDJH DQG PDWHULDO VWDFN RI IHUURHOHFWULF )(7V
,'9*FXUYHV RI /97+97 RI  GHYLFHV
+LJKRQRIIUDWLR
/DUJHFXUUHQW YDULDELOLW\
+LJKRQRIIUDWLR
/RZRFXUUHQW YDULDELOLW\
µµ µµ µµ µµ µµ
µµ
5HSRUWHG FRQILJXUDWLRQ
3URSRVHG FRQILJXUDWLRQ
9'6 9
GHYLFHV
5 0ȍ
5HIHUHQFH*HQHUDWRU $'& 3UHFLVLRQ E E E E
3HDN3RZHUX:    
/DWHQF\QV    
7DE $'& SHUIRUPDQFH SDUDPHWHUV
QP
QPVLOLFLGH
QPSRO\ 6L
QP7L1
QP+I2
aQP6L2
S6L
D E F
9'6 9
GHYLFHV
5 0ȍ
&ROXPQ
)
)5
ȍ
ȍ
ȍ
,& ,& ,&
9*6
9*6
9*6
[Pð [Pð [Pð [Pð[Pð [Pð
)H)(7 DUHD VFDOLQJ RSSRUWXQLWLHV EHQFKPDUN
)H)(7 &URVVEDU2UJDQL]DWLRQDQG3K\VLFDO,PSOHPHQWDWLRQ
)LJ 3URJUDP DQG LQIHUHQFH FRQILJXUDWLRQ RI
)H)(7 VHJPHQWV FRQQHFWHG WR D VLQJOH UHVLVWRU
SHU VHJPHQW LPSURYLQJ DUHD HIILFLHQF\




%LW %LW %LW %LW %LW %LW %LW %LW %LW %LW %LW %LW
ELW$'& ELW$'& ELW$'& ELW$'&
%LW
%LW
%LW
%LW
%LW
%LW
%LW
%LW
7236:
)LJ 6\VWHP SHUIRUPDQFH LQ 7236: 23 VWDQGV IRU D IXOO 0$& RSHUDWLRQ
IRU GLIIHUHQW ELWSUHFLVLRQ $'&V FRQQHFWHG WR WKH FURVVEDU DQG DW GLIIHUHQW
DFWLYDWLRQ SUHFLVLRQ LQ WKH UDQJH RI  WR  ELWV DQG ZHLJKWV SUHFLVLRQ RI  DQG
 ELWV
$FWLYDWLRQ
3UHFLVLRQ
7DE6\VWHP DQG GHYLFH SDUD
PHWHUV XVHG LQ WKLV VLPXODWLRQ
)LJ %HQFKPDUN RI WKH SURSRVHG )H)(7 EDVHG $&L0 ZLWK
H[LVWLQJ VROXWLRQV ZLWK UHVSHFW WR SRZHU HIILFLHQF\ DQG
SHUIRUPDQFH $ VFDODEOH DFFHOHUDWRU FRQFHSW LV SURYLGHG ZLWK
SRZHU HIILFLHQF\ !7236:
)LJ /D\RXW RI D NE )H)(7
FURVVEDU EHLQJ EXLOW RI î
VHJPHQWV VXSSRUWLQJ  0$&
RSHUDWLRQV EE SHU FORFN
F\FOH
)LJ &URVVEDU DUUDQJHPHQW
DV FRQVWUXFWHG E\ VHJPHQWV
DQG $'&V
65$0 55$0 )H)(7
9RODWLOLW\ 9RODWLOH 1RQYRODWLOH 1RQYRODWLOH
5HWHQWLRQ 6WDWLF V#& V#&
,RQ,RII 
$UHD>Pð@  #
QP&026
#QP
&026
 #
QP&026
(QGXUDQFH   
)LJ ([SHULPHQWDO ,'9*FXUYHV DQG
H[WUDFWHG 97KLVWRJUDPV IRU UHSUHVHQW
DWLYH :/ FRQILJXUDWLRQV RI WKH
)H)(7V $ FRQILJXUDWLRQ VXFK DV
îPð SUHVKULQN RIIHUV VXLWDEOH
YDULDWLRQ RI ı970: 7KLV UHVXOWV
LQ D ELW VL]H RI DERXW Pð 7DE &RPSDULVRQ RI )H)(7 ZLWK RWKHU PHPRU\
FRQFHSWV VWXGLHG IRU LPSOHPHQWDWLRQ LQ $&L0
/97 9
+97 9
ı/97 9
ı+97 9
9LQSXW 9
9LQK:/ P9
9ZULWH 9
9
9
3XOVH
GXUDWLRQ QV
,QSXW a
2XWSXW a
$'& 
6\VWHPEHQFKPDUNLQJ
)LJ 7KH YDULDELOLW\ RI WKH +97 DQG /97 VWDWHV IRU
YDULRXV :/ FRQILJXUDWLRQV SORWWHG RYHU DUHD
:HLJKW 3UHFLVLRQ
7KLVZRUN
ZLQGRZ RI
RSHUDWLRQ
[Pð [Pð
[Pð
[Pð
[Pð
+97
/97
9LQK:/
ZULWH
9LQK6/%/
ZULWH
... The feasibility of 28 nm technology node-based Fe-FETs was successfully demonstrated with the fabrication of such synaptic device in GlobalFoundries facilities (Soliman et al., 2020). Figure 2A illustrates the 28 nm technology node-based FeFET with the schematic and transmission electron microscopic (TEM) image. ...
... (A) Schematic illustration with the transmission electron micrograph of the FeFET cell fabricated in the GlobalFoundries' 28-nm HKMG facility. (B) Program--erase operation conducted across sixty test devices using 1 µs pulse of amplitude ±5 V. (C) Analog program-erase operations in a single FeFET cell show the impact of V g read on the number of different states and linearity(Soliman et al., 2020;Lederer et al., 2021). ...
Article
Full-text available
This work presents 2 bits/cell operation in deeply scaled ferroelectric finFETs (Fe-finFET) with 1µs write pulse of maximum ±5V amplitude and WRITE endurance above 109 cycles. Fe-finFET devices with single and multiple fins have been fabricated on SOI wafer using a gate first process, with gate lengths down to 70 nm and fin width 20 nm. Extrapolated retention above ten years also ensures stable inference operation for ten years without any need for re-training. Statistical modeling of device-to-device and cycle-to-cycle variation is performed based on measured data and applied to neural network simulations using the CIMulator software platform. Stochastic device-to-device variation is mainly compensated during online training and has virtually no impact on training accuracy. On the other hand, stochastic cycle-to-cycle threshold voltage variation up to 400mV can be tolerated for MNIST handwritten digits recognition. Substantial inference accuracy drop with systematic retention degradation was observed in analog neural networks. However, quaternary neural networks (QNN) and binary neural networks (BNN) with Fe-finFETs as synaptic devices demonstrated excellent immunity towards the cumulative impact of stochastic and systematic variations
... Yin et al. [64] designed a PIM Chiplet based on FeRAM, whose area and power consumption are 58% and 64% of the SRAM, respectively. Soliman et al. [65] prepared an FeRAM Chiplet with 28 nm CMOS technology; the energy efficiency and latency are 13714 TOPS/W and 0.5 ns, respectively, when 2-bit data operations are performed, as shown in Figure 8e. FeRAM has low read/write time and power consumption and has the best compatibility with CMOS technology; however, the FeRAM Chiplet has a high cost because its electrode materials are noble metals (Pt, Ir). ...
... Due to the unique characteristics of hysteretic, FinFET [62], Copyright 2020, with permission from IEEE); (e) FeRAM Chiplet vs. traditional computing system. (Reprinted from [65], Copyright 2020, with permission from IEEE); (f) MRAM Chiplet (Reprinted from [68], Copy-right 2021, with permission from ELSEVIER); (g) 3D PCRAM Chiplet, (Reprinted from [68], Copy-right 2009, with permission from IEEE). ...
Article
Full-text available
Computing systems are widely used in medical diagnosis, climate prediction, autonomous vehicles, etc. As the key part of electronics, the performance of computing systems is crucial in the intellectualization of the equipment. The conflict between performance, efficiency, and cost can be solved by choosing an appropriate computing system architecture. In order to provide useful advice and instructions for the designers to fabricate high-performance computing systems, this paper reviews the Chiplet-based computing system architectures, including computing architecture and memory architecture. Firstly, the computing architecture used for high-performance computing, mobile, and PC is presented and summarized. Secondly, the memory architecture based on mainstream memory and emerging non-volatile memory used for data storing and processing are introduced, and the key parameters of memory are compared and discussed. Finally, this paper is concluded, and the future perspectives of computing system architecture based on Chiplet are presented.
... Initially, the hardware of SNNs was implemented with complex complementary metal-oxide-semiconductor field-effect transistor (CMOSFET) circuits where a single neuron or synapse was realized by using a great number of CMOSFETs, resulting in the low integration density and high energy consumption [10], [11]. In recent decades, emerging technologies, such as resistive random access memories (RRAMs) [12]- [16], phase change memories (PCMs) [17]- [20], and ferroelectric field-effect transistors (FeFETs) [21]- [23], have shown remarkable performance in emulating spiking neurons and plastic synapses. However, most of them cannot directly convert signals collected from the environment into spikes to serve as the inputs to SNNs without relying on a large number of complex analog-to-digital converters (ADCs) and voltage-tofrequency converters (VFCs) [ Fig. 1(a)]. ...
... In this work, we adopt an alternative method proposed in [21], which is to limit the ON current variability by a limiter. In this way, the ON current of FeFET is effectively independent on both the applied V G and the stored V TH state, therefore, the V TH variation of Fe- Integration of the series resistor with FeFET is necessary to fully exploit the benefit of current limiter. ...
Preprint
Full-text available
Content addressable memory (CAM) is widely used in associative search tasks for its highly parallel pattern matching capability. To accommodate the increasingly complex and data-intensive pattern matching tasks, it is critical to keep improving the CAM density to enhance the performance and area efficiency. In this work, we demonstrate: i) a novel ultra-compact 1FeFET CAM design that enables parallel associative search and in-memory hamming distance calculation; ii) a multi-bit CAM for exact search using the same CAM cell; iii) compact device designs that integrate the series resistor current limiter into the intrinsic FeFET structure to turn the 1FeFET1R into an effective 1FeFET cell; iv) a successful 2-step search operation and a sufficient sensing margin of the proposed binary and multi-bit 1FeFET1R CAM array with sizes of practical interests in both experiments and simulations, given the existing unoptimized FeFET device variation; v) 89.9x speedup and 66.5x energy efficiency improvement over the state-of-the art alignment tools on GPU in accelerating genome pattern matching applications through the hyperdimensional computing paradigm.
... Recent progress in the research on 28nm high-k metal gate (HKMG) based ferroelectric field-effect transistors (Fe-FETs), and deeply scaled ferroelectric fin field-effect transistors (Fe-finFETs) corroborates this fact. This progress has accelerated the application of deeply scaled FeFETs for computing-in-memory (CIM) applications [14][15][16][17][18][19][20][21][22]. However, device-to-device (D2D) variations in HfO2 based deeply scaled FeFETs pose a severe threat towards accurately executing vector-matrix multiplication operations. ...
Preprint
Full-text available
This work proposes a hybrid-precision neural network training framework with an eNVM based computational memory unit executing the weighted sum operation and another SRAM unit, which stores the error in weight update during back propagation and the required number of pulses to update the weights in the hardware. The hybrid training algorithm for MLP based neural network with 28 nm ferroelectric FET (FeFET) as synaptic devices achieves inference accuracy up to 95% in presence of device and cycle variations. The architecture is primarily evaluated using behavioral or macro-model of FeFET devices with experimentally calibrated device variations and we have achieved accuracies compared to floating-point implementations.
... In our work, we present an architecture based on a 1FeFET1R model that greatly reduces the previously mentioned variability and limits I ds necessary for ultralow power operation [9], [10], [11]. The proposed design operates in the range of 100nA output per activated FeFET with very low current variability. ...
Conference Paper
More and more applications use deep neural networks (DNN) to execute complex tasks. Depending on the application, there are high requirements regarding latency and performance parameters, especially for edge devices (edge AI). Due to the increasing challenges regarding Moore's Law, new approaches are required. A promising candidate for this is analog in-memory computing (AIMC). In-memory computing platforms are based on crossbar structures using resistive non-volatile memories. However, many demonstrated approaches suffer from the problems of reduced accuracy as well as reliability, large analog-to-digital as well as digital-to-analog converters (ADCs/DACs), and power efficiency. The in-memory computing architecture presented in this paper is utilizing ferroelectric field effect transistors (FeFETs), which are used as a non-volatile memory cell. The crossbar-based in-memory architecture eliminates the need for any DAC and also provides high parallelism with only 3-bit ADCs. This results in lower area and power usage. The implementation of operations with binary as well as variable bit depth and power efficiencies >10000 TOPS/W are presented.
... 2). To address this, we utilize Thermometer coding [14] where the number of positive pulses is proportional to a representation level. We show that Thermometer coding can achieve higher noise robustness than the bit slicing approach in the following sub-sections. ...
Preprint
Binary memristive crossbars have gained huge attention as an energy-efficient deep learning hardware accelerator. Nonetheless, they suffer from various noises due to the analog nature of the crossbars. To overcome such limitations, most previous works train weight parameters with noise data obtained from a crossbar. These methods are, however, ineffective because it is difficult to collect noise data in large-volume manufacturing environment where each crossbar has a large device/circuit level variation. Moreover, we argue that there is still room for improvement even though these methods somewhat improve accuracy. This paper explores a new perspective on mitigating crossbar noise in a more generalized way by manipulating input binary bit encoding rather than training the weight of networks with respect to noise data. We first mathematically show that the noise decreases as the number of binary bit encoding pulses increases when representing the same amount of information. In addition, we propose Gradient-based Bit Encoding Optimization (GBO) which optimizes a different number of pulses at each layer, based on our in-depth analysis that each layer has a different level of noise sensitivity. The proposed heterogeneous layer-wise bit encoding scheme achieves high noise robustness with low computational cost. Our experimental results on public benchmark datasets show that GBO improves the classification accuracy by ~5-40% in severe noise scenarios.
... For the 1T FeFET (FEoL approach), the ideal suitability for hardware accelerated neural networks, due to the lowpower write and read operation [15] and the excellent peak performance exceeding 13700 TOPS/W [16], was already shown. In addition, high dynamic range, multi-bit operation, and good linearity as well as symmetry was reported [15]. ...
Conference Paper
Advanced non-volatile memory concepts such as the 1T1C ferroelectric (FE) random-access memory (FeRAM) and the 1T1C FE field-effect transistor (FeFET) can be realized by connecting a metal-ferroelectric-metal (MFM) capacitor placed in the back end of line (BEoL) of a microchip to the drain and gate contacts of a standard logic device, respectively. With the vertical distributed select devices in the front-end of line (FEoL) and the storage elements in the BEoL, both concepts increase the effective memory density of a microchip without introducing major changes in the FEoL fabrication technology. However, for advanced neuromorphic computing architectures, the 1T1C FeFET is the device of choice, since it provides non-destructive readout. The most promising material for the integration of FE non-volatile memory functionalities into the BEoL is Zr doped HfO 2 (HZO). It crystallizes at low temperatures in the orthorhombic phase (the one with FE properties) and with a polycrystalline structure. The latter is important to enable analogue like switching in synaptic devices. Herein, the above-mentioned memory concepts are introduced and key steps to optimize the HZO films for the BEoL integration and for the neuromorphic computing use case are described.
Thesis
Full-text available
The discovery of ferroelectricity in hafnium oxide spurred a growing research field due to hafnium oxides compatibility with processes in microelectronics as well as its unique properties. Notably, its application in non-volatile memories, neuromorphic devices as well as piezo- and pyroelectric sensors is investigated. However, the behavior of ferroelectric hafnium oxide is not understood into depth compared to common perovskite structure ferroelectrics. Due the the metastable nature of the ferroelectric phase, process conditions have a strong influence during and after its deposition. In this work, the physical properties of hafnium oxide, process influences on the microstructure as well as reliability aspects in non-volatile and neuromorphic devices are investigated. With respect to the physical properties, strong evidence is provided that the antiferroelectric-like behavior in hafnium oxide based thin films is governed by ferroelastic 90° domain wall movement. Furthermore, the discovery of an electric field-induced crystallization process in this material system is reported. For the analysis of the microstructure, the novel method of transmission Kikuchi diffraction is introduced, allowing an investigation of the local crystallographic phase, orientation and grain structure. Here, strong crystallographic textures are observed in dependence of the substrate, doping concentration and annealing temperature. Based on these results, the observed reliability behavior in the electronic devices is explainable and engineering of the present defect landscape enables further optimization. Finally, the behavior in neuromorphic devices is explored as well as process and design guidelines for the desired behavior are provided.
Chapter
Due to the slow-down of Moore’s Law and Dennard Scaling, new disruptive computer architectures are mandatory. One such new approach is Neuromorphic Computing, which is inspired by the functionality of the human brain. In this position paper, we present the projected SEC-Learn ecosystem, which combines neuromorphic embedded architectures with Federated Learning in the cloud, and performance with data protection and energy efficiency.
Conference Paper
Full-text available
This paper presents a novel ferroelectric field-effect transistor (FeFET) in-memory computing architecture dedicated to accelerate Binary Neural Networks (BNNs). We present in-memory convolution, batch normalization and dense layer processing through a grid of small crossbars with reduced unit size, which enables multiple bit operation and value accumulation. Additionally, we explore the possible operations parallelization for maximized computational performance. Simulation results show that our new architecture achieves a computing performance up to 2.46 TOPS while achieving a high power efficiency reaching 111.8 TOPS/Watt and an area of 0.026 mm2 in 22nm FDSOI technology.
Article
Full-text available
Hyperdimensional computing is an emerging computational framework that takes inspiration from attributes of neuronal circuits including hyperdimensionality, fully distributed holographic representation and (pseudo)randomness. When employed for machine learning tasks, such as learning and classification, the framework involves manipulation and comparison of large patterns within memory. A key attribute of hyperdimensional computing is its robustness to the imperfections associated with the computational substrates on which it is implemented. It is therefore particularly amenable to emerging non-von Neumann approaches such as in-memory computing, where the physical attributes of nanoscale memristive devices are exploited to perform computation. Here, we report a complete in-memory hyperdimensional computing system in which all operations are implemented on two memristive crossbar engines together with peripheral digital complementary metal–oxide–semiconductor (CMOS) circuits. Our approach can achieve a near-optimum trade-off between design complexity and classification accuracy based on three prototypical hyperdimensional computing-related learning tasks: language classification, news classification and hand gesture recognition from electromyography signals. Experiments using 760,000 phase-change memory devices performing analog in-memory computing achieve comparable accuracies to software implementations.
Article
Full-text available
Analogue in-memory computing using memristors could alleviate the performance constraints imposed by digital von Neumann systems in data-intensive tasks. Conventional linear memristors typically operate at high currents, potentially limiting power efficiency and scalability in practical applications. Here, we show that nonlinear ferroelectric tunnel junction memristors can perform linear computation at ultralow currents. Using logarithmic line drivers, we demonstrate that analogue-voltage-amplitude vector–matrix multiplication (VMM) can be performed in selectorless ferroelectric tunnel junction crossbars by exploiting a device nonlinearity factor that remains constant for multiple conductive states. We also show that our ferroelectric tunnel junction crossbars have the attributes required to scale analogue VMM-intensive applications, such as neural inference engines, towards energy efficiencies above 100 tera-operations per second per watt. Nonlinear ferroelectric tunnel junction memristors can be used to perform linear vector–matrix multiplication operations at ultralow currents.
Article
Modern computers are based on the von Neumann architecture in which computation and storage are physically separated: data are fetched from the memory unit, shuttled to the processing unit (where computation takes place) and then shuttled back to the memory unit to be stored. The rate at which data can be transferred between the processing unit and the memory unit represents a fundamental limitation of modern computers, known as the memory wall. In-memory computing is an approach that attempts to address this issue by designing systems that compute within the memory, thus eliminating the energy-intensive and time-consuming data movement that plagues current designs. Here we review the development of in-memory computing using resistive switching devices, where the two-terminal structure of the devices, their resistive switching properties, and direct data processing in the memory can enable area- and energy-efficient computation. We examine the different digital, analogue, and stochastic computing schemes that have been proposed, and explore the microscopic physical mechanisms involved. Finally, we discuss the challenges in-memory computing faces, including the required scaling characteristics, in delivering next-generation computing. This Review Article examines the development of in-memory computing using resistive switching devices.
A 28nm HKMG super low power embedded NVM technology based on ferroelectric FETs
  • M Trentzsch
M. Trentzsch, et al., "A 28nm HKMG super low power embedded NVM technology based on ferroelectric FETs," IEDM 2016
A circuit compatible accurate compact model for ferroelectric-FETs
  • K Ni
K. Ni et al., "A circuit compatible accurate compact model for ferroelectric-FETs", VLSI Symposium, 2018