Content uploaded by Thomas Kämpfe
Author content
All content in this area was uploaded by Thomas Kämpfe on Dec 29, 2020
Content may be subject to copyright.
Ultra-Low Power Flexible Precision FeFET Based
Analog In-Memory Computing
T. Soliman1*, F. Müller2*, T. Kirchner1 , T. Hoffmann2, H. Ganem1 , E. Karimov2, T. Ali2,
M. Lederer2, C. Sudarshan3, T. Kämpfe2, A. Guntoro1 and N. Wehn3
1Robert Bosch GmbH, Renningen Germany
2Fraunhofer IPMS, Center Nanoelectronic Technologies (CNT), Dresden, Germany,
3TU Kaiserslautern, Kaiserslautern, Germany
*equal contributions, email: taha.soliman@de.bosch.com, thomas.kaempfe@ipms.fraunhofer.de
Abstract— This paper presents an efficient crossbar design and
implementation intended for analog compute-in-memory (ACiM)
acceleration of artificial neural networks based on ferroelectric
FET (FeFET) technology. The novel mixed signal blocks
presented in this work reduce the device-to-device variation and
are optimized for low area, low power and high throughput. In
addition, we illustrate the operation and programmability of the
crossbar that adopts bit decomposition techniques for MAC
operation. Our crossbar based ACiM accelerator achieves a
record peak performance of 13714 TOPS/W.
I. INTRODUCTION
Deep neural networks (DNN) have developed over the past
few years to the state-of-the-art technique covering various
tasks such as image processing, speech recognition, etc.
However, such networks come at the cost of high computational
and memory requirements. The computational kernels in these
networks rely on matrix-vector multiplications as their main
operation. In the last few years, various in-memory based
architectures were introduced which perform multiply and
accumulate (MAC) operations in an analog way [1]. However,
this approach requires power/area hungry, latency intensive
ADCs (analog to digital converters) which so far seen to limit
the overall performance and efficiency. In addition, the device-
to-device variation in ultra-scaled memory cells, particularly
non-volatile-memory (NVM), adds a further challenge to such
architectures as they affect the MAC operation results [2].
In this paper, we present an energy efficient, high
performance FeFET based crossbar, while supporting
accessibility for different operation modes and immunity to
variability. Influences of device-to-device variation on the
MAC operations are strongly reduced. We present optimized
peripheral IP specialized to improve power and area efficiency
on both single block level and overall system level. At the
system level, the proposed architecture is based on bit
decomposition of the MAC operations. Finally, we present our
results for different architectural configurations such as ADC
precision, number of activated cells and input feature/weight
precision.
II. PROPOSAL OF A 1FEFET1R CELL
The FeFET technology (Fig.2) has been introduced as a
cost-effective, ultra-low-power NVM solution for IoT
applications in 28nm bulk HKMG technology [3].
Ferroelectric memories have attracted broad interest for the
application for in-memory computing [4]. The FeFET is
particularly interesting as featuring a high memory window
(MW), a very high Ion/Ioff ratio, a fast and extremely low power
write operation (<1fJ), low device-to-device variation (MW>
30ߪ for a width W and length L configuration of
450×450nm²) and high data retention (>108s at 80°C) [5].
FeFET cells can be co-integrated with standard CMOS FETs,
which is particularly interesting for ultra-scaled periphery
circuitry.
Connected with the high Ion/Ioff ratio is a large variability in
the IDS of the low-VT (LVT) state even for a very small VT
variation. To compensate for the IDS variation, we suggest to
form a 1FeFET1R (1F1R) bitcell. For sufficiently high
resistance, the output current variability is strongly reduced
(Fig.3). Furthermore, variability originating from the word line
(WL) is suppressed due to the large operation window of VGS.
III. CROSSBAR ORGANIZATION
In [6], a mixed-signal system algorithm that can eliminate
digital to analog converters (DACs) as shown in Fig.1 was
presented. Instead of large power and area expensive ADCs,
smaller more efficient blocks are beneficial. We start here by
presenting the ultra-low power ADC developed to support
such a system.
A. Mixed Signal Blocks
Our ADC operates in high side current mode with a
thermometer coding schematically depicted in Fig.4. It is
connected between the matrix and the power supply Vdd. The
circuitry consists of two blocks, a ladder-style ADC and a
reference voltage generator. The ADC is consisting of stacked
current mirrors. In the case of a 2b ADC these are M1/M7,
M4/M9 and M6/M11. The set current of these mirrors is
decreasing linearly from top to bottom.
As current is drawn by the column, the lowest current mirror
maintains a high voltage potential at its output Out1 as long as
the matrix current (Ires) is smaller than the reference current . If
Ires rises beyond the reference current Iref,3, the voltage at Out1
drops sharply as the current through M11 is limited.
This turns on M12, which then provides a bypass for any extra
current drawn by the matrix. The sum of both currents through
M11 and M12 then also runs through M9. If that current keeps
rising and exceeds the set current of the M4/M9 mirror (Iref,2),
the voltage at Out2 will drop, too. This process repeats for all
subsequent stages. The idea is only explained for three stages,
but depending on the required ADC precision, a higher number
of stages can be defined. Inverters to drive the result for the
following digital circuits follow the ADC stages. Transient
simulations upon activated switches reveal the latency (see
Fig. 6). The extracted parameters are summarized in Tab. 2.
A reference generator converts multiple reference currents
into gate voltages for the current mirrors. Since it solely drives
the gates of the ADC transistors, it can be reused for multiple
ADCs. As the generated voltages are derived from the same
transistor type that is used in the ADC, it is automatically
accounting for device manufacturing and temperature
variations. The selection of current-input ADC also reduces
influence of wire resistance variation on the crossbar output.
B. Crossbar
We use a 1FeFET1R memory cell crossbar configuration
as shown in Fig. 6. The input feature bits (Vinput ) are applied to
the WLs, the operation result is read at the drain lines (DL).
For simulation, the experimentally obtained IDVG curves were
used to calibrate a Preisach based FeFET model [7]. We
implemented the cell variability using the experimentally
obtained VT variation for LVT and HVT state, as shown in Fig.
7. The Monte-Carlo (MC) transient analysis upon WL
activation suggests a fault-free digitization (Fig. 8). For larger
device-to-device variation a small deviation at large activated
FeFET count can be observed (Fig. 9). The 64×64 crossbar is
e.g. consisting of 64 8×8 segments (Fig. 10).
Within a segment, source terminals of all segment columns
are either connected to the programming/inhibit voltage pins or
to a shared resistor, which is used for limiting the current
contributed by one FeFET. Therefore, only one FeFET per
segment is activated during inference mode, for which its
column and row are activated. On the crossbar level, all gates
of a single row are connected. All drain terminals of a crossbar
column are connected, either to the programming/inhibit
voltage or to the ADC, which is shared among e.g. every 8
columns where only one of them is activated at a time (Fig. 11).
To reduce leakage by activated LVT FeFETs, we furthermore
apply a read inhibit voltage ܸ
inh,WL . The layout of the 4kb
crossbar is shown in Fig. 12. Every FeFET within a segment is
programmed individually. To program the FeFET to one of the
two binary states (HVT/ LVT), a WL voltage of ±4V (ܸ
write) is
applied while keeping the SL/DL at 0V. To prevent other
FeFETs within the same row from being programmed, an
inhibit voltage of ±2.7V (ܸ
write
inh, SL/BL) is applied at the SLs on
segment level and the DLs on the crossbar level. The further
parameter configuration is shown in Tab. 2.
Furthermore, we investigated the applicability of further
scaled devices. Statistical extraction of the IDVG curves of such
devices are shown in Fig.13 and summarized in Fig.14. The
variability is affected by the gate area. In general, the MW/ߪ
ratio decreases upon reduced W/L. Proper selection of the
FeFET W/L still can preserve a high ADC accuracy and can
significantly reduce the crossbar area.
IV. BENCHMARK OF THE PROPOSED FEFET ACIM
We now compare the 1FeFET1R based ACiM solution with
other memory cell concepts, such as static and resistive random
access memory (SRAM, RRAM). The here implemented
FeFET memory cell is non-volatile with a long data retention,
which reduces the need of large external memory and write
energy upon restart e.g. for an edge device such as in a SRAM
cell. The large Ion/Ioff ratio allows for a low latency operation.
The cell area is comparable or smaller than for SRAM.
We simulated four different ADC precision configurations
(2b, 3b, 4b and 5b) connected to the crossbar. The performance
is affected by both the latency and the power consumption
imposed by the ADC and the number of activated rows.
The configuration with the 3b ADC shows the highest
performance efficiency and requires a chip area of only 2μm².
It could be operated at 1GS/s while consuming a worst-case
power of 600nW. As the power consumption depends on the
measurement current, the average power consumption is
expected to be below 300nW.
The presented crossbar offers a flexible MAC operation
precision. It allows for an activation precision within the range
of 1b to 8b and weight precision of 1b, 2b or 4b. This flexibility
allows for coverage of various networks while ensuring
accuracy loss within an acceptable range. The architecture can
also allow for layer based precision to maintain the highest
possible throughput. For our experiments, we simulated various
networks with the datasets CIFAR10 and MNIST. In Fig. 15,
we show the crossbar performance at different precisions and at
different configurations, in which it can reach a peak power
efficiency of 13714 TOPS/W, which significantly advances the
state of the art (see Fig.16). This value can be further increased
by taking moderate sparsity into account.
V. CONCLUSIONS
In this work, we propose a novel high performance 1FeFET1R
based analog compute-in-memory (ACiM) architecture, which
mitigates the FeFET VT variation. By decreasing the current
sensing threshold with a novel current-input ADC we
drastically optimize the power and area efficiency, while
preserving the accuracy after digitalization. The functionality
has been verified on the cell and array level by both experiments
and simulations. The generated ACiM architectures high
parallelism supports various network precisions with a power
efficiency up to 13714 TOPS/W.
VI. ACKNOWLEDGMENT
This work received funding within the ECSEL Joint
Undertaking project TEMPO in collaboration with the
European Union’s H2020 Framework Program (H2020/2014-
2020) and National Authorities, under grant agreement number
826655. We thank Globalfoundries for the provision of 28nm
technology FeFET structures.
VII. REFERENCES
[1] A. Sebastian et al., “In-memory hyperdimensional computing”, Nature
Electronics, 3, 327-337, 2020.
[2] D. Ielmini, H.-S. P. Wong, “In-memory computing with resistive
switching devices”, Nature Electronics, 1, 333-343, 2018
[3] M. Trentzsch, et al., “A 28nm HKMG super low power embedded NVM
technology based on ferroelectric FETs,” IEDM 2016
[4] R. Berdan et al., “Low-power linear computation using nonlinear
ferroelectric tunnel junction memristors“, Nature Electronics, 3, 259–
266, 2020
[5] S. Beyer et al., “FeFET: A versatile CMOS compatible device with
game-changing potential“, IMW 2020
[6] T. Soliman et al., “Efficient FeFET crossbar accelerator for
multiprecision neural networks” SOCC 2020.
[7] K. Ni et al., “A circuit compatible accurate compact model for
ferroelectric- FETs”, VLSI Symposium, 2018
0RWLYDWLRQ8OWUDORZSRZHULQPHPRU\FRPSXWLQJXVLQJQRQYRODWLOH)H)(7 VHOHFWRUV
3HULSKHUDOVIRUUHOLDEOH)H)(7 EDVHGLQPHPRU\FRPSXWLQJ
)LJ 'HHS OHDUQLQJ DOJRULWKPV UHTXLUH PHPRU\LQWHQVLYH 0$& FDOFXODWLRQV
$QDORJ FRPSXWHLQPHPRU\ $&L0 XVLQJ 190 ZHLJKWV DQG DGDSWHG $'&V
LPSOHPHQWHG LQWR D WLOH DUFKLWHFWXUH RIIHU KLJK IOH[LELOLW\ WR HPXODWH DQG WR DGDSW
WR YDULRXV ODUJH QHXUDO QHWZRUNV ,PSHGLPHQW LV DQ KLJKO\ UHOLDEOH DQG DFFXUDWH
GLJLWDOL]DWLRQ
([SHULPHQWDO'HPRQVWUDWLRQRI3URSRVHG)HUURHOHFWULF&$0&HOO
)LJ )HUURHOHFWULF )(7V )H)(7V DFW DV QRQYRODWLOH
VHOHFWRUV ZLWK KLJK ,RQ,RII UDWLR LI FRQQHFWHG WR D UHVLVWRU
)5 7KH FXUUHQW YDULDELOLW\ LV KHUHE\ VWURQJO\ UHGXFHG
)LJ $&L0 DFFXUDF\ IRU WKH
H[SHULPHQWDO )H)(7 GHYLFHWRGHYLFH
97YDULDWLRQ DQG PXOWLWXGHV RI WKLV )RU
ߪൌ ͷͶ WKH $'& UHYHDOV WKH
FRUUHFW UHVXOW IRU DOO 0& VDPSOHV
$QDORJ&RPSXWH,Q0HPRU\$&L0
&DOFXODWLRQ )ORZ
190 190
190 190
190 190
190 190
190 190
190 190
190 190
190 190
$'& $'& $'& $'&
3URFHVVLQJ(OHPHQW3(
$
%
&
3(
3(
3(
3(
3(
3(
3(
3(
6KLIWHU
'
(
7LOH DUFKLWHFWXUH
6KLIWLQJ YDOXH LV WKH FXUUHQW LQSXW ELW
SRVLWLRQ
)LQDORXWSXW RI 0$&RSHUDWLRQV
DIWHUQFORFN F\FOHV ZKHUH QLV WKH
LQSXW SUHFLVLRQ
(
'
&
6LQJOHELW GLJLWDODFWLYDWLRQ
$FFXPXODWLRQ RI $1'RSHUDWLRQ RQWKDW
OLQH
'LJLWDOL]DWLRQ RI FURVVEDU RXWSXW WR ELW
YDOXH
$
%
ߪ P9
/97
+97
)LJ 3URSRVHG FXUUHQWLQSXW WKHUPRPHWHUFRGH
$'& ZLWK DGDSWDEOH WKUHVKROG FXUUHQW SHU VWDJH
VWDJHV VNHWFKHG
)LJ )H)(7 EDVHG FURVVEDU FROXPQV
DV XVHG LQ WKH VLPXODWLRQ
)LJ D )H)(7 GHYLFH PRGHOLQJ EDVHG RQ D
FDOLEUDWHG 3UHLVDFK )( PRGHO >@ DGGLQJ DQ
H[SHULPHQWDO YDULDELOLW\ ZLWK ߪൌ ͷͶǤE
7KH VLPXODWLRQ IRU WKH SURSRVHG )5 FRQILJ
XUDWLRQ PDWFKHV WKH H[SHULPHQWDO FRQGLWLRQ
9'6 P9
GHYLFHV
)LJ 7UDQVLHQW DQDO\VLV RI WKH SURSRVHG $'&
ZLWK VWDJHV
)LJ 7UDQVLHQW UHVSRQVH XSRQ VHTXHQWLDO
DFWLYDWLRQ RI WKH /97 )H)(7V LQ WKH FROXPQ
DV ZHOO DV WKH GLJLWDOL]DWLRQ RI WKH FXUUHQW
UHVSRQVH XSRQ H[SHULPHQWDOO\ H[WUDFWHG
GHYLFH YDULDELOLW\
)LJ 7(0 LPDJH DQG PDWHULDO VWDFN RI IHUURHOHFWULF )(7V
,'9*FXUYHV RI /97+97 RI GHYLFHV
+LJKRQRIIUDWLR
/DUJHFXUUHQW YDULDELOLW\
+LJKRQRIIUDWLR
/RZRFXUUHQW YDULDELOLW\
µµ µµ µµ µµ µµ
µµ
5HSRUWHG FRQILJXUDWLRQ
3URSRVHG FRQILJXUDWLRQ
9'6 9
GHYLFHV
5 0ȍ
5HIHUHQFH*HQHUDWRU $'& 3UHFLVLRQ E E E E
3HDN3RZHUX:
/DWHQF\QV
7DE $'& SHUIRUPDQFH SDUDPHWHUV
QP
QPVLOLFLGH
QPSRO\ 6L
QP7L1
QP+I2
aQP6L2
S6L
D E F
9'6 9
GHYLFHV
5 0ȍ
&ROXPQ
)
)5
ȍ
ȍ
ȍ
,& ,& ,&
9*6
9*6
9*6
[Pð [Pð [Pð [Pð[Pð [Pð
)H)(7 DUHD VFDOLQJ RSSRUWXQLWLHV EHQFKPDUN
)H)(7 &URVVEDU2UJDQL]DWLRQDQG3K\VLFDO,PSOHPHQWDWLRQ
)LJ 3URJUDP DQG LQIHUHQFH FRQILJXUDWLRQ RI
)H)(7 VHJPHQWV FRQQHFWHG WR D VLQJOH UHVLVWRU
SHU VHJPHQW LPSURYLQJ DUHD HIILFLHQF\
%LW %LW %LW %LW %LW %LW %LW %LW %LW %LW %LW %LW
ELW$'& ELW$'& ELW$'& ELW$'&
%LW
%LW
%LW
%LW
%LW
%LW
%LW
%LW
7236:
)LJ 6\VWHP SHUIRUPDQFH LQ 7236: 23 VWDQGV IRU D IXOO 0$& RSHUDWLRQ
IRU GLIIHUHQW ELWSUHFLVLRQ $'&V FRQQHFWHG WR WKH FURVVEDU DQG DW GLIIHUHQW
DFWLYDWLRQ SUHFLVLRQ LQ WKH UDQJH RI WR ELWV DQG ZHLJKWV SUHFLVLRQ RI DQG
ELWV
$FWLYDWLRQ
3UHFLVLRQ
7DE6\VWHP DQG GHYLFH SDUD
PHWHUV XVHG LQ WKLV VLPXODWLRQ
)LJ %HQFKPDUN RI WKH SURSRVHG )H)(7 EDVHG $&L0 ZLWK
H[LVWLQJ VROXWLRQV ZLWK UHVSHFW WR SRZHU HIILFLHQF\ DQG
SHUIRUPDQFH $ VFDODEOH DFFHOHUDWRU FRQFHSW LV SURYLGHG ZLWK
SRZHU HIILFLHQF\ !7236:
)LJ /D\RXW RI D NE )H)(7
FURVVEDU EHLQJ EXLOW RI î
VHJPHQWV VXSSRUWLQJ 0$&
RSHUDWLRQV EE SHU FORFN
F\FOH
)LJ &URVVEDU DUUDQJHPHQW
DV FRQVWUXFWHG E\ VHJPHQWV
DQG $'&V
65$0 55$0 )H)(7
9RODWLOLW\ 9RODWLOH 1RQYRODWLOH 1RQYRODWLOH
5HWHQWLRQ 6WDWLF V#& V#&
,RQ,RII
$UHD>Pð@ #
QP&026
#QP
&026
#
QP&026
(QGXUDQFH
)LJ ([SHULPHQWDO ,'9*FXUYHV DQG
H[WUDFWHG 97KLVWRJUDPV IRU UHSUHVHQW
DWLYH :/ FRQILJXUDWLRQV RI WKH
)H)(7V $ FRQILJXUDWLRQ VXFK DV
îPð SUHVKULQN RIIHUV VXLWDEOH
YDULDWLRQ RI ı970: 7KLV UHVXOWV
LQ D ELW VL]H RI DERXW Pð 7DE &RPSDULVRQ RI )H)(7 ZLWK RWKHU PHPRU\
FRQFHSWV VWXGLHG IRU LPSOHPHQWDWLRQ LQ $&L0
/97 9
+97 9
ı/97 9
ı+97 9
9LQSXW 9
9LQK:/ P9
9ZULWH 9
9
9
3XOVH
GXUDWLRQ QV
,QSXW a
2XWSXW a
$'&
6\VWHPEHQFKPDUNLQJ
)LJ 7KH YDULDELOLW\ RI WKH +97 DQG /97 VWDWHV IRU
YDULRXV :/ FRQILJXUDWLRQV SORWWHG RYHU DUHD
:HLJKW 3UHFLVLRQ
7KLVZRUN
ZLQGRZ RI
RSHUDWLRQ
[Pð [Pð
[Pð
[Pð
[Pð
+97
/97
9LQK:/
ZULWH
9LQK6/%/
ZULWH