Article

A 0.11pJ/bit read energy embedded NanoBridge non-volatile memory and its integration in a 28nm 32-bit RISC-V microcontroller units

Authors:
  • NanoBridge Semiconductor Inc.
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

A 28nm 512Kb NanoBridge non-volatile memory is developed for an energy-efficient microcontroller unit. 0.11pJ/bit read energy is achieved by utilizing an inverter sense scheme thanks to large ON/OFF conductance ratio of a split-electrode NanoBridge. The read energy is 71% and 54% less than those of a ReRAM and a SONOS commercial embedded NOR Flash at the same technology node, respectively. Moreover, a 28nm 32-bit RISC-V microcontroller unit embedded with a 2Mb NanoBridge non-voltage memory is fabricated and achieves 80MHz operation frequency.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In-memory computing (IMC) has emerged as a new computing paradigm able to alleviate or suppress the memory bottleneck, which is the major concern for energy efficiency and latency in modern digital computing. While the IMC concept is simple and promising, the details of its implementation cover a broad range of problems and solutions, including various memory technologies, circuit topologies, and programming/processing algorithms. This Perspective aims at providing an orientation map across the wide topic of IMC. First, the memory technologies will be presented, including both conventional complementary metal-oxide-semiconductor-based and emerging resistive/memristive devices. Then, circuit architectures will be considered, describing their aim and application. Circuits include both popular crosspoint arrays and other more advanced structures, such as closed-loop memory arrays and ternary content-addressable memory. The same circuit might serve completely different applications, e.g., a crosspoint array can be used for accelerating matrix-vector multiplication for forward propagation in a neural network and outer product for backpropagation training. The different algorithms and memory properties to enable such diversification of circuit functions will be discussed. Finally, the main challenges and opportunities for IMC will be presented.
Article
Full-text available
The analog AI core concept is appealing for deep‐learning (DL) because it combines computation and memory functions into a single device. Yet, significant challenges such as noise and weight drift will impact large‐scale analog in‐memory computing. Here, effects of flicker noise and drift on large DL systems are explored using a new flicker‐noise model with memory, which preserves temporal correlations, including a flicker noise figure of merit (FOM) Ar to quantify impacts on system performance. Flicker noise is characterized for Ge2Sb2Te5 (GST) based phase‐change memory (PCM) cells with a discovery of read‐noise asymmetry tied to shape asymmetry of mushroom cells. This experimental read polarity dependence is consistent with Pirovano's trap activation and defect annihilation model in an asymmetric GST cell. The impact of flicker noise and resistance drift of analog PCM synaptic devices on deep‐learning hardware is assessed for six large‐scale deep neural networks (DNNs) used for image classification, finding that the inference top‐1 accuracy degraded with the accumulated device flicker noise and drift as ∝Ar×twait, and ∝twait−ν, respectively, where ν is the drift coefficient. These negative impacts could be mitigated with a new hardware‐aware (HWA) (pre)‐training of the DNNs, which is applied before programming to the analog arrays. Flicker noise with memory is included in simulations of large image classification neural networks. Figures of merit (FOM) are derived and compared with flicker noise and drift measurements on phase‐change memory (PCM) cells. A new noise asymmetry is found, correlated with the cell's structural asymmetry. New hardware training aware algorithms are explored to mitigate noise impacts.
Article
Full-text available
To support the increasing demands for efficient deep neural network processing, accelerators based on analog in-memory computation of matrix multiplication have recently gained significant attention for reducing the energy of neural network inference. However, analog processing within memory arrays must contend with the issue of parasitic voltage drops across the metal interconnects, which distort the results of the computation and limit the array size. This work analyzes how parasitic resistance affects the end-to-end inference accuracy of state-of-the-art convolutional neural networks, and comprehensively studies how various design decisions at the device, circuit, architecture, and algorithm levels affect the system’s sensitivity to parasitic resistance effects. A set of guidelines are provided for how to design analog accelerator hardware that is intrinsically robust to parasitic resistance, without any explicit compensation or re-training of the network parameters.
Conference Paper
A 28nm 512Kb NanoBridge non-volatile memory is developed for an energy-efficient microcontroller unit. 0.11pJ/bit read energy is achieved by utilizing an inverter sense scheme thanks to large ON/OFF conductance ratio of a split-electrode NanoBridge. The read energy is 54% or 71% less than that of a ReRAM or a SONOS commer-cial eFLASH at the same technology node, respectively.
Chapter
Ever since the emergence of the electrical computer and the Von Neumann model, computer architects have adhered to a well-structured hierarchy of memory solutions, clearly trading off performance and capacity with cost. The ubiquitous memory technologies, dominated by SRAM, DRAM, Flash, and magnetic hard disks, are each situated in a well-defined location within this hierarchy. However, as processes have scaled into deep nanometer feature sizes and the demand for larger capacities and bandwidths increases, these traditional options face tough challenges and may be limited in their ability to continue to provide the new requirements. New technologies, such as phase change memory (PCM), magnetic RAM (MRAM), and resistive RAM (RRAM), have been researched and developed over the recent past in an attempt to meet these demands and replace some or all of the traditional technologies. In this article, the primary technologies are overviewed, including those that currently fill the memory hierarchy pyramid, the primary candidates to join or replace them in the near-term, and a number of newer candidates that may arise as legitimate solutions in the farther term. In addition, an overview of the current state of processing within the emerging memory technologies is provided, as an attempt to break free of the traditional Von Neumann paradigm to overcome the energy and performance bottlenecks in modern systems.
Article
With the expansion of Internet of Things (IoT) by artificial intelligence (AI), there is a strong demand for higher performance, higher intelligence, and lower cost in endpoint devices for crossover area located at the boundary between high-end micro controller unit (MCU) area and low-end micro processor unit (MPU) area, such as home automation, machine vision, robotics, and so on. This article presents a 40-nm embedded split-gate MONOS (SG-MONOS) flash macro with optimized memory array architecture and charge-assisted offset cancellation sense amplifier (CAOC-SA), which achieve high-speed random read operation of 200 MHz and high density of 7.91 Mb/mm 2 suitable for crossover devices.
Article
The majority of IoT edge devices are embedded systems with a tiny microcontroller unit (MCU), which acts as its brain. When users want their edge devices to continuously improve for better edge-analytics results, there is a need to equip their devices with algorithms that can learn/train from the continuously evolving real-world data. Currently, such devices are not capable of executing any machine learning (ML)-based model training tasks due to their resource constraints such as: limited memory (SRAM, Flash, and EEPROM), low operations per second, its inability to perform parallel processing, etc. In this article, we provide ML-MCU, a framework with our novel Optimized-Stochastic Gradient Descent (Opt-SGD) and Optimized One-Versus-One (Opt-OVO) algorithms to enable both binary and multiclass ML classifier training directly on MCUs. Thus, ML-MCU enables billions of MCU-based IoT edge devices to self learn/train (offline) after their deployment, using live data from a wide range of IoT use cases. When evaluating our algorithms on multiple popular MCUs, using various data sets of different sizes and feature dimensions, one of the most exciting findings was, our Opt-OVO algorithm trained a multiclass classifier using a data set of class count 50, on a \ $ 3 resource-constrained MCU and also performed onboard unit inference for the same 50 class data in super real time (6.2 ms).
Conference Paper
Abstract: Programmable Logic (PL) with a high logic density is demonstrated by cross-bar (xbar) of atom switches, which are programmed through logic transistors. The PL has 4 4-input LUTs to minimize area-delay product owing to small area & capacitance of atom switch. Xbar with 50% and 100% populations mixed and programming lines shared architecture achieves a 2× higher logic density comparing to a commercial PL chip on same technology node of 40 nm. 3× higher operation frequency and 40% lower power consumption are also assessed.
Article
The authors demonstrate an ultra-low-power microcontroller unit (MCU) with an embedded atom-switch ROM, which performs 0.39-V operation voltage and 18.26-pJ/cycle minimum active energy (or 18.26-μW/MHz minimum active power) at 14.3 MHz. The MCU is fabricated using an embedded atom-switch process with a hybrid silicon-on-thin-buried-oxide (SOTB) core and bulk I/O transistors. The atom switch is suitable for an ultra-low-voltage operation because of its high on/off conductance ratio. The SOTB CMOS with a body-bias voltage control realizes a high operation frequency of 40 MHz at 0.54 V and an ultra-low sleep power of 0.628 μW, simultaneously.
Article
The designs of resistive RAM (ReRAM) macros are limited by 1) a small sensing margin, limited read-VDDmin, and slow read access time (TAC) caused by a high cell-resistance and small cell-resistance-ratio (R-ratio) and 2) poor power integrity and increased energy waste attributable to a large SET dc-current (IDC-SET) resulting from the wide distribution of write (SET)-times (TSET). This study proposes a swing-sample-and-couple (SSC) voltage-mode sense amplifier (VSA) to enable an approximately 1.8+ x greater sensing margin for lower VDDmin and a 1.7+x faster read speed across a wide VDD range, compared with conventional VSAs. A 4T self-boost-write-termination (SBWT) scheme is proposed to cut off the IDC-SET of devices with a rapid TSET. The SBWT scheme reduces 99+% of the IDC-SET with an area penalty below 0.5%. A fabricated 512 row 28 nm 1 Mb ReRAM macro achieved TAC=404 ns when VDD=0.27 V and confirmed the IDC-SET cutoff by the SBWT.
Article
For the first time, an area-efficient nonvolatile carry chain combining look-up tables and a pass-transistor-logic-based adder is newly developed using complementary atom switches without additional CMOS circuits. A proposed tristate switch composed of three pairs of complementary atom switches selects one of "0", "1", and the "carry_in" signal as the input of a common multiplexer for both a look-up table and an adder. The developed nonvolatile carry chain achieves the reductions of 20% area, 17% delay, and 17% power consumption, respectively, in comparison with a conventional nonvolatile carry chain using dedicated CMOS gates.
Article
The dc current-stress tolerance of the ON-state Cu atom switch is evaluated at elevated temperature. It is revealed that the reset-direction current stress causes time-dependent failures, which originate from the -field-driven diffusion of Cu in the conducting bridge. A new empirical lifetime estimation model, including the Joule heating effect, gives an allowable maximum current per atom switch of , which is large enough to satisfy the requirements for signal routing under currents that are average (18 ) and peak (63 ) in the reconfigurable switch block operated at 500 MHz at 125 °C.
Conference Paper
Emerging nonvolatile memories (NVMs) have a potential to overcome the issues in the conventional static random-access memory (SRAM) based reconfigurable logic cell arrays (RLCAs). Replacing a CMOS switch element composed of a SRAM and a pass transistor by a NVM reduces chip size. And non-volatility reduces the stand-by power. More importantly, the compactness of NVM allows fine-grain logic cells (small cluster size), which advantageously enables a highly efficient cell usage, resulting in compact circuit for applications. In this paper, we investigate the fine-grain cell architecture using atom switch which is one of the NVMs. We evaluate the effect of the cluster size and the segment length on the atom-switch-based RLCA to confirm the optimal point considering area-delay product. Cluster size is optimized to be 4, which is smaller than that in the conventional SRAM- and multiplexer-based RLCA. This optimization is originated from the fact that the inter-delay among clusters is only twice of the intra-delay in cluster for atom-switch-based RLCA with routing block formed by crossbar switches because of very small capacitance and resistance of atom switches. On the other hand, the segment length is optimized to be 4, which is the same as that in the conventional SRAM- and multiplexer-based RLCA.
Article
Current overshoot during a set operation has significant impacts on the ON conductance and reliability of resistive change devices such as atom switches. We break the set operation into three steps: incubation, transition, and settling. We clarify their contributions to the determinations of the ON conductance of the atom switch. The variation in the transition time causes a significant variation in the ON conductance. On the basis of the ON conductance distribution, the median of the transition time of 1 ns and its distribution are revealed. (C) 2014 The Japan Society of Applied Physics
Article
A polymer solid-electrolyte (PSE) switch has been embedded in a 90-nm-node CMOS featuring a forming-less programming and extremely high on/off ratio of 105. A fast programming of 10 ns is also demonstrated for 50-nmΦ 1 k-b array by introducing the PSE switches integrated with a fully logic compatible process below 350°C. A high free volume in the PSE is supposed to result in the smooth formation of the Cu bridge without destroying the electrolyte, thereby also resulting in forming-less programming and high breakdown voltage. High disturbance reliability (T50; 50% fail) is extracted to be over 10 years at operation condition. The improved switching characteristics enable us to accurately program the crossbar circuit in a practical scale (32 × 32) without cell transistors. The developed switch is a strong candidate for realizing a low-power and low-cost nonvolatile programmable logic.
Embedded 28 nm charge-trap NVM technology
  • I Kouznetsov
I. Kouznetsov, "Embedded 28 nm charge-trap NVM technology," Flash Memory Summit, 2017.
Unipolar switching based antiferroelectric HfZrO 2 diode beyond endurance >10 11 cycles with low operation voltage for FeRAM application
  • K.-Y Hsiang
K.-Y. Hsiang et al., "Unipolar switching based antiferroelectric HfZrO 2 diode beyond endurance >10 11 cycles with low operation voltage for FeRAM application," Ext. Abstr. Solid State Devices and Materials, 2021, p. 89.
Area-efficient nanobridge-based FPGA with optimized architecture
  • X Bai
X. Bai et al., "Area-efficient nanobridge-based FPGA with optimized architecture," Ext. Abstr. Solid State Devices and Materials2016p. 451.
A low-power Cu atom switch programmable logic fabricated in a 40nm-node CMOS technology
  • X Bai
X. Bai et al., "A low-power Cu atom switch programmable logic fabricated in a 40nm-node CMOS technology," 2017 Symp. on VLSI Technology (Kyoto, Japan), 2017, p. T28.
Highly reliable 28 nm embedded flash process development for high-density and high-speed automotive grade-1 application
  • J Lee
J. Lee et al., "Highly reliable 28 nm embedded flash process development for high-density and high-speed automotive grade-1 application," IEEE Int. Memory Workshop (IMW), 2021, p. 1.