Figure - available from: Nature
This content is subject to copyright. Terms and conditions apply.
Conventional memory crossbar array for ANN computing
a, Conventional memory crossbar array to perform analogue vector–matrix multiplication. b, Vector–matrix multiplication is prevalent in ANN computing: it is used to transfer data from a layer to the next.

Conventional memory crossbar array for ANN computing a, Conventional memory crossbar array to perform analogue vector–matrix multiplication. b, Vector–matrix multiplication is prevalent in ANN computing: it is used to transfer data from a layer to the next.

Source publication
Article
Full-text available
Implementations of artificial neural networks that borrow analogue techniques could potentially offer low-power alternatives to fully digital approaches1–3. One notable example is in-memory computing based on crossbar arrays of non-volatile memories4–7 that execute, in an analogue manner, multiply–accumulate operations prevalent in artificial neura...

Citations

... Moreover, the complexity can be further enhanced by linking multiple confined geometries, which could provide an ultra-low-energy alternative to neuromorphic computing based on arrays of nanoscale spintronic oscillators [34][35][36] . Another advantage is the possibility of the combination of the all-magnetic linear read-out with in-memory computing, for example, magnetic random-access memory 37 . Furthermore, the scalability of this concept to nanoscale dimensions reduces the displacement distances, which would give rise to latency in the nanosecond regime 17 . ...
Article
Full-text available
Reservoir computing (RC) has been considered as one of the key computational principles beyond von-Neumann computing. Magnetic skyrmions, topological particle-like spin textures in magnetic films are particularly promising for implementing RC, since they respond strongly nonlinearly to external stimuli and feature inherent multiscale dynamics. However, despite several theoretical proposals that exist for skyrmion reservoir computing, experimental realizations have been elusive until now. Here, we propose and experimentally demonstrate a conceptually new approach to skyrmion RC that leverages the thermally activated diffusive motion of skyrmions. By confining the electrically gated and thermal skyrmion motion, we find that already a single skyrmion in a confined geometry suffices to realize nonlinearly separable functions, which we demonstrate for the XOR gate along with all other Boolean logic gate operations. Besides this universality, the reservoir computing concept ensures low training costs and ultra-low power operation with current densities orders of magnitude smaller than those used in existing spintronic reservoir computing demonstrations. Our proposed concept is robust against device imperfections and can be readily extended by linking multiple confined geometries and/or by including more skyrmions in the reservoir, suggesting high potential for scalable and low-energy reservoir computing.
... The key mechanism is based on the change of orientation of one of the layers of the stack, which results in a variation in the device's electrical resistance. Several companies, like Samsung [41], are raising their interest in these architectures, and we expect that real devices might be commercially available shortly. Therefore, we hypothesize that MRAM-PuM architectures are a good fit to accelerate TSA, and particularly sDTW. ...
Preprint
Full-text available
Time Series Analysis (TSA) is a critical workload for consumer-facing devices. Accelerating TSA is vital for many domains as it enables the extraction of valuable information and predict future events. The state-of-the-art algorithm in TSA is the subsequence Dynamic Time Warping (sDTW) algorithm. However, sDTW's computation complexity increases quadratically with the time series' length, resulting in two performance implications. First, the amount of data parallelism available is significantly higher than the small number of processing units enabled by commodity systems (e.g., CPUs). Second, sDTW is bottlenecked by memory because it 1) has low arithmetic intensity and 2) incurs a large memory footprint. To tackle these two challenges, we leverage Processing-using-Memory (PuM) by performing in-situ computation where data resides, using the memory cells. PuM provides a promising solution to alleviate data movement bottlenecks and exposes immense parallelism. In this work, we present MATSA, the first MRAM-based Accelerator for Time Series Analysis. The key idea is to exploit magneto-resistive memory crossbars to enable energy-efficient and fast time series computation in memory. MATSA provides the following key benefits: 1) it leverages high levels of parallelism in the memory substrate by exploiting column-wise arithmetic operations, and 2) it significantly reduces the data movement costs performing computation using the memory cells. We evaluate three versions of MATSA to match the requirements of different environments (e.g., embedded, desktop, or HPC computing) based on MRAM technology trends. We perform a design space exploration and demonstrate that our HPC version of MATSA can improve performance by 7.35x/6.15x/6.31x and energy efficiency by 11.29x/4.21x/2.65x over server CPU, GPU and PNM architectures, respectively.
... The new RF connection scheme based on frequency multiplexing proposed here also constitutes a promising alternative to the crossbar array geometry for densely connecting neurons through synaptic devices with limited resistance variations. Passive crossbar arrays indeed typically require devices with an OFF/ON ratio above 100 to avoid excessive parasitic currents between their columns and rows 3 , and they are not adapted to connect low OFF/ON memristors or magnetic tunnel junctions 52 . In contrast, the presented RF connection scheme does not suffer from sneak paths: by construction, the synaptic chains feeding each output neuron are not interconnected as the communication between neuron and synapses relies on frequency multiplexing instead of pure wiring, as can be seen in in Fig. 1a. ...
Preprint
Full-text available
Spintronic nano-synapses and nano-neurons perform complex cognitive computations with high accuracy thanks to their rich, reproducible and controllable magnetization dynamics. These dynamical nanodevices could transform artificial intelligence hardware, provided that they implement state-of-the art deep neural networks. However, there is today no scalable way to connect them in multilayers. Here we show that the flagship nano-components of spintronics, magnetic tunnel junctions, can be connected into multilayer neural networks where they implement both synapses and neurons thanks to their magnetization dynamics, and communicate by processing, transmitting and receiving radio frequency (RF) signals. We build a hardware spintronic neural network composed of nine magnetic tunnel junctions connected in two layers, and show that it natively classifies nonlinearly-separable RF inputs with an accuracy of 97.7%. Using physical simulations, we demonstrate that a large network of nanoscale junctions can achieve state-of the-art identification of drones from their RF transmissions, without digitization, and consuming only a few milliwatts, which is a gain of more than four orders of magnitude in power consumption compared to currently used techniques. This study lays the foundation for deep, dynamical, spintronic neural networks.
... These systems alleviate the memory wall problem in conventional architectures, while also leveraging the efficiency and parallelism of analog computation (Sebastian et al., 2020;Xiao et al., 2020). A variety of computational memory devices have been proposed as artificial synapses for DNNs: resistive random access memories (ReRAM) (Li et al., 2018;Yao et al., 2020), phase change memories (Barbera et al., 2018;Joshi et al., 2020), electrochemical memories (Gkoupidenis et al., 2015;Lin et al., 2016;Li et al., 2021;Kireev et al., 2022), designer ionic/electronic thin films (Robinson et al., 2022), magnetic memories (Jung et al., 2022), and others. However, these synaptic devices cannot directly implement BNN weights, which are not static but are sampled from trained probability distributions. ...
Article
Full-text available
Bayesian neural networks (BNNs) combine the generalizability of deep neural networks (DNNs) with a rigorous quantification of predictive uncertainty, which mitigates overfitting and makes them valuable for high-reliability or safety-critical applications. However, the probabilistic nature of BNNs makes them more computationally intensive on digital hardware and so far, less directly amenable to acceleration by analog in-memory computing as compared to DNNs. This work exploits a novel spintronic bit cell that efficiently and compactly implements Gaussian-distributed BNN values. Specifically, the bit cell combines a tunable stochastic magnetic tunnel junction (MTJ) encoding the trained standard deviation and a multi-bit domain-wall MTJ device independently encoding the trained mean. The two devices can be integrated within the same array, enabling highly efficient, fully analog, probabilistic matrix-vector multiplications. We use micromagnetics simulations as the basis of a system-level model of the spintronic BNN accelerator, demonstrating that our design yields accurate, well-calibrated uncertainty estimates for both classification and regression problems and matches software BNN performance. This result paves the way to spintronic in-memory computing systems implementing trusted neural networks at a modest energy budget.
... The connectivity of nodes, that is, synaptic weights, can be modulated. consists of two FET switches and two magnetic tunnel junctions, and these bit cells are connected in series to form a column in the crossbar array (Jung et al., 2022). All images in the diagram are adapted from the corresponding references. ...
... stores a synaptic weight W (R L or R H ; each is the sum of the MTJ and FET switching resistances), while the MTJ-FET path on the right stores a weight complementary to W. Then, the left or right path can be selected by IN, generating the resistance (R L or R H ) of the selected path as a bit cell output (Jung et al., 2022). ...
Article
Full-text available
Biologically-inspired neuromorphic computing paradigms are computational platforms that imitate synaptic and neuronal activities in the human brain to process big data flows in an efficient and cognitive manner. In the past decades, neuromorphic computing has been widely investigated in various application fields such as language translation, image recognition, modeling of phase, and speech recognition, especially in neural networks (NNs) by utilizing emerging nanotechnologies; due to their inherent miniaturization with low power cost, they can alleviate the technical barriers of neuromorphic computing by exploiting traditional silicon technology in practical applications. In this work, we review recent advances in the development of brain-inspired computing (BIC) systems with respect to the perspective of a system designer, from the device technology level and circuit level up to the architecture and system levels. In particular, we sort out the NN architecture determined by the data structures centered on big data flows in application scenarios. Finally, the interactions between the system level with the architecture level and circuit/device level are discussed. Consequently, this review can serve the future development and opportunities of the BIC system design.
... STT-MRAM is a nonvolatile device with high density and near-zero leakage. STT-MRAM based MAC can perform a range of arithmetic, logic,and vector operations for general purpose or binary/ternary CNNs [23][24][25][26][27][28]. The multilevel cell (MLC) STT-MRAM was used for bitwise operations of BNN. ...
Article
In-memory computing (IMC) quantized neural network (QNN) accelerators are extensively used to improve energy-efficiency. However, ternary neural network (TNN) accelerators with bitwise operations in nonvolatile memory are lacked. In addition, specific accelerators are generally used for a single algorithm with limited applications. In this report, a multiply-and-accumulate (MAC) circuit based on ternary spin-torque transfer magnetic random access memory (STT-MRAM) is proposed, which allows writing, reading, and multiplying operations in memory and accumulations near memory. The design is a promising scheme to implement hybrid binary and ternary neural network accelerators.
... Their real-time data have manifested the need to overcome latency and energy costs induced by the data transfer between the processing unit and memory in von-Neumann architecture. Therefore, researchers showed interest in building an in-memory-computing (IMC)-based alternative paradigm [13,1,42,28,21], where the computation is done inside the memory, reducing the latency and energy cost. The quintessential example of IMC is vector-matrix multiplication (VMM) with non-volatile memories (NVMs), which is applied to solve many high-level applications such as neuromorphic computing and to solve computationally tricky problems [40,14,43,9]. ...
Preprint
Full-text available
This article reports an improvement in the performance of the hafnium oxide-based (HfO 2) ferroelectric field-effect transistors (FeFET) achieved by a synergistic approach of interfacial layer (IL) engineering and READ-voltage optimisation. FeFET devices with silicon dioxide (SiO 2) and silicon oxynitride (SiON) as IL were fabricated and characterised. Although the FeFETs with SiO 2 interfaces demonstrated better low-frequency characteristics compared to the FeFETs with SiON interfaces, the latter demonstrated better WRITE endurance and retention. Finally, the neuromorphic simulation was conducted to evaluate the performance of FeFETs with SiO 2 and SiON IL as synaptic devices. We observed that the WRITE endurance in both FeFETs was insufficient (< 10 8) to carry out online neural network training. Therefore, we consider inference-only operation with offline neural network training. The system-level simulation reveals that the impact of systematic degradation via retention degradation is much more significant for inference-only operation than low-frequency noise. The neural network with FeFETs based on SiON IL in the synaptic core shows 96% accuracy for the inference operation on the handwritten digit from the Modified National Institute of Standards and Technology (MNIST) data set in the presence of flicker noise and retention degradation, which is only 2.5% deviation from software baseline.
... A slew of new processor architectures employing analog techniques is keenly sought to reduce power dissipation and improve computational speed. Crossbarbased processing-in-memory (PIM) architectures [2,10,18,19,26] and integrated optical neural networks (ONNs) [6,8,9,20,21,28] are two prominent examples in this direction. ...
Preprint
Analog computing has been recognized as a promising low-power alternative to digital counterparts for neural network acceleration. However, conventional analog computing is mainly in a mixed-signal manner. Tedious analog/digital (A/D) conversion cost significantly limits the overall system's energy efficiency. In this work, we devise an efficient analog activation unit with magnetic tunnel junction (MTJ)-based analog content-addressable memory (MACAM), simultaneously realizing nonlinear activation and A/D conversion in a fused fashion. To compensate for the nascent and therefore currently limited representation capability of MACAM, we propose to mix our analog activation unit with digital activation dataflow. A fully differential framework, SuperMixer, is developed to search for an optimized activation workload assignment, adaptive to various activation energy constraints. The effectiveness of our proposed methods is evaluated on a silicon photonic accelerator. Compared to standard activation implementation, our mixed activation system with the searched assignment can achieve competitive accuracy with $>$60% energy saving on A/D conversion and activation.
... Neural networks based on memristive devices [1][2][3] have shown potential in substantially improving throughput and energy efficiency for machine learning [4] and artificial intelligence [5], especially in edge applications. [6][7][8][9][10][11][12][13][14][15][16][17][18][19] Because training a neural network model from scratch is very costly, it is impractical to do it individually on billions of memristive neural networks distributed at the edge. A practical approach would be to download the synaptic weights obtained from the cloud training and program them directly into memristors for the commercialization of edge applications (Figure 1a). ...
Preprint
Full-text available
Neural networks based on memristive devices have shown potential in substantially improving throughput and energy efficiency for machine learning and artificial intelligence, especially in edge applications. Because training a neural network model from scratch is very costly, it is impractical to do it individually on billions of memristive neural networks distributed at the edge. A practical approach would be to download the synaptic weights obtained from the cloud training and program them directly into memristors for the commercialization of edge applications. Some post-tuning in memristor conductance may follow afterward or during applications. Therefore, a critical requirement on memristors for neural network applications is a high-precision programming ability to guarantee uniform and accurate performance across a massive number of memristive networks. This translates into the requirement of many distinguishable conductance levels on each memristive device, not just lab-made devices but more importantly, devices fabricated in foundries. High precision memristors also benefit other neural network applications, such as training and scientific computing. Here we report over 2048 conductance levels, the largest number among all types of memories ever reported, achieved with memristors in fully integrated chips with 256x256 memristor arrays monolithically integrated on CMOS circuits in a standard foundry. We have unearthed the underlying physics that previously limited the number of achievable conductance levels in memristors and developed electrical operation protocols to circumvent such limitations. These results reveal insights into the fundamental understanding of the microscopic picture of memristive switching and provide approaches to enabling high-precision memristors for various applications.
... MAC-DO's computation accuracy has been tested in the same way as [61] by measuring the Top-1 accuracy drops when MAC-DO performs MAC operations for a convolution layer of LeNet-5 [60] neural network for MNIST dataset [59]. Other layers such as non-linear function have been supported by using software. ...
... For the test, C3 convolution layer is executed by the MAC-DO test circuit using transistor-level simulation for 448 test set images from the MNIST dataset, and the Top-1 accuracy is calculated from the collection of the final results. Other layers have been executed with full precision using software in the similar way as [61]. In order to dequantize the analog MAC results, we use four images as training data to find proper dequantization parameters. ...
Preprint
Deep neural networks (DNN) have been proved for its effectiveness in various areas such as classification problems, image processing, video segmentation, and speech recognition. The accelerator-in-memory (AiM) architectures are a promising solution to efficiently accelerate DNNs as they can avoid the memory bottleneck of the traditional von Neumann architecture. As the main memory is usually DRAM in many systems, a highly parallel multiply-accumulate (MAC) array within the DRAM can maximize the benefit of AiM by reducing both the distance and amount of data movement between the processor and the main memory. This paper presents an analog MAC array based AiM architecture named MAC-DO. In contrast with previous in-DRAM accelerators, MAC-DO makes an entire DRAM array participate in MAC computations simultaneously without idle cells, leading to higher throughput and energy efficiency. This improvement is made possible by exploiting a new analog computation method based on charge steering. In addition, MAC-DO innately supports multi-bit MACs with good linearity. MAC-DO is still compatible with current 1T1C DRAM technology without any modifications of a DRAM cell and array. A MAC-DO array can accelerate matrix multiplications based on output stationary mapping and thus supports most of the computations performed in DNNs. Our evaluation using transistor-level simulation shows that a test MAC-DO array with 16 x 16 MAC-DO cells achieves 188.7 TOPS/W, and shows 97.07% Top-1 accuracy for MNIST dataset without retraining.