Article

T-PIM: An Energy-Efficient Processing-in-Memory Accelerator for End-to-End On-Device Training

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Recently, on-device training has become crucial for the success of edge intelligence. However, frequent data movement between computing units and memory during training has been a major problem for battery-powered edge devices. Processing-in-memory (PIM) is a novel computing paradigm that merges computing logic into memory, which can address the data movement problem with excellent power efficiency. However, previous PIM accelerators cannot support the entire training process on chip due to its computing complexity. This article presents a PIM accelerator for end-to-end on-device training (T-PIM), the first PIM realization that enables end-to-end on-device training as well as high-speed inference. Its full-custom PIM macro contains 8T-SRAM cells to perform the energy-efficient in-cell and operation and the bit-serial-based computation logic enables fully variable bit-precision for input data. The macro supports various data mapping methods and computational paths for both fully connected and convolutional layers, in order to handle the complex training process. An efficient tiling scheme is also proposed to enable T-PIM to compute any size of deep neural network with the implemented hardware. In addition, configurable arithmetic units in a forward propagation path make T-PIM handle power-of-two bit-precision for weight data, enabling a significant performance boost during inference. Finally, T-PIM efficiently handles sparsity in both operands by skipping the computation of zeros in the input data and by gating-off computing units when the weight data are zero. Finally, we fabricate the T-PIM chip in 28-nm CMOS technology, occupying a die area of 5.04 mm 2^{2} , including five T-PIM cores. It dissipates 5.25–51.23 mW at 50–280 MHz operating frequency with 0.75–1.05-V supply voltage. We successfully demonstrate that T-PIM can run the end-to-end training of VGG16 model on the CIFAR10 and CIFAR100 datasets, achieving 0.13–161.08-and 0.25–7.59-TOPS/W power efficiency during inference and training, respectively. The result shows that T-PIM is 2.02 ×\times more energy-efficient than the state-of-the-art PIM chip that only supports backward propagation, not a whole training. Furthermore, we conduct an architectural experiment using a cycle-level simulator based on actual measurement results, which suggests that the T-PIM architecture is scalable and its scaled-up version provides up to 203.26 ×\times higher power efficiency than a comparable GPU.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Another field which shows their advantage is sparse lightweight networks [15], [16], a main category under neural networks. In contrast to digital SRAM-based CIMs, which trade more area and computing time for higher precision [17], [18], [19], [20], analog-mixedsignal SRAM-based CIMs perform small kernel convolution computations in a single cycle, significantly reducing computation time at the cost of minor recognition rate lost [21], [22], [23]. The lightweight network's high parallelism and energy efficiency computing requirements when using small convolution kernels, as well as its high tolerance for computational accuracy, perfectly match the characteristics of analog in-memory computing. ...
... This activation input method harnesses the benefits of both pulse width modulation and multiple signal control to enhance accuracy. This makes our method stand out from existing single measure methodologies such as: wordline pulse count or pulse width modulation [4], [6], [18], analog input voltage on wordline [26], and multiple input wordlines or signals [1], [27]. Our approach for result accumulation also corresponds with our objective. ...
... Compared to [2], statistics show the advantages of structure improvement compared to advanced technology process. Compared to [17], [18], [19], [20], the advantage of analog-mixed-signal SRAM-based CIM designs over digital CIM designs when performing signed multi-bit computing operations is revealed, even with a smaller input sparsity. ...
Article
Full-text available
In this paper, we present an analog-mixed-signal 6T SRAM computing-in-memory (CIM) macro. The macro uses dual-wordline 6T bitcells to reduce power consumption and write-disturb issues. The macro also proposes an analog computation logic circuit for high precision, energy efficient charge-domain computation. The bitcell structure combined with the analog computation logic circuit allows direct input of signed activations and weights to the chip for full signed computation. The proposed macro consists of four CIM blocks, each with four 32x8 compute blocks, a pulse generator, an analog computation logic circuit and a SAR-ADC. Fabricated in a 55 nm process, our CIM macro test chip achieves an energy efficiency of 7.3 TOPS/W. A comprehensive computing test that encompasses the entire range of inputs and weights has been conducted. The results show that the CIM macro test chip can achieve a precision of 79.51% in a 1-FE error range of 71.88%. The target application of the proposed CIM macro is lightweight neural networks, this is demonstrated by mapping a pre-trained network into the macro and achieving a recognition accuracy of 92.28% on the CIFAR-10 dataset. The design surpasses existing designs in comprehensive consideration of energy efficiency, technology and bit width.
... Another alternative is in-memory bit-serial computation used, for example, by existing SRAM-based PIM technologies [10], [11]. As a consequence, this requires the data to be reorganized in a bit-serial manner and incurs significant overhead in both performance and complexity. ...
... Finally, SRAM-based PIM technologies such as [10], [11] all perform bit-serial computations. They hence suffer from data reformulation and lack the support of SOTA hardware, as these are often bit-parallel. ...
... However as the number of banks increases, the area becomes dominated by other digital circuits, and a larger cost is associated with each bank. Table II compares DAISM to Z-PIM [10] and T-PIM [11] which both use digital in-SRAM computation logics. Despite using 45nm technology, DAISM achieves comparable energy [23] * Varies according to the weight sparsity (0.1∼0.9). ...
Preprint
Full-text available
DNNs are one of the most widely used Deep Learning models. The matrix multiplication operations for DNNs incur significant computational costs and are bottlenecked by data movement between the memory and the processing elements. Many specialized accelerators have been proposed to optimize matrix multiplication operations. One popular idea is to use Processing-in-Memory where computations are performed by the memory storage element, thereby reducing the overhead of data movement between processor and memory. However, most PIM solutions rely either on novel memory technologies that have yet to mature or bit-serial computations which have significant performance overhead and scalability issues. In this work, an in-SRAM digital multiplier is proposed to take the best of both worlds, i.e. performing GEMM in memory but using only conventional SRAMs without the drawbacks of bit-serial computations. This allows the user to design systems with significant performance gains using existing technologies with little to no modifications. We first design a novel approximate bit-parallel multiplier that approximates multiplications with bitwise OR operations by leveraging multiple wordlines activation in the SRAM. We then propose DAISM - Digital Approximate In-SRAM Multiplier architecture, an accelerator for convolutional neural networks, based on our novel multiplier. This is followed by a comprehensive analysis of trade-offs in area, accuracy, and performance. We show that under similar design constraints, DAISM reduces energy consumption by 25\% and the number of cycles by 43\% compared to state-of-the-art baselines.
... In [36], an accelerator is claimed to be the first digital IMC IC that can support end-to-end on-device training, as well as inference, for edge devices. The IMC macro block includes 8T-SRAM cells that perform energy-efficient in-cell AND operations. ...
... This is in contrast to most of the solutions included in Table VII and Table VIII. The digital IMC accelerator ASIC in [36] is a very interesting exception, and operates with low power consumption also during training. The digital FPGA/ASIC solution in [35] also supports training. ...
Article
The Tsetlin Machine (TM) is a novel machine learning algorithm that uses Tsetlin automata (TAs) to define propositional logic expressions (clauses) for classification. This paper describes a field-programmable gate array (FPGA) accelerator for image classification based on the Convolutional Coalesced Tsetlin Machine. The accelerator classifies booleanized images of 28×2828\times 28 pixels into 10 classes, and is configured with 128 clauses in a highly parallel architecture. To achieve fast clause evaluation and class prediction, the TA action signals and the clause weights per class are available from registers. Full on-device training is included, and the TAs are implemented with 34 Block RAM (BRAM) instances which operate in parallel. Each BRAM is addressed by the clause number and has a 72-bit word width that supports 8 TAs. The design is implemented in a Xilinx Zynq Ultrascale + XCZU7 FPGA. Running at 50 MHz, the accelerator core achieves 134k image classifications per second, with an energy consumption per classification of 13.3 μ\mu J. A single training epoch of 60k samples requires a processing time of 1.5 seconds. The accelerator obtains a test accuracy of 97.6% on MNIST, 84.1% on Fashion-MNIST and 82.8% on Kuzushiji-MNIST.
... Recently, a few training-oriented SoCs that fit the power budget of extreme-edge applications have been presented. T-PIM [37] is a processing-in-memory accelerator in 28 nm technology for on-device learning. It reaches up to 250 GOPS/W during training with 0% of sparsity and within a power envelope of 51.23 mW at 280 MHz operating frequency. ...
... Cambricon-Q is also 17.7× more power-hungry than our design, therefore not suitable for TinyML applications. Similar considerations hold for T-PIM [37], a training chip designed in 28 nm technology that features an in-memory computing core for high energy efficiency but only works with 16-bit integer precision, not satisfying the precision requirements to enable on-chip training. TSUNAMI [39] and Trainer [40] are conceived for energyefficient embedded training and extensively use pruning and sparse matrices generation to increase energy efficiency and reduce the number of required MAC operations during training with zero-skipping. ...
Preprint
The increasing interest in TinyML, i.e., near-sensor machine learning on power budgets of a few tens of mW, is currently pushing toward enabling TinyML-class training as opposed to inference only. Current training algorithms, based on various forms of error and gradient backpropagation, rely on floating-point matrix operations to meet the precision and dynamic range requirements. So far, the energy and power cost of these operations has been considered too high for TinyML scenarios. This paper addresses the open challenge of near-sensor training on a few mW power budget and presents RedMulE - Reduced-Precision Matrix Multiplication Engine, a low-power specialized accelerator conceived for multi-precision floating-point General Matrix-Matrix Operations (GEMM-Ops) acceleration, supporting FP16, as well as hybrid FP8 formats, with {sign, exponent, mantissa}=({1,4,3}, {1,5,2}). We integrate RedMule into a Parallel Ultra-Low-Power (PULP) cluster containing eight energy-efficient RISC-V cores sharing a tightly-coupled data memory and implement the resulting system in a 22 nm technology. At its best efficiency point (@ 470 MHz, 0.65 V), the RedMulE-augmented PULP cluster achieves 755 GFLOPS/W and 920 GFLOPS/W during regular General Matrix-Matrix Multiplication (GEMM), and up to 1.19 TFLOPS/W and 1.67 TFLOPS/W when executing GEMM-Ops, respectively, for FP16 and FP8 input/output tensors. In its best performance point (@ 613 MHz, 0.8 V), RedMulE achieves up to 58.5 GFLOPS and 117 GFLOPS for FP16 and FP8, respectively, with 99.4% utilization of the array of Computing Elements and consuming less than 60 mW on average, thus enabling on-device training of deep learning models in TinyML application scenarios while retaining the flexibility to tackle other classes of common linear algebra problems efficiently.
... Since edge devices have more stringent energy constraints compared to centralized data processing methods, energy-efficient data processing is a major challenge [4,5]. In response, various technological efforts have been made to enhance the energy efficiency of edge computing, one of which is the integration of processing-in-memory (PIM) architecture with edge devices [6,7]. PIM performs data processing inside or near the memory array, reducing latency due to data movement and addressing the key design goal of high energy efficiency in edge devices for memory-centric tasks such as AI applications [8]. ...
Article
Full-text available
The rapid advancement of artificial intelligence (AI) technology, combined with the widespread proliferation of Internet of Things (IoT) devices, has significantly expanded the scope of AI applications, from data centers to edge devices. Running AI applications on edge devices requires a careful balance between data processing performance and energy efficiency. This challenge becomes even more critical when the computational load of applications dynamically changes over time, making it difficult to maintain optimal performance and energy efficiency simultaneously. To address these challenges, we propose a novel processing-in-memory (PIM) technology that dynamically optimizes performance and power consumption in response to real-time workload variations in AI applications. Our proposed solution consists of a new PIM architecture and an operational algorithm designed to maximize its effectiveness. The PIM architecture follows a well-established structure known for effectively handling data-centric tasks in AI applications. However, unlike conventional designs, it features a heterogeneous configuration of high-performance PIM (HP-PIM) modules and low-power PIM (LP-PIM) modules. This enables the system to dynamically adjust data processing based on varying computational load, optimizing energy efficiency according to the application’s workload demands. In addition, we present a data placement optimization algorithm to fully leverage the potential of the heterogeneous PIM architecture. This algorithm predicts changes in application workloads and optimally allocates data to the HP-PIM and LP-PIM modules, improving energy efficiency. To validate and evaluate the proposed technology, we implemented the PIM architecture and developed an embedded processor that integrates this architecture. We performed FPGA prototyping of the processor, and functional verification was successfully completed. Experimental results from running applications with varying workload demands on the prototype PIM processor demonstrate that the proposed technology achieves up to 29.54% energy savings.
... CIM accelerators can be broadly classified into analog and digital categories. The analog CIMs accumulate their operands in either charge domain or current domain, while digital CIMs use digital arithmetic units (AUs) to perform logic operations [8][9][10]. Therefore, analog CIMs require the analog-to-digital converter (ADC) and digital-to-analog converter (DAC) for data domain conversion. ...
Article
Full-text available
Recently, frequent data movement between computing units and memory during floating-point arithmetic has become a major problem for scientific computing. Computing-in-memory (CIM) is a novel computing paradigm that merges computing logic into memory, which can address the data movement problem with excellent power efficiency. However, the previous CIM paradigm failed to support double-precision floating-point format (FP64) due to its computing complexity. This paper presents a novel all-digital CIM macro-DCIM-FF to complete FP64 based fused multiply-add (FMA) operation for the first time. With 16 sub-CIM cells integrating digital multipliers to complete mantissa multiplication, DCIM-FF is able to provide correct rounded implementations for normalized/denormalized inputs in round-to-nearest-even mode and round-to-zero mode, respectively. To evaluate our design, we synthesized and tested the DCIM-FF macro in 55-nm CMOS technology. With a minimum power efficiency of 0.12 mW and a maximum computing efficiency of 26.9 TOPS/W, we successfully demonstrated that DCIM-FF can run the FP64-based FMA operation without error. Compared to related works, the proposed DCIM-FF macro shows significant power efficiency improvement and less area overhead based on CIM technology. This work paves a novel pathway for high-performance implementation of an FP64-based matrix-vector multiplication (MVM) operation, which is essential for hyperscale scientific computing.
Preprint
Processing-in-Memory (PIM) architectures offer promising solutions for efficiently handling AI applications in energy-constrained edge environments. While traditional PIM designs enhance performance and energy efficiency by reducing data movement between memory and processing units, they are limited in edge devices due to continuous power demands and the storage requirements of large neural network weights in SRAM and DRAM. Hybrid PIM architectures, incorporating non-volatile memories like MRAM and ReRAM, mitigate these limitations but struggle with a mismatch between fixed computing resources and dynamically changing inference workloads. To address these challenges, this study introduces a Heterogeneous-Hybrid PIM (HH-PIM) architecture, comprising high-performance MRAM-SRAM PIM modules and low-power MRAM-SRAM PIM modules. We further propose a data placement optimization algorithm that dynamically allocates data based on computational demand, maximizing energy efficiency. FPGA prototyping and power simulations with processors featuring HH-PIM and other PIM types demonstrate that the proposed HH-PIM achieves up to 60.43 percent average energy savings over conventional PIMs while meeting application latency requirements. These results confirm the suitability of HH-PIM for adaptive, energy-efficient AI processing in edge devices.
Conference Paper
DNNs are widely used but face significant computational costs due to matrix multiplications, especially from data movement between the memory and processing units. One promising approach is therefore Processing-in-Memory as it greatly reduces this overhead. However, most PIM solutions rely either on novel memory technologies that have yet to mature or bit-serial computations that have significant performance overhead and scalability issues. Our work proposes an in-SRAM digital multiplier, that uses a conventional memory to perform bit-parallel computations, leveraging multiple wordlines activation. We then introduce DAISM, an architecture leveraging this multiplier, which achieves up to two orders of magnitude higher area efficiency compared to the SOTA counterparts, with competitive energy efficiency.
Article
Full-text available
A lot of research on deep learning and big data has led to efficient methods for processing large volumes of data and research on conserving computing resources. Particularly in domains like the IoT (Internet of Things), where the computing power is constrained, efficiently processing large volumes of data to conserve resources is crucial. The processing-in-memory (PIM) architecture was introduced as a method for efficient large-scale data processing. However, PIM focuses on changes within the memory itself rather than addressing the needs of low-cost solutions such as the IoT. This paper proposes a new approach using the PIM architecture to overcome memory bottlenecks effectively in domains with computing performance constraints. We adopt the RISC-V instruction set architecture for our proposed PIM system’s design, implementation, and comprehensive performance evaluation. Our proposal expects to efficiently utilize low-spec systems like the IoT by minimizing core modifications and introducing PIM instructions at the ISA level to enable solutions that leverage PIM capabilities. We evaluate the performance of our proposed architecture by comparing it with existing structures using convolution operations, the fundamental unit of deep-learning and big data computations. The experimental results show our proposed structure achieves a 34.4% improvement in processing speed and 18% improvement in power consumption compared to conventional von Neumann-based architectures. This substantiates its effectiveness at the application level, extending to fields such as deep learning and big data.
Article
Resistive random-access memory (ReRAM) based processing-in-memory (PIM) architectures are used extensively to accelerate inferencing/training with convolutional neural networks (CNNs). Three-dimensional (3D) integration is an enabling technology to integrate many PIM cores on a single chip. In this work, we propose the design of a thermally efficient dataflow-aware monolithic 3D (M3D) NoC architecture referred to as TEFLON to accelerate CNN inferencing without creating any thermal bottlenecks. TEFLON reduces the Energy-Delay-Product (EDP) by 4 2\% , 46\% , and 45 \% on an average compared to a conventional 3D mesh NoC for systems with 36-, 64-, and 100-PIM cores respectively. TEFLON reduces the peak chip temperature by 25 K and improves the inference accuracy by up to 11 \% compared to sole performance-optimized SFC-based counterpart for inferencing with diverse deep CNN models using CIFAR-10/100 datasets on a 3D system with 100-PIM cores.
Article
Over the past few years, on-device learning (ODL) has become an integral aspect of the success of edge devices that embrace machine learning (ML) since it plays a crucial role in restoring ML model accuracy when the edge environment changes. However, implementing ODL on battery-limited edge devices poses significant challenges due to the generation of large-size intermediate data during ML training and the frequent data movement between the processor and memory, resulting in substantial power consumption. To address this limitation, certain ML accelerators in edge devices have adopted a processing-in-memory (PIM) paradigm, integrating computing logic into memory. Nevertheless, these accelerators still face hurdles such as long latency caused by the lack of a pipelined approach in the training process, notable power and area overheads related to floating-point arithmetic, and incomplete handling of data sparsity during training. This article presents a high-throughput super-pipelined PIM accelerator, named SP-PIM, designed to overcome the limitations of existing PIM-based ODL accelerators. To this end, SP-PIM implements a holistic multi-level pipelining scheme based on local error prediction (EP), enhancing training speed by 7.31 ×\times . In addition, SP-PIM introduces a local EP unit (LEPU), a lightweight circuit that performs accurate EP leveraging power-of-two (PoT) random weights. This strategy significantly reduces power-hungry external memory access (EMA) by 59.09%. Moreover, SP-PIM fully exploits sparsities in both activation and error data during training, facilitated by a highly optimized PIM macro design. Finally, the SP-PIM chip, fabricated using 28-nm CMOS technology, achieves a training speed of 8.81 epochs/s. It occupies a die area of 5.76 mm 2^{2} and consumes between 6.91 and 433.25 mW at operating frequencies of 20–450 MHz with a supply voltage of 0.56–1.05 V. We demonstrate that it can successfully execute end-to-end ODL for the CIFAR10 and CIFAR100 datasets. Consequently, it achieves state-of-the-art area efficiency (560.6 GFLOPS/mm 2^{2} ) and competitive power efficiency (22.4 TFLOPS/W), marking a 3.95 ×\times higher figure-of-merit (area efficiency ×\times power efficiency ×\times capacity) than previous work. Furthermore, we implemented a cycle-level simulator using Python to investigate and validate the scalability of SP-PIM. By doing architectural experiments in various hardware configurations, we successfully verified that the core computing unit within SP-PIM possesses both scale-up and scale-out capabilities.
Article
Training Deep Neural Networks (DNNs) requires a large number of operations, among which matrix-vector multiplies (MVMs), often of high dimensionality, dominate. In-Memory Computing (IMC) is a promising approach to enhance MVM compute efficiency and throughput, but introduces fundamental tradeoffs with dynamic range of the computed outputs. While IMC has been successful in DNN inference systems, it has not yet shown feasibility for training, which is more sensitive to dynamic range. This work leverages recent work on alternative radix-4 number formats in DNN training on digital architectures, together with recent work on high-precision analog IMC with multi-level inputs, to enable IMC training. Furthermore, we implement a mapping of radix-4 operands to multi-level analog-input IMC in a manner that improves robustness to analog noise effects. The proposed approach is shown in simulations calibrated to silicon-measured IMC noise to be capable of training DNNs on the CIFAR-10 dataset to within 10%10\% of the testing accuracy of standard DNN training approaches, while analysis shows that further reduction of IMC noise to feasible levels results in accuracy within 2%2\% of standard DNN training approaches.
Article
Deep neural networks (DNNs) have recently gained significant prominence in various real-world applications such as image recognition, natural language processing, and autonomous vehicles. However, due to their black-box nature in system, the underlying mechanisms of DNNs behind the inference results remain opaque to users. In order to address this challenge, researchers have focused on developing explainable artificial intelligence (AI) algorithms. Explainable AI aims to provide a clear and human-understandable explanation of the model’s decision, thereby building more reliable systems. However, the explanation task differs from well-known inference and training processes as it involves interactions with the user. Consequently, existing inference and training accelerators face inefficiencies when processing explainable AI on edge devices. This article introduces explainable processing unit (EPU), the first hardware accelerator designed for explainable AI workloads. The EPU utilizes a novel data compression format for the output heat maps and intermediate gradients to enhance the overall system performance by reducing both memory footprint and external memory access. Its sparsity-free computing core efficiently handles the input sparsity with negligible control overhead, resulting in a throughput boost of up to 9.48×. It also proposes a dynamic workload scheduling with a customized on-chip network for distinct inference and explanation tasks to maximize internal data reuse hence reducing external memory access by 63.7%. Furthermore, the EPU incorporates point-wise gradient pruning (PGP) that can significantly reduce the size of heat maps by a factor of 7.01× combined with the proposed compression format. Finally, the EPU chip fabricated in a 28 nm CMOS process achieves a remarkable heat map generation rate of 367 frames/s for ResNet-34 while maintaining the state-of-the-art area and energy efficiency of 112.3 GOPS/mm 2^2 and 26.55 TOPS/W, respectively.
Article
State-of-the-art in-memory computing (IMC) based 2.5D systems are tailored only for deep neural network (DNN) inference. Therefore, they do not address the issue of high storage requirements and high on-package as well as on-chip communication volume while performing DNN training. To this end, we propose an architecture for IMC-based 2.5D systems with multiple network-on-package (NoP) to store and communicate the data required for DNN training in an energy-efficient manner. Specifically, we introduce banks of scratch pad memories (SPM) which are local to different parts of the 2.5D system. The SPM banks are connected to chiplets with IMC elements (IMC-chiplets) through a different NoP than the NoP connecting the IMC-chiplets with each other, thus forming a multi-NoP architecture. Experimental evaluations with a wide range of DNNs show up to 84% improvement in the energy-delay-area product (EDAP) with respect to state-of-the-art IMC-based 2.5D systems.
Article
Training deep/convolutional neural networks (DNNs/CNNs) requires a large amount of memory and iterative computation, which necessitates speedup and energy reduction, especially for edge devices with resource/energy constraints. In this work, we present an 8-bit floating-point (FP8) training the processor which implements: 1) highly parallel tensor cores (fused multiply–add trees) that maintain high utilization throughout forward propagation (FP), backward propagation (BP), and weight update (WU) phases of the training process; 2) hardware-efficient channel gating for dynamic output activation sparsity; 3) dynamic weight sparsity (WS) based on group Lasso; and 4) gradient skipping based on the FP prediction error. We develop a custom instruction set architecture (ISA) to flexibly support different CNN topologies and training parameters. The 28-nm prototype chip demonstrates large improvements in floating-point operations (FLOPs) reduction (7.3 ×\times ), energy efficiency (16.4 TFLOPS/W), and overall training latency speedup (4.7 ×\times ), for both supervised and self-supervised training tasks.
Conference Paper
Full-text available
Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.
Article
Full-text available
Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.
Conference Paper
Full-text available
CMOS technology scaling continues to drive higher levels of integration in VLSI design, which adds more compute engines on a die. To meet the overall performance-scaling needs, high-speed and high-bandwidth memory is becoming increasingly important. Conventional VLSI systems often rely on on-die SRAMs to address the performance gap between CPU and main memory, DRAM. However, with the rapid growth in capacity needs for high-performance memory, SRAM is not always sufficient to meet the demands of bandwidth-intense applications. Embedded DRAM (eDRAM) has been explored as an alternative to satisfy the high-performance and density needs in memory [1-3]. In this paper, a high-performance eDRAM based on a 22nm tri-gate CMOS technology is introduced. This eDRAM technology enables the integration of an eDRAM cell into the logic technology platform [4]. The design features a well-balanced configuration to achieve both optimal array efficiency and bandwidth. By leveraging the high-performance and low-voltage tri-gate transistor at 22nm generation, the eDRAM achieves a wide range in operating voltage, from 1.1V down to 0.7V, which is essential for low-power logic applications.
Article
Full-text available
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide detailed a analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.
Article
Full-text available
In this paper, we propose a novel neural network model called RNN Encoder--Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder--Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.
Conference Paper
Full-text available
In order to achieve autonomous operation of a vehicle in urban situations with unpredictable traffic, several realtime systems must interoperate, including environment perception, localization, planning, and control. In addition, a robust vehicle platform with appropriate sensors, computational hardware, networking, and software infrastructure is essential. We previously published an overview of Junior, Stanford's entry in the 2007 DARPA Urban Challenge. This race was a closed-course competition which, while historic and inciting much progress in the field, was not fully representative of the situations that exist in the real world. In this paper, we present a summary of our recent research towards the goal of enabling safe and robust autonomous operation in more realistic situations. First, a trio of unsupervised algorithms automatically calibrates our 64-beam rotating LIDAR with accuracy superior to tedious hand measurements. We then generate high-resolution maps of the environment which are subsequently used for online localization with centimeter accuracy. Improved perception and recognition algorithms now enable Junior to track and classify obstacles as cyclists, pedestrians, and vehicles; traffic lights are detected as well. A new planning system uses this incoming data to generate thousands of candidate trajectories per second, choosing the optimal path dynamically. The improved controller continuously selects throttle, brake, and steering actuations that maximize comfort and minimize trajectory error. All of these algorithms work in sun or rain and during the day or night. With these systems operating together, Junior has successfully logged hundreds of miles of autonomous operation in a variety of real-life conditions.
Conference Paper
Article
As machine learning (ML) and artificial intelligence (AI) have become mainstream technologies, many accelerators have been proposed to cope with their computation kernels. However, they access the external memory frequently due to the large size of deep neural network model, suffering from the von Neumann bottleneck. Moreover, as privacy issue is becoming more critical, on-device training is emerging as its solution. However, on-device training is challenging because it should perform the training under a limited power budget, which requires a lot more computations and memory accesses than the inference. In this paper, we present an energy-efficient processing-in-memory (PIM) architecture supporting end-to-end on-device training named T-PIM. Its macro design includes an 8T-SRAM cell-based PIM block to compute in-memory AND operation and three computational datapaths for end-to-end training. Each of three computational paths integrates arithmetic units for forward propagation, backward propagation, and gradient calculation and weight update, respectively, allowing the weight data stored in the memory stationary. T-PIM also supports variable bit precision to cover various ML scenarios. It can use fully variable input bit precision and 2-bit, 4-bit, 8-bit, and 16-bit weight bit precision for the forward propagation and the same input bit precision and 16-bit weight bit precision for the backward propagation. In addition, T-PIM implements sparsity handling schemes that skip the computation for input data and turn off the arithmetic units for weight data to reduce both unnecessary computations and leakage power. Finally, we fabricate the T-PIM chip on a 5.04mm 2 die in a 28-nm CMOS logic process. It operates at 50–280MHz with the supply voltage of 0.75–1.05V, dissipating 5.25–51.23mW power in inference and 6.10–37.75mW in training. As a result, it achieves 17.90–161.08TOPS/W energy efficiency for the inference of 1-bit activation and 2-bit weight data, and 0.84–7.59TOPS/W for the training of 8-bit activation/error and 16-bit weight data. In conclusion, T-PIM is the first PIM chip that supports end-to-end training, demonstrating 2.02 times performance improvement over the latest PIM that partially supports training.
Article
We present an energy-efficient processing-in-memory (PIM) architecture named Z-PIM that supports both sparsity handling and fully variable bit-precision in weight data for energy-efficient deep neural networks. Z-PIM adopts the bit-serial arithmetic that performs a multiplication bit-by-bit through multiple cycles to reduce the complexity of the operation in a single cycle and to provide flexibility in bit-precision. To this end, it employs a zero-skipping convolution SRAM, which performs in-memory AND operations based on custom 8T-SRAM cells and channel-wise accumulations, and a diagonal accumulation SRAM that performs bit- and spatial-wise accumulation on the channel-wise accumulation results using diagonal logic and adders to produce the final convolution outputs. We propose the hierarchical bitline structure for energy-efficient weight bit pre-charging and computational readout by reducing the parasitic capacitances of the bitlines. Its charge reuse scheme reduces the switching rate by 95.42% for the convolution layers of VGG-16 model. In addition, Z-PIM’s channel-wise data mapping enables sparsity handling by skip-reading the input channels with zero weight. Its read-operation pipelining enabled by a read-sequence scheduling improves the throughput by 66.1%. The Z-PIM chip is fabricated in a 65-nm CMOS process on a 7.568-mm 2 die, while it consumes average 5.294-mW power at 1.0-V voltage and 200-MHz frequency. It achieves 0.31–49.12-TOPS/W energy efficiency for convolution operations as the weight sparsity and bit-precision vary from 0.1 to 0.9 and 1 to 16 bit, respectively. For the figure of merit considering input bit-width, weight bit-width, and energy efficiency, the Z-PIM shows more than 2.1 times improvement over the state-of-the-art PIM implementations.
Article
A binary neural network (BNN) chip explores the limits of energy efficiency and computational density for an all-digital deep neural network (DNN) inference accelerator. The chip intersperses data storage and computation using computation near memory (CNM) to reduce interconnect and data movement costs. It performs wide inner product operations to leverage parallelism inherent in DNN computations. The BNN chip leverages lightweight pipelining at a near-threshold voltage (NTV) to reduce the overhead of sequential elements. It employs optimized data access patterns to reduce memory accesses for convolutional operation with pooling layers. The combination of these techniques enables the BNN chip to achieve a peak energy efficiency of 617 TOPS/W. The digital BNN chip approaches the energy efficiency of analog in-memory techniques while also ensuring deterministic, scalable, and bit-accuracy operation. Moreover, the all-digital design leverages process scaling and does not require additional memory transistors or passive devices to attain a peak compute density of 418 TOPS/mm 2 and a memory density of 414 KB/mm 2 . The binary design is extended to enable bit-serial integer precision operation with a reconfigurable 1-b multiplication circuit and element-wise partial sum shift and accumulate. This technique allows for fine-grain mixed precision and retains energy efficiency by exploiting parallelism inherent in DNNs. The bit-serial binary operation allows for bit-accurate operation and high DNN accuracy that multibit analog compute-in-memory designs struggle to attain. It provides favorable energy tradeoffs compared with small-integer digital DNN accelerators.
Article
A scalable deep-learning accelerator supporting the training process is implemented for device personalization of deep convolutional neural networks (CNNs). It consists of three processor cores operating with distinct energy-efficient dataflow for different types of computation in CNN training. Unlike the previous works where they implement design techniques to exploit the same characteristics from the inference, we analyze major issues that occurred from training in a resource-constrained system to resolve the bottlenecks. A masking scheme in the propagation core reduces a massive amount of intermediate activation data storage. It eliminates frequent off-chip memory accesses for holding the generated activation data until the backward path. A disparate dataflow architecture is implemented for the weight gradient computation to enhance PE utilization while maximally reuse the input data. Furthermore, the modified weight update system enables an 8-bit fixed-point computing datapath. The processor is implemented in 65-nm CMOS technology and occupies 10.24 mm 2 of the core area. It operates with the supply voltage from 0.63 to 1.0 V, and the computing engine runs in near-threshold voltage of 0.5 V. The chip consumes 40.7 mW at 50 MHz with the highest efficiency and achieves 47.4 μJ /epoch of training efficiency for the customized CNN model.
Article
This article proposes a general-purpose hybrid in-/near-memory compute SRAM (CRAM) that combines an 8T transposable bit cell with vector-based, bit-serial in-memory arithmetic to accommodate a wide range of bit-widths, from single to 32 or 64 bits, as well as a complete set of operation types, including integer and floating-point addition, multiplication, and division. This approach provides the flexibility and programmability necessary for evolving software algorithms ranging from neural networks to graph and signal processing. The proposed design was implemented in a small Internet of Things (IoT) processor in the 28-nm CMOS consisting of a Cortex-M0 CPU and 8 CRAM banks of 16 kB each (128 kB total). The system achieves 475-MHz operation at 1.1 V and, with all CRAMs active, produces 30 GOPS or 1.4 GFLOPS on 32-bit operands. It achieves an energy efficiency of 0.56 TOPS/W for 8-bit multiplication and 5.27 TOPS/W for 8-bit addition at 0.6 V and 114 MHz.
Article
A recent trend in deep neural network (DNN) development is to extend the reach of deep learning applications to platforms that are more resource and energy-constrained, e.g., mobile devices. These endeavors aim to reduce the DNN model size and improve the hardware processing efficiency and have resulted in DNNs that are much more compact in their structures and/or have high data sparsity . These compact or sparse models are different from the traditional large ones in that there is much more variation in their layer shapes and sizes and often require specialized hardware to exploit sparsity for performance improvement. Therefore, many DNN accelerators designed for large DNNs do not perform well on these models. In this paper, we present Eyeriss v2, a DNN accelerator architecture designed for running compact and sparse DNNs. To deal with the widely varying layer shapes and sizes, it introduces a highly flexible on-chip network, called hierarchical mesh, that can adapt to the different amounts of data reuse and bandwidth requirements of different data types, which improves the utilization of the computation resources. Furthermore, Eyeriss v2 can process sparse data directly in the compressed domain for both weights and activations and therefore is able to improve both processing speed and energy efficiency with sparse models. Overall, with sparse MobileNet, Eyeriss v2 in a 65-nm CMOS process achieves a throughput of 1470.6 inferences/s and 2560.3 inferences/J at a batch size of 1, which is 12.6×12.6\times faster and 2.5×2.5\times more energy-efficient than the original Eyeriss running MobileNet.
Conference Paper
State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120x energy saving; Exploiting sparsity saves 10x; Weight sharing gives 8x; Skipping zero activations from ReLU saves another 3x. Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102GOPS/s working directly on a compressed network, corresponding to 3TOPS/s on an uncompressed network, and processes FC layers of AlexNet at 1.88x10^4 frames/sec with a power dissipation of only 600mW. It is 24,000x and 3,400x more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy efficiency and area efficiency.
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Article
Neural networks are a family of powerful machine learning models. This book focuses on the application of neural network models to natural language data. The first half of the book (Parts I and II) covers the basics of supervised machine learning and feed-forward neural networks, the basics of working with machine learning over language data, and the use of vector-based rather than symbolic representations for words. It also covers the computation-graph abstraction, which allows to easily define and train arbitrary neural networks, and is the basis behind the design of contemporary neural network software libraries. The second part of the book (Parts III and IV) introduces more specialized neural network architectures, including 1D convolutional neural networks, recurrent neural networks, conditioned-generation models, and attention-based models. These architectures and techniques are the driving force behind state-of-the-art algorithms for machine translation, syntactic parsing, and many other a...
Article
State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120× energy saving; Exploiting sparsity saves 10×; Weight sharing gives 8×; Skipping zero activations from ReLU saves another 3×. Evaluated on nine DNN benchmarks, EIE is 189× and 13× faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS working directly on a compressed network, corresponding to 3 TOPS on an uncompressed network, and processes FC layers of AlexNet at 1.88×10⁴ frames/sec with a power dissipation of only 600mW. It is 24,000× and 3,400× more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9×, 19× and 3× better throughput, energy efficiency and area efficiency.
Article
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
Article
Many companies are deploying services, either for consumers or industry, which are largely based on machine-learning algorithms for sophisticated processing of large amounts of data. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be both computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system. We implement the node down to the place and route at 28nm, containing a combination of custom storage and computational units, with industry-grade interconnects.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Article
Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Article
This book is required reading for anyone working with accelerator-based computing systems. From the Foreword by Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory CUDA is a computing architecture designed to facilitate the development of parallel programs. In conjunction with a comprehensive software platform, the CUDA Architecture enables programmers to draw on the immense power of graphics processing units (GPUs) when building high-performance applications. GPUs, of course, have long been available for demanding graphics and game applications. CUDA now brings this valuable resource to programmers working on applications in other domains, including science, engineering, and finance. No knowledge of graphics programming is requiredjust the ability to program in a modestly extended version of C. CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples. After a concise introduction to the CUDA platform and architecture, as well as a quick-start guide to CUDA C, the book details the techniques and trade-offs associated with each key CUDA feature. Youll discover when to use each CUDA C extension and how to write CUDA software that delivers truly outstanding performance. Major topics covered include Parallel programming Thread cooperation Constant memory and events Texture memory Graphics interoperability Atomics Streams CUDA C on multiple GPUs Advanced atomics Additional CUDA resources All the CUDA software tools youll need are freely available for download from NVIDIA.http://developer.nvidia.com/object/cuda-by-example.html
Article
Based on the bit pair (a i , b i ) truth table, the carry propagate p i and carry generate g i have dominated the carry-look-ahead formation process for more than two decades. This paper presents a new scheme in which the new carry propagation is examined by including the neighboring pairs (a i , b i ; a i+1 , b i+1 ). This schem e not only reduces the component count in design, but also requires fewer logic levels in adder implementation. In addition, this new algorithm offers an astonishingly uniform loading in fan-in and fan-out nesting.
Low-Power AI Startup Eta Compute Delivers First Commercial Chips
  • S K Moore