
Sarma VrudhulaArizona State University | ASU · School of Computing, Informatics, and Decision Systems Engineering
Sarma Vrudhula
Doctor of Philosophy
About
246
Publications
33,938
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
8,472
Citations
Publications
Publications (246)
p>This paper presents TULIP, a new architecture for a variable precision Quantized Neural Network (QNN) inference. It is designed with the goal of maximizing energy efficiency per classification. TULIP is constructed by arranging a collection of unique processing elements (TULIP-PEs) in a single instruction multiple data (SIMD) fashion. Each TULIP-...
p>This paper presents TULIP, a new architecture for a variable precision Quantized Neural Network (QNN) inference. It is designed with the goal of maximizing energy efficiency per classification. TULIP is constructed by arranging a collection of unique processing elements (TULIP-PEs) in a single instruction multiple data (SIMD) fashion. Each TULIP-...
Decades of progress in energy-efficient and low-power design have successfully reduced the operational carbon footprint in the semiconductor industry. However, this has led to an increase in embodied emissions, encompassing carbon emissions arising from design, manufacturing, packaging, and other infrastructural activities. While existing research...
A new design methodology for reducing the area and power of standard cell ASICs that uses a combination of differential flipflops and a method of deliberate clock-skewing, called local clocking (LC), is described. LC introduces clock skew without the use of extra buffers in the clock network. This is done by having some flipflops, called sources, g...
Machine Learning (ML) workloads are increasingly deployed at the edge. Enabling efficient inference execution while considering model and system heterogeneity remains challenging, especially for ML tasks built with a network of DNNs. The challenge is to maximize the utilization of all available resources on the multiprocessor system on a chip (MPSo...
Flash has been the workhorse technology for non-volatile memory design for many years now. In this chapter, we show that flash technology can be used to design a variety of general-purpose circuits, both digital and analog. We demonstrate this via case studies that demonstrate two styles of flash-based ASIC design (including a secure variant), flas...
In this paper, we describe a design of a mixed-signal circuit for an binary neuron (a.k.a perceptron, threshold logic gate) and a methodology for automatically embedding such cells in ASICs. The binary neuron, referred to as an FTL (flash threshold logic) uses floating gate or flash transistors whose threshold voltages serve as a proxy for the weig...
This paper presents a framework to enable the energy-efficient execution of convolutional neural networks (CNNs) on edge devices. The framework consists of a pair of edge devices connected via a wireless network: a performance and energy-constrained device D as the first recipient of data, and an energy-unconstrained device N as an accelerator for...
In this paper, we describe a design of a mixed signal circuit for a binary neuron (a.k.a perceptron, threshold logic gate) and a methodology for automatically embedding such cells in ASICs. The binary neuron, referred to as an FTL (flash threshold logic) uses floating gate or flash transistors whose threshold voltages serve as a proxy for the weigh...
This paper presents a DRAM-based processing-in-memory (PIM) architecture, called CIDAN-XE. It contains a novel computing unit called the neuron processing element (NPE). Each NPE can perform a variety of operations that include logical, arithmetic, relational, and predicate operations on multi-bit operands. Furthermore, they can be reconfigured to...
Numerous applications such as graph processing, cryptography, databases, bioinformatics, etc., involve the repeated evaluation of Boolean functions on large bit vectors. In-memory architectures which perform processing in memory (PIM) are tailored for such applications. This paper describes a different architecture for in-memory computation called...
The flexibility of field-programmable gate arrays (FPGAs) is attributed to the reconfigurability of their basic logic elements (BLEs). Traditionally, the BLEs are comprised of one or more lookup tables (LUTs) of
$n$
inputs, that are designed to implement Boolean functions of
$n$
or fewer inputs. In an attempt to reduce the area and power consum...
This paper presents TULIP, a new architecture for a binary neural network (BNN) that uses an optimal schedule for executing the operations of an arbitrary BNN. It was constructed with the goal of maximizing energy efficiency per classification. At the top-level, TULIP consists of a collection of unique processing elements (TULIP-PEs) that are organ...
Object detection using deep neural networks (DNNs) involves a huge amount of computation which impedes its implementation on resource/energy-limited user-end devices. The reason for the success of DNNs is due to having knowledge over all different domains of observed environments. However, we need a limited knowledge of the observed environment at...
The next significant step in the evolution and proliferation of artificial intelligence technology will be the integration of neural network (NN) models within embedded and mobile systems. This calls for the design of compact, energy efficient NN models in silicon. In this article, we present a scalable application-specific integrated circuit (ASIC...
Long Short-Term Memory (LSTM) Recurrent Neural network (RNN) is known for its capability in modeling sequence learning tasks such as language modeling. However, due
to the large number of model parameters and compute-intensive operations, existing FPGA implementations of LSTMs are not sufficiently energy-efficient as they require large area and exh...
The next significant step in the evolution and proliferation of artificial intelligence technology will be the integration of neural network (NN) models within embedded and mobile systems. This calls for the design of compact, energy efficient NN models in silicon. In this paper, we present a scalable ASIC design of an LSTM accelerator named ELSA,...
This paper describes a novel design of a threshold logic gate (a binary perceptron) and its implementation as a standard cell. This new cell structure, referred to as flash threshold logic (FTL), uses floating gate (flash) transistors to realize the weights associated with a threshold function. The threshold voltages of the flash transistors serve...
Long Short-Term Memory (LSTM) Recurrent Neural network (RNN) is known for its capability in modeling temporal aspects of data and has been shown to produce promising results in sequence learning tasks such as language modeling. However, due to the large number of model parameters and compute-intensive operations, existing FPGA implementations of LS...
The recently reported successes of convolutional neural networks (CNNs) in many areas has generated wide interest in the development of FPGA-based accelerators. To achieve high performance and energy efficiency, an FPGA-based accelerator must fully utilize the limited computation resources and minimize the data communication and memory access, both...
A broad range of applications are increasingly benefiting from the rapid and flourishing development of convolutional neural networks (CNNs). The FPGA-based CNN inference accelerator is gaining popularity due to its high-performance and low-power as well as FPGA’s conventional advantage of reconfigurability and flexibility. Without a general compil...
The success of mobile devices and applications is directly linked to a user's satisfaction of the quality of service — a metric used to denote the user's perception of the quality of an application. The first and necessary building block to manage user satisfaction is to establish accurate performance and power models which are sensitive to the mob...
The rapid improvement in computation capability has made convolutional neural networks (CNNs) a great success in recent years on image classification tasks, which has also prospered the development of objection detection algorithms with significantly improved accuracy. However, during the deployment phase, many applications demand low latency proce...
As convolution contributes most operations in convolutional neural network (CNN), the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate operations with four levels of loops, which results in a large design space. Prior works either employ...
Systems powered by harvested energy must consume very low power and withstand frequent interruptions in power. Nonvolatile logic (NVL) addresses the latter by saving the system state in flipflops enhanced with spin-transfer torque magnetic tunnel junctions (STT-MTJs) as the nonvolatile storage devices. Manufacturing variations in the STT-MTJs and i...
Deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to the large volume of data, the extensive amount of computation and frequent memory accesses. Although existing high-level synthesis tools (e.g. HLS, OpenCL) for FPGAs dramatically reduce the design time, the resulting implementations are still inefficient...
A new method for reducing power and area of standard cell ASICs is described. The method is based on deliberately introducing clock skew without the use of extra buffers in the clock network. This is done by having some flipflops, called sources, generate clock signals for other flipflops, called targets. The method involves two key features: (1) t...
As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs involves three-dimensional multiply and accumulate (MAC) operations with four levels of loops, which r...
Spike timing dependent plasticity (STDP) is an important neural process that enables biological neural networks to learn by strengthening or weakening synaptic connections between neurons. This work presents simulation results and post-silicon experimental data that demonstrate for the first time the possibility of tuning the on state resistance of...
In technology mapping, enumeration of subcircuits or cuts to be replaced by a standard cell is an important step that decides both the quality of the solution and execution speed. In this work, we view cuts as set of edges instead of as set of nodes and based on it, provide a classification of cuts. It is shown that if enumeration is restricted to...
In this paper, we describe a new approach to reduce dynamic power, leakage, and area of application-specified integrated circuits, without sacrificing performance. The approach is based on a design of threshold logic gates (TLGs) and their seamless integration with conventional standard-cell design flow. We first describe a new robust, standard-cel...
This paper proposes a method to completely hide the functionality of a digital standard cell. This is accomplished by a differential threshold logic gate (TLG). A TLG with $n$ inputs implements a subset of Boolean functions of $n$ variables that are linear threshold functions. The output of such a gate is one if and only if an integer weighted line...
Convolutional Neural Networks (CNNs) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Due to multiple convolution and fully-connected layers that are compute-/memory-intensive, it is difficult to perform re...
Many recent advances in sparse coding led its wide adoption in signal processing, pattern classification, and object recognition applications. Even with improved performance in state-of-the-art algorithms and the hardware platform of CPUs/GPUs, solving a sparse coding problem still requires expensive computations, making real-time large-scale learn...
A neuro-inspired computing paradigm beyond the von Neumann architecture is emerging and it generally takes advantage of massive parallelism and is aimed at complex tasks that involve intelligence and learning. The cross-point array architecture with synaptic devices has been proposed for on-chip implementation of the weighted sum and weight update...
Temperature has a negative impact on metal resistance and thus wire delay. In state-of-the-art VLSI circuits, large thermal gradients usually exist due to the uneven distribution of heat sources. The difference in wire temperature can lead to performance mismatch because wires of the same length can have different delay.
Traditional floorplanning a...
This paper proposes a parallel architecture with resistive crosspoint array. The design of its two essential operations, read and write, is inspired by the biophysical behavior of a neural system, such as integrate-and-fire and local synapse weight update. The proposed hardware consists of an array with resistive random access memory (RRAM) and CMO...
In this work, a zero-leakage nonvolatile flip-flop architecture based on a differential CMOS sense-amplifier flip-flop is presented. The flip-flop stores data in complimentarily programmed resistive memory devices during inactive period while power supply is turned off and then restores the data to flip-flop outputs once power supply is turned back...
Threshold-logic gates have long been known to result in more compact and faster circuits when compared to conventional AND/OR logic equivalents [1], However, threshold logic based design has not entered the mainstream design technology (neither custom ASIC nor FPGA) due to the lack of efficient and reliable gate implementations and the necessary in...
This paper proposes a parallel programming scheme for the cross-point array with resistive random access memory (RRAM). Synaptic plasticity in unsupervised learning is realized by tuning the conductance of each RRAM cell. Inspired by the spike-timing-dependent-plasticity (STDP), the programming strength is encoded into the spike firing rate (i.e.,...
This paper proposes a parallel architecture with resistive crosspoint array. The design of its two essential operations, Read and Write, is inspired by the biophysical behavior of a neural system, such as integrate-and-fire and time-dependent synaptic plasticity. The proposed hardware consists of an array with resistive random access memory (RRAM)...
Chalcogenide glass-based programmable metallization cell (PMC) devices undergo Ag+-ion transport and controlled resistance change under the application of electrical bias. In this paper, photo-doped PMC devices are characterized with impedance spectroscopy. Photo doping is an important step in PMC fabrication as it introduces the mobile Ag into the...
Recent empirical studies have shown that multicore scaling is fast becoming power limited, and consequently, an increasing fraction of a multicore processor has to be under clocked or powered off. Therefore, in addition to fundamental innovations in architecture, compilers and parallelization of application programs, there is a need to develop prac...
In this work, we investigate the resistance switching behavior of Ag–Ge–Se based resistive memory (ReRAM) devices, otherwise known as programmable metallization cells (PMC). The devices studied are switched between high and low resistive states under externally applied electrical bias. The presence of multiple resistive states observed under both d...
A method of mapping threshold gate cells into a Boolean network is disclosed. In one embodiment, cuts are enumerated within the Boolean network. Next, a subset of the cuts within the Boolean network that are threshold is identified. To minimize power, cuts in the subset of the cuts are selected.
Differential mode threshold-logic gates can be programmed to compute complex logic functions within a single cell, resulting in significant reduction in area and power. However the circuit yield reduces if they are operated at low voltages. This paper describes a novel integration of RRAM with such threshold-logic gates to achieve robust, low volta...
One of the challenges that all accelerators face, is to execute loops that have if-then-else constructs. There are three ways to accelerate loops with an if-then-else construct on a Coarse-grained reconfigurable architecture (CGRA): full predication, partial predication, and dual-issue scheme. In comparison with the other schemes, dual-issue scheme...
Energy harvesting is a promising solution for reducing network maintenance and the overhead of replacing chemical batteries in sensor networks. In this article, problems related to controlling an active wireless sensor network comprised of nodes powered by both rechargeable batteries and solar energy are investigated. The objective of this control...
Energy efficiency has taken center stage in all aspects of computing, regardless of whether it is performed on a portable battery-powered device, a desktop PC, on servers in a data center, or on a supercomputer. It is expressed as performance-per-watt (PPW), which is equal to the number of instructions that are executed per Joule of energy. The shi...
Spiking neural P systems (in short, SN P systems) have been introduced as computing devices inspired by the structure and functioning of neural cells. The presence of unreliable components in SN P systems can be considered in many different aspects. In this paper we focus on two types of unreliability: the stochastic delays of the spiking rules and...
Systems and methods for identifying a Boolean function as either a threshold function or a non-threshold function are disclosed. In one embodiment, in order to identify a Boolean function as either a threshold function or a non-threshold function, a determination is first made as to whether the Boolean function satisfies one or more predefined cond...
Coarse-Grained Reconfigurable Architectures (CGRAs) are an extremely attractive platform when both performance and power efficiency are paramount. Although the power-efficiency of CGRAs can be very high, their performance critically hinges upon the capabilities of the compiler. This is because a CGRA compiler has to perform explicit pipelining, sch...
This paper describes a novel, first of its kind architecture for a threshold logic gate using conventional MOSFETs and an STT-MTJ (Spin Transfer Torque-Magnetic Tunneling Junction) device. The resulting cell, called STL which is extremely compact can be programmed to realize a large number of threshold functions, many of which would require a multi...
This paper describes the design of a standard cell library of differential mode threshold gates, referred to as a Threshold Logic Latch or TLL, and new threshold function identification and decomposition methods to map a conventional logic network consisting of logic gates and flipflops, into a hybrid network that consists of both TLLs and conventi...
Coarse-Grained Reconfigurable Architectures (CGRAs) are an attractive platform that promise simultaneous high-performance and high power-efficiency. One of the primary challenges in using CGRAs is to develop efficient compilers that can automatically and efficiently map applications to the CGRA. To this end, this paper makes several contributions:...
Energy harvesting in a sensor network is essential in situations where it is either difficult or not cost effective to access the network's nodes to replace the batteries. In this paper, we investigate the problems involved in controlling an active wireless sensor network that is powered both by rechargeable batteries and solar energy. The objectiv...
Extracting high performance from multi-core processors requires increased use of thermal management techniques. In contrast to offline thermal management techniques, online techniques are capable of sensing changes in the workload distribution and setting the processor controls accordingly. Hence, online solutions are more accurate and are able to...
Coarse-Grained Reconfigurable Arrays or CGRAs are programmable fabrics that promise both high performance and high power efficiency. Traditionally, CGRAs were used to accelerate extremely-embedded systems, and were typically manually programmed. However, as CGRAs are conceived to be used as more general-purpose accelerators, there is a need to deve...
This paper presents a new four moduli residue number system of the form {2k, 2n-1, 2n+1-1}, n d k d 2n, which is an enhancement of the popular four-moduli set {2n,2n-1,2n,2n+1-1} (for even n). Our k-mod4 moduli set achieves a higher dynamic range and a better balancing of the binary channels. Using the proposed k-mod4 moduli set helps in reducing t...
This paper presents a new and efficient heuristic procedure for determining whether or not a given Boolean function is a threshold function, when the Boolean function is given in the form of a decision diagram. The decision diagram based method is significantly different from earlier methods that are based on solving linear inequalities in Boolean...
Advances in chip-multiprocessor processing capabilities have led to an increased power consumption and temperature hotspots. Reducing the on-die peak temperature is important from the power reduction and reliability considerations. However, the presence of task deadlines constrain the reduction of peak temperature and thus complicates the determina...