About
189
Publications
21,763
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,171
Citations
Citations since 2017
Publications
Publications (189)
Genome Informatics (GI) involves accurate computational investigations of strongly correlated subsystems that demands inter-disciplinary approaches for problem solving. With the growing volume of genomic sequencing data at an alarming rate, High Performance Computing (HPC) solutions offer the right platform to address the computational needs. GI re...
Polynomial factorization is a classical algorithmic problem in algebra, which has a wide range of applications. Of special interest is factorization over finite fields, among which the field of order two is probably the most important one due to the relationship to Boolean functions. In particular, factorization of Boolean polynomials corresponds t...
As the complexity of circuit design increases, verification
of these circuits through simulation also becomes extremely
challenging. This creates a bottleneck in the IC design process.
Distributed simulation is one way of solving this problem where
the simulation workload is distributed among the parallel processors
involved in the simulation. Howe...
We present efficient realization of Generalized Givens Rotation (GGR) based QR factorization that achieves 3-100x better performance in terms of Gflops/watt over state-of-the-art realizations on multicore, and General Purpose Graphics Processing Units (GPGPUs). GGR is an improvement over classical Givens Rotation (GR) operation that can annihilate...
In this paper, we present efficient realization of Kalman Filter (KF) that can achieve up to 65% of the theoretical peak performance of underlying architecture platform. KF is realized using Modified Faddeeva Algorithm (MFA) as a basic building block due to its versatility and REDEFINE Coarse Grained Reconfigurable Architecture (CGRA) is used as a...
In this paper, we present efficient realization of Kalman Filter (KF) that can achieve up to 65% of the theoretical peak performance of underlying architecture platform. KF is realized using Modified Faddeeva Algorithm (MFA) as a basic building block due to its versatility and REDEFINE Coarse Grained Reconfigurable Architecture (CGRA) is used as a...
In this paper we present design and analysis of scalable hardware architectures for training learning parameters of RBFNN to classify large data sets. We design scalable hardware architectures for K-means clustering algorithm to training the position of hidden nodes at hidden layer of RBFNN and pseudoinverse algorithm for weight adjustments at outp...
REDEFINE is a distributed dynamic dataflow architecture, designed for exploiting parallelism at various granularities as an embedded system-on-chip (SoC). This paper dwells on the flexibility of REDEFINE architecture and its execution model in accelerating real-time applications coupled with a WCET analyzer that computes execution time bounds of re...
Transistor supply voltages no longer scales at the same rate as transistor density and frequency of operation. This has led to the Dark Silicon problem, wherein only a fraction of transistors can operate at maximum frequency and nominal voltage, in order to ensure that the chip functions within the power and thermal budgets. Heterogeneous computing...
We present efficient realization of Householder Transform (HT) based QR factorization through algorithm-architecture co-design where we achieve performance improvement of 3-90x in-terms of Gflops/watt over state-of-the-art multicore, General Purpose Graphics Processing Units (GPGPUs), Field Programmable Gate Arrays (FPGAs), and ClearSpeed CSX700. T...
In this paper we present design and analysis of a scalable real-time Face Recognition (FR) module to perform 450 recognitions per second. We introduce an algorithm for FR, which is a combination of Weighted Modular Principle Component Analysis and Radial Basis Function Neural Networks. This algorithm offers better recognition accuracy in various pr...
Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) form basic building blocks for several High Performance Computing (HPC) applications and hence dictate performance of the HPC applications. Performance in such tuned packages is attained through tuning of several algorithmic and architectural parameters such as number of pa...
Basic Linear Algebra Subprograms (BLAS) play key role in high performance and scientific computing applications. Experimentally, yesteryear multicore and General Purpose Graphics Processing Units (GPGPUs) are capable of achieving up to 15 to 57% of the peak performance at 65W to 240W of power respectively in underlying platform for compute bound op...
Performance of classification using Feed-Forward Backpropagation Neural Network (FFBPNN) on a reconfigurable hardware architecture is evaluated in this paper. The hardware architecture used for implementation of FFBPNN in this paper is a set of interconnected HyperCells which serve as reconfigurable data paths for the network. The architecture is e...
Performance of an application on a many-core machine primarily hinges on the ability of the architecture to exploit parallelism and to provide fast memory accesses. Exploiting parallelism in static application graphs on a multicore target is relatively easy owing to the fact that compilers can map them onto an optimal set of processing elements and...
In this paper we propose architecture of a processor for vector operations involved in on-line learning of neural networks. We target to implement on-line learning on a Radial Basis Function Neural Network (RBFNN) based Face Recognition (FR) system that has pseudo inverse computation as an essential component during training. Synaptic weights of RB...
Security is becoming one of the main aspects of Multiprocessor System-on-Chip (MP-SoC) design. Software attacks, the most common type of attacks, mainly exploit vulnerabilities like buffer overflow. This is possible if proper access control to memory is absent in the system. In this paper, we propose three hardware based mechanisms to implement Rol...
With transistor energy efficiency not scaling at the same rate as transistor density and frequency, CMOS technology has hit a utilization wall, whereby large portions of the chip remain under clocked. To improve performance, while keeping power dissipation at a realistic level, future computing devices will consist of heterogeneous application spec...
The growing number of applications and processing units in modern Multiprocessor Systems-on-Chips (MPSoCs) come along with reduced time to market. Different IP cores can come from different vendors, and their trust levels are also different, but typically they use Network-on-Chip (NoC) as their communication infrastructure. An MPSoC can have multip...
A scalable and reconfigurable architecture for ac-
celerating classification using Radial Basis Function Neural
Network (RBFNN) is presented in this paper. The proposed
accelerator comprises a set of interconnected HyperCells, which
serve as the reconfigurable datapath on which the RBFNN is
realized. The dimensions of RBFNN that can be supported
on...
The topology and channel width in Network-on-Chips (NoC) impacts the throughput and latency and therefore the area of deployment. In this paper an NoC based on a three dimensional, toroidal rectangular honeycomb topology using a two tupled (x, y) address, is discussed. It employs a minimal and deterministic routing algorithm utilizing Virtual Chann...
Fast Fourier Transform is an integral part of OFDM systems. FFT is the most compute intensive operation that critically affects the OFDM system performance. In order to support the various OFDM standards, a scalable and reconfigurable FFT architecture is necessary. This paper presents an energy efficient and scalable FFT architecture, which can be...
NoC based high performance MP-SoCs can have multiple secure regions or Trusted Execution Environments (TEEs). These TEEs can be separated by non-secure regions or Rich Execution Environments (REEs) in the same MP-SoC. All communications between two TEEs need to cross the in-between REEs. Without any security mechanisms, these traffic flows can face...
The EDA industry has recently witnessed the growing popularity of densely populated, IP rich SoC designs targeting high performance computing platforms. Such SoCs require effective logic simulation, with high levels of accuracy and throughput, for a fault free design and faster time to market. Hardware-Assisted Simulation (HAS) is the appropriate c...
A hundred fold performance improvement of a CAD tool can bring about a similar increase in design size. Hardware accelerators for CAD tools offer such performance improvements, but are application specific. Though some accelerators can be programmed to adapt themselves to somewhat different applications, it takes as much time and money to reconfigu...
The routing of nets or a set of interconnection points is a compute intensive application encountered in CAD for VLSI design. In this paper, we propose a gridless algorithm for area routing based on the Hightower's Maze routing algorithm for message passing multiprocessor systems. The suitability of k-d trees as a data structuring technique for gri...
Advances in wireless internet, sensor technologies, mobile technologies, and global positioning technologies have renewed interest in location based services (LBSs) among mobile users. LBSs on smartphones allow consumers to locate nearby products and services, in exchange of their location information. Precision of location data helps for accurate...
Radial Basis Function Neural Networks (RBFNN) are used in variety of applications such as pattern recognition, control and time series prediction and nonlinear identification. RBFNN with Gaussian Function as the basis function is considered for classification purpose. Training is done offline using K-means clustering method for center learning and...
LU and QR factorizations are the computationally dear part of many applications ranging from large scale simulations (e.g. Computational fluid dynamics) to augmented reality. These factorizations exhibit time complexity of O (n3) and are difficult to accelerate due to presence of bandwidth bound kernels, BLAS-1 or BLAS-2 (level-1 or level-2 Basic L...
The objective of this paper is to come up with a scalable modular hardware solution for real-time Face Recog-
nition (FR) on large databases. Existing hardware solutions use algorithms with low recognition accuracy suitable for real-time response. In addition, database size for these solutions is limited by on-chip resources making them unsuitable...
FFT is the most compute intensive operation that critically affects the OFDM system performance. In order to support the various OFDM standards, a scalable and reconfigurable FFT architecture is necessary. This paper presents an energy efficient and scalable FFT architecture, which can be dynamically reconfigured to adapt to specifications of diffe...
This paper presents two hardware architectures of bi-cubic convolution interpolation termed Parallelized Row Column Interpolation Architecture (PRCIA) and Serialized Row Column Interpolation Architecture (SRCIA) for real-time image scaling . These architectures factor in the challenges of high computational complexity, redundant computations and re...
A new layered reconfigurable architecture is proposed which exploits modularity, scalability and flexibility to achieve high energy efficiency and memory bandwidth.
Using two flavors of Column-wise Givens rotation, derived from traditional Fast Givens and Square root and Division Free Givens Rotation algorithms the architecture is thoroughly evalua...
In this paper we present a framework for realizing arbitrary instruction set extensions (IE) that are identified post-silicon. The proposed framework has two components viz., an IE synthesis methodology and the architecture of a reconfigurable data-path for realization of the such IEs. The IE synthesis methodology ensures maximal utilization of res...
Coarse Grained Reconfigurable Architectures
(CGRA) are emerging as embedded application processing
units in computing platforms for Exascale computing. Such
CGRAs are distributed memory multi-core compute elements
on a chip that communicate over a Network-on-chip (NoC).
Numerical Linear Algebra (NLA) kernels are key to several
high performance comp...
Givens Rotation is a key computation-intensive block in embedded wireless applications.
In order to achieve an efficient mapping which smoothly scales to the underlying architecture, we propose two new Column-based Givens Rotation algorithms, derived from traditional Fast Givens and Square-root and Division Free Givens algorithms.
These algorithms...
QR decomposition (QRD) is a widely used Numerical Linear Algebra (NLA) kernel with applications ranging from SONAR beam forming to wireless MIMO receivers. In this paper, we propose a novel Givens Rotation (GR) based QRD (GR-QRD) where we reduce the computational complexity of GR and exploit higher degree of parallelism. This low complexity Column-...
Cloud computing model separates usage from ownership in terms of control on resource provisioning. Resources in the cloud are projected as a service and are realized using various service models like IaaS, PaaS and SaaS. In IaaS model, end users get to use a VM whose capacity they can specify but not the placement on a specific host or with which o...
This paper presents a Radix-43 based FFT architecture suitable for OFDM based WLAN applications. The radix-43 parallel unrolled architecture presented here, uses a radix-4 butterfly unit which takes all four inputs in parallel and can selectively produce one out of the four outputs. A 64 point FFT processor based on the proposed architecture has be...
Elasticity in cloud systems provides the flexibility to acquire and relinquish computing resources on demand. However, in current virtualized systems resource allocation is mostly static. Resources are allocated during VM instantiation and any change in workload leading to significant increase or decrease in resources is handled by VM migration. He...
In this paper we propose a fully parallel 64K point radix-44 FFT processor. The radix-44 parallel unrolled architecture uses a novel radix-4 butterfly unit which takes all four inputs in parallel and can selectively produce one out of the four outputs. The radix-44 block can take all 256 inputs in parallel and can use the select control signals to...
The overall system cost and other considerations require that processors deliver high performance within a given power budget. Pipelining and supply voltage, both affect the performance and power consumption. Therefore, pipelining depth can be varied simultaneously with appropriate adjustments in supply voltage so as to keep power consumption const...
Realization of cloud computing has been possible due to availability of virtualization technologies on commodity platforms. Measuring resource usage on the virtualized servers is difficult because of the fact that the performance counters used for resource accounting are not virtualized. Hence, many of the prevalent virtualization technologies like...
Monitoring of infrastructural resources in clouds plays a crucial role in providing application guarantees like performance, availability, and security. Monitoring is crucial from two perspectives - the cloud-user and the service provider. The cloud user's interest is in doing an analysis to arrive at appropriate Service-level agreement (SLA) deman...
The highest levels of security can be achieved through the use of more than one type of cryptographic algorithm for each security function. In this paper, the REDEFINE polymorphic architecture is presented as an architecture framework that can optimally support a varied set of crypto algorithms without losing high performance. The presented solutio...
Future mobile multimedia systems will have wearable computing devices as their front ends, supported by database servers, I/O servers, and compute servers over a backbone network. Multimedia applications on such systems are demanding in terms of network and compute resources, and have stringent Quality of Service (QoS) requirements. Providing QoS h...
In this paper we present a hardware-software hybrid technique for modular multiplication over large binary fields. The technique involves application of Karatsuba-Ofman algorithm for polynomial multiplication and a novel technique for reduction. The proposed reduction technique is based on the popular repeated multiplication technique and Barrett r...
Mapping applications onto a Coarse Grained Re-configurable Architecture (CGRA) requires knowledge about the interconnect topology used on the reconfigurable fabric. In order to make communication as efficient as possible, the application sub-structures or partitions need to be mapped to appropriate Compute Elements on the fabric, such that frequent...
The rapid evolution of reconfigurable computing places a great demand for Floating Point Multipliers (FPMs) capable of supporting wide range of application domains from scientific computing to multimedia applications. While former needs the support of higher precision formats like Double Precision(DP) / Extended Precision(EP), the latter needs Sing...
Coarse Grain Reconfigurable Architectures (CGRA) support spatial and temporal computation to speedup execution and reduce reconfiguration time. Thus compilation involves partitioning instructions spatially and scheduling them temporally. The task of partitioning is governed by the opposing forces of being able to expose as much parallelism as possi...
Video decoders used in emerging applications need to be flexible to handle a large variety of video formats and deliver scalable performance to handle wide variations in workloads. In this paper we propose a unified software and hardware architecture for video decoding to achieve scalable performance with flexibility. The light weight processor til...
Coarse Grain Reconfigurable Architectures(CGRA) support Spatial and Temporal computation to speedup execution and reduce reconfiguration
time. Thus compilation involves partitioning instructions spatially and scheduling them temporally. We extend Edge-Betweenness
Centrality scheme, originally used for detecting community structures in social and bi...
Flexibility in implementation of the underlying field algebra kernels often dictates the life-span of an Elliptic Curve Cryptography solution. The systems/methods designed to realize binary field arithmetic operations can be tuned either for performance or for flexibility. Usually flexibility of these solutions adversely affects their performance....
Emerging trend of multicore servers promises to be the panacea for all data-center issues with system virtualization as the
enabling technology. System virtualization allows one to create virtual replicas of the physical system, over which independent
virtual machines can be created, complete with their own, individual operating systems, software,...
Prevalent and popular virtualization technologies have concentrated on consolidating servers based on the CPU component of the workload. Other system resources, particularly I/O devices like network interfaces and disks have been always designed to be in control of the OS that is managing system resources. Sharing of these devices has been through...
3GPP LongTerm Evolution (LTE) is targeted towards variable transmission bandwidths to improve universal mobile telecommunications. This requires support for variable point FFT/IFFT (128 through 4096). ASIC solutions, one for a particular FFT is not cost-effective, while DSP solutions are not performance-effective. We hence provide a solution on RED...
In the world of high performance computing huge efforts have been put to accelerate Numerical Linear Algebra (NLA) kernels like QR Decomposition (QRD) with the added advantage of reconfigurability and scalability. While popular custom hardware solution in form of systolic arrays can deliver high performance, they are not scalable, and hence not com...
Numerical Linear Algebra (NLA) kernels are at the heart of all computational problems. These kernels require hardware acceleration for increased throughput. NLA Solvers for dense and sparse matrices differ in the way the matrices are stored and operated upon although they exhibit similar computational properties. While ASIC solutions for NLA Solver...
REDEFINE [3] is a polymorphic ASIC, in which arbitrary computationalstructures on hardware are defined at runtime. TheREDEFINE execution fabric comprises Compute Elements (CEs)interconnected by a Honeycomb network, which also serves as thedistributed Network-on-chip. Each computational structure isdynamically assigned to a subset of the CEs on the...
In Dynamically Reconfigurable Processors (DRPs), compilation involves breaking an application into sub-tasks for piecewise execution on the fabric. These sub-tasks are sequenced based on data and control dependences. In DRPs, sub-task prefetching is used to hide the reconfiguration time while another sub-task executes. In REDEFINE, our target DRP,...
REDEFINE is a runtime reconfigurable hardware platform. In this paper, we trace the development of a runtime reconfigurable hardware from a general purpose processor, by eliminating certain characteristics such as: Register files and Bypass network. We instead allow explicit write backs to the reservation stations as in Transport Triggered Architec...
A comparison between an automated, and a semi-custom design, and synthesis of a H.264 restricted baseline profile decoder is the subject of this paper. The automated approach models the H.264 decoder as a Kahn Process Network (KPN), that is mapped on a multi-processor Field Programmable Gate Array (FPGA) execution platform. The semi-custom approach...
RECONNECT is a network-on-chip using a honeycomb topology. In this paper we focus on properties of general rules applicable to a variety of routing algorithms for the NoC which take into account the missing links of the honeycomb topology when compared to a mesh. We also extend the original proposal and show a method to insert and extract data to a...
Emerging embedded applications are based on evolving standards (e.g., MPEG2/4, H.264/265, IEEE802.11a/b/g/n). Since most of these applications run on handheld devices, there is an increasing need for a single chip solution that can dynamically interoperate between different standards and their derivatives. In order to achieve high resource utilizat...
Flexible constraint length channel decoders are required for software defined radios. This paper presents a novel scalable scheme for realizing flexible constraint length Viterbi decoders on a de Bruijn interconnection network. Architectures for flexible decoders using the flattened butterfly and shuffle-exchange networks are also described. It is...