Ranjani Narayan's research while affiliated with Morphing Machines and other places

Publications (58)

Article
Full-text available
We present efficient realization of Generalized Givens Rotation (GGR) based QR factorization that achieves 3-100x better performance in terms of Gflops/watt over state-of-the-art realizations on multicore, and General Purpose Graphics Processing Units (GPGPUs). GGR is an improvement over classical Givens Rotation (GR) operation that can annihilate...
Article
Full-text available
In this paper, we present efficient realization of Kalman Filter (KF) that can achieve up to 65% of the theoretical peak performance of underlying architecture platform. KF is realized using Modified Faddeeva Algorithm (MFA) as a basic building block due to its versatility and REDEFINE Coarse Grained Reconfigurable Architecture (CGRA) is used as a...
Conference Paper
REDEFINE is a distributed dynamic dataflow architecture, designed for exploiting parallelism at various granularities as an embedded system-on-chip (SoC). This paper dwells on the flexibility of REDEFINE architecture and its execution model in accelerating real-time applications coupled with a WCET analyzer that computes execution time bounds of re...
Article
Transistor supply voltages no longer scales at the same rate as transistor density and frequency of operation. This has led to the Dark Silicon problem, wherein only a fraction of transistors can operate at maximum frequency and nominal voltage, in order to ensure that the chip functions within the power and thermal budgets. Heterogeneous computing...
Article
Full-text available
We present efficient realization of Householder Transform (HT) based QR factorization through algorithm-architecture co-design where we achieve performance improvement of 3-90x in-terms of Gflops/watt over state-of-the-art multicore, General Purpose Graphics Processing Units (GPGPUs), Field Programmable Gate Arrays (FPGAs), and ClearSpeed CSX700. T...
Article
In this paper we present design and analysis of a scalable real-time Face Recognition (FR) module to perform 450 recognitions per second. We introduce an algorithm for FR, which is a combination of Weighted Modular Principle Component Analysis and Radial Basis Function Neural Networks. This algorithm offers better recognition accuracy in various pr...
Article
Full-text available
Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) form basic building blocks for several High Performance Computing (HPC) applications and hence dictate performance of the HPC applications. Performance in such tuned packages is attained through tuning of several algorithmic and architectural parameters such as number of pa...
Article
Full-text available
Basic Linear Algebra Subprograms (BLAS) play key role in high performance and scientific computing applications. Experimentally, yesteryear multicore and General Purpose Graphics Processing Units (GPGPUs) are capable of achieving up to 15 to 57% of the peak performance at 65W to 240W of power respectively in underlying platform for compute bound op...
Conference Paper
Performance of an application on a many-core machine primarily hinges on the ability of the architecture to exploit parallelism and to provide fast memory accesses. Exploiting parallelism in static application graphs on a multicore target is relatively easy owing to the fact that compilers can map them onto an optimal set of processing elements and...
Conference Paper
In this paper we propose architecture of a processor for vector operations involved in on-line learning of neural networks. We target to implement on-line learning on a Radial Basis Function Neural Network (RBFNN) based Face Recognition (FR) system that has pseudo inverse computation as an essential component during training. Synaptic weights of RB...
Conference Paper
With transistor energy efficiency not scaling at the same rate as transistor density and frequency, CMOS technology has hit a utilization wall, whereby large portions of the chip remain under clocked. To improve performance, while keeping power dissipation at a realistic level, future computing devices will consist of heterogeneous application spec...
Article
Full-text available
The growing number of applications and processing units in modern Multiprocessor Systems-on-Chips (MPSoCs) come along with reduced time to market. Different IP cores can come from different vendors, and their trust levels are also different, but typically they use Network-on-Chip (NoC) as their communication infrastructure. An MPSoC can have multip...
Conference Paper
Full-text available
The topology and channel width in Network-on-Chips (NoC) impacts the throughput and latency and therefore the area of deployment. In this paper an NoC based on a three dimensional, toroidal rectangular honeycomb topology using a two tupled (x, y) address, is discussed. It employs a minimal and deterministic routing algorithm utilizing Virtual Chann...
Article
Full-text available
Fast Fourier Transform is an integral part of OFDM systems. FFT is the most compute intensive operation that critically affects the OFDM system performance. In order to support the various OFDM standards, a scalable and reconfigurable FFT architecture is necessary. This paper presents an energy efficient and scalable FFT architecture, which can be...
Conference Paper
Full-text available
NoC based high performance MP-SoCs can have multiple secure regions or Trusted Execution Environments (TEEs). These TEEs can be separated by non-secure regions or Rich Execution Environments (REEs) in the same MP-SoC. All communications between two TEEs need to cross the in-between REEs. Without any security mechanisms, these traffic flows can face...
Conference Paper
Full-text available
Radial Basis Function Neural Networks (RBFNN) are used in variety of applications such as pattern recognition, control and time series prediction and nonlinear identification. RBFNN with Gaussian Function as the basis function is considered for classification purpose. Training is done offline using K-means clustering method for center learning and...
Conference Paper
LU and QR factorizations are the computationally dear part of many applications ranging from large scale simulations (e.g. Computational fluid dynamics) to augmented reality. These factorizations exhibit time complexity of O (n3) and are difficult to accelerate due to presence of bandwidth bound kernels, BLAS-1 or BLAS-2 (level-1 or level-2 Basic L...
Conference Paper
Full-text available
The objective of this paper is to come up with a scalable modular hardware solution for real-time Face Recog- nition (FR) on large databases. Existing hardware solutions use algorithms with low recognition accuracy suitable for real-time response. In addition, database size for these solutions is limited by on-chip resources making them unsuitable...
Conference Paper
FFT is the most compute intensive operation that critically affects the OFDM system performance. In order to support the various OFDM standards, a scalable and reconfigurable FFT architecture is necessary. This paper presents an energy efficient and scalable FFT architecture, which can be dynamically reconfigured to adapt to specifications of diffe...
Article
In this paper we present a framework for realizing arbitrary instruction set extensions (IE) that are identified post-silicon. The proposed framework has two components viz., an IE synthesis methodology and the architecture of a reconfigurable data-path for realization of the such IEs. The IE synthesis methodology ensures maximal utilization of res...
Conference Paper
Coarse Grained Reconfigurable Architectures (CGRA) are emerging as embedded application processing units in computing platforms for Exascale computing. Such CGRAs are distributed memory multi-core compute elements on a chip that communicate over a Network-on-chip (NoC). Numerical Linear Algebra (NLA) kernels are key to several high performance comp...
Conference Paper
QR decomposition (QRD) is a widely used Numerical Linear Algebra (NLA) kernel with applications ranging from SONAR beam forming to wireless MIMO receivers. In this paper, we propose a novel Givens Rotation (GR) based QRD (GR-QRD) where we reduce the computational complexity of GR and exploit higher degree of parallelism. This low complexity Column-...
Conference Paper
This paper presents a Radix-43 based FFT architecture suitable for OFDM based WLAN applications. The radix-43 parallel unrolled architecture presented here, uses a radix-4 butterfly unit which takes all four inputs in parallel and can selectively produce one out of the four outputs. A 64 point FFT processor based on the proposed architecture has be...
Conference Paper
Full-text available
In this paper we propose a fully parallel 64K point radix-44 FFT processor. The radix-44 parallel unrolled architecture uses a novel radix-4 butterfly unit which takes all four inputs in parallel and can selectively produce one out of the four outputs. The radix-44 block can take all 256 inputs in parallel and can use the select control signals to...
Article
Full-text available
The highest levels of security can be achieved through the use of more than one type of cryptographic algorithm for each security function. In this paper, the REDEFINE polymorphic architecture is presented as an architecture framework that can optimally support a varied set of crypto algorithms without losing high performance. The presented solutio...
Conference Paper
In this paper we present a hardware-software hybrid technique for modular multiplication over large binary fields. The technique involves application of Karatsuba-Ofman algorithm for polynomial multiplication and a novel technique for reduction. The proposed reduction technique is based on the popular repeated multiplication technique and Barrett r...
Article
Coarse Grain Reconfigurable Architectures (CGRA) support spatial and temporal computation to speedup execution and reduce reconfiguration time. Thus compilation involves partitioning instructions spatially and scheduling them temporally. The task of partitioning is governed by the opposing forces of being able to expose as much parallelism as possi...
Conference Paper
Coarse Grain Reconfigurable Architectures(CGRA) support Spatial and Temporal computation to speedup execution and reduce reconfiguration time. Thus compilation involves partitioning instructions spatially and scheduling them temporally. We extend Edge-Betweenness Centrality scheme, originally used for detecting community structures in social and bi...
Conference Paper
Flexibility in implementation of the underlying field algebra kernels often dictates the life-span of an Elliptic Curve Cryptography solution. The systems/methods designed to realize binary field arithmetic operations can be tuned either for performance or for flexibility. Usually flexibility of these solutions adversely affects their performance....
Conference Paper
3GPP LongTerm Evolution (LTE) is targeted towards variable transmission bandwidths to improve universal mobile telecommunications. This requires support for variable point FFT/IFFT (128 through 4096). ASIC solutions, one for a particular FFT is not cost-effective, while DSP solutions are not performance-effective. We hence provide a solution on RED...
Conference Paper
Full-text available
Numerical Linear Algebra (NLA) kernels are at the heart of all computational problems. These kernels require hardware acceleration for increased throughput. NLA Solvers for dense and sparse matrices differ in the way the matrices are stored and operated upon although they exhibit similar computational properties. While ASIC solutions for NLA Solver...
Conference Paper
In the world of high performance computing huge efforts have been put to accelerate Numerical Linear Algebra (NLA) kernels like QR Decomposition (QRD) with the added advantage of reconfigurability and scalability. While popular custom hardware solution in form of systolic arrays can deliver high performance, they are not scalable, and hence not com...
Conference Paper
In Dynamically Reconfigurable Processors (DRPs), compilation involves breaking an application into sub-tasks for piecewise execution on the fabric. These sub-tasks are sequenced based on data and control dependences. In DRPs, sub-task prefetching is used to hide the reconfiguration time while another sub-task executes. In REDEFINE, our target DRP,...
Conference Paper
Full-text available
REDEFINE is a runtime reconfigurable hardware platform. In this paper, we trace the development of a runtime reconfigurable hardware from a general purpose processor, by eliminating certain characteristics such as: Register files and Bypass network. We instead allow explicit write backs to the reservation stations as in Transport Triggered Architec...
Conference Paper
A comparison between an automated, and a semi-custom design, and synthesis of a H.264 restricted baseline profile decoder is the subject of this paper. The automated approach models the H.264 decoder as a Kahn Process Network (KPN), that is mapped on a multi-processor Field Programmable Gate Array (FPGA) execution platform. The semi-custom approach...
Conference Paper
Full-text available
RECONNECT is a network-on-chip using a honeycomb topology. In this paper we focus on properties of general rules applicable to a variety of routing algorithms for the NoC which take into account the missing links of the honeycomb topology when compared to a mesh. We also extend the original proposal and show a method to insert and extract data to a...
Article
Full-text available
Emerging embedded applications are based on evolving standards (e.g., MPEG2/4, H.264/265, IEEE802.11a/b/g/n). Since most of these applications run on handheld devices, there is an increasing need for a single chip solution that can dynamically interoperate between different standards and their derivatives. In order to achieve high resource utilizat...
Conference Paper
Full-text available
This paper reports the design of an input-triggered polymorphic ASIC for H.264 baseline decoder. Hardware polymorphism is achieved by selectively reusing hardware resources at system and module level. Complete design is done using ESL design tools following a methodology that maintains consistency in testing and verification throughout the design f...
Conference Paper
Full-text available
In this paper we develop compilation techniques for the realization of applications described in a High Level Language (HLL) onto a Runtime Reconfigurable Architecture. The compiler determines Hyper Operations (HyperOps) that are subgraphs of a data flow graph (of an application) and comprise elementary operations that have strong producer-consumer...
Conference Paper
Full-text available
In this paper we explore an implementation of a high-throughput, streaming application on REDEFINE-v2, which is an enhancement of REDEFINE. REDEFINE is a polymorphic ASIC combining the flexibility of a programmable solution with the execution speed of an ASIC. In REDEFINE Compute Elements are arranged in an 8x8 grid connected via a Network on Chip...
Conference Paper
Full-text available
A polymorphic ASIC is a runtime reconfigurable hardware substrate comprising compute and communication elements. It is a ldquofuture proofrdquo custom hardware solution for multiple applications and their derivatives in a domain. Interoperability between application derivatives at runtime is achieved through hardware reconfiguration. In this paper...
Conference Paper
Full-text available
Application accelerators are predominantly ASICs. The cost of ASIC solutions are order of magnitudes higher than programmable processing cores. Despite this, ASIC solutions are preferred when both high performance and low power is the target. ASICs offer no flexibility in terms of it being able to cater to application derivatives, unless this has b...
Conference Paper
Full-text available
Run-time interoperability between different applications based on H.264/AVC is an emerging need in networked infotainment, where media delivery must match the desired resolution and quality of the end terminals. In this paper, we describe the architecture and design of a polymorphic ASIC to support this. The H.264 decoding flow is partitioned into...
Conference Paper
In this paper we propose the architecture of a SoC fabric onto which applications described in a HLL are synthesized. The fabric is a homogeneous layout of computation, storage and communication resources on silicon. Through a process of composition of resources (as opposed to decomposition of applications), application specific computational struc...
Conference Paper
Full-text available
Grid environment, being a collection of heterogeneous and geographically distributed resources, is prone to many kinds of failures. In order to fully realize the potential of the grid environment, we need mechanisms that guarantee quality of service, even in the face of failures. In this paper, we introduce the notion of scheduling based on "availa...

Citations

... Here in this section we describe the QRD-MV Beamformer implementation [12] using CGR Algorithm described in the paper [26]. The process is explained in below Acceleration process is explained in the literature survery now we will see how we use the Column wise givens rotation to find the upper triangular matrix and how to is more efficent that the standard methods. ...
... The second main feature of this work is the decentralized aspect of the allocation process, which is an improvement upon our prior work, which relied on a centralized architecture [16,39]. Decentralized means here that there is no central element deciding alone of the allocation for the rest of the architecture. ...
... Given rotation was introduced by Wallace Givens in 1950. Merchant et al. [16] used the given rotation algorithm in generalization for the annihilation of multiple elements of an input matrix simultaneously where the annihilation regime spans over columns and rows. It is intended to expose higher parallelism in given rotation through generalization and also reduction in total number computations in given rotation. ...
... QR factorization/decomposition (QRF/QRD) is a prevalent operation encountered in several engineering and scientific operations ranging from Kalman Filter (KF) to computational finance [1] [2]. QR factorization of a non-singular matrix A m×n of size m × n is given by ...
... To ensure correct orientation of applications, a set of constraints used in [12] is also needed. In order to enforce this, the numbering of the CUs on the architecture is used. ...
... The Logistic map is the simplest and the easiest encryption technique [25,27,33,41]. It is a function of a nonlinear system generated by iterating the equation shown in Eq. 1 up to (x-1). ...
... This shift towards reconfigurable, adaptable systems is an inescapable need to guarantee tomorrow's performance and power consumption [9]: examples include new heterogeneous multi-core adaptable software architectures [10] and application-specific accelerators on reconfigurable hardware (FPGAs) [11]. The vast majority of modern cyber-physical systems are adaptable in some way [12], due to, e.g., market demand for upgrades [13]. ...
... By the performance of a pipeline, we mean its throughput. This is due to the widespread use of computing pipelines: for the design of signal processors [1,2], co-processors [3,4,5], applications for processing big data [6]. They are used for image processing [7], for load balancing [8,9], for processing big amounts of data and for supporting multithreaded cloud applications. ...
... In the case of Strassen's multiplication of two 2 × 2 matrices, the computation step required to obtain each element of the result matrix is different whereas the arithmetic steps to compute each element in naïve multiplication are uniform [23]. For instance, as shown in Figures 3a and 4a, in the multiplication of two 2 × 2 matrices to get C11, C12, C21 and C22, computing C11 requires four arithmetic steps, whereas computing C12 requires three steps. ...
... Additionally, to reduce hardware cost, the bit width on the critical path is required to be optimized. In general, the bit-width optimization is implemented based on the result of error analysis [1,9,18,20,21,22,23,24,25,26,27,28,29]. Whereas, it will be time-consuming to dynamic monitor the error result via exhaustively considering all combination with different inputs and outputs to optimize the bit-width [8,30]. ...