Figure 12 - uploaded by Alexander Fell
Content may be subject to copyright.

Mapping of 1024-point FFT onto REDEFINE-v2
Source publication
In this paper we explore an implementation of a high-throughput, streaming application on REDEFINE-v2, which is an enhancement of REDEFINE. REDEFINE is a polymorphic ASIC combining the flexibility of a programmable solution with the execution speed of an ASIC. In REDEFINE Compute Elements are arranged in an 8x8 grid connected via a Network on Chip...
Citations
... However, these two kernels have very different datapaths and storage requirements. REDEFINE as proposed in [16], [17] is an architecture framework that can be customized for an application domain to meet the desired performance. In this paper we specifically design and implement domain specialization of REDEFINE for NLA kernels. ...
... REDEFINE is an execution engine in which multiple tiles are connected through a toroidal honeycomb packet-switched network [17]. Each tile comprises a Compute Element (CE) and a Router. ...
... In order to tailor REDEFINE for a specific application domain, compiler directives may be used to force partitioning and assignment of HyperOps. Further, domain specific Custom Function Units (CFUs), which are micro architectural hardware assists may be handcrafted to work in tandem with the ALU [17]. In the following sub-sections we elaborate NLA-specific enhancements to REDEFINE in order to meet expected performance goals in a scenario where inputs are streamed, to: ...
Numerical Linear Algebra (NLA) kernels are at the heart of all computational problems. These kernels require hardware acceleration for increased throughput. NLA Solvers for dense and sparse matrices differ in the way the matrices are stored and operated upon although they exhibit similar computational properties. While ASIC solutions for NLA Solvers can deliver high performance, they are not scalable, and hence are not commercially viable. In this paper, we show how NLA kernels can be accelerated on REDEFINE, a scalable runtime reconfigurable hardware platform. Compared to a software implementation, Direct Solver (Modified Faddeev's algorithm) on REDEFINE shows a 29X improvement on an average and Iterative Solver (Conjugate Gradient algorithm) shows a 15-20% improvement. We further show that solution on REDEFINE is scalable over larger problem sizes without any notable degradation in performance.
... The Support Logic comprises HyperOp Launcher (HL), Load Store Unit (LSU), Inter HyperOp Data Forwarder (IHDF), Hardware Resource Manager (HRM) and Resource Binder (RB). In [4], functional description of these modules is briefly provided. In REDEFINE, diverse data paths are composed in terms of computational structures at runtime. ...
REDEFINE [3] is a polymorphic ASIC, in which arbitrary computationalstructures on hardware are defined at runtime. TheREDEFINE execution fabric comprises Compute Elements (CEs)interconnected by a Honeycomb network, which also serves as thedistributed Network-on-chip. Each computational structure isdynamically assigned to a subset of the CEs on the execution fabric bythe REDEFINE support logic. A HLL specification of the applicationis compiled into Hyper Operations (HyperOps) by the REDEFINEcompiler [3], where each HyperOp is a set of interacting operations.The compiler also determines partitions of the HyperOps(pHyperOps) that can be assigned to CEs to suitably meet thestructural constraints of the execution fabric. In this paper we proposean algorithm to map HyperOps onto Computational Structures. ApHyperOp communication graph (PCG) captures the communicationbetween the various pHyperOps. Through a sequence oftransformations, the PCG is transformed into a Cayley tree. TheCayley tree is then overlayed on the Cayley graph to form acomputational structure. The proposed mapping algorithm offers asolution that incurs a penalty 18% on average over that of theoptimal.
... Explicit Transports are facilitated through the use of a Network on Chip [6]. Any data communication across HyperOps, i.e. inter-HyperOp data traffic is facilitated through the inter-HyperOp data forwarder ( Figure 5; [7]) and data is stored in the global wait match unit, which is a part of the Scheduler. For performing transports, the operations of the HyperOp need to know the exact placement of the dependent operations and their position on the reconfigurable fabric. ...
... The GPP run was performed on a Pentium 4 running at 2.26GHz. The number of cycles was measured using Intel VTune, as described in [7]. The REDEFINE cycle count was obtained through a cycleaccurate simulation 3 , for an 8x8 fabric of tiles. ...
... The performance of REDEFINE with custom FU is about 4x better than the one without. A similar improvement in performance is seen in the context of FFT, as reported in [7]. Figure 5). ...
REDEFINE is a runtime reconfigurable hardware platform. In this paper, we trace the development of a runtime reconfigurable hardware from a general purpose processor, by eliminating certain characteristics such as: Register files and Bypass network. We instead allow explicit write backs to the reservation stations as in Transport Triggered Architecture (TTA), but use a dataflow paradigm unlike TTA. The compiler and hardware requirements for such a reconfigurable platform are detailed. The performance comparison of REDEFINE with a GPP yields 1.91x improvement for SHA-1 application. The performance can be improved further through the use of Custom IP blocks inside the compute elements. This yields a 4x improvement in performance for the Shift-Reduce kernel, which is a part of the field multiplication operation. We also list other optimizations to the platform so as to improve its performance.
Coarse Grained Reconfigurable Architectures
(CGRA) are emerging as embedded application processing
units in computing platforms for Exascale computing. Such
CGRAs are distributed memory multi-core compute elements
on a chip that communicate over a Network-on-chip (NoC).
Numerical Linear Algebra (NLA) kernels are key to several
high performance computing applications. In this paper we
propose a systematic methodology to obtain the specification of
Compute Elements (CE) for such CGRAs. We analyze block
Matrix Multiplication and block LU Decomposition algorithms
in the context of a CGRA, and obtain theoretical bounds on
communication requirements, and memory sizes for a CE.
Support for high performance custom computations common to
NLA kernels are met through custom function units (CFUs) in
the CEs. We present results to justify the merits of such CFUs
In this paper we present a framework for realizing arbitrary instruction set extensions (IE) that are identified post-silicon. The proposed framework has two components viz., an IE synthesis methodology and the architecture of a reconfigurable data-path for realization of the such IEs. The IE synthesis methodology ensures maximal utilization of resources on the reconfigurable data-path. In this context we present the techniques used to realize IEs for applications that demand high throughput or those that must process data streams. The reconfigurable hardware called HyperCell comprises a reconfigurable execution fabric. The fabric is a collection of interconnected compute units. A typical use case of HyperCell is where it acts as a co-processor with a host and accelerates execution of IEs that are defined post-silicon. We demonstrate the effectiveness of our approach by evaluating the performance of some well-known integer kernels that are realized as IEs on HyperCell. Our methodology for realizing IEs through HyperCells permits overlapping of potentially all memory transactions with computations. We show significant improvement in performance for streaming applications over general purpose processor based solutions, by fully pipelining the data-path.
In Dynamically Reconfigurable Processors (DRPs), compilation involves breaking an application into sub-tasks for piecewise execution on the fabric. These sub-tasks are sequenced based on data and control dependences. In DRPs, sub-task prefetching is used to hide the reconfiguration time while another sub-task executes. In REDEFINE, our target DRP, subtasks are referred to as HyperOps. Determining the successor for a HyperOp requires merging information from the control flow graph and the HyperOp dataflow graph. Succession in many cases is data dependent. Since hardware branch predictors cannot be applied due to the non-binary branches, we employ a speculative prefetch unit together with a profile based prediction scheme. Simulation results show around 7-33% reduction in overall execution time, when compared to the execution time without prefetching. We observe better performance when fewer resources on the fabric are used to execute prefetched HyperOps.
In the world of high performance computing huge efforts have been put to accelerate Numerical Linear Algebra (NLA) kernels like QR Decomposition (QRD) with the added advantage of reconfigurability and scalability. While popular custom hardware solution in form of systolic arrays can deliver high performance, they are not scalable, and hence not commercially viable. In this paper, we show how systolic solutions of QRD can be realized efficiently on REDEFINE, a scalable runtime reconfigurable hardware platform. We propose various enhancements to REDEFINE to meet the custom need of accelerating NLA kernels. We further do the design space exploration of the proposed solution for any arbitrary application of size n × n. We determine the right size of the sub-array in accordance with the optimal pipeline depth of the core execution units and the number of such units to be used per sub-array.
3GPP LongTerm Evolution (LTE) is targeted towards variable transmission bandwidths to improve universal mobile telecommunications. This requires support for variable point FFT/IFFT (128 through 4096). ASIC solutions, one for a particular FFT is not cost-effective, while DSP solutions are not performance-effective. We hence provide a solution on REDEFINE, a dynamically reconfigurable platform, by appropriately characterizing the compute elements. We make use of the idle resources of REDEFINE to improve throughput by supporting multiple channels of FFT/IFFT.