Table 2 - uploaded by S. K. Nandy
Content may be subject to copyright.
Source publication
In this paper we develop compilation techniques for the realization of applications described in a High Level Language (HLL)
onto a Runtime Reconfigurable Architecture. The compiler determines Hyper Operations (HyperOps) that are subgraphs of a data
flow graph (of an application) and comprise elementary operations that have strong producer-consumer...
Contexts in source publication
Context 1
... is because some of the high latency inter BB transports are converted into intra HyperOp transports. Table 2 summarizes the maximum buffer size needed to store HyperOps waiting for input operands. More the buffer size, more is the complexity of hardware, which primarily matches the operands. ...
Similar publications
Variable length encoding can considerably decrease code size in VLIW processors by reducing the number of bits wasted on encoding No Operations(NOPs). A processor may have different instruction templates where different execution slots are implicitly NOPs, but all combinations of NOPs may not be supported by the instruction templates. The efficienc...
Citations
... Many CGRAs have been proposed up to date, but few provide also an actual programming tool-chain, which transforms high-level language code into configuration bitstream [3], [5]. Instead designers rely on manual optimizations and derive the configuration for each target application, which poses tedious and error-prone work. ...
Coarse-Grained ReconfigurableArchitectures (CGRA) are proven to be advantageous over fine-grained architectures, massively parallel GPUs and generic CPUs, in terms of energy and flexibility. However the key challenge of programmability is preventing wide-spread adoption. To exploit instruction level parallelism inherent to such architectures, optimal scheduling and mapping of algorithmic kernels is essential. Transforming an input algorithm in the form of a Data Flow Graph (DFG) into a CGRA schedule and mapping configuration is very challenging, due the necessity to consider architectural details such as memory bandwidth requirements, communication patterns, pipelining and heterogeneity to optimally extract maximum performance.
In this paper, an algorithm is proposed that employs Force-Directed Scheduling concepts to solve such scheduling and resource minimization problems. Our euristic extensions are flexible enough for generic heterogeneous CGRAs, allowing to estimate the execution time of an algorithm with different configurations, while maximizing the utilization of available hardware. Beside our experiments, we compare also given CGRA configurations introduced by state-of-the-art mapping algorithms such as EPIMap, achieving optimal resource utilization by our schedule with a reduced overall DFG execution time by 39% on average.
... Similar to DEBUG_NOC [1][2][3] this define is used to print out debugging messages for the Assembly Unit (AU) (refer to section 3.3.1). This along with the provided test bench is particular useful, to check, if the conversion from the union representing different packet types, into the bit structure of flits and back, has been implemented correctly. ...
... RETARGET generates a data flow graph (DFG) of the application in which the nodes represent Basic Blocks (BBs) and the edges their dependencies among each other. The DFG is then further processed to obtain Hyper Operations (HyperOps) [2,3] that are acyclic, disjunct, and directed subgraphs of the DFG. In addition it is required that between two HyperOps there is always a producer-consumer relationship (Convexity condition). ...
... In a 6 × 6 Fabric the relative destination tuples always equals (2,2). Hence toroidal links are as any other link heavily included in this configuration. ...
In this thesis a Network-on-Chip (NoC) router implementation called RECONNECT realized in Bluespec System Verilog (BSV), is presented. It is highly configurable in terms of flit size, the number of provided Input Port/Output Port (IP/OP) pairs and support for configurations during runtime, to name a few. Depending on the amount of available IP/OP pairs, the router can be integrated into different topologies. Due to the ability to be configured during runtime, the router can even support multiple topologies. A developer is then able to choose among the available topologies the one that promises the highest performance for an application. However this work only concentrates on tessellations like toroidal mesh, honeycomb and hexagonal.
Routing algorithms that were needed to be developed or adapted, are presented. In addition a step-by-step example of the routing algorithm development for the honeycomb topology is included in this thesis. This enables a system designer who wishes to use RECONNECT for any other topology that is not discussed such as a hypercube or a ring, to develop the required routing algorithm easily and fast.
The impact of the chosen topology on the execution time of several real life algorithms has been analyzed by executing these algorithms on a target architecture called REDEFINE, a dataflow multi-processor consisting of Compute Elements (CE) and Support Logic (SL). For this purpose an NoC comprising of RECONNECT routers establishing communication links among the CEs, has been integrated into REDEFINE. It has been found out that for very small algorithms, the execution time does not depend on the choice of topology, whereas for larger applications such as AES encryption and decryption, it becomes evident that the honeycomb topology performs worst and the hexagonal one best. However it is observed that in many cases the additional links that are provided by the hexagonal topology, when compared with the mesh, are not utilized due to the topology unawareness of the REDEFINE SL. Hence the algorithm execution time for mesh topology is often on par with hexagonal ones.
In addition to the chosen topology it is investigated, how the size of the flit affects these algorithms. As expected the performance of the NoC decreases, if the flit size is reduced so that the packets have to be segmented into more flits. Further it is analyzed, if the NoC performance is sufficient to support high level algorithms such as e.g. the H.264 decoder through which data is streamed. These algorithms require to perform the necessary computations not only within a time constraint, but also the data needs to be fed to the CEs fast enough. In H.264 the time constraint is the frame rate meaning that each frame need to be processed in a specified fraction of a second. The current RECONNECT implementation does not qualify to deliver the data within this requirement. As a result, the necessity for a pipelined router version is presented.
To allow a fair comparison of network performance with implementations found in current literature and to validate this approach, the NoC has been put under stress by artificial traffic generators which could be configured to generate uniform and self-similar traffic patterns. Further different destination addresses generation algorithms such as normal (randomly selecting a destination located anywhere in the network), close neighbor communication, bit complement and tornado, for each of these traffic patterns have been developed. It could be observed that in general the honeycomb topology performs worst, followed by the mesh and topped by the hexagonal topology. From the artificial traffic generators it can be concluded that the richer the topology, the higher the throughput.
The different router designs have been synthesized to gain approximate area and power consumption details. Depending on the flit size the single cycle router which is able to forward an incoming flit in the next clock cycle, if no congestion occurs, dissipates between 13 and 35mW for honeycomb topology operating at a frequency of 450MHz. The power increases by approximately 25% for each IP/OP pair that is added to the router integrated in a honeycomb topology. The area that is required for a router in a honeycomb network, has been found out to be between 96167 and 301339 cells depending on the flit size. A router supporting a mesh or a hexagonal topology needs respectively 50% or 91% more area than the honeycomb router.
Depending on the flit size the pipelined version of the router dissipates between 70 and 270, 75 and 294, and 85 and 337mW for the honeycomb, mesh and hexagonal topologies respectively. The area that is required for a single router, is between 213898 and 839334 for honeycomb, 238139 and 957548 for mesh, or 286328 and 1182129 cells for hexagonal router configurations. The tremendous increase of both power dissipation and area consumption is caused by the additional buffers that are required for each stage. The maximum clock frequency of the pipelined version has reached 1.4GHz.
... An application written in a high level language 'C' is transformed into coarse grain operations called "HyperOps" [19] by RETARGET 2 , the compiler for REDEFINE. In addition, the compiler partitions each HyperOp into several pHyperOps and each pHyperOp is assigned to a CE. ...
Numerical Linear Algebra (NLA) kernels are at the heart of all computational problems. These kernels require hardware acceleration for increased throughput. NLA Solvers for dense and sparse matrices differ in the way the matrices are stored and operated upon although they exhibit similar computational properties. While ASIC solutions for NLA Solvers can deliver high performance, they are not scalable, and hence are not commercially viable. In this paper, we show how NLA kernels can be accelerated on REDEFINE, a scalable runtime reconfigurable hardware platform. Compared to a software implementation, Direct Solver (Modified Faddeev's algorithm) on REDEFINE shows a 29X improvement on an average and Iterative Solver (Conjugate Gradient algorithm) shows a 15-20% improvement. We further show that solution on REDEFINE is scalable over larger problem sizes without any notable degradation in performance.
... In order to specify explicit transports between two operations, the compiler -RETARGET [4] constructs a dataflow graph from the SSA representation for the application specification (written in C language), generated by LLVM [5]. LLVM transforms the application into SSA form based on a virtual instruction set architecture (VISA). ...
... The operations in the VISA are simple and non-orthogonal. The Application is subdivided into smaller units called HyperOps [4]. HyperOps are constructed by grouping together basic blocks such that a total order of HyperOps can be constructed for execution. ...
... A SystemC/C++ based simulator developed in-house.4 The computation of time in seconds computed based on number of cycles and frequency does not match the time reported in milli-seconds; however the results reported by Intel Vtune are being directly reproduced. ...
REDEFINE is a runtime reconfigurable hardware platform. In this paper, we trace the development of a runtime reconfigurable hardware from a general purpose processor, by eliminating certain characteristics such as: Register files and Bypass network. We instead allow explicit write backs to the reservation stations as in Transport Triggered Architecture (TTA), but use a dataflow paradigm unlike TTA. The compiler and hardware requirements for such a reconfigurable platform are detailed. The performance comparison of REDEFINE with a GPP yields 1.91x improvement for SHA-1 application. The performance can be improved further through the use of Custom IP blocks inside the compute elements. This yields a 4x improvement in performance for the Shift-Reduce kernel, which is a part of the field multiplication operation. We also list other optimizations to the platform so as to improve its performance.
... [2] provides a quantitative comparison between REDEFINE and FPGA. RETARGET is a compiler tool chain that is used to compile applications to an intermediate form and convert it into dataflow graphs [1]. These dataflow graphs are directed graphs of nodes where each node represents a HyperOp. ...
... It is the job of the HRM to identify " ready " HyperOps, arbitrate among them and launch them for execution. While within a HyperOp static dataflow execution paradigm is followed , across HyperOps a dynamic dataflow schedule is used [3], [1]. The Global Wait-Match Unit (GWMU) resident within the HRM, holds the HyperOps waiting for input operands. ...
... In case of load operation, the memory address is sent along with the relative X, Y coordinates of the receiving CE and relative X, Y coordinates of the CE that processes the next load/store instruction in the memory chain. (The memory chains are formed to maintain sequential memory consistency [3], [1].) In case of store operation, the memory address, the data to be stored, and the acknowledgement address are sent to the LSU. ...
In this paper we explore an implementation of a high-throughput, streaming application on REDEFINE-v2, which is an enhancement of REDEFINE. REDEFINE is a polymorphic ASIC combining the flexibility of a programmable solution with the execution speed of an ASIC. In REDEFINE Compute Elements are arranged in an 8x8 grid connected via a Network on Chip (NoC) called RECONNECT, to realize the various macrofunctional blocks of an equivalent ASIC. For a 1024-FFT we carry out an application-architecture design space exploration by examining the various characterizations of Compute Elements in terms of the size of the instruction store. We further study the impact by using application specific, vectorized FUs. By setting up different partitions of the FFT algorithm for persistent execution on REDEFINE-v2, we derive the benefits of setting up pipelined execution for higher performance. The impact of the REDEFINE-v2 micro-architecture for any arbitrary N-point FFT (N > 4096) FFT is also analyzed. We report the various algorithm-architecture tradeoffs in terms of area and execution speed with that of an ASIC implementation. In addition we compare the performance gain with respect to a GPP.
Trading patterns in financial markets determine the conduct of pricing returns , risk management and market surveillance. The search for these patterns usually involve the analysis of large data sets. A direct tool to represent the interactions within a data set is a graph. This study seeks to better understand the dynamics in financial markets by considering transaction costs as the main ingredient for the construction of a market graph. Based on 168 global financial instrument, a market graph that features a strong con-nectivity, some traces of a power law in the degree distribution and an intensive presence of cliques is obtained. Thereby, very specific trading blocs are behind of both transmitting disturbances and driving the behavior of markets. For example, infrastructure, sustainability and commodity indexes from APEC, EU and NAFTA account for most of the interactions in markets.
This work presents methodology for synchronizing distributed FSMs (Finite State Machines) which are generated while implementing different algorithms on a coarse grain reconfigurable architecture. These FSMs interact with each other while executing algorithms and they are dependent upon each other; thus they need to be synchronized with each other for performing correct execution. The algorithms presented in this paper makes appropriate use of different strategies available for synchronizing these FSMs. The tool hides all sorts of low level details from the Programmer. It lets the designer focus on the details of algorithm (at higher level of abstraction) and cycle by cycle timings are resolved automatically.
The energy consumption of buildings from construction to operation accounts for nearly 45% of the total energy consumption, so the energy saving and emission reduction in building field has a long way to go. Taking consumer demand as the starting point combined with current energy consumption and management situation of commercial housing in our country, this article explores ways to realize energy saving and emission reduction effectively by contract energy management in civil building field with the purpose to promote the concept of green consumption and the implementation of green building.