Fig 1 - uploaded by Zoltán Endre Rákossy
Content may be subject to copyright.
An example of a time extended fabric consisting of 4 PEs, connected to an NoC router mounting a mesh topology. 

An example of a time extended fabric consisting of 4 PEs, connected to an NoC router mounting a mesh topology. 

Source publication
Conference Paper
Full-text available
Coarse-Grained ReconfigurableArchitectures (CGRA) are proven to be advantageous over fine-grained architectures, massively parallel GPUs and generic CPUs, in terms of energy and flexibility. However the key challenge of programmability is preventing wide-spread adoption. To exploit instruction level parallelism inherent to such architectures, optim...

Context in source publication

Context 1
... time index is a time unit representing the greatest common divisor of required clock cycles for all operations supported by the fabric. Further unidirectional edges are introduced connecting the same PE across the time indexes t i and t i+1 (refer to the example given in figure 1). When edges hop across one or more time indexes, it is a delayed interconnect, possibly requiring local storage. ...

Similar publications

Article
Full-text available
Many computation-intensive iterative or recursive applications commonly found in digital signal pro-cessing and image processing applications can be represented by data-flow graphs (DFGs). The execution of all tasks of a DFG is called an iteration, with the average computation time of an iteration the itera-tion period. A great deal of research has...

Citations

... On the other hand, with moderate sacrifice, homogeneous CGRAs (such as HyCUBE [13], HReA [18] and CCF [20]) are usually able to simplify the compilation problem by providing the PEA with symmetric PE and interconnect networks between each PE. The compilers [21][22][23][24][25][26][27][28][29][30] for those CGRAs use a modulo-scheduling-based approach to schedule the operations inside the data flow graph (DFG) of the loop kernel. Modulo scheduling can efficiently deploy a large DFG onto CGRA with a limited number of PEs and fully explore the instruction-level and loop-level parallelism. ...
... Modulo scheduling is a promising way of deploying loops onto CGRAs [21,[24][25][26][27]30,41]. Previous works have formulated the CGRA modulo-scheduling problem as mapping the data flow graph of a loop body onto the modulo time-extended abstracted hardware resource graph [24,42] and this mapping problem has been proved to be NP-Complete [23,24] In 2002, researchers began to explore the efficient heuristics [21][22][23][24][25][26][27][28][29][30], and their main goals are to generate the schedule that minimizes the initiation interval (I I) between consecutive iterations of the loop and less compilation time. These algorithms may use different mapping strategies by leveraging the different properties of CGRA. ...
Article
Full-text available
Modulo-scheduled coarse-grained reconfigurable array (CGRA) processors have shown their potential for exploiting loop-level parallelism at high energy efficiency. However, these CGRAs need frequent reconfiguration during their execution, which makes them suffer from large area and power overhead for context memory and context-fetching. To tackle this challenge, this paper uses an architecture/compiler co-designed method for context reduction. From an architecture perspective, we carefully partition the context into several subsections and only fetch the subsections that are different to the former context word whenever fetching the new context. We package each different subsection with an opcode and index value to formulate a context-fetching primitive (CFP) and explore the hardware design space by providing the centralized and distributed CFP-fetching CGRA to support this CFP-based context-fetching scheme. From the software side, we develop a similarity-aware tuning algorithm and integrate it into state-of-the-art modulo scheduling and memory access conflict optimization algorithms. The whole compilation flow can efficiently improve the similarities between contexts in each PE for the purpose of reducing both context-fetching latency and context footprint. Experimental results show that our HW/SW co-designed framework can improve the area efficiency and energy efficiency to at most 34% and 21% higher with only 2% performance overhead.
... We use ILP to formulate the enhanced scheduling for two reasons: 1) As our enhanced scheduling involves graph modification, e.g., routing node insertion and edge cutting, traditional scheduling, such as list scheduling [18], force-directed scheduling [19], [20] or even SDC [21], can not work in this problem. 2) ILP approach can formulate most problems into clear representations in mathematics and it is possible to find the optimal solution [22], [23]. ...
... The earliest time step that operator n can be scheduled which could be easily obtained with As-Soon-As-Possible (ASAP ) [19] considering a given T l . As shown in Fig. 5(b), the earliest time steps of operator 1, 4, 8 and 9 are 0, 3, 1 and 2, respectively. ...
... As shown in Fig. 5(b), the earliest time steps of operator 1, 4, 8 and 9 are 0, 3, 1 and 2, respectively. • Latest time step for scheduling (L n ): The latest time step that operator n can be scheduled which could be easily obtained with As-Late-As-Possible (ALAP ) [19] considering a given T l . As shown in Fig. 5 • Latest time step for routing ( L n ): The latest time step that operator n's routing node can be inserted which is equal to max n ∈OEn (L n − 1), where OE n indicates the set of sub-operator from operator n. ...
Article
Full-text available
Coarse-Grained Reconfigurable Architectures (CGRAs) are a promising solution to domain-specific applications for their energy efficiency and flexibility. To improve performance on CGRA, modulo scheduling is commonly adopted on Data Dependence Graph (DDG) of loops by minimizing the Initiation Interval (II) between adjacent loop iterations. The mapping process usually consists of scheduling and placement-and-routing (P&R). As existing approaches don’t fully and globally explore the routing strategies of the long dependencies in a DDG at the scheduling stage, the following P&R is prone to failure leading to performance loss. To this end, this paper proposes a routability-enhanced scheduling for CGRA mapping using Integer Linear Programming (ILP) formulation, where a global optimized scheduling could be found to improve the success rate of P&R. Experimental results show that our approach achieves 1.12× and 1.22× performance speedup, 28.7% and 50.2% compilation time reduction, as compared to 2 state-of-the-art heuristics.
... The DFG needs to be transformed to match the structure given by the Execution Fabric. Scheduling instructions temporally and mapping them spatially onto PUs has been in the focus of research and many heuristics exist such as [7], [8], [9], [10]. On the architectural side many CGRAs have been proposed till date such as ADRES [14], KressArray [11], Layered CGRA [19], [20], PACT XPP [5], [17] and REDEFINE [3]. ...
Article
Full-text available
In this paper, DFGenTool, a dataflow graph (DFG) generation tool, is presented, which converts loops in a sequential program given in a high-level language such as C, into a DFG. DFGenTool adapts DFGs for mapping to Coarse Grain Reconfigurable Architectures (CGRA) to enable a variety of CGRA implementations and compilers to be benchmarked against a standard set of DFGs. Several kernels have been converted and are presented in this paper as case studies. The output of DFGenTool is in DOT, a popular graph description standard which could be used with a variety of CGRA compilers. Furthermore, DFGenTool has been released as open-source.
... Generally, when mapping in a scalable way on a scalable architecture, the execution window size has to match the size of the array for efficiency and respect available memory bandwidth. An efficient block-based scheduling and mapping solution is discussed in earlier work [4], where Layers had fixed 4 × 4 PEs and 8 ports, yielding a fixed mapping, while automation of this is attempted in [6]. ...
Conference Paper
Full-text available
A scalable mapping is proposed for 3 important kernels from the Numerical Linear Algebra domain, to exploit architectural features to reach asymptotically optimal efficiency and a low energy consumption. Performance and power evaluations were done with input data set matrix sizes ranging from 64×64 to 16384×16384. 12 architectural variants with up to 10×10 processing elements were used to explore scalability of the mapping and the architecture, achieving < 10% energy increase for architectures up to 8×8 PEs coupled with performance speed-ups of more than an order of magnitude. This enables a clean area-performance trade-off on the Layers architecture while keeping energy constant over the variants.
Article
The utilization of computation resources and reconfiguration time has a large impact on reconfiguration system performance. In order to promote the performance, a dynamical self-reconfigurable mechanism for data-driven cell array is proposed. Cells can be fired only when the needed data arrives, and cell array can be worked on two modes: fixed execution and reconfiguration. On reconfiguration mode, cell function and data flow direction are changed automatically at run time according to contexts. Simultaneously using an H-tree interconnection network, through pre-storing multiple application mapping contexts in reconfiguration buffer, multiple applications can execute concurrently and context switching time is the minimal. For verifying system performance, some algorithms are selected for mapping onto the proposed structure, and the amount of configuration contexts and execution time are recorded for statistical analysis. The results show that the proposed self-reconfigurable mechanism can reduce the number of contexts efficiently, and has a low computing time.
Article
Coarse-grained reconfigurable architectures (CGRAs) are expected to be used for embedded systems, Internet of Things (IoT) devices, and edge computing thanks to their high-energy efficiency and programmability. In essence, a CGRA is an array of numerous processing elements. To exploit this abundant computation resource, a compiler for CGRAs has to fulfill more tasks compared that for general-purpose processors. Therefore, many studies have proposed optimization methods, especially for application mapping, because the performance and energy efficiency strongly depend on optimization at compile time. However, many works focus only on performance improvement or resource minimization, although such optimization objectives are not always appropriate when considering various use cases. In this work, we propose GenMap, an application mapping framework using multiobjective optimization based on a genetic algorithm so that users can set optimization criteria as needed. Besides, it provides aggressive power optimization using our dynamic power model and leakage minimization technique. The proposed method is applied to three fabricated CGRA chips for evaluation. Experimental results show that GenMap achieves 15.7% reduction of wire length while keeping processing element utilization when compared with conventional methods. In addition, according to real chip experiments, 12.1%-46.8% of energy consumption is reduced, and up to 2x speedup is archived for several architectures when compared with other two approaches.
Article
With VLSI process technology scaling into nano-scale, the increasingly serious aging issues (e.g. NBTI and HCI aging effects) have brought a significant threat to system reliability. Coarse-grained reconfigurable architectures (CGRAs) exhibit the feature to reconfigure and execute different mapping schemes (Maps) dynamically, which can mitigate the aging issues effectively. In this paper, a two-level stress-aware loops mapping algorithm is proposed by jointing the intra-kernel and inter-kernel stress optimizations for the CGRA-mapped designs. With the pipelining technique, the intra-kernel optimization employs the stress-aware force-directed and effective MCC (Maximal Compatibility Class) methods to optimize operations' placement and mapping distribution on CGRAs. It helps to avoid many operations to be mapped on some certain processing elements (PEs) and reduces the stresses accumulated on PEs. By leveraging the dynamic reconfiguration feature, the inter-kernel stress optimization develops the multi-map scheduling method to find a set of ordered maps to reconfigure dynamically. It diversifies the PE usages and compensates for the stresses accumulated on different PEs among them. Experimental results show that our approach can achieve the maximum stress reduction by 82.0% for NBTI and 79.4% for HCI, and improve the aging efficiency by 6.01X and MTTF by 3.16X, while the optimized performance is kept.