Conference Paper

PARLGRAN: parallelism granularity selection for scheduling task chains on dynamically reconfigurable architectures

Center for Embedded Comput. Syst., California Univ., Irvine, CA, USA;
DOI: 10.1109/ASPDAC.2006.1594733 Conference: Design Automation, 2006. Asia and South Pacific Conference on
Source: IEEE Xplore

ABSTRACT Partial dynamic reconfiguration, often called RTR (run-time reconfiguration) is a key feature in modern reconfigurable platforms. While partial RTR enables additional application performance, it imposes physical constraints necessitating simultaneous scheduling and placement while mapping application task graphs onto such architectures. In this paper, we present PARLGRAN, an approach that maximizes performance of application task chains by selecting a suitable granularity of data-parallelism for individual data parallel tasks. Our approach focuses on reconfiguration delay overhead and placement-related issues (such as fragmentation) while selecting individual data-parallelism granularity as an integral part of simultaneous scheduling and placement. We demonstrate that our heuristic generates high-quality schedules on an extensive set of over a 1000 synthetic experiments by comparing the results with an approach that tries to statically maximize data-parallelism, i.e., does not consider the overheads and constraints associated with partial RTR. A detailed case-study on JPEG encoding additionally confirms that blindly maximizing data-parallelism can result in schedules even worse than that generated by a simple (but RTR-aware) approach oblivious to data-parallelism.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Partial dynamic reconfiguration (often referred to as partial RTR) enables true on-demand computing. A dynamically invoked application is assigned resources such as data bandwidth, configurable logic, and the limited logic resources are customized during application execution with partial RTR. In this work, we present key theoretical principles for maximizing application performance when available bandwidth is limited. We exploit bandwidth very effectively by selecting a suitable clock frequency for each task and maximize performance with partial RTR by exploiting data-parallelism property of common image-processing tasks. Our theoretical principles are integrated in our scheduling strategy, SCHEDRTR. We present detailed application case studies on a cycle-accurate simulation platform that addresses micro architectural concerns and includes detailed resource considerations of the Virtex XC2V3000 device. Our results demonstrate that applying SCHEDRTR to common image-filtering applications leads to 15--20% performance gain in scenarios with limited bandwidth, when compared to a sophisticated RTR scheduling strategy with data-parallelism but simpler bandwidth considerations.
    Proceedings of the 44th annual Design Automation Conference; 06/2007
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The focus of this dissertation is on kernel loops (K-loops), which are loop nests that contain hardware mapped kernels in the loop body. In this thesis, we propose methods for improving the performance of such K-loops, by using standard loop transformations for exposing and exploiting the coarse grain loop level parallelism. We target a reconfigurable architecture that is a heterogeneous system consisting of a general purpose processor and a field programmable gate array (FPGA). Research projects targeting reconfigurable architectures are trying to give answers to several problems: how to partition the application -- decide which parts to be accelerated on the FPGA, how to optimize these parts (the kernels), what is the performance gain. However, only few try to exploit the coarse grain loop level parallelism. This work goes towards automatically deciding the number of kernel instances to place into the reconfigurable hardware, in a flexible way that can balance between area and performance. In this dissertation, we propose a general framework that helps determine the optimal degree of parallelism for each hardware mapped kernel within a K-loop, taking into account area, memory size and bandwidth, and performance considerations. In the future it can also take into account power. Furthermore, we present algorithms and mathematical models for several loop transformations in the context of K-loops. The algorithms are used to determine the best degree of parallelism for a given K-loop, while the mathematical models are used to determine the corresponding performance improvement. The algorithms are validated with experimental results. The loop transformations that we analyze in this thesis are loop unrolling, loop shifting, K-pipelining, loop distribution, and loop skewing. An algorithm that decides which transformations to use for a given K-loop is also provided. Finally, we also present an analysis of possible situations and justifications of when and why the loop transformations have or have not a significant impact on the K-loop performance.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Abstract|Recongurable,Computing (RC) is one of the research directions that focuses on accelerating applications. In the presented approach we assume the Molen machine organization and the Molen programming,paradigm as our framework. Molen combines a general purpose processor (GPP) and a Field Programmable Gate Array (FPGA), having the advantages of both speed of hardware and,exibility of software execution. In this paper we present a method,that allows complete automation of ecient code generation with the Molen compiler for recongurable,architectures in Delft WorkBench project. The proposed algorithm computes the optimal degree of parallelism for a kernel K called from inside a loop or loop nest, in order to achieve the maximum performance, taking into consideration the resource constraints. The input data for the algorithm consists of proling,information about the execution times for running K in both hardware and software, the memory transfers and the occupied area.

Full-text (3 Sources)

Available from
Jun 1, 2014