Conference Paper

Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model.

DOI: 10.1007/978-3-540-78791-4_9 Conference: Compiler Construction, 17th International Conference, CC 2008, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2008, Budapest, Hungary, March 29 - April 6, 2008. Proceedings
Source: DBLP

ABSTRACT The polyhedral model provides powerful abstractions to optimize loop nests with regular accesses. Affine transformations in this model capture a com- plex sequence of execution-reordering loop transformations that can improve per- formance by parallelization as well as locality enhancement. Although a signifi- cant body of research has addressed affine scheduling and partitioning, the prob- lemofautomaticallyfindinggoodaffinetransformsforcommunication-optimized coarse-grained parallelization together with locality optimization for the general case of arbitrarily-nested loop sequences remains a challenging problem. We propose an automatic transformation framework to optimize arbitrarily- nested loop sequences with affine dependences for parallelism and locality si- multaneously. The approach finds good tiling hyperplanes by embedding a pow- erful and versatile cost function into an Integer Linear Programming formulation. These tiling hyperplanes are used for communication-minimized coarse-grained parallelization as well as for locality optimization. The approach enables the min- imization of inter-tile communication volume in the processor space, and mini- mization of reuse distances for local execution at each node. Programs requir- ing one-dimensional versus multi-dimensional time schedules (with scheduling- based approaches) are all handled with the same algorithm. Synchronization-free parallelism, permutable loops or pipelined parallelism at various levels can be detected. Preliminary studies of the framework show promising results.

Download full-text

Full-text

Available from: Ponnuswamy Sadayappan, Jul 07, 2015
1 Follower
 · 
98 Views
  • Source
    • "The Pluto algorithm based on [4] [5] is the most recent among these, and has been shown to be suitable for architectures where extracting coarse-grained parallelism and locality are crucial — prominently modern generalpurpose multicore processors. The Pluto algorithm employs an objective function based on minimization of dependence distances [4]. The objective function makes certain practical trade-offs to avoid a combinatorial explosion in determining the transformations. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Affine transformations have proven to be very powerful for loop restructuring due to their ability to model a very wide range of transformations. A single multi-dimensional affine function can represent a long and complex sequence of simpler transformations. Existing affine transformation frameworks like the Pluto algorithm, that include a cost function for modern multicore architectures where coarse-grained parallelism and locality are crucial, consider only a sub-space of transformations to avoid a combinatorial explosion in finding the transformations. The ensuing practical trade-offs lead to the exclusion of certain useful transformations, in particular, transformation compositions involving loop reversals and loop skewing by negative factors. In this paper, we propose an approach to address this limitation by modeling a much larger space of affine transformations in conjunction with the Pluto algorithm's cost function. We perform an experimental evaluation of both, the effect on compilation time, and performance of generated codes. The evaluation shows that our new framework, Pluto+, provides no degradation in performance in any of the Polybench benchmarks. For Lattice Boltzmann Method (LBM) codes with periodic boundary conditions, it provides a mean speedup of 1.33x over Pluto. We also show that Pluto+ does not increase compile times significantly. Experimental results on Polybench show that Pluto+ increases overall polyhedral source-to-source optimization time only by 15%. In cases where it improves execution time significantly, it increased polyhedral optimization time only by 2.04x.
    Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015), San Francisco, CA, USA; 02/2015
  • Source
    • "In this section we briefly introduce the polyhedral model and the notation that we will use in this paper. A detailed description of polyhedral models can be found in [5] [6] [13] [20]. In this work we are using the polyhedral model to consider data access patterns for communication between sets of loop nests, and to optimize this communication ordering in order to perform fine-grained communication and enable parallelization within a loop nest and pipelining between loop nests through loop transformation. "
    [Show abstract] [Hide abstract]
    ABSTRACT: High level synthesis (HLS) is an important enabling technology for the adoption of hardware accelerator technologies. It promises the performance and energy efficiency of hardware designs with a lower barrier to entry in design expertise, and shorter design time. State-of-the-art high level synthesis now includes a wide variety of powerful optimizations that implement efficient hardware. These optimizations can implement some of the most important features generally performed in manual designs including parallel hardware units, pipelining of execution both within a hardware unit and between units, and fine-grained data communication. We may generally classify the optimizations as those that optimize hardware implementation within a code block (intra-block) and those that optimize communication and pipelining between code blocks (inter-block). However, both optimizations are in practice difficult to apply. Real-world applications contain data-dependent blocks of code and communicate through complex data access patterns. Existing high level synthesis tools cannot apply these powerful optimizations unless the code is inherently compatible, severely limiting the optimization opportunity. In this paper we present an integrated framework to model and enable both intra- and inter-block optimizations. This integrated technique substantially improves the opportunity to use the powerful HLS optimizations that implement parallelism, pipelining, and fine-grained communication. Our polyhedral model-based technique systematically defines a set of data access patterns, identifies effective data access patterns, and performs the loop transformations to enable the intra- and inter-block optimizations. Our framework automatically explores transformation options, performs code transformations, and inserts the appropriate HLS directives to implement the HLS optimizations. Furthermore, our framework can automatically generate the optimized communication blocks for fine-grained communication between hardware blocks. Experimental evaluation demonstrates that we can achieve an average of 6.04X speedup over the high level synthesis solution without our transformations to enable intra- and inter-block optimizations.
    Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays; 01/2013
  • Source
    • "2) of all hyperplanes as rows. T R of any statement is thus a square matrix which can be obtained by eliminating the columns producing translation and any other rows meant to specify loop distribution at any level [21], [10], [5]. As the inter-tile dependences are all intra-statement and are not affected by the translation components of tiling hyperplanes, we have: F ′ = T R · F where F corresponds to approximate inter-tile dependences in original iteration space. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Most stencil computations allow tile-wise concurrent start, i.e., there always exists a face of the iteration space and a set of tiling hyperplanes such that all tiles along that face can be started concurrently. This provides load balance and maximizes parallelism. However, existing automatic tiling frameworks often choose hyperplanes that lead to pipelined start-up and load imbalance. We address this issue with a new tiling technique that ensures concurrent start-up as well as perfect load-balance whenever possible. We first provide necessary and sufficient conditions on tiling hyperplanes to enable concurrent start for programs with affine data accesses. We then provide an approach to find such hyperplanes. Experimental evaluation on a 12-core Intel Westmere shows that our code is able to outperform a tuned domain-specific stencil code generator by 4% to 27%, and previous compiler techniques by a factor of 2× to 10.14×.
    High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for; 01/2012