Conference Paper

Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model.

DOI: 10.1007/978-3-540-78791-4_9 Conference: Compiler Construction, 17th International Conference, CC 2008, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2008, Budapest, Hungary, March 29 - April 6, 2008. Proceedings
Source: DBLP


The polyhedral model provides powerful abstractions to optimize loop nests with regular accesses. Affine transformations in this model capture a com- plex sequence of execution-reordering loop transformations that can improve per- formance by parallelization as well as locality enhancement. Although a signifi- cant body of research has addressed affine scheduling and partitioning, the prob- lemofautomaticallyfindinggoodaffinetransformsforcommunication-optimized coarse-grained parallelization together with locality optimization for the general case of arbitrarily-nested loop sequences remains a challenging problem. We propose an automatic transformation framework to optimize arbitrarily- nested loop sequences with affine dependences for parallelism and locality si- multaneously. The approach finds good tiling hyperplanes by embedding a pow- erful and versatile cost function into an Integer Linear Programming formulation. These tiling hyperplanes are used for communication-minimized coarse-grained parallelization as well as for locality optimization. The approach enables the min- imization of inter-tile communication volume in the processor space, and mini- mization of reuse distances for local execution at each node. Programs requir- ing one-dimensional versus multi-dimensional time schedules (with scheduling- based approaches) are all handled with the same algorithm. Synchronization-free parallelism, permutable loops or pipelined parallelism at various levels can be detected. Preliminary studies of the framework show promising results.

Download full-text


Available from: Ponnuswamy Sadayappan, Oct 08, 2015
1 Follower
24 Reads
  • Source
    • "The Pluto algorithm based on [4] [5] is the most recent among these, and has been shown to be suitable for architectures where extracting coarse-grained parallelism and locality are crucial — prominently modern generalpurpose multicore processors. The Pluto algorithm employs an objective function based on minimization of dependence distances [4]. The objective function makes certain practical trade-offs to avoid a combinatorial explosion in determining the transformations. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Affine transformations have proven to be very powerful for loop restructuring due to their ability to model a very wide range of transformations. A single multi-dimensional affine function can represent a long and complex sequence of simpler transformations. Existing affine transformation frameworks like the Pluto algorithm, that include a cost function for modern multicore architectures where coarse-grained parallelism and locality are crucial, consider only a sub-space of transformations to avoid a combinatorial explosion in finding the transformations. The ensuing practical trade-offs lead to the exclusion of certain useful transformations, in particular, transformation compositions involving loop reversals and loop skewing by negative factors. In this paper, we propose an approach to address this limitation by modeling a much larger space of affine transformations in conjunction with the Pluto algorithm's cost function. We perform an experimental evaluation of both, the effect on compilation time, and performance of generated codes. The evaluation shows that our new framework, Pluto+, provides no degradation in performance in any of the Polybench benchmarks. For Lattice Boltzmann Method (LBM) codes with periodic boundary conditions, it provides a mean speedup of 1.33x over Pluto. We also show that Pluto+ does not increase compile times significantly. Experimental results on Polybench show that Pluto+ increases overall polyhedral source-to-source optimization time only by 15%. In cases where it improves execution time significantly, it increased polyhedral optimization time only by 2.04x.
    Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015), San Francisco, CA, USA; 02/2015
  • Source
    • "is a method of space-time tiling that provides parallelism, including concurrent startup, while maintaining data locality. The original presentation of diamond tiling demonstrated excellent performance and scaling, beating the previous state of the art [6] that required a wavefront startup, and, therefore, had less available parallelism. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Stencil computations figure prominently in the core kernels of many scientific computations, such as partial differential equation solvers. Parallel scaling of stencil computations can be significantly improved on multicore processors using advanced tiling techniques that include the time dimension, such as diamond tiling. Such techniques are difficult to include in general purpose optimizing compilers because of the need for interprocedural pointer and array data-flow analysis, plus the need to tune scheduling strategies and tile size parameters for each pairing of stencil computation and machine. Since a fully automatic solution is problematic, we propose to provide parameterized space and time tiling iterators through libraries. Ideally, the execution schedule or tiling code will be expressed orthogonally to the computation. This supports code reuse, easier tuning, and improved programmer productivity. Chapel iterators provide this capability implicitly. We present an advanced, parameterized tiling approach that we have implemented using Chapel parallel iterators. We show how such iterators can be used by programmers in stencil computations with multiple spatial dimensions. We also demonstrate that these new iterators provide better scaling than a traditional data parallel schedule.
    Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, Newport Beach, CA, United States of America; 06/2014
  • Source
    • "In this section we briefly introduce the polyhedral model and the notation that we will use in this paper. A detailed description of polyhedral models can be found in [5] [6] [13] [20]. In this work we are using the polyhedral model to consider data access patterns for communication between sets of loop nests, and to optimize this communication ordering in order to perform fine-grained communication and enable parallelization within a loop nest and pipelining between loop nests through loop transformation. "
    [Show abstract] [Hide abstract]
    ABSTRACT: High level synthesis (HLS) is an important enabling technology for the adoption of hardware accelerator technologies. It promises the performance and energy efficiency of hardware designs with a lower barrier to entry in design expertise, and shorter design time. State-of-the-art high level synthesis now includes a wide variety of powerful optimizations that implement efficient hardware. These optimizations can implement some of the most important features generally performed in manual designs including parallel hardware units, pipelining of execution both within a hardware unit and between units, and fine-grained data communication. We may generally classify the optimizations as those that optimize hardware implementation within a code block (intra-block) and those that optimize communication and pipelining between code blocks (inter-block). However, both optimizations are in practice difficult to apply. Real-world applications contain data-dependent blocks of code and communicate through complex data access patterns. Existing high level synthesis tools cannot apply these powerful optimizations unless the code is inherently compatible, severely limiting the optimization opportunity. In this paper we present an integrated framework to model and enable both intra- and inter-block optimizations. This integrated technique substantially improves the opportunity to use the powerful HLS optimizations that implement parallelism, pipelining, and fine-grained communication. Our polyhedral model-based technique systematically defines a set of data access patterns, identifies effective data access patterns, and performs the loop transformations to enable the intra- and inter-block optimizations. Our framework automatically explores transformation options, performs code transformations, and inserts the appropriate HLS directives to implement the HLS optimizations. Furthermore, our framework can automatically generate the optimized communication blocks for fine-grained communication between hardware blocks. Experimental evaluation demonstrates that we can achieve an average of 6.04X speedup over the high level synthesis solution without our transformations to enable intra- and inter-block optimizations.
    Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays; 01/2013
Show more