Conference Paper

Autotuning multigrid with PetaBricks

DOI: 10.1145/1654059.1654065 Conference: Proceedings of the ACM/IEEE Conference on High Performance Computing, SC 2009, November 14-20, 2009, Portland, Oregon, USA
Source: DBLP


Algorithmic choice is essential in any problem domain to realizing optimal computational performance. Multigrid is a prime example: not only is it possible to make choices at the highest grid resolution, but a program can switch techniques as the problem is recursively attacked on coarser grid levels to take advantage of algorithms with different scaling behaviors. Additionally, users with different convergence criteria must experiment with parameters to yield a tuned algorithm that meets their accuracy requirements. Even after a tuned algorithm has been found, users often have to start all over when migrating from one machine to another.
We present an algorithm and autotuning methodology that address these issues in a near-optimal and efficient manner. The freedom of independently tuning both the algorithm and the number of iterations at each recursion level results in an exponential search space of tuned algorithms that have different accuracies and performances. To search this space efficiently, our autotuner utilizes a novel dynamic programming method to build efficient tuned algorithms from the bottom up. The results are customized multigrid algorithms that invest targeted computational power to yield the accuracy required by the user.
The techniques we describe allow the user to automatically generate tuned multigrid cycles of different shapes targeted to the user's specific combination of problem, hardware, and accuracy requirements. These cycle shapes dictate the order in which grid coarsening and grid refinement are interleaved with both iterative methods, such as Jacobi or Successive Over-Relaxation, as well as direct methods, which tend to have superior performance for small problem sizes. The need to make choices between all of these methods brings the issue of variable accuracy to the forefront. Not only must the autotuning framework compare different possible multigrid cycle shapes against each other, but it also needs the ability to compare tuned cycles against both direct and (non-multigrid) iterative methods. We address this problem by using an accuracy metric for measuring the effectiveness of tuned cycle shapes and making comparisons over all algorithmic types based on this common yardstick. In our results, we find that the flexibility to trade performance versus accuracy at all levels of recursive computation enables us to achieve excellent performance on a variety of platforms compared to algorithmically static implementations of multigrid.
Our implementation uses PetaBricks, an implicitly parallel programming language where algorithmic choices are exposed in the language. The PetaBricks compiler uses these choices to analyze, autotune, and verify the PetaBricks program. These language features, most notably the autotuner, were key in enabling our implementation to be clear, correct, and fast.

Download full-text


Available from: Saman P. Amarasinghe
  • Source
    • "Many of these efforts examined 2D or constant-coefficient problems — features rarely seen in real-world applications. Chan et al. explored how, using an auto-tuned approach, one could restructure the multigrid V-cycle to improve time-to-solution in the context of a 2D, constant-coefficient Laplacian [5]. This approach is orthogonal to our implemented optimizations and their technique could be incorporated in future work. "

    Full-text · Technical Report · Dec 2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: 1. ABSTRACT The NAS parallel benchmarks, originally developed by NASA for evaluating performance of their high-performance com-puters, have been regarded as one of the most widely used benchmark suites for side-by-side comparisons of high-performance machines. However, even though the NAS parallel bench-marks have grown tremendously in the last two decades, documentation is lagging behind because of rapid changes and additions to the collection of benchmark codes primar-ily due to rapid innovation of parallel architectures. Conse-quently, the learning curve for beginning graduate students, researchers, or software systems engineers to pick up these benchmarks is typically huge. In this paper, we document and assess the NAS parallel benchmark suite by identifying parallel patterns within the NAS benchmark codes. We be-lieve that such documentation of the benchmarks will allow researchers as well as those in industry to understand, use and modify these codes more effectively. 2. INTRODUCTION What we have come to know as "high-performance comput-ing" today has been and may still likely remain pivotal in scientific advancement. Advances in cancer research, neuro-science, renewable energy, and space exploration have been heavily dependent on the quality of the high-performance machines used for scientific simulation. Yet, a central issue in the area of high-performance com-puting research is the problem of performance analysis and benchmarking. The evaluation of high-performance parallel machines involves consideration of many architectural fea-tures such as memory hierarchy, interconnect topology, the memory consistency model, cache coherency, and simulta-neous multi-threading. Coupled with this, the number of parallel applications being supported continues to increase at a rapid pace.
    Preview · Article · Jan 2010
  • Source
    Article: PetaBricks.
    [Show abstract] [Hide abstract]
    ABSTRACT: Building adaptable and more efficient programs for the multi-core era is now within reach.
    Full-text · Article · Sep 2010 · Crossroads
Show more