ArticlePDF Available

Abstract and Figures

The enormous and growing complexity of today's high-end systems has increased the already significant challenges of obtaining high performance on equally complex scientific applications. Application scientists are faced with a daunting challenge in tuning their codes to exploit performance-enhancing architectural features. The Performance Engineering Research Institute (PERI) is working toward the goal of automating portions of the performance tuning process. This paper describes PERI's overall strategy for auto-tuning tools and recent progress in both building auto-tuning tools and demonstrating their success on kernels, some taken from large-scale applications.
Content may be subject to copyright.
PERI Auto-Tuning
Jacqueline Chame4, Chun Chen4, Jack Dongarra5, Mary Hall4, Jeffrey K.
Hollingsworth3, Paul Hovland1, Shirley Moore5, Keith Seymour5, Jaewook Shin1,
Ananta Tiwari3, Sam Williams2, Haihang You5, David H. Bailey2
1Argonne National Laboratory, Argonne, IL 60439
2Lawrence Berkeley National Laboratory, Berkeley, CA 94720
3University of Maryland, College Park, MD 20742
4USC/ISI, Marina del Rey, CA 90292
5University of Tennessee, Knoxville, TN 37996
E-mail: mhall@isi.edu
Abstract. The enormous and growing complexity of today's high-end systems has
increased the already significant challenges of obtaining high performance on today's
equally complex scientific applications. Application scientists are faced with a daunting
challenge in tuning their codes to exploit performance-enhancing architectural features. The
Performance Engineering Research Institute (PERI) is working towards the goal of automating
portions of the performance tuning process. This paper describes PERI’s overall strategy for
auto-tuning tools, and recent progress in both building auto-tuning tools and demonstrating
their success on kernels, some taken from large-scale applications.
1. Introduction
As we are enter the era of petascale systems, as with earlier high-end systems, there is likely to be a
significant performance gap between peak and sustained performance on the petascale hardware.
Historically, the burden of achieving high performance on new platforms has largely fallen on the
application scientists. To relieve application scientists of this burden, we would like to provide
performance tools that are (largely) automatic, a long-term goal commonly called auto-tuning. This
goal encompasses tools that analyze a scientific application, both as source code and during execution,
generate a space of tuning options, and search for a near-optimal performance solution. There are
numerous challenges to fully realizing this vision, including enhancement of automatic code
manipulation tools, automatic run-time parameter selection, automatic communication optimization,
and intelligent heuristics to control the combinatorial explosion of tuning possibilities. On the other
hand, we are encouraged by recent successful results such as ATLAS, which has automatically tuned
components of the LAPACK linear algebra library[1]. We are also studying techniques used in the
highly successful FFTW library[2] and several other related projects[3-6].
The Performance Engineering Research Institute (PERI) is formalizing a performance tuning
methodology used by application developers and automating portions of this process. Auto-tuning is
one of the three aspects of performance tuning that are the focus of PERI, in addition to performance
modeling and application engagement. In the context of these three aspects, the goal of PERI is to
migrate automatic and semi-automatic prototypes into practice for a set of important applications.
The remainder of this document focuses on the PERI auto-tuning strategy (Section 2), recent
progress in developing common interfaces to auto-tuning tools (Section 3), and tool infrastructure and
experimental results (Section 4), followed by a conclusion.
2. PERI Auto-tuning Conceptual Diagram
Figure 1 provides a conceptual diagram of auto-tuning in PERI. Several phases are shown, as
described here, but this document focuses on Transformation, Code Generation and Off-line Search.
1. Triage. This step involves performance measurement, analysis and modeling to determine whether
an application has opportunities for optimization.
2. Semantic analysis. This step involves analysis of program semantics to support safe
transformation of the source code. The analyses include traditional compiler analyses to determine
data and control dependences and can exploit semantic information provided by the user through
annotations or information about domain-specific abstractions.
3. Transformation. Transformations include traditional optimizations such as loop optimizations and
in-lining, as well as more aggressive data structure reorganizations and domain-specific
optimizations. Transformations such as tiling may be parameterized to allow for input size and
machine characteristic tuning.
4. Code generation. This phase produces a set of possible implementations to be considered.
5. Offline search.. Offline search entails running the generated code and searching for the best-
performing implementation. The search process may be constrained by guidance from a
performance model or user input.
Figure 1: Flow diagram of the auto-tuning process
Transformation
API
Search
API
6. Application assembly. At this point, the optimized code components are integrated to produce an
executable code, possible including instrumentation and support for dynamic tuning.
7. Training runs. Training runs involve a separate execution step designed mainly to produce
performance data for feedback into the optimization process.
8. Online adaptation. Finally, optimizations may occur during production runs, especially for
problems or machines whose optimal configuration changes during execution.
3. Evolving an Auto-tuning System through Common Interfaces
The challenges in automating the performance tuning process require that the following three issues be
addressed: 1) The number of code variants for a complete application can be enormous. Strategies to
avoid code explosion and to judiciously select what transformation techniques to apply to different
sections of the application code are needed to keep the tuning time at manageable levels.
2) As the number of tuning parameters increases, the search space becomes high dimensional and
exponential in size. Search algorithms that can cope with exponential spaces and deliver results within
a few search iterations are needed. 3) A metric that measures and isolates the performance of a section
of code being optimized within an application is needed to accurately guide search algorithms.
Within PERI, there are five different research groups working on developing auto-tuning tools to
address these issues. These projects have complementary strengths and can, therefore, be brought
together to develop an integrated auto-tuning system. Towards that end, we are working to develop a
common framework to allow auto-tuning tools to share information and search strategies. Through
common APIs, we can evolve an auto-tuning system that brings together the best capabilities of each
of these tools, and also engage the broader community of tool developers beyond PERI researchers.
We have focused development of the interfaces on two portions of the auto-tuning process. Any
compiler-based approach will apply code transformations to rewrite application code from its original
form to one that more effectively exploits architectural features such as registers, caches, SIMD
compute engines, and multiple cores. Commonly used code transformations include loop unrolling,
blocking for cache, and software pipelining. Thus, we are designing a transformation API that will be
input to the Transformation box in Figure 1. This API provides a transformation recipe that describes
how to transform original source into an optimized source representation. By accepting a common
transformation recipe, the auto-tuning system permits code transformation strategies derived by PERI
compilers and tools (or users) to be implemented using any transformation and code generation tools,
such as Chill (USC/ISI), LoopProcessor (LLNL) and POET (UTSA). The API supports the
specification of unbound transformation parameters that are then tuned using search algorithms. The
initial draft of the API includes a naming convention for specifying language constructs in source code
and code transformations available in Chill and POET.
A search API provides input into the empirical optimization process of running a series of
experiments on actual hardware to determine the best optimized implementation. The search API
allows the auto-tuning tools to exchange information about their available tuning options and
constraints on the search space, and to plug-in different search algorithms. The common framework
will support both auto-tuning using training runs (and re-compilation) along with continuous
optimization during production runs. For the search API, we are working on developing a simple and
extensible language that standardizes the parameter space representation. Using the language,
developers and researchers can expose tunable parameters to tuning frameworks. Relationships
(ordering, dependencies, constraints and ranking) between tunable parameters can also be expressed.
To understand the requirements of integrating auto-tuning tools developed independently by PERI
researchers, we have been engaging in the integration of several PERI tools. Researchers at the
University of Tennessee have integrated auto-tuning search (described below) with the ROSE
LoopProcessor (LLNL) and the POET code generator (UTSA). Similarly, an initial integration of the
Active Harmony system (UMD) and the Chill transformation framework (USC/ISI) is providing
experience in how to effectively integrate these separate tools into an auto-tuning system.
4. Auto-tuning Infrastructure and Experimental Results
We now describe recent performance results from applying auto-tuning to dense and sparse linear
algebra kernels. The dense linear algebra experiments use our prototype compiler and code-generation
tools, while the more difficult-to-analyze sparse linear algebra experiments use a library-based auto-
tuning approach. This section describes these results and the tool infrastructures used to derive them.
Code generation and empirical search. In conjunction with the previously-described APIs, a goal in
PERI is to easily substitute different code generators and search engines in the auto-tuning process.
To that end, researchers at the University of Tennessee have developed a simple syntax for specifying
the way to generate code to perform the evaluation of the code variants. To permit switching search
techniques, the application-specific aspects of the evaluation process are described by a simple
specification of the search bounds and constraints. This system permits evaluation of various search
techniques to determine which work best in an auto-tuning context. We have evaluated some classic
techniques such as genetic algorithms, simulated annealing, and particle swarm optimization, as well
as some more ad-hoc techniques such as a modified orthogonal search.
As a simple example of an optimization search space, Figure 2 shows optimization of square
matrix-matrix multiplication (N=400), using an exhaustive search over two dimensions: block sizes up
to 128 and unrolling up to 128. The x and y axes represent block size and unrolling amount,
respectively, while the z axis represents the performance in Mflop/s of the generated code. In general,
we see the best results along the blocking axis with a low unrolling amount as well as along the
diagonal where blocking and unrolling are equal, but there are also peaks along areas where the block
size is evenly divisible by the unrolling amount. The best performance was found with block size 80
and unrolling amount 2. This code variant ran at 1459 Mflop/s compared to 778 Mflop/s for the naive
version compiled with gcc.
Compiler-based infrastructure. Researchers at USC/ISI are developing the TUNE auto-tuning
compiler and its constituent Chill transformation framework, which provide a compiler-based
infrastructure for general application code. Chill takes source code and a transformation recipe with
bound parameter values as input. Using a polyhedral framework, Chill extracts a mathematical
representation of the iteration space and array accesses, and composes the transformations in the
recipe through rewriting rules, to generate optimized code. A parameter sweep interface takes a
transformation recipe with unbound parameter values and ranges and constraints on parameter values
to derive a series of alternative implementations.
TUNE has been used to achieve hand-tuned levels of performance on dense linear algebra kernels
such as matrix-matrix multiply, matrix-vector multiply and LU factorization for older architectures
such as the SGI R10K, Pentium M and Pentium D. Recent work to optimize matrix-matrix multiply
for newer architectures is promising, but show gaps between compiler-optimized and hand-tuned
performance. Performance results for the Jacquard system at LBNL are 40% below hand-tuned
performance, and we attribute the gaps to code generation for SSE-3, instruction scheduling and
control of prefetch. Collaborating researchers at Argonne have been experimenting with manually
scheduled inner loop bodies have achieved performance within 83% of peak and are developing
automatic approaches to close this gap. On the Intel Core2Duo, we have achieved performance that is
28% below near-peak hand-tuned results. The primary performance gap appears to be control of
software prefetch at the source code level.
Auto-tuning sparse-matrix vector multiply for multi-core architectures. In a study of auto-
tuning a sparse matrix-vector multiply (SpMV) kernel, PERI researchers at Lawrence Berkeley
National Laboratory compared performance on a dual-socket Intel Core2 Xeon, single- and dual-core
AMD processors, a quad-core Opteron, a Sun Niagara2, and an IBM Cell Broadband Engine. This
study found that these processors differed markedly in the effectiveness of various autotuning schemes
on the SpMV kernel [6]. In two subsequent studies [7,8], these researchers analyzed similar but more
sophisticated automatic optimizations for SpMV, the explicit heat equation PDE on a regular grid
(Stencil), and a lattice Boltzmann application (LBMHD). This auto-tuning approach employs a code
generator that produces multiple versions of the computational kernels using a set of optimizations
with varying parameter settings. The optimizations include: TLB and cache blocking, loop unrolling,
code reordering, software prefetching, streaming stores, and use of SIMD instructions.
The impact of each optimization varied significantly across architecture and kernel, necessitating a
machine-dependent approach to automatic tuning in this study. In addition, detailed analysis revealed
performance bottlenecks for each computation on the various systems mentioned above. The overall
performance results showed that the Cell processor offers the highest raw performance and power
efficiency for these computations, despite having peak double-precision performance and memory
bandwidth that is lower than many of the other platforms in our study. The key architectural
advantage of Cell is explicit software control of data movement between the local store (cache) and
main memory, which is a major departure from conventional programming. In any event, these
studies make it clear that there is still considerable room for improvement in automatic tuning methods
for all of the candidate architectures.
Acknowledgment
This work was supported in part by the U.S. Dept. of Energy under Contract DE-AC02-06CH11357.
References
[1] Whaley, C., Petitet, A., Dongarra, J. “Automated Empirical Optimizations of Software and the
ATLAS Project,” Parallel Computing, Vol. 27 (2001), no. 1, pg. 3-25.
[2] Frigo, M., Johnson, S. “FFTW: An Adaptive Software Architecture for the FFT,” Proceedings
of the International Conference on Acoustics, Speech, and Signal Processing, Seattle,
Washington, May 1998.
[3] Bilmes, J., Asanovic, K., Chin, C. W., Demmel, J. “Optimizing Matrix Multiply using PHi-
PAC: a Portable, High-Performance, ANSI C Coding Methodology,” Proceedings of the
International Conference on Supercomputing, Vienna, Austria, ACM SIGARCH, July 1997.
[4] Vuduc, R., Demmel, J., Yelick, K. “OSKI: A Library of Automatically Tuned sparse Matrix
Kernels,” Proceedings of SciDAC 2005, Journal of Physics: Conference Series, June 2005.
[5] Chen, C., Chame, J., Hall, M. “Combining Models and Guided Empirical Search to Optimize
for Multiple Levels of the Memory Hierarchy,” Proceedings of the Conference on Code
Generation and Optimization, March, 2005.
[6] S. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick, “Lattice Boltzmann Simulation
Optimization on Leading Multicore Platforms,” International Parallel & Distributed
Processing Symposium (IPDPS) (to appear), 2008. WINNER: Best paper, applications track.
[7] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick and J. Demmel, “Optimization of Sparse
Matrix-Vector Multiplication on Emerging Multicore Platforms,” Proceedings of SC07,
ACM/IEEE, November 2007.
[8] Samuel Williams, Kaushik Datta, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick,
David Bailey, “PERI - Auto-tuning Memory Intensive Kernels for Multicore,” available at
http://crd.lbl.gov/~dhbailey/dhbpapers/scidac08_peri.pdf.
... Autotuning is one such example: this very popular technique has been actively researched since the 1990s to automatically explore large optimization spaces and improve efficiency of computer systems [113,89,49,112,66,50,78,105,84,80,96,104,44,72,61,110,30,100,69,70,76,39,111,34]. Every year, dozens of autotuning papers get published to optimize some components of computer systems, improve and speed up exploration and co-design strategies, and enable run-time adaptation. ...
Article
Developing efficient software and hardware has never been harder whether it is for a tiny IoT device or an Exascale supercomputer. Apart from the ever growing design and optimization complexity, there exist even more fundamental problems such as lack of interdisciplinary knowledge required for effective software/hardware co-design, and a growing technology transfer gap between academia and industry. We introduce our new educational initiative to tackle these problems by developing Collective Knowledge (CK), a unified experimental framework for computer systems research and development. We use CK to teach the community how to make their research artifacts and experimental workflows portable, reproducible, customizable and reusable while enabling sustainable R&D and facilitating technology transfer. We also demonstrate how to redesign multi-objective autotuning and machine learning as a portable and extensible CK workflow. Such workflows enable researchers to experiment with different applications, data sets and tools; crowdsource experimentation across diverse platforms; share experimental results, models, visualizations; gradually expose more design and optimization choices using a simple JSON API; and ultimately build upon each other's findings. As the first practical step, we have implemented customizable compiler autotuning, crowdsourced optimization of diverse workloads across Raspberry Pi 3 devices, reduced the execution time and code size by up to 40%, and applied machine learning to predict optimizations. We hope such approach will help teach students how to build upon each others' work to enable efficient and self-optimizing software/hardware/model stack for emerging workloads.
... Auto-tuning with a checkpoint/restart system [8] has been developed to reduce the tuning time for a large-scale application. As the checkpoint-restart system simply saves and restores the whole memory image of an application process, however, the timing and storage overheads might not be negligible in practice. ...
Article
Full-text available
Automatic performance tuning of a practical application could be time-consuming and sometimes infeasible, because it often needs to evaluate the performances of a large number of code variants to find the best one. In this paper, hence, a light-weight rollback mechanism is proposed to evaluate each of code variants at a low cost. In the proposed mechanism, once one code variant of a target code block is executed, the execution state is rolled back to the previous state of not yet executing the block so as to repeatedly execute only the block to find the best code variant. It also has a feature of terminating a code variant whose execution time is longer than the shortest execution time so far. As a result, it can prevent executing the whole application many times and thus reduces the timing overhead of an auto-tuning process required for finding the best code variant. © Copyright 2015 The Institute of Electronics, Information and Communication Engineers.
... Numerous autotuning, run-time adaptation, genetic and machine learning techniques (including our own) have been introduced in the past two decades to help software engineers optimize their applications for rapidly evolving hardware [72,18,25,63,60,28,71,51,41,70,29,52,68,56,54,34,65,47,22,50,19,59,62,66,61,58]. These techniques usually demonstrate that it is possible to improve various characteristics of existing software by automatically and empirically searching and predicting better combinations of optimizations. ...
Article
Nowadays, engineers have to develop software often without even knowing which hardware it will eventually run on in numerous mobile phones, tablets, desktops, laptops, data centers, supercomputers and cloud services. Unfortunately, optimizing compilers are not keeping pace with ever increasing complexity of computer systems anymore and may produce severely underperforming executable codes while wasting expensive resources and energy. We present our practical and collaborative solution to this problem via light-weight wrappers around any software piece when more than one implementation or optimization choice available. These wrappers are connected with a public Collective Mind autotuning infrastructure and repository of knowledge (c-mind.org/repo) to continuously monitor various important characteristics of these pieces (computational species) across numerous existing hardware configurations together with randomly selected optimizations. Similar to natural sciences, we can now continuously track winning solutions (optimizations for a given hardware) that minimize all costs of a computation (execution time, energy spent, code size, failures, memory and storage footprint, optimization time, faults, contentions, inaccuracy and so on) of a given species on a Pareto frontier along with any unexpected behavior. The community can then collaboratively classify solutions, prune redundant ones, and correlate them with various features of software, its inputs (data sets) and used hardware either manually or using powerful predictive analytics techniques. Our approach can then help create a large, realistic, diverse, representative, and continuously evolving benchmark with related optimization knowledge while gradually covering all possible software and hardware to be able to predict best optimizations and improve compilers and hardware depending on usage scenarios and requirements.
... The Performance Engineering Research Institute (PERI) Autotuning project [1] combines measurement and searchdirected auto-tuning in a multistep process to obtain automated optimization. It can be viewed as a special case of an expert system where one flexible solution method is applied to all types of bottlenecks. ...
Article
Full-text available
Automated source-code performance optimiza-tion has four stages: measurement, diagnosis of bottlenecks, determination of optimizations, and rewriting of the source code. Each stage must be successfully implemented to ena-ble the next stage. The PerfExpert tool supports automatic performance measurement and bottleneck diagnosis for multicore and multichip compute nodes, i.e., it implements the first two stages. This paper presents AutoSCOPE, a new system that extends PerfExpert by implementing the third stage. Based on PerfExpert's output, AutoSCOPE automati-cally determines appropriate source-code optimizations and compiler flags. We describe the process for selecting opti-mizations and evaluate the effectiveness of AutoSCOPE by applying it to three HPC production codes. Each of these codes is available in unoptimized and manually optimized versions. AutoSCOPE succeeds in selecting the same source-code transformations as were chosen by human ex-perts in most cases. AutoSCOPE is an extensible framework to which additional optimizations and further rules for se-lecting optimizations can be added.
... Various off-line and on-line auto-tuning techniques together with run-time adaptation and split compilation have been introduced during the past two decades to address some of the above problems and help users automatically improve performance, power consumption and other characteristics of their applications. These approaches treat rapidly evolving computer system as a black box and explore program and architecture design and optimization spaces empirically [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20]. ...
Article
Full-text available
Empirical auto-tuning and machine learning techniques have been showing high potential to improve execution time, power consumption, code size, reliability and other important metrics of various applications for more than two decades. However, they are still far from widespread production use due to lack of native support for auto-tuning in an ever changing and complex software and hardware stack, large and multi-dimensional optimization spaces, excessively long exploration times, and lack of unified mechanisms for preserving and sharing of optimization knowledge and research material. We present a possible collaborative approach to solve above problems using Collective Mind knowledge management system. In contrast with previous cTuning framework, this modular infrastructure allows to preserve and share through the Internet the whole auto-tuning setups with all related artifacts and their software and hardware dependencies besides just performance data. It also allows to gradually structure, systematize and describe all available research material including tools, benchmarks, data sets, search strategies and machine learning models. Researchers can take advantage of shared components and data with extensible meta-description to quickly and collaboratively validate and improve existing auto-tuning and benchmarking techniques or prototype new ones. The community can now gradually learn and improve complex behavior of all existing computer systems while exposing behavior anomalies or model mispredictions to an interdisciplinary community in a reproducible way for further analysis. We present several practical, collaborative and model-driven auto-tuning scenarios. We also decided to release all material at http://c-mind.org/repo to set up an example for a collaborative and reproducible research as well as our new publication model in computer engineering where experimental results are continuously shared and validated by the community.
Conference Paper
Computational chemistry codes such as GAMESS and MPQC have been under development for several years and are constantly evolving to include new science and adapt to new high performance computing (HPC) systems. Our work with these codes has given rise to two needs. One is to refactor the codes so that it is easier to optimize them. After profiling has identified performance critical regions, refactoring to outline those regions into separate routines facilitates performance tuning and porting to complex heterogeneous HPC architectures. The second need is for automated performance tuning. Because of the large number of both fine-grained and coarse-grained parameters for tuning performance on complex hierarchical and hybrid architectures, the search space for an optimal set of parameters becomes very large. This paper describes initial results on using refactoring tools to restructure MPQC and GAMESS and on using automated tools to tune performance on multicore and manycore architectures.
Conference Paper
Full-text available
In this paper we introduce a multi-objective autotuning framework comprising compiler and runtime components. Focusing on individual code regions, our compiler uses a novel search technique to compute a set of optimal solutions, which are encoded into a multi-versioned executable. This enables the runtime system to choose specifically tuned code versions when dynamically adjusting to changing circumstances. We demonstrate our method by tuning loop tiling in cache-sensitive parallel programs, optimizing for both runtime and efficiency. Our static optimizer finds solutions matching or surpassing those determined by exhaustively sampling the search space on a regular grid, while using less than 4% of the computational effort on average. Additionally, we show that parallelism-aware multi-versioning approaches like our own gain a performance improvement of up to 70% over solutions tuned for only one specific number of threads.
Article
This article presents a novel compiler framework for CUDA code generation. The compiler structure is designed to support autotuning, which employs empirical techniques to evaluate a set of alternative mappings of computation kernels and select the mapping that obtains the best performance. This article introduces a Transformation Strategy Generator, a meta-optimizer that generates a set of transformation recipes, which are descriptions of the mapping of the sequential code to parallel CUDA code. These recipes comprise a search space of possible implementations. This system achieves performance comparable and sometimes better than manually tuned libraries and exceeds the performance of a state-of-the-art GPU compiler.
Conference Paper
Full-text available
Although automated empirical performance optimization and tuning is well-studied for kernels and domain-specific libraries, a current research grand challenge is how to extend these methodologies and tools to significantly larger sequential and parallel applications. In this con- text, we present the ROSE source-to-source outliner, which addresses the problem of extracting tunable kernels out of whole programs, thereby helping to convert the challenging whole-program tuning problem into a set of more manageable kernel tuning tasks. Our outliner aims to handle large scale C/C++, Fortran and OpenMP applications. A set of program analysis and transformation techniques are utilized to enhance the porta- bility, scalability, and interoperability of source-to-source outlining. More importantly, the generated kernels preserve performance characteristics of tuning targets and can be easily handled by other tools. Preliminary evaluations have shown that the ROSE outliner serves as a key compo- nent within an end-to-end empirical optimization system and enables a wide range of sequential and parallel optimization opportunities.
Conference Paper
This paper presents a programming language interface, a complete scripting language, to describe composable compiler transformations. These transformation programs can be written, shared and reused by non-expert application and library developers. From a compiler writer’s perspective, a scripting language interface permits rapid prototyping of compiler algorithms that can mix levels and compose different sequences of transformations, producing readable code as output. From a library or application developer’s perspective, the use of transformation programs permits expression of clean high-level code, and a separate description of how to map that code to architectural features, easing maintenance and porting to new architectures. We illustrate this interface in the context of CUDA-CHiLL, a source-to-source compiler transformation and code generation framework that transforms sequential loop nests to high-performance GPU code. We show how this high-level transformation and code generation language can be used to express: (1) complex transformation sequences, exemplified by a single loop restructuring construct used to generate a series of tiling and permute commands; and, (2) complex code generation sequences to produce CUDA code from a high-level specification. We demonstrate that the automatically-generated code either performs closely or outperforms two hand-tuned GPU library kernels from Nvidia’s CUBLAS 2.2 and 3.2 libraries.
Article
Full-text available
We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to sparse matrix vector multiplication (SpMV), the explicit heat equation PDE on a regular grid (Stencil), and a lattice Boltzmann application (LBMHD). We explore one of the broadest sets of multicore architectures in the high-performance computing literature, including the Intel Xeon Clovertown, AMD Opteron Barcelona, Sun Victoria Falls, and the Sony-Toshiba-IBM (STI) Cell. Rather than hand-tuning each kernel for each system, we develop a code generator for each kernel that allows us identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned kernel applications often achieve a better than 4× improvement compared with the original code. Additionally, we analyze a Roofline performance model for each platform to reveal hardware bottlenecks and software challenges for future multicore systems and applications.
Conference Paper
Full-text available
We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Clovertown, AMD Opteron X2, Sun Niagara!, STI Cell, as well as the single core Intel Itanium.2. Rather than hand-tuning LBMHD for each system, we develop a code generator that allows us identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto- tuned LBMHD application achieves up to a 14times improvement compared with the original code. Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications.
Article
Full-text available
The Optimized Sparse Kernel Interface (OSKI) is a collection of low-level primitives that provide automatically tuned computational kernels on sparse matrices, for use by solver libraries and applications. These kernels include sparse matrix-vector multiply and sparse triangular solve, among others. The primary aim of this interface is to hide the complex decisionmaking process needed to tune the performance of a kernel implementation for a particular user's sparse matrix and machine, while also exposing the steps and potentially non-trivial costs of tuning at run-time. This paper provides an overview of OSKI, which is based on our research on automatically tuned sparse kernels for modern cache-based superscalar machines.
Conference Paper
We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) – one of the most heavily used kernels in scientific computing – across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD dual-core and Intel quad-core designs, the heterogeneous STI Cell, as well as the first scientific study of the highly multithreaded Sun Niagara2. We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural tradeoffs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.
Article
This paper describes the automatically tuned linear algebra software (ATLAS) project, as well as the fundamental principles that underly it. ATLAS is an instantiation of a new paradigm in high performance library production and maintenance, which we term automated empirical optimization of software (AEOS); this style of library management has been created in order to allow software to keep pace with the incredible rate of hardware advancement inherent in Moore's Law. ATLAS is the application of this new paradigm to linear algebra software, with the present emphasis on the basic linear algebra subprograms (BLAS), a widely used, performance-critical, linear algebra kernel library.
Conference Paper
This paper describes an algorithm for simultaneously optimizing across multiple levels of the memory hierarchy for dense-matrix computations. Our approach combines compiler models and heuristics with guided empirical search to take advantage of their complementary strengths. The models and heuristics limit the search to a small number of candidate implementations, and the empirical results provide the most accurate information to the compiler to select among candidates and tune optimization parameter values. We have developed an initial implementation and applied this approach to two case studies, matrix multiply and Jacobi relaxation. For matrix multiply, our results on two architectures, SGI R10000 and Sun UltraSparc IIe, outperform the native compiler, and either outperform or achieve comparable performance as the ATLAS self-tuning library and the hand-tuned vendor BLAS library. Jacobi results also substantially outperform the native compilers.
Article
FFT literature has been mostly concerned with minimizing the number of floating-point operations performed by an algorithm. Unfortunately, on present-day microprocessors this measure is far less important than it used to be, and interactions with the processor pipeline and the memory hierarchy have a larger impact on performance. Consequently, one must know the details of a computer architecture in order to design a fast algorithm. In this paper, we propose an adaptive FFT program that tunes the computation automatically for any particular hardware. We compared our program, called FFTW, with over 40 implementations of the FFT on 7 machines. Our tests show that FFTW's self-optimizing approach usually yields significantly better performance than all other publicly available software. FFTW also compares favorably with machine-specific, vendor-optimized libraries. 1. INTRODUCTION The discrete Fourier transform (DFT) is an important tool in many branches of science and engineering [1] and...
Article
This paper describes the ATLAS (Automatically Tuned Linear Algebra Software) project, as well as the fundamental principles that underly it. ATLAS is an instantiation of a new paradigm in high performance library production and maintenance, which we term AEOS (Automated Empirical Optimization of Software); this style of library management has been created in order to allow software to keep pace with the incredible rate of hardware advancement inherent in Moore's Law. ATLAS is the application of this new paradigm to linear algebra software, with the present emphasis on the Basic Linear Algebra Subprograms (BLAS), a widely used, performance-critical, linear algebra kernel library. This work was supported in part by: U.S. Department of Energy under contract number DE-AC0596OR22464; National Science Foundation Science and Technology Center Cooperative Agreement No. CCR-8809615; University of California, Los Alamos National Laboratory, subcontract # B76680017-3Z; Department of D...
Article
BLAS3 operations have great potential for aggressive optimization. Unfortunately, they usually need to be hand-coded for a speci#c machine and compiler to achieve near-peak performance. Wehave developed a methodology whereby near-peak performance on a wide range of systems can be achieved automatically for such routines. First, by analyzing current machines and C compilers, we've developed guidelines for writing Portable, High-Performance, ANSI C #PHiPAC, pronounced #fee-pack"#. Second, rather than code by hand, we produce parameterized code generators. Third, we write search scripts that #nd the best parameters for a given system. We report on a BLAS GEMM compatible multi-level cache-blocked matrix multiply generator that produces code achieving performance in excess of 90# of peak on the Sparcstation-20#61, IBM RS#6000-590, HP 712#80i, and 80# of peak on the SGI Indigo R4k. On the IBM, HP, and SGI, the resulting routine is often faster than the vendor-supplied BLAS GEMM. ...
Automated empirical optimizations of software and the ATLAS project 3-25 Frigo M and Johnson S 1998 FFTW: an adaptive software architecture for the FFT Proc) Bilmes J, Asanovic K, Chin C W and Demmel J 1997 Optimizing matrix multiply using PHi-PAC: a Portable
  • C Whaley
  • Petitet
  • J Dongarra
Whaley C, Petitet A and Dongarra J 2001 Automated empirical optimizations of software and the ATLAS project Parallel Computing 27 (1): 3-25 Frigo M and Johnson S 1998 FFTW: an adaptive software architecture for the FFT Proc. International Conference on Acoustics, Speech, and Signal Processing (Seattle, Washington) Bilmes J, Asanovic K, Chin C W and Demmel J 1997 Optimizing matrix multiply using PHi-PAC: a Portable, High-Performance, ANSI C Coding methodology Proc. International Conference on Supercomputing (Vienna, Austria: ACM SIGARCH)