# Autotuning multigrid with PetaBricks.

**ABSTRACT** Algorithmic choice is essential in any problem domain to re- alizing optimal computational performance. Multigrid is a prime example: not only is it possible to make choices at the highest grid resolution, but a program can switch techniques as the problem is recursively attacked on coarser grid levels to take advantage of algorithms with dierent scaling behav- iors. Additionally, users with dierent convergence criteria must experiment with parameters to yield a tuned algorithm that meets their accuracy requirements. Even after a tuned algorithm has been found, users often have to start all over when migrating from one machine to another. We present an algorithm and autotuning methodology that address these issues in a near-optimal and ecient man- ner. The freedom of independently tuning both the algo- rithm and the number of iterations at each recursion level re- sults in an exponential search space of tuned algorithms that have dierent accuracies and performances. To search this

**0**Bookmarks

**·**

**92**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**The description of large state spaces through stochastic structured modeling formalisms like stochastic Petri nets, stochastic automata networks and performance evaluation process algebra usually represent the infinitesimal generator of the underlying ...ACM SIGMETRICS Performance Evaluation Review 01/2011; - Samuel Williams, Dhiraj D Kalamkar, Amik Singh, Anand M Deshpande, Brian Van Straalen, Mikhail Smelyanskiy, Ann Almgren, Pradeep Dubey, John Shalf, Leonid Oliker[Show abstract] [Hide abstract]

**ABSTRACT:**Multigrid methods are widely used to accelerate the convergence of iterative solvers for linear systems used in a number of different application areas. In this paper, we explore optimization techniques for geometric multigrid on existing and emerging multicore systems including the Opteron-based Cray XE6, Intel® Xeon® E5-2670 and X5550 processor-based Infiniband clusters, as well as the new Intel® Xeon Phi™ coprocessor (Knights Corner). Our work examines a variety of novel techniques including communication-aggregation, threaded wavefront-based DRAM communication-avoiding, dynamic threading decisions, SIMDization, and fusion of operators. We quantify performance through each phase of the V-cycle for both single-node and distributed-memory experiments and provide detailed analysis for each class of optimization. Results show our optimizations yield significant speedups across a variety of subdomain sizes while simultaneously demonstrating the potential of multi- and manycore processors to dramatically accelerate single-node performance. However, our analysis also indicates that improvements in networks and communication will be essential to reap the potential of manycore processors in large-scale multigrid calculations.Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis; 01/2012 - SourceAvailable from: Samuel W. Williams
##### Technical Report: Implementation and Optimization of miniGMG - a Compact Geometric Multigrid Benchmark

Samuel W. Williams, Dhiraj D. Kalamkar, Amik Singh, Anand M. Deshpande, Brian Van Straalen, Mikhail Smelyanskiy, Ann Almgren, Pradeep Dubey, John Shalf, Leonid Oliker

Page 1

Autotuning Multigrid with PetaBricks

Cy Chan†, Jason Ansel†, Yee Lok Wong?, Saman Amarasinghe†, Alan Edelman†?

†CSAIL, Massachusetts Institute of Technology, Cambridge, MA 02139

?Dept. of Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02139

ABSTRACT

Algorithmic choice is essential in any problem domain to re-

alizing optimal computational performance. Multigrid is a

prime example: not only is it possible to make choices at the

highest grid resolution, but a program can switch techniques

as the problem is recursively attacked on coarser grid levels

to take advantage of algorithms with different scaling behav-

iors. Additionally, users with different convergence criteria

must experiment with parameters to yield a tuned algorithm

that meets their accuracy requirements. Even after a tuned

algorithm has been found, users often have to start all over

when migrating from one machine to another.

We present an algorithm and autotuning methodology

that address these issues in a near-optimal and efficient man-

ner. The freedom of independently tuning both the algo-

rithm and the number of iterations at each recursion level re-

sults in an exponential search space of tuned algorithms that

have different accuracies and performances. To search this

space efficiently, our autotuner utilizes a novel dynamic pro-

gramming method to build efficient tuned algorithms from

the bottom up. The results are customized multigrid al-

gorithms that invest targeted computational power to yield

the accuracy required by the user.

The techniques we describe allow the user to automati-

cally generate tuned multigrid cycles of different shapes tar-

geted to the user’s specific combination of problem, hard-

ware, and accuracy requirements. These cycle shapes dic-

tate the order in which grid coarsening and grid refinement

are interleaved with both iterative methods, such as Jacobi

or Successive Over-Relaxation, as well as direct methods,

which tend to have superior performance for small problem

sizes. The need to make choices between all of these meth-

ods brings the issue of variable accuracy to the forefront.

Not only must the autotuning framework compare different

possible multigrid cycle shapes against each other, but it

also needs the ability to compare tuned cycles against both

direct and (non-multigrid) iterative methods. We address

this problem by using an accuracy metric for measuring the

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission and/or a fee.

SC09 November 14-20, 2009, Portland, Oregon, USA

Copyright 2009 ACM 978-1-60558-744-8/09/11 ...$10.00.

effectiveness of tuned cycle shapes and making comparisons

over all algorithmic types based on this common yardstick.

In our results, we find that the flexibility to trade perfor-

mance versus accuracy at all levels of recursive computation

enables us to achieve excellent performance on a variety of

platforms compared to algorithmically static implementa-

tions of multigrid.

Our implementation uses PetaBricks, an implicitly paral-

lel programming language where algorithmic choices are ex-

posed in the language. The PetaBricks compiler uses these

choices to analyze, autotune, and verify the PetaBricks pro-

gram. These language features, most notably the autotuner,

were key in enabling our implementation to be clear, correct,

and fast.

1. INTRODUCTION

While multigrid is currently one of the most popular tech-

niques for efficiently solving partial differential equations

over a grid, it has become clear that restricting ourselves

to a single technique in any problem domain is rarely the

optimal strategy. It is often the case that we want to choose

between different algorithms based on some characteristics

of the input. For example, we may use the input’s magni-

tude as the criteria in a factoring algorithm or the input’s

length in a sorting algorithm. The optimal cutoff is almost

always dependent on underlying machine properties, and it

is the goal of autotuning packages such as FFTW [7, 8],

ATLAS [17, 18], and OSKI [16] to discover which situations

warrant the application of each available technique.

In some cases, tuning algorithmic choice could simply

mean choosing the appropriate top-level technique during

the initial function invocation; however, for many problems

including multigrid, it is better to be able to utilize multiple

techniques within a single function call or solve. For exam-

ple, in the C++ Standard Template Library’s sort routine,

the algorithm switches from using the divide-and-conquer

O(nlogn) merge sort to O(n2) insertion sort once the work-

ing array size falls below a set cutoff. In multigrid, an anal-

ogous strategy might switch from recursive multigrid calls

to a direct method such as Cholesky factorization and tri-

angular solve once the problem size falls below a threshold.

This paper analyzes the optimizations of algorithmic choice

in multigrid. When confronted with the problem of training

the autotuner to choose between a recursive multigrid call

and a call to an iterative or direct solver, one quickly real-

izes that no comparison between methods can be fair with-

out considering the relative accuracies of each. Indeed, we

found that in some cases sacrificing accuracy at lower levels

Page 2

of recursion has little impact on the accuracy of the final re-

sult, while in other cases improving accuracy at a lower level

reduces the number of (more expensive) iterations needed at

a higher level.

In this paper we describe a novel dynamic programming

strategy that allows us to make fair comparisons between

various iterative, recursive, and direct methods, resulting in

an efficient, tuned algorithm for user-specified convergence

criteria. The resulting algorithms can be visualized as tuned

multigrid cycles that apply targeted computational power

to meet the accuracy requirements of the user. Our me-

thodology does not tune cycle shapes by manipulating the

shapes directly; it instead categorizes algorithms based on

the accuracy of the results produced, allowing it to com-

pare all types of algorithms (direct, iterative, and recursive)

and make tuning decisions based on that common yardstick.

Additionally, our tuning algorithm has the flexibility of uti-

lizing different accuracy constraints for various components

within a single algorithm, allowing the autotuner to inde-

pendently trade performance and accuracy at each level of

multigrid recursion.

This work on multigrid was developed using the Peta-

Bricks programming language [2]. PetaBricks is an implic-

itly parallel programming language where algorithmic choice

is a first class construct, to help programmers express and

tune algorithmic choices and cutoffs such as these to ob-

tain the fastest combination of algorithms to solve a prob-

lem. While traditional compiler optimizations can be suc-

cessful at optimizing a single algorithm, when an algorithmic

change is required to boost performance the burden is put

on the programmer to incorporate the new algorithm. Pro-

grams written in PetaBricks can naturally describe multiple

algorithms for solving a problem and how they can fit to-

gether. This information is used by the PetaBricks compiler

and runtime to create and autotune an optimized multigrid

algorithm.

1.1 Outline

We first describe in Section 2 the algorithmic choices avail-

able in multigrid and detail the dynamic programming ap-

proach to autotuning for accuracy and performance. Sec-

tion 3 describes the PetaBricks language and implementa-

tion of the compiler and autotuning system that makes tun-

ing over algorithmic choice possible. Section 4 presents ex-

perimental results. Finally, Sections 5, 6 and 7 describe

related work, future work, and conclusions.

1.2Contributions

We make the following contributions:

• We introduce an autotuner that can tune over algo-

rithmic choice in multigrid problems.

• We describe how an accuracy metric can be used to

make reasonable comparisons between direct, iterative,

and recursive algorithms in a multigrid setting for the

purposes of autotuning.

• We show how the use of dynamic programming can

help us efficiently build tuned multigrid algorithms

that combine methods with varying levels of accuracy

while providing that a final target accuracy is met.

• We demonstrate that the performance of our tuned

multigrid algorithms is superior to more basic reference

approaches.

• We show that optimally tuned multigrid algorithms

can be dependent on machine architecture, demon-

strating the utility of a portable solution.

2.AUTOTUNING MULTIGRID

Although multigrid is a versatile technique that can be

used to solve many different types of problems, we will use

the 2D Poisson’s equation as an example and benchmark

to guide our discussion. The techniques presented here are

generalizable to higher dimensions and the broader set of

multigrid problems.

Poisson’s equation is a partial differential equation that

describes many processes in physics, electrostatics, fluid dy-

namics, and various other engineering disciplines. The con-

tinuous and discrete versions are

?2φ = fandTx = b,(1)

where T, x, and b are the finite difference discretizations of

the Laplace operator, φ, and f, respectively.

To build an autotuned multigrid solver for Poisson’s equa-

tion, we consider the use of three basic algorithmic building

blocks: one direct (band Cholesky factorization through LA-

PACK’s DPBSV routine), one iterative (Red-Black Succes-

sive Over Relaxation), and one recursive (multigrid). The

table below shows the computational complexity of using

any single algorithm to compute a solution.

Algorithm

Complexity

Direct

n2(N4)

SORMultigrid

n (N2)n1.5(N3)

From left to right, each of the methods has a larger over-

head, but yields a better asymptotic serial complexity [6].

N is the size of the grid on a side, and n = N2is the number

of cells in the grid.

2.1Algorithmic choice in multigrid

Multigrid is a recursive algorithm that uses the solution

to a coarser grid resolution as part of the algorithm. We will

first address tuning symmetric “V-type” cycles. An exten-

sion to full multigrid will be presented in Section 2.4.

For simplicity, we assume all inputs are of size N = 2k+1

for some positive integer k. Let x be the initial state of the

grid, and b be the right hand side of Equation (1).

MULTIGRID-V-SIMPLE(x,b)

1: if N = 3 then

2: Solve directly

3: else

4: Relax using some iterative method

5: Compute the residual and restrict to half resolution

6: Recursively call MULTIGRID-V-SIMPLE on coarser grid

7:Interpolate result and add correction term to current

solution

8: Relax using some iterative method

9: end if

It is at the recursive call on line 6 that our autotuning

compiler can make a choice of whether to continue making

recursive calls to multigrid or take a shortcut by using the

direct solver or one of the iterative solvers at the current res-

olution. Figure 1 shows these possible paths of the multigrid

algorithm.

Page 3

Figure 1:

multigrid algorithm. The diagonal arrows represent

the recursive case, while the dotted horizontal ar-

rows represent the shortcut case where a direct or

iterative solution may be substituted.

on the desired level of accuracy a different choice

may be optimal at each decision point.

ure does not illustrate the autotuner’s capability of

using multiple iterations at different levels of recur-

sion; it shows a single iteration at each level.

Simplified illustration of choices in the

Depending

This fig-

The idea of choice can be implemented by defining a top

level function MULTIGRID-V, which makes calls to either the

direct, iterative, or recursive solution. The function RECURSE

implements the recursive solution.

MULTIGRID-V(x,b)

1: either

2:Solve directly

3: Use an iterative method

4: Call RECURSE for some number of iterations

5: end either

RECURSE(x,b)

1: if N = 3 then

2: Solve directly

3: else

4:Relax using some iterative method

5:Compute the residual and restrict to half resolution

6:On the coarser grid, call MULTIGRID-V

7:Interpolate result and add correction term to current

solution

8:Relax using some iterative method

9: end if

Making the choice in line 1 of MULTIGRID-V has two impli-

cations. First, the time to complete the algorithm is choice

dependent. Second, the accuracy of the result is also de-

pendent on choice since the various methods have different

abilities to reduce error (depending on parameters such as

number of iterations or weights). To make a fair comparison

between choices, we must take both performance and accu-

racy of each choice into account. To this end, during the

tuning process, we keep track of not just a single optimal

algorithm at every recursion level, but a set of such optimal

algorithms for varying levels of desired accuracy.

2.2Full dynamic programming solution

We will first describe a full dynamic programming solution

to handling variable accuracy, then restrict it to a discrete

set of accuracies. We define an algorithm’s accuracy level to

be the ratio between the error norm of its input xin versus

the error norm of its output xout compared to the optimal

(a)(b)

Figure 2: (a) Possible algorithmic choices with op-

timal set designated by squares (both hollow and

solid). The choices designated by solid squares are

the ones remembered by the PetaBricks compiler,

being the fastest algorithms better than each accu-

racy cutoff line. (b) Choices across different accura-

cies in multigrid. At each level, the autotuner picks

the best algorithm one level down to make a recur-

sive call. The path highlighted in red is an example

of a possible path for accuracy level p2

solution xopt:

||xin− xopt||2

||xout− xopt||2.

We choose this ratio instead of its reciprocal so that a higher

accuracy level is better, which is more intuitive. In order to

measure the accuracy level of a potential tuned algorithm,

we assume we have access to representative training data

so that the accuracy level of our algorithms during tuning

closely reflects their accuracy level during use.

Let level k refer to an input size of N = 2k+ 1. Suppose

that for level k − 1, we have solved for some set Ak−1 of

optimal algorithms, where optimality is defined such that

no optimal algorithm is dominated by any other algorithm

in both accuracy and compute time.

In order to construct the optimal set Ak, we try substi-

tuting all algorithms in Ak−1 for step 6 of RECURSE. We also

try varying parameters in the other steps of the algorithm,

including the choice of iterative methods and the number of

iterations (possibly zero) in steps 4 and 8 of RECURSE and

steps 3 and 4 of MULTIGRID-V.

Trying all of these possibilities will yield many algorithms

that can be plotted as in Figure 2(a) according to their ac-

curacy and compute time. The optimal algorithms we add

to Ak are the dominant ones designated by square markers.

The reason to remember algorithms of multiple accuracies

for use in step 6 of RECURSE is that it may be better to use a

less accurate, fast algorithm and then iterate multiple times,

rather than use a more accurate, slow algorithm. Note that

even if we use a direct solver in step 6, the interpolation in

step 7 will invariably introduce error at the higher resolution.

2.3Discrete dynamic programming solution

Since the optimal set of tuned algorithms can grow to be

very large, the PetaBricks autotuner offers an approximate

version of the above solution. Instead of remembering the

full optimal set Ak, the compiler remembers the fastest al-

Page 4

gorithm yielding an accuracy of at least pi for each pi in

some set {p1,p2,...,pm}. The vertical lines in Figure 2(a)

indicate the discrete accuracy levels pi, and the optimal al-

gorithms (designated by solid squares) are the ones remem-

bered by PetaBricks. Each highlighted algorithm is associ-

ated with a function MULTIGRID-Vi, which achieves accuracy

pi on all input sizes.

Due to restricted time and computational resources, to

further narrow the search space, we only use SOR as the

iteration function since we found experimentally that it per-

formed better than weighted Jacobi on our particular train-

ing data for similar computation cost per iteration. In

MULTIGRID-Vi, we fix the weight parameter of SOR to ωopt,

the optimal value for the 2D discrete Poisson’s equation with

fixed boundaries [6]. In RECURSEi, we fix SOR’s weight pa-

rameter to 1.15 (chosen by experimentation to be a good

parameter when used in multigrid). We also fix the number

of iterations of SOR in steps 4 and 8 in RECURSEi to one.

As more powerful computational resources become available

over time, the restrictions on the algorithmic search space

presented here may be relaxed to find a more optimal solu-

tion.

The resulting accuracy-aware Poisson solver is a family of

functions, where i is the accuracy parameter:

MULTIGRID-Vi(x,b)

1: either

2:Solve directly

3:Iterate using SORωoptuntil accuracy pi is achieved

4: For some j, iterate with RECURSEj until accuracy pi

is achieved

5: end either

RECURSEi(x,b)

1: if N = 3 then

2:Solve directly

3: else

4:Compute one iteration of SOR1.15

5: Compute the residual and restrict to half resolution

6: On the coarser grid, call MULTIGRID-Vi

7:Interpolate result and add correction term to current

solution

8:Compute one iteration of SOR1.15

9: end if

The autotuning process determines what choices to make

in MULTIGRID-Vifor each i and for each input size. Since the

optimal choice for any single accuracy for an input of size

2k+ 1 depends on the optimal algorithms for all accuracies

for inputs of size 2k−1+1, the PetaBricks autotuner tunes all

accuracies at a given level before moving to a higher level. In

this way, the autotuner builds optimal algorithms for every

specified accuracy level and for each input size up to a user

specified maximum, making use of the tuned sub-algorithms

as it goes.

The final set of multigrid algorithms produced by the au-

totuner can be visualized as in Figure 2(b). Each of the

versions has the flexibility to choose any of the other ver-

sions during its recursive calls, and the optimal path may

switch between accuracies many times as we recurse down

towards either the base case or a shortcut case.

Figure 3: Conceptual breakdown of full multigrid

into an estimation phase and a solve phase. The es-

timation phase can be thought of as just a recursive

call to full multigrid up to a coarser grid resolution.

We make use of this recursive structure, in addi-

tion to our autotuned “V-type” multigrid cycles, in

constructing tuned full multigrid cycles.

2.4 Extension to Autotuning Full Multigrid

Full multigrid methods have been shown to exhibit bet-

ter convergence behavior than traditional symmetric cycle

shapes such as the V and W cycles by utilizing an estimation

phase before the solve phase (see Figure 3). The estimation

phase of the full multigrid algorithm can be thought of as

just a recursive call to itself at a coarser grid resolution. We

extend the autotuning ideas presented thus far to leverage

this structure and produce autotuned full multigrid cycles.

The following simplified code for ESTIMATE and

FULL-MULTIGRID illustrates how to construct an autotuned

full multigrid cycle.

ESTIMATEi(x,b)

1: Compute residual and restrict to half resolution

2: Call FULL-MULTIGRIDi on restricted problem

3: Interpolate result and add correction to x

FULL-MULTIGRIDi(x,b)

1: either

2:Solve directly

3:For some j, compute estimate by calling ESTIMATEj(x,b),

then either:

4:Iterate using SORωoptuntil accuracy pi is achieved

5:For some k, iterate with RECURSEk until accuracy

pi is achieved

6: end either

Here we take advantage of the discrete dynamic program-

ming analogue presented in Section 2.3 where we maintain

only finite sets of optimized functions FULL-MULTIGRIDj and

MULTIGRID-Vkto use in recursive calls. In FULL-MULTIGRIDi,

there are three choices: the first is just a direct solve (line

2), while the latter two choices (lines 4 and 5) are similar

to those given in MULTIGRID-Vi except an estimate is first

calculated and then used as a starting point for iteration.

Note that this structure is descriptive enough to include the

standard full multigrid V or W cycle shapes, just as the

MULTIGRID-Vi algorithm can produce standard regular V or

W cycles.

The parameters j and k in FULL-MULTIGRID can be chosen

independently, providing a great deal of flexibility in the

Page 5

construction of the optimized full multigrid cycle shape. In

cases where the user does not require much accuracy in the

final output, it may make sense to invest more heavily in the

estimation phase, while in cases where very high precision is

needed, a high precision estimate may not be as helpful as

most of the computation would be done in relaxations at the

highest resolution. Indeed, we found patterns of this type

during our experiments.

2.5Limitations

It should be clear that the algorithms produced by the

autotuner are not meant to be optimal in any theoretical

sense. Because of the compromises made in the name of ef-

ficiency, the resulting autotuning algorithm merely strives to

discover near-optimal algorithms from within the restricted

space of cycle shapes reachable during the search. There are

many cycle shapes that fall outside the space of searched

algorithms; for example, our approach does not check algo-

rithms that utilize different choices in succession at the same

recursion depth instead of choosing a single choice and iter-

ating. Future work may examine the extent to which this

restriction impacts performance.

Additionally, the scalar accuracy metric is an imperfect

measure of the effectiveness of a multigrid cycle. Each cycle

may have different effects on the various error modes (fre-

quencies) of the current guess, all of which would be impossi-

ble to capture in a single number. Future work may expand

the notion of an “optimal” set of sub-algorithms to include

separate classes of algorithms that work best to reduce dif-

ferent types of error. Though such an approach could lead

to a better final tuned algorithm, this extension would ob-

viously make the auto-tuning process more complex.

We will demonstrate in Section 4 that although our me-

thodology is not exhaustive, it can be quite descriptive, dis-

covering cycle shapes that are both unconventional and effi-

cient. That section will present actual cycle shapes produced

by our multigrid autotuner and show their performance com-

pared to less sophisticated heuristics. We will first describe

the PetaBricks language and autotuning compiler in further

detail.

3.PETABRICKS LANGUAGE

A key element that made our approach to multigrid pos-

sible was the PetaBricks programming language [2]. Peta-

Bricks is a new implicitly parallel programming language in

which algorithmic choice is a first class language construct.

PetaBricks programs describe many possible ways to solve a

problem and how they fit together. The PetaBricks compiler

and runtime use these choices to autotune the program in

order to find an optimal hybrid algorithm. Our implementa-

tion was written in the PetaBricks language, and we use the

PetaBricks autotuner to tune our algorithms. For more in-

formation about the PetaBricks language and compiler see

our prior work [2]; the following summary is included for

background.

3.1PetaBricks Language Design

The main goal of the PetaBricks language was to expose

algorithmic choice to the compiler in order to empower the

compiler to perform autotuning over aspects of the program

not normally available to it. PetaBricks is an implicitly par-

allel language, where the compiler automatically parallelizes

PetaBricks programs.

The PetaBricks language is built around two major con-

structs, transforms and rules. The transform, analogous to

a function, defines an algorithm that can be called from

other transforms or invoked from the command line. The

header for a transform defines to, from, and through argu-

ments, which represent inputs, outputs, and intermediate

data used within the transform. The size in each dimension

of these arguments is expressed symbolically in terms of free

variables, the values of which must be determined by the

PetaBricks runtime.

The user encodes choice by defining multiple rules in each

transform. Each rule computes a region of data in order to

make progress towards a final goal state. Rules can have

different granularities and intermediate state. The compiler

is required to find a sequence of rule applications that will

compute all outputs of the program. Rules have explicit

dependencies, allowing automatic parallelization and auto-

matic detection and handling of corner cases by the com-

piler. The rule header references to and from regions which

are the inputs and outputs for the rule. Free variables in

these regions can be set by the compiler allowing a rule to

be applied repeatedly in order to compute a larger data re-

gion. The body of a rule consists of C++-like code to perform

the actual work.

3.2PetaBricks Implementation

The PetaBricks implementation consists of three compo-

nents: a source-to-source compiler from the PetaBricks lan-

guage to C++, an autotuning system and choice framework

to find optimal choices and set parameters, and a runtime

library used by the generated code.

3.2.1

The PetaBricks compiler works using three main phases.

In the first phase, applicable regions (regions where each

rule can legally be applied) are calculated for each possi-

ble choice using an inference system. Next, the applicable

regions are aggregated together into choice grids. The choice

grid divides each matrix into rectilinear regions where uni-

form sets of rules may legally be applied. Finally, a choice

dependency graph is constructed and analyzed. The choice

dependency graph consists of edges between symbolic re-

gions in the choice grids. Each edge is annotated with the

set of choices that require that edge, a direction of the data

dependency, and an offset between rule centers for that de-

pendency. The output code is generated from this choice

dependency graph.

PetaBricks code generation has two modes. In the default

mode, choices and information for autotuning are embed-

ded in the output code. This binary can then be dynami-

cally tuned, generating an optimized configuration file; sub-

sequent runs can then use the saved configuration file. In

the second mode, a previously tuned configuration file is ap-

plied statically during code generation. The second mode is

included since the C++ compiler can make the final code

incrementally more efficient when the choices are fixed.

PetaBricks Compiler

3.2.2

The autotuner uses the choice dependency graph encoded

in the compiled application. This choice dependency graph

is also used by the parallel scheduler. This choice depen-

dency graph contains the choices for computing each region

and also encodes the implications of different choices on de-

Autotuning System and Choice Framework

Page 6

pendencies.

The intuition of the autotuning algorithm is that we take a

bottom-up approach to tuning. To simplify autotuning, we

assume that the optimal solution to smaller sub-problems

is independent of the larger problem. In this way we build

algorithms incrementally, starting on small inputs and work-

ing up to larger inputs.

The autotuner builds a multi-level algorithm. Each level

consists of a range of input sizes and a corresponding algo-

rithm and set of parameters. Rules that recursively invoke

themselves result in algorithmic compositions. In the spirit

of a genetic tuner, a population of candidate algorithms

is maintained. This population is seeded with all single-

algorithm implementations.

small training input and on each iteration doubles the size

of the input. At each step, each algorithm in the popula-

tion is tested. New algorithm candidates are generated by

adding levels to the fastest members of the population. Fi-

nally, slower candidates in the population are dropped until

the population is below a maximum size threshold. Since

the best algorithms from the previous input size are used

to generate candidates for the next input size, optimal algo-

rithms are iteratively built from the bottom up.

In addition to tuning algorithm selection, PetaBricks uses

an n-ary search tuning algorithm to optimize additional pa-

rameters such as parallel-sequential cutoff points for individ-

ual algorithms, iteration orders, block sizes (for data parallel

rules), data layout, as well as user specified tunable param-

eters.

All choices are represented in a flat configuration space.

Dependencies between these configurable parameters are ex-

ported to the autotuner so that the autotuner can choose a

sensible order to tune different parameters. The autotuner

starts by tuning the leaves of the graph and works its way

up. If there are cycles in the dependency graph, it tunes

all parameters in the cycle in parallel, with progressively

larger input sizes. Finally, it repeats the entire training pro-

cess, using the previous iteration as a starting point, a small

number of times to better optimize the result.

The autotuner starts with a

3.2.3

The runtime library is primarily responsible for managing

parallelism, data, and configuration. It includes a runtime

scheduler as well as code responsible for reading, writing,

and managing inputs, outputs, and configurations. The run-

time scheduler dynamically schedules tasks (that have their

input dependencies satisfied) across processors to distribute

work. The scheduler attempts to maximize locality using a

greedy algorithm that schedules tasks in a depth-first search

order. Following the approach taken by Cilk [9], we dis-

tribute work with thread-private deques and a task stealing

protocol.

Runtime Library

4.RESULTS

In this section, we present the results of the PetaBricks

autotuner when optimizing our multigrid algorithm on three

parallel architectures designed for a variety of purposes: In-

tel Xeon E7340 server processor, AMD Opteron 2356 Bar-

celona server processor, and the Sun Fire T200 Niagara low

power, high throughput server processor. These machines

provided architectural diversity, allowing us to show not

only how autotuned multigrid cycles outperform reference

multigrid algorithms, but also how the shape of optimal au-

totuned cycles can be dependent on the underlying machine

architecture.

To the best of our knowledge, there are no standard data

distributions currently in wide use for benchmarking mul-

tigrid solvers, so it was not clear what the best choice is

for training and benchmarking our tuned solvers. We de-

cided to use matrices with entries drawn from two differ-

ent random distributions: 1) uniform over [−232,232] (un-

biased), and 2) the same distribution shifted in the positive

direction by 231(biased). The random entries were used to

generate right-hand sides (b in Equation 1) and boundary

conditions (boundaries of x) for the problem. We also ex-

perimented with specifying a finite number of random point

sources/sinks in the right-hand side, but since the observed

results were similar to those found with the unbiased random

distribution, we did not include them in interest of space. If

one wishes to obtain tuned multigrid cycles for a different

input distribution, the training should be done using that

data distribution.

4.1Autotuned multigrid cycle shapes

During the tuning process for the MULTIGRID-Vialgorithm

presented in Section 2.3, the autotuner first computes the

number of iterations needed for the SOR and RECURSEj choi-

ces before determining which is the fastest option to attain

accuracy pifor each input size. Representative training data

is required to make this determination. Once the number of

required iterations of each choice is known, the autotuner

times each choice and chooses the fastest option.

Figures 4(a) and 4(b) show the traces of calls to the tuned

MULTIGRID-V4 algorithms for unbiased and biased uniform

random inputs of size N = 4097, on the Intel machine. As

you can see, the algorithm utilizes multiple accuracy lev-

els throughout the call stack. In general, whenever greater

accuracy is required by our tuned algorithm, it is achieved

through some repetition of optimal substructures determined

by the dynamic programming method. This may be easier

to visualize by examining the resulting tuned cycles corre-

sponding to the autotuned multigrid calls.

Figures 5(a) and 5(b) show some tuned “V-type” cycles

created by the autotuner for unbiased and biased uniform

random inputs of size N = 2049 on the AMD Opteron ma-

chine. The cycles are shown using standard multigrid nota-

tion with some extensions: The path of the algorithm pro-

gresses from left to right through time. As the path moves

down, it represents a restriction to a coarser resolution, while

paths up represent interpolations. Dots represent red-black

SOR relaxations, solid horizontal arrows represent calls to

the direct solver, and dashed horizontal arrows represent

calls to the iterative solver.

As seen in the figure, a different cycle shape is used de-

pending on what level of accuracy is required by the user.

Cycles shown are tuned to produce final accuracy levels of

10,103,105, and 107. The leverage of optimal subproblems

is clearly seen in the common patterns that appear across

cycles. Note that in Figure 5(b), the call to the direct solver

in cycle i) occurs at level 4, while for the other three cycles,

the direct call occurs at level 5. This is an example of the au-

totuner trading accuracy for performance while accounting

for the accuracy requirements of the user.

Figures 5(c) and 5(d) show autotuned full multigrid cy-

cles for unbiased and biased uniform random inputs of size

N = 2049 on the AMD Opteron machine. Although similar

Page 7

11

10

9

8

7

6

5

4

11

10

9

8

6

5

iv)

7

i) ii) iii)

4

(a)

11

10

9

8

7

6

5

4

11

10

9

8

6

5

iv)

7

i)ii) iii)

(b)

10

9

8

7

6

5

11

i)

4

iv)

iii)

ii)

10

9

8

7

6

5

11

4

3

(c)

10

9

8

7

6

5

11

i)

4

iv)

iii)ii)

10

9

8

7

6

5

11

4

3

(d)

Figure 5: Optimized multigrid V (a and b) and full multigrid (c and d) cycles created by the autotuner

for solving the 2D Poisson’s equation on an input if size N = 2049. Subfigures a) and c) were trained on

unbiased uniform random data, while b) and d) were trained on biased uniform random data. Cycles i), ii),

iii), and iv), correspond to algorithms that yield accuracy levels of 10,103,105, and 107, respectively. The solid

arrows at the bottom of the cycles represent shortcut calls to the direct solver, while the dashed arrow in

c)-i) represents an iterative solve using SOR. The dots present in the cycle represent single relaxations. Note

that some paths in the full multigrid cycles skip relaxations while moving to a higher grid resolution. The

recursion level is displayed on the left, where the size of the grid at level k is 2k+ 1.

substructures are shared between these cycles and the “V-

type” cycles in 5(a) and 5(b), some of the expensive higher

resolution relaxations are avoided by allowing work to oc-

cur at the coarser grids during the estimation phase of the

full multigrid algorithm. The tuned full multigrid cycle in

Figure 5(d)-iv) shows how the additional flexibility of using

an estimation phase can dramatically alter the tuned cycle

shape when compared to Figure 5(b)-iv).

It is important to realize that the call stacks in Figure 4

and the cycle shapes in Figure 5 are all dependent on the spe-

cific situation at hand. They would all likely change were the

autotuner run on other architectures, using different train-

ing data, or solving other multigrid problems. The flexibility

to adapt to any of these changing variables by tuning over

algorithmic choice is the autotuner’s greatest strength.

4.2 Performance

This section will provide data showing the performance of

our tuned multigrid Poisson’s equation solver versus refer-

ence algorithms and heuristics. Test data was produced from

the same distributions used for training described in Sec-

tion 4. Section 4.2.1 describes performance of the autotuned

MULTIGRID-V algorithm, and Section 4.2.2 describes the per-

formance of the autotuned FULL-MULTIGRID algorithm.

4.2.1

To demonstrate the effectiveness of our dynamic program-

ming methodology, we compare the autotuned MULTIGRID-V

algorithm against more basic approaches to solving the 2D

Poisson’s equation to an accuracy of 109, including several

multigrid variations. Results presented in the section were

collected on the Intel Xeon server testbed machine.

Figure 6 shows the performance of our autotuned multi-

grid algorithm for accuracy 109on unbiased uniform random

inputs of different sizes. The autotuned algorithm uses in-

ternal accuracy levels of {10,103,105,107,109} during its re-

cursive calls. The figure compares the autotuned algorithm

with the direct solver, iterated calls to SOR, and iterated

calls to MULTIGRID-V-SIMPLE (labeled Multigrid). Each of

the iterative methods is run until an accuracy of at least 109

is achieved.

As to be expected, the autotuned algorithm outperforms

all of the simple algorithms shown in Figure 6. At sizes

greater than N = 65, the autotuned algorithm performs

slightly better than MULTIGRID-V-SIMPLE because it utilizes

a more complex tuned strategy.

Figure 7 compares the tuned algorithm with various heuris-

tics more complex than MULTIGRID-V-SIMPLE. The training

data used in this graph was drawn from the biased uniform

distribution. Strategy 109refers to requiring an accuracy of

109at each recursive level of multigrid until the base case

Autotuned multigrid V algorithm

Page 8

DIRECT

12

11

10

9

8

7

6

5

1x

1x

2x

2x

1x

1x

421

MULTIGRIDMULTIGRIDMULTIGRID

1x

1x

(a)

DIRECT

12

11

10

9

8

7

6

5

1x

1x

2x

2x

1x

1x

1x

432

MULTIGRID MULTIGRID MULTIGRID

1

MULTIGRID

1x

(b)

Figure 4:

tuned MULTIGRID-V4 for a) unbiased and b) biased

random inputs of size N = 4097 on an Intel Xeon

server. Discrete accuracies used during autotuning

were (pi)i=1..5 = (10,103,105,107,109). The recursion

level is displayed on the left, where the size of the

grid at level k is 2k+ 1. Note that each arrow con-

necting to a lower recursion level actually represents

a call to RECURSEi, which handles grid coarsening, fol-

lowed by a call to MULTIGRID-Vi.

Call stacks generated by calls to auto-

direct method is called at N = 65. Strategies of the form

10x/109refer to requiring an accuracy of 10xat each re-

cursive level below that of the input size, which requires an

accuracy of 109. Thus, all strategies presented result in a

final accuracy of 109; they differ only in what accuracies are

required at lower recursion levels. All heuristic strategies

call the direct method for smaller input sizes whenever it is

more efficient to meet the accuracy requirement.

The lines in Figure 7 are somewhat close together and

difficult to see on the logarithmic time scale, so Figure 8

presents the same data but showing the ratio of times taken

versus the autotuned algorithm. We can more clearly see

in this figure that as the input size increases, the most ef-

ficient heuristic changes from Strategy 101/109to 103/109

to 105/109. The autotuner does better than just choosing

the best from among these heuristics, since it can also tune

the desired accuracy at each recursion level independently,

allowing greater flexibility. This figure highlights the com-

plexity of finding an optimal strategy and showcases the util-

ity of an autotuner that can efficiently find this optimum.

Another big advantage of using PetaBricks for autotuning

is that it allows a single program to be optimized for both

sequential performance and parallel performance. We have

observed our autotuner make different choices when running

on different numbers of cores. Figure 9 shows the speedup

achieved by our tuned MULTIGRID-V algorithms on our Intel

1e-05

0.0001

0.001

0.01

0.1

1

10

100

1000

1 4 16 64

Input Size

256 1024 4096 16384

Time (s)

Direct

SOR

Multigrid

Autotuned

Figure 6: Performance for algorithms to solve Pois-

son’s equation on unbiased uniform random data up

to an accuracy of 109using 8 cores. The basic direct

and SOR algorithms as well as the standard V-cycle

multigrid algorithm are all compared to our tuned

multigrid algorithm.The iterated SOR algorithm

uses the corresponding optimal weight ωopt for each

of the different input sizes

testbed machine.

4.2.2

In order to evaluate the performance of our autotuned

MULTIGRID-V and FULL-MULTIGRID algorithms on multiple

architectures, we ran them for problem sizes up to N = 4097

(up to 2049 on the Sun Niagara) for target accuracy levels of

105and 109alongside two reference algorithms: an iterated

V cycle and a full multigrid algorithm. The reference V cycle

algorithm runs standard V cycles until the accuracy target is

reached, while the reference full multigrid algorithm runs a

standard full multigrid cycle (as in Figure 3), then standard

V cycles until the accuracy target is reached.

We chose these two reference algorithms since they are

generally deemed good starting points for those interested

in implementing multigrid for the first time. Since they are

easy to understand and commonly implemented, we felt they

were a reasonable point of reference for our results. From

these starting points, performance tweaks can be manually

applied to tailor the solver to each user’s specific application

domain. The goal of our autotuner is to discover and make

these tweaks automatically.

Figure 10 shows the performance of both reference and au-

totuned multigrid algorithms for unbiased uniform random

data relative to the reference iterated V-cycle algorithm on

all three testbed machines. Figure 11 shows similar com-

parisons for biased uniform random data. The relative time

(lower is better) to compute the solution up to an accuracy

level of 105is plotted against problem size.

On all three architectures, we see that the autotuned al-

gorithms provide an improvement over the reference algo-

rithms’ performances. There is an especially marked dif-

ference for small problem sizes due to the autotuned algo-

rithms’ use of the direct solve without incurring the overhead

of recursion. Speedups relative to the reference full multi-

grid algorithm are also observed at higher problem sizes:

Autotuned full multigrid algorithm

Page 9

0

0.2

0.4

0.6

0.8

1

1.2

16 64 256 1024 4096

Relative Time (ratio)

Problem Size

(a)

Reference V

Reference Full MG

Autotuned V

Autotuned Full MG

0

0.2

0.4

0.6

0.8

1

1.2

16 64 256 1024 4096

Relative Time (ratio)

Problem Size

(b)

Reference V

Reference Full MG

Autotuned V

Autotuned Full MG

0

0.2

0.4

0.6

0.8

1

1.2

16 64 256 1024 4096

Relative Time (ratio)

Problem Size

(c)

Reference V

Reference Full MG

Autotuned V

Autotuned Full MG

Figure 10: Relative performance of multigrid algorithms versus reference V cycle algorithm for solving the

2D Poisson’s equation on unbiased, uniform random data to an accuracy level of 105on a) Intel Harpertown,

b) AMD Barcelona, and c) Sun Niagara.

0

0.2

0.4

0.6

0.8

1

1.2

16 64 256 1024 4096

Relative Time (ratio)

Problem Size

(a)

Reference V

Reference Full MG

Autotuned V

Autotuned Full MG

0

0.2

0.4

0.6

0.8

1

1.2

16 64 256 1024 4096

Relative Time (ratio)

Problem Size

(b)

Reference V

Reference Full MG

Autotuned V

Autotuned Full MG

0

0.2

0.4

0.6

0.8

1

1.2

16 64 256 1024 4096

Relative Time (ratio)

Problem Size

(c)

Reference V

Reference Full MG

Autotuned V

Autotuned Full MG

Figure 11: Relative performance of multigrid algorithms versus reference V cycle algorithm for solving the

2D Poisson’s equation on biased uniform random data to an accuracy level of 105on a) Intel Harpertown, b)

AMD Barcelona, and c) Sun Niagara.

e.g., for problem size N = 2049, we observed speedups of

1.2x, 1.1x, and 1.8x on the unbiased uniform test inputs,

and 2.9x, 2.5x, and 1.8x on the biased uniform test inputs

for the Intel, AMD, and Sun machines, respectively.

Figures 12 and 13 show similar performance comparisons,

except to an accuracy level of 109. The autotuner had a

more difficult time beating the reference full multigrid algo-

rithm when training for both high accuracy and large size

(greater than N = 257). For sizes greater than 257, auto-

tuned performance is essentially tied with the reference full

multigrid algorithm on the Intel and AMD machines, while

improvements were still possible on the Sun machine. For

input size N = 2049, a speedup of 1.9x relative to the ref-

erence full multigrid algorithm was observed on the Niagara

for both input distributions. We suspect that performance

gains are more difficult to achieve when solving for both high

accuracy and size in some part due to a greater percentage

of compute time being spent on unavoidable relaxations at

the finest grid resolution.

4.3Effect of Architecture on Autotuning

Multicore architectures have drastically increased the pro-

cessor design space resulting in a large variance in processors

currently on the market. Such variance significantly hinders

porting efforts of performance critical code.

Figure 14 shows the different optimized cycles chosen by

the autotuner on the three testbed architectures. Though

all cycles were tuned to yield the same accuracy level of 105,

the autotuner found a different optimized cycle shape on

each architecture. These differences take advantage of the

specific characteristics of each machine. For example, the

AMD and Sun machines recurse down to a coarse grid level

of 24versus 25on the Intel machine. The AMD and Sun’s

cycles appear to make up for the reduced accuracy of the

coarser direct solve by doing more relaxations at medium

grid resolutions (levels 9 and 10).

We found that the performance of tuned multigrid cycles

can be quite sensitive to where the autotuning is performed

in some cases. For example, the use of the autotuned full

multigrid cycle for unbiased uniform inputs of size N = 2049

trained on the Sun Niagara but run on the Intel Xeon re-

sults in a 29% slowdown compared to the natively trained

algorithm. Likewise, using the cycle trained on the Xeon

results in a 79% slowdown compared to using the natively

trained cycle on the Niagara.

5.RELATED WORK

Some multigrid solvers using algorithmic choice have been

presented in the past. SuperSolvers [3] is not an autotuner

but rather a system for designing composite algorithms that

leverage multiple algorithmic choices to solve sparse linear

systems reliably. Our approach differs by the use of tuning

algorithmic choice at different levels of the multigrid hier-

archy and the use of tuned subproblems during recursion.

Page 10

0

0.2

0.4

0.6

0.8

1

1.2

16 64 256 1024 4096

Relative Time (ratio)

Problem Size

(a)

Reference V

Reference Full MG

Autotuned V

Autotuned Full MG

0

0.2

0.4

0.6

0.8

1

1.2

16 64 256 1024 4096

Relative Time (ratio)

Problem Size

(b)

Reference V

Reference Full MG

Autotuned V

Autotuned Full MG

0

0.2

0.4

0.6

0.8

1

1.2

16 64 256 1024 4096

Relative Time (ratio)

Problem Size

(c)

Reference V

Reference Full MG

Autotuned V

Autotuned Full MG

Figure 12: Relative performance of multigrid algorithms versus reference V cycle algorithm for solving the

2D Poisson’s equation on unbiased, uniform random data to an accuracy level of 109on a) Intel Harpertown,

b) AMD Barcelona, and c) Sun Niagara.

0

0.2

0.4

0.6

0.8

1

1.2

16 64 256 1024 4096

Relative Time (ratio)

Problem Size

(a)

Reference V

Reference Full MG

Autotuned V

Autotuned Full MG

0

0.2

0.4

0.6

0.8

1

1.2

16 64 256 1024 4096

Relative Time (ratio)

Problem Size

(b)

Reference V

Reference Full MG

Autotuned V

Autotuned Full MG

0

0.2

0.4

0.6

0.8

1

1.2

16 64 256 1024 4096

Relative Time (ratio)

Problem Size

(c)

Reference V

Reference Full MG

Autotuned V

Autotuned Full MG

Figure 13: Relative performance of multigrid algorithms versus reference V cycle algorithm for solving the

2D Poisson’s equation on biased, uniform random data to an accuracy level of 109on a) Intel Harpertown,

b) AMD Barcelona, and c) Sun Niagara.

Unfortunately, no direct performance comparison was pos-

sible for this paper due to the lack of availability of source

code.

Cache-aware implementations of multigrid have also been

developed. In [15], [14], and [11] optimizations improve

cache utilization by reducing capacity and conflict misses

during linear relaxation and inter-grid transfers. An auto-

tuner was presented in [5] to automatically search the space

of cache and memory optimizations for the relaxation step

over a variety of hardware architectures. The optimizations

presented in these related works are for the most part or-

thogonal to the approach taken in this paper. There is no

reason lower-level optimizations cannot be combined with

algorithmic tuning at the level of cycle shape.

A number of empirical autotuning frameworks have been

developed for building efficient, portable libraries in other

specific domains. PHiPAC [4] is an autotuning system for

dense matrix multiply, generating portable C code and search

scripts to tune for specific systems. ATLAS [17, 18] utilizes

empirical autotuning to produce a cache-contained matrix

multiply, which is then used in larger matrix computations

in BLAS and LAPACK. FFTW [7, 8] uses empirical au-

totuning to combine solvers for FFTs. Other autotuning

systems include SPARSITY [10] for sparse matrix computa-

tions, SPIRAL [13] for digital signal processing, UHFFT [1]

for FFT on multicore systems, OSKI [16] for sparse matrix

kernels, and an autotuning framework for optimizing paral-

lel sorting algorithms by Olszewski and Voss [12].

6.FUTURE WORK

An interesting direction we wish take this work is in the

domain of tuning multi-level algorithms across distributed

memory systems. The problem of discovering the best data

layout and communications pattern for such a solver is very

complex.

One specific problem this framework may help address is

when to migrate data between machines. For example, we

may want to use a smaller subset of machines once the prob-

lem is sufficiently small to reduce the surface area to volume

ratio of each machine’s working set. Doing so reduces the

communications overhead of relaxations, but incurs the cost

of the data transfer. We wish to extend the ideas presented

here to produce“optimal”algorithms parameterized not just

on size and accuracy, but also on data layout. The dynamic

programming search could then take data transfers into ac-

count when comparing the costs of utilizing various “opti-

mal”sub-algorithms, each with their own associated layouts.

Another direction we plan to explore is the use of dy-

namic tuning where an algorithm has the ability to adapt

during execution based on some features of the intermediate

state. Such flexibility would allow the autotuned algorithm

to classify inputs and intermediate states into different dis-

tribution classes and then switch between tuned versions of

itself, providing better performance across a broader range

Page 11

0.01

0.1

1

10

100

1000

32 64 128 256 512

Input Size

1024 2048 4096 8192 16384

Time (s)

Strategy 109

Strategy 107/109

Strategy 105/109

Strategy 103/109

Strategy 101/109

Autotuned

Figure 7: Performance for algorithms to solve Pois-

son’s equation up to an accuracy of 109using 8

cores. The autotuned multigrid algorithm is pre-

sented alongside various possible heuristics.

graph omits sizes less than N = 65 since all cases

call the direct method for those inputs. To see the

trends more clearly, Figure 8 shows the same data

as this figure, but as ratios of times taken versus the

autotuned algorithm.

The

of inputs. For example, we may want to switch between

cycle shapes during execution depending on the dominant

error frequencies observed in the residual.

7. CONCLUSIONS

It has become nearly impossible to tune individual algo-

rithms by hand for portable performance, and multigrid al-

gorithms are no exception. No single choice of parameters

can yield the best possible result for different user environ-

ments, which include problem, machine architecture, and

accuracy requirements. The high performance computing

community has always known that in many problem do-

mains, the best sequential algorithm is different from the

best parallel algorithm. Varying problem size and data sets

will also require different algorithms. Currently there is no

viable way for incorporating all these algorithmic choices

into a single multigrid program to produce portable pro-

grams with consistently high performance.

In this paper we introduced a novel dynamic program-

ming approach to autotuning multigrid algorithms. Our ap-

proach tunes with an awareness of accuracy that allows fair

comparison between various direct, iterative, and recursive

algorithmic types such that optimal solutions are built from

the bottom up. We demonstrated that the resulting tuned

cycles achieve excellent performance compared to algorith-

mically static implementations of multigrid.

8.ACKNOWLEDGEMENTS

This work is partially supported by NSF Award CCF-

0832997 and an award from the Gigascale Systems Research

Center. We wish to thank the UC Berkeley EECS depart-

ment for generously letting us use one of their machines for

benchmarking. Finally, we would like to thank the anony-

mous reviewers for their constructive feedback.

0.5

1

2

4

8

16

32

32 64 128 256 512

Input Size

1024 2048 4096 8192 16384

Times slower than Autotuned

Strategy 109

Strategy 107/109

Strategy 105/109

Strategy 103/109

Strategy 101/109

Autotuned

Figure 8: Speedup of tuned algorithm compared to

various simple heuristics to solve Poisson’s equation

up to an accuracy of 109using 8 cores.

presented in this graph is the same as in Figure 7

except that the ratio of time taken versus the au-

totuned algorithm is plotted.

problem size increases, the higher accuracy heuris-

tics become more favored since they require fewer

iterations at high resolution grid sizes.

The data

Notice that as the

9. REFERENCES

[1] A. Ali, L. Johnsson, and J. Subhlok. Scheduling FFT

computation on smp and multicore systems. In ICS

’07: Proceedings of the 21st annual international

conference on Supercomputing, pages 293–301, 2007.

[2] J. Ansel, C. Chan, Y. L. Wong, M. Olszewski,

Q. Zhao, A. Edelman, and S. Amarasinghe.

Petabricks: A language and compiler for algorithmic

choice. In PLDI ’09: Proceedings of ACM SIGPLAN

Conference on Programming Language Design and

Implementation, 2009.

[3] S. Bhowmick, P. Raghavan, and K. Teranishi. A

combinatorial scheme for developing efficient

composite solvers. In ICCS ’02: Proceedings of the

International Conference on Computational

Science-Part II, pages 325–334, London, UK, 2002.

Springer-Verlag.

[4] J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel.

Optimizing matrix multiply using PHiPAC: a

portable, high-performance, ansi c coding

methodology. In ICS ’97: Proceedings of the 11th

international conference on Supercomputing, pages

340–347, 1997.

[5] K. Datta, M. Murphy, V. Volkov, S. Williams,

J. Carter, L. Oliker, D. Patterson, J. Shalf, and

K. Yelick. Stencil computation optimization and

auto-tuning on state-of-the-art multicore

architectures. In SC ’08: Proceedings of the 2008

ACM/IEEE conference on Supercomputing, pages

1–12, Piscataway, NJ, USA, 2008. IEEE Press.

[6] J. W. Demmel. Applied Numerical Linear Algebra.

SIAM, August 1997.

[7] M. Frigo and S. G. Johnson. FFTW: An adaptive

software architecture for the FFT. In Proc. 1998 IEEE

Intl. Conf. Acoustics Speech and Signal Processing,

Page 12

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8

Speedup

Number of Threads

Autotuned Poisson

Figure 9:

worker threads are added. Run on an 8 core (2 pro-

cessor × 4 core) x86 64 Intel Xeon System.

Parallel scalability. Speedup as more

10

9

8

7

6

5

11

10

9

8

7

6

5

11

4

i)

ii)

iii)

4

Figure 14: Comparison of tuned full multigrid cycles

across machine architectures: i) Intel Harpertown,

ii) AMD Barcelona, iii) Sun Niagara.

solve the 2D Poisson’s equation on unbiased uniform

random input to an accuracy of 105for an initial grid

size of 211.

All cycles

volume 3, pages 1381–1384. IEEE, 1998.

[8] M. Frigo and S. G. Johnson. The design and

implementation of FFTW3. Proceedings of the IEEE,

93(2):216–231, February 2005. Invited paper, special

issue on “Program Generation, Optimization, and

Platform Adaptation”.

[9] M. Frigo, C. E. Leiserson, and K. H. Randall. The

implementation of the Cilk-5 multithreaded language.

In Proceedings of the ACM SIGPLAN Conference on

Programming Language Design and Implementation,

pages 212–223, Montreal, Quebec, Canada, Jun 1998.

Proceedings published ACM SIGPLAN Notices, Vol.

33, No. 5, May, 1998.

[10] E. Im and K. Yelick. Optimizing sparse matrix

computations for register reuse in SPARSITY. In In

Proceedings of the International Conference on

Computational Science, volume 2073 of LNCS, pages

127–136. Springer, 2001.

[11] M. Kowarschik and C. Weiss. Dimepack – a

cache-optimized multigrid library. In Proc. of the

International Conference on Parallel and Distributed

Processing Techniques and Applications (PDPTA

2001), volume I, pages 425–430. CSREA, CSREA

Press, 2001.

[12] M. Olszewski and M. Voss. Install-time system for

automatic generation of optimized parallel sorting

algorithms. In PDPTA, pages 17–23, 2004.

[13] M. Puschel, J. M. F. Moura, J. R. Johnson, D. Padua,

M. M. Veloso, B. W. Singer, J. Xiong, A. G.

Franz Franchetti, R. W. J. Yevgen Voronenko,

Kang Chen, and N. Rizzolo. SPIRAL: Code generation

for dsp transforms. In Proceedings of the IEEE.

[14] G. Rivera and C.-W. Tseng. Tiling optimizations for

3d scientific computations. In Supercomputing ’00:

Proceedings of the 2000 ACM/IEEE conference on

Supercomputing (CDROM), page 32, Washington, DC,

USA, 2000. IEEE Computer Society.

[15] S. Sellappa and S. Chatterjee. Cache-efficient

multigrid algorithms. Int. J. High Perform. Comput.

Appl., 18(1):115–133, 2004.

[16] R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: A

library of automatically tuned sparse matrix kernels.

In Proceedings of SciDAC 2005, Journal of Physics:

Conference Series, San Francisco, CA, USA, June

2005. Institute of Physics Publishing.

[17] R. C. Whaley and J. J. Dongarra. Automatically

tuned linear algebra software. In Supercomputing ’98:

Proceedings of the 1998 ACM/IEEE conference on

Supercomputing (CDROM), pages 1–27, Washington,

DC, USA, 1998. IEEE Computer Society.

[18] R. C. Whaley and A. Petitet. Minimizing development

and maintenance costs in supporting persistently

optimized BLAS. Software: Practice and Experience,

35(2):101–121, February 2005.

#### View other sources

#### Hide other sources

- Available from Saman P. Amarasinghe · Jun 4, 2014
- Available from mit.edu