Page 1

Automatic Transformations for

Communication-Minimized Parallelization and Locality

Optimization in the Polyhedral Model

Uday Bondhugula1, Muthu Baskaran1, Sriram Krishnamoorthy1,

J. Ramanujam2, Atanas Rountev1, and P. Sadayappan1

1Dept. of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA

{bondhugu,baskaran,krishnsr,rountev,saday}@cse.ohio-state.edu,

2Dept. of Electrical and Computer Engg., Louisiana State University, Baton Rouge, LA , USA

jxr@ece.lsu.edu

Abstract. Many compute intensive applications spend a significant fraction of

their time in nested loops. The polyhedral model provides powerful abstractions

to optimize loop nests with regular accesses for parallel execution. Affine trans-

formations in this model capture a complex sequence of execution-reordering

loop transformations that can improve performance by parallelization as well as

locality enhancement. Although a significant amount of research has addressed

affine scheduling and partitioning, the problem of automatically finding good

affinetransformsforcommunication-optimizedcoarse-grainedparallelizationalong

withlocalityoptimizationforthegeneralcaseofarbitrarily-nestedloopsequences

remains a challenging problem.

In this paper, we propose an automatic transformation framework to optimize

arbitrarily-nested loop sequences with affine dependences for parallelism and

locality simultaneously. The approach finds good tiling hyperplanes by embed-

ding a powerful and versatile cost function into an Integer Linear Programming

formulation. These tiling hyperplanes are used for communication-minimized

coarse-grained parallelization as well as locality optimization. It enables the min-

imization of inter-tile communication volume in the processor space, and min-

imization of reuse distances for local execution at each node. Programs requir-

ing one-dimensional versus multi-dimensional time schedules (with scheduling-

based approaches) are all handled with the same algorithm. Synchronization-free

parallelism, permutable loops or pipelined parallelism at various levels can be

detected. Preliminary results from the implemented framework show promising

performance and scalability with input size.

1Introduction and Motivation

Current trends in architecture are increasingly towards larger number of processing el-

ements on a chip. This has led to multi-core architectures becoming mainstream along

with the emergence of several specialized parallel architectures or accelerators like

the Cell processor and General-Purpose GPUs. The difficulty of programming these

architectures to effectively tap the potential of multiple on-chip processing units is a

well-known challenge. Among several approaches to addressing this issue, one that is

very promising but simultaneously very challenging is automatic parallelization. This

requires no effort on part of the programmer in the process of parallelization and opti-

mization and is therefore very attractive.

Page 2

Many compute-intensive applications often spend most of their running time in

nested loops. This is particularly common in scientific and engineering applications.

The polyhedral model [13,7,14] provides a powerful abstraction to reason about trans-

formations on such loop nests by viewing a dynamic instance (iteration) of each state-

ment as an integer point in a well-defined space which is the statement’s polyhedron.

With such a representation for each statement and a precise characterization of inter or

intra-statement dependences, it is possible to reason about the correctness and goodness

of a sequence of complex loop transformations using machinery from Linear Algebra

and Linear Programming. The transformations finally reflect in the generated code as

reordered execution with improved cache locality and/or loops that have been paral-

lelized. The polyhedral model is readily applicable to loop nests in which the data

access functions and loop bounds are affine combinations (linear combination with a

constant) of the enclosing loop variables and parameters. While a precise characteriza-

tion of data dependences is feasible for programs with static control structure and affine

references/loop-bounds, codes with non-affine array access functions or code with dy-

namic control can also be handled, but with conservative assumptions on some depen-

dences.

Early approaches to program transformation and automatic parallelization applied

only to perfectly nested loops and involved the application of a sequence of transfor-

mations to the program’s structure represented as an attributed abstract syntax tree. The

polyhedral model has enabled much more complex programs to be handled and easy

composition and application of more sophisticated transformations [7,14]. The task of

program optimization in the polyhedral model may be viewed in terms of three phases:

(1) static dependence analysis of the input program, (2) transformations in the polyhe-

dral abstraction, and (3) generation of efficient loop code for the transformed program.

Inspite of progresses in these techniques in the nineties, several scalability challenges

limited applicability to small loop nests. Significant recent advances in dependence

analysis [29] and more importantly, in code generation [24,3,28], have demonstrated

the applicability of the polyhedral techniques to code representative of real applica-

tions. However, current state-of-the-art polyhedral implementations still apply transfor-

mations manually and significant time is spent by an expert to determine the best set of

transformations that lead to improved performance [7,14]. An important open issue is

that of the choice of transformations from the huge space of valid transforms. Our work

addresses this problem, by formulating a way of obtaining good transformations fully

automatically.

Tiling is a key transformation and has been studied from two perspectives - data

locality optimization and parallelization. Tiling for locality requires grouping points in

an iteration space into smaller blocks (tiles) allowing reuse in multiple directions when

the block fits in a faster memory (registers, L1, or L2 cache). Tiling for coarse-grained

parallelism fundamentally involves partitioning the iteration space into tiles that may

be concurrently executed on different processors with a reduced frequency and volume

of inter-processor communication: a tile is atomically executed on a processor with

communication required only before and after execution. Hence, one of the key aspects

of an automatic transformation framework is to find good ways of performing tiling.

Existing automatic transformation frameworks [21,20,19,2,15] have one or more

drawbacks or restrictions that do not allow them to effectively parallelize/optimize loop

nests. All of them lack a realistic cost model that is suitable for coarse-grained parallel

execution as is used in practice with manually developed parallel applications. With the

2

Page 3

exception of Griebl [15], previous work generally focuses on one or the other of the

complementary aspects of parallelization and locality optimization. The approach we

develop answers the following question: What is the best way to tile imperfectly nested

loop sequences to minimize the volume of communication between tiles (in processor

space) as well as improve data reuse at each processor?

The rest of this paper is organized as follows. Section 2 provides an overview of the

polyhedral model and notation. In Section 3, we describe our automatic transformation

framework in detail. Section 4 shows step-by-step application of our approach through

an example. Section 5 provides a summary of the implementation and initial results.

Section 6 discusses related work and conclusions are presented in Section 7. Full details

of the framework, transformations and optimized codes obtained for various codes, and

experimental results are available in extended reports [4,5].

2Overview of the Polyhedral Framework

Thissectionpresentsanoverviewofthepolyhedralframeworkandnotationusedthrough-

out the paper.

The set X of all vectors x ∈ Znsuch that h.x = k, for k ∈ Z, forms an affine

hyperplane. The set of parallel hyperplane instances corresponding to different values

of k is characterized by the vector h which is normal to the hyperplane. Each instance

of a hyperplane is an n − 1 dimensional affine sub-space of the n-dimensional space.

Two vectors x1and x2lie in the same hyperplane if h.x1= h.x2.

The set of all vectors x ∈ Znsuch that Ax + b ≥ 0, where A is an integer matrix,

defines a (convex) integer polyhedron. A polytope is a bounded polyhedron. Each run-

time instance of a statement S is identified by its iteration vector i which contains

values for the indices of the loops surrounding it, from outermost to innermost. Hence,

a statement S is associated with a polytope which is characterized by a set of bounding

hyperplanes or faces. This is true when the loop bounds are affine combinations of

outer loop indices and program parameters (typically, symbolic constants representing

the problem size). Let p be the vector of the program parameters.

A well-known known result useful for polyhedral analyses is the affine form of the

Farkas Lemma.

Lemma 1 (Affine form of Farkas Lemma). Let D be a non-empty polyhedron defined

by s affine inequalities or faces: ak.x + bk≥ 0, 1 ≤ k ≤ s, then an affine form ψ is

non-negative everywhere in D iff it is a positive affine combination of the faces:

ψ(x) ≡ λ0+

k

The non-negative constants λkare referred to as Farkas multipliers. Proof of the if part

is obvious. For the only if part, see Schrijver [27].

?

λk(akx + bk), λk≥ 0

(1)

2.1Polyhedral Dependences

Our dependence model is of exact affine dependences and same as the one used in [11,

20,7,29,23].Dependencesare determinedpreciselythroughdataflowanalysis[10],but

we consider all dependences including anti (write-after-read), output (write-after-write)

3

Page 4

and input (read-after-read) dependences, i.e., input code does not require conversion to

single-assignment form. The Data Dependence Graph (DDG) is a directed multi-graph

with each vertex representing a statement, and an edge, e ∈ E, from node Sito Sj

representing a polyhedral dependence from a dynamic instance of Sito one of Sj: it is

characterized by a polyhedron, Pe, called the dependence polyhedron that captures the

exact dependence information corresponding to edge, e. The dependence polyhedron is

in the sum of the dimensionalities of the source and target statement’s polyhedra (with

dimensions for program parameters as well).

Pe≡

?DPs

DPt

he

?

s

t

p

1

?≥ 0

= 0

?

(2)

The equalities in Petypically represent the affine function mapping the target iteration

vector t to the particular source s that is the last access to the conflicting memory loca-

tion, also known as the h-transformation [11]. The last access condition is not necessary

though; in general, the equalities can be used to eliminate variables from Pe. In the rest

of this section, it is assumed for convenience that s can be completely eliminated using

the he, being substituted by he(t).

A one-dimensional affine transform for statement Skis defined by:

?

= fSki + f0,where fSk= [f1,...,fmSk], fi∈ Z

A multi-dimensional affine transformation for a statement can now be represented

by a matrix with each row being an affine hyperplane. If such a transformation ma-

trix has full column rank, it completely specifies when and where an iteration executes

(one-to-one mapping from source to target). The total number of rows in the matrix

may be much larger as some special rows, splitters, may represent unfused loops at a

level. Consider the code in Fig. ?? for example. Such transformations capture the fusion

structure as well as compositions of permutation, reversal, relative shifting, and skew-

ing transformations. This representation for transformations has been used by many

researchers [12,17,7,14], and directly fits with scattering functions that a code genera-

tor like CLooG [3] supports. Our problem is thus to find the the transformation matrices

that are best for parallelism and locality.

φsk=

f1 ... fmSk

??i?+ f0

(3)

3Finding good transformations

3.1Legality of tiling imperfectly-nested loops

Theorem 1. Let φsibe a one-dimensional affine transform for statement Si. For {φs1,

φs2, ..., φsk}, to be a legal (statement-wise) tiling hyperplane, the following should

hold for each edge e from Siand Sj:

φsj(t) − φsi(s) ≥ 0, Pe

(4)

4

Page 5

Proof. Tiling of a statement’s iteration space defined by a set of tiling hyperplanes is

said to be legal if each tile can be executed atomically and a valid total ordering of the

tiles can be constructed. This implies that there exists no two tiles such that they both

influence each other. Let {φ1

wise1-d affinetransforms thatsatisfy(4). Considera tile formedby aggregatingagroup

of hyperplane instances along φ1

the target iteration is mapped to the same hyperplane or a greater hyperplane than the

source, i.e., the set of all iterations that are outside of the tile and are influenced by it

always lie in the forward direction along one of the independent tiling dimensions (φ1

and φ2in this case). Similarly, all iterations outside of a tile influencing it are either in

that tile or in the backward direction along one or more of the hyperplanes. The above

argumentholdstrueforbothintra-andinter-statementdependences.Forinter-statement

dependences, this leads to an interleaved execution of tiles of iteration spaces of each

statement when code is generated from these mappings. Hence, {φ1

{φ2

such a tile is executed on a processor, communication would be needed only before

and after its execution. From locality point of view, if such a tile is executed with the

associated data fitting in a faster memory, reuse is exploited in multiple directions.2

The above condition was well-known for the case of a single-statement perfectly

nested loops from the work of Irigoin and Triolet [16] (as hT.R ≥ 0). We have general-

ized it above for multiple iteration spaces with exact affine dependences with possibly

different dimensionalities and imperfect nestings for statements.

s1, φ1

s2, ..., φ1

sk}, {φ2

s1, φ2

s2, ..., φ2

sk} be two statement-

siand φ2

si. Due to (4), for any dynamic dependence,

s1, φ1

s2, ..., φ1

sk},

s1, φ2

s2, ..., φ2

sk} represent rectangularly tilable loops in the transformed space. If

Tiling at an arbitrary depth. Note that the legality condition as written in (4) is imposed

on all dependences. However, if it is imposed only on dependences that have not been

carried up to a certain depth, the independent φ’s that satisfy the condition represent

tiling hyperplanes at that depth, i.e., rectangular blocking (stripmine/interchange) at

that level in the transformed program is legal.

Consider the perfectly nested version of 1-d Jacobi shown in Fig. 1(a) as an exam-

ple. This discussion also applies to the imperfectly nested version, but for convenience

we first look at the single-statement perfectly nested version. We first describe solutions

obtained by existing state of the art approaches: Lim and Lam’s affine partitioning [21,

20]andGriebl’sspaceandtimetilingwithForwardCommunication-Only(FCO)place-

ment [15].

for (t=1; t<T; t++)

for (i=2; i<N−1; i++)

a[t,i] = 0.33∗(a[t−1,i] +

a[t−1,i−1] + a[t−1,i+1]);

(a) 1-d Jacobi: perfectly nested

for (t=1; t<T; t++)

for (i=2; i<N−1; i++)

S1: b[i] = 0.33∗(a[i−1]+ a[i]+a[i +1]);

for (i=2; i<N−1; i++)

S2: a[i] = b[i ];

(b) 1-d Jacobi: imperfectly nested

Fig.1. 1-d Jacobi

Lim and Lam define legal time partitions which have the same property of tiling hy-

perplanes we described above. Their algorithm obtains affine partitions that minimize

theorderofcommunicationwhilemaximizingthedegreeofparallelism.(4)giveslegal-

ityconstraints:ct≥ 0;ct+ci≥ 0;ct−ci≥ 0correspondingtodependences(1,0),(1,1)

5

Page 6

i

t

(1,0)

seq

(2,1)

parallel

ii

tt

P1

P0

P3

P2

(1,1)

(1,0)

seq

(1,1)

(1,0)

parallel

seq

parallel

P1P2

P0

Fig.2. Communication volume with different valid hyperplanes for perfectly nested 1-d jacobi

and (1,-1). There are infinitely many valid solutions with the same order complexity of

synchronization, but with different communication volumes that may impact perfor-

mance. Although it may seem that the volume may not effect performance considering

the fact that communication startup time on modern interconnects is significant, for

higher dimensional problems like n-d Jacobi, the ratio of communication to computa-

tion increases (proportional to tile size raised to n−1). Existing works on tiling [26,25,

33] can find near communication-optimal tiles for perfectly nested loops with constant

dependences, but cannot handle arbitrarily nested loops. For 1-d Jacobi, all solutions

within the cone formed by the vectors (1,1) and (1,−1) are valid tiling hyperplanes.

For the imperfectly nested version of 1-d Jacobi, the valid cone is (2,1) and (2,−1)

(discussed later). For imperfectly nested Jacobi, Lim’s algorithm [21] finds two valid

independent solutions without optimizing for any particular criterion. In particular, the

solutions found by their algorithm (Algorithm A in [21]) are (2,−1) and (3,−1) which

are clearly not the best tiling hyperplanes to minimize communication volume, though

they do minimize the order of synchronization which is O(N) (in this case any valid

hyperplane has O(N) synchronization). Figure 2 shows that the required communica-

tion increases as the hyperplane gets more and more oblique. For a hyperplane with

normal (k,1), one would need (k + 1)T values from the neighboring tile.

UsingGriebl’sapproach,wefirstfindthatonlyspacetilingisenabledwithFeautrier’s

schedule being θ(t,i) = t, i.e., using (1,0) as the scheduling hyperplane. With forward

communication-only (FCO) placement, an allocation is found such that dependences

have positive components along space dimensions thereby enabling tiling of the time

dimension too; this decreases the frequency of communication. In this case, time tiling

is enabled with FCO placement along (1,1). However, note that communication in the

processor space occurs along (1,1), i.e., two lines of the array are required. However,

using (1,0) and (1,1) as tiling hyperplanes with (1,0) as space and (1,1) as inner time and

a tile space schedule of (2,1) leads to only one line of communication along (1,0). Our

algorithm finds such a solution. We now develop a cost function for an affine transform

that captures communication volume and reuse distance.

3.2

Consider the following affine form δe:

A Linear Cost Function

δe(t) = φsi(t) − φsj(he(t)),

Pe

(5)

The affine form δe(t) holds much significance. This function is the number of hyper-

planes the dependence e traverses along the hyperplane normal. It gives us a measure

of the reuse distance if the hyperplane is used as time, i.e., if the hyperplanes are ex-

ecuted sequentially. Also, this function is a factor in the communication volume if the

6

Page 7

hyperplane is used to generate tiles for parallelization and used as a processor space di-

mension. An upper bound on this function would mean that the number of hyperplanes

that would be communicated as a result of the dependence at the tile boundaries would

not exceed this bound. We are particularly interested if this function can be reduced to

a constant amount or zero by choosing a suitable direction for φ: if this is possible, then

that particular dependence leads to a constant or no communication for this hyperplane.

Note that each δeis an affine function of the loop indices. The challenge is to use this

function to obtain a suitable objective for optimization in the affine framework.

Challenges. The constraints obtained from (4) only guarantee legality of tiling (per-

mutability). However, several problems are encountered when one tries to apply a per-

formance factor to find a good tile shape out of the several possibilities. Farkas Lemma

has been used by many approaches in the affine literature [11,12,21,15] to eliminate

loop variables from constraints by getting equivalent linear inequalities. The affine form

in the loop variables is represented as a positive linear combination of the faces of the

dependence polyhedron. When this is done, the coefficients of the loop variables on

the left and right hand side are equated to eliminate the constraints of variables. This

is done for each of the dependences, and the constraints obtained are aggregated. The

resulting constraints are entirely in the coefficients of the tile mappings and Farkas mul-

tipliers. All Farkas multipliers can be eliminated, some by Gaussian elimination and

the rest by Fourier-Motzkin elimination [27]. However, an attempt to minimize com-

munication volume ends up in an objective function involving both loop variables and

hyperplane coefficients. For example, φ(t)−φ(he(t)) could be c1i+(c2−c3)j, where

1 ≤ i ≤ N ∧ 1 ≤ j ≤ N ∧ i ≤ j. One ends up with such a form when a depen-

dence is not uniform or for an inter-statement dependence, making it hard to construct

an objective function involving only the unknown hyperplane coefficients.

3.3Cost Function Bounding and Minimization

Theorem 2. If all iteration spaces are bounded, there exists at least one affine form v

in the structure parameters p, that bounds δe(t) for every dependence edge e, i.e., there

exists

v(p) = u.p + w

(6)

such that

v(p) −

?φsi(t) − φsj(he(t))?

≥ 0, Pe,

≥

∀e ∈ E

∀e ∈ E.

(7)

i.e.,

v(p) − δe(t)0, Pe,

The idea behind the above is that even if δeinvolves loop variables, one can find large

enough constants in u that would be sufficient to bound δe(s). Note that the loop vari-

ables themselves are bounded by affine functions of the parameters, and hence the

maximum value taken by δe(s) will be bounded by such an affine form. Also, since

v(p) ≥ δe(s) ≥ 0, v should increase with an increase in the structural parameters,

i.e., the coordinates of u are positive. The reuse distance or communication volume for

each dependence is bounded in this fashion by the same affine form. Such a bounding

function was used by Feautrier [11] to find minimum latency schedules.

7

Page 8

Now, we apply Farkas Lemma to (7).

v(p) − δe(t) ≡ λe0+

me

?

k=1

λekPk

e,λek≥ 0

(8)

where Pk

loop indices in i and parameters in p on the left and right hand side can be gathered and

equated. We now get linear inequalities entirely in coefficients of the affine mappings

for all statements, components of row vector u, and w. The above inequalities can at

once be solved by finding a lexicographic minimal solution with u and w in the leading

position, and the other variables following in any order. Let u = (u1,u2,...uk).

eis a face of PeThe above is an identity and the coefficients of each of the

minimize≺{u1,u2,...,uk,w,...,c?

is,...}

(9)

Finding the lexicographic minimal solution for a system of linear inequalities is within

the reach of the simplex algorithm and can be handled by the Parametric Integer Pro-

gramming (PIP) software [9]. Since the structural parameters are quite large, we first

want to minimize their coefficients. We do not lose the optimal solution since an optimal

solution would have the smallest possible values for u’s. Note that the relative order-

ing of the structural parameters and their values at runtime may affect the solution, but

considering this is beyond the scope of this approach.

The solution gives a hyperplane for each statement. Note that the application of the

Farkas Lemma to (7) is not required in all cases. When a dependence is uniform, the

corresponding δeis independent of any loop variables, and Farkas Lemma need not be

applied. In such cases, we just have w ≥ δe.

3.4Iteratively Finding Independent Solutions.

Solving the ILP formulation in the previous section gives us a single solution to the

coefficients of the best mappings for each statement. We need at least as many inde-

pendent solutions as the dimensionality of the polytope associated with each statement.

Hence, once a solution is found, we augment the ILP formulation with new constraints

and obtain the next solution; the new constraints make sure of linear independence with

solutions already found. Let the rows of HSrepresent the solutions found so far for a

statement S. Then, the sub-space orthogonal to HS[22,18] is given by:

H⊥

S= I − HT

S

?HSHT

S

?−1HS

(10)

Note that H⊥

the next row (linear portion of the hyperplane) to be found for statement S. Let Hi⊥

a row ofH⊥

gives the necessary constraint to be added for statement S to make sure that h∗

non-zero component in the sub-space orthogonal to HS. This leads to a non-convex

space, and ideally, all cases have to be tried and the best among those kept. When the

number of statements is large, this leads to a combinatorial explosion. In such cases,

we restrict ourselves to the sub-space of the orthogonal space where all the constraints

S.HST= 0, i.e., the rows of HSare orthogonal to those of H⊥

S. Let h∗

Sbe

Sbe

S< 0

Shas a

S. Then, anyoneof the inequalitiesgiven by∀i, Hi⊥

S.h∗

S> 0,Hi⊥

S.h∗

8

Page 9

are positive, i.e., the following constraints are added to the ILP formulation for linear

independence:

∀i,Hi⊥

By just considering a particular convex portion of the orthogonal sub-space, we are dis-

carding solutions that usually involve loop reversals or combination of reversals with

other transformations; however, in practice, we believe this does not make a difference.

The mappings found are independent on a per-statement basis. When there are state-

ments with different dimensionalities, the number of such independent mappings found

for each statement is equal to the number of outer loops it has. Hence, no more orthogo-

nality constraints need be added for statements for which enough independent solutions

have been found (the rest of the rows get automatically filled with zeros or linearly de-

pendent rows). As mentioned in Sec. 2, the number of rows in the transformation matrix

is the same for each statement and the depth of the deepest loop nest in the target code

is the same as that of the source loop nest. Overall, a hierarchy of fully permutable loop

nest sets are found, and a lower level in the hierarchy will not be obtained unless con-

straints corresponding to dependences that have been carried by the parent permutable

set have been removed.

S.h∗S≥ 0

∧

?

i

Hi⊥

Sh∗

S≥ 1

(11)

3.5Communication and locality optimization unified

From the algorithm described above, both synchronization-free and pipelined paral-

lelism is found. Note that the best possible solution to (9) is with (u = 0,w = 0) and

this happens when we find a hyperplane that has no dependence components along its

normal, which is a fully parallel loop requiring no synchronization if it is at the outer

level (outer parallel); it could be an inner parallel loop if some dependences were re-

moved previously and so a synchronization is required after the loop is executed in

parallel. Thus, in each of the steps that we find a new independent hyperplane, we

end up first finding all synchronization-free hyperplanes; these are followed by a set of

fully permutable hyperplanes that are tilable and pipelined parallel requiring constant

boundary communication (u = 0;w > 0) w.r.t the tile sizes. In the worst case, we have

a hyperplane with u > 0,w ≥ 0 resulting in long communication from non-constant

dependences. It is important to note that the latter are pushed to the innermost level. By

bringing in the notion of communication volume and its minimization, all degrees of

parallelism are found in the order of their preference.

From the point of view of data locality, note that the hyperplanes that are used to

scan the tile space are the same as the ones that scan points in a tile. Hence, data locality

isoptimizedfromtwoangles:(1)cachemissesattileboundariesareminimizedforlocal

execution (as cache misses at local tile boundaries are equivalent to communication

along processor tile boundaries); (2) by reducing reuse distances, we are increasing the

local cache tile sizes. The former is due to selection of good tile shapes and the latter

by the right permutation of hyperplanes (which is implicit in the order in which we find

hyperplanes).

3.6Space and time in transformed iteration space.

By minimizing φ(t) − φ(s) as we find hyperplanes from outermost to innermost, we

push dependence carrying to inner loops and also ensure that loops do not have negative

9

Page 10

dependence components (to the extent possible) so that target loops can be tiled. Once

this is done, if the outer loops are used as space (how many ever desired, say k), and

the rest are used as time (note that at least one time loop is required unless all loops are

synchronization-free parallel), communication in the processor space is optimized as

the outer space loops are the k best ones. Whenever the loops can be tiled, they result in

coarse-grained parallelism as well as better reuse within a tile. In practice, we usually

do not need more than two degrees of parallelism. If a degree of communication-free

parallelism exists, that particular loop (assuming it has a large extent) is sufficient to

expose enough coarse-grained parallelism. Note that all loops detected as parallel need

not be marked parallel.

3.7Fusion

The algorithm described in the previous section can also enable fusion across multiple

iterationspacesthatareweaklyconnected,asinsequencesofproducer-consumerloops.

Solving for hyperplanes for multiple statements leads to a schedule for each statement

such that all statements in question are finely interleaved: this is indeed fusion. This

generalization of fusion is same as the one proposed in [7,14]. Note that we leave the

structure parameter p out of our affine transform definition in 4. The extended version

of this paper [4] describes how fusion naturally integrates into the framework.

3.8 Summary

The algorithm is summarized below. It can be viewed as transforming to a tree of

permutable loop nests sets/bands - each node of the tree is a good permutable loop nest

set. Step 12 of the repeat-until block in Algorithm 3 finds such a band of permutable

loops. If all loops are tilable, there is just one node containing all the loops that are per-

mutable. On the other extreme, if no loops are tilable, each node of the tree has just one

loop and so no tiling is possible. At least two hyperplanes should be found at any level

(without dependence removal/cutting) to enable tiling. Dependences from previously

found solutions are thus not removed unless they have to be (Step 17): to allow the next

permutable band to be found, and so on. Hence, partially tilable or untilable input is

all handled. Loops in each node of the target tree can be stripmined/interchanged when

there are at least two of them in it; however, it is illegal to move a stripmined loop across

different levels in the tree.

3.9Accuracy of Cost Function and Refinement.

The metric we presented here can be refined while keeping the problem within ILP.

The motivation behind taking a max is to avoid multiple counting of the same set of

points that need to be communicated for different dependences. This happens when all

dependences originate from the same data space and the same order volume of commu-

nication is required for each of them. Using the sum of max’es on a per-array basis is a

more accurate metric. Also, even for a single array, sets of points with very less overlap

or no overlap may have to be communicated for different dependences. Also, differ-

ent dependences may have source dependence polytopes of different dimensionalities.

Note that the image of the source dependence polytope under the data access func-

tion associated with the dependence gives the actual set of points to be communicated.

10

Page 11

Input Generalized dependence graph G = (V,E) (includes dependence polyhedra Pe, e ∈ E)

1: Smax: statement with maximum domain dimensionality

2: for each dependence e ∈ E do

3:Build legality constraints: apply Farkas Lemma on φ(t) − φ(he(t)) ≥ 0 under t ∈ Pe,

and eliminate all Farkas multipliers

4:Build communication volume/reuse distance bounding constraints: apply Farkas Lemma

to v(p) − (φ(t) − φ(he(t))) ≥ 0 under Pe, and eliminate all Farkas multipliers

5: Aggregate constraints from both into Ce(i)

6: end for

7: repeat

8:

C = ∅

9:

for each dependence edge e ∈ E do

10:

C ← C ∪ Ce(i)

11:

end for

12:Compute lexicographic minimal solution with u?s coefficients in the leading position fol-

lowed by w to iteratively find independent solutions to C (orthogonality constraints are

added as each soln is found)

13:

if no solutions were found then

14:Cut dependences between two strongly-connected components in the GDG and insert

the appropriate splitter in the transformation matrices of the statements

15:

end if

16: Compute Ec: dependences carried by solutions of Step 12/14; update necessary depen-

dence polyhedra (when a portion of it is satisfied)

17:

E ← E − Ec; reform the GDG (V,E)

18: until H⊥

Output A transformation matrix for each statement (with the same number of rows)

Smax= 0 and E = ∅

Fig.3. Algorithm 1

Hence, just using the communication rate (number of hyperplanes on the tile boundary)

as the metric may not be accurate enough. This can be taken care of by having different

bounding functions for dependences with different orders of communication, and us-

ing the bound coefficients for dependences with higher orders of communication as the

leading coefficients while finding the lexicographic minimal solution.

4 Example

Figure4showsanexamplefromtheliterature[8]withaffinenon-constantdependences.

We exclude the constant c0from the mappings as we have a single statement. De-

pendence analysis produces the dependence polyhedra and h-transformations shown in

Fig. 4.

Dependence 1: Tiling legality constraint:

cii + cjj − cii − cj(j − 1) ≥ 0

⇒

cj≥ 0

Since this is a constant dependence, the volume bounding constraint gives:

w − cj≥ 0

11

Page 12

for(i=0; i<N: i++) {

for (j=2; j<N; j++) {

a[i,j] = a[j,i]+a[i,j−1];

}

}

P0

P3

P3P4 P2

P3

P1

P0

P1

P1

P2

P2

P2

P4

P5

space

time

j

i

a[i?,j?] → a[i,j − 1]

h1 : i?= i,j?= j − 1;

2 ≤ j ≤ N,1 ≤ i ≤ N

a[i?,j?] → a[j,i]

h2 : i?= j,j?= i;

2 ≤ j ≤ N, 1 ≤ i ≤ N,i − j ≥ 1

a[j?,i?] → a[i,j]

h3 : j?= i,i?= j

2 ≤ j ≤ N, 1 ≤ i ≤ N, i − j ≥ 1

Fig.4. Example: Non-constant dependences

Dependence 2: Tiling legality constraint:

(cii + cjj) − (cij + cji) ≥ 0,

(i,j) ∈ P2

Applying Farkas Lemma, we have:

(ci− cj)i + (cj− ci)j

≡ λ0+ λ1(N − i) + λ2(N − j)

+λ3(i − j − 1) + λ4(i − 1) + λ5(j − 1)

λ0,λ1,λ2,λ3,λ4,λ5 ≥ 0

(12)

Equating LHS and RHS coefficients for i, j, N and the constants in (12), and eliminat-

ing Farkas multipliers through Fourier-Motzkin, we obtain the following:

ci− cj≥ 0

Volume bounding constraint:

u1N + w − (cij + cji − cii − cjj) ≥ 0, (i,j) ∈ P2

Application of Farkas Lemma in a similar way as above and elimination of the multi-

pliers yields:

u1 ≥ 0

u1− ci+ cj ≥ 0

3u1+ w − ci+ cj ≥ 0

(13)

Dependence 3: Due to symmetry with respect to i and j, the third dependence does not

give anything more than the second one.

12

Page 13

Finding the hyperplanes. Aggregating legality and volume bounding constraints for all

dependences, we obtain:

cj ≥ 0

w − cj ≥ 0

ci− cj ≥ 0

u1 ≥ 0

u1− ci+ cj ≥ 0

3u1+ w − ci+ cj ≥ 0

minimize≺ (u1,w,ci,cj)

(14)

The lexicographic minimal solution for the vector (u1,w,ci,cj) = (0,1,1,1) (the zero

vector is a trivial solution and is avoided). Hence, we get ci= cj= 1. Note that ci= 1

and cj= 0 is not obtained even though it is a valid tiling hyperplane as it involves more

communication: it requires u1to be positive.

The next solution is forced to have a positive component in the subspace orthogonal

to (1,1) given by (10) as (1,-1). This leads to the addition of the constraint ci−cj≥ 1 or

ci−cj≤ −1 to the existing formulation. Adding ci−cj≥ 1 to (14), the lexicographic

minimal solution is (1, 0, 1, 0), i.e., u1 = 1,w = 0,ci = 1,cj = 0 (u1 = 0 is no

longer valid). Hence, (1,1) and (1,0) are the tiling hyperplanes obtained. (1,1) is used

as space with one line of communication between processors, and the hyperplane (1,0)

is used as time in a tile. The outer tile schedule is (2,1) ( = (1,1) + (1,0)).

This transformation is in contrast to other approaches based on schedules which ob-

tain a schedule and then the rest of the transformation matrix. Feautrier’s greedy heuris-

tic gives the schedule θ(i,j) = 2i+j−3 which carries all dependences. However, using

this as either space or time does not lead to communication or locality optimization. The

(2,1) hyperplane has non-constant communication along it. In fact, the only hyperplane

that has constant communication along it is (1,1). This is the best hyperplane to be used

as a space loop if the nest is to be parallelized, and is the first solution that our algo-

rithm finds. The (1,0) hyperplane is used as time leading to a solution with one degree

of pipelined parallelism with one line per tile of near-neighbor communication (along

(1,1)) as shown in Fig. 4. Hence, a good schedule that tries to carry all dependences (or

as many as possible) is not necessarily a good loop for the transformed iteration space.

5Implementation and Preliminary Results

We have implemented our transformation framework using PipLib 1.3.3 [9]. Our tool

takes as input dependence information (dependence polyhedra and h-transformations)

from LooPo’s [1] dependence tester and generates statement-wise affine transforma-

tions. The transforms generated by our tool are provided to CLooG [3] as scattering

functions. The goal is to get tiled shared memory parallel code, for example, OpenMP

codeformulti-corearchitectures.Asafinalstep,thedetectedparallelismcanbemapped

to a desired parallel architecture depending on the number of degrees of parallelism

required. Results show that the tool (with preliminary optimizations) already runs ex-

tremely fast making further refinements to the model very attractive. The number of

loops shown in the table is the sum of the number of outer loops of all statements in

the original code. In theory, since our approach relies on integer programming, it has

13

Page 14

a worst-case exponential time complexity. However, it runs extremely fast in practice.

This is mainly because program polyhedra have a simple structure, and the ILP for-

mulations resulting from them are quickly solved. Due to space constraints, we are not

including results from experimental evaluation of the transformed code. A summary of

theresultscanbefoundinTable5.[20,19,21,15]representthestate-of-the-artfromthe

research community, while ICC 10.1 with ’-fast -parallel’ was used as the native com-

piler. The results were taken on an Intel Core 2 Quad (Q6600 1.5 GHz). The detailed

experimental evaluation can be found in [5].

Benchmark Single core speedup

over native over state-of-the-art over native over state-of-the-art

compilerresearch

5.23x2.1x

3.7x3.1x

1.6x

5.56x 5.74x

9.3x 5.5x

Table 1. Initial results: speedup over state-of-the-art

Multi-core speedup (4 cores)

compiler

20x

7.4x

4.5x

14x

13x

research

1.7x

2.46x

1-d Jacobi (imperfect nest)

2-d FDTD

3-d Gauss-Seidel

LU decomposition

Matrix Vec Transpose

3.76x

6.96x

6Related work

Iteration space tiling [16,31,32,25] is a standard approach for aggregating a set of loop

iterations into tiles, with each tile being executed atomically. In addition, researchers

have considered the problem of selecting tile shape and size to minimize communica-

tion, improve locality or minimize finish time [25,33]. These works are restricted to a

single perfectly nested loop nest with uniform dependences or similar restrictions which

are far away from real-world code.

Loop parallelization has been studied extensively. The reader is referred to the sur-

vey of Boulet et al.[6] for a detailed summary of older parallelization algorithms which

acceptedrestricted inputand/or arebasedon weakerdependence abstractionsthan exact

polyhedral dependences.

Scheduling with affine functions using faces of the polytope by application of the

Farkas algorithm was first proposed by Feautrier [11]. Feautrier explored various pos-

sible approaches to obtain good affine schedules that minimize latency. The schedules

carry all dependences and so all the inner loops can be parallel. However, transforming

to permutable loops that are amenable to tiling was not addressed. Though schedules

yield inner parallel loops, the time loops cannot be tiled unless communication in the

space loops is in the forward direction (dependences have positive components along all

dimensions). Several works [15,7,23] make use of such schedules. Overall, Feautrier’s

classic works [11,12] are geared towards finding maximum fine-grained parallelism as

opposed to tilability for coarse-grained parallelization with minimized communication

and better locality.

Lim and Lam [21,20] proposed an affine partitioning framework that identifies

outerparallelloops(communication-freespacepartitions)andpermutableloops(pipelined

parallel or tilable loops) to maximize the degree of parallelism and minimizing the or-

der of synchronization. They employ the same machinery for blocking [19]. Several

14

Page 15

(infinitely many) solutions equivalent in terms of the criterion they optimize for result

from their algorithm, and these significantly differ in communication cost and locality;

no metric is provided to differentiate between these solutions. As seen in Sec. 3, without

a cost function, the solutions obtained even for the simplest input are not satisfactory.

Ahmed et al. [2] proposed a framework for locality optimization of imperfectly

nested loops for sequential execution. The approach determines the embedding for each

statement into a product space, which is then optimized for locality through another

transformation. Their heuristic sets reuse distances in the target space for some de-

pendences to zero (or a constant) to obtain solutions to the embedding/transformation

matrix coefficients. However, the choice of the dependences and the number, which is

important, is determined heuristically. Also, such an approach need not completely de-

termine the embedding function/transformation matrix coefficients. Exploring all pos-

sibilities here is infeasible. Overall, the automatability and robustness of the heuristic

even for simple code is not clear from the description.

Griebl [15] presents an integrated framework for optimizing locality and paral-

lelism with space and time tiling. Griebl’s approach enables time tiling by using a for-

ward communication-only placement with an existing schedule. As mentioned earlier

(Sec. 3), using schedules as time loops may not lead to communication or locality-

optimized solutions.

Cohenetal.,Girbaletal.[7,14]proposedanddevelopedaframework(URUK/WRAP-

IT) to compose sequences of transformations in a semi-automatic fashion. Transforma-

tions are applied automatically, but specified manually by an expert. Pouchet et al. [23]

searches the space of transformations (one-dimensional schedules) to find good ones

through iterative optimization by employing performance counters. On the other hand,

our approach is fully-automatic. However, some amount of empirical and iterative op-

timization is required to choose transforms that work best in practice. This is true when

several fusion choices exist. Also, effective determination of tile sizes and unroll factors

for transformed whole-programs may only be possible through some amount of empir-

ical search. A combination of our algorithm and empirical search in a smaller space is

an interesting approach to pursue. Alternatively, more powerful cost models like those

based on computing Ehrhart polynomials [30] can be employed once transformations

in a smaller space can be enumerated.

7Conclusions

We have presented a single framework that addresses automatic parallelization and

data locality optimization using transformations in the polyhedral model. The proposed

algorithm finds communication-minimized tiling hyperplanes for parallelization of a

sequence of arbitrarily nested loops. The same hyperplanes also minimize reuse dis-

tances and improve data locality. The approach also enables fusion in the presence of

producing-consuming loops. To the best of our knowledge, our work is the first to pro-

pose a practical cost model to drive automatic transformation in the polyhedral model.

The framework has been implemented into a tool to perform transformations in a fully

automatic way from C/Fortran code using the LooPo infrastructure and CLooG. Pre-

liminary results show very good scalability of the running time of the framework with

input size and complexity of input code.

15