Automatic Transformations for CommunicationMinimized Parallelization and Locality Optimization in the Polyhedral Model.
ABSTRACT The polyhedral model provides powerful abstractions to optimize loop nests with regular accesses. Affine transformations in this model capture a com plex sequence of executionreordering loop transformations that can improve per formance by parallelization as well as locality enhancement. Although a signifi cant body of research has addressed affine scheduling and partitioning, the prob lemofautomaticallyfindinggoodaffinetransformsforcommunicationoptimized coarsegrained parallelization together with locality optimization for the general case of arbitrarilynested loop sequences remains a challenging problem. We propose an automatic transformation framework to optimize arbitrarily nested loop sequences with affine dependences for parallelism and locality si multaneously. The approach finds good tiling hyperplanes by embedding a pow erful and versatile cost function into an Integer Linear Programming formulation. These tiling hyperplanes are used for communicationminimized coarsegrained parallelization as well as for locality optimization. The approach enables the min imization of intertile communication volume in the processor space, and mini mization of reuse distances for local execution at each node. Programs requir ing onedimensional versus multidimensional time schedules (with scheduling based approaches) are all handled with the same algorithm. Synchronizationfree parallelism, permutable loops or pipelined parallelism at various levels can be detected. Preliminary studies of the framework show promising results.

Conference Paper: Predictive modeling in a polyhedral optimization space.
Proceedings of the CGO 2011, The 9th International Symposium on Code Generation and Optimization, Chamonix, France, April 26, 2011; 01/2011  SourceAvailable from: J. Ramanujam
Conference Paper: Towards effective automatic parallelization for multicore systems
Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on; 05/2008  Cybernetics and Systems Analysis 01/2012; 48(1).
Page 1
Automatic Transformations for
CommunicationMinimized Parallelization and Locality
Optimization in the Polyhedral Model
Uday Bondhugula1, Muthu Baskaran1, Sriram Krishnamoorthy1,
J. Ramanujam2, Atanas Rountev1, and P. Sadayappan1
1Dept. of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
{bondhugu,baskaran,krishnsr,rountev,saday}@cse.ohiostate.edu,
2Dept. of Electrical and Computer Engg., Louisiana State University, Baton Rouge, LA , USA
jxr@ece.lsu.edu
Abstract. Many compute intensive applications spend a significant fraction of
their time in nested loops. The polyhedral model provides powerful abstractions
to optimize loop nests with regular accesses for parallel execution. Affine trans
formations in this model capture a complex sequence of executionreordering
loop transformations that can improve performance by parallelization as well as
locality enhancement. Although a significant amount of research has addressed
affine scheduling and partitioning, the problem of automatically finding good
affinetransformsforcommunicationoptimizedcoarsegrainedparallelizationalong
withlocalityoptimizationforthegeneralcaseofarbitrarilynestedloopsequences
remains a challenging problem.
In this paper, we propose an automatic transformation framework to optimize
arbitrarilynested loop sequences with affine dependences for parallelism and
locality simultaneously. The approach finds good tiling hyperplanes by embed
ding a powerful and versatile cost function into an Integer Linear Programming
formulation. These tiling hyperplanes are used for communicationminimized
coarsegrained parallelization as well as locality optimization. It enables the min
imization of intertile communication volume in the processor space, and min
imization of reuse distances for local execution at each node. Programs requir
ing onedimensional versus multidimensional time schedules (with scheduling
based approaches) are all handled with the same algorithm. Synchronizationfree
parallelism, permutable loops or pipelined parallelism at various levels can be
detected. Preliminary results from the implemented framework show promising
performance and scalability with input size.
1Introduction and Motivation
Current trends in architecture are increasingly towards larger number of processing el
ements on a chip. This has led to multicore architectures becoming mainstream along
with the emergence of several specialized parallel architectures or accelerators like
the Cell processor and GeneralPurpose GPUs. The difficulty of programming these
architectures to effectively tap the potential of multiple onchip processing units is a
wellknown challenge. Among several approaches to addressing this issue, one that is
very promising but simultaneously very challenging is automatic parallelization. This
requires no effort on part of the programmer in the process of parallelization and opti
mization and is therefore very attractive.
Page 2
Many computeintensive applications often spend most of their running time in
nested loops. This is particularly common in scientific and engineering applications.
The polyhedral model [13,7,14] provides a powerful abstraction to reason about trans
formations on such loop nests by viewing a dynamic instance (iteration) of each state
ment as an integer point in a welldefined space which is the statement’s polyhedron.
With such a representation for each statement and a precise characterization of inter or
intrastatement dependences, it is possible to reason about the correctness and goodness
of a sequence of complex loop transformations using machinery from Linear Algebra
and Linear Programming. The transformations finally reflect in the generated code as
reordered execution with improved cache locality and/or loops that have been paral
lelized. The polyhedral model is readily applicable to loop nests in which the data
access functions and loop bounds are affine combinations (linear combination with a
constant) of the enclosing loop variables and parameters. While a precise characteriza
tion of data dependences is feasible for programs with static control structure and affine
references/loopbounds, codes with nonaffine array access functions or code with dy
namic control can also be handled, but with conservative assumptions on some depen
dences.
Early approaches to program transformation and automatic parallelization applied
only to perfectly nested loops and involved the application of a sequence of transfor
mations to the program’s structure represented as an attributed abstract syntax tree. The
polyhedral model has enabled much more complex programs to be handled and easy
composition and application of more sophisticated transformations [7,14]. The task of
program optimization in the polyhedral model may be viewed in terms of three phases:
(1) static dependence analysis of the input program, (2) transformations in the polyhe
dral abstraction, and (3) generation of efficient loop code for the transformed program.
Inspite of progresses in these techniques in the nineties, several scalability challenges
limited applicability to small loop nests. Significant recent advances in dependence
analysis [29] and more importantly, in code generation [24,3,28], have demonstrated
the applicability of the polyhedral techniques to code representative of real applica
tions. However, current stateoftheart polyhedral implementations still apply transfor
mations manually and significant time is spent by an expert to determine the best set of
transformations that lead to improved performance [7,14]. An important open issue is
that of the choice of transformations from the huge space of valid transforms. Our work
addresses this problem, by formulating a way of obtaining good transformations fully
automatically.
Tiling is a key transformation and has been studied from two perspectives  data
locality optimization and parallelization. Tiling for locality requires grouping points in
an iteration space into smaller blocks (tiles) allowing reuse in multiple directions when
the block fits in a faster memory (registers, L1, or L2 cache). Tiling for coarsegrained
parallelism fundamentally involves partitioning the iteration space into tiles that may
be concurrently executed on different processors with a reduced frequency and volume
of interprocessor communication: a tile is atomically executed on a processor with
communication required only before and after execution. Hence, one of the key aspects
of an automatic transformation framework is to find good ways of performing tiling.
Existing automatic transformation frameworks [21,20,19,2,15] have one or more
drawbacks or restrictions that do not allow them to effectively parallelize/optimize loop
nests. All of them lack a realistic cost model that is suitable for coarsegrained parallel
execution as is used in practice with manually developed parallel applications. With the
2
Page 3
exception of Griebl [15], previous work generally focuses on one or the other of the
complementary aspects of parallelization and locality optimization. The approach we
develop answers the following question: What is the best way to tile imperfectly nested
loop sequences to minimize the volume of communication between tiles (in processor
space) as well as improve data reuse at each processor?
The rest of this paper is organized as follows. Section 2 provides an overview of the
polyhedral model and notation. In Section 3, we describe our automatic transformation
framework in detail. Section 4 shows stepbystep application of our approach through
an example. Section 5 provides a summary of the implementation and initial results.
Section 6 discusses related work and conclusions are presented in Section 7. Full details
of the framework, transformations and optimized codes obtained for various codes, and
experimental results are available in extended reports [4,5].
2Overview of the Polyhedral Framework
Thissectionpresentsanoverviewofthepolyhedralframeworkandnotationusedthrough
out the paper.
The set X of all vectors x ∈ Znsuch that h.x = k, for k ∈ Z, forms an affine
hyperplane. The set of parallel hyperplane instances corresponding to different values
of k is characterized by the vector h which is normal to the hyperplane. Each instance
of a hyperplane is an n − 1 dimensional affine subspace of the ndimensional space.
Two vectors x1and x2lie in the same hyperplane if h.x1= h.x2.
The set of all vectors x ∈ Znsuch that Ax + b ≥ 0, where A is an integer matrix,
defines a (convex) integer polyhedron. A polytope is a bounded polyhedron. Each run
time instance of a statement S is identified by its iteration vector i which contains
values for the indices of the loops surrounding it, from outermost to innermost. Hence,
a statement S is associated with a polytope which is characterized by a set of bounding
hyperplanes or faces. This is true when the loop bounds are affine combinations of
outer loop indices and program parameters (typically, symbolic constants representing
the problem size). Let p be the vector of the program parameters.
A wellknown known result useful for polyhedral analyses is the affine form of the
Farkas Lemma.
Lemma 1 (Affine form of Farkas Lemma). Let D be a nonempty polyhedron defined
by s affine inequalities or faces: ak.x + bk≥ 0, 1 ≤ k ≤ s, then an affine form ψ is
nonnegative everywhere in D iff it is a positive affine combination of the faces:
ψ(x) ≡ λ0+
k
The nonnegative constants λkare referred to as Farkas multipliers. Proof of the if part
is obvious. For the only if part, see Schrijver [27].
?
λk(akx + bk), λk≥ 0
(1)
2.1Polyhedral Dependences
Our dependence model is of exact affine dependences and same as the one used in [11,
20,7,29,23].Dependencesare determinedpreciselythroughdataflowanalysis[10],but
we consider all dependences including anti (writeafterread), output (writeafterwrite)
3
Page 4
and input (readafterread) dependences, i.e., input code does not require conversion to
singleassignment form. The Data Dependence Graph (DDG) is a directed multigraph
with each vertex representing a statement, and an edge, e ∈ E, from node Sito Sj
representing a polyhedral dependence from a dynamic instance of Sito one of Sj: it is
characterized by a polyhedron, Pe, called the dependence polyhedron that captures the
exact dependence information corresponding to edge, e. The dependence polyhedron is
in the sum of the dimensionalities of the source and target statement’s polyhedra (with
dimensions for program parameters as well).
Pe≡
?DPs
DPt
he
?
s
t
p
1
?≥ 0
= 0
?
(2)
The equalities in Petypically represent the affine function mapping the target iteration
vector t to the particular source s that is the last access to the conflicting memory loca
tion, also known as the htransformation [11]. The last access condition is not necessary
though; in general, the equalities can be used to eliminate variables from Pe. In the rest
of this section, it is assumed for convenience that s can be completely eliminated using
the he, being substituted by he(t).
A onedimensional affine transform for statement Skis defined by:
?
= fSki + f0,where fSk= [f1,...,fmSk], fi∈ Z
A multidimensional affine transformation for a statement can now be represented
by a matrix with each row being an affine hyperplane. If such a transformation ma
trix has full column rank, it completely specifies when and where an iteration executes
(onetoone mapping from source to target). The total number of rows in the matrix
may be much larger as some special rows, splitters, may represent unfused loops at a
level. Consider the code in Fig. ?? for example. Such transformations capture the fusion
structure as well as compositions of permutation, reversal, relative shifting, and skew
ing transformations. This representation for transformations has been used by many
researchers [12,17,7,14], and directly fits with scattering functions that a code genera
tor like CLooG [3] supports. Our problem is thus to find the the transformation matrices
that are best for parallelism and locality.
φsk=
f1 ... fmSk
??i?+ f0
(3)
3Finding good transformations
3.1Legality of tiling imperfectlynested loops
Theorem 1. Let φsibe a onedimensional affine transform for statement Si. For {φs1,
φs2, ..., φsk}, to be a legal (statementwise) tiling hyperplane, the following should
hold for each edge e from Siand Sj:
φsj(t) − φsi(s) ≥ 0, Pe
(4)
4
Page 5
Proof. Tiling of a statement’s iteration space defined by a set of tiling hyperplanes is
said to be legal if each tile can be executed atomically and a valid total ordering of the
tiles can be constructed. This implies that there exists no two tiles such that they both
influence each other. Let {φ1
wise1d affinetransforms thatsatisfy(4). Considera tile formedby aggregatingagroup
of hyperplane instances along φ1
the target iteration is mapped to the same hyperplane or a greater hyperplane than the
source, i.e., the set of all iterations that are outside of the tile and are influenced by it
always lie in the forward direction along one of the independent tiling dimensions (φ1
and φ2in this case). Similarly, all iterations outside of a tile influencing it are either in
that tile or in the backward direction along one or more of the hyperplanes. The above
argumentholdstrueforbothintraandinterstatementdependences.Forinterstatement
dependences, this leads to an interleaved execution of tiles of iteration spaces of each
statement when code is generated from these mappings. Hence, {φ1
{φ2
such a tile is executed on a processor, communication would be needed only before
and after its execution. From locality point of view, if such a tile is executed with the
associated data fitting in a faster memory, reuse is exploited in multiple directions.2
The above condition was wellknown for the case of a singlestatement perfectly
nested loops from the work of Irigoin and Triolet [16] (as hT.R ≥ 0). We have general
ized it above for multiple iteration spaces with exact affine dependences with possibly
different dimensionalities and imperfect nestings for statements.
s1, φ1
s2, ..., φ1
sk}, {φ2
s1, φ2
s2, ..., φ2
sk} be two statement
siand φ2
si. Due to (4), for any dynamic dependence,
s1, φ1
s2, ..., φ1
sk},
s1, φ2
s2, ..., φ2
sk} represent rectangularly tilable loops in the transformed space. If
Tiling at an arbitrary depth. Note that the legality condition as written in (4) is imposed
on all dependences. However, if it is imposed only on dependences that have not been
carried up to a certain depth, the independent φ’s that satisfy the condition represent
tiling hyperplanes at that depth, i.e., rectangular blocking (stripmine/interchange) at
that level in the transformed program is legal.
Consider the perfectly nested version of 1d Jacobi shown in Fig. 1(a) as an exam
ple. This discussion also applies to the imperfectly nested version, but for convenience
we first look at the singlestatement perfectly nested version. We first describe solutions
obtained by existing state of the art approaches: Lim and Lam’s affine partitioning [21,
20]andGriebl’sspaceandtimetilingwithForwardCommunicationOnly(FCO)place
ment [15].
for (t=1; t<T; t++)
for (i=2; i<N−1; i++)
a[t,i] = 0.33∗(a[t−1,i] +
a[t−1,i−1] + a[t−1,i+1]);
(a) 1d Jacobi: perfectly nested
for (t=1; t<T; t++)
for (i=2; i<N−1; i++)
S1: b[i] = 0.33∗(a[i−1]+ a[i]+a[i +1]);
for (i=2; i<N−1; i++)
S2: a[i] = b[i ];
(b) 1d Jacobi: imperfectly nested
Fig.1. 1d Jacobi
Lim and Lam define legal time partitions which have the same property of tiling hy
perplanes we described above. Their algorithm obtains affine partitions that minimize
theorderofcommunicationwhilemaximizingthedegreeofparallelism.(4)giveslegal
ityconstraints:ct≥ 0;ct+ci≥ 0;ct−ci≥ 0correspondingtodependences(1,0),(1,1)
5
Page 6
i
t
(1,0)
seq
(2,1)
parallel
ii
tt
P1
P0
P3
P2
(1,1)
(1,0)
seq
(1,1)
(1,0)
parallel
seq
parallel
P1P2
P0
Fig.2. Communication volume with different valid hyperplanes for perfectly nested 1d jacobi
and (1,1). There are infinitely many valid solutions with the same order complexity of
synchronization, but with different communication volumes that may impact perfor
mance. Although it may seem that the volume may not effect performance considering
the fact that communication startup time on modern interconnects is significant, for
higher dimensional problems like nd Jacobi, the ratio of communication to computa
tion increases (proportional to tile size raised to n−1). Existing works on tiling [26,25,
33] can find near communicationoptimal tiles for perfectly nested loops with constant
dependences, but cannot handle arbitrarily nested loops. For 1d Jacobi, all solutions
within the cone formed by the vectors (1,1) and (1,−1) are valid tiling hyperplanes.
For the imperfectly nested version of 1d Jacobi, the valid cone is (2,1) and (2,−1)
(discussed later). For imperfectly nested Jacobi, Lim’s algorithm [21] finds two valid
independent solutions without optimizing for any particular criterion. In particular, the
solutions found by their algorithm (Algorithm A in [21]) are (2,−1) and (3,−1) which
are clearly not the best tiling hyperplanes to minimize communication volume, though
they do minimize the order of synchronization which is O(N) (in this case any valid
hyperplane has O(N) synchronization). Figure 2 shows that the required communica
tion increases as the hyperplane gets more and more oblique. For a hyperplane with
normal (k,1), one would need (k + 1)T values from the neighboring tile.
UsingGriebl’sapproach,wefirstfindthatonlyspacetilingisenabledwithFeautrier’s
schedule being θ(t,i) = t, i.e., using (1,0) as the scheduling hyperplane. With forward
communicationonly (FCO) placement, an allocation is found such that dependences
have positive components along space dimensions thereby enabling tiling of the time
dimension too; this decreases the frequency of communication. In this case, time tiling
is enabled with FCO placement along (1,1). However, note that communication in the
processor space occurs along (1,1), i.e., two lines of the array are required. However,
using (1,0) and (1,1) as tiling hyperplanes with (1,0) as space and (1,1) as inner time and
a tile space schedule of (2,1) leads to only one line of communication along (1,0). Our
algorithm finds such a solution. We now develop a cost function for an affine transform
that captures communication volume and reuse distance.
3.2
Consider the following affine form δe:
A Linear Cost Function
δe(t) = φsi(t) − φsj(he(t)),
Pe
(5)
The affine form δe(t) holds much significance. This function is the number of hyper
planes the dependence e traverses along the hyperplane normal. It gives us a measure
of the reuse distance if the hyperplane is used as time, i.e., if the hyperplanes are ex
ecuted sequentially. Also, this function is a factor in the communication volume if the
6
Page 7
hyperplane is used to generate tiles for parallelization and used as a processor space di
mension. An upper bound on this function would mean that the number of hyperplanes
that would be communicated as a result of the dependence at the tile boundaries would
not exceed this bound. We are particularly interested if this function can be reduced to
a constant amount or zero by choosing a suitable direction for φ: if this is possible, then
that particular dependence leads to a constant or no communication for this hyperplane.
Note that each δeis an affine function of the loop indices. The challenge is to use this
function to obtain a suitable objective for optimization in the affine framework.
Challenges. The constraints obtained from (4) only guarantee legality of tiling (per
mutability). However, several problems are encountered when one tries to apply a per
formance factor to find a good tile shape out of the several possibilities. Farkas Lemma
has been used by many approaches in the affine literature [11,12,21,15] to eliminate
loop variables from constraints by getting equivalent linear inequalities. The affine form
in the loop variables is represented as a positive linear combination of the faces of the
dependence polyhedron. When this is done, the coefficients of the loop variables on
the left and right hand side are equated to eliminate the constraints of variables. This
is done for each of the dependences, and the constraints obtained are aggregated. The
resulting constraints are entirely in the coefficients of the tile mappings and Farkas mul
tipliers. All Farkas multipliers can be eliminated, some by Gaussian elimination and
the rest by FourierMotzkin elimination [27]. However, an attempt to minimize com
munication volume ends up in an objective function involving both loop variables and
hyperplane coefficients. For example, φ(t)−φ(he(t)) could be c1i+(c2−c3)j, where
1 ≤ i ≤ N ∧ 1 ≤ j ≤ N ∧ i ≤ j. One ends up with such a form when a depen
dence is not uniform or for an interstatement dependence, making it hard to construct
an objective function involving only the unknown hyperplane coefficients.
3.3Cost Function Bounding and Minimization
Theorem 2. If all iteration spaces are bounded, there exists at least one affine form v
in the structure parameters p, that bounds δe(t) for every dependence edge e, i.e., there
exists
v(p) = u.p + w
(6)
such that
v(p) −
?φsi(t) − φsj(he(t))?
≥ 0, Pe,
≥
∀e ∈ E
∀e ∈ E.
(7)
i.e.,
v(p) − δe(t)0, Pe,
The idea behind the above is that even if δeinvolves loop variables, one can find large
enough constants in u that would be sufficient to bound δe(s). Note that the loop vari
ables themselves are bounded by affine functions of the parameters, and hence the
maximum value taken by δe(s) will be bounded by such an affine form. Also, since
v(p) ≥ δe(s) ≥ 0, v should increase with an increase in the structural parameters,
i.e., the coordinates of u are positive. The reuse distance or communication volume for
each dependence is bounded in this fashion by the same affine form. Such a bounding
function was used by Feautrier [11] to find minimum latency schedules.
7
Page 8
Now, we apply Farkas Lemma to (7).
v(p) − δe(t) ≡ λe0+
me
?
k=1
λekPk
e,λek≥ 0
(8)
where Pk
loop indices in i and parameters in p on the left and right hand side can be gathered and
equated. We now get linear inequalities entirely in coefficients of the affine mappings
for all statements, components of row vector u, and w. The above inequalities can at
once be solved by finding a lexicographic minimal solution with u and w in the leading
position, and the other variables following in any order. Let u = (u1,u2,...uk).
eis a face of PeThe above is an identity and the coefficients of each of the
minimize≺{u1,u2,...,uk,w,...,c?
is,...}
(9)
Finding the lexicographic minimal solution for a system of linear inequalities is within
the reach of the simplex algorithm and can be handled by the Parametric Integer Pro
gramming (PIP) software [9]. Since the structural parameters are quite large, we first
want to minimize their coefficients. We do not lose the optimal solution since an optimal
solution would have the smallest possible values for u’s. Note that the relative order
ing of the structural parameters and their values at runtime may affect the solution, but
considering this is beyond the scope of this approach.
The solution gives a hyperplane for each statement. Note that the application of the
Farkas Lemma to (7) is not required in all cases. When a dependence is uniform, the
corresponding δeis independent of any loop variables, and Farkas Lemma need not be
applied. In such cases, we just have w ≥ δe.
3.4Iteratively Finding Independent Solutions.
Solving the ILP formulation in the previous section gives us a single solution to the
coefficients of the best mappings for each statement. We need at least as many inde
pendent solutions as the dimensionality of the polytope associated with each statement.
Hence, once a solution is found, we augment the ILP formulation with new constraints
and obtain the next solution; the new constraints make sure of linear independence with
solutions already found. Let the rows of HSrepresent the solutions found so far for a
statement S. Then, the subspace orthogonal to HS[22,18] is given by:
H⊥
S= I − HT
S
?HSHT
S
?−1HS
(10)
Note that H⊥
the next row (linear portion of the hyperplane) to be found for statement S. Let Hi⊥
a row ofH⊥
gives the necessary constraint to be added for statement S to make sure that h∗
nonzero component in the subspace orthogonal to HS. This leads to a nonconvex
space, and ideally, all cases have to be tried and the best among those kept. When the
number of statements is large, this leads to a combinatorial explosion. In such cases,
we restrict ourselves to the subspace of the orthogonal space where all the constraints
S.HST= 0, i.e., the rows of HSare orthogonal to those of H⊥
S. Let h∗
Sbe
Sbe
S< 0
Shas a
S. Then, anyoneof the inequalitiesgiven by∀i, Hi⊥
S.h∗
S> 0,Hi⊥
S.h∗
8
Page 9
are positive, i.e., the following constraints are added to the ILP formulation for linear
independence:
∀i,Hi⊥
By just considering a particular convex portion of the orthogonal subspace, we are dis
carding solutions that usually involve loop reversals or combination of reversals with
other transformations; however, in practice, we believe this does not make a difference.
The mappings found are independent on a perstatement basis. When there are state
ments with different dimensionalities, the number of such independent mappings found
for each statement is equal to the number of outer loops it has. Hence, no more orthogo
nality constraints need be added for statements for which enough independent solutions
have been found (the rest of the rows get automatically filled with zeros or linearly de
pendent rows). As mentioned in Sec. 2, the number of rows in the transformation matrix
is the same for each statement and the depth of the deepest loop nest in the target code
is the same as that of the source loop nest. Overall, a hierarchy of fully permutable loop
nest sets are found, and a lower level in the hierarchy will not be obtained unless con
straints corresponding to dependences that have been carried by the parent permutable
set have been removed.
S.h∗S≥ 0
∧
?
i
Hi⊥
Sh∗
S≥ 1
(11)
3.5Communication and locality optimization unified
From the algorithm described above, both synchronizationfree and pipelined paral
lelism is found. Note that the best possible solution to (9) is with (u = 0,w = 0) and
this happens when we find a hyperplane that has no dependence components along its
normal, which is a fully parallel loop requiring no synchronization if it is at the outer
level (outer parallel); it could be an inner parallel loop if some dependences were re
moved previously and so a synchronization is required after the loop is executed in
parallel. Thus, in each of the steps that we find a new independent hyperplane, we
end up first finding all synchronizationfree hyperplanes; these are followed by a set of
fully permutable hyperplanes that are tilable and pipelined parallel requiring constant
boundary communication (u = 0;w > 0) w.r.t the tile sizes. In the worst case, we have
a hyperplane with u > 0,w ≥ 0 resulting in long communication from nonconstant
dependences. It is important to note that the latter are pushed to the innermost level. By
bringing in the notion of communication volume and its minimization, all degrees of
parallelism are found in the order of their preference.
From the point of view of data locality, note that the hyperplanes that are used to
scan the tile space are the same as the ones that scan points in a tile. Hence, data locality
isoptimizedfromtwoangles:(1)cachemissesattileboundariesareminimizedforlocal
execution (as cache misses at local tile boundaries are equivalent to communication
along processor tile boundaries); (2) by reducing reuse distances, we are increasing the
local cache tile sizes. The former is due to selection of good tile shapes and the latter
by the right permutation of hyperplanes (which is implicit in the order in which we find
hyperplanes).
3.6Space and time in transformed iteration space.
By minimizing φ(t) − φ(s) as we find hyperplanes from outermost to innermost, we
push dependence carrying to inner loops and also ensure that loops do not have negative
9
Page 10
dependence components (to the extent possible) so that target loops can be tiled. Once
this is done, if the outer loops are used as space (how many ever desired, say k), and
the rest are used as time (note that at least one time loop is required unless all loops are
synchronizationfree parallel), communication in the processor space is optimized as
the outer space loops are the k best ones. Whenever the loops can be tiled, they result in
coarsegrained parallelism as well as better reuse within a tile. In practice, we usually
do not need more than two degrees of parallelism. If a degree of communicationfree
parallelism exists, that particular loop (assuming it has a large extent) is sufficient to
expose enough coarsegrained parallelism. Note that all loops detected as parallel need
not be marked parallel.
3.7Fusion
The algorithm described in the previous section can also enable fusion across multiple
iterationspacesthatareweaklyconnected,asinsequencesofproducerconsumerloops.
Solving for hyperplanes for multiple statements leads to a schedule for each statement
such that all statements in question are finely interleaved: this is indeed fusion. This
generalization of fusion is same as the one proposed in [7,14]. Note that we leave the
structure parameter p out of our affine transform definition in 4. The extended version
of this paper [4] describes how fusion naturally integrates into the framework.
3.8 Summary
The algorithm is summarized below. It can be viewed as transforming to a tree of
permutable loop nests sets/bands  each node of the tree is a good permutable loop nest
set. Step 12 of the repeatuntil block in Algorithm 3 finds such a band of permutable
loops. If all loops are tilable, there is just one node containing all the loops that are per
mutable. On the other extreme, if no loops are tilable, each node of the tree has just one
loop and so no tiling is possible. At least two hyperplanes should be found at any level
(without dependence removal/cutting) to enable tiling. Dependences from previously
found solutions are thus not removed unless they have to be (Step 17): to allow the next
permutable band to be found, and so on. Hence, partially tilable or untilable input is
all handled. Loops in each node of the target tree can be stripmined/interchanged when
there are at least two of them in it; however, it is illegal to move a stripmined loop across
different levels in the tree.
3.9Accuracy of Cost Function and Refinement.
The metric we presented here can be refined while keeping the problem within ILP.
The motivation behind taking a max is to avoid multiple counting of the same set of
points that need to be communicated for different dependences. This happens when all
dependences originate from the same data space and the same order volume of commu
nication is required for each of them. Using the sum of max’es on a perarray basis is a
more accurate metric. Also, even for a single array, sets of points with very less overlap
or no overlap may have to be communicated for different dependences. Also, differ
ent dependences may have source dependence polytopes of different dimensionalities.
Note that the image of the source dependence polytope under the data access func
tion associated with the dependence gives the actual set of points to be communicated.
10
Page 11
Input Generalized dependence graph G = (V,E) (includes dependence polyhedra Pe, e ∈ E)
1: Smax: statement with maximum domain dimensionality
2: for each dependence e ∈ E do
3:Build legality constraints: apply Farkas Lemma on φ(t) − φ(he(t)) ≥ 0 under t ∈ Pe,
and eliminate all Farkas multipliers
4:Build communication volume/reuse distance bounding constraints: apply Farkas Lemma
to v(p) − (φ(t) − φ(he(t))) ≥ 0 under Pe, and eliminate all Farkas multipliers
5: Aggregate constraints from both into Ce(i)
6: end for
7: repeat
8:
C = ∅
9:
for each dependence edge e ∈ E do
10:
C ← C ∪ Ce(i)
11:
end for
12:Compute lexicographic minimal solution with u?s coefficients in the leading position fol
lowed by w to iteratively find independent solutions to C (orthogonality constraints are
added as each soln is found)
13:
if no solutions were found then
14:Cut dependences between two stronglyconnected components in the GDG and insert
the appropriate splitter in the transformation matrices of the statements
15:
end if
16: Compute Ec: dependences carried by solutions of Step 12/14; update necessary depen
dence polyhedra (when a portion of it is satisfied)
17:
E ← E − Ec; reform the GDG (V,E)
18: until H⊥
Output A transformation matrix for each statement (with the same number of rows)
Smax= 0 and E = ∅
Fig.3. Algorithm 1
Hence, just using the communication rate (number of hyperplanes on the tile boundary)
as the metric may not be accurate enough. This can be taken care of by having different
bounding functions for dependences with different orders of communication, and us
ing the bound coefficients for dependences with higher orders of communication as the
leading coefficients while finding the lexicographic minimal solution.
4 Example
Figure4showsanexamplefromtheliterature[8]withaffinenonconstantdependences.
We exclude the constant c0from the mappings as we have a single statement. De
pendence analysis produces the dependence polyhedra and htransformations shown in
Fig. 4.
Dependence 1: Tiling legality constraint:
cii + cjj − cii − cj(j − 1) ≥ 0
⇒
cj≥ 0
Since this is a constant dependence, the volume bounding constraint gives:
w − cj≥ 0
11
Page 12
for(i=0; i<N: i++) {
for (j=2; j<N; j++) {
a[i,j] = a[j,i]+a[i,j−1];
}
}
P0
P3
P3P4 P2
P3
P1
P0
P1
P1
P2
P2
P2
P4
P5
space
time
j
i
a[i?,j?] → a[i,j − 1]
h1 : i?= i,j?= j − 1;
2 ≤ j ≤ N,1 ≤ i ≤ N
a[i?,j?] → a[j,i]
h2 : i?= j,j?= i;
2 ≤ j ≤ N, 1 ≤ i ≤ N,i − j ≥ 1
a[j?,i?] → a[i,j]
h3 : j?= i,i?= j
2 ≤ j ≤ N, 1 ≤ i ≤ N, i − j ≥ 1
Fig.4. Example: Nonconstant dependences
Dependence 2: Tiling legality constraint:
(cii + cjj) − (cij + cji) ≥ 0,
(i,j) ∈ P2
Applying Farkas Lemma, we have:
(ci− cj)i + (cj− ci)j
≡ λ0+ λ1(N − i) + λ2(N − j)
+λ3(i − j − 1) + λ4(i − 1) + λ5(j − 1)
λ0,λ1,λ2,λ3,λ4,λ5 ≥ 0
(12)
Equating LHS and RHS coefficients for i, j, N and the constants in (12), and eliminat
ing Farkas multipliers through FourierMotzkin, we obtain the following:
ci− cj≥ 0
Volume bounding constraint:
u1N + w − (cij + cji − cii − cjj) ≥ 0, (i,j) ∈ P2
Application of Farkas Lemma in a similar way as above and elimination of the multi
pliers yields:
u1 ≥ 0
u1− ci+ cj ≥ 0
3u1+ w − ci+ cj ≥ 0
(13)
Dependence 3: Due to symmetry with respect to i and j, the third dependence does not
give anything more than the second one.
12
Page 13
Finding the hyperplanes. Aggregating legality and volume bounding constraints for all
dependences, we obtain:
cj ≥ 0
w − cj ≥ 0
ci− cj ≥ 0
u1 ≥ 0
u1− ci+ cj ≥ 0
3u1+ w − ci+ cj ≥ 0
minimize≺ (u1,w,ci,cj)
(14)
The lexicographic minimal solution for the vector (u1,w,ci,cj) = (0,1,1,1) (the zero
vector is a trivial solution and is avoided). Hence, we get ci= cj= 1. Note that ci= 1
and cj= 0 is not obtained even though it is a valid tiling hyperplane as it involves more
communication: it requires u1to be positive.
The next solution is forced to have a positive component in the subspace orthogonal
to (1,1) given by (10) as (1,1). This leads to the addition of the constraint ci−cj≥ 1 or
ci−cj≤ −1 to the existing formulation. Adding ci−cj≥ 1 to (14), the lexicographic
minimal solution is (1, 0, 1, 0), i.e., u1 = 1,w = 0,ci = 1,cj = 0 (u1 = 0 is no
longer valid). Hence, (1,1) and (1,0) are the tiling hyperplanes obtained. (1,1) is used
as space with one line of communication between processors, and the hyperplane (1,0)
is used as time in a tile. The outer tile schedule is (2,1) ( = (1,1) + (1,0)).
This transformation is in contrast to other approaches based on schedules which ob
tain a schedule and then the rest of the transformation matrix. Feautrier’s greedy heuris
tic gives the schedule θ(i,j) = 2i+j−3 which carries all dependences. However, using
this as either space or time does not lead to communication or locality optimization. The
(2,1) hyperplane has nonconstant communication along it. In fact, the only hyperplane
that has constant communication along it is (1,1). This is the best hyperplane to be used
as a space loop if the nest is to be parallelized, and is the first solution that our algo
rithm finds. The (1,0) hyperplane is used as time leading to a solution with one degree
of pipelined parallelism with one line per tile of nearneighbor communication (along
(1,1)) as shown in Fig. 4. Hence, a good schedule that tries to carry all dependences (or
as many as possible) is not necessarily a good loop for the transformed iteration space.
5Implementation and Preliminary Results
We have implemented our transformation framework using PipLib 1.3.3 [9]. Our tool
takes as input dependence information (dependence polyhedra and htransformations)
from LooPo’s [1] dependence tester and generates statementwise affine transforma
tions. The transforms generated by our tool are provided to CLooG [3] as scattering
functions. The goal is to get tiled shared memory parallel code, for example, OpenMP
codeformulticorearchitectures.Asafinalstep,thedetectedparallelismcanbemapped
to a desired parallel architecture depending on the number of degrees of parallelism
required. Results show that the tool (with preliminary optimizations) already runs ex
tremely fast making further refinements to the model very attractive. The number of
loops shown in the table is the sum of the number of outer loops of all statements in
the original code. In theory, since our approach relies on integer programming, it has
13
Page 14
a worstcase exponential time complexity. However, it runs extremely fast in practice.
This is mainly because program polyhedra have a simple structure, and the ILP for
mulations resulting from them are quickly solved. Due to space constraints, we are not
including results from experimental evaluation of the transformed code. A summary of
theresultscanbefoundinTable5.[20,19,21,15]representthestateoftheartfromthe
research community, while ICC 10.1 with ’fast parallel’ was used as the native com
piler. The results were taken on an Intel Core 2 Quad (Q6600 1.5 GHz). The detailed
experimental evaluation can be found in [5].
Benchmark Single core speedup
over native over stateoftheart over native over stateoftheart
compilerresearch
5.23x2.1x
3.7x3.1x
1.6x
5.56x 5.74x
9.3x 5.5x
Table 1. Initial results: speedup over stateoftheart
Multicore speedup (4 cores)
compiler
20x
7.4x
4.5x
14x
13x
research
1.7x
2.46x
1d Jacobi (imperfect nest)
2d FDTD
3d GaussSeidel
LU decomposition
Matrix Vec Transpose
3.76x
6.96x
6Related work
Iteration space tiling [16,31,32,25] is a standard approach for aggregating a set of loop
iterations into tiles, with each tile being executed atomically. In addition, researchers
have considered the problem of selecting tile shape and size to minimize communica
tion, improve locality or minimize finish time [25,33]. These works are restricted to a
single perfectly nested loop nest with uniform dependences or similar restrictions which
are far away from realworld code.
Loop parallelization has been studied extensively. The reader is referred to the sur
vey of Boulet et al.[6] for a detailed summary of older parallelization algorithms which
acceptedrestricted inputand/or arebasedon weakerdependence abstractionsthan exact
polyhedral dependences.
Scheduling with affine functions using faces of the polytope by application of the
Farkas algorithm was first proposed by Feautrier [11]. Feautrier explored various pos
sible approaches to obtain good affine schedules that minimize latency. The schedules
carry all dependences and so all the inner loops can be parallel. However, transforming
to permutable loops that are amenable to tiling was not addressed. Though schedules
yield inner parallel loops, the time loops cannot be tiled unless communication in the
space loops is in the forward direction (dependences have positive components along all
dimensions). Several works [15,7,23] make use of such schedules. Overall, Feautrier’s
classic works [11,12] are geared towards finding maximum finegrained parallelism as
opposed to tilability for coarsegrained parallelization with minimized communication
and better locality.
Lim and Lam [21,20] proposed an affine partitioning framework that identifies
outerparallelloops(communicationfreespacepartitions)andpermutableloops(pipelined
parallel or tilable loops) to maximize the degree of parallelism and minimizing the or
der of synchronization. They employ the same machinery for blocking [19]. Several
14
Page 15
(infinitely many) solutions equivalent in terms of the criterion they optimize for result
from their algorithm, and these significantly differ in communication cost and locality;
no metric is provided to differentiate between these solutions. As seen in Sec. 3, without
a cost function, the solutions obtained even for the simplest input are not satisfactory.
Ahmed et al. [2] proposed a framework for locality optimization of imperfectly
nested loops for sequential execution. The approach determines the embedding for each
statement into a product space, which is then optimized for locality through another
transformation. Their heuristic sets reuse distances in the target space for some de
pendences to zero (or a constant) to obtain solutions to the embedding/transformation
matrix coefficients. However, the choice of the dependences and the number, which is
important, is determined heuristically. Also, such an approach need not completely de
termine the embedding function/transformation matrix coefficients. Exploring all pos
sibilities here is infeasible. Overall, the automatability and robustness of the heuristic
even for simple code is not clear from the description.
Griebl [15] presents an integrated framework for optimizing locality and paral
lelism with space and time tiling. Griebl’s approach enables time tiling by using a for
ward communicationonly placement with an existing schedule. As mentioned earlier
(Sec. 3), using schedules as time loops may not lead to communication or locality
optimized solutions.
Cohenetal.,Girbaletal.[7,14]proposedanddevelopedaframework(URUK/WRAP
IT) to compose sequences of transformations in a semiautomatic fashion. Transforma
tions are applied automatically, but specified manually by an expert. Pouchet et al. [23]
searches the space of transformations (onedimensional schedules) to find good ones
through iterative optimization by employing performance counters. On the other hand,
our approach is fullyautomatic. However, some amount of empirical and iterative op
timization is required to choose transforms that work best in practice. This is true when
several fusion choices exist. Also, effective determination of tile sizes and unroll factors
for transformed wholeprograms may only be possible through some amount of empir
ical search. A combination of our algorithm and empirical search in a smaller space is
an interesting approach to pursue. Alternatively, more powerful cost models like those
based on computing Ehrhart polynomials [30] can be employed once transformations
in a smaller space can be enumerated.
7Conclusions
We have presented a single framework that addresses automatic parallelization and
data locality optimization using transformations in the polyhedral model. The proposed
algorithm finds communicationminimized tiling hyperplanes for parallelization of a
sequence of arbitrarily nested loops. The same hyperplanes also minimize reuse dis
tances and improve data locality. The approach also enables fusion in the presence of
producingconsuming loops. To the best of our knowledge, our work is the first to pro
pose a practical cost model to drive automatic transformation in the polyhedral model.
The framework has been implemented into a tool to perform transformations in a fully
automatic way from C/Fortran code using the LooPo infrastructure and CLooG. Pre
liminary results show very good scalability of the running time of the framework with
input size and complexity of input code.
15