Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model.
ABSTRACT The polyhedral model provides powerful abstractions to optimize loop nests with regular accesses. Affine transformations in this model capture a com- plex sequence of execution-reordering loop transformations that can improve per- formance by parallelization as well as locality enhancement. Although a signifi- cant body of research has addressed affine scheduling and partitioning, the prob- lemofautomaticallyfindinggoodaffinetransformsforcommunication-optimized coarse-grained parallelization together with locality optimization for the general case of arbitrarily-nested loop sequences remains a challenging problem. We propose an automatic transformation framework to optimize arbitrarily- nested loop sequences with affine dependences for parallelism and locality si- multaneously. The approach finds good tiling hyperplanes by embedding a pow- erful and versatile cost function into an Integer Linear Programming formulation. These tiling hyperplanes are used for communication-minimized coarse-grained parallelization as well as for locality optimization. The approach enables the min- imization of inter-tile communication volume in the processor space, and mini- mization of reuse distances for local execution at each node. Programs requir- ing one-dimensional versus multi-dimensional time schedules (with scheduling- based approaches) are all handled with the same algorithm. Synchronization-free parallelism, permutable loops or pipelined parallelism at various levels can be detected. Preliminary studies of the framework show promising results.
- 01/2013;
- SourceAvailable from: Jeremy Siek[Show abstract] [Hide abstract]
ABSTRACT: The performance of many scientific programs is limited by data movement. Loop fusion is one optimization used to increase the speed of memory bound operations. To automate loop fusion for matrix computations, we developed the Build to Order (BTO) compiler. Within BTO, an analytic memory model efficiently and accurately reduces the number of serial loop fusion options considered. In this paper, we extend the model to shared memory parallel machines. We detail the differences between parallel and serial memory use and runtime prediction and explain the changes made to include parallel machines in the model. Analysis of the parallel model's predictions show that when it is included in BTO it will reduce the search space of considered routines.SIGMETRICS Perform. Eval. Rev. 01/2011; 38:43-49. - SourceAvailable from: Kyle Rupnow
Conference Paper: Improving high level synthesis optimization opportunity through polyhedral transformations
[Show abstract] [Hide abstract]
ABSTRACT: High level synthesis (HLS) is an important enabling technology for the adoption of hardware accelerator technologies. It promises the performance and energy efficiency of hardware designs with a lower barrier to entry in design expertise, and shorter design time. State-of-the-art high level synthesis now includes a wide variety of powerful optimizations that implement efficient hardware. These optimizations can implement some of the most important features generally performed in manual designs including parallel hardware units, pipelining of execution both within a hardware unit and between units, and fine-grained data communication. We may generally classify the optimizations as those that optimize hardware implementation within a code block (intra-block) and those that optimize communication and pipelining between code blocks (inter-block). However, both optimizations are in practice difficult to apply. Real-world applications contain data-dependent blocks of code and communicate through complex data access patterns. Existing high level synthesis tools cannot apply these powerful optimizations unless the code is inherently compatible, severely limiting the optimization opportunity. In this paper we present an integrated framework to model and enable both intra- and inter-block optimizations. This integrated technique substantially improves the opportunity to use the powerful HLS optimizations that implement parallelism, pipelining, and fine-grained communication. Our polyhedral model-based technique systematically defines a set of data access patterns, identifies effective data access patterns, and performs the loop transformations to enable the intra- and inter-block optimizations. Our framework automatically explores transformation options, performs code transformations, and inserts the appropriate HLS directives to implement the HLS optimizations. Furthermore, our framework can automatically generate the optimized communication blocks for fine-grained communication between hardware blocks. Experimental evaluation demonstrates that we can achieve an average of 6.04X speedup over the high level synthesis solution without our transformations to enable intra- and inter-block optimizations.Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays; 01/2013
Page 1
Automatic Transformations for
Communication-Minimized Parallelization and Locality
Optimization in the Polyhedral Model
Uday Bondhugula1, Muthu Baskaran1, Sriram Krishnamoorthy1,
J. Ramanujam2, Atanas Rountev1, and P. Sadayappan1
1Dept. of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
{bondhugu,baskaran,krishnsr,rountev,saday}@cse.ohio-state.edu,
2Dept. of Electrical and Computer Engg., Louisiana State University, Baton Rouge, LA , USA
jxr@ece.lsu.edu
Abstract. Many compute intensive applications spend a significant fraction of
their time in nested loops. The polyhedral model provides powerful abstractions
to optimize loop nests with regular accesses for parallel execution. Affine trans-
formations in this model capture a complex sequence of execution-reordering
loop transformations that can improve performance by parallelization as well as
locality enhancement. Although a significant amount of research has addressed
affine scheduling and partitioning, the problem of automatically finding good
affinetransformsforcommunication-optimizedcoarse-grainedparallelizationalong
withlocalityoptimizationforthegeneralcaseofarbitrarily-nestedloopsequences
remains a challenging problem.
In this paper, we propose an automatic transformation framework to optimize
arbitrarily-nested loop sequences with affine dependences for parallelism and
locality simultaneously. The approach finds good tiling hyperplanes by embed-
ding a powerful and versatile cost function into an Integer Linear Programming
formulation. These tiling hyperplanes are used for communication-minimized
coarse-grained parallelization as well as locality optimization. It enables the min-
imization of inter-tile communication volume in the processor space, and min-
imization of reuse distances for local execution at each node. Programs requir-
ing one-dimensional versus multi-dimensional time schedules (with scheduling-
based approaches) are all handled with the same algorithm. Synchronization-free
parallelism, permutable loops or pipelined parallelism at various levels can be
detected. Preliminary results from the implemented framework show promising
performance and scalability with input size.
1Introduction and Motivation
Current trends in architecture are increasingly towards larger number of processing el-
ements on a chip. This has led to multi-core architectures becoming mainstream along
with the emergence of several specialized parallel architectures or accelerators like
the Cell processor and General-Purpose GPUs. The difficulty of programming these
architectures to effectively tap the potential of multiple on-chip processing units is a
well-known challenge. Among several approaches to addressing this issue, one that is
very promising but simultaneously very challenging is automatic parallelization. This
requires no effort on part of the programmer in the process of parallelization and opti-
mization and is therefore very attractive.
Page 2
Many compute-intensive applications often spend most of their running time in
nested loops. This is particularly common in scientific and engineering applications.
The polyhedral model [13,7,14] provides a powerful abstraction to reason about trans-
formations on such loop nests by viewing a dynamic instance (iteration) of each state-
ment as an integer point in a well-defined space which is the statement’s polyhedron.
With such a representation for each statement and a precise characterization of inter or
intra-statement dependences, it is possible to reason about the correctness and goodness
of a sequence of complex loop transformations using machinery from Linear Algebra
and Linear Programming. The transformations finally reflect in the generated code as
reordered execution with improved cache locality and/or loops that have been paral-
lelized. The polyhedral model is readily applicable to loop nests in which the data
access functions and loop bounds are affine combinations (linear combination with a
constant) of the enclosing loop variables and parameters. While a precise characteriza-
tion of data dependences is feasible for programs with static control structure and affine
references/loop-bounds, codes with non-affine array access functions or code with dy-
namic control can also be handled, but with conservative assumptions on some depen-
dences.
Early approaches to program transformation and automatic parallelization applied
only to perfectly nested loops and involved the application of a sequence of transfor-
mations to the program’s structure represented as an attributed abstract syntax tree. The
polyhedral model has enabled much more complex programs to be handled and easy
composition and application of more sophisticated transformations [7,14]. The task of
program optimization in the polyhedral model may be viewed in terms of three phases:
(1) static dependence analysis of the input program, (2) transformations in the polyhe-
dral abstraction, and (3) generation of efficient loop code for the transformed program.
Inspite of progresses in these techniques in the nineties, several scalability challenges
limited applicability to small loop nests. Significant recent advances in dependence
analysis [29] and more importantly, in code generation [24,3,28], have demonstrated
the applicability of the polyhedral techniques to code representative of real applica-
tions. However, current state-of-the-art polyhedral implementations still apply transfor-
mations manually and significant time is spent by an expert to determine the best set of
transformations that lead to improved performance [7,14]. An important open issue is
that of the choice of transformations from the huge space of valid transforms. Our work
addresses this problem, by formulating a way of obtaining good transformations fully
automatically.
Tiling is a key transformation and has been studied from two perspectives - data
locality optimization and parallelization. Tiling for locality requires grouping points in
an iteration space into smaller blocks (tiles) allowing reuse in multiple directions when
the block fits in a faster memory (registers, L1, or L2 cache). Tiling for coarse-grained
parallelism fundamentally involves partitioning the iteration space into tiles that may
be concurrently executed on different processors with a reduced frequency and volume
of inter-processor communication: a tile is atomically executed on a processor with
communication required only before and after execution. Hence, one of the key aspects
of an automatic transformation framework is to find good ways of performing tiling.
Existing automatic transformation frameworks [21,20,19,2,15] have one or more
drawbacks or restrictions that do not allow them to effectively parallelize/optimize loop
nests. All of them lack a realistic cost model that is suitable for coarse-grained parallel
execution as is used in practice with manually developed parallel applications. With the
2
Page 3
exception of Griebl [15], previous work generally focuses on one or the other of the
complementary aspects of parallelization and locality optimization. The approach we
develop answers the following question: What is the best way to tile imperfectly nested
loop sequences to minimize the volume of communication between tiles (in processor
space) as well as improve data reuse at each processor?
The rest of this paper is organized as follows. Section 2 provides an overview of the
polyhedral model and notation. In Section 3, we describe our automatic transformation
framework in detail. Section 4 shows step-by-step application of our approach through
an example. Section 5 provides a summary of the implementation and initial results.
Section 6 discusses related work and conclusions are presented in Section 7. Full details
of the framework, transformations and optimized codes obtained for various codes, and
experimental results are available in extended reports [4,5].
2Overview of the Polyhedral Framework
Thissectionpresentsanoverviewofthepolyhedralframeworkandnotationusedthrough-
out the paper.
The set X of all vectors x ∈ Znsuch that h.x = k, for k ∈ Z, forms an affine
hyperplane. The set of parallel hyperplane instances corresponding to different values
of k is characterized by the vector h which is normal to the hyperplane. Each instance
of a hyperplane is an n − 1 dimensional affine sub-space of the n-dimensional space.
Two vectors x1and x2lie in the same hyperplane if h.x1= h.x2.
The set of all vectors x ∈ Znsuch that Ax + b ≥ 0, where A is an integer matrix,
defines a (convex) integer polyhedron. A polytope is a bounded polyhedron. Each run-
time instance of a statement S is identified by its iteration vector i which contains
values for the indices of the loops surrounding it, from outermost to innermost. Hence,
a statement S is associated with a polytope which is characterized by a set of bounding
hyperplanes or faces. This is true when the loop bounds are affine combinations of
outer loop indices and program parameters (typically, symbolic constants representing
the problem size). Let p be the vector of the program parameters.
A well-known known result useful for polyhedral analyses is the affine form of the
Farkas Lemma.
Lemma 1 (Affine form of Farkas Lemma). Let D be a non-empty polyhedron defined
by s affine inequalities or faces: ak.x + bk≥ 0, 1 ≤ k ≤ s, then an affine form ψ is
non-negative everywhere in D iff it is a positive affine combination of the faces:
ψ(x) ≡ λ0+
k
The non-negative constants λkare referred to as Farkas multipliers. Proof of the if part
is obvious. For the only if part, see Schrijver [27].
?
λk(akx + bk), λk≥ 0
(1)
2.1Polyhedral Dependences
Our dependence model is of exact affine dependences and same as the one used in [11,
20,7,29,23].Dependencesare determinedpreciselythroughdataflowanalysis[10],but
we consider all dependences including anti (write-after-read), output (write-after-write)
3
Page 4
and input (read-after-read) dependences, i.e., input code does not require conversion to
single-assignment form. The Data Dependence Graph (DDG) is a directed multi-graph
with each vertex representing a statement, and an edge, e ∈ E, from node Sito Sj
representing a polyhedral dependence from a dynamic instance of Sito one of Sj: it is
characterized by a polyhedron, Pe, called the dependence polyhedron that captures the
exact dependence information corresponding to edge, e. The dependence polyhedron is
in the sum of the dimensionalities of the source and target statement’s polyhedra (with
dimensions for program parameters as well).
Pe≡
?DPs
DPt
he
?
s
t
p
1
?≥ 0
= 0
?
(2)
The equalities in Petypically represent the affine function mapping the target iteration
vector t to the particular source s that is the last access to the conflicting memory loca-
tion, also known as the h-transformation [11]. The last access condition is not necessary
though; in general, the equalities can be used to eliminate variables from Pe. In the rest
of this section, it is assumed for convenience that s can be completely eliminated using
the he, being substituted by he(t).
A one-dimensional affine transform for statement Skis defined by:
?
= fSki + f0,where fSk= [f1,...,fmSk], fi∈ Z
A multi-dimensional affine transformation for a statement can now be represented
by a matrix with each row being an affine hyperplane. If such a transformation ma-
trix has full column rank, it completely specifies when and where an iteration executes
(one-to-one mapping from source to target). The total number of rows in the matrix
may be much larger as some special rows, splitters, may represent unfused loops at a
level. Consider the code in Fig. ?? for example. Such transformations capture the fusion
structure as well as compositions of permutation, reversal, relative shifting, and skew-
ing transformations. This representation for transformations has been used by many
researchers [12,17,7,14], and directly fits with scattering functions that a code genera-
tor like CLooG [3] supports. Our problem is thus to find the the transformation matrices
that are best for parallelism and locality.
φsk=
f1 ... fmSk
??i?+ f0
(3)
3Finding good transformations
3.1Legality of tiling imperfectly-nested loops
Theorem 1. Let φsibe a one-dimensional affine transform for statement Si. For {φs1,
φs2, ..., φsk}, to be a legal (statement-wise) tiling hyperplane, the following should
hold for each edge e from Siand Sj:
φsj(t) − φsi(s) ≥ 0, Pe
(4)
4
Page 5
Proof. Tiling of a statement’s iteration space defined by a set of tiling hyperplanes is
said to be legal if each tile can be executed atomically and a valid total ordering of the
tiles can be constructed. This implies that there exists no two tiles such that they both
influence each other. Let {φ1
wise1-d affinetransforms thatsatisfy(4). Considera tile formedby aggregatingagroup
of hyperplane instances along φ1
the target iteration is mapped to the same hyperplane or a greater hyperplane than the
source, i.e., the set of all iterations that are outside of the tile and are influenced by it
always lie in the forward direction along one of the independent tiling dimensions (φ1
and φ2in this case). Similarly, all iterations outside of a tile influencing it are either in
that tile or in the backward direction along one or more of the hyperplanes. The above
argumentholdstrueforbothintra-andinter-statementdependences.Forinter-statement
dependences, this leads to an interleaved execution of tiles of iteration spaces of each
statement when code is generated from these mappings. Hence, {φ1
{φ2
such a tile is executed on a processor, communication would be needed only before
and after its execution. From locality point of view, if such a tile is executed with the
associated data fitting in a faster memory, reuse is exploited in multiple directions.2
The above condition was well-known for the case of a single-statement perfectly
nested loops from the work of Irigoin and Triolet [16] (as hT.R ≥ 0). We have general-
ized it above for multiple iteration spaces with exact affine dependences with possibly
different dimensionalities and imperfect nestings for statements.
s1, φ1
s2, ..., φ1
sk}, {φ2
s1, φ2
s2, ..., φ2
sk} be two statement-
siand φ2
si. Due to (4), for any dynamic dependence,
s1, φ1
s2, ..., φ1
sk},
s1, φ2
s2, ..., φ2
sk} represent rectangularly tilable loops in the transformed space. If
Tiling at an arbitrary depth. Note that the legality condition as written in (4) is imposed
on all dependences. However, if it is imposed only on dependences that have not been
carried up to a certain depth, the independent φ’s that satisfy the condition represent
tiling hyperplanes at that depth, i.e., rectangular blocking (stripmine/interchange) at
that level in the transformed program is legal.
Consider the perfectly nested version of 1-d Jacobi shown in Fig. 1(a) as an exam-
ple. This discussion also applies to the imperfectly nested version, but for convenience
we first look at the single-statement perfectly nested version. We first describe solutions
obtained by existing state of the art approaches: Lim and Lam’s affine partitioning [21,
20]andGriebl’sspaceandtimetilingwithForwardCommunication-Only(FCO)place-
ment [15].
for (t=1; t<T; t++)
for (i=2; i<N−1; i++)
a[t,i] = 0.33∗(a[t−1,i] +
a[t−1,i−1] + a[t−1,i+1]);
(a) 1-d Jacobi: perfectly nested
for (t=1; t<T; t++)
for (i=2; i<N−1; i++)
S1: b[i] = 0.33∗(a[i−1]+ a[i]+a[i +1]);
for (i=2; i<N−1; i++)
S2: a[i] = b[i ];
(b) 1-d Jacobi: imperfectly nested
Fig.1. 1-d Jacobi
Lim and Lam define legal time partitions which have the same property of tiling hy-
perplanes we described above. Their algorithm obtains affine partitions that minimize
theorderofcommunicationwhilemaximizingthedegreeofparallelism.(4)giveslegal-
ityconstraints:ct≥ 0;ct+ci≥ 0;ct−ci≥ 0correspondingtodependences(1,0),(1,1)
5
Page 6
i
t
(1,0)
seq
(2,1)
parallel
ii
tt
P1
P0
P3
P2
(1,1)
(1,0)
seq
(1,1)
(1,0)
parallel
seq
parallel
P1P2
P0
Fig.2. Communication volume with different valid hyperplanes for perfectly nested 1-d jacobi
and (1,-1). There are infinitely many valid solutions with the same order complexity of
synchronization, but with different communication volumes that may impact perfor-
mance. Although it may seem that the volume may not effect performance considering
the fact that communication startup time on modern interconnects is significant, for
higher dimensional problems like n-d Jacobi, the ratio of communication to computa-
tion increases (proportional to tile size raised to n−1). Existing works on tiling [26,25,
33] can find near communication-optimal tiles for perfectly nested loops with constant
dependences, but cannot handle arbitrarily nested loops. For 1-d Jacobi, all solutions
within the cone formed by the vectors (1,1) and (1,−1) are valid tiling hyperplanes.
For the imperfectly nested version of 1-d Jacobi, the valid cone is (2,1) and (2,−1)
(discussed later). For imperfectly nested Jacobi, Lim’s algorithm [21] finds two valid
independent solutions without optimizing for any particular criterion. In particular, the
solutions found by their algorithm (Algorithm A in [21]) are (2,−1) and (3,−1) which
are clearly not the best tiling hyperplanes to minimize communication volume, though
they do minimize the order of synchronization which is O(N) (in this case any valid
hyperplane has O(N) synchronization). Figure 2 shows that the required communica-
tion increases as the hyperplane gets more and more oblique. For a hyperplane with
normal (k,1), one would need (k + 1)T values from the neighboring tile.
UsingGriebl’sapproach,wefirstfindthatonlyspacetilingisenabledwithFeautrier’s
schedule being θ(t,i) = t, i.e., using (1,0) as the scheduling hyperplane. With forward
communication-only (FCO) placement, an allocation is found such that dependences
have positive components along space dimensions thereby enabling tiling of the time
dimension too; this decreases the frequency of communication. In this case, time tiling
is enabled with FCO placement along (1,1). However, note that communication in the
processor space occurs along (1,1), i.e., two lines of the array are required. However,
using (1,0) and (1,1) as tiling hyperplanes with (1,0) as space and (1,1) as inner time and
a tile space schedule of (2,1) leads to only one line of communication along (1,0). Our
algorithm finds such a solution. We now develop a cost function for an affine transform
that captures communication volume and reuse distance.
3.2
Consider the following affine form δe:
A Linear Cost Function
δe(t) = φsi(t) − φsj(he(t)),
Pe
(5)
The affine form δe(t) holds much significance. This function is the number of hyper-
planes the dependence e traverses along the hyperplane normal. It gives us a measure
of the reuse distance if the hyperplane is used as time, i.e., if the hyperplanes are ex-
ecuted sequentially. Also, this function is a factor in the communication volume if the
6
Page 7
hyperplane is used to generate tiles for parallelization and used as a processor space di-
mension. An upper bound on this function would mean that the number of hyperplanes
that would be communicated as a result of the dependence at the tile boundaries would
not exceed this bound. We are particularly interested if this function can be reduced to
a constant amount or zero by choosing a suitable direction for φ: if this is possible, then
that particular dependence leads to a constant or no communication for this hyperplane.
Note that each δeis an affine function of the loop indices. The challenge is to use this
function to obtain a suitable objective for optimization in the affine framework.
Challenges. The constraints obtained from (4) only guarantee legality of tiling (per-
mutability). However, several problems are encountered when one tries to apply a per-
formance factor to find a good tile shape out of the several possibilities. Farkas Lemma
has been used by many approaches in the affine literature [11,12,21,15] to eliminate
loop variables from constraints by getting equivalent linear inequalities. The affine form
in the loop variables is represented as a positive linear combination of the faces of the
dependence polyhedron. When this is done, the coefficients of the loop variables on
the left and right hand side are equated to eliminate the constraints of variables. This
is done for each of the dependences, and the constraints obtained are aggregated. The
resulting constraints are entirely in the coefficients of the tile mappings and Farkas mul-
tipliers. All Farkas multipliers can be eliminated, some by Gaussian elimination and
the rest by Fourier-Motzkin elimination [27]. However, an attempt to minimize com-
munication volume ends up in an objective function involving both loop variables and
hyperplane coefficients. For example, φ(t)−φ(he(t)) could be c1i+(c2−c3)j, where
1 ≤ i ≤ N ∧ 1 ≤ j ≤ N ∧ i ≤ j. One ends up with such a form when a depen-
dence is not uniform or for an inter-statement dependence, making it hard to construct
an objective function involving only the unknown hyperplane coefficients.
3.3Cost Function Bounding and Minimization
Theorem 2. If all iteration spaces are bounded, there exists at least one affine form v
in the structure parameters p, that bounds δe(t) for every dependence edge e, i.e., there
exists
v(p) = u.p + w
(6)
such that
v(p) −
?φsi(t) − φsj(he(t))?
≥ 0, Pe,
≥
∀e ∈ E
∀e ∈ E.
(7)
i.e.,
v(p) − δe(t)0, Pe,
The idea behind the above is that even if δeinvolves loop variables, one can find large
enough constants in u that would be sufficient to bound δe(s). Note that the loop vari-
ables themselves are bounded by affine functions of the parameters, and hence the
maximum value taken by δe(s) will be bounded by such an affine form. Also, since
v(p) ≥ δe(s) ≥ 0, v should increase with an increase in the structural parameters,
i.e., the coordinates of u are positive. The reuse distance or communication volume for
each dependence is bounded in this fashion by the same affine form. Such a bounding
function was used by Feautrier [11] to find minimum latency schedules.
7
Page 8
Now, we apply Farkas Lemma to (7).
v(p) − δe(t) ≡ λe0+
me
?
k=1
λekPk
e,λek≥ 0
(8)
where Pk
loop indices in i and parameters in p on the left and right hand side can be gathered and
equated. We now get linear inequalities entirely in coefficients of the affine mappings
for all statements, components of row vector u, and w. The above inequalities can at
once be solved by finding a lexicographic minimal solution with u and w in the leading
position, and the other variables following in any order. Let u = (u1,u2,...uk).
eis a face of PeThe above is an identity and the coefficients of each of the
minimize≺{u1,u2,...,uk,w,...,c?
is,...}
(9)
Finding the lexicographic minimal solution for a system of linear inequalities is within
the reach of the simplex algorithm and can be handled by the Parametric Integer Pro-
gramming (PIP) software [9]. Since the structural parameters are quite large, we first
want to minimize their coefficients. We do not lose the optimal solution since an optimal
solution would have the smallest possible values for u’s. Note that the relative order-
ing of the structural parameters and their values at runtime may affect the solution, but
considering this is beyond the scope of this approach.
The solution gives a hyperplane for each statement. Note that the application of the
Farkas Lemma to (7) is not required in all cases. When a dependence is uniform, the
corresponding δeis independent of any loop variables, and Farkas Lemma need not be
applied. In such cases, we just have w ≥ δe.
3.4Iteratively Finding Independent Solutions.
Solving the ILP formulation in the previous section gives us a single solution to the
coefficients of the best mappings for each statement. We need at least as many inde-
pendent solutions as the dimensionality of the polytope associated with each statement.
Hence, once a solution is found, we augment the ILP formulation with new constraints
and obtain the next solution; the new constraints make sure of linear independence with
solutions already found. Let the rows of HSrepresent the solutions found so far for a
statement S. Then, the sub-space orthogonal to HS[22,18] is given by:
H⊥
S= I − HT
S
?HSHT
S
?−1HS
(10)
Note that H⊥
the next row (linear portion of the hyperplane) to be found for statement S. Let Hi⊥
a row ofH⊥
gives the necessary constraint to be added for statement S to make sure that h∗
non-zero component in the sub-space orthogonal to HS. This leads to a non-convex
space, and ideally, all cases have to be tried and the best among those kept. When the
number of statements is large, this leads to a combinatorial explosion. In such cases,
we restrict ourselves to the sub-space of the orthogonal space where all the constraints
S.HST= 0, i.e., the rows of HSare orthogonal to those of H⊥
S. Let h∗
Sbe
Sbe
S< 0
Shas a
S. Then, anyoneof the inequalitiesgiven by∀i, Hi⊥
S.h∗
S> 0,Hi⊥
S.h∗
8
Page 9
are positive, i.e., the following constraints are added to the ILP formulation for linear
independence:
∀i,Hi⊥
By just considering a particular convex portion of the orthogonal sub-space, we are dis-
carding solutions that usually involve loop reversals or combination of reversals with
other transformations; however, in practice, we believe this does not make a difference.
The mappings found are independent on a per-statement basis. When there are state-
ments with different dimensionalities, the number of such independent mappings found
for each statement is equal to the number of outer loops it has. Hence, no more orthogo-
nality constraints need be added for statements for which enough independent solutions
have been found (the rest of the rows get automatically filled with zeros or linearly de-
pendent rows). As mentioned in Sec. 2, the number of rows in the transformation matrix
is the same for each statement and the depth of the deepest loop nest in the target code
is the same as that of the source loop nest. Overall, a hierarchy of fully permutable loop
nest sets are found, and a lower level in the hierarchy will not be obtained unless con-
straints corresponding to dependences that have been carried by the parent permutable
set have been removed.
S.h∗S≥ 0
∧
?
i
Hi⊥
Sh∗
S≥ 1
(11)
3.5Communication and locality optimization unified
From the algorithm described above, both synchronization-free and pipelined paral-
lelism is found. Note that the best possible solution to (9) is with (u = 0,w = 0) and
this happens when we find a hyperplane that has no dependence components along its
normal, which is a fully parallel loop requiring no synchronization if it is at the outer
level (outer parallel); it could be an inner parallel loop if some dependences were re-
moved previously and so a synchronization is required after the loop is executed in
parallel. Thus, in each of the steps that we find a new independent hyperplane, we
end up first finding all synchronization-free hyperplanes; these are followed by a set of
fully permutable hyperplanes that are tilable and pipelined parallel requiring constant
boundary communication (u = 0;w > 0) w.r.t the tile sizes. In the worst case, we have
a hyperplane with u > 0,w ≥ 0 resulting in long communication from non-constant
dependences. It is important to note that the latter are pushed to the innermost level. By
bringing in the notion of communication volume and its minimization, all degrees of
parallelism are found in the order of their preference.
From the point of view of data locality, note that the hyperplanes that are used to
scan the tile space are the same as the ones that scan points in a tile. Hence, data locality
isoptimizedfromtwoangles:(1)cachemissesattileboundariesareminimizedforlocal
execution (as cache misses at local tile boundaries are equivalent to communication
along processor tile boundaries); (2) by reducing reuse distances, we are increasing the
local cache tile sizes. The former is due to selection of good tile shapes and the latter
by the right permutation of hyperplanes (which is implicit in the order in which we find
hyperplanes).
3.6Space and time in transformed iteration space.
By minimizing φ(t) − φ(s) as we find hyperplanes from outermost to innermost, we
push dependence carrying to inner loops and also ensure that loops do not have negative
9
Page 10
dependence components (to the extent possible) so that target loops can be tiled. Once
this is done, if the outer loops are used as space (how many ever desired, say k), and
the rest are used as time (note that at least one time loop is required unless all loops are
synchronization-free parallel), communication in the processor space is optimized as
the outer space loops are the k best ones. Whenever the loops can be tiled, they result in
coarse-grained parallelism as well as better reuse within a tile. In practice, we usually
do not need more than two degrees of parallelism. If a degree of communication-free
parallelism exists, that particular loop (assuming it has a large extent) is sufficient to
expose enough coarse-grained parallelism. Note that all loops detected as parallel need
not be marked parallel.
3.7Fusion
The algorithm described in the previous section can also enable fusion across multiple
iterationspacesthatareweaklyconnected,asinsequencesofproducer-consumerloops.
Solving for hyperplanes for multiple statements leads to a schedule for each statement
such that all statements in question are finely interleaved: this is indeed fusion. This
generalization of fusion is same as the one proposed in [7,14]. Note that we leave the
structure parameter p out of our affine transform definition in 4. The extended version
of this paper [4] describes how fusion naturally integrates into the framework.
3.8 Summary
The algorithm is summarized below. It can be viewed as transforming to a tree of
permutable loop nests sets/bands - each node of the tree is a good permutable loop nest
set. Step 12 of the repeat-until block in Algorithm 3 finds such a band of permutable
loops. If all loops are tilable, there is just one node containing all the loops that are per-
mutable. On the other extreme, if no loops are tilable, each node of the tree has just one
loop and so no tiling is possible. At least two hyperplanes should be found at any level
(without dependence removal/cutting) to enable tiling. Dependences from previously
found solutions are thus not removed unless they have to be (Step 17): to allow the next
permutable band to be found, and so on. Hence, partially tilable or untilable input is
all handled. Loops in each node of the target tree can be stripmined/interchanged when
there are at least two of them in it; however, it is illegal to move a stripmined loop across
different levels in the tree.
3.9Accuracy of Cost Function and Refinement.
The metric we presented here can be refined while keeping the problem within ILP.
The motivation behind taking a max is to avoid multiple counting of the same set of
points that need to be communicated for different dependences. This happens when all
dependences originate from the same data space and the same order volume of commu-
nication is required for each of them. Using the sum of max’es on a per-array basis is a
more accurate metric. Also, even for a single array, sets of points with very less overlap
or no overlap may have to be communicated for different dependences. Also, differ-
ent dependences may have source dependence polytopes of different dimensionalities.
Note that the image of the source dependence polytope under the data access func-
tion associated with the dependence gives the actual set of points to be communicated.
10
Page 11
Input Generalized dependence graph G = (V,E) (includes dependence polyhedra Pe, e ∈ E)
1: Smax: statement with maximum domain dimensionality
2: for each dependence e ∈ E do
3:Build legality constraints: apply Farkas Lemma on φ(t) − φ(he(t)) ≥ 0 under t ∈ Pe,
and eliminate all Farkas multipliers
4:Build communication volume/reuse distance bounding constraints: apply Farkas Lemma
to v(p) − (φ(t) − φ(he(t))) ≥ 0 under Pe, and eliminate all Farkas multipliers
5: Aggregate constraints from both into Ce(i)
6: end for
7: repeat
8:
C = ∅
9:
for each dependence edge e ∈ E do
10:
C ← C ∪ Ce(i)
11:
end for
12:Compute lexicographic minimal solution with u?s coefficients in the leading position fol-
lowed by w to iteratively find independent solutions to C (orthogonality constraints are
added as each soln is found)
13:
if no solutions were found then
14:Cut dependences between two strongly-connected components in the GDG and insert
the appropriate splitter in the transformation matrices of the statements
15:
end if
16: Compute Ec: dependences carried by solutions of Step 12/14; update necessary depen-
dence polyhedra (when a portion of it is satisfied)
17:
E ← E − Ec; reform the GDG (V,E)
18: until H⊥
Output A transformation matrix for each statement (with the same number of rows)
Smax= 0 and E = ∅
Fig.3. Algorithm 1
Hence, just using the communication rate (number of hyperplanes on the tile boundary)
as the metric may not be accurate enough. This can be taken care of by having different
bounding functions for dependences with different orders of communication, and us-
ing the bound coefficients for dependences with higher orders of communication as the
leading coefficients while finding the lexicographic minimal solution.
4 Example
Figure4showsanexamplefromtheliterature[8]withaffinenon-constantdependences.
We exclude the constant c0from the mappings as we have a single statement. De-
pendence analysis produces the dependence polyhedra and h-transformations shown in
Fig. 4.
Dependence 1: Tiling legality constraint:
cii + cjj − cii − cj(j − 1) ≥ 0
⇒
cj≥ 0
Since this is a constant dependence, the volume bounding constraint gives:
w − cj≥ 0
11
Page 12
for(i=0; i<N: i++) {
for (j=2; j<N; j++) {
a[i,j] = a[j,i]+a[i,j−1];
}
}
P0
P3
P3P4 P2
P3
P1
P0
P1
P1
P2
P2
P2
P4
P5
space
time
j
i
a[i?,j?] → a[i,j − 1]
h1 : i?= i,j?= j − 1;
2 ≤ j ≤ N,1 ≤ i ≤ N
a[i?,j?] → a[j,i]
h2 : i?= j,j?= i;
2 ≤ j ≤ N, 1 ≤ i ≤ N,i − j ≥ 1
a[j?,i?] → a[i,j]
h3 : j?= i,i?= j
2 ≤ j ≤ N, 1 ≤ i ≤ N, i − j ≥ 1
Fig.4. Example: Non-constant dependences
Dependence 2: Tiling legality constraint:
(cii + cjj) − (cij + cji) ≥ 0,
(i,j) ∈ P2
Applying Farkas Lemma, we have:
(ci− cj)i + (cj− ci)j
≡ λ0+ λ1(N − i) + λ2(N − j)
+λ3(i − j − 1) + λ4(i − 1) + λ5(j − 1)
λ0,λ1,λ2,λ3,λ4,λ5 ≥ 0
(12)
Equating LHS and RHS coefficients for i, j, N and the constants in (12), and eliminat-
ing Farkas multipliers through Fourier-Motzkin, we obtain the following:
ci− cj≥ 0
Volume bounding constraint:
u1N + w − (cij + cji − cii − cjj) ≥ 0, (i,j) ∈ P2
Application of Farkas Lemma in a similar way as above and elimination of the multi-
pliers yields:
u1 ≥ 0
u1− ci+ cj ≥ 0
3u1+ w − ci+ cj ≥ 0
(13)
Dependence 3: Due to symmetry with respect to i and j, the third dependence does not
give anything more than the second one.
12
Page 13
Finding the hyperplanes. Aggregating legality and volume bounding constraints for all
dependences, we obtain:
cj ≥ 0
w − cj ≥ 0
ci− cj ≥ 0
u1 ≥ 0
u1− ci+ cj ≥ 0
3u1+ w − ci+ cj ≥ 0
minimize≺ (u1,w,ci,cj)
(14)
The lexicographic minimal solution for the vector (u1,w,ci,cj) = (0,1,1,1) (the zero
vector is a trivial solution and is avoided). Hence, we get ci= cj= 1. Note that ci= 1
and cj= 0 is not obtained even though it is a valid tiling hyperplane as it involves more
communication: it requires u1to be positive.
The next solution is forced to have a positive component in the subspace orthogonal
to (1,1) given by (10) as (1,-1). This leads to the addition of the constraint ci−cj≥ 1 or
ci−cj≤ −1 to the existing formulation. Adding ci−cj≥ 1 to (14), the lexicographic
minimal solution is (1, 0, 1, 0), i.e., u1 = 1,w = 0,ci = 1,cj = 0 (u1 = 0 is no
longer valid). Hence, (1,1) and (1,0) are the tiling hyperplanes obtained. (1,1) is used
as space with one line of communication between processors, and the hyperplane (1,0)
is used as time in a tile. The outer tile schedule is (2,1) ( = (1,1) + (1,0)).
This transformation is in contrast to other approaches based on schedules which ob-
tain a schedule and then the rest of the transformation matrix. Feautrier’s greedy heuris-
tic gives the schedule θ(i,j) = 2i+j−3 which carries all dependences. However, using
this as either space or time does not lead to communication or locality optimization. The
(2,1) hyperplane has non-constant communication along it. In fact, the only hyperplane
that has constant communication along it is (1,1). This is the best hyperplane to be used
as a space loop if the nest is to be parallelized, and is the first solution that our algo-
rithm finds. The (1,0) hyperplane is used as time leading to a solution with one degree
of pipelined parallelism with one line per tile of near-neighbor communication (along
(1,1)) as shown in Fig. 4. Hence, a good schedule that tries to carry all dependences (or
as many as possible) is not necessarily a good loop for the transformed iteration space.
5Implementation and Preliminary Results
We have implemented our transformation framework using PipLib 1.3.3 [9]. Our tool
takes as input dependence information (dependence polyhedra and h-transformations)
from LooPo’s [1] dependence tester and generates statement-wise affine transforma-
tions. The transforms generated by our tool are provided to CLooG [3] as scattering
functions. The goal is to get tiled shared memory parallel code, for example, OpenMP
codeformulti-corearchitectures.Asafinalstep,thedetectedparallelismcanbemapped
to a desired parallel architecture depending on the number of degrees of parallelism
required. Results show that the tool (with preliminary optimizations) already runs ex-
tremely fast making further refinements to the model very attractive. The number of
loops shown in the table is the sum of the number of outer loops of all statements in
the original code. In theory, since our approach relies on integer programming, it has
13
Page 14
a worst-case exponential time complexity. However, it runs extremely fast in practice.
This is mainly because program polyhedra have a simple structure, and the ILP for-
mulations resulting from them are quickly solved. Due to space constraints, we are not
including results from experimental evaluation of the transformed code. A summary of
theresultscanbefoundinTable5.[20,19,21,15]representthestate-of-the-artfromthe
research community, while ICC 10.1 with ’-fast -parallel’ was used as the native com-
piler. The results were taken on an Intel Core 2 Quad (Q6600 1.5 GHz). The detailed
experimental evaluation can be found in [5].
Benchmark Single core speedup
over native over state-of-the-art over native over state-of-the-art
compilerresearch
5.23x2.1x
3.7x3.1x
1.6x
5.56x 5.74x
9.3x 5.5x
Table 1. Initial results: speedup over state-of-the-art
Multi-core speedup (4 cores)
compiler
20x
7.4x
4.5x
14x
13x
research
1.7x
2.46x
1-d Jacobi (imperfect nest)
2-d FDTD
3-d Gauss-Seidel
LU decomposition
Matrix Vec Transpose
3.76x
6.96x
6Related work
Iteration space tiling [16,31,32,25] is a standard approach for aggregating a set of loop
iterations into tiles, with each tile being executed atomically. In addition, researchers
have considered the problem of selecting tile shape and size to minimize communica-
tion, improve locality or minimize finish time [25,33]. These works are restricted to a
single perfectly nested loop nest with uniform dependences or similar restrictions which
are far away from real-world code.
Loop parallelization has been studied extensively. The reader is referred to the sur-
vey of Boulet et al.[6] for a detailed summary of older parallelization algorithms which
acceptedrestricted inputand/or arebasedon weakerdependence abstractionsthan exact
polyhedral dependences.
Scheduling with affine functions using faces of the polytope by application of the
Farkas algorithm was first proposed by Feautrier [11]. Feautrier explored various pos-
sible approaches to obtain good affine schedules that minimize latency. The schedules
carry all dependences and so all the inner loops can be parallel. However, transforming
to permutable loops that are amenable to tiling was not addressed. Though schedules
yield inner parallel loops, the time loops cannot be tiled unless communication in the
space loops is in the forward direction (dependences have positive components along all
dimensions). Several works [15,7,23] make use of such schedules. Overall, Feautrier’s
classic works [11,12] are geared towards finding maximum fine-grained parallelism as
opposed to tilability for coarse-grained parallelization with minimized communication
and better locality.
Lim and Lam [21,20] proposed an affine partitioning framework that identifies
outerparallelloops(communication-freespacepartitions)andpermutableloops(pipelined
parallel or tilable loops) to maximize the degree of parallelism and minimizing the or-
der of synchronization. They employ the same machinery for blocking [19]. Several
14
Page 15
(infinitely many) solutions equivalent in terms of the criterion they optimize for result
from their algorithm, and these significantly differ in communication cost and locality;
no metric is provided to differentiate between these solutions. As seen in Sec. 3, without
a cost function, the solutions obtained even for the simplest input are not satisfactory.
Ahmed et al. [2] proposed a framework for locality optimization of imperfectly
nested loops for sequential execution. The approach determines the embedding for each
statement into a product space, which is then optimized for locality through another
transformation. Their heuristic sets reuse distances in the target space for some de-
pendences to zero (or a constant) to obtain solutions to the embedding/transformation
matrix coefficients. However, the choice of the dependences and the number, which is
important, is determined heuristically. Also, such an approach need not completely de-
termine the embedding function/transformation matrix coefficients. Exploring all pos-
sibilities here is infeasible. Overall, the automatability and robustness of the heuristic
even for simple code is not clear from the description.
Griebl [15] presents an integrated framework for optimizing locality and paral-
lelism with space and time tiling. Griebl’s approach enables time tiling by using a for-
ward communication-only placement with an existing schedule. As mentioned earlier
(Sec. 3), using schedules as time loops may not lead to communication or locality-
optimized solutions.
Cohenetal.,Girbaletal.[7,14]proposedanddevelopedaframework(URUK/WRAP-
IT) to compose sequences of transformations in a semi-automatic fashion. Transforma-
tions are applied automatically, but specified manually by an expert. Pouchet et al. [23]
searches the space of transformations (one-dimensional schedules) to find good ones
through iterative optimization by employing performance counters. On the other hand,
our approach is fully-automatic. However, some amount of empirical and iterative op-
timization is required to choose transforms that work best in practice. This is true when
several fusion choices exist. Also, effective determination of tile sizes and unroll factors
for transformed whole-programs may only be possible through some amount of empir-
ical search. A combination of our algorithm and empirical search in a smaller space is
an interesting approach to pursue. Alternatively, more powerful cost models like those
based on computing Ehrhart polynomials [30] can be employed once transformations
in a smaller space can be enumerated.
7Conclusions
We have presented a single framework that addresses automatic parallelization and
data locality optimization using transformations in the polyhedral model. The proposed
algorithm finds communication-minimized tiling hyperplanes for parallelization of a
sequence of arbitrarily nested loops. The same hyperplanes also minimize reuse dis-
tances and improve data locality. The approach also enables fusion in the presence of
producing-consuming loops. To the best of our knowledge, our work is the first to pro-
pose a practical cost model to drive automatic transformation in the polyhedral model.
The framework has been implemented into a tool to perform transformations in a fully
automatic way from C/Fortran code using the LooPo infrastructure and CLooG. Pre-
liminary results show very good scalability of the running time of the framework with
input size and complexity of input code.
15