Conference PaperPDF Available

APOLLO: Automatic speculative POLyhedral Loop Optimizer


Abstract and Figures

A few weeks ago, we were glad to announce the first release of Apollo, the Automatic speculative POLyhedral Loop Opti-mizer. Apollo applies polyhedral optimizations on-the-fly to loop nests, whose control flow and memory access patterns cannot be determined at compile-time. In contrast to existing tools, Apollo can handle any kind of loop nest, whose memory accesses can be performed through pointers and in-directions. At runtime, Apollo builds a predictive polyhedral model, which is used for speculative optimization including parallelization. Being a dynamic system, Apollo can even apply the polyhedral model to nonlinear loops. This paper describes Apollo from the perspective of a user, as well as some of its main contributions and mechanisms, including the just-in-time polyhedral compilation, that significantly extends the scope of polyhedral techniques.
Content may be subject to copyright.
Automatic speculative POLyhedral Loop Optimizer
Juan Manuel Martinez
Univ. of Strasbourg, France
Dpt of Computer Science and
Ohio State University, USA
Artiom Baloian
Univ. of Strasbourg, France
Manuel Selva
Univ. of Strasbourg, France
Philippe Clauss
Univ. of Strasbourg, France
A few weeks ago, we were glad to announce the first release of
Apollo, the Automatic speculative POLyhedral Loop Opti-
mizer. Apollo applies polyhedral optimizations on-the-fly to
loop nests, whose control flow and memory access patterns
cannot be determined at compile-time. In contrast to exist-
ing tools, Apollo can handle any kind of loop nest, whose
memory accesses can be performed through pointers and in-
directions. At runtime, Apollo builds a predictive polyhedral
model, which is used for speculative optimization including
parallelization. Being a dynamic system, Apollo can even
apply the polyhedral model to nonlinear loops. This paper
describes Apollo from the perspective of a user, as well as
some of its main contributions and mechanisms, including
the just-in-time polyhedral compilation, that significantly
extends the scope of polyhedral techniques.
Is it legal to parallelize the code in Listing 1? Can you
apply tiling? Can your polyhedral compiler handle it?
Listing 1: Sparse Matrix-Matrix multiplication
for ( ro w = 1 ; ro w <= l e f t >Si z e ; r ow++) {
Pe lem = l e f t >FirstInRo w [ row ] ;
wh ile ( P ele m ) {
for ( c o l = 1 ; c o l <= c o l s ; c o l ++) {
r e s u l t [ ro w ] [ c o l ] +=
Pelem>Real r i g h t [ Pel em>Col ] [ c ol ] ;
Pelem = Pelem>NextInRow ;
Seventh International Workshop on Polyhedral Compilation Techniques
Jan 23, 2017, Stockholm, Sweden
In conjunction with HiPEAC 2017.
The polyhedral model [9], or polytope model, is well-
known for performing aggressive loop transformations de-
voted to parallelism and data-locality. Although very effec-
tive, compilers relying on this model [3, 10] are restricted
to a small class of compute-intensive codes that can only
be handled at compile-time. However, most codes are not
amenable to this model, because they use dynamic data
structures accessed through indirect references or pointers,
which prevent a precise static dependency analysis. On
the other hand, Thread-Level Speculation (TLS) [18] is a
promising approach to overcome this limitation: regions of
code are executed in parallel before all the dependencies
are known. Hardware or software mechanisms track register
and memory accesses to determine if any dependency viola-
tion occurs. But traditional TLS systems implement only
a straightforward loop parallelization strategy, consisting
of slicing the target loop into consecutive parallel threads,
where each thread follows the original serial schedule of loop
iterations and statements.
A few weeks ago, we were glad to announce the first release
of Apollo [1], which is a TLS software framework implement-
ing a speculative and dynamic adaptation of the polyhedral
model, where parallelizing and optimizing transformations
are performed on-the-fly for loops exhibiting a polyhedral-
compliant behavior at runtime. This software is the outcome
of seven years of research and developments, three PhD the-
ses [11, 20, 15] and several Master theses. It is based on
an initial prototype called VMAD [13, 12], which was im-
plementing some seminal concepts, that were later improved
and extended with Apollo [21, 22, 23, 5].
We begin the next Section with an overview of the frame-
work in Subsection 2.1, then some performance results in
Subsection 2.2, and finally in Subsection 2.3, we describe
Apollo from the user’s perspective and define the kinds of
codes that may be good candidates. Then, we present two
key concepts: Section 3 details the prediction model, which
enables to detect polyhedral-compliant runtime behaviors
and to perform speculative polyhedral optimization; and
Section 4 details Apollo’s hybrid code generation mechanism
based on code bones, which allow to generate on-the-fly, opti-
mized code resulting from any polyhedral transformations.
Section 5 describes the pitfalls that we overcame to make
the polyhedral model usable at runtime, and addresses to
the polyhedral model community some new challenges for
making runtime polyhedral optimization even more effective
and polyhedral techniques’ scope even larger. Conclusions
are given in Section 6.
This section gives an overview of the Apollo framework
along with the achieved speedups and how to use it.
2.1 Overview
Apollo is capable of applying polyhedral loop optimiza-
tions to any kind of loop-nest1, even if it contains unpre-
dictable control and memory accesses through pointers or
indirections, as long as it exhibits a polyhedral-compliant be-
havior at runtime, at least in phases. A polyhedral-compliant
behavior is characterized as follows:
linear loops: when the target loop nests are running,
(1) every memory instruction references a series of
memory addresses that can be defined as an affine
function of the surrounding loop counters; (2) the loop
trip count of each loop of the nest can be defined as
an affine function of the surrounding loop counters;
(3) every scalar variable depending on scalar variables
defined in previous iterations behaves as an induction
variable, making its series of values definable as an
affine function of the surrounding loop counters.
nonlinear loops: either the same characteristics hold,
or (1) when not linear, a memory instruction references
a series of memory addresses which can be approxi-
mated by a couple of parallel regression hyperplanes
of dimension d1, defined by an affine function of
the dsurrounding loop counters, and forming a tube
that is sufficiently narrow, inside which every memory
address that is accessed occurs; (2) when not linear,
a loop trip count can also be approximated as in (1).
More details regarding this modeling are given in Sec-
tion 3.
The framework is made of two components: a static com-
piler, whose role is to prepare the target code for speculative
parallelization, and implemented as passes of the Clang-
LLVM compiler [14]; and a runtime system, that orches-
trates the execution of the code.
New virtual iterators (or loop counters), starting at zero
with step one, are systematically inserted at compile-time
in the original loop nest. They are used for handling any
kind of loop in the same manner, and serve as a basis for
building the prediction model and for reasoning about code
Apollo’s static compiler analyzes each target loop nest re-
garding its memory accesses, its loop bounds and the evolu-
tion of its scalar variables. It classifies these objects as being
static or dynamic. For example, if the target addresses of
a memory instruction can be defined as a linear function of
the iterators at compile-time, then it is considered as static.
Otherwise, it is dynamic. Dynamic instructions require in-
strumentation so that their memory access patterns can be
observed and analyzed at runtime. The same is achieved
for the loop bounds and scalars. This classification is used
to build an instrumented version of the code, where instruc-
tions collecting values of the dynamic objects are inserted, as
1for-loops, while-loops, do-while-loops, goto-loops, ...
well as instructions collecting the initial values of the static
objects (e.g. base addresses of regular data structures).
At runtime, Apollo executes the target loop nest in suc-
cessive phases, where each phase corresponds to a slice of
the outermost loop (see Figure 1):
First, an on-line profiling 1
phase is launched, execut-
ing serially a small number of iterations, and recording
memory addresses, loop-trip counts and scalar values.
From the recorded values, linear equalities and in-
equalities are obtained, through interpolation or re-
gression, to build a polyhedral prediction model. This
process is further addressed in Section 3. Using this
model, the loop optimizations to be applied are deter-
mined by invoking Pluto [4] on-line. From the Pluto-
suggested transformation, the corresponding parallel
code is generated, with additional instructions devoted
to the verification of the speculation. These last steps
for code generation are detailed in Section 4. In order
to mask the time overhead of these steps, the original
serial version of the loop is launched in parallel 3
A backup 4
of the memory regions, that are predicted
to be updated during the execution of the next slice,
is performed. An early detection of a potential mis-
prediction is also performed, by checking that all the
memory locations that are predicted to be accessed are
actually allocated to the process.
A large slice of iterations is executed 5
using the par-
allel optimized version of the code. While executing,
the prediction model is continually verified by com-
paring the actual reached values against their predic-
tions. At the end of the slice, if no misprediction was
detected, Apollo performs a new backup for the next
slice 6
, and executes this slice using the same opti-
mized version 7
. If a misprediction is detected, mem-
ory is restored 8
to cancel the execution of the current
slice. Then, the execution of the slice is re-initiated us-
ing the original 9
serial version of the code, in order
to overcome the faulty execution point. Finally, a pro-
filing slice is launched again to capture the changing
behavior and build a new prediction model.
2.2 Performance results
In Figure 2, we show the speed-up obtained by using
Apollo over the best serial version generated among the gcc-
4.8 or clang-3.4 compilers with optimization level 3 (-O3).
Our experiments were ran on two machines: a general-purpose
multi-core server, with two AMD Opteron 6172 processors
of 12 cores each; and an embedded system multi-core chip,
with one ARM Cortex A53 64-bit processor of 8 cores. The
benchmarks were executed using 8 threads.
The set of benchmarks has been built from a collection
of benchmark suites, such that the selected codes include
a main kernel loop nest and highlight Apollo’s capabilities:
SOR from the Scimark suite [17]; Backprop and Needle from
the Rodinia suite [7]; DMatmat,ISPMatmat,SPMatmat and
Pcg from the SPARK00 suite of irregular codes [24]; and
Mri-q from the Parboil suite [19]. In Table 1, we identify
the characteristics for each program that make it impossible
to parallelize at compile-time, where:
Has indirections means that the kernel loop accesses
memory through array indirections (e.g., A[B[i]]).
profiling optimization
backup optimized
backup optimized
rollback original profiling ...
1 2
4 5
6 7
8 9
i=0 i=7
i=8 i=18
i=19 i=42 i=119 i=120i=42i=200
i=120 i=220 i=221 i=228
Figure 1: Execution in slices of iterations from iteration 0 to 228 of the outermost original loop
backprop dmatmat ispmatmat mri-q needle pcg scimark
backprop dmatmat ispmatmat mri-q needle pcg scimark
Speedup over sequential
Figure 2: Speedup of Apollo, using 8 threads, over the best sequential version generated with clang/gcc.
Has pointers means that the kernel loop accesses mem-
ory through pointers (e.g., linked list).
Unpredictable bounds means that some loop bounds
cannot be known at compile-time (e.g., while loops or
for loops bounded by runtime variables).
Unpredictable scalars means that the values taken by
some scalars cannot be known at compile-time.
Both “has indirections” and “has pointers” leads to memory
accesses that cannot be identified as linear statically.
Has Has Unpredict. Unpredict.
Benchmark ind. pointers bounds scalars
Mri-q X
Needle X
Backprop X X
DMatmat X X
ISPMatmat X X X X
SPMatmat X X X X
Table 1: Characteristics of each benchmark.
For the SPMatmat kernel, five inputs with different data
layouts were used to highlight some key features of Apollo.
In Table 2, we show the transformations that were selected
by Pluto at runtime. Reported results are obtained by com-
puting the mean and standard deviation from the outcome
of five runs. We can observe that the versions optimized
with Apollo can be up to 20×faster than the original se-
rial version. In many cases, speed-ups over 8×(more than
the number of threads being used) were reached thanks to
transformations that also improve data locality.
Benchmark Selected Optimization
Mri-q Interchange
Needle Skewing + Interchange + Tiling
SOR Skewing + Tiling
Backprop Interchange
PCG Identity
DMatmat Tiling
ISPMatmat Tiling
SPMatmat Tiling
Table 2: Transformations suggested by Pluto at run-
2.3 How to use it?
To use Apollo, the programmer has to enclose the target
loop nests using a dedicated #pragma directive (shown in
Listing 2). This #pragma does not imply any semantics, it
is only used to identify loop nests that may be interesting
to optimize with Apollo. Inside the #pragma, any kind of
loops are allowed, although there are still some restrictions:
1. The target loops must not contain any instruction per-
forming dynamic memory allocation, although dynamic
allocation is obviously allowed outside the target loops.
2. Since Apollo does not handle inter-procedural analy-
ses, the target loops should not generally contain any
function invocation. Nevertheless, a called function
may be inlined in some cases, thus annihilating this
Listing 2: Example of the #pragma directive.
#pragma a p o l l o dc op
for ( ro w = 1 ; r ow <= left>S i z e ; r ow++) {
Then, the programmer compiles the code using our spe-
cialized compiler which is based on LLVM. Two commands
showing how to compile a source code with Apollo are pre-
sented in Listing 3. Any compiler flag available with clang
may be used. The result of these commands are specialized
executables containing invocations to the runtime system of
Apollo, as well as static analysis results and important data
that are required for runtime code generation.
The generated executables can be launched by the user in
the standard way. Once the execution reaches a loop nest
previously marked with the special #pragma, the runtime
system of Apollo takes control of the execution and specu-
lative execution starts, as described in the previous Section.
When the original loop exit is reached, the execution returns
to its normal flow.
Listing 3: Usage of the static component.
a po ll o c - O 3 s ou r ce . c - o m y ex e cu t ab l e
apo l l o c ++ -O3 s ource .c p p - o m y e x ecutable
In contrast to most TLS systems, Apollo builds a model
that predicts the behavior of the loop nest. This is the key
for performing speculative polyhedral transformations. By
assuming that this prediction is valid, Apollo deduces de-
pendencies between iterations and instructions, and applies
aggressive code optimizations, that involve reordering the
execution of iteration and instructions.
This model predicts: (1) memory accesses, (2) loop bounds,
and (3) basic scalars. Intuitively, memory accesses and loop
bounds must be predicted to enable precise dependency anal-
ysis for selecting an optimizing transformation. Basic scalars
correspond to the φ-instructions in the header of each loop.
Typically, they correspond to loop iterators and accumula-
tors updated at each loop iteration. They introduce very
restrictive dependencies that may forbid any optimization.
By predicting the values that are taken by these scalars,
Apollo removes these dependencies.
Memory accesses.
Apollo embeds three possible modelings for memory ac-
cesses: (i) linear, (ii) tube and (iii) range. In Figure 3, we
depict each model.
When it is possible to interpolate a linear function from
the registered memory addresses, a memory access is pre-
dicted as (i) linear. Future accesses are predicted to follow
exactly the interpolating function. However, memory ac-
cesses may not follow a perfect linear pattern. In this case,
a regression hyperplane is calculated using the least squares
Figure 3: Different modelings for memory accesses.
method. The regression hyperplane coefficients (initially of
type real) are then rounded to their nearest integer value.
The Pearson’s correlation coefficient is then computed. If
it is greater than 0.9, future memory accesses are predicted
to happen inside a (ii) tube, otherwise inside a (iii) range.
This criterion is derived from experimental evaluation [20,
22, 23]. The tube consists of the regression hyperplane, a
tube width and a predicted alignment. The tube width is
the maximum observed deviation from the regression hyper-
plane, rounded to the next bigger multiple of the word size.
The range consists of a maximum and a minimum memory
address between which all the memory accesses are predicted
to occur. In many cases, if a memory access occurs outside
the predicted region, the transformation may still remain
valid if no new dependencies are introduced. When a mis-
prediction occurs, a more complex verification mechanism
checks if this is the case, to avoid any useless rollback.
Loop bounds.
There are two possible modelings, (i) linear, or (ii) tube.
Figure 4 visually depicts the different types of predictions
for loop bounds. Notice that the lower bound is always 0,
and only the upper bound is predicted. The linear predic-
tion mirrors the memory accesses linear prediction. For the
tube case, a regression hyperplane is computed and its coef-
ficients are rounded to the nearest integer values. Then two
hyperplanes are derived, predicting a maximum and mini-
mum number of iterations for a loop to execute. These new
hyperplanes are parallel to the regression hyperplane, but
passing through the maximum positive and negative devia-
tions from the number of executed iterations. When select-
ing a transformation, the iteration space is divided in two,
and all the iterations situated below the minimum predicted
number of iterations may be aggressively transformed; on
the other hand, the iterations between the predicted mini-
mum and maximum must be executed sequentially, until the
loop exit has been reached.
Basic scalars.
There are three possible modelings, (i) linear, (ii) semi-
linear and (iii) reduction. Again, the linear case resembles
Figure 4: Different modelings for the loop bounds.
the memory access linear case. A semi-linear scalar is char-
acterized by a constant increment at each iteration of its
parent loop, although the initial value of the scalar at the
beginning of the loop cannot be predicted. The memory
locations used for computing the initial value of the basic
scalar must be predicted to remain unmodified during the
execution of the loop. Any other behaviors are considered as
being reductions. Unfortunately, the underlying polyhedral
tools used in Apollo are not able to handle reductions at
all, preventing the generation of optimized code when they
Once the prediction model is obtained, Apollo is ready
for selecting a polyhedral optimizing and parallelizing trans-
formation, and to generate binary executable code from it.
These tasks are detailed in the next Section.
A key component of Apollo is its optimized code gen-
eration mechanism. This mechanism intervenes both at
compile-time and at runtime.
At compile-time, some building blocks, common to ev-
ery transformation that may be selected at runtime, are ex-
tracted from the original source code. We call them Code-
Bones [5]. These are embedded in the binary executable to
be used later by the runtime system. At runtime, Apollo’s
code generation mechanism is in charge of: encoding the
Code-Bones and the prediction model in a polyhedral repre-
sentation; passing it to Pluto and CLooG [2, 8] to obtain an
optimizing and parallelizing polyhedral transformation and
its associated scan of iteration domains; and finally gener-
ating optimized binary code. In this Section, we provide an
overview of this code generation mechanism.
Any speculatively optimized code is generally composed of
two types of operations: (1) operations extracted from the
original target code, whose schedule and parameters have
been modified for optimization purposes; and (2) operations
devoted to the verification of the speculation, whose role
is to ensure semantic correctness and to launch a recovery
process in case of wrong speculation. From the control-flow
graph (CFG) of the target loop nest, we extract the different
Code-Bones that reflect both roles:
(1) Each memory write instruction in the original code
yields an associated code bone, that includes all in-
structions belonging to the backward static slice of the
memory write instruction. In other words, these are
all the instructions required to execute an instance of
the memory write. Notice that memory read instruc-
tions are also included in Code-Bones, since the role
of any read instruction is related to the accomplish-
Figure 5: Code-Bone computing a store instruction,
for the code of Listing 1, after runtime patching.
Figure 6: Code-Bone verifying the prediction of a
store instruction, for the code of Listing 1, after run-
time patching.
ment of at least one write instruction. This first set of
Code-Bones is called computation bones.
(2) For each memory instruction (read/write) of the com-
putation bones, an associated verification bone is cre-
ated. Additionally, verification bones for the scalars
(one for the initial value and one for the increment),
and for the loop bounds, are also created. These bones
contain instructions devoted to verifying the validity of
the prediction model.
All the generated bones are embedded in the executable in
their LLVM intermediate representation form.
Recall the code in Listing 1. In Figures 5 and 6, we de-
pict the computation bone associated to the unique store
instruction of the code and the verification bone in charge
of verifying this access. These have been simplified for peda-
gogical purposes. We depict a single verification bone among
many others, that are dedicated to the verification of loop
bounds, scalars and the rest of the memory accesses.
The first four instructions in the computation bone com-
pute the memory addresses that will be accessed, by using
the predicting linear functions. The linear function’s coeffi-
cients are computed and appended in the IR by the runtime
system, from the interpolation of the collected addresses,
when profiling a small slice of the target loop nest. Then, the
value to be written to memory is calculated and a store in-
struction is finally executed. Similarly, the verification bone
calculates memory addresses and basic scalar values using
the predicting linear functions. Then, the original pointer is
computed. If both values are different, then misprediction is
detected. The code bone returns the misprediction status.
At runtime, once the prediction model has been obtained,
the bones embedded in the executable are loaded and parsed
to identify the memory access instructions, scalars and verifi-
cation instructions, which characterize them. For now, other
internal computations can be completely ignored, since they
do not affect the polyhedral representation.
Using the available bones, a loop structure that mimics
the original nest is created, complemented with dynamic in-
formation obtained thanks to the prediction model. The
reconstructed nest is verified against the predicted depen-
dencies in the original nest to ensure their equivalence. The
final result is a loop nest, with well-defined statements, with
memory accesses and loop bounds described by linear equal-
ities and inequalities. This code can now be encoded into a
polyhedral representation.
Then Apollo invokes Pluto2to determine an optimizing
and parallelizing transformation. The transformed repre-
sentation is passed to CLooG to compute scanning loops.
A dedicated compiler pass transforms CLooG’s output to
LLVM-IR. Then, this IR is optimized and passed to the
LLVM Just-In-Time (JIT) compiler for generating binary
code, which is then launched in a next chunk. This binary
code is reused in the successive optimized chunks that are
launched, until a misprediction occurs, which invalidates the
previous prediction model, or until the completion of the
loop nest.
Instructions devoted to the verification of the predictions
are something unique to Apollo. These instructions exhibit
some properties that can be exploited to achieve better per-
formance. The optimizations detailed below exploit some of
these properties.
The first optimization consists of moving all the verifi-
cation bones that do not participate in dependencies into
a separate loop nest, to be executed before the rest of the
code. This verification loop nest is encoded in its own poly-
hedral representation and optimized through Pluto, Cloog
and LLVM JIT separetly from the computation loop nest.
This enables an inspector-executor way of launching opti-
mized chunks, thanks to an early detection of any mispre-
diction. In some cases, this can completely eliminate the
need of performing memory backups and rollbacks.
Other optimizations can reduce the algorithmic time-com-
plexity of the verification code. Consider a verification bone
that does not participate in any dependency, and where all
the coefficients of the predicting linear functions at a given
loop level are equal to zero. For this loop level, the input of
the code bone remains the same, since all the predictions and
original address computations are not affected by changes of
the corresponding virtual iterator. Hence, verifying a single
iteration for this loop level is enough. We depict this opti-
mization with the example in Figure 7. This bone verifies an
access to an array that is performed through an indirection.
The iterator vj does not participate in any computation of
this bone, since the coefficients multiplying it are equal to
zero. Hence we can safely remove this loop.
Using polyhedral tools at runtime raises several challenges.
We describe in this section what are these challenges and
how Apollo handles them.
2Pluto is used as a shared library.
for ( v i = 0; vi <N ; ++v i )
for ( v j = 0; v j <N ; ++v j )
for ( vk =0 ; vk<N ; ++vk )
i f ( &(A[ v i ] + vk ) != 4 0 0v i +0v j +8v k )
rollback ()
for ( v i = 0; vi <N ; ++v i )
for ( vk =0 ; vk<N ; ++vk )
i f ( &(A[ v i ] + vk ) != 4 0 0v i +8v k )
rollback ()
Figure 7: Verification code: before (Top) and after
(Bottom) optimization.
5.1 Apollo’s internal solutions
Time overhead.
The motivation behind Code-Bones is to provide a good
trade off between the set of supported transformations and
the time taken by the invoked polyhedral tools to work with
such blocks. Instead of Code-Bones, we could have consid-
ered basic block nodes of the control-flow-graph as polyhe-
dral statements. This is the approach adopted by Polly [10].
However, such regions are too coarse and would restrict the
set of supported transformations. For example, it would be
impossible to schedule differently instructions which belong
to a same basic block of the original code.
A further approach providing even finer schedules would
be to directly consider individual memory instructions of
the LLVM IR as polyhedral statements. Unfortunately, due
to exponential complexity in the number of statements, the
polyhedral tools would be too slow and no more suitable for
the runtime usage of Apollo.
Quality of the optimizations vs. time overhead.
Pluto exposes multiple options that must be tweaked to
result in the best optimizing transformation. There is no
unique setup of options that always outperforms other op-
tions. Furthermore, the best set may depend on the tar-
get code or the hardware. However, numerous experiments
lead us to define a set of options yielding well perform-
ing optimized code in most cases. Intra-tile-optimization
(–intratileopt) is always activated since it enables Pluto to
do loop interchanges to improve data locality. We always
enable parallelization (–parallel), unless there is a single pro-
cessor core. Loop unrolling (–unroll) is enabled with a fac-
tor of 2; larger factors did not yield any significant perfor-
mance improvements, while also greatly increasing the code
size, harming the LLVM JIT performance. Maximum fis-
sion is always set (–nofuse); this configuration provides the
best performance results and keeps CLooG’s and the LLVM
JIT compilation times low. Additionally, by relying on a
simple heuristic, Apollo automatically decides when tiling
should be beneficial. However, we never found level 2 tiling
(–l2tile) to be profitable, and it greatly increases CLooG’s
execution time. For the rest of the options, we kept the de-
fault behavior. Notice that, in a dynamic context, it is not
mandatory to obtain the best performing optimized code.
One must consider a trade-off between the time taken to
obtain optimized code and its execution performance, since
global performance is no more solely depending on the tar-
get code itself, but also on the time-overhead of the runtime
CLooG embeds an option to optimize the control of the
generated code, at the price of increasing the code size. We
are compelled to deactivate this option since it greatly in-
creases CLooG’s total execution time. Additionally, such
larger code size would also increase the time taken by the
LLVM JIT compiler to generate binary code, the last step
in our code generation pipeline.
Integer overflows.
The inequalities predicting the memory accesses, that are
obtained from interpolation or regression, cannot generally
be used as they are, to obtain good optimizing transforma-
tions. Since they reference memory as a one-dimensional
array by addressing bytes, polyhedral tools often generate
integer overflows, thus crashing the user’s application. This
happens due to some large values taken by the coefficients
participating in these inequalities. To overcome this issue,
multiple analyses are performed in Apollo to recover high
level information about memory accesses. This not only
helps to prevent integer overflows, but also improves the
quality of the selected transformation, especially if a skewing
transformation is involved. In this purpose, several steps are
performed: (i) aliasing groups are identified, each associated
to its own array; (ii) for each array, it is sometimes possible
to recover the dimensions and access functions of a multi-
dimensional array. If successful, the arithmetic complex-
ity of the computations related to dependency analysis and
transformation selection is significantly lowered. Our imple-
mentation, which has to be fast, is derived from Maslov’s
delinearization technique [16]. Since all the values of the
coefficients in the linear access functions are known at run-
time, this task is greatly simplified: some coefficients may
explicitly exhibit some dimension sizes. Finally, notice that
implementations of polyhedral tools that use the GNU Mul-
tiple Precision Arithmetic Library (GMP) are not suitable
for a runtime usage, due to excessive time-overhead.
5.2 Dynamic polyhedral kernels
Apollo extends significantly the scope of polyhedral tech-
niques. First, polyhedral approaches are no more limited to
codes having a convenient syntactic structure, that explic-
itly exhibit affine loop bounds and array references. Now,
a runtime polyhedral behavior elects any kind of loop for
polyhedral optimizations. Second, even nonlinear loops can
be handled thanks to smart runtime modelings of the mem-
ory and iterative behaviors. Apollo can apply polyhedral
transformations to nonlinear loops by representing series
of nonlinear values as tubes, defined by affine inequalities.
However, to go further in spreading the use of polyhedral
techniques for general programs, polyhedral kernels3that
are better adapted to a dynamic usage are required. This
opens interesting perspectives for many new research devel-
opments. Despite the faced challenges, it has been shown
that Apollo is able to benefit from the entire polyhedral
model stack at runtime. However, some of the threats are
mitigated and not completely solved.
When using Pluto, some parameters cannot be set through
3We call “polyhedral kernels” fundamental tools performing
code analyses and transformations, using mathematical op-
erations on polyhedra, like schedulers and code generators.
the library interface: the tile sizes cannot be set to a size dif-
ferent from the default, and it is impossible to add arbitrary
constraints to the transformation. Even worse, it is impossi-
ble to describe a tube or range of memory accesses, although
it is possible with tools like Candl [6]. To overcome this lat-
ter issue, we did the following: first, the OpenScop represen-
tation is passed to Candl, to perform the dependence anal-
ysis. Then, in the OpenScop representation, all tube and
range accesses are replaced by accesses using their regression
hyperplanes, and the computed dependencies are attached
to the OpenScop representation. We modified Pluto to use
the attached dependencies and to ignore the access func-
tions encoded in the OpenScop representation. In this way,
the transformation is finally selected by Pluto based on the
dependencies computed by Candl, and using the equations
describing nonlinear memory accesses.
More generally for our dynamic context, polyhedral ker-
nels that purpose sub-optimal solutions, but that guaran-
tee a smaller time overhead, could be advantageous. Cur-
rent kernels, as Pluto, seek the best solution regarding their
heuristics, although a more straightforward solution, de-
termined in short time, would already provide interesting
speed-ups. A polyhedral scheduler working incrementally
could be a good approach: a first straightforward solution
could be produced, and, while it has been launched, a next
better solution could be computed in a separate thread,
which would in turn be launched as soon as it has been fully
determined, and so on. When several solutions cannot be
ranked regarding their predicted performance, they could
be evaluated dynamically by executing them in successive
chunks, to finally select the best performing one.
More generally, polyhedral kernels may not stay exclu-
sively static. Their heuristics may be assisted and strength-
ened by runtime analysis. For example, it has been shown
in some works that it is difficult to predict the effectiveness
of one or another code transformation. In some cases, the
resulting control complexity of the new loops may hamper
the benefits of the transformation. Runtime analysis would
provide the actual provided performance.
Current polyhedral schedulers consider statements as the
smallest entity to be scheduled, as they usually apply on
source codes. However, data dependencies are related to
memory references, which occur in elementary memory in-
structions of intermediate code representations as the LLVM
IR. Being able to schedule such instructions would yield
more efficient solutions. A typical example of good can-
didate is a stencil computation, where one unique statement
embeds many inter-dependent memory references. Never-
theless, since schedulers’ complexity is exponential in the
number of statements, the scheduling granularity could be
different and adjusted according to the memory and com-
puting costs of the statements. More generally, polyhedral
kernels should operate on compilers’ intermediate forms, and
no more exclusively on source codes. This has been initiated
by Polly. But polyhedral kernels should simultaneously op-
erate at runtime, and no more exclusively at compile-time.
This has been initiated by Apollo.
Regarding polyhedral code generators, it may be useless
to address some code optimizations that are handled anyway
by lower-level JIT compilers, as for example, loop-invariant
code motion. The focus should be put on what is exclusively
provided by polyhedral concepts. The rest should be trans-
ferred to general underlying optimizing mechanisms. Such
an approach may lower the time-overhead of polyhedral code
Finally, the polyhedral model can be viewed as the most
accurate and efficient model of program analysis and opti-
mization. Thus, one of its important goals is to extend its
scope to general-purpose programs, in order to be used in
modern applications. By being usable at runtime, maybe
supported by speculative techniques or new behavior mod-
elings, it is likely to provide very good answers to the multi-
core and processor heterogeneity challenges.
Apollo is proof that polyhedral techniques are effective at
runtime on more general loops than traditional fortran-like
loops. The target loops may be while-loops with memory
references through pointers and indirections, exhibiting a
linear or nonlinear behavior at runtime. A few weeks ago,
the first release of this framework was made available.
We expect you to contribute in further developments re-
lated to runtime polyhedral techniques, for making the poly-
hedral model’s benefits available to every programmer and
[1] APOLLO: Automatic POLyhedral speculative Loop
[2] C. Bastoul. Code generation in the polyhedral model
is easier than you think. In PACT’13 IEEE
International Conference on Parallel Architecture and
Compilation Techniques, pages 7–16, Juan-les-Pins,
France, September 2004.
[3] U. Bondhugula, A. Hartono, J. Ramanujam, and
P. Sadayappan. A practical automatic polyhedral
parallelizer and locality optimizer. PLDI, 2008.
[4] U. K. R. Bondhugula. Effective automatic
parallelization and locality optimization using the
polyhedral model. PhD thesis, Ohio State University,
[5] J. M. M. Caama˜no, W. Wollf, and P. Clauss. Code
bones: Fast and flexible code generation for dynamic
and speculative polyhedral optimization. In Euro-Par
2016 Parallel Processing - 22th International
Conference, Grenoble, France. Proceedings. (Best
paper mention), 2016.
[6] Candl: Data dependence analysis tool in the
polyhedral model.˜bastoul/development/candl.
[7] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer,
S. H. Lee, and K. Skadron. Rodinia: A benchmark
suite for heterogeneous computing. In Workload
Characterization, 2009. IISWC 2009. IEEE
International Symposium on, pages 44–54, Oct 2009.
[8] Cloog: the chunky loop generator.
[9] P. Feautrier. Some efficient solutions to the affine
scheduling problem. part ii. multidimensional time.
International Journal of Parallel Programming,
21(6):389–420, 1992.
[10] T. Grosser, A. Gr¨
oßlinger, and C. Lengauer. Polly -
performing polyhedral optimizations on a low-level
intermediate representation. Parallel Processing
Letters, 22(4), 2012.
[11] A. Jimborean. Adapting the polytope model for
dynamic and speculative parallelization. PhD thesis,
Universit´e de Strasbourg, Sept. 2012.
[12] A. Jimborean, P. Clauss, J. Dollinger, V. Loechner,
and J. M. M. Caama˜no. Dynamic and speculative
polyhedral parallelization using compiler-generated
skeletons. International Journal of Parallel
Programming, 42(4):529–545, 2014.
[13] A. Jimborean, L. Mastrangelo, V. Loechner, and
P. Clauss. VMAD: An Advanced Dynamic Program
Analysis and Instrumentation Framework, pages
220–239. Springer Berlin Heidelberg, Berlin,
Heidelberg, 2012.
[14] LLVM compiler infrastructure.
[15] J. M. Martinez Caama˜no. Fast and Flexible
Compilation Techniques for Effective Speculative
Polyhedral Parallelization. Theses, Universit´e de
Strasbourg, Sept. 2016.
[16] V. Maslov. Delinearization: An efficient way to break
multiloop dependence equations. In Proceedings of the
ACM SIGPLAN 1992 Conference on Programming
Language Design and Implementation, PLDI ’92,
pages 152–161, New York, NY, USA, 1992. ACM.
[17] SciMark benchmark suite.
[18] J. Steffan and T. Mowry. The Potential for Using
Thread-Level Data Speculation to Facilitate
Automatic Parallelization. In Proceedings of the 4th
International Symposium on High-Performance
Computer Architecture, HPCA ’98, Washington, DC,
USA, 1998. IEEE Computer Society.
[19] J. A. Stratton, C. Rodrigrues, I.-J. Sung, N. Obeid,
L. Chang, G. Liu, and W.-M. W. Hwu. Parboil: A
revised benchmark suite for scientific and commercial
throughput computing. Technical Report
IMPACT-12-01, University of Illinois at
Urbana-Champaign, Urbana, Mar. 2012.
[20] A. Sukumaran-Rajam. Beyond the Realm of the
Polyhedral Model: Combining Speculative Program
Parallelization with Polyhedral Compilation. PhD
thesis, Universit´e de Strasbourg, Nov. 2015.
[21] A. Sukumaran-Rajam, J. M. M. Caama˜no, W. Wolff,
A. Jimborean, and P. Clauss. Speculative program
parallelization with scalable and decentralized runtime
verification. In Runtime Verification - 5th
International Conference, RV 2014, Toronto, ON,
Canada, September 22-25, 2014. Proceedings, pages
124–139, 2014.
[22] A. Sukumaran-Rajam, L. E. Campostrini, J. M. M.
Caama˜no, and P. Clauss. Speculative runtime
parallelization of loop nests: Towards greater scope
and efficiency. In 2015 IEEE International Parallel
and Distributed Processing Symposium Workshop,
IPDPS 2015, Hyderabad, India, May 25-29, 2015,
pages 245–254, 2015.
[23] A. Sukumaran-Rajam and P. Clauss. The polyhedral
model of nonlinear loops. ACM Trans. Archit. Code
Optim., 12(4):48:1–48:27, Dec. 2015.
[24] H. L. A. van der Spek, E. M. Bakker, and H. A. G.
Wijshoff. SPARK00: A benchmark package for the
compiler evaluation of irregular/sparse codes. CoRR,
abs/0805.3897, 2008.
... The PLuTo engine is used in other compilers, for example, in Apollo [6], PPCG [7], PTile [8], and Autogen framework [9] as well as in commercial IBM-XL and R-STREAM from the Reservoir Lab [10]. ...
... The key limitation of non-speculative compilers is that ensuring that two tasks are independent is often impossible at compiling time [9,10]. On the other hand, speculative parallelism [11][12][13] is a compiling technique that increases memory usage and keeps more processors busy at runtime but suffers from costly speculation and makes parallelism unprofitable in practice for applications with frequent conflicts. For example, Ref. [9] proposed a compiler called Pluto, which often has to stop without any explanation to the programmer. ...
Full-text available
Due to the design of computer systems in the multi‐core and/or multi‐processor form, it is possible to use the maximum capacity of processors to run an application with the least time consumed through parallelisation. This is the responsibility of parallel compilers, which perform parallelisation in several steps by distributing iterations between different processors and executing them simultaneously to achieve lower runtime. The present paper focuses on the uniformisation of three‐level perfect nested loops as an important step in parallelisation and proposes a method called Towards Three‐Level Loop Parallelisation (TLP) that uses a combination of a Frog Leaping Algorithm and Fuzzy to achieve optimal results because in recent years, many algorithms have worked on volumetric data, that is, three‐dimensional spaces. Results of the implementation of the TLP algorithm in comparison with existing methods lead to a wide variety of optimal results at desired times, with minimum cone size resulting from the vectors. Besides, the maximum number of input dependence vectors is decomposed by this algorithm. These results can accelerate the process of generating parallel codes and facilitate their development for High‐Performance Computing purposes.
... The use of Polyhedral compilers [11], [12] is to parallelize loops that have regular accesses into arrays and similar structures but most programs are irregular, and for such programs, compilers have limited visibility when invoking codes, a fact that impedes nonspeculative parallelization. On the other hand, speculative parallelism [13]- [15] is a technique that increases memory usage and keeps more processors busy at runtime but suffers from costly speculation and makes parallelism unprofitable in practice for applications with frequent conflicts. For example, [6] proposed a compiler called t4. ...
Full-text available
The growth of software techniques for implementing applications must go hand in hand with the growth of computer system hardware in the design of multi-core and multi-processor systems; otherwise, we cannot expect to be able to use maximum hardware capacities. One of the most important and challenging techniques for running applications is to run them in parallel with a focus on loop parallelism to reduce execution time. On the other hand, in recent years, many algorithms have been working on volumetric data, i.e., three-dimensional spaces; therefore, parallelization must be possible for all types of two-dimensional and three-dimensional loops. Uniformization is an important part of loop parallelism, and also the present paper’s focus. The proposed algorithm in the present paper performed uniformization with a combination of the frog leaping algorithm and the fuzzy system for two- and three-dimensional loops on a wide range of input dependence vectors and achieved a considerable variety of results in the desired time. The results of this study can be used to facilitate the development of parallel codes.
... There are several efforts in the literature that facilitate speculative execution utilizing higher-level constructs. Among the many we list a few pertinent to this study such as the use of transactional memory at the software level [36], compiler-assisted methods [1,11,37] and libraries such as Galois [31], ParlayLib [6] and SPETABARU [10]. ...
Full-text available
Handling the ever-increasing complexity of mesh generation codes along with the intricacies of newer hardware often results in codes that are both difficult to comprehend and maintain. Different facets of codes such as thread management and load balancing are often intertwined, resulting in efficient but highly complex software. In this work, we present a framework which aids in establishing a core principle, deemed separation of concerns, where functionality is separated from performance aspects of various mesh operations. In particular, thread management and scheduling decisions are elevated into a generic and reusable tasking framework. The results indicate that our approach can successfully abstract the load balancing aspects of two case studies, while providing access to a plethora of different execution back-ends. One would expect, this new flexibility to lead to some additional cost. However, for the configurations studied in this work, we observed up to \(13\%\) speedup for some meshing operations and up to \(5.8\%\) speedup over the entire application runtime compared to hand-optimized code. Moreover, we show that by using different task creation strategies, the overhead compared to straight-forward task execution models can be improved dramatically by as much as \(1200\%\) without compromises in portability and functionality.
... SeaHorn [33] provides new abstractions for developing new verification techniques. Polly [8, b le n d e r _ r d e e p s je n g _ r im a g ic k _ r lb m _ r le e la _ r m c f _ r n a b _ r n a m d _ r o m n e t p p _ r p a r e s t _ r p e r lb e n c h _ r x 2 6 4 _ r x a la n c b m k _ r 32], PLUTO [21], HALIDE [5,45], Tiramisu [14,19], and APOLLO [22] provide abstractions to suit polyhedral optimizations, which target loops characterized by regular control and data flows. TensorFlow [12], a widely used machine learning framework, uses high-level graph representations that allow graph optimizations more discoverable [13]. ...
Full-text available
Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, they should support custom tools designed to meet these demands with higher-level analysis-powered abstractions of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE is modular and demand-driven, making it easy-to-extend and adaptable to custom-tool-specific needs without unduly wasting compile time and memory. This paper shows the power of NOELLE by presenting a diverse set of ten custom tools built upon it, with a 33.2% to 99.2% reduction in code size (LoC) compared to their counterparts without NOELLE.
Full-text available
Процесс распараллеливания программ может быть затруднён ввиду их оптимизации под последовательное выполнение. Из-за этого полученная параллельная версия может быть неэффективной, а в некоторых случаях распараллеливание оказывается невозможным. Решить указанные проблемы помогают преобразования исходного кода программ. В данной статье рассматривается реализации в системе автоматизированного распараллеливания SAPFOR (System FOR Automated Parallelization) преобразований последовательных Фортран-программ, позволяющих облегчить работу пользователя в системе и существенно снизить трудоемкость распараллеливания программ. Применение реализованных преобразований в системе SAPFOR продемонстрировано на прикладной программе, решающей систему нелинейных дифференциальных уравнений в частных производных. Также было произведено сравнение производительности полученной параллельной версией с версиями, распараллелеными вручную с использованием DVM и MPI технологий. The process of parallelizing programs can be difficult due to their optimization for sequential execution. Because of this, the resulting parallel version may be inefficient, and in some cases parallelization is not possible. Transformations of the source code of programs help to solve these problems. This article discusses the implementation of transformations of sequential Fortran programs in the SAPFOR (System FOR Automated Parallelization) system, which make it possible to facilitate the user's work in the system and significantly reduce the complexity of program parallelization. The application of the implemented transformations in the SAPFOR system is demonstrated on a program that solves a system of non-linear partial differential equations. The performance of the obtained parallel version was also compared with the versions parallelized manually using DVM and MPI technologies.
Full-text available
This dissertation presents symbolic loop compilation, the first full-fledged approach to symbolically map loop nests onto tightly coupled processor arrays (TCPAs), a class of loop accelerators that consist of a grid of processing elements (PEs). It is: Full-fledged because it covers all steps of compilation, including space-time mapping, code generation, and generation of configuration data for all involved hardware components. A full-fledged compiler is paramount because manual mapping for accelerators, such as TCPAs, is difficult, tedious, and, most of all, error-prone. Symbolic because symbolic loop compilation assumes the loop bounds and number of allocated PEs to be unknown during compile time, thus allowing them to be chosen at run time.This flexibility benefits resource-aware applications where the number of PEs is known only at run time. Symbolic loop compilation is a hybrid static/dynamic approach with two phases: At compile time, all involved NP-hard problems (such as resource-constrained modulo scheduling) are solved symbolically, resulting in a so-called symbolic configuration, which is a space-efficient intermediate representation parameterized in the loop bounds and number of PEs. This phase is called symbolic mapping. Because it takes place at compile time, there is ample time to solve the involved NP-hard problems. At run time, for each requested accelerated execution of a loop program with given loop bounds and number of allocated PEs, concrete PE programs and configuration data are generated from the symbolic configuration according to these parameter values. This phase is called instantiation. In the context of these two phases, this dissertation presents the following contributions: 1. Symbolic modulo scheduling is a technique for solving resource-constrained modulo scheduling for multi-dimensional loop nests when the loop bounds and number of available PEs are unknown. We show that a latency-minimal solution can be found if the number of PEs is known beforehand and a near latency-minimal solution if it is not. 2. Polyhedral syntax trees are a space-efficient, parameterized representation of a set of PE program variants from which the necessary concrete PE programs are generated at run time. 3. Instantiation includes methods to generate concrete programs and configuration data from a symbolic configuration in a manner whose time complexity is not proportional to the loop bounds or number of allocated PEs. 4. Run-time enforcement for loops is a technique that utilizes the flexibility granted by symbolic loop compilation to enforce requirements on non-functional properties by dynamically adapting the mapping before execution. An example is to allocate a number of PEs that satisfies a given latency bound. In summary, the methods presented in this dissertation enable, for the first time, the full-fledged symbolic compilation of loop nests onto TCPAs. Without these methods, a given loop nest would need to be fully recompiled each time the loop bounds or number of available PEs change, which would render run-time mapping impractical and even conventional compilation overly time- and space-consuming.
The SAPFOR and DVM systems were primary designed to simplify the development of parallel programs of scientific-technical calculations. SAPFOR is a software development suite that aims to produce a parallel version of a sequential program in a semi-automatic way. Fully automatic parallelization is also possible if the program is well-formed and satisfies certain requirements. SAPFOR uses the DVMH directive-based programming model to expose parallelism in the code. The DVMH model introduces CDVMH and Fortran-DVMH (FDVMH) programming languages which extend standard C and Fortran languages by parallelism specifications. We present MPI-aware extension of the SAPFOR system that exploits opportunities provided by the new features of the DVMH model to extend existing MPI programs with intra-node parallelism. In that way, our approach reduces the cost of parallel program maintainability and allows the MPI program to utilize accelerators and multi-core processors. SAPFOR extension has been implemented for both Fortran and C programming languages. In this paper, we use the NAS Parallel Benchmarks to evaluate the performance of generated programs.
Full-text available
In this thesis, we present a Thread-Level Speculation (TLS) framework whose main feature is to speculatively parallelize a sequential loop nest in various ways, to maximize performance. We perform code transformations by applying the polyhedral model that we adapted for speculative and runtime code parallelization. For this purpose, we designed a parallel code pattern which is patched by our runtime system according to the profiling information collected on some execution samples. We show on several benchmarks that our framework yields good performance on codes which could not be handled efficiently by previously proposed TLS systems.
Conference Paper
Full-text available
Runtime loop optimization and speculative execution are becoming more and more prominent to leverage performance in the current multi-core and many-core era. However, a wider and more efficient use of such techniques is mainly hampered by the prohibitive time overhead induced by centralized data race detection, dynamic code behavior modeling and code generation. Most of the existing Thread Level Speculation (TLS) systems rely on slicing the target loops into chunks, and trying to execute the chunks in parallel with the help of a centralized performance-penalizing verification module that takes care of data races. Due to the lack of a data dependence model, these speculative systems are not capable of doing advanced transformations and, more importantly, the chances of rollback are high. The polytope model is a well known mathematical model to analyze and optimize loop nests. The current state-of-art tools limit the application of the polytope model to static control codes. Thus, none of these tools can handle codes with while loops, indirect memory accesses or pointers. Apollo (Automatic POLyhedral Loop Optimizer) is a framework that goes one step beyond, and applies the polytope model dynamically by using TLS. Apollo can predict, at runtime, whether the codes are behaving linearly or not, and applies polyhedral transformations on-the-fly. This paper presents a novel system, which extends the capability of Apollo to handle codes whose memory accesses are not necessarily linear. More generally, this approach expands the applicability of the polytope model at runtime to a wider class of codes.
Full-text available
The Parboil benchmarks are a set of throughput computing applications useful for studying the performance of throughput computing architecture and compilers. The name comes from the culinary term for a partial cooking process, which represents our belief that useful throughput computing benchmarks must be "cooked", or preselected to implement a scalable algorithm with fine-grained parallel tasks. But useful benchmarks for this field cannot be "fully cooked", because the architectures and programming models and supporting tools are evolving rapidly enough that static benchmark codes will lose relevance very quickly. We have collected benchmarks from throughput computing application researchers in many different scientific and commercial fields including image processing, biomolecular simulation, fluid dynamics, and astronomy. Each benchmark includes several implementations. Some implementations we provide as readable base implementations from which new optimization efforts can begin, and others as examples of the current state-of-the-art targeting specific CPU and GPU architectures. As we continue to optimize these benchmarks for new and existing architectures ourselves, we will also gladly accept new implementations and benchmark contributions from developers to recognize those at the frontier of performance optimization on each architecture. Finally, by including versions of varying levels of optimization of the same fundamental algorithm, the bench-marks present opportunities to demonstrate tools and architectures that help programmers get the most out of their parallel hardware. Less optimized versions are presented as challenges to the compiler and architecture research communities: to develop the technology that automatically raises the performance of simpler implementations to the performance level of sophisticated programmer-optimized implementations, or demonstrate any other performance or programmability improvements. We hope that these benchmarks will facilitate effective demonstrations of such technology.
Conference Paper
Full-text available
Thread Level Speculation (TLS) is a dynamic code parallelization technique proposed to keep the software in pace with the advances in hardware, in particular, to automatically parallelize programs to take advantage of the multi-core processors. Being speculative, frameworks of this type unavoidably rely on verification systems that are similar to software transactional memory, and that require voluminous inter-thread communications or centralized registering of the performed memory accesses. The high degree of communication is against the basic principles of high performance parallel computing, does not scale with an increasing number of processor cores, and yields weak performance. Moreover, TLS systems often apply one unique parallelization strategy consisting in slicing a loop into several parallel speculative threads. Such a strategy is also against the basic principles since loops in the original serial code are not necessarily parallel and also, it is well-known that the parallel schedule must promote data locality which is crucial in obtaining good performance. This situation appeals to scalable and decentralized verification systems and new strategies to dynamically generate efficient parallel code resulting from advanced optimizing parallelizing transformations. Such transformations require a more complex verification system that allows intra-thread iterations to be reordered. In this paper, we propose a verification system of this kind, based on a model built at runtime and predicting a linear memory behavior. This strategy is part of the Apollo speculative code parallelizer which is based on an adaptation for dynamic usage of the polyhedral model.
Conference Paper
Full-text available
VMAD (Virtual Machine for Advanced Dynamic analysis) is a platform for advanced profiling and analysis of programs, consisting in a static component and a runtime system. The runtime system is organized as a set of decoupled modules, dedicated to specific instrumenting or optimizing operations, dynamically loaded when required. The program binary files handled by VMAD are previously processed at compile time to include all necessary data, instrumentation instructions and callbacks to the runtime system. For this purpose, the LLVM compiler has been extended to automatically generate multiple versions of the code, each of them tailored for the targeted instrumentation or optimization strategies. The compiler chooses the most suitable intermediate representation for each version, depending on the information to be acquired and on the optimizations to be applied. The control flow graph is adapted to include the new versions and to transfer the control to and from the runtime system, which is in charge of the execution flow orchestration. The strength of our system resides in its extensibility, as one can add support for various new profiling or optimization strategies, independently of the existing modules. VMAD’s potential is illustrated by presenting several analysis and optimization applications dedicated to loop nests: instrumentation by sampling, dynamic dependence analysis, adaptive version selection.
Full-text available
We propose a framework based on an original generation and use of algorithmic skeletons, and dedicated to speculative parallelization of scientific nested loop kernels, able to apply at run-time polyhedral transformations to the target code in order to exhibit parallelism and data locality. Parallel code generation is achieved almost at no cost by using binary algorithmic skeletons that are generated at compile-time, and that embed the original code and operations devoted to instantiate a polyhedral parallelizing transformation and to verify the speculations on dependences. The skeletons are patched at run-time to generate the executable code. The run-time process includes a transformation selection guided by online profiling phases on short samples, using an instrumented version of the code. During this phase, the accessed memory addresses are used to compute on-the-fly dependence distance vectors, and are also interpolated to build a predictor of the forthcoming accesses. Interpolating functions and distance vectors are then employed for dependence analysis to select a parallelizing transformation that, if the prediction is correct, does not induce any rollback during execution. In order to ensure that the rollback time overhead stays low, the code is executed in successive slices of the outermost original loop of the nest. Each slice can be either a parallel version which instantiates a skeleton, a sequential original version, or an instrumented version. Moreover, such slicing of the execution provides the opportunity of transforming differently the code to adapt to the observed execution phases, by patching differently one of the pre-built skeletons. The framework has been implemented with extensions of the LLVM compiler and an x86-64 runtime system. Significant speed-ups are shown on a set of benchmarks that could not have been handled efficiently by a compiler.
In this thesis, we present our contributions to APOLLO: an automatic parallelization compiler that combines polyhedral optimization with Thread-Level-Speculation, to optimize dynamic codes on-the-fly. Thanks to an online profiling phase and a speculation model about the target's code behavior, Apollo is able to select an optimization and to generate code based on it. During optimized code execution, Apollo constantly verifies the validity of the speculation model. The main contribution of this thesis is a code generation mechanism that is able to instantiate any polyhedral transformation, at runtime, without incurring a major time-overhead. This mechanism is currently in use inside Apollo. We called it Code-Bones. It provides significant performance benefits when compared to other approaches.
Conference Paper
In this paper, we present a new runtime code generation technique for speculative loop optimization and parallelization, that allows to generate on-the-fly codes resulting from any polyhedral optimizing transformation of loop nests, such as tiling, skewing, fission, fusion or interchange, without introducing a penalizing time overhead. The proposed strategy is based on the generation of code bones at compile-time, which are parametrized code snippets either dedicated to speculation management or to computations of the original target program. These code bones are then instantiated and assembled at runtime to constitute the speculatively-optimized code, as soon as an optimizing polyhedral transformation has been determined. Their granularity threshold is sufficient to apply any polyhedral transformation, while still enabling fast runtime code generation. This strategy has been implemented in the speculative loop parallelizing framework Apollo.
Dans cette thèse, nous présentons nos contributions à Apollo (Automatic speculative POLyhedral Loop Optimizer), qui est un compilateur automatique combinant la parallélisation spéculative et le modèle polyédrique, afin d'optimiser les codes à la volée. En effectuant une instrumentation partielle au cours de l'exécution, et en la soumettant à une interpolation, Apollo est capable de construire un modèle polyédrique spéculatif dynamiquement. Ce modèle spéculatif est ensuite transmis à Pluto, qui est un ordonnanceur polyédrique statique. Apollo sélectionne ensuite un des squelettes d'optimisation de code générés statiquement, et l'instancie. La partie dynamique d'Apollo surveille continuellement l'exécution du code afin de détecter de manière décentralisée toute violation de dépendance. Une autre contribution importante de cette thèse est notre extension du modèle polyédrique aux codes exhibant un comportement non-linéaire. Grâce au contexte dynamique et spéculatif d'Apollo, les comportements non-linéaires sont soit modélisés par des hyperplans de régression linéaire formant des tubes, soit par des intervalles de valeurs atteintes. Notre approche permet l'application de transformations polyédriques à des codes non-linéaires grâce à un système de vérification de la spéculation hybride, combinant vérifications centralisées et décentralisées.
Runtime code optimization and speculative execution are becoming increasingly prominent to leverage performance in the current multi- and many-core era. However, a wider and more efficient use of such techniques is mainly hampered by the prohibitive time overhead induced by centralized data race detection, dynamic code behavior modeling, and code generation. Most of the existing Thread Level Speculation (TLS) systems rely on naively slicing the target loops into chunks and trying to execute the chunks in parallel with the help of a centralized performance-penalizing verification module that takes care of data races. Due to the lack of a data dependence model, these speculative systems are not capable of doing advanced transformations, and, more importantly, the chances of rollback are high. The polyhedral model is a wellknown mathematical model to analyze and optimize loop nests. The current state-of-art tools limit the application of the polyhedral model to static control codes. Thus, none of these tools can generally handle codes with while loops, indirect memory accesses, or pointers. Apollo (Automatic POLyhedral Loop Optimizer) is a framework that goes one step beyond and applies the polyhedral model dynamically by using TLS. Apollo can predict, at runtime, whether the codes are behaving linearly or not, and it applies polyhedral transformations on-the-fly. This article presents a novel system that enables Apollo to handle codes whose memory accesses and loop bounds are not necessarily linear. More generally, this approach expands the applicability of the polyhedral model at runtime to a wider class of codes. Plugging together both linear and nonlinear accesses to the dependence prediction model enables the application of polyhedral loop optimizing transformations even for nonlinear code kernels while also allowing a low-cost speculation verification.