ArticlePDF Available

Automatic Parallelization: Executing Sequential Programs on a Task-Based Parallel Runtime

Authors:

Abstract and Figures

There are billions of lines of sequential code inside nowadays’ software which do not benefit from the parallelism available in modern multicore architectures. Automatically parallelizing sequential code, to promote an efficient use of the available parallelism, has been a research goal for some time now. This work proposes a new approach for achieving such goal. We created a new parallelizing compiler that analyses the read and write instructions, and control-flow modifications in programs to identify a set of dependencies between the instructions in the program. Afterwards, the compiler, based on the generated dependencies graph, rewrites and organizes the program in a task-oriented structure. Parallel tasks are composed by instructions that cannot be executed in parallel. A work-stealing-based parallel runtime is responsible for scheduling and managing the granularity of the generated tasks. Furthermore, a compile-time granularity control mechanism also avoids creating unnecessary data-structures. This work focuses on the Java language, but the techniques are general enough to be applied to other programming languages. We have evaluated our approach on 8 benchmark programs against OoOJava, achieving higher speedups. In some cases, values were close to those of a manual parallelization. The resulting parallel code also has the advantage of being readable and easily configured to improve further its performance manually.
Content may be subject to copyright.
International Journal of Parallel Programming manuscript No.
(will be inserted by the editor)
Automatic Parallelization: Executing Sequential
Programs on a Task-Based Parallel Runtime
Alcides Fonseca ·Bruno Cabral ·Jo˜ao
Rafael ·Ivo Correia
Received: date / Accepted: date
Abstract There are billions of lines of sequential code inside nowadays’ soft-
ware which do not benefit from the parallelism available in modern multicore
architectures. Automatically parallelizing sequential code, to promote an effi-
cient use of the available parallelism, has been a research goal for some time
now.
This work proposes a new approach for achieving such goal. We created a
new parallelizing compiler that analyses the read and write instructions, and
control-flow modifications in programs to identify a set of dependencies be-
tween the instructions in the program. Afterwards, the compiler, based on the
generated dependencies graph, rewrites and organizes the program in a task-
oriented structure. Parallel tasks are composed by instructions that cannot be
executed in parallel. A work-stealing-based parallel runtime is responsible for
scheduling and managing the granularity of the generated tasks. Furthermore,
a compile-time granularity control mechanism also avoids creating unnecessary
data-structures.
This work focuses on the Java language, but the techniques are general
enough to be applied to other programming languages.
We have evaluated our approach on 8 benchmark programs against OoO-
Java, achieving higher speedups. In some cases, values were close to those of
This work was partially supported by the Portuguese Research Agency FCT, through
CISUC (R&D Unit 326/97) and the CMU|Portugal program (R&D Project Æminium
CMU-PT/SE/0038/2008).
Supported by the Portuguese National Foundation for Science and Technology (FCT)
through a Doctoral Grant (SFRH/BD/84448/2012).
Alcides Fonseca, Bruno Cabral, Jo˜ao Rafael, Ivo Correia
Department of Informatics Engineering
Universidade de Coimbra
Tel: +351 239 790 000
Fax: +351 239 701 266
E-mail: {amaf, bcabral}@dei.uc.pt,
E-mail: {jprafael, icorreia}@student.dei.uc.pt
arXiv:1604.03211v1 [cs.PL] 11 Apr 2016
2 Alcides Fonseca et al.
a manual parallelization. The resulting parallel code also has the advantage
of being readable and easily configured to improve further its performance
manually.
Keywords Automatic parallelization ·Task-based runtime ·symbolic
analysis
1 Introduction
Nowadays, in order to achieve the best performance on multicore machines,
programmers have to write parallel programs. This is typically done using
threads, either directly or indirectly through other high-level constructs of the
language. Traditionally, manually defining threads and synchronizing them
is the only way to achieve the best results. However this process is often
cumbersome and error-prone, often leading to the occurrence of problems such
as deadlocks and race conditions. Furthermore, as the code base increases it
becomes increasingly harder to detect interferences between executing threads.
Writing, debugging and tuning multi-threaded code is very time-consuming,
as there are multiple combinations of executions that make the performance
and visibility of errors non-deterministic. Furthermore, there are billions of
lines of source code inside existent software that are not able to benefit from
today’s multicore architectures. Parallelizing these programs is a daunting and
extremely costly task, one that hardly someone is eager to initiate.
The automatic parallelization of existing software has been a long run-
ning objective and prominent research subject [4]. Existing research has been
mainly focused on the analysis and transformation of loops, since these have
always been perceived as the main source of potential parallelism in sequen-
tial programs [16]. Nonetheless, other models have also been studied, such as
the parallelization of recursive methods [5], and of sub-expressions in func-
tional languages. Focusing only on the parallelization of loops is not enough
in most cases and, other approaches have not revealed significant performance
improvements.
In this paper we introduce a new approach for performing a fine-grained
automatic parallelization of programs. This approach is distinct from others,
since it parallelizes all the instructions that can, effectively, be executed in
parallel. To identify which instructions can be parallelized, we infer instruc-
tion signatures from the source code of the program. These signatures include
dependency and control flow information, which allows us to organize instruc-
tions into a task-oriented structure. The result is a program that exhibits the
maximum possible parallelism at the finest granularity level (e.g. one task
can equal one instruction). However, in order to achieve good performance
and decrease the overhead in run-time task management, the granularity of
tasks is coarsened during compilation and also during run-time. At run-time,
the system load influences granularity control. Furthermore, a work-stealing
scheduler is used to efficiently manage execution and control dependencies.
Title Suppressed Due to Excessive Length 3
Our approach can parallelize irregular recursive programs with a low run-
time overhead, resulting on up to 20x of speedup, on a 24 thread machine
and an average of 5x of speedup. Because of dependency tracking and trans-
formations during compilation, we are able to avoid harsh runtime overheads
from which existing solutions suffered. This paper contributes with an hy-
brid methodology for analyzing procedural source code and translating it to
a parallel version with a broad level of parallelism and granularity, that is
fine-tuned during execution by runtime granularity control mechanisms. The
parallelization approach was tested with popular benchmark tests for task-
based parallelism, and compared with another two approaches.
We have applied this approach to the Java language, one of the most pop-
ular programming languages, since it has a large code base of legacy sequen-
tial software. Nonetheless, our approach can be applied to any procedural or
object-oriented language.
The framework presented includes two language front-ends, the Æminium
language compiler [30], the JPar compiler for Java, the Æminium Runtime,
and a ÆminiumGPU[18] compiler and runtime, to enable GPU execution of
data-parallel programs.
This paper is organized as follows: This section introduced the Æminium
framework; In Section 2 we present the related work in automatic paralleliza-
tion; In Section 3 we describe in further detail the architecture of the frame-
work, mainly from the dynamic perspective; In Section 4 we explain the par-
allelization technique applied; In Section 5 we present the Runtime support
for executing parallel programs. Finally in Section 6 we evaluate the platform
in different programs; and in Section 7 we lay the conclusions.
2 Related Work
Given the wide availability of multicore processors, GPUs and other accelera-
tors such as FPGAs and the Xeon Phi, research on concurrent programming
has increased in the last decade. New programming models, languages and
runtime systems have been developed to improve the expression and execu-
tion of parallel programs. Much of this work has culminated in new languages,
such as X10[9], Fortress[29] and Chapel[7], in which most language constructs
are default by parallel (such as for cycles, for instance). These languages also
provide constructs to explicitly inform the compiler that certain memory re-
gions are independent and, therefore, accesses to them can be executed in
parallel. Unlike these languages, which mostly target scientific computing, the
Æminium language[30] has focused on dependable systems programming. By
annotating variables with access permissions, programs could be automatically
parallelized with guarantees that the execution would not break the defined
contracts.
Another approach for writing parallel programs is semi-automatic paral-
lelization. In this approach, programmers annotate existing sequential pro-
grams with enough information for the compiler to automatically parallelize
4 Alcides Fonseca et al.
parts of the code. Cilk[19] and OpenMP[11] are the two most common ex-
amples of such approach and work on top of the C language. Cilk focus on
divide-and-conquer recursive algorithms, while OpenMP has focus mostly on
symmetrical parallelism in for cycles. OpenMP 3.0 has introduced unstruc-
tured parallelism via the concept of Tasks[2][3]. More recently, OpenMP has
also started to generate code for GPUs[23].
The third approach is to translate unmodified sequential programs au-
tomatically to a parallel version of themselves. Functional Languages, like
LISP[20] and Haskell[25], can be easily parallelized since sub-expressions do
not interferes with each other. Imperative languages such as C and Java make
this task more difficult, since different parts of the code can access the same
memory location. In order to be able to parallelize code, there are some ver-
ifications that have to be made. The main focus of research has been the
parallelization of for loops. Different techniques can be used, depending on
the type of loop, such as DO-ALL,DO-ACROSS or DO-PIPE.DO-ALL par-
allelism does not contain any interference between loop iterations, and each
iteration (or sets of iterations, called slices), can be executed in parallel and
they must synchronize at the end of the for-cycle. DO-ACROSS has a part
of the cycle (usually minimal compared to the rest) that interferes with other
iterations. For these cases, variable privatizing can be done to aggregate values
per thread, and then another for-cycle is sequentially executed in the end to ag-
gregate the private variables. DO-PIPE can be parallelized by using different
threads for different parts of the for cycle, in which dependencies between dif-
ferent threads are minimized. In order to verify if the loops can be parallelized
or not, the Polyhedral Model is frequently used[6]. Cetus[12] and Par4All[1]
are compilers that perform this kind of transformation, which can also target
GPGPUs. Loop parallelization has been done during runtime[34], but without
any relevant speedups.
Automatic non-loop parallelization has been less studied, but it is still
a popular way of expression parallelism, specially in divide-and-conquer al-
gorithms. This analysis has been implemented in zJava[8] by analyzing data
writes and reads at a local level. In order to allow a correct parallelization,
zJava uses a runtime registry of regions, to which threads can be assigned.
This allows threads to access shared data with a synchronization overhead.
OoOJava[21] also performs static analysis to retrieve read and write informa-
tion on annotated tasks. Then it compiles the program to a speculative C
program, that has runtime checks to resolve conflicts between threads. Be-
cause of such speculation, OoOJava does not support I/O instructions. MP-
Tomasulo[33] also uses Out-of-Order instructions for automatically paralleliz-
ing code for FPGAs but removes write-after-read and read-after-write by re-
naming parameters, keep control flow separate. This approach is not compati-
ble with regular multicore processors, however. Jrpm[10] also performs thread-
level speculation at runtime, operating over Java bytecode instead of Java
code, revealing a worse speedup than compiler-time strategies. FJComp[28]
also focus on Divide-and-Conquer algorithms using the Fork-Join framework.
However, the compiler requires the programmer to annotate tasks and option-
Title Suppressed Due to Excessive Length 5
ally define the cut-off mechanism, making FJComp more of a Translator from
recursive calls to FJ-style calls, than an automatic parallelizing compiler.
3 Architecture
J2JPar
compiler
JAVA
Java
Compiler
JVM
Æminium
Runtime
Legend:
Compiler
Runtime
Library
SourceCode Outputs
file
Receives
as input
JVM Java Virtual
Machine
Parallel
ÆminiumGPU
Runtime
ÆminiumGPU
Compiler
DataParallel
JAVAJAVA JAR
Fig. 1 Information Flow in the Æminium Framework
The software architecture of the Æminium framework is heavily based on
the Java stack, making use of the Java Compiler to generate bytecode and a
JVM for executing such bytecode. The framework includes two compilers:the
JPar compiler and the ÆminiumGPU compiler. The later two make use of the
Spoon library[26] to operate on the Java Abstract Syntax Tree (AST). Figure 1
shows the information flow between the framework components. The Æmini-
umGPU components are optional, as not all programs can take advantage of
the GPU.
The JPar Compiler is a Source-to-Source Compiler for the Java language.
It parses the original Java code into an AST, performs static analysis to infer
access permissions of each node, and generates Java code that wraps some of
the operations in calls to the Æminium Runtime.
The Æminium Runtime is a Java library that provides an API for express-
ing task-parallelism. These APIs can be targeted by compilers, or directly by
programmers. The API allows the definition of tasks and dependencies be-
tween tasks. Tasks are wrappers around a set of Java statements that can
execute asynchronously, and can have any number of instructions. Internally,
the most important components are the Scheduler, Decider and Profiler. The
Runtime includes a Work-Sharing scheduler, but defaults to a Work-Stealing
scheduler, in which the programmer can configure the stealing policy. The De-
cider is a component that determines in real-time whether a new task should
be created or if it should be inlined in the caller site. Finally, the Profiler
records information during execution, such as number of tasks created, steals
and dependencies unfulfilled.
6 Alcides Fonseca et al.
The ÆminiumGPU Compiler is a Source-to-Source Compiler for the Java
language that identifies data-parallel operations, such as map and reduce, and
translates them to OpenCL. The ÆminiumGPU Runtime is a library that
executes data-parallel operations on the GPU. During execution, if the opera-
tion is heavy and/or operates on a large dataset, the runtime decides whether
to use the GPU or not. If it does, the JavaCL binding library[13] is used to
schedule OpenCL code and copy memory between the host and the GPU.
4 Compilation
This section details the compilation process, starting from access permission
analysis to code generation. The JPar is based upon the J2JPar Compiler[27],
but simplifies the access permission analysis, and performs parallelization dur-
ing code-generation, reducing compilation times. An overview of the compila-
tion phase can be seen in Figure 2.
JPar Compiler Phases Pipeline
Parsing
JAVA
Signature
Extraction
Method
Cloning Parallelization Code
Generation
JAVA
Fig. 2 JPar Compiler Phases
4.1 Signature Extraction
In order to automatically parallelize the program, it is necessary to analyze
the memory is accessed to understand dependencies between parts of the pro-
gram. If two program parts read and write to the same variable, then they
cannot be parallelized without guaranteeing determinism. Thus, the first step
of the compiler is to understand what each AST node reads and modifies.
Datagroups[24] are used to represent different memory sections and if two
method calls share no datagroup, it means that they can be executed in paral-
lel. After this phase, each AST node will have a signature, composed of one or
more datagroups permissions. An example of signatures in code can be seen
in Listing 1. Datagroup permissions can be one of the following:
read(dg) - the AST subtree reads the variables represented by datagroup
dg;
– write(dg)- the subtree writes to the objects in datagroup dg;
Title Suppressed Due to Excessive Length 7
control(dg) - the control flow of other operations in datagroup dg may
be altered. This is the case with return statements, breaks, continues, ifs
and whiles.
In the previous approach, a full analysis of the AST was possible due to a
two-pass verification. In the first pass, invocations produced a call(dg) per-
mission, and aliasing effects produced merge(dg1, dg2) permissions. On the
second pass, these permissions were replaced with the true permissions, that
could be looked up in the rest of the program. The new approach converts
the two passes into one. Aliasing is handled in-place, using HashMaps to find
the right permission of the aliased element. When finding a method invoca-
tion, the current element being processed is saved, and the compiler processes
the method declaration first. When it is complete, it returns to the method
invocation and the method data-group is already available. The only place
in which this is not possible is in recursive (direct or indirect) calls. In this
case, the stack detects loops in the recursion, uses the partial permissions, to
fill in the recursive invocation for the full permission. This process is now a
two-step process only for recursive calls, reducing the analysis time in all other
operations.
int f(int n) {
if (n < 2) { // read(n), control(f)
return n; // write(return), control(f)
}
int a = f(n - 1); // call(f), read(n), write(a)
int b = f(n - 2); // call(f), read(n), write(b)
return a + b; // read(a), read(b), control(f), write(return)
}
Listing 1 Examples of Signatures in Fibonacci Program
In this phase, whenever some operations can have different results, such as
the case of an if statement, a conservative approach is take, leaving the union
of the two possible branches, as the signature for that node. This approach does
not perform thread-level speculation, guaranteeing instead the same semantics
of the original programming and supporting I/O and other operations that
cannot be transactional. One such example is that all accesses to external
objects, such as the System.out object, are inside a single global datagroup.
This bottleneck can be removed by explicitly expressing the signatures for
those methods in a special signatures file.
4.2 Method Cloning
Executing a method in parallel may not always be worthwhile. For instance,
in the Fibonacci example, the cost of creating a new task is higher than the
cost of executing the method for a low input number. Thus, creating a task is
only useful when another thread can execute it. As such, the alternative is to
execute the original method sequentially after a certain point.
8 Alcides Fonseca et al.
In this compiler phase, methods are cloned. The original method will be
parallelized, while the clone will serve as a backup sequential version of the
method. The decision when to change to the sequential method is introduced
on the beginning of the parallel version. The decision itself is a call to the
Runtime API.
For recursive calls, using anonymous inner classes revealed to be a very big
overhead, which is not noticeable with regular parallel tasks. As such, in this
phase, for each recursive call in the code, a static class is created to represent
the asynchronous call to that method.
4.3 Parallelization
This is the phase, JPar also deviates from its predecessor. While the previ-
ous compiler would take two passes, one for generating dependencies between
operations, and other for generating the tasks, the new version generates the
dependencies when it is creating a task. This change allows to avoid creating
tasks that would only be used for dependency purposes.
Firstly, this approach identifies parallelism with the finest granularity pos-
sible. Parallelizing all possible paths is not useful, as the overhead in scheduling
may be significant. As such, the compiler decides to create tasks around parts
of the code considered large enough. By default, only invocations and loops
can be parallelized. In order to be parallelized, methods have to contain loops,
at least 10 instructions, calls to other expensive methods, or recursive calls.
The 10 instruction is a heuristic limit that can be configured.
The compiler performs three types of parallelization: Parallel Invocations,
DO-ALL and DO-ACROSS. All of the three types can be nested inside each
other.
4.3.1 Parallel Invocation
Every method invocation node is considered for parallelization if the target
function is considered large enough. Invocations are converted into a Future
call[31], with a set of dependency tasks. Using Futures has then advantage of
producing readable parallel Java, which the developer can use to learn, debug
or to manually fine-tune.
The invocation is replaced by a call to the get() method of the future.
The original invocation is moved to a lambda that represents the computation
(task body). That lambda is wrapped around a typed Future object that
represents the asynchronous execution of the task. An example is shown in
Listing 2, which shows a translation of code in Listing 1, disregarding the fact
that it is a recursive method. Recursive methods have the lambda converted
into a static class, to avoid task creation overheads.
Title Suppressed Due to Excessive Length 9
int f(int n) {
if (RuntimeManager.shouldSeq())
return jpar_sequential_version_of_f(n);
if (n < 2) {
return n;
}
Future<Integer> b_tmp = new Future<Integer>(task -> f(n-2));
Future<Integer> a_tmp = new Future<Integer>(task -> f(n-1));
int a = a_tmp.get();
int b = b_tmp.get();
return a + b;
}
Listing 2 The Fibonacci program translated with futures, without considering the special
case.
When the Future object is instantiated, the task is marked for execution
and an available thread may start to execute it. When the get() method
is called, the current task awaits for the execution of the task and reads its
result.
The main decision to make is where to introduce the Future creation, max-
imizing parallelism while keeping the same semantics of the original program.
The main requirements for the position of the Future creation are:
Must not be declared before the declaration of all used variables;
Must not be declared before an expression which may return inside that
method;
Must not be declared before an expression which may change the control
flow inside that block (break, continue);
Must not be executed before an expression that may write to a variable
accessed inside the lambda;
Must not be executed before an expression that may reads a variable that
is written inside the lambda;
Must be before the Future get() call.
Considering these requirements, Algorithm 18 is used to find the best po-
sition to create the future, considering θas the function that for an AST node
returns its access permission set, meth the method in which the invocation
is found, node the invocation being processed and stmt the statement being
analyzed. block is the current block being analyzed. This block starts with the
most outer scope (the method body) and moves deeper until it is the scope
block in which the invocation is defined. This order tries to maximize how soon
can the invocation start to execute. Finally, this algorithm outputs the Hard
Dependency, which is the instruction after which the Future can be safely in-
troduced, and the Soft Dependencies, a set of already defined tasks which the
current future will have to wait to execute.
The invocation parallelization is completed when the Future declaration is
at the right position, the Future call is at the invocation site, and there is a
granularity control introduced in the beginning of each parallel method.
10 Alcides Fonseca et al.
Algorithm 1 Algorithm to find the Hard Dependency and Soft dependencies
for a Future for the current node
harddep N one
softdeps ← ∅
for stmt block do
if control(meth)θ(stmt)control(block)θ(stmt)then
harddep stmt
continue
end if
if a, [read(a)θ(stmt)write(a)θ(node)] [write(a)θ(stmt)read(a)
θ(node)] [write(a)θ(stmt)write(a)θ(node)] then
if isT ask(stmt)then
softdeps sof tdeps stmt
else
harddep stmt
end if
end if
if stmt node then
break
end if
end for
4.3.2 Parallel DO-ALL
Besides invocations, for and for-each loops are also targets for parallelization.
However, there are two scenarios, DO-ALL and DO-ACROSS. First, we will
focus on DO-ALL, when iterations of the loop are independent and have de-
pendencies only with code outside the for loop.
In order to verify if this is the situation, accesses to arrays or arraylists are
annotated with an indexed datagroup. This means that the code array[i] = 1
will have a permission write(array[i]) that is treated as a write(array) for
all code outside loops. Inside loops, the indexed permission is used to verify if
reads and writes are independent. The verification performed is rather na¨ıve,
as it only considers for-loops in which the iteration variable is only increased or
decreased. Nevertheless, it is possible to apply the polyhedral model, obtaining
a better degree of parallelism.
The code generation of DO-ALL simply replaces the for loop with a call
to a static helper method, that will dynamically create parallel tasks for slices
of the range. The iteration code is defined as a lambda function passed to the
helper method. The helper method will return a Future that can be used as a
soft dependency for later Futures.
4.3.3 Parallel DO-ACROSS
In order to parallelize DO-ACROSS loops, the loop must contain the same
conditions as for DO-ALL, but some write permissions are allowed inside the
loop, namely operations that are commutative and associative. By default, the
compiler considers for these tasks the operators +,-,*and the methods Math
Title Suppressed Due to Excessive Length 11
.min(), Math.max(). However, any over operation can be annotated as such,
and will be parallelized using the same mechanism.
The compiler generates a Map-Reduce operation for the DO-ACROSS loop.
The map lambda contains the loop code, saving the new changes inside the
lambda, instead of the global state. The reduce lambda aggregates two states
together. The return type of the operation is also a Future, in order to be used
as a soft dependency.
4.4 ÆminiumGPU Integration
The ÆminiumGPU Compiler is a source-to-source compiler from Java-to-Java,
in which the final Java code has some extra OpenCL code. The compiler tar-
gets lambda functions used inside Map-Reduce style of functions. For each
of these lambdas, the compiler generates an OpenCL function. This function
is compiled as a kernel during the compilation phase, but can also be dy-
namically compiled during execution, if merged with another function. This
dynamic merging of functions into one kernel is used to avoid overheads in
kernel scheduling and eventual memory transfers to and from the GPU.
It is important to notice that not all Java code can be translated to
OpenCL. The ÆminiumGPU compiler does not support all method calls, non-
local variables, for-each loops, object instantiation and exceptions. It does sup-
port a common subset between Java and C99 with some extra features like
static accesses, calls to methods and references to fields of the Math object.
Since DO-ALL and DO-ACROSS loops generate Map and Map-Reduce
function calls, the lambdas can be automatically translated by the Æmini-
umGPU compiler and handled by the ÆminiumGPU runtime, thus taking
advantage of available GPU processing power.
5 Runtime Execution
5.1 Tasks and Dependencies
The Æminium Runtime is a Java library that exposes APIs for expressing
asynchronous execution of code. The Runtime is composed of modules that
allow for an efficient execution of the source code, by leveraging the multiple
hardware threads available.
The core concept of the Æminium Runtime is the task as a representation
of code that can execute asynchronously. Tasks have a body, which can be
represented as a lambda, an anonymous inner class or as a regular class (useful
when doing recursive calls). Tasks are also defined by a set of dependencies on
other tasks. If A depends on B, it means that A cannot execute before B is
completed. Tasks can also have a parent task to represent subcomputations.
If A is the parent of B, then A is only considered as completed when both
the body of A has completed and B is considered as completed. Finally, tasks
12 Alcides Fonseca et al.
can be characterized with hints, such as Recursive, Loops, Small or Large. An
example task graph can bee seen in Figure 3, which represents 6 tasks with
dependencies among them, as well as parent-child relationships.
A B C
DE F
Legend:
Task
Depends on
Is child of
Fig. 3 Example of a task graph.
Each task can be of one of three types:
Non-Blocking Tasks are all operations that are purely computational.
Blocking Tasks are tasks that have at least one operation that performs
input or output, such as disk reads/writes, communication over sockets or
other interactions with the Operative System.
Atomic Tasks are tasks that cannot execute at the same time as other
Atomic Tasks that share the same Data Group. The Data Group acts as
the lock that each atomic task must acquire before executing and release
after executing. However, two Atomic Tasks with different Data Groups
can execute concurrently.
Figure 4 shows the lifecycle of tasks inside the Runtime. When a task is
submitted to the runtime, along with its dependencies, the runtime analyzes
if the dependencies are already met. If so, the task is sent to a queue for
execution. If not, no action is performed at this point.
The ÆminiumRuntime does not create a Thread for each task, as the over-
head would be very noticeable. Instead, there are always nthreads running,
one for each processor core available in the machine (This number can be con-
figurable per program execution). These threads are responsible for executing
tasks that are considered ready. In order to reduce the locking on the queues,
each thread has its own queue. When a thread finds its queue empty, it will
“steal” a task from the queue of other thread. The ÆminiumRuntime has a
few stealing algorithms, including a random steal, stealing from the largest
queue or from the one with more dependent tasks.
Non-Blocking and Atomic tasks are stored on those regular queues. Since
the execution of Blocking Tasks may take a long time to execute, because
of OS dependencies such as Sockets or Files, they are added to a special
ThreadPool-backed queue. This avoids blocking Work-Stealing workers with
Blocking tasks.
Title Suppressed Due to Excessive Length 13
Runtime
Waiting for
Dependencies
Ready Ready Waiting for
Children
Completed
(GC)
Unscheduled
Legend:
Task
Non-Blocking
Queue AArea for Tasks
with status A
CPU Core Blocking Queue
Fig. 4 Runtime Areas for the different phases of the Task Lifecycle, for a quad-core machine.
When a task completes, it will check if there are any child tasks that were
scheduled during execution and belong to the logical concern of the current
task. When all child tasks have finished, the task is marked as completed, and
it will notify both the dependent tasks and parent task that they do not need
to wait for it anymore.
5.2 Executing DO-ALL and DO-ACROSS
Since each iteration may take a different time to execute, DO-ALL and DO-
ACROSS loops cannot simply be divided in equal parts and executed in
slices. In order to balance loads across cores, a more dynamic approach is
required. We provide two different approaches: Binary Splitting and Lazy Bi-
nary Splitting[32]. Binary Splitting divides the current range in two if the
Decider module considers that it is still useful to create new tasks. If not, it
executes the current range iterations immediately. With Lazy Binary Splitting,
there is a parameter PPS which represents how frequently should we check if
we should split the range in half. A PPS of 3 means that every 3 iterations
the runtime checks if the remaining range should be split in two.
For DO-ACROSS loops, the Map-Reduce approach is better than creat-
ing lock-protected atomic blocks, since it avoids locking contention when all
threads want to access that data. However, only associative and commutative
operations can be converted into Map-Reduce. This is not a large problem,
14 Alcides Fonseca et al.
as most data-intensive computations are based on those operations, such as
+,*, -.
5.3 Controlling Granularity
One of the most important factors when executing irregular programs is to
decide whether to execute a new task in parallel or inline the task body inside
the current task. This decision has a great impact since the overhead in task
creation is too high that can prevent recursive programs from having any
speedups. The solution is to start calling the parallel version, but after a
certain point, convert it to the sequential version of the method.
The Æminium Runtime provides several mechanisms for controlling the
granularity of tasks:
Maximum task recursion level (max-level) - Divide-and-conquer al-
gorithms create tasks in a tree-shaped structure. In order to avoid the cre-
ation of too many tasks, the cut-off limit may be defined by the depth of
the recursion[15], which can be calculated by the number of ancestors of the
running task.
This approach limits the depth of the task hierarchy graph, which may be
suitable for more balanced parallelism programs, but not for more dynamic
unbalanced programs.
Maximum number of tasks (max-tasks) - In this approach, tasks are
created until the total number of active tasks reaches a certain threshold[15].
After that point, all new computations are inlined instead of spawning another
thread. When the number of active tasks lowers, new tasks can be created until
the threshold is reached again.
The threshold in this approach is typically defined as the number of proces-
sor threads on the machine, adapting to different machines, but being oblivious
to other factors such as memory and processor speed.
Load Based - This simple heuristic is based on whether all cores are being
used or not. A new task is only created if there is at least one idle core[14].
Surplus Queued Task Count - This approach is included in Java’s Fork
Join framework[22] and it relies on the size of work-stealing queues. Before
creating a new task, the number of queued tasks in the current thread that
exceeds the number of tasks in other queues is compared to a threshold limit
(usually 3 in existing ForkJoin benchmarks).
In order to decrease the overhead of computing the size of queues, the size
of other queues is estimated from the size of the current queue after applying
a factor of (number of idle threads /active threads), because idle threads
are known to have 0 tasks in their queue. This estimation assumes a regular
distribution among threads, which may not always happen.
Adaptive Tasks Cut-Off (ATC) - Adaptive Tasks Cut-Off[14] changes
the policy of the cut-off mechanisms according to the recursion lifecycle. Tasks
are only created if two conditions are met. The first is that there are fewer
tasks than the number of threads on a given recursion level. This condition
Title Suppressed Due to Excessive Length 15
forces the threads to expand in depth, creating work for all threads and being
within a certain bound limit. The second condition is that the depth-level
is less than a certain threshold. This is, in fact, the usage of max-level and
max-tasks together.
ATC adds a profiler that saves information regarding how much time a
sub-tree takes to execute, and predicts further subtrees (if the prediction is
larger than 1ms, the task will not be created). This is, however, based on the
assumption that all tasks inside a level have a similar behavior, which does
not happen in unbalanced parallelism.
Maximum Queue Size - We introduce this new approach, which limits
the number of tasks in the local queue to a certain threshold This approach
differs from maxtasks in only looking at the local queue, instead of all the
queues, reducing the time by not accessing information from other threads.
If the threshold is one or two tasks higher than the threshold of max-tasks,
queues will have extra tasks that can be stolen by other threads.
Stack Size - In recursive divide-and-conquer programs, the recursion limit
of the platform imposes heavy limitations on the parallelization of programs.
Recursive calls are necessary to inline the execution of tasks inside the same
worker thread. As such, the performance of programs decreases when the stack
grows beyond a certain size.
System Monitoring - Instead of looking to the task and runtime state,
this decider method analyses the system load. If it is below a given CPU
occupation and below a given memory occupation, then a new task is created.
Both the CPU and memory occupations can be configurable.
The Runtime can use any of these methods for each program execution. It
is also possible to combine two or more of the methods at the cost of increasing
the overhead of the decision.
5.4 ÆminiumGPU Integration
The ÆminiumGPU Runtime allows the execution of data-parallel operations
(such as map and reduce) on GPUs. Operations are lazy and are only executed
when the result is needed. This differs from regular Java semantics, as it can
avoid unnecessary overheads in the case two GPU operations are chained. In
that case, we merge the GPU kernels into one, making only one data copy to
each side, and starting only one kernel.
The first step is to decide how to divide work between the GPU and the
CPU cores. If the GPU is going to be used, a OpenCL kernel is compiled, data
is sent to the GPU and the kernel executes. After completion, data is copied
back to the main memory.
The decision whether to use the GPU is firstly done by analyzing the size of
input data. If it is below a given threshold (which depends on the machine) and
the operation is complex enough, then it is executed on the regular Æminium
Runtime as a DO-ALL or DO-ACROSS loop. If it is complex enough, it can
execute on the GPU or in a mix of GPU and CPU. This decision has been
16 Alcides Fonseca et al.
improved by using Machine Learning[18], in which features from static analy-
sis (number of memory writes and reads, global vs local accesses, cyclomatic
complexity, number of branch instructions and number of mathematical in-
tensive operations) and runtime information (size of data, number of chained
operations) are used to decide which platform to use for execution.
6 Evaluation
In this section, we will evaluate several design options of the JPar Compiler,
as well as Runtime configuration for its programs. We then compare it with
OoOJava and with a manually parallelized version. The experiments shown
were performed on a 12-core machine with a Intel Xeon X5660 @ 2.8GHz
processor with 12 cores and 24 hyper-threads, and 24GB of RAM. The ma-
chine was running Ubuntu 14.04 server 64-Bit and Java Hotspot 64-Bit Server
JVM. The machine was chosen for it’s high number of cores. Unless specified,
programs were executed 7 times and the average value was used.
6.1 Benchmark Programs
In order to evaluate the performance of the compiler, we used the sequential
version of 8 programs from the Æminium Benchmark Suite[17]. The configu-
ration for each program is described in Table 1, as well as the parallelization
performed for each program.
Program Parallelism Input size
Black-Scholes DO-ACROSS 100000
FFT Recursive 16777216
Fibonacci Recursive n=51
Health Recursive, DO-ALL n=6
Integrate Recursive s=-2101, e=1700, error=1014
MergeSort Recursive n=251658240
N-Body DO-ALL it=10, bodies=25000
Pi DO-ACROSS n=1500000000
Table 1 Description of the programs used in the benchmark
6.2 Compiler Granularity Control
The JPar compiler performs a selection of whether tasks should be paralleliz-
able, based on the number of instructions. To evaluate whether this is useful,
we compared two versions of the compiler, one with fine-grained parallelism,
and other that only creates tasks if there are 10 or more instructions in the
body of the task, or have recursive calls or loops.
We used the FFT benchmark because it works with an array of Complex
objects, each with several lightweight methods, such as sin, cos, tan, times and
divide. The Full Parallel version converts invocations that can occur in parallel
Title Suppressed Due to Excessive Length 17
Fig. 5 Distribution of execution times for each version of the FFT program of a random
array with 262144 elements.
into tasks, while the Partial parallel will only create parallel tasks for the main
recursive function. Figure 5 shows the distribution of execution times of both
versions, as well as of the sequential version. With an input size of 262144
elements, there is no speedup in any of the versions. However, with higher
input sizes, the Full Parallel version will have take more than one hour to
execute, while the Partial Parallel version would provide speedups. Thus, we
confirm that enabling a threshold for task parallelization benefits programs.
6.3 Binary vs Lazy Binary Splitting
Programs that have for loops parallelized, can generate tasks in two different
ways: Binary Splitting or Lazy Binary Splitting. For Lazy Binary Splitting, we
used the recommended value of 3 for the PPS parameter, and a higher value
of 10 for less frequent decisions. Figure 6 shows the speed-up over sequential
programs of the three approaches in program with loops. The Lazy Binary
Split version achieved best results in the Black-Scholes program, running in less
than half of the time as its Binary Split counterpart, but could not complete
the other programs within a 5-minute time-out, resulting in no speed-up. The
conclusion is that Binary Split is a conservative approach that can be used
for any program, while Lazy Binary Split can be used to achieve best results,
but needs to be applied after verifying there is a speedup in that particular
program.
6.4 Cutoff Mechanism
One of the most important variables for tuning a parallel program is the gran-
ularity of tasks. Besides the compilation-time decision of parallelizing a task or
not, the Æminium Runtime has several automatic granularity control mech-
anisms. Figure 7 shows the speed-up achieved by each method on the 8 pro-
grams. For each mechanism, several parameters were previously tested, and
we used the best.
Black-Scholes is a program with a high number of tasks and each task per-
forms very small work. Creating extra-tasks in this kind of program brings a
18 Alcides Fonseca et al.
Fig. 6 Average speedup of Binary Split and Lazy Binary Split (PPS=3 and 10) versions of
the programs with loops.
high penalty, because the task overhead is several times more expensive than
executing the task itself. max-level is the mechanism with the best performance
since all loop iterations take equal time to execute. However, a maximum level
of other value than 12 would not perform as well. The same mechanism does
not perform well on Fibonacci or Integrate given the very irregular and un-
balanced nature of the problem. The skewness in the computation tree would
leave one task with a time-consuming task similar that of the sequential ver-
sion.
Black-Scholes is the program that takes more advantage of being paral-
lelized on this machine. ATC, Load Balance, Max-Level and Surplus show a
good performance. When cutting parallelism by the stack depth or system
resources, too many tasks are created, and there is an high unnecessary over-
head.
The FFT program did not achieve high speedup values because of mem-
ory issues. Each recursive call would allocate a large array of memory, and
this revealed to be a performance hit on the java platform. The approaches
that limited the JVM stack depth achieved the best results, but still under
the desired values. The same happened with N-Body simulation because of
compiler-time granularity. N-Body is made of two loops, and the JPar com-
piler over-parallelizes the program by trying to convert both loops in DO-ALL
operations. Since the inner loop is small enough to be considered only one
Title Suppressed Due to Excessive Length 19
Fig. 7 Speed-ups of different cut-off approaches for each program.
task, the compiler introduces a high overhead by trying to parallelize a block
that is already small enough.
The comparison between approaches for the overall benchmark can be seen
in Figure 8. Despite Max-Level having the highest speedup, we used as default
the Surplus3 approach because of its lower variance. Most of the difference of
speedup of Max Level is due to only one program: Black-Scholes.
6.5 Comparison with Other Approaches
We evaluated the JPar compiler against two other approaches: OoOJava and
a human programmer. The source programs were annotated with the sese
statements to identify main parallel tasks. Without these annotations, the
generated version would be equivalent to the sequential version. Since the
OoOJava compiler generates C code, a sequential version using the same com-
piler is also presented. The OoOJava compiler was unable to compile Black-
Scholes, Health and the N-Body programs, failing to identify dependencies
during code-generation. The Æminium Benchmark includes versions of the
20 Alcides Fonseca et al.
Fig. 8 Average speed-ups of different cut-off approaches across the 8 different programs.
programs manually written on top of the Fork-Join framework, and on top of
the Æminium Runtime. Manual or automatic cut-off mechanisms were selected
using a manual local search across mechanisms and parameters.
Figure 9 shows the speedup achieved by the three approaches, each with
different configurations of platform, granularity control mechanism and, in
the case of OoOJava, the sequential version is also included. Overall, the per-
formance of JPar compiler was superior to that of OoOJava. OoOJava did
not even achieve a speedup compared with the OoOJava sequential version.
The reason is that the automatic granularity control mechanism of the OoO-
Java over-parallelizes, resulting in overheads on smaller tasks and on a heavier
contention in locking. The results presented in [21] mask the overhead by in-
troducing a manual cut-off decision in the sequential programs, which should
not be aware that they are parallel.
Overall, the human versions Fork-Join and Æminium programs performed
better than those of JParCompiler, which is expected, since the programmers
have a better knowledge of the nature of the program, and select the best
parts to parallelize. Black-Scholes is an exception, since the Human version
only parallelizes inside each loop, and does not consider parallelizing different
tasks at the same time. This is a case where the compiler could find parallelism
hidden in plain sight, and this can be used to further improve the benchmarks.
In the other cases, the Human versions outperformed the JPar compiler with
the best general cut-off. It is important to notice that the best threshold in
some programs would show a speed-up similar to the human version, however
we decided not to include a manual tuning of the parallel version generated
by the JPar compiler.
Title Suppressed Due to Excessive Length 21
Fig. 9 Speed-up comparison between JParCompiler, Human and OoOJava approaches. The
JPar version includes the two best performant cut-off mechanisms, maxtasks2 and surplus3.
The Human version includes one version on top of the Fork-Join framework, and two on top
of the Æminium Runtime, one with a manual cut-off and other with an automatic cut-off
mechanism. For the OoOJava compiler, the serial and parallel versions are shown.
7 Conclusion
We have presented the JPar compiler and the Æminium framework for auto-
matically parallelizing sequential programs. By analyzing the data dependen-
cies in the sequential program, we were able to conservatively extract paral-
lelism without changing the program semantics. We have improved over our
previous work by performing the data-dependency analysis in one pass, and
by generating source-code similar to the original, but with futures replacing
parallel computations. This change allows developers to understand how par-
allelization occurs, and how to improve it.
We have also studied two granularity control mechanisms. The JPar com-
piler only considers for parallelization tasks that are considered large enough.
This change introduced speedups in several programs of the benchmark, that
would not have it otherwise. We have also applied several existing runtime
cut-off mechanisms, as well as three new (StackSize, System Monitor, and a
combination of StackSize with Max Tasks) that can be used to improve the
performance of programs. We have also studied the usage of Binary versus
Lazy Binary Splitting.
Finally, we compared our compiler with another state-of-the-art compiler,
OoOJava, and with human parallelization. While our results were not as good
22 Alcides Fonseca et al.
as if a human would write and fine-tune the programs, JPar generated pro-
grams outperformed OoOJava.
For future work, we intend to select the best granularity control mechanism
for a given program by looking at its structure. This can be performed using
machine learning techniques over a large dataset of programs. A low-overhead
combination of mechanisms that can be used to improve any program is also
currently being explored. Another aspect which needs improvement is to better
access the granularity of tasks at compile-time, for which a cost model can be
of use.
Acknowledgements This work would not have been possible without the contributions to
the Æminium language and runtime from Sven Stork, Paulo Marques and Jonathan Aldrich.
References
1. Amini, M., Creusillet, B., Even, S., Keryell, R., Goubier, O., Guelton, S., McMahon,
J.O., Pasquier, F.X., P´ean, G., Villalon, P.: Par4all: From convex array regions to hetero-
geneous computing. In: IMPACT 2012: Second International Workshop on Polyhedral
Compilation Techniques HiPEAC 2012 (2012)
2. Ayguad´e, E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Teruel, X.,
Unnikrishnan, P., Zhang, G.: The design of openmp tasks. Parallel and Distributed
Systems, IEEE Transactions on 20(3), 404–418 (2009)
3. Ayguad´e, E., Duran, A., Hoeflinger, J., Massaioli, F., Teruel, X.: An experimental eval-
uation of the new openmp tasking model. In: Languages and Compilers for Parallel
Computing, pp. 63–77. Springer (2008)
4. Banerjee, U., Eigenmann, R., Nicolau, A., Padua, D.A., et al.: Automatic program
parallelization. Proceedings of the IEEE 81(2), 211–243 (1993)
5. Bik, A.J., Gannon, D.B.: Automatically exploiting implicit parallelism in java. Concur-
rency - Practice and Experience 9(6), 579–619 (1997)
6. Bondhugula, U., Baskaran, M., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sa-
dayappan, P.: Automatic transformations for communication-minimized parallelization
and locality optimization in the polyhedral model. In: Compiler Construction, pp. 132–
146. Springer (2008)
7. Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the chapel lan-
guage. International Journal of High Performance Computing Applications 21(3), 291–
312 (2007)
8. Chan, B., Abdelrahman, T.S.: Run-time support for the automatic parallelization of
java programs. The Journal of Supercomputing 28(1), 91–117 (2004)
9. Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K.,
Von Praun, C., Sarkar, V.: X10: an object-oriented approach to non-uniform cluster
computing. In: ACM SIGPLAN Notices, vol. 40, pp. 519–538. ACM (2005)
10. Chen, M.K., Olukotun, K.: The jrpm system for dynamically parallelizing java programs.
In: Computer Architecture, 2003. Proceedings. 30th Annual International Symposium
on, pp. 434–445. IEEE (2003)
11. Dagum, L., Enon, R.: Openmp: an industry standard api for shared-memory program-
ming. Computational Science & Engineering, IEEE 5(1), 46–55 (1998)
12. Dave, C., Bae, H., Min, S.J., Lee, S., Eigenmann, R., Midkiff, S.: Cetus: A source-to-
source compiler infrastructure for multicores. Computer (12), 36–42 (2009)
13. Dominguez, R.M.: Evaluating different java bindings for opencl (2013)
14. Duran, A., Corbal´an, J., Ayguad´e, E.: An adaptive cut-off for task parallelism. In: High
Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. Interna-
tional Conference for, pp. 1–11. IEEE (2008)
15. Duran, A., Corbal´an, J., Ayguad´e, E.: Evaluation of openmp task scheduling strategies.
In: OpenMP in a new era of parallelism, pp. 100–110. Springer (2008)
Title Suppressed Due to Excessive Length 23
16. Feautrier, P.: Automatic parallelization in the polytope model. In: The Data Parallel
Programming Model, pp. 79–103. Springer (1996)
17. Fonseca, A.: Æminium Benchmark Suite. https://github.com/AEminium/
AeminiumBenchmarks (2013). [Online; accessed 23-October-2013]
18. Fonseca, A., Cabral, B.: Æminiumgpu: An intelligent framework for gpu programming.
In: Facing the Multicore-Challenge III, pp. 96–107. Springer (2013)
19. Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the cilk-5 multi-
threaded language. In: ACM Sigplan Notices, vol. 33, pp. 212–223. ACM (1998)
20. Hogen, G., Kindler, A., Loogen, R.: Automatic parallelization of lazy functional pro-
grams. In: ESOP’92, pp. 254–268. Springer (1992)
21. Jenista, J.C., Demsky, B.C., et al.: Ooojava: software out-of-order execution. In: ACM
SIGPLAN Notices, vol. 46, pp. 57–68. ACM (2011)
22. Lea, D.: A java fork/join framework. In: Proceedings of the ACM 2000 conference on
Java Grande, pp. 36–43. ACM (2000)
23. Lee, S., Min, S.J., Eigenmann, R.: Openmp to gpgpu: a compiler framework for auto-
matic translation and optimization. ACM Sigplan Notices 44(4), 101–110 (2009)
24. Leino, K., Poetzsch-Heffter, A., Zhou, Y.: Using data groups to specify and check side
effects. ACM SIGPLAN Notices 37(5), 246–257 (2002)
25. Marlow, S., Peyton Jones, S., Singh, S.: Runtime support for multicore haskell. In:
ACM Sigplan Notices, vol. 44, pp. 65–78. ACM (2009)
26. Pawlak, R., Monperrus, M., Petitprez, N., Noguera, C., Seinturier, L.: Spoon v2: Large
scale source code analysis and transformation for java. Tech. Rep. hal-01078532, INRIA
(2006). URL https://hal.inria.fr/hal- 01078532
27. Rafael, J., Correia, I., Fonseca, A., Cabral, B.: Dependency-based automatic paral-
lelization of java applications. In: Euro-Par 2014: Parallel Processing Workshops, pp.
182–193. Springer (2014)
28. Senghor, A., Konate, K.: Fjcomp, a java parallelizing compiler for dealing with divide-
and-conquer algorithm. In: Computer Applications Technology (ICCAT), 2013 Inter-
national Conference on, pp. 1–5. IEEE (2013)
29. Steele, G.: Parallel programming and parallel abstractions in fortress. Lecture Notes in
Computer Science 3945, 1 (2006)
30. Stork, S., Naden, K., Sunshine, J., Mohr, M., Fonseca, A., Marques, P., Aldrich, J.:
Æminium: A permission-based concurrent-by-default programming language approach.
ACM Transactions on Programming Languages and Systems (TOPLAS) 36(1), 2 (2014)
31. Swaine, J., Tew, K., Dinda, P., Findler, R.B., Flatt, M.: Back to the futures: incremental
parallelization of existing sequential runtime systems. In: ACM Sigplan Notices, vol. 45,
pp. 583–597. ACM (2010)
32. Tzannes, A., Caragea, G.C., Barua, R., Vishkin, U.: Lazy binary-splitting: a run-time
adaptive work-stealing scheduler. ACM Sigplan Notices 45(5), 179–190 (2010)
33. Wang, C., Li, X., Zhang, J., Zhou, X., Nie, X.: Mp-tomasulo: A dependency-aware
automatic parallel execution engine for sequential programs. ACM Transactions on
Architecture and Code Optimization (TACO) 10(2), 9 (2013)
34. Zhao, J., Rogers, I., Kirkham, C., Watson, I.: Loop parallelisation for the jikes rvm.
In: Parallel and Distributed Computing, Applications and Technologies, 2005. PDCAT
2005. Sixth International Conference on, pp. 35–39. IEEE (2005)
... But the main challenge is to be able to identify components of codes that can be run in parallel. This is perhaps the reason automated application parallelisation has been a longstanding and prominent research topic ever since 1990 [3][4][5]. ...
... Due to the high execution time of loops in applications [39], there are other classifications for automatic parallelisation, namely loop parallelisation and non-loop parallelisation [3]. Among the methods based on non-loop parallelisation, in Ref. [35] the data is analysed on reads and writes at the local level, and in Ref. [40] the programme complies into a speculative C programme. ...
... The interaction between threads is checked at runtime but does not support I/O instructions. Other methods in this group, such as [3,35,40] and optimisations that have been made in recent years to automatically modify the parallelisation of Java applications [6,8,37], either cannot adapt with common multi-core processors or have poor speed, or need high programing skills. Finally, irregular programs must decide whether to be parallel or not. ...
Article
Full-text available
Due to the design of computer systems in the multi‐core and/or multi‐processor form, it is possible to use the maximum capacity of processors to run an application with the least time consumed through parallelisation. This is the responsibility of parallel compilers, which perform parallelisation in several steps by distributing iterations between different processors and executing them simultaneously to achieve lower runtime. The present paper focuses on the uniformisation of three‐level perfect nested loops as an important step in parallelisation and proposes a method called Towards Three‐Level Loop Parallelisation (TLP) that uses a combination of a Frog Leaping Algorithm and Fuzzy to achieve optimal results because in recent years, many algorithms have worked on volumetric data, that is, three‐dimensional spaces. Results of the implementation of the TLP algorithm in comparison with existing methods lead to a wide variety of optimal results at desired times, with minimum cone size resulting from the vectors. Besides, the maximum number of input dependence vectors is decomposed by this algorithm. These results can accelerate the process of generating parallel codes and facilitate their development for High‐Performance Computing purposes.
... OpenMP [3] and Cilk [4] are two well-known approaches that use instructions for extracting parallel codes. Another classification, the one in [5], classifies parallelism into three categories. The first is parallel programming which greatly challenges programmer skills. ...
... This is firstly because parallel programming requires a high level of specialized skills, and secondly, there are problems such as deadlocks and so on that have to be dealt with. Perhaps this is why automated application parallelization has been a long-standing and prominent research topic back from 1990 [5], [7]. In recent years, the issue of automatic parallelization of programs with irregular structures has widely been considered [8]- [10]. ...
... , d m }. Furthermore, the dependence cone C(D) can be explained as follows defined by the Equation (5). ...
Article
Full-text available
The growth of software techniques for implementing applications must go hand in hand with the growth of computer system hardware in the design of multi-core and multi-processor systems; otherwise, we cannot expect to be able to use maximum hardware capacities. One of the most important and challenging techniques for running applications is to run them in parallel with a focus on loop parallelism to reduce execution time. On the other hand, in recent years, many algorithms have been working on volumetric data, i.e., three-dimensional spaces; therefore, parallelization must be possible for all types of two-dimensional and three-dimensional loops. Uniformization is an important part of loop parallelism, and also the present paper’s focus. The proposed algorithm in the present paper performed uniformization with a combination of the frog leaping algorithm and the fuzzy system for two- and three-dimensional loops on a wide range of input dependence vectors and achieved a considerable variety of results in the desired time. The results of this study can be used to facilitate the development of parallel codes.
... Il s'est également agit de proposer l'automatisation de la génération de programmes parallèles à partir de leur écriture séquentielle classique. Différentes techniques existent, dont, à nouveau, l'exploitation des compilateurs pour analyser le code source et produire, sans que soit nécessaire une intervention de la part du développeur, un programme dont l'exécution est parallèle [Zima et al. 1988 ;Blume et al. 1995 ;Ahmad et al. 1997 ;Fonseca et al. 2016]. Similairement, des langages de programmation se sont mis à intégrer des constructions orientées vers la parallélisation et plus ou moins transparentes pour le développeur [Roscoe et Charles Antony Richard Hoare 1988 ;Loveman 1993 ;Chamberlain et al. 2007 ;Rieger et al. 2019]. ...
... Généralement, les outils de parallélisation automatique logicielle fonctionnent directement comme des compilateurs [Blume et al. 1995 ;Fonseca et al. 2016], générant un exécutable parallèle, ou comme des méta-compilateurs [Zima et al. 1988 ;Ahmad et al. 1997], produisant un nouveau code source parallèle à partir d'un code source séquentiel. Il existe également des langages intrinsèquement parallèles [Roscoe et Charles Antony Richard Hoare 1988 ;Loveman 1993 ;Chamberlain et al. 2007] car ils incluent dans leur définition des éléments faisant de la parallélisation automatique par le compilateur une fonctionnalité obligatoire, à l'inverse de la majorité des langages pour lesquels cette optimisation est un atout propre au compilateur qui la propose. ...
Thesis
Full-text available
L’écriture de programmes parallèles, par opposition aux programmes « classiques » séquentiels et n’utilisant donc qu’un processeur, est devenue une nécessité. En effet, si jusqu’au début du millénaire la puissance de calcul des ordinateurs dépendait principalement de la fréquence du processeur, elle est maintenant liée au nombre de cœurs de calcul qui sont de plus en plus nombreux. Pourtant, à cause des difficultés introduites par la parallélisation, la plupart des programmes sont toujours écrits de manière « classique ». En particulier, il peut être compliqué de déterminer, étant donné une sous partie d’un programme, si la parallélisation est possible, c’est-à-dire si elle n’introduira pas un changement de comportement du programme. Cependant, même en sachant précisément ce qui peut être parallélisé, le faire correctement est aussi une tâche difficile. Cette thèse présente deux approches pour simplifier l’écriture de programmes parallèles.Nous proposons une bibliothèque active – par métaprogrammation template, elle agit durant la compilation – qui acquiert des informations à propos d’une portion de programme, correspondant à une boucle, au moyen de patrons d’expression. Celles-ci sont utilisées pour analyser les instructions et identifier lesquelles peuvent être exécutées en parallèle. Cette analyse repose sur deux niveaux de connaissance : l’ensemble des variables utilisées en distinguant les accès en lecture de ceux en écriture, et, puisqu’il s’agit souvent de tableaux, des fonctions d’indice. Les variables et leur mode d’accès permettent de savoir quelles instructions sont interdépendantes tandis que les fonctions d’indice nous servent à déterminer, pour un groupe d’instructions, s’il est possible de procéder à une exécution parallèle des itérations de la boucle. L’objectif de cette bibliothèque est de proposer un cadre, à la fois pour les développeurs afin d’écrire des boucles qui seront automatiquement exécutées en parallèle si cela est possible, mais aussi à un niveau plus élevé pour intégrer de nouvelles méthodes de parallélisation ou d’autres règles à utiliser pour l’analyse.Nous proposons également une seconde bibliothèque, active elle aussi, orientée sur la parallélisation assistée en utilisant la technique des squelettes algorithmiques comme interface pour le développeur. Celle-ci permet de représenter des algorithmes complets comme des assemblages de motifs d’exécution : séquence de tâches ; exécution répétée d’une tâche en parallèle ; ... En utilisant cette connaissance, nous pouvons mettre en place des choix dans la manière de répartir les tâches exécutées en parallèle sur les différents processeurs. Par ailleurs, nous avons choisi d’expliciter l’expression de la transmission des données entre les tâches, contrairement à ce qui est habituellement fait. Grâce à cela, la bibliothèque automatise notamment la transmission de paramètres qui ne doivent pas être partagés par des tâches parallèles. Cela nous permet en particulier de garantir la répétabilité des exécutions y compris lorsque, par exemple, les tâches utilisent des nombres pseudo-aléatoires. En tenant compte de la politique d’exécution choisie et des nombres de processeurs possibles, nous réduisons la quantité nécessaire de ces paramètres ne devant pas être partagés. Ainsi, cette seconde bibliothèque propose elle aussi un cadre de programmation à plusieurs niveaux. Celle-ci est extensible au niveau de ses politiques d’exécution ou des motifs pour la construction des squelettes algorithmiques. On peut l’utiliser pour définir une variété de squelettes algorithmiques, lesquels serviront ensuite à un développeur pour écrire des programmes dont la parallélisation sera facilitée.
... Unlike our approach, the other existing works in the literature typically focus on the parallelization of a sequential code by applying transformations [7,21] that make some instructions, or sequence of them, parallel. Some other works focus on the analysis of the code in order to find independent tasks that can be executed in parallel [15]. Finally, Genetic Programming (GP) has also been applied to parallelize code [53,56], with the handicap when using this technique it cannot be ensured that the transformed parallel program keeps the same semantics as the original one. ...
... The tools rely on carefully inspecting all the possible dependencies of data, in order to identify all concurrent methods in the source program. Fonseca et al. [15] propose a method to decompose the program into a number of tasks that can be executed in parallel, taking into account that the identified dependencies between tasks are respected. They achieved up to around 11 times speedup in a sample code on a 12 cores machine. ...
Article
Full-text available
We present a novel method to automatically generate new parallel solvers for optimization problems, called the Virtual Savant. It applies machine learning to model a reference algorithm (which is treated as a black box) from its solutions to a given problem, and after, it is able to efficiently and accurately reproduce its solutions on new unseen problem instances. Additionally, the generated parallel algorithm is scalable to a different problem dimension. We analyze the performance and accuracy of our novel technique on a scheduling problem, and we use instances of different features and sizes. Virtual Savant was used to learn from an exact algorithm, a well-known heuristic, and a specialized parallel metaheuristic. During our experiments, our method solved to optimality up to 95.5% of the 200 unseen instances tested. For larger instances, it is not only highly competitive with the reference algorithms, but it is even able to outperform them in some cases. Another outstanding result is that Virtual Savant improved its accuracy when scaling the problem size (without additional training), with respect to the original algorithm.
... Fonseca et al. [17] studied an automatic parallelization system. This system analyzed the memory access by understanding the dependencies between two program parts at compile time. ...
Article
Full-text available
Recently, with the wide usage of multicore architectures, automatic parallelization has become a pressing issue. Speculative parallelization, one of the most popular automatic parallelization techniques, depends on estimating probably-parallelized code parts. This in turn motivates the employment of data dependence detection techniques for these code parts to report whether they contain dependence or not in order to be parallelized. In this paper, we propose a runtime data-dependence detection technique that is based on abstract interpretation at the intermediate representation (IR) level. We apply our proposed approach on the most frequently visited blocks of the code, hot loops. Unlike most existing approaches in which data analysis occurs at compile time, our proposed method conducts the analysis immediately while interpreting the code, which in turn saves the analysis time for potentially parallelized loops. Specifically, the proposed technique depends on the concept of abstract interpretation to analyze the hot loops at runtime. This process is done by firstly computing the abstract domain for each hot loop program points. Each abstract domain is incrementally computed, till a fixpoint is achieved for all program points, and correspondingly the analysis terminates in order to consecutively detect the existence of data dependence. Once the analysis result reports a parallelization possibility for the finished hot loop, the interpreter invokes the compiler to resume the execution in a parallel fashion as recommended by our proposed approach. The proposed technique is implemented on LLVM compiler, then used to test the dependence detection for a set of kernels on the Polybench framework, and the data dependence analysis required for each kernel is studied in terms of the computation overhead.
... However, data race detection is a runtime checking method, which was not intended as a target for the present study. Task parallel executions using dependency graphs consisting of instructions in programs written in a PGAS language are also beyond the scope of this work [6]. ...
Conference Paper
Data independence of iterations of a loop statement in a partitioned global address space (PGAS) language is a sufficient condition to enable parallel processing of the loop iterations on distributed memories. However, checking data independence is generally difficult. In this paper, we propose the non-interference property of statements and design a sub-language of a directive-based PGAS language XcalableMP with a type system using the notion of vertex centricity. Although data independence and non-interference are generally mutually orthogonal, non-interference of a statement in the sub-language, which can be checked easily on the type system, implies data independence. We also implemented type checking on the Omni compiler for XcalableMP and confirmed the effectiveness of our approach using case studies of directive-based parallelization and temporal blocking optimization of stencil kernels.
... Due to the indexing nature of TBs in CUDA programming, we can also profile the data flow between thread blocks to identify dependencies and generate the dependency graph. In prior work [15] parallel task-based dependencies were extracted from sequential programs using a similar technique. ...
Conference Paper
Full-text available
GPUs lack fundamental support for data-dependent parallelism and synchronization. While CUDA Dynamic Parallelism signals progress in this direction, many limitations and challenges still remain. This paper introduces Wireframe, a hardware-software solution that enables generalized support for data-dependent parallelism and synchronization. Wireframe enables applications to naturally express execution dependencies across different thread blocks through a dependency graph abstraction at run-time, which is sent to the GPU hardware at kernel launch. At run-time, the hardware enforces the dependencies specified in the dependency graph through a dependency-aware thread block scheduler. Overall, Wireframe is able to improve total execution time up to 65.20% with an average of 45.07%.
... In fact, program dependence graphs [8], which we used in this work, are commonly used in the context of automatic parallelization. A recent study on automatic parallelization was performed by Fonseca et al. [10]. Their compiler parallelized general Java programs by extracting task parallelism with dependence analysis. ...
Article
MapReduce is a framework for large-scale data processing proposed by Google, and its open-source implementation, Hadoop MapReduce, is now widely used. Several language systems have been proposed to make developing MapReduce programs easier, for instance, Sawzall, FlumeJava, Pig, Hive, and Crunch. These language systems mainly target applications that can be naturally solved by using a MapReduce-like programming model. In this study, we propose a new MapReduce-program generator that accepts programs manipulating one-dimensional arrays. By using the proposed generator, users only need to write sequential programs to generate Hadoop MapReduce programs automatically. We applied some program optimization techniques to the generation of Hadoop MapReduce programs. In this paper, we also report our experiment results that compare programs generated by the proposed generator with hand-written MapReduce programs.
Article
Full-text available
It is now well established that the device scaling predicted by Moore's Law is no longer a viable option for increasing the clock frequency of future uniprocessor systems at the rate that had been sustained during the last two decades. As a result, future systems are rapidly moving from uniprocessor to multiprocessor configurations, so as to use parallelism instead of frequency scaling as the foundation for increased compute capacity. The dominant emerging multiprocessor structure for the future is a Non-Uniform Cluster Computing (NUCC) system with nodes that are built out of multi-core SMP chips with non-uniform memory hierarchies, and interconnected in horizontally scalable cluster configurations such as blade servers. Unlike previous generations of hardware evolution, this shift will have a major impact on existing software. Current OO language facilities for concurrent and distributed programming are inadequate for addressing the needs of NUCC systems because they do not support the notions of non-uniform data access within a node, or of tight coupling of distributed nodes. We have designed a modern object-oriented programming language, X10, for high performance, high productivity programming of NUCC systems. A member of the partitioned global address space family of languages, X10 highlights the explicit reification of locality in the form of places; lightweight activities embodied in async, future, foreach, and ateach constructs; a construct for termination detection (finish); the use of lock-free synchronization (atomic blocks); and the manipulation of cluster-wide global data structures. We present an overview of the X10 programming model and language, experience with our reference implementation , and results from some initial productivity comparisons between the X10 and Java TM languages.
Conference Paper
Full-text available
There are billions of lines of sequential code inside nowadays software which do not benefit from the parallelism available in modern multicore architectures. Transforming legacy sequential code into a parallel version of the same programs is a complex and cumbersome task. Trying to perform such transformation automatically and without the intervention of a developer has been a striking research objective for a long time. This work proposes an elegant way of achieving such a goal. By targeting a task-based runtime which manages execution using a task dependency graph, we developed a translator for sequential JAVA code which generates a highly parallel version of the same program. The translation process interprets the AST nodes for signatures such as read-write access, execution-flow modifications, among others and generates a set of dependencies between executable tasks. This process has been applied to well known problems, such as the recursive Fibonacci and FFT algorithms, resulting in versions capable of maximizing resource usage. For the case of two CPU bounded applications we were able to obtain 10.97x and 9.0x speedup on a 12 core machine.
Article
Full-text available
Recent compilers comprise an incremental way for convert-ing software toward accelerators. For instance, the pgi Ac-celerator [14] or hmpp [3] require the use of directives. The programmer must select the pieces of source that are to be executed on the accelerator, providing optional directives that act as hints for data allocations and transfers. The compiler generates all code automatically. Jcuda [15] offers a simpler interface to target cuda from Java. Data transfers are automatically generated for each call. Arguments can be declared as IN, OUT, or INOUT to avoid useless transfers, but no piece of data can be kept in the gpu memory between two kernel launches. There have also been several initiatives to automate transforma-tions for OpenMP annotated source code to cuda [10, 11]. The gpu programming model and the host accelera-tor paradigm greatly restrict the potential of this approach, since OpenMP is designed for shared memory computer. Recent work [6, 9] adds extensions to OpenMP that account for cuda specificity. These make programs easier to write, but the developer is still responsible for designing and writ-ing communications code, and usually the programmer have to specialize his source code for a particular architecture.
Article
We present Lazy Binary Splitting (LBS), a user-level scheduler of nested parallelism for shared-memory multiprocessors that builds on existing Eager Binary Splitting work-stealing (EBS) implemented in Intel's Threading Building Blocks (TBB), but improves performance and ease-of-programming. In its simplest form (SP), EBS requires manual tuning by repeatedly running the application under carefully controlled conditions to determine a stop-splitting-threshold (sst)for every do-all loop in the code. This threshold limits the parallelism and prevents excessive overheads for fine-grain parallelism. Besides being tedious, this tuning also over-fits the code to some particular dataset, platform and calling context of the do-all loop, resulting in poor performance portability for the code. LBS overcomes both the performance portability and ease-of-programming pitfalls of a manually fixed threshold by adapting dynamically to run-time conditions without requiring tuning. We compare LBS to Auto-Partitioner (AP), the latest default scheduler of TBB, which does not require manual tuning either but lacks context portability, and outperform it by 38.9% using TBB's default AP configuration, and by 16.2% after we tuned AP to our experimental platform. We also compare LBS to SP by manually finding SP's sst using a training dataset and then running both on a different execution dataset. LBS outperforms SP by 19.5% on average. while allowing for improved performance portability without requiring tedious manual tuning. LBS also outperforms SP with sst=1, its default value when undefined, by 56.7%, and serializing work-stealing (SWS), another work-stealer by 54.7%. Finally, compared to serializing inner parallelism (SI) which has been used by OpenMP, LBS is 54.2% faster.
Conference Paper
Reasoning precisely about the side effects of procedure calls is important to many program analyses. This paper introduces a technique for specifying and statically checking the side effects of methods in an object-oriented language. The technique uses data groups, which abstract over variables that are not in scope, and limits program behavior by two alias-confining restrictions, pivot uniqueness and owner exclusion. The technique is shown to achieve modular soundness and is simpler than previous attempts at solving this problem.
Article
Writing concurrent applications is extremely challenging, not only in terms of producing bug-free and maintainable software, but also for enabling developer productivity. In this article we present the Æminium concurrent-by-default programming language. Using Æminium programmers express data dependencies rather than control flow between instructions. Dependencies are expressed using permissions, which are used by the type system to automatically parallelize the application. The Æminium approach provides a modular and composable mechanism for writing concurrent applications, preventing data races in a provable way. This allows programmers to shift their attention from low-level, error-prone reasoning about thread interleaving and synchronization to focus on the core functionality of their applications. We study the semantics of Æminium through μÆminium, a sound core calculus that leverages permission flow to enable concurrent-by-default execution. After discussing our prototype implementation we present several case studies of our system. Our case studies show up to 6.5X speedup on an eight-core machine when leveraging data group permissions to manage access to shared state, and more than 70&percnt; higher throughput in a Web server application.
Conference Paper
The Programming Language Research Group at Sun Microsystems Laboratories seeks to apply lessons learned from the Java (TM) Programming Language to the next generation of programming languages. The Java language supports platform-independent parallel programming with explicit multithreading and explicit locks. As part of the DARPA program for High Productivity Computing Systems, we are developing Fortress, a language intended to support large-scale scientific computation. One of the design principles is that parallelism be encouraged everywhere (for example, it is intentionally just a little bit harder to write a sequential loop than a parallel loop). Another is to have rich mechanisms for encapsulation and abstraction; the idea is to have a fairly complicated language for library writers that enables them to write libraries that present a relatively simple set of interfaces to the application programmer. We will discuss ideas for using a rich polymorphic type system to organize multithreading and data distribution on large parallel machines. The net result is similar in some ways to data distribution facilities in other languages such as HPF and Chapel, but more open-ended, because in Fortress the facilities are defined by user-replaceable libraries rather than wired into the compiler.
Article
This article presents MP-Tomasulo, a dependency-aware automatic parallel task execution engine for sequential programs. Applying the instruction-level Tomasulo algorithm to MPSoC environments, MP-Tomasulo detects and eliminates Write-After-Write (WAW) and Write-After-Read (WAR) inter-task dependencies in the dataflow execution, therefore to operate out-of-order task execution on heterogeneous units. We implemented the prototype system within a single FPGA. Experimental results on EEMBC applications demonstrate that MP-Tomasulo can execute the tasks out-of-order to achieve as high as 93.6&percnt; to 97.6&percnt; of ideal peak speedup. A comparative study against a state-of-the-art dataflow execution scheme is illustrated with a classic JPEG application. The promising results show MP-Tomasulo enables programmers to uncover more task-level parallelism on heterogeneous systems, as well as to ease the burden of programmers.
Article
In this paper we consider productivity challenges for parallel programmers and explore ways that parallel language design might help improve end-user productivity. We offer a candidate list of desirable qualities for a parallel programming language, and describe how these qualities are addressed in the design of the Chapel language. In doing so, we provide an overview of Chapel's features and how they help address parallel productivity. We also survey current techniques for parallel programming and describe ways in which we consider them to fall short of our idealized productive programming model.