PreprintPDF Available

Exploiting High-Performance Heterogeneous Hardware for Java Programs using Graal

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

The proliferation of heterogeneous hardware in recent years means that every system we program is likely to include a mix of compute elements; each with different characteristics. By utilizing these available hardware resources, developers can improve the performance and energy efficiency of their applications. However, existing tools for heterogeneous programming neglect developers who do not have the time or inclination to switch programming languages or learn the intricacies of a specific piece of hardware. This paper presents a framework that enables Java applications to be deployed across a variety of heterogeneous systems while exploiting any available multi- or many-core processor. The novel aspect of our approach is that it does not require any a priori knowledge of the hardware, or for the developer to worry about managing disparate memory spaces. Java applications are transparently compiled and optimized for the hardware at run-time. We also present a performance evaluation of our just-in-time (JIT) compiler using a framework to accelerate SLAM, a complex computer vision application entirely written in Java. We show that we can accelerate SLAM up to 150x compared to the Java reference implementation, rendering 107 frames per second (FPS).
Content may be subject to copyright.
Exploiting High-Performance Heterogeneous Hardware
for Java Programs using Graal
James Clarkson
The University of Manchester
Manchester, UK
james.clarkson@manchester.ac.uk
Juan Fumero
The University of Manchester
Manchester, UK
juan.fumero@manchester.ac.uk
Michail Papadimitriou
The University of Manchester
Manchester, UK
michail.papadimitriou@manchester.
ac.uk
Foivos S. Zakkak
The University of Manchester
Manchester, UK
foivos.zakkak@manchester.ac.uk
Maria Xekalaki
The University of Manchester
Manchester, UK
maria.xekalaki@manchester.ac.uk
Christos Kotselidis
The University of Manchester
Manchester, UK
christos.kotselidis@manchester.ac.uk
Mikel Luján
The University of Manchester
Manchester, UK
mikel.lujan@manchester.ac.uk
ABSTRACT
The proliferation of heterogeneous hardware in recent years means
that every system we program is likely to include a mix of compute
elements; each with dierent characteristics. By utilizing these avail-
able hardware resources, developers can improve the performance
and energy eciency of their applications. However, existing tools
for heterogeneous programming neglect developers who do not
have the time or inclination to switch programming languages or
learn the intricacies of a specic piece of hardware.
This paper presents a framework that enables Java applications
to be deployed across a variety of heterogeneous systems while
exploiting any available multi- or many-core processor. The novel
aspect of our approach is that it does not require any a priori knowl-
edge of the hardware, or for the developer to worry about managing
disparate memory spaces. Java applications are transparently com-
piled and optimized for the hardware at run-time.
We also present a performance evaluation of our just-in-time
(JIT) compiler using a framework to accelerate SLAM, a complex
computer vision application entirely written in Java. We show that
we can accelerate SLAM up to 150x compared to the Java reference
implementation, rendering 107 frames per second (FPS).
CCS CONCEPTS
Software and its engineering Virtual machines
;
Just-in-
time compilers;
KEYWORDS
Heterogeneous Hardware, Java, Virtual Machine, Graal, openCL
ManLang’18, September 12–14, 2018, Linz, Austria
©2018 Copyright held by the owner/author(s). Publication rights licensed to ACM.
This is the author’s version of the work. It is posted here for your personal use. Not
for redistribution. The denitive Version of Record was published in 15th International
Conference on Managed Languages & Runtimes (ManLang’18), September 12–14, 2018,
Linz, Austria, https://doi.org/10.1145/3237009.3237016.
ACM Reference Format:
James Clarkson, Juan Fumero, Michail Papadimitriou, Foivos S. Zakkak,
Maria Xekalaki, Christos Kotselidis, and Mikel Luján. 2018. Exploiting High-
Performance Heterogeneous Hardware for Java Programs using Graal. In
15th International Conference on Managed Languages & Runtimes (Man-
Lang’18), September 12–14, 2018, Linz, Austria. ACM, New York, NY, USA,
13 pages. https://doi.org/10.1145/3237009.3237016
1 INTRODUCTION
The majority of programming languages used by everyday develop-
ers make the fundamental assumption that the whole program will
execute on a single processor. Moreover, the portability of these
languages is due to the fact that the majority of systems used today
run on the same type of processor: whether it be x86, POWER,
ARM or MIPS. In this homogeneous world software development
is simplied as computing systems only use a single type of proces-
sor: either in a single-core or multi-core conguration. As a result,
porting languages in their entirety to use a dierent processor ar-
chitecture was enough. Until recently, we did not have the need, or
requirement, to use multiple processors of dierent architectures
within a single application.
The pervasion of hardware accelerators into mainstream com-
puting systems is rapidly changing the programming landscape.
For example, we can nd general purpose graphics accelerators
or GPGPUs in mobile phones, tablets, laptops, PCs, and servers.
Since these accelerators are programmable, it is natural to assume
that developers wish to utilize them in order to achieve improve-
ments in performance and/or energy-eciency. However, in order
to develop eective programming languages for this heterogeneous
hardware we need to invalidate one long-standing assumption —
that all applications execute exclusively on homogeneous hardware.
To this end, heterogeneous programming languages have emerged.
These languages are specically designed so that applications can
utilize multiple types of devices in concert. The popular ones like
CUDA [
11
], OpenCL [
24
], and OpenACC [
38
] are born out of the
necessity to eciently program GPGPUs. The consequence of this
ManLang’18, September 12–14, 2018, Linz, Austria J. Clarkson et al.
is that they all adopt a position where work is ooaded from a
host-device onto an accelerator; mirroring how the rendering of
complex compute graphics is ooaded from a processor onto a
GPU. For example, languages such as OpenACC are geared towards
the creation of applications which ooad computation onto one
or more devices of the same type. A true heterogeneous language,
however, needs to oer more to developers; such as the ability to
construct complex processing pipelines across multiple devices, or
the ability to map computation onto the device which is closest to
the data it needs to process. These eorts led us to the design and
implementation of Tornado, a framework that enables the execution
of managed languages onto any OpenCL compatible device such as
CPUs, GPUs, FPGAs, and Intel Xeon Phi. Tornado builds upon our
previous work [
9
,
33
] and is part of the Beehive Ecosystem [
1
,
44
].
Tornado ultimately aims to assist developers to transparently
execute their code onto any OpenCL compatible device. Further-
more, by exploiting the portability of the Java language, via the
Java Virtual Machine (JVM), Tornado is able to execute across any
JVM compatible architecture — extending its reach beyond that
of existing approaches. Finally, we showcase Tornado’s maturity
and ability to execute real-world complex applications by accel-
erating a Java version of the Kinect Fusion (KF) application. An
application that is typically beyond the computational capability of
non-hardware-accelerated implementations.
More specically, this paper makes the following contributions:
It presents Tornado, a heterogeneous programming frame-
work for Java that through JIT compilation transparently
accelerates applications using hardware accelerators.
It presents how Tornado can be used to accelerate a complex
computer vision application, that is written entirely in Java.
It evaluates the performance of a complex computer vi-
sion application written using Tornado and demonstrates
throughput of up to 107 frames per second (FPS).
2 RELATED WORK
This section reviews the most relevant related works to our ap-
proach. We focus on Java JIT compilation for heterogeneous com-
puting, and, in particular, to GPUs.
Parallel API. Aparapi [
2
] is one of the most well-known API and
JIT compilers from Java byte-code to OpenCL. Aparapi program-
mers extend their classes from a common Aparapi base-class and
override a
run
method. Data within the kernel can be accessed if it
is declared in the same lexical scope of elds in the same class. How-
ever, Java exceptions, allocation of new arrays or creating objects
are not allowed. Moreover, only programs that follow the parallel
map semantics can be expressed with Aparapi.
Sumatra [
39
] uses the new Java Stream 8 API to generate parallel
code for HSAIL architectures using Graal. Sumatra makes use of
the
forEach
construct (map parallel semantics) and ooads, at
run-time, the Java code passed to the construct to HSAIL (a new
assembly-standard for heterogeneous devices).
Other frameworks such as JOCL [
31
], and JCUDA [
43
] use OpenCL
wrappers for Java. Using these frameworks, programmers have to
explicitly implement their kernels in OpenCL or CUDA. This is a
handicap for many high-level users, because it requires knowledge
about the new parallel architecture and programming model.
Fumero et. al [
22
,
23
] provide a Java API for function composi-
tion to program heterogeneous hardware. The API follows a pure
functional style within Java to easily identify and generate parallel
code. That API, however, relies on Java lambdas and well-known
parallel skeletons, such as map, reduce and pipeline, thus requiring
the user to change the code from a non-functional style.
Tornado provides just a few annotations to add in the existing
Java sequential code (in similar way to OpenMP or OpenACC) to
inform the compiler that certain loops are potentially parallel and
candidates to execute on parallel hardware (e.g., GPUs). It also
provide a very light API to group methods (task-schedule) to highly
optimize data transfers.
GPU Compilation for Java. Liquid Metal and the Lime Com-
piler [
3
,
15
] are a runtime and a language implementation based
on Java to execute on GPUs and FPGAs. The Lime compiler stati-
cally generates heterogeneous code. Habanero Java [
26
] is also a
Java based language that generates OpenCL code at run-time. It
combines compile-time and run-time code generation. In similar
way, Rootbeer [
40
] statically generates CUDA for CUDA devices.
With our approach in Tornado, we generate code at run-time. This
allows us to specialize the generated code depending on the target
device, as we show in Section 3.3.
JaBEE [
45
] is a compiler framework that generates, at run-time,
CUDA from Java-bytecode. JaBEE supports many Java features
such as virtual methods and exceptions. However, in our opinion,
the reported performance is not good. Tornado makes use of ex-
isting compiler optimizations of the JVM, such as escape analysis,
aggressive inlining and loop unrolling, that improve the generated
code. Tornado also contains an optimizing runtime that improves
data movement between Java and the accelerator.
Ishizaki et. al [
30
] present a GPU JIT compiler for generating
CUDA code for Java collections. In a similar way to Sumatra [
39
],
their compiler generates parallel code from Java lambda expressions
for the
forEach
construct of the Java Stream API. To take advantage
of the automatic GPU JIT compilation, programmers need to adapt
their code to use these streams. Tornado takes a dierent approach
in which changes to the source code are minimal (as we show
in Section 3). Moreover, Tornado can compile any arbitrary Java
code, not only lambda expressions representing the map parallel
semantics.
GPU JIT compilation for other managed languages. There are
some works that generate at run-time, GPU code from high-level
and interpreted languages. Haskell [
8
,
28
,
34
], Python [
4
,
6
,
32
,
41
],
Scala [
7
,
37
], MATLAB [
4
,
10
], JavaScript [
29
], Lua [
10
], R [
21
] and
Ruby [
42
]. As Fumero et. al [
21
] has demonstrated, the presented
solution for generating Java code can be extensible to other high-
level programming languages.
Summary. Tornado diers from prior work by: 1) not using a
super-set of the Java language [
3
,
27
], 2) not using ahead-of-time
compilation [
13
,
40
], 3) not requiring developers to write heteroge-
neous code in another language [
14
,
31
], 4) not requiring manual
parallelization of kernels [
2
], and 5) supporting native library calls.
Furthermore, to the best of our knowledge, Tornado is the rst
framework that is able to automatically accelerate Java computer
vision applications on GPGPUs, as we show in detail in Section 4.
Exploiting High-Performance Heterogeneous Hardware for Java Programs using Graal ManLang’18, September 12–14, 2018, Linz, Austria
Figure 1: Tornado Overview
3 TORNADO
Tornado is a Java-based parallel programming framework that en-
ables managed programming languages to take advantage of het-
erogeneous hardware platforms. Figure 1 shows an overview of
Tornado, which comprises three main software-layers:
Tornado API:
a parallel API which enables developers to iden-
tify loops that can be executed in parallel. It also provides
an API to compose and build a pipeline of multiple tasks, in
which dependencies and optimizations between them are
automatically managed in our runtime.
Tornado Runtime:
an optimizing runtime that performs data
dependence analysis, optimizes data transfers, and orches-
trates the parallel execution between the Java host and the
target parallel devices.
Tornado JIT Compiler:
a JIT compiler that dynamically gen-
erates heterogeneous and optimized machine code for the
target devices.
The following sections describe each component in more detail.
3.1 Tornado API
To eciently and safely compile Java code for heterogeneous plat-
forms, Tornado relies on a minimal API that works alongside exist-
ing code without requiring developers to re-write their code. We
achieve that by providing support for expressing data-parallelism
and allowing developers to markup the induction variable of any
data-parallel loop with the
@Parallel
annotation. This signals the
compiler that each iteration of the loop can be executed indepen-
dently and that, consequently, it is safe to parallelize it. Note that
the use of the annotation does not provide any guarantees that the
loop will be parallelized or any information about how it can be
parallelized — just that each loop iteration can be performed inde-
pendently. If the code is executed on a machine without hardware
accelerators, the JVM will simply ignore any Tornado annotations.
A key feature of Tornado is its portability across dierent hard-
ware platforms as we show in Section 4. For this reason, Tornado
prohibits developers from deliberately parallelizing code for a spe-
cic architecture by not providing a mechanism to explicitly map
code onto individual threads. The parallelization is applied auto-
matically by the compiler and is discussed further in Section 3.3.
Furthermore, Tornado encourages developers not to specialize data-
parallel code for a specic accelerator by using techniques such
as loop-tiling — as these are device specic and therefore restrict
portability.
Listing 1: A simple Tornado example of array addition.
1public class Compute {
2public void add(int[] a, int[] b, int[] c) {
3for (@Parallel int i = 0; i < c.length; i++) {
4c[i] = a[i] + b[i];
5}
6}
7public void compute(int[] a, int[] b, int[] c) {
8TaskSchedule s = new TaskSchedule ("s0")
9.task("t0",this::add, a, b, c)
10 .streamOut(c).execute();
11 }
12 }
To manage the execution of Java code on parallel hardware,
Tornado employs a task-based programming model. A task is a
reference to an existing Java method that has the potential to com-
pute data-parallel code. Each task encapsulates: a) the code to exe-
cute, the data it should operate on, and b) meta-data that contains
both the compiler and runtime congurations for the task. As data-
parallel code is always enclosed within a task, we use tasks as the
basic unit of execution for heterogeneous code. Developers have
the ability to map each task onto a dierent device in the following
ways: in the application (either statically or dynamically), automat-
ically by the Tornado runtime system, or as a tuning parameter on
the command line.
3.1.1 Task-Schedules. A key feature of Tornado is composabil-
ity — the ability to write applications with many tasks that have
complex data-dependencies. To achieve that, tasks are not executed
directly by the application. Instead, they are scheduled indirectly
via a task-schedule which provides the Tornado runtime system
greater scope for optimizing the execution of tasks. A task-schedule
is simply a group of multiple tasks. Therefore, it is a group of mul-
tiple Java methods executing data parallel code. A task-schedule
provides developers with an easy way to compose complex pro-
cessing pipelines which might run multiple tasks across multiple
accelerators. The task-schedule exists to shield developers from
the complexities of scheduling data-movement in complex applica-
tions. The result is that Tornado is able to infer all data-movement
from the task-schedule and automatically exploit any available task-
parallelism within it. Moreover, Tornado enables task-schedules to
be executed asynchronously and, hence, developers do not need to
wait for their completion. This allows developers to automatically
overlap code execution between the application running on the
JVM and the code running on the accelerators.
Listing 1 illustrates a simple Tornado example of adding two
arrays of integers. To execute the
add
method on a hardware accel-
erator, a task-schedule,
s0
, that contains a single task,
t0
, is created.
Here the task-schedule can be thought of as a lexical closure: task
t0
will invoke the
add
method with parameters
a
,
b
, and
c
, but only
when the task-schedule is executed. In Tornado, task-schedules are
represented internally as data-ow graphs (also called task-graphs)
that explicitly model the data-movement between tasks. The nodes
in these graphs are individual tasks that could be executing code,
allocating memory, or transferring data. The benet of using this
ManLang’18, September 12–14, 2018, Linz, Austria J. Clarkson et al.
Schedule
add()
copy(a) copy(b) alloc(c)
(a) By default data are left in-
situ so that it can be reused
by subsequent tasks running on
the same device.
Schedule
add()
copy(a) copy(b) alloc(c)
copy(c)
(b) An explicit copy needs
to be generated using the
streamOut(c) operator to trans-
fer the data back to the host.
Figure 2: Data Management
representation is that it is straightforward to generate an optimal
execution schedule that exploits task-parallelism but satises all
data-dependencies.
3.1.2 Data Transfers. As each task has the potential to execute
on a dierent device, managing data-movement is critical to obtain
high performance. Tornado manages all data-movement within
the task-schedule automatically. By default, all reference types —
objects and arrays — are copied onto the accelerator automatically
the rst time they are used but are never copied back to the host.
This way the data will always remain on the last device on which
it was created and an explicit request must be made to transfer it
back to the host; hence the use of the
streamOut
operator. Figure 2
illustrates the semantics of the task-schedule dened in Listing 1
and why the
streamOut
operator is necessary to transfer
c
back to
the host. Although this may seem counter-intuitive for develop-
ers who are used to shared memory programming, it provides a
new dimension of optimization for heterogeneous programming
locality. On balance, the ability to exploit locality in this way
helps to dramatically increase performance as opposed to enforcing
coherency between disparate physical memories which severely
degrades performance. Finally, since the task-schedule works as a
closure it is also a synchronization point. That means that when
control is returned from
execute
, the host is guaranteed to view
all required memory updates.
3.1.3 Task Execution. By default, Tornado will execute all tasks
on the rst accelerator it will nd in the system. However, the API
allows assigning names to each task-schedule and task,
s0
and
t0
respectively from Listing 1. This way, task-schedules and tasks
can be referenced by name and congured on the command line
(or elsewhere in the application). Therefore, dierent properties
such as the accelerator to execute a set of tasks, are not embedded
in the source code. This immediately benets the developer by
enabling the conguration of the application without the need of
modications to the source code and a re-compilation of the whole
application.
3.1.4 Summary of the API. Tornado provides a minimal and
clean API for achieving heterogeneous execution of Java applica-
tions. It is mainly based on the notion of tasks and task-schedules
that are compositions of calls to existing code — allowing code to
be re-used extensively. Furthermore, the non-intrusive annotations
can be used by developers to improve performance without sacric-
ing backwards-compatibility since they are ignored by the compiler
in case Tornado is not activated.
3.2 Tornado Runtime
Figure 3 illustrates a more detailed overview of the Tornado compo-
nents as well as their interaction along with a typical execution ow.
In this section, we shortly discuss each step of the execution ow
and the actions taken by Tornado. The execution is driven by task-
schedules provided by the developer, which describe a data-ow
graph of tasks (referred to as a task-graph).
Task Graph Optimizer. A task-graph is constructed the rst time a
task-schedule is executed via
execute
or
schedule
and is passed to
the Graph Optimizer (1). At this stage the task-graph only contains
tasks which execute code. Since Tornado handles data-transfers
automatically, the next step is to be augment the task-graph with
tasks (or nodes) that handle data-transfers.
Tornado Sketchers. To achieve this, the Graph Optimizer passes
each node of the task-graph to the Sketcher which creates a sketch
of the code that is executed by the node (2). Essentially, the sketcher
constructs a High-level Intermediate Representation (HIR) of each
task from Java byte-code and places it in a HIR cache so it can
be retrieved by the code generator in the future. The Graph Opti-
mizer is also able to query each sketch to aid in the optimization
of the task-graph. For example, the sketcher determines the us-
age of every object accessed within a task — this can be read-only,
write-only, read and write, or unknown. Knowing this information,
the graph optimizer is able to fully populate the task-graph with
the nodes to perform the required data-transfers. This means that
data-movement is automatically inferred from the code, unlike Ope-
nACC, OpenMP and OpenCL, where developers have to manually
handle data-movement.
This “split-compilation” approach is highly-ecient since the
HIR graphs of the tasks are constructed once and can be used
multiple times — to compile tasks for dierent devices or to re-
compile for the same device using a dierent set of optimizations.
Additionally, the cache decouples the front-end and the back-end
of the compiler, making it possible for multiple back-ends to share
the same front-end. This is desirable because it makes it possible to
use dierent heterogeneous frameworks and code generators like
OpenCL [24], CUDA/PTX [11, 12] or HSA/HSAIL [19, 20].
Once all sketches are available, the optimizer tries to eliminate
as many unnecessary nodes as possible from the task-graph and
then generates an optimal execution schedule for the tasks. Here
the optimizer aims to minimize the length of the critical path by
overlapping the execution of data-transfers and execution where
possible. The result is a serialized list of low-level tasks; each one
describing an action that is to be applied to an abstract acceler-
ator, e.g., data-transfers and code execution (4). To avoid having
to repeatedly call the Graph Optimizer the serialized schedule is
cached.
Execution Engine. The execution of the task schedule is per-
formed by passing the serialized schedule to the Task Executor (5).
The Task Executor reads the serialized tasks in order and translates
them into calls to the driver API (6) — in our case the OpenCL
Exploiting High-Performance Heterogeneous Hardware for Java Programs using Graal ManLang’18, September 12–14, 2018, Linz, Austria
 


new
  ! 
"
#

voidint$% int$%! int$%&
for@Parallelint'#()#**&
$%'$%*!$%#
+
+
,-

./0-
.!11
.11
"2
.
..342
.4.5

.5/0-
.))6-
.)).5

.
.2
.57!.

85/0-
"9
:;<=
:>8)?
:<=
"9
@@5=A
&
A
+
0))!85
/0-
"2
"2

!
.34)

B
C
D
E
F
2


G
Figure 3: Tornado Architecture Outline.
Runtime API. If the current task executes code on the accelerator,
the Task Executor retrieves the compiled code from a code cache
(7). In the event that no compiled code exists in the cache, a HIR
compilation will be triggered. Note that in this case the HIR will
be retrieved from the HIR cache. Furthermore, any parallelization
strategy or device specic built-ins are applied to the HIR at this
stage. The output of the code generator is compatible with the
driver API — in our case we generate OpenCL C code.
3.3 Tornado JIT Compiler
In Tornado, the JIT compiler is responsible for both parallelizing
code and generating the Driver API code. The Tornado JIT compiler
is built using Graal’s API and compiler framework [
16
,
17
]. Graal is
an industrial strength JIT compiler with the ability to generate ma-
chine code directly from Java byte-code. Tornado augments Graal
with the ability to parallelize code (discussed later) and generate
the Driver API code.
Figure 4 shows the main workow of Tornado, emphasizing the
JIT compilation part. On the top left is a Java method that performs
a vector addition. Note that the loop is annotated with
@Parallel
.
As usual, the method is compiled by
javac
into Java byte-code.
Then, at run-time, Tornado analyzes the data dependencies, per-
forms optimizations, generates the Driver API code (right side of
Figure 4), and orchestrates the parallel execution. As indicated by
the gray box on the lower left side of Figure 4, Tornado utilizes the
Graal Compiler API and JVM compiler interface (JVMCI) to interact
with the JVM. This interaction, enables one of the key features of
the Tornado JIT compiler; the integration of the Tornado optimiza-
tion pipeline with the JVM optimization pipeline. This integration
ManLang’18, September 12–14, 2018, Linz, Austria J. Clarkson et al.
Figure 4: The Tornado workow.
allows for better optimization by combining typical compiler op-
timizations, such as inlining and loop unrolling, with automatic
parallelization. Moreover, it ensures proper Java semantics for both
the code executed on the JVM host as well as for the code executed
on the dierent modules of the underlying hardware.
Apart from the analyses and optimization passes which are al-
ready present in Graal, Tornado also applies the following optimiza-
tion phases to the IR-graph:
Tornado Data Flow Analysis:
Analyzes data dependencies
between tasks.
Tornado Reduce Replacements:
Detects reductions and per-
forms node replacement with special Tornado reduction
nodes.
Tornado Task Specialization:
Specializes the IR-graph by in-
lining elds and objects.
Tornado Driver API Intrinsics:
Sets nodes for Driver API
intrinsics such as barriers and debug information.
Tornado Snippet Post-Processing:
Processes non-lowerable
IR nodes that are introduced from snippets during the low-
ering phases.
Tornado Shape Analysis:
Analyzes the loop index space and
determines the correct indices when using strides in loops.
Tornado Parallel Scheduler:
Optimizes data-parallel loops
for the target device.
Presently, the Tornado JIT compiler supports two parallelization
schemes: 1) the assignment of a thread to a block of iterations (block
mapping), and 2) the assignment of one thread to each iteration
in the loop. By default, the choice of scheme is governed by the
type of the target device, but it is also possible to be dynamically
congured. The rst scheme provides a coarser thread granularity
which suits latency oriented devices, such as x86 cores, whereas the
Listing 2: Loop Re-Written For CPUs.
1int id = get_global_id(0);
2int size = get_global_size(0);
3int block_size = (size + inputSize - 1) / size;
4int start = id *block_size;
5int end = min(start + bs, c.length);
6for (int i = start; i < end; i++) {
7c[i] = a[i] + b[i];
8}
Listing 3: Loop Re-Written For GPGPUs.
1int idx = get_global_id(0);
2int size = get_global_size(0);
3for (int i = idx; i < c.length; i += size) {
4c[i] = a[i] + b[i];
5}
second provides a ner thread granularity which is preferred by
throughput oriented devices, like GPGPUs. Listings 2 and 3 demon-
strate the generated Driver API code after the parallel scheduler
phase transforms the
add
function from the example provided in
Listing 1. Listing 2 shows the result of assigning a thread to a block
of iterations, suitable for latency oriented devices. On the contrary,
Listing 3 shows the result of assigning one thread to each iteration
in the loop, better suited for throughput oriented devices.
Figure 5 illustrates these compiler transformations at the IR level
when optimizing for throughput oriented devices. The left side of
the gure corresponds to the IR before the optimization and the
right to the IR after it. Each rectangle represents a node in the
IR-graph, solid arrows indicate the control ow, and dashed arrows
indicate the data ow. The right side of the gure shows the result
Exploiting High-Performance Heterogeneous Hardware for Java Programs using Graal ManLang’18, September 12–14, 2018, Linz, Austria
LoopBegin
If
Offset Stride
Range
<+
Phi
Begin
If
GlobalID
+
Range
<+
Phi
Begin
*
GlobalSize
GPU
Scheduler
LoopBegin
Figure 5: GPU Parallel Scheduler transformation.
of applying this transformation to execute on a GPU (Listing 3).
Note that two new nodes appear (
GlobalID
and
GlobalSize
). These
nodes are later translated to OpenCL intrinsics to obtain the correct
indexes from the loop iteration space on the parallel hardware.
OpenCL C as the Driver API. As is evident by Figures 3 and 4, as
well as by Listings 2 and 3, the current implementation of Tornado
uses OpenCL C as the Driver API. The decision to use OpenCL C as
the Driver API was mainly based on the fact that OpenCL supports
a plethora of devices, allowing us to accelerate Java applications
on all these devices with a single back-end. Despite the fact that
OpenCL allows us to target a wide and diverse range of accelerators,
it prohibits us from implementing some vital features of the Java
language (e.g., exception handling) while it introduces diculties
when implementing others (e.g., objects). Furthermore, OpenCL
comes at the cost of invoking a second JIT compiler. We rst need
to compile Java byte-code to OpenCL C using the Tornado JIT
compiler and then compile the OpenCL C to machine code using an
OpenCL-compatible compiler. These issues can be resolved, albeit at
the cost of developing multiple back-ends, by generating PTX [
12
]
or HSAIL [20] code for the Driver API.
3.4 Java Coverage
Throughout the development of Tornado, we have found that the
only real constraint is the support of features which require calls,
either internally to the JVM or externally to a native library or the
OS. Typically, this means that features like Java reection, I/O, or
the threading API are unable to be used inside Tornado tasks. The
reason behind these restrictions is that tasks need to be able to
execute on devices other than the one hosting the OS. Theoretically,
Tornado can support the majority of the Java language. However,
its ability to do so depends on the type of the generated low-level
code. For example, a major issue we discovered using OpenCL C is
the lack of support for direct branches, which inhibits our ability
to adequately support exceptions. Currently, Tornado is unable to
create objects on accelerators and move them under the control of
the memory manager inside the JVM or remove objects from under
the control of the memory manager. To handle such cases, Tornado
Listing 4: Handling Library Calls in Tornado.
1public static void testCos(double[] a) {
2for (@Parallel int i = 0; i < a.length; i++) {
3a[i] = Math.cos(a[i]);
4}
5}
Figure 6: RGB-D camera combines RGB with Depth informa-
tion to form a 3D reconstruction of a scene (right).
takes a practical approach where an attempt is made to compile all
code and if it is not possible, execution will revert back to running
the sequential Java inside the JVM.
3.4.1 Library Calls. Tornado also supports the invocation of li-
brary calls in Tornado tasks. Java libraries are usually implemented
in Java itself. However, some libraries rely on native implemen-
tations, e.g.,
java.util.concurrent
. For libraries implemented in
Java, Tornado automatically inlines the library call using Graal
(see Section 3.3). However, Tornado has no way to automatically
handle invocations to library calls with native implementations. To
support such library calls, the Tornado compiler relies on intrinsic
substitution during compilation. These instrinsics need to be man-
ually dened in the Tornado compiler. Currently, Tornado ships
with pre-dened intrinsics for the
java.lang.Math
library. Other
library calls to native code are currently unsupported. Listing 4
gives an example of a Tornado task invoking a method from the
java.lang.Math library.
4 ACCELERATING THE KINECT FUSION
COMPUTER VISION APPLICATION
To further stress Tornado and demonstrate its capability of running
real applications, we accelerate end-to-end a complex Computer
Vision (CV) application; namely, Kinect Fusion (KF) [36]. Another
goal of this demonstration is to show that using Tornado we can: 1)
execute across as many devices as possible without requiring code
modications, and 2) achieve high performance.
4.1 Kinect Fusion
Kinect Fusion [
36
] processes a stream of depth images, obtained
by a RGB-D camera, and reconstructs a three-dimensional repre-
sentation of the space (Figure 6). In order to achieve its quality
ManLang’18, September 12–14, 2018, Linz, Austria J. Clarkson et al.
Figure 7: Kinect Fusion Pipeline stages
of service (QoS) target (i.e. real-time reconstruction of the envi-
ronment) Kinect Fusion needs to operate at the frame-rate of the
camera, which is 30 frames per second (FPS). Implementation-wise,
some of Kinect Fusion’s kernels are very large — about 250 lines
of code — and utilize a much wider range of Java language fea-
tures than other benchmarks that are typically used to evaluate the
performance of heterogeneous programming frameworks.
Kinect Fusion comprises a six-stage processing pipeline (depicted
in Figure 7) to process the input stream of depth images:
acquisition
obtains the next RGB-D frame - either from a
camera or from a le.
pre-processing
applies a bilateral lter to remove anomalous
values, re-scales the input data to represent distances in
millimeters and builds a pyramid of vertex and normal maps
using three dierent image resolutions.
tracking
estimates the dierence in camera pose between
frames. This is achieved by matching the incoming data
to an internal model of the scene using a technique called
Iterative Closest Point (ICP) [5, 46].
integrate
fuses the current frame into the internal model, if
the tracking error is less than a predetermined threshold.
raycast
constructs a new reference point cloud from the inter-
nal representation of the scene.
rendering
uses the same ray-casting technique of the previous
stage to produce a visualization of the 3D scene.
In SLAMBench each stage of the Kinect Fusion pipeline is composed
from a series of kernels. Typically, a single frame will require the
execution of 18 to 54 kernels. Therefore, to achieve the target frame-
rate of 30 FPS, the application must sustain the execution of 540 to
1620 kernels per second.
4.2 Java & Tornado Implementation
A common characteristic of CV applications, regardless of the sce-
nario in which they are used, is their extreme computational de-
mands. Typically, they are written in programming languages such
as C++ and OpenMP with binding extensions for OpenCL or CUDA
execution. A common drawback of such implementations is the
lack of portability since the application has to be recompiled and
optimized for each underlying hardware platform. Building and
optimizing CV applications on top of a managed runtime system
such as the Java Virtual Machine (JVM) would enable single imple-
mentations to run across multiple devices such as desktop machines
or low-power devices. To demonstrate this, we evaluate Kinect Fu-
sion against a diverse set of heterogeneous hardware resources in
Section 4.3.
Our Java reference implementation is derived from the open-
source C++ version provided by SLAMBench [
35
]. During porting,
we ensured that the Java implementation produces bit-exact re-
sults when compared to the C++ one.
1
This is highly important,
and challenging, since Java does not support unsigned integers.
Therefore, we had to modify the code to use signed representations
and maintain correctness. Although all kernels produce near iden-
tical results during unit-testing, each implementation can produce
slightly dierent results when combined together due to the nature
of oating-point arithmetic.
We have developed the Java implementation with minimal de-
pendencies on third-party code and we do not use any form of
Foreign Function Interface (FFI) or native libraries. Our only depen-
dency is on the EJML library [
18
] for its implementation of Singular
Value Decomposition (SVD).
During preliminary performance analysis, we discovered that the
C++ implementation is 3
.
4
×
faster than Java. Despite outperforming
Java, the C++ implementation barely manages to achieve 4 FPS,
which is much lower than the expected QoS target of 30 FPS. After
the initial validation and performance analysis of our serial Java
Kinect Fusion implementation, we ported Kinect Fusion to Tornado.
To enable our baseline Java Kinect Fusion implementation to take
advantage of Tornado’s capabilities we: 1) used the Tornado API
to describe the processing pipeline, 2) annotated loops that are
safe to be executed in parallel with
@Parallel
, and 3) executed the
task-graph at an appropriate point in the application. The Tornado
Kinect Fusion implementation has eight separate task graphs in
total: one for each of the pre-processing, integrate, raycast and
rendering stages, and four for the tracking stage — one to create
the image pyramid and one to process each of three levels of the
pyramid.
4.3 Kinect Fusion Evaluation
To evaluate our Tornado Kinect Fusion implementation, we use
four distinct classes of heterogeneous systems (shown in Table 1).
Each system has a multi-core processor, along with a minimum of
one GPGPU that can be used for acceleration. To provide fair com-
parisons, all experiments use the same application conguration
and scene from the ICL-NUIM data-set [25].
Table 2 provides the frame-rates we achieved during our ex-
periments for all tested implementations. To better understand
the results we provide three dierent Tornado implementations:
Tornado-NR, Tornado-JR, and Tornado-OR. Tornado-NR (No Re-
duce) does not support reduction operations. Tornado-JR (Java
Reduce) has limited support for reduction operations, which are
written in pure Java. Tornado-OR (OpenCL Reduce) has fully sup-
ports reduction operations, but relies on hand-crafted OpenCL
1Even if this failed, it came within 5 Units of Last Place (ULP).
Exploiting High-Performance Heterogeneous Hardware for Java Programs using Graal ManLang’18, September 12–14, 2018, Linz, Austria
Machine Name OS (kernel) CPU Cores OpenCL GPGPU CU OpenCL
Laptop OSX 10.11.6 (14.5.0) Intel i7-4850HQ @ 2.3 GHz 4 (8) 1.2 (Apple) Intel Iris Pro 5200 40 1.2 (Apple)
NVIDIA GT 750M @ 925 MHz 2 1.2 (Apple)
Desktop Fedora 21 (4.1.10) AMD A10-7850K @ 1.7 GHz 4 1.2 (AMD) AMD Radeon R7 @ 720 MHz 8 2.0 (AMD)
Enterprise CentOS 6.8 (2.6.32) Intel Xeon E5-2620 @ 1.2 GHz 12 (24) 1.2 (Intel) NVIDIA Tesla K20m @ 705 MHz 13 1.2 (NVIDIA)
Table 1: Hardware Congurations, CU: Number of OpenCL Compute Units.
Machine Java C++ OMP OpenCL
CPU GPU 1 GPU 2
Laptop 0.87 3.69 - e 57.93 e
Desktop 0.40 3.14 7.87 e e -
Enterprise 0.71 2.40 19.63 29.02 138.10 -
Tornado-NR
CPU GPU 1 GPU 2
15.01 24.84 20.15
5.28 16.80 -
21.60 31.25 -
Tornado-JR
CPU GPU 1 GPU 2
15.39 45.66 21.00
7.91 15.51 -
30.65 52.26 -
Tornado-OR
GPU 1 GPU 2
48.36 24.61
21.78 -
107.78 -
Table 2: SLAMBench performance in FPS for each implementation (e: failed to produce a valid result)
kernels (for the reductions only). The three implementations are
discussed in more detail later in the evaluation.
4.3.1 Measuring Performance and Accuracy. A challenge when
comparing dierent implementations of Kinect Fusion, and CV al-
gorithms in general, is that performance and accuracy measures
are subjective. Normally, this is due to the real measure of the al-
gorithmic quality being the user experience: does the user notice
slow performance and is it accurate enough for their needs? Never-
theless, we must ensure that each implementation of Kinect Fusion
does the same work and produces the same answer. Therefore, out
of a number of Kinect Fusion implementations we have selected
the ones provided by SLAMBench since they provide ready-made
infrastructure to measure the performance and accuracy, enabling
reliable comparisons between dierent implementations.
The accuracy of each reconstruction is determined by comparing
the estimated trajectory of the camera against a provided ground
truth, and is reported as an absolute trajectory error (ATE). The
ground truths are provided by the synthetically generated ICL-
NUIM data-set [
25
]. Finally, the performance is measured as the
average frame-rate achieved when processing the entire data-set.
For a result to be considered valid, an implementation needs to
return a mean Absolute Trajectory Error (ATE) of under 5 cm; as
per the criteria set out by SLAMBench.
4.3.2 Portability. The rst notable outcome of our experiments
is that the OpenCL implementation produced valid results on only
six devices, 60% of all devices, whereas Tornado produced valid
results on all devices. This result strengthens our argument that
Tornado with its dynamic JIT compilation is able to provide high
performing heterogeneous implementations that are portable on a
wide number or devices. On the contrary, the OpenCL implementa-
tion could not execute on all devices due to the assumptions made
by the developers when they initially implemented SLAMBench
on their device of choice. These assumptions regard work-group
dimensions, and the amount of local memory available. If either
of these assumptions are invalid on the target device, the reduce
kernel fails to execute correctly. These problems are avoided in
Tornado as resource usage is determined automatically by the run-
time system and is based on the preferences of the target device.
Additionally, Tornado provides developers with a number of run-
time conguration options to inuence how resources are allocated;
meaning that a number of these issues can be corrected without
re-compiling the application.
4.3.3 Performance Study. Inspecting the results in more detail,
we observe that the baseline Tornado implementation (Tornado-NR)
of SLAMBench achieves a speedup of 12-43
×
over the reference
Java implementation and in one case it exceeds our minimum level
of QoS at 31
.
25 FPS. Nevertheless, it produces 0
.
36
×
the perfor-
mance of the OpenCL implementation. To understand where the
performance loss occurs we ran a number of additional experi-
ments with ner grained measurements. Figures 8a and 8b present
the results. Figure 8a compares the amount of time spent in each
pipeline stage, while Figure 8b compares the mean execution times
of each Tornado pipeline stage against the OpenCL implementation
on the Enterprise system. From these gures we see that: 1) the
total execution times of the parallel implementations in OpenMP,
OpenCL, and Tornado, are dominated by the tracking stage, and 2)
Tornado achieves less than 0
.
15
×
the performance of the OpenCL
implementation in the tracking stage. These observations indicate
that the tracking stage is the performance bottleneck in our imple-
mentation.
Studying the tracking stage of our implementation we detected
the issue to be related to data transfers between the device and
the host. According to our measurements, the Tornado-NR imple-
mentation requires over 14MB of data to be moved between the
device and host per frame. In the OpenCL version, this problem
is addressed by using a hand-crafted reduction operation which
compresses the size of the tracking result before transferring it back
to the host. In our initial Tornado implementation (Tornado-NR),
we chose not to port this kernel because it cannot be expressed
in serial Java. The main issue preventing the reduction operation
being written in Java, is the ability to express the movement of data
between dierent threads and work-groups since such notions do
not exist in the language. Note also that providing a hand-crafted
ManLang’18, September 12–14, 2018, Linz, Austria J. Clarkson et al.
Java
C++
OpenMP
Tornado Xeon E5-2620
OpenCL Xeon E5-2620
Tornado Tesla K20m
OpenCL Tesla K20m
0% 25% 50% 75% 100%
A P T I Ra Re
(a) Breakdown of time spent in each stage of the pipeline.
0.19 0.11
0.44 0.46
1.59
0.15
0.42
0.61
1.15
1.68
0.92
0.62
0.0
0.5
1.0
1.5
2.0
A P T I Ra Geo
Mean
Xeon E5-2620 Tesla K20m
(b) Tornado execution times normalized to the OpenCL equivalent
(lower is better).
Figure 8: Performance breakdown on the Enterprise system. (A: acquisition, P: pre-processing, T: tracking, I: integration, Ra:
raycast, Re: rendering)
Compute Unit
54
Compute Unit
3210
Data Transferred To Host
Input Array
<addr>
<addr+6>
<addr+12>
<addr+18>
Intermediate Values
(a) Java Based Reduction
Data Transferred To Host
Input Array
<addr>
<addr+8>
<addr+16>
Compute Unit 0
3210
0
Compute Unit 1
3210
0
Intermediate Values
(b) OpenCL Based Reduction
Figure 9: Illustration of the dierent reduction algorithms.
implementation would negatively impact the portability of our
implementation.
Nevertheless, since the tracking stage has become a performance
bottleneck we revised two solutions. The rst solution was to im-
plement a reduction function, in Java, that does not use inter-thread
communication (Tornado-JR), while the second was to provide Tor-
nado with a hand-crafted OpenCL C kernel that does (Tornado-OR).
The advantage of the rst approach is that the code remains portable
across all devices by sacricing performance, whereas the second
approach yields better performance but sacrices portability.
Figures 9a and 9b illustrate the reduction operations imple-
mented in Java and OpenCL respectively (Tornado-JR and Tornado-
OR). The Java reduction uses a xed number of threads to combine
results in a thread-cyclic manner creating one partial result per
used thread. However, the inability to communicate data between
threads means that we cannot fully utilize the hardware (dashed
boxes). The OpenCL implementation, similarly to the Java one, is a
multi-stage reduction and it is able to utilize more threads by ex-
ploiting inter-thread/intra-work group communication. The most
important dierence, to the Java one, is that an extra reduction is
performed inside each work-group which results in a single value
being produced per work-group. We have also added the ability to
change the number of work-groups used in the reduction varying
the utilization of the GPGPUs compute units. It is important to
note Tornado’s ability to allow developers to add their user-dened
reduction kernels while hand-optimizing their implementations if
performance becomes an issue.
4.3.4 Performance Improvements. To evaluate the impact of our
dierent reduction kernels we repeated our experiments using a
number of congurations. The Tornado-JR kernel can be congured
at run-time to use dierent numbers of threads - we used values of:
512, 1024, 2048, and 4096 on the GPGPUs; and 0.5, 1, 2, and 4
×
the
number of available compute units on the CPUs. The Tornado-OR
kernel is implemented to dynamically adjust the work-group size
and number of work-groups it uses to help us vary the utilization of
the compute-units. By default, we assign a single work-group with
the largest possible dimensions onto each compute unit. As shown
in Table 2 Tornado’s performance has improved in both Tornado-
JR and Tornado-OR congurations compared to the baseline one
(Tornado-NR).
As shown in Figure 10, the Tornado implementations are ob-
taining signicantly higher levels of performance. Regarding the
JR experiments, we observe performance improvements across all
devices with a maximum speedup of 74
×
over the reference imple-
mentation and a maximum frame-rate of 52 FPS. More importantly,
we see that three devices are now able to exceed our QoS threshold
of 30 FPS. This means that we have been able to exceed the QoS
threshold on the same devices as OpenCL by using an implementa-
tion written entirely in Java. Moreover, if we compare these results
to OpenMP we see that although we started from a performance
point of 3-7
×
lower than C++, our Tornado implementation is able
to achieve higher performance on all CPU implementations. By
Exploiting High-Performance Heterogeneous Hardware for Java Programs using Graal ManLang’18, September 12–14, 2018, Linz, Austria
N
12
16
2
4
8
N
16
32
4
8
N
12
24
48
6
Hand
N
1024
2048
4096
512
Hand
N
1024
2048
4096
512
Hand
N
1024
2048
4096
512
Hand
N
1024
2048
4096
512
0
30
60
90
AMD
A10
7850K
Intel
i7
4850HQ
Intel
E5
2620
AMD
Radeon
R7
Intel
Iris Pro
5200
NVIDIA
GT
750M
NVIDIA
Tesla
K20m
Frames Per Second
Figure 10: Performance in FPS after implementing the reduce kernel (N: Tornado-NR, Hand: Tornado-OR, Rest: Tornado-JR
with variable number of threads).
using the OR version, we have managed to obtain the highest per-
formance with a maximum speedup of 150
×
over the reference
implementation and a maximum frame-rate of 107 FPS on the Tesla
K20m.
Finally, regarding the comparison with the OpenCL implemen-
tation, Tornado achieves 0
.
59
×
the performance of the OpenCL
implementation by using only Java (Tornado-JR), and if a developer
wishes to sacrice a little portability by using a single hand-crafted
OpenCL kernel this rises to 0.77×.
4.3.5 Host Side Performance. Figure 11a shows the average com-
pilation times of all tasks across all devices. In general, we observe
that compilation takes between 100-200 milliseconds and is split,
almost evenly, between the Graal and the OpenCL compiler. Tor-
nado provides the ability to manually trigger the optimization of
task-graphs and the JIT compilation of tasks before a task-schedule
is executed; this way the corresponding overheads can be removed
from time-sensitive task-schedules — such as the ones in Kinect
Fusion. Using this option mirrors how the compilation overheads
are handled in the OpenCL of Kinect Fusion.
Figure 11b shows the time spent to process each frame during
execution for the Tornado JR (2048), OR (Hand), and OpenCL im-
plementation on the Enterprise system. During the early stages
of the benchmark, all Tornado implementations experience extra
overheads from JIT compilation and garbage collection. However,
performance stabilizes after approximately 100 frames and contin-
ues to improve. After proling, we discovered that the memory
usage of our Tornado implementations stabilizes at around 400 MB
resulting in minimal GC interference after the warmup period.
5 CONCLUSIONS AND FUTURE WORK
We demonstrate that through holistic design it is possible to de-
velop a practical heterogeneous programming framework. The dis-
tinguishing feature of Tornado is that it enables developers to write
portable heterogeneous code in pure Java. This allows them to
write applications that can be quickly deployed across dierent
hardware accelerators and operating systems. Moreover, our dy-
namic design allows them to avoid making a priori decisions —
instead applications can be dynamically congured at run-time.
We demonstrate that by using Tornado it is possible to write a
single implementation of a complex Computer Vision application
and deploy it across a variety of heterogeneous systems, while
maintaining comparable performance to hand-crafted optimized
equivalents. What makes Tornado unique is that it has been devel-
oped to provide heterogeneous programming support to the general
purpose Java programming language, a language that would not
normally be associated with writing either high-performance or
heterogeneous code. By utilizing the introduced Tornado frame-
work, we managed to obtain speedups of 18-150
×
over the reference
Java implementation. The results show that not only can we obtain
levels of performance that meet our QoS target of 30 FPS, but we
can also exceed our target by up to 3
×
(at 107 FPS). Moreover, we
demonstrate that Tornado is able to utilize 2
.
3
×
more devices than
the hand-written OpenCL.
Experiences with OpenCL. Although Tornado is able to execute
complex Java applications across a wide range of accelerators suc-
cessfully, a number of key issues remain. Tornado is currently
based on OpenCL. More desirable options such as CUDA/PTX or
even HSA/HSAIL would have severely restricted the diversity and
number of accelerators that we could use. Aside from issues regard-
ing Java/OpenCL compatibility, we struggled to nd an OpenCL-
compatible way to create on-device managed heaps and subse-
quently objects. The problems arise from the fact that we are being
forced to access device memory via OpenCL buers while having
no standard way of resolving device-side addresses: either rela-
tive or absolute. Advanced OpenCL features, like Shared Virtual
Memory, could not help solving our problems as this is not, yet,
supported on the majority of devices we used. The main reason
is that vendors’ support for OpenCL is variable across operating
systems and devices — sometimes getting access to an SDK is near
impossible.
Future work. As future work, we aim to further improve Tor-
nado’s performance by implementing more compiler and runtime
optimizations, such as control-ow minimization, predication, use
of constant memory, and compressed object layouts for objects
residing on the hardware accelerators. We also plan to enable new
ManLang’18, September 12–14, 2018, Linz, Austria J. Clarkson et al.
0.00
0.05
0.10
0.15
0.20
AMD
A10
7850K
Intel
i7
4850HQ
Intel
E5
2620
AMD
Radeon
R7
Intel
Iris Pro
5200
NVIDIA
GT
750M
NVIDIA
Tesla
K20m
Time (seconds)
OpenCL Graal
(a) Average per-task compilation times for each device.
OpenCL
Java (2048)
Java (Hand)
0
50
100
150
200
250
0 250 500 750
Frame Number
Frames Per Second
(b) Per-frame performance on NVIDIA Tesla K20m.
Figure 11: Compilation times and per-frame performance. Key: Hand - OR implementation, 2048 - JR with 2048 threads.
object allocation from tasks running on devices other than that run-
ning the JVM host. Furthermore, we plan to extend Tornado’s reach
to more devices and diverse accelerators such as Intel’s Xeon Phi
and FPGAs. We also plan to open-source and release the complete
Tornado JIT compiler and the task-based API in Github, under the
beehive-lab github-organization of the University of Manchester
(https://github.com/beehive-lab/).
ACKNOWLEDGMENTS
We would like to thank the anonymous reviewers for their con-
structive feedback and the eort they put to review this manuscript.
This work is partially supported by the EPSRC grants PAMELA
EP/K008730/1 and AnyScale Apps EP/L000725/1, and the EU Hori-
zon 2020 E2Data 780245.
REFERENCES
[1]
2015. Project Beehive: A Hardware/Software Co-designed Stack for Runtime
and Architectural Research. CoRR abs/1509.04085 (2015). arXiv:1509.04085
http://arxiv.org/abs/1509.04085 Withdrawn.
[2]
AMD. 2016. Aparapi. Retrieved December 19, 2018 from http://developer.amd.
com/tools-and- sdks/heterogeneous-computing/aparapi
[3]
Joshua Auerbach, David F. Bacon, Perry Cheng, and Rodric Rabbah. 2010. Lime: A
Java-compatible and Synthesizable Language for Heterogeneous Architectures. In
Proceedings of the ACM International Conference on Object Oriented Programming
Systems Languages and Applications (OOPSLA ’10). ACM, New York, NY, USA,
89–108. https://doi.org/10.1145/1869459.1869469
[4]
James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pas-
canu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua
Bengio. 2010. Theano: a CP U and GPU Math Expression Compiler. In Proceedings
of the Python for Scientic Computing Conference (SciPy). 3–10.
[5]
P. J. Besl and H. D. McKay. 1992. A method for registration of 3-D shapes.
IEEE Transactions on Pattern Analysis and Machine Intelligence 14, 2 (Feb 1992),
239–256.
[6]
Bryan Catanzaro, Michael Garland, and Kurt Keutzer. 2011. Copperhead: Com-
piling an Embedded Data Parallel Language. In Proceedings of the 16th ACM
Symposium on Principles and Practice of Parallel Programming (PPoPP ’11). ACM,
New York, NY, USA, 47–56. https://doi.org/10.1145/1941553.1941562
[7]
Olivier Chak. 2015. ScalaCL: Faster Scala: optimizing compiler plugin + GPU-
based collections (OpenCL). Retrieved December 19, 2018 from https://github.
com/nativelibs4java/ScalaCL
[8]
Manuel M.T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and
Vinod Grover. 2011. Accelerating Haskell Array Codes with Multicore GPUs. In
Proceedings of the Sixth Workshop on Declarative Aspects of Multicore Programming
(DAMP ’11). ACM, New York, NY, USA, 3–14. https://doi.org/10.1145/1926354.
1926358
[9]
James Clarkson, Christos Kotselidis, Gavin Brown, and Mikel Luján. 2017. Boost-
ing Java Performance Using GPGPUs. In Architecture of Computing Systems -
ARCS 2017, Jens Knoop, Wolfgang Karl, Martin Schulz, Koji Inoue, and Thilo
Pionteck (Eds.). Springer International Publishing, Cham, 59–70.
[10]
Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2011. Torch7: A
Matlab-like Environment for Machine Learning. In BigLearn, NIPS Workshop.
[11]
NVIDIA Corporation. 2017. CUDA. Retrieved December 19, 2018 from http:
//developer.nvidia.com/cuda-zone
[12]
NVIDIA Corporation. 2017. Parallel Thread Execution ISA Version 4.0. Retrieved
December 19, 2018 from https://docs.nvidia.com/cuda/parallel-thread- execution/
index.html
[13]
Georg Dotzler, Ronald Veldema, and Michael Klemm. 2010. JCudaMP. In Proceed-
ings of the 3rd International Workshop on Multicore Software Engineering. 10–17.
https://doi.org/10.1145/1808954.1808959
[14]
Christophe Dubach, Perry Cheng, Rodric Rabbah, David Bacon, and Stephen
Fink. 2012. Compiling a High-Level Language for GPUs (via Language Support
for Architectures and Compilers). In Proceedings of the 33rd ACM SIGPLAN
Symposium on Programming Language Design and Implementation (PLDI). 1–12.
papers/dubach12pldi.pdf
[15]
Christophe Dubach, Perry Cheng, Rodric Rabbah, David F. Bacon, and Stephen J.
Fink. 2012. Compiling a High-level Language for GPUs: (via Language Support for
Architectures and Compilers). In Proceedings of the 33rd ACM SIGPLAN Conference
on Programming Language Design and Implementation (PLDI ’12). ACM, New
York, NY, USA, 1–12. https://doi.org/10.1145/2254064.2254066
[16]
G. Duboscq, L. Stadler, T. Würthinger, D. Simon, C. Wimmer, and H. Mössenböck.
2013. Graal IR: An extensible declarative intermediate representation. In Asia-
Pacic Programming Languages and Compilers.
[17]
Gilles Duboscq, Thomas Würthinger, Lukas Stadler, Christian Wimmer, Doug
Simon, and Hanspeter Mössenböck. 2013. An Intermediate Representation for
Speculative Optimizations in a Dynamic Compiler. In Proceedings of the 7th ACM
Workshop on Virtual Machines and Intermediate Languages (VMIL ’13). ACM, New
York, NY, USA, 1–10. https://doi.org/10.1145/2542142.2542143
[18] EJML. 2017. EJML. Retrieved December 19, 2018 from http://ejml.org
[19]
HSA Foundation. 2016. HSA Foundation. Retrieved December 19, 2018 from
http://www.hsafoundation.com
[20]
HSA Foundation. 2017. HSAIL Virtual ISA and Programming Model, Compiler
Writer’s Guide, and Object Format (BRIG) .95. Retrieved December 19, 2018
from https://hsafoundation.box.com/s/m6mrsjv8b7r50kqeyyalg
[21]
Juan Fumero, Michel Steuwer, Lukas Stadler, and Christophe Dubach. 2017. Just-
In-Time GPU Compilation for Interpreted Languages with Partial Evaluation.
In Proceedings of the 13th ACM SIGPLAN/SIGOPS International Conference on
Virtual Execution Environments (VEE ’17). ACM, New York, NY, USA, 60–73.
https://doi.org/10.1145/3050748.3050761
[22]
Juan José Fumero, Toomas Remmelg, Michel Steuwer, and Christophe Dubach.
2015. Runtime Code Generation and Data Management for Heterogeneous
Computing in Java. In Proceedings of the Principles and Practices of Programming
on The Java Platform (PPPJ ’15). ACM, New York, NY, USA, 16–26. https://doi.
org/10.1145/2807426.2807428
Exploiting High-Performance Heterogeneous Hardware for Java Programs using Graal ManLang’18, September 12–14, 2018, Linz, Austria
[23]
Juan José Fumero, Michel Steuwer, and Christophe Dubach. 2014. A Composable
Array Function Interface for Heterogeneous Computing in Java. In Proceedings
of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers
for Array Programming (ARRAY’14). ACM, New York, NY, USA, 44:44–44:49.
https://doi.org/10.1145/2627373.2627381
[24]
Khronos Group. 2017. OpenCL. Retrieved December 19, 2018 from https:
//www.khronos.org/opencl
[25]
A. Handa, T. Whelan, J.B. McDonald, and A.J. Davison. 2014. A Benchmark for
RGB-D Visual Odometry, 3D Reconstruction and SLAM. In IEEE Intl. Conf. on
Robotics and Automation, ICRA. Hong Kong, China, 1524–1531.
[26]
Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, and Vivek Sarkar.
2013. Accelerating Habanero-Java Programs with OpenCL Generation. In Proceed-
ings of the 2013 International Conference on Principles and Practices of Programming
on the Java Platform: Virtual Machines, Languages, and Tools (PPPJ ’13). ACM,
New York, NY, USA, 124–134. https://doi.org/10.1145/2500828.2500840
[27]
Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, and Vivek Sarkar.
2013. Accelerating Habanero-Java Programs with OpenCL Generation. In Pro-
ceedings of the 2013 International Conference on Principles and Practices of Pro-
gramming on the Java Platform: Virtual Machines, Languages, and Tools. 124–134.
https://doi.org/10.1145/2500828.2500840
[28]
Sylvain Henry. 2013. ViperVM: A Runtime System for Parallel Functional High-
performance Computing on Heterogeneous Architectures. In Proceedings of the
2Nd ACM SIGPLAN Workshop on Functional High-performance Computing (FHPC
’13). ACM, New York, NY, USA, 3–12. https://doi.org/10.1145/2502323.2502329
[29]
Stephan Herhut, Richard L. Hudson, Tatiana Shpeisman, and Jaswanth Sreeram.
2013. River Trail: A Path to Parallelism in JavaScript. In Proceedings of the 2013
ACM SIGPLAN International Conference on Object Oriented Programming Systems
Languages &#38; Applications (OOPSLA ’13). ACM, New York, NY, USA, 729–744.
https://doi.org/10.1145/2509136.2509516
[30]
Kazuaki Ishizaki, Akihiro Hayashi, Gita Koblents, and Vivek Sarkar. 2015. Com-
piling and Optimizing Java 8 Programs for GPU Execution. In Proceedings
of the 2015 International Conference on Parallel Architecture and Compilation
(PACT) (PACT ’15). IEEE Computer Society, Washington, DC, USA, 419–431.
https://doi.org/10.1109/PACT.2015.46
[31]
JOCL. 2017. Java bindings for OpenCL. Retrieved December 19, 2018 from
http://www.jocl.org/
[32]
Andreas Klöckner, Nicolas Pinto, Yunsup Lee, Bryan Catanzaro, Paul Ivanov, and
Ahmed Fasih. 2012. PyCUDA and PyOpenCL: A Scripting-based Approach to
GPU Run-time Code Generation. Parallel Comput. 38, 3 (March 2012), 157–174.
[33]
Christos Kotselidis, James Clarkson, Andrey Rodchenko, Andy Nisbet, John
Mawer, and Mikel Luján. 2017. Heterogeneous Managed Runtime Systems: A
Computer Vision Case Study. In Proceedings of the 13th ACM SIGPLAN/SIGOPS
International Conference on Virtual Execution Environments (VEE ’17). ACM, New
York, NY, USA, 74–82. https://doi.org/10.1145/3050748.3050764
[34]
Georey Mainland and Greg Morrisett. 2010. Nikola: Embedding Compiled GPU
Functions in Haskell. In Proceedings of the Third ACM Haskell Symposium on
Haskell (Haskell ’10). ACM, New York, NY, USA, 67–78. https://doi.org/10.1145/
1863523.1863533
[35] Luigi Nardi, Bruno Bodin, M. Zeeshan Zia, John Mawer, Andy Nisbet, Paul H. J.
Kelly, Andrew J. Davison, Mikel Luján, Michael F. P. O’Boyle, Graham Riley, Nigel
Topham, and Steve Furber. 2015. Introducing SLAMBench, a performance and
accuracy benchmarking methodology for SLAM. In IEEE Intl. Conf. on Robotics
and Automation (ICRA).
[36]
Richard A. Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David
Kim, Andrew J. Davison, Pushmeet Kohli, Jamie Shotton, Steve Hodges, and
Andrew Fitzgibbon. 2011. KinectFusion: Real-time Dense Surface Mapping and
Tracking. In Proceedings of the 2011 10th IEEE International Symposium on Mixed
and Augmented Reality (ISMAR ’11). IEEE Computer Society, Washington, DC,
USA, 127–136. https://doi.org/10.1109/ISMAR.2011.6092378
[37]
Nathaniel Nystrom, Derek White, and Kishen Das. 2011. Firepile: Run-time
Compilation for GPUs in Scala. In Proceedings of the 10th ACM International
Conference on Generative Programming and Component Engineering (GPCE ’11).
ACM, New York, NY, USA, 107–116. https://doi.org/10.1145/2047862.2047883
[38]
OpenACC.org. 2017. OpenAcc: Directives for Accelerators. Retrieved December
19, 2018 from http://www.openacc-standard.org
[39]
OpenJDK. 2017. OpenMP. Retrieved December 19, 2018 from http://openjdk.
java.net/projects/sumatra
[40]
P.C. Pratt-Szeliga, J.W. Fawcett, and R.D. Welch. 2012. Rootbeer: Seamlessly
Using GPUs from Java. In Proceedings of 14th International IEEE High Performance
Computing and Communication Conference on Embedded Software and Systems.
https://doi.org/10.1109/HPCC.2012.57
[41]
Alex Rubinsteyn, Eric Hielscher, Nathaniel Weinman, and Dennis Shasha. 2012.
Parakeet: A Just-in-time Parallel Accelerator for Python. In Proceedings of the 4th
USENIX Conference on Hot Topics in Parallelism (HotPar’12). USENIX Association,
Berkeley, CA, USA, 14–14.
[42]
Matthias Springer, Peter Wauligmann, and Hidehiko Masuhara. 2017. Modular
Array-based GPU Computing in a Dynamically-typed Language. In Proceedings
of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and
Compilers for Array Programming (ARRAY 2017). ACM, New York, NY, USA,
48–55. https://doi.org/10.1145/3091966.3091974
[43]
Yonghong Yan, Max Grossman, and Vivek Sarkar. 2009. JCUDA: A Programmer-
Friendly Interface for Accelerating Java Programs with CUDA. In Euro-Par 2009
Parallel Processing, Henk Sips, Dick Epema, and Hai-Xiang Lin (Eds.), Vol. 5704.
Springer Berlin Heidelberg.
[44]
Foivos S. Zakkak, Andy Nisbet, John Mawer, Tim Hartley, Nikos Foutris, Orion
Papadakis, Andreas Andronikakis, Iain Apreotesei, and Christos Kotselidis. 2018.
On the Future of Research VMs: A Hardware/Software Perspective. In Conference
Companion of the 2Nd International Conference on Art, Science, and Engineering
of Programming (Programming&#39;18 Companion). ACM, New York, NY, USA,
51–53. https://doi.org/10.1145/3191697.3191729
[45]
Wojciech Zaremba, Yuan Lin, and Vinod Grover. 2012. JaBEE: Framework for
Object-oriented Java Bytecode Compilation and Execution on Graphics Processor
Units. In Proceedings of the 5th Annual Workshop on General Purpose Processing
with Graphics Processing Units (GPGPU-5). ACM, New York, NY, USA, 74–83.
https://doi.org/10.1145/2159430.2159439
[46]
Zhengyou Zhang. 1994. Iterative Point Matching for Registration of Free-form
Curves and Surfaces. Int. J. Comput. Vision 13, 2 (Oct. 1994), 119–152.
... As shown in the work-flow presented in figure 2, the input Java code is compiled to Java bytecodes using the standard Java compiler (javac). Then, the TornadoVM Data Flow Analyzer [9] exploits the data dependencies and builds an initial Intermediate Representation (IR) graph of the input program. The generated IR graph is compiled down to the target architecture following the two-stage compilation approach illustrated in figure 3. ...
... The listed code contains a method with two nested for loops with the computation residing inside the nested loop. Note that the first loop is annotated (by the developer) using the Java annotation @Parallel, proposed by Clarkson, Fumero, Papadimitriou, Zakkak, Xekalaki, Kotselidis, and Luján [9] to program heterogeneous architectures using TornadoVM. Figure 5 illustrates the compiler transformations that are automatically applied to the IR graph of the code in listing 1. ...
... Besides the aforementioned FPGA-targeted optimizations, the extended framework reuses all the compiler optimizations of the original Tornado compiler [9] such as partial escape analysis, dead code elimination, constant propagation, etc. After performing all compiler transformations and optimizations, the toolchain invokes the OpenCL code generator. ...
... As shown in the work-flow presented in figure 2, the input Java code is compiled to Java bytecodes using the standard Java compiler (javac). Then, the TornadoVM Data Flow Analyzer [9] exploits the data dependencies and builds an initial Intermediate Representation (IR) graph of the input program. The generated IR graph is compiled down to the target architecture following the two-stage compilation approach illustrated in figure 3. ...
... The listed code contains a method with two nested for loops with the computation residing inside the nested loop. Note that the first loop is annotated (by the developer) using the Java annotation @Parallel, proposed by Clarkson, Fumero, Papadimitriou, Zakkak, Xekalaki, Kotselidis, and Luján [9] to program heterogeneous architectures using TornadoVM. Figure 5 illustrates the compiler transformations that are automatically applied to the IR graph of the code in listing 1. ...
... Besides the aforementioned FPGA-targeted optimizations, the extended framework reuses all the compiler optimizations of the original Tornado compiler [9] such as partial escape analysis, dead code elimination, constant propagation, etc. After performing all compiler transformations and optimizations, the toolchain invokes the OpenCL code generator. ...
Preprint
Full-text available
In recent years, heterogeneous computing has emerged as the vital way to increase computers? performance and energy efficiency by combining diverse hardware devices, such as Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs). The rationale behind this trend is that different parts of an application can be offloaded from the main CPU to diverse devices, which can efficiently execute these parts as co-processors. FPGAs are a subset of the most widely used co-processors, typically used for accelerating specific workloads due to their flexible hardware and energy-efficient characteristics. These characteristics have made them prevalent in a broad spectrum of computing systems ranging from low-power embedded systems to high-end data centers and cloud infrastructures. However, these hardware characteristics come at the cost of programmability. Developers who create their applications using high-level programming languages (e.g., Java, Python, etc.) are required to familiarize with a hardware description language (e.g., VHDL, Verilog) or recently heterogeneous programming models (e.g., OpenCL, HLS) in order to exploit the co-processors? capacity and tune the performance of their applications. Currently, the above-mentioned heterogeneous programming models support exclusively the compilation from compiled languages, such as C and C++. Thus, the transparent integration of heterogeneous co-processors to the software ecosystem of managed programming languages (e.g. Java, Python) is not seamless. In this paper we rethink the engineering trade-offs that we encountered, in terms of transparency and compilation overheads, while integrating FPGAs into high-level managed programming languages. We present a novel approach that enables runtime code specialization techniques for seamless and high-performance execution of Java programs on FPGAs. The proposed solution is prototyped in the context of the Java programming language and TornadoVM; an open-source programming framework for Java execution on heterogeneous hardware. Finally, we evaluate the proposed solution for FPGA execution against both sequential and multi-threaded Java implementations showcasing up to 224x and 19.8x performance speedups, respectively, and up to 13.82x compared to TornadoVM running on an Intel integrated GPU. We also provide a break-down analysis of the proposed compiler optimizations for FPGA execution, as a means to project their impact on the applications? characteristics.
... The proposed compiler extensions are in the form of enhancements to the Intermediate Representation (IR) and associated optimization phases, that can automatically exploit local memory allocations and data locality on GPUs. We implemented the proposed compiler extensions and optimizations in the context of TornadoVM [12,21], an open-source framework for accelerating managed applications on heterogeneous hardware co-processors via JIT compilation of Java bytecodes to OpenCL. ...
... TornadoVM exposes a lightweight API that developers use to indicate which Java methods they would like TornadoVM to accelerate on heterogeneous devices. Once the user identifies the methods, TornadoVM compiles, at run-time, Java bytecodes to OpenCL C as follows: a) it builds a data-flow graph with the aim to optimize the data dependencies between tasks, and subsequently reduce the required data transfers and buffer allocation time; b) it generates new bytecodes (TornadoVM Bytecodes) on top of the Java bytecodes, which are used for pure orchestration on the heterogeneous devices; and c) it executes the whole application by using the TornadoVM bytecode, and it compiles at runtime, the input Java methods to OpenCL C code [5]. TornadoVM API:. ...
Preprint
Full-text available
The advent of modern cloud services, along with the huge volume of data produced on a daily basis, have increased the demand for fast and efficient data processing. This demand is common among numerous application domains, such as deep learning, data mining, and computer vision. In recent years, hardware accelerators have been employed as a means to meet this demand, due to the high parallelism that these applications exhibit. Although this approach can yield high performance, the development of new deep learning neural networks on heterogeneous hardware requires a steep learning curve. The main reason is that existing deep learning engines support the static compilation of the accelerated code, that can be accessed via wrapper calls from a wide range of managed programming languages (e.g., Java, Python, Scala). Therefore, the development of high-performance neural network architectures is fragmented between programming models, thereby forcing developers to manually specialize the code for heterogeneous execution. The specialization of the applications' code for heterogeneous execution is not a trivial task, as it requires developers to have hardware expertise and use a low-level programming language, such as OpenCL, CUDA or High Level Synthesis (HLS) tools. In this paper we showcase how we have employed TornadoVM, a state-of-the-art heterogeneous programming framework to transparently accelerate Deep Netts on heterogeneous hardware. Our work shows how a pure Java-based deep learning neural network engine can be dynamically compiled at runtime and specialized for particular hardware accelerators, without requiring developers to employ any low-level programming framework typically used for such devices. Our preliminary results show up to 6.45x end-to-end performance speedup and up to 88.5x kernel performance speedup, when executing the feed forward process of the network's training on the GPUs against the sequential execution of the original Deep Netts framework.
Article
The ever-increasing demand for high performance Big Data analytics and data processing, has paved the way for heterogeneous hardware accelerators, such as Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs), to be integrated into modern Big Data platforms. Currently, this integration comes at the cost of programmability since the end-user Application Programming Interface (APIs) must be altered to access the underlying heterogeneous hardware. For example, current Big Data frameworks, such as Apache Spark, provide a new API that combines the existing Spark programming model with GPUs. For other Big Data frameworks, such as Flink, the integration of GPUs and FPGAs is achieved via external API calls that bypass their execution models completely. In this paper, we rethink current Big Data frameworks from a systems and programming language perspective, and introduce a novel co-designed approach for integrating hardware acceleration into their execution models. The novelty of our approach is attributed to two key design decisions: a) support for arbitrary User Defined Functions (UDFs), and b) no modifications to the user level API. The proposed approach has been prototyped in the context of Apache Flink, and enables unmodified applications written in Java to run on heterogeneous hardware, such as GPU and FPGAs, transparently to the users. The performance evaluation of the proposed solution has shown performance speedups of up to 65x on GPUs and 184x on FPGAs for suitable workloads of standard benchmarks and industrial use cases against vanilla Flink running on traditional multi-core CPUs.
Article
Bounded-exhaustive testing (BET), which exercises a program under test for all inputs up to some bounds, is an effective method for detecting software bugs. Systematic property-based testing is a BET approach where developers write test generation programs that describe properties of test inputs. Hybrid test generation programs offer the most expressive way to write desired properties by freely combining declarative filters and imperative generators. However, exploring hybrid test generation programs, to obtain test inputs, is both computationally demanding and challenging to parallelize. We present the first programming and execution models, dubbed Tempo, for parallel exploration of hybrid test generation programs. We describe two different strategies for mapping the computation to parallel hardware and implement them both for GPUs and CPUs. We evaluated Tempo by generating instances of various data structures commonly used for benchmarking in the BET domain. Additionally, we generated CUDA programs to stress test CUDA compilers, finding four bugs confirmed by the developers.
Conference Paper
Full-text available
In the recent years, we have witnessed an explosion of the usages of Virtual Machines (VMs) which are currently found in desktops, smartphones, and cloud deployments. These recent developments create new research opportunities in the VM domain extending from performance to energy efficiency, and scalability studies. Research into these directions necessitates research frameworks for VMs that provide full coverage of the execution domains and hardware platforms. Unfortunately, the state of the art on Research VMs does not live up to such expectations and lacks behind industrial-strength software, making it hard for the research community to provide valuable insights. This paper presents our work in attempting to tackle those shortcomings by introducing Beehive, our vision towards a modular and seamlessly extensible ecosystem for research on virtual machines. Beehive unifies a number of existing state-of-the-art tools and components with novel ones providing a complete platform for hardware/software co-design of Virtual Machines.
Conference Paper
Full-text available
Real-time 3D space understanding is becoming prevalent across a wide range of applications and hardware platforms. To meet the desired Quality of Service (QoS), computer vision applications tend to be heavily parallelized and exploit any available hardware accelerators. Current approaches to achieving real-time computer vision, evolve around programming languages typically associated with High Performance Computing along with binding extensions for OpenCL or CUDA execution. Such implementations, although high performing, lack portability across the wide range of diverse hardware resources and accelerators. In this paper, we showcase how a complex computer vision application can be implemented within a managed runtime system. We discuss the complexities of achieving high-performing and portable execution across embedded and desktop configurations. Furthermore, we demonstrate that it is possible to achieve the QoS target of over 30 frames per second (FPS) by exploiting FPGA and GPGPU acceleration transparently through the managed runtime system.
Conference Paper
Full-text available
In this paper we describe Jacc, an experimental framework which allows developers to program GPGPUs directly from Java. The goal of Jacc, is to allow developers to benefit from using heterogeneous hardware whilst minimizing the amount of code refactoring required. Jacc utilizes two key abstractions: tasks which encapsulate all the information needed to execute code on a GPGPU; and task graphs which capture both inter-task control-flow and data dependencies. These abstractions enable the Jacc runtime system to automatically choreograph data movement and synchronization between the host and the GPGPU; eliminating the need to explicitly manage disparate memory spaces. We demonstrate the advantages of Jacc, both in terms of programmability and performance, by evaluating it against existing Java frameworks. Experimental results show an average performance speedup of 19x, using NVIDIA Tesla K20m GPU, and a 4x decrease in code complexity when compared with writing multi-threaded Java code across eight evaluated benchmarks.
Conference Paper
Full-text available
GPUs can enable significant performance improvements for certain classes of data parallel applications and are widely used in recent computer systems. However, GPU execution currently requires explicit low-level operations such as 1) managing memory allocations and transfers between the host system and the GPU, 2) writing GPU kernels in a low-level programming model such as CUDA or OpenCL, and 3) optimizing the kernels by utilizing appropriate memory types on the GPU. Because of this complexity, in many cases, only expert programmers can exploit the computational capabilities of GPUs through the CUDA/OpenCL languages. This is unfortunate since a large number of programmers use high-level languages, such as Java, due to their advantages of productivity, safety, and platform portability, but would still like to exploit the performance benefits of GPUs. Thus, one challenging problem is how to utilize GPUs while allowing programmers to continue to benefit from the productivity advantages of languages like Java. This paper presents a just-in-time (JIT) compiler that can generate and optimize GPU code from a pure Java program written using lambda expressions with the new parallel streams APIs in Java 8. These APIs allow Java programmers to express data parallelism at a higher level than threads and tasks. Our approach translates lambda expressions with parallel streams APIs in Java 8 into GPU code and automatically generates runtime calls that handle the low-level operations mentioned above. Additionally, our optimization techniques 1) allocate and align the starting address of the Java array body in the GPUs with the memory transaction boundary to increase memory bandwidth, 2) utilize read-only cache for array accesses to increase memory eciency in GPUs, and 3) eliminate redundant data transfer between the host and the GPU. The compiler also performs loop versioning for eliminating redundant exception checks and for supporting virtual method invocations within GPU kernels. These features and optimizations are supported and automatically performed by a JIT compiler that is built on top of a production version of the IBM Java 8 runtime environment. Our experimental results on an NVIDIA Tesla GPU show significant performance improvements over sequential execution (127.9 ⇥ geometric mean) and parallel execution (3.3 ⇥ geometric mean) for eight Java 8 benchmark programs running on a 160-thread POWER8 machine. This paper also includes an in-depth analysis of GPU execution to show the impact of our optimization techniques by selectively disabling each optimization. Our experimental results show a geometric-mean speed-up of 1.15 ⇥ in the GPU kernel over state-of-the-art approaches. Overall, our JIT compiler can improve the performance of Java 8 programs by automatically leveraging the computational capability of GPUs.
Conference Paper
Torch7 is a versatile numeric computing framework and machine learning library that extends Lua. Its goal is to provide a flexible environment to design and train learning machines. Flexibility is obtained via Lua, an extremely lightweight scripting language. High performance is obtained via efficient OpenMP/SSE and CUDA implementations of low-level numeric routines. Torch7 can easily be in- terfaced to third-party software thanks to Lua's light interface.
Conference Paper
Nowadays, GPU accelerators are widely used in areas with large data-parallel computations such as scientific computations or neural networks. Programmers can either write code in low-level CUDA/OpenCL code or use a GPU extension for a high-level programming language for better productivity. Most extensions focus on statically-typed languages, but many programmers prefer dynamically-typed languages due to their simplicity and flexibility. This paper shows how programmers can write high-level modular code in Ikra, a Ruby extension for array-based GPU computing. Programmers can compose GPU programs of multiple reusable parallel sections, which are subsequently fused into a small number of GPU kernels. We propose a seamless syntax for separating code regions that extensively use dynamic language features from those that are compiled for efficient execution. Moreover, we propose symbolic execution and a program analysis for kernel fusion to achieve performance that is close to hand-written CUDA code.
Conference Paper
Computer systems are increasingly featuring powerful parallel devices with the advent of many-core CPUs and GPUs. This offers the opportunity to solve computationally-intensive problems at a fraction of the time traditional CPUs need. However, exploiting heterogeneous hardware requires the use of low-level programming language approaches such as OpenCL, which is incredibly challenging, even for advanced programmers. On the application side, interpreted dynamic languages are increasingly becoming popular in many domains due to their simplicity, expressiveness and flexibility. However, this creates a wide gap between the high-level abstractions offered to programmers and the low-level hardware-specific interface. Currently, programmers must rely on high performance libraries or they are forced to write parts of their application in a low-level language like OpenCL. Ideally, nonexpert programmers should be able to exploit heterogeneous hardware directly from their interpreted dynamic languages. In this paper, we present a technique to transparently and automatically offload computations from interpreted dynamic languages to heterogeneous devices. Using just-in-time compilation, we automatically generate OpenCL code at runtime which is specialized to the actual observed data types using profiling information. We demonstrate our technique using R, which is a popular interpreted dynamic language predominately used in big data analytic. Our experimental results show the execution on a GPU yields speedups of over 150x compared to the sequential FastR implementation and the obtained performance is competitive with manually written GPU code. We also show that when taking into account start-up time, large speedups are achievable, even when the applications run for as little as a few seconds.
Conference Paper
We introduce the Imperial College London and National University of Ireland Maynooth (ICL-NUIM) dataset for the evaluation of visual odometry, 3D reconstruction and SLAM algorithms that typically use RGB-D data. We present a collection of handheld RGB-D camera sequences within synthetically generated environments. RGB-D sequences with perfect ground truth poses are provided as well as a ground truth surface model that enables a method of quantitatively evaluating the final map or surface reconstruction accuracy. Care has been taken to simulate typically observed real-world artefacts in the synthetic imagery by modelling sensor noise in both RGB and depth data. While this dataset is useful for the evaluation of visual odometry and SLAM trajectory estimation, our main focus is on providing a method to benchmark the surface reconstruction accuracy which to date has been missing in the RGB-D community despite the plethora of ground truth RGB-D datasets available.