Content uploaded by Juan Fumero
Author content
All content in this area was uploaded by Juan Fumero on Apr 05, 2019
Content may be subject to copyright.
Content uploaded by Juan Fumero
Author content
All content in this area was uploaded by Juan Fumero on Mar 01, 2019
Content may be subject to copyright.
Dynamic Application Reconguration on
Heterogeneous Hardware
Juan Fumero
The University of Manchester
United Kingdom
juan.fumero@manchester.ac.uk
Michail Papadimitriou
The University of Manchester
United Kingdom
mpapadimitriou@cs.man.ac.uk
Foivos S. Zakkak
The University of Manchester
United Kingdom
foivos.zakkak@manchester.ac.uk
Maria Xekalaki
The University of Manchester
United Kingdom
maria.xekalaki@manchester.ac.uk
James Clarkson
The University of Manchester
United Kingdom
james.clarkson@manchester.ac.uk
Christos Kotselidis
The University of Manchester
United Kingdom
ckotselidis@cs.man.ac.uk
Abstract
By utilizing diverse heterogeneous hardware resources, de-
velopers can signicantly improve the performance of their
applications. Currently, in order to determine which parts of
an application suit a particular type of hardware accelerator
better, an oine analysis that uses a priori knowledge of the
target hardware conguration is necessary. To make matters
worse, the above process has to be repeated every time the
application or the hardware conguration changes.
This paper introduces TornadoVM, a virtual machine capa-
ble of reconguring applications, at run-time, for hardware
acceleration based on the currently available hardware re-
sources. Through TornadoVM, we introduce a new level of
compilation in which applications can benet from heteroge-
neous hardware. We showcase the capabilities of TornadoVM
by executing a complex computer vision application and six
benchmarks on a heterogeneous system that includes a CPU,
an FPGA, and a GPU. Our evaluation shows that by using
dynamic reconguration, we achieve an average of 7.7
×
speedup over the statically-congured accelerated code.
CCS Concepts •Software and its engineering →Vir-
tual machines;
Keywords Dynamic Reconguration, FPGAs, GPUs, JVM
ACM Reference Format:
Juan Fumero, Michail Papadimitriou, Foivos S. Zakkak, Maria Xe-
kalaki, James Clarkson, and Christos Kotselidis. 2019. Dynamic
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear
this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specic permission and/or a fee. Request
permissions from permissions@acm.org.
VEE ’19, April 14, 2019, Providence, RI, USA
©2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6020-3/19/04.. .$15.00
hps://doi.org/10.1145/3313808.3313819
Application Reconguration on Heterogeneous Hardware. In Pro-
ceedings of the 15th ACM SIGPLAN/SIGOPS International Confer-
ence on Virtual Execution Environments (VEE ’19), April 14, 2019,
Providence, RI, USA. ACM, New York, NY, USA, 14 pages. hps:
//doi.org/10.1145/3313808.3313819
1 Introduction
The advent of heterogeneous hardware acceleration as a
means to combat the stall imposed by the Moore’s law [39]
created new challenges and research questions regarding
programmability, deployment, and integration with current
frameworks and runtime systems. The evolution from single-
core to multi- or many- core systems was followed by the
introduction of hardware accelerators into mainstream com-
puting systems. General Purpose Graphics Processing Units
(GPGPUs), Field-programmable Gate Arrays (FPGAs), Appli-
cation Specic Integrated Circuits (ASICs), and integrated
many-core accelerators (e.g., Xeon Phi) are some examples of
hardware devices capable of achieving higher performance
than CPUs when executing suitable workloads. Whether
using a GPU or an FPGA for accelerating specic workloads,
developers need to employ programming models such as
CUDA [
9
] and OpenCL [
17
], or High Level Synthesis (HLS)
tools [29] to program and accelerate their code.
The integration of these new programming models to
mainstream computing has not been fully achieved in all as-
pects of programming or programming languages. For exam-
ple, in the Java world, excluding IBM’s J9 [
21
] GPU support
and APARAPI [
2
], there are no other commercial solutions
available for automatically and transparently executing Java
programs on hardware accelerators. The situation is even
more challenging for FPGA acceleration where the program-
ming models are not only dierent than the typical ones,
but also the tool-chains are in the majority of the cases sepa-
rated from the programming languages [
29
]. Consequently,
programming language developers need to create either new
bindings to transition from one programming language to
another [
22
], or specic (static or dynamic) compilers that
compile a subset of an existing programming language to
another one tailored for a specic device [
1
,
12
,
13
,
19
,
33
,
38
].
165
VEE ’19, April 14, 2019, Providence, RI, USA Fumero, Papadimitriou, Zakkak, Xekalaki, Clarkson, Kotselidis
Therefore, applications are becoming more heterogeneous
in their code bases (i.e. mixing programming languages and
paradigms), resulting in harder to maintain and debug code.
Ideally, developers should follow the programming norm
of “write-once-run-anywhere” and allow the underlying run-
time system to dynamically adjust the execution depending
on the provisioned hardware. Achieving the unication or
co-existence of the various programming models under a
common runtime, would not only result in more ecient
code development but also in portable applications, in terms
of performance as well, where the system software adapts
the application to the underlying hardware conguration.
At the same time, the question of which parts of the code
should execute on which accelerator remains open, further
increasing the applications’ complexity. Various techniques
such as manual code inspection [
36
], oine machine learn-
ing based models [
20
], heuristics [
6
], analytical [
3
], and sta-
tistical models [
16
] have been proposed in order to identify
which parts of an application are more suitable for acceler-
ation by the available hardware devices. Such approaches,
however, require either high expertise on the developer’s
side in order to reason about which part of their source
code would be better accelerated, or a priori decision making
regarding the type and characteristics of the devices upon
which the oine analysis will be performed. Therefore, the
majority of these approaches require developers’ interven-
tion and oine work to achieve the desired results.
Naturally, the research question that rises is: “Is it possible
for a system to nd the best conguration and execution prole
automatically and dynamically?”.
In this paper we show that the answer to this question can
be positive. We introduce a novel mechanism that tackles
the aforementioned challenges by allowing the dynamic and
transparent execution of code segments on diverse hardware
devices. Our mechanism performs execution permutations,
at run-time, in order to nd the highest performing congu-
ration of the application. To achieve this, we employ nested
application virtualization for Java applications running on
a standard Java Virtual Machine (JVM). At the rst level of
virtualization, standard Java bytecodes are executed either
in interpreted or just in time (JIT) compiled mode on the
CPU. At the second level of virtualization, the code compiled
for heterogeneous hardware is executed via a secondary
lightweight bytecode interpreter that allows code migration
between devices, while handling automatically both execu-
tion and data management. This results in a system capable
of dynamically adapting its execution until it discovers the
highest performing conguration completely transparently
to the developer and the application. In detail, this paper
makes the following contributions:
•
Presents TornadoVM: a virtualization layer enabling
dynamic migration of tasks between dierent devices.
•
Analyses how TornadoVM performs a best-eort exe-
cution in order to automatically and dynamically (i.e.
at run-time) discover which device or combination of
devices results in the best performing execution.
•
Discusses how TornadoVM can be used, by existing
JVMs with tiered compilation, as a new tier, breaking
the CPU-only compilation boundaries.
•
Showcases that by using TornadoVM we are able to
achieve an average of 7.7
×
performance improvements
over the statically-congured accelerated code for a
representative set of benchmarks.
2 Background and Motivation
This work builds upon Tornado [
23
], an open-source parallel
programming framework that enables dynamic JIT com-
pilation and execution of Java applications onto OpenCL-
compatible devices, transparently to the user. This way it
enables inexperienced –with hardware accelerators– users
to accelerate their Java applications by introducing a mini-
mal set of changes to their code and choosing an accelerator
to target. Tornado consists of the following three main com-
ponents: a parallel API, a runtime system, and a JIT compiler
and driver.
Tornado API:
Tornado provides a task-based parallel API
for parallel programming within Java. By using the Tornado
API, developers express parallelism in existing Java applica-
tions with minimal alterations of the sequential Java code.
Each task comprises a Java method handle containing the
pure Java code and the data it accesses. The Tornado API
provides interfaces to create task-schedules; groups of tasks
that will be automatically scheduled for execution by the
runtime. In addition to dening tasks, Tornado allows de-
velopers to indicate that a loop is a candidate for parallel
execution through the @Parallel annotation.
Listing 1shows a parallel map/reduce computation using
the Tornado API. The Java class
Compute
contains two meth-
ods,
map
in Line 2 and
reduce
in line 7. These two methods
are written in Java augmented with the
@Parallel
annota-
tion. The rst method performs a vector multiplication while
the second computes an addition of the elements.
Lines 13–16 create a task-schedule containing the two
tasks of our example along with their input and output ar-
guments. Both the task-schedule and the individual tasks
receive string identiers (
s0
,
t0
and
t1
) that enable program-
mers to reference them at runtime.
Furthermore, since our example performs a map-reduce
operation, the intermediate results of
map
(
t0
) are passed
to
reduce
(
t1
) through the
temp
array. Line 16 species the
array to be copied back from the device to the host through
the
streamOut
method call. Finally, we invoke the
execute
method (Line 17) to signal the execution of the task-schedule.
Tornado Runtime:
The role of the Tornado runtime sys-
tem is to analyze data dependencies between tasks within
a task-schedule, and use this information to minimize data
166
Dynamic Application Reconfiguration on Heterogeneous Hardware VEE ’19, April 14, 2019, Providence, RI, USA
Figure 1. Tornado JIT compiler outline.
Listing 1. Example of the Tornado Task Parallel API.
1public class Compute {
2public void map(float[] in, float[] out) {
3for (@Parallel int i = 0; i < n; i++) {
4out[i] = in[i] *in[i];
5}
6}
7public void reduce(float[] in, float[] out) {
8for (@Parallel int i = 1; i < n; i++) {
9out[0] += in[i];
10 }
11 }
12 public void run(float[] in, float[] out, float[] temp) {
13 new TaskSchedule("s0")
14 .task("t0",this::map, in, temp)
15 .task("t1",this::reduce, temp, out)
16 .streamOut(out)
17 .execute();
18 }
19 }
transfers between a host (e.g., a CPU) and the devices (e.g.,
a GPU). In the example of Listing 1, the Tornado runtime
will discover the read-after-write dependency on the
temp
array and instead of copying it back to the host, it will per-
sist it on the device. Additionally, due to this dependency, it
will ensure that task
t1
will not be scheduled before task
t0
completes.
Tornado JIT Compiler and Driver:
At runtime, Tornado
has a two-tier JIT compilation mode that allows it to rst
compile Java bytecode to OpenCL C, and then from OpenCL
C to machine code. Figure 1provides a high-level overview
of Tornado’s compilation chain. As shown, Java bytecode is
transformed to an Intermediate Representation (IR) graph
(Step
1
○
) which is then optimized and lowered incrementally
from High-level IR (HIR), to Mid-level IR (MIR), and nally
reaching the Low-level IR (LIR) state, which is close to the as-
sembly level (Step
2
○
). From that point, instead of generating
assembly instructions, a special compiler phase is invoked
which rewrites the LIR graph to OpenCL C code (OpenCL
code generator) (Step
3
○
). After the OpenCL C source code
is created, depending on the target device that execution will
take place, the respective OpenCL device compiler is invoked
(Step
4
○
). Finally, the generated binary code gets installed to
the code cache (Step 5
○) and is ready for execution.
Figure 2.
Tornado speedup over sequential Java on a range
of dierent input sizes for DFT (Discrete Fourier Transform).
2.1 Motivation
Due to architectural dierences, dierent hardware acceler-
ators tend to favour dierent workloads. For instance, GPUs
are well known for their high eciency when all threads
of the same warp perform the same operation, but they fail
to eciently accelerate parallel applications with complex
control ows [
26
]. Experienced developers with in-depth
understanding of their applications and the underlying hard-
ware can potentially argue about which devices better suit
their applications. However, even in such cases, choosing
the best conguration or the best device from a family of
devices is not trivial [5,6,11].
To better understand how dierent accelerators or data
sizes aect the performance of an application we run the DFT
(Discrete Fourier Transform) benchmark using Tornado on
three dierent devices while varying the input size. Figure 2
depicts the obtained results with the X-axis showing the
range of the input size and the Y-axis showing the speedup
over the sequential Java implementation. Each line in the
graph represents one of the three dierent devices we run
our experiment on: an Intel i7 CPU, an NVIDIA GTX 1060
GPU, and an Intel Altera FPGA. Overall, DFT performs better
when running on the NVIDIA GPU. However, when running
with small sizes, the highest performing device varies. For
example, for input sizes between 2
6
-2
8
, the parallel execution
on the multi-core CPU is the highest performing one.
The Importance of Dynamic Reconguration:
As show-
cased by our experiment, to achieve the highest performance
one needs to explore a large space of dierent executions
before discovering the best possible conguration for an ap-
plication. To make matters worse, this conguration is con-
sidered the “best possible” only for the given system setup
and input data. Each time the code changes, a new device is
introduced, or the input data changes we need to perform
further exploration and potentially restart our application
to apply the new conguration.
To address the challenge of discovering the highest per-
forming conguration, we introduce TornadoVM: a system
that is able to automatically and dynamically adapt execution
to the best possible conguration, according to the user’s
requirements, for each application and input data size in a
heterogeneous system,and without the need of restarting
the application.
167
VEE ’19, April 14, 2019, Providence, RI, USA Fumero, Papadimitriou, Zakkak, Xekalaki, Clarkson, Kotselidis
3 TornadoVM
In order to enable dynamic application reconguration of the
executed applications on heterogeneous hardware, we imple-
ment a
virtualization layer
that uses the Tornado parallel
programming framework to run Java applications on het-
erogeneous hardware. The implemented virtualization layer
is responsible for executing and performing code migration
of the generated JIT compiled code through a lightweight
bytecode-based mechanism, as well as managing the mem-
ory between the dierent devices. The combination of the
aforementioned components results in a heterogeneous JVM,
called TornadoVM, capable of dynamically reconguring the
executed applications on the available hardware resources
completely transparently to the users.
TornadoVM is implemented in Java and, as illustrated in
Figure 3, runs inside a standard Java Virtual Machine (e.g.,
the HotSpot JVM [
25
,
32
]); resulting in a VM that runs in-
side another VM—VM in a VM. The TornadoVM interprets
TornadoVM bytecodes, manages the corresponding mem-
ory, and orchestrates the execution on the heterogeneous
devices. The JVM executes Java bytecodes and the interpreter
methods of TornadoVM.
To implement TornadoVM, we augment the original Tor-
nado components (shown in light blue color in Figure 3) with
the components shown in dark green color. Namely, we in-
troduce a) abytecode generator (Section 3.2) responsible for
the generation of TornadoVM specic bytecodes (Section 3.1)
that are used to execute code on heterogeneous devices, b) a
bytecode interpreter (Section 3.1) that executes the generated
bytecodes, c) adevice heap manager (Section 3.3) responsible
for managing data across the dierent devices ensuring a
consistent memory view, and d) atask migration manager
(Section 3.4) responsible for migrating tasks between devices.
All the components of TornadoVM are device agnostic except
for the nal connection with the underlying OpenCL driver.
Initially, the application starts on the standard JVM (host)
which can execute it on CPUs. When the execution reaches
a Tornado API method invocation, the control ow of the
execution is passed to the Tornado compiler in order to cre-
ate and optimize the data ow graph for the task-schedule at
hand. The data ow graph is then passed to the TornadoVM
bytecode generator that generates an optimized compact
sequence of TornadoVM bytecodes, describing the corre-
sponding instructions of the task-schedule. In contrast to
the original Tornado, at this point, TornadoVM does not JIT
compile the tasks involved in the compiled task-schedule to
binaries. The task compilation, from Java bytecode to binary,
is performed lazily by the TornadoVM upon attempting to
execute a task whose code has not been compiled yet for
the corresponding target device. For each device there is a
code cache maintaining the binaries corresponding to the
tasks that have been already compiled for this device to avoid
paying the compilation overhead multiple times.
Figure 3. TornadoVM overview and workow.
3.1 TornadoVM Bytecodes
TornadoVM relies on a custom set of bytecodes that are
specically tailored to describe task-schedules, resulting in
a more compact representation, which is also easier to parse
and translate to heterogeneous hardware management ac-
tions. Table 1enlists the bytecodes that are currently gener-
ated and supported by the TornadoVM. TornadoVM employs
11 bytecodes that allow the VM to prepare the execution, to
perform data allocation and transferring (between the host
and the devices), as well as to launch the kernels. All the
bytecodes are hardware agnostic and are used to express a
task-schedule regardless of the device(s) it will run on.
All the TornadoVM bytecodes take at least one argument,
the context identier, which is a unique number used to
identify a task-schedule. TornadoVM generates a new con-
text identier for each task-schedule in the program. The
context identier is used at run-time to obtain a context
object which, among others, contains references to the data
accessed by the task-schedule, and information about the
device, on which, the tasks will be executed. Additionally,
all bytecodes except
BEGIN
,
END
, and
BARRIER
, take at least
a bytecode index as an argument. The bytecode indices are
a way to uniquely identify bytecodes so that we can then
reference them from other bytecodes. In addition, they are
used for synchronization and ordering purposes since 6 out
of the 11 bytecodes are non-blocking in order to increase
performance by overlapping data transfers and execution
of kernels. The TornadoVM bytecodes can be conceptually
grouped in the following categories:
Initialization and Termination:
Bytecode sections in the
TornadoVM are encapsulated in regions that start with the
BEGIN
bytecode and conclude with the
END
bytecode. These
bytecodes essentially signal the activation and deactivation
of a TornadoVM context.
Memory Allocation and Data Transferring:
In order to
execute code on a heterogeneous device, memory has to be
allocated and data need to be transferred from the host to the
168
Dynamic Application Reconfiguration on Heterogeneous Hardware VEE ’19, April 14, 2019, Providence, RI, USA
Table 1. TornadoVM bytecodes.
Bytecode Operands Blocking Description
BEGIN <context> Yes Creates a new parallel execution context.
ALLOC <context, BytecodeIndex, object> No Allocates a buer on the target device.
STREAM_IN <context, BytecodeIndex, object> No Performs a copy of an object from host to device.
COPY_IN <context, BytecodeIndex, object> No Allocates and copies an object from host to device.
STREAM_OUT <context, BytecodeIndex, object> No Performs a copy of an object from device to host.
COPY_OUT <context, BytecodeIndex, object> No Allocates and copies an object from device to host.
COPY_OUT_BLOCK <context, BytecodeIndex, object> Yes A blocking COPY_OUT operation.
LAUNCH <context, BytecodeIndex, task, Args> No Executes a task, compiling it if needed.
ADD_DEP <context, BytecodeIndices> Yes Adds a dependency between labels.
BARRIER <context> Yes Waits for all previous bytecodes to be nished.
END <context> Yes Ends the parallel execution context.
heterogeneous device. The
ALLOC
bytecode allocates su-
cient memory on the device heap (see Section 3.3), to accom-
modate the objects passed to it as an argument. The
COPY_IN
bytecode both allocates memory and transfers the object to
the device, while the
STREAM_IN
bytecode only copies the ob-
ject (assuming a previous allocation). Note that the
COPY_IN
bytecode is used for read-only data and implements a caching
mechanism that allows it to skip data transfers if the corre-
sponding data are already on the target device. On the other
hand, the
STREAM_IN
bytecode is used for data streaming on
the heterogeneous device, in which the kernel is executed
multiple times with an open channel for receiving new data.
The corresponding bytecodes for copying the data back from
the device to the host are COPY_OUT and STREAM_OUT.
Synchronization:
TornadoVM bytecodes can be ordered
and synchronized through the
BARRIER
,
ADD_DEP
, and
END
bytecodes.
BARRIER
and
END
wait for all previous bytecodes
to reach completion, while
ADD_DEP
waits only for those
corresponding to the indices passed to it as parameters.
Computation:
The
LAUNCH
bytecode is used to execute a
kernel. To execute the code on the target device, the Tor-
nadoVM rst checks if a binary that targets the correspond-
ing device (according to the context) has been generated
for the task at hand. Upon success, it directly executes the
binary on the heterogeneous device. Otherwise, TornadoVM
compiles the input task, through Tornado, and installs the
binary into the code cache from where it is then retrieved
for execution.
3.2 TornadoVM: Bytecode Generator
TornadoVM relies on Tornado to obtain a data ow graph for
each task-schedule. The data ow graph is essentially a data
structure that describes data dependencies between tasks.
During the compilation of task-schedules, Tornado builds
this graph and optimizes it to reduce data transfers. This
optimized data dependency graph is then used to generate
the nal TornadoVM bytecode. The TornadoVM bytecode
generation is a simple process of traversing a graph and
Listing 2. Generated TornadoVM bytecodes for Listing 1.
1BEGIN <0> // Starts a new context
2COPY_IN <0, bi1, in> // Allocates and copies <in>
3ALLOC <0, bi2, temp> // Allocates <temp> on device
4ADD_DEP <0, bi1, bi2> // Waits for copy and alloc
5LAUNCH <0, bi3, @map, in, temp> // Runs map
6ALLOC <0, bi4, out> // Allocates <out> on device
7ADD_DEP <0, bi3, bi4> // Waits for alloc and map
8LAUNCH <0, bi5, @reduce, temp, out> // Runs reduce
9ADD_DEP <0, bi5> // Wait for reduce
10 COPY_OUT_BLOCK <0, bi6, out> // Copies <out> back
11 END <0> // Ends context
generating, for each input node in the data ow graph, a
set of TornadoVM bytecodes. Listing 2demonstrates the
generated TornadoVM bytecode that corresponds to the code
from Listing 1.
TornadoVM’s Bytecode Interpreter
The TornadoVM im-
plements a bytecode interpreter for running the TornadoVM
bytecodes. Since TornadoVM uses only a limited set of 11
bytecodes, we implement the interpreter as a simple switch
statement in Java. TornadoVM bytecodes are not JIT com-
piled, but the interpreter itself can be JIT compiled by the un-
derlying JVM (e.g., Oracle HotSpot) to improve performance.
Note that the TornadoVM bytecodes only orchestrates the
execution between the accelerators and the host machine;
they do not perform the actual computation. The latter is JIT
compiled by Tornado.
As shown in Listing 2, a new TornadoVM context starts by
running the
BEGIN
bytecode with context-id 0 (line 1). Note
that the context-id maps to a context object that contains
initial information regarding the device on which execution
will take place. Then, the TornadoVM performs an allocation
and a data transfer through the
COPY_IN
bytecode (line 2).
In line 3, TornadoVM performs an allocation for the
temp
Java array on the target device, and in line 4 it blocks to wait
for the copy and the allocation to be completed. Note that
the
ADD_DEP
in line 4 receives the bytecode indices of the
COPY_IN
and the
ALLOC
bytecodes. Then, in line 5 it launches
169
VEE ’19, April 14, 2019, Providence, RI, USA Fumero, Papadimitriou, Zakkak, Xekalaki, Clarkson, Kotselidis
the
map
task. At this stage, the TornadoVM compiles the
map
task by invoking the Tornado JIT compiler and launches
the generated binary on the target device. Line 6 allocates
the output variable of the
reduce
task. In addition, since the
input for the
reduce
task is the output of the previous task,
a dependency is added (line 7) and execution waits for the
nalization of the
LAUNCH
bytecode at line 5, as well as for
the allocation at line 6. Line 8 launches the
reduce
kernel, at
line 9 it waits for the kernel to complete, and then the result
is copied back from the device to the host in line 10. Finally,
the current TornadoVM context ends by
END
at line 11. Each
context of the TornadoVM manages one device meaning
that all tasks that are launched from the same context are
executed on the same device.
3.3 TornadoVM Memory Manager
Since heterogeneous systems typically comprise a number of
distinct memories that are not always shared nor coherent,
TornadoVM implements a memory manager, which is re-
sponsible for keeping the data consistent across the dierent
devices, as well as for allocating and de-allocating memory
on them. To minimize the overhead of accelerating code on
heterogeneous devices with distinct non-shared memories,
TornadoVM pre-allocates a memory region on each accelera-
tor. This region can be seen as a heap extension on the target
device and is managed by the TornadoVM memory manager.
The initial device heap size is by default congured to be
equal to the maximum capacity of global memory on each
target device. However, this value can be tuned depending
on the needs of each application. By pre-allocating the de-
vice heaps, TornadoVM’s memory manager becomes solely
responsible for transferring data between the host and the
target heaps to ensure memory consistency at run-time.
In the common case, TornadoVM just copies the input
data from the host to the target device and copies back the
results from the target device to the host, according to the
corresponding TornadoVM bytecode for each task-schedule.
The most interesting cases where the TornadoVM memory
manager acts are the cases of: a) migrating task-schedules to
a dierent device (Section 3.4); and b) the case of dynamic
reconguration (Section 4). In the case of task migration,
TornadoVM allocates a new memory area on the new target
device and performs the necessary data transfers from the
previous target device to the new target device.
In the case of dynamic reconguration in which a single
task may be running on more than one device, the process
has more steps. Figure 4sketches how TornadoVM manages
memory on such cases. On the top left part of the Figure
is a Tornado
task-schedule
that creates a task with two
parameters,
a
and
b
. Parameter
a
represents an array of oats
and is given as input to the task to be executed. Parameter
b
represents an array of oats where the user expects to
obtain the output. These two variables are maintained in the
Java heap on the host side, as in any other Java program.
Figure 4. Overview of the TornadoVM Memory Manager
However, to enable code acceleration such variables need
to get copied to the target device when the latter does not
have access to the host memory. As a result, TornadoVM
categorizes variables in two groups: host variables and device
variables.
Host Variables:
Following the data-ow programming mo-
del, TornadoVM splits data in input and output. Data that are
used solely as input are considered read-only and thus safe to
be accessed by more than one device at a time. Output data on
the other hand, contain the results of some computation and
are mandatory for the correctness of the algorithm. When
running the same task on dierent devices concurrently,
despite expecting to obtain the same result at the end, we
cannot use the same memory for storing that result. Dierent
devices require dierent time to perform the computation
and thus one device may overwrite the data of the other in
an unpredictable order, possibly resulting in stale data. For
this reason, the TornadoVM duplicates output variables in
the host-side. This way, each device writes back the output
data to a dierent memory segment avoiding the above issue.
The code running on the host accesses this data through a
proxy. When the execution for all devices nishes and the
TornadoVM chooses the best device depending on the input
policies (as we will present in Section 4), the TornadoVM sets
the proxy to redirect accesses to the corresponding memory
segment. For example, in Figure 4, if the selected device is
the FPGA, the proxy of
b
will redirect accesses to the
b-FPGA
buer.
Device Variables:
On the device side, dierent devices have
dierent characteristics. For instance, integrated GPUs have
direct coherent access to the host memory. Other devices
may be able to directly access the host memory through
their driver, but they still require proper synchronization
to ensure coherence, e.g., external GPUs. Finally, there are
devices that require explicit memory copies to and from the
device. To maximize data throughput, TornadoVM dynam-
ically queries devices for their capabilities and adapts its
memory management accordingly.
170
Dynamic Application Reconfiguration on Heterogeneous Hardware VEE ’19, April 14, 2019, Providence, RI, USA
3.4 Task Migration Manager
The TornadoVM task-migration manager is a component
within the VM that handles code and data migration from one
device to another. By employing the bytecodes and the new
virtualization layer, TornadoVM is capable of migrating the
executing task-schedules to dierent devices at runtime. Task
migration is signalled by changing the target device of a task-
schedule. To safely migrate task-schedules without losing
data, task migrations are only allowed after task-schedules
nish execution and become eective on the next execution.
Whenever a task-schedule completes its execution, Tor-
nadoVM checks whether the target device has been changed.
If the target device has changed, TornadoVM performs two
main actions: a) transfer all the necessary data from the cur-
rent device to the new target device through its memory
manager, and b) invoke the Tornado JIT compiler to compile
all the tasks in the
task-schedule
for the new target device,
if not already compiled. After the transfers reach completion
and the code gets compiled, TornadoVM can safely launch
the corresponding binary on the target device and continue
execution. Section 4presents how task-migration enables
TornadoVM to dynamically detect and use the best, according
to some policy, conguration for the running application.
4 Dynamic Reconguration
With task migration, TornadoVM can dynamically recong-
ure the running applications in order to discover the most
ecient mapping of task-schedules on devices. TornadoVM
starts executing task-schedules on the CPU, and in parallel
it explores dierent devices on the system (e.g., GPU) and
collects proling data. Then, according to a reconguration
policy it assigns scores to each device and selects the best
candidate to execute each task-schedule.
4.1 Reconguration Policies
Areconguration policy is essentially the denition of an
execution-plan, and a function that given a set of metrics
(e.g., total runtime), obtained by executing a task-schedule
on a device according to the execution-plan, returns an ef-
ciency score. The higher the score, the more ecient the
execution of the task-schedule on the corresponding device
according to that policy. TornadoVM currently features three
such policies, end-to-end,latency and peak performance:
•End-to-end:
Measures the total execution time of the
task-schedule by performing a single cold run on each
device. The total execution time includes the time
spent on JIT compilation, data transfers, and computa-
tion. The device that yields the shortest total execution
time is considered the most ecient.
•Latency:
Same as end-to-end, but does not wait for
the proling of all the devices to complete. By the
time that the fastest device reaches completion, Tor-
nadoVM chooses that device and continues execution
discarding the execution of the rest devices.
•Peak performance:
Measures the time required to
transfer data, that are not already cached on the device,
and the computation time. JIT compilation and initial
data transfers are not included in the measurements.
To obtain these measurements the task-schedule is
executed multiple times on the target device to warm
it up before obtaining them.
The end-to-end policy is suitable for debugging and opti-
mizing code. Getting access to the end-to-end measurements
for each device gives users the power to tweak their pro-
grams to favour specic devices, or to identify bottlenecks
and x them to improve performance. The latency policy
is more suitable for short running applications that are not
expected to live long enough in order to oset the overhead
of JIT compilation and warming up. The peak performance
policy, on the other hand, is more suitable for long running
applications that run the same task-schedules multiple times
and thus diminish the initial overhead.
A policy is by default set for all the task-schedules in the
whole application and can be altered through a parameter
when starting the application. However, to allow users to
have more control, we extend the task-based parallel API in
Tornado to allow users to specify dierent policies per task-
schedule execution. To avoid complicating the API, we over-
load the
execute
method with an optional parameter that de-
nes the reconguration policy for the task-schedule at hand.
If no parameters are passed, then TornadoVM uses the recon-
guration policy set for the whole application. For instance,
to execute a task-schedule using the performance policy we
use taskSchedule.execute(Policy.PERFORMANCE).
Note that in addition to the aforementioned policies, Tor-
nadoVM allows the implementation of custom recongura-
tion policies, giving its users the exibility to set the metric
on which they want their application to become more e-
cient, e.g., energy instead of performance.
4.2 Device Exploration
TornadoVM automatically starts an exhaustive exploration
by running each task-schedule on all available devices and
proles their performance in accordance to the selected re-
conguration policy. This way, TornadoVM is able to se-
lect the best device for each task-schedule. In addition, Tor-
nadoVM does not require application restart or any prior
knowledge from the user’s perspective to execute and adapt
their code to a target device.
Figure 5illustrates how device selection is performed
within the TornadoVM. When execution is invoked with
a policy the TornadoVM spawns a set of Java threads. Each
thread executes a copy of the input task-schedules for a par-
ticular device. Therefore, TornadoVM spawns one thread
171
VEE ’19, April 14, 2019, Providence, RI, USA Fumero, Papadimitriou, Zakkak, Xekalaki, Clarkson, Kotselidis
Figure 5. Overview of device selection in TornadoVM.
per heterogeneous device on the system. In parallel with
the dynamic exploration, a Java thread is also running the
task-schedule on the CPU. This is done to ensure that the
application makes progress while we explore alternatives,
and to obtain measurements that will allow us to compare
the heterogeneous execution with the sequential Java. Once
the execution is nished, TornadoVM collects the proling
information and selects the most suitable, according to the
reconguration policy, device for the task-schedule at hand.
From this point on, TornadoVM remembers the target de-
vice for each input task-schedule and policy. Note that the
same task-schedule may run multiple times, potentially with
dierent reconguration policies, through the overloaded
execute
method. In this case, as any new policy is encoun-
tered, the TornadoVM starts the exploration and it performs
a new decision that better adapts to the given policy. In
conclusion, dynamic reconguration enables programmers
to eortlessly accelerate their applications on any system
equipped with heterogeneous hardware. Furthermore, it en-
ables the applications to dynamically adapt to changes to
the underlying hardware, e.g., in cases of dynamic resource
provisioning.
4.3 A new High-Performance Tier Compilation
Contemporary managed runtime systems employ tiered com-
pilation to achieve better performance (e.g., tier compiler
within JVM). Tiered compilation enables the runtime system
to use faster compilers that produce less optimized code for
code that is not invoked as frequently. As the number of invo-
cations increases for a code segment the runtime re-compiles
it with higher-tier compilers, which might be slower, but
produce more optimized code. The main idea behind tiered
compilation is that it is only worth investing time to optimize
code that will be invoked multiple times. Currently, after a
JIT compiler reaches the maximum tier compilation (e.g., C2
compilation in JVM), there are no further optimizations.
Table 2. Experimental Platform
Hardware
Processor Intel Core i7-7700 @ 4.2GHz
Cores 4 (8 HyperThreads)
RAM 64GB
GPU NVIDIA GTX 1060 (Pascal), up to 1.7GHz,
6GB GDDR5, 1280 CUDA Cores
FPGA Nallatech 385A, Intel Arria 10 FPGA,
Two banks of 4GB DDR3 SDRAM each
Soware
OS CentOS 7.4 (Linux Kernel 3.10.0-693)
OpenCL (CPU) 2.0 (Intel)
OpenCL (GPU) 1.2 (Nvidia CUDA 9.0.282)
OpenCL (FPGA) 1.0 (Intel), Intel FPGA SDK 17.1,
HPC Board Support Package (BSP) by Nallatech
JVM Java SE 1.8.0_131 64-Bit JVMCI VM
Following the tier-compilation concept for JVM, we add
dynamic reconguration as a new compilation tier, improv-
ing the state-of-the-art by enabling it to further optimize
code and take advantage of hardware accelerators. The Hot-
Spot JVM employs an interpreter and two compilers in its
tiered compilation (C1, and C2). When a method is optimized
with the highest tier compiler (C2), TornadoVM takes action
and explores more ecient alternatives, possibly utilizing
heterogeneous devices. This integration allows TornadoVM
to pay the exploration overhead only for methods that are
invoked multiple times or contain loops with multiple itera-
tions. A current limitation of the TornadoVM is that it can
only optimize code that is written using the Tornado API,
thus it is not a generally applicable compilation tier.
5 Evaluation
This section presents the experimental evaluation of Tor-
nadoVM. We rst describe the experimental setup and metho-
dology, then the benchmarks we use, and nally we present
and discuss the results.
5.1 Evaluation Setup and Methodology
For the evaluation of TornadoVM we use a heterogeneous
system comprising three dierent processing units: an Intel
CPU, an external NVIDIA GPU and an Intel Altera FPGA.
This conguration essentially covers all the currently sup-
ported types of target devices of Tornado, which TornadoVM
relies on for heterogeneous JIT compilation. Table 2details
the hardware and software congurations of our system.
TornadoVM, being a VM running in another VM, falls into
the performance methodology traits of managed runtime sys-
tems [
14
]. VMs comprise a number of complex subsystems,
like the JIT compiler and the Garbage Collector, that add a
level of non-determinism in the obtained results. Adhering
to standard techniques of evaluating VMs, we rst perform
a warm-up phase for every benchmark to stabilize the per-
formance of the JVM. After the warm-up phase nishes, we
172
Dynamic Application Reconfiguration on Heterogeneous Hardware VEE ’19, April 14, 2019, Providence, RI, USA
perform one more run from which we obtain the measure-
ments that we report. Note that in our measurements the
TornadoVM instance itself is not warmed up.
The results we report depend on the reconguration pol-
icy that we evaluate each time. For example, for the peak
performance policy, we only include execution and data trans-
ferring times, excluding JIT compilation and device initial-
ization times. On the other hand, for the end-to-end and
latency policies, we include all times related to JIT compila-
tion (except for FPGA compilation), device initialization, data
transferring, and execution. Anticipating that in the future
FPGA synthesis times will decrease, we chose to exclude JIT
compilation times from the FPGA measurements when using
the end-to-end and latency policies. The current state-of-the-
art FPGA synthesis tools take between 60 and 90 minutes
to compile our benchmarks. Therefore, including the FPGA
compilation time was resulting in non-comparable measure-
ments. However, FPGA initialization and kernel loading are
still included in all our measurements. We discuss in more
details all JIT compilation times in Section 5.4.
5.2 Benchmarks
To evaluate TornadoVM, we employ six dierent benchmarks
and a complex computer vision application. Namely, the six
benchmarks we use are Saxpy,Montercarlo,RenderTrack,
BlackScholes,NBody, and DFT (Discrete Fourier Transform),
and the computer vision application is the Kinect Fusion (KFu-
sion) [
31
] implementation provided by SLAMBench [
30
]. The
domain of the chosen benchmarks ranges from mathemati-
cal and nancial applications to physics and linear-algebra
kernels, while KFusion creates a 3D representation from a
stream of depth images produced by an RGB-D camera such
as the Microsoft Kinect.
We ported each benchmark as well as KFusion from C++
and OpenCL to pure Java using the Tornado API. Porting
existing applications to the Tornado API requires: a) the cre-
ation of a task-schedule to pass the references of existing Java
methods, and b) the addition of the
@Parallel
annotations
to the existing loops. This results to 4–8 extra lines of code
per task-schedule regardless its size. The porting of the six
benchmarks resulted in six Tornado-based applications with
a single-task task-schedule each, while the porting of KFu-
sion resulted in an application with multiple task-schedules,
both single- and multi-task. When TornadoVM runs the se-
quential version, it ignores the
@Parallel
annotation and
the code is compiled and executed by the standard JVM (e.g.,
HotSpot).
Workload Size
In addition to exploring dierent work-
loads, we also explore the impact of the workload size. Thus,
for each benchmark, we vary the input data sizes in powers
of two. Table 3summarizes the input sizes used for each
benchmark. Please note that for
MonteCarlo
, we copy only
a single element (the number of iterations) and obtain a new
Table 3.
Input and data sizes for the given set of benchmarks.
Benchmark
Data
Range
Input
(Mb)
Output
(Mb)
min max max max
Saxpy 256 33554432 270 135
MonteCarlo 256 33554432 <0.1 268
RenderTrack 64 4096 70 50
N-Body 256 131072 0.5 0.5
BlackScholes 256 4194304 270 135
DFT 64 65536 4 1
array with the MonteCarlo computation, while for KFusion
we use the input scenes from the ICL-NUIM data-set [18].
5.3 Dynamic Reconguration Evaluation
Figure 6shows the performance evaluation of the six bench-
marks on all three types of supported hardware devices (CPU,
GPU, and FPGA). X-axis shows the range of input sizes while
y-axis shows the speedup over the sequential Java code opti-
mized by the HotSpot JVM. Each point represents the per-
formance achieved by a particular Tornado device, while
the solid black line highlights the speedup of TornadoVM
through dynamic reconguration. Note that in the cases
where the line does not overlap with a point, it means that
the code is executed by the HotSpot JVM, since it happens to
outperform the Tornado devices. Using the end-to-end and
peak-performance policies we are able to observe the perfor-
mance of each device on the system. The top of the Figure
shows the performance results when the policy end-to-end
is used. The bottom of the Figure shows the results when
the policy peak performance is used.
This Figure shows the dynamic reconguration in action
by deploying a benchmark and altering the input data that
we pass to it. This way we can observe how TornadoVM
dynamically recongures it to run on the best device, accord-
ing to the reconguration policy. Additionally, thanks to the
proling data that both end-to-end and peak-performance
policies gather, we can observe the dierences between the
dierent devices for each benchmark and input size.
Evaluation of End-to-End Policy
When using the end-to-
end policy we observe that for small input sizes the HotSpot
optimized sequential code outperforms the other congu-
rations. This is attributed to the relatively small workload
that no matter how fast it will execute it cannot hide the
overhead of the JIT compilation and device initialization. As
the input data size increases, we observe that TornadoVM
manages to nd better suited devices for all the benchmarks,
except for Saxpy. This indicates that the performance gain is
not enough to hide the compilation, device initialization, and
data transfer overheads for this benchmark. The fact that,
as the input size increases the less slowdown is observed,
indicates that JIT compilation time dominates, and as we
173
VEE ’19, April 14, 2019, Providence, RI, USA Fumero, Papadimitriou, Zakkak, Xekalaki, Clarkson, Kotselidis
218 222
10 1
100
101
102
103
Saxpy
219 223
MonteCarlo
28211
RenderTrack
Tornado-CPU Tornado-GPU Tornado-FPGA
TornadoVM
219 223
BlackScholes
210 213
NBody
29213 10 1
100
101
102
103
DFT
218 222
10 1
100
101
102
103
Speedup vs Java Sequential
219 223 28211
Input Range
219 223 210 213 29213 10 1
100
101
102
103
Figure 6.
Speedup of TornadoVM over sequential Java for two dierent policies. The top set of gures shows the performance of
the TornadoVM over Java sequential for the end-to-end policy. The bottom set of gures shows the performance of TornadoVM
using the peak performance policy.
increase the workload, it starts taking a smaller portion of
the end-to-end execution time.
The rest of the benchmarks are split into two groups.
MonteCarlo, BlackScholes, and DFT, after a certain input
size seem to stabilize and perform better on the GPU. Ren-
derTrack and NBody, on the other hand, yield comparable
performance when ran on the GPU or in parallel on the CPU.
As a result, depending on the input size, TornadoVM might
prefer one over the other. These two benchmarks showcase
why dynamic reconguration is important. Imagine moving
these workloads to a system with a slower GPU, and to yet
another system with a faster GPU. TornadoVM will be able to
use the CPU in the rst case and the GPU on the second case,
getting the best possible performance out of each system
without the need for the user to change anything in their
code. In average, TornadoVM is 7.7x faster compared to the
execution on the best parallel device for this policy.
Evaluation of the Peak Performance Policy
When us-
ing the peak performance policy we observe a slightly dier-
ent behaviour, as shown at the lower part of Figure 6. In this
case, TornadoVM takes into account only the execution time
and the data transfers, excluding JIT compilation and device
initialization times. In contrast to the end-to-end policy, we
observe that when using the peak performance policy no
matter the input size in all benchmarks, except from
saxpy
,
Tornado outperforms the HotSpot sequential execution. This
is expected, given that the peak performance policy does not
take in account device initialization and JIT compilations
that are the main overheads of cold runs. What is more in-
teresting is that for Saxpy when using the end-to-end policy,
larger input sizes were reducing the gap between sequential
Java and the execution on the Tornado devices. On the con-
trary, when using the peak performance policy, we observe
the opposite. This is an indication, that in Saxpy the rst
dominating part is JIT compilation and device initialization,
and the second one is data transfers.
For the rest of the benchmarks we observe again that
they can be split into two groups, however this time the
groups are dierent. MonteCarlo, RenderTrack, and BlackSc-
holes perform better on the GPU, no matter the input size.
This indicates that these benchmarks feature highly parallel
computations which also take a signicant part of the total
execution time. The second group includes NBody and DFT,
which for smaller sizes perform better when ran in parallel
on the multi-core CPU, than on the GPU.
Note that, although the FPGA is never selected in our
system, it gives high-performance for BlackScholes and DFT
(25
×
and 260
×
respectively compared to Java sequential,
and 2.5
×
and 5
×
compared to the parallel execution on the
CPU). Therefore, users that do not have a powerful GPU
can benet from executing on FPGAs, with the additional
advantage that they consume less energy than running on
the GPU or CPU [34].
TornadoVM vs Tornado Using the Latency Policy
Fig-
ure 7shows the speedups of TornadoVM, using the latency
reconguration policy, over the corresponding Tornado ex-
ecutions using the best, on average, device — in our case
the GPU. Recall that the latency policy starts running task-
schedules on all devices and selects the rst to nish, ignor-
ing the rest. Then it stores these data to avoid running again
on the less optimal devices. We see that for applications
such as Saxpy, TornadoVM does not switch to an accelerator
since sequential Java optimized by HotSpot performs better.
174
Dynamic Application Reconfiguration on Heterogeneous Hardware VEE ’19, April 14, 2019, Providence, RI, USA
Figure 7. Speedup of TornadoVM over Tornado running on the best (on average) device.
Table 4. Breakdown analysis of execution times (in milliseconds, unless otherwise noted) per benchmark.
Benchmark Compilation Time Host to Device Execution Device To Host Rest
CPU FPGA Load GPU CPU FPGA GPU CPU FPGA GPU CPU FPGA GPU CP U FPGA GPU
Saxpy 7.44 53 mins 1314 99.64 19.78 19.85 59.01 57.04 248.64 2.72 10.06 13.54 20.57 1.15 20.14 1.50
MonteCarlo 85.85 54 mins 1368 87.60 1 ns 1 ns 1 ns 240.88 456.96 2.75 21.61 59.71 41.14 0.70 0.57 0.70
RenderTrack 111.10 51 mins 1380 105.03 18.59 40.10 58.35 24.50 242.15 1.96 3.86 6.84 7.70 0.69 3.03 2.61
BlackScholes 178.61 114 mins 1420 243.02 16.84 10.30 30.97 1036.12 400.31 4.43 21.07 20.45 41.14 1.09 2.36 0.93
NBody 144.68 51 mins 1387 151.25 0.05 0.04 0.08 101.81 441.48 7.47 0.04 0.10 0.08 1.31 1.21 1.04
DFT 83.80 68 mins 1398 161.96 0.09 0.10 0.16 31674.15 4424.13 460.68 0.05 0.10 0.08 1.03 1.85 1.24
This gives programmers speedups of up to 45
×
compared
to the static Tornado [
8
]. More interesting are benchmarks
such as MonteCarlo, BlackScholes and DFT. For these bench-
marks, TornadoVM can run sequential Java for the smaller
input sizes and migrate to the GPU for bigger input sizes,
dynamically getting the best performance.
For RenderTrack and NBody, TornadoVM ends up using
all devices, except for the FPGA, depending on the input size.
For example in the case of
RenderTrack
, TornadoVM starts
running sequential Java, then it switches execution to the
GPU and nally to parallel execution on the CPU, giving up
to 30
×
speedup over Tornado. Note that the speedup over
Tornado goes down as the input size increases. This is due
to the fact that for large input sizes the GPU manages to
hide the overhead of the data transfers by processing the
data signicantly faster than the other devices. As a result,
for large input sizes, the GPU ends up dominating on both
the TornadoVM and Tornado, thus resulting in the same
performance, which can be up to three orders of magnitude
better than running on a standard JVM.
KFusion
To assess the correctness of TornadoVM and de-
monstrate its maturity, we opted to use it to accelerate KFu-
sion as well. KFusion is composed of ve distinct task-schedu-
les, both single- and multi-task. KFusion is a streaming ap-
plication that takes as input pictures captured by an RGB-D
camera and processes them. After the rst few frames have
been processed, the system stabilizes and runs the optimized
code from the code cache. As a result, we use the peak per-
formance policy to accelerate it. Our evaluation results show
that TornadoVM can successfully accelerate KFusion on the
evaluation platform, yielding 135.27 FPS (frames-per-second)
compared to the 1.69 FPS achieved by the HotSpot JVM by
automatically selecting the GPU.
5.4 End-to-end Times Breakdown
Table 4shows the breakdown of the total execution time, in
milliseconds, for the largest input size of each benchmark
divided into ve categories: compilation, host to device data
transfers, kernel execution, device to host data transfers, and
the rest. The rest is the time spent to execute the parts of the
applications that can not be accelerated, and it is computed as
the total time to execute the benchmarks minus all the other
categories (
TotalT=CompT+H2DT+ExecT+D2HT+RestT
).
The compilation time (CompT) includes the time to build
and optimize the Tornado data ow graph, and the time to
compile the input tasks to the corresponding binary. For
FPGA compilation we break the time down in two columns,
the rst one (FPGA) shows the compilation time including
the FPGA synthesis in minutes, and the second (Load) shows
the time needed to load the compiled bitstream on the FPGA
in milliseconds. On average, CPU and GPU compilation is
in the range of hundreds of milliseconds and is up to four
orders of magnitude faster than FPGA compilation. From
TornadoVM’s perspective this is a limitation of the available
FPGA synthesis tools and is the reason why it supports both
just-in-time and ahead-of-time compilation for FPGAs.
Note that although we use the same conguration for
all devices, data transfers between the host and the device
(
H2DT
and
D2HT
) are faster for FPGAs [
4
]. This is because of
the use of pinned memory (unlocked memory) that enables
fast DMA transfers between the host and the accelerators,
and appears to have a greater impact on the FPGA. Since
Tornado pre-allocates a buer region for the heap, we cannot
175
VEE ’19, April 14, 2019, Providence, RI, USA Fumero, Papadimitriou, Zakkak, Xekalaki, Clarkson, Kotselidis
have zero copies because it does not have the pointers to the
user data at the moment of calling malloc for the heap.
Regarding the kernel execution
ExecT
, we observe that
the GPU performs better than the rest of the devices. We also
notice that the FPGA kernel execution performs much better
than the multi-core CPU for
Blackscholes
and
DFT
, with
speedups of 2.5
×
and 7.1
×
respectively over the multi-core
execution. Finally, regarding the rest
RestT
, our measure-
ments highlight that TornadoVM practically does not intro-
duce any overhead, except for the FPGA execution, in which
we connect our framework with the Altera tools and drivers.
In the case of the memory intensive application,
saxpy
, the
overheads comes from the delay between enqueuing the data
into the command queue and starting the computation.
6 Related Work
This section presents the most relevant related works that
allow dynamic application reconguration.
Dynamic Reconguration
gMig [
28
] is a solution for live
migration of GPU processes to other GPUs, through virtual-
ization. When compared with TornadoVM, gMig poses the
limitation that it only targets GPUs. Aira [
27
] is a compiler
and a runtime system that allows developers to automati-
cally instrument existing parallel and heterogeneous code
to select the best mapping device. Aira makes use of perfor-
mance prediction models and resource allocation policies
to perform CPU/GPU selection at runtime. TornadoVM, in
contrast, does not predict upfront the best device to run a
set of tasks. Instead, it adapts the execution, during runtime,
to the heterogeneous device that best suits the input pol-
icy. TornadoVM can thus adapt the execution to any system
conguration and achieve the best possible performance
utilizing the available resources.
VirtCL [
40
] is a framework for programming multi-GPUs
in a transparent manner by using data from previous runs
to develop regression models. These models predict the total
time of a task on each device. TornadoVM focuses on a sin-
gle device, moving execution from CPU to the best device.
Rethinagiri et al. [
36
] proposed a new heterogeneous system
architecture and a set of applications that can fully exploit
all available hardware. However, their solutions are highly
customized for this new platform mixing programming lan-
guages and, thus, increasing software complexity. Che et
al. [
6
] studied the performance of a set of applications on FP-
GAs and GPUs and presented applications’ characteristics to
decide where to place the code. However, the categorization
is very generic and limited to the three benchmarks studied.
HPVM [
24
] is a compiler, a runtime system and a virtual
instruction set for targeting dierent heterogeneous sys-
tems. The concept of the virtual instruction set in HPVM
is similar to the TornadoVM bytecodes. However, all byte-
codes described in TornadoVM are totally hardware agnostic,
allowing to easily ship those bytecodes between dierent
machines and dierent hardware. Besides, HPVM does not
support task reconguration like TornadoVM. Hayashi et
al. [
20
] and Grewe et al. [
16
] employed machine learning
techniques to address the challenge of device selection. In
contrast, TornadoVM adapts execution with no prior knowl-
edge and models about the input programs.
Reconguration for Interpreted Languages and DSLs
In the dynamic programming language domain, Qunaibit et
al. [
35
] presented MegaGuard, a compiler framework for com-
piling and running Python programs on GPUs. MegaGuard
is able to choose the fastest device to ooad the computa-
tion. Although this approach is similar to ours, the analysis
was performed on a single GPU instead of multiple devices
such as GPUs, FPGAs and CPUs. Dandelion [
7
,
37
] combines
a runtime system and a set of compilers for running Lan-
guage Integrated Queries (LINQs) on heterogeneous hard-
ware. Dandelion compiles .NET bytecodes to heterogeneous
code, and generates data-ow-graphs for the orchestration
of the execution. In contrast, TornadoVM compiles Java byte-
codes to heterogeneous code, and data-ow-graphs to cus-
tom bytecode which it then interprets for the orchestration
of the execution. Leo [
10
] builds on top of Dandelion and pro-
vides dynamic proling and optimization for heterogeneous
execution on GPUs. TornadoVM provides a more generic
framework in which tasks can be proled and re-scheduled
between dierent types of devices at runtime (e.g., from an
FPGA to a GPU).
7 Conclusions and Future Work
In this paper we present TornadoVM, a virtualization layer
that works in cooperation with standard JVMs and is able to
automatically compile, and execute code on heterogeneous
hardware. In addition, TornadoVM is capable of automati-
cally discovering, at runtime, the best combination of het-
erogeneous devices for running input tasks to increase the
performance of running applications, completely transpar-
ently to the users. We also present TornadoVM as a new level
of tier-compilation and execution to make transparent use of
heterogeneous hardware. To the best of our knowledge, there
is no prior work that can dynamically compile and recon-
gure the running applications on heterogeneous hardware,
including FPGAs, without requiring any a priori knowledge
of the underlying hardware and the applications. Finally, we
demonstrate that TornadoVM can achieve on average 7.7
×
speedup over statically-congured parallel executions for a
set of six benchmarks.
Future Work
We plan to extend TornadoVM by imple-
menting a batch execution mechanism that will allow users
to run big data applications that do not t on the memory
of a single device. We also plan to introduce new policies,
such as power draw and energy consumption of each de-
vice. An interesting policy to implement is also the ability
176
Dynamic Application Reconfiguration on Heterogeneous Hardware VEE ’19, April 14, 2019, Providence, RI, USA
to adapt the code based on the real cost (in dollars) of run-
ning the code for each device, and try to minimize those
costs at reasonable performance. Furthermore, we plan to
extend the dynamic reconguration capabilities for tasks
within a task-schedule as we currently explore migration
at the task-schedule level. We also plan to integrate this
solution for multiple VMs running on dierent nodes in dis-
tributed scenarios, namely cloud-based VMs in which code
can adapt to dierent heterogeneous hardware. Moreover,
we would like to extend TornadoVM with the ability to share
resources, such as in [
15
], that allows user to run multiple
task-schedules on GPUs and FPGAs concurrently. Lastly, we
project to introduce a fault-tolerant mechanism that allows
users to automatically recongure running applications in
case of failures.
Acknowledgments
This work is partially supported by the European Union’s
Horizon 2020 E2Data 780245 and ACTiCLOUD 732366 grants.
Authors would also like to thank Athanasios Stratikopoulos
and the anonymous reviewers for their valuable feedback.
References
[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis,
Jerey Dean, Matthieu Devin, Sanjay Ghemawat, Georey Irving,
Michael Isard, et al
.
2016. Tensorow: a system for large-scale machine
learning.. In OSDI, Vol. 16. 265–283.
[2] AMD. 2016. Aparapi. (2016). hp://aparapi.github.io/.
[3]
Ulysse Beaugnon, Antoine Pouille, Marc Pouzet, Jacques Pienaar,
and Albert Cohen. 2017. Optimization Space Pruning Without Re-
grets. In Proceedings of the 26th International Conference on Compiler
Construction (CC 2017). ACM, New York, NY, USA, 34–44. hps:
//doi.org/10.1145/3033019.3033023
[4]
Ray Bittner, Erik Ruf, and Alessandro Forin. 2014. Direct GPU/FPGA
communication Via PCI express. Cluster Computing 17 (01 Jun 2014).
hps://doi.org/10.1007/s10586-013- 0280-9
[5]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaer, S. Lee, and K.
Skadron. 2009. Rodinia: A benchmark suite for heterogeneous com-
puting. In 2009 IEEE International Symposium on Workload Characteri-
zation (IISWC). 44–54. hps://doi.org/10.1109/IISWC.2009.5306797
[6]
S. Che, J. Li, J. W. Sheaer, K. Skadron, and J. Lach. 2008. Accel-
erating Compute-Intensive Applications with GPUs and FPGAs. In
2008 Symposium on Application Specic Processors. 101–107. hps:
//doi.org/10.1109/SASP.2008.4570793
[7]
Eric Chung, John Davis, and Jaewon Lee. 2013. LINQits: Big Data
on Little Clients. In 40th International Symposium on Computer Archi-
tecture (40th international symposium on computer architecture ed.).
ACM.
[8]
James Clarkson, Juan Fumero, Michail Papadimitriou, Foivos S. Zakkak,
Maria Xekalaki, Christos Kotselidis, and Mikel Luján. 2018. Exploiting
High-performance Heterogeneous Hardware for Java Programs Using
Graal. In Proceedings of the 15th International Conference on Managed
Languages & Runtimes (ManLang ’18). ACM, New York, NY, USA,
Article 4, 13 pages. hps://doi.org/10.1145/3237009.3237016
[9]
NVIDIA Corporation. 2019. CUDA. (2019). Retrieved March 20, 2019
from hp://developer.nvidia.com/cuda-zone
[10]
Naila Farooqui, Christopher J. Rossbach, Yuan Yu, and Karsten
Schwan. 2014. Leo: A Prole-Driven Dynamic Optimization Frame-
work for GPU Applications. In 2014 Conference on Timely Results
in Operating Systems (TRIOS 14). USENIX Association, Broomeld,
CO. hps://www.usenix.org/conference/trios14/technical-sessions/
presentation/farooqui
[11]
Jeremy Fowers, Greg Brown, Patrick Cooke, and Greg Stitt. 2012. A
Performance and Energy Comparison of FPGAs, GPUs, and Multicores
for Sliding-window Applications. In Proceedings of the ACM/SIGDA
International Symposium on Field Programmable Gate Arrays (FPGA
’12). ACM, New York, NY, USA, 47–56. hps://doi.org/10.1145/2145694.
2145704
[12]
Juan Fumero, Michel Steuwer, Lukas Stadler, and Christophe Dubach.
2017. Just-In-Time GPU Compilation for Interpreted Languages with
Partial Evaluation. In Proceedings of the 13th ACM SIGPLAN/SIGOPS
International Conference on Virtual Execution Environments (VEE ’17).
ACM, New York, NY, USA, 60–73. hps://doi.org/10.1145/3050748.
3050761
[13]
Juan José Fumero, Toomas Remmelg, Michel Steuwer, and Christophe
Dubach. 2015. Runtime Code Generation and Data Management for
Heterogeneous Computing in Java. In Proceedings of the Principles and
Practices of Programming on The Java Platform (PPPJ ’15). ACM, New
York, NY, USA, 16–26. hps://doi.org/10.1145/2807426.2807428
[14]
Andy Georges, Dries Buytaert, and Lieven Eeckhout. 2007. Statistically
Rigorous Java Performance Evaluation. In Proceedings of the 22Nd
Annual ACM SIGPLAN Conference on Object-oriented Programming
Systems and Applications (OOPSLA ’07). ACM, New York, NY, USA,
57–76. hps://doi.org/10.1145/1297027.1297033
[15]
A. Goswami, J. Young, K. Schwan, N. Farooqui, A. Gavrilovska, M.
Wolf, and G. Eisenhauer. 2016. GPUShare: Fair-Sharing Middleware
for GPU Clouds. In 2016 IEEE International Parallel and Distributed
Processing Symposium Workshops (IPDPSW). 1769–1776. hps://doi.
org/10.1109/IPDPSW.2016.94
[16]
Dominik Grewe and Michael F. P. O’Boyle. 2011. A Static Task Par-
titioning Approach for Heterogeneous Systems Using OpenCL. In
Compiler Construction, Jens Knoop (Ed.). Springer Berlin Heidelberg,
Berlin, Heidelberg, 286–305.
[17]
Khronos Group. 2017. OpenCL. (2017). Retrieved March 20, 2019
from hps://www.khronos.org/opencl
[18]
A. Handa, T. Whelan, J.B. McDonald, and A.J. Davison. 2014. A Bench-
mark for RGB-D Visual Odometry, 3D Reconstruction and SLAM. In
IEEE Intl. Conf. on Robotics and Automation, ICRA. Hong Kong, China,
1524–1531.
[19]
Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, and Vivek
Sarkar. 2013. Accelerating Habanero-Java Programs with OpenCL
Generation. In Proceedings of the 2013 International Conference on
Principles and Practices of Programming on the Java Platform: Virtual
Machines, Languages, and Tools (PPPJ ’13). ACM, New York, NY, USA,
124–134. hps://doi.org/10.1145/2500828.2500840
[20]
Akihiro Hayashi, Kazuaki Ishizaki, Gita Koblents, and Vivek Sarkar.
2015. Machine-Learning-based Performance Heuristics for Runtime
CPU/GPU Selection. In Proceedings of the Principles and Practices of
Programming on The Java Platform (PPPJ ’15). ACM, New York, NY,
USA, 27–36. hps://doi.org/10.1145/2807426.2807429
[21]
IBM. 2018. IBM J9 Virtual Machine. (2018). hps:
//www.ibm.com/support/knowledgecenter/en/SSYKE2_7.0.0/
com.ibm.java.win.70.doc/user/java_jvm.html.
[22] JOCL 2017. Java bindings for OpenCL. (2017). hp://www.jocl.org/.
[23]
Christos Kotselidis, James Clarkson, Andrey Rodchenko, Andy Nisbet,
John Mawer, and Mikel Luján. 2017. Heterogeneous Managed Runtime
Systems: A Computer Vision Case Study. In Proceedings of the 13th
ACM SIGPLAN/SIGOPS International Conference on Virtual Execution
Environments (VEE ’17). ACM, New York, NY, USA, 74–82. hps:
//doi.org/10.1145/3050748.3050764
[24]
Maria Kotsifakou, Prakalp Srivastava, Matthew D. Sinclair, Rakesh Ko-
muravelli, Vikram Adve, and Sarita Adve. 2018. HPVM: Heterogeneous
Parallel Virtual Machine. In PPoPP ’18 Proceedings of the 23rd ACM SIG-
PLAN Symposium on Principles and Practice of Parallel Programming.
177
VEE ’19, April 14, 2019, Providence, RI, USA Fumero, Papadimitriou, Zakkak, Xekalaki, Clarkson, Kotselidis
68–80. hps://doi.org/10.1145/3178487.3178493
[25]
Thomas Kotzmann, Christian Wimmer, Hanspeter Mössenböck,
Thomas Rodriguez, Kenneth Russell, and David Cox. 2008. Design
of the Java HotSpot
T M
Client Compiler for Java 6. ACM Trans.
Archit. Code Optim. 5, 1, Article 7 (May 2008), 32 pages. hps:
//doi.org/10.1145/1369396.1370017
[26]
Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Dae-
hyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyan-
skiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and
Pradeep Dubey. 2010. Debunking the 100X GP U vs. CPU Myth: An
Evaluation of Throughput Computing on CPU and GPU. In Proceed-
ings of the 37th Annual International Symposium on Computer Ar-
chitecture (ISCA ’10). ACM, New York, NY, USA, 451–460. hps:
//doi.org/10.1145/1815961.1816021
[27] Robert Lyerly, Alastair Murray, Antonio Barbalace, and Binoy Ravin-
dran. 2018. AIRA: A Framework for Flexible Compute Kernel Execu-
tion in Heterogeneous Platforms. In IEEE Transactions on Parallel and
Distributed Systems.hps://doi.org/10.1109/TPDS.2017.2761748
[28]
Jiacheng Ma, Xiao Zheng, Yaozu Dong, Wentai Li, Zhengwei Qi, Bing-
sheng He, and Haibing Guan. 2018. gMig: Ecient GP U Live Migration
Optimized by Software Dirty Page for Full Virtualization. In Proceed-
ings of the 14th ACM SIGPLAN/SIGOPS International Conference on
Virtual Execution Environments (VEE ’18). ACM, New York, NY, USA,
31–44. hps://doi.org/10.1145/3186411.3186414
[29]
R. Nane, V. M. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen, H.
Hsiao, S. Brown, F. Ferrandi, J. Anderson, and K. Bertels. 2016. A Survey
and Evaluation of FPGA High-Level Synthesis Tools. IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems 35, 10
(Oct 2016), 1591–1604. hps://doi.org/10.1109/TCAD.2015.2513673
[30]
L. Nardi, B. Bodin, M. Z. Zia, J. Mawer, A. Nisbet, P. H. J. Kelly, A. J.
Davison, M. LujÃąn, M. F. P. O’Boyle, G. Riley, N. Topham, and S.
Furber. 2015. Introducing SLAMBench, a performance and accuracy
benchmarking methodology for SLAM. In 2015 IEEE International
Conference on Robotics and Automation (ICRA). 5783–5790. hps:
//doi.org/10.1109/ICRA.2015.7140009
[31]
Richard A. Newcombe, Shahram Izadi, Otmar Hilliges, David
Molyneaux, David Kim, Andrew J. Davison, Pushmeet Kohli, Jamie
Shotton, Steve Hodges, and Andrew Fitzgibbon. 2011. KinectFusion:
Real-time Dense Surface Mapping and Tracking. In Proceedings of
the 2011 10th IEEE International Symposium on Mixed and Augmented
Reality (ISMAR ’11). IEEE Computer Society, Washington, DC, USA,
127–136. hps://doi.org/10.1109/ISMAR.2011.6092378
[32]
Michael Paleczny, Christopher Vick, and Cli Click. 2001. The Java
hotspotTM Server Compiler. In Proceedings of the 2001 Symposium
on JavaTM Virtual Machine Research and Technology Symposium -
Volume 1 (JVM’01). USENIX Association, Berkeley, CA, USA, 1–1. hp:
//dl.acm.org/citation.cfm?id=1267847.1267848
[33]
P.C. Pratt-Szeliga, J.W. Fawcett, and R.D. Welch. 2012. Rootbeer: Seam-
lessly Using GPUs from Java. In Proceedings of 14th International IEEE
High Performance Computing and Communication Conference on Em-
bedded Software and Systems.hps://doi.org/10.1109/HPCC.2012.57
[34]
Berten Digital Processing. 2016. White paper: GPU vs
FPGA Performance Comparison. Technical Report. hp:
//www.bertendsp.com/pdf/whitepaper/BWP001_GPU_vs_FPGA_
Performance_Comparison_v1.0.pdf
[35]
Mohaned Qunaibit, Stefan Brunthaler, Yeoul Na, Stijn Volckaert, and
Michael Franz. 2018. Accelerating Dynamically-Typed Languages
on Heterogeneous Platforms Using Guards Optimization. In 32nd
European Conference on Object-Oriented Programming (ECOOP 2018)
(Leibniz International Proceedings in Informatics (LIPIcs)), Todd Mill-
stein (Ed.), Vol. 109. Schloss Dagstuhl–Leibniz-Zentrum fuer Infor-
matik, Dagstuhl, Germany, 16:1–16:29. hps://doi.org/10.4230/LIPIcs.
ECOOP.2018.16
[36]
S. K. Rethinagiri, O. Palomar, J. A. Moreno, O. Unsal, and A. Cristal.
2015. Trigeneous Platforms for Energy Ecient Computing of HPC
Applications. In 2015 IEEE 22nd International Conference on High Per-
formance Computing (HiPC). 264–274. hps://doi.org/10.1109/HiPC.
2015.19
[37]
Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin,
and Dennis Fetterly. 2013. Dandelion: A Compiler and Runtime for
Heterogeneous Systems. In Proceedings of the Twenty-Fourth ACM
Symposium on Operating Systems Principles (SOSP ’13). ACM, New
York, NY, USA, 49–68. hps://doi.org/10.1145/2517349.2522715
[38]
Sumatra. 2015. Sumatra OpenJDK. (2015). hp://openjdk.java.net/
projects/sumatra/.
[39]
Thomas N. Theis and H. S. Philip Wong. 2017. The End of Moore’s
Law: A New Beginning for Information Technology. Computing in
Science and Engg. 19, 2 (March 2017), 41–50. hps://doi.org/10.1109/
MCSE.2017.29
[40]
Yi-Ping You, Hen-Jung Wu, Yeh-Ning Tsai, and Yen-Ting Chao. 2015.
VirtCL: A Framework for OpenCL Device Abstraction and Manage-
ment. In PPoPP 2015 Proceedings of the 20th ACM SIGPLAN Sym-
posium on Principles and Practice of Parallel Programming. 161–172.
hps://doi.org/10.1145/2688500.2688505
178