Conference PaperPDF Available

Dynamic Application Reconfiguration on Heterogeneous Hardware

Authors:

Abstract and Figures

By utilizing diverse heterogeneous hardware resources, developers can significantly improve the performance of their applications. Currently, in order to determine which parts of an application suit a particular type of hardware accelerator better, an offline analysis that uses a priori knowledge of the target hardware configuration is necessary. To make matters worse, the above process has to be repeated every time the application or the hardware configuration changes. This paper introduces TornadoVM, a virtual machine capable of reconfiguring applications, at run-time, for hardware acceleration based on the currently available hardware re- sources. Through TornadoVM, we introduce a new level of compilation in which applications can benefit from heterogeneous hardware. We showcase the capabilities of TornadoVM by executing a complex computer vision application and six benchmarks on a heterogeneous system that includes a CPU, an FPGA, and a GPU. Our evaluation shows that by using dynamic reconfiguration, we achieve an average of 7.7× speedup over the statically-configured accelerated code.
Content may be subject to copyright.
Dynamic Application Reconguration on
Heterogeneous Hardware
Juan Fumero
The University of Manchester
United Kingdom
juan.fumero@manchester.ac.uk
Michail Papadimitriou
The University of Manchester
United Kingdom
mpapadimitriou@cs.man.ac.uk
Foivos S. Zakkak
The University of Manchester
United Kingdom
foivos.zakkak@manchester.ac.uk
Maria Xekalaki
The University of Manchester
United Kingdom
maria.xekalaki@manchester.ac.uk
James Clarkson
The University of Manchester
United Kingdom
james.clarkson@manchester.ac.uk
Christos Kotselidis
The University of Manchester
United Kingdom
ckotselidis@cs.man.ac.uk
Abstract
By utilizing diverse heterogeneous hardware resources, de-
velopers can signicantly improve the performance of their
applications. Currently, in order to determine which parts of
an application suit a particular type of hardware accelerator
better, an oine analysis that uses a priori knowledge of the
target hardware conguration is necessary. To make matters
worse, the above process has to be repeated every time the
application or the hardware conguration changes.
This paper introduces TornadoVM, a virtual machine capa-
ble of reconguring applications, at run-time, for hardware
acceleration based on the currently available hardware re-
sources. Through TornadoVM, we introduce a new level of
compilation in which applications can benet from heteroge-
neous hardware. We showcase the capabilities of TornadoVM
by executing a complex computer vision application and six
benchmarks on a heterogeneous system that includes a CPU,
an FPGA, and a GPU. Our evaluation shows that by using
dynamic reconguration, we achieve an average of 7.7
×
speedup over the statically-congured accelerated code.
CCS Concepts Software and its engineering Vir-
tual machines;
Keywords Dynamic Reconguration, FPGAs, GPUs, JVM
ACM Reference Format:
Juan Fumero, Michail Papadimitriou, Foivos S. Zakkak, Maria Xe-
kalaki, James Clarkson, and Christos Kotselidis. 2019. Dynamic
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear
this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specic permission and/or a fee. Request
permissions from permissions@acm.org.
VEE ’19, April 14, 2019, Providence, RI, USA
©2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6020-3/19/04.. .$15.00
hps://doi.org/10.1145/3313808.3313819
Application Reconguration on Heterogeneous Hardware. In Pro-
ceedings of the 15th ACM SIGPLAN/SIGOPS International Confer-
ence on Virtual Execution Environments (VEE ’19), April 14, 2019,
Providence, RI, USA. ACM, New York, NY, USA, 14 pages. hps:
//doi.org/10.1145/3313808.3313819
1 Introduction
The advent of heterogeneous hardware acceleration as a
means to combat the stall imposed by the Moore’s law [39]
created new challenges and research questions regarding
programmability, deployment, and integration with current
frameworks and runtime systems. The evolution from single-
core to multi- or many- core systems was followed by the
introduction of hardware accelerators into mainstream com-
puting systems. General Purpose Graphics Processing Units
(GPGPUs), Field-programmable Gate Arrays (FPGAs), Appli-
cation Specic Integrated Circuits (ASICs), and integrated
many-core accelerators (e.g., Xeon Phi) are some examples of
hardware devices capable of achieving higher performance
than CPUs when executing suitable workloads. Whether
using a GPU or an FPGA for accelerating specic workloads,
developers need to employ programming models such as
CUDA [
9
] and OpenCL [
17
], or High Level Synthesis (HLS)
tools [29] to program and accelerate their code.
The integration of these new programming models to
mainstream computing has not been fully achieved in all as-
pects of programming or programming languages. For exam-
ple, in the Java world, excluding IBM’s J9 [
21
] GPU support
and APARAPI [
2
], there are no other commercial solutions
available for automatically and transparently executing Java
programs on hardware accelerators. The situation is even
more challenging for FPGA acceleration where the program-
ming models are not only dierent than the typical ones,
but also the tool-chains are in the majority of the cases sepa-
rated from the programming languages [
29
]. Consequently,
programming language developers need to create either new
bindings to transition from one programming language to
another [
22
], or specic (static or dynamic) compilers that
compile a subset of an existing programming language to
another one tailored for a specic device [
1
,
12
,
13
,
19
,
33
,
38
].
165
VEE ’19, April 14, 2019, Providence, RI, USA Fumero, Papadimitriou, Zakkak, Xekalaki, Clarkson, Kotselidis
Therefore, applications are becoming more heterogeneous
in their code bases (i.e. mixing programming languages and
paradigms), resulting in harder to maintain and debug code.
Ideally, developers should follow the programming norm
of “write-once-run-anywhere” and allow the underlying run-
time system to dynamically adjust the execution depending
on the provisioned hardware. Achieving the unication or
co-existence of the various programming models under a
common runtime, would not only result in more ecient
code development but also in portable applications, in terms
of performance as well, where the system software adapts
the application to the underlying hardware conguration.
At the same time, the question of which parts of the code
should execute on which accelerator remains open, further
increasing the applications’ complexity. Various techniques
such as manual code inspection [
36
], oine machine learn-
ing based models [
20
], heuristics [
6
], analytical [
3
], and sta-
tistical models [
16
] have been proposed in order to identify
which parts of an application are more suitable for acceler-
ation by the available hardware devices. Such approaches,
however, require either high expertise on the developer’s
side in order to reason about which part of their source
code would be better accelerated, or a priori decision making
regarding the type and characteristics of the devices upon
which the oine analysis will be performed. Therefore, the
majority of these approaches require developers’ interven-
tion and oine work to achieve the desired results.
Naturally, the research question that rises is: “Is it possible
for a system to nd the best conguration and execution prole
automatically and dynamically?”.
In this paper we show that the answer to this question can
be positive. We introduce a novel mechanism that tackles
the aforementioned challenges by allowing the dynamic and
transparent execution of code segments on diverse hardware
devices. Our mechanism performs execution permutations,
at run-time, in order to nd the highest performing congu-
ration of the application. To achieve this, we employ nested
application virtualization for Java applications running on
a standard Java Virtual Machine (JVM). At the rst level of
virtualization, standard Java bytecodes are executed either
in interpreted or just in time (JIT) compiled mode on the
CPU. At the second level of virtualization, the code compiled
for heterogeneous hardware is executed via a secondary
lightweight bytecode interpreter that allows code migration
between devices, while handling automatically both execu-
tion and data management. This results in a system capable
of dynamically adapting its execution until it discovers the
highest performing conguration completely transparently
to the developer and the application. In detail, this paper
makes the following contributions:
Presents TornadoVM: a virtualization layer enabling
dynamic migration of tasks between dierent devices.
Analyses how TornadoVM performs a best-eort exe-
cution in order to automatically and dynamically (i.e.
at run-time) discover which device or combination of
devices results in the best performing execution.
Discusses how TornadoVM can be used, by existing
JVMs with tiered compilation, as a new tier, breaking
the CPU-only compilation boundaries.
Showcases that by using TornadoVM we are able to
achieve an average of 7.7
×
performance improvements
over the statically-congured accelerated code for a
representative set of benchmarks.
2 Background and Motivation
This work builds upon Tornado [
23
], an open-source parallel
programming framework that enables dynamic JIT com-
pilation and execution of Java applications onto OpenCL-
compatible devices, transparently to the user. This way it
enables inexperienced –with hardware accelerators– users
to accelerate their Java applications by introducing a mini-
mal set of changes to their code and choosing an accelerator
to target. Tornado consists of the following three main com-
ponents: a parallel API, a runtime system, and a JIT compiler
and driver.
Tornado API:
Tornado provides a task-based parallel API
for parallel programming within Java. By using the Tornado
API, developers express parallelism in existing Java applica-
tions with minimal alterations of the sequential Java code.
Each task comprises a Java method handle containing the
pure Java code and the data it accesses. The Tornado API
provides interfaces to create task-schedules; groups of tasks
that will be automatically scheduled for execution by the
runtime. In addition to dening tasks, Tornado allows de-
velopers to indicate that a loop is a candidate for parallel
execution through the @Parallel annotation.
Listing 1shows a parallel map/reduce computation using
the Tornado API. The Java class
Compute
contains two meth-
ods,
map
in Line 2 and
reduce
in line 7. These two methods
are written in Java augmented with the
@Parallel
annota-
tion. The rst method performs a vector multiplication while
the second computes an addition of the elements.
Lines 13–16 create a task-schedule containing the two
tasks of our example along with their input and output ar-
guments. Both the task-schedule and the individual tasks
receive string identiers (
s0
,
t0
and
t1
) that enable program-
mers to reference them at runtime.
Furthermore, since our example performs a map-reduce
operation, the intermediate results of
map
(
t0
) are passed
to
reduce
(
t1
) through the
temp
array. Line 16 species the
array to be copied back from the device to the host through
the
streamOut
method call. Finally, we invoke the
execute
method (Line 17) to signal the execution of the task-schedule.
Tornado Runtime:
The role of the Tornado runtime sys-
tem is to analyze data dependencies between tasks within
a task-schedule, and use this information to minimize data
166
Dynamic Application Reconfiguration on Heterogeneous Hardware VEE ’19, April 14, 2019, Providence, RI, USA
Figure 1. Tornado JIT compiler outline.
Listing 1. Example of the Tornado Task Parallel API.
1public class Compute {
2public void map(float[] in, float[] out) {
3for (@Parallel int i = 0; i < n; i++) {
4out[i] = in[i] *in[i];
5}
6}
7public void reduce(float[] in, float[] out) {
8for (@Parallel int i = 1; i < n; i++) {
9out[0] += in[i];
10 }
11 }
12 public void run(float[] in, float[] out, float[] temp) {
13 new TaskSchedule("s0")
14 .task("t0",this::map, in, temp)
15 .task("t1",this::reduce, temp, out)
16 .streamOut(out)
17 .execute();
18 }
19 }
transfers between a host (e.g., a CPU) and the devices (e.g.,
a GPU). In the example of Listing 1, the Tornado runtime
will discover the read-after-write dependency on the
temp
array and instead of copying it back to the host, it will per-
sist it on the device. Additionally, due to this dependency, it
will ensure that task
t1
will not be scheduled before task
t0
completes.
Tornado JIT Compiler and Driver:
At runtime, Tornado
has a two-tier JIT compilation mode that allows it to rst
compile Java bytecode to OpenCL C, and then from OpenCL
C to machine code. Figure 1provides a high-level overview
of Tornado’s compilation chain. As shown, Java bytecode is
transformed to an Intermediate Representation (IR) graph
(Step
1
) which is then optimized and lowered incrementally
from High-level IR (HIR), to Mid-level IR (MIR), and nally
reaching the Low-level IR (LIR) state, which is close to the as-
sembly level (Step
2
). From that point, instead of generating
assembly instructions, a special compiler phase is invoked
which rewrites the LIR graph to OpenCL C code (OpenCL
code generator) (Step
3
). After the OpenCL C source code
is created, depending on the target device that execution will
take place, the respective OpenCL device compiler is invoked
(Step
4
). Finally, the generated binary code gets installed to
the code cache (Step 5
) and is ready for execution.
Figure 2.
Tornado speedup over sequential Java on a range
of dierent input sizes for DFT (Discrete Fourier Transform).
2.1 Motivation
Due to architectural dierences, dierent hardware acceler-
ators tend to favour dierent workloads. For instance, GPUs
are well known for their high eciency when all threads
of the same warp perform the same operation, but they fail
to eciently accelerate parallel applications with complex
control ows [
26
]. Experienced developers with in-depth
understanding of their applications and the underlying hard-
ware can potentially argue about which devices better suit
their applications. However, even in such cases, choosing
the best conguration or the best device from a family of
devices is not trivial [5,6,11].
To better understand how dierent accelerators or data
sizes aect the performance of an application we run the DFT
(Discrete Fourier Transform) benchmark using Tornado on
three dierent devices while varying the input size. Figure 2
depicts the obtained results with the X-axis showing the
range of the input size and the Y-axis showing the speedup
over the sequential Java implementation. Each line in the
graph represents one of the three dierent devices we run
our experiment on: an Intel i7 CPU, an NVIDIA GTX 1060
GPU, and an Intel Altera FPGA. Overall, DFT performs better
when running on the NVIDIA GPU. However, when running
with small sizes, the highest performing device varies. For
example, for input sizes between 2
6
-2
8
, the parallel execution
on the multi-core CPU is the highest performing one.
The Importance of Dynamic Reconguration:
As show-
cased by our experiment, to achieve the highest performance
one needs to explore a large space of dierent executions
before discovering the best possible conguration for an ap-
plication. To make matters worse, this conguration is con-
sidered the “best possible” only for the given system setup
and input data. Each time the code changes, a new device is
introduced, or the input data changes we need to perform
further exploration and potentially restart our application
to apply the new conguration.
To address the challenge of discovering the highest per-
forming conguration, we introduce TornadoVM: a system
that is able to automatically and dynamically adapt execution
to the best possible conguration, according to the user’s
requirements, for each application and input data size in a
heterogeneous system,and without the need of restarting
the application.
167
VEE ’19, April 14, 2019, Providence, RI, USA Fumero, Papadimitriou, Zakkak, Xekalaki, Clarkson, Kotselidis
3 TornadoVM
In order to enable dynamic application reconguration of the
executed applications on heterogeneous hardware, we imple-
ment a
virtualization layer
that uses the Tornado parallel
programming framework to run Java applications on het-
erogeneous hardware. The implemented virtualization layer
is responsible for executing and performing code migration
of the generated JIT compiled code through a lightweight
bytecode-based mechanism, as well as managing the mem-
ory between the dierent devices. The combination of the
aforementioned components results in a heterogeneous JVM,
called TornadoVM, capable of dynamically reconguring the
executed applications on the available hardware resources
completely transparently to the users.
TornadoVM is implemented in Java and, as illustrated in
Figure 3, runs inside a standard Java Virtual Machine (e.g.,
the HotSpot JVM [
25
,
32
]); resulting in a VM that runs in-
side another VM—VM in a VM. The TornadoVM interprets
TornadoVM bytecodes, manages the corresponding mem-
ory, and orchestrates the execution on the heterogeneous
devices. The JVM executes Java bytecodes and the interpreter
methods of TornadoVM.
To implement TornadoVM, we augment the original Tor-
nado components (shown in light blue color in Figure 3) with
the components shown in dark green color. Namely, we in-
troduce a) abytecode generator (Section 3.2) responsible for
the generation of TornadoVM specic bytecodes (Section 3.1)
that are used to execute code on heterogeneous devices, b) a
bytecode interpreter (Section 3.1) that executes the generated
bytecodes, c) adevice heap manager (Section 3.3) responsible
for managing data across the dierent devices ensuring a
consistent memory view, and d) atask migration manager
(Section 3.4) responsible for migrating tasks between devices.
All the components of TornadoVM are device agnostic except
for the nal connection with the underlying OpenCL driver.
Initially, the application starts on the standard JVM (host)
which can execute it on CPUs. When the execution reaches
a Tornado API method invocation, the control ow of the
execution is passed to the Tornado compiler in order to cre-
ate and optimize the data ow graph for the task-schedule at
hand. The data ow graph is then passed to the TornadoVM
bytecode generator that generates an optimized compact
sequence of TornadoVM bytecodes, describing the corre-
sponding instructions of the task-schedule. In contrast to
the original Tornado, at this point, TornadoVM does not JIT
compile the tasks involved in the compiled task-schedule to
binaries. The task compilation, from Java bytecode to binary,
is performed lazily by the TornadoVM upon attempting to
execute a task whose code has not been compiled yet for
the corresponding target device. For each device there is a
code cache maintaining the binaries corresponding to the
tasks that have been already compiled for this device to avoid
paying the compilation overhead multiple times.
Figure 3. TornadoVM overview and workow.
3.1 TornadoVM Bytecodes
TornadoVM relies on a custom set of bytecodes that are
specically tailored to describe task-schedules, resulting in
a more compact representation, which is also easier to parse
and translate to heterogeneous hardware management ac-
tions. Table 1enlists the bytecodes that are currently gener-
ated and supported by the TornadoVM. TornadoVM employs
11 bytecodes that allow the VM to prepare the execution, to
perform data allocation and transferring (between the host
and the devices), as well as to launch the kernels. All the
bytecodes are hardware agnostic and are used to express a
task-schedule regardless of the device(s) it will run on.
All the TornadoVM bytecodes take at least one argument,
the context identier, which is a unique number used to
identify a task-schedule. TornadoVM generates a new con-
text identier for each task-schedule in the program. The
context identier is used at run-time to obtain a context
object which, among others, contains references to the data
accessed by the task-schedule, and information about the
device, on which, the tasks will be executed. Additionally,
all bytecodes except
BEGIN
,
END
, and
BARRIER
, take at least
a bytecode index as an argument. The bytecode indices are
a way to uniquely identify bytecodes so that we can then
reference them from other bytecodes. In addition, they are
used for synchronization and ordering purposes since 6 out
of the 11 bytecodes are non-blocking in order to increase
performance by overlapping data transfers and execution
of kernels. The TornadoVM bytecodes can be conceptually
grouped in the following categories:
Initialization and Termination:
Bytecode sections in the
TornadoVM are encapsulated in regions that start with the
BEGIN
bytecode and conclude with the
END
bytecode. These
bytecodes essentially signal the activation and deactivation
of a TornadoVM context.
Memory Allocation and Data Transferring:
In order to
execute code on a heterogeneous device, memory has to be
allocated and data need to be transferred from the host to the
168
Dynamic Application Reconfiguration on Heterogeneous Hardware VEE ’19, April 14, 2019, Providence, RI, USA
Table 1. TornadoVM bytecodes.
Bytecode Operands Blocking Description
BEGIN <context> Yes Creates a new parallel execution context.
ALLOC <context, BytecodeIndex, object> No Allocates a buer on the target device.
STREAM_IN <context, BytecodeIndex, object> No Performs a copy of an object from host to device.
COPY_IN <context, BytecodeIndex, object> No Allocates and copies an object from host to device.
STREAM_OUT <context, BytecodeIndex, object> No Performs a copy of an object from device to host.
COPY_OUT <context, BytecodeIndex, object> No Allocates and copies an object from device to host.
COPY_OUT_BLOCK <context, BytecodeIndex, object> Yes A blocking COPY_OUT operation.
LAUNCH <context, BytecodeIndex, task, Args> No Executes a task, compiling it if needed.
ADD_DEP <context, BytecodeIndices> Yes Adds a dependency between labels.
BARRIER <context> Yes Waits for all previous bytecodes to be nished.
END <context> Yes Ends the parallel execution context.
heterogeneous device. The
ALLOC
bytecode allocates su-
cient memory on the device heap (see Section 3.3), to accom-
modate the objects passed to it as an argument. The
COPY_IN
bytecode both allocates memory and transfers the object to
the device, while the
STREAM_IN
bytecode only copies the ob-
ject (assuming a previous allocation). Note that the
COPY_IN
bytecode is used for read-only data and implements a caching
mechanism that allows it to skip data transfers if the corre-
sponding data are already on the target device. On the other
hand, the
STREAM_IN
bytecode is used for data streaming on
the heterogeneous device, in which the kernel is executed
multiple times with an open channel for receiving new data.
The corresponding bytecodes for copying the data back from
the device to the host are COPY_OUT and STREAM_OUT.
Synchronization:
TornadoVM bytecodes can be ordered
and synchronized through the
BARRIER
,
ADD_DEP
, and
END
bytecodes.
BARRIER
and
END
wait for all previous bytecodes
to reach completion, while
ADD_DEP
waits only for those
corresponding to the indices passed to it as parameters.
Computation:
The
LAUNCH
bytecode is used to execute a
kernel. To execute the code on the target device, the Tor-
nadoVM rst checks if a binary that targets the correspond-
ing device (according to the context) has been generated
for the task at hand. Upon success, it directly executes the
binary on the heterogeneous device. Otherwise, TornadoVM
compiles the input task, through Tornado, and installs the
binary into the code cache from where it is then retrieved
for execution.
3.2 TornadoVM: Bytecode Generator
TornadoVM relies on Tornado to obtain a data ow graph for
each task-schedule. The data ow graph is essentially a data
structure that describes data dependencies between tasks.
During the compilation of task-schedules, Tornado builds
this graph and optimizes it to reduce data transfers. This
optimized data dependency graph is then used to generate
the nal TornadoVM bytecode. The TornadoVM bytecode
generation is a simple process of traversing a graph and
Listing 2. Generated TornadoVM bytecodes for Listing 1.
1BEGIN <0> // Starts a new context
2COPY_IN <0, bi1, in> // Allocates and copies <in>
3ALLOC <0, bi2, temp> // Allocates <temp> on device
4ADD_DEP <0, bi1, bi2> // Waits for copy and alloc
5LAUNCH <0, bi3, @map, in, temp> // Runs map
6ALLOC <0, bi4, out> // Allocates <out> on device
7ADD_DEP <0, bi3, bi4> // Waits for alloc and map
8LAUNCH <0, bi5, @reduce, temp, out> // Runs reduce
9ADD_DEP <0, bi5> // Wait for reduce
10 COPY_OUT_BLOCK <0, bi6, out> // Copies <out> back
11 END <0> // Ends context
generating, for each input node in the data ow graph, a
set of TornadoVM bytecodes. Listing 2demonstrates the
generated TornadoVM bytecode that corresponds to the code
from Listing 1.
TornadoVM’s Bytecode Interpreter
The TornadoVM im-
plements a bytecode interpreter for running the TornadoVM
bytecodes. Since TornadoVM uses only a limited set of 11
bytecodes, we implement the interpreter as a simple switch
statement in Java. TornadoVM bytecodes are not JIT com-
piled, but the interpreter itself can be JIT compiled by the un-
derlying JVM (e.g., Oracle HotSpot) to improve performance.
Note that the TornadoVM bytecodes only orchestrates the
execution between the accelerators and the host machine;
they do not perform the actual computation. The latter is JIT
compiled by Tornado.
As shown in Listing 2, a new TornadoVM context starts by
running the
BEGIN
bytecode with context-id 0 (line 1). Note
that the context-id maps to a context object that contains
initial information regarding the device on which execution
will take place. Then, the TornadoVM performs an allocation
and a data transfer through the
COPY_IN
bytecode (line 2).
In line 3, TornadoVM performs an allocation for the
temp
Java array on the target device, and in line 4 it blocks to wait
for the copy and the allocation to be completed. Note that
the
ADD_DEP
in line 4 receives the bytecode indices of the
COPY_IN
and the
ALLOC
bytecodes. Then, in line 5 it launches
169
VEE ’19, April 14, 2019, Providence, RI, USA Fumero, Papadimitriou, Zakkak, Xekalaki, Clarkson, Kotselidis
the
map
task. At this stage, the TornadoVM compiles the
map
task by invoking the Tornado JIT compiler and launches
the generated binary on the target device. Line 6 allocates
the output variable of the
reduce
task. In addition, since the
input for the
reduce
task is the output of the previous task,
a dependency is added (line 7) and execution waits for the
nalization of the
LAUNCH
bytecode at line 5, as well as for
the allocation at line 6. Line 8 launches the
reduce
kernel, at
line 9 it waits for the kernel to complete, and then the result
is copied back from the device to the host in line 10. Finally,
the current TornadoVM context ends by
END
at line 11. Each
context of the TornadoVM manages one device meaning
that all tasks that are launched from the same context are
executed on the same device.
3.3 TornadoVM Memory Manager
Since heterogeneous systems typically comprise a number of
distinct memories that are not always shared nor coherent,
TornadoVM implements a memory manager, which is re-
sponsible for keeping the data consistent across the dierent
devices, as well as for allocating and de-allocating memory
on them. To minimize the overhead of accelerating code on
heterogeneous devices with distinct non-shared memories,
TornadoVM pre-allocates a memory region on each accelera-
tor. This region can be seen as a heap extension on the target
device and is managed by the TornadoVM memory manager.
The initial device heap size is by default congured to be
equal to the maximum capacity of global memory on each
target device. However, this value can be tuned depending
on the needs of each application. By pre-allocating the de-
vice heaps, TornadoVM’s memory manager becomes solely
responsible for transferring data between the host and the
target heaps to ensure memory consistency at run-time.
In the common case, TornadoVM just copies the input
data from the host to the target device and copies back the
results from the target device to the host, according to the
corresponding TornadoVM bytecode for each task-schedule.
The most interesting cases where the TornadoVM memory
manager acts are the cases of: a) migrating task-schedules to
a dierent device (Section 3.4); and b) the case of dynamic
reconguration (Section 4). In the case of task migration,
TornadoVM allocates a new memory area on the new target
device and performs the necessary data transfers from the
previous target device to the new target device.
In the case of dynamic reconguration in which a single
task may be running on more than one device, the process
has more steps. Figure 4sketches how TornadoVM manages
memory on such cases. On the top left part of the Figure
is a Tornado
task-schedule
that creates a task with two
parameters,
a
and
b
. Parameter
a
represents an array of oats
and is given as input to the task to be executed. Parameter
b
represents an array of oats where the user expects to
obtain the output. These two variables are maintained in the
Java heap on the host side, as in any other Java program.
Figure 4. Overview of the TornadoVM Memory Manager
However, to enable code acceleration such variables need
to get copied to the target device when the latter does not
have access to the host memory. As a result, TornadoVM
categorizes variables in two groups: host variables and device
variables.
Host Variables:
Following the data-ow programming mo-
del, TornadoVM splits data in input and output. Data that are
used solely as input are considered read-only and thus safe to
be accessed by more than one device at a time. Output data on
the other hand, contain the results of some computation and
are mandatory for the correctness of the algorithm. When
running the same task on dierent devices concurrently,
despite expecting to obtain the same result at the end, we
cannot use the same memory for storing that result. Dierent
devices require dierent time to perform the computation
and thus one device may overwrite the data of the other in
an unpredictable order, possibly resulting in stale data. For
this reason, the TornadoVM duplicates output variables in
the host-side. This way, each device writes back the output
data to a dierent memory segment avoiding the above issue.
The code running on the host accesses this data through a
proxy. When the execution for all devices nishes and the
TornadoVM chooses the best device depending on the input
policies (as we will present in Section 4), the TornadoVM sets
the proxy to redirect accesses to the corresponding memory
segment. For example, in Figure 4, if the selected device is
the FPGA, the proxy of
b
will redirect accesses to the
b-FPGA
buer.
Device Variables:
On the device side, dierent devices have
dierent characteristics. For instance, integrated GPUs have
direct coherent access to the host memory. Other devices
may be able to directly access the host memory through
their driver, but they still require proper synchronization
to ensure coherence, e.g., external GPUs. Finally, there are
devices that require explicit memory copies to and from the
device. To maximize data throughput, TornadoVM dynam-
ically queries devices for their capabilities and adapts its
memory management accordingly.
170
Dynamic Application Reconfiguration on Heterogeneous Hardware VEE ’19, April 14, 2019, Providence, RI, USA
3.4 Task Migration Manager
The TornadoVM task-migration manager is a component
within the VM that handles code and data migration from one
device to another. By employing the bytecodes and the new
virtualization layer, TornadoVM is capable of migrating the
executing task-schedules to dierent devices at runtime. Task
migration is signalled by changing the target device of a task-
schedule. To safely migrate task-schedules without losing
data, task migrations are only allowed after task-schedules
nish execution and become eective on the next execution.
Whenever a task-schedule completes its execution, Tor-
nadoVM checks whether the target device has been changed.
If the target device has changed, TornadoVM performs two
main actions: a) transfer all the necessary data from the cur-
rent device to the new target device through its memory
manager, and b) invoke the Tornado JIT compiler to compile
all the tasks in the
task-schedule
for the new target device,
if not already compiled. After the transfers reach completion
and the code gets compiled, TornadoVM can safely launch
the corresponding binary on the target device and continue
execution. Section 4presents how task-migration enables
TornadoVM to dynamically detect and use the best, according
to some policy, conguration for the running application.
4 Dynamic Reconguration
With task migration, TornadoVM can dynamically recong-
ure the running applications in order to discover the most
ecient mapping of task-schedules on devices. TornadoVM
starts executing task-schedules on the CPU, and in parallel
it explores dierent devices on the system (e.g., GPU) and
collects proling data. Then, according to a reconguration
policy it assigns scores to each device and selects the best
candidate to execute each task-schedule.
4.1 Reconguration Policies
Areconguration policy is essentially the denition of an
execution-plan, and a function that given a set of metrics
(e.g., total runtime), obtained by executing a task-schedule
on a device according to the execution-plan, returns an ef-
ciency score. The higher the score, the more ecient the
execution of the task-schedule on the corresponding device
according to that policy. TornadoVM currently features three
such policies, end-to-end,latency and peak performance:
End-to-end:
Measures the total execution time of the
task-schedule by performing a single cold run on each
device. The total execution time includes the time
spent on JIT compilation, data transfers, and computa-
tion. The device that yields the shortest total execution
time is considered the most ecient.
Latency:
Same as end-to-end, but does not wait for
the proling of all the devices to complete. By the
time that the fastest device reaches completion, Tor-
nadoVM chooses that device and continues execution
discarding the execution of the rest devices.
Peak performance:
Measures the time required to
transfer data, that are not already cached on the device,
and the computation time. JIT compilation and initial
data transfers are not included in the measurements.
To obtain these measurements the task-schedule is
executed multiple times on the target device to warm
it up before obtaining them.
The end-to-end policy is suitable for debugging and opti-
mizing code. Getting access to the end-to-end measurements
for each device gives users the power to tweak their pro-
grams to favour specic devices, or to identify bottlenecks
and x them to improve performance. The latency policy
is more suitable for short running applications that are not
expected to live long enough in order to oset the overhead
of JIT compilation and warming up. The peak performance
policy, on the other hand, is more suitable for long running
applications that run the same task-schedules multiple times
and thus diminish the initial overhead.
A policy is by default set for all the task-schedules in the
whole application and can be altered through a parameter
when starting the application. However, to allow users to
have more control, we extend the task-based parallel API in
Tornado to allow users to specify dierent policies per task-
schedule execution. To avoid complicating the API, we over-
load the
execute
method with an optional parameter that de-
nes the reconguration policy for the task-schedule at hand.
If no parameters are passed, then TornadoVM uses the recon-
guration policy set for the whole application. For instance,
to execute a task-schedule using the performance policy we
use taskSchedule.execute(Policy.PERFORMANCE).
Note that in addition to the aforementioned policies, Tor-
nadoVM allows the implementation of custom recongura-
tion policies, giving its users the exibility to set the metric
on which they want their application to become more e-
cient, e.g., energy instead of performance.
4.2 Device Exploration
TornadoVM automatically starts an exhaustive exploration
by running each task-schedule on all available devices and
proles their performance in accordance to the selected re-
conguration policy. This way, TornadoVM is able to se-
lect the best device for each task-schedule. In addition, Tor-
nadoVM does not require application restart or any prior
knowledge from the user’s perspective to execute and adapt
their code to a target device.
Figure 5illustrates how device selection is performed
within the TornadoVM. When execution is invoked with
a policy the TornadoVM spawns a set of Java threads. Each
thread executes a copy of the input task-schedules for a par-
ticular device. Therefore, TornadoVM spawns one thread
171
VEE ’19, April 14, 2019, Providence, RI, USA Fumero, Papadimitriou, Zakkak, Xekalaki, Clarkson, Kotselidis
Figure 5. Overview of device selection in TornadoVM.
per heterogeneous device on the system. In parallel with
the dynamic exploration, a Java thread is also running the
task-schedule on the CPU. This is done to ensure that the
application makes progress while we explore alternatives,
and to obtain measurements that will allow us to compare
the heterogeneous execution with the sequential Java. Once
the execution is nished, TornadoVM collects the proling
information and selects the most suitable, according to the
reconguration policy, device for the task-schedule at hand.
From this point on, TornadoVM remembers the target de-
vice for each input task-schedule and policy. Note that the
same task-schedule may run multiple times, potentially with
dierent reconguration policies, through the overloaded
execute
method. In this case, as any new policy is encoun-
tered, the TornadoVM starts the exploration and it performs
a new decision that better adapts to the given policy. In
conclusion, dynamic reconguration enables programmers
to eortlessly accelerate their applications on any system
equipped with heterogeneous hardware. Furthermore, it en-
ables the applications to dynamically adapt to changes to
the underlying hardware, e.g., in cases of dynamic resource
provisioning.
4.3 A new High-Performance Tier Compilation
Contemporary managed runtime systems employ tiered com-
pilation to achieve better performance (e.g., tier compiler
within JVM). Tiered compilation enables the runtime system
to use faster compilers that produce less optimized code for
code that is not invoked as frequently. As the number of invo-
cations increases for a code segment the runtime re-compiles
it with higher-tier compilers, which might be slower, but
produce more optimized code. The main idea behind tiered
compilation is that it is only worth investing time to optimize
code that will be invoked multiple times. Currently, after a
JIT compiler reaches the maximum tier compilation (e.g., C2
compilation in JVM), there are no further optimizations.
Table 2. Experimental Platform
Hardware
Processor Intel Core i7-7700 @ 4.2GHz
Cores 4 (8 HyperThreads)
RAM 64GB
GPU NVIDIA GTX 1060 (Pascal), up to 1.7GHz,
6GB GDDR5, 1280 CUDA Cores
FPGA Nallatech 385A, Intel Arria 10 FPGA,
Two banks of 4GB DDR3 SDRAM each
Soware
OS CentOS 7.4 (Linux Kernel 3.10.0-693)
OpenCL (CPU) 2.0 (Intel)
OpenCL (GPU) 1.2 (Nvidia CUDA 9.0.282)
OpenCL (FPGA) 1.0 (Intel), Intel FPGA SDK 17.1,
HPC Board Support Package (BSP) by Nallatech
JVM Java SE 1.8.0_131 64-Bit JVMCI VM
Following the tier-compilation concept for JVM, we add
dynamic reconguration as a new compilation tier, improv-
ing the state-of-the-art by enabling it to further optimize
code and take advantage of hardware accelerators. The Hot-
Spot JVM employs an interpreter and two compilers in its
tiered compilation (C1, and C2). When a method is optimized
with the highest tier compiler (C2), TornadoVM takes action
and explores more ecient alternatives, possibly utilizing
heterogeneous devices. This integration allows TornadoVM
to pay the exploration overhead only for methods that are
invoked multiple times or contain loops with multiple itera-
tions. A current limitation of the TornadoVM is that it can
only optimize code that is written using the Tornado API,
thus it is not a generally applicable compilation tier.
5 Evaluation
This section presents the experimental evaluation of Tor-
nadoVM. We rst describe the experimental setup and metho-
dology, then the benchmarks we use, and nally we present
and discuss the results.
5.1 Evaluation Setup and Methodology
For the evaluation of TornadoVM we use a heterogeneous
system comprising three dierent processing units: an Intel
CPU, an external NVIDIA GPU and an Intel Altera FPGA.
This conguration essentially covers all the currently sup-
ported types of target devices of Tornado, which TornadoVM
relies on for heterogeneous JIT compilation. Table 2details
the hardware and software congurations of our system.
TornadoVM, being a VM running in another VM, falls into
the performance methodology traits of managed runtime sys-
tems [
14
]. VMs comprise a number of complex subsystems,
like the JIT compiler and the Garbage Collector, that add a
level of non-determinism in the obtained results. Adhering
to standard techniques of evaluating VMs, we rst perform
a warm-up phase for every benchmark to stabilize the per-
formance of the JVM. After the warm-up phase nishes, we
172
Dynamic Application Reconfiguration on Heterogeneous Hardware VEE ’19, April 14, 2019, Providence, RI, USA
perform one more run from which we obtain the measure-
ments that we report. Note that in our measurements the
TornadoVM instance itself is not warmed up.
The results we report depend on the reconguration pol-
icy that we evaluate each time. For example, for the peak
performance policy, we only include execution and data trans-
ferring times, excluding JIT compilation and device initial-
ization times. On the other hand, for the end-to-end and
latency policies, we include all times related to JIT compila-
tion (except for FPGA compilation), device initialization, data
transferring, and execution. Anticipating that in the future
FPGA synthesis times will decrease, we chose to exclude JIT
compilation times from the FPGA measurements when using
the end-to-end and latency policies. The current state-of-the-
art FPGA synthesis tools take between 60 and 90 minutes
to compile our benchmarks. Therefore, including the FPGA
compilation time was resulting in non-comparable measure-
ments. However, FPGA initialization and kernel loading are
still included in all our measurements. We discuss in more
details all JIT compilation times in Section 5.4.
5.2 Benchmarks
To evaluate TornadoVM, we employ six dierent benchmarks
and a complex computer vision application. Namely, the six
benchmarks we use are Saxpy,Montercarlo,RenderTrack,
BlackScholes,NBody, and DFT (Discrete Fourier Transform),
and the computer vision application is the Kinect Fusion (KFu-
sion) [
31
] implementation provided by SLAMBench [
30
]. The
domain of the chosen benchmarks ranges from mathemati-
cal and nancial applications to physics and linear-algebra
kernels, while KFusion creates a 3D representation from a
stream of depth images produced by an RGB-D camera such
as the Microsoft Kinect.
We ported each benchmark as well as KFusion from C++
and OpenCL to pure Java using the Tornado API. Porting
existing applications to the Tornado API requires: a) the cre-
ation of a task-schedule to pass the references of existing Java
methods, and b) the addition of the
@Parallel
annotations
to the existing loops. This results to 4–8 extra lines of code
per task-schedule regardless its size. The porting of the six
benchmarks resulted in six Tornado-based applications with
a single-task task-schedule each, while the porting of KFu-
sion resulted in an application with multiple task-schedules,
both single- and multi-task. When TornadoVM runs the se-
quential version, it ignores the
@Parallel
annotation and
the code is compiled and executed by the standard JVM (e.g.,
HotSpot).
Workload Size
In addition to exploring dierent work-
loads, we also explore the impact of the workload size. Thus,
for each benchmark, we vary the input data sizes in powers
of two. Table 3summarizes the input sizes used for each
benchmark. Please note that for
MonteCarlo
, we copy only
a single element (the number of iterations) and obtain a new
Table 3.
Input and data sizes for the given set of benchmarks.
Benchmark
Data
Range
Input
(Mb)
Output
(Mb)
min max max max
Saxpy 256 33554432 270 135
MonteCarlo 256 33554432 <0.1 268
RenderTrack 64 4096 70 50
N-Body 256 131072 0.5 0.5
BlackScholes 256 4194304 270 135
DFT 64 65536 4 1
array with the MonteCarlo computation, while for KFusion
we use the input scenes from the ICL-NUIM data-set [18].
5.3 Dynamic Reconguration Evaluation
Figure 6shows the performance evaluation of the six bench-
marks on all three types of supported hardware devices (CPU,
GPU, and FPGA). X-axis shows the range of input sizes while
y-axis shows the speedup over the sequential Java code opti-
mized by the HotSpot JVM. Each point represents the per-
formance achieved by a particular Tornado device, while
the solid black line highlights the speedup of TornadoVM
through dynamic reconguration. Note that in the cases
where the line does not overlap with a point, it means that
the code is executed by the HotSpot JVM, since it happens to
outperform the Tornado devices. Using the end-to-end and
peak-performance policies we are able to observe the perfor-
mance of each device on the system. The top of the Figure
shows the performance results when the policy end-to-end
is used. The bottom of the Figure shows the results when
the policy peak performance is used.
This Figure shows the dynamic reconguration in action
by deploying a benchmark and altering the input data that
we pass to it. This way we can observe how TornadoVM
dynamically recongures it to run on the best device, accord-
ing to the reconguration policy. Additionally, thanks to the
proling data that both end-to-end and peak-performance
policies gather, we can observe the dierences between the
dierent devices for each benchmark and input size.
Evaluation of End-to-End Policy
When using the end-to-
end policy we observe that for small input sizes the HotSpot
optimized sequential code outperforms the other congu-
rations. This is attributed to the relatively small workload
that no matter how fast it will execute it cannot hide the
overhead of the JIT compilation and device initialization. As
the input data size increases, we observe that TornadoVM
manages to nd better suited devices for all the benchmarks,
except for Saxpy. This indicates that the performance gain is
not enough to hide the compilation, device initialization, and
data transfer overheads for this benchmark. The fact that,
as the input size increases the less slowdown is observed,
indicates that JIT compilation time dominates, and as we
173
VEE ’19, April 14, 2019, Providence, RI, USA Fumero, Papadimitriou, Zakkak, Xekalaki, Clarkson, Kotselidis
218 222
10 1
100
101
102
103
Saxpy
219 223
MonteCarlo
28211
RenderTrack
Tornado-CPU Tornado-GPU Tornado-FPGA
TornadoVM
219 223
BlackScholes
210 213
NBody
29213 10 1
100
101
102
103
DFT
218 222
10 1
100
101
102
103
Speedup vs Java Sequential
219 223 28211
Input Range
219 223 210 213 29213 10 1
100
101
102
103
Figure 6.
Speedup of TornadoVM over sequential Java for two dierent policies. The top set of gures shows the performance of
the TornadoVM over Java sequential for the end-to-end policy. The bottom set of gures shows the performance of TornadoVM
using the peak performance policy.
increase the workload, it starts taking a smaller portion of
the end-to-end execution time.
The rest of the benchmarks are split into two groups.
MonteCarlo, BlackScholes, and DFT, after a certain input
size seem to stabilize and perform better on the GPU. Ren-
derTrack and NBody, on the other hand, yield comparable
performance when ran on the GPU or in parallel on the CPU.
As a result, depending on the input size, TornadoVM might
prefer one over the other. These two benchmarks showcase
why dynamic reconguration is important. Imagine moving
these workloads to a system with a slower GPU, and to yet
another system with a faster GPU. TornadoVM will be able to
use the CPU in the rst case and the GPU on the second case,
getting the best possible performance out of each system
without the need for the user to change anything in their
code. In average, TornadoVM is 7.7x faster compared to the
execution on the best parallel device for this policy.
Evaluation of the Peak Performance Policy
When us-
ing the peak performance policy we observe a slightly dier-
ent behaviour, as shown at the lower part of Figure 6. In this
case, TornadoVM takes into account only the execution time
and the data transfers, excluding JIT compilation and device
initialization times. In contrast to the end-to-end policy, we
observe that when using the peak performance policy no
matter the input size in all benchmarks, except from
saxpy
,
Tornado outperforms the HotSpot sequential execution. This
is expected, given that the peak performance policy does not
take in account device initialization and JIT compilations
that are the main overheads of cold runs. What is more in-
teresting is that for Saxpy when using the end-to-end policy,
larger input sizes were reducing the gap between sequential
Java and the execution on the Tornado devices. On the con-
trary, when using the peak performance policy, we observe
the opposite. This is an indication, that in Saxpy the rst
dominating part is JIT compilation and device initialization,
and the second one is data transfers.
For the rest of the benchmarks we observe again that
they can be split into two groups, however this time the
groups are dierent. MonteCarlo, RenderTrack, and BlackSc-
holes perform better on the GPU, no matter the input size.
This indicates that these benchmarks feature highly parallel
computations which also take a signicant part of the total
execution time. The second group includes NBody and DFT,
which for smaller sizes perform better when ran in parallel
on the multi-core CPU, than on the GPU.
Note that, although the FPGA is never selected in our
system, it gives high-performance for BlackScholes and DFT
(25
×
and 260
×
respectively compared to Java sequential,
and 2.5
×
and 5
×
compared to the parallel execution on the
CPU). Therefore, users that do not have a powerful GPU
can benet from executing on FPGAs, with the additional
advantage that they consume less energy than running on
the GPU or CPU [34].
TornadoVM vs Tornado Using the Latency Policy
Fig-
ure 7shows the speedups of TornadoVM, using the latency
reconguration policy, over the corresponding Tornado ex-
ecutions using the best, on average, device — in our case
the GPU. Recall that the latency policy starts running task-
schedules on all devices and selects the rst to nish, ignor-
ing the rest. Then it stores these data to avoid running again
on the less optimal devices. We see that for applications
such as Saxpy, TornadoVM does not switch to an accelerator
since sequential Java optimized by HotSpot performs better.
174
Dynamic Application Reconfiguration on Heterogeneous Hardware VEE ’19, April 14, 2019, Providence, RI, USA
Figure 7. Speedup of TornadoVM over Tornado running on the best (on average) device.
Table 4. Breakdown analysis of execution times (in milliseconds, unless otherwise noted) per benchmark.
Benchmark Compilation Time Host to Device Execution Device To Host Rest
CPU FPGA Load GPU CPU FPGA GPU CPU FPGA GPU CPU FPGA GPU CP U FPGA GPU
Saxpy 7.44 53 mins 1314 99.64 19.78 19.85 59.01 57.04 248.64 2.72 10.06 13.54 20.57 1.15 20.14 1.50
MonteCarlo 85.85 54 mins 1368 87.60 1 ns 1 ns 1 ns 240.88 456.96 2.75 21.61 59.71 41.14 0.70 0.57 0.70
RenderTrack 111.10 51 mins 1380 105.03 18.59 40.10 58.35 24.50 242.15 1.96 3.86 6.84 7.70 0.69 3.03 2.61
BlackScholes 178.61 114 mins 1420 243.02 16.84 10.30 30.97 1036.12 400.31 4.43 21.07 20.45 41.14 1.09 2.36 0.93
NBody 144.68 51 mins 1387 151.25 0.05 0.04 0.08 101.81 441.48 7.47 0.04 0.10 0.08 1.31 1.21 1.04
DFT 83.80 68 mins 1398 161.96 0.09 0.10 0.16 31674.15 4424.13 460.68 0.05 0.10 0.08 1.03 1.85 1.24
This gives programmers speedups of up to 45
×
compared
to the static Tornado [
8
]. More interesting are benchmarks
such as MonteCarlo, BlackScholes and DFT. For these bench-
marks, TornadoVM can run sequential Java for the smaller
input sizes and migrate to the GPU for bigger input sizes,
dynamically getting the best performance.
For RenderTrack and NBody, TornadoVM ends up using
all devices, except for the FPGA, depending on the input size.
For example in the case of
RenderTrack
, TornadoVM starts
running sequential Java, then it switches execution to the
GPU and nally to parallel execution on the CPU, giving up
to 30
×
speedup over Tornado. Note that the speedup over
Tornado goes down as the input size increases. This is due
to the fact that for large input sizes the GPU manages to
hide the overhead of the data transfers by processing the
data signicantly faster than the other devices. As a result,
for large input sizes, the GPU ends up dominating on both
the TornadoVM and Tornado, thus resulting in the same
performance, which can be up to three orders of magnitude
better than running on a standard JVM.
KFusion
To assess the correctness of TornadoVM and de-
monstrate its maturity, we opted to use it to accelerate KFu-
sion as well. KFusion is composed of ve distinct task-schedu-
les, both single- and multi-task. KFusion is a streaming ap-
plication that takes as input pictures captured by an RGB-D
camera and processes them. After the rst few frames have
been processed, the system stabilizes and runs the optimized
code from the code cache. As a result, we use the peak per-
formance policy to accelerate it. Our evaluation results show
that TornadoVM can successfully accelerate KFusion on the
evaluation platform, yielding 135.27 FPS (frames-per-second)
compared to the 1.69 FPS achieved by the HotSpot JVM by
automatically selecting the GPU.
5.4 End-to-end Times Breakdown
Table 4shows the breakdown of the total execution time, in
milliseconds, for the largest input size of each benchmark
divided into ve categories: compilation, host to device data
transfers, kernel execution, device to host data transfers, and
the rest. The rest is the time spent to execute the parts of the
applications that can not be accelerated, and it is computed as
the total time to execute the benchmarks minus all the other
categories (
TotalT=CompT+H2DT+ExecT+D2HT+RestT
).
The compilation time (CompT) includes the time to build
and optimize the Tornado data ow graph, and the time to
compile the input tasks to the corresponding binary. For
FPGA compilation we break the time down in two columns,
the rst one (FPGA) shows the compilation time including
the FPGA synthesis in minutes, and the second (Load) shows
the time needed to load the compiled bitstream on the FPGA
in milliseconds. On average, CPU and GPU compilation is
in the range of hundreds of milliseconds and is up to four
orders of magnitude faster than FPGA compilation. From
TornadoVM’s perspective this is a limitation of the available
FPGA synthesis tools and is the reason why it supports both
just-in-time and ahead-of-time compilation for FPGAs.
Note that although we use the same conguration for
all devices, data transfers between the host and the device
(
H2DT
and
D2HT
) are faster for FPGAs [
4
]. This is because of
the use of pinned memory (unlocked memory) that enables
fast DMA transfers between the host and the accelerators,
and appears to have a greater impact on the FPGA. Since
Tornado pre-allocates a buer region for the heap, we cannot
175
VEE ’19, April 14, 2019, Providence, RI, USA Fumero, Papadimitriou, Zakkak, Xekalaki, Clarkson, Kotselidis
have zero copies because it does not have the pointers to the
user data at the moment of calling malloc for the heap.
Regarding the kernel execution
ExecT
, we observe that
the GPU performs better than the rest of the devices. We also
notice that the FPGA kernel execution performs much better
than the multi-core CPU for
Blackscholes
and
DFT
, with
speedups of 2.5
×
and 7.1
×
respectively over the multi-core
execution. Finally, regarding the rest
RestT
, our measure-
ments highlight that TornadoVM practically does not intro-
duce any overhead, except for the FPGA execution, in which
we connect our framework with the Altera tools and drivers.
In the case of the memory intensive application,
saxpy
, the
overheads comes from the delay between enqueuing the data
into the command queue and starting the computation.
6 Related Work
This section presents the most relevant related works that
allow dynamic application reconguration.
Dynamic Reconguration
gMig [
28
] is a solution for live
migration of GPU processes to other GPUs, through virtual-
ization. When compared with TornadoVM, gMig poses the
limitation that it only targets GPUs. Aira [
27
] is a compiler
and a runtime system that allows developers to automati-
cally instrument existing parallel and heterogeneous code
to select the best mapping device. Aira makes use of perfor-
mance prediction models and resource allocation policies
to perform CPU/GPU selection at runtime. TornadoVM, in
contrast, does not predict upfront the best device to run a
set of tasks. Instead, it adapts the execution, during runtime,
to the heterogeneous device that best suits the input pol-
icy. TornadoVM can thus adapt the execution to any system
conguration and achieve the best possible performance
utilizing the available resources.
VirtCL [
40
] is a framework for programming multi-GPUs
in a transparent manner by using data from previous runs
to develop regression models. These models predict the total
time of a task on each device. TornadoVM focuses on a sin-
gle device, moving execution from CPU to the best device.
Rethinagiri et al. [
36
] proposed a new heterogeneous system
architecture and a set of applications that can fully exploit
all available hardware. However, their solutions are highly
customized for this new platform mixing programming lan-
guages and, thus, increasing software complexity. Che et
al. [
6
] studied the performance of a set of applications on FP-
GAs and GPUs and presented applications’ characteristics to
decide where to place the code. However, the categorization
is very generic and limited to the three benchmarks studied.
HPVM [
24
] is a compiler, a runtime system and a virtual
instruction set for targeting dierent heterogeneous sys-
tems. The concept of the virtual instruction set in HPVM
is similar to the TornadoVM bytecodes. However, all byte-
codes described in TornadoVM are totally hardware agnostic,
allowing to easily ship those bytecodes between dierent
machines and dierent hardware. Besides, HPVM does not
support task reconguration like TornadoVM. Hayashi et
al. [
20
] and Grewe et al. [
16
] employed machine learning
techniques to address the challenge of device selection. In
contrast, TornadoVM adapts execution with no prior knowl-
edge and models about the input programs.
Reconguration for Interpreted Languages and DSLs
In the dynamic programming language domain, Qunaibit et
al. [
35
] presented MegaGuard, a compiler framework for com-
piling and running Python programs on GPUs. MegaGuard
is able to choose the fastest device to ooad the computa-
tion. Although this approach is similar to ours, the analysis
was performed on a single GPU instead of multiple devices
such as GPUs, FPGAs and CPUs. Dandelion [
7
,
37
] combines
a runtime system and a set of compilers for running Lan-
guage Integrated Queries (LINQs) on heterogeneous hard-
ware. Dandelion compiles .NET bytecodes to heterogeneous
code, and generates data-ow-graphs for the orchestration
of the execution. In contrast, TornadoVM compiles Java byte-
codes to heterogeneous code, and data-ow-graphs to cus-
tom bytecode which it then interprets for the orchestration
of the execution. Leo [
10
] builds on top of Dandelion and pro-
vides dynamic proling and optimization for heterogeneous
execution on GPUs. TornadoVM provides a more generic
framework in which tasks can be proled and re-scheduled
between dierent types of devices at runtime (e.g., from an
FPGA to a GPU).
7 Conclusions and Future Work
In this paper we present TornadoVM, a virtualization layer
that works in cooperation with standard JVMs and is able to
automatically compile, and execute code on heterogeneous
hardware. In addition, TornadoVM is capable of automati-
cally discovering, at runtime, the best combination of het-
erogeneous devices for running input tasks to increase the
performance of running applications, completely transpar-
ently to the users. We also present TornadoVM as a new level
of tier-compilation and execution to make transparent use of
heterogeneous hardware. To the best of our knowledge, there
is no prior work that can dynamically compile and recon-
gure the running applications on heterogeneous hardware,
including FPGAs, without requiring any a priori knowledge
of the underlying hardware and the applications. Finally, we
demonstrate that TornadoVM can achieve on average 7.7
×
speedup over statically-congured parallel executions for a
set of six benchmarks.
Future Work
We plan to extend TornadoVM by imple-
menting a batch execution mechanism that will allow users
to run big data applications that do not t on the memory
of a single device. We also plan to introduce new policies,
such as power draw and energy consumption of each de-
vice. An interesting policy to implement is also the ability
176
Dynamic Application Reconfiguration on Heterogeneous Hardware VEE ’19, April 14, 2019, Providence, RI, USA
to adapt the code based on the real cost (in dollars) of run-
ning the code for each device, and try to minimize those
costs at reasonable performance. Furthermore, we plan to
extend the dynamic reconguration capabilities for tasks
within a task-schedule as we currently explore migration
at the task-schedule level. We also plan to integrate this
solution for multiple VMs running on dierent nodes in dis-
tributed scenarios, namely cloud-based VMs in which code
can adapt to dierent heterogeneous hardware. Moreover,
we would like to extend TornadoVM with the ability to share
resources, such as in [
15
], that allows user to run multiple
task-schedules on GPUs and FPGAs concurrently. Lastly, we
project to introduce a fault-tolerant mechanism that allows
users to automatically recongure running applications in
case of failures.
Acknowledgments
This work is partially supported by the European Union’s
Horizon 2020 E2Data 780245 and ACTiCLOUD 732366 grants.
Authors would also like to thank Athanasios Stratikopoulos
and the anonymous reviewers for their valuable feedback.
References
[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis,
Jerey Dean, Matthieu Devin, Sanjay Ghemawat, Georey Irving,
Michael Isard, et al
.
2016. Tensorow: a system for large-scale machine
learning.. In OSDI, Vol. 16. 265–283.
[2] AMD. 2016. Aparapi. (2016). hp://aparapi.github.io/.
[3]
Ulysse Beaugnon, Antoine Pouille, Marc Pouzet, Jacques Pienaar,
and Albert Cohen. 2017. Optimization Space Pruning Without Re-
grets. In Proceedings of the 26th International Conference on Compiler
Construction (CC 2017). ACM, New York, NY, USA, 34–44. hps:
//doi.org/10.1145/3033019.3033023
[4]
Ray Bittner, Erik Ruf, and Alessandro Forin. 2014. Direct GPU/FPGA
communication Via PCI express. Cluster Computing 17 (01 Jun 2014).
hps://doi.org/10.1007/s10586-013- 0280-9
[5]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaer, S. Lee, and K.
Skadron. 2009. Rodinia: A benchmark suite for heterogeneous com-
puting. In 2009 IEEE International Symposium on Workload Characteri-
zation (IISWC). 44–54. hps://doi.org/10.1109/IISWC.2009.5306797
[6]
S. Che, J. Li, J. W. Sheaer, K. Skadron, and J. Lach. 2008. Accel-
erating Compute-Intensive Applications with GPUs and FPGAs. In
2008 Symposium on Application Specic Processors. 101–107. hps:
//doi.org/10.1109/SASP.2008.4570793
[7]
Eric Chung, John Davis, and Jaewon Lee. 2013. LINQits: Big Data
on Little Clients. In 40th International Symposium on Computer Archi-
tecture (40th international symposium on computer architecture ed.).
ACM.
[8]
James Clarkson, Juan Fumero, Michail Papadimitriou, Foivos S. Zakkak,
Maria Xekalaki, Christos Kotselidis, and Mikel Luján. 2018. Exploiting
High-performance Heterogeneous Hardware for Java Programs Using
Graal. In Proceedings of the 15th International Conference on Managed
Languages & Runtimes (ManLang ’18). ACM, New York, NY, USA,
Article 4, 13 pages. hps://doi.org/10.1145/3237009.3237016
[9]
NVIDIA Corporation. 2019. CUDA. (2019). Retrieved March 20, 2019
from hp://developer.nvidia.com/cuda-zone
[10]
Naila Farooqui, Christopher J. Rossbach, Yuan Yu, and Karsten
Schwan. 2014. Leo: A Prole-Driven Dynamic Optimization Frame-
work for GPU Applications. In 2014 Conference on Timely Results
in Operating Systems (TRIOS 14). USENIX Association, Broomeld,
CO. hps://www.usenix.org/conference/trios14/technical-sessions/
presentation/farooqui
[11]
Jeremy Fowers, Greg Brown, Patrick Cooke, and Greg Stitt. 2012. A
Performance and Energy Comparison of FPGAs, GPUs, and Multicores
for Sliding-window Applications. In Proceedings of the ACM/SIGDA
International Symposium on Field Programmable Gate Arrays (FPGA
’12). ACM, New York, NY, USA, 47–56. hps://doi.org/10.1145/2145694.
2145704
[12]
Juan Fumero, Michel Steuwer, Lukas Stadler, and Christophe Dubach.
2017. Just-In-Time GPU Compilation for Interpreted Languages with
Partial Evaluation. In Proceedings of the 13th ACM SIGPLAN/SIGOPS
International Conference on Virtual Execution Environments (VEE ’17).
ACM, New York, NY, USA, 60–73. hps://doi.org/10.1145/3050748.
3050761
[13]
Juan José Fumero, Toomas Remmelg, Michel Steuwer, and Christophe
Dubach. 2015. Runtime Code Generation and Data Management for
Heterogeneous Computing in Java. In Proceedings of the Principles and
Practices of Programming on The Java Platform (PPPJ ’15). ACM, New
York, NY, USA, 16–26. hps://doi.org/10.1145/2807426.2807428
[14]
Andy Georges, Dries Buytaert, and Lieven Eeckhout. 2007. Statistically
Rigorous Java Performance Evaluation. In Proceedings of the 22Nd
Annual ACM SIGPLAN Conference on Object-oriented Programming
Systems and Applications (OOPSLA ’07). ACM, New York, NY, USA,
57–76. hps://doi.org/10.1145/1297027.1297033
[15]
A. Goswami, J. Young, K. Schwan, N. Farooqui, A. Gavrilovska, M.
Wolf, and G. Eisenhauer. 2016. GPUShare: Fair-Sharing Middleware
for GPU Clouds. In 2016 IEEE International Parallel and Distributed
Processing Symposium Workshops (IPDPSW). 1769–1776. hps://doi.
org/10.1109/IPDPSW.2016.94
[16]
Dominik Grewe and Michael F. P. O’Boyle. 2011. A Static Task Par-
titioning Approach for Heterogeneous Systems Using OpenCL. In
Compiler Construction, Jens Knoop (Ed.). Springer Berlin Heidelberg,
Berlin, Heidelberg, 286–305.
[17]
Khronos Group. 2017. OpenCL. (2017). Retrieved March 20, 2019
from hps://www.khronos.org/opencl
[18]
A. Handa, T. Whelan, J.B. McDonald, and A.J. Davison. 2014. A Bench-
mark for RGB-D Visual Odometry, 3D Reconstruction and SLAM. In
IEEE Intl. Conf. on Robotics and Automation, ICRA. Hong Kong, China,
1524–1531.
[19]
Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, and Vivek
Sarkar. 2013. Accelerating Habanero-Java Programs with OpenCL
Generation. In Proceedings of the 2013 International Conference on
Principles and Practices of Programming on the Java Platform: Virtual
Machines, Languages, and Tools (PPPJ ’13). ACM, New York, NY, USA,
124–134. hps://doi.org/10.1145/2500828.2500840
[20]
Akihiro Hayashi, Kazuaki Ishizaki, Gita Koblents, and Vivek Sarkar.
2015. Machine-Learning-based Performance Heuristics for Runtime
CPU/GPU Selection. In Proceedings of the Principles and Practices of
Programming on The Java Platform (PPPJ ’15). ACM, New York, NY,
USA, 27–36. hps://doi.org/10.1145/2807426.2807429
[21]
IBM. 2018. IBM J9 Virtual Machine. (2018). hps:
//www.ibm.com/support/knowledgecenter/en/SSYKE2_7.0.0/
com.ibm.java.win.70.doc/user/java_jvm.html.
[22] JOCL 2017. Java bindings for OpenCL. (2017). hp://www.jocl.org/.
[23]
Christos Kotselidis, James Clarkson, Andrey Rodchenko, Andy Nisbet,
John Mawer, and Mikel Luján. 2017. Heterogeneous Managed Runtime
Systems: A Computer Vision Case Study. In Proceedings of the 13th
ACM SIGPLAN/SIGOPS International Conference on Virtual Execution
Environments (VEE ’17). ACM, New York, NY, USA, 74–82. hps:
//doi.org/10.1145/3050748.3050764
[24]
Maria Kotsifakou, Prakalp Srivastava, Matthew D. Sinclair, Rakesh Ko-
muravelli, Vikram Adve, and Sarita Adve. 2018. HPVM: Heterogeneous
Parallel Virtual Machine. In PPoPP ’18 Proceedings of the 23rd ACM SIG-
PLAN Symposium on Principles and Practice of Parallel Programming.
177
VEE ’19, April 14, 2019, Providence, RI, USA Fumero, Papadimitriou, Zakkak, Xekalaki, Clarkson, Kotselidis
68–80. hps://doi.org/10.1145/3178487.3178493
[25]
Thomas Kotzmann, Christian Wimmer, Hanspeter Mössenböck,
Thomas Rodriguez, Kenneth Russell, and David Cox. 2008. Design
of the Java HotSpot
T M
Client Compiler for Java 6. ACM Trans.
Archit. Code Optim. 5, 1, Article 7 (May 2008), 32 pages. hps:
//doi.org/10.1145/1369396.1370017
[26]
Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Dae-
hyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyan-
skiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and
Pradeep Dubey. 2010. Debunking the 100X GP U vs. CPU Myth: An
Evaluation of Throughput Computing on CPU and GPU. In Proceed-
ings of the 37th Annual International Symposium on Computer Ar-
chitecture (ISCA ’10). ACM, New York, NY, USA, 451–460. hps:
//doi.org/10.1145/1815961.1816021
[27] Robert Lyerly, Alastair Murray, Antonio Barbalace, and Binoy Ravin-
dran. 2018. AIRA: A Framework for Flexible Compute Kernel Execu-
tion in Heterogeneous Platforms. In IEEE Transactions on Parallel and
Distributed Systems.hps://doi.org/10.1109/TPDS.2017.2761748
[28]
Jiacheng Ma, Xiao Zheng, Yaozu Dong, Wentai Li, Zhengwei Qi, Bing-
sheng He, and Haibing Guan. 2018. gMig: Ecient GP U Live Migration
Optimized by Software Dirty Page for Full Virtualization. In Proceed-
ings of the 14th ACM SIGPLAN/SIGOPS International Conference on
Virtual Execution Environments (VEE ’18). ACM, New York, NY, USA,
31–44. hps://doi.org/10.1145/3186411.3186414
[29]
R. Nane, V. M. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen, H.
Hsiao, S. Brown, F. Ferrandi, J. Anderson, and K. Bertels. 2016. A Survey
and Evaluation of FPGA High-Level Synthesis Tools. IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems 35, 10
(Oct 2016), 1591–1604. hps://doi.org/10.1109/TCAD.2015.2513673
[30]
L. Nardi, B. Bodin, M. Z. Zia, J. Mawer, A. Nisbet, P. H. J. Kelly, A. J.
Davison, M. LujÃąn, M. F. P. O’Boyle, G. Riley, N. Topham, and S.
Furber. 2015. Introducing SLAMBench, a performance and accuracy
benchmarking methodology for SLAM. In 2015 IEEE International
Conference on Robotics and Automation (ICRA). 5783–5790. hps:
//doi.org/10.1109/ICRA.2015.7140009
[31]
Richard A. Newcombe, Shahram Izadi, Otmar Hilliges, David
Molyneaux, David Kim, Andrew J. Davison, Pushmeet Kohli, Jamie
Shotton, Steve Hodges, and Andrew Fitzgibbon. 2011. KinectFusion:
Real-time Dense Surface Mapping and Tracking. In Proceedings of
the 2011 10th IEEE International Symposium on Mixed and Augmented
Reality (ISMAR ’11). IEEE Computer Society, Washington, DC, USA,
127–136. hps://doi.org/10.1109/ISMAR.2011.6092378
[32]
Michael Paleczny, Christopher Vick, and Cli Click. 2001. The Java
hotspotTM Server Compiler. In Proceedings of the 2001 Symposium
on JavaTM Virtual Machine Research and Technology Symposium -
Volume 1 (JVM’01). USENIX Association, Berkeley, CA, USA, 1–1. hp:
//dl.acm.org/citation.cfm?id=1267847.1267848
[33]
P.C. Pratt-Szeliga, J.W. Fawcett, and R.D. Welch. 2012. Rootbeer: Seam-
lessly Using GPUs from Java. In Proceedings of 14th International IEEE
High Performance Computing and Communication Conference on Em-
bedded Software and Systems.hps://doi.org/10.1109/HPCC.2012.57
[34]
Berten Digital Processing. 2016. White paper: GPU vs
FPGA Performance Comparison. Technical Report. hp:
//www.bertendsp.com/pdf/whitepaper/BWP001_GPU_vs_FPGA_
Performance_Comparison_v1.0.pdf
[35]
Mohaned Qunaibit, Stefan Brunthaler, Yeoul Na, Stijn Volckaert, and
Michael Franz. 2018. Accelerating Dynamically-Typed Languages
on Heterogeneous Platforms Using Guards Optimization. In 32nd
European Conference on Object-Oriented Programming (ECOOP 2018)
(Leibniz International Proceedings in Informatics (LIPIcs)), Todd Mill-
stein (Ed.), Vol. 109. Schloss Dagstuhl–Leibniz-Zentrum fuer Infor-
matik, Dagstuhl, Germany, 16:1–16:29. hps://doi.org/10.4230/LIPIcs.
ECOOP.2018.16
[36]
S. K. Rethinagiri, O. Palomar, J. A. Moreno, O. Unsal, and A. Cristal.
2015. Trigeneous Platforms for Energy Ecient Computing of HPC
Applications. In 2015 IEEE 22nd International Conference on High Per-
formance Computing (HiPC). 264–274. hps://doi.org/10.1109/HiPC.
2015.19
[37]
Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin,
and Dennis Fetterly. 2013. Dandelion: A Compiler and Runtime for
Heterogeneous Systems. In Proceedings of the Twenty-Fourth ACM
Symposium on Operating Systems Principles (SOSP ’13). ACM, New
York, NY, USA, 49–68. hps://doi.org/10.1145/2517349.2522715
[38]
Sumatra. 2015. Sumatra OpenJDK. (2015). hp://openjdk.java.net/
projects/sumatra/.
[39]
Thomas N. Theis and H. S. Philip Wong. 2017. The End of Moore’s
Law: A New Beginning for Information Technology. Computing in
Science and Engg. 19, 2 (March 2017), 41–50. hps://doi.org/10.1109/
MCSE.2017.29
[40]
Yi-Ping You, Hen-Jung Wu, Yeh-Ning Tsai, and Yen-Ting Chao. 2015.
VirtCL: A Framework for OpenCL Device Abstraction and Manage-
ment. In PPoPP 2015 Proceedings of the 20th ACM SIGPLAN Sym-
posium on Principles and Practice of Parallel Programming. 161–172.
hps://doi.org/10.1145/2688500.2688505
178
... Following the development of heterogeneous programming models such as OpenCL [17] and CUDA [7], managed programming languages such as Java, Python, etc. have been making steady progress towards integrating into their execution models and managed runtime environments (MREs) the various hardware accelerators that are commonly found today in a wide spectrum of devices spanning from smartphones to cloud servers [2,4,13,21,22,34,42]. ...
... In detail, Section 3.1 introduces the notion of batch processing for addressing the challenge of limited physical memory on hardware accelerators, Section 3.2 explains the addition of pipeline parallelism to batch processing, and finally Section 3.3 extends both sequential and parallel batching with pinned memory and off-heap memory buffers. Although, all optimizations have been performed in the context of Tor-nadoVM [13] and the Java programming language, they are generally applicable to other heterogeneous MREs and programming languages. ...
... Upon executing the task-schedule, TornadoVM will perform the following actions: (i) it will build a data-flow graph to optimize the data transfers between the host and the target device, and (ii) it will create a list of bytecodes that represents the orchestration of the execution on the heterogeneous hardware. As defined in [13], TornadoVM runs its generated bytecodes in a bytecode interpreter on the main CPU as a way of orchestrating the execution between the code running on the CPU and the accelerators. When running the bytecode interpreter, TornadoVM allocates the input and output buffers, performs the data transfers (data copy between the CPU and the heterogeneous target device), performs the runtime compilation (from Java bytecode to OpenCL and PTX), and dispatches the generated code on the target device. ...
Conference Paper
Full-text available
During the last decade, managed runtime systems have been constantly evolving to become capable of exploiting underlying hardware accelerators, such as GPUs and FPGAs. Regardless of the programming language and their corresponding runtime systems, the majority of the work has been focusing on the compiler front trying to tackle the challenging task of how to enable just-in-time compilation and execution of arbitrary code segments on various accelerators. Besides this challenging task, another important aspect that defines both functional correctness and performance of managed runtime systems is that of automatic memory management. Although automatic memory management improves productivity by abstracting away memory allocation and maintenance, it hinders the capability of using specific memory regions, such as pinned memory, in order to perform data transfer times between the CPU and hardware accelerators. In this paper, we introduce and evaluate a series of memory optimizations specifically tailored for heterogeneous managed runtime systems. In particular, we propose: (i) transparent and automatic “parallel batch processing” for overlapping data transfers and computation between the host and hardware accelerators in order to enable pipeline parallelism, and (ii) “off-heap pinned memory” in combination with parallel batch processing in order to increase the performance of data transfers without posing any on-heap overheads. These two techniques have been implemented in the context of the state-of-the-art open-source TornadoVM and their combination can lead up to 2.5x end-to-end performance speedup against sequential batch processing.
... The last case is for when the application is running on the FPGA. The scheduler increases the FPGA threshold when the FPGA execution time is greater than the x86 execution time (lines [19][20][21][22][23]. ...
... If the x86 load is less than the ARM/FPGA thresholds, the function is executed on x86 (lines [19][20][21]. When the x86 load exceeds only the ARM threshold, the scheduler executes the function on the ARM CPU (lines 22-24). ...
... TornadoVM [20] is a virtualization layer implemented on top of Tornado [32], a framework used for parallel programming. TornadoVM focuses on Java applications, evaluates their efficiency for different hardware platforms including CPUs, GPUs, and FPGAs at run-time, and makes optimal target selection decisions. ...
Preprint
Full-text available
Datacenter servers are increasingly heterogeneous: from x86 host CPUs, to ARM or RISC-V CPUs in NICs/SSDs, to FPGAs. Previous works have demonstrated that migrating application execution at run-time across heterogeneous-ISA CPUs can yield significant performance and energy gains, with relatively little programmer effort. However, FPGAs have often been overlooked in that context: hardware acceleration using FPGAs involves statically implementing select application functions, which prohibits dynamic and transparent migration. We present Xar-Trek, a new compiler and run-time software framework that overcomes this limitation. Xar-Trek compiles an application for several CPU ISAs and select application functions for acceleration on an FPGA, allowing execution migration between heterogeneous-ISA CPUs and FPGAs at run-time. Xar-Trek's run-time monitors server workloads and migrates application functions to an FPGA or to heterogeneous-ISA CPUs based on a scheduling policy. We develop a heuristic policy that uses application workload profiles to make scheduling decisions. Our evaluations conducted on a system with x86-64 server CPUs, ARM64 server CPUs, and an Alveo accelerator card reveal 88%-1% performance gains over no-migration baselines.
... GraphStep (Delorimier et al., 2011), GraphGen (Nurvitadhi et al., 2014) Data parallel MapReduce (Kapre and Bayliss, 2016), Accelerator (Bond et al., 2010), FCUDA (Papakonstantinou et al., 2009), SuSy (Lai et al., 2020a) Circuit generators Flopoco (de Dinechin et al., 2009), JHDL (Bellows and Hutchings, 1998), PAMDC (Bertin and Touati, 1994) Image processing HIPACC (Reiche et al., 2017), FROST (Del Sozzo et al., 2017), Darkroom (Hegarty et al., 2014), RIPL Stewart et al. (2018), PolyMage (Chugh et al., 2016) Static JBits (Guccione et al., 2000), TVM (Moreau et al., 2018) Task based TAPAS (Chi et al., 2021) Dynamic PyRTL (Clow et al., 2017), APARAPI (Segal et al., 2014), TornadoVM (Fumero et al., 2019), (Caldeira et al., 2018), ...
Article
Full-text available
In this community review report, we discuss applications and techniques for fast machine learning (ML) in science—the concept of integrating powerful ML methods into the real-time experimental data processing loop to accelerate scientific discovery. The material for the report builds on two workshops held by the Fast ML for Science community and covers three main areas: applications for fast ML across a number of scientific domains; techniques for training and implementing performant and resource-efficient ML algorithms; and computing architectures, platforms, and technologies for deploying these algorithms. We also present overlapping challenges across the multiple scientific domains where common solutions can be found. This community report is intended to give plenty of examples and inspiration for scientific discovery through integrated and accelerated ML solutions. This is followed by a high-level overview and organization of technical advances, including an abundance of pointers to source material, which can enable these breakthroughs.
... As example, Landgraf et al. present SYNERGY, an FPGA compiler tool capable of generating controls to software execution, necessary to support core virtualization primitives such as suspend, resume, and program migration on FPGA [84]. Fumero et al. introduce TornadoVM, a virtual machine for applications acceleration on heterogeneous hardware at runtime [52]. It relies on JIT compilation to map kernels to adequate hardware accelerators. ...
Article
In this article, we survey existing academic and commercial efforts to provide Field-Programmable Gate Array (FPGA) acceleration in datacenters and the cloud. The goal is a critical review of existing systems and a discussion of their evolution from single workstations with PCI-attached FPGAs in the early days of reconfigurable computing to the integration of FPGA farms in large-scale computing infrastructures. From the lessons learned, we discuss the future of FPGAs in datacenters and the cloud and assess the challenges likely to be encountered along the way. The article explores current architectures and discusses scalability and abstractions supported by operating systems, middleware, and virtualization. Hardware and software security becomes critical when infrastructure is shared among tenants with disparate backgrounds. We review the vulnerabilities of current systems and possible attack scenarios and discuss mitigation strategies, some of which impact FPGA architecture and technology. The viability of these architectures for popular applications is reviewed, with a particular focus on deep learning and scientific computing. This work draws from workshop discussions, panel sessions including the participation of experts in the reconfigurable computing field, and private discussions among these experts. These interactions have harmonized the terminology, taxonomy, and the important topics covered in this manuscript.
Article
The ever-increasing demand for high performance Big Data analytics and data processing, has paved the way for heterogeneous hardware accelerators, such as Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs), to be integrated into modern Big Data platforms. Currently, this integration comes at the cost of programmability since the end-user Application Programming Interface (APIs) must be altered to access the underlying heterogeneous hardware. For example, current Big Data frameworks, such as Apache Spark, provide a new API that combines the existing Spark programming model with GPUs. For other Big Data frameworks, such as Flink, the integration of GPUs and FPGAs is achieved via external API calls that bypass their execution models completely. In this paper, we rethink current Big Data frameworks from a systems and programming language perspective, and introduce a novel co-designed approach for integrating hardware acceleration into their execution models. The novelty of our approach is attributed to two key design decisions: a) support for arbitrary User Defined Functions (UDFs), and b) no modifications to the user level API. The proposed approach has been prototyped in the context of Apache Flink, and enables unmodified applications written in Java to run on heterogeneous hardware, such as GPU and FPGAs, transparently to the users. The performance evaluation of the proposed solution has shown performance speedups of up to 65x on GPUs and 184x on FPGAs for suitable workloads of standard benchmarks and industrial use cases against vanilla Flink running on traditional multi-core CPUs.
Article
Full-text available
Graphics Processing Units (GPUs) have evolved from very specialized designs geared towards computer graphics to accommodate general-purpose highly-parallel workloads. Harnessing the performance that these accelerators provide requires the use of specialized native programming interfaces, such as CUDA or OpenCL, or higher-level programming models like OpenMP or OpenACC. However, on managed programming languages, offloading execution into GPUs is much harder and error-prone, mainly due to the need to call through a native API (Application Programming Interface), and because of mismatches between value and reference semantics. The Fancier framework provides a unified interface to Java, C/C++, and OpenCL C compute kernels, together with facilities to smooth the transitions between these programming languages. This combination of features makes GPU acceleration on Java much more approachable. In addition, Fancier Java code can be directly translated into equivalent C/C++ or OpenCL C code easily, which simplifies the implementation of higher-level abstractions targeting GPU or parallel execution on Java. Furthermore, it reduces the programming effort without adding significant overhead on top of the necessary OpenCL and Java Native Interface (JNI) API calls.We validate our approach on several image processing workloads running on different Android devices.
Article
Full-text available
Due to the ongoing slowdown of Dennard scaling, heterogeneous hardware architectures are inevitable to meet the increasing demand for energy efficient systems. However, one of the most important aspects that shape today’s computing landscape is the wide availability of software that can run on any system. Current applications that use accelerators, in contrast, are often especially tailored to a specific hardware setup and therefore not universally deployable. This is particularly true for reconfigurable logic as their internal structure requires the circuits and their integration to be designed as well. This makes them inherently difficult to use and therefore less accessible for a general audience. Nevertheless, their balance of flexibility and efficiency puts reconfigurable accelerators in a unique position between CPUs, GPUs, and ASICs. Therefore, one of the main challenges of future heterogeneous systems is to foster collaborative computing between these vastly different components while still being simple to use. Previous approaches mostly focused on subproblems instead of a holistic view of hardware and software in the context of commonplace usability. This paper analyzes the general demands on a reconfigurable platform and derives their requirements regarding accessibility and security. Hereby, we investigate several key features like hardware virtualization, system shared virtual memory, and the use of wide-spread programming paradigms. Then, we systematically build up such a platform based on the established ROCm GPU framework and its internal HSA standard. This new common HERA methodology is finally also demonstrated as a prototype.
Preprint
Full-text available
The proliferation of heterogeneous hardware in recent years means that every system we program is likely to include a mix of compute elements; each with different characteristics. By utilizing these available hardware resources, developers can improve the performance and energy efficiency of their applications. However, existing tools for heterogeneous programming neglect developers who do not have the time or inclination to switch programming languages or learn the intricacies of a specific piece of hardware. This paper presents a framework that enables Java applications to be deployed across a variety of heterogeneous systems while exploiting any available multi- or many-core processor. The novel aspect of our approach is that it does not require any a priori knowledge of the hardware, or for the developer to worry about managing disparate memory spaces. Java applications are transparently compiled and optimized for the hardware at run-time. We also present a performance evaluation of our just-in-time (JIT) compiler using a framework to accelerate SLAM, a complex computer vision application entirely written in Java. We show that we can accelerate SLAM up to 150x compared to the Java reference implementation, rendering 107 frames per second (FPS).
Conference Paper
Full-text available
Real-time 3D space understanding is becoming prevalent across a wide range of applications and hardware platforms. To meet the desired Quality of Service (QoS), computer vision applications tend to be heavily parallelized and exploit any available hardware accelerators. Current approaches to achieving real-time computer vision, evolve around programming languages typically associated with High Performance Computing along with binding extensions for OpenCL or CUDA execution. Such implementations, although high performing, lack portability across the wide range of diverse hardware resources and accelerators. In this paper, we showcase how a complex computer vision application can be implemented within a managed runtime system. We discuss the complexities of achieving high-performing and portable execution across embedded and desktop configurations. Furthermore, we demonstrate that it is possible to achieve the QoS target of over 30 frames per second (FPS) by exploiting FPGA and GPGPU acceleration transparently through the managed runtime system.
Conference Paper
This paper introduces gMig, an open-source and practical GPU live migration solution for full virtualization. By taking advantage of the dirty pattern of GPU workloads, gMig presents the One-Shot Pre-Copy combined with the hashing based Software Dirty Page technique to achieve efficient GPU live migration. Particularly, we propose three approaches for gMig: 1) Dynamic Graphics Address Remapping, which parses and manipulates GPU commands to adjust the address mapping to adapt to a different environment after migration, 2) Software Dirty Page, which utilizes a hashing based approach to detect page modification, overcomes the commodity GPU's hardware limitation, and speeds up the migration by only sending the dirtied pages, 3) One-Shot Pre-Copy, which greatly reduces the rounds of pre-copy of graphics memory. Our evaluation shows that gMig achieves GPU live migration with an average downtime of 302 ms on Windows and 119 ms on Linux. With the help of Software Dirty Page, the number of GPU pages transferred during the downtime is effectively reduced by 80.0%.
Conference Paper
We propose a parallel program representation for heterogeneous systems, designed to enable performance portability across a wide range of popular parallel hardware, including GPUs, vector instruction sets, multicore CPUs and potentially FPGAs. Our representation, which we call HPVM, is a hierarchical dataflow graph with shared memory and vector instructions. HPVM supports three important capabilities for programming heterogeneous systems: a compiler intermediate representation (IR), a virtual instruction set (ISA), and a basis for runtime scheduling; previous systems focus on only one of these capabilities. As a compiler IR, HPVM aims to enable effective code generation and optimization for heterogeneous systems. As a virtual ISA, it can be used to ship executable programs, in order to achieve both functional portability and performance portability across such systems. At runtime, HPVM enables flexible scheduling policies, both through the graph structure and the ability to compile individual nodes in a program to any of the target devices on a system. We have implemented a prototype HPVM system, defining the HPVM IR as an extension of the LLVM compiler IR, compiler optimizations that operate directly on HPVM graphs, and code generators that translate the virtual ISA to NVIDIA GPUs, Intel's AVX vector units, and to multicore X86-64 processors. Experimental results show that HPVM optimizations achieve significant performance improvements, HPVM translators achieve performance competitive with manually developed OpenCL code for both GPUs and vector hardware, and that runtime scheduling policies can make use of both program and runtime information to exploit the flexible compilation capabilities. Overall, we conclude that the HPVM representation is a promising basis for achieving performance portability and for implementing parallelizing compilers for heterogeneous parallel systems.
Article
Heterogeneous-ISA computing platforms have become ubiquitous, and will be used for diverse workloads which render static mappings of computation to processors inadequate. Dynamic mappings which adjust an application's usage in consideration of platform workload can reduce application latency and increase throughput for heterogeneous platforms. We introduce AIRA, a compiler and runtime for flexible execution of applications in CPU-GPU platforms. Using AIRA, we demonstrate up to a 3.78x speedup in benchmarks from Rodinia and Parboil, run with various workloads on a server-class platform. Additionally, AIRA is able to extract up to an 87% increase in platform throughput over a static mapping.
Article
The insights contained in Gordon Moore's now famous 1965 and 1975 papers have broadly guided the development of semiconductor electronics for over 50 years. However, the field-effect transistor is approaching some physical limits to further miniaturization, and the associated rising costs and reduced return on investment appear to be slowing the pace of development. Far from signaling an end to progress, this gradual "end of Moore's law" will open a new era in information technology as the focus of research and development shifts from miniaturization of long-established technologies to the coordinated introduction of new devices, new integration technologies, and new architectures for computing.
Conference Paper
Computer systems are increasingly featuring powerful parallel devices with the advent of many-core CPUs and GPUs. This offers the opportunity to solve computationally-intensive problems at a fraction of the time traditional CPUs need. However, exploiting heterogeneous hardware requires the use of low-level programming language approaches such as OpenCL, which is incredibly challenging, even for advanced programmers. On the application side, interpreted dynamic languages are increasingly becoming popular in many domains due to their simplicity, expressiveness and flexibility. However, this creates a wide gap between the high-level abstractions offered to programmers and the low-level hardware-specific interface. Currently, programmers must rely on high performance libraries or they are forced to write parts of their application in a low-level language like OpenCL. Ideally, nonexpert programmers should be able to exploit heterogeneous hardware directly from their interpreted dynamic languages. In this paper, we present a technique to transparently and automatically offload computations from interpreted dynamic languages to heterogeneous devices. Using just-in-time compilation, we automatically generate OpenCL code at runtime which is specialized to the actual observed data types using profiling information. We demonstrate our technique using R, which is a popular interpreted dynamic language predominately used in big data analytic. Our experimental results show the execution on a GPU yields speedups of over 150x compared to the sequential FastR implementation and the obtained performance is competitive with manually written GPU code. We also show that when taking into account start-up time, large speedups are achievable, even when the applications run for as little as a few seconds.
Conference Paper
Many computationally-intensive algorithms benefit from the wide parallelism offered by Graphical Processing Units (GPUs). However, the search for a close-to-optimal implementation remains extremely tedious due to the specialization and complexity of GPU architectures. We present a novel approach to automatically discover the best performing code from a given set of possible implementations. It involves a branch and bound algorithm with two distinctive features: (1) an analytic performance model of a lower bound on the execution time, and (2) the ability to estimate such bounds on a partially-specified implementation. The unique features of this performance model allow to aggressively prune the optimization space without eliminating the best performing implementation. While the space considered in this paper focuses on GPUs, the approach is generic enough to be applied to other architectures. We implemented our algorithm in a tool called Telamon and demonstrate its effectiveness on a huge, architecture-specific and input-sensitive optimization space. The information provided by the performance model also helps to identify ways to enrich the search space to consider better candidates, or to highlight architectural bottlenecks.