Conference PaperPDF Available

CloudCL: Distributed Heterogeneous Computing on Cloud Scale

IEEE Copyright Notice
Copyright © IEEE. Personal use of this material is permitted. However, permission to
reprint/republish this material for advertising or promotional purposes or for creating new
collective works for resale or redistribution to servers or lists, or to reuse any copyrighted
component of this work in other works must be obtained from the IEEE.
This material is presented to ensure timely dissemination of scholarly and technical
work. Copyright and all rights therein are retained by authors or by other copyright
holders. All persons copying this information are expected to adhere to the terms and
constraints invoked by each author's copyright. In most cases, these works may not be
reposted without the explicit permission of the copyright holder.
This work has been published in “2017 Fifth International Symposium on Computing and
Networking (CANDAR) “, 9 - 22 November, 2017, Aomori, Japan.
DOI: 10.1109/CANDAR.2017.49
CloudCL: Distributed Heterogeneous
Computing on Cloud Scale
Max Plauth, Florian R¨
osler and Andreas Polze
Operating Systems and Middleware Group
Hasso Plattner Institute for Digital Engineering
University of Potsdam, Germany
Abstract—The ever-growing demand for computing resources
has reached a wide range of application domains. Even though
the ubiquitous availability of cloud-based GPU instances pro-
vides an abundance of computing resources, the programmatic
complexity of utilizing heterogeneous hardware in a scale-out
scenario is not yet addressed sufficiently. We deal with this
issue by introducing the CloudCL framework, which enables
developers to focus their implementation efforts on compute
kernels without having to consider inter-node communication.
Using CloudCL, developers can access the resources of an entire
cluster as if they were local resources. The framework also
facilitates the development of cloud-native application behavior
by supporting dynamic addition and removal of resources at
runtime. The combination of a straightforward job design and the
corresponding job scheduling framework make sure that cluster
resources are used efficiently and fairly. In an extensive perfor-
mance evaluation, we demonstrate that the framework provides
close-to-linear scale-out capabilities in multi-node deployment
In the age of big data, compute tasks are gaining complexity
and data volumes are growing by the day. Unlike the High-
Performance Computing (HPC) domain, this ever-growing
demand for computing resources is not restricted to selected
applications and algorithms, but it concerns a wide range of
application domains. As a result, an increasing number of
everyday use cases are developing a demand for computing
resources drawing near to that of certain HPC use cases [1, 2].
Even though hardware accelerators such as Graphics Pro-
cessing Units (GPUs) or Field-Programmable Gate Arrays
(FPGAs) are a popular approach for satisfying these de-
mands, operating a heterogeneous compute infrastructure is
still expensive [3] and involves a high degree of application
complexity [4]. To a certain degree, the economic concerns
are alleviated by the wide availability of cloud-based ac-
celerator instances equipped with GPUs, FPGAs, or other
accelerators. However, the programmatic complexity of uti-
lizing heterogeneous hardware in a scale-out scenario is not
yet addressed sufficiently. Implementing applications for such
massively parallel, distributed systems is already a challenging
task for software engineers that are well-trained in parallel
implementation strategies. For domain experts without deeper
software expertise, it is very hard to write code that efficiently
utilizes heterogeneous resources, especially in a distributed
environment [5].
We address this issue by introducing the CloudCL frame-
work, which enables developers and domain experts to focus
their implementation efforts on compute kernels without hav-
ing to consider inter-node communication and management
tasks for heterogeneous compute devices (see Figure 1). Based
on the API-forwarding capabilities of dOpenCL [4], CloudCL
enables developers to access the resources of an entire cluster
as if they were local resources. The complexity of kernel
development is further reduced by employing Aparapi [6] as
a high-level, Java-based programming interface for domain
experts. CloudCL combines the capabilities of these base
technologies and augments them with a scalable job design
concept and a corresponding job scheduling system, enabling
cloud-native application behavior by supporting dynamic ad-
dition and removal of resources at runtime. In an extensive
performance evaluation, we demonstrate that the framework
provides close-to-linear scale-out capabilities both in an on-
premise hosting environment and using Amazon EC2-based
public cloud resources.
This paper is structured as follows: Section II provides back-
ground about the employed base technologies. Subsequently,
Section III reviews related work in the field. CloudCL and
its fundamental concepts are introduced in Section IV. Lastly,
Section V evaluates the scale-out behavior of the framework
before a conclusion is reached in Section VI.
On-Premise Cluster
Compute Node
Client Node
Compute Node
IaaS Cluster
Compute Node Compute Node
Job Scheduler
Fig. 1: The CloudCL framework hides compute device man-
agement (Aparapi) and inter-node communication (dOpenCL),
allowing developers to focus on compute kernel development.
To provide background about the base technologies em-
ployed by CloudCL, this section outlines the properties of
dOpenCL and Aparapi.
A. dOpenCL
dOpenCL is a wrapper library that enables users to trans-
parently utilize OpenCL devices installed in remote machines
based on API forwarding techniques [4]. The library provides
its own Installable Client Driver (ICD), which forwards the
API calls to specified remote machines in the network, which
run a dOpenCL daemon. The calls are received by the daemon
and are executed using the available native OpenCL ICDs
on the remote machine with the results being returned via
network. This allows utilizing remote devices as if they were
installed locally in the host machine. For example, a remote
GPU appears to the application as if it was installed in the local
machine’s PCI Express slot. Therefore OpenCL kernels do not
require changes to run remotely as dOpenCL hides network
transfers behind the standard OpenCL API. An overview of
the architecture of an example cluster is shown in Figure 2.
Remote Machine
Intel CPUdOpenCL Daemon OpenCL
Remote Machine
Remote Machine
dOpenCL Daemon OpenCL
Fig. 2: dOpenCL ties in remote resources in a local ICD.
dOpenCL supports shared cluster environments, in which
multiple OpenCL programs run concurrently, by employing a
device manager. The device manager handles the assignment
of devices to specific kernels and keeps track of device
utilization within the cluster. Thus, it ensures that a device
is used only by a single kernel at each point in time.
B. Aparapi
Aparapi is a library that offers extensive functionality to
ease the usage of the OpenCL API and minimize programming
effort when developing OpenCL kernels [6]. Firstly, it contains
bindings to access OpenCL functions through a Java Native
Interface (JNI), abstracting and bundling multiple low-level
calls within Java high-level functions. Secondly, Aparapi is
able to translate Java code to valid OpenCL kernels. Input
data may be defined in Java code itself and the results are
again available in Java after execution. This is possible as
Aparapi automatically copies the participating data back and
forth between the host code and the executing device. For a
more granular approach, developers can also define explicitly
which data should be written or read in order to improve
Aparapi JNI OpenCL
Code Aparapi
Fig. 3: Aparapi translates Java code to OpenCL kernels.
The framework drastically reduces code complexity, taking
care of tasks such as device selection and data handling when
no explicit control is required. This enables developers to
program algorithms considerably faster and offers beginners
a simplified access to OpenCL features without knowledge of
low-level mechanisms.
In this section, we provide an overview of related OpenCL
API forwarding approaches and frameworks for aggregating
heterogeneous cluster resources.
A. SnuCL
Instead of simply forwarding API calls to remote machines,
SnuCL [7] heavily transforms kernels depending on the avail-
able runtimes of a machine. For example, it translates OpenCL
to CUDA when only the CUDA platform is provided by the
machine. Additionally, it can transform OpenCL code to C,
which is then executed within a thread for each core of a CPU.
Furthermore, SnuCL introduces a virtual global memory, in
which buffers may be shared among devices. It manages the
consistency among shared buffers and attempts to minimize
copy operations throughout the execution. Buffers that are
not written can therefore remain on a device without the
requirement to be copied back to the host.
B. DistCL
DistCL aims at fusing multiple GPUs into a single virtual
OpenCL device [8]. To achieve this, it abstracts the devices
by representing them as one unified device while handling
kernel distribution and data transfers transparently. In order to
enable parallelization of a kernel, DistCL automatically splits
it into multiple kernels with their respective required data,
called subranges. For this, programmers have to supply a meta-
function that determines the memory access pattern. Based on
the given function, DistCL can transfer only relevant data to
a device that executes a subrange.
C. MultiCL
MultiCL is an extension of SnuCL and schedules kernels
across multiple heterogeneous devices in a cluster [9]. It offers
a round robin approach as well as an autofit mechanism. When
queuing a kernel, a flag can be attached to it, which labels
the assumed execution type. The available flags comprise
compute-intensive,memory bound,I/O bound or iterative. The
scheduling mechanism is able to employ a static or a dynamic
algorithm. In the static scheduling approach, the approach
profiles all available devices in regards to memory bandwidth
and instruction throughput. Based on these measurements, the
best fitting device is selected with respect to the kernel flag.
Here, we provide an overview of the major characteristics
of the job design, the job scheduling mechanisms, and the
dynamic scale-out capabilities in CloudCL.
A. Job Design
In order to fully utilize as many remote computing resources
as possible, CloudCL has to provide a mechanism for splitting
tasks. It is mandatory to ensure that each split has all the
necessary data for its correct computation all while keeping
the amount of memory transfers at a minimum level. Based on
literature research, we identified Manual Splits,Naive Buffer
Replication [10], Intelligent Buffer Replication [11] and Meta
Functions [8, 12] as potential splitting strategies. All methods
come with a tradeoff between performance and programming
complexity: Naive Buffer Replication is inefficient and does
not scale with increasing split counts. Intelligent Buffer Repli-
cation requires an extra step for sampling memory accesses,
which includes intermediate code translation. Although Meta
Functions try to unburden developers, they are still required
to define the access patterns manually. Manual Splits put the
entire responsibility into the hands of the developer. While
it is the least automated method, it also grants full control
over performance. Thus, CloudCL employs the Manual Splits
strategy. Listing 1 demonstrates how the splitting approach is
integrated into OpenCL and that writing a split and merge
algorithm can be intuitive and does not require much code. In
order to grant higher parallelization, the partialCount can be
dynamically changed into a divisor of the array length.
1public class AdditionKernel extends Kernel{
2int[] a, b, result;
4public AdditionKernel(int[] a, int[] b) {
5this.a = a;
6this.b = b;
7this.result = new int[a.length];
10 @Override
11 public void run() {
12 int i = getGlobalId();
13 result[i] = a[i] + b[i];
14 }}
16 public class Addition {
17 public static void main(String[] args) {
18 final int partialCount = 2;
19 int[] a = new int[]{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
20 int[] b = new int[]{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
21 int[] result = new int[a.length];
23 int partialWidth = a.length / partialCount;
24 for(int i = 0; i < partialCount; i++){
25 int[] aPartial = Arrays.copyOfRange(a, i*partialWidth,
26 int[] bPartial = Arrays.copyOfRange(b, i*partialWidth,
28 AdditionKernel additionKernel = new AdditionKernel(
aPartial, bPartial);
29 additionKernel.execute(partialWidth);
30 System.arraycopy(additionKernel.result, 0, result, i*
partialWidth, partialWidth);
31 }}}
Listing 1: Implementation of Manual Splits based on
Aparapi. This example portrays the serial execution of the
partials for the sake of simplicity.
The execution model of CloudCL requires developers to
define Jobs, which are comprised of at least one Partial, which
is an extended Aparapi kernel. Partials contain the data that is
split manually. After a job is constructed it can be submitted to
the Job Executor that assigns and executes the partials across
the available devices of its cluster. The job partial execution
model is illustrated in Figure 4.
Split 1
Split 2
Split 3
Partial Kernel
Split 1
Partial Kernel
Split 2
Partial Kernel
Split 3
Job Executor
Machine A
Machine B
Fig. 4: The CloudCL execution model splits job into partials,
which are distributed across the cluster resources.
B. Job Scheduling
In order to allow the scheduler to serve specific cluster
requirements, a pluggable two-tiered architecture is proposed.
As such the scheduler consists of two modules, called Job
Scheduler and Device Scheduler.CloudCL allows switching
the modules during runtime in order to enable clusters to
adapt to varying situations and usage scenarios. The overall
scheduling architecture is depicted in Figure 5. In the fol-
lowing sections, the Job Scheduler and Device Scheduler are
explained in detail.
Job Executor
Device Manager
Job Scheduler
Device Scheduler
List of jobs with partials
Partial-to-device assignments
List of unused devices
partial order
Fig. 5: The two-tier scheduling architecture of CloudCL con-
siders the job abstraction level on its first tier and compute
device characteristics on the second tier.
1) Job Scheduler (First Tier): The first scheduling tier
considers only the system state on a job abstraction level
without having any knowledge about the actual partial struc-
ture or available devices in the cluster. Instead, this tier is
mainly concerned about ensuring predefined fairness policies
and as such controls the order in which jobs are eligible for
consideration during a scheduling round. Algorithms available
in CloudCL for this tier are First-In First-Out and Round-
Robin. Still, the first tier hands over the partial order to the
second tier, which may choose to ignore the given order based
on the nature of the underlying algorithm.
2) Device Scheduler (Second Tier): The second scheduling
tier gets its input from the preceding tier and has no concerns
of the high-level job abstraction. Instead, it focuses on the
actual assignment of partials to devices. Executing OpenCL
code in a heterogeneous environment can lead to drastically
different performance behavior based on the algorithm imple-
mentation and the executing device. For example, code that
may perform exceptionally well on a GPU may run poorly on a
CPU because of varying computational capabilities. Therefore,
CloudCL enables developers to express preferences in terms
of the device types that their algorithm should be computed
on. The mechanism is implemented through a defined Device
Preference attribute that can be attached to every partial.
Valid values for the attribute include: None,CPU only,CPU
preferred,GPU only and GPU preferred.
For CloudCL, not only the performance of the execution
device matters, but also the speed of the network connection
is a crucial factor. This becomes especially important when
remote resources are tied in via a wide area network, which
inherently can not offer the same bandwidth and latency as
a local network. A na¨
ıve approach that combines measuring
device performance and networking capabilities is the usage
of historical data. With the assumption that jobs are split into
many equally sized partials, one can identify the best-suited
device for a partial given the history of previous executions
within the cluster.
In order to provide scheduling algorithms on the second
tier with meaningful metrics, CloudCL provides the following
metrics per partial:
a) Kernel Data Approximation: Kernel classes report the
actual size of the overall kernel including its data. The size
is gained by employing Java reflection on an Aparapi kernel,
identifying employed data types and data structures. However,
this measure should only be treated as an approximation, as it
does not consider the effects of explicit memory management.
b) Data Transfer History: To take explicit memory man-
agement into account, a history of transferred data volumes is
maintained. For this purpose, the memory management calls
of Aparapi were modified to accumulate the sizes of the
corresponding data structures in a kernel attribute.
c) Performance History: CloudCL maintains a history
of execution times of partials for every device. The observed
timespan reaches from invocation to completion of a kernel
from the application host, thus including network transfers.
In the current version of CloudCL, we implemented a
Performance Based Device Scheduler based on the Perfor-
mance History. It assigns a partial to the available device
with the best performance history for the given kernel class.
With the available metrics, more sophisticated and fine-grained
schedulers can be implemented in the future. It should be noted
that schedulers on the second tier may ignore the order of
partials suggested by the first tier. For example, the first tier
might hand over a list of partials in FIFO order. If this list
is handled by a simplex-based scheduler on the second tier,
the principle of FIFO may be violated in favor of finding the
optimal distribution.
C. Dynamic Scale-Out Capabilities
The general idea behind scaling performance within
CloudCL is that regardless of whether on-premise or cloud-
based Infrastructure as a Service (IaaS) resources are em-
ployed, you can add additional resources in times of high
capacity demands. Similarly, in times of low demand, re-
sources can be released again in order to reduce costs or
yield resources to other workloads. Widely employed cluster
resource management solutions include functionalities to add
or remove nodes dynamically. Offering a similar functionality
in a dOpenCL based solution differs substantially.
OpenCL and Aparapi are built to run on a single machine
and as such include assumptions of limited capabilities con-
cerning changing hardware during operation. For example,
Aparapi caches device queries to OpenCL as devices are
usually added or removed during maintenance procedures
when a machine is powered down. dOpenCL overcomes
such limitations and enables the host node to have devices
virtually installed by adding nodes to the dOpenCL cluster at
runtime. dOpenCL therefore extends the OpenCL standard by
adding the custom methods clCreateComputeNodeWWU
and clReleaseComputeNodeWWU.
As OpenCL itself has no multi-machine capabilities, it offers
information only for the respective built-in devices through
the standard API. Therefore all devices appear as if they were
installed in the host node that runs the dOpenCL library. This
makes identifying the owning machine per device impossible.
Knowing the relation between devices and machines is crucial
for providing basic functionality, such as releasing superfluous
machines, where it has to be ensured that no partials are run-
ning anymore on its respective devices. In order to provide the
machine-device relation, dOpenCL introduces another method
called clGetDeviceIDsFromComputeNodeWWU.
The mentioned dOpenCL methods are available to C++ pro-
grammers when including the respective header files that con-
tain the extensions. Thus, host code that wants to benefit from
dynamic device management has to be modified appropriately.
In its standard version, Aparapi is bound to the standard
OpenCL specification and has no understanding of the offered
dOpenCL extensions. Therefore it was necessary to modify the
Aparapi JNI to enable dynamic resource scaling. In order to do
so, two new JNI methods were implemented, addNode and
removeNode. Both call the respective dOpenCL functions,
with addNode also reporting back the available devices of
the added node.
To evaluate the scale-out capabilities of CloudCL, this
section starts with providing a detailed description of the test
environment. This description includes the employed hardware
as well as the general method of measurement. Afterwards,
the results of our scale-out benchmarks are presented and
TABLE I: Specifications of on-premise nodes.
HPE ProLiant m710p Server Cartridge
CPU 1 ×Intel Xeon E3-1284L v4 (8 logical cores)
Memory 4 ×8GB PC3L-12800
NIC Mellanox Connect-X3 Pro (10 GBit/s)
OS Ubuntu 16.04.1 64 Bit
TABLE II: Network performance of on-premise deployment.
HPE ProLiant m710p Server Cartridge
iperf3 9.409 Gbit/s (σ= 0.051)
ping 0.14 ms (σ= 0.012)
A. Test Environment & Method of Measurement
With CloudCL targeting dynamic deployment strategies on
cloud infrastructure, the evaluation distinguishes between three
scenarios: an on-premise private cloud deployment scenario,
an Amazon EC2 based public cloud deployment scenario and
ahybrid cloud deployment scenario. For each environment, we
are utilizing well-defined hardware configurations and provide
detailed hardware specifications. Since CloudCL heavily de-
pends on network performance, we further provide network
performance measurements based on iperf3 and the ping
utilities, conducted between the application host and a compute
node. For the on-premise private cloud deployment scenario,
we utilize HPE ProLiant m710p server cartridges [13] for
both compute nodes and the application host, with the de-
tailed specifications denoted in Table I. The practical network
performance is documented Table II. For the public cloud
deployment scenario,g2.2xlarge instances are used to asses
the performance of GPU-equipped compute nodes, whereas
c4.8xlarge instances are employed to asses the performance
of CPU-equipped compute nodes. In both cases, a c4.8xlarge
instance is used for the application host. The detailed specifi-
cations and the network performance are reported in Table III
and Table IV, respectively. The hybrid cloud deployment
scenario uses one m710p on-premise compute node and up
to three c4.8xlarge public cloud compute nodes. The network
performance is documented in Table V.
TABLE III: Specifications of public cloud nodes.
c4.8xlarge g2.2xlarge
CPU Intel Xeon E5-2666 v3, 36 cores Intel Xeon E5-2670, 8 cores
Memory 60GB 15GB
NIC 10 GBit/s 1 GBit/s
OS Ubuntu 14.04.4 64 Bit Ubuntu 14.04.4 64 Bit
OpenCL 1.2 CUDA 367.57
TABLE IV: Network performance of public cloud deployment.
c4.8xlarge g2.2xlarge
iperf3 9.44 Gbit/s (σ= 0.068) 992.6 MBit/s (σ= 4.8)
ping 0.158 ms (σ= 0.021) 1.29 ms (σ= 0.047)
TABLE V: Network performance of hybrid deployment.
HPE ProLiant m710p & c4.8xlarge
iperf3 169.26 Mbit/s (σ= 11.93)
ping 18.9 ms (σ= 0.17)
Regarding the method of measurement, the application
host and the compute nodes are always hosted on separate
machines, hence kernel partials and the corresponding data
always have to pass the network interface regardless of the
number of compute nodes. This decision relies on the ob-
servation of Tausche et al. [14], who have demonstrated that
using a fast network interconnect, the forwarding technique
implemented by the dOpenCL framework incurs very little
overhead compared to native execution.
To demonstrate the scaling capabilities of CloudCL, we
decided to use na¨
ıve matrix multiplication as a very data-
intensive benchmark workload in order to test the scaling
capabilities of CloudCL under the most challenging conditions.
It should be noted that a na¨
ıve matrix multiplication can
not be perfectly divided into partials. Instead, one of the
matrices has to be a part in its entirety in each partial, while
the second matrix can be split. This means that there is a
growing overhead correlating to the number of splits. With
every additional split, the input overhead is increased by the
data size of the indivisible matrix, which is 50% of the overall
input. Therefore while a higher parallelization may grant faster
computational capabilities of a cluster, it also requires more
data transfers.
B. Benchmark Results
1) On-Premise Private Cloud Deployment Scenario: The
performance for the on-premise deployment scenario is pre-
sented in Figure 6. Adding a second m710p compute node
yields substantial performance improvements ranging from at
least 1.65x speedup for small matrices up to 1.95x speedup
for larger problem sizes. Adding a third m710p compute node
pays off for larger problem sizes, where a speedup of 2.8x
is achieved. Overall, CloudCL scales well for larger problem
sizes in the on-premise deployment scenario, most likely due
to the high-bandwidth and low-latency network connection
between nodes.
3000 4200 5400 6600 7800 9000 10200
Matrix Dimension [N x N]
1x m710p 2x m710p 3x m710p
Fig. 6: Close-to-linear speedup in the on-premise deployment.
3000 4200 5400 6600 7800 9000 10200
Matrix Dimension [N x N]
1x g2.2xlarge 2x g2.2xlarge 3x g2.2xlarge
(a) GPU-based compute nodes
3000 4200 5400 6600 7800 9000 10200
Matrix Dimension [N x N]
1x c4.8xlarge 2x c4.8xlarge 3x c4.8xlarge
(b) CPU-based compute nodes
Fig. 7: Using public cloud resources, GPU-based nodes achieve linear speedup, CPU-based nodes offer close-to-linear speedup.
3000 4200 5400 6600 7800 9000 10200
Matrix Dimension [N x N]
1x m710p +
1x c4.8xlarge
1x m710p +
2x c4.8xlarge
1x m710p +
3x c4.8xlarge
(a) Matrix Multiplication
3000 4200 5400 6600 7800 9000 10200
Picture Width/Height
1x m710p +
1x c4.8xlarge
1x m710p +
2x c4.8xlarge
1x m710p +
3x c4.8xlarge
(b) Mandelbrot Set
Fig. 8: Hybrid clouds are not feasible for data-intensive workloads (a), but perform well for compute-intensive workloads (b).
2) Public Cloud Deployment Scenario: The performance
results for the public cloud deployment scenario are illustrated
in Figure 7a (GPU-based instances) and Figure 7b (CPU-based
instances). Using GPU-based instances, CloudCL manages to
achieve close to ideal scale-out performance with an average
1.96x speedup for using two g2.2xlarge compute nodes and
an average 2.94x speedup when a third node is added. On
the side of CPU-based instances, using a second c4.8xlarge
compute node yields 1.94x speedup for large matrices. With
up to 2.77x speedup for large matrices, adding a third node
only exudes good scaling behavior for large matrices.
The discrepancy between the scaling behavior on GPU-
based versus CPU-based compute nodes is likely to be caused
by the differing network performance of employed instance
types (see Table IV). In the case of the GPU-based setup, the
application host node is equipped with a 10 Gbit/s Ethernet
link which can easily saturate the 1 Gbit/s Ethernet links of
the compute nodes. For the CPU-based setup, all nodes are
equipped with a 10 Gbit/s Ethernet link, which no longer
allows the application host node to fully saturate the network
links of the compute nodes. To validate this hypothesis, addi-
tional experiments have to be conducted where the application
host node is equipped with a faster network link.
3) Hybrid Cloud Deployment Scenario: Even though net-
work connectivity is a massively limiting factor, we also con-
ducted a benchmark using a hybrid cloud deployment scenario
for the sake of completeness, where some compute nodes are
hosted in the on-premise private cloud and other compute
nodes are tied in from public cloud resources. Figure 8a
presents the results of this experiment. As expected, using a
data-intensive workload such as matrix multiplication, incor-
porating resources from the public cloud is not feasible at all.
However, we conducted a second benchmark using Mandelbrot
sets as a compute-intense but data-insensitive workload in
order to evaluate whether hybrid cloud deployment scenarios
might be feasible for less data-dependent workloads. The
results of the Mandelbrot set benchmark depicted in Figure 8b
indicate much better scaling behavior for such workloads.
4) Overarching Evaluation: Our performance evaluation
has demonstrated that CloudCL scales well both in on-premise
and public cloud deployments, using matrix multiplication as
a data-intensive worst case workload. With network connec-
tivity being the predominating bottleneck, future versions of
CloudCL might mitigate the impact of network performance
by implementing lightweight compression or broadcast mecha-
nisms in the underlying dOpenCL API forwarding mechanism.
In this paper, we introduced the CloudCL framework,
which enables developers to focus their implementation efforts
on compute kernels without having to consider inter-node
communication. To achieve this goal, CloudCL removes two
obstacles: First, by using Aparapi, developers can implement
compute kernels in Java and no longer have to deal with
tedious OpenCL boiler plate code for device management.
For novice developers, this simplification greatly alleviates
the hurdles towards getting started. Second, developers do not
have to consider distributed aspects such as inter-node com-
munication, since the underlying dOpenCL library empowers
developers to interact with remote compute devices as if they
were installed locally. CloudCL extends both base technologies
to enable the development of cloud-native application behavior
by supporting dynamic addition and removal of compute nodes
at runtime. Additionally, the combination of a straightforward
job design and the corresponding job scheduling framework
make sure that cluster resources are used efficiently and fairly.
In an extensive performance evaluation, we demonstrate that
the framework provides close-to-linear scale-out performance
in on-premise private cloud deployment scenarios and in
public cloud deployment scenarios, using matrix multipli-
cation as a data-intensive worst case workload. For data-
intensive workloads, hybrid cloud deployment scenarios suffer
from insufficient wide area network connectivity. However,
for compute-intense but less data-sensitive workloads such
as the computation of the Mandelbrot set, the hybrid cloud
deployment scenario also revealed good performance.
This paper has received funding from the European Union’s
Horizon 2020 research and innovation programme 2014-2018
under grant agreement No. 644866.
This paper reflects only the authors’ views and the European
Commission is not responsible for any use that may be made
of the information it contains.
[1] L. Rampasek and A. Goldenberg, “TensorFlow: Biologys
Gateway to Deep Learning?” Cell Systems, vol. 2, no. 1,
pp. 12–14, 2016.
[2] M Johnson et al., “Google’s Multilingual Neural Machine
Translation System: Enabling Zero-Shot Translation,
CoRR, vol. abs/1611.04558, 2016. [Online]. Available:
[3] F. Silla, S. Iserte, C. Rea˜
no, and J. Prades, “On the ben-
efits of the remote GPU virtualization mechanism: The
rCUDA case,Concurrency and Computation: Practice
and Experience, vol. 29, no. 13, 2017.
[4] P. Kegel, M. Steuwer, and S. Gorlatch, “dOpenCL: To-
wards a Uniform Programming Approach for Distributed
Heterogeneous Multi-/Many-Core Systems,” in Proceed-
ings of the 2012 IEEE 26th International Parallel and
Distributed Processing Symposium Workshops & PhD
Forum, ser. IPDPSW ’12. Washington, DC, USA: IEEE
Computer Society, 2012, pp. 174–186.
[5] S. Nanz, S. West, and K. S. da Silveira, Examining the
Expert Gap in ParallelProgramming. Berlin, Heidel-
berg: Springer Berlin Heidelberg, 2013, pp. 434–445.
[6] R. LaMothe and G. Frost. Official AMD Aparapi
Repository. [Online]. Available:
[7] J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee, “SnuCL:
An OpenCL Framework for Heterogeneous CPU/GPU
Clusters,” in Proceedings of the 26th ACM International
Conference on Supercomputing, ser. ICS ’12. New York,
NY, USA: ACM, 2012, pp. 341–352.
[8] T. Diop, S. Gurfinkel, J. Anderson, and N. E. Jerger,
“DistCL: A Framework for the Distributed Execution
of OpenCL Kernels,” in Proceedings of the 2013 IEEE
21st International Symposium on Modelling, Analysis &
Simulation of Computer and Telecommunication Systems,
ser. MASCOTS ’13. Washington, DC, USA: IEEE
Computer Society, 2013, pp. 556–566.
[9] A. M. Aji, A. J. Pe˜
na, P. Balaji, and W.-c. Feng, “Au-
tomatic Command Queue Scheduling for Task-Parallel
Workloads in OpenCL,” in Proceedings of the 2015
IEEE International Conference on Cluster Computing,
ser. CLUSTER ’15. Washington, DC, USA: IEEE
Computer Society, 2015, pp. 42–51.
[10] C. S. de la Lama, P. Toharia, J. L. Bosque, and O. D. Rob-
les, “Static Multi-device Load Balancing for OpenCL,
in Proceedings of the 2012 IEEE 10th International
Symposium on Parallel and Distributed Processing with
Applications, ser. ISPA ’12. Washington, DC, USA:
IEEE Computer Society, 2012, pp. 675–682.
[11] J. Kim, H. Kim, J. H. Lee, and J. Lee, “Achieving a
Single Compute Device Image in OpenCL for Multiple
GPUs,” SIGPLAN Not., vol. 46, no. 8, pp. 277–288, Feb.
[12] P. Li, E. Brunet, F. Trahay, C. Parrot, G. Thomas,
and R. Namyst, “Automatic OpenCL Code Generation
for Multi-device Heterogeneous Architectures,” in Pro-
ceedings of the 2015 44th International Conference on
Parallel Processing (ICPP), ser. ICPP ’15. Washington,
DC, USA: IEEE Computer Society, 2015, pp. 959–968.
[13] Hewlett Packard Enterprise, “HPE ProLiant m710p
Server Cartridge QuickSpecs,”,
[14] K. Tausche, M. Plauth, and A. Polze, “dOpenCL –
Evaluation of an API-Forwarding Implementation,” in
Proceedings of the Fourth HPI Cloud Symposium “Op-
erating the Cloud” 2016, Potsdam, Germany, 2017.
... The Scale-Out scenario addresses workloads exceeding the compute capacity of a single node. The majority of small to mid-ranged scale-out clusters is restricted to comparatively slow commodity interconnection technologies such as Ethernet, especially when dynamic on-demand cluster setups are formed using cloud computing infrastructures (e.g. using dOpenCL [23] or CloudCL [24,25]). Therefore, the Scale-Out scenario illustrated in Figure 4 evaluates the use of hardwarebased compression in distributed environments in order to mitigate the low bandwidth of the inter-node network. ...
Full-text available
The heterogeneity of today's state-of-the-art computer architectures is confronting application developers with an immense degree of complexity which results from two major challenges. First, developers need to acquire profound knowledge about the programming models or the interaction models associated with each type of heterogeneous system resource to make efficient use thereof. Second, developers must take into account that heterogeneous system resources always need to exchange data with each other in order to work on a problem together. However, this data exchange is always associated with a certain amount of overhead, which is why the amounts of data exchanged should be kept as low as possible. This thesis proposes three programming abstractions to lessen the burdens imposed by these major challenges with the goal of making heterogeneous system resources accessible to a wider range of application developers. The lib842 compression library provides the first method for accessing the compression and decompression facilities of the NX-842 on-chip compression accelerator available in IBM Power CPUs from user space applications running on Linux. Addressing application development of scale-out GPU workloads, the CloudCL framework makes the resources of GPU clusters more accessible by hiding many aspects of distributed computing while enabling application developers to focus on the aspects of the data parallel programming model associated with GPUs. Furthermore, CloudCL is augmented with transparent data compression facilities based on the lib842 library in order to improve the efficiency of data transfers among cluster nodes. The improved data transfer efficiency provided by the integration of transparent data compression yields performance improvements ranging between 1.11x and 2.07x across four data-intensive scale-out GPU workloads. To investigate the impact of programming abstractions for data placement in NUMA systems, a comprehensive evaluation of the PGASUS framework for NUMA-aware C++ application development is conducted. On a wide range of test systems, the evaluation demonstrates that PGASUS does not only improve the developer experience across all workloads, but that it is also capable of outperforming NUMA-agnostic implementations with average performance improvements of 1.56x. Based on these programming abstractions, this thesis demonstrates that by providing a sufficient degree of abstraction, the accessibility of heterogeneous system resources can be improved for application developers without occluding performance-critical properties of the underlying hardware.
Technical Report
Full-text available
Parallel workloads using compute resources such as GPUs and accelerators is a rapidly developing trend in the field of high performance computing. At the same time, virtualization is a generally accepted solution to share compute resources with remote users in a secure and isolated way. However, accessing compute resources from inside virtualized environments still poses a huge problem without any generally accepted and vendor independent solution. This work presents a brief experimental evaluation of employing dOpenCL as an approach to solve this problem. dOpenCL extends OpenCL for distributed computing by forwarding OpenCL calls to remote compute nodes. We evaluate the dOpenCL implementation for accessing local GPU resources from inside virtual machines, thus omitting the need of any specialized or proprietary GPU virtualization software. Our measurements revealed that the overhead of using dOpenCL from inside a VM compared to utilizing OpenCL directly on the host is less than 10% for average and large data sets. For very small data sets, it may even provide a performance benefit. Furthermore, dOpenCL greatly simplifies distributed programming compared to, e.g., MPI based approaches, as it only requires a single programming paradigm and is mostly binary compatible to plain OpenCL implementations.
Conference Paper
Full-text available
This paper presents the Load Balancing for OpenCL (lbcl) library, devoted to automatically solve load balancing issues on both multi-platform and heterogeneous environments. Using this library, a single kernel can be executed on a set of heterogeneous devices, giving each device an amount of work proportional to its computing power. A wrapper has been developed so the library can balance the workload of an existing application not only without introducing any changes into its source code, but without any recompilation stage. Also a general OpenCL profiler has been developed to easily do a detailed profiling of the obtained results.
Conference Paper
Full-text available
Modern computer systems are becoming increasingly heterogeneous by comprising multi-core CPUs, GPUs, and other accelerators. Current programming approaches for such systems usually require the application developer to use a combination of several programming models (e.g., MPI with OpenCL or CUDA) in order to exploit the full compute capability of a system. In this paper, we present dOpenCL (Distributed OpenCL) -- a uniform approach to programming distributed heterogeneous systems with accelerators. dOpenCL extends the OpenCL standard, such that arbitrary computing devices installed on any node of a distributed system can be used together within a single application. dOpenCL allows moving data and program code to these devices in a transparent, portable manner. Since dOpenCL is designed as a fully-fledged implementation of the OpenCL API, it allows running existing OpenCL applications in a heterogeneous distributed environment without any modifications. We describe in detail the mechanisms that are required to implement OpenCL for distributed systems, including a device management mechanism for running multiple applications concurrently. Using three application studies, we compare the performance of dOpenCL with MPI+OpenCL and a standard OpenCL implementation.
Graphics processing units (GPUs) are being adopted in many computing facilities given their extraordinary computing power, which makes it possible to accelerate many general purpose applications from different domains. However, GPUs also present several side effects, such as increased acquisition costs as well as larger space requirements. They also require more powerful energy supplies. Furthermore, GPUs still consume some amount of energy while idle, and their utilization is usually low for most workloads. In a similar way to virtual machines, the use of virtual GPUs may address the aforementioned concerns. In this regard, the remote GPU virtualization mechanism allows an application being executed in a node of the cluster to transparently use the GPUs installed at other nodes. Moreover, this technique allows to share the GPUs present in the computing facility among the applications being executed in the cluster. In this way, several applications being executed in different (or the same) cluster nodes can share 1 or more GPUs located in other nodes of the cluster. Sharing GPUs should increase overall GPU utilization, thus reducing the negative impact of the side effects mentioned before. Reducing the total amount of GPUs installed in the cluster may also be possible. In this paper, we explore some of the benefits that remote GPU virtualization brings to clusters. For instance, this mechanism allows an application to use all the GPUs present in the computing facility. Another benefit of this technique is that cluster throughput, measured as jobs completed per time unit, is noticeably increased when this technique is used. In this regard, cluster throughput can be doubled for some workloads. Furthermore, in addition to increase overall GPU utilization, total energy consumption can be reduced up to 40%. This may be key in the context of exascale computing facilities, which present an important energy constraint. Other benefits are related to the cloud computing domain, where a GPU can be easily shared among several virtual machines. Finally, GPU migration (and therefore server consolidation) is one more benefit of this novel technique.
We propose a simple, elegant solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages. Our solution requires no change in the model architecture from our base system but instead introduces an artificial token at the beginning of the input sentence to specify the required target language. The rest of the model, which includes encoder, decoder and attention, remains unchanged and is shared across all languages. Using a shared wordpiece vocabulary, our approach enables Multilingual NMT using a single model without any increase in parameters, which is significantly simpler than previous proposals for Multilingual NMT. Our method often improves the translation quality of all involved language pairs, even while keeping the total number of model parameters constant. On the WMT'14 benchmarks, a single multilingual model achieves comparable performance for English$\rightarrow$French and surpasses state-of-the-art results for English$\rightarrow$German. Similarly, a single multilingual model surpasses state-of-the-art results for French$\rightarrow$English and German$\rightarrow$English on WMT'14 and WMT'15 benchmarks respectively. On production corpora, multilingual models of up to twelve language pairs allow for better translation of many individual pairs. In addition to improving the translation quality of language pairs that the model was trained with, our models can also learn to perform implicit bridging between language pairs never seen explicitly during training, showing that transfer learning and zero-shot translation is possible for neural translation. Finally, we show analyses that hints at a universal interlingua representation in our models and show some interesting examples when mixing languages.
Conference Paper
OpenCL is a portable interface that can be used to program cluster nodes with heterogeneous compute devices. The OpenCL specification tightly binds its workflow abstraction, or “command queue,” to a specific device for the entire program. For best performance, the user has to find the ideal queue–device mapping at command queue creation time, an effort that requires a thorough understanding of the match between the characteristics of all the underlying device architectures and the kernels in the program. In this paper, we propose to add scheduling attributes to the OpenCL context and command queue objects that can be leveraged by an intelligent runtime scheduler to automatically perform ideal queue–device mapping. Our proposed extensions enable the average OpenCL programmer to focus on the algorithm design rather than scheduling and automatically gain performance without sacrificing programmability. As an example, we design and implement an OpenCL runtime for task-parallel workloads, called MultiCL, which efficiently schedules command queues across devices. Within MultiCL, we implement several key optimizations to reduce runtime overhead. Our case studies include the SNU-NPB OpenCL benchmark suite and a real-world seismology simulation. We show that, on average, users have to apply our proposed scheduler extensions to only four source lines of code in existing OpenCL applications in order to automatically benefit from our runtime optimizations. We also show that MultiCL always maps command queues to the optimal device set with negligible runtime overhead.
TensorFlow is Google’s recently released open-source software for deep learning. What are its applications for computational biology?
Conference Paper
Parallel programming is often regarded as one of the hardest programming disciplines. On the one hand, parallel programs are notoriously prone to concurrency errors; and, while trying to avoid such errors, achieving program performance becomes a significant challenge. As a result of the multicore revolution, parallel programming has however ceased to be a task for domain experts only. And for this reason, a large variety of languages and libraries have been proposed that promise to ease this task. This paper presents a study to investigate whether such approaches succeed in closing the gap between domain experts and mainstream developers. Four approaches are studied: Chapel, Cilk, Go, and Threading Building Blocks (TBB). Each approach is used to implement a suite of benchmark programs, which are then reviewed by notable experts in the language. By comparing original and revised versions with respect to source code size, coding time, execution time, and speedup, we gain insights into the importance of expert knowledge when using modern parallel programming approaches.
Conference Paper
GPUs are used to speed up many scientific computations, however, to use several networked GPUs concurrently, the programmer must explicitly partition work and transmit data between devices. We propose DistCL, a novel framework that distributes the execution of penCL kernels across a GPU cluster. DistCL makes multiple distributed compute devices appear to be a single compute device. DistCL abstracts and manages many of the challenges associated with distributing a kernel across multiple devices including: (1) partitioning work into smaller parts, (2) scheduling these parts across the network, (3) partitioning memory so that each part of memory is written to by at most one device, and (4) tracking and transferring these parts of memory. Converting an OpenCL application to DistCL is straightforward and requires little programmer effort. This makes it a powerful and valuable tool for exploring the distributed execution of OpenCL kernels. We compare DistCL to SnuCL, which also facilitates the distribution of OpenCL kernels. We also give some insights: distributed tasks favor more compute bound problems and favour large contiguous memory accesses. DistCL achieves a maximum speedup of 29.1 and average speedups of 7.3 when distributing kernels among 32 peers over an Infiniband cluster.